Using altimetry observations combined with GRACE to select parameter sets of a hydrological model in a data-scarce region

. Limited availability of ground measurements in the vast majority of river basins world-wide increases the value of alternative data sources such as satellite observations in hydrological modelling. This study investigates the potential of using remotely sensed river water levels, i.e. altimetry observations, from multiple satellite missions to identify parameter sets for a hydrological model in the semi-arid Luangwa River basin in Zambia. A distributed process-based rainfall–runoff model with sub-grid process heterogeneity was developed and run on a daily timescale for the time period 2002 to 2016. As a benchmark, feasible model parameter sets were identiﬁed using traditional model calibration with observed river discharge data. For the parameter identiﬁcation using remote sensing, data from the Gravity Recovery and Climate Experiment (GRACE) were used in a ﬁrst step to restrict the feasible parameter sets based on the seasonal ﬂuctuations in total water storage. Next, three alternative ways of further restricting feasible model parameter sets using satellite altimetry time series from 18 different locations along the river were compared.

Abstract. Limited availability of ground measurements in the vast majority of river basins world-wide increases the value of alternative data sources such as satellite observations in hydrological modelling. This study investigates the potential of using remotely sensed river water levels, i.e. altimetry observations, from multiple satellite missions to identify parameter sets for a hydrological model in the semi-arid Luangwa River basin in Zambia. A distributed process-based rainfall-runoff model with sub-grid process heterogeneity was developed and run on a daily timescale for the time period 2002 to 2016. As a benchmark, feasible model parameter sets were identified using traditional model calibration with observed river discharge data. For the parameter identification using remote sensing, data from the Gravity Recovery and Climate Experiment (GRACE) were used in a first step to restrict the feasible parameter sets based on the seasonal fluctuations in total water storage. Next, three alternative ways of further restricting feasible model parameter sets using satellite altimetry time series from 18 different locations along the river were compared. In the calibrated benchmark case, daily river flows were reproduced relatively well with an optimum Nash-Sutcliffe efficiency of E NS,Q = 0.78 (5/95th percentiles of all feasible solutions E NS,Q,5/95 = 0.61-0.75). When using only GRACE observations to restrict the parameter space, assuming no discharge observations are available, an optimum of E NS,Q = −1.4 (E NS,Q,5/95 = −2.3-0.38) with respect to discharge was obtained. The direct use of altimetry-based river levels frequently led to overestimated flows and poorly identified feasible parameter sets (E NS,Q,5/95 = −2.9-0.10). Similarly, converting modelled discharge into water levels us-ing rating curves in the form of power relationships with two additional free calibration parameters per virtual station resulted in an overestimation of the discharge and poorly identified feasible parameter sets (E NS,Q,5/95 = −2.6-0.25). However, accounting for river geometry proved to be highly effective. This included using river cross-section and gradient information extracted from global high-resolution terrain data available on Google Earth and applying the Strickler-Manning equation to convert modelled discharge into water levels. Many parameter sets identified with this method reproduced the hydrograph and multiple other signatures of discharge reasonably well, with an optimum of E NS,Q = 0.60 (E NS,Q,5/95 = −0.31-0.50). It was further shown that more accurate river cross-section data improved the water-level simulations, modelled rating curve, and discharge simulations during intermediate and low flows at the basin outlet where detailed on-site cross-section information was available. Also, increasing the number of virtual stations used for parameter selection in the calibration period considerably improved the model performance in a spatial split-sample validation. The results provide robust evidence that in the absence of directly observed discharge data for larger rivers in data-scarce regions, altimetry data from multiple virtual stations combined with GRACE observations have the potential to fill this gap when combined with readily available estimates of river geometry, thereby allowing a step towards more reliable hydrological modelling in poorly gauged or ungauged basins.

Introduction
Reliable models of water movement and distribution in terrestrial systems require sufficient good-quality hydrometeorological data throughout the modelling process. However, the development of robust models is challenged by the limited availability of ground measurements in the vast majority of river basins world-wide . Therefore, modellers increasingly resort to alternative data sources such as satellite data (Lakshmi, 2004;Winsemius et al., 2008;Sun et al., 2018;Pechlivanidis and Arheimer, 2015;Demirel et al., 2018;Zink et al., 2018;Rakovec et al., 2016;Nijzink et al., 2018;Dembélé et al., 2020).
Satellite altimetry observations provide estimates of the water level relative to a reference ellipsoid. For these observations, a radar signal is emitted from the satellite in the nadir direction and reflected back by the Earth's surface. The time difference between sending and receiving this signal is then used to estimate the distance between the satellite and the Earth's surface. As the position of the satellite is known at very high accuracy, this distance can then be used to infer the surface level relative to a reference ellipsoid (Łyszkowicz and Bernatowicz, 2017;Calmant et al., 2009). Satellite altimetry is sensed and recorded along the satellite's track. Altimetry-based water levels can therefore only be observed where these tracks intersect with open water surfaces. For rivers, these points are typically referred to as "virtual stations" (de Oliveira Campos et al., 2001;Birkett, 1998;Schneider et al., 2017;Jiang et al., 2017;Seyler et al., 2013). Depending on the satellite mission, the equatorial inter-track distance can vary between 75 and 315 km, the along-track distance between 173 and 374 m, and the temporal resolution between 10 and 35 d (Schwatke et al., 2015;CNES, 2018;ESA, 2018;Łyszkowicz and Bernatowicz, 2017). Due to this rather coarse resolution, the application of remotely sensed altimetry data is at this moment limited to large lakes or rivers of more than approximately 200 m wide (Getirana et al., 2009;de Oliveira Campos et al., 2001;Biancamaria et al., 2017). Use of altimetry for hydrological models so far also remains rather rare due to the relatively low temporal resolution of the data, with applications typically limited to monthly or longer modelling time steps (Birkett, 1998).
In some previous studies, altimetry data were used to estimate river discharge at virtual stations in combination with routing models (Michailovsky and Bauer-Gottwein, 2014;Michailovsky et al., 2013) or stochastic models . Other studies either directly related river altimetry to modelled discharge (Getirana et al., 2009;Getirana and Peters-Lidard, 2013;Leon et al., 2006;Paris et al., 2016) or they relied on rating curves developed with water-level data from either in situ measurements (Michailovsky et al., 2012;Tarpanelli et al., , 2017Papa et al., 2012) or, alternatively, from altimetry data (Kouraev et al., 2004). In typical applications, radar altimetry data from one single or only a few virtual stations were used for model calibration, validation, or data assimilation. These data were mostly obtained from a single satellite mission, either TOPEX/Poseidson or Envisat (Sun et al., 2012;Getirana, 2010;Liu et al., 2015;Pedinotti et al., 2012;Fleischmann et al., 2018;Michailovsky et al., 2013;Bauer-Gottwein et al., 2015). In previous studies, hydrological models have been calibrated or validated successfully with respect to (satellite-based) river water levels, for example by (1) applying the Spearman rank correlation coefficient (Seibert and Vis, 2016;Jian et al., 2017) or by converting modelled discharge to stream levels using (2) rating curves whose parameters are free calibration parameters in the modelling process (Sun et al., 2012;Sikorska and Renard, 2017) or (3) the Strickler-Manning equation to directly estimate water levels over the hydraulic properties of the river (Liu et al., 2015;Hulsman et al., 2018).
In the Zambezi River basin, altimetry data have been used in previous studies for hydrological modelling (Michailovsky and Bauer-Gottwein, 2014;Michailovsky et al., 2012). These studies used the altimetry data from the Envisat satellite in an assimilation procedure to update states in a Muskingum routing scheme. Including the altimetry data improved the model performance, especially when the model initially performed poorly due to high model complexity or input data uncertainties.
Despite these recent advances in using river altimetry in hydrological studies, exploitation of its potential is still limited. Various previous studies have argued and provided evidence based on observed discharge data that, in a special case of multi-criteria calibration, the simultaneous model calibration to flow in multiple sub-basins of a river basin can be beneficial for a more robust selection of parameter sets and thus for a more reliable representation of hydrological processes and their spatial patterns (e.g. Ajami et al., 2004;Clark et al., 2016;Hrachowitz and Clark, 2017;Hasan and Pradhanang, 2017;Santhi et al., 2008). Hence, there may be considerable value in simultaneously using altimetry data not only from one single satellite mission, but also in combining data from multiple missions, which has not yet been systematically explored. While promising calibration results using data from Envisat were found by Getirana (2010) in tropical and Liu et al. (2015) in snow-dominated regions, altimetry data from multiple sources have not yet been used to calibrate hydrological models in semi-arid regions.
As altimetry observations only describe water-level dynamics, they do not provide direct information on the discharge amount. In an attempt to reduce the uncertainty in modelled discharge arising from the missing information on flow amounts, data from the Gravity Recovery and Climate Experiment (GRACE), which provides estimates of the monthly total water storage anomalies, were used to support model calibration. With GRACE, discharge can be constrained through improved simulation of the rainfall partitioning into runoff and evaporation as illustrated in previous studies (Rakovec et al., 2016;Bai et al., 2018).
Therefore, the overarching objective of this study is to explore the combined information content (cf. Beven, 2008) of river altimetry data from multiple satellite missions and GRACE observations to identify feasible parameter sets for the calibration of hydrological models of large river systems in a semi-arid, data-scarce region.
More specifically, in a step-wise approach we use GRACE observations together with altimetry data from multiple virtual stations to identify model parameters following three different strategies, and we compare model performances to a traditional calibration approach based on in situ observed river discharge. These three strategies compare altimetry observations to (1) modelled discharge by applying the Spearman rank correlation coefficient and to modelled stream levels by converting modelled discharge using (2) rating curves whose parameters were treated as free model calibration parameters and (3) the Strickler-Manning equation to infer water levels directly from hydraulic properties of the river. These three strategies are tested on a distributed processbased rainfall-runoff model with sub-grid process heterogeneity for the Luangwa basin. More specifically, we test the following research hypotheses: (1) the use of altimetry data combined with GRACE observations allows a meaningful selection of feasible model parameter sets to reproduce river discharge depending on the applied parameter identification strategy, and (2) the combined application of multiple virtual stations from multiple satellite missions improves the model's ability to reproduce observed hydrological dynamics.

Site description
The study area is the Luangwa River in Zambia, a tributary of the Zambezi River (Fig. 1). It has a basin area of 159 000 km 2 , which is about 10 % of the Zambezi River basin. The Luangwa basin is poorly gauged, mostly unregulated, and sparsely populated with about 1.8 million inhabitants in 2005 (World Bank, 2010). The mean annual precipitation is around 970 mm yr −1 , potential evaporation is around 1555 mm yr −1 , and river runoff reaches about 100 mm yr −1 (World Bank, 2010). The main land cover consists of broadleaf deciduous forest (55 %), shrubland (25 %), and savanna grassland (16 %) (ESA and UCLouvian, 2010). The irrigated area in the basin is limited to about 180 km 2 , i.e. roughly 0.1 % of the basin area with an annual water use of about 0.7 mm yr −1 , which amounts to < 0.001 % of the annual basin water balance (World Bank, 2010). The landscape varies between low-lying flat areas along the river to large escarpments mostly in the north-west of the basin and highlands with an elevation difference of up to 1850 m (see Fig. 1b and Sect. 3.2 for more information on the landscape classification). During the dry season, the river meanders between sandy banks, while during the wet season from November to May it can cover floodplains several kilometres wide.
The Luangwa drains into the Zambezi downstream of the Kariba Dam and upstream of the Cahora Bassa Dam. The operation of both dams is crucial for hydropower production and flood and drought protection, but is very difficult due to the lack of information from poorly gauged tributaries such as the Luangwa (SADC-WD and Zambezi River Authority, 2008;Schleiss and Matos, 2016;World Bank, 2010). As a result, the local population has suffered from severe floods and droughts (ZAMCOM et al., 2015;Beilfuss and dos Santos, 2001;Hanlon, 2001;SADC-WD and Zambezi River Authority, 2008;Schumann et al., 2016

Gridded data products
Besides the in situ observations, gridded data products were used in this study for topographic description, model forcing (precipitation and temperature), and model parameter selection/calibration (total water storage anomalies), as shown in Table 1. The temperature data were used to estimate the potential evaporation according to the Hargreaves method (Hargreaves and Samani, 1985;Hargreaves and Allen, 2003).
Gravity Recovery and Climate Experiment (GRACE) observations describe monthly total water storage anomalies which include all terrestrial water stores present in the groundwater, soil moisture, and surface water. Two identical satellites observe the variations in the Earth's gravity field to detect regional mass changes which are dominated by variations in the terrestrial water storage once atmospheric effects have been accounted for (Landerer and Swenson, 2012;Swenson, 2012). In this study, processed GRACE observations of Release 05 generated by the CSR (Centre for Space Research), GFZ (GeoForschungsZentrum Potsdam), and JPL (Jet Propulsion Laboratory) were downloaded from   (Swenson, 2012;Swenson and Wahr, 2006;Landerer and Swenson, 2012) n/a means not applicable.
the GRACE Tellus website (https://grace.jpl.nasa.gov/, last access: June 2017). The averages of all three sources were used. The raw data were previously processed by the CSR, GFZ, and JPL to remove atmospheric mass changes using ECMWF (European Centre for Medium-Range Weather Forecasts) atmospheric pressure fields, systematic errors causing north-south-oriented stripes, and high-frequency noise using a 300 km wide Gaussian filter via spatial smoothing (Swenson and Wahr, 2006;Landerer and Swenson, 2012;Wahr et al., 1998). Processed GRACE observations describe terrestrial water storage anomalies in "equivalent water thickness" in centimetres relative to the 2004-2009 timemean baseline. In other words, the water storage anomaly is the water storage minus the long-term mean (Landerer and Swenson, 2012). All gridded information was rescaled to the model resolution of 0.1 • . The temperature and GRACE data were rescaled by dividing each cell of the satellite product into multiple cells such that the model resolution is obtained, retaining the original value. The precipitation was rescaled by taking the average of all cells located within each model cell.

River geometry information
In the Luangwa basin, very limited detailed in situ information was available on the river geometry such as cross sec-Hydrol. Earth Syst. Sci., 24, 3331-3359, 2020 https://doi.org/10.5194/hess-24-3331-2020  (Pandya et al., 2017;Zhou and Wang, 2015). This was done for each virtual station and the basin outlet. Google Earth only provides river geometry information above the river water level. As the Luangwa is a perennial river, parts of the cross section remain submerged throughout the year and are thus unknown. To limit uncertainties arising from this issue, the cross-section geometry for each virtual station was extracted from Google Earth images with the lowest water levels. The dates of these images in general fall in the dry season, with flows at the Great East Road Bridges gauging station on the respective days ranging from 1 % to 4 % relative to the maximum discharge (see Table S3 in the Supplement for the dates of the satellite images and the associated flows at the Great East Road Bridges gauging station). The database underlying the global terrain images in Google Earth originate from multiple, merged data sources with varying spatial resolutions. For the Luangwa basin these include the Shuttle Radar Topography Mission (SRTM) with a spatial resolution of 30 m, Landsat 8 with a spatial resolution of 15 m, and the Satellite Pour l'Observation de la Terre 4/5 (SPOT) with a spatial resolution of 2.5 to 20 m (Smith and Sandwell, 2003;Irons et al., 2012;Drusch et al., 2012). In addition to Google Earth data, the submerged part of the channel cross section was surveyed in the field on 27 April 2018 near the Great East Road Bridges river gauging station at the coordinates 30 • 13 E and 15 • 00 S (Abas, 2018) with an Acoustic Doppler Current Profiler (ADCP).

General approach
The potential of river altimetry for model calibration was tested with a process-based hydrological model for the Luangwa River basin. This model relied on distributed forcing allowing for spatially explicit distributed water storage calculations. The model was run on a daily timescale for the time period 2002 to 2016. To reach the objective of this study, the following distinct parameter identification strategies were compared in a step-wise approach: (1) traditional model calibration to observed river flow as a benchmark; (2) identification of parameter sets reproducing the seasonal water storage anomalies based on GRACE data only; (3a) Altimetry Strategy 1: identification of parameter sets directly based on remotely sensed water levels combined with GRACE data; (3b) Altimetry Strategy 2: identification of parameter sets based on remotely sensed water levels by converting modelled discharges into water levels using calibrated rating curves combined with GRACE data; (3c) Altimetry Strategy 3: identification of parameter sets based on remotely sensed water levels by converting modelled discharges into water levels using the Strickler-Manning equation and including river geometry information (cross section and gradient) extracted from Google Earth combined with GRACE data; (4a) Water level Strategy 1: identification of parameter sets based on daily river water level at the catchment outlet only using the Strickler-Manning equation and including river geometry information extracted from Google Earth combined with GRACE data; and (4b) Water level Strategy 2: identification of parameter sets based on daily river water level at the catchment outlet only using the Strickler-Manning equation and including river geometry information obtained from a detailed field survey with an ADCP combined with GRACE data. Note that (1) is completely independent of (2) to (4), where no discharge data were used for the identification of parameter sets.

Hydrological model structure
In this study, a process-based rainfall-runoff model with distributed water accounting and sub-grid process heterogeneity was developed (Ajami et al., 2004;Euser et al., 2015). The river basin was discretized into a grid with a spatial resolution of 10 × 10 km 2 . Each model grid cell was characterized by the same model structure and parameter sets but forced by spatially distributed, gridded input data (Table 1). Runoff was then calculated in parallel for each cell separately. Subsequently, a routing scheme was applied to estimate the aggregated flow in each grid cell at each time step.
Adopting the FLEX-Topo modelling concept  and extending it to a gridded implementation, each grid cell was further discretized into functionally distinct hydrological response units (HRUs) as demonstrated by Nijzink et al. (2016). Each point within a grid cell was assigned to a response class based on its position in the landscape as defined by its local slope and "Height-above-thenearest-drainage" (HAND; Rennó et al., 2008;Gharari et al., 2011). Similarly to previous studies (e.g. Gao et al., 2016;Nijzink et al., 2016), the response units plateau, hillslope, terrace, and wetland were distinguished. Reflecting earlier work (e.g. Gharari et al., 2011), all locations with a slope of > 4 % were assumed to be hillslope. Locations with lower slopes were then either defined as wetland (HAND < 11 m), terrace (11 m ≤ HAND < 275 m), or plateau (HAND ≥ 275 m); see Fig. 2. Following this classification, wetlands make up p HRU = 8 %, terraces p HRU = 41 %, hillslopes p HRU = 28 %, and plateaus p HRU = 23 % of the total Luangwa River basin area as mapped in Fig. 1b.
Each response class consisted of a series of storage components that were linked by fluxes. The flow generated from each grid cell at any given time step was then computed as the area-weighted flow from the individual response units plus a contribution from the common groundwater component which connects the response units (Fig. 2). Finally, the outflow from each modelling cell was routed to downstream cells to obtain the accumulated flow in each grid cell at any Figure 2. Sketch of the hydrological response units including the thresholds used in this analysis for the slope and HAND (height above nearest drainage) and including their corresponding model structures. This spatial sub-grid discretization was applied to each grid cell. Symbol explanation: precipitation (P ), effective precipitation (P e ), interception evaporation (E i ), plant transpiration (E t ), infiltration into the unsaturated root zone (R u ), drainage to the fast runoff component (R f ), delayed fast runoff (R fl ), lag time (T lag ), groundwater recharge (R r ), upwelling groundwater flux (R GW ), fast runoff (Q f ), groundwater/slow runoff (Q s ).
given time step. For this purpose, the mean flow length of each model grid cell to the outlet was derived based on the flow direction extracted from the digital elevation model. The flow velocity, which was assumed to be constant in space and time, was calibrated. With this information on the flow path length and velocity, the accumulated flow in each grid cell was calculated at the end of each time step. The relevant model equations are given in Table 3. This concept was previously successfully applied in a wide range of environments Gharari et al., 2014;Fovet et al., 2015;Nijzink et al., 2016;Prenner et al., 2018).

Parameter selection procedures
To evaluate the information content and thus the utility of altimetry data for the selection of feasible model parameter sets, a step-wise procedure as specified in detail below was applied ( Table 5). Note that given data scarcity and the related issues of epistemic uncertainties (Beven and Westerberg, 2011;McMillan and Westerberg, 2015) and equifinality (Beven, 2006;Savenije, 2001) we did not aim to identify the "optimal" parameter set in what is frequently considered a traditional calibration approach. In most hydrological applications the available data have limited strength for rigorous model tests (Clark et al., 2015;Gupta et al., 2008;Jakeman and Hornberger, 1993). Thus, to reduce the risk of rejecting good parameters when they should have been accepted (Beven, 2010;Hrachowitz and Clark, 2017), we rather attempted to identify and discard the most implausible parameter sets (Freer et al., 1996) that violate our theoretical understanding of the system or that are inconsistent with the available data (Knutti, 2008). This allowed us to iteratively constrain the feasible parameter space and thus the uncertainty around the modelled hydrograph . To do so, a Monte Carlo sampling strategy with uniform prior parameter distributions was applied to generate 5 × 10 4 model realizations. This random set of solutions was in the following steps used as a baseline and iteratively constrained by identifying parameter sets that do not satisfy prespecified criteria (see below), depending on the data type and source used.

Benchmark: parameter selection based on observed discharge data
As a benchmark, and following a traditional calibration procedure, the model was calibrated with observed daily discharge based on the Nash-Sutcliffe efficiency (E NS,Q , Eq. 1) in Table 4 using all complete hydrological years within the Table 3. Equations applied in the hydrological model. Fluxes (mm d −1 ): precipitation (P ), effective precipitation (P e ), potential evaporation (E p ), interception evaporation (E i ), plant transpiration (E t ), infiltration into the unsaturated zone (R u ), drainage to fast runoff component (R f ), delayed fast runoff (R fl ), groundwater recharge (R r for each relevant HRU and R r,tot combining all relevant HRUs), upwelling groundwater (R GW for each relevant HRU and R GW,tot combining all relevant HRUs), fast runoff (Q f for each HRU and Q f,tot combining all HRUs), groundwater/slow runoff (Q s ), total runoff (Q m ). Storages (mm): storage in interception reservoir (S i ), storage in unsaturated root zone (S u ), storage in groundwater/slow reservoir (S s ), storage in fast reservoir (S f ). Parameters: interception capacity (I max ) (mm), maximum upwelling groundwater (C max ) (mm d −1 ), maximum root zone storage capacity (S umax ) (mm), splitter (W ) (-), shape parameter (β) (-), transpiration coefficient (C e ) (-), time lag (T lag ) (d), reservoir timescales (d) of fast (K f ) and slow (K s ) reservoirs, areal weights (p HRU ) (-), time step ( t) (d). Model parameters are shown in bold letters in the table below. The equations were applied to each hydrological response unit (HRU) unless indicated differently.

Reservoir system Water balance equation Process functions
Interception Unsaturated zone Plateau/terrace: Supporting literature Gharari et al. (2014); Gao et al. (2014); Euser et al. (2015) time period 2002 to 2016 (Nash and Sutcliffe, 1970). These are the years starting in the fall of 2004, 2006, and 2008. To limit the solutions to relatively robust representations of the system while allowing for data and model uncertainty (e.g. Beven, 2006;Beven and Westerberg, 2011), only parameter sets that resulted in E NS,Q ≥ 0.6 were retained as feasible. The hydrological model consisted of 18 free calibration parameters (Table 5, Fig. S1 in the Supplement) whose uniform prior distributions are given in Table S1 in the Supplement with associated parameter constraints as summarized in Table S2.

Parameter selection based on the seasonal water storage (GRACE)
In a next step we assumed that discharge records in the Luangwa basin were absent. The starting assumption thus had to be that all model realizations, i.e. all sampled parameter sets, were equally likely to allow feasible representations of the hydrological system. In a step-wise approach, confronting these realizations with different types of data, we sequentially identified and discarded solutions that were least likely to provide meaningful system representations, thereby gradually narrowing down the feasible parameter space. We first identified and discarded solutions that were least likely to preserve observed seasonal water storage (S tot ) fluc- Spearman rank correlation coefficient σ r Q,mod · σ (r WL,obs ) r Q,mod : ranks of the modelled discharge r WL,obs : ranks of the observed water levels (2) Relative error Euclidian distance over multiple virtual stations Euclidian distance over multiple signatures tuations. To do so, the monthly modelled total water storage (S tot,mod = S i + S u + S f + S s ) relative to the 2004-2009 time-mean baseline in each grid cell was compared to water storage anomalies observed with GRACE where this same time-mean baseline was used (Tang et al., 2017;Fang et al., 2016;Forootan et al., 2019;Khaki and Awange, 2019). The model's skill at reproducing the seasonal water storage, i.e. S tot , was assessed using the Nash-Sutcliffe efficiency E NS,Stot (Eq. 1). Note that E NS,Stot,j was computed at first from the time series of S tot in each grid cell j which were then averaged to obtain E NS,S tot . If no additional data were available, a hypothetic modeller relying on E NS,S tot to calibrate a model may choose only the solution with the highest E NS,S tot or allow for some uncertainty. To mimic this traditional approach and balance it with a sufficient number of feasible solutions to be kept for the subsequent steps, we here identified and discarded the poorest performing 75 % of all solutions in terms of E NS,S tot as unfeasible for the subsequent modelling steps.

Parameter selection based on satellite altimetry data
Next, the remaining feasible parameter sets were used to evaluate their potential to reproduce time series of observed altimetry applying three distinct parameter selection strategies. Assuming again the situation of an ungauged basin (i.e. no time series of river flow available), we kept for each strategy as feasible the respective 1 % best performing parameter sets according to the specific performance metric associated with that strategy.
Altimetry Strategy 1: direct comparison of altimetry data to modelled discharge In the simplest approach, we directly used altimetry data to correlate observed water levels with modelled discharge based on the Spearman rank correlation coefficient (E R,WL ; Spearman, 1904) using Eq. (2) ( Table 4). This strategy, hereafter referred to with subscript WL, i.e. water level, requires the assumption that the relationship between water level and discharge is monotonic. The Spearman rank correlation was applied successfully in previous studies to calibrate a rainfall-runoff model to water-level time series (Seibert and Vis, 2016). As there were multiple virtual stations with water-level data available in this study, the E R,WL was computed at each location simultaneously. The individual values E R,WL were weighted based on the record length of the corresponding virtual stations and then combined into the Euclidean distance as aggregate metric D E,R,WL with Eq. (4).     a 1 , a 2 , a 3 , a 4 , b 1 , b 2 , b 3     In the second strategy, as successfully applied in previous studies (Getirana and Peters-Lidard, 2013;Jian et al., 2017), model parameters were selected based on the models' ability to reproduce water levels by converting the modelled discharge to water levels, assuming these two are related through a rating curve in the form of a power function (Rantz, 1982): where h is the water level, h 0 is a reference water level, and a and b are two additional free calibration parameters determining the shape of the function and lumping combined influences of different river cross-section characteristics such as geometry or roughness. Note that here for each virtual station h 0 is the elevation that corresponds to the water level of the Google Earth image with the lowest flow available, corresponding to the assumption of no-flow at that time. This strategy is hereafter referred to with subscript RC, i.e. rating curve. As river cross sections vary in space, each of the 18 virtual stations would require an individual set of these parameters a and b. To limit the number of additional calibration parameters, we here classified the river cross sections of the 18 virtual stations into four groups (Figs. 1a and 3). For cross sections within each class, i.e. geometrically similar, the same values for a and b were used, resulting in four sets of a and b and thus a total of eight additional calibration parameters. The river cross sections were extracted from global high-resolution terrain data available on Google Earth (see Sect. 2.1.4). The modelled river water levels were evaluated against the observed water levels at each virtual station using the Nash-Sutcliffe efficiency E NS,RC (equivalent to Eq. 1 in Table 4), weighted based on the record length of the corresponding virtual stations and then combined into the Euclidean distance D E,NS,RC as an aggregated performance metric (Eq. 4).

Altimetry Strategy 3: Strickler-Manning equation
As a third strategy, we converted the modelled discharge to river water levels using the Strickler-Manning equation (Manning, 1891): where k is a roughness parameter here treated as a free calibration parameter and assumed constant for all virtual stations, i is the mean channel slope extracted here over a distance of 10 km, and A and R are the river cross-section area and hydraulic radius. Assuming trapezoidal cross sections (see Fig. 4 as an illustrative example), A and R were calculated for each cross section according to where B is the assumed river bed width, i 1 and i 2 are the river bank slopes, d is the water depth, h is the water level, and h 0 is the reference water level, here assumed to be the lowest observed river water level to limit the number of calibration parameters. In contrast to previous studies that use a similar approach but relied on locally observed river cross sections (Michailovsky et al., 2012;Hulsman et al., 2018;Liu et al., 2015), here both the river bed geometries (Fig. 3) at and the channel slopes upstream of the 18 virtual stations were computed using high-resolution terrain data retrieved from Google Earth (see Sect. 2.1.4). Similar data sources were already used in previous studies to extract the river geometry (e.g. Michailovsky et al., 2012;Pramanik et al., 2010;Gichamo et al., 2012). The reader is referred to Table S3 for the values of the variables for each virtual station. This strategy is hereafter referred to with subscript SM, i.e. Strickler-Manning. Equivalent to above, the modelled river water levels were then evaluated against the observed water levels at each virtual station using the Nash-Sutcliffe efficiency E NS,SM (equivalent to Eq. 1), weighted based on the record length of the corresponding virtual stations and then combined into the Euclidean distance D E,NS,SM as an aggregated performance metric (Eq. 4).

Parameter selection based on daily river water level at the basin outlet
For the previous parameter identification strategy (Altimetry Strategy 3), river geometry information was extracted from high-resolution terrain data retrieved from Google Earth, which have a low accuracy. Unfortunately, more accurate cross-section information from in situ surveys was only available at the Great East Road Bridge gauging station, i.e. the basin outlet, where, in turn, no altimetry observations were available. That is why water-level time series were used to illustrate the influence of the cross-section accuracy. As shown in Fig. 5, the Google Earth based above-water cross section at the basin outlet corresponded in general well to the field survey considering that satellite images have limited spatial resolution. However, the in situ measurement also illustrated the relevance of the submerged part of the channel cross section at that location on the day the image was taken (2 June 2008).  . Example of approximating a trapezoidal cross section (black) into the Google Earth based cross-section data (red) for virtual station "VS 4" (see also Fig. 1a and Fig. 3). The reference level is equal to the lowest water level in the river profile.

Water level Strategy 1: river geometry information extracted from Google Earth
First, cross-section information was extracted from global high-resolution terrain data available on Google Earth (subscript GE) and used with the Strickler-Manning equation (Eq. 7) to convert the modelled discharge to water levels. This was combined with GRACE observations to restrict the parameter space in an equivalent way to Sect. 3.3.3. The model performance with respect to river water levels was calculated with the Nash-Sutcliffe efficiency E NS,SM,GE (Eq. 1).

Water level Strategy 2: river geometry information obtained from a detailed field survey
Second, cross-section information obtained from a detailed field survey with an ADCP (subscript ADCP) was used with the Strickler-Manning equation (Eq. 7) to convert the modelled discharge to water levels. This was combined with GRACE observations to restrict the parameter space in an equivalent way to Sect. 3.3.3. The model performance with respect to river water levels was calculated with the Nash-Sutcliffe efficiency E NS,SM,ADCP (Eq. 1).

Model evaluation
For each calibration strategy, the performance of all model realizations was evaluated post-calibration with respect to discharge using seven additional hydrological signatures (e.g. Sawicz et al., 2011;Euser et al., 2013) to assess the skill of the model at reproducing the overall response of the system and thus the robustness of the selected parameters . The signatures included the logarithm of the daily flow time series (hereafter referred to with the subscript log Q), the flow duration curve (FDC), its logarithm (logFDC), the mean seasonal runoff coefficient during dry periods (April-September; RCdry), the mean seasonal runoff coefficient during the wet periods (October-March; RCwet), the autocorrelation function of daily flow (AC) and the rising limb density of the hydrograph (RLD). An overview of these signatures can be found in Table 6, and more detailed explanations in Euser et al. (2013) and references therein. As performance measures for the model to reproduce the individual observed signatures, the Nash-Sutcliffe efficiency (E NS,log Q , E NS,FDC , E NS,log FDC , E NS,AC ; equivalent to Eq. (1) in Table 4) and a metric based on the relative error (E R,RCdry , E R,RCwet , E R,RLD ; equivalent to Eq. 3) were used (Euser et al., 2013). The signatures were combined, with equal weights, into one objective function, which was formulated based on the Euclidian distance D E (Eq. 5) so that a value of 1 indicates a "perfect" model (Schoups et al., 2005).

Parameter selection and model performance
The complete set of all model realizations unsurprisingly resulted in a wide range of model solutions (Fig. 6a), with E NS,Q ranging from −6.4 to 0.78 and with the combined performance metric of all signatures D E ranging from −334 to 0.79 (Fig. 7). With respect to the individual flow signatures, the model performance varied such that the largest range was found in E NS,Q and the smallest in E NS,AC , as visualized in  Table S4. Although containing relatively good solutions, this full set of all realizations clearly also contained many parameter sets that considerably overestimated and/or underestimated flows.

Benchmark: parameter selection based on observed discharge data
For the benchmark case, applying the traditional model calibration approach using discharge data, this parameter selection and calibration strategy resulted in a reasonable model performance, in which the seasonal but also the daily flow dynamics and magnitudes were in general well captured as shown in Fig. 6b. For some years, a number of solutions overestimated flows in the wet season and underestimated flows during the dry season, when the river becomes a small meandering stream with almost annual morphological changes, which is difficult to meaningfully observe. The best performing solution had a calibration objective function of E NS,Q,opt = 0.78 (5/95th percentiles of all feasible solutions E NS,Q,5/95 = 0.61-0.75; Fig. 7 and Table 7). For the post-calibration evaluation of all retained solutions, it was observed that most signatures were well reproduced by the majority of solutions, except for the dry season runoff coefficient (RC dry ; Fig. 7 and Table S4). This resulted in aggregated model performances, combining all signatures, of D E,5/95 = 0.55-0.76, with the above-identified best performing solution (i.e. E NS,Q,opt ) reaching a value of D E,opt = 0.60.

Parameter selection based on the seasonal water storage (GRACE)
Starting from the set of all model realizations (Figs. 6a and 7), and assuming no discharge observations are available, we identified and discarded parameter sets as unfeasible when they did not meet the previously defined criteria to reproduce the seasonal water storage (E NS,S tot ; see Sect. 3.3.2). The range of random model realizations with respect to the total water storage is visualized in Fig. 9. The sub-set of solutions retained as feasible resulted in a significant reduction in the uncertainty around the modelled variables, which is illustrated by the narrower 5/95th percentiles of the solutions compared to the set of all realizations, as shown in Fig. 6c. The feasible solutions with respect to GRACE reached E NS,S tot ,opt = 0.56 (E NS,S tot ,5/95 = 0.45-0.52) (Fig. 7, Table 7). These parameter sets were then used to evaluate the model for the years 2004, 2006, and 2008 used in the benchmark case. While the flow dynamics were captured relatively well, many of the retained solutions considerably overestimated flows across all seasons (Fig. 6c), resulting in a decreased performance with respect to the individual flow signatures; only the dry runoff coefficient (E R,RC dry ) improved significantly compared to the benchmark as shown in Table S4 and Fig. 7. The parame-  Table 7. Summary of the model results: elimination of unfeasible parameter sets and detection of optimal parameter set according to each parameter identification strategy including the corresponding model performance range (E NS,Q , D E ) indicating the model's skill at reproducing the discharge during the benchmark time period. For each strategy, the model efficiency for the optimal parameter set is summarized together with the corresponding performance metrics with respect to discharge (E NS,Q,opt , D E,opt ). For all parameter sets retained as feasible, the maximum (E NS,Q,max , D E,max ) and 5/95th percentiles (E NS,Q,5/95 , D E,5/95 ) of all performance metrics with respect to discharge are summarized. Data sources used for the parameter set selection: (1) all parameter sets (no data), (2) discharge, (3) GRACE, (4) altimetry combined with GRACE (Altimetry Strategy 1), (5) altimetry data using rating curves combined with GRACE (Altimetry Strategy 2), (6) altimetry data using the Strickler-Manning equation combined with GRACE (Altimetry Strategy 3), and (7)  ter set associated with the best performing model with respect to GRACE (E NS,S tot ,opt ) resulted for the benchmark period in E NS,Q = −1.4 (E NS,Q,5/95 = −2.3-0.38) and the corresponding D E,opt = −0.18 (D E,5/95 = −0.58-0.62) with respect to discharge (Fig. 7, Table 7). As illustrated in Figs. 7 and 6c, many parameter sets that resulted in implausible representations of the seasonal signals were eliminated. However, as also indicated by the rather modest values of E NS,Q and D E with respect to discharge, the data source used here obviously contained only limited information to avoid the overpredictions of flow during all wet seasons. The sequence of applying first GRACE and then altimetry, or the reverse, did not affect the identification of feasible parameter sets when using altimetry data as shown in Fig. S8. However, it did affect the selection of the "best" parameter set.

Parameter selection based on satellite altimetry data
Altimetry Strategy 1: directly compare altimetry data to modelled discharge The first approach, Altimetry Strategy 1, resulted in an overestimation of in particular intermediate and low flows as shown in Fig. 6d. The feasible solutions reached an optimum of D E,R,WL,opt = 0.76 (D E,R,WL,5/95 = 0.74-0.75) with respect to altimetry observations. Focusing on the model's skill at reproducing the observed discharge using these feasible parameter sets for the benchmark period, the parameter set associated with the best performing model with respect to altimetry (D E,R,WL,opt ) resulted in a E NS,Q = 0.65 (E NS,Q,5/95 = −2.9-0.10) and D E = 0.63 (D E,5/95 = −0.83-0.50) with respect to discharge (Fig. 7, Table 7). Hence, the parameter set with the highest model performance with respect to altimetry did not perform best with respect to discharge as shown in Table 7 and Fig. S7. While the optimum model performance with respect to discharge was similar to the benchmark, the very wide range in the 5/95th percentiles of the solutions indicated that this strategy has only limited potential to identify implausible parameter sets. This was also the case with respect to the individual flow signatures as shown in Fig. 7 and Table S4.

Altimetry Strategy 2: rating curves
The second approach, Altimetry Strategy 2, also resulted in an overestimation of the flows (Fig. 8e). The feasible solutions reached an optimum of D E,NS,RC,opt = −0.50 (D E,NS,RC,5/95 = −1.0 to −0.77) with respect to altimetry observations. As an example, Fig. S6A in the Supplement visualizes the simulated and observed river water levels at Virtual Station 4 ( Fig. 1), where the model significantly underestimated the stream levels. Focusing on the model's skill at reproducing the discharge using these parameter sets for the benchmark period, the parameter set associated with the best performing model with respect to altimetry (D E,NS,RC,opt ) resulted in E NS,Q = −0.31 (E NS,Q,5/95 = −2.6-0.25) and D E = 0.27 (D E,5/95 = −0.72-0.56) with respect to discharge (Fig. 7, Table 7). Hence, similarly to Altimetry Strategy 1, the best parameter set with respect to altimetry did not perform best with respect to discharge (see Table 7 and Fig. S7). The optimum model performance with respect to discharge was worse compared to the benchmark, and the wide range in the 5/95th percentiles of the solutions indicated this strategy poorly identified the feasible parameter sets. This was also the case with respect to the individual flow signatures as shown in Fig. 7 and Table S4. Only the dry runoff coefficient (E R,RC dry ) improved significantly compared to the benchmark.

Altimetry Strategy 3: Strickler-Manning equation
The third approach, Altimetry Strategy 3, resulted in improved flow predictions compared to the other two strategies using altimetry data (Fig. 8f). Even though the feasible solutions exhibited a very poor ability to reproduce the altimetry data, with an optimum of D E,NS,SM,opt = −1.4 (D E,NS,SM,5/95 = −3.8 to −1.8), the model's skill at reproducing the discharge for the benchmark period using these parameter sets significantly increased compared to the two alternative strategies. As an example, Fig. S6b visualizes the simulated and observed river water levels at Virtual Station 4 (Fig. 1), where the model simulated the stream levels relatively well. The parameter set associated with the best performing model with respect to altimetry (D E,NS,SM,opt ) resulted in E NS,Q = 0.60 (E NS,Q,5/95 = −0.31-0.50) and D E = 0.71 (D E,5/95 = 0.36-0.67) with respect to discharge (Fig. 7, Table 7). While the optimum model performance with respect to discharge was worse compared to the benchmark, the 5/95th percentiles of the solutions were significantly constrained by the removal of many implausible parameter sets. This was valid for the performance with respect to the individual flow signatures (E NS,θ and E R,θ ) and overall flow response (D E ) as shown in Fig. 7 and Table S4. This indicated that, although the model performance with respect to altimetry observations was low, this strategy contained valuable information to considerably constrain the feasible solution space.

Parameter selection based on daily river water level at the basin outlet
Water level Strategy 1: river geometry information extracted from Google Earth The parameter identification strategy "Water level Strategy 1", using cross-section information extracted from Google Earth resulted in a poor simulation of the river water level (Fig. 10a), with an optimal objective function value with respect to river water levels of E NS,SM,GE,opt = −1.8 (E NS,SM,GE,5/95 = −6.8 to −3.1). Focusing on the model's skill at reproducing the discharge using these feasible parameter sets for the benchmark period, the parameter set associated with the best performing model with respect to river water levels (E NS,SM,GE,opt ) resulted in E NS,Q = 0.65 (E NS,Q,5/95 = −0.48-0.60) and D E = 0.77 (D E,5/95 = 0.28-0.70) with respect to discharge (Fig. 7, Table 7). The model performances with respect to the remaining signatures as visualized in Fig. 7 are tabulated in Table S4. As shown in Fig. 8g, the discharge was overestimated in particular during intermediate and low flows. Water level Strategy 2: river geometry information obtained from a detailed field survey The parameter identification strategy "Water level Strategy 2", using cross-section information obtained from a detailed field survey, resulted in improved river water-level simulations (compare Fig. 10a and b) with an optimal ob-jective function value with respect to river water levels of E NS,SM,ADCP,opt = 0.79 (E NS,SM,ADCP,5/95 = 0.60-0.74). The parameter set associated with the best performing model with respect to river water levels (E NS,SM,ADCP,opt ) resulted in E NS,Q = 0.14 (E NS,Q,5/95 = −1.1-0.50) and in D E = 0.55 (D E,5/95 = 0.03-0.67) with respect to discharge (Fig. 7, Ta- Figure 9. Range of random model realizations with respect to the total water storage (grey) including the observation according to GRACE (black). ble 7). The model performances with respect to the remaining signatures as visualized in Fig. 7 are tabulated in Table S4.
Compared to using river geometry information extracted from Google Earth (Water level Strategy 1), the overall model performance with respect to discharge did not increase since the parameter space was already restricted using GRACE data. However, the modelled flow duration curve during intermediate and low flows (compare Fig. 8g with h) and rating curve (Fig. 11) improved significantly when using more accurate geometry information obtained from a detailed field survey covering the cross section that is submerged most of the year, which is thus unlikely to be captured by satellite-based observations. Note that the in situ cross-section information was limited to the submerged part during the time of measurement. The remaining part (water levels > 5 m) was extrapolated, which is likely to explain the larger discrepancies during high flows visible in the flow duration curve (Fig. 8h).

Number of virtual stations used for model calibration and evaluation
In this study, altimetry data were available at 18 virtual stations. However, would the model performance change if more or less virtual stations were used? To answer this question, n random stations were selected for model calibration, while the remaining stations were used for cross-validation (Klemeš, 1986;Gharari et al., 2013;Garavaglia et al., 2017). This was repeated to cover all combinations of n stations and for n = 1, 2. . .17. When applying Altimetry Strategy 3 us- Figure 11. Discharge-water-level graphs for the recorded (black) and modelled discharge and stream levels with the optimal model performance (E NS ) using the Strickler-Manning equation for the discharge-stream-level conversion with cross-section information (a) extracted from Google Earth (Water level Strategy 1) or (b) obtained from a detailed field survey with an ADCP (Water level Strategy 2).
ing altimetry data with the Strickler-Manning equation, this analysis revealed that when increasing the number of calibration stations, the model calibration performance D E,NS,SM gradually decreased, but the ability to meaningfully reproduce the remaining observations which were not used for calibration increased significantly (Fig. 12). Similar results were obtained for Strategies 1 and 2 (compare Fig. 12 with Supplementary Figs. S3 and S4). Also, the model performance with respect to discharge increased when using more virtual stations with an optimum at 7-15 stations depending on the calibration strategy (Fig. S5). This provides evidence that in spite of reduced calibration performance, the simultaneous use of multiple virtual stations can contribute towards more plausible selections of model parameter sets and thus increase the model realism.

Uncertainties and limitations
In the absence of discharge data for hydrological model calibration, as is commonly the case in poorly gauged or ungauged regions, freely and globally available remotely sensed streamwater levels could provide the opportunity to fill this gap, as illustrated in this study as well as in previous studies (e.g. Michailovsky and Bauer-Gottwein, 2014;Pereira-Cardenal et al., 2011;Sun et al., 2012). However, there are several limitations to the approach proposed in this study using altimetry for model calibration.
First, river altimetry data are prone to large uncertainties which increase for smaller river widths as a result of backscatter effects of the surrounding topography (Sulistioadi et al., 2015;Biancamaria et al., 2017;Domeneghetti et al., 2015). Too small rivers could even be missed altogether. In this study, the Luangwa River becomes a small meandering stream in the dry season, resulting in larger altimetry uncertainties. Unfortunately, this uncertainty could not be estimated for the virtual stations used in this study due to data limitations. However, in previous studies in the Zambezi basin, the RMSE relative to in situ stream levels ranged between 0.32 and 0.72 m using Envisat (Michailovsky et al., 2012). Improving altimetry observations such that the uncertainties decrease would improve the identification of feasible parameter sets and simulation of stream levels and flow. However, comparison results between the three altimetrybased calibration strategies are not expected to change since the same altimetry data were used. In other words, Altimetry Strategy 3 is still expected to perform best when decreasing the uncertainties in the altimetry observations.
Second, large uncertainties in the forcing data (precipitation and temperature) with respect to the spatial-temporal variations should not be ignored. This could compromise comparison results between modelled river water levels and altimetry within the basin since it has a low temporal resolution (10 or 35 d). Bias in the precipitation data affects storage calculations and hence the identification of feasible parameter sets based on GRACE (Le Coz and van de Giesen, 2020). This could explain why the flows were frequently overestimated when using GRACE only. In addition, precipitation bias could be compensated through calibration parameters introduced for the discharge-water-level conversion. Therefore, such parameters should be constrained as much as possible. There are also data uncertainties in the cross sections and river gradients extracted from high-resolution terrain data available on Google Earth due to its limited spatial resolution, but more importantly since no information is available below the water surface.
Further, GRACE observations are prone to uncertainties as a result of data (post-)processing, including for example data smoothing (Landerer and Swenson, 2012;Blazquez et al., 2018;Riegger et al., 2012) causing leakage between neighbouring cells of 1 • (≈ 111 km), which are thus not completely independent of each other. Additionally, GRACE observations are more accurate for large areas. Depending on the applied processing scheme, the error is about 2 cm for basins with an area of around 63 000 km 2 (Landerer and Swenson, 2012;Vishwakarma et al., 2018). Also note that due to the coarse temporal resolution, monthly averaged GRACE observations are dominated by slowly changing processes such as the groundwater, soil moisture system, and seasonal variations reflected in all storage components. In addition, open water bodies or wetlands could affect GRACE observations if they are located in or near the basin, for example within a radius of about 300 km, which is the distance often used for data smoothing. In this study, several open water bodies or wetlands were located ≤ 300 km of the Luangwa basin, such as Lake Malawi, Kafue Flats, Cahora Bassa reservoir, Kariba reservoir, Bangweulu, and Tanganyika. These open water bodies and wetlands had a limited impact on the GRACE observations due to limited fluctuations or different temporal variation as illustrated in Fig. 13 for the Cahora Bassa reservoir. These uncertainties in the GRACE observations could influence the identification of plausible parame-Hydrol. Earth Syst. Sci., 24, 3331-3359, 2020 https://doi.org/10.5194/hess-24-3331-2020  ter sets. For example, feasible parameter sets could be discarded incorrectly, which could distort results obtained by calibrating with respect to altimetry and GRACE simultaneously. However, the comparison between the three altimetrybased calibration strategies is not expected to change since the same GRACE data were used. In other words, Altimetry Strategy 3 is still expected to perform best when considering these uncertainties. Uncertainties were not only introduced by the data, but also as a result of assumptions and simplifications. First, the reference level h 0 was assumed to be equal to the lowest river water level observed to limit the number of calibration parameters (Altimetry Strategies 2 and 3, Water level Strategies 1 and 2). However, uncertainties in the altimetry observations as explained previously influence h 0 estimates, which results in a bias between the observed and simulated stream levels in Altimetry Strategies 2 and 3. Second, the roughness was assumed to be constant in time, over the entire cross section, and for all virtual stations throughout the basin (Altimetry Strategy 3). However, this roughness can vary between 15 and 50 m 1/3 s −1 for natural rivers (Vatanchi and Maghrebi, 2019;Chow, 1959), changing the simulated stream levels be-tween 42 % and 75 % in the Luangwa basin, with the low flows being the most sensitive. Third, all 18 virtual stations were grouped based on their cross-section similarity to limit the number of calibration parameters (Altimetry Strategy 2), but differences within each group remain such that the calibration parameters related to the rating curve vary slightly for each virtual station within a group. Fourth, the assumption of a constant flow velocity in space and time affects the timing of the simulated flow and stream levels, influencing the comparison between model results and altimetry observations (all strategies).
Another limitation is the missing information on absolute flow amounts when directly using (satellite-based) river water levels for model calibration using the Spearman rank correlations as a model performance metric (Altimetry Strategy 1; Seibert and Vis, 2016). This resulted here in an overestimation of intermediate and low flows due to the nonlinear relation between stream levels and flows. In contrast, when converting the discharge to streamwater levels, information on absolute flow amounts was included at the cost of introducing additional calibration parameters (Altimetry Strategies 2 and 3), thereby increasing the degrees of freedom and thus the potential for parameter equifinality in the model (Beven, 2006;Sikorska and Renard, 2017;Sun et al., 2012).
Furthermore, it was assumed that the Nash-Sutcliffe efficiency contained sufficient valuable information to describe the model performance with respect to river water level and total water storage when identifying feasible parameter sets. This performance measure is sensitive to the sample size, outliers, bias, and time offset (McCuen Richard et al., 2006). Unfortunately, simulated discharge and stream levels are prone to bias uncertainties as a result of spatio-temporal bias in the rainfall (Le Coz and van de Giesen, 2020). In addition, altimetry observations have a limited sample size for several virtual stations (see Table 2) and are prone to bias due to uncertainties in the reference level h 0 as mentioned before. Moreover, a time offset in the simulated flow can occur as a result of rainfall uncertainties. As a comparison, the model performance with respect to altimetry only reached up to D E,NS,SM = −1.3 for Altimetry Strategy 3, while it reached up to E NS,SM,GE = 0.61 with respect to daily in situ stream levels for Water level Strategy 1. Therefore, additional study is recommended to confirm this assumption and to assess which performance metric(s) would be most suitable. The model performance with respect to discharge was evaluated with respect to multiple hydrological signatures simultaneously (see Table 6) to assess the model's skill at reproducing the internal dynamics of the system. Even though a few of these signatures have some overlapping information content (McMillan et al., 2017), each signature also contains at least some additional information not included in the other signatures. In general, the ambition is to represent a hydrological system as well as possible in a model, which critically required that the model exhibit sufficient ability to simultaneously reproduce multiple flow signatures (Gupta et al., 2008;Euser et al., 2013;Hrachowitz et al., 2014).

Comparison with previous studies
Previous studies have successfully used river altimetry data to calibrate and evaluate rainfall-runoff models using a few virtual stations (Sun et al., 2012;Getirana, 2010;Getirana et al., 2010;Liu et al., 2015). In these studies, the modelled discharge was converted to stream levels by means of a hydraulic model or empirical relations. Our results support several previous findings and added a number of new ones.
Similar to previous studies, the rainfall-runoff model reproduced river flow relatively well when calibrating on remotely sensed streamwater levels, preferably at several virtual stations simultaneously, but discharge-based calibration results performed significantly better (Getirana, 2010). Thus, while river altimetry data cannot fully substitute discharge observations, they at least provide an alternative data source that holds information value where no reliable discharge data are available. In addition, our results suggest that in spite of the typically limited temporal resolution of altimetry observations, these data, when using multiple virtual stations simultaneously, provide enough information to select meaningful model parameter sets (Seibert and Beven, 2009;Getirana, 2010).
Strikingly, only limited studies combined altimetry with GRACE observations in the calibration procedure (Kittel et al., 2018). As altimetry observations only describe waterlevel variations with no information on the flow amounts, GRACE provides additional valuable information to constrain the river discharge by improving the rainfall-runoff partitioning as demonstrated in previous studies (Rakovec et al., 2016;Bai et al., 2018;Dembélé et al., 2020). Combining both data sources in the calibration procedure allowed for a more accurate identification of feasible parameter sets. The model performance range with respect to discharge improved from D E,5/95 = −8.4-0.77 when using only altimetry to D E,5/95 = 0.19-0.75 when combining GRACE and altimetry for Altimetry Strategy 3 (see Fig. S8).
In contrast to previous studies, altimetry data originated from five different satellite missions rather than a single one. As a result, altimetry data were available at 18 locations for the time period 2002 to 2016. This gave the opportunity to analyse the effect of combining different numbers of stations for calibration and evaluation. This study illustrated that better predictions can be achieved when using more virtual stations for calibration. Furthermore, this study demonstrated that in particular the combination of altimetry with information on river geometry (cross section, gradient) proved beneficial for the selection of feasible parameter sets within relatively narrow bounds, comparable to the benchmark using discharge. Using more accurate cross-section information obtained from a detailed field survey rather than Google Earth based estimates improved the water-level simulations, modelled rating curve, and discharge simulations during intermediate and low flows significantly for which onsite cross-section data were available. That is why it is recommended to acquire accurate cross-section information on locations concurring with altimetry overpasses (not done in this study).

Opportunities for future studies
For future studies, it will be interesting to improve Altimetry Strategy 3 using additional data sources. For instance, the combination of altimetry observations with river width estimates derived from Landsat or Sentinel-1/2 (Pekel et al., 2016;Hou et al., 2018) may bear some potential as the combination of the two different hydraulic variables complements each other and increases the temporal sampling (Huang et al., 2018;Tarpanelli et al., 2017;Sichangi et al., 2016). During high flows for example, river width estimates can be more accurate than altimetry observations, especially when floodplains are inundated and small waterlevel changes cause large river width changes. Alternatively, the altimetry observations used here could be combined with river surface water-level slope estimates based on CryoSat observations which provide water-level information at lower temporal resolution (every 369 d) but higher spatial resolution (equatorial inter-track distance of 7.5 km) (Schneider et al., 2017;Jiang et al., 2017). This allows for the estimation of the energy gradient based on stream levels as required in the Strickler-Manning equation, instead of the bed slope based on topography, which proved to be a good first estimate in the absence of more reliable data. In addition, CryoSat observations are available annually such that there can be more overlap with altimetry observations in contrast to topography data. With the upcoming SWOT (Surface Water Ocean Topography) mission, more accurate altimetry observations should be available as well as river slope observations and width. The repeat cycle will be 21 d and across-track resolution between 10 and 60 m, increasing the number of obser-vation points available within a specific area (Biancamaria et al., 2016;Langhorst et al., 2019;Oubanas et al., 2018). As a result, hydrological models can be calibrated with respect to river altimetry and width simultaneously at multiple locations even for small river basins, improving the identification of plausible parameter sets and hence the model realism as illustrated in Sect. 4.2. It will also be very valuable to improve cross-section estimates with respect to the submerged part of the cross section as already explored in previous studies (Domeneghetti, 2016) or to use drone observations to obtain more accurate cross-section information and estimates of the river slope and roughness (Entwistle and Heritage, 2019). By improving the river profile description, the simulated stream levels become more accurate, which is crucial when using this time series for model calibration.
As illustrated with Water level Strategies 1 and 2, improving the cross section resulted in a more accurate rating curve (Fig. 11), stream-level simulation (see Fig. 10), and discharge simulation (Fig. 8). Clearly, it will be interesting to analyse and disentangle different individual sources of uncertainty related to the discharge-water-level conversion from the hydrological model in a more data-rich region (Renard et al., 2010). Unfortunately, this was not possible in this study due to the scarcely available in situ observations in Luangwa. As concluded by Renard et al. (2010), reliable estimates of the data uncertainty are required to disaggregate multiple sources of uncertainty in rainfall-runoff modelling successfully.

Summary and conclusion
This study investigated the potential value of river altimetry observations from multiple satellite missions to identify feasible parameters for a hydrological model of the semiarid and poorly gauged Luangwa River basin. A distributed process-based rainfall-runoff model with sub-grid process heterogeneity was developed on a daily timescale for the time period 2002 to 2016. Various parameter identification strategies were implemented step-wise to assess the potential of satellite altimetry data for model calibration. As a benchmark, when identifying parameter sets with the traditional model calibration strategy using discharge data, the model was able to simulate the flows relatively well (E NS,Q = 0.78, E NS,Q,5/95 = 0.61-0.75). When assuming no discharge observations are available, the feasible parameter sets were restricted with GRACE data only resulting in an optimum of E NS,Q = −1.4 (E NS,Q,5/95 = −2.3-0.38) with respect to discharge. Combining GRACE with altimetry data only from 18 virtual stations focusing on the water-level dynamics resulted in frequently overestimated flows and poorly identified feasible parameter sets (Altimetry Strategy 1, E NS,Q,5/95 = −2.9-0.10). This was also the case when converting modelled discharge to water levels using rating curves (Altimetry Strategy 2, E NS,Q,5/95 = −2.6-0.25). The identification of the feasible parameter sets improved when including river geometry information, more specifically cross section and river gradient extracted from Google Earth, in the dischargewater-level conversion using the Strickler-Manning equation (Altimetry Strategy 3, E NS,Q = 0.60, E NS,Q,5/95 = −0.31-0.50). Moreover, it was shown that more accurate crosssection data improved the water-level simulations, modelled rating curve and discharge simulations during intermediate and low flows for which on-site cross-section information was available. The Nash-Sutcliffe efficiency with respect to river water levels increased from E NS,SM,GE = −1.8 (E NS,SM,GE,5/95 = −6.8 to −3.1) using river geometry information extracted from Google Earth (Water level Strategy 1) to E NS,SM,ADCP = 0.79 (E NS,SM,ADCP,5/95 = 0.60-0.74) using river geometry information obtained from a detailed field survey (Water level Strategy 2). The model performance also improved when increasing the number of virtual stations used for parameter selection. Therefore, in the absence of reliable discharge data as commonly the case in poorly or ungauged basins, altimetry data from multiple virtual stations combined with GRACE observations have the potential to fill this gap if combined with river geometry estimates.
Data availability. Local hydro-meteorological data were provided by WARMA (Water Resources Management Authority in Zambia), ZMD (Zambia Meteorological Department), GRDC (Global Runoff Data Centre), and NOAA (National Oceanic and Atmospheric Administration). Remotely sensed river water levels were obtained from DAHITI, HydroSat, EARPS, and LEGOS.
Author contributions. PH and MH designed the experiment. PH did the analysis. PH and MH wrote the first draft. All the authors discussed the results and contributed to writing the final manuscript.
Competing interests. The authors declare that they have no conflict of interest.
Review statement. This paper was edited by Daniel Viviroli and reviewed by Wenchao Sun and two anonymous referees.