Learning from satellite observations: increased understanding of catchment processes through stepwise model improvement

Abstract. Satellite observations can provide valuable information for a better understanding of hydrological processes and thus serve as valuable tools for model structure development and improvement. While model calibration and evaluation have in recent years started to make increasing use of spatial, mostly remotely sensed information, model structural development largely remains to rely on discharge observations at basin outlets only. Due to the ill-posed inverse nature and the related equifinality issues in the modelling process, this frequently results in poor representations of the spatio-temporal heterogeneity of system-internal processes, in particular for large river basins. The objective of this study is thus to explore the value of remotely sensed, gridded data to improve our understanding of the processes underlying this heterogeneity and, as a
consequence, their quantitative representation in models through a stepwise adaptation of model structures and parameters. For this purpose, a
distributed, process-based hydrological model was developed for the study
region, the poorly gauged Luangwa River basin. As a first step, this
benchmark model was calibrated to discharge data only and, in a post-calibration evaluation procedure, tested for its ability to simultaneously reproduce (1) the basin-average temporal dynamics of
remotely sensed evaporation and total water storage anomalies and (2) their temporally averaged spatial patterns. This allowed for the diagnosis of model structural deficiencies in reproducing these temporal dynamics and spatial patterns. Subsequently, the model structure was adapted in a stepwise procedure, testing five additional alternative process hypotheses that could potentially better describe the observed dynamics and pattern. These included, on the one hand, the addition and testing of alternative
formulations of groundwater upwelling into wetlands as a function of the water storage and, on the other hand, alternative spatial discretizations of the groundwater reservoir. Similar to the benchmark, each alternative model hypothesis was, in a next step, calibrated to discharge only and tested against its ability to reproduce the observed spatio-temporal pattern in evaporation and water storage anomalies. In a final step, all models were re-calibrated to discharge, evaporation and water storage anomalies
simultaneously. The results indicated that (1) the benchmark model (Model A) could reproduce the time series of observed discharge, basin-average evaporation and total water storage reasonably well. In contrast, it poorly represented time series of evaporation in wetland-dominated areas as well as the spatial pattern of evaporation and total water storage. (2) Stepwise adjustment of the model structure (Models B–F) suggested that Model F, allowing for upwelling groundwater from a distributed representation of the groundwater reservoir and (3) simultaneously calibrating the model with respect to multiple variables, i.e. discharge, evaporation and total water storage anomalies, provided the best representation of all these variables with respect to their temporal dynamics and spatial patterns, except for the basin-average temporal dynamics in the total water storage anomalies. It was shown that satellite-based evaporation and total water storage anomaly data are not only valuable for multi-criteria calibration, but can also play an important role in improving our understanding of hydrological processes through the diagnosis of model deficiencies and stepwise model structural improvement.


Abstract. Satellite observations can provide valuable information for a better understanding of hydrological processes and thus serve as valuable tools for model structure development and improvement. While model calibration and evaluation have in recent years started to make increasing use of spatial, mostly remotely sensed information, model structural development largely remains to rely on discharge observations at basin outlets only. Due to the ill-posed inverse nature and the related equifinality issues in the modelling process, this frequently results in poor representations of the spatio-temporal heterogeneity of system-internal processes, in particular for large river basins. The objective of this study is thus to explore the value of remotely sensed, gridded data to improve our understanding of the processes underlying this heterogeneity and, as a consequence, their quantitative representation in models through a stepwise adaptation of model structures and parameters. For this purpose, a distributed, process-based hydrological model was developed for the study region, the poorly gauged Luangwa River basin. As a first step, this benchmark model was calibrated to discharge data only and, in a post-calibration evaluation procedure, tested for its ability to simultaneously reproduce (1) the basin-average temporal dynamics of remotely sensed evaporation and total water storage anomalies and (2) their temporally averaged spatial patterns. This allowed for the diagnosis of model structural deficiencies in reproducing these temporal dynamics and spatial patterns. Subsequently, the model structure was adapted in a stepwise procedure, testing five additional alternative process hypotheses that could potentially better describe the observed dynamics and pattern. These included, on the one hand, the addition and testing of alternative formulations of groundwater upwelling into wet-lands as a function of the water storage and, on the other hand, alternative spatial discretizations of the groundwater reservoir. Similar to the benchmark, each alternative model hypothesis was, in a next step, calibrated to discharge only and tested against its ability to reproduce the observed spatiotemporal pattern in evaporation and water storage anomalies. In a final step, all models were re-calibrated to discharge, evaporation and water storage anomalies simultaneously. The results indicated that (1) the benchmark model (Model A) could reproduce the time series of observed discharge, basin-average evaporation and total water storage reasonably well. In contrast, it poorly represented time series of evaporation in wetland-dominated areas as well as the spatial pattern of evaporation and total water storage.
(2) Stepwise adjustment of the model structure (Models B-F) suggested that Model F, allowing for upwelling groundwater from a distributed representation of the groundwater reservoir and (3) simultaneously calibrating the model with respect to multiple variables, i.e. discharge, evaporation and total water storage anomalies, provided the best representation of all these variables with respect to their temporal dynamics and spatial patterns, except for the basin-average temporal dynamics in the total water storage anomalies. It was shown that satellite-based evaporation and total water storage anomaly data are not only valuable for multi-criteria calibration, but can also play an important role in improving our understanding of hydrological processes through the diagnosis of model deficiencies and stepwise model structural improvement.

Introduction
Traditionally, discharge observations at basin outlets are used for hydrological model development and calibration, which can be a robust strategy in small watersheds with relatively uniform characteristics such as topography and land cover but not for larger, heterogeneous basins (Blöschl and Sivapalan, 1995;Daggupati et al., 2015). As a result, temporal dynamics of discharge may be well reproduced. This, however, does not ensure that the spatial patterns and temporal dynamics of model internal storage and flux variables provide a meaningful representation of their real pattern and dynamics (Beven, 2006b;Kirchner, 2006;Clark et al., 2008;Gupta et al., 2008;Hrachowitz et al., 2014;Garavaglia et al., 2017). Especially in large, poorly gauged basins, this traditional model calibration and testing method is likely to result in a poor representation of spatial variability (Daggupati et al., 2015) due to equifinality and the related boundary flux problem (Beven, 2006b).
An increasing number of satellite-based observations have become available over the last decade, giving us insight into a wide range of hydrology-relevant variables, including precipitation, total water storage anomalies, evaporation, surface soil moisture or river width (Xu et al., 2014;Jiang and Wang, 2019). These data are increasingly used as model forcing or for parameter selection and model calibration (e.g. Li et al., 2015;Mazzoleni et al., 2019;Tang et al., 2019).
Other studies simultaneously calibrated hydrological models with respect to multiple remotely sensed variables but only exploiting basin-average time series, without consideration for spatial patterns (e.g. Milzow et al., 2011;López et al., 2017;Kittel et al., 2018;Nijzink et al., 2018). On the other hand, some studies exclusively calibrated models to spatial patterns of observed variables (Stisen et al., 2011;Koch et al., 2016;Mendiguren et al., 2017;Demirel et al., 2018;Zink et al., 2018). As most satellite-based observations such as evaporation are not measured directly but are themselves a result of underlying models using satellite data as input (Xu et al., 2014), more focus has been recently placed on calibration to the relative spatial variability instead of using absolute magnitudes (Stisen et al., 2011;van Dijk and Renzullo, 2011;Dembélé et al., 2020).
To fully exploit the information content of satellite-based observations, simultaneous model calibration on both tem-poral dynamics and spatial patterns of multiple variables has the potential to improve the representation of spatio-temporal variability and, linked to that, their underlying model internal processes and therefore the model realism (Rientjes et al., 2013;Rakovec et al., 2016;Herman et al., 2018). Strikingly, only a few studies so far have used satellite-based observations to calibrate with respect to the temporal and spatial variation simultaneously (Rajib et al., 2018;Dembélé et al., 2020).
In general, most studies that have made use of remotely sensed data for model applications have exclusively addressed the problem of parameter selection and thus model calibration. However, as models are always abstract and simplified representations of reality, every model structure needs to be understood as a hypothesis to be tested (Clark et al., 2011;Fenicia et al., 2011;Hrachowitz and Clark, 2017). Yet, most studies on model structural improvement have so far only relied on spatially aggregated variables (Fenicia et al., 2008;Kavetski and Fenicia, 2011;Hrachowitz et al., 2014;Nijzink et al., 2016), while spatial data remain rarely used for that purpose (e.g. Fenicia et al., 2016;Roy et al., 2017).
The overall objective of this paper is therefore to explore the simultaneous use of spatial patterns and temporal dynamics of satellite-based evaporation and total water storage observations for a stepwise structural improvement and calibration of hydrological models for a large river system in a semi-arid, data-scarce region. More specifically, we tested the research hypotheses that (1) spatial patterns and temporal dynamics in satellite-based evaporation and water storage anomaly data contain relevant information to diagnose and to iteratively improve on model structural deficiencies and that (2) these data, when simultaneously used with discharge data for calibration, do contain sufficient information for a more robust parameter selection.

Site description
The Luangwa River in Zambia is a large, mostly unregulated tributary of the Zambezi with a length of about 770 km (Fig. 1). This poorly gauged river basin has an area of 159 000 km 2 , which is mostly covered with deciduous forest, shrubs and savanna and where an elevation difference of up to 1850 m can be found between the highlands and lowlands along the river (The World Bank, 2010;Hulsman et al., 2020). In this semi-arid basin, the mean annual evaporation (1555 mm yr −1 ) exceeds the mean annual precipitation (970 mm yr −1 ).
The Luangwa River flows into the Zambezi upstream of the Cahora Bassa dam, which is used for hydropower production and flood and drought protection. The operation of this dam is very difficult since there is only limited information available from the poorly gauged upstream tributaries (SADC, 2008;Schleiss and Matos, 2016). As a result, the lo- cal population has in the past suffered from severe floods and droughts (ZAMCOM et al., 2015;Schumann et al., 2016).

In situ discharge observations
Historical daily in situ discharge data were available from the Zambian Water Resources Management Authority at the Great East Road Bridge gauging station, located at 30 • 13 E and 14 • 58 S (Fig. 1), for the time period 2004 to 2016, yet containing considerable gaps, resulting in a temporal coverage of 31 %.

Spatially gridded observation
Spatially gridded data were used for a topography-based landscape classification into hydrological response units (HRUs; Savenije, 2010), as model forcing (precipitation and temperature) and for parameter selection (evaporation and total water storage; see Table 1).
More specifically, topography was extracted from GMTED with a spatial resolution of 0.002 • (Danielson and Gesch, 2011). Daily precipitation data were extracted from CHIRPS (Climate Hazards Group InfraRed Precipitation with Station) with a spatial resolution of 0.05 • . Monthly temperature data extracted from CRU at a spatial resolution of 1 • were used to estimate the potential evaporation, applying the Hargreaves method (Hargreaves and Samani, 1985;Hargreaves and Allen, 2003). These monthly observations were interpolated to a daily timescale using daily averaged in situ temperature measured at two locations with the coordinates 28 • 30 E, 14 • 24 S and 32 • 35 E, 13 • 33 S. The satellite-based total evaporation data were extracted from WaPOR (Water Productivity Open-access portal; FAO, 2018) version 1.1 as it proved to perform well in African river basins (Weerasinghe et al., 2020). This product was available on 10 d temporal and 250 m spatial resolution. Satellite-based observations on the total water storage anomalies were extracted from the Gravity Recovery and Climate Experiment (GRACE). With two identical GRACE satellites, the variations in the Earth's gravity field were measured to detect regional mass changes, which are dominated by variations in the terrestrial water storage after having accounted for atmospheric and oceanic effects (Landerer and Swenson, 2012;Swenson, 2012). In this study, the long-term bias between the discharge, evaporation (WaPOR) and total water storage anomalies (GRACE) was corrected by multiplying the evaporation by a correction factor of 1.08 to close the long-term water balance.
The gridded information provided for the precipitation, temperature and evaporation was rescaled to the model resolution of 0.1 • . If the resolution of the satellite product was higher than 0.1 • , then the mean of all cells located within each model cell was used. Otherwise, each cell of the satellite product was divided into multiple cells, such that the model resolution was obtained, retaining the original value. In contrast, the modelled total water storage was rescaled to 1 • , the resolution of the GRACE data set, by taking the mean.

Modelling approach
A previously developed and tested (Hulsman et al., 2020) distributed, process-based hydrological model was implemented for the Luangwa basin; see Sect. 3.1 for more information. This benchmark model (Model A) was calibrated with respect to discharge for the time period 2002-2012 and validated for the time period 2012-2016 with respect to discharge, evaporation and total water storage anomalies. Then, the model was calibrated with respect to all above variables, hence discharge, evaporation and total water storage anomalies simultaneously, for the time period 2002-2012 and val- idated with respect to the same variables for the time period 2012-2016. Model deficiencies were then diagnosed for this benchmark model (Model A) based on the results of both calibration strategies.
Next, model structure changes were applied creating Models B-D to improve the deficiencies found in Model A. These changes concerned the groundwater upwelling into the unsaturated zone as explained in Sect. 4.2. The same calibration and validation strategies as applied to Model A were applied to Models B-D. Model improvements were evaluated, and further deficiencies were diagnosed for these models based on the calibration and validation results.
To improve the deficiencies diagnosed in Models B-D, further model structural changes, i.e. increased levels of spatial discretization of the saturated zone as explained in Sect. 4.3, resulted in Models E and F. Similar to the previous models, the same calibration and validation strategies were applied to Models E and F, and model improvements and deficiencies were diagnosed based on the calibration and validation model performances.
The calculation of the model performance with respect to discharge, evaporation and total water storage is explained in Sect. 3.2. The calibration and validation procedures are described in Sect. 3.3 and 3.4.

Benchmark model (Model A)
This model is a process-based hydrological model developed in a previous study by Hulsman et al. (2020) for the Luangwa basin. In this model, the water accounting was distributed by discretizing the basin and using spatially distributed forc-ing data, while the same model structure and parameter set were used for the entire basin. Each 0.1 • × 0.1 • model cell was then further discretized into functionally distinct landscape classes, i.e. hydrological response units (HRUs), inferred from topography ( Fig. 1b) but connected by a common groundwater component (Euser et al., 2015) following the FLEX-Topo modelling concept (Savenije, 2010), which was previously successfully applied in many different and climatically contrasting regions (Gao et al., , 2016Nijzink et al., 2016). Here, the landscape was classified based on the local slope and the height above the nearest drainage (HAND; Rennó et al., 2008) into sloped areas (slope ≥ 4 %), flat areas (slope < 4 %, HAND ≥ 11 m) and wetlands (slope < 4 %, HAND < 11 m). For this purpose, the drainage network was derived from a digital elevation map extracted from GMTED (Sect. 2.1.2) using a flow accumulation map after having burned in a river network map extracted from OpenStreetMap (https://wiki.openstreetmap. org/wiki/Shapefiles, last access: February 2021) to obtain an as accurate as possible drainage network, as done successfully in previous studies (Seyler et al., 2009). According to this classification, the wetland areas covered 8 % of the basin, flat areas 64 % and sloped areas 28 % (Fig. 1).
The model consisted of different storage components schematized as reservoirs representing interception and unsaturated storage, as well as a slow-responding reservoir, representing the groundwater, and a fast-responding reservoir (Fig. 2). The water balance for each reservoir and the associated constitutive equations are summarized in Table 2. The individual model structures of each parallel HRU were very similar. Functional differences between HRUs were thus mostly accounted for by different parameter sets. To allow Figure 2. Schematization of the model structure applied to each grid cell. Symbol explanation: precipitation (P ), effective precipitation (P e ), interception evaporation (E i ), plant transpiration (E t ), infiltration into the unsaturated root zone (R u ), drainage to fast runoff component (R f ), delayed fast runoff (R fl ), lag time (T lag ), groundwater recharge (R r ), upwelling groundwater flux (R GW ), fast runoff (Q f ) and groundwater/slow runoff (Q s ).
for the use of partly overlapping prior parameter distributions while maintaining relationships between parameters of individual HRUs that are consistent with our physical understanding of the system and to limit equifinality, model process constraints Hrachowitz et al., 2014) were applied for several parameters (Table 3). For instance, in the Luangwa basin, the sloped areas are dominated by dense vegetation, suggesting higher interception capacities and larger storage capacities in the unsaturated zone compared to the remaining part of the basin. In addition, for each HRU the model structure was adjusted where necessary to include processes unique to that area. For instance, water percolates and recharges the groundwater system in sloped and flat areas, whereas in wetlands this was assumed to be negligible due to groundwater tables that are very shallow and thus close to the surface.
The runoff was first calculated for each individual grid cell. A simple routing scheme based on the flow direction and constant flow velocity as calibration parameters was applied to estimate the flow at the outlet. In total, this model consisted of 16 calibration parameters with uniform prior distributions and constraints as summarized in Table 3.

First model adaptation: adding groundwater upwelling (Models B-D)
Satellite-based evaporation and total water storage observations were used to evaluate the benchmark model (Model A) with respect to the spatial and temporal variability visually and using model performance metrics as described in Sect. 3.2 to detect model deficiencies in these system-internal variables. The first model adaptation was applied to improve the hydrological model with respect to the deficiencies detected in Model A. Therefore, a detailed description of the reasoning behind the first model adaptation was explained in Sect. 4.2 after having described the deficiencies in Model A in Sect. 4.1.3.
In short, groundwater upwelling (R GW ) was added in wetland areas (see Fig. 2). This upwelling groundwater was made (1) a linear function of the water content in the unsaturated reservoir (Model B; Eq. 9 in Table 2), (2) a linear function of the water content in the slow-responding reservoir (Model C; Eq. 10) and (3) a non-linear function of the water content in the slow-responding reservoir (Model D; Eq. 11). As a result, upwelling water from the saturated zone feeds the unsaturated zone, controlled by the water content in the unsaturated (Model B) or in the saturated zone (Models C and D), and thus increasing the water availability for transpiration from the unsaturated zone in wetland areas. Compared to the benchmark model (Model A), Model B introduces one additional calibration parameter, Model C two and Model D three (Tables 2 and 3).

Second model adaptation: discretizing the groundwater system (Models E and F)
Similar to the first model adaptation, the second model adaptation was developed to improve deficiencies detected in Models B-D. Therefore, a detailed description of the reasoning behind the second model adaptation was explained in Section 4.3 after having described the deficiencies in Models B-D in Sect. 4.2.3. In short, the spatial resolution of the slow-responding reservoir was gradually increased from lumped (Models A-D) to semi-distributed (Model E) and fully distributed (Model F). In Model E, the slow-responding reservoir was divided into four units as visualized in Fig. 1a, whereas in Model F it was further discretized into a grid of 10 × 10 km 2 , equivalent to the remaining parts of the model. For both alternative formulations, Models E and F, the slow reservoir timescales K s remained constant throughout the basin to limit the number of calibration parameters. For both Models E and F, groundwater upwelling was included according to Eq. (10) ( Table 2), hence using Model C as a basis, in- Table 2. Equations applied in the hydrological model. Fluxes [mm d −1 ]: precipitation (P ), effective precipitation (P e ), potential evaporation (E p ), interception evaporation (E i ), plant transpiration (E t ), infiltration into the unsaturated zone (R u ), drainage to fast runoff component (R f ), delayed fast runoff (R fl ), groundwater recharge (R r for each relevant HRU and R r,tot combining all relevant HRUs), groundwater upwelling (R GW for each relevant HRU and R GW,tot combining all relevant HRUs), fast runoff (Q f for each HRU and Q f,tot combining all HRUs), groundwater/slow runoff (Q s ) and total runoff (Q m ). Storages [mm]: storage in interception reservoir (S i ), storage in unsaturated root zone (S u ), storage in groundwater/slow reservoir (S s ) and storage in fast reservoir (S f ). Parameters: interception capacity (I max ) [mm], maximum upwelling groundwater (C max ) [mm d −1 ], maximum root zone storage capacity (S u,max of fast (K f ) and slow (K s ) reservoirs, areal weights for each grid cell (p HRU ) [-] and time step ( t) [d]. Model calibration parameters are shown in bold letters in the table below. The equations were applied to each hydrological response unit (HRU) unless indicated differently.

Discharge
The model performance with respect to discharge was evaluated using eight distinct signatures simultaneously characterizing the observed discharge data (Euser et al., 2013; Hulsman et al., 2020). The model performance measure was based either on the Nash-Sutcliffe efficiency (E NS,θ ; Eq. 28 in Table 4) or the relative error (E R,θ ; Eq. 29) depending on the individual signature. The resulting performance metrics for the eight signatures then included the Nash-Sutcliffe efficiencies of the daily discharge time series (E NS,Q ), its logarithm (E NS,logQ ), the flow duration curve (E NS,FDC ) and its logarithm (E NS,logFDC ), of the autocorrelation function of daily flows (E NS,AC ) and the relative errors of the mean seasonal runoff coefficient during dry and wet periods (E R,RCdry , E R,RCwet ) and of the rising limb density of the hydrograph (E R,RLD ). All these signatures were combined into an overall performance metric based on the Euclidian distance to the "perfect" model (D E,Qcal ; Eq. 31). In absence of more information, and to obtain balanced solutions, all individual performance metrics were equally weighted in Eq. (31). Here, a D E,Qcal = 1 indicates a perfect fit. The discharge data availability was very limited during the validation time period (2012-2016). As a result, hydrological years were not fully captured, resulting in incomplete information on the hydrologic signatures such as rising limb density or autocorrelation function. That is why the overall model performance (D E,Qval ) was calculated using the signatures E NS,Q , E NS,logQ , E NS,FDC and E NS,logFDC , excluding E R,RCdry , E R,RCwet , E R,RLD and E NS,AC . It is therefore important to note that D E,Qcal cannot be meaningfully compared with D E,Qval . Instead, following the overall objective of the analysis, D E,Qval of the different alternative model hy-pothesis were compared to evaluate the differences between the models.

Evaporation and total water storage
The model performance was also evaluated with respect to both the temporal dynamics and the spatial pattern of evaporation and total storage, respectively. For this purpose, satellite-based evaporation data (WaPOR) were used on a 10 d timescale and total water storage anomaly data (GRACE) on a monthly timescale.

Temporal variation
To quantify the models' skill in reproducing the temporal dynamics of evaporation and total water storage anomalies, the respective Nash-Sutcliffe efficiencies (Eq. 28) were used as performance metrics. This performance metric was applied to assess the models' skill in reproducing the basin-average time series of evaporation and total water storage anomalies, i.e. E NS,Basin,E and E NS,Basin,S . Similarly, the models' performance in mimicking the dynamics of evaporation in all grid cells dominated by the wetland HRU was calculated with the Nash-Sutcliffe efficiency (E NS,Wetland,E ). Grid cells were considered as wetland-dominated if they were completely covered by wetlands, hence if p HRU = 1, with p HRU the areal weight of wetland areas within that cell. With respect to evaporation, the flux was normalized first with Eq. (33) to emphasize temporal variations rather than abso- Spatial efficiency µ mean γ fraction of histogram intersection between K and L K observed histogram L modelled histogram n = 100 bins t time step within the dry season with maximum t max lute values in an attempt to reduce bias-related errors in the observation:

Spatial variation
The model performance with respect to the spatial pattern of evaporation and total water storage anomalies was calculated with the spatial efficiency metrics E SP,E and E SP,S (Eq. 30), respectively, which were successfully used in previous studies Koch et al., 2018). The spatial model performance was first calculated for each time step within the dry period which was in September and October and then averaged to obtain the overall model performance (E SP , Eq. 30). The spatial pattern was averaged over the dry season to minimize the effect of precipitation errors.

Multi-variable
The overall potential of the models to simultaneously reproduce the temporal dynamics as well as the spatial patterns of all observed variables, i.e. discharge, evaporation and total water storage anomalies, was tested with the overall model performance metric D E,ESQ . This metric was the Euclidian distance (Eq. 32) of the following individual metrics: the temporal variation of the basin-average evaporation (E NS,Basin,E ) and total water storage anomalies (E NS,Basin,S ), spatial pattern of the evaporation (E SP,E ) and total water storage anomalies (E SP,S ), as well as discharge (D E,Q ). See Table 5 for an overview of all model performance metrics used in this study.

Model calibration
In general, the model was calibrated by first running the model with 5 × 10 4 random parameter sets generated with a Monte Carlo sampling strategy from uniform prior parameter distributions (Table 3). Then, the optimal and 5 % bestperforming parameter sets were selected according to the model performance metric as described in the previous section. The model was calibrated within the time period 2002-2012 with respect to (1) discharge (D E,Qcal ) and (2) all variables simultaneously (D E,ESQcal ). As the objective of this study was to explore the information content of multiple variables using multiple model evaluation criteria for stepwise model structure development and calibration, it was important to use the same parameter sets for all models as a common starting point to rule out the effect of different parameter sets. This was efficiently possible with the Monte Carlo parameter sampling strategy, which, in addition, also allowed for a relatively straightforward and intuitive interpretation and communication of the results.

Model validation
The model was validated with respect to discharge, evaporation and total water storage anomalies for the time period 2012-2016. During validation each variable was evaluated separately, both temporally and spatially. This included the temporal variation of the basinaverage evaporation (E NS,Basin,E ) and total water storage anomalies (E NS,Basin,S ), evaporation in wetland areas (E NS,Wetland,E ), spatial pattern of the evaporation (E SP,E ) and total water storage anomalies (E SP,S ), as well as discharge (D E,Qval ). In addition, the model was evaluated with respect to the overall performance (D E,ESQval ). This was done for the solutions from both calibration strategies.  Fig. 3). As shown in Fig. 4, the main features of the hydrological response were captured reasonably well. However, particularly in the validation period, low flows were somewhat underestimated. Note that in 2013, the observed high flows were probably underestimated due to failures in the recording, which resulted in a truncated top in the hydrograph and flat top in the flow duration curve during the validation time period (Fig. 4) and which affect the validated model performance values (D E,Qval ). The range in the calibrated model performance with respect to each discharge signature separately is visualized in Fig. S1 in the Supplement.
The basin-average evaporation (E NS,Basin,E = 0.54) and total water storage anomalies (E NS,Basin,S = 0.74) were in Figure 3. Model performance with respect to discharge, evaporation and storage for all models. The model is calibrated to discharge (D E,Qcal , darker boxplots in the first column) and validated to the discharge, evaporation and storage (lighter boxplots). The dots represent the model performance using the "optimal" parameter set and the boxplot the range of the best 5 % solutions, both according to discharge (D E,Qcal ). The following performance metrics were used: (1) discharge using the overall model performance metric (D E,Qcal for calibration and D E,Qval for validation), (2) temporal evaporation basin-average (E NS,Basin,E ), (3) temporal evaporation, wetland areas only (E NS,Wetland,E ), (4) spatial evaporation (E SP,E ), (5) temporal storage basin-average (E NS,Basin,S ), (6) spatial storage (E SP,S ) and (7) the combination of evaporation, storage and discharge (combined metric D E,ESQval ). general also reproduced rather well (Figs. S3 and S5). In contrast, the model failed to mimic the evaporation dynamics in wetland-dominated areas as it decreased rapidly to zero in the dry season in contrast to the observations (E NS,Wetland,E = 0.25, Fig. 5). Similarly, the spatial variability in evaporation (E SP,E = 0.17) and water storage anomalies (E SP,S = −0.02) were poorly captured as several areas were over-or underestimated (Figs. 6 and 7). Note that in both figures the normalized evaporation and total water stor-age anomalies were plotted by applying Eq. (33) to emphasize relative spatial differences rather than absolute values.

Multi-variable calibration
Calibrating with respect to multiple variables simultaneously in the second calibration strategy resulted in a reduced model skill in simultaneously reproducing all flow signatures in the validation period with D E,Qval = 0.07 (Table 7, Figs. 8 and 9). Compared to the first calibration strategy, the simulated evaporation did not change significantly with respect to   . Spatial variability of the normalized total water storage anomalies for Model A averaged over all days within the dry season. Panel (a) shows the observation according to GRACE data, (b) the model result using the "optimal" parameter set with respect to discharge (D E,Qcal ) and (c) the difference between the observation and model. the temporal dynamics (E NS,Wetland,E = 0.27, E NS,Basin,E = 0.57) and spatial pattern (E SP,E = −0.18). Evaporation from wetland-dominated areas remained underestimated in the dry season (Fig. 10), and large areas in the basin were still underor overestimated (Fig. 11). The reproduction of the total water storage anomalies decreased though, mostly with respect to the spatial pattern (E SP,S = −0.14, Fig. 12). On the other hand, when looking at the 5th and 95th percentile range instead of the "optimal" parameter set, then an improvement was observed in the spatial pattern in evapora- Figure 8. Model performance with respect to discharge, evaporation and storage for all models. The model is calibrated to multiple variables simultaneously (D E,ESQcal , darker boxplots in the first column) and evaluated with respect to each flux individually (lighter boxplots). The dots represent the model performance using the "optimal" parameter set and the boxplot the range of the best 5 % solutions, both according to D E,ESQcal . The following performance metrics were used: (1) discharge using the overall model performance metric (D E,Qval ), (2) temporal evaporation basin-average (E NS,Basin,E ), (3) temporal evaporation, wetland areas only (E NS,Wetland,E ), (4) spatial evaporation (E SP,E ), (5) temporal storage basin-average (E NS,Basin,S ), (6) spatial storage (E SP,S ) and (7) the combination of evaporation, storage and discharge (combined metric D E,ESQval ).

Model deficiencies
Regardless of the calibration strategy, the benchmark model failed in particular to adequately reproduce evaporation dynamics in wetland-dominated areas. During the dry seasons, the modelled evaporation decreased rapidly to zero in contrast to the observations (Figs. 5 and 10). Partly as a consequence of that, the spatial pattern of evaporation was captured poorly, as illustrated in Figs. 6 and 11. Apart from the wetlands, the modelled average dry season evaporation was also extremely low in the centre of the basin, which did not correspond with the satellite observations. At the same time, the evaporation was significantly overestimated in the southern part of the basin. Also the spatial pattern in total water storage anomalies were poorly represented since the model significantly overestimated storage anomalies in large parts of the basin (Figs. 7 and 12). Note, overestimations in specific regions do not necessarily mean the actual (non-normalized) model values were also higher compared to the observation, but it does mean the model results in this cell/region were high relative to the remainder of the basin compared to observations. This was the case for the evaporation and total water storage, even though it was negative during dry seasons (compare Figs. S8 and S9).

First model adaptation: adding groundwater upwelling (Models B-D)
In the benchmark model (Model A), there was no groundwater upwelling into the wetlands and floodplains around the river channels, similar to many distributed conceptual hydrological models (e.g. Samaniego et al., 2010;Bieger et al., 2017). However, according to field and satellite-based observations, wetland areas remain moist at the end of the dry season, while the remaining areas of the basin become very dry. Given the low elevation of these wetlands above rivers, it is plausible to assume that groundwater from higher parts of the catchment is pushed up into the unsaturated root zone of these wetlands. As a result, water deficits in the unsaturated zone are partly replenished by upwelling groundwater. It thereby can sustain relatively elevated levels of moisture, available for plant transpiration long into the dry season.
To improve the representation of evaporation in the model, the process of upwelling groundwater (R GW ) was added to the model. In principle, it was assumed that the upwelling   . Spatial variability of the normalized total evaporation for Models A, C and F averaged over all days within the dry season. Panels (a, d, g) show the observation according to WaPOR data, (b, e, h) the model result using the "optimal" parameter set with respect to multiple variables (D E,ESQcal ) and (c, f, i) the difference between the observation and model. groundwater is regulated by the head difference between upland groundwater and the groundwater in the wetland. As this information was not available, due to the lack of continuous gradients in the type of model used (Hrachowitz and Clark, 2017), this was done in a simplified way. In three alternative formulations of this hypothesis, the upwelling groundwater was made (1) a linear function of the water content in the unsaturated reservoir (Model B, Eq. 9), (2) a linear function of the water content in the slow-responding reservoir (Model C, Eq. 10) and (3) a non-linear function of the water content in the slow-responding reservoir (Model D, Eq. 11). In other words, in Model B the groundwater upwelling was driven by the water deficit in the unsaturated zone; hence the lower the water content in the unsaturated zone, the higher the groundwater upwelling. In Models C and D, the groundwater upwelling was driven by the water content in the slowresponding reservoir, the groundwater system, such that the higher the water content in the slow-responding reservoir, the higher the groundwater upwelling. As a result of the nonlinear relation between the groundwater upwelling and the water content in the slow-responding reservoir in Model D, the groundwater upwelling increased the most under dry conditions and the least under wet conditions. In Models B-D, the groundwater upwelling flowed into the unsaturated zone until it was saturated, hence until its maximum S u,max was reached (Eq. 12). Model B required one additional calibration parameter, Model C two and Model D three (Tables 2  and 3).

Discharge-based calibration
Following the first calibration strategy, the performances of Models B-D with respect to discharge did not improve significantly for the calibration period (D E,Qcal = 0.75-0.79) compared to Model A, regardless of the model (Table 6, Figs. 3 and S2). For the validation period, Models B and D Figure 12. Spatial variability of the normalized total water storage anomalies for Models A, C and F averaged over all days within the dry season. Panels (a, d, g) show the observation according to GRACE data, (b, e, h) the model result using the "optimal" parameter set with respect to multiple variables (D E,ESQcal ) and (c, f, i) the difference between the observation and model. experienced a pronounced reduction of their ability to adequately reproduce the discharge signatures with D E,Qval = 0.08 and −1.7, respectively, since the flows were mostly underestimated (Fig. S2). On the other hand, Model C showed significant improvements with D E,Qval = 0.81. With respect to the evaporation from wetland-dominated areas, the largest improvements were found for Model D (E NS,Wetland,E = 0.41), where the evaporation did not drop rapidly to zero anymore, even though it was still significantly underestimated in the dry season (Fig. S4). But this came at the cost of decreased simulations of all remaining variables (Table 6, Fig. 3), i.e. the discharge, basin-average evaporation and total water storage and their spatial patterns (Figs. S2-S7). For example Fig. S6 illustrates the poorly simulated temporally averaged dry season evaporation for Model D, which was higher in wetland areas (centre of the basin) compared to the surrounding areas and which was not observed in the satellite-based observations. For Models B and C, the model performances with respect to the remaining variables remained comparable to Model A or even decreased, as can be seen in Table 6 and Fig. 3. As a result, when considering all variables simultaneously, Model C performed the best, with D E,ESQval = 0.32.

Multi-variable calibration
Following the second calibration strategy, Model C experienced the largest increases compared to Model A in its ability to describe features of discharge, with D E,Qval = 0.61, while Model D decreased the most to D E,Qval = −0.08, with the high flows being overestimated and low flows underestimated (Table 7, Figs. 8 and 9). With this calibration strategy, large improvements were observed in the reproduction of the evaporation from wetland-dominated areas for all three Models B-D, especially for Model D, with E NS,Wetland,E = 0.59, where the evaporation was simulated well, even during the Table 6. dry season as it did not decrease rapidly to zero in the dry season compared to Model A (Fig. 10). For Models C and D, the spatial pattern in evaporation and total water storage anomalies improved, albeit moderately (Table 7), as large areas were still under-or overestimated (Figs. S12 and S13), whereas it decreased slightly for Model B. For Models B-D, the basin-average temporal dynamics in evaporation and total water storage anomalies remained similar or decreased slightly (Table 7, Figs. S10 and S11). Overall, when considering the model performance with respect to all variables simultaneously, Model C showed the best performance, with D E,ESQval = 0.33.

Model deficiencies
According to the results, the representation of evaporation strongly benefitted from including upwelling groundwater as a function of the water content in the slow-responding reservoir (Eq. 10; Model C), especially for the second calibration strategy. The incorporation of this flux resulted in increased levels of water supply to the unsaturated zone of wetlands to sustain higher levels of transpiration throughout the dry periods (Fig. 10). But even though the evaporation increased during dry periods, it was still underestimated, especially towards the end of the dry season, due to too large an amount of groundwater upwelling depleting the slow-responding reservoir. The major weakness of the model remained its very limited ability in representing the spatial pattern in evaporation as there were several local clusters of considerable mismatches, both over-and underestimating observed evaporation. This was clearly visible for example in the centre and southern part of the basin (Fig. 11). Also the spatial pattern in the total water storage anomalies remained poorly represented, in spite of some improvements compared to Model A, as they were considerably overestimated in the northern parts of the basin (Fig. 12). This could be a result of deficiencies in the hydrological models or in the satellite-based observations.

Second model adaptation: discretizing the groundwater system (Models E and F)
In all above models, the groundwater layer was simulated as a single lumped reservoir, assuming equal groundwater availability throughout the entire basin. As groundwater processes can occur on relatively large spatial scales, this assumption may be valid for small-scale or mesoscale catchments but not necessarily for larger basins such as the Luangwa basin. This may partly be responsible for the deficiency of all above models to meaningfully reproduce the spatial pattern of the total water storage. Taking Model C as a basis for further model adaptations, two more alternative model hypothesis were formulated. In both models the slowresponding reservoir, representing the groundwater, was spatially discretized. For Model E, the reservoir was split into four units with an area of 15 396-47 239 km 2 , each containing four to six different GRACE cells (see Fig. 1a). In contrast, Model F was formulated with a completely distributed slow reservoir at the resolution of the remaining parts of the model, i.e. 10 × 10 km 2 . In Models E and F, the slow reservoir timescales K s remained constant throughout the basin to limit the number of calibration parameters. Models E and F did not require additional calibration parameters. See Tables 2 and 3 for the corresponding model equations and calibration parameter ranges.

Discharge-based calibration
Following the first calibration strategy, the calibrated and validated model performance with respect to discharge did not change significantly for Model E compared to Model C. For Model F on the other hand, the calibrated model performance increased to D E,Qcal = 0.91 (Table 6, Figs. 3 and S2), but during validation it decreased to D E,Qval = 0.52 compared to Model C as a result of overestimated high flows (Fig. S2). In other words, the discharge simulation was only affected when applying a fully distributed groundwater system (Model F). Also the simulated dynamics of the evaporation improved for Model F, especially for wetlanddominated areas (E NS,Wetland,E = 0.56, Table 6), even though it remained significantly underestimated during the dry season (Fig. S4). But for both models, no improvements in the spatial pattern of evaporation can be observed, with E SP,E = 0.05 and −0.03 for Models E and F, respectively. As shown in Fig. S6, for Model E and F the temporally averaged dry season evaporation was very low in the centre of the basin compared to the remaining part of the basin in contrast to the satellite-based observations. The spatial pattern of total water storage anomalies was at least slightly better mimicked by Model F with E SP,S = 0.08 (Fig. S7), which, in turn, came at the price of a poorer reproduction of the temporal dynamics of the basin-averaged total water storage anomalies (E NS,Basin,S = 0.66, Fig. S5).

Multi-variable calibration
Including multiple variables in the calibration process did not improve the representation of the hydrological response with respect to discharge for Models E and F compared to Model C with D E,Qval = 0.30 and 0.51, respectively (Table 7, Figs. 8 and 9). For both models, the flows were underestimated during low flows and overestimated during high flows (Fig. 9). Also the evaporation from wetland-dominated areas did not improve for both models as it decreased rapidly in the dry season (Fig. 10). On the other hand, the spatial pattern in the evaporation was slightly better mimicked for Model F (E SP,E = 0.23) but still at low performance levels similar to Models A-D, with large areas still being under-or overestimated (Fig. S12). Slight improvements could be observed though for the representation of spatial pattern in total water storage in Models F (E SP,S = 0.09, Fig. S13), albeit modestly. Overall, when considering the model performance with respect to all variables simultaneously, Model F showed the best performance, with D E,ESQval = 0.37.

Model deficiencies
Applying the second calibration strategy, Model F poorly reproduced the evaporation from wetlands (Fig. 10) since the water availability for evaporation decreased rapidly in the dry season due to the limited water availability in the slowresponding reservoir. This was a direct result of the limited connectivity in the distributed groundwater system within the basin and very likely points to the presence of contiguous groundwater systems extending beyond the modelling resolution that sustain dry-season evaporation in wetlands. Strikingly, discretizing the groundwater basin only had limited effects on the spatial pattern in evaporation and total water storage anomalies. Despite their limited improvements, they remained poorly captured as several local clusters were overand underestimated (Figs. 11 and 12).

Discussion
As illustrated in the previous sections, satellite-based evaporation and storage anomaly data were used in an attempt to (1) iteratively improve a benchmark model structure and (2) identify parameter sets with which the model can simultaneously reproduce the temporal dynamics as well as the spatial patterns of multiple flux and storage variables. The results suggested that among the tested models, Models C and F provided the overall best representation of the hydrological processes in the Luangwa basin, following the first and second calibration strategy respectively. The addition of upwelling groundwater alone (Model C) significantly improved the discharge simulations during validation regardless of the calibration strategy and the simulation of evaporation from wetland areas following the second calibration strategy. Discretizing the slow-responding reservoir (Model F) reached reasonable overall performance levels, i.e. D E,ESQval , when calibrating with respect to discharge and its signatures only (Fig. 3), with improved simulations of evaporation from wetland areas. But calibrating with respect to multiple variables proved instrumental as it allowed the spatial pattern of the evaporation to be improved compared to calibrating with respect to discharge (Figs. 11 and S12) while maintaining high levels for the other performance criteria (Fig. 8). In general, it could also be observed that a further discretization of the model led to a better representation of the system especially with respect to the spatial patterns. Nevertheless, while the model structure and calibration strategy did influence the spatial pattern in the evaporation (Figs. S6 and S12) and total water storage anomalies (Figs. S7 and S13), none of the tested mod-els could adequately reproduce the observed spatial pattern, which could be a result of model deficiencies or uncertainties in the satellite-based spatial patterns.
A potential reason for the models' problems in meaningfully describing the spatial pattern of the evaporation was in this study the use of the same parameters within a specific HRU in different model grid cells, as also observed in previous studies . As a result, the simulated spatial pattern was strongly influenced by the catchment classification method into distinct HRUs. In this study, the catchment was classified merely on the basis of topography into flat, sloped and wetland areas, whereas ecosystem diversity could also be considered as an additional layer in the classification. The poor representation of the spatial pattern in total water storage was also partly linked to that. Spatially distributing calibration parameters could improve the modelled spatial pattern, assuming there are sufficient data available to meaningfully constrain the increased number of calibration parameters and thus to avoid elevated equifinality. In a preliminary test, the maximum interception storage (I max ) was spatially distributed using a linear transfer function with LAI (leaf area index) data similar to previous studies (Samaniego et al., 2010;Kumar et al., 2013) and using Model F as a basis. This did not result in obvious improvements as shown in Fig. S14. It was considered outside the scope of this study to analyse additional parameter distribution strategies with the limited data availability in this study region.
Another likely reason for the poorly modelled spatial pattern is the absence of lateral exchange of subsurface water between model grid cells in the tested models, as contiguous groundwater bodies of varying but unknown spatial scale will shape water transfer through the landscape in the real world, which remains unaccounted for in the model. Lateral exchange fluxes are, as any flux, driven by continuous gradients and resistances. However, conceptual-type models, such as the one used in this study, only mimic gradients within grid cells but not between grid cells. As a result, the head difference between neighbouring cells remains unknown, which entails that the direction and magnitude of lateral exchange between cells are unknown. Consequently, these fluxes can only be expressed on the basis of free calibration parameters. However, in this data-scarce region it will not be possible to test whether the additional calibration parameters and the associated exchange fluxes are physically plausible. These unspecified boundary fluxes across grid cells are at the core of the closure problem (Beven, 2006a) and touch on the limits of what can be done in hydrology with our current observational technology and the available data. Therefore, adding lateral exchange flow to the model was considered outside the scope of this study.
In addition, each of the applied data sources have their own uncertainties and bias. These include uncertainties in observed discharge due to rating curve uncertainties (Westerberg et al., 2011;Domeneghetti et al., 2012;Tomkins, 2014) and limited data availability, in precipitation data, often as 976 P. Hulsman et al.: Learning from satellite observations: increased understanding of catchment processes a result of poorly capturing mountainous regions or extreme events on small scales (Hrachowitz and Weiler, 2011;Kimani et al., 2017;Dinku et al., 2018;Le Coz and van de Giesen, 2019), in estimates of total water storage anomalies as a result of data (post-)processing including data smoothing using a radius of for example 300 km, affecting the spatial variability on a basin scale (Landerer and Swenson, 2012;Blazquez et al., 2018), and in evaporation data due to model, input data and parameter estimation uncertainties (Zhang et al., 2016). In general satellite products are a result of models that are prone to uncertainties related to the input data or model conceptualization. Uncertainties in for example the spatial pattern of the precipitation affect the spatial pattern of the evaporation considerably, as shown in Fig. S15. In an ideal situation, the data would be validated with field measurements to assess the error magnitude. However, this was not possible due to data limitations. To compensate for bias errors in the satellite-based evaporation, and to allow for more reliable comparisons with model results, the satellite-based evaporation was adjusted with a correction factor of 1.08 (Sect. 2.1.2). Correcting the precipitation in a similar manner instead of the evaporation did not significantly affect the model results since normalized values were used for model calibration and evaluation (Fig. S16).
The results in this study were sensitive to the choice of performance metrics with respect to the individual variables (discharge, evaporation and total water storage) and all variables combined. For instance, the overall model performance measure D E,ESQval (Eq. 32) was strongly influenced by the validated discharge model performance D E,Qval due to its large range and variation between models compared to the remaining variables, for which the range was smaller and similar for all models (Fig. 8). As a result, the overall model performance measure might not reflect each variable equally well, which affected the choice of best performing model. However, this did not cause the poorly reproduced spatial pattern in the evaporation as it remained poorly modelled also when calibrating only with respect to that variable (E SP,E , Fig. S17). In addition, the histogram component (γ ) in the spatial efficiency metric (E SP , Eq. 30) becomes less meaningful for very coarse resolutions when the river basin consists of only a few grid cells, as was the case for GRACE. It would be interesting to examine the different components in E SP in more detail in future studies to assess the overall suitability of this metric to identify feasible parameter sets across different spatial scales.
Reflecting the results of previous studies, this study found that calibrating to multiple variables including the spatial patterns improved the simulation of the evaporation and storage with some trade-off in the discharge simulation depending on the model structure (Stisen et al., 2011;Rientjes et al., 2013;Demirel et al., 2018;Herman et al., 2018;Dembélé et al., 2020). But in contrast and in addition to previous studies, this study also provided an example, illustrating that spatial data, here evaporation and total water storage, can contain relevant information to diagnose model deficiencies and to therefore enable stepwise model structural improvement. Previous studies have largely relied on discharge observations to improve model structures Fenicia et al., 2016), and only few studies used satellite data (Roy et al., 2017), even though they provide valuable information on the internal processes temporally and spatially, which is not available with discharge data alone (Daggupati et al., 2015;Rakovec et al., 2016). Roy et al. (2017) observed that the simulated evaporation according to the spatially lumped model HYMOD (HYdrological MODel) rapidly dropped to zero in contrast to the satellite product GLEAM (Global Land Evaporation Amsterdam Model) in the Nyangores River basin in Kenya. They improved this simulated evaporation while maintaining good discharge performances by modifying the corresponding equation in HY-MOD such that it was a function of the soil moisture.
While here we focussed on upwelling groundwater and spatial discretization, a promising avenue for future studies may be to evaluate the incorporation of simple formulations of subsurface exchange fluxes between model grid cells. Similarly, a further discretization of HRUs into different land cover and ecosystem types may be worthwhile. In addition, a systematic sensitivity analysis is recommended to explore the influence of individual factors such as model structure and parameters on the spatial and temporal variability of different variables and to further improve the representation of the hydrological processes.

Conclusion
The objective of this paper was to explore the added value of satellite-based evaporation and total water storage anomaly data to increase the understanding of hydrological processes through stepwise model structure improvement and model calibration for large river systems in a semi-arid, data-scarce region. For this purpose, a distributed process-based hydrological model with sub-grid process heterogeneity for the Luangwa River basin was developed and iteratively adjusted. The results suggested that (1) the benchmark model (Model A) calibrated with respect to discharge reproduced observed discharge but also basin-average evaporation and total water storage anomalies rather well, while poorly capturing the evaporation for wetland-dominated areas as well as the spatial pattern of evaporation and total water storage anomalies. Testing five further alternative model structures (Models B-F), it was found that (2) among the tested model hypotheses, Model F, allowing for upwelling groundwater from a distributed representation of the groundwater reservoir and (3) simultaneously calibrating this model with respect to multiple variables, i.e. discharge, evaporation and total water storage anomalies, resulted in marked improvements of the model performance, providing the best simultaneous representation of all these variables with respect to