Calibration analysis for water storage variability of the global hydrological model WGHM

The aim of this study is to provide an improved global simulation of continental water storage variations by calibrating the WaterGAP Global Hydrology Model (WGHM) for 28 of the largest river basins worldwide. Five years (January 2003–December 2007) of satellite-based estimates of the total water storage changes from the GRACE mission were combined with river discharge data in a multiobjective calibration framework that uses the most sensitive WGHM model parameters. The uncertainty and significance of the calibration results were analysed with respect to errors in the observation data. An independent simulation period (January 2008–December 2008) was used for validation. The contribution of single storage compartments to the total water budget before and after calibration was analysed in detail. A multi-objective improvement of the model states was obtained for most of the river basins, with mean error reductions of up to 110 km3/month for discharge and up to 24 mm of a water mass equivalent column for total water storage changes, such as for the Amazon basin. Errors in the phase and signal variability of seasonal water mass changes were reduced. The calibration is shown to primarily affect soil water storage in most river basins. The variability of groundwater storage variations was reduced on a global scale after calibration. Structural model errors were identified from a small contribution of surface water storage including wetlands in river basins with large inundation areas, such as the Amazon or the Mississippi. Our results demonstrate the value of both the GRACE data and the multi-objective calibration approach for improving large-scale hydrological simulations, and they provide a starting-point for improving model structures. The integration of complimentary observation data to further constrain the simulation of single storage compartments is encouraged. Correspondence to: S. Werth (sawerth@gmail.com)


Printer-friendly Version
Interactive Discussion residence time of water in the river basins (Ngo-Duc et al., 2007). Recent regional studies focus on modelling of groundwater storage with land surface models (e.g., Gulden et al., 2007;Lo et al., 2008;Kollet and Maxwell, 2008) but groundwater is still absent in several large-scale or global models. Although the global model WGHM simulates the most important storages compartments, including surface water and groundwater, 5 simulation accuracy of the conceptual model was originally low for river discharge in snow dominated and semi-arid regions. Here, difficulties in the representation of evaporation or snow accumulation appeared (Döll et al., 2003). In response, Hunger and Döll (2008) and Schulze and Döll (2004) improved model equations for both processes. For TWS, however, WGHM still tended to underestimate seasonal TWS variations and 10 phase shifts appeared (Schmidt et al., 2008b(Schmidt et al., , 2006. Güntner et al. (2007) found a regional varying sensitivity of WGHM parameters. Since only one parameter of the original model has globally been calibrated so far, this calls for an extension towards a regional calibration with respect to dominant processes of a river basin.
Theoretical studies propagate an iterative working process of model prediction, 15 model analysis and process understanding (e.g., Fenicia et al., 2008;Savenije, 2009). An evaluation of model predictions should be undertaken by comparisons of simulated states of the water cycle to real-world observations. Model behaviour during tuning processes like data assimilation (e.g., Houtekamer and Mitchell, 1998;Reichle et al., 2002) or model calibration (e.g., Duan et al., 2003;Gupta et al., 2005) provides infor-20 mation on process behaviour and structural model deficits. But, the learning process is especially difficult on the global scale and limited to iterative steps, primarily because of the lack of adequate model forcing and validation data with global coverage and acceptable resolution and accuracy. In this respect, the Gravity Recovery And Climate Experiment (GRACE) is of ex-and oceanic gravity effects, GRACE observations enable temporarily reliable studies of different hydrological processes (like snow and ice, groundwater, soil, surface, as done by Wouters et al., 2008;Niu et al., 2007;Swenson et al., 2008;Papa et al., 2008, respectively) that include different climatic conditions and extreme events for many regions (e.g., Zeng et al., 2008;Seitz et al., 2008) or the water balance itself (Sheffield 5 et al., 2009). Since the first GRACE record became available, large progress has been made in order to improve GRACE data accuracy and, thus, the reliability of water mass variations from GRACE. These include studies on dealiasing (Han et al., 2004), error estimates (Horwath and Dietrich, 2006), development of filter (Swenson and Wahr, 2002) and decorrelation techniques (Kusche, 2007) as well as filter optimization (Werth et al., 2009b). Consequently, GRACE depicts a valuable tool for validation and calibration of large-scale hydrological models (Schmidt et al., 2008a;Güntner, 2009;Lettenmaier and Famiglietti, 2006). Application of GRACE data for large-scale hydrological modelling started out with validation of simulated water storage variations for large river basins or with global coverage (e.g., Ngo-Duc et al., 2007;Syed et al., 2008;Güntner, 15 2009). More recently, promising further steps were made towards the integration of GRACE data into model development and model tuning for particular regions, e.g., the Amazon or Mississippi basin (e.g., Zaitchik et al., 2008;Werth et al., 2009a;Lo et al., 2009). As a subsequent step that makes full use of the global coverage of GRACE, a world-wide integration of TWS variations towards an improved simulation of conti-20 nental TWSV as a whole would be desirable. But many combinations of simulated single storage compartments may lead to a good fit for the integrative GRACE TWS variations with only coarse resolution. Hence, to obtain additional model constraints, higher parameter accuracy (Yapo et al., 1998;Vrugt et al., 2003;Gupta et al., 2005) and to reduce parameter equifinality (Beven and Binley, 1992), the combination with other system states, like river discharge, in a multi-objective method is promising. In addition, using GRACE-based TWSV and river discharge is of particular interest for water balance analyses as both are integrated measures of the hydrological dynamics in a river basin.

4817
In this context, this study makes a step forward in the iterative learning process of large-scale hydrological modelling towards improved global simulation of the continental water cycle and its storage compartments by a multi-objective calibration (Sect. 2.2) of the global model WGHM (Sect. 2.1) against river discharge and GRACE-based estimations (Sect. 2.3) for 28 of the largest and most important river basins world wide 5 (Sect. 3.1).

Global hydrological model
The WaterGAP Global Hydrology Model (WGHM, Döll et al., 2003) simulates the continental water cycle by conceptual formulations of the most important hydrological pro- 10 cesses. WGHM was originally developed by Döll et al. (2003) for water availability studies at the continental scale (Alcamo et al., 2003, e.g.,). But since the model provides estimates of water masses, it may serve for for hydrological analyses of water storage and its global dynamics (Güntner et al., 2007) as well as for individual storage compartments, such as groundwater recharge (Döll and Fiedler, 2008) or storage of 15 surface water bodies (Papa et al., 2008). WGHM was numerously applied for comparison of continental water storage variability to GRACE-based water mass variations (Schmidt et al., 2006(Schmidt et al., , 2008b. The conceptual model equations of WGHM are described in detail by Döll et al. (2003), Kaspar (2004) and Hunger and Döll (2008). In general, if water precipitates as 20 rain it is passed through the storages of interception, surface water (including rivers, reservoirs, lakes and wetlands), soil and groundwater, reduced for evapotranspiration losses. In case of precipitation falling as snow, it accumulates as snow storage and follows the above liquid water cycle after melting. Additionally, human water consumption is considered (Döll et al., 2003). Accumulation of ice or permafrost is not accounted the model was forced by climate data (temperature, cloudiness and number of rain days per month) from the operational forecasts of the European Centre for Medium-Range Weather Forecasts (ECMWF). Monthly precipitation input from the Global Precipitation Climatology Centre (GPCC) was used. Precipitation data were corrected for precipitation measurement errors according to Legates and Willmott (1990) following Fiedler 15 and Döll (2007). This model set up formed the reference of the present study and is hereafter called the original model version. Döll et al. (2003) and Hunger and Döll (2008) tuned the original WGHM against long-term river discharge by a runoff coefficient parameter, which determines the fraction of effective precipitation that translates into runoff, depending on the saturation of 20 soil water (Eq. 3, Döll et al., 2003). Both studies noted that calibrating this parameter only was not sufficient for some areas to get acceptable simulation results for river discharge because, for instance, the water balance of lakes and wetlands is not influenced by this calibration approach, and because of other mis-modeled processes. Therefore, this study intends to calibrate WGHM parameters of all important process 25 formulations besides runoff within a river basin (see Sect. 2.2.1). We consider calibrated parameter values as effective values that account for non-resolvable features in a large-scale model such as sub-scale variability, input data errors, model structure errors or simplifications in model equations.

4819
WGHM consists of 36 model parameters. They are explained in detail in the publications of the original model versions while an overview of the 21 relevant WGHM parameters for this study is given below and in Table 1. The admitted parameter ranges for calibration were based on literature data and qualitative reasoning (Kaspar, 2004).
The soil storage capacity depends on the soil type and the land cover and is reg-5 ulated by the root depth parameter. This parameter is calibrated as a multiplicative factor (SL-1), i.e., the particular value for soil storage capacity based on the soil and land cover data in each model cell is multiplied by the value of SL-1 (here in the range of 0.5 to 2, see Table 1). Groundwater storage and outflow is governed by the groundwater baseflow coefficient (GW-1).

10
Surface water transport may on the one hand be calibrated by the river velocity (SW-2). On the other hand, the surface water flow coefficient (SW-5) as well as the maximum range of water levels in lakes (lake depth, SW-3) and wetlands (wetlands depth, SW-4) determine storage rates of surface water bodies and are possible calibration parameter for surface water transport processes. Furthermore, the runoff coefficient 15 parameter, which was tuned against river discharge for the original model versions, is calibrated as a multiplier (SW-1) in this study.
The potential evapotranspiration is computed in WGHM by the approach of Priestley and Taylor (1972) (PT). The equation is adjusted by the PT-coefficient that differentiates between humid (average relative humidity of 60% or more, ER-5) and arid regions 20 (average relative humidity less than 60%, ER-6). The net radiation required as input for the PT-approach is computed according Shuttleworth (1993) (see Döll et al., 2003. Herein, the radiation proportion parameter (ER-1) is used to determine the radiation fraction of the extraterrestrial radiation that reaches the Earth's surface. The radiation fraction may be reduced by cloud cover following a radiation correction parameter (ER-25 2). The actual evaporation of open water can be calibrated by the open water albedo (ER-4) and sublimation of snow by the snow albedo (ER-3). Land surface evapotranspiration is limited by the maximum potential evapotranspiration (MPET, ER-7) parameter (see Döll et al., 2003).

4820
Interception storage capacity depends on three parameters: The maximum canopy water height (MCWH, IN-1) as well as a specific leaf area multiplier (IN-2) and a biomass multiplier (IN-3).
The rates of snow melt and accumulation depend on land cover and elevation. Snow melt is computed in WGHM by a degree-day approach. The degree-day factor depends 5 on the land cover type. It is calibrated in this study by a multiplicative factor (SN-3). Sub-grid variability of elevation within a 0.5 degree model cell is represented in WGHM (100 sub-units per 0.5 • -cell) and elevation effects are accounted for by a temperature gradient (SN-4). Additional effects on snow storage processes can be adjusted by a cell-averaged snow freeze temperature (SN-1) and snow melt temperature (SN-2).

Calibration regions and parameter sensitivity
Due to the limited resolution of GRACE data, the 28 largest and most important river basin worldwide were selected for this study (Fig. 1). Except for Volta in western Africa, all basins are larger than 600 000 km 2 in size (see Table 2). WGHM calibration is 15 carried out separately for each basin. Güntner et al. (2007) showed that WGHM parameter sensitivity for water storage variations varied between the river basins. This inter-basin variability is due to different climatic conditions as well as land surface properties and, thus, varying relevance of different storage processes. Consequently, for each region, only the sensitive pa-20 rameters should be calibrated in order to reduce computational costs and to simplify the interpretation of the calibration results. A sensitivity analysis (SA) against TWSV and river discharge was undertaken (see also Werth et al., 2009a) following the SA approach of Hornberger and Spear (1981). The parameter sensitivity was analyzed by a Latin Hyper-cube sampling for 2000 parameter sets for all 28 river basins. Applied 25 parameter ranges are given in Table 1. The resulting six to eight most sensitive parameters for TWSV and river discharge (Table 3) were used for the regional calibration of 4821 each river basin and non-sensitive parameters were set to their original values (Table 1, col. 3).
The results of the SA confirmed that the subset of sensitive parameters varied considerably between the river basins. While snow parameters are not sensitive in tropical basins, parameters that control surface water transport appeared as particularly sen-5 sitive in basins with important flood plains, such as the Amazon. A broader range of sensitive parameters resulted, for instance, in the Indus river basin which is, on the one hand, dominated by snow storage in the northern mountain area and, on the other hand, high evaporation rates in desert region of the lower Indus. Hence, sensitive parameters belong to these two processes and, e.g., soil water parameters are 10 comparatively less important in the Indus basin. As an example of a river basin that stretches among three different climate regions (cold in the north, subtropical in the southeast and dry in the southwest), important parameters for the Mississippi cover a variety of processes (soil, snow, evaporation, interception and surface water). 15 The multi-objective calibration approach of WGHM was explained in detail by Werth et al. (2009a). Figure 2 and the description below gives an overview. The calibration was done for all 28 river basins in an automated framework for the period 01/2003-12/2007.

Multi-objective calibration approach
Calibration is a widely used optimization technique in hydrological modelling. In an it-20 erative process, different parameter values are tested for their ability to generate model system states that fit well to observations. The best parameter set provides the lowest simulation error or the highest simulation performance expressed by an objective value. Several functions to measure the objective value are possible, like the normalized root mean square error or the correlation coefficient. Within this study, the Nash-Sutcliffe-25 efficiency coefficient (NSC, Nash and Sutcliffe, 1970) is applied. NSC is a simulation performance measure that normalizes the squared difference of a predicted to an observed time series by the sum of squared deviations of the observations to their mean 4822 Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version Interactive Discussion during the period of interest. It ranges from −∞ to 1 (optimal fit), with a value of 0 indicating a simulated time series that performs as well as a model being equal to the mean of the observable. NSC is applied here because it measures errors in phase, amplitude and mean of a simulated time series at the same time.
Within a multi-objective calibration, more than one observation is applied to evalu-5 ate the model simulations, which makes the selection of the best parameter set less trivial. Due to errors in the model structure and the input data (Vrugt et al., 2003), the approach will no longer provide one single optimal parameter set, but lead to a Pareto set of optimal solutions (Gupta et al., 1998). Each Pareto solution provides a better simulation performance than any other Pareto solutions for at least one of the objec-10 tives (but not all objectives). Without additional information on the observations or a defined priority of simulation accuracy, the Pareto solutions are equal. In this study, river discharge and TWSV were applied for the calibration of WGHM and a balanced improvement of simulation performance for both objectives was intended. Therefore, the solution closest to the optimum of the objective values (here a value of NSC=1 for 15 both objectives) was selected as the best parameter set and used for further analyzes. For parameter variation, ranking and archiving the calibration algorithm -Nondominated Sorting Genetic Algorithm-II ( -NSGAII, Kollat and Reed, 2006) was used. The multi-start scheme and the evolutionary strategy of the algorithm (mutation, crossover and selection) enable a global optimization of the parameter values and are 20 able to solve highly non-linear optimization problems. The algorithm is one of the most efficient and effective multi-objective optimization methods used in hydrological modelling (Tang et al., 2006). These features enable a multi-objective calibration for more than one parameter of the non-linear and computational expensive WGHM model system. -NSGAII operators were set to values proposed by Kollat and Reed (2006) and 25 a population size of N=12, an -resolution of 0.05 for both objectives and a generation size of 100 (hence, a maximum of 1200 model evaluations) were used.
In contrast to Werth et al. (2009a) who applied significant signal periods within the GRACE data for their calibration, a calibration against full time series of GRACE TWSV 4823 was undertaken in the present study (see data Sect. 2.3). During the calibration of WGHM, TWSV simulations were filtered in the same way as GRACE data (see Sect. 2.3.2 and Table 2) to ensure equal resolution and a consistent comparison of both data sets.

Discharge data
River discharge data of the most downstream gauging station of each river basin were used (Table 2, col. 4 and Fig. 1). Data were obtained from the Arctic Regional Integrated Hydrological Monitoring System for the Pan-Arctic Land Mass (ArcticRIMS, http://rims.unh.edu), the Environmental Research Observatory for geodynamical, hy-10 drological and biogeochemical control of erosion/alteration and material transport in the Amazon (ORE HYBAM, http://www.ore-hybam.org) and the Global Runoff Data Center (GRDC, grdc.bafg.de).
For the Amazon, Mississippi, Mackenzie, Ob and Yenisei monthly discharge data were available for the GRACE operation period. For all other basins, up-to-date mea-15 surements were not available and mean monthly river discharge (for Jan-Dec) was computed from the most recent period of available data (maximum period of 30 years, see Table 2).
Errors of discharge measurements depend on the individual measurement methods and channel cross sections are likely to vary for the individual stations and time peri-20 ods. Unfortunately, no details are provided from the data centres on the accuracy of discharge measurements. Therefore, the error of discharge measurements was set to a conservative value of 20% for the uncertainty analysis of the calibration results. 4824

GRACE data
The greatest challenge in the application of GRACE-based TWSV is marked by the difficulty of separating error from signal as well as separating signal from the region of interest and its neighbouring regions. The spatial resolution of the GRACE data is limited due to the decreasing sensitivity of the satellites to mass variations with smaller 5 geographical extent. Simulation data of atmospheric and oceanic circulation models are applied to de-alias the gravity fields from sub-monthly circulation effects in both systems. Errors in these de-aliasing data and satellite measurement errors increase the noise in spherical harmonic coefficients particularly for higher degrees of the expansion terms, i.e., higher spatial resolution (e.g., Schmidt et al., 2008a). The error 10 budget is also influenced by signal leakage errors from surrounding areas. As a consequence, the application of filter methods is indispensable to reduce noise in the GRACE data. Nevertheless, the magnitude of errors varies between particular regions and months. Therefore, the user has to decide on an adequate filter method as well as for filter parameter settings to balance and minimize GRACE measurement errors and 15 leakage errors. Filtering in turn may change the final signal properties. Werth et al. (2009b) showed that filter induced amplitude damping and phase shifts in time series of basin-averaged TWSV differs between regions because of varying signal characteristics inside and outside of the river basin and basin shape. Hence, the selection of an optimum filter method is a function of the river basins. For the present study, the  Table 2 for a summary of applied filter methods).

25
The method to transform the spherical harmonic representation of GRACE gravity data to regional averaged water mass variations by Swenson and Wahr (2002) was applied using river basin boundaries as displayed in Fig. 1 Interactive Discussion mined both from the hydrology model WGHM and from the GRACE gravity fields were removed from time series of TWSV in this study. GRACE derived time series of TWSV from different processing centres show significant differences (as for the Lena basin in Fig. 3). These differences are due to different processing strategies, background models or processing software (Schmidt 5 et al., 2008a) and reflect uncertainties in the GRACE data. Consequently, an average of GRACE gravity fields (Level-2 products, most recent version RL04) from three processing centres was used (Flechtner, 2009): the German Research Center for Geosciences (GFZ, until degree 120), the Center for Space Research (CSR, until degree 60) and the Jet Propulsion Laboratory (JPL until degree 120). The three sets of coef- GRACE errors were estimated from the error coefficients of the individual data sets published by the processing centres, i.e., correlated errors as provided by GFZ and 15 CSR (Schmidt et al., 2008a;Wahr et al., 2006). As correlated errors were not available for JPL gravity fields, the confidence interval of JPL coefficient errors is increased to 99% by assuming a normal distribution. This results in a multiplication by ≈2.6 of the formal coefficient errors. The final error estimates for the averaged coefficients from the three processing centres amounts to: (1) Errors in the coefficients are propagated to the basin averages of water storage for each river basin. See Fig. 3 for an example of basin-averaged TWSV derived from the three gravity solutions, the average solution and associated errors.

Uncertainty estimation due to observational errors
The uncertainty of the calibration results due to errors in the calibration data is estimated for each river basin by the following procedure: 1) Selection of the calibration run with the Pareto solution closest to the optimum (see an example for the Lena river basin in Fig. 4). 2) Propagation of GRACE coefficient errors to basin-averaged esti-5 mates of TWSV as well as determination of the 20% discharge error. 3) Generation of 5000 normally distributed samples within the estimated error ranges for the monthly data points of GRACE-based TWSV and monthly river discharge, respectively. The sample size was tested ahead and selected to provide stable statistical results. 4) Estimation of both objective functions (NSC) for each sample against simulated time 10 series of the selected optimal solution, respectively for TWSV and discharge. 5) Determination of the NSC standard deviations for both objectives as the semiaxis for an error ellipse around the selected optimal solution. And 6) Selection of all calibration runs within the error ellipse (see Fig. 4 for the Lena basin).
The described approach determines all Pareto solutions around the selected opti-15 mum and non-Pareto solutions close to the Pareto frontier, which cannot be evaluated to provide a better fit to the observations than the selected Pareto solution if the error range of the observations is considered. The selected cluster of calibration solutions represents the total uncertainty of the calibration results in view of the observation errors.

20
3 Results and discussion

Calibration results
Detailed results for Lena basin (Fig. 4) show a typical objective function response that was found after calibration of most river basins. A clear trade-off exists between both objective functions for TWSV and mean monthly discharge. The best solutions for the Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version Interactive Discussion single objectives are located at the end of the Pareto frontier (crossed dots). Best results for a single objective, however, give an undesirable decrease in the accuracy for the other objective. The selected Pareto optimum (large gray dot) provides a balanced improvement between both objectives. The multi-objective calibration approach also decreases equifinality of the parameter sets, since unacceptable parameter sets for 5 any of the objectives are excluded by the multi-objective evaluation scheme. A more pronounced equifinality for simulating total water storage variations originates from the character of total water storage data. Since GRACE provides no absolute values but only variations of water masses, the same storage variations may be simulated by different model representations with different absolute amounts of water stored in the river 10 basin. This is not the case for river discharge where both absolute values and variations are given by the observation data. Hence, a smaller number of model realizations provides good objective values for evaluation by discharge than by TWSV. The large ellipse around the selected Pareto optimum represents its uncertainty caused by measurement errors in the calibration data. Variations of parameter values or model output 15 for model realizations within this range are not significant for the assumed observation data errors. Nevertheless, a significant improvement was achieved for both objective values relative to the original model for the example of the Lena basin. An overview of the calibration results for all river basins is given in terms of relative root mean squared error (RMSE, Fig. 5). The relative RMSE was computed from the 20 RMSE of time series of mean monthly discharge (circles) and TWSV (squares) against root mean squared (RMS) values of the respective measurements. Absolute values of signal RMS and model RMSE are presented in Table 4. Uncertainty ranges due to observational errors were transferred to RMSE and relative RMSE values and they are indicated by error bars in (Fig. 5)

Simulation of seasonal TWSV
The effects of model calibration on seasonal amplitudes and phases of TWSV are given in Fig. 6. For the most basins, the amplitude was shifted towards the GRACE observations. The strongest improvements for the seasonal amplitude are achieved, e.g., for Amazon, Mackenzie, Niger, Orinoco and Zambezi. For some basins, reduced seasonal 25 phase differences between GRACE and WGHM could be achieved by calibration (e.g., Amazon, Mississippi, Ob and Congo). Only phases could be corrected for Columbia, Danube, Lena, Nelson, Parana and Yenisei. No success for the calibration results 4829 again for Huang He in case of the seasonal signal. For Amur and Orange phases differ strongly between GRACE and WGHM, but TWS does not exhibit a distinct seasonal signal in both basins (not shown).

Parameter values and single storage compartments
A detailed analysis of parameter changes (Fig. 7) and their effects on single storage 5 compartments (Figs. 8-9) is provided below for the example of seven river basins of different continents, climatic conditions and calibration success. Storage in lakes, floodplains and wetlands (denoted surface water) is analyzed separately from water in the river channel (denoted river storage) in the following sections.

Amazon
The better representation of TWSV simulations for the tropical Amazon after multi-criterial calibration is mainly due to a lower river flow velocity (SW-2) in the calibrated model version as well as a larger runoff coefficient (SW-1). The adjustment 15 of both parameters is stable against calibration uncertainty from observation errors (Fig. 7a). The parameter changes cause a longer-lasting storage of more water in the river network which leads to larger and delayed seasonal amplitudes of TWS in line with GRACE observations (Fig. 8a) represented with the calibrated model (Fig. 8a). A slightly increased soil water storage is due to the larger rooting depth (SL-1) in the re-calibrated model. But the rooting depth parameter is highly uncertain and it is not significant relative to the original model as can be seen from the wide spread of parameter values for the Pareto solutions in Fig. 7a. The larger value of the parameter wetland depth (SW-4) has nearly no effect 25 on the storage variability in lakes and wetlands in spite of the large importance of wetlands and floodplains for water storage in the Amazon (e.g., Papa et al., 2008). Surface water storage is mainly attributed to river channel storage in WGHM (Fig. 8a) 4830 although the large inundation areas are taken into account as model input. This may indicate structural model errors in representing surface water exchange processes between floodplains and the channel due to the conceptual model formulations and the cell-based simulation of surface water bodies in WGHM.

Mississippi
The Mississippi basin is located in different climate zones ranging from cold to temperate (Fig. 1) and therefore it shows a more complex contribution of the individual storage compartments (see Fig. 8b) than the Amazon. The most important change 10 in TWSV after model calibration is due to a larger soil storage variability and a longer storage persistence in the early summer, caused by a deeper rooting depth (SL-1). Secondly, a higher snow melt temperature (SN-2) causes an increased snow peak and later melting by one month. The changes for snow and soil storage are supported by a lower radiation proportion absorbed by the surface, which leads to 15 higher snow accumulation as well as a delayed snow melt. These parameter changes for the Mississippi compared to the original model are reliable considering calibration uncertainty (Fig. 7d). An earlier seasonal peak of simulated TWS compared to GRACE data (see Fig. 8b) can possibly be attributed to underestimated groundwater storage that are typically characterized by a later seasonal phase compared to 20 near-surface storage. In fact, studies of (Rodell et al., 2007) and Zaitchik et al. (2008) indicate a higher groundwater volume than represented by WGHM. A change for groundwater was prevented by the missing sensitivity of groundwater parameter for WGHM (B11 in Table 3), which may be due to the overlap with soil storage variations. The groundwater parameters should therefore be included in further calibration studies.

Lena
For the Lena basin, the seasonality of river water storage exhibits an opposite 4831 phase to total storage which is dominated by snow storage variations. This makes a fit of the overall small TWSV amplitude (below 50 mm w.eq. in average) more difficult than for the two previous basins. Model improvements by calibration for this cold, high-latitude basin (Fig. 1) mainly are of temporal nature. The phase of TWSV could be corrected (see also Fig. 6) based on changes of water accumulation in 5 snow, river and soil (Fig. 8c). Due to a larger snow melt temperature (SN-2), snow accumulation lasts nearly one month longer while snow melt finally occurs later but more rapidly in April and May. The larger snow albedo (ER-3) decreases snow sublimation and supports the slightly larger variability of snow storage. In line with the later and faster snow melt in spring, water storage dynamics in river network 10 change accordingly. A larger and later monthly runoff peak also corresponds to the river discharge measurements and is better represented by the calibrated model (see embedded graph in Fig. 8c). Changes in soil storage dynamics due to calibration are of minor importance in the Lena basin but, in general, are characterized by slightly larger seasonal variations with a later phase commensurate to the snow dynamics but 15 also to overall lower evapotranspiration rates caused by smaller radiation proportion (ER-1) and PT-coefficient (ER-5) parameters.

Danube
As for Lena, mainly a phase correction of TWSV was achieved by calibration ( Fig. 6) for the cold and partly temperate (Fig. 1) Danube basin. This resulted in a smaller RMSE of TWSV time series (Fig. 5). While the seasonal amplitude was not changed, a better fit of extreme events like heat waves or floods as observed by  Fig. 8). In the calibrated model, snow is melting faster due to a higher snow melt temperature, hence reducing the snow storage volume. The released water is mainly stored in the soil of which the storage capacity was increased by a larger root 4832 depth parameter after calibration. Also river water is reallocated to the soil where it can remain for longer periods during the spring season than in the quickly draining river network. The smaller river discharge in spring is confirmed by observations (not shown here, due to limited space), hence, a smaller RMSE for mean monthly discharge (Fig. 5). Groundwater storage variations slightly decreased and delayed in 5 the Danube basin.

Zambezi
Increased storage variations in the hot-temperate and partly dry Zambezi basin 10 ( Fig. 1) are due to larger soil, groundwater and surface water storage amplitudes (Fig. 9a). The largely corrected seasonal variability of TWSV (Fig. 6) in the calibrated model originates mainly from less evapotranspiration of surface and soil water as controlled by a smaller PT-coefficient (ER-6) and a smaller maximum potential evapotranspiration (ER-7). As WGHM contains only one soil layer, it may be exhausted too quickly by evapotranspiration in the dry Zambezi region instead of being stored in deeper soil layers. This is supported by the increased groundwater volume, that confirms the high relevance of water exchange with deeper soil zones for Zambezi basin (see also Winsemius et al., 2006a). Surface water volume changes in wetlands increase after calibration and cause longer residence times of water in the Zambezi 20 basin. The importance of this storage mechanism in the Zambezi basin was also found by Winsemius et al. (2006a).

Nelson 25
The seasonality of snow and groundwater storage exhibits a marked anti-phase in the Nelson basin according to the WGHM simulation results (Fig. 9b). This decreases model sensitivity for TWS variations and makes an effective calibration of the individual storage components difficult, since many combinations of different snow 4833 and groundwater states can lead to an equally good fit of simulated to GRACE-based TWSV. In addition, the required smoothing of GRACE data has a huge effect on overall water storage dynamics for this basin (Fig. 9). Major seasonal signals are smoothed out, but remaining TWSV time series correspond reasonably well between GRACE and WGHM. Comparatively small changes occur by model re-calibration relative to the 5 original model.

Congo
TWSV in the Congo (Zaire) basin is dominated by inter-annual patterns such as  (Table 4). The inter-annual variations in TWS mainly derive from soil and groundwater storage (Fig. 9c). For the calibrated model, a larger seasonal variability in soil storage causes a slightly delayed phase of storage variability. This delay appears to be compensated by a negative phase shift in groundwater. As a result, the faster 20 outflow of the groundwater (due to a larger outflow coefficient GW-1) causes a smaller groundwater volume and decreases the inter-annual variation of groundwater storage in the calibrated model. Three of the four basins (Nelson, Orange, Congo) with an unsuccessful calibration for TWS exhibit strong inter-annual variations (see Fig. 9b, c for Nelson and Congo).

25
The inter-annual variations are visible in GRACE as well, but the short period of five years used here may impede the effective calibration of inter-annual changes in total storage variability and its components. Furthermore, for Congo, Nelson and Orange a large trade-off occurs for the Pareto solutions between simulation performance of river 4834 discharge simulation and TWS (not shown). Therefore, calibration difficulties within these basins may also be due to the use of mean monthly discharge values, which neglect inter-annual variations during the calibration period. As a further drawback for Congo, available discharge data are from the period 1954-1983 for this basin.
The water mass variation of the Orange basin, which also exhibit inter-annual varia-5 tions (not shown), are smaller than 12 mm of a water column (see Table 4 and Fig. 6) and for some months below GRACE data accuracy. While inter-annual variations are not relevant for the Yukon basin, similar to Nelson, a clear anti-phase between snow and groundwater storage as well as soil storage causes a small model sensitivity for TWS variations.

Global analysis
A global analysis of simulated TWSV for the calibrated model (see spatial distribution in Fig. 10 and variability of basin averages in Table 5) shows that its variability increased for the most river basins compared to the original model. On the global average (last row of Table 5), TWS variability increased by 7 mm w.eq., which is mainly due to a 15 larger variations of soil, river and surface water storage. Most variability is gained within the tropical and temperate regions, like the Amazon (total 60 mm for the basin average), Congo (9 mm), Niger (14 mm), Mekong (35 mm) as well as for the Mississippi (14 mm). A spatial redistribution between sub-regions for some of these basins is visible in Fig. 10, e.g., Ganges and Parana. A smaller total water budget appears only 20 for basins in cold regions like Mackenzie, St. Lawrence, Volga or Yangtze (Table 5). Some further cold regions like Lena or Ob exhibit an unchanged water budget. This comparison shows that TWS variability in the original WGHM was mainly underestimated in tropical and temperate regions but overestimated in cold regions, similar to the seasonal components (Fig. 6).

25
For the individual basins and storages, largest differences to the original model occur within soil storage, mainly for tropical and temperate regions like Mekong, Mississippi, Orinoco, Volta or Zambezi, which is visible by area distributed TWSV differences to the 4835 original model in the lower Fig. 10 and reflected in the basin-averages (Table 5). Soil has the highest seasonal capacity to store water and contributes most to the gravity signal discovered by GRACE that is usually dominated by seasonal features. Linear structures in the spatial distribution TWSV differences to the original model are mainly due to changes in river storage, being the second most contributor to changes for the 5 basin-averages (Table 5). Very large increase of river water volumes occur in rainy tropical regions of Amazon, Mekong and Orinoco, where a slow discharge in the river network causes a longer maintenance of river water in the basin (see analysis for Amazon in Sect. 3.3 above). In contrast, a decrease of river water volume is visible for temperate and dry regions. Snow storage increases for regions in cold climate zones 10 e.g., Columbia, Ob, Yenisei). But it decreases for cold climates with a warm summer (St. Lawrence, Volga, Danube). In these transition zones, less snow precipitation may be due to global climate warming, that is relevant for the calibration period but not incorporated in the calibration of the original model. Simulated groundwater storage changes decreased on the global scale. A large 15 decrease of groundwater variations occurred for regions with a distinct dry season (Ganges, Niger, Nile) and for some cold regions (St. Lawrence, Volga). Groundwater seasonality is usually delayed compared to soil and surface storage, because groundwater recharge and runoff are temporarily filtered by soil transfer processes. As seen from seasonal phase shift between GRACE and WGHM, water often drains too quickly 20 from river basins compared to GRACE even for the calibrated model version. This may be explained by a too small groundwater recharge and volume in WGHM (e.g. Zambezi or Mississippi). Also the sensitivity of the model to changes in groundwater storage may be superimposed by the soil storage with a different seasonal phase. Therefore, future calibrations against GRACE data should include groundwater timing and volume 25 parameters for each river basin.

Conclusions
This study demonstrates that a multi-objective calibration with TWS variations from the GRACE satellite mission and river discharge enables a world-wide improved simulation of changes in the continental water cycle and its compartments. The presented strategy for improving simulations of continental water storage includes the following key large-scales. The multi-objective calibration of WGHM led to higher simulation accuracy for TWS variations and river discharge for most of the 28 calibrated river basins. Seasonal amplitudes and phases of the water budget for most river basins were improved. A global comparison showed that TWS variability was mostly increased for tropical regions. The 20 highest proportion of the increase occurred for soil storage. An analysis of single storage compartments for seven river basins from different continents and diverse climatic regions revealed reasonable changes within single storage compartments of the calibrated model that contributed to a better representation of TWS variability. Herein, the deviation of the calibrated parameter sets to the original model version and their 25 uncertainty due to measurement errors provide an insight into the control and reliability of the individual process simulations.
For some basins, possible model structural errors are uncovered by the calibration, 4837 e.g., too small wetland volumes in the Amazon and the Mississippi basin. For some basins, errors or limits of the calibration data restrict the calibration success. An update of global river discharge data sets to the GRACE mission period is an urgent need for further progress. As another strategy for the calibration of basins with strong interannual variations and scarce discharge data availability, smaller weights could be given 5 to mean monthly discharge data in the calibration process. However, the model representation of TWS variations inheres more parameter equifinality than river discharge due to the lack of absolute values and the integrative nature and limited spatial resolution of GRACE TWS variations. Consequently, GRACE data alone are not adequate for calibrating water storage state variables in large-scale hydrological models.
Calibration difficulties are also due to the complexity of interaction between single storage components and to the inability to separate these storages with the integrative TWSV data. Many different single storage combinations can lead to similar variations in the total water budget of a river basin. The decrease of model sensitivity for TWS or its components is catalyzed if clear anti-phases occur between storage variations 15 in individual compartments. Since groundwater seasonality deviates mostly from the other storages, it plays an important role in the timing of TWSV in a river basin. A parameter sensitivity analysis for such basins should be undertaken carefully and for future studies, an increased attention should be given to groundwater storage in the calibration process.

20
The improvement of large-scale hydrological models and the validation of GRACE water mass estimates is an iterative process. Model structure errors may complicate the calibration of WGHM with GRACE TWSV. But also limited spatial resolution or regional varying accuracy (e.g., Winsemius et al., 2006b) as well as different smoothing effects between GRACE and modelled data may affect the calibration performance.

25
Therefore, GRACE uncertainties are still an important object of research. Furthermore, due to the general data scarcity of hydrological observations at the global scale, newly developed observation systems like GRACE in turn depend on global model estimations for validation and error reduction. This will complicate the independence of 4838 model re-calibrations and again it limits the application of GRACE for hydrology (and vice versa) in the sense of an iterative learning process.
As consequence to the named difficulties, it is desirable to include further hydrological observations into the calibration and validation process. Only satellite data are in the run for large-scale hydrological modelling and the rise of satellite obser-5 vation systems with global coverage are promising. Groundwater observations are not available with global coverage and this storage is not directly accessible by space techniques. But satellite observations of snow storage (e.g., by MODIS, Parajka and Blöschl, 2008), surface water (Papa et al., 2008) or soil moisture from the future satellite missions such as SMOS and SMAP are applicable for tuning or validation of large 10 scale-hydrological models with more than two objectives.
Due to the large diversity of processes in different regions of manifold climatic conditions, global hydrological modelling is a challenging ambition. The present study expands experiences on representing hydrological processes on the global scale with a particular emphasis on water storage dynamics. The continuation of similar stud-15 ies is motivated by the steadily improved accuracy of GRACE solutions and the future prospect of a GRACE-follow on mission. Longer time series of gravity data will in particular allow focusing on hydrological extremes, inter-annual variations and secular trends in both observations and modelling capabilities.  Res., 112, D07103, doi:10.1029/2006JD007522, 2007 Variations of surface water 25 extent and water storage in large river basins: A comparison of different global data sources, Geophys. Res. Lett., 35, L11401, doi:10.1029/2008GL033857, 2008 2002WR001746, 2003. 4817, 4823 Wahr, J., Swenson, S., Zlotnicki, V., and Velicogna, I.: Time-variable gravity from GRACE: First results, Geophys. Res. Lett., 31, L11501, doi:10.1029/2004GL019779, 2004 Accuracy of GRACE mass estimates, Geophys. Res. Lett., 33, L06401, doi:10.1029/2005GL025305, 2006 Werth, S., Güntner, A., Petrovic, S., and Schmidt, R.: Integration of GRACE mass variations into a global hydrological model, Earth Planet. Sc. Lett., 277, 166-173, 2009a. 4817, 4821, 4822, 4823 Werth, S., Güntner, A., Schmidt, R., and Kusche, J.: Evaluation of GRACE filter tools from a hydrological perspective, Geophys. J. Int., in review, 2009b. 4815, 4817, 4825, 4853 models, J. Hydrol., 204, 83-97, 1998. 4817 Zaitchik, B. F., Rodell, M., andReichle, R. H.: Assimilation of GRACE terrestrial water storage data into a land surface model: Results for the Mississippi River Basin, J. Hydrometeorol., 9, 535-548, 2008. 4817, 4831 Zeng, N., Yoon, J.-H., Marengo, J. A., Subramaniam, A., Nobre, C. A., Mariotti, A., and Neelin, Swenson and Wahr (2002) for an a-priori given maximum error of basin average ∆ max ; II: Swenson and Wahr (2002) computed by by the auto-correlation length G l and standard deviation σ 0 of an exponential signal model; III: decorelation method by Kusche (2007) with the power x of the regularization factor a=10 x of the signal covariance matrix).   A m a zo n B2 A m u r B3 C o lu m b ia B4 D a n u b e B5 G a n g e s B6 H u a n g H e B7 In

Fig. 5.
Simulation performance for the 28 calibrated river basins in terms of relative root mean squared error (rRMSE) for river discharge (circles) and TWSV (squares) of the original (gray) and the calibrated model version (black). See Table 4