Reply on RC5

Kraft et al. present a hybrid global modelling framework combining machine learning with simple water balance equations to simulate the most relevant hydrological states and fluxes at global scale and daily resolution. The model has been simultaneously trained with observational data-sets of total water storage, snow water equivalent, evapotranspiration and runoff. Results are evaluated against the same data-sets (split sample) as well as compared to simulations provided by four global hydrological models. The model intercomparison focuses on the attribution of water storage changes to variations in snow pack, soil moisture and groundwater storage. The authors prove that their hybrid modelling framework is able simulate relevant water storage components and fluxes with similar or better performance than state-of-the-art GHMs. Systematic differences between the models and the implications of their findings are extensively discussed.

The contribution is novel and fits well within the scope of HESS, however requires major revision before potential publication. This especially concerns the results section which contains a large number of (complex) figures that are often difficult to read and/or grasp. The reader is required to go back and forth between the main text and the caption in order to understand what is displayed. These comments are already based on the revised set of figures uploaded by the authors. I would recommend to thoroughly revise the results section, potentially drop a few figures and revise the remaining ones, in order to arrive at a more concise and digestable presentation.
The discussion touches many important aspects, however appears lengthy and repetitive at times. I would recommend revision to make it more concise. Further, the significance of the parameter estimates for process-based modeling is overstated in my opinion. The authors acknowledge that, due to the simple model structure and small number of parameters, parameter estimates will tend to compensate for insufficient or lacking process representations and uncertainty in the input data, which undermines their 'physical meaningfulness' and ability to describe specific processes.

Response:
We would like to thank you for appreciating the novelty of the study and for providing valuable comments.
We received similar comments from the other reviewers regarding the complexity of the manuscript and the figures. We plan to simplify some of the figures and also move some parts of model performance to the appendix. This involves reducing the information density (e.g., Figures 6 and 7, see simplified versions in the supplement to this response, Figures B and C) or removing (e.g., Figure 4) figures and streamlining the discussion section, and unifying the terminology. We will also provide additional information on the machine learning model and training procedure, as several questions have been raised by you and other reviewers.
Regarding the relevance of parameters in process-based models, we wanted to highlight that the model parameters are a critical source of uncertainty as well (e.g., Liu and Gupta, 2007, doi: 10.1029. Several related specific instances are mentioned throughout the text for covering different aspects of "parameter uncertainty," such as equifinality issues, scale mismatch between the estimation of parameters (such as hydraulic conductivity in the lab vs the same at half-degree resolution), discrete vs continuous in space (e.g., vegetation type and a corresponding lookup table based on observations at plant/ecosystem level), and temporally fixed vs varying. In the revision, we will tone down the text wherever possible, while still maintaining the message that there are ample rooms for improvement regarding parameter estimations in processbased models.
Below, we provide detailed answers to your comments.

Specific comments
It remains unclear which thresholds and data-sets were used to identify cells with "high anthropogenic impact". Groundwater abstraction is given as an example, however cells with extensive irrigation (irrespective of source) should be removed due to its effects on soil moisture and evapotranspiration.
It remains unclear if the parameters beta_s and beta_g were estimated by the neural network or were preset. In the latter case, please clarify how these parameters were determined.
Response: They are estimated as free parameters (in the sense of not being connected to data), i.e., they are not predicted by a neural network but rather by the optimizer. The gradient-descent based optimizer receives all parameters of the neural networks (NN) and the global parameters. In each optimization iteration, the optimizer can update the weights of the neural network, as well as the two parameters. We will elaborate on it in the revision.

Response:
The global signal is less affected by data uncertainties and reflects how well the model gets the global-scale patterns. We agree that the global signal can be misleading (e.g., Northern and Southern Hemisphere can cancel out parts of the signal due to opposite seasonality), but the ability to reproduce the regional and global signal is . The best would be to report regional performance together with the cell level performance, but it would add more complexity to the manuscript. So, even some of the regional performance evaluations would actually be moved to the appendix (based on suggestions from other reviewers), and the main text would focus on the global signal. Fig. 3 seems little meaningful since the better part of the common time series (2003)(2004)(2005)(2006)(2007)(2008) was part of the training data-set; particularly since NSE and MSE are closely related. In this regard, it would also be interesting to see a direct comparison of the performances achieved by H2M in the training period and the evaluation period, respectively.

Response:
We decided to use the entire time-series because 1) longer time-series yield a more robust performance 2) the model performances of H2M were very similar in the training, validation, and test period. We checked the latter by comparing the model performance for the different sets. Figure D in the supplement to this response (not contained in the manuscript, we will add it to the appendix) shows the performance (RMSE) of the training, the validation and the test set. From the figure, we see that there is no systematic difference in RMSE across the sets. We assume that this is a result of the regularization (e.g., early stopping, weight decay) and the physical constraints, which avoid overfitting as well.
However, we agree that the performance comparison should be made on the test set. We decided to make a compromise: The splitting of the training, validation and test set in the spatial domain makes a comparison difficult, as we would need to show the GHM performance for the different cells corresponding to the test set separately as well. Thus, we will include all the cells (from training, validation, test), but only use the years 2009 to 2012, which were not used for training.  Response: We agree that it makes little sense to use new metrics here. The phase and variance error are insightful, but the message of the figure is that our model struggles in the same regions as the GHMs, a point that could also be made with NSE, for example. We also consider moving the figure to the appendix in the revision, or to removing it entirely.
The insets in Figs. 5 and 6 severely compromise readability and I'd suggest removing them. Further, the x-axis labels seem to be cut off. Please revise.

Response:
We agree that the figures are hard to read/grasp. We will replace the figures with the global signal and move the original figures (with improved axis labels) to the appendix (see Figures B and C in the supplement to this response). Response: Regarding the abbreviation, we agree. For the calculation of the quantiles, we "mix" all time steps and grid cells and calculate the quantiles for bins of CWD. For example, we filter all the values (space and time) for CWD values from 0 to 10 mm. From these values, we calculate the quantiles (independently per cross validation run). Then we take CWD from 10 to 20 mm, etc. We will improve the caption. Fig. 9: The masking color (black) and the darkest shade of the color scale are hardly distinguishable, please revise. In general, I feel that the figure conveys a similar message as Fig. 8. Given the overall large number of figures in the manuscript, this one could be dropped for conciseness.
Response: True, the masking color is not ideal, and we will change it to white. With regards to the similarity of Fig, 8/9: we will consider removing one of them in the revision. Response: Agreed, we will change the masking color.

Minor comments and technical corrections
Response: Thanks for the technical corrections, we will follow all your suggestions listed below and clarify where needed. Please also note the supplement to this comment: https://hess.copernicus.org/preprints/hess-2021-211/hess-2021-211-AC6-supplement.pdf