the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Towards hybrid modeling of the global hydrological cycle
Martin Jung
Marco Körner
Sujan Koirala
Markus Reichstein
Download
- Final revised paper (published on 23 Mar 2022)
- Preprint (discussion started on 25 May 2021)
Interactive discussion
Status: closed
-
RC1: 'Comment on hess-2021-211', Anonymous Referee #1, 21 Jun 2021
Kraft et al present a convincing example of the potential of hybrid modelling. The H2M framework combines a neural network with hydrological model constraints. I definitely see the value of this research line in the context of hydrological modelling, demonstrated in this manuscript by the comparison with several established GHMs. Unfortunately, the manuscript is quite hard to read, especially because in the figures several letters dropped of the axes, which made it a puzzle to find out what was shown where (I did not manage to solve this puzzle for Fig 10), this made it hard to estimate if all conclusions are robust / valid. Besides, some sections and choices are hard to follow for an average HESS reader with average ML knowledge (as I consider myself that way..). Below I indicate this in more detail, hopefully the authors will be able to improve and clarify this in a next version.
Please solve the axes issues for all figures. In Fig 4, for instance the N is missing on the y-axis, and the 40 and 60 have dropped off, and the legend is unclear (my guess is it should be H2M and GHMs). In Fig 5/6, first letters of the month dropped off x-axis, and it took a while before I realize the two middle panels show variation over the years (not only because the numbers dropped off, it would be helpful to add a label ‘years’, in the same way it would be helpful to add a y-axis label TWS or SWE). In Figure 10 I don’t know which variable is on which corner of the pyramid. Perhaps updated figures can be uploaded in response to this review, so that other reviewers can use these figures.Besides the axes-issues, I think the figures themselves are anyways challenging read. There is a very high information density in each figure, but the figures are often not directly showing what is most interesting. For Fig 5/6 for instance, one could consider to show the difference between the models and the observations in a barplot, rather than their temporal dynamics. Figure 2 is only very very briefly introduced, even though the climate regions are extensively used for all figures.
It is unclear why the authors have decided to use the forcing from three different data sources, which makes the study more sensitive to inconsistencies / non-closure of balance, etc. Besides, it remains undiscussed how these data compare to the data used in the eartH2Observe project, because it might explain some of the differences with the GHMs.
Grid cells with large withdrawals have been removed but it is unclear which data source was used to identify cells with groundwater withdrawals.
The procedure with the static input layers is unclear to me. First, they are compressed to 30 (l.95-100 on p4) and then from 30 to 12? (l.140 p7).
In the validation, large negative NSE values were rescaled, but in Table 3 the spatial mean NSE is given. This makes the numbers provided here not comparable to NSE values obtained in other studies. Question is if it should still be called NSE, this can be misleading. In general, section 3.1 is hard to follow, because it is not directly clear what the spatially averaged signal is - is that averaged globally?.Overall, the manuscript has a very high information density, and a combination of unclear figure axes and sometimes unclear terminology (“spatially averaged signal” as an example, but for instance also expressing soil water as a deficit requires the reader to pay a lot of attention) makes it difficult to completely understand what happens where. As written above, I see the potential of hybrid modelling and the potential of the approach of this study (comparing it to GHMs, exploring NN identified patterns), it would therefore be a waste if readers give up the reading because it is so challenging. Besides, it makes it difficult to estimate whether the conclusions are robust/valid, so I hope the authors can help me, average HESS reader, by increasing the clarity and readability.
Lieke Melsen
Wageningen UniversityCitation: https://doi.org/10.5194/hess-2021-211-RC1 -
AC3: 'Reply on RC1', Basil Kraft, 09 Aug 2021
Kraft et al present a convincing example of the potential of hybrid modelling. The H2M framework combines a neural network with hydrological model constraints. I definitely see the value of this research line in the context of hydrological modelling, demonstrated in this manuscript by the comparison with several established GHMs. Unfortunately, the manuscript is quite hard to read, especially because in the figures several letters dropped of the axes, which made it a puzzle to find out what was shown where (I did not manage to solve this puzzle for Fig 10), this made it hard to estimate if all conclusions are robust / valid. Besides, some sections and choices are hard to follow for an average HESS reader with average ML knowledge (as I consider myself that way..). Below I indicate this in more detail, hopefully the authors will be able to improve and clarify this in a next version.
Response:
Thank you very much for the comments and suggestions and for helping us to improve the manuscript. First of all, we want to apologize for the issues with the figures. We understand that this made the review difficult and we appreciate your efforts to "solve the puzzle".
We agree that some things need to be simplified. We plan to reduce the complexity Figures 5 and 6 (only show global signal, move current Figures to Appendix), and remove Figure 4. We will also try to better explain the machine learning aspects. The NSE transformation, to deal with large negative values, could be avoided by just truncating the figure axes, which is easier to understand and we will change accordingly. Furthermore, we will improve the sometimes confusing terminology of CWD and SM, although we still need both terms as the former is used in the H2M, while the other is used when comparing the results to the GHMs.
Below, we provide answers to your comments.
Please solve the axes issues for all figures. In Fig 4, for instance the N is missing on the y-axis, and the 40 and 60 have dropped off, and the legend is unclear (my guess is it should be H2M and GHMs).
Response: We will fix the issues in the next version.
In Fig 5/6, first letters of the month dropped off x-axis, and it took a while before I realize the two middle panels show variation over the years (not only because the numbers dropped off, it would be helpful to add a label ‘years’, in the same way it would be helpful to add a y-axis label TWS or SWE).
Response: We will add labels `Month', `Year', `TWS', and `SWE' to the respective axes of Fig. 5 and 6.
In Figure 10 I don’t know which variable is on which corner of the pyramid. Perhaps updated figures can be uploaded in response to this review, so that other reviewers can use these figures.
Response: Labels for Fig. 10 will be fixed.
Besides the axes-issues, I think the figures themselves are anyways challenging read. There is a very high information density in each figure, but the figures are often not directly showing what is most interesting. For Fig 5/6 for instance, one could consider to show the difference between the models and the observations in a barplot, rather than their temporal dynamics.
Response: We think that showing the dynamics is important for our storyline and would prefer to keep the plots. However, we agree that the Figures are very complex and hard to grasp. We, therefore, decided to only include the global signals of TWS and SWE (Figures B and C in the supplement to this response) and their MSC and IAV components. As the global signal does not tell the full story, we want to move the original Figures 5 and 6 (with improved axis labels) to the Appendix.
Figure 2 is only very very briefly introduced, even though the climate regions are extensively used for all figures.
Response: We will provide more details on the climatic regions (Figure 2) as you suggest.
It is unclear why the authors have decided to use the forcing from three different data sources, which makes the study more sensitive to inconsistencies / non-closure of balance, etc. Besides, it remains undiscussed how these data compare to the data used in the eartH2Observe project, because it might explain some of the differences with the GHMs.
Response: We agree that the uncertainties in the forcing data have a big impact on the model simulations. In the choice of our forcing data, we wanted to stick as close as possible to observations, and they are unfortunately from different sources. Now, for consistency, we trained our model with WFDEI (as used in eartH2Observe). As the results are very similar, we think that the findings are robust (see Figure A in the supplement to this response). We plan to add the figure to the appendix.
Grid cells with large withdrawals have been removed but it is unclear which data source was used to identify cells with groundwater withdrawals.
Response: We used an ad-hoc solution based on Rodell (2019; https://www.nature.com/articles/s41586-018-0123-1). We removed all regions with only groundwater depletion (#7, #12, and #14). We will add this explanation to the manuscript.
The procedure with the static input layers is unclear to me. First, they are compressed to 30 (l.95-100 on p4) and then from 30 to 12? (l.140 p7).
Response: The static data was compressed in a pre-processing step: For each 1-degree grid-cell, we had 30 latitude x 30 longitude high-resolution cells for 20 variables = 18'000. Instead of feeding this high-dimensional data into the model, we compressed the data using an autoencoder and ended up with a vector of 30 values (non-interpretable) for each grid cell. This data was used in the model, but before adding the 30 values as an input to the LSTM, the data was further compressed to 12 values. In this step, the compression is 'learned' as part of the model tuning. We will better explain this aspect in revision.
In the validation, large negative NSE values were rescaled, but in Table 3 the spatial mean NSE is given. This makes the numbers provided here not comparable to NSE values obtained in other studies. Question is if it should still be called NSE, this can be misleading. In general, section 3.1 is hard to follow, because it is not directly clear what the spatially averaged signal is - is that averaged globally?.
Response: Agreed, we need to be more precise on the terminology. The spatial mean is the global signal, i.e., averaging each time-step separately across space, leading to a single, global time-series. As all values for the spatial mean in the table are positive, the transformation, that deals with large negative values, has no effect. For the cell-level median, a single negative value was reported, it would be ~-0.8 instead of -0.65 (as reported). We believe that negative values of NSE just indicate that the predicted values are worse than the observed mean, and the magnitude of -0.65 or -0.8 does not change the interpretation. However, it seems indeed that the transformation of negative NSE causes confusion and we will use the original NSE without transformation in the revision. The only impact this has is the one value in Table 3 and Figure 3, where large negative values occur in some boxplots. Instead of transforming the negative NSE, we will truncate the y axis limits in the figure. Your suggestion is much cleaner than transforming the values, which did not add any additional information.
Overall, the manuscript has a very high information density, and a combination of unclear figure axes and sometimes unclear terminology (“spatially averaged signal” as an example, but for instance also expressing soil water as a deficit requires the reader to pay a lot of attention) makes it difficult to completely understand what happens where. As written above, I see the potential of hybrid modelling and the potential of the approach of this study (comparing it to GHMs, exploring NN identified patterns), it would therefore be a waste if readers give up the reading because it is so challenging. Besides, it makes it difficult to estimate whether the conclusions are robust/valid, so I hope the authors can help me, average HESS reader, by increasing the clarity and readability.
Response: We will revisit every figure, unify the terminology, and clarify the machine learning methods. With these changes, the information density could be trimmed. However, this study constitutes a first assessment of hybrid modeling on global scales and we want to be fully transparent and comprehensive about the strengths and weaknesses of the approach. Thus, we provide as much insights as possible, allowing the readers to grasp the potential but also the limitations of the approach. The presented results were selected to cover two major aspects of the study: model performance and different hydrological responses. We are of the opinion that covering both these aspects are critical to maintain the value of the study. We hope that changes proposed before would make the manuscript more readable. In addition, we will make additional effort to avoid repetition and improve conciseness in the revision. Specifically, we plan to simplify Figures 5 and 6, and whenever SM or CWD appears in a Figure, we change the label to SM (-CWD), and indicate `wet -> dry' label to facilitate the interpretation (Fig. 7, 8, and 9). Further, the text will be improved where SM and CWD are mentioned.
-
AC3: 'Reply on RC1', Basil Kraft, 09 Aug 2021
-
RC2: 'Comment on hess-2021-211', Anonymous Referee #2, 25 Jun 2021
General comments:
This is an interesting paper demonstrating the feasibility of reproducing the simulations of global hydrological models (GHMs) using a hybrid approach (H2M). The latter is based on a toy model consisting of a series of bulk reservoirs, coupled to a statistical model based on machine learning. Results are encouraging. I am not sure that the comparison of H2M with GHMs is completely fair because the precipitation dataset used to force H2M (GPCP) is based on observations, while the one used to force GHMs is derived from the ERA-Interim reanalysis. Another reason why the comparison may be unfair is that spatial resolution of GHMs is degraded from 0.5 degree to 1 degree to be compared to the H2M simulations. Also, it should be emphasized that some GHMs are uncalibrated models. I was not able to do a complete review of this work because some Figures are not readable.
Recommendation: major revisions.
Particular comments:
- L. 5 (Abstract): Is H2M a new model developed in this study? What is the added value of this approach with respect to more traditional modeling approaches? What is the meaning of H2M acronym?
- L. 95 (22 static variables): unclear because 4 lines correspond to static variables in Table 1, not 22.
- L. 113: procuct?
- L. 166, 174 (soltmax, softplus): all readers may not be familiar with these machine learning technical terms. They should be defined.
- L. 199 (model training): more details should be given on the used machine learning approach. Is a local training (one statistical model for each model grid cell) performed or a global training (all model grid cells together represented by the same statistical model)?
- L. 243 (Table 2): CWD and SStor are written here for the first time and were not defined before. A clear definition should be given. The definition of CWD given in the next paragraph is not clear.
- L. 242 (selection of models): How was model selection made? In Schellekens et al., 10 models are considered.
- L. 284 (Table 3): the period of time for which the comparison was made should be indicated.
- L. 286 (model intercomparison): Could be completed with a water balance Table similar to Tables 7 and 8 in Schellekens et al.
- L. 293: Mean or median scores are little informative in case of non-Gaussian score value statistical distribution. Could you plot score histograms instead?
- L. 340 (Amazon basin): the Amazon area was affected by droughts (2005, 2010, 2015). Are these drought events visible in the simulations performed in this study?
Citation: https://doi.org/10.5194/hess-2021-211-RC2 -
AC2: 'Reply on RC2', Basil Kraft, 03 Jul 2021
Dear reviewer,
Thank you very much for helping us to improve the manuscript. We sincerely apologize for the issues with the Figures and understand that you decided to not provide a full review.
We attached the figures that were not displayed correctly to this comment: AC1: 'Comment on hess-2021-211', Basil Kraft, 03 Jul 2021 | https://doi.org/10.5194/hess-2021-211-AC1.
We would greatly appreciate it if you could have a second look, but we also understand that your time is limited.
Best regards,
the authors
Citation: https://doi.org/10.5194/hess-2021-211-AC2 -
AC4: 'Reply on RC2', Basil Kraft, 09 Aug 2021
Reply to comment https://doi.org/10.5194/hess-2021-211-RC2
General comments:
This is an interesting paper demonstrating the feasibility of reproducing the simulations of global hydrological models (GHMs) using a hybrid approach (H2M). The latter is based on a toy model consisting of a series of bulk reservoirs, coupled to a statistical model based on machine learning. Results are encouraging. I am not sure that the comparison of H2M with GHMs is completely fair because the precipitation dataset used to force H2M (GPCP) is based on observations, while the one used to force GHMs is derived from the ERA-Interim reanalysis. Another reason why the comparison may be unfair is that spatial resolution of GHMs is degraded from 0.5 degree to 1 degree to be compared to the H2M simulations. Also, it should be emphasized that some GHMs are uncalibrated models. I was not able to do a complete review of this work because some Figures are not readable.
Recommendation: major revisions.
Response:
We are grateful for the comments and suggestions, which helps us to improve the manuscript. Once again, we apologize for the issue with the figures and we understand that you could not provide a full review under the given circumstances. We also thank you for the second comment after the correction of the figures. Here, we reply to both comments (https://doi.org/10.5194/hess-2021-211-RC2, https://doi.org/10.5194/hess-2021-211-RC3).
Regarding the general comments, we have now added an analysis of the same WFDEI forcings as used in eartH2Observe to make the comparison as fair as possible. In fact, this was also suggested by reviewer 1. The analysis reveals that the H2M performance is similar when using WFDEI. This additional result will be added to the final version in the appendix (see Figure A in the supplement to this response).
On the “fairness” of the comparison, we agree that the H2M modeling framework is driven by the observation data, and by design, it should be expected to be closer to the observation. One of the main motivations for hybrid modeling is that it can make use of large amounts of data, and learn from data (including data-specific biases).
As you mentioned, most of the GHMs, not only those in eartH2Observe, are often either not calibrated at all or calibrated against river discharge and are not able/allowed to learn from the observed data. Within the GHMs, only LISFLOOD uses catchment-based runoff for calibration, as clearly mentioned in L 289.
We would also like to emphasize here that the evaluation of the performance of H2M against the GHMs is actually the benchmarking that is more essential to the qualitative validation of the H2M rather than devaluating the state-of-the-art GHMs, as both approaches have their strengths and weaknesses. We tried to make this clear by stressing that the model performance comparison is not the core component of this study (L 290), but shows the strengths (the local adaptivity) of the approach.
Lastly, regarding the spatial aggregation, we agree that the information in the half degree simulation is potentially altered due to aggregation to one degree. We assume that the differences within the 2 half degree grid cells are much smaller than the spatial variability across the globe. In fact, such aggregation and disaggregations are quite common in CMIP model intercomparisons, where the models with different spatial resolutions are evaluated against each other as well as combined together to generate model ensembles. Nonetheless, we will add a clarification of the assumption in the revised text.
In summary, we think that you raised very essential questions that would benefit the manuscript and guide the readers to evaluate the results in a way we have intended to. We will gladly address the issues in the revised manuscript.
Below, we indicate the answers to your particular comments, which we consider very useful as well.
Particular comments:
L. 5 (Abstract): Is H2M a new model developed in this study? What is the added value of this approach with respect to more traditional modeling approaches? What is the meaning of H2M acronym?
Response: The model was first introduced here: https://doi.org/10.5194/isprs-archives-XLIII-B2-2020-1537-2020 and is now further developed and evaluated in this study. The acronym (hybrid hydrological model, H2M) was used without introduction in the abstract and the benefit of using a hybrid approach could be stated more clearly, which we will fix in the revised version.
L. 95 (22 static variables): unclear because 4 lines correspond to static variables in Table 1, not 22.
Response: We used four data products, some of them representing multiple variables (e.g., land cover fractions of water bodies, wetlands, artificial surfaces, tundra, permanent snow and ice, etc.), 20 variables (used to be 22 in the text but actually it is 20) in total. We will improve this aspect in the next version.
L. 113: procuct?
Response: Fixed.
L. 166, 174 (soltmax, softplus): all readers may not be familiar with these machine learning technical terms. They should be defined.
Response: We will add the definitions to the manuscript.
L. 199 (model training): more details should be given on the used machine learning approach. Is a local training (one statistical model for each model grid cell) performed or a global training (all model grid cells together represented by the same statistical model)?
Response: It is a global model processing each grid-cell individually, i.e., one model learns the dynamics of all pixels. We will make this clearer in the revision.
L. 243 (Table 2): CWD and SStor are written here for the first time and were not defined before. A clear definition should be given. The definition of CWD given in the next paragraph is not clear.
Response: We will improve this in the revision.
L. 242 (selection of models): How was model selection made? In Schellekens et al., 10 models are considered.
Response: We only selected the models for which groundwater storage was available (mentioned in L 243).
L. 284 (Table 3): the period of time for which the comparison was made should be indicated.
Response: We will add the time period (2009 to 2014) to the table caption.
L. 286 (model intercomparison): Could be completed with a water balance Table similar to Tables 7 and 8 in Schellekens et al.
Response: We will add a table showing ET, Q, Precip, and delta Storage to the revised version (see Table A in the supplement to this response).
L. 293: Mean or median scores are little informative in case of non-Gaussian score value statistical distribution. Could you plot score histograms instead?
Response: This could be a misunderstanding: we report the performance of the global signal (`spatially aggregated mean', terminology to be improved on request of reviewer 1), and the cell level median. The cell level distribution is also shown, at least for the NSE in Figure 3. Showing boxplots as in Fig. 3 for all the metrics would be too extensive in our opinion, as the manuscript is already loaded with figures. However, we can add the respective figures to your or the Editor’s request.
L. 340 (Amazon basin): the Amazon area was affected by droughts (2005, 2010, 2015). Are these drought events visible in the simulations performed in this study?
Response: Figure 6 shows the Amazon region (T1 S-AM tropical) in detail. From the years you mentioned, only 2005 and 2010 are covered by our simulations. In both cases, the H2M models reproduce the GRACE patterns quite well.
Reply to comment https://doi.org/10.5194/hess-2021-211-RC3.
Thank you for the additional comments!
In Figure 10, CWD is indicated as one of the 3 considered variables, while the Figure itself shows SM.
Response: This will be fixed in the revision.
In the whole paper, there is a confusion between CWD and SM.
Response: This issue was also brought up by reviewer 1 and we agree that this aspect needs to be improved. We have to use both terms as our model simulates CWD, while the GHMs simulate SM. Whenever we compare the models, we use negative CWD dynamics as SM dynamics. We try to make this clearer by using better figure labels and by providing interpretations in the text (e.g., “higher CWD (dryer soil)” etc.).
L. 249 ("We consider the dynamics of CWD to correspond to SM and thus, the terms are used interchangeably when talking about soil moisture dynamics"): has to be clarified.
Response: We agree that this aspect needs to be improved.
-
AC1: 'Comment on hess-2021-211', Basil Kraft, 03 Jul 2021
As noted by the reviewers, there were issues with some of the figures in the preprint version. We apologize for any inconvenience caused. The respective figures are added here as an attachment to enable a proper discussion.
A detailed response to the reviewers' comments will follow later.Kind regards,
the authors
-
RC3: 'Comment on hess-2021-211', Anonymous Referee #2, 13 Jul 2021
Thanks for the new version of the Figures.
In Figure 10, CWD is indicated as one of the 3 considered variables, while the Figure itself shows SM.
In the whole paper, there is a confusion between CWD and SM.
L. 249 ("We consider the dynamics of CWD to correspond to SM and thus, the terms are used interchangeably when talking about soil moisture dynamics"): has to be clarified.
Citation: https://doi.org/10.5194/hess-2021-211-RC3 -
AC4: 'Reply on RC2', Basil Kraft, 09 Aug 2021
Reply to comment https://doi.org/10.5194/hess-2021-211-RC2
General comments:
This is an interesting paper demonstrating the feasibility of reproducing the simulations of global hydrological models (GHMs) using a hybrid approach (H2M). The latter is based on a toy model consisting of a series of bulk reservoirs, coupled to a statistical model based on machine learning. Results are encouraging. I am not sure that the comparison of H2M with GHMs is completely fair because the precipitation dataset used to force H2M (GPCP) is based on observations, while the one used to force GHMs is derived from the ERA-Interim reanalysis. Another reason why the comparison may be unfair is that spatial resolution of GHMs is degraded from 0.5 degree to 1 degree to be compared to the H2M simulations. Also, it should be emphasized that some GHMs are uncalibrated models. I was not able to do a complete review of this work because some Figures are not readable.
Recommendation: major revisions.
Response:
We are grateful for the comments and suggestions, which helps us to improve the manuscript. Once again, we apologize for the issue with the figures and we understand that you could not provide a full review under the given circumstances. We also thank you for the second comment after the correction of the figures. Here, we reply to both comments (https://doi.org/10.5194/hess-2021-211-RC2, https://doi.org/10.5194/hess-2021-211-RC3).
Regarding the general comments, we have now added an analysis of the same WFDEI forcings as used in eartH2Observe to make the comparison as fair as possible. In fact, this was also suggested by reviewer 1. The analysis reveals that the H2M performance is similar when using WFDEI. This additional result will be added to the final version in the appendix (see Figure A in the supplement to this response).
On the “fairness” of the comparison, we agree that the H2M modeling framework is driven by the observation data, and by design, it should be expected to be closer to the observation. One of the main motivations for hybrid modeling is that it can make use of large amounts of data, and learn from data (including data-specific biases).
As you mentioned, most of the GHMs, not only those in eartH2Observe, are often either not calibrated at all or calibrated against river discharge and are not able/allowed to learn from the observed data. Within the GHMs, only LISFLOOD uses catchment-based runoff for calibration, as clearly mentioned in L 289.
We would also like to emphasize here that the evaluation of the performance of H2M against the GHMs is actually the benchmarking that is more essential to the qualitative validation of the H2M rather than devaluating the state-of-the-art GHMs, as both approaches have their strengths and weaknesses. We tried to make this clear by stressing that the model performance comparison is not the core component of this study (L 290), but shows the strengths (the local adaptivity) of the approach.
Lastly, regarding the spatial aggregation, we agree that the information in the half degree simulation is potentially altered due to aggregation to one degree. We assume that the differences within the 2 half degree grid cells are much smaller than the spatial variability across the globe. In fact, such aggregation and disaggregations are quite common in CMIP model intercomparisons, where the models with different spatial resolutions are evaluated against each other as well as combined together to generate model ensembles. Nonetheless, we will add a clarification of the assumption in the revised text.
In summary, we think that you raised very essential questions that would benefit the manuscript and guide the readers to evaluate the results in a way we have intended to. We will gladly address the issues in the revised manuscript.
Below, we indicate the answers to your particular comments, which we consider very useful as well.
Particular comments:
L. 5 (Abstract): Is H2M a new model developed in this study? What is the added value of this approach with respect to more traditional modeling approaches? What is the meaning of H2M acronym?
Response: The model was first introduced here: https://doi.org/10.5194/isprs-archives-XLIII-B2-2020-1537-2020 and is now further developed and evaluated in this study. The acronym (hybrid hydrological model, H2M) was used without introduction in the abstract and the benefit of using a hybrid approach could be stated more clearly, which we will fix in the revised version.
L. 95 (22 static variables): unclear because 4 lines correspond to static variables in Table 1, not 22.
Response: We used four data products, some of them representing multiple variables (e.g., land cover fractions of water bodies, wetlands, artificial surfaces, tundra, permanent snow and ice, etc.), 20 variables (used to be 22 in the text but actually it is 20) in total. We will improve this aspect in the next version.
L. 113: procuct?
Response: Fixed.
L. 166, 174 (soltmax, softplus): all readers may not be familiar with these machine learning technical terms. They should be defined.
Response: We will add the definitions to the manuscript.
L. 199 (model training): more details should be given on the used machine learning approach. Is a local training (one statistical model for each model grid cell) performed or a global training (all model grid cells together represented by the same statistical model)?
Response: It is a global model processing each grid-cell individually, i.e., one model learns the dynamics of all pixels. We will make this clearer in the revision.
L. 243 (Table 2): CWD and SStor are written here for the first time and were not defined before. A clear definition should be given. The definition of CWD given in the next paragraph is not clear.
Response: We will improve this in the revision.
L. 242 (selection of models): How was model selection made? In Schellekens et al., 10 models are considered.
Response: We only selected the models for which groundwater storage was available (mentioned in L 243).
L. 284 (Table 3): the period of time for which the comparison was made should be indicated.
Response: We will add the time period (2009 to 2014) to the table caption.
L. 286 (model intercomparison): Could be completed with a water balance Table similar to Tables 7 and 8 in Schellekens et al.
Response: We will add a table showing ET, Q, Precip, and delta Storage to the revised version (see Table A in the supplement to this response).
L. 293: Mean or median scores are little informative in case of non-Gaussian score value statistical distribution. Could you plot score histograms instead?
Response: This could be a misunderstanding: we report the performance of the global signal (`spatially aggregated mean', terminology to be improved on request of reviewer 1), and the cell level median. The cell level distribution is also shown, at least for the NSE in Figure 3. Showing boxplots as in Fig. 3 for all the metrics would be too extensive in our opinion, as the manuscript is already loaded with figures. However, we can add the respective figures to your or the Editor’s request.
L. 340 (Amazon basin): the Amazon area was affected by droughts (2005, 2010, 2015). Are these drought events visible in the simulations performed in this study?
Response: Figure 6 shows the Amazon region (T1 S-AM tropical) in detail. From the years you mentioned, only 2005 and 2010 are covered by our simulations. In both cases, the H2M models reproduce the GRACE patterns quite well.
Reply to comment https://doi.org/10.5194/hess-2021-211-RC3.
Thank you for the additional comments!
In Figure 10, CWD is indicated as one of the 3 considered variables, while the Figure itself shows SM.
Response: This will be fixed in the revision.
In the whole paper, there is a confusion between CWD and SM.
Response: This issue was also brought up by reviewer 1 and we agree that this aspect needs to be improved. We have to use both terms as our model simulates CWD, while the GHMs simulate SM. Whenever we compare the models, we use negative CWD dynamics as SM dynamics. We try to make this clearer by using better figure labels and by providing interpretations in the text (e.g., “higher CWD (dryer soil)” etc.).
L. 249 ("We consider the dynamics of CWD to correspond to SM and thus, the terms are used interchangeably when talking about soil moisture dynamics"): has to be clarified.
Response: We agree that this aspect needs to be improved.
-
AC4: 'Reply on RC2', Basil Kraft, 09 Aug 2021
-
RC4: 'Comment on hess-2021-211', Derek Karssenberg, 27 Jul 2021
The manuscript proposes machine learning (with extensive observational data as input) to identify spatially and temporally varying values of parameters in an extremely simple forward simulation model of hydrology. A small number of parameters represents how water fluxes are partitioned over various storages (however they vary in space and time!). Results are evaluated 1) by comparing prediction error with existing global hydrological simulation models, and 2) by exploring spatio-temporal patterns in parameters values and providing possible explanations of these patterns in terms of processes.
The work presented is innovative. The authors have recently published an article on the same model and data sets (Kraft et al, 2020 – referenced in the manuscript), but the current manuscript extends on this by providing a more extensive evaluation of the results (and it seems minor adjustments in the methodology). Because of the innovation I am in the opinion this is a promising manuscript. However, it needs considerable revision in particular regarding the presentation of the material: figures are often very unclear (it is sometimes even unclear what attributes are shown), the text is rather long and could be condensed providing at the same time more focus. Regarding the latter: this seems to be a proof-of-concept paper. It is thus not essential to provide a complete evaluation of the model. Instead, I believe it is more important to properly explain the methodology and key outcomes. In revising the paper I suggest the authors to possibly leave some of the results out (e.g. less figures, simpler figures) – it wouldn’t harm the paper but may make it more accessible.
My main comments are:
Figures (and tables)
Most of the figures are hard to understand. Legends are often missing, panels are included without any explanation of what is shown, variables are plotted for which it has not been explained how they are calculated. This needs considerable improvement. Please enlarge the figures as well. In my detailed comments I pin-point a few things that are not clear, please consider this as examples (not a complete list).Objective function
The objective function contains four different observational data (terrestrial water storage, evapotranspiration, runoff, snow water equivalent). It is completely unclear how each of these are weighted in the objective function. Do you ‘calibrate’ against standardized values of these attributes? If so, how are these standardized? Please explain. Note that it is to be expected that this weighting has a strong influence on the results, e.g. if more ‘weight’ is assigned in the objective function to runoff, the model will perform better in runoff prediction. I do not expect you to explore different objective functions but at least the objective function needs to be given and it needs to be explained that this is quite arbitrarily chosen. Note that for instance in Bayesian data assimilation ‘weights’ of observations will depend on the uncertainty associated with the observations (high uncertainty -> low weight). This is not the case in your approach.Training, validation, testing data sets
In machine learning, one uses training and validation sets in the procedural step of model building. Model evaluation then is done using a data set not used for model building (this is often referred to as the test data set in the machine learning world). It seems you are not separating validation and testing data sets. This is an important issue – evaluation of the model (all the performance metrics provided, almost all plots provided with model outcomes) needs to be done on data that are not at all involved in the model building phase. If you are evaluating on the same data as used for building the models, this should be clearly indicated in the manuscript and implications of this extensively discussed.Context, aim of modelling
I could argue that in your comparison of your approach to modelling (hybrid modelling) and existing global hydrological models you are comparing apples and oranges. The main aim of hybrid modelling (as defined in your manuscript) is prediction (in the statistical sense): estimating variables at time steps for which observational data are not available. Existing global hydrological models however aim at scenario analysis (and oftentimes prediction as well). Scenario analysis may involve evaluating effects of climate change, effects of future changes in water allocation (e.g., irrigation, domestic water use), effects of future changes in land use, etc. Hybrid modelling is not suitable at all for scenario analysis as it almost completely relies on observational data on the system. If the system changes, it won’t work anymore. I may exaggerate somewhat (to make my point clear), and you may disagree in which case I challenge you to convince me otherwise. In any case I suggest you 1) mention this difference in the introduction of the manuscript and 2) discuss this issue in the Discussion section. I consider this important in particular because this is a ‘proof-of-concept’ paper and it is thus important to position this work in the broader context.
Detailed comments (please note my comments on the figures are not complete)p. 3, end of introduction
The introduction gives a good overview of past work in the domain. However, on line 68-72) it remains somewhat unclear what the contribution of this paper is. I suggest stating this more explicitly and also to state more explicitly what this paper adds compared to your previous publication (Kraft et al., 2020).Code sharing
I strongly suggest sharing the code of your model on a public repository (e.g., GitHub). It will make your work more credible and will enable other researchers to build on your research.p. 5, line 113
Clearly state what Q refers to. It is the amount of runoff generated in a pixel (or area of land), not the discharge from the pixel (streamflow). The latter can only be calculated in a spatial model that does channel routing.p. 5, line 123
Adjust the numbering, is this a nested numbered list?p. 6, model description
Is this a spatial model, i.e. does it include spatial interactions in the time transition functions? I don’t think so, it is a local (point model). Please state this clearly.p. 6, figure 1
The caption is too long. Reduce it and explain concepts in the main text.p. 7, neural network
I am unsure you are building the machine learning model for all cells at once (single model) or for each cell separately (number of models equals number of cells). If the former, the method you propose is, I believe, fully non-spatial (point model for hydrology, identified separately for each cell with observations for that cell). Please explain this clearly.p. 9, line 196
Why DELTA T instead of T?p. 9, model training
I am wondering how you train the model. Machine learning models typically do not run forward in time. However, in this application, they are fed by temporally changing data, in a forward timestep approach. How is this done? Please explain. Providing the code would help as well.Figure 2
Too small.Section 2.5
Which runs of the global hydrological models were used?Caption Table 3
What is ‘median-cell level’? Please explain.Figure 3
Why are you not including results for ET and Q as well?Figure 4
Explain root phase and variance error in Methods. I am in the opinion however this plot could be left out.Figure 5.
What are the small panels on the left and right side? What is plotted on the x-axis? The figure is hard to understand. It needs to be simplified (and possibly include more detailed information in a digital supplement or appendix).Figure 7.
What is represented by each line in the figure? A single location? A single year? ‘quantile of the spatio-temporal distribution’ is not easy to understand.Figure 8.
What is represented by the colours?Figure 10.
Too small. Consider selecting runs and plotting these.Figure 11.
Too small. What is along the x-axis? Months? It runs up to 13.Discussion section
The discussion is interesting but it is somewhat long. Consider reducing it somewhat focusing on the main things (that are relevant to the research objective and questions).
Derek Karssenberg (reviewer)Citation: https://doi.org/10.5194/hess-2021-211-RC4 -
AC5: 'Reply on RC4', Basil Kraft, 09 Aug 2021
The manuscript proposes machine learning (with extensive observational data as input) to identify spatially and temporally varying values of parameters in an extremely simple forward simulation model of hydrology. A small number of parameters represents how water fluxes are partitioned over various storages (however they vary in space and time!). Results are evaluated 1) by comparing prediction error with existing global hydrological simulation models, and 2) by exploring spatio-temporal patterns in parameters values and providing possible explanations of these patterns in terms of processes.
The work presented is innovative. The authors have recently published an article on the same model and data sets (Kraft et al, 2020 – referenced in the manuscript), but the current manuscript extends on this by providing a more extensive evaluation of the results (and it seems minor adjustments in the methodology). Because of the innovation I am in the opinion this is a promising manuscript. However, it needs considerable revision in particular regarding the presentation of the material: figures are often very unclear (it is sometimes even unclear what attributes are shown), the text is rather long and could be condensed providing at the same time more focus. Regarding the latter: this seems to be a proof-of-concept paper. It is thus not essential to provide a complete evaluation of the model. Instead, I believe it is more important to properly explain the methodology and key outcomes. In revising the paper I suggest the authors to possibly leave some of the results out (e.g. less figures, simpler figures) – it wouldn’t harm the paper but may make it more accessible.
Response:
Thank you very much for appreciating the value of the manuscript and providing very relevant comments that will help us to improve the manuscript.
A couple of issues were also raised by the other reviewers, especially the clarity and complexity of the figures. To address that, we will revise the focus of Figure 5 and 6, and show the global means of terrestrial water storage and snow water equivalent (Figure B and C in the supplement to this response). The original figures for the regional variations will be moved to the appendix so that the interested readers can still access them. Additionally, we will only include the critical parts of the performance evaluation in the main text, while providing the details in the appendix. We also agree that the proof-of-concept here is beyond the performance evaluation, and also brings in the hydrological responses within a data-driven prediction, which is rarely a focus of machine-learning based predictions.
My main comments are:
Figures (and tables)
Most of the figures are hard to understand. Legends are often missing, panels are included without any explanation of what is shown, variables are plotted for which it has not been explained how they are calculated. This needs considerable improvement. Please enlarge the figures as well. In my detailed comments I pin-point a few things that are not clear, please consider this as examples (not a complete list).
Response: As pointed out in the other reviews, the figures were, unfortunately, corrupted on upload. We uploaded the corrected figures, and thus hope that you were able to access them before the review. In the new figures, all the missing texts and legends were corrected, but still, some axis labels were missing. We will also try to make the legends more prominent wherever possible. Regarding the figure size, the HESS style guide proposes two sizes, one for 1-column, and one for full text width, which is still not really using the full vertical space. We try to improve this aspect as well and will coordinate with the HESS editorial team to provide the best possible quality of figures, when and if the manuscript reaches the final publication phase.
Objective function
The objective function contains four different observational data (terrestrial water storage, evapotranspiration, runoff, snow water equivalent). It is completely unclear how each of these are weighted in the objective function. Do you ‘calibrate’ against standardized values of these attributes? If so, how are these standardized? Please explain. Note that it is to be expected that this weighting has a strong influence on the results, e.g. if more ‘weight’ is assigned in the objective function to runoff, the model will perform better in runoff prediction. I do not expect you to explore different objective functions but at least the objective function needs to be given and it needs to be explained that this is quite arbitrarily chosen. Note that for instance in Bayesian data assimilation ‘weights’ of observations will depend on the uncertainty associated with the observations (high uncertainty -> low weight). This is not the case in your approach.
Response: To avoid too much overlap with the first publication (Kraft et al. 2020), we did not elaborate on the multi-objective approach. In L 219/220, we state: “The five loss terms were dynamically weighted using a self-paced task weighting approach proposed by Kendall et al. (2018)—we refer to Kraft et al. (2020) for more details.” The task weights (found by automatic task weighting) are shown in Appendix B.
Training, validation, testing data sets
In machine learning, one uses training and validation sets in the procedural step of model building. Model evaluation then is done using a data set not used for model building (this is often referred to as the test data set in the machine learning world). It seems you are not separating validation and testing data sets. This is an important issue – evaluation of the model (all the performance metrics provided, almost all plots provided with model outcomes) needs to be done on data that are not at all involved in the model building phase. If you are evaluating on the same data as used for building the models, this should be clearly indicated in the manuscript and implications of this extensively discussed.
Response: This aspect is discussed in Section 2.3 Model training (starting at L 200): We split the data in both temporal and spatial domains into training, validation, and test set. For the spatial splitting, we asserted a minimum distance between the cells of one pixel between the sets (110km at the equator), which is a tradeoff between data limitations and cross-validation requirements. In the time domain, we split the data into two sets (2002 to 2008 / 2009 to 2014) and use the first one for training, and the other for validation and testing. We know that this is not ideal, but the data coverage is limited and we tried to find a good compromise. As stated in L 209, more details on the cross-validation are provided in Kraft et al. (2020).
For performance evaluation, we used the test set only (e.g., Table 3), while for the more qualitative evaluation, we decided to use the full time range covered by the observations. Figure D in the supplement to this response (not contained in the manuscript, we will add it to the appendix) shows the performance (RMSE) of the training, the validation, and the test set. From the figure, we see that there is no systematic difference in RMSE across the sets. We assume that this is a result of the regularization (e.g., early stopping, weight decay) and the physical constraints, which are designed to avoid overfitting.
However, to address the issue, we decided to only use the test years (2009 to 2012) for Figure 3 as this makes the comparison fairer and to explain why we use the full time range (2003 to 2012) for the qualitative evaluation/comparison.
Context, aim of modelling
I could argue that in your comparison of your approach to modelling (hybrid modelling) and existing global hydrological models you are comparing apples and oranges. The main aim of hybrid modelling (as defined in your manuscript) is prediction (in the statistical sense): estimating variables at time steps for which observational data are not available. Existing global hydrological models however aim at scenario analysis (and oftentimes prediction as well). Scenario analysis may involve evaluating effects of climate change, effects of future changes in water allocation (e.g., irrigation, domestic water use), effects of future changes in land use, etc. Hybrid modelling is not suitable at all for scenario analysis as it almost completely relies on observational data on the system. If the system changes, it won’t work anymore. I may exaggerate somewhat (to make my point clear), and you may disagree in which case I challenge you to convince me otherwise. In any case I suggest you 1) mention this difference in the introduction of the manuscript and 2) discuss this issue in the Discussion section. I consider this important in particular because this is a ‘proof-of-concept’ paper and it is thus important to position this work in the broader context.
Response: Thanks for an interesting comment. Perhaps, we may not have been clear enough in the manuscript, but the comparison of H2M with GHMs was actually intended as a validation of the H2M itself rather than as an effort to provide an alternative GHM. The holy grail of hybrid modeling would be to bring tightly together with the best of machine learning methods and the physical models. For example, the mass balance and hydrological responses in machine learning methods, while still having a robust data adaptability that is not always the case in physical models with process simplification and abstraction and rigidity of physical/empirical parameters.
Future predictions are a challenge in general even for physical models, as seen through numerous generations of model intercomparison projects. Even the validity of the “physical” parameters (that are often learned through iterations of model development) can be an interesting puzzle to solve. But, given that for GHMs, we still cannot state that machine learning methods are better. In fact, the data-space that the machine learning methods are trained with may provide a range of climate-spectrum that may be within the projected changes in climate. A particular aspect of the “interpolation” vs “extrapolation” by machine learning methods is discussed well in Jung et al. (2020, https://doi.org/10.5194/bg-17-1343-2020). Lastly, one could also argue that the data-adaptivity of the machine learning methods gives them larger flexibility in dealing with non-linear responses (as touched upon in the hydrological responses section of the manuscript) of the land surface to changes in climate. We, however, confide that more research on the applicability of the hybrid models on scenario-type analysis is necessary. Only such research would be able to make a convincing argument rather than a speculative one we can make now based on the state-of-the-art. We hope that our proof-of-concept can contribute towards that.
As suggested, we will add relevant aspects of this comment and response in the revision.
Detailed comments (please note my comments on the figures are not complete)
p. 3, end of introduction
The introduction gives a good overview of past work in the domain. However, on line 68-72) it remains somewhat unclear what the contribution of this paper is. I suggest stating this more explicitly and also to state more explicitly what this paper adds compared to your previous publication (Kraft et al., 2020).
Response: We will improve this aspect in the revision.
Code sharing
I strongly suggest sharing the code of your model on a public repository (e.g., GitHub). It will make your work more credible and will enable other researchers to build on your research.
Response: The code and simulations are available online (see Code and data availability statement, L 653) at https://github.com/bask0/h2m.
p. 5, line 113
Clearly state what Q refers to. It is the amount of runoff generated in a pixel (or area of land), not the discharge from the pixel (streamflow). The latter can only be calculated in a spatial model that does channel routing.
Response: The runoff from the GRUN dataset is the streamflow measured at the outlet divided by the catchment area. The authors (Ghiggi et al., 2019) use catchments with a similar size as the forcing variables for model training and state that on a monthly scale, runoff equals the streamflow. We will add a short explanation in the revised version.
From the original paper (Ghiggi et al., 2019):
“Runoff is defined here as all the water draining from a small land area. Runoff cannot be observed directly, but at a monthly timescale the average catchment runoff can be assumed to equal the monthly streamflow measured at the outlet divided by the catchment area, provided storage of river water (e.g. in dams, reservoirs) and/or river water losses (e.g. river channel and lake evaporation, irrigation) are minimal. Thus, runoff rates (millimetres per month) are obtained by dividing the GSIM river discharge (cubic metres per month) with the station's upstream catchment area (km2). We then select catchments with an area comparable to the grid-cell size of the atmospheric forcing data in order to derive observational estimates of the runoff rate response to changes in atmospheric forcing.”
p. 5, line 123
Adjust the numbering, is this a nested numbered list?
Response: The first mention of 1-3 lists the three criteria, the second time, the numbers refer back to the criteria. We agree that this is not needed, and we will improve this.
p. 6, model description
Is this a spatial model, i.e. does it include spatial interactions in the time transition functions? I don’t think so, it is a local (point model). Please state this clearly.
Response: There are no spatial interactions. This is mentioned here and there in the discussion, but we will make this clear in the model description.
p. 6, figure 1
The caption is too long. Reduce it and explain concepts in the main text.
Response: We will shorten the caption.
p. 7, neural network
I am unsure you are building the machine learning model for all cells at once (single model) or for each cell separately (number of models equals number of cells). If the former, the method you propose is, I believe, fully non-spatial (point model for hydrology, identified separately for each cell with observations for that cell). Please explain this clearly.
Response: We train one global model without between-cell interactions, but we do not explain this explicitly in the text. We will, thus, improve this aspect.
p. 9, line 196
Why DELTA T instead of T?
Response: Our notation is a bit imprecise. We actually model the T dynamics, but not the mean, we could therefore say that we model the delta T. But what we do is: remove the mean from T and S + G + (-C) independently, such that only the dynamics are modeled (because T is not absolute terrestrial water storage but just the variations). We will improve the notation. Note that calculating delta T is needed to compare with the GRACE observation, as it only provides the variation.
p. 9, model training
I am wondering how you train the model. Machine learning models typically do not run forward in time. However, in this application, they are fed by temporally changing data, in a forward timestep approach. How is this done? Please explain. Providing the code would help as well.
Response: The code is available here: https://github.com/bask0/h2m
We do a “many-to-many” prediction with a spin-up (random years from training features) and a warm-up (one year of input features in correct order, just the year before the actual model run) to equilibrate the model states. To understand the sequential processing of the model, we can ignore the “hybrid” part and think of a traditional hydrological model or a plain recurrent neural network (RNN). In both cases, a state (e.g., soil moisture, or the RNN state) is updated at each time step in interaction with the current inputs (e.g., meteorological variables), which yields an updated state (and a couple of output variables). In practice this is just a for loop (https://github.com/bask0/h2m/blob/04f0e15b60fe90d0c897aca2d801289bec158ebe/src/models/hybridmodel_loop.py#L171). The model does not predict into the future, as it does not simulate meteorological forcings, i.e., it needs inputs at time t to simulate response at time t. Figure E in the supplement to this response nicely illustrates the concept. On the left side, we see the “for loop” version, on the right side the unfolded version of the “for loop”. The outputs o are only present up to the last input element x.
We try to improve the description of the sequential processing in the revision.
Figure 2: Too small.
Response: We will enlarge the Figure in the revised version.
Section 2.5: Which runs of the global hydrological models were used?
Response: The ones originally published by Schellekens et al. (2017) that are available at https://wci.earth2observe.eu/
Caption Table 3: What is ‘median-cell level’? Please explain.
Response: It is the median across all grid cells, i.e., metrics are calculated per grid-cell first, and the median is reported. The spatially averaged signal is the global signal (i.e., mean across all time steps, then we calculate the metrics for the global time series). We will improve the terminology.
Figure 3: Why are you not including results for ET and Q as well?
Response: ET and Q are not direct observations, and are upscaled from local measurements, and can be classified as observation-based products. We assume that the performance on these signals is more prone to be affected by upscaling prediction uncertainties than those with direct remote sensing. ET and Q are both affected by substantial biases and uncertainties. We do provide ET and Q metrics in Table 3, and show the local biases in Appendix A, Figure A2. And, after all, as you have suggested, the manuscript is already quite long.
Figure 4: Explain root phase and variance error in Methods. I am in the opinion however this plot could be left out.
Response: We will consider your suggestion and either explain the metrics or remove the plot in the revision.
Figure 5: What are the small panels on the left and right side? What is plotted on the x-axis? The figure is hard to understand. It needs to be simplified (and possibly include more detailed information in a digital supplement or appendix).
Response: The panels (“inset axes”) show the NSE per model (color corresponds to model). If you refer to the outer panels, this is the mean seasonal cycle. We will only show the global signal in the main text and move the current figures (with improved axis labels and without the inset axis) to the appendix in the revision.
Figure 7: What is represented by each line in the figure? A single location? A single year? ‘quantile of the spatio-temporal distribution’ is not easy to understand.
Response: We “mix” all time steps and grid cells and calculate the quantiles for bins of CWD. For example, we filter all the values (space and time) for CWD values from 0 to 10 mm. From these values, we calculate the quantiles (independently per cross validation run). Then we take CWD from 10 to 20 mm, etc. We will improve the caption in revision.
Figure 8: What is represented by the colours?
Response: The legend may be hard to spot (top left panel), and, so, we will move it outside. We also mention the color scheme in the caption.
Figure 10: Too small. Consider selecting runs and plotting these.
Response: We will improve the figure size, but we do not understand what “runs” is referring to here.
Figure 11: Too small. What is along the x-axis? Months? It runs up to 13.
Response: We will improve axis labels and make the figure larger.
Discussion section: The discussion is interesting but it is somewhat long. Consider reducing it somewhat focusing on the main things (that are relevant to the research objective and questions).
Response: We will consider shortening the discussion.
-
AC5: 'Reply on RC4', Basil Kraft, 09 Aug 2021
-
RC5: 'Comment on hess-2021-211', Anonymous Referee #4, 28 Jul 2021
Kraft et al. present a hybrid global modelling framework combining machine learning with simple water balance equations to simulate the most relevant hydrological states and fluxes at global scale and daily resolution. The model has been simultaneously trained with observational data-sets of total water storage, snow water equivalent, evapotranspiration and runoff. Results are evaluated against the same data-sets (split sample) as well as compared to simulations provided by four global hydrological models. The model intercomparison focuses on the attribution of water storage changes to variations in snow pack, soil moisture and groundwater storage. The authors prove that their hybrid modelling framework is able simulate relevant water storage components and fluxes with similar or better performance than state-of-the-art GHMs. Systematic differences between the models and the implications of their findings are extensively discussed.
The contribution is novel and fits well within the scope of HESS, however requires major revision before potential publication. This especially concerns the results section which contains a large number of (complex) figures that are often difficult to read and/or grasp. The reader is required to go back and forth between the main text and the caption in order to understand what is displayed. These comments are already based on the revised set of figures uploaded by the authors. I would recommend to thoroughly revise the results section, potentially drop a few figures and revise the remaining ones, in order to arrive at a more concise and digestable presentation.
The discussion touches many important aspects, however appears lengthy and repetitive at times. I would recommend revision to make it more concise. Further, the significance of the parameter estimates for process-based modeling is overstated in my opinion. The authors acknowledge that, due to the simple model structure and small number of parameters, parameter estimates will tend to compensate for insufficient or lacking process representations and uncertainty in the input data, which undermines their 'physical meaningfulness' and ability to describe specific processes.
Specific comments
It remains unclear which thresholds and data-sets were used to identify cells with "high anthropogenic impact". Groundwater abstraction is given as an example, however cells with extensive irrigation (irrespective of source) should be removed due to its effects on soil moisture and evapotranspiration.
It remains unclear if the parameters beta_s and beta_g were estimated by the neural network or were preset. In the latter case, please clarify how these parameters were determined.
It is my understanding that global, area-weighted averages of TWS, SWE, Q, and ET have been used to calculate the performances reported in Tab. 3 and Fig. 3. If correct, I cannot quite see the value in doing so. Both H2M and the GHMs aim to estimate hydrological states and fluxes in a spatially distributed manner, i.e. numbers based on a global average provide little insight into the models' performances. Further, jumping between global performance and cell median makes Sect. 3.1 rather hard to follow. I'd suggest to focus on cell median and to ditch the global numbers.
The comparison of model performances between H2M and the GHMs in Fig. 3 seems little meaningful since the better part of the common time series (2003-2008) was part of the training data-set; particularly since NSE and MSE are closely related. In this regard, it would also be interesting to see a direct comparison of the performances achieved by H2M in the training period and the evaluation period, respectively.
Fig. 3 is hardly readable, please rescale/revise.
Fig. 4 shows performance metrics that have not been introduced in the methods section or used in the previous figures and tables which, frankly, is confusing. I'd recommend to stick with the performance metrics used earlier.
The insets in Figs. 5 and 6 severely compromise readability and I'd suggest removing them. Further, the x-axis labels seem to be cut off. Please revise.
Fig. 7: Q is an unfortunate abbreviation for quantile here since used for runoff in other parts of the manuscript. Please revise. Which variable/quantity are the quantiles exactly based on? Please clarify.
Fig. 9: The masking color (black) and the darkest shade of the color scale are hardly distinguishable, please revise. In general, I feel that the figure conveys a similar message as Fig. 8. Given the overall large number of figures in the manuscript, this one could be dropped for conciseness.
Fig. 10: The masking color (grey) is also part of the color scale (equal contribution from all three components), please revise.
Minor comments and technical correctionsl. 89: Please rephrase "average content [...] of bulk density".
l. 95: Please rephrase "keep".
l. 127: What percentage of global land area do the remaining 12084 grid cells represent?
Fig. 1 (caption): Is scorr the same as βs?
eq. 15: Shouldn't the equation be qf = αf × win ?
l. 229: Replace with "ratio of modeled and observed standard deviation (SDR)" for clarity.
Tab. 2 (footnote): Should read "interception"?
l. 280-81: "SDE" is not defined in the manuscript, should be "SDR"?
l. 285: Should read "provided"?
l. 307: Should refer to Fig. 5, not Fig. 6?
l. 311: SWEMSE should read SWEMSC and NES should read NSE?Citation: https://doi.org/10.5194/hess-2021-211-RC5 -
AC6: 'Reply on RC5', Basil Kraft, 09 Aug 2021
Kraft et al. present a hybrid global modelling framework combining machine learning with simple water balance equations to simulate the most relevant hydrological states and fluxes at global scale and daily resolution. The model has been simultaneously trained with observational data-sets of total water storage, snow water equivalent, evapotranspiration and runoff. Results are evaluated against the same data-sets (split sample) as well as compared to simulations provided by four global hydrological models. The model intercomparison focuses on the attribution of water storage changes to variations in snow pack, soil moisture and groundwater storage. The authors prove that their hybrid modelling framework is able simulate relevant water storage components and fluxes with similar or better performance than state-of-the-art GHMs. Systematic differences between the models and the implications of their findings are extensively discussed.
The contribution is novel and fits well within the scope of HESS, however requires major revision before potential publication. This especially concerns the results section which contains a large number of (complex) figures that are often difficult to read and/or grasp. The reader is required to go back and forth between the main text and the caption in order to understand what is displayed. These comments are already based on the revised set of figures uploaded by the authors. I would recommend to thoroughly revise the results section, potentially drop a few figures and revise the remaining ones, in order to arrive at a more concise and digestable presentation.
The discussion touches many important aspects, however appears lengthy and repetitive at times. I would recommend revision to make it more concise. Further, the significance of the parameter estimates for process-based modeling is overstated in my opinion. The authors acknowledge that, due to the simple model structure and small number of parameters, parameter estimates will tend to compensate for insufficient or lacking process representations and uncertainty in the input data, which undermines their 'physical meaningfulness' and ability to describe specific processes.
Response:
We would like to thank you for appreciating the novelty of the study and for providing valuable comments.
We received similar comments from the other reviewers regarding the complexity of the manuscript and the figures. We plan to simplify some of the figures and also move some parts of model performance to the appendix. This involves reducing the information density (e.g., Figures 6 and 7, see simplified versions in the supplement to this response, Figures B and C) or removing (e.g., Figure 4) figures and streamlining the discussion section, and unifying the terminology. We will also provide additional information on the machine learning model and training procedure, as several questions have been raised by you and other reviewers.
Regarding the relevance of parameters in process-based models, we wanted to highlight that the model parameters are a critical source of uncertainty as well (e.g., Liu and Gupta, 2007, doi: 10.1029/2006WR005756). Several related specific instances are mentioned throughout the text for covering different aspects of “parameter uncertainty,” such as equifinality issues, scale mismatch between the estimation of parameters (such as hydraulic conductivity in the lab vs the same at half-degree resolution), discrete vs continuous in space (e.g., vegetation type and a corresponding lookup table based on observations at plant/ecosystem level), and temporally fixed vs varying. In the revision, we will tone down the text wherever possible, while still maintaining the message that there are ample rooms for improvement regarding parameter estimations in process-based models.
Below, we provide detailed answers to your comments.
Specific comments
It remains unclear which thresholds and data-sets were used to identify cells with "high anthropogenic impact". Groundwater abstraction is given as an example, however cells with extensive irrigation (irrespective of source) should be removed due to its effects on soil moisture and evapotranspiration.
Response: We used an ad-hoc solution based on Rodell (2019) [https://www.nature.com/articles/s41586-018-0123-1]. We removed all regions with only groundwater depletion (#7, #12, and #14). We will add this explanation in the revision.
It remains unclear if the parameters beta_s and beta_g were estimated by the neural network or were preset. In the latter case, please clarify how these parameters were determined.
Response: They are estimated as free parameters (in the sense of not being connected to data), i.e., they are not predicted by a neural network but rather by the optimizer. The gradient-descent based optimizer receives all parameters of the neural networks (NN) and the global parameters. In each optimization iteration, the optimizer can update the weights of the neural network, as well as the two parameters. We will elaborate on it in the revision.
It is my understanding that global, area-weighted averages of TWS, SWE, Q, and ET have been used to calculate the performances reported in Tab. 3 and Fig. 3. If correct, I cannot quite see the value in doing so. Both H2M and the GHMs aim to estimate hydrological states and fluxes in a spatially distributed manner, i.e. numbers based on a global average provide little insight into the models' performances. Further, jumping between global performance and cell median makes Sect. 3.1 rather hard to follow. I'd suggest to focus on cell median and to ditch the global numbers.
Response: The global signal is less affected by data uncertainties and reflects how well the model gets the global-scale patterns. We agree that the global signal can be misleading (e.g., Northern and Southern Hemisphere can cancel out parts of the signal due to opposite seasonality), but the ability to reproduce the regional and global signal is still insightful, and they have been used extensively in other previous studies (Trautmann et al. (2018; doi: 10.5194/hess-22-4061-2018), Jung et al. (2010; doi: 10.1038/nature09396), Schellekens et al. (2017; doi: 10.5194/essd-9-389-2017), Humphrey et al. (2018, doi: 10.1038/s41586-018-0424-4). The best would be to report regional performance together with the cell level performance, but it would add more complexity to the manuscript. So, even some of the regional performance evaluations would actually be moved to the appendix (based on suggestions from other reviewers), and the main text would focus on the global signal.
The comparison of model performances between H2M and the GHMs in Fig. 3 seems little meaningful since the better part of the common time series (2003-2008) was part of the training data-set; particularly since NSE and MSE are closely related. In this regard, it would also be interesting to see a direct comparison of the performances achieved by H2M in the training period and the evaluation period, respectively.
Response: We decided to use the entire time-series because 1) longer time-series yield a more robust performance 2) the model performances of H2M were very similar in the training, validation, and test period. We checked the latter by comparing the model performance for the different sets. Figure D in the supplement to this response (not contained in the manuscript, we will add it to the appendix) shows the performance (RMSE) of the training, the validation and the test set. From the figure, we see that there is no systematic difference in RMSE across the sets. We assume that this is a result of the regularization (e.g., early stopping, weight decay) and the physical constraints, which avoid overfitting as well.
However, we agree that the performance comparison should be made on the test set. We decided to make a compromise: The splitting of the training, validation and test set in the spatial domain makes a comparison difficult, as we would need to show the GHM performance for the different cells corresponding to the test set separately as well. Thus, we will include all the cells (from training, validation, test), but only use the years 2009 to 2012, which were not used for training.
Fig. 3 is hardly readable, please rescale/revise.
Response: We will increase the figure size.
Fig. 4 shows performance metrics that have not been introduced in the methods section or used in the previous figures and tables which, frankly, is confusing. I'd recommend to stick with the performance metrics used earlier.
Response: We agree that it makes little sense to use new metrics here. The phase and variance error are insightful, but the message of the figure is that our model struggles in the same regions as the GHMs, a point that could also be made with NSE, for example. We also consider moving the figure to the appendix in the revision, or to removing it entirely.
The insets in Figs. 5 and 6 severely compromise readability and I'd suggest removing them. Further, the x-axis labels seem to be cut off. Please revise.
Response: We agree that the figures are hard to read/grasp. We will replace the figures with the global signal and move the original figures (with improved axis labels) to the appendix (see Figures B and C in the supplement to this response).
Fig. 7: Q is an unfortunate abbreviation for quantile here since used for runoff in other parts of the manuscript. Please revise. Which variable/quantity are the quantiles exactly based on? Please clarify.
Response: Regarding the abbreviation, we agree. For the calculation of the quantiles, we “mix” all time steps and grid cells and calculate the quantiles for bins of CWD. For example, we filter all the values (space and time) for CWD values from 0 to 10 mm. From these values, we calculate the quantiles (independently per cross validation run). Then we take CWD from 10 to 20 mm, etc. We will improve the caption.
Fig. 9: The masking color (black) and the darkest shade of the color scale are hardly distinguishable, please revise. In general, I feel that the figure conveys a similar message as Fig. 8. Given the overall large number of figures in the manuscript, this one could be dropped for conciseness.
Response: True, the masking color is not ideal, and we will change it to white. With regards to the similarity of Fig, 8/9: we will consider removing one of them in the revision.
Fig. 10: The masking color (grey) is also part of the color scale (equal contribution from all three components), please revise.
Response: Agreed, we will change the masking color.
Minor comments and technical corrections
Response: Thanks for the technical corrections, we will follow all your suggestions listed below and clarify where needed.
l. 89: Please rephrase "average content [...] of bulk density".
l. 95: Please rephrase "keep".
l. 127: What percentage of global land area do the remaining 12084 grid cells represent?
Fig. 1 (caption): Is scorr the same as βs?
eq. 15: Shouldn't the equation be qf = αf × win ?
l. 229: Replace with "ratio of modeled and observed standard deviation (SDR)" for clarity.
Tab. 2 (footnote): Should read "interception"?
l. 280-81: "SDE" is not defined in the manuscript, should be "SDR"?
l. 285: Should read "provided"?
l. 307: Should refer to Fig. 5, not Fig. 6?
l. 311: SWEMSE should read SWEMSC and NES should read NSE?
-
AC6: 'Reply on RC5', Basil Kraft, 09 Aug 2021