To bucket or not to bucket? Analyzing the performance  and interpretability of hybrid hydrological  models with dynamic parameterization

Acuña Espinoza, Eduardo; Loritz, Ralf; Álvarez Chaves, Manuel; Bäuerle, Nicole; Ehret, Uwe

doi:https://doi.org/10.5194/hess-28-2705-2024

Articles | Volume 28, issue 12

https://doi.org/10.5194/hess-28-2705-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-28-2705-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 28, issue 12

Research article

|

27 Jun 2024

Research article |

| 27 Jun 2024

To bucket or not to bucket? Analyzing the performance and interpretability of hybrid hydrological models with dynamic parameterization

Eduardo Acuña Espinoza, Ralf Loritz, Manuel Álvarez Chaves, Nicole Bäuerle, and Uwe Ehret

Download

Final revised paper (published on 27 Jun 2024)
Preprint (discussion started on 15 Sep 2023)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2023-1980', Anonymous Referee #1, 06 Oct 2023
General Comments
The authors introduce and analyse a hybrid hydrological model consisting of a conceptual hydrological model and a LSTM data-driven model to estimate time dependent model parameter dependent on the same inputs as used to drive the conceptual model. The intension is to keep the excellent performance of data driven approaches that have been demonstrated in recent years, but also to keep or improve the interpretability of such data driven approaches.
In general, I am in favour of an intensive analysis of such approaches, and think the manuscript is well suited for the readership of HESS, in continuation of a significant number of important papers in this area in the same journal.
It is in general well written and figures support the understanding and flow of the text! However, I have a number of major and minor comments/suggestion that I believe would improve the manuscript and should be addressed before final publication.
The authors motivate they work by a paper of Feng et al. who propose a general framework of hybrid dPL modelling. They use the HBV model as a basis and estimate static and dynamically HBV parameters using Catchment parameters and meteorological input (as used do force HBV). This paper extends and slightly varies the this approach by analysing simple bucket based models as well as (what they call) NonSense model. Dynamic parameters are estimated with an LSTM DL. Research question 1 is “do conceptual models serve as a regionalization mechanism for thwe dynamic parameterization? I do think this is an important question (and I miss the reference of Frame et al, 2022 in this context), however, I believe it is not addressed in such a rigorous way as would be needed here. Conceptual models can range over a large range of complexity. Wha,t if we would just apply a simple equation relating Rainfall to runoff (Q = c(x,t) * P) and allow c to be estimated by a LSTM as suggested. This is the simplest model I can think of, and then I would systematically increase the complexity of the conceptual models.

(Frame, J. M., Kratzert, F., Klotz, D., Gauch, M., Shalev, G., Gilon, O., Qualls, L. M., Gupta, H. V., and Nearing, G. S.: Deep learning rainfall–runoff predictions of extreme events, Hydrol. Earth Syst. Sci., 26, 3377–3392, https://doi.org/10.5194/hess-26-3377-2022, 2022.)

In that procedure I would suggest to use a much wider set of catchments and characteristics in order to see under what physio-geographical properties and climate conditions (as has been used of plenty other previous application) to answer research question 1 in a more general way!
Research question 2 addresses the physical interpretability of conceptual models and whether it is comprised by data driven dynamic parameterization. Fig. 8 shows some of the parameters for 2 catchments and how they vary in time. I am missing a few points that should be discussed: i) Are the variations of parameters du to structural imitations of the conceptual model component, or is it just needed because of averaging non-linear processes over spatial variable catchment characteristics, or is it compensating for biases in the ERA5 input data? Or all three? What do I learn from Fig. 8? Which weight is assigned to each individual input for driving the variation? ii) How does the methodology compare to “more classical/statistical approaches” such as state and time dependent parameter estimation techniques. iii) How does the methodology compare in philosophy and potential to approaches that have been introduced by e.g. Feigl et al. (2022), what do we learn here in this approach from mistakes?

(Feigl et al., 2022, Learning from mistakes-Assessing the performance and uncertainty in process-based models. Hydrological Processes 36).

Overall, I miss a kind of “surprise” concerning the analysis – could that be more emphazided.

Specific/technical Comments
The following minor comments/suggestions I would like to make:
L9ff: The last part of the abstract is hard to understand/follow – I read it before the rest of text and did not know what is meant.

L20: Reference needed.

L136: how is ETp calculated (may one short sentence)

L161: how you calculate the gradiants for if/then and iterative loops with state updates?

L214: is 855 batches true hen you consider tat one data point consideres 180 previous days as input?

L216: Why not optimizing the initial conditions?

L232: this refers to one major comment – when is the model complex enough so that the LSTM is able to produce the full output space just by varying parameters!? Is this already possible with the structure I suggested). When an I see limitations/restictions?

L265: what is the criterium for overfitting! Have you used ensembles of optimized networks to see how robust results are?

6: it is hard to see any differences, perhaps you can enlarge an interesting part of the time seies!

L309: I would guess that ERA5-Land data are also computed and not observed quantities. So it is a model state intercomparison!

L329: why looking at average values and not show the distribution?

L385: what has this paper contributed to a better understanding in this context! Be specific!

L402: What s new compared to Feng et al., what are different findings!

L417: States (instead of variables!?

L421: correlation is a very weak measure-of -goodness-of-fit especially when dealing with cyclic data/processes)

Overall, I feel, the manuscript has in general the potential to be a valuable contribution to HESS, however, questions and issues raised in the general comments would need to be addressed and discussed to a significant part before final acceptance.
Citation: https://doi.org/10.5194/egusphere-2023-1980-RC1
- AC1: 'Reply on RC1', Eduardo Acuna, 23 Oct 2023
  
  Response to RC1: Comment of eguspere-2023-1980. Anonymous Referee #1. 06 Oct 2023
  We want to thank the referee for the detailed evaluation of our paper. In this document we answer the questions, comments and suggestions given. We will address those comments individually. For clarity, the original comments posted by the referee are written in italic, while our answers are written in bold.
  The authors introduce and analyse a hybrid hydrological model consisting of a conceptual hydrological model and a LSTM data-driven model to estimate time dependent model parameter dependent on the same inputs as used to drive the conceptual model. The intension is to keep the excellent performance of data driven approaches that have been demonstrated in recent years, but also to keep or improve the interpretability of such data driven approaches.
  In general, I am in favour of an intensive analysis of such approaches, and think the manuscript is well suited for the readership of HESS, in continuation of a significant number of important papers in this area in the same journal.
  It is in general well written and figures support the understanding and flow of the text! However, I have a number of major and minor comments/suggestion that I believe would improve the manuscript and should be addressed before final publication.
  • The authors motivate they work by a paper of Feng et al. who propose a general framework of hybrid dPL modelling. They use the HBV model as a basis and estimate static and dynamically HBV parameters using Catchment parameters and meteorological input (as used do force HBV). This paper extends and slightly varies the this approach by analysing simple bucket based models as well as (what they call) NonSense model. Dynamic parameters are estimated with an LSTM DL.
  We thank the referee for the well-structured summary of our paper until this point.
  • Research question 1 is “do conceptual models serve as a regionalization mechanism for thwe dynamic parameterization?
  We assume there was a typo in the word regionalization, as in line 61 of our original manuscript the research question was: “Do conceptual models serve as an efficient regularization mechanism…”. Therefore, we will answer the following comments/suggestions assuming the word regularization.
  • I do think this is an important question (and I miss the reference of Frame et al, 2022 in this context), …
  Frame et al (2022) evaluate the performance of deep learning methods for rainfall-runoff models in predicting extreme events. According to the authors: “The primary objective of this study is to test the hypothesis that data-driven models lose predictive accuracy in extreme events more than models based on process understanding.” To accomplish this objective, they compared the performance of a LSTM network, a mass conservative LSTM (MC-LSTM), a conceptual model (SAC-SMA) and a process-based model (NWM) for predicting extreme events. In their study they showed that the data-driven models were better than conceptual and process-based models at predicting peak flows under almost all conditions.
  Most of their study is dedicated on answering their main objective, which is not directly related to our research. However, in the last paragraph of the conclusions the authors do discuss the differences between pure ML and physics informed ML. They argue that ‘’there is only one type of situation in which adding any type of constraint (physically based or otherwise) to a data-driven model can add value: if constraints help optimization”.
  Given the relevance of this last paragraph, we will add this reference in a revised version of the manuscript. We will include the reference in the introduction. We thank the referee for pointing out this study.
  • …however, I believe it is not addressed in such a rigorous way as would be needed here. Conceptual models can range over a large range of complexity. Wha,t if we would just apply a simple equation relating Rainfall to runoff (Q = c(x,t) * P) and allow c to be estimated by a LSTM as suggested. This is the simplest model I can think of, and then I would systematically increase the complexity of the conceptual models
  The general idea of the hybrid models in our study is to test if we can reach the performance of data-driven methods while maintaining interpretability and access to untrained variables. A model Q = c(x,t) * P will very likely be able to reach a similar performance as a stand-alone LSTM, as the performance will be given by the data-driven part (coefficient c(x,t)). However, we will not be gaining any interpretability or access to untrained variables. Also, the stand-alone LSTM that we are using receives the precipitation as an input, and therefore has access to the precipitation to make the discharge prediction, therefore we argue that the case Q = c(x,t) * P is already being covered.
  We evaluated in our study multiple conceptual structures: LSTM+Bucket, LSTM+NonSense and LSTM+SHM, which intended to cover a representative spectrum of conceptual models. The first case (LSTM+Bucket) removed most of the hydrological understanding we normally impose in our process-based model through its components (multiple buckets) and the fluxes between them. With the LSTM+Bucket model we only impose: mass conservation, the idea that some water may not reach the river (evapotranspiration is present) and the idea that the outflow is somehow proportional to the water content of the basin (Q = k*S). Even with this limited information, the LSTM+Bucket model was able to achieve similar performance as the other cases which indicated that the data-driven part can compensate for missing processes and flux interactions. The second case (LSTM+NonSense) allowed us to test that the data-driven part can even compensate for erroneous structure. Finally, the LSTM+SHM model covered the case where a well-structured conceptual model is given. This allowed us to evaluate the interpretability of our conceptual part and the access to untrained variables.
  Therefore, we argue that we are evaluating in a rigorous way our research questions, as the spectrum of cases that help us achieve our objectives, is being covered. Testing other conceptual structures would be associated with the specific case of application and which untrained variables one is interested in recovering, however, this is not the main objective of study.
  In that procedure I would suggest to use a much wider set of catchments and characteristics in order to see under what physio-geographical properties and climate conditions (as has been used of plenty other previous application) to answer research question 1 in a more general way!
  About using CAMELS-GB:
  Feng et al. (2022) conducted a study using a similar method in CAMELS-US. We wanted to test our method on a different dataset, which would increase the general testing conditions of studies involving hybrid models.
  About using a subset of CAMELS-GB:
  In our study we used 60 basins and 25 years of data per basin, which is not negligible to produce robust conclusions. As we explained in section 2.1, using a subset of the whole CAMELS-GB had different reasons.
  First, we wanted to ensure a fair comparison between the models, on an even playing field. Therefore, we removed basins with high-anthropogenic impacts, as the process-based models did not consider these effects in their structure. We also considered the fact that we are using a daily resolution, so the basins should have a sufficient size such that the discharge variations can be resolved by daily data. Second, as shown in Figure 1 of the manuscript, the spatial location of the 60 basins covers most of the original range. So even if the overall range of hydroclimatic conditions in the (CAMELS-) UK may not be as wide as in the (CAMELS-) US, we made sure that it was fully represented by our test data set. Third our performance measurements aligned with the benchmark set by Lees et at. (2021) where he trained a data-driven method for the CAMELS-GB full dataset. Lastly, to have good baselines for our study we also calibrated the stand-alone conceptual models. We calibrated for each basin the SHM-only, Bucket-only and NonSense-only. During this process, to mitigate potential calibration biases that may favor our hybrid models, we calibrated each conceptual model with three different methods: SCE-UA, DREAM and gradient descent. Therefore, using a subset of 60 basins, we did 3(models) * 3(calibration methods) * 60(basins) = 540 model calibrations. Hence, using a subset of the whole CAMELS-GB dataset was important to maintain a reasonable computational cost.
  • Research question 2 addresses the physical interpretability of conceptual models and whether it is comprised by data driven dynamic parameterization. Fig. 8 shows some of the parameters for 2 catchments and how they vary in time. I am missing a few points that should be discussed: i) Are the variations of parameters du to structural imitations of the conceptual model component, or is it just needed because of averaging non-linear processes over spatial variable catchment characteristics, or is it compensating for biases in the ERA5 input data? Or all three?
  With the methodology we used in this study we were not trying to differentiate which deficiencies in the conceptual model our data-driven part was compensating for. However, the three possibilities that the referee suggested are very likely to be included.
  For example, our experiment with the different conceptual structures indicates that the data-driven part can compensate for structural deficiencies and missing processes. Moreover, as we explained in line 288, in the LSTM+NonSense variation, the LSTM is reducing as much as possible the initial lag caused by the baseflow and interflow modules, which suggests the data-driven part can even “turn-off” parts of the conceptual model that are not useful.
  The possibility that the data-driven part is compensating for the limitations of averaging non-linear processes was discussed in line 367, where we indicated that all our conceptual models are being operated in a lumped manner. Lumped models handle multiple uncertainties and subprocesses by a single parameter, which is indeed a limitation. Therefore, the LSTM can vary the parameters in time to compensate for this limitation and get a better performance. In a similar study, Feng et al (2022), partially covered this problem by using 16 conceptual models parameterized by a LSTM, to consider a semi-distributed version. Moreover, they showed two models, one with static and one with dynamic parameters. The fact that the dynamic parameterization got a better performance may indicate that even with a semi distributed model, there are some deficiencies in the model structure that the LSTM is still able to compensate for.
  Lastly, our model can be compensating for biases in the input data (however, we use CAMELS-GB input data not ERA5). It is known that, due to their structure, data-driven models can compensate for biased input data, and there is no reason to suggest that our model is not doing this.
  Therefore, the data-driven part is compensating for multiple limitations of the model. However, disentangling which particular limitation is being compensated for does not align nor affect our objective to evaluate if hybrid model maintain interpretability and provide access to untrained variables. With respect to the topic raised by the referee here, we therefore suggest keeping the manuscript as it is.
  • What do I learn from Fig. 8? Which weight is assigned to each individual input for driving the variation?
  The LSTM processes the sequence of input variables using a series of gates (forget, input and output). Through weights, biases, and context dependent gates the network encodes the information in hidden and cell states to get an output. However, because of how the information is used there is not a one-to-one assignation of how much each input contributes to each output. Moreover, our study focuses on the interpretability remaining in the conceptual model structure and not on the internal functioning of the data-driven part. Figure 8 allowed us to analyze the time variation of the parameters and link these variations to our hydrological knowledge.
  ii) How does the methodology compare to “more classical/statistical approaches” such as state and time dependent parameter estimation techniques.
  Lan et al. (2020) indicate that the most common approach for dynamic parameterization of hydrological models is the calibration for different subperiods. He supports this statement by referencing over 20 studies on this subject published in the last 15 years. According to the authors, this method divides the data into subperiods, considering seasonal characteristics or clustering approaches, and proposes a set of parameters for each subperiod. The idea is to capture the temporal variations of the catchment characteristics.
  Our dynamic parameterization technique is also intended to capture the temporal variation of the catchment characteristics. Specifically, we use a recurrent neural network that analyzes a given sequence length, so the proposed parameters are context informed, and reflect the current state of the catchment.
  Therefore, in philosophy our technique is similar to “more classical approaches”. The main difference is that our dynamic parameterization is much more flexible, as a custom parameterization can be proposed for each prediction, and it is not constrained to a typical small set of predefined subperiods. Also, one can include as input of the LSTM any information that is considered useful to make an informed parameter inference, even if this is not used later in the conceptual part of the model.
  We thank the referee for this question. We will include this information in a revised version of the manuscript.
  iii) How does the methodology compare in philosophy and potential to approaches that have been introduced by e.g. Feigl et al. (2022), what do we learn here in this approach from mistakes?
  The study by Feigl et al. (2022) proposes a technique where they use a machine learning technique to map the residuals of a conceptual model to deficiencies in model structure. This idea is quite interesting, but it does differ in philosophy and potential to the method we propose.
  Feigl´s method is based on the hypothesis that the residuals are caused, in part, because of deficiencies in the model structure. He then uses a ML algorithm to associate this residual to a specific limitation and modifies the structure of the process-based model according to this. Therefore, the data-driven part is used to analyze the deficiencies, and based on the results of those analysis the structure of the conceptual model is modified.
  In our case, the dynamic parameterization provided by the data-driven part also showed the capability to compensate for deficiencies in the model structure. However, this compensation is done directly through the dynamic parameterization, and there are not intermediate steps to analyze the residuals and map those residuals to changes in the process-based part.
  
  Therefore, even though both methods are using a data-driven method to increase the performance of a process-based model, the idea of how this is done is quite different. We thank the referee for pointing out this reference. We will include this information in a revised version of the manuscript.
  • Overall, I miss a kind of “surprise” concerning the analysis – could that be more emphazided
  Even though the dynamic parameterization of conceptual models had been applied before by Kraft et al. (2022) and Feng et al. (2022), our study does presents novelty:
  
  We applied the hybrid model approach on CAMELS-GB, which to the best of our knowledge had not been done before. With this we increased the application range of the models, which contributed to testing the robustness of the approach.
  To the best of our knowledge, this is also the first time that the capability of LSTM to compensate for structural deficiencies in the process-based model has been tested. The LSTM+Bucket and LSTM+NonSense model allowed us to prove that the hyper flexibility of the data-driven method can overwrite the physical regularization given by the conceptual part. However, we also tested that if a meaningful conceptual model structure is given, physical interpretability can be maintained, which is consistent with previous studies.
  Overall, we argue that there is novelty in the study and that the conclusions we draw from our analysis are consistent with the results.
  Specific/technical Comments
  The following minor comments/suggestions I would like to make:
  • L9ff: The last part of the abstract is hard to understand/follow – I read it before the rest of text and did not know what is meant.
  We will modify the abstract in a revised version of the manuscript.
  • L20: Reference needed.
  We will add a reference in a revised version of the manuscript.
  • L136: how is ETp calculated (may one short sentence)
  ETp is read directly from CAMELS-GB, so we did not calculate it. According to Coxon et al. (2020) ETp was calculated using the Penman–Monteith equation. We will add this information in a revised version of the manuscript.
  • L161: how you calculate the gradiants for if/then and iterative loops with state updates?
  The gradients are calculated using Automatic Differentiation, which is already implemented in PyTorch. This technique does not have a problem with if/then statements as the derivative is calculated depending on the path the if/then statement takes. It is also not a problem to include loops.
  • L214: is 855 batches true hen you consider tat one data point consideres 180 previous days as input?
  855 is the number of batches (each batch has 256 elements), while 180 is the sequence length. Therefore, to make a prediction, the LSTM considers the information of the last 180 days. However, this is independent of the number of batches used.
  • L216: Why not optimizing the initial conditions?
  We are using a warmup period of one year to stabilize the internal states of the conceptual model, therefore optimizing the initial conditions is not necessary.
  • L232: this refers to one major comment – when is the model complex enough so that the LSTM is able to produce the full output space just by varying parameters!? Is this already possible with the structure I suggested). When an I see limitations/restictions?
  The LSTM+Bucket is the simplest hybrid structure we can think of that still includes some hydrological concepts in the regularization. In the paper we showed that this model already reaches state of the art performance. However, this hybrid model behaves as a LSTM variant and the bucket regularization does not give us any extra information about our hydrological system. Therefore, the simplest model we can think of already produces a full output space, because of the flexibility of the LSTM.
  • L265: what is the criterium for overfitting! Have you used ensembles of optimized networks to see how robust results are?
  We are using dropout and tracking the validation loss during training to avoid overfitting. And yes, we have used ensembles, and the models are quite robust. We presented those results at EGU-2023. However, as this was not directly aligned with the main objective of our paper, we did not include this information.
  • 6: it is hard to see any differences, perhaps you can enlarge an interesting part of the time seies!
  The idea of Figure 6 was exactly this. To show that regardless of the regularization we used, there are some basins in which the simulated time series are almost identical. In a revised version of the manuscript, we can reduce the time series period we show, however this would not change the concept of what we are showing.
  • L309: I would guess that ERA5-Land data are also computed and not observed quantities. So it is a model state intercomparison!
  Yes, ERA5-Land data is also computed. We justified why we used this type of data in section 2.2. In a revised version of the manuscript, we can include the model state intercomparison term.
  • L329: why looking at average values and not show the distribution?
  We used the average as we were trying to give a general similarity metric of the models along all the testing basins. However, in a revised version of the manuscript we can add other metrics.
  • L385: what has this paper contributed to a better understanding in this context! Be specific!
  In line 385 we were doing a recap of previous studies and what motivated our research, but we were not referring to our research yet. From line 392 forward is where we stated the specific conclusions of our paper. There we described in detail the process we followed to answer our two research questions, and the results we obtained.
  • L402: What s new compared to Feng et al., what are different findings!
  In line 402 we indicated that our hybrid model achieved similar performance as stand-alone LSTMs and outperformed the conceptual models, and that these findings align with existing literature, including Feng et al. (2022). This paragraph is intended to compare part of our results with existing studies, which we argue is always a good practice to make, because it increases the robustness of the methods.
  Nevertheless, these are not the only findings we summarized in the conclusions. From line 404 on, we described the specific findings of our study and the answers to the research questions, which to the best of our knowledge had not been studied before.
  • L417: States (instead of variables!?
  In a revised version of the manuscript, we will use the word states (or state variables) instead of variables.
  • L421: correlation is a very weak measure-of -goodness-of-fit especially when dealing with cyclic data/processes)
  We use the correlation metric to compare the dynamics of the unsaturated zone against ERA5-Land data. For this test, following the process proposed by Ehret et al (2020), we normalized the data before comparison (map the values to a 0-1 range), as we are interested in comparing the dynamics of the series and not the specific values. Therefore, given the purpose of the test (comparing the dynamic of the normalized series) we think the correlation coefficient can give us the information we need. Are there other specific metrics that are better suited for this case?
  Overall, I feel, the manuscript has in general the potential to be a valuable contribution to HESS, however, questions and issues raised in the general comments would need to be addressed and discussed to a significant part before final acceptance.
  We thank the referee for the overall positive evaluation of our manuscript and hope we could adress the questions raised in a satisfactory manner.
  
  References
  
  • Coxon, G., Addor, N., Bloomfield, J. P., Freer, J., Fry, M., Hannaford, J., Howden, N. J. K., Lane, R., Lewis, M., Robinson, E. L., Wagener, T., and Woods, R.: CAMELS-GB: hydrometeorological time series and landscape attributes for 671 catchments in Great Britain, Earth System Science Data, 12, 2459–2483, https://doi.org/10.5194/essd-12-2459-2020, 2020.
  
  • Ehret, U., van Pruijssen, R., Bortoli, M., Loritz, R., Azmi, E., and Zehe, E.: Adaptive clustering: reducing the computational costs ofdistributed (hydrological) modelling by exploiting time-variable similarity among model elements, Hydrology and Earth System Sciences, 24, 4389–4411, https://doi.org/10.5194/hess-24-4389-2020, 2020.
  
  • Frame, J., Kratzert, F., Klotz, D., Gauch, M., Shelev, G., Gilon, O., ... & Nearing, G. S. (2021). Deep learning rainfall-runoff predictions of extreme events. Hydrology and Earth System Sciences Discussions, 2021, 1-20.
  
  • Feigl, M., Roesky, B., Herrnegger, M., Schulz, K., & Hayashi, M. (2022). Learning from mistakes—Assessing the performance and uncertainty in process‐based models. Hydrological Processes, 36(2), e14515.
  
  • Feng, D., Liu, J., Lawson, K., and Shen, C.: Differentiable, Learnable, Regionalized Process-Based Models With Multiphysical Outputs can Approach State-Of-The-Art Hydrologic Prediction Accuracy, Water Resources Research, 58, e2022WR032 404, https://doi.org/https://doi.org/10.1029/2022WR032404, e2022WR032404 2022WR032404, 2022.
  
  • Kraft, B., Jung, M., Körner, M., Koirala, S., and Reichstein, M.: Towards hybrid modeling of the global hydrological cycle, Hydrology and Earth System Sciences, 26, 1579–1614, https://doi.org/10.5194/hess-26-1579-2022, 2022.
  
  • Lan, T., Lin, K., Xu, C. Y., Tan, X., & Chen, X. (2020). Dynamics of hydrological-model parameters: mechanisms, problems and solutions. Hydrology and Earth System Sciences, 24(3), 1347-1366.
  
  • Lees, T., Buechel, M., Anderson, B., Slater, L., Reece, S., Coxon, G., and Dadson, S. J.: Benchmarking data-driven rainfall-runoff models in Great Britain: a comparison of long short-term memory (LSTM)-based models with four lumped conceptual models, Hydrology and Earth System Sciences, 25, 5517–5534, https://doi.org/10.5194/hess-25-5517-2021, 2021.
  
  Citation: https://doi.org/10.5194/egusphere-2023-1980-AC1
RC2:
'Comment on egusphere-2023-1980', Grey Nearing, 13 Nov 2023

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-1980/egusphere-2023-1980-RC2-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2023-1980-RC2
- AC2: 'Reply on RC2', Eduardo Acuna, 04 Dec 2023
  
  We want to thank Grey Nearing for the detailed evaluation of our paper. We attach the responses to his questions/comments in a PDF file. We believe that the changes proposed here will increase the quality of the manuscript and hope we addressed the questions raised satisfactorily.
  
  Citation: https://doi.org/10.5194/egusphere-2023-1980-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (07 Dec 2023) by Daniel Viviroli

AR by Eduardo Acuna on behalf of the Authors (07 Feb 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (14 Feb 2024) by Daniel Viviroli

RR by Anonymous Referee #1 (19 Feb 2024)

RR by Grey Nearing (02 Mar 2024)

Suggestions for revision or reasons for rejection

Summary of Review

This is a good paper, and it should be published. I have one remaining comment.

Major Comments

I don’t agree with the reasoning for performing the experiment on a subset of basins.

The reason that the authors state is that they expect conceptual models to not be able to model human-influenced catchments. If there is this (or another) limitation of one type of model in the study, then it seems that this limitation is part of any meaningful comparison. Instead, the approach I would take would be to do the full experiment (on the whole benchmark dataset), and then – if there is a conceptual data split that makes sense – report analyses of results on the full data set and also on that split that you want to highlight.

Even if you restrict your analysis to “near-natural” basins, – which, to reiterate, I think is somewhat artificial – you probably should train the LSTM models on the full dataset. 60 catchments is probably not enough for training (see Kratzert et al 2024).

Additionally, there is an unfortunate consequence of only using a subset of catchments in that, moving forward, if someone wants to benchmark against or build on your results, they only have a subset of the community benchmark to work with.

Minor Comments

Line 265: It seems like the experiments in this paper could/should be run with ensembles. We do not know whether the conceptual models and hybrid models benefit the same way from ensembles as the pure ML models.

Line 290: Hoge (2022) and Kraft (2022) do not show that fusing deep learning models with hydrological mechanistic models can reach state-of-the-art. The reason for this is that they did not test against current state-of–the-art, and instead tested against handicapped LSTM models that perform significantly worse than current state-of-the-art. As an example, Hoge (2022) used LSTM values from a different study (Jiang, 2022), but they did not use even the best-performing LSTM from that paper, let alone the current SOTA LSTM (which is not from Jiang). Kraftonly tested against physically based models, not ML models, and there is no physically-based hydrology model that is anywhere close to the current SOTA. These papers should not be referenced in the way that they are here.

Line 293: I am not sure that Mendoza et al discussed dynamic vs. fixed parameters. When they talk about fixed parameters, they mean parameters that are hard-coded in the source code and therefore cannot be calibrated. This is not the same thing as parameters that vary during time series prediction.

Hide

ED: Reconsider after major revisions (further review by editor and referees) (06 Mar 2024) by Daniel Viviroli

AR by Eduardo Acuna on behalf of the Authors (16 Apr 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (17 Apr 2024) by Daniel Viviroli

RR by Grey Nearing (06 May 2024)

ED: Publish as is (08 May 2024) by Daniel Viviroli

AR by Eduardo Acuna on behalf of the Authors (10 May 2024) Manuscript

Short summary

Hydrological hybrid models promise to merge the performance of deep learning methods with the interpretability of process-based models. One hybrid approach is the dynamic parameterization of conceptual models using long short-term memory (LSTM) networks. We explored this method to evaluate the effect of the flexibility given by LSTMs on the process-based part.