Closing the data gap: runoff prediction in fully ungauged settings using LSTM

Hashemi, Reyhaneh; Javelle, Pierre; Delestre, Olivier; Razavi, Saman

doi:https://doi.org/10.5194/hess-2023-282

Preprints

https://doi.org/10.5194/hess-2023-282

Preprints

18 Dec 2023

| 18 Dec 2023

Status: this discussion paper is a preprint. It has been under review for the journal Hydrology and Earth System Sciences (HESS). The manuscript was not accepted for further review after discussion.

Closing the data gap: runoff prediction in fully ungauged settings using LSTM

Reyhaneh Hashemi, Pierre Javelle, Olivier Delestre, and Saman Razavi

Abstract. Prediction in ungauged basins (PUB), where flow measurements are unavailable, is a critical need in hydrology and has been a focal point of extensive research efforts in this field over the past two decades. From the perspective of deep learning, PUB can be viewed as a scenario where the generalization capability of a pretrained neural network is employed to make predictions on samples that were not included in its training data set. This paper adopts this view and conducts genuine PUB using long short-term memory (LSTM) networks. Unlike PUB approaches based on k-fold training-test technique, where an arbitrary catchment B is treated as gauged in k−1 rounds and as ungauged in one round, our approach ensures that the sample for which the PUB is conducted (the UNGAUGED sample) is completely independent from the sample used to previously train the LSTMs (the GAUGED sample). The UNGAUGED sample includes 379 catchments from five hydrological regimes: Uniform, Mediterranean, Oceanic, Nivo-Pluvial, and Nival. PUB predictions are conducted using LSTMs trained both at the regime level (using only gauged catchments within a specific regime) and at the national level (using all gauged catchments). For benchmarking the performance of LSTM in PUB, four regionalized variants of the GR4J conceptual model are considered: spatial proximity, multi-attribute proximity, regime proximity, and IQ-IP-Tmin proximity, where IQ, IP, and T_min are the indices defining the five hydrological regimes. To align with the study's fully ungauged context, the IQ index, which is also an input feature for the LSTMs, and the regime classification, crucial for the REGIME LSTMs, are reproduced under ungauged conditions using a regime-informed neural network and an XGBoost multi-class classifier respectively. The results demonstrate the overall superior performance of NATIONAL LSTMs compared to REGIME LSTMs. Among the four regionalization approaches tested for GR4J, the IQ-IP-Tmin proximity approach proves to be the most effective when analyzed on a regime-wise basis. When comparing the best-performing LSTM with the best-performing GR4J model within each regime, LSTMs show superior performance in both the Nival and Mediterranean regimes.

Received: 27 Nov 2023 – Discussion started: 18 Dec 2023

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Reyhaneh Hashemi, Pierre Javelle, Olivier Delestre, and Saman Razavi

Status: closed

CC1:
'A curious interpretation', Daniel Klotz, 26 Jan 2024
Dear editor, dear reviewers, & dear authors,

I am writing to let you know that this paper does not faithfully depict the work of Kratzert et al. (2019). I am one of the alii, but I do not write this because of some agenda of mine. As a matter of fact, I also do see room for improvement in our evaluation approaches for PUB --- and many aspects are still unexplored. No, I decided to comment because was intrigued that it possible to misunderstand our work so much to result in the rendition from L. 132ff.

Back then, we did a standard cross-validation (CV) on basis of the catchments. I noticed that the authors do not use the word CV, but it does indeed exists. You can search for it, the approach even has its own Wikipedia page. Anyway, what the mention of k-fold CV should suggest to modelers is that one splits the data into k-folds. Then to predict the values of a given fold j one trains a model on basis of remaining k-1 folds and uses j as validation set. This process is repeated until each fold was used one time as validation set. What this basically gives you are predictions for each basins, where the basins is guaranteed to not be part of the training (forget what the authors write about the probability of a catchment being part of the training). Lastly, we evaluate the performance of each validation fold to obtain a estimate of the performance, as is standard in CV literature. It is well known that CV is almost unbiased in the i.i.d setting, but has high variance. We also did not ensemble the models across folds, but rather build 10 LSTMs per evaluation fold and checked that ensemble.
In hindsight, I can see that explanations from Kratzert et al. (2019) are maybe a bit Spartan. However, I do not see how lent themselves to the exposition given in L. 132ff. For convenience, let me quote the relevant passages from Kratzert et al. (2019):
„Out-of-sample testing was done by k-fold validation, which splits the 531 basins randomly into k = 12 groups of approximately equal size, uses all basins from k-1 groups to train the model, and then tests the model on the single group of holdout basins. This procedure is repeated k = 12 times so that out-of-sample predictions are available from every basin. [...].

For each model type we trained and tested an ensemble of N = 10 LSTM models to match the 10 SCE restarts used to calibrate the SAC-SMA models.“

I do not understand how this descriptions lead to the interpretation of our evaluation procedure as a bagging approach --- which is actually a model selection procedure that has nothing to do with the the evaluation. I therefore hope that you (i.e., the authors) can find some time during the busy admission process on HESS to explain your thought process so that we can do better in the future.
Well, I also hope that you correct your „mistakes“.
All the best,

Daniel

Minor comment: Since I am already add it I’d also like to give some minor things to correct. While skimming I noticed that the terms bagging and transfer learning are used wrongly: Bagging is a bootstrapping approach for building ensembles and that implies that the data is used with resampling --- else it would not be bootstrapping by definition. Transfer learning is about using knowledge gained from one task to improve the performance on another task. It is thus completely distinct from the things discussed here. I cannot tell if the manuscript has more errors because I stopped reading after the description of the Kratzert et al. (2019) evaluation technique.
Citation: https://doi.org/10.5194/hess-2023-282-CC1
- AC1:
  'Reply on CC1', Reyhaneh Hashemi, 14 Feb 2024
  Dear Daniel Klotz,
  
  Thank you for reviewing our paper and your valuable comments. Below, we address your comments and are open to further clarifying or, if necessary, removing any points raised in the main paper.
  
  ** ** ** ** ** ** ** ** ** ** **
  
  The use of the term bagging
  
  ** ** ** ** ** ** ** ** ** ** **
  
  Regarding your comments on the relationship between the approach used in Kratzert et al. [1] and the concept of bagging [2], we regret that the respective text in our paper was not sufficiently clear. Therefore, we would like to clarify a few points here:
  
  Our paper does not claim to categorize Kratzert et al.’s method directly as a bagging technique. Instead, it highlights conceptual similarities between their approach and the bagging method, as understood in the broader context of ensemble learning.
  
  Our paper does not intend to reflect or indicate any biases within Kratzert et al.'s methodology; there are none.
  
  Referring to the definition of "bagging" in Chapter 7 of Goodfellow et al. [3]:
  
  “Bagging (short for bootstrap aggregating) is a technique for reducing generalization error by combining several models (Breiman, 1994). The idea is to train several different models separately, then have all the models vote on the output for test examples. This is an example of a general strategy in machine learning called model averaging. Techniques employing this strategy are known as ensemble methods.”
  
  It further explains,
  
  “Specifically, bagging involves constructing k different datasets. Each dataset has the same number of examples as the original dataset, but each dataset is constructed by sampling with replacement from the original dataset. This means that, with high probability, each dataset is missing some of the examples from the original dataset and contains several duplicate examples (on average around two–thirds of the examples from the original dataset are found in the resulting training set, if it has the same size as the original). Model i is then trained on dataset i. The differences between which examples are included in each dataset result in differences between the trained models.”
  
  We observe two conceptual similarities between Kratzert et al.’s approach and bagging: 1. the aggregation of predictions from an ensemble of K different models, and 2. the creation of K different datasets from sampling an original set.
  
  However, two distinctions include: 1. Kratzert et al. avoid replacement in their sampling, ensuring dataset independence via leave-one-out K-fold splitting, and 2. the12 obtained LSTMs are tested on distinct test sets, highlighting a departure from bagging.
  
  Kratzert et al. [1] utilizes 12 distinct LSTM models, each of which is trained on a different subset. This leads to 12 distinct LSTM models with 12 unique bias-variance patterns (albeit the 12 patterns do not interact in any way, since they are tested on different test sets). The reported performances in Kratzert et al. [1] result from 12 bias-variance patterns.
  
  The method used in our paper employs a single LSTM model without any form of result aggregation. The reported performances result from the single existing bias-variance pattern.
  
  There is, therefore, a clear methodological difference between the approach used in Kratzert et al. [1] and our paper.
  
  ** ** ** ** ** ** ** ** ** ** ** ** ** **
  
  The use of the term “transfer learning”
  
  ** ** ** ** ** ** ** ** ** ** ** ** ** **
  
  1. The term "transfer learning" is utilized only once in our paper, specifically in Figure 1, to illustrate the act of transferring knowledge from gauged to ungauged spaces. This usage is entirely literal and directly corresponds with the figure’s concept, aiming to depict the transfer of learning between these two spaces.
  
  2. Referring to the definition of “transfer learning” (as a specific deep learning technique and in the sense you are referencing), Goodfellow et al. [2] explains that:
  
  “Transfer learning and domain adaptation refer to the situation where what has been learned in one setting (e.g., distribution P1) is exploited to improve generalization in another setting (say, distribution P2).
  
  In transfer learning, the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in P1 are relevant to the variations that need to be captured for learning P2. This is typically understood in a supervised learning context, where the input is the same, but the target may be of a different nature.”
  
  Our paper clearly indicates (in Lines 3 and 154; page 6) that it does not engage in any new learning, fine-tuning, or model modification tasks. The term "fine-tuning" is absent from our paper (neither in the Abstract, the conclusion, nor the main text) leaving no ambiguity that our paper's use of the term "transfer" is distinct from the technical "transfer learning" concept you referenced.
  
  3. In the context of Predictions in Ungauged Basins (PUB), "transfer" is an established term used precisely as it appears in our work. Its application in PUB studies, for transferring knowledge, understanding, and model parameterizations from gauged to ungauged catchments, is numerous. Here we refer to only some examples from the reference PUB paper “A decade of Predictions in Ungauged Basins (PUB)—a review” [4].
  
  - This resulted in some success in the development of catchment classification schemes, similarity frameworks and model regionalization methods for TRANSFERRING KNOWLEDGE and improving predictions in ungauged basins.
  
  - unsuitable regionalization techniques to TRANSFER UNDERSTANDING OF HYDROLOGICAL RESPONSE patterns from gauged to ungauged environments due toa lack of comparative studies across catchments and a lack of understanding of the physical principles governing robust regionalization.
  
  - In contrast to the regionalization of flow metrics, regionalization efforts for TRANSFERRING MODEL PARAMETERIZATIONS from gauged to ungauged catchments have a longer tradition.
  
  Should there be an alternative term that aligns (as precisely) with the concept depicted in Figure 1 and retains the literal essence of "transfer" as defined (e.g., "to move someone or something from one place to another" -- Cambridge Dictionary), we are open to considering its replacement for this singular mention.
  
  Sincerely,
  
  Reyhaneh Hashemi
  
  ** ** ** ** **
  
  References
  
  ** ** ** ** **
  
  [1] Kratzert F, Klotz D, Herrnegger M, Sampson AK, Hochreiter S, Nearing GS. Toward improved predictions in ungauged basins: Exploiting the power of machine learning. Water Resources Research. 2019 Dec;55(12):11344-54 (https://doi.org/10.1029/2019WR026065).
  
  [2] Breiman L. "Bagging predictors." Machine learning 24 (1996): 123-140.
  
  [3] Goodfellow I, Bengio Y, Courville A. Deep learning. MIT press, 2016.
  
  [4] M. Hrachowitz, H.H.G. Savenije, G. Blöschl, J.J. McDonnell, M. Sivapalan, J.W. Pomeroy, B. Arheimer, T. Blume, M.P. Clark, U. Ehret, F. Fenicia, J.E. Freer, A. Gelfan, H.V. Gupta, D.A. Hughes, R.W. Hut, A. Montanari, S. Pande, D. Tetzlaff, P.A. Troch, S. Uhlenbrook, T. Wagener, H.C. Winsemius, R.A. Woods, E. Zehe & C. Cudennec (2013) A decade of Predictions in Ungauged Basins (PUB)—a review, Hydrological Sciences Journal, 58:6, 1198-1255 (https://doi.org/10.1080/02626667.2013.803183)
  
  Citation: https://doi.org/10.5194/hess-2023-282-AC1
  - CC2: 'Reply on AC1', Daniel Klotz, 15 Feb 2024
    
    Dear Reyhaneh,
    thank you for answering to my concerns and thank you also for posting your answer so swiftly. Often, as discussion on HESS go, it makes little sense to answer timely. You have my admiration for not doing that.
    Also note that I am not a reviewer. At this stage, I am just a “concerned citizen”. As such, I’d like to address some of your answers, and give you the chance for another response. But, to avoid unnecessary fatigue from your side, I promise that I will not answer again should you choose to address this comment here. I wish you best luck!
    
    Regarding what you said about bagging:
    After reading your answer, I now understand that you want to examine conceptual similarities between the approach from Kratzert et al. and bagging. In your answer you did not commit to any changes in the manuscript, however given your goals I would propose to rewrite the relevant paragraphs. I believe many readers will misinterpret them like I did. To give just one example of what I mean (and this is not exhaustive in any shape of form, but illustrative for what I mean): L.132ff of the current manuscript reads:
    “The k-fold validation based framework [...] exhibits the following characteristics: [...] ungauged predictions are made using an out-of-bag (OOB) [...], but with subsampling performed without replacement.”
    I’d like to argue that it is very difficult to for readers to understand that you now want to compare two approaches (namely: bagging and k-fold cross-validation), rather than formulating one approach in terms of the other. I only understood this after reading your answer and would therefore like to suggest to rewrite sentences like this
    Secondly --- and this might indeed suggest a different view of what constitutes a method and what not --- but for me a very important part of what constitutes a bagging approach is the “sampling with replacement” aspect. I already pointed this out in the minor comment section of my previous comment: The sampling form is the core of what that defines bootstrapping.
    
    The text that you cite mentions this as well. If you claim that one does “bagging without replacement” (as you do in the present form of your manuscript) then that is an oxymoron for me, since the very definition of the first excludes the form. In general, I have to say that there exist many ways to build ensembles, both in ML an in general (e.g., boosting, stacking, or implicit ensembles etc.). I do not understand why one would conceptualize our approach as bagging in particular. Please also note, that I already mentioned that k-fold cross-validation gives an almost unbiased estimate while taking advantage of using all the available data, while sample splitting into training/validation and test-set provides an unbiased estimates that is less data efficient. This is a well-known tradeoff.
    Regarding transfer learning:
    
    In my comment I mentioned that you only use the term transfer learning once. And I maintain that you use it wrongly. Or rather. It makes no sense to use it the way you did. At no point in your setup is information that is gained about one setting exploited to get better predictions for another setting. What would your two settings even be? You do not even have two separate evaluations or tasks.
    The definition and explanation that you quote is in direct contrast to the explanations also show that you use the term wrongly. According to what you told me , Goodfellow et al. (2016) wrote “In transfer learning, the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in [task] P1 are relevant to the variations that need to be captured for learning [task] P2.”, which clearly implies that the knowledge that one obtained from task P1 should be used to obtain a better performance on task P2 --- like, for example, is done in finetuning (a common approach in transfer learning). You then progress to say that you do not engage in “new learning, fine-tuning or model modifications”, which is another way of saying that you do not engage in transfer learning. That is, you contradicted the very definition that you use in two sentences.
    However, I do agree with the third point that you give, i.e., transfer has a specific meaning in the context of ungauged basins and has many uses in that context. The current version of the manuscript only uses term transfer learning in one figure. Why not change it and use of the established terms to express what you mean (you already mention several examples in your answer anyways) instead of using an ML term wrongly?
    
    Citation: https://doi.org/10.5194/hess-2023-282-CC2
    
    AC2: 'Reply on CC2', Reyhaneh Hashemi, 23 Feb 2024
    
    Dear Daniel,
    
    Thank you for your prompt and valuable feedback. We greatly appreciate your role and contributions to the community.
    
    Please feel free to share further thoughts you may have, and we will do our best to respond comprehensively.
    
    We will await the reviewers' feedback and proceed accordingly by incorporating the envisaged changes highlighted in bold below.
    
    Sincerely,
    
    Reyhaneh Hashemi
    
    ** ** ** ** ** ** ** ** ** ** **
    
    The use of the term “bagging”
    
    ** ** ** ** ** ** ** ** ** ** **
    
    We acknowledge your concerns regarding the potential for misinterpretation in our manuscript's comparison between the bagging method and the approach used by Kratzert et al. [1]. We understand how this could lead to confusion for the reader.
    
    >> Remove the text pertaining to the bagging comparison (and all impacted sentences) from the manuscript to avoid any misunderstanding.
    
    ** ** ** ** ** ** ** ** ** ** ** ** ** **
    
    The use of the term “transfer learning”
    
    ** ** ** ** ** ** ** ** ** ** ** ** ** **
    
    Our previous response, particularly points 1 and 2, aimed to clarify that the term “transfer learning” used in Figure 1 does not correspond to the deep learning technique known as “transfer learning”. Instead, it was intended to denote a literal “transfer of learning”, akin to the examples provided in point 3 for the PUB context: transfer of understanding, knowledge, parameters, etc.
    
    >> To further clarify this, replace it with “transferring of knowledge” in the manuscript’s figure.
    
    ** ** ** ** **
    
    References
    
    ** ** ** ** **
    
    [1] Kratzert F, Klotz D, Herrnegger M, Sampson AK, Hochreiter S, Nearing GS. Toward improved predictions in ungauged basins: Exploiting the power of machine learning. Water Resources Research. 2019 Dec;55(12):11344-54 (https://doi.org/10.1029/2019WR026065).
    
    Citation: https://doi.org/10.5194/hess-2023-282-AC2
RC1:
'Comment on hess-2023-282', Anonymous Referee #1, 28 Feb 2024

Good articles share something in common: an easy-to-follow abstract, a comprehensive literature review, clear research objectives, solid methodology, concise results, and in-depth discussion. Judging from these aspects, this article needs a lot of revision in all sections except for the introduction, particularly in the brevity of the article. This is one of the longest articles I have ever reviewed. Much of the article did not need to be stated in detail. Piling to much dispensable information can prevent the reader from grasping the topic of the paper. Objectively, this work makes a lot of effort to select the most effective PUB methods, and the results are interesting. However, there are many issues with the article that need to be revised.

Major:

[1] Abstract needs to be revised to ensure that it is concise and easy to understand enough.

[2] The methodology section needs to be shortened considerably. Piling to much content here will dilute the theme of the article. To do this, compressing current sections 3-6 into one that consists of three subsections: 3.1 predicting of IO, 3.2 Recreating regime classification, and 3.3 Conceptual benchmarking. Specifically, I recommend using only one method that works best for calculating IQ, even though multiple methods have been tested. Also, some figures and text can be placed in the attachment of the paper. The methodology section should not be overloaded with figures.

[3] The discussion section is an essential part of a scientific paper, but it is missing in this article. The discussion is supposed to interpret and elucidate the significance of the study findings, justify their importance and contributions to current scientific literature, and provide specific suggestions for future research. It needs to be added in the revised manuscript.

Minor:

[1] Abstract: Too many abbreviations are used in the text, which reduces the readability of the manuscript.

[2] Lines 41-45: The importance of the PUB should not be placed in this place; it should appear at the beginning of the introduction.

[3] L49: How to define the ‘success’ for various PUB methods?

[4] Lines175-180: Overall, the all three of the authors’ stated contributions are a bit of stretch. Firstly, similar studies have not been performed within the French context cannot be as a novelty of this research. Also, the third contribution needs to be revisited as previous studies (searching by google scholar) have compared the performance of LSTM and GR4J models in runoff simulations.

[5] Figure 3, the legend and symbol colors in Figure 3f need to be redrawn.

[6] Figure 5: It is recommended that Figure 5 be deleted, as similar information is already included in Table 2.

[7] The hydrologic regime classification should be placed where it first appears, i.e. in the caption of Figure 1.

[8] Briefly describe the four parameter regionalisation methods used, and Figures 12-14 should be moved to the supplementary material.

[9] It makes no sense to compare the performance of the best-performing LSTM and GR4J in three typical basins, except to further increase the length of the article. Thus, please remove Figure 19-24 and corresponding description in the paper.

[10] Conclusion only summarizes the main findings, and results that are not relevant to the research objectives need to be removed, such as the content in lines 576-582.

Citation: https://doi.org/10.5194/hess-2023-282-RC1
- AC3:
  'Reply on RC1', Reyhaneh Hashemi, 19 Apr 2024
  Dear Reviewer,
  
  We greatly appreciate the time and effort you have dedicated to providing a very detailed review of our manuscript. Your comments on each part of the paper, suggesting specific modifications, were both clear and very constructive.
  
  ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
  
  Objectives and Research Novelties
  
  ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
  
  This paper builds upon a previous study conducted in the gauged space using LSTMs (and a training catchment set). A crucial, but currently understated, aspect of our research is its overarching narrative: it tracks the performance drop of LSTM as it transitions from the gauged to the ungauged space (tested on a new catchment set) across various hydrological regimes. Moreover, during this transition, the study compares an LSTM with regime-specialized training to a mixed-regime trained LSTM. Both these elements constitute novel contributions to the field. Previous literature has focused on studies performed exclusively within either gauged or ungauged spaces without such a comparative framework (between regime and national LSTMs).
  
  Given the main narrative of our study, it was necessary to maintain consistency in the models used between the two spaces. This requirement led us to diverge from the k-fold training/testing approach commonly utilized in previous studies for performing PUB using LSTMs. Our departure introduces a methodological distinction from previous studies: we employed a single LSTM model characterized by a unique bias-variance pattern, and all results were produced using this consistent pattern. In contrast, previous PUB studies using LSTMs all employed K different bias-variance patterns (i.e., K different LSTM models) to generate their results.
  
  In the revised manuscript, we will elucidate these points to ensure the paper’s central narrative and its novel contributions are unequivocally presented.
  
  ** ** ** ** ** ** ** ** ** ** ** ** **
  
  Length of the Manuscript
  
  ** ** ** ** ** ** ** ** ** ** ** ** **
  
  In developing the manuscript, we aimed to build upon a previous study, exploring a variety of methods applied across numerous tasks. Our intention was to offer a comprehensive understanding that would enable readers to fully grasp and replicate the methods and results presented, leading to the inclusion of extensive details in the manuscript. We acknowledge, as you rightly pointed out, that these extensive details may have contributed to the manuscript's excessive length, potentially obscuring the overarching narrative of the paper.
  
  We gratefully acknowledge the specific modifications suggested for reducing the manuscript's length. We accept all of these recommendations and will integrate them into the revised manuscript.
  
  Our revisions will specifically focus on:
  
  Abstract: Revising it to make it more concise and easier to follow.
  
  Introduction: Clarifying the objective and elucidating the novelty aspects of the paper.
  
  Methodology: Streamlining it with three distinct subsections: 3.1 Predicting IQ, 3.2 Recreating Regime Classification, and 3.3 Conceptual Benchmarking.
  
  Discussion: Developing a separate discussion section in the revised manuscript to compare our findings with those of previous studies, acknowledging differences in methodology and exploring their implications for future research.
  
  Additional and/or Dispensable Figures and Text: Moving them to supplementary material.
  
  Conclusion: Refining it to more explicitly emphasize the main findings in direct relation to the study's objectives.
  
  
  
  Please find attached a point-to-point response to your comments.
  
  Sincerely,
  
  Reyhaneh Hashemi
  
  Citation: https://doi.org/10.5194/hess-2023-282-AC3
RC2:
'Comment on hess-2023-282', Ralf Loritz, 08 Mar 2024
Dear Authors, Dear Editor,

This manuscript addresses a relevant and timely topic: how to make predictions in ungauged basins using LSTMs. However, due to the presentation quality, several sections appear underdeveloped and could benefit from further refinement to better convey the contributions and findings of the study. In its current form, it is difficult to discern the novelty and significance of the work.
The discussion surrounding Kratzert et al. 2019's study seems to be either wrong or unnecessarily complicated. It is crucial for the manuscript to accurately represent previous works to effectively build upon them. I suggest providing a more straightforward explanation and comparison to other studies that perform ungauged predictions using LSTMs (Arsenault et al., 2023; Feng et al., 2020; and references within) to ensure readers grasp the differences and advancements made in the current study.
Recommendations for improvement include:
Benchmark replication: Replicating Kratzert et al. 2019's results using the CAMELS-US Dataset is recommended (this can be added to the appendix or supplementary materials). This will not only validate the current approach and your codes but also establish a clearer benchmark for comparison.

Clarification of novel contributions: After replicating the benchmark study, the authors should distinctly outline how their work differs from and improves upon previous efforts. For instance, why is the presented division into gauged and ungauged basins better than a k-fold CV (for me it seems a bit arbitrary)? Why would you use a feature engineering approach if you work with a deep learning model? Clearly articulating the advancements in methodology, the robustness of the approach, and any enhancements in predictive accuracy would be beneficial.

Unfortunately, in its current form, I cannot recommend that the manuscript can be published in HESS. However, as the work clearly has potential, I encourage the authors to resubmit a revised version of the manuscript.
Sincerely,
Ralf Loritz
Citation: https://doi.org/10.5194/hess-2023-282-RC2
- AC4: 'Reply on RC2', Reyhaneh Hashemi, 19 Apr 2024
  
  RC :
  
  " This manuscript addresses a relevant and timely topic: how to make predictions in ungauged basins using LSTMs. However, due to the presentation quality, several sections appear underdeveloped and could benefit from further refinement to better convey the contributions and findings of the study. In its current form, it is difficult to discern the novelty and significance of the work."
  AR :
  
  Dear Dr. Loritz,
  We sincerely appreciate your feedback on our manuscript. While we note the concerns raised about the presentation quality and the perception that several sections appear underdeveloped, identifying and locating these specific areas within the manuscript proves challenging without direct references or examples from the reviewer's comments.
  We have undertaken a thorough review of our manuscript and already outlined a detailed plan to refine the manuscript to ensure that our contributions and findings are presented as clearly and concisely as possible. This plan, guided by the suggestions from Reviewer 1, includes elucidating the paper’s main narrative and novelties, addressing concerns related to the manuscript's length, and incorporating a more explicit discussion to better highlight the significance and implications of our findings.
  
  ** ** ** **
  
  RC :
  
  "The discussion surrounding Kratzert et al. 2019's study seems to be either wrong or unnecessarily complicated. It is crucial for the manuscript to accurately represent previous works to effectively build upon them. I suggest providing a more straightforward explanation and comparison to other studies that perform ungauged predictions using LSTMs (Arsenault et al., 2023; Feng et al., 2020; and references within) to ensure readers grasp the differences and advancements made in the current study."
  
  AR :
  
  In our discussion (https://doi.org/10.5194/hess-2023-282-AC1) with Daniel Klotz (one of the authors of Kratzert et al. 2019) we explain that our presentation of their work is grounded in the methodological difference between the approach used in Kratzert et al (as well as all previous studies) and ours; which is a clear and distinct difference: they used 12 different LSTMs with 12 different bias-variance patterns, and their results are concatenated from an ensemble of 12 LSTMS while we used 1 single with one single bias-variance pattern, no result concatenation takes place.
  
  We have also outlined in our response to them (https://doi.org/10.5194/hess-2023-282-AC2) that we are committed to remove the usage of the term “bagging” to prevent the least misunderstanding about their work and we will focus on clearly explaining the methodological differences between their PUB method and our own.
  
  ** ** ** **
  
  RC :
  
  " Benchmark replication: Replicating Kratzert et al. 2019's results using the CAMELS-US Dataset is recommended (this can be added to the appendix or supplementary materials). This will not only validate the current approach and your codes but also establish a clearer benchmark for comparison. Clarification of novel contributions: After replicating the benchmark study, the authors should distinctly outline how their work differs from and improves upon previous efforts. For instance, why is the presented division into gauged and ungauged basins better than a k-fold CV (for me it seems a bit arbitrary)? Clearly articulating the advancements in methodology, the robustness of the approach, and any enhancements in predictive accuracy would be beneficial."
  
  AR :
  
  We thank the reviewer for this suggestion. However, the primary objective of our paper is not to enhance the PUB results achievable with LSTMs, for which benchmarking against the approach used by Kratzert et al. would indeed be relevant. It appears this aspect of our study's narrative has not been communicated clearly or has been misunderstood in the current manuscript version.
  
  Our research builds upon prior work, employing two classes of LSTMs: one with regime-specialized training and another with mixed training, trained in the gauged space (using a training catchment set). These models are then transferred to the ungauged space (for a new test catchment set) across different hydrological regimes to observe their performance degradation. The paper's main contribution lies in comparing the performance of these two LSTMs during the transition from gauged to ungauged spaces specifically across various hydrological regimes. To preserve the integrity of this comparison, consistency between the models in both spaces is obligatory, necessitating our departure from the k-fold training-testing approach commonly employed in previous studies.
  
  It is important to clarify that our study does not seek to evaluate whether the k-fold training-testing approach is superior or inferior to using a single LSTM model. The k-fold training-testing approach was not utilized because it did not serve the specific objectives of our research. Our paper's narrative is different from that of Kratzert et al. 2019 and other prior PUB studies, which typically develop multiple LSTMs exclusively for PUB application.
  
  In the revised manuscript, we will more clearly articulate these differences in objectives (and thus methodologies) between our work and preceding PUB studies, ensuring the unique contributions of our research are fully understood.
  
  ** ** ** **
  
  RC:
  
  Why would you use a feature engineering approach if you work with a deep learning model?
  
  AR:
  
  Thank you for raising this point about feature engineering. We would like to clarify that in our paper, feature engineering is not applied to any deep learning models. It pertains solely to the XGBoost classifier, which is a machine learning model that should benefit from feature engineering to enhance performance (contrary to deep learning models).
  Sincerely,
  
  Reyhaneh Hashemi
  
  Citation: https://doi.org/10.5194/hess-2023-282-AC4

Status: closed

CC1:
'A curious interpretation', Daniel Klotz, 26 Jan 2024
Dear editor, dear reviewers, & dear authors,

I am writing to let you know that this paper does not faithfully depict the work of Kratzert et al. (2019). I am one of the alii, but I do not write this because of some agenda of mine. As a matter of fact, I also do see room for improvement in our evaluation approaches for PUB --- and many aspects are still unexplored. No, I decided to comment because was intrigued that it possible to misunderstand our work so much to result in the rendition from L. 132ff.

Back then, we did a standard cross-validation (CV) on basis of the catchments. I noticed that the authors do not use the word CV, but it does indeed exists. You can search for it, the approach even has its own Wikipedia page. Anyway, what the mention of k-fold CV should suggest to modelers is that one splits the data into k-folds. Then to predict the values of a given fold j one trains a model on basis of remaining k-1 folds and uses j as validation set. This process is repeated until each fold was used one time as validation set. What this basically gives you are predictions for each basins, where the basins is guaranteed to not be part of the training (forget what the authors write about the probability of a catchment being part of the training). Lastly, we evaluate the performance of each validation fold to obtain a estimate of the performance, as is standard in CV literature. It is well known that CV is almost unbiased in the i.i.d setting, but has high variance. We also did not ensemble the models across folds, but rather build 10 LSTMs per evaluation fold and checked that ensemble.
In hindsight, I can see that explanations from Kratzert et al. (2019) are maybe a bit Spartan. However, I do not see how lent themselves to the exposition given in L. 132ff. For convenience, let me quote the relevant passages from Kratzert et al. (2019):
„Out-of-sample testing was done by k-fold validation, which splits the 531 basins randomly into k = 12 groups of approximately equal size, uses all basins from k-1 groups to train the model, and then tests the model on the single group of holdout basins. This procedure is repeated k = 12 times so that out-of-sample predictions are available from every basin. [...].

For each model type we trained and tested an ensemble of N = 10 LSTM models to match the 10 SCE restarts used to calibrate the SAC-SMA models.“

I do not understand how this descriptions lead to the interpretation of our evaluation procedure as a bagging approach --- which is actually a model selection procedure that has nothing to do with the the evaluation. I therefore hope that you (i.e., the authors) can find some time during the busy admission process on HESS to explain your thought process so that we can do better in the future.
Well, I also hope that you correct your „mistakes“.
All the best,

Daniel

Minor comment: Since I am already add it I’d also like to give some minor things to correct. While skimming I noticed that the terms bagging and transfer learning are used wrongly: Bagging is a bootstrapping approach for building ensembles and that implies that the data is used with resampling --- else it would not be bootstrapping by definition. Transfer learning is about using knowledge gained from one task to improve the performance on another task. It is thus completely distinct from the things discussed here. I cannot tell if the manuscript has more errors because I stopped reading after the description of the Kratzert et al. (2019) evaluation technique.
Citation: https://doi.org/10.5194/hess-2023-282-CC1
- AC1:
  'Reply on CC1', Reyhaneh Hashemi, 14 Feb 2024
  Dear Daniel Klotz,
  
  Thank you for reviewing our paper and your valuable comments. Below, we address your comments and are open to further clarifying or, if necessary, removing any points raised in the main paper.
  
  ** ** ** ** ** ** ** ** ** ** **
  
  The use of the term bagging
  
  ** ** ** ** ** ** ** ** ** ** **
  
  Regarding your comments on the relationship between the approach used in Kratzert et al. [1] and the concept of bagging [2], we regret that the respective text in our paper was not sufficiently clear. Therefore, we would like to clarify a few points here:
  
  Our paper does not claim to categorize Kratzert et al.’s method directly as a bagging technique. Instead, it highlights conceptual similarities between their approach and the bagging method, as understood in the broader context of ensemble learning.
  
  Our paper does not intend to reflect or indicate any biases within Kratzert et al.'s methodology; there are none.
  
  Referring to the definition of "bagging" in Chapter 7 of Goodfellow et al. [3]:
  
  “Bagging (short for bootstrap aggregating) is a technique for reducing generalization error by combining several models (Breiman, 1994). The idea is to train several different models separately, then have all the models vote on the output for test examples. This is an example of a general strategy in machine learning called model averaging. Techniques employing this strategy are known as ensemble methods.”
  
  It further explains,
  
  “Specifically, bagging involves constructing k different datasets. Each dataset has the same number of examples as the original dataset, but each dataset is constructed by sampling with replacement from the original dataset. This means that, with high probability, each dataset is missing some of the examples from the original dataset and contains several duplicate examples (on average around two–thirds of the examples from the original dataset are found in the resulting training set, if it has the same size as the original). Model i is then trained on dataset i. The differences between which examples are included in each dataset result in differences between the trained models.”
  
  We observe two conceptual similarities between Kratzert et al.’s approach and bagging: 1. the aggregation of predictions from an ensemble of K different models, and 2. the creation of K different datasets from sampling an original set.
  
  However, two distinctions include: 1. Kratzert et al. avoid replacement in their sampling, ensuring dataset independence via leave-one-out K-fold splitting, and 2. the12 obtained LSTMs are tested on distinct test sets, highlighting a departure from bagging.
  
  Kratzert et al. [1] utilizes 12 distinct LSTM models, each of which is trained on a different subset. This leads to 12 distinct LSTM models with 12 unique bias-variance patterns (albeit the 12 patterns do not interact in any way, since they are tested on different test sets). The reported performances in Kratzert et al. [1] result from 12 bias-variance patterns.
  
  The method used in our paper employs a single LSTM model without any form of result aggregation. The reported performances result from the single existing bias-variance pattern.
  
  There is, therefore, a clear methodological difference between the approach used in Kratzert et al. [1] and our paper.
  
  ** ** ** ** ** ** ** ** ** ** ** ** ** **
  
  The use of the term “transfer learning”
  
  ** ** ** ** ** ** ** ** ** ** ** ** ** **
  
  1. The term "transfer learning" is utilized only once in our paper, specifically in Figure 1, to illustrate the act of transferring knowledge from gauged to ungauged spaces. This usage is entirely literal and directly corresponds with the figure’s concept, aiming to depict the transfer of learning between these two spaces.
  
  2. Referring to the definition of “transfer learning” (as a specific deep learning technique and in the sense you are referencing), Goodfellow et al. [2] explains that:
  
  “Transfer learning and domain adaptation refer to the situation where what has been learned in one setting (e.g., distribution P1) is exploited to improve generalization in another setting (say, distribution P2).
  
  In transfer learning, the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in P1 are relevant to the variations that need to be captured for learning P2. This is typically understood in a supervised learning context, where the input is the same, but the target may be of a different nature.”
  
  Our paper clearly indicates (in Lines 3 and 154; page 6) that it does not engage in any new learning, fine-tuning, or model modification tasks. The term "fine-tuning" is absent from our paper (neither in the Abstract, the conclusion, nor the main text) leaving no ambiguity that our paper's use of the term "transfer" is distinct from the technical "transfer learning" concept you referenced.
  
  3. In the context of Predictions in Ungauged Basins (PUB), "transfer" is an established term used precisely as it appears in our work. Its application in PUB studies, for transferring knowledge, understanding, and model parameterizations from gauged to ungauged catchments, is numerous. Here we refer to only some examples from the reference PUB paper “A decade of Predictions in Ungauged Basins (PUB)—a review” [4].
  
  - This resulted in some success in the development of catchment classification schemes, similarity frameworks and model regionalization methods for TRANSFERRING KNOWLEDGE and improving predictions in ungauged basins.
  
  - unsuitable regionalization techniques to TRANSFER UNDERSTANDING OF HYDROLOGICAL RESPONSE patterns from gauged to ungauged environments due toa lack of comparative studies across catchments and a lack of understanding of the physical principles governing robust regionalization.
  
  - In contrast to the regionalization of flow metrics, regionalization efforts for TRANSFERRING MODEL PARAMETERIZATIONS from gauged to ungauged catchments have a longer tradition.
  
  Should there be an alternative term that aligns (as precisely) with the concept depicted in Figure 1 and retains the literal essence of "transfer" as defined (e.g., "to move someone or something from one place to another" -- Cambridge Dictionary), we are open to considering its replacement for this singular mention.
  
  Sincerely,
  
  Reyhaneh Hashemi
  
  ** ** ** ** **
  
  References
  
  ** ** ** ** **
  
  [1] Kratzert F, Klotz D, Herrnegger M, Sampson AK, Hochreiter S, Nearing GS. Toward improved predictions in ungauged basins: Exploiting the power of machine learning. Water Resources Research. 2019 Dec;55(12):11344-54 (https://doi.org/10.1029/2019WR026065).
  
  [2] Breiman L. "Bagging predictors." Machine learning 24 (1996): 123-140.
  
  [3] Goodfellow I, Bengio Y, Courville A. Deep learning. MIT press, 2016.
  
  [4] M. Hrachowitz, H.H.G. Savenije, G. Blöschl, J.J. McDonnell, M. Sivapalan, J.W. Pomeroy, B. Arheimer, T. Blume, M.P. Clark, U. Ehret, F. Fenicia, J.E. Freer, A. Gelfan, H.V. Gupta, D.A. Hughes, R.W. Hut, A. Montanari, S. Pande, D. Tetzlaff, P.A. Troch, S. Uhlenbrook, T. Wagener, H.C. Winsemius, R.A. Woods, E. Zehe & C. Cudennec (2013) A decade of Predictions in Ungauged Basins (PUB)—a review, Hydrological Sciences Journal, 58:6, 1198-1255 (https://doi.org/10.1080/02626667.2013.803183)
  
  Citation: https://doi.org/10.5194/hess-2023-282-AC1
  - CC2: 'Reply on AC1', Daniel Klotz, 15 Feb 2024
    
    Dear Reyhaneh,
    thank you for answering to my concerns and thank you also for posting your answer so swiftly. Often, as discussion on HESS go, it makes little sense to answer timely. You have my admiration for not doing that.
    Also note that I am not a reviewer. At this stage, I am just a “concerned citizen”. As such, I’d like to address some of your answers, and give you the chance for another response. But, to avoid unnecessary fatigue from your side, I promise that I will not answer again should you choose to address this comment here. I wish you best luck!
    
    Regarding what you said about bagging:
    After reading your answer, I now understand that you want to examine conceptual similarities between the approach from Kratzert et al. and bagging. In your answer you did not commit to any changes in the manuscript, however given your goals I would propose to rewrite the relevant paragraphs. I believe many readers will misinterpret them like I did. To give just one example of what I mean (and this is not exhaustive in any shape of form, but illustrative for what I mean): L.132ff of the current manuscript reads:
    “The k-fold validation based framework [...] exhibits the following characteristics: [...] ungauged predictions are made using an out-of-bag (OOB) [...], but with subsampling performed without replacement.”
    I’d like to argue that it is very difficult to for readers to understand that you now want to compare two approaches (namely: bagging and k-fold cross-validation), rather than formulating one approach in terms of the other. I only understood this after reading your answer and would therefore like to suggest to rewrite sentences like this
    Secondly --- and this might indeed suggest a different view of what constitutes a method and what not --- but for me a very important part of what constitutes a bagging approach is the “sampling with replacement” aspect. I already pointed this out in the minor comment section of my previous comment: The sampling form is the core of what that defines bootstrapping.
    
    The text that you cite mentions this as well. If you claim that one does “bagging without replacement” (as you do in the present form of your manuscript) then that is an oxymoron for me, since the very definition of the first excludes the form. In general, I have to say that there exist many ways to build ensembles, both in ML an in general (e.g., boosting, stacking, or implicit ensembles etc.). I do not understand why one would conceptualize our approach as bagging in particular. Please also note, that I already mentioned that k-fold cross-validation gives an almost unbiased estimate while taking advantage of using all the available data, while sample splitting into training/validation and test-set provides an unbiased estimates that is less data efficient. This is a well-known tradeoff.
    Regarding transfer learning:
    
    In my comment I mentioned that you only use the term transfer learning once. And I maintain that you use it wrongly. Or rather. It makes no sense to use it the way you did. At no point in your setup is information that is gained about one setting exploited to get better predictions for another setting. What would your two settings even be? You do not even have two separate evaluations or tasks.
    The definition and explanation that you quote is in direct contrast to the explanations also show that you use the term wrongly. According to what you told me , Goodfellow et al. (2016) wrote “In transfer learning, the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in [task] P1 are relevant to the variations that need to be captured for learning [task] P2.”, which clearly implies that the knowledge that one obtained from task P1 should be used to obtain a better performance on task P2 --- like, for example, is done in finetuning (a common approach in transfer learning). You then progress to say that you do not engage in “new learning, fine-tuning or model modifications”, which is another way of saying that you do not engage in transfer learning. That is, you contradicted the very definition that you use in two sentences.
    However, I do agree with the third point that you give, i.e., transfer has a specific meaning in the context of ungauged basins and has many uses in that context. The current version of the manuscript only uses term transfer learning in one figure. Why not change it and use of the established terms to express what you mean (you already mention several examples in your answer anyways) instead of using an ML term wrongly?
    
    Citation: https://doi.org/10.5194/hess-2023-282-CC2
    
    AC2: 'Reply on CC2', Reyhaneh Hashemi, 23 Feb 2024
    
    Dear Daniel,
    
    Thank you for your prompt and valuable feedback. We greatly appreciate your role and contributions to the community.
    
    Please feel free to share further thoughts you may have, and we will do our best to respond comprehensively.
    
    We will await the reviewers' feedback and proceed accordingly by incorporating the envisaged changes highlighted in bold below.
    
    Sincerely,
    
    Reyhaneh Hashemi
    
    ** ** ** ** ** ** ** ** ** ** **
    
    The use of the term “bagging”
    
    ** ** ** ** ** ** ** ** ** ** **
    
    We acknowledge your concerns regarding the potential for misinterpretation in our manuscript's comparison between the bagging method and the approach used by Kratzert et al. [1]. We understand how this could lead to confusion for the reader.
    
    >> Remove the text pertaining to the bagging comparison (and all impacted sentences) from the manuscript to avoid any misunderstanding.
    
    ** ** ** ** ** ** ** ** ** ** ** ** ** **
    
    The use of the term “transfer learning”
    
    ** ** ** ** ** ** ** ** ** ** ** ** ** **
    
    Our previous response, particularly points 1 and 2, aimed to clarify that the term “transfer learning” used in Figure 1 does not correspond to the deep learning technique known as “transfer learning”. Instead, it was intended to denote a literal “transfer of learning”, akin to the examples provided in point 3 for the PUB context: transfer of understanding, knowledge, parameters, etc.
    
    >> To further clarify this, replace it with “transferring of knowledge” in the manuscript’s figure.
    
    ** ** ** ** **
    
    References
    
    ** ** ** ** **
    
    [1] Kratzert F, Klotz D, Herrnegger M, Sampson AK, Hochreiter S, Nearing GS. Toward improved predictions in ungauged basins: Exploiting the power of machine learning. Water Resources Research. 2019 Dec;55(12):11344-54 (https://doi.org/10.1029/2019WR026065).
    
    Citation: https://doi.org/10.5194/hess-2023-282-AC2
RC1:
'Comment on hess-2023-282', Anonymous Referee #1, 28 Feb 2024

Good articles share something in common: an easy-to-follow abstract, a comprehensive literature review, clear research objectives, solid methodology, concise results, and in-depth discussion. Judging from these aspects, this article needs a lot of revision in all sections except for the introduction, particularly in the brevity of the article. This is one of the longest articles I have ever reviewed. Much of the article did not need to be stated in detail. Piling to much dispensable information can prevent the reader from grasping the topic of the paper. Objectively, this work makes a lot of effort to select the most effective PUB methods, and the results are interesting. However, there are many issues with the article that need to be revised.

Major:

[1] Abstract needs to be revised to ensure that it is concise and easy to understand enough.

[2] The methodology section needs to be shortened considerably. Piling to much content here will dilute the theme of the article. To do this, compressing current sections 3-6 into one that consists of three subsections: 3.1 predicting of IO, 3.2 Recreating regime classification, and 3.3 Conceptual benchmarking. Specifically, I recommend using only one method that works best for calculating IQ, even though multiple methods have been tested. Also, some figures and text can be placed in the attachment of the paper. The methodology section should not be overloaded with figures.

[3] The discussion section is an essential part of a scientific paper, but it is missing in this article. The discussion is supposed to interpret and elucidate the significance of the study findings, justify their importance and contributions to current scientific literature, and provide specific suggestions for future research. It needs to be added in the revised manuscript.

Minor:

[1] Abstract: Too many abbreviations are used in the text, which reduces the readability of the manuscript.

[2] Lines 41-45: The importance of the PUB should not be placed in this place; it should appear at the beginning of the introduction.

[3] L49: How to define the ‘success’ for various PUB methods?

[4] Lines175-180: Overall, the all three of the authors’ stated contributions are a bit of stretch. Firstly, similar studies have not been performed within the French context cannot be as a novelty of this research. Also, the third contribution needs to be revisited as previous studies (searching by google scholar) have compared the performance of LSTM and GR4J models in runoff simulations.

[5] Figure 3, the legend and symbol colors in Figure 3f need to be redrawn.

[6] Figure 5: It is recommended that Figure 5 be deleted, as similar information is already included in Table 2.

[7] The hydrologic regime classification should be placed where it first appears, i.e. in the caption of Figure 1.

[8] Briefly describe the four parameter regionalisation methods used, and Figures 12-14 should be moved to the supplementary material.

[9] It makes no sense to compare the performance of the best-performing LSTM and GR4J in three typical basins, except to further increase the length of the article. Thus, please remove Figure 19-24 and corresponding description in the paper.

[10] Conclusion only summarizes the main findings, and results that are not relevant to the research objectives need to be removed, such as the content in lines 576-582.

Citation: https://doi.org/10.5194/hess-2023-282-RC1
- AC3:
  'Reply on RC1', Reyhaneh Hashemi, 19 Apr 2024
  Dear Reviewer,
  
  We greatly appreciate the time and effort you have dedicated to providing a very detailed review of our manuscript. Your comments on each part of the paper, suggesting specific modifications, were both clear and very constructive.
  
  ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
  
  Objectives and Research Novelties
  
  ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
  
  This paper builds upon a previous study conducted in the gauged space using LSTMs (and a training catchment set). A crucial, but currently understated, aspect of our research is its overarching narrative: it tracks the performance drop of LSTM as it transitions from the gauged to the ungauged space (tested on a new catchment set) across various hydrological regimes. Moreover, during this transition, the study compares an LSTM with regime-specialized training to a mixed-regime trained LSTM. Both these elements constitute novel contributions to the field. Previous literature has focused on studies performed exclusively within either gauged or ungauged spaces without such a comparative framework (between regime and national LSTMs).
  
  Given the main narrative of our study, it was necessary to maintain consistency in the models used between the two spaces. This requirement led us to diverge from the k-fold training/testing approach commonly utilized in previous studies for performing PUB using LSTMs. Our departure introduces a methodological distinction from previous studies: we employed a single LSTM model characterized by a unique bias-variance pattern, and all results were produced using this consistent pattern. In contrast, previous PUB studies using LSTMs all employed K different bias-variance patterns (i.e., K different LSTM models) to generate their results.
  
  In the revised manuscript, we will elucidate these points to ensure the paper’s central narrative and its novel contributions are unequivocally presented.
  
  ** ** ** ** ** ** ** ** ** ** ** ** **
  
  Length of the Manuscript
  
  ** ** ** ** ** ** ** ** ** ** ** ** **
  
  In developing the manuscript, we aimed to build upon a previous study, exploring a variety of methods applied across numerous tasks. Our intention was to offer a comprehensive understanding that would enable readers to fully grasp and replicate the methods and results presented, leading to the inclusion of extensive details in the manuscript. We acknowledge, as you rightly pointed out, that these extensive details may have contributed to the manuscript's excessive length, potentially obscuring the overarching narrative of the paper.
  
  We gratefully acknowledge the specific modifications suggested for reducing the manuscript's length. We accept all of these recommendations and will integrate them into the revised manuscript.
  
  Our revisions will specifically focus on:
  
  Abstract: Revising it to make it more concise and easier to follow.
  
  Introduction: Clarifying the objective and elucidating the novelty aspects of the paper.
  
  Methodology: Streamlining it with three distinct subsections: 3.1 Predicting IQ, 3.2 Recreating Regime Classification, and 3.3 Conceptual Benchmarking.
  
  Discussion: Developing a separate discussion section in the revised manuscript to compare our findings with those of previous studies, acknowledging differences in methodology and exploring their implications for future research.
  
  Additional and/or Dispensable Figures and Text: Moving them to supplementary material.
  
  Conclusion: Refining it to more explicitly emphasize the main findings in direct relation to the study's objectives.
  
  
  
  Please find attached a point-to-point response to your comments.
  
  Sincerely,
  
  Reyhaneh Hashemi
  
  Citation: https://doi.org/10.5194/hess-2023-282-AC3
RC2:
'Comment on hess-2023-282', Ralf Loritz, 08 Mar 2024
Dear Authors, Dear Editor,

This manuscript addresses a relevant and timely topic: how to make predictions in ungauged basins using LSTMs. However, due to the presentation quality, several sections appear underdeveloped and could benefit from further refinement to better convey the contributions and findings of the study. In its current form, it is difficult to discern the novelty and significance of the work.
The discussion surrounding Kratzert et al. 2019's study seems to be either wrong or unnecessarily complicated. It is crucial for the manuscript to accurately represent previous works to effectively build upon them. I suggest providing a more straightforward explanation and comparison to other studies that perform ungauged predictions using LSTMs (Arsenault et al., 2023; Feng et al., 2020; and references within) to ensure readers grasp the differences and advancements made in the current study.
Recommendations for improvement include:
Benchmark replication: Replicating Kratzert et al. 2019's results using the CAMELS-US Dataset is recommended (this can be added to the appendix or supplementary materials). This will not only validate the current approach and your codes but also establish a clearer benchmark for comparison.

Clarification of novel contributions: After replicating the benchmark study, the authors should distinctly outline how their work differs from and improves upon previous efforts. For instance, why is the presented division into gauged and ungauged basins better than a k-fold CV (for me it seems a bit arbitrary)? Why would you use a feature engineering approach if you work with a deep learning model? Clearly articulating the advancements in methodology, the robustness of the approach, and any enhancements in predictive accuracy would be beneficial.

Unfortunately, in its current form, I cannot recommend that the manuscript can be published in HESS. However, as the work clearly has potential, I encourage the authors to resubmit a revised version of the manuscript.
Sincerely,
Ralf Loritz
Citation: https://doi.org/10.5194/hess-2023-282-RC2
- AC4: 'Reply on RC2', Reyhaneh Hashemi, 19 Apr 2024
  
  RC :
  
  " This manuscript addresses a relevant and timely topic: how to make predictions in ungauged basins using LSTMs. However, due to the presentation quality, several sections appear underdeveloped and could benefit from further refinement to better convey the contributions and findings of the study. In its current form, it is difficult to discern the novelty and significance of the work."
  AR :
  
  Dear Dr. Loritz,
  We sincerely appreciate your feedback on our manuscript. While we note the concerns raised about the presentation quality and the perception that several sections appear underdeveloped, identifying and locating these specific areas within the manuscript proves challenging without direct references or examples from the reviewer's comments.
  We have undertaken a thorough review of our manuscript and already outlined a detailed plan to refine the manuscript to ensure that our contributions and findings are presented as clearly and concisely as possible. This plan, guided by the suggestions from Reviewer 1, includes elucidating the paper’s main narrative and novelties, addressing concerns related to the manuscript's length, and incorporating a more explicit discussion to better highlight the significance and implications of our findings.
  
  ** ** ** **
  
  RC :
  
  "The discussion surrounding Kratzert et al. 2019's study seems to be either wrong or unnecessarily complicated. It is crucial for the manuscript to accurately represent previous works to effectively build upon them. I suggest providing a more straightforward explanation and comparison to other studies that perform ungauged predictions using LSTMs (Arsenault et al., 2023; Feng et al., 2020; and references within) to ensure readers grasp the differences and advancements made in the current study."
  
  AR :
  
  In our discussion (https://doi.org/10.5194/hess-2023-282-AC1) with Daniel Klotz (one of the authors of Kratzert et al. 2019) we explain that our presentation of their work is grounded in the methodological difference between the approach used in Kratzert et al (as well as all previous studies) and ours; which is a clear and distinct difference: they used 12 different LSTMs with 12 different bias-variance patterns, and their results are concatenated from an ensemble of 12 LSTMS while we used 1 single with one single bias-variance pattern, no result concatenation takes place.
  
  We have also outlined in our response to them (https://doi.org/10.5194/hess-2023-282-AC2) that we are committed to remove the usage of the term “bagging” to prevent the least misunderstanding about their work and we will focus on clearly explaining the methodological differences between their PUB method and our own.
  
  ** ** ** **
  
  RC :
  
  " Benchmark replication: Replicating Kratzert et al. 2019's results using the CAMELS-US Dataset is recommended (this can be added to the appendix or supplementary materials). This will not only validate the current approach and your codes but also establish a clearer benchmark for comparison. Clarification of novel contributions: After replicating the benchmark study, the authors should distinctly outline how their work differs from and improves upon previous efforts. For instance, why is the presented division into gauged and ungauged basins better than a k-fold CV (for me it seems a bit arbitrary)? Clearly articulating the advancements in methodology, the robustness of the approach, and any enhancements in predictive accuracy would be beneficial."
  
  AR :
  
  We thank the reviewer for this suggestion. However, the primary objective of our paper is not to enhance the PUB results achievable with LSTMs, for which benchmarking against the approach used by Kratzert et al. would indeed be relevant. It appears this aspect of our study's narrative has not been communicated clearly or has been misunderstood in the current manuscript version.
  
  Our research builds upon prior work, employing two classes of LSTMs: one with regime-specialized training and another with mixed training, trained in the gauged space (using a training catchment set). These models are then transferred to the ungauged space (for a new test catchment set) across different hydrological regimes to observe their performance degradation. The paper's main contribution lies in comparing the performance of these two LSTMs during the transition from gauged to ungauged spaces specifically across various hydrological regimes. To preserve the integrity of this comparison, consistency between the models in both spaces is obligatory, necessitating our departure from the k-fold training-testing approach commonly employed in previous studies.
  
  It is important to clarify that our study does not seek to evaluate whether the k-fold training-testing approach is superior or inferior to using a single LSTM model. The k-fold training-testing approach was not utilized because it did not serve the specific objectives of our research. Our paper's narrative is different from that of Kratzert et al. 2019 and other prior PUB studies, which typically develop multiple LSTMs exclusively for PUB application.
  
  In the revised manuscript, we will more clearly articulate these differences in objectives (and thus methodologies) between our work and preceding PUB studies, ensuring the unique contributions of our research are fully understood.
  
  ** ** ** **
  
  RC:
  
  Why would you use a feature engineering approach if you work with a deep learning model?
  
  AR:
  
  Thank you for raising this point about feature engineering. We would like to clarify that in our paper, feature engineering is not applied to any deep learning models. It pertains solely to the XGBoost classifier, which is a machine learning model that should benefit from feature engineering to enhance performance (contrary to deep learning models).
  Sincerely,
  
  Reyhaneh Hashemi
  
  Citation: https://doi.org/10.5194/hess-2023-282-AC4

Reyhaneh Hashemi, Pierre Javelle, Olivier Delestre, and Saman Razavi

Viewed

Total article views: 1,178 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
771	343	64	1,178	64	76

HTML: 771
PDF: 343
XML: 64
Total: 1,178
BibTeX: 64
EndNote: 76

Views and downloads (calculated since 18 Dec 2023)

Month	HTML	PDF	XML	Total
Dec 2023	212	41	5	258
Jan 2024	64	30	3	97
Feb 2024	99	25	9	133
Mar 2024	60	20	7	87
Apr 2024	49	21	15	85
May 2024	42	20	5	67
Jun 2024	22	7	2	31
Jul 2024	22	8	5	35
Aug 2024	17	3	0	20
Sep 2024	14	4	1	19
Oct 2024	8	4	0	12
Nov 2024	12	5	0	17
Dec 2024	13	7	0	20
Jan 2025	13	13	4	30
Feb 2025	21	14	1	36
Mar 2025	17	11	6	34
Apr 2025	14	20	0	34
May 2025	22	14	0	36
Jun 2025	31	45	1	77
Jul 2025	17	26	0	43
Aug 2025	2	5	0	7

Cumulative views and downloads (calculated since 18 Dec 2023)

Month	HTML	PDF	XML	Total
Dec 2023	212	41	5	258
Jan 2024	64	30	3	97
Feb 2024	99	25	9	133
Mar 2024	60	20	7	87
Apr 2024	49	21	15	85
May 2024	42	20	5	67
Jun 2024	22	7	2	31
Jul 2024	22	8	5	35
Aug 2024	17	3	0	20
Sep 2024	14	4	1	19
Oct 2024	8	4	0	12
Nov 2024	12	5	0	17
Dec 2024	13	7	0	20
Jan 2025	13	13	4	30
Feb 2025	21	14	1	36
Mar 2025	17	11	6	34
Apr 2025	14	20	0	34
May 2025	22	14	0	36
Jun 2025	31	45	1	77
Jul 2025	17	26	0	43
Aug 2025	2	5	0	7

Viewed (geographical distribution)

Total article views: 1,132 (including HTML, PDF, and XML) Thereof 1,132 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 07 Aug 2025

Short summary

Here, we have tackled the challenge of estimating water flow in areas without direct measurements, a crucial task in hydrology. We have applied deep learning techniques to a large sample of French catchments with various hydrological regimes. We have also compared our approach with traditional methods. We found that incorporating more data improves the accuracy of our deep learning predictions. Notably, our method outperforms traditional approaches in certain regimes, though not universally.


Total:	0
HTML:	0
PDF:	0
XML:	0

Closing the data gap: runoff prediction in fully ungauged settings using LSTM

Viewed

Viewed (geographical distribution)

Cited

1 citations as recorded by crossref.