the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
HESS Opinions: Never train an LSTM on a single basin
Abstract. Machine learning (ML) has played an increasing role in the hydrological sciences. In particular, certain types of time series modeling strategies are popular for rainfall–runoff modeling. A large majority of studies that use this type of model do not follow best practices, and there is one mistake in particular that is common: training deep learning models on small, homogeneous data sets (i.e., data from one or a small number of watersheds). In this position paper, we show that Long Short Term Memory (LSTM) streamflow models are best when trained with a large amount of hydrologically diverse data.
- Preprint
(1435 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on hess-2023-275', Marvin Höge, 05 Feb 2024
Review of “HESS Opinions: Never train an LSTM on a single basin” by Frederik Kratzert et al., 2024
Summary
The Opinion paper addresses a common issue and misconception in the application of LSTM models for hydrologic streamflow prediction: Often, LSTM models are trained and evaluated on only a few basins or even a single one. This leads to sub-optimal performance as LSTM models benefit from being trained on a large variety of data - as they are increasingly available in hydrology. Therefore, the authors suggest to conduct LSTM training with such large-sample datasets as best-practice (outlining that additional fine-tuning might be an option for single or small set basin applications). Further, the paper focus on training data diversity and optimal setup-up of training sets.
Evaluation and Recommendation
The Opinion paper covers a timely topic since LSTM models are state-of-the-art tools for streamflow prediction and a variety of other tasks in the broader geosciences. With LSTMs being increasingly used (which is also shown in the paper), the presented topic is important and meets the community’s interest.
The manuscript is well written and referenced. The codes that were used are freely available and data sources are referenced. The figures are of good quality. Yet, the current manuscript requires some specifics to be addressed (see below). In particular, section “5 Is hydrological diversity always an asset?” addresses a very important topic – at the same time, it would benefit from some iteration as is also specified in the comments below.
I recommend publication after minor revisions.
Specific comments
l.18-19: “We do not mean top-down vs. bottom-up in the sense discussed by Hrachowitz and Clark (2017).” Please specify briefly their definition of top-down and bottom-up.
l.28-29: “We see no reason why…” Please briefly elaborate on these reasons.
l. 32-34: Please rewrite for clarity, e.g. split up in two sentences.
l.50 ff & Figure 2: It is stated that 400+ catchments from CAMELS were modelled with mHM and VIC for comparison. How many did overlap with your 531 basins? There has to be a large overlap anyway with 671 basins in CAMELS in total, but I think it would be an interesting information to know. Or better: why not showing the cumulative plots between only the basins that are in both the VIC+mHM set and the LSTM set? This would make the comparison stricter.
l.63: “to 1 (-1)” -> unclear, is this [1 -1]? Please specify.
l.64-65: “size equal to the number of cell states.” -> Please add one or two explanatory sentences. This refers to the model architecture and might not be clear to all readers.
l.73: “no model captures all of the extremes” -> Agreed. Yet, I wonder whether the chosen basin in Fig. 4 represents "extreme events" well. The hydrograph shows a rather regular pattern with a peak flow ranging roughly between 5 and 15 mm/d every year. I think a more irregular hydrograph with, e.g., one or two intense peaks (or missing peaks in some years) would better illustrate extreme events.
l.83-84: “In other words, even these 531 basins are most likely not enough to train optimal LSTM models for streamflow.” Why do you think that is? Please elaborate. Are there indications for an upper limit of the objective value and a corresponding sufficient training size from research with CARAVAN that could be shown here? (This last point could also be part of conclusions and outlook)
l.85ff: This section covers quite a range of aspects. I suggest to restructure it a little, e.g. into subsections: One that contains Figure 6 and the corresponding text, and one that covers conjectures about effects of larger datasets and data split approaches. Nonetheless, these are interesting points to be discussed since they also might pave the way for further research.
l.87ff: “more is always better, as far as we have seen), and variety refers to the (hydrologic) diversity of data.” -> Both points, volume and variety, are important to be pointed out. There is an issue that I think could be mentioned and discussed in this context as well: class imbalance (even if this is not a classification problem). This might also be part of the explanation behind the things discussed in the rest of section 5. In the CARAVAN-paper (Kratzert et al., 2023), there is a histogram showing the distribution of catchments over the different climatic zones of the earth and the corresponding distribution in the dataset. There, a class imbalance is visible which indicates why predictions of certain climatic basin classes are better or worse since they are overrepresented or underrepresented.
l.148: “training (1 October 1999 through 30 September 2000), validation” -> mix-up of dates? This is a very short training period.
l.210-211: “without carbonate rocks fraction and the seasonality of precipitation” -> I agree on dropping the seasonality of precipitation. But I would assume that carbonate rock fraction and related karst flow properties might be an important feature to be included in the clusters. Did you investigate in which clusters those basins fall, that have a carbonate rock fraction value larger 0 or above a certain threshold? Maybe having a dedicated “karst” cluster might be an option?
l.215: “into 5 groups” -> should be “into 6 groups”, right?
Fig A4: Interesting to see how geographically aligned the different clusters are. (Apart from separated small accumulations apart from the bulk of a certain cluster – like the small groups of cluster 2 (orange) and 6 (brown) north from cluster 1 (blue)). With 6 being the detected optimum number of clusters, did you look into the neighboring 5 and 7 clusters with respect to the spatial distribution on the map and performance gain/loss of the model?
l.249: “Fig ??” -> compilation error, please check reference
Tables and Figures
Fig. 3.: y-axis label on right hand plot not necessary if figure outline kept like this
Fig. 5 could be dropped since its content is also shown in Figure 6 and I think the value of showing the blue line alone is not as significant.
Language
Very good and clear language, only small remarks as follow:
- 63: “bounded bounded” -> bounded
- 69-70: “… in total 10 timesteps of streamflow observations…” -> … in total 10 streamflow observations…
Citation: https://doi.org/10.5194/hess-2023-275-RC1 - AC1: 'Reply on RC1', Frederik Kratzert, 28 Feb 2024
-
CC1: 'Comment on hess-2023-275', Sivarajah Mylevaganam, 05 Feb 2024
Is the scientific field mired with improper methodologies adopted by journal offices?
Sivarajah Mylevaganam
Alumnus, Spatial Sciences Laboratory, Texas A&M University, College Station, USA.I presume that the lifecycle of a manuscript goes through the steps outlined below.
- Performing a modeling work (e.g., mathematical modeling or physical modeling)
- Writing a manuscript based what has been driven out from the modeling work.
- Submitting the manuscript to a journal office that has advertised to have achieved a high impact factor.
- Reviewing of the manuscript (probably a rigorous review) based on what has been presented in the manuscript by the authors.
- Accepting or Rejecting based on what has been presented in the manuscript by the authors.
Does step 4 assert that the authors have done the modeling work correctly? In my opinion, even a seasoned guy with extensive experience and many qualifications may not be able to catch a bug if the authors are smart enough to present the manuscripts in a way to get through the system of processing. The following questions would be pertinent to be asked at this level to understand the methodologies adopted by journal offices.
- Is it possible to assert that there are no PhD students who have not tweaked modeling results or analysis to attain their titles and ambitions? Would we need hypothesis testing on this?
- Is it possible to assert that faculty members who are on the edge of a knife have never tweaked modeling results or analysis to meet their needs?
- Is it possible to assert that faculty members have never tweaked modeling results or analysis to progress in their careers?
- Is it possible to assert that the faculty members who supervise students are conversant with all the nuts and bolts of the work carried out by the students?
- Is it possible to assert that a manuscript accepted by a journal office with a high impact factor has no tweaked modeling results or analysis?
In my opinion, considering this loophole in the methodology adopted by journal offices, neither an advancement in the technologies nor in the scientific theories would yield the best of what is desirable for the betterment of the scientific field.
Acknowledgement and Disclaimer
The author is an alumnus of Texas A&M University, Texas, USA. The views expressed here are solely those of the author in his private capacity and do not in any way represent the views of Texas A&M University, Texas, USA.
-
AC4: 'Reply on CC1', Frederik Kratzert, 28 Feb 2024
-
CC9: 'Reply on AC4', Sivarajah Mylevaganam, 29 Feb 2024
Publisher’s note: the content of this comment was removed on 29 February 2024 after approval of the HESS executive editors since the formulations were inappropriate.
Citation: https://doi.org/10.5194/hess-2023-275-CC9
-
CC9: 'Reply on AC4', Sivarajah Mylevaganam, 29 Feb 2024
-
CC2: 'Comment on hess-2023-275', Sivarajah Mylevaganam, 06 Feb 2024
Sivarajah Mylevaganam
Alumnus, Spatial Sciences Laboratory, Texas A&M University, College Station, USA.The history of hydrological models goes to many decades. The progress that has been made to improve hydrological models through scientific findings has been showing the endless road to guide the next generation of hydrologist and associated specialists to quest for the betterment and mark the dead end. In this manuscript, the authors, who are not from the same hydrological basin, have employed an ML model to predict streamflow by training the model with hydrologically diverse data that is spatially and temporally large in extent. Based on the research, the authors draw a concrete conclusion that the previous modeling works (that is found based on ad-sense free search) using ML models have failed to understand the underlying philosophy in training and testing ML models.
In my opinion, the current version of the manuscript has many flaws. Moreover, the way the manuscript has been presented gives an impression that the authors are far off from the field of hydrology. Therefore, the current version of the manuscript needs an expert in hydrology to go through in detail in an unbiased manner. Furthermore, the language that has been used in the manuscript needs to be edited by a language specialist as the way the manuscript has been presented and the words that have been coined throughout the manuscripts are beyond from what is expected in a scientific manuscript that is submitted to a journal office that advertises to have achieved a high impact factor based on rigorous reviews by a panel of experts in the fields of specializations listed in the scope of the journal.
The following comments are posted.
- In my opinion, the abstract of the manuscript needs to be re-written. I would say that an ML model with limited and poor data has been employed in writing the abstract. Moreover, what has been highlighted (“there is one MISTAKE in particular that is common”) is not well documented.
- Line 10-Line 14:
Hydrology models based on machine learning (ML) are different – ML models work best when trained on data from many watersheds (Nearing et al., 2021). This citation needs to be evaluated against the conclusion (i.e., ML models are best when trained with a large amount of hydrologically diverse data) that the authors draw from the manuscript that is submitted to this journal office.
- Line 10-Line 14:
Because ML models are trained with data from multiple watersheds, they are able to learn hydrologically diverse rainfall–runoff responses (Kratzert et al., 2019b). This citation goes to 2019. The comment 2 goes to 2021. Are the cited manuscripts giving the same thoughts? If so, the rationales in citing these manuscripts are not well understood.
- Line 17-Line 23:
The paragraph needs to be critically reviewed by a specialist. The sentences need to be evaluated. The paragraph gives an impression that the authors lack fundamental knowledge in the subject. The terminologies and words from an English dictionary are thrown without understanding the exact meaning.
- The crux of the manuscript that is highlighted by the authors (i.e., LSTM stream models are best when trained with a large amount of hydrologically diverse data) is a well established fact in the scientific field. Basically, the authors are hitting the concept of SAMPLING SIZE of an experiment. Therefore, instead of going through a painful path of running models to determine the number of basins, it would be a wise man thought to go through some statistical methods to answer this question (i.e., sampling size). In fact, even the current version of this manuscript does not give the exact number that would be required. It is a random number (531) that the authors have ended up with based on what has been analyzed (see Fig.5).
- Line 30-Line 36:
Does the order of the KEYWORDS have an influence in your search? Is the search from the engine not prioritized by the engine provider based on the business model employed? What was the reason to limit the search to 2021. Based on Fig.1, it is understood that there are more than 3500 publications. Even if we consider the authors’ statement that review was initiated in September 2022, an iota of incompleteness surface.
- Line 25-Line 29:
The authors claim that the use of LSTMs for rainfall–runoff modeling has increased exponentially in the last several years. A figure (Fig.1) to support the claim is found in the manuscript. As per the figure, a rough estimation considering the heights of the bars gives an indication that around 8500(=3500+2500+1500+500+400+100) manuscripts have been found in the topic that the authors have invested. Referring to the previous comment, the authors have considered 100 manuscripts based on the search from a search engine of their interest. In other words, this manuscript is based on 100/8500*100%=1% of the manuscript found in the literature. What can be inferred from the training dataset that is employed in reviewing the literature? Will it lead to conclude that similar to an ML model the limited publications reviewed lead to draw wrong conclusions?
- The title of the manuscript needs to be assessed by a specialist. What is a basin? What is a watershed? What is a catchment? What is a region? What is the amount of data that a single basin possesses? What is the spatial and temporal extent of the basin that the authors are defining in the title of the manuscript? What is the heterogeneity level of the basin that the authors are defining in the title of the manuscript?
- Refer to Part III
Acknowledgement and Disclaimer
The author is an alumnus of Texas A&M University, Texas, USA. The views expressed here are solely those of the author in his private capacity and do not in any way represent the views of Texas A&M University, Texas, USA.
-
AC5: 'Reply on CC2', Frederik Kratzert, 28 Feb 2024
See our reply to CC1 for a summary reply to all comments from SM.
Citation: https://doi.org/10.5194/hess-2023-275-AC5
-
CC3: 'Comment on hess-2023-275', Sivarajah Mylevaganam, 06 Feb 2024
Sivarajah Mylevaganam
Alumnus, Spatial Sciences Laboratory, Texas A&M University, College Station, USA.9) Line 90-Line 94:
We have some evidence that there might be ways to construct training sets that could result in better models than simply training on all available streamflow data. We do not have results that support this directly.
These statements are not understood.
10) Line 79-Line 84: Figure 5 shows how test period performance increases as more basins are added to the training set. Performance continues to increase up to the maximum size of the CAMELS data set (531 basins). In other words, even these 531 basins are most likely not enough to train optimal LSTM models for streamflow.
In my opinion, these statements do not make sense. To better understand your methodology, assume that we have 16 basins (not 531 basins as in your analysis) as shown in the figure in the attached PDF file. For simplicity, let us forget the shapes of the basins. Moreover, assume that you have added the basins A, B, and C to have a training set to derive the associated NSE. Likewise you add more basins to your training set to have an array of NSE values to show a plot like the one that you have shown in Fig.5. Based on this figure, is it meaningful to conclude that the performance of the model increases as more basins are added to the training set? From a hydrological point of view, does it make sense to have a streamflow that is sourced by basins A, B, and C? Would the stream network that sources the flow at a location of interest become discontinuous?
The NSE value that you have reported for a training set size of 100 basins (see Fig.5) may not actually represent the hydrology although the reported NSE value is very close to the NSE value for a training set size of 531 basins that may represent the actual hydrology. In ML models, you need to understand the theory that governs the system of equations. An in-depth understanding on the system of equations and how they are formulated will lead to understand the physics. Do the catchment attributes that you have chosen in your analysis play a role in the NSE values that you have reported in Fig.5?
Acknowledgement and Disclaimer
The author is an alumnus of Texas A&M University, Texas, USA. The views expressed here are solely those of the author in his private capacity and do not in any way represent the views of Texas A&M University, Texas, USA.
-
AC6: 'Reply on CC3', Frederik Kratzert, 28 Feb 2024
See our reply to CC1 for a summary reply to all comments from SM.
Citation: https://doi.org/10.5194/hess-2023-275-AC6
-
AC6: 'Reply on CC3', Frederik Kratzert, 28 Feb 2024
-
CC4: 'Comment on hess-2023-275', Sivarajah Mylevaganam, 07 Feb 2024
Sivarajah Mylevaganam
Alumnus, Spatial Sciences Laboratory, Texas A&M University, College Station, USA.11) Line 12-Line 13:
Because ML models are trained with data from multiple watersheds, they are able to learn hydrologically diverse rainfall–runoff responses (Kratzert et al., 2019b) in a way that is useful for example for prediction in ungauged basins (Kratzert et al., 2019a).
First In First Out (FIFO) is the concept that is implemented in citing (i.e., 2019a should come first). Since I don’t know what is being implemented by this journal office, I would let the journal office to pay an attention on this.
12) Line 21-Line 23:
Then, in the bottom-up approach, after a model is developed, we might work on regionalization strategies to extrapolate parameters and parameterizations to larger areas (e.g., Samaniego et al., 2010; Beck et al., 2016).
Using e.g., in citing previous research works is considered an evading language in the scientific field. This gives an indication that the authors have failed to document a comprehensive review of the literature.
13) Line 34-Line 36:
We collected these 100 papers for review in September, 2022, nearly three years after the original regional LSTM rainfall–runoff modeling papers (Kratzert et al., 2019a, b) were published.
The critical review of these 100 papers is not found the current version of the manuscript. Throughout this manuscript, the authors have freely cited their own works. Considering the level of their knowledge in the field of hydrology and the other related disciplines that are reflected in the current version of the manuscript, I would have my reservation in the cited manuscripts.
13) Line 106-Line 107:
We selected a k-means cluster model based on a maximin criterion on silhouette scores, which resulted in a model with 6 clusters ranging from 59 to 195 basins per cluster.
I would say that your clustering methodology is inappropriate for this task as it would completely destroy the stream network and the underlying hydrology. This is one of the reasons why we have HUCs in the datasets that you have used in your analysis. Do you know the exact definition of HUC and the rationale behind developing HUCs? Your Fig.A4 is completely meaningless considering the purpose of the manuscript. I would suggest you have an in-depth look at your Fig.A3 and Fig.A4. What do you learn from those figures?
14) Line 235-Line 240:
For a particular basin of your interest, would you be able to show the values of Ws and the associated hs?
Acknowledgement and Disclaimer
The author is an alumnus of Texas A&M University, Texas, USA. The views expressed here are solely those of the author in his private capacity and do not in any way represent the views of Texas A&M University, Texas, USA.
-
AC7: 'Reply on CC4', Frederik Kratzert, 28 Feb 2024
See our reply to CC1 for a summary reply to all comments from SM.
Citation: https://doi.org/10.5194/hess-2023-275-AC7
-
AC7: 'Reply on CC4', Frederik Kratzert, 28 Feb 2024
-
CC5: 'Comment on hess-2023-275', Sivarajah Mylevaganam, 09 Feb 2024
Does the number of basins influence the modeling outcome using ML models?
A systematic methodology to answer the question that is being raised is presented. This method ensures that the underlying hydrology is conserved at the best to answer the question.
Step-1:
Draw an artificial stream network. The stream network shall have one limb or line per basin. In other words, if you have 531 basins in your dataset, your stream network will have 531 limbs or lines.
Step-2:
Use one of the existing techniques to order or name the stream developed in step 1. For example, if you are using Strahler's method to order your stream network, you will end up with a network like the one shown below. There are many methods to order or name a stream network. Therefore, you will have to do some research to determine the best method that is suited for the problem that you are solving.
Step-3:
Run your ML model, considering all the basins that you have in your dataset. Report the coefficient that is of your interest (e.g., NSE).
Step-4:
Run your ML model, considering the basins after removing the smallest stream number. If you are referring to the above figure, the smallest number will be 1. You can randomly remove the streams from the network.
Step-4-1:
Assume that you have 50 basins with "1". You can run your ML model after removing those 50 basins. In other words, you will have 531–50=481 basins in your simulation. Report the coefficient that is of your interest (e.g., NSE).
Step-4-2:
Assume that you have 50 basins with “1”. You can run your ML model after removing one of those 50 basins. You can randomly remove the basin from the network. In other words, you will have 531-1=530 basins in your simulation. You repeat this procedure by removing them one by one. In other words, your final simulation will have 531–50=481 basins in your simulation. Report the coefficient that is of your interest (e.g., NSE). You will have an array of NSEs.
Step-5:
Run your ML model, considering the basins after removing the next smallest stream number. If you are referring to the above figure, the next smallest number will be 2. You can randomly remove the streams from the network.
Step-5-1: Refer to Step-4-1
Step-5-2: Refer to Step-4-2
Step-6:
Continue your simulation work until you reach the largest stream number in your network. If you are referring to the above figure, the largest stream number will be 3.
Step-7:
Plot your NSEs and answer the question that has been raised.
In summary, a reverse algorithm needs to be developed to address the problem. Moreover, being conversant with spatial operations using products from ESRI or other vendors is required.
-
AC8: 'Reply on CC5', Frederik Kratzert, 28 Feb 2024
See our reply to CC1 for a summary reply to all comments from SM.
Citation: https://doi.org/10.5194/hess-2023-275-AC8
-
AC8: 'Reply on CC5', Frederik Kratzert, 28 Feb 2024
-
RC2: 'Comment on hess-2023-275', Markus Hrachowitz, 13 Feb 2024
- AC11: 'Reply on RC2', Frederik Kratzert, 04 Mar 2024
-
RC3: 'Comment on hess-2023-275', Anonymous Referee #3, 15 Feb 2024
In this study, rainfall-runoff data from 531 watersheds in the camel dataset were simulated utilizing the Long Short Term Memory (LSTM) model. The LSTM model, when trained with multiple basins, exhibited superior performance compared to those trained with a single basin. The authors assert that LSTM streamflow models yield optimal results when trained with extensive and hydrologically diverse datasets, encompassing as many watersheds as feasible. Regardless of the specific objectives a researcher may have for training a machine learning (ML)-based rainfall-runoff model, there appears to be no compelling reason not to employ a large-sample dataset for training purposes. Segregating basins into groups based on hydrologically-relevant characteristics appears to enhance the performance of models trained with limited data, surpassing models trained on randomly grouped basins of comparable size.
I concur with the authors' viewpoints. The research outlined in the manuscript is methodical and enhances our comprehension of the necessity to utilize data from numerous basins for effectively training LSTM models for rainfall-runoff modeling. The statistical analyses conducted by the authors are robust. Nevertheless, significant revisions are imperative prior to considering this manuscript for acceptance.
My suggestions and critiques are enumerated as follows:
Abstract: The abstract contains excessive information regarding the popularity of ML and LSTM in rainfall-runoff modeling. To align with the traditional structure of scientific papers, I propose including more details regarding the methodology implementation, delineating how the LSTM model's performance improves with a larger dataset, and specifically identifying the most crucial influencing factors for LSTM training in rainfall-runoff modeling.
Section 1: While it is true that ML algorithms vary, it would be beneficial for the authors to elaborate on why LSTM has gained prominence, particularly due to its development for time series prediction, and elucidate why other ANN systems are seldom employed in hydrological studies.
Section 2: Figure 2 presents results that may be surprising to readers. The authors should provide more information on why VIC basin and mHM basin outperform VIC regional and mHM regional. While these findings support the authors' assertions, additional context is required to evaluate their accuracy. Consider transferring relevant information from the appendix to Section 2 for clarification.
Section 3: The utilization of 531 CAMEL basins, encompassing more extreme runoff data, to enhance LSTM prediction is comprehensible. However, clarity is lacking regarding Figure 3 and the content between Line 61 and 67. Additionally, it remains unclear why there is a limitation on regional models in Figure 2, and how the bounded vector in LSTM influences runoff prediction as mentioned in Lines 63-64.
Section 4: Would presenting Figure 5 on a logarithmic scale for the x-axis be more appropriate? This aspect requires further consideration.
Section 5: Enhancements to Figure 6 could involve using more contrasting colors to ensure clarity in black-and-white printed versions.
Code and Availability: While the code and data are accessible for download, they remain challenging to execute, especially for those unfamiliar with the NeuralHydrology Python package. I recommend providing supplementary materials to furnish more detailed information on these packages.
Considering the significance of the content, it is advisable to integrate essential information from the appendix into Sections 1-5, reserving detailed model settings for supplementary materials.
Citation: https://doi.org/10.5194/hess-2023-275-RC3 - AC10: 'Reply on RC3', Frederik Kratzert, 28 Feb 2024
-
CC6: 'Comment on hess-2023-275', John Ding, 16 Feb 2024
The concept of summation vs. unit hydrograph
By the title alone of their opinion paper, I admire the frankness of the authors admitting the shortcomings of the LSTM networks when applied to a SINGLE basin.
In the LSTM, the form of the conversion function for the hidden gate h_t, is tanh, e. g., Kratzert et al. (2018, Equation 7). As observed previously by me, this is similar in shape to a summation or S-curve hydrograph in unit hydrograph theory, e.g., Lees (2022, CC1 and CC2 therein).
For the hidden gate, I suggest the authors consider taking one further step of using the first derivative of the conversion function. This is equivalent to using the form of an instantaneous unit hydrograph or impulse response function in convolution integral, e.g., Ding (1974). This, I believe, will inject some hydrologic realism into the LSTM.
The bottom half of Figure 4 for Buffalo Fork, Wyoming (USGS Gage 13011900) for the single LSTM basin model clearly indicates that an impulse response model having a distinct peak time and magnitude characteristic, e.g., Jeong and Kim (2023, Figure 2), would outperform the LSTM as now configured.
A reconfigured LSTM as suggested above may perform as well as, if not better than, the impulse response ones.
References
Ding, J.Y., 1974. Variable unit hydrograph. Journal of Hydrology, 22(1-2), pp.53-69.
Jeong, M. and Kim, D.H., 2023. Instantaneous physical rainfall–runoff prediction technique using a power–law relationship between time to peak and peak flow of an instantaneous unit hydrograph and the rainfall excess intensity. Journal of Hydroinformatics, 25(2), pp.415-431.
Kratzert, F., Klotz, D., Brenner, C., Schulz, K., and Herrnegger, M.: Rainfall–runoff modelling using Long Short-Term Memory (LSTM) networks, Hydrol. Earth Syst. Sci., 22, 6005–6022, https://doi.org/10.5194/hess-22-6005-2018, 2018.
Lees, T., Reece, S., Kratzert, F., Klotz, D., Gauch, M., De Bruijn, J., Kumar Sahu, R., Greve, P., Slater, L., and Dadson, S. J.: Hydrological concept formation inside long short-term memory (LSTM) networks, Hydrol. Earth Syst. Sci., 26, 3079–3101, https://doi.org/10.5194/hess-26-3079-2022, 2022.
Citation: https://doi.org/10.5194/hess-2023-275-CC6 - AC2: 'Reply on CC6', Frederik Kratzert, 28 Feb 2024
-
RC4: 'Comment on hess-2023-275', Juliane Mai, 20 Feb 2024
Review of “HESS Opinions: Never train an LSTM on a single basin” by Kratzert et al.
The opinion piece draws the attention towards the size and diversity of data used to train LSTM models for streamflow. I’m really happy that the authors put this material together to emphasize the good practice of using large and diverse datasets for models that will reliably predict streamflow at ungauged locations and time periods that are beyond the time periods the models were trained on.
I think the manuscript is well structured and nicely written. The arguments are easy to follow and the figures are appropriate to support the statements. I actually do not have any major comments; the list of minor (easy to address) comments are attached below. I would like to finish with congratulating the authors to this nice manuscript. It was a very enjoyable read and hope it will be published soon.
Best regards,
Juliane Mai
Minor:
- L 37-38: “It is important to recognize that there is usually no reason in practice to train LSTM streamflow models using data from only a small number of watersheds.” —> Couldn’t it be that one could run into storage limitations when, for example, wanting to use 10,000s of basins? I know this is not the target audience here, but maybe it should be mentioned that there is an upper limit of how many basins one can use and still be able to train a model with a standard machine.
- L 45-52: Section 2: I think it might be helpful to re-iterate that the behaviour the LSTMs show here is the exact opposite compared to the physically-based models (left panel). I’d probably also add panel IDs to the figure (e.g., A and B) and then refer to them in the text. Up to the authors.
- Figure 2: caption: “for models trained on individual basins (basin) vs. on multiple basins (regional)” —> “for models trained on individual basins (basin; orange coloured lines) vs. on multiple basins (regional; blue coloured lines)”
- L 63: “a vector that is bounded bounded to 1 (-1)” —> “a vector that is bounded to 1 (-1)”. Also, what does the “-1” in parenthesis mean? Is the vector bound to the interval [-1,1]? I am guessing the “(-1)” refers to “(limiting)” in the next sentence?! But it is confusing. I’d probably not mention the lower bound because you already made the point clear with the upper limit that can be reached.
- L 65: Missing closing parenthesis after “Appendix B”.
- Figure 3: Wow! This is such an impressive difference!
- Figure 4 (caption and text): You illustrate the simulation of the single-basin model (13011900). I am assuming that the time period you show is the testing period for the model that was trained with this basin only (no spatial transfer), right? It would be helpful to mention this here as the reader doe not know which 10-year period you use for training vs testing (it’s in the appendix I am sure but it would help the flow of reading).
- Section 3: Caption is missing a question mark at the end.
- Section 3: I would finish that section with a clear statement like: “When you train your LSTM model on a single basin, you will likely not be able to predict any peak event in the future.” This is what I would expect to read as a response to the question you state in the section title.
- Section 4: I think it might be confusing when you talk about splits here given that hydrologists usually use “splits” to use one set for calibration and then the remaining for validation. When you say N splits (I think!) you mean that you repeat training/testing N times, right? Like N independent experiments. E.g., “The 531 basins are, for example, split into 5 groups (each around 107 basins). Then the first group of basins (first split) is trained on the training period (Oct 1999-Sep 2000) and then all basins of the first split are tested during validation period (Oct 1980-Sep 1989). After training, the model is evaluated for the ~107 basins of the first split using the testing period (Oct 1989 to Sep 1999). This experiment is then repeated for each of the remaining 4 splits.” Not sure if that is entirely correct but it took me quite some reading of the appendix to get to this.
- Section 4: Do you think that 531 basins might not be enough because they are not diverse enough? Do you think 531 would be enough when the training period is longer than 10 years? I don’t think any additional experiments are required here but it might be nice to have a bit of discussion here. Otherwise it might be really demotivating to know that even >500 basins are not enough.
- L 97 (and caption figure 6): “the curve in Fig. A2” —> There is no curve in figure A2. Do you mean Figure 5?
- Figure 6 caption: For completeness, describe what the dashed orange/green line mean.
- Section 5: I love the description of the experiment and its discussion here. Also the take-home message at the end is great. This is something that Section 3 (take-home) and Section 4 (description experiment) are slightly lacking.
- Line 128: “good” —> maybe “reliable”?
- Line 129: “We’ve” —> “We have”
- Line 131: “Of course, it is trivial (but most likely uninteresting) to beat improperly trained models.” Love that statement!
- Figure C1: I think it might be helpful to indicate (on x-axis) which the optimal value is for each metric. Table C1 might also be a good place for it. Up to the authors.
Citation: https://doi.org/10.5194/hess-2023-275-RC4 - AC3: 'Reply on RC4', Frederik Kratzert, 28 Feb 2024
-
CC7: 'Comment on hess-2023-275', Tam Nguyen, 27 Feb 2024
The manuscript was well written and the topic is very interesting. I really enjoy reading this manuscript. The discussion is thorough, providing a deep understanding of LSTM and good practice. I think there is still room for discussion (related to the questions below). Please feel free to include it if the authors think these points are relevant.
- Why is the performance of physical or conceptual models (e.g., mHM and VIC) when calibrated for a single basin (“basin model”) better than the one calibrated for multiple basins (“regional model”) while that is the opposite with LSTM?
- "Never train an LSTM on a single basin". What makes LSTM so special that we should not train on a single basin? What if we want to model new processes with LSTM and we just have data for a single (or just a few) basin(s)? – we do not have other publicly available data (as for streamflow modeling).
Blow are possible discussions regarding these questions (just my opinion, please correct me if I am wrong)
Discussion of question 1:Physically-based (or conceptual) model formulation: The physical processes are conceptualized using mathematical /empirical equations that were based on experimental results or the modeler’s understanding of reality. In this sense, PB models already used some kinds of data (because the equations in PB models are based on experimental results or the modeler’s understanding of reality – so equations in PB models are also a kind of data).
PB models: “basin” and “regional” models are often calibrated for a few dozen (or a few hundreds) calibration parameters (I am not sure how many calibration parameters there are for the VIC and mHM that used in this study?). There are uncertainties in the data, model structure, and imprecise description of the physical process via mathematical equations in PB models, the calibration process will adjust calibraiton parameters to compensate for these errors/uncertainties. However, when applying to a large number of basins, the number of parameters might not be large enough to compensate for these errors/uncertainty => leading to a reduce model performance of the “regional” model compared to “basin” model.
ML models: “basin” and “regional” both have a high number of “trainable” parameters (similar to the term calibration parameters in PB) compared to PB models. For example, a basin (regional) model with 16 (256) hidden states, 1 layer of LSTM, and the model head consisting of 1 dense layer could have more than 1,400 (250,000) “trainable” parameters (these numbers were roughly calculated for a native LSTM network in PyTorch). With such models, it can easily overfit the train data and provide poor performance for test data. While the “regional” model with more data the overfitting problem might be avoided (but still possible) and provide a better performance for test data compared to “basin” model. This could be a reason why the performance of the “regional” model is better than the “basin” model with the LSTM network.Now if we calibrate “basin” and "regional" PB model by adjusting parameters for individual model grid cells (or many groups of grid cells based on certain criteria), which results in a much larger number of calibration parameters – up to the level that the number of calibrated parameters of PB models and the number of trainable parameters of LSTM are the same, will PB behave similar to ML model? (I mean will the performance of the “basin” model is worse than the “regional” model?).
Discussion of question 2:What makes LSTM so special that we should not train on a single basin? Assume that we have one basin and train the model for daily streamflow simulation for 10 years. In this case, we have 3650 pairs of inputs and outputs– this is already “big data” compared to other models that use a much lower number data points (e.g., Table 1 – in Zhu et al., 2022; https://doi.org/10.1016/j.eehl.2022.06.001). I would be curious to know if this could be because the number of trainable parameters of the LSTM networks is high or something related to the long-short term memory (that we need more data to “warm-up” the model?).
What if we want to model new processes with LSTM and we just have data for a single (or very few) basin? Could we impose our understanding of the process of interest on the structure of the LSTM model or somehow make LSTM applicable even for a single basin? For example in Figure 4 (page 5), if we know that there are seasonal and trend components, can we model these two components separately and then combine them which might improve the model prediction, etc.. I would very helful it if the authors could give some comments on this.Thank you
Tam
Citation: https://doi.org/10.5194/hess-2023-275-CC7 - AC12: 'Reply on CC7', Frederik Kratzert, 04 Mar 2024
-
CC8: 'Comment on hess-2023-275', Sivarajah Mylevaganam, 28 Feb 2024
Does the number of catchments/watersheds/sub-watersheds/subbasins influence the modeling outcome using ML models?
A systematic methodology to answer the question that is being raised is presented. This method ensures that the underlying hydrology is conserved at the best to answer the question.
Step-1:
Identify the basin of your interest. This should be based on a proper research methodology (Refer to PART IX)
Step-2:
Extract the associated catchments and the stream network using ESRI products or other vendors. These catchments and the stream network cover the basin of your interest. See the attached PDF file.
Step-3:
Use one of the existing techniques to order or name the stream network developed in step 2. For example, if you are using Strahler's method to order your stream network, you will end up with a network like the one shown below. There are many methods to order or name a stream network. Therefore, you will have to do some research to determine the best method that is suited for the problem that you are solving.
Step-4:
Run your ML model, considering all the catchments that you have in your basin. Report the coefficient that is of your interest (e.g., NSE).
Step-5:
Run your ML model, considering the catchments after removing the smallest stream number. If you are referring to the above figure, the smallest number will be 1. You can randomly remove the streams from the network.
Step-5-1:
Assume that you have 500 catchments with "1". Assume that you have 10,000 catchments within your basin. You can run your ML model after removing those 500 catchments. In other words, you will have 10,000–500=9,500 catchments in your simulation. Report the coefficient that is of your interest (e.g., NSE).
Step-5-2:
Assume that you have 500 catchments with "1". Assume that you have 10,000 catchments within your basin. You can run your ML model after removing one of those 500 catchments. You can randomly remove the catchment from the network. In other words, you will have 10,000-1=9,999 catchments in your simulation. You repeat this procedure by removing them one by one. In other words, your final simulation will have 10,000–500=9,500 catchments in your simulation. Report the coefficient that is of your interest (e.g., NSE). You will have an array of NSEs.
Step-6:
Run your ML model, considering the catchments after removing the next smallest stream number. If you are referring to the above figure, the next smallest number will be 2. You can randomly remove the streams from the network.
Step-6-1: Refer to Step-5-1
Step-6-2: Refer to Step-5-2
Step-7:
Continue your simulation work until you reach the largest stream number in your network. If you are referring to the above figure, the largest stream number will be 3.
Step-8:
Plot your NSEs and answer the question that has been raised.
In summary, a reverse algorithm needs to be developed to address the problem. Moreover, being conversant with spatial operations using products from ESRI or other vendors is required.
-
AC9: 'Reply on CC8', Frederik Kratzert, 28 Feb 2024
See our reply to CC1 for a summary reply to all comments from SM.
Citation: https://doi.org/10.5194/hess-2023-275-AC9
-
AC9: 'Reply on CC8', Frederik Kratzert, 28 Feb 2024
- AC11: 'Reply on RC2', Frederik Kratzert, 04 Mar 2024
Data sets
Results and experimental data Frederik Kratzert, Martin Gauch, Daniel Klotz, and Grey Nearning https://doi.org/10.5281/zenodo.10139248
Model code and software
Code for analyzing model runs Frederik Kratzert, Martin Gauch, Daniel Klotz, and Grey Nearning https://github.com/kratzert/never-paper
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
1,326 | 453 | 65 | 1,844 | 19 | 16 |
- HTML: 1,326
- PDF: 453
- XML: 65
- Total: 1,844
- BibTeX: 19
- EndNote: 16
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1