the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Deep learning for monthly rainfall-runoff modelling: a comparison with classical rainfall-runoff modelling across Australia
Stephanie R. Clark
Julien Lerat
Jean-Michel Perraud
Peter Fitch
Abstract. A deep learning model designed for time series predictions, the long short-term memory (LSTM) architecture is regularly producing reliable results in local and regional rainfall-runoff applications around the world. Recent large-sample-hydrology studies in North America and Europe have shown the LSTM to successfully match conceptual model performance at a daily timestep over hundreds of catchments. Here we investigate how these models perform in producing monthly runoff predictions in the relatively dry and variable conditions of the Australian continent. The monthly timestep matches historic data availability and is also important for future water resources planning, however it provides significantly smaller training data sets than daily time series. In this study, a continental-scale comparison of monthly deep learning (LSTM) predictions to conceptual rainfall-runoff model (WAPABA) predictions is performed on almost 500 catchments across Australia with performance results aggregated over a variety of catchment sizes, flow conditions, and hydrological record lengths. The study period covers a wet phase followed by a prolonged drought, introducing challenges for making predictions outside of known conditions - challenges that will intensify as climate change progresses. The results show that LSTMs matched or exceeded WAPABA prediction performance for more than two-thirds of the study catchments; the largest performance gains of LSTM versus WAPABA occurred in large catchments; the LSTM models struggled less to generalise than the WAPABA models (eg. making predictions under new conditions); and catchments with few training observations due to the monthly timestep did not demonstrate a clear benefit with either WAPABA or LSTM.
- Preprint
(1678 KB) - Metadata XML
- BibTeX
- EndNote
Stephanie R. Clark et al.
Status: final response (author comments only)
-
RC1: 'Comment on hess-2023-124', Martin Gauch, 03 Jul 2023
SummaryThe paper compares a per-basin LSTM with a traditional RR model (WAPABA) on monthly data for basins in Australia and reports the LSTM to perform slightly better than WAPABA. The authors extensively investigate the results with respect to catchment size, flow conditions, and amount of data available.I especially appreciated the detailed evaluation that breaks down the performance differences according to different possible causes.My main concerns with this paper are related to methodology with the DL model. I believe that the presented comparison would have a different (likely more clear) outcome if the LSTM was trained differently. Please see my detailed comments below.Major Comments
- We know that single-basin LSTM RR models perform worse than globally trained ones. This has been very clearly shown for daily modeling [1] and I see no reason to believe this would be different for monthly data. In fact, I am quite confident that a well-trained global LSTM would outperform WAPABA more clearly than in the presented study. For instance, the authors state that their LSTM tends to underestimate high-flows (L639), which is exactly what global LSTMs are better at (because a high flow value in one basin is often not a very high value for another basin). There would also be fewer issues with the amount of training data, as the global model would have access to the samples from all basins at once. Another advantage of the global model is that one single model needs less compute to hyperparameter-tune and fit than 500 single-basin ones. The authors even discuss global models (L693) and their expected benefits (L689), so I don't understand why they wouldn't use one. If you think this requires a lot of coding work, I can recommend the NeuralHydrology library, which should allow to run your experiments with no or hardly any code modification (disclaimer: I'm one of the maintainers. This is just a suggestion, it's totally fine by me if you'd like to keep using your code).
- Beyond the issue of how to train the ML model, I think it is questionable whether an LSTM is even the best choice of an ML model here. LSTMs are good for long input sequences with dependencies across many input steps, which is not the case here -- the paper ends up using just 6 time steps. These could easily be fed into a simple feedforward net (or even a random forest or an XGBoost model). Ideally, a paper that claims to investigate DL for monthly RR prediction should also check whether the LSTM is the right tool for this task. To be clear, it might be -- but it might also be no better or worse than a more lightweight and faster feedforward net.
- The authors chose to use no validation period and justify this with an unreported "sensitivity test" (on one basin?). I do not find this convincing: it is unclear to me whether the test set remained untouched until final evaluation after hyperparameter tuning. However, in this case HP-tuning only happened on a single basin (which is far from ideal in itself), which means that at least most of the test set was apparently not touched for validation. Still, I would prefer to see a separate validation period. If lack of data is a concern, the authors could opt for a cross-validation scheme for HP-tuning.
- Open research and reproducibility:
- I was unable to find the actual code and configuration files under the link that is supposed to provide the source code for the paper's experiments. All I found is a notebook with a toy example.
- I would appreciate the authors to provide a ready-made download link to the forcings and streamflow data, rather than pointers to several government sites that leave people to figure out how to find the data from ~500 basins themselves. A single zenodo link would be far easier. If that's not possible (e.g., for license reasons), please provide a script to download the data (and to put it in the correct format if any changes are needed).
- Overall, and especially in section 3, I think there is too much focus on the performance difference between train and test period. This is not really meaningful for ML models which can easily be trained to NSE >> 0.9 on the training set and still perform well on test data.
- I would transpose table 3 and add columns for the different KGE components.
- Fig 10: I might be misunderstanding something, but what is your definition for "no flow"? Apparently it is not Q = 0, because the observation axis still shows variation.
- Fig. 12: Would be interesting to look at the observed and predicted hydrographs of the one basin where LSTM is poor and WAPABA is good.
- L42 "In some cases...": I think this formulation is a bit too weak.
- L52: The citations should include the first LSTM paper [3] and L54 should include the paper that showed how to train global LSTMs at CONUS scale [4].
- L58: I'd add [2] here, which I think is a bit closer to the content of the sentence than the other two citations.
- L83: There are many more, I suggest adding an "e.g.," to the list of citations
Typos- L279 model's
- L619 broken reference
- Several incorrect uses of \citep vs. \citet
[1] Nearing, G., Kratzert, F., Sampson, A. K., Pelissier, C. S., Klotz, D., Frame, J. M., et al.: What role does hydrological science play in the age of machine learning?. Water Resources Research, 57, e2020WR028091, https://doi.org/10.1029/2020WR028091, 2021.[2] Frame, J. M., Kratzert, F., Klotz, D., Gauch, M., Shalev, G., Gilon, O., Qualls, L. M., Gupta, H. V., and Nearing, G.: Deep learning rainfall–runoff predictions of extreme events, Hydrol. Earth Syst. Sci., 26, 3377–3392, https://doi.org/10.5194/hess-26-3377-2022, 2022.[3] Kratzert, F., Klotz, D., Brenner, C., Schulz, K., and Herrnegger, M.: Rainfall–runoff modelling using Long Short-Term Memory (LSTM) networks, Hydrol. Earth Syst. Sci., 22, 6005–6022, https://doi.org/10.5194/hess-22-6005-2018, 2018.[4] Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., and Nearing, G.: Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets, Hydrol. Earth Syst. Sci., 23, 5089–5110, https://doi.org/10.5194/hess-23-5089-2019, 2019.Citation: https://doi.org/10.5194/hess-2023-124-RC1 -
AC1: 'Reply on RC1', Stephanie Clark, 21 Aug 2023
Summary comment: The paper compares a per-basin LSTM with a traditional RR model (WAPABA) on monthly data for basins in Australia and reports the LSTM to perform slightly better than WAPABA. The authors extensively investigate the results with respect to catchment size, flow conditions, and amount of data available. I especially appreciated the detailed evaluation that breaks down the performance differences according to different possible causes. My main concerns with this paper are related to methodology with the DL model. I believe that the presented comparison would have a different (likely more clear) outcome if the LSTM was trained differently. Please see my detailed comments below.
Response: Thank you for your time in reading our manuscript and providing comments especially with regards to the deep learning model methodology. Specific comments are addressed below. However, here we would like to re-emphasise for completeness that the objective of this study is to obtain the minimum performance level that a standard (non-expert) user might expect to get from off-the-shelf LSTM usage - to help users of traditional modelling methods who are not necessarily experienced with machine learning understand what they might expect from running a very basic LSTM; the goal is not to maximise performance to cutting-edge machine learning standards. We will emphasize this message more clearly in the introduction to the paper so that readers understand what the goal is.
Major Comments
Comment 1: We know that single-basin LSTM RR models perform worse than globally trained ones. This has been very clearly shown for daily modeling [1] and I see no reason to believe this would be different for monthly data. In fact, I am quite confident that a well-trained global LSTM would outperform WAPABA more clearly than in the presented study. For instance, the authors state that their LSTM tends to underestimate high-flows (L639), which is exactly what global LSTMs are better at (because a high flow value in one basin is often not a very high value for another basin). There would also be fewer issues with the amount of training data, as the global model would have access to the samples from all basins at once. Another advantage of the global model is that one single model needs less compute to hyperparameter-tune and fit than 500 single-basin ones. The authors even discuss global models (L693) and their expected benefits (L689), so I don't understand why they wouldn't use one. If you think this requires a lot of coding work, I can recommend the NeuralHydrology library, which should allow to run your experiments with no or hardly any code modification (disclaimer: I'm one of the maintainers. This is just a suggestion, it's totally fine by me if you'd like to keep using your code).
Response: We are aware that global LSTM models have many benefits over individual models (for all the reasons stated above) and that a global model incorporating these ~500 stations would undoubtedly produce better LSTM results than we have obtained (one co-author is currently running a global model with NeuralHydrology on another project). A global model was not chosen for this study, however, as we endeavoured to make the comparison with individual-catchment WAPABA models as apples-to-apples as possible. This also provides the appropriate scale context for readers who seek to model a single catchment in their study, as is very often the case with traditional modelling studies. A frequently heard reason why researchers do not attempt to use machine learning approaches is the small data size associated with individual catchment time series, and we were interested in demonstrating the lower limits of data availability required to fit an LSTM with individual catchment monthly data sets. Even though fitting a global LSTM over hundreds of catchments may lead to better results, in this study we have shown that similar performance to traditional models can be reached despite the fact that the LSTM was fit using limited data on a single catchment. A global LSTM is proposed for further work in a follow-on project.
Comment 2: Beyond the issue of how to train the ML model, I think it is questionable whether an LSTM is even the best choice of an ML model here. LSTMs are good for long input sequences with dependencies across many input steps, which is not the case here -- the paper ends up using just 6 time steps. These could easily be fed into a simple feedforward net (or even a random forest or an XGBoost model). Ideally, a paper that claims to investigate DL for monthly RR prediction should also check whether the LSTM is the right tool for this task. To be clear, it might be -- but it might also be no better or worse than a more lightweight and faster feedforward net.
Response: It is possible that a feedforward neural network may be used to model the system, however this structure does not embody the sequential nature of the data, requires an increase in the complexity of the training space and is not likely to be the optimal choice for time series data. A comparison between individual feedforward networks and LSTMs is outside the scope of this project, however many papers (eg. Rahimzad et al., 2021) have shown the LSTM to be produce better results for modelling time series compared to feedforward networks and other conventional machine learning models. As the LSTM is the current state-of-the-art method for machine learning rainfall-runoff modelling, this method fits best into the scope of our research question. Furthermore, it is the authors’ plan to use this study as a stepping-stone for upcoming LSTM research.
Comment 3: The authors chose to use no validation period and justify this with an unreported "sensitivity test" (on one basin?). I do not find this convincing: it is unclear to me whether the test set remained untouched until final evaluation after hyperparameter tuning. However, in this case HP-tuning only happened on a single basin (which is far from ideal in itself), which means that at least most of the test set was apparently not touched for validation. Still, I would prefer to see a separate validation period. If lack of data is a concern, the authors could opt for a cross-validation scheme for HP-tuning.
Response: The omission of a validation period was a deliberate choice based on the like for like comparison we are doing. We wanted to train on the same data as WAPABA and predict similarly, with identical duration and dataset size. As mentioned earlier, the objective was not to determine the optimal model configuration, rather to use a vanilla setup. This approach was followed with the selection of hyperparameters, with a set chosen that yielded reasonable results on a wide range of catchments. The testing set remained completely untouched until final evaluation. Data-leakage between the training and testing sets was avoided by splitting the training set for the sensitivity test so that 80% of the training set was used for training and 20% of the training set was used for validation to monitor for over-fitting. The testing set was not used at all during the sensitivity test. Hyperparameters were tuned in a separate process on a subset of catchments, also using only the training set.
Comment 4: Open research and reproducibility. I was unable to find the actual code and configuration files under the link that is supposed to provide the source code for the paper's experiments. All I found is a notebook with a toy example. I would appreciate the authors to provide a ready-made download link to the forcings and streamflow data, rather than pointers to several government sites that leave people to figure out how to find the data from ~500 basins themselves. A single zenodo link would be far easier. If that's not possible (e.g., for license reasons), please provide a script to download the data (and to put it in the correct format if any changes are needed).
Response: It is our aim to make the data and code accessible. A link to the code and configuration files will be provided with the revised manuscript. The forcings and streamflow data for the set of basins could be provided in the form a zenodo link.
Minor Comments:
- Overall, and especially in section 3, I think there is too much focus on the performance difference between train and test period. This is not really meaningful for ML models which can easily be trained to NSE >> 0.9 on the training set and still perform well on test data.
- I would transpose table 3 and add columns for the different KGE components.
- Fig 10: I might be misunderstanding something, but what is your definition for "no flow"? Apparently it is not Q = 0, because the observation axis still shows variation.
- 12: Would be interesting to look at the observed and predicted hydrographs of the one basin where LSTM is poor and WAPABA is good.
- L42 "In some cases...": I think this formulation is a bit too weak.
- L52: The citations should include the first LSTM paper [3] and L54 should include the paper that showed how to train global LSTMs at CONUS scale [4].
- L58: I'd add [2] here, which I think is a bit closer to the content of the sentence than the other two citations.
- L83: There are many more, I suggest adding an "e.g.," to the list of citations
Response: We appreciate the comments you have made here. These ‘minor’ comments will be taken into account when revising the manuscript.
Citation: https://doi.org/10.5194/hess-2023-124-AC1 -
AC3: 'Reply on AC1', Stephanie Clark, 21 Aug 2023
Apologies, I omitted the reference from the previous comment.
Reference: Rahimzad, M., Moghaddam Nia, A., Zolfonoon, H. et al. Performance Comparison of an LSTM-based Deep Learning Model versus Conventional Machine Learning Algorithms for Streamflow Forecasting. Water Resour Manage 35, 4167–4187 (2021). https://doi.org/10.1007/s11269-021-02937-w
Citation: https://doi.org/10.5194/hess-2023-124-AC3
-
RC2: 'Comment on hess-2023-124', Umut Okkan, 12 Aug 2023
General Comments
The paper compares the long short-term memory model (LSTM), a type of recurrent neural network, to a well-known conceptual rainfall-runoff model (WAPABA) using monthly data for basins in Australia. They noted that LSTM is just as effective as conceptual models for simulating daily runoff in various regions of the world. Due to the significance of monthly data in water resources planning, they studied the LSTM while establishing a monthly rainfall-runoff relationship, unlike earlier studies. In their comprehensive work, the ability of LSTM to produce an accurate monthly runoff simulation was tested under different conditions across the Australian continent. During this pathway, they analyzed various indices and concluded that the LSTM outperforms WAPABA in the majority of catchments. Even though a great deal of effort went into the calculations, my concerns have nothing to do with the utilization of LSTM (tuning the hyper-parameters regarding its internal architecture, training algorithm, etc.). Consequently, I suppose that considering my remarks below can raise the scope of the study.
Specific Comments
a. Major Comments:
- As they specified, the accuracy of data-driven models trained for runoff simulation is heavily dependent on the quantity of lagging data, and as a result of numerous trials, they decided on a lag of six months. In this case, there are two issues that need to be discussed in the paper. First, are we confident that the LSTM can substitute for a conceptual model since it is so dependent on antecedent data? Additionally, as discussed by Robertson et al. (2013), when a catchment is wet, antecedent runoff does not promptly respond to antecedent precipitation; rather, soil moisture and groundwater storages mostly refill. Under these circumstances, antecedent runoff values may underestimate the actual soil moisture conditions, resulting in relatively low runoff simulations. Other than the median flow (i.e., low and high flow), might this explain the differences between the simulations at different percentiles?
- In addition, it may be conceivable for another machine learning model (standard feed forward neural networks, support vector regression etc.) to supersede LSTM, particularly for the runoff simulation using 6-month lagged data. In this sense, I also strongly recommend conventional machine learning models against WAPABA. Otherwise, we will just be exploring a fiction revolving around the increasingly popular use of deep learning to monthly runoff data.
- Another issue to be discussed in the paper is the coupled conceptual-machine learning modeling framework. Although several references are provided in Section 4, it is important to clarify what benefits this hybridization brings and what limitations of individual machine learning models it can address. In fact, soil moisture and groundwater recharge outputs derived from calibrated WAPABA model are likely to strengthen the predictors of LSTM.
- Moreover, in a rather limited number of studies, the internal structure of conceptual hydrological models has been replaced with machine learning approaches (see Okkan et al., 2021). Employing LSTM to modify the internal runoff partitioning mechanism of WAPABA is not deemed required in this study. But it would be appropriate to refer to and discuss this kind of nested hybridization in Section 4, just to shed light on potential future studies.
b. Minor Comments and Typos:
1. The figures for grading criteria are appealing, but it would be useful to visualize them on the continent using a GIS tool.2. The way of citation in some lines does not seem usual (e.g., L83).
3. It should be checked whether the unit of "Inverse K" in Table 2 is mm/day or mm/month.
4. There is no identity in the presentation of references. Some journal names are abbreviated, some are not.
5. Line 619 has the following statement: “Error! Reference source not found”
6. It would be appropriate to give the conceptual diagram of WAPABA.
7. Also, was the warm-up period used while applying WAPABA? To avoid any bias associated with initial storage values, this can be useful.
References
Robertson, D. E., Pokhrel, P., & Wang, Q. J. (2013). Improving statistical forecasts of seasonal streamflows using hydrological model output. Hydrology and Earth System Sciences, 17(2), 579-593.
Okkan, U., Ersoy, Z. B., Kumanlioglu, A. A., & Fistikoglu, O. (2021). Embedding machine learning techniques into a conceptual model to improve monthly runoff simulation: A nested hybrid rainfall-runoff modeling. Journal of Hydrology, 598, 126433.
Citation: https://doi.org/10.5194/hess-2023-124-RC2 -
AC2: 'Reply on RC2', Stephanie Clark, 21 Aug 2023
Summary comment: The paper compares the long short-term memory model (LSTM), a type of recurrent neural network, to a well-known conceptual rainfall-runoff model (WAPABA) using monthly data for basins in Australia. They noted that LSTM is just as effective as conceptual models for simulating daily runoff in various regions of the world. Due to the significance of monthly data in water resources planning, they studied the LSTM while establishing a monthly rainfall-runoff relationship, unlike earlier studies. In their comprehensive work, the ability of LSTM to produce an accurate monthly runoff simulation was tested under different conditions across the Australian continent. During this pathway, they analyzed various indices and concluded that the LSTM outperforms WAPABA in the majority of catchments. Even though a great deal of effort went into the calculations, my concerns have nothing to do with the utilization of LSTM (tuning the hyper-parameters regarding its internal architecture, training algorithm, etc.). Consequently, I suppose that considering my remarks below can raise the scope of the study.
Response: Thank you for your comments on our manuscript especially relating to the integration of machine learning and traditional models in hybrid configurations. These considerations will strengthen the message of our work and tie it into broader contemporary themes in hydrological machine learning.
Major Comments:
Comment 1: As they specified, the accuracy of data-driven models trained for runoff simulation is heavily dependent on the quantity of lagging data, and as a result of numerous trials, they decided on a lag of six months. In this case, there are two issues that need to be discussed in the paper. First, are we confident that the LSTM can substitute for a conceptual model since it is so dependent on antecedent data? Additionally, as discussed by Robertson et al. (2013), when a catchment is wet, antecedent runoff does not promptly respond to antecedent precipitation; rather, soil moisture and groundwater storages mostly refill. Under these circumstances, antecedent runoff values may underestimate the actual soil moisture conditions, resulting in relatively low runoff simulations. Other than the median flow (i.e., low and high flow), might this explain the differences between the simulations at different percentiles?
Response: This is an interesting question. The advantage of the LSTM, in terms of the choice of possible machine learning models to use, is that the LSTM processes antecedent data to make predictions at the current timestep. Therefore, to predict the runoff at time t, our model is considering the precipitation, evapotranspiration etc. sequentially over the previous 6 months (t-6). In this way the model will learn patterns such that periods of dryness followed by precipitation may lead to less runoff (due to refilling of soil moisture and groundwater storages) than precipitation occurring at the end of a long, wet period. A lag time of 6 months was determined by trial and error to produce the best predictions and this is supported by hydrologic knowledge that catchment conditions longer than 6 months in the past would likely have little effect on current runoff.
Comment 2: In addition, it may be conceivable for another machine learning model (standard feed forward neural networks, support vector regression etc.) to supersede LSTM, particularly for the runoff simulation using 6-month lagged data. In this sense, I also strongly recommend conventional machine learning models against WAPABA. Otherwise, we will just be exploring a fiction revolving around the increasingly popular use of deep learning to monthly runoff data.
Response: The comparison of other machine learning models to LSTMs for modelling the rainfall-runoff relationship is outside the scope of this study, though there are many papers which have done this in peer-reviewed journals (eg Rahimzad, 2021). In most comparisons, the LSTM has been shown to have superior performance and therefore we did not feel it was necessary to repeat these comparisons.
Comment 3: Another issue to be discussed in the paper is the coupled conceptual-machine learning modeling framework. Although several references are provided in Section 4, it is important to clarify what benefits this hybridization brings and what limitations of individual machine learning models it can address. In fact, soil moisture and groundwater recharge outputs derived from calibrated WAPABA model are likely to strengthen the predictors of LSTM.
Response: Thank you for highlighting this as an important point. We have touched upon this topic in the Discussion section, however it would be interesting to expand the discussion to include the benefits of hybridization in overcoming limitations of these individual machine learning models. A further literature search on this point is a very welcome suggestion that will improve the depth of the manuscript.
Comment 4: Moreover, in a rather limited number of studies, the internal structure of conceptual hydrological models has been replaced with machine learning approaches (see Okkan et al., 2021). Employing LSTM to modify the internal runoff partitioning mechanism of WAPABA is not deemed required in this study. But it would be appropriate to refer to and discuss this kind of nested hybridization in Section 4, just to shed light on potential future studies.
Response: As per the response to the previous comment, this would be an interesting addition to the discussion.
Minor Comments:
- The figures for grading criteria are appealing, but it would be useful to visualize them on the continent using a GIS tool.
- The way of citation in some lines does not seem usual (e.g., L83).
- It should be checked whether the unit of "Inverse K" in Table 2 is mm/day or mm/month.
- There is no identity in the presentation of references. Some journal names are abbreviated, some are not.
- Line 619 has the following statement: “Error! Reference source not found”
- It would be appropriate to give the conceptual diagram of WAPABA.
- Also, was the warm-up period used while applying WAPABA? To avoid any bias associated with initial storage values, this can be useful.
Response: We appreciate the comments you have made here. These ‘minor’ comments will be taken into account when revising the manuscript.
Reference:
Rahimzad, M., Moghaddam Nia, A., Zolfonoon, H. et al. Performance Comparison of an LSTM-based Deep Learning Model versus Conventional Machine Learning Algorithms for Streamflow Forecasting. Water Resour Manage 35, 4167–4187 (2021). https://doi.org/10.1007/s11269-021-02937-w
Citation: https://doi.org/10.5194/hess-2023-124-AC2
Stephanie R. Clark et al.
Stephanie R. Clark et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
516 | 127 | 19 | 662 | 7 | 6 |
- HTML: 516
- PDF: 127
- XML: 19
- Total: 662
- BibTeX: 7
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1