How can we benefit from regime information to  make more  effective use of long short-term memory (LSTM) runoff models?

Hashemi, Reyhaneh; Brigode, Pierre; Garambois, Pierre-André; Javelle, Pierre

doi:https://doi.org/10.5194/hess-26-5793-2022

Articles | Volume 26, issue 22

https://doi.org/10.5194/hess-26-5793-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-26-5793-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 26, issue 22

Research article

|

17 Nov 2022

Research article |

| 17 Nov 2022

How can we benefit from regime information to make more effective use of long short-term memory (LSTM) runoff models?

Reyhaneh Hashemi, Pierre Brigode, Pierre-André Garambois, and Pierre Javelle

Download

Final revised paper (published on 17 Nov 2022)
Preprint (discussion started on 12 Oct 2021)

Interactive discussion

Status: closed

RC1:
'Comment on hess-2021-511', John Quilty, 16 Nov 2021

Dear Authors,

Please find attached my review of your article. I appreciate the opportunity to review this interesting paper.

Best regards,

John Quilty

Citation: https://doi.org/10.5194/hess-2021-511-RC1
- AC1: 'Reply on RC1', Reyhaneh Hashemi, 06 Jan 2022
  
  Dear Dr. Quilty,
  We highly appreciate your interest in reviewing our paper. The full responses to your comments are provided in a supplementary document. Please find it attached.
  
  Best regards,
  Reyhaneh Hashemi
  
  Citation: https://doi.org/10.5194/hess-2021-511-AC1
RC2:
'Comment on hess-2021-511', Anonymous Referee #2, 14 Dec 2021

Summary of Review:

This paper addresses two research questions related to the use of LSTMs for rainfall-runoff modeling: (1) Does appropriate sequence length depend on hydrological regime, and (2) should LSTM training be done on hydrologically similar basins?

To state my opinion up front, I have run similar experiments (unpublished) and found results that are qualitatively different than what are reported here. There are several technical issues in this paper (overall, the methodology is not appropriate for testing the stated hypotheses), and it might be worth addressing those before we look carefully at the results.

My overall recommendation is to revise the experiment as suggested in one of the comments below. The experimental design that is appropriate to test the (two) hypotheses outlined here is very simple (but somewhat computationally expensive). If the authors were to find similar results using a more appropriate experiment, this would be an interesting study.

Comments:

Hyperparameter tuning was done on LSTMs trained on individual basins. LSTMs trained on individual basins behave fundamentally differently than LSTMs trained on multiple basins, which means that lessons learned from hypertuning on individual basins do not translate to multiple-basin models. Additionally, 15 catchments is not enough for robust hypertuning – we would need to perform hyperparameter tuning on the full (evaluation) dataset (although see a later comment – the experimental design needs to be changed fundamentally). Also, notice that the only portions of the “hypertuning” that were actually used for the other experiments in this paper were (1) discarding the S2 model architecture, and (2) batch size.

There is strong relationship between the dimension of the cell state and the sequence length, and also between the cell state dimension and the ability of the model to generalize (Kratzert et al 2019 shows how the model uses the cell state to map catchment similarity). This parameter was not included in the hyperparameter tuning, and it was also not considered in the experimental design. 64 cell states is smaller than used by most of the previously published work. The hypotheses that are tested here are about the ability of the model to generalize and about memory timescales, both of which are directly controlled by the cell state (more cell states means more ability to have different memory timescales for different hydrological regimes).

It would be interesting (and useful) to know whether there is value in clustering catchments prior to training models, and if so whether we could find correlations between different hyperparameters (e.g., sequence length, cell state dimension) and hydrological regime (the former is a more interesting question than the latter, in my opinion). The way to test this is simple – you separately (and fully) hypertune each model. For example, if you want to test the clustering strategy described in lines 120-125, you would hypertune models separately for each catchment group (considering all of the important LSTM hyperparameters), and as a benchmark you would hypertune a model for all of the catchments combined. Then the results would be directly comparable. After that, you could look at whether there was any relationship between hydrological regime and the “optimal” (hypertuning is never actually optimal) sequence length for that cluster. If you really wanted to train single-basin models (which I suggest you should not do), then these need to be separately (and fully) hypertuned for each basin.

I wonder why we are training local models. There is no situation where we would ever use a model trained on a single catchment for any real-world purpose. Additionally, the behavior of the LSTM is fundamentally and qualitatively different when trained on one catchment vs. many, which means that we cannot learn anything general or useful from locally trained models. If there was a specific hypothesis that we wanted to test that required training local models, then this might make sense, but I do not believe this is the case here – we could ask the question about appropriate sequence length on hydrologically grouped models, and asking the question this way would give us a more useful answer. Just a note: Kratzert et al. (in all papers after their 2018 paper) trained single-basin models only to make the point that this is not an appropriate thing to do.

Minor Comments:

The S2 architecture (stacked LSTMs) is interesting, but not related to either of the hypotheses of the study. What was the motivation for testing this and how does it relate to the questions that were motivated in the introduction? I’m not saying to remove it, just give us some reasoning or motivation. Also, when the “complexity” of this model is discussed, you might give us the number of free parameters so that we can get a sense of what the differences are.

Line 192: I think this is just a typo. Validation data is used to help find the best and control overfitting (it is explicitly *not* used to help tune weights and biases, except through early stopping).

Line 201: It is a little concerning to have different sized training data records per catchment, especially if some catchments only have 1 year of training data. This is *especially* problematic if we are looking at differences between what data is required to train in different types of catchments.

In line 180 is reads like you are doing sequence-to-one prediction, however in line 259 you say that you are using a patience of 50 epochs with a maximum of 500 epochs. Typically you only need this many epochs if you are doing sequence-to-sequence training. Regardless, the number of epochs used by previous studies was in the range of 20-50. Have you found that more epochs help (we looked at this carefully in previous studies), or is there something else about your model that is different from previously published work?

Line 291: This is a pretty small list of catchment attributes. Given that catchment attributes are available globally (e.g., HydroAtlas), and this will directly influence the generalizability of a model, why did we use such a limited set of attributes here?

In general, naming experiments with non-descriptive names like R1, R2. P1, etc. makes the paper more difficult to read than is necessary. This means that the reader must always refer back to the text in order to understand each figure. This can be solved simply by naming each of the models/experiments/datasets with descriptive names.

Citation: https://doi.org/10.5194/hess-2021-511-RC2
- AC2:
  'Reply on RC2', Reyhaneh Hashemi, 07 Jan 2022
  
  Dear Anonymous Referee #2,
  We are very appreciative of your review and thoughts. Please find our responses to each individual comment in the attached supplementary document.
  
  Best regards,
  Reyhaneh Hashemi
  
  Citation: https://doi.org/10.5194/hess-2021-511-AC2
  - RC3: 'Reply on AC2', Anonymous Referee #2, 10 Jan 2022
    
    Dear Authors,
    
    I want to respond briefly and partially to these replies before this goes back to the editor's desk. I am doing this because the author responses are not suffieient, and I will have to recommend to reject the paper on re-review if the authors pursue this line of revisions. If the roles in this process were reveresed, I would prefer to know this in advance rather than waiting through another round of review. I do believe this paper could be a nice contribution, if the authors were to make the effort.
    I also want to apologize in advance for the directness of these comments. However, from the authors' replies I see that they are not taking this review process seriously, and I think it is both important and useful to be as direct as possible about the (severe) problems with this paper. To reiterate, I would not take the time to do this if I did not think there was some amount of potential in this paper, however as it stands, the paper is not publishable nor is it (apparently) on track to being publishable.
    Please see comments attached. Thank you.
    
    Citation: https://doi.org/10.5194/hess-2021-511-RC3
    
    AC3: 'Responses to RC3 — addressed to Editor', Reyhaneh Hashemi, 24 Jan 2022
    
    Dear Editor,
    
    We believe that in the document provided in AC2 we have taken seriously into account every single point brought up in the first review (RC2) of Anonymous Referee #2 (R2).
    
    If the editor finds it necessary, we will of course proceed to conduct further tests on hyperparameter tuning without having any problems. However, the elements given by R2 in their reviews do not identify what exact tests they expected and their attachment has left us perplexed. The latest literature admits also that hyperparameter tuning could be an endless job as, for instance, Klotz et al. (2021, under review) looks also for a “balance between computational resources and search depth”.
    
    Considering the hidden unit size hyperparameter, we took the choice concluded in Lees et al. (2021) since they had reported that they had systematically tested larger values for their entire sample — which was in a region close to ours — and they did not observe any performance improvement either.
    
    In what follows, we elaborate on two points that we find important:
    
    1. How hyperparameter tuning of LSTM networks is carried out in some recent related studies.
    
    2. Our local training methodology and why our results are not in contradiction with what has been previously done in those studies.
    
    We hope this will help you in making your decision.
    
    Finally, we would like to thank you for taking charge of our paper as well as both referees for their time and reviews.
    
    Best regards,
    
    Authors of hess-2021-511
    
    --------------------------------------
    
    1) LSTM hyperparameter tuning in similar studies
    
    We fully agree with R2 on the importance of hyperparameter tuning. We carefully read their comments but did not find an objective or — their — definition of the standards to achieve. We also carefully read the papers of the authors cited by R2. It appears to us that level of hyperparameter tuning is not uniform between studies and varies from study to study:
    
    --Kratzert et al. (2018) report doing a “manual” hyperparameter tuning on “several” catchments from a region — Austria — different and far from the study region — the United States (US): “The specific design of the network architecture, i.e., the number of layers, cell/hidden state length, dropout rate and input sequence length were found through a number of experiments in several seasonal-influenced catchments in Austria”. From this hyperparameter tuning — conducted using a local LSTM and for catchments in Austria — they conclude a 2 layer structure with 20 hidden units. They then use these choices in their local — and regional — LSTMs and for 241 catchments located in the US (CAMELS data set), “without further tuning” even if they acknowledge that it “is something to do in the future”.
    
    R2 highlights in their first review that “there is strong relationship between the dimension of the cell state and the sequence length”. However, in this study (Kratzert et al. (2018)), the length of the input sequence — called lookback in our paper and AC2 — is not varied and it is fixed to 365 days “in order to capture at least the dynamics of a full annual cycle”.
    
    It turns out that, in this study, hyperparameter tuning is not performed in the same manner as required by R2.
    
    -- Kratzert et al. (2019) performed a more elaborate hyperparameter tuning, described briefly in Annex A of their paper, but without presenting any detailed results. They considered the following variations for the following hyperparameters: “Hidden states: 64, 96, 128, 156, 196, 224, 256; Dropout rate: 0.0, 0.25, 0.4, 0.5; Length of input sequence: 90, 180, 270, 365; Number of stacked LSTM layer: 1, 2”. No batch size variations are reported.
    
    They finally chose a one layer structure, with 256 hidden states, a dropout rate of 0.4 and a length of input sequence of 270 [days]. This hyperparameter tuning is performed using only one performance metric and for only one regional LSTM (no local training in this study). However, they then used “the same architecture (apart from the inclusion of a static input gate in the EA-LSTM), which found through hyperparameter optimization” to compare 3 different regional LSTM models for 2 different performance metrics.
    
    It turns out that, in this study, hyperparameter tuning is not, either, done in the "for-each-model" fasion that R2 requires.
    
    -- Gauch et al. (2021) conducted even a more complex hyperparameter tuning. In this recent study, three periods are defined, instead of the only two in previous studies (but with a k-fold cross validation using data of the first period). As mentioned by the authors, splitting the data into three periods (calibration, validation, test) is “a widespread calibration strategy for DL models”. Gauch et al. (2021) compared four LSTM type models — Naive_daily, Naive_hourly, sMTS-LSTM and MTS-LSTM. It is crystal clear that the hyperparameters they tuned differ from model to model. Also, the sequence length and hidden unit size parameters are not varied in hyperparameter tuning of their Naive models and are set to a fixed value — contrary to what R2 believes about their strong inter connectedness. Please also note that Kratzert et al. (2019) had previously found a different (270 [days]) optimal value for lookback and although it was obtained in a different setting, the evidence for the importance of its variation was present for Gauch et al. (2021).
    
    It turns out that, in this study, hyperparameter tuning is not, either, done in the "equivalently-for-each-model" fasion and for all hyperparameters that R2 requires.
    
    Furthermore, for some unmentioned reason, the number of studied catchments in their study is not the same as previous studies of Kratzert, although the authors also used the same US CAMELS data set.
    
    -- Klotz et al. (2021, under review) used three periods: “training, validation, and testing that are standard in the machine learning community”. However, for some unmentioned reason, they did not choose the same dates and catchments as those taken in Gauch et al. (2021), although they had also used the US CAMELS data set. Apparently, the length of the input sequence (lookback) is not varied — this is not mentioned, but this is at least what one would understand from the preprint paper and the discussion available on HESSD on 2022-01-24.
    
    Hyperparameter tuning is performed for 6 parameters (hidden states, number of components, noise, dropout, batch size and learning rate) and for four models (GMM, CMAL, UMAL and MCD). One LSTM model is added for comparison, but taking the hyperparameters obtained by Kratzert et al. (2019): “We, therefore, also compare a model with the same hyper-parameters as Kratzert et al. (2019), the latter model is labeled LSTMp”.
    
    It turns out that, in this study, hyperparameter tuning is not, either, done in the "equivalently-for-each-model" fasion that R2 requires.
    
    Furthermore, the train/validation period of Kratzert et al. (2019) — 1999-10-01 to 2008-09-30 — overlaps the test period of Klotz et al. (2021, under review) — 1995-10-01 to 2005-09-01. There might be some unmentioned reason, but Deep Learning guidelines (Goodfellow et al., 2016) require choosing independent periods.
    
    2) Local (versus regional) training
    
    In the cited studies, authors seem to consider that local training is not useful and regional training using static attributes brings much better performances. According to their comments, this seems to be the opinion of R2 as well.
    
    We would therefore like to highlight two points:
    
    1- Contrary to what seems to be the prevailing view, it turns out that local LSTMs have never been compared to regional LSTMs WITH static attributes — probably, apart from the «unpublished» results mentioned by R2 in their first review. Indeed, Kratzert et al. (2018) compared local LSTM, regional LSTM WITHOUT static attributes and a third approach: “fine tuning the regional model for each catchment”. The two first approaches gave similar results, while the last approach clearly improved performance. Results of Kratzert et al. (2018) were obtained on a “subset” of the CAMEL set. Then, when moving to the entire US CAMEL set for their following studies (Kratzert et al. (2019), Gauch et al. (2021), Klotz et al. (2021, under review)), the authors abandoned both local training and fine tuning methods, focusing only on improving their regional LSTM models.
    
    2- We tried to explain in AC2 that we used a different methodology for training our LSTM models. We used three independent sufficiently long intervals — training, validation, test. This made us possible to apply an early stopping criterion, for each catchment, individually. Therefore, in our study, the number of epochs used to locally train LSTM differs from catchment to catchment, depending on the loss obtained on the validation period.
    
    Kratzert et al. (2018) used a different approach. They had two periods — one for “training-validation”, the other to test their model. They carried out a preliminary test in which their “training-validation” period is divided into two parts: 14 years for local training, 1 year for validation. Based on the mean NSE calculated on the validation period, they chose the same number of epochs for all catchments and re-train the model locally on the whole “training-validation” period. This approach is fully understandable since the authors argue that “the goal of this study is therefore not to find the best per-catchment model, but rather to investigate the general potential of LSTMs for the task of rainfall–runoff modelling”. Nevertheless, this approach clearly penalizes local LSTMs.
    
    We must state that we find all cited peer reviewed papers excellent and we fully agree with the choices made by their authors. We only intended to stress that, in our opinion, the hyperparameter tuning tests presented in the preprint version of our paper are not in contradiction with what is reported in the existing literature of this space, contrary to what R2 suggests.
    
    References
    
    Gauch, M., Kratzert, F., Klotz, D., Nearing, G., Lin, J., and Hochreiter, S.: Rainfall–runoff prediction at multiple timescales with a single Long Short-Term Memory network, Hydrol. Earth Syst. Sci., 25, 2045–2062, https://doi.org/10.5194/hess-25-2045-2021, 2021.
    
    Goodfellow, I., Bengio, Y., and Courville, A: Deep Learning, MIT, Press, available at: http://www.deeplearningbook.org, 2016.
    
    Klotz, D., Kratzert, F., Gauch, M., Keefe Sampson, A., Brandstetter, J., Klambauer, G., Hochreiter, S., and Nearing, G.: Uncertainty Estimation with Deep Learning for Rainfall–Runoff Modelling, Hydrol. Earth Syst. Sci. Discuss. [preprint], https://doi.org/10.5194/hess-2021-154, in review, 2021.
    
    Kratzert, F., Klotz, D., Brenner, C., Schulz, K., and Herrnegger, M.: Rainfall–runoff modelling using Long Short-Term Memory (LSTM) networks, Hydrol. Earth Syst. Sci., 22, 6005–6022, https://doi.org/10.5194/hess-22-6005-2018, 2018.
    
    Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., and Nearing, G.: Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets, Hydrol. Earth Syst. Sci., 23, 5089–5110, https://doi.org/10.5194/hess-23-5089-2019, 2019.
    
    Lees, T., Buechel, M., Anderson, B., Slater, L., Reece, S., Coxon, G., and Dadson, S. J.: Benchmarking data-driven rainfall–runoff models in Great Britain: a comparison of long short-term memory (LSTM)-based models with four lumped conceptual models, Hydrol. Earth Syst. Sci., 25, 5517–5534, https://doi.org/10.5194/hess-25-5517-2021, 2021.
    
    Citation: https://doi.org/10.5194/hess-2021-511-AC3

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (02 Feb 2022) by Efrat Morin

AR by Reyhaneh Hashemi on behalf of the Authors (16 May 2022) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (02 Jun 2022) by Efrat Morin

RR by John Quilty (14 Jul 2022)

RR by Anonymous Referee #3 (25 Jul 2022)

Suggestions for revision or reasons for rejection

In this paper entitled "How can we benefit from regime information to make use of LSTM runoff models more effectively?", Hashemi et al. developed long short term memory (LSTM) models to assess their capability for runoff modeling according to how long memory (lookback hyperparameter) depends on hydrological regimes (i.e. on the information existing up to annual time scale), how the models are trained (local, regional or "national"-scale training), and in the end, answer the question "what is the most effective way of using LSTM for making runoff predictions?" (quite a broad question).
This paper, which has undergone a number of modifications by the authors already, is overall very well written and organized, with clear objectives. This type of paper certainly deserves being brought to the hydrological community. I have a few concerns, though, that I think should be addressed before the paper be considered for final publication. They meet, to some extent, those already expressed previously by one reviewer. The authors will decide whether they can just use these comments below to modify the text or if additional trials are needed.

Main comment:
It would have been probably better to explore a little deeper the parameter space in my opinion (as emphasized by reviewer 2 previously). At least, should the paper be published, it is mandatory to explain why some important parameters were kept constant and what is the rationale behind this decision: otherwise, my feeling is it will not be of sufficient help to the readership and potential users to use this work as a support to develop their own models, for instance.
I am not saying the values of the parameters are not suited, but without any *strong rationale* (physical or anything else) supporting this choice, it is difficult, in the framework of ML/DL approaches, to justify the selection of just a few values of a limited number of hyperparameters.
For instance, it is not clear why batch size was kept to 128. Or why only 64, 128, 256 hidden units were eventually selected: not less, nothing in between? By the way, is there any specific reason for choosing log2 values? I don't think any numerical constraints would require this in the present context and gaps between successive values are large...
Also, I am wondering if it would have been interesting to use sequence lengths (lookback) up to, say, 4 years: I have not seen what the streamflow time series look like but for some of them with strong baseflow and high multi-annual variability (as visible in some regimes of fig.4), it might be possible that some useful information be still present further back in time (even more than 2 years), and that the annual scale ("regime") does not necessarily contains all the useful information by itself (there have been quite an amount of works published in the past decade on the topic).
Without that, it will be probably difficult to provide a meaningful answer to question Q4 about "[...] the most effective way of using LSTM for making runoff predictions", in my opinion...

Minor suggestions:
- Introduction section, line 52: I think that confusing the hysteretic behavior and the memory length of a catchment is not strictly speaking true: the first relates mainly to the lagged response to the input, the second to the time taken by the system to dissipate the information of the input.
- Introduction section, line 54: remove "ground" (!?) and just keep "aquifers".
- Section 3.2: I understand the arguments supporting the choice of classical standardization instead of the usual minmax scaling. Yet, it would be interesting to indicate whether the two types of scaling were tested or not (from the text it seems that no trial was made using minmax scaling but it should be indicated).
- Caption of fig.9 seems to contradict the legend at the top of the figure (which says, for instance, "solid=mean" while the caption says "solid=training").

Hide

ED: Publish subject to revisions (further review by editor and referees) (31 Jul 2022) by Efrat Morin

AR by Reyhaneh Hashemi on behalf of the Authors (30 Sep 2022) Author's response Author's tracked changes Manuscript

ED: Publish as is (12 Oct 2022) by Efrat Morin

AR by Reyhaneh Hashemi on behalf of the Authors (13 Oct 2022) Author's response Manuscript

Short summary

Hydrologists have long dreamed of a tool that could adequately predict runoff in catchments. Data-driven long short-term memory (LSTM) models appear very promising to the hydrology community in this respect. Here, we have sought to benefit from traditional practices in hydrology to improve the effectiveness of LSTM models. We discovered that one LSTM parameter has a hydrologic interpretation and that there is a need to increase the data and to tune two parameters, thereby improving predictions.