To date, long short-term memory (LSTM) networks have been successfully applied to a key problem in hydrology: the prediction of runoff. Unlike traditional conceptual models, LSTM models are built on concepts that avoid the need for our knowledge of hydrology to be formally encoded into the model. The question, then, is how we can still make use of our domain knowledge and traditional practices, not to build the LSTM models themselves, as we do for conceptual models, but to use them more effectively. In the present paper, we adopt this approach, investigating how we can use information concerning the hydrologic characteristics of catchments for LSTM runoff models. In this first application of LSTM in a French context, we use 361 gauged catchments with very diverse hydrologic conditions from across France. The catchments have long time series of at least 30 years. Our main directions for investigation include (a) the relationship between LSTM performance and the length of the LSTM input sequence within different hydrologic regimes, (b) the importance of the hydrologic homogeneity of catchments when training LSTMs on a group of catchments, and (c) the interconnected influence of the local tuning of the two important LSTM hyperparameters, namely the length of the input sequence and the hidden unit size, on the performance of group-trained LSTMs. We present a classification built on three indices taken from the runoff, precipitation, and temperature regimes. We use this classification as our measure of homogeneity: catchments within the same regime are assumed to be hydrologically homogeneous. We train LSTMs on individual catchments (local-level training), on catchments within the same regime (regime-level training), and on the entire sample (national-level training). We benchmark local LSTMs using the GR4J conceptual model, which is able to represent the water gains/losses in a catchment. We show that LSTM performance has the highest sensitivity to the length of the input sequence in the Uniform and Nival regimes, where the dominant hydrologic process of the regime has clear long-term dynamics; thus, long input sequences should be chosen in these cases. In other regimes, this level of sensitivity is not found. Moreover, in some regimes, almost no sensitivity is observed. Therefore, the size of the input sequence in these regimes does not need to be large. Overall, our homogeneous regime-level training slightly outperforms our heterogeneous national-level training. This shows that the same level of data adequacy with respect to the complexity of representation(s) to be learned is achieved in both levels of training. We do not, however, exclude a potential role of the regime-informed property of our national LSTMs, which use previous classification variables as static attributes. Last but not least, we demonstrate that the local selection of the two important LSTM hyperparameters (the length of the input sequence and the hidden unit size) combined with national-level training can lead to the best runoff prediction performance.

Surface-water runoff (referred to hereafter as runoff) is the response of a catchment to its intakes and yields. The reliable prediction of runoff is
essential for the management of many water-related hazards and water resources, and it has been the focus of numerous studies in hydrology over the past
decades. Nevertheless, the accurate prediction of runoff has remained a challenge due to the non-linearity of the several surface and subsurface
processes involved

Conforming to the daily runoff model from

We can decompose the error associated with any deep learning network (including LSTM) to the following three components

In line with classical regionalization

Local and regional LSTMs have already been investigated and compared against multiple conceptual models in several studies. The reader is referred to

Seeking to benefit from traditional methods of hydrologic classification

To identify hydrologic similarity, we present a purely hydrologic classification built on three indices obtained from the analysis of runoff,
precipitation, and temperature regimes. To date, only one other investigation of the data homogeneity component in training LSTMs has been undertaken
in a recently published study conducted in parallel with the present research

The last investigation path in this paper – inspired by the fine-tuning experiment performed by

In pursuing these paths, we apply LSTM to a sample consisting of 361 gauged catchments with very diverse hydrologic conditions from all over France; this paper is the first application of LSTM to the French context. The discharge time series of the catchments is at least 30 years in length (between 30 and 60 years). In all experiments, the LSTM is tuned with respect to the lookback and hidden unit size as well as the dropout rate, and three disjoint subsets (training, validation, and test) are used. We also use the non-mass-conservative GR4J conceptual model to benchmark the LSTM.

The remainder of this paper is organized as follows. The next section presents the available data and our hydrologic catchment
classification. Section

Spatial variation in the three indices used for hydrologic catchment classification: IQ, IP, and

The data set used in this study contains time series of hydrometeorological variables and time-invariant catchment attributes. It is a subset of a
larger data set of 4190 French catchments

The catchment sample for this paper includes 361 catchments from all over France with discharge time series ranging from 30 to 60 years. These catchments range in size from 5 to 13 806

The classification proposed in this paper uses readily available data and is inspired by

In this definition, the IQ and IP indices give information on runoff variability and precipitation variability throughout the year
respectively. Low values for IQ and IP indicate their uniform distribution across the year, whereas a high value reflects the presence
of contrasting dry and wet seasons. A low IQ can also imply the presence of groundwater or reservoirs (natural or artificial), which tend to
attenuate runoff fluctuations at the catchment outlet. The

Using the specified indices, the following classification criteria are defined and applied to each catchment in the sample to determine its hydrologic
regime (Fig.

Classification of the catchments into five hydrologic regimes based on five conditions built on

The location of the catchments within each regime is shown in Fig.

Distribution of catchments from each of the five regimes across France: Uniform, Mediterranean, Oceanic, Nivo–Pluvial, and Nival. Each point represents one catchment and is coloured according to its regime.

Interannual monthly (or regime of) runoff (

For each regime, variations in interannual monthly runoff, total precipitation, and mean temperature are presented in Fig.

Stacked bar charts showing the variation in the four physical attributes used in this paper within each regime and the entire sample. The end-to-end segments of each bar correspond to the intervals for each quartile of the physical attribute of interest. The quartiles are computed by taking all 361 catchments into account. The number inside each segment denotes its length.

Stacked bar charts showing the variation in the three climatic attributes used in this paper within each regime and the entire sample. The end-to-end segments of each bar correspond to the intervals for each quartile of the climatic attribute of interest. The quartiles are computed by taking all 361 catchments into account. The number inside each segment denotes its length.

In this paper, we use four physical attributes – surface area (

LSTM networks are a family of recurrent neural networks (RNNs) that address issues of both vanishing and exploding gradients

The standard LSTM involves two feedback connections operating at different timescales: the shallow-level hidden state (

It describes the cell state as a linear self-loop of form

is called the forget gate and has the following properties:

It is a unit analogous to a neuron in nature, meaning that (1) it takes a weighted sum of its inputs (

Its non-linear function is the sigmoid function (

The presence of the term

So far, we have provided the definition of all terms in Eq. (

The network output at time step

It is now clear that

The notation, dimensions (for a single time step), and definition of the different variables in the LSTM's forward pass equations are given in Table

The period for which there is a full discharge record differs between the catchments in the sample. To obtain training, validation, and test data sets, the data for each individual catchment are divided into three sets as follows: the most recent period containing 10 years of full discharge records is set as the test period; working backwards, the next period that contains 10 years of full discharge records is set as the validation period; and what remains constitutes the training period, the length of which varies between 10 and 40 years in the sample.

As the values for features and the target vary widely, a feature-wise standardization for the features and the target is performed. The
standardization is performed using the mean and the standard deviation of the training data. This form of standardization – where the input data are
centred around 0 and are scaled by the standard deviation – is also used by

Notation, dimensions, and definition of the terms and operators in Eqs. (

The hyperparameters tested for all LSTMs in the paper and their variations.

In this paper, to evaluate runoff prediction performance, we use the Kling–Gupta efficiency (KGE) score

When addressing a research question using a deep learning model, it is important to limit (as much as possible) any potential conclusion biases resulting from the use of a model that is not hyperparameter tuned. LSTM has, in particular, two interconnected hyperparameters that need to be tuned together: the lookback and hidden unit size. For this purpose, for each LSTM in the paper, we have tested all combinations of all variations in the hyperparameters listed in Table

The remainder of this subsection discusses the choice of and variation in the tuning hyperparameters in this paper.

The gradient-based Adam algorithm

The early stopping algorithm implemented in the paper already acts as a regularizer.

List of the dynamic and static features used in different LSTM models in the paper.

Here, the goal is to train an LSTM that takes the past

Depending on whether the LSTM is trained on just a single catchment or on a group of catchments, either the mean squared error (MSE, Eq.

Names, training catchments, approaches to the selection of the best hyperparameter set, and features used for the five LSTM models in the paper.

We used the Keras library

The LSTM is trained both locally, using the data from “individual catchments”, and regionally, using the data from “a group of catchments”. In local training, the loss function is the MSE, and only the dynamic features of Table

361

54 group training sessions on the 71 Uniform catchments using the REGIONAL REGIME model,

54 group training sessions on the 62 Mediterranean catchments using the REGIONAL REGIME model,

54 group training sessions on the 101 Oceanic catchments using the REGIONAL REGIME model,

54 group training sessions on the 100 Nivo–Pluvial catchments using the REGIONAL REGIME model,

54 group training sessions on the 27 Nival catchments using the REGIONAL REGIME model,

54 group training sessions on the 361 sample catchments using the REGIONAL NATIONAL model.

This gives a total of 19 818 (

Conceptual flowchart of how SINGLE, REGIONAL, and HYBRID LSTM models and their submodels (green rounded rectangles) are built based on three decision criteria (orange rhombuses): the training approach, the approach to the selection of the best hyperparameters, and the training catchments.

So far, different local and regional LSTMs have been trained for the 54 hyperparameter sets. Now, the best hyperparameter set must be chosen for the trained LSTMs. For SINGLEs, the only possible approach is to select, for each catchment, its own best set: the hyperparameter set that offers the
best KGE for the validation data. However, for REGIONALs, be they NATIONAL or REGIME, two possibilities exist. We can identify either one best
set for each of the training catchments or one best overall set for the entire model. In this paper, we investigate both approaches. By crossing the
two (local and regional) training approaches with the two approaches to the selection of the best hyperparameter set (as shown in Fig.

The daily lumped GR4J model (Génie Rural à 4 paramètres Journalier;

GR4J is a parsimonious model incorporating only four free parameters. CemaNeige has two parameters and computes snow accumulation and snowmelt as
outputs

Compulsory inputs to the GR4J model consist of daily total precipitation (

Our results showed that the use of a second regularization strategy (dropout rates of 0.2 and 0.4) in conjunction with early stopping would not further improve performance (compared with the use of early stopping alone, i.e. dropout rate of 0). All results presented here correspond to a dropout rate of 0.

In Fig.

LSTM performance variations with respect to the length of input sequences within different regimes for the SINGLE and REGIONAL REGIME models. In each panel, the dashed and dotted lines correspond to the training and validation data respectively. The solid line is the mean of the training and validation lines. Each line plots the median KGE scores (on the

For both models, the curves tend to show a consistent pattern within the various regimes. The median KGE first increases at a certain slope and then, from a specific lookback onwards, the KGE remains largely unchanged or even decreases. Both the slope and the lookback appear to be regime dependent. In the Uniform and Nival regimes, the slope is distinctively pronounced for both models, and we find the highest sensitivity within these two regimes. In the Mediterranean regime, the median KGE varies between 0.81 and 0.85 and between 0.77 and 0.82 for the SINGLE and REGIONAL REGIME models respectively. The initial slope is steeper in this regime than in the Oceanic regime, and KGE stalls at an earlier point. In both regimes, the global sensitivity of performance to lookback size is low. In the Nivo–Pluvial regime, the initial slope is shallow, creating an almost flat pattern that also reflects low global sensitivity with respect to lookback variations. The range of variation in the median KGE is 0.85–0.89 and 0.85–0.88 for the SINGLE and REGIONAL REGIME models respectively.

The continuous tendency for performance to improve with increasing lookback up to lookbacks longer than a year within the Uniform regime, as compared
to the multi-month scale in other regimes, is consistent with the multi-year and multi-month catchment memory scales showed by

Cumulative distribution functions (CDFs) of the KGE scores of the test data for three LSTM models: SINGLE (blue), REGIONAL REGIME (orange), and REGIONAL NATIONAL (green). From top to bottom, the first five panels indicate the CDFs of one of the five regimes: Uniform, Mediterranean, Oceanic, Nivo–Pluvial, and Nival. The last panel corresponds to the distributions of the entire sample.

Figure

Next, homogeneous group training (REGIONAL REGIME) is specifically compared with non-homogeneous group training (REGIONAL NATIONAL). In the Mediterranean catchments, the REGIONAL REGIME model is observed to have a lower median KGE than the REGIONAL NATIONAL model, whereas it is higher in the Nivo–Pluvial regime. In all other regimes, both training types have almost the same median KGE. In the Nivo–Pluvial regime, the CDF of the REGIONAL REGIME model is completely shifted towards higher KGE scores. In the Nival regime, although both models have the same median KGE, the CDF curve of the REGIONAL NATIONAL regime is shifted towards better KGEs. Overall, when all catchments are considered, the homogeneous group training slightly outperforms the group training with mixed regimes in terms of the median KGE score. However, their CDFs are superposed for high KGEs.

Cumulative distribution functions (CDFs) of the KGE scores of the test data for the group-trained LSTM models: REGIONAL REGIME (orange), REGIONAL NATIONAL (green), HYBRID REGIME (red), and HYBRID NATIONAL (purple). From top to bottom, the first five panels indicate the CDFs for each of the five regimes: Uniform, Mediterranean, Oceanic, Nivo–Pluvial, and Nival. The last panel corresponds to the distributions of the entire sample.

Figure

We see that there is clearly a performance improvement from the REGIONAL NATIONAL model to the HYBRID NATIONAL model in almost all regimes as well as overall. This is both in terms of median KGE scores and the shift in the CDF curve towards better KGEs. However, moving from the REGIONAL REGIME model to the HYBRID REGIME model, there is little or no improvement in performance, except for the Mediterranean regime. Of all tested LSTMs, the HYBRID NATIONAL model performs best.

Median KGE scores, within different regimes and overall, for the GR4J model compared to the LSTM models.

Variation in KGE scores with respect to the runoff ratio (

Table

The Uniform and Nival regimes can be distinguished as the two regimes with the cleanest performance–lookback pattern, where performance increases with increasing lookback size. We can relate this to the long-term dynamics of their dominant hydrologic processes: the recharge and discharge of the aquifer and the thawing of accumulated snow.

Uniform catchments occur mainly in areas known to be highly influenced by large aquifers, such as the aquifers of the Seine or the Somme river basins
in the north of France (Fig.

Five examples from the Mediterranean regime, each with a different lookback sensitivity pattern.

In the Mediterranean regime, the performance–lookback pattern is characterized by a narrow spread in the KGE scores for different lookbacks, whereas a clear offset was expected for small lookback values. In this regime, internal states (e.g. soil moisture) do not depend on long antecedent periods, as precipitation tends to generate flash floods and is particularly intense in the autumn (Fig.

In the Oceanic and Nivo–Pluvial regimes, the performance–lookback pattern displays little variation, and there is far less sensitivity to lookback in the median KGE scores. We attribute this to the intermediate-term dynamics of the dominant hydrologic processes in these two regimes.

To answer this question, we need to take SINGLE, REGIONAL REGIME, and REGIONAL NATIONAL LSTMs into account. In the passage from individual catchment (local) training to group (regional) training, we increased the capacity of the model (by adding 10 static attributes) and the size of the data. As a result, LSTM performance improved in almost all regimes and overall. That is, in passing both from local to homogeneous regional training and from local to heterogeneous regional training, the precision that the LSTM gains is “almost” always greater than the generalization it loses. For Uniform, Mediterranean, and (to a lesser extent) Nivo–Pluvial catchments, the passage from local to at least one of the regional LSTMs is a real gain. For the two other regimes, the benefit is less obvious, and performance improvements do not turn out to be significant.

One explanation for the small performance difference between local and regional (homogeneous or heterogeneous) training is that the quantity of available data at the local level is already sufficiently large with respect to the complexity of catchment representations. Thus, the LSTM has already “asymptoted” to an error very close to the minimum possible error. At the regional level, although the amount of data has increased greatly, the result of the gained precision, lost generalization, and varied complexity is not sufficiently positive to push the final error to a point closer to the minimum possible error. Additionally, in local training, selection of the best hyperparameter set is also local (catchment-wise), allowing each catchment to take its own best set.

To answer this question, we need to compare the REGIONAL REGIME model against the REGIONAL NATIONAL model. For almost all regimes as well as overall, when hydrologically similar but fewer catchments are used, median KGE scores are as good as when far more training catchments from various regimes are used. This is interesting for at least two reasons.

First, both models benefit from group training, and their data are already several times greater than local-level data. However, of the two, it is not the model with greater amount of training data that performs best. For example, in the Nival regime, the (heterogeneous) national model uses data “13 times” larger than the data used by the (homogeneous) regime model. Nevertheless, they have the same median KGE score. The point to note here is that, passing from the regime level to national level, we did not increase the data from this particular regime (representation) 13 times. We did add a considerable amount (13 times the regime size) of data from some “dissimilar” representations. This is very different from including a large quantity of data from the “similar” representation, as occurs in the passage from local to regime training. Therefore, for non-homogeneous training there is a “varied”, but not necessarily an added, complexity with respect to the representations.

Second, for both forms of training, the complexity (and learning capacity) of the model is the same – exactly the same model with identical static attributes is used for both forms of training. In regime (homogeneous) training, each REGIME LSTM learns a single representation, whereas the LSTM is exposed to the representations from all regimes in national (non-homogeneous) training.

What appears to be important for both models is whether the varied complexity is shifted towards a simpler or a more difficult learning
representation. In the latter case, it is then important whether there is sufficient data. The complexity of representation(s) appears to vary from
regime to regime. Given our results, we can identify three levels:

The first level is regimes with “self-sufficient” representations where homogeneous training clearly outperforms heterogeneous training. The only instance of this level is found in the Nivo–Pluvial regime. In this regime, the new complexity appears to be shifted towards a “more complex” representation.

The second level is regimes with “self-insufficient” representations, which must have inputs from contrasting/dissimilar representations to be learned by the LSTMs. The only instance of this level is the Mediterranean regime.

The third level is regimes with “neutral” representations for which the addition/removal of contrasting representations has little or no effect on the complexity of the task for LSTM. The Uniform, Oceanic, and Nival regimes exhibit this level of representation. However, if we look at the performance overall, it turns out that almost the same level of data adequacy–representation complexity is achieved in both regime and national training forms.

In our results, we did not observe the performance improvement that

Our results suggest that the performance of an LSTM-based runoff model is controlled by two factors: (1) its training approach and (2) its lookback–hidden unit size tuning. The results of this paper suggest that maximization of the number of training catchments (national-scale training) in conjunction with local selection of the lookback–hidden unit size set give the best results, both within the regimes and overall. The interesting point to note is that it is only the “combination” of the two components of this setting that gives the best results. Either of them separately does not appear to be a major winning factor: local LSTMs with local lookback–hidden unit size sets did not outperform regional LSTMs, and NATIONAL LSTMs did not outperform REGIME LSTMs. We should also remember that the NATIONAL LSTMs that we tested are regime-informed. Thus, we might include this property as the third component of this setting.

We have previously discussed the importance of lookback as a hyperparameter for LSTM. Here, we note the importance of tuning lookback and hidden unit
size at a local scale so that the LSTM can better capture the dynamics of each catchment separately. The relationship between these two
hyperparameters has been previously recognized by

In this study, we have used a sample of 361 gauged catchments in the hydrologically diverse French context. Our goal has been to exploit catchment
hydrologic information when using LSTM-based runoff models. Thus, we have proposed a regime classification built on three hydrologic indices to
identify catchments with similar hydrologic behaviours (representations). We have then trained the LSTM once locally – on individual catchments – and once regionally – on a group of catchments. We have performed the regional training at two scales: (1) at the scale of each hydrologic regime (i.e. only catchments from the same regime have been trained together) and (2) at the national scale (i.e. all 361 catchments have been trained together). For all training passes, we have performed 54 hyperparameter tunings on three hyperparameters: the dropout rate (three variations) as well as the two important LSTM hyperparameters, namely input sequence length (six variations) and hidden unit size (three variations). We have investigated the relationship between the size of an LSTM's input sequence and LSTM performance within different regimes. We have tested a new approach to the selection of the best hyperparameter set for regional LSTMs, and we have examined how different training and hyperparameter selection approaches change the performance of LSTM. For training and evaluation of all local and regional LSTMs, we have used three long completely independent data sets: training (10

In the Uniform and Nival regimes, where there is a clean long-term dominant process, we found a clear performance–lookback pattern, with performance increasing with increasing lookback up to an effective value, which depended on the time scaling of the dominant process. In the Mediterranean regime, characterized by its propensity to generate flash floods, we expected a similar distinct pattern but with a much shorter effective lookback. What we found was a narrow spread of performance scores for different lookbacks. We assumed this to relate to the underlying different temporal dynamics in this regime, given that several catchments in this regime might be locally affected by the presence of karstic geological features.

In the Oceanic and Nivo–Pluvial regimes, we found a largely unchanging performance–lookback pattern, reflecting performance insensitivity to changes in lookback values. This indicates that, in these regimes, adequate performance can be achieved without using large lookbacks.

Whether an LSTM benefits from the passage from local to regional or not depends on (a) the amount of data at the local scale and (b) how it can negotiate the trade-off between the varied complexity of the representation(s) to be learned and the augmented data at the regional scale. If, in the move from local to regional, there is also an increase in model complexity produced, for example, by the inclusion of multiple attributes in the regional model, this trade-off could become harder because the LSTM would need to further trade generalization for precision (due to the more complex model). The passage from local to regime level produced a slightly better performance improvement than did the passage from local to national level.

At the local scale of a single catchment, if the representation to be learned is “smooth” enough to elicit, or if the catchment's data are so abundant that there is no difficulty in eliciting whatever complex representation they contain, the LSTM will already be very close to the minimum possible error. In such cases, there will be “less room” to improve performance by passing to regional LSTMs.

At the regional scale, from the regime (hydrologically homogeneous) level to the national (hydrologically heterogeneous) level, the model capacity is the same. A large quantity of dissimilar data are added, thereby varying the complexity of the new representations to be learned. What appears to be important is whether the varied complexity is shifted towards a simpler or a more difficult learning representation. In the latter case, the issue is then whether there is an adequate quantity of data. Our results showed regime training to perform better overall, but the difference was very slight, and we can consider the two forms of regional training to be equivalent. This means that, for both regime and national training levels, the quantity of data has been adequate and appropriate with respect to the complexity of the representation(s) at that level. Nevertheless, the potential role of our national LSTM's regime-informed property in simplifying the task in the heterogeneous space should not be excluded.

Given the almost equivalent performance of REGIME and regime-informed NATIONAL LSTMs, in choosing between them, we may take into consideration that the former needs less data but requires an external classification – a precise encoding of our knowledge to the right classification. The latter requires a national database but calls for no classification (criterion).

To improve the performance of an LSTM model, two elements were found to be important: the training approach and the lookback–hidden unit size tuning. The best performance was shown by the HYBRID NATIONAL LSTMs, mixing national training with local tuning of the two lookback, hidden unit size hyperparameters, and providing regime information through attributes.

Our findings allow us to identify a number of directions for further research:

The conclusions drawn here have been premised on a single condition concerning the similarity and size of data. References to an “increase in data size” at the national training level designated an increase in the data of dissimilar representations with the increase always falling within the following bands:

A useful step for the improvement of homogeneous training would be to refine the current classification to maximize the number of self-sufficient regimes.

Our hydrologically heterogeneous LSTMs were regime-informed. We encourage verification of the conjecture that an LSTM is able to learn classification if we provide it with regime information (through classification attributes). A simple way to achieve this is to include once and exclude once the classification indices in and from static features of regional LSTMs and compare the results. This paper does the former but not the latter.

A future research direction could be to explore the relationship between LSTM's optimal lookback and memory-related metrics, such as the catchment forgetting curve

The methods presented in this paper are developed for gauged catchments. A further step would be to extend them to approaches applicable to ungauged catchments – catchments not used in training.

To request access to the results and the codes upon which this study is based, please contact the corresponding author.

The meteorological forcing data are produced and provided by Météo France

RH designed all of the experiments with advice from PB, PAG, and PJ. RH conducted all of the experiments and wrote the manuscript. PB gave guidance on the data and GR4J simulations. PAG assisted RH with the execution of experiments. PJ supervised the work and was in charge of the overall direction. Analysis of the results and revision of the manuscript were carried out collectively.

The contact author has declared that none of the authors has any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was granted access to the IDRIS GPU resources under the allocation 2022-AD011013339 made by GENCI. We would like to extend our sincere thanks to Jérémy Verrier for his constant support and help with respect to using INRAE and GENCI's HPC resources. We would also like to thank our handling editor (Efrat Morin), John Quilty, and the two anonymous referees, who provided input that substantially improved this paper.

This paper was edited by Efrat Morin and reviewed by John Quilty and two anonymous referees.