Interactive comment on “ Rainfall-Runoff modelling using Long-Short-Term-Memory ( LSTM ) networks

This paper utilizes a data-driven approach, based on recurrent neural networks, to model rainfall-runoff relationships. A novel method is applied to model runoff in catchments in the continental U.S where gage data and meteorological forcings are available, and results are compared with existing process-based model results which are used as a benchmark. The LSTM method presented is tested through various experiments where the network is either trained for individual catchments, large aggregated regional catchments, or a combination approach where models are initialized based on large catchments and then “fine-tuned” to smaller catchments. This study is presented to introduce LSTM as an efficient hydrological modelling approach that is shown to provide similar quality predictions as an existing process-based model.

characteristics of the sub-surface, are mostly only available for small, experimental watersheds, limiting the model's applicability for larger river basins in an operational context. The high computational costs further limit their application, especially if uncertainty estimations and multiple model runs within an ensemble forecasting framework are required (Clark et al., 2017).
ANNs are especially known to well mimic highly non-linear and complex systems. Therefore, first studies using ANNs for rainfall-runoff prediction date back to the early 1990s (Daniell, 1991;Halff et al., 1993). Since then, many studies applied 10 ANNs for modelling runoff processes (see for example Abrahart et al. (2012); ASCE Task Committee on Application of Artificial Neural Networks (2000) for a historic overview). However, a drawback of feed-forward ANNs, which have mainly be used in the past, for time series analysis is that any information about the sequential order of the inputs is lost. Recurrent neural networks (RNNs) are a special type of neural network architecture that have been specifically designed to understand temporal dynamics by processing the input in its sequential order (Rumelhart et al., 1986). Carriere et al. (1996) and Hsu 15 et al. (1997) conducted first studies using RNNs for rainfall-runoff modelling. The former authors tested the use of RNNs within laboratory conditions and demonstrated their potential use for event-based applications. In their study, Hsu et al. (1997) compared a RNN to a traditional ANN. Even though the traditional ANN in general performed equally well, they found that the number of delayed inputs, which are provided as driving inputs to the ANN, is a critical hyperparameter. However, the RNN, due to its architecture, made the search for this number obsolete. Kumar et al. (2004) also used RNNs for monthly streamflow 20 prediction and found them to outperform a traditional feed-forward ANN.
In recent years, neural networks have gained a lot of attention under the name of Deep Learning (DL). As in hydrological modelling, the success of DL approaches is largely facilitated by the improvements in computer technology (especially through graphic processing units or GPUs (Schmidhuber, 2015) and the availability of huge datasets (Halevy et al., 2009;Schmidhuber, 2015). While most well-known applications of DL are in the field of computer vision (Farabet et al., 2013;Krizhevsky et al., 25 2012; Tompson et al., 2014), speech recognition (Hinton et al., 2012) or natural language processing  few attempts have been made to apply recent advances in DL to hydrological problems. Shi et al. (2015) investigated a deep learning approach for precipitation nowcasting. Tao et al. (2016) used a deep neural network for bias correction of satellite precipitation products. Recently, Fang et al. (2017) investigated the use of deep learning models to predict soil moisture in the context of NASA's Soil Moisture Active Passive (SMAP) satellite mission. In general, the potential use and benefits of 30 DL approaches in the field of hydrology and water sciences has only recently come into the focus of discussion (Marçais and de Dreuzy, 2017;Shen et al., 2018).
For problems, for which the sequential order of the inputs matters, the current state-of-the-art network architecture is the so-called "Long-Short-Term-Memory" (LSTM), which in its initial form was introduced by Hochreiter and Schmidhuber (1997). Through a specially designed architecture, the LSTM overcomes the problem of the traditional RNN of learning long-35 term dependencies representing e.g. storage effects within hydrological catchments, which may play an important role for hydrological processes, for example in snow-driven catchments.
Regardless of the hydrological modelling approach applied, any model will be typically calibrated for specific catchments for which observed time series of meteorological and hydrological data are available. The calibration procedure is required because models are only simplifications of real catchment hydrology and model parameters have to effectively represent non-resolved 5 processes and any effect of subgrid-scale heterogeneity in catchment characteristics (e.g. soil hydraulic properties) (Beven, 1995;Merz et al., 2006). The transferability of model parameters (regionalization) from catchments where meteorological and runoff data are available to ungauged or data scares basins is one of the ongoing challenges in hydrology (Buytaert and Beven, 2009;He et al., 2011;Samaniego et al., 2010).
The aim of this study is to explore the potential of the LSTM architecture (in the adapted version proposed by Gers et al.

10
(2000)) to describe the rainfall-runoff behavior of a large number of differently complex catchments at the daily time scale.
Additionally, we want to analyze the potential of LSTMs for regionalizing the rainfall-runoff response by training a single model for a multitude of catchments. In order to allow for a more general conclusion about the suitability of our modelling approach, we test this approach on a large number of catchments of the CAMELS data set (Addor et al., 2017b;Newman et al., 2014). This dataset is freely available and includes meteorological forcing data and observed discharge for 671 catchments 15 across the contiguous United States. For each basin, the CAMELS data set also includes time series of simulated discharge from the Sacramento Soil Moisture Accounting Model (Burnash et al., 1973) coupled with the Snow-17 snow model (Anderson, 1973). In our study, we use these simulations as a benchmark, to compare our model results with an established modelling approach.
The paper is structured in the following way: In Sect. 2, we will briefly describe the LSTM network architecture and the 20 dataset used. This is followed by an introduction into three different experiments: In the first experiment, we test the general ability of the LSTM to model rainfall-runoff processes for a large number of individual catchments. The second experiment investigates the capability of LSTMs for regional modelling, while the last tests whether the regional models can help to enhance the simulation performance for individual catchments. Section 3 presents and discusses the results of our experiments, before we end our paper with a conclusion and outlook for future studies. In this section, we introduce the LSTM architecture in more detail. Beside a technical description of the network internals, we added a "hydrological interpretation of the LSTM" at the end of this section (see Sect. 2.1.1) in order to bridge differences between the hydrology and deep learning research communities.

30
The LSTM architecture is a special kind of recurrent neural network (RNN), designed to overcome the weakness of the traditional RNN to learn long-term dependencies. Bengio et al. (1994) have shown that the traditional RNN can hardly remember sequences with a length of over 10. For daily streamflow modelling, this would imply that we could only use the last 10 days of meteorological data as input to predict the streamflow of the next day. This period is too short considering the memory of catchments including groundwater, snow or even glacier storages, with lag times between precipitation and discharge up to several years.
To explain how the RNN and the LSTM work, we unfold the recurrence of the network into a directed acyclic graph (see Fig. 1). The output (in our case discharge) for a specific time step is predicted from the input x = [x -n , ..., x 0 ] consisting of the 5 last n consecutive time steps of independent variables (in our case daily precipitation, min/max temperature, solar radiation and vapor pressure) and is processed sequentially. In each time step t (-n ≤ t ≤ 0), the current input x t is processed in the recurrent cells of each layer in the network.
The difference of the traditional RNN and the LSTM are the internal operations of the recurrent cell (encircled in Fig. 1) that are depicted in Fig. 2.

10
In a traditional RNN cell, only one internal state h t exists (see Fig. 2a), which is recomputed in every time step by the following equation: where g(·) is the activation function (typically the hyperbolic tangent), W and U are the adjustable weight matrices of the hidden state h and the input x, and b is an adjustable bias vector. In the first time step, the hidden state is initialized as a vector 15 of zeros and its length is an user-defined hyperparameter of the network. In comparison, the LSTM has (i) an additional cell state or cell memory c t in which information can be stored, and (ii) three gates that control the information flow within the LSTM cell (three encircled letters in Fig. 2b). The first gate is the forget gate, introduced by Gers et al. (2000). It controls which elements of the cell state vector c t-1 will be forgotten (to which degree): where f t is a resulting vector with values in the range (0, 1), σ(·) represents the logistic sigmoid function and W f , U f and 5 b f define the set of learnable parameters for the forget gate, i.e. two adjustable weight matrices and a bias vector. As for the traditional RNN, the hidden state h is initialized in the first time step by a vector of zeros with has a user-defined length.
In the next step, a potential update vector for the cell state is computed from the current input (x t ) and the last hidden state (h t-1 ) given by the following equation: where c t is a vector with values in the range (-1, 1), tanh(·) is the hyperbolic tangent and W c , U c and b c are another set of learnable parameters.
Additionally, the second gate is compute, the input gate, defining which (and to what degree) information of c is used to update the cell state in the current time step: 15 where i t is a vector with values in the range (0, 1), and W i , U i and b i are a set of learnable parameters, defined for the input gate.
With the results of Eq.
(2)-(4) the cell state, c t is updated by the following equation: where denotes element-wise multiplication. Because the vectors f t and i t have both entries in the range (0, 1), Eq. (5) can be interpreted in the way that it defines, which information stored in c t-1 will be forgotten (values of f t of approx. 0) and which will be kept (values of f t of approx. 1). Similarly, i t decides which new information stored in c will be added to the cell state (values of i t of approx. 1) and which will be ignored (values of i t of approx. 0). Like the hidden state vector, the cell state is initialized by a vector of zeros in the first time step. Its length corresponds to the length of the hidden state vector.

5
The third and last gate is the output gate, which controls the information of the cell state c t that flows into the new hidden state h t . The output gate is calculated by the following equation: where o t is a vector with values in the range (0, 1), and W o , U o and b o are a set of learnable parameters, defined for the output gate. From this vector, the new hidden state h t is calculated by combining the results of Eq. (5) and Eq. (6): It is in particular the cell state (c t ) that allows for an effective learning of long-term dependencies. Due to its very simple linear interactions with the remaining LSTM cell, it can store information unchanged over a long period of time steps. During training, this characteristic helps to prevent the problem of the exploding or vanishing gradients in the backpropagation step (Hochreiter and Schmidhuber, 1997). As with other neural networks, where one layer can consist of multiple units (or neurons), 15 the length of the cell and hidden state vectors in the LSTM can be chosen freely. Additionally, we can stack multiple layers on top of each other. For this study, we used a 2-layer LSTM network, with each layer having a cell/hidden state length of 20. The output from the last LSTM layer at the last time step is connected through a traditional dense layer to a single output neuron, which computes the final discharge prediction (see Fig. 1 for a schematic image of the network). Between the layers, we added dropout, a technique to prevent the model from overfitting (Srivastava et al., 2014). Dropout sets a certain percentage (10 % in 20 our case) of random neurons to zero during training in order to force the network into a more robust feature learning.
Another hyperparameter is the length of the input sequence, which corresponds to the number of days of meteorological input data provided to the network for the prediction of the next discharge value. We decided to keep this value constant at 365 days for this study in order to capture at least the dynamics of a full annual cycle. The specific design of the network architecture, i.e. the number of layers, cell/hidden state lengths, dropout rate and input sequence length were varied and found 25 to work well in a number of preceding tests.

A hydrological interpretation of the LSTM
Similar to continuous hydrological models, the LSTM processes the input data time step after time step. In every time step, the input data (here meteorological forcing data) are used to update a number of values in the LSTM internal cell states. In comparison to traditional hydrological models, the cell states can be interpreted as storages that are often used for e.g. snow 30 accumulation, soil water content, groundwater storage, etc. Updating the internal cell states (or storages) is regulated through a number of so-called gates: one that regulates the depletion of the storages, a second that regulates the increase of the storages and a third that regulates the outflow of the storages. Each of these gates comes with a set of adjustable parameters that are adapted during a calibration period (referred to as training). During the validation period, updates of the cell states depend only on the input at a specific time step and the states of the last time step (given the learned parameters of the calibration period).
In contrast to hydrological models, the LSTM does not "know" the principle of water/mass conservation and the governing process equations describing e.g. infiltration or evapotranspiration processes a priori. Compared to traditional hydrological 5 models, the LSTM is optimized to predict the streamflow as good as possible, and has to learn these physical principles and laws during the calibration process purely from the data.

The calibration procedure
In traditional hydrological models, the calibration involves a defined number of iteration steps of simulating the entire calibration period with a given set of model parameters and evaluating the model performance with some objective criteria. The 10 model parameters are, regardless to the applied optimization technique (global and/or local), perturbed in such a way, that the maximum (or minimum) of an objective criteria is found. Regarding the training of a LSTM, the adaptable (or learnable) parameters of the network, the weights and biases, are also updated depending on a given loss function of an iteration step. In this study we used the mean-squared-error (MSE) as objective criterion.
In contrast to most hydrological models, the neural network exhibits the property of differentiability of the network equa- 15 tions. Therefore, the gradient of the loss function with respect to any network parameter can always be calculated explicitly.
This property is used in the so-called backpropagation step, in which the network parameters are adapted to minimize the overall loss. For a detailed description see e.g. Goodfellow et al. (2016).
A schematic visualization of one iteratsion step in the LSTM training/calibration is visualized in Fig. 3. One iteration step during the training of LSTMs usually works with a subset (called batch or mini-batch) of the available training data. The  of the LSTM over a number of training epochs. We can see that the network has to learn the entire rainfall-runoff relation from scratch (grey line of random weights) and is able to better represent the discharge dynamics with each epoch.
For efficient learning, all input features (the meteorological variables), as well as the output (the discharge) data are normalized by subtracting the mean and dividing by the standard deviation (LeCun et al., 2012;Minns and Hall, 1996 Overview over the used basins prediction, the output of the network is retransformed using the normalization parameters from the calibration period ( Fig. 4 shows the retransformed model outputs).

The CAMELS data set
The underlying data for our study is the CAMELS data set ( NLDAS (Xia et al., 2012)) and consists of day length, precipitation, shortwave downward radiation, maximum and minimum temperature, snow-water equivalent and humidity. We used the Daymet data, since it has the highest spatial resolution (1 km 10 grid compared to 12 km grid for Maurer and NLDAS) as a basis for calculating the catchment averages and all available meteorological input variables with exception of the snow-water equivalent and the day length.
The 671 catchments in the dataset are grouped into 18 hydrological units (HUCs) following the U.S. Geological Survey's HUC map (Seaber et al., 1987). These groups correspond to geographic areas that represent the drainage area of either a major river or the combined drainage area of a series of rivers. In our study, we used 241 catchments from the HUCs 01, 03, 11, 17 15 (see Fig. 5) in order to cover a wide range of different hydrological conditions on one hand and to limit the computational costs on the other hand. The selected catchments contain snow-driven catchments as well as catchments without any influence of snow. In addition, the four units cover a wide range of climates, containing rather dry catchments with less than 400 mm/year of mean precipitation, as well as catchments with mean precipitation up to 3260 mm/year.
Additionally, the CAMELS data set contains time series of simulated discharge from the calibrated Snow-17 models cou-20 pled with the Sacramento Soil Moisture Accounting Model (see Newman et al. (2015) for further details). The models were calibrated with the first 15 hydrological years for which streamflow data is available (in most cases 1 October 1980 until 30 September 1995). We use the exact same period for the training of the LSTM, while the remaining data (in most cases 2.4.2 Experiment 2: One regional model for each hydrological unit Our second experiment is motivated by two different ideas; firstly, DL models really excel, when having many training data available (Hestness et al., 2017;Schmidhuber, 2015). A huge training data set allows the network to learn more general and abstract patterns of the input-to-output relationship. In our case, the network has to learn the entire "hydrological model" purely from the available data (see Fig. 4). Therefore, having more than just the data of a single catchment available, can help to learn a 20 more general understanding of the rainfall-runoff processes. An illustrative example are e.g. two similarly behaving catchments of which one lacks high precipitation events or extended drought periods in the calibration period, while having these events in the validation period. Given that the second catchment experienced these conditions in the calibration set, the LSTM could learn the response behavior to those extremes and use this knowledge in the first catchment. Classical hydrological models have the process understanding implemented in the model structure itself and therefore -at least in theory -it is not strictly 25 necessary to have these kind of events in the calibration period.
The second motivation is the prediction of runoff in ungauged basins, one of the main challenges in the field of hydrology (Blöschl, 2013;Sivapalan, 2003). A regional model that performs reasonably well across all catchments within a region could potentially be a step towards the prediction of runoff for such basins.
Therefore, the aim of the second experiment is to analyze, how well the network architecture can generalize (or regionalize) 30 to all catchments within a certain region. We use the HUCs that are used for grouping the catchments in the CAMELS data set for the definition of the regions (four in this case). The training data for these regional models is the combined data of the calibration period of all catchments within the same HUC.
To determine the number of training epochs we performed the same preliminary experiment as describe in Experiment 1.
Across all catchments, the highest mean NSE was achieved after 20 epochs in this case. Thus, for the final training, we train one LSTM for each of the four used HUCs for 20 epochs with the full 15-year long calibration period of all catchments within 5 the specific HUC.

Experiment 3: Fine-tuning the regional model for each catchment
In the third experiment, we want to test if the more general knowledge of the regional model (Experiment 2) can help to increase the performance of the LSTM in a single catchment. In the field of DL this is a common approach called fine-tuning (Razavian et al., 2014;Yosinski et al., 2014), where a model is first trained on a huge dataset to learn general patterns and 10 relationships between (meteorological) input data and (streamflow) output data (this is referred to as pre-training). Then, the pre-trained network is further trained for a few number of epochs with the data of a specific catchment alone to adapt the more generally learned processes to a specific catchment. Loosely speaking, the LSTM first learns the general behavior of the runoff generating processes from a large dataset, and is in a second step adapted in order to account for the specific behavior of a given catchment (e.g. the scaling of the runoff response in a specific catchment). 15 In this study, the regional models of Experiment 2 serve as pre-trained models. Therefore, depending on the affiliation of a catchment to a certain HUC, the specific regional model for this HUC is taken as starting-point for the fine-tuning. With the initial LSTM weights from the regional model, the training is continued only with the training data of a specific catchment for a few epochs (ranging from 0 to 20, median 10). Thus, similar to Experiment 1, we finally have 241 different models, one for each of the 241 catchments. Different from the two previous experiments, we do not use a global number of epochs for fine-20 tuning. Instead, we used the 14-year/1-year split to determine the optimal number of epochs for each catchment individually.
The reason is that the regional model fits individual catchments within a HUC differently well. Therefore, the number of epochs the LSTM needs to adapt to a certain catchment before it starts to overfit is different for each catchment.

Evaluation metrics
The metrics for model evaluation are the Nash-Sutcliff efficiency (Nash and Sutcliffe, 1970) and the three decompositions 25 following Gupta et al. (2009). These are the correlation coefficient of the observed and simulated discharge (r), the variance bias (α) and the total volume bias (β). While all of these measures evaluate the performance over the entire time series, we also use three different signatures of the flow duration curve (FDC) that evaluate the performance of specific ranges of discharge.
Following Yilmaz et al. (2008), we calculate the bias of the 2 % flows, the peak flows (FHV), the bias of the slope of the middle section of the FDC (FMS) and the bias of the bottom 30 % low flows (FLV).

30
Because our modelling approach needs 365 days of meteorological data as input for predicting one time step of discharge, we cannot simulate the first year of the calibration period. To be able to compare our models to the SAC-SMA + Snow-17 benchmark model, we recomputed all metrics for the benchmark model for the same simulation periods.

Open source software
Our research heavily relies on open source software. The programming language of choice is Python 3.6 (van Rossum, 1995).

Results and discussion
We now present the results of our experiments and discuss the following points; at first, we present how well our LSTM network can model runoff processes of single catchments. Therefore, we analyze the results of Experiment 1, for which we trained one network separately for each basin and compare the results to the SAC-SMA + Snow-17 benchmark model. Then we investigate 10 the potential of LSTMs to learn hydrological behavior at the regional scale. In this context, we compare the performance of the regional models from Experiment 2 against the models of Experiment 1 and discuss their strengths and weaknesses. Lastly, we examine, whether our fine-tuning approach enhances the predictive power of our models in the individual catchments. In all cases, the analysis is based on the data of the 241 catchments of the calibration (the first 15 years) and validation (all remaining years available) periods. 15 3.1 Using LSTMs as a hydrological model Figure 6a hows the spatial distribution of the LSTM performances for Experiment 1 in the validation period. In over 50 % of the catchments, an NSE of 0.65 or above is found, with a mean NSE of 0.63 over all catchments. We can see that the LSTM performs better in catchments with snow influence (HUC 01 and 17) and catchments with higher mean annual precipitation (also HUC 01 and 17, but also basins in the western part of HUC 03; see Fig. 5 for precipitation distribution). The performance 20 deteriorates in the more arid catchments in the center of the CONUS, where no discharge is observed for longer periods of the year. Having a constant value of discharge (zero in this case) for a high percentage of the training samples seems to be difficult information for the LSTM to learn and to reproduce this hydrological behavior. However, if we compare the results for these basins to the benchmark model (Fig. 6b), we see that for most of these dry catchments the LSTM outperforms the latter, meaning that also the benchmark model did not yield satisfactory results for these catchments. In general, the visualization of 25 the differences in the NSE shows that the LSTM performs slightly better in the northern, more snow-influenced catchments, while the SAC-SMA + Snow-17 performs better in the catchments in the south-east. This is a somewhat surprising result, since we were expecting that the correct reproduction of snow accumulation and snowmelt processes might be challenging for the LSTM approach. However, from our results it seems that the model can easily learn these long-term dependencies, i.e. the time lag between precipitation falling as snow during the winter period and runoff generation in spring with warmer temperatures.

30
The median value of the NSE-differences is -0.03, which means that the benchmark model slightly outperforms the LSTM. Based on the mean NSE value (0.58 for the benchmark model, compared to 0.63 for the LSTM of this Experiment), the LSTM outperforms the benchmark results.
In Fig. 7, we present the cumulative density functions (CDF) for various metrics for the calibration and validation period.
We see that the LSTM and the benchmark model work comparably well for all but the FLV (bias of the bottom 30 % low flows) metric. The underestimation of the peak flow in both models could be expected when using the MSE as the objective 5 function for calibration (Gupta et al., 2009). However, the LSTM underestimates the peaks slightly stronger compared to the benchmark model (Fig. 7d). In contrast, the middle section of the FDC is better represented in the LSTM (Fig. 7e). Regarding the performance in terms of the NSE, the LSTM shows fewer negative outliers and thus seems to be more robust. The poorest  Figure 7f shows large differences between the LSTM and the SAC-SMA + Snow-17 model regarding the FLV metric. The FLV is highly sensitive to the one single minimum flow in the time series, since it compares the area between the FDC and this minimum value in the log-space of the observed and simulated discharge. The discharge from the LSTM model, which has no exponential outflow function like traditional hydrological models, can easily drop to very small numbers or even zero, to which we limited 5 our model output. A rather simple solution for this issue is to introduce just one additional parameter and to limit the simulated discharge not to zero, but to the minimum observed flow from the calibration period. Figure 8 shows the effect of this approach on the CDF of the FLV. We can see that this simple solution leads to better FLV values compared to the benchmark model.
Other metrics, such as the NSE, are almost unaffected by this change, since these low flow values only marginally influence the resulting NSE values (not shown here).

10
From the CDF of the NSE in Fig 7a, we can also observe a trend towards higher values in the calibration compared to the validation period for both modelling approaches. This is a sign of overfitting, and in the case of the LSTM, could be tackled by a smaller network size, stronger regularization or more data. However, we want to highlight again that achieving the best  Difference of NSE: Experiment 2 vs Experiment 1 Figure 9. Difference of the regional model compared to the models from Experiment 1 for each basin regarding the NSE of the validation period. Blue colors (> 0) mean the regional model performed better than the models from Experiment 1, red (< 0) the other way around. model performance possible was not the aim of this study, rather testing the general ability of the LSTM in reproducing runoff processes.

LSTMs as regional hydrological model
We now analyze the results of the four regional models that we trained for the four investigated HUCs in Experiment 2.   Figure 10. Cumulative density functions for several metrics of the calibration and validation period of the models from Experiment 1 compared to the regional models from Experiment 2. Figure 9 shows the difference in the NSE between the model outputs from Experiment 1 and 2. For some basins, the regional models performs significantly worse (dark red) than the individually trained models from Experiment 1. However, from the histograms of the differences we can see that the median is almost zero, meaning that in 50 % of the basins the regional model performs better than the model specifically trained for a single basin. Especially in HUC 01 the regional model performed better for almost all basins (except for two in the far northeast). In general, for all HUCs and catchments, the median difference 5 is -0.001.
From Fig. 10 it is evident that the increased data size of the regional modelling approach (Experiment 2) helps to attenuate the drop in model performance between the calibration and validation period, which could be observed in Experiment 1 probably as a result of overfitting. From the CDF of the NSE (Fig. 10a) we can see that Experiment 2 performed worse for approximately 20 % of the basins, while being comparable or even slightly better for the remaining watersheds. We can also observe that 10 the regional models show a more balanced under-and overestimation, while the models from Experiment 1 as well as the benchmark model tend to underestimate the discharge (see Fig. 10d-f e.g. the flow variance, the top 2 % flow bias or the bias of the middle flows). This is not too surprising, since we train one model on a range of different basins with different discharge characteristics, where the model minimizes the error between simulated and observed discharge for all basins at the same time.
On average, the regional model will therefore equally over-and underestimate the observed discharge.
The comparison of the performances of Experiment 1 and 2 shows no clear consistent pattern for the investigated HUCs, but reveals a tendency to higher NSE values in HUC 01 and to lower NSE values in HUC 11. The reason for these differences might become clearer once we look at the correlation in the observed discharge time series of the basins within both HUCs 5 (see Fig. 11).
We can see that in HUC 01 (where the regional model performed better for most of the catchments compared to the individual models of Experiment 1) many basins have a strong correlation in their discharge time series. Conversely, for HUC 11 the correlation of the discharge across the basins is in general low and there exists only one bigger group of correlating catchments (basins 0 to seven). The results suggest that a single, regionally calibrated LSTM could generally be better in predicting the 10 discharge of a group of basins compared to many LSTMs trained separately for each of the basins within the group especially when the group's basins exhibit a strong correlation in their discharge behavior.

The effect of fine-tuning
In this section, we analyze the effect of fine-tuning the regional model for a few number of epochs to a specific catchment. Figure 12 shows two effects of the fine-tuning process. In the comparison with the model performance of Experiment 1, 15 and from the histogram of the differences (Fig. 12a), we see that in general the pre-training and fine-tuning improves the NSE of the runoff prediction. Comparing the results of Experiment 3 to the regional models of Experiment 2 (Fig. 12b), we can see the biggest improvement in those basins in which the regional models performed poorly (see also Fig. 9). It is worth highlighting that, even though the models in Experiment 3 have seen the data of their specific basins for less epochs in total than in Experiment 1, they still perform better on average. Therefore, it seems that pre-training with a bigger dataset 20 before fine-tuning for a specific catchment helps the model to learn general rainfall-runoff processes and that this knowledge is transferable to single basins. It is also worth noting that the group of catchments we used as one region (the HUC) can be quite inhomogeneous regarding their hydrological catchment properties. Figure 13 finally shows that the models of Experiment 3 and the benchmark model perform comparably well over all catchments. The median of the NSE for the validation period is almost the same (0.72 and 0.71 for Experiment 3 and benchmark model), while the mean for the models of Experiment 3 is about 15 % higher (0.68 compared to 0.58). In addition, more basins 5 have an NSE above a threshold of 0.8 (27.4 % of all basins compared to 17.4 % for the benchmark model), which is often taken as a threshold value for reasonably well performing models (Newman et al., 2015).

Summary and conclusion
In this study, we investigated, in three different experiments, the potential of Long-Short-Term-Memory networks (LSTMs) for rainfall-runoff modelling, utilizing daily discharge and basin averaged meteorological data of 241 basins from the CAMELS data set.
In the first experiment, we trained a LSTM to every single catchment and achieved comparable results to the well-established 5 SAC-SMA + Snow-17 model. In snow driven catchments, the LSTM showed good model performance suggesting that it has no problem in learning long-term dependencies between meteorological inputs and discharge. Similar to the SAC-SMA + Snow-17 model results, the LSTM model performance deteriorated in arid catchments but at the same time outperformed the benchmark model. Even without any optimization of the network structure and other hyperparameters, the presented LSTM is well competing with the performance of the well-established conceptual model.

10
In the second experiment, we investigated the potential use of LSTMs as regional models. We trained a single LSTM to predict the discharge of all catchments in a region (a HUC) and could show that these regional models can even outperform the models from the first experiment in HUCs with similar hydrological characteristics. However, if the catchments in a HUC behave very differently, the regional model performance deteriorated. Therefore, the grouping or clustering of the catchments for the regional model plays an important role and in a future study, different grouping-strategies should be investigated. A the regional models of the second experiments showed encouraging results, we suggest that LSTMs as regional models could be a new promising approach for predictions in ungauged catchments. Therefore, future studies should explore the potential of these regional models by e.g. training the LSTM only on the data of a sub-group of all available catchments and evaluate the performance on the data of previously unseen catchments (i.e. cross-validation).
In the last experiment, we showed that fine-tuning pre-trained regional models for specific catchments generally increases 5 the model performance compared to the models from our first experiment and to the SAC-SMA + Snow-17 model. These results suggest that the LSTM is able to transfer knowledge learned at regional scale to individual catchments. Overall, 83.8 % (27.4 %) of all catchments achieved an NSE of above 0.55 (0.8) in the validation period. In comparison, the SAC-SMA + Snow-17 model yielded an NSE above 0.55 (0.8) for 81.7 % (17.4 %) of all catchments. Pre-training the model before fine-tuning has the benefit of increasing the amount of available training data for learning the entire relationship between meteorological 10 inputs and discharge. Thus, it might either compensate the limited data availability within the catchment of interest or improve predictions for cases not covered within the catchment's calibration period.
For all experiments, different hyperparameters of the LSTM, such as the number of layers, size of the cell memory, the dropout rate or the number of days of the provided meteorological data were not specifically optimized, but were kept constant, after an initial screening. For optimal performance, these settings should be systematically investigated using an independent 15 validation set. A hyperparameter search of this scale was out of the scope of this study and will be an important task for future studies.
Neural networks are often criticized for their "black-box-ness", not only in the hydrological community. Yes, this criticism is very justifiable -at least in science the question of how and why a specific model or method works well or not is important.
Looking behind the scene is what makes our work and science attractive. In this context, we want to conclude with a visualiza-20 tion of a very preliminary analysis of a cell state of the applied LSTM. Figure 14 shows the evolution of the value of a single cell state in the LSTM over the period of one year for an arbitrary catchment used in this study, exhibiting snow accumulation and melt in spring. Very surprising and interesting temporal dynamics are evident. The value of the cell state increases as soon as the temperatures drop below zero and a fast depletion is evident as soon as the temperatures increases above the freezing point. These seasonal dynamics are exactly what we expect, when we think about snow accumulation and melt on the catch-25 ment scale. Thus, the LSTM unintentionally generated observable snow dynamics within a cell state, suggesting that there is more to find behind the scenes In this way, further analyses of LSTM cell states and comparison with climate variables as well as (dynamic) catchment characteristics (such as e.g. ground water level, soil moisture, leaf area index), might reveal some more process controls and their dependencies. The application of LSTMs and its further development have therefore a high potential to extend data-based 30 mechanistic modelling approaches in the field of hydrology, e.g. rainfall-runoff modelling.