Benchmarking Data-Driven Rainfall-Runoff Models in Great Britain: A comparison of LSTM-based models with four lumped conceptual models

. Long short-term memory models (LSTMs) are recurrent neural networks from the emerging ﬁeld of Deep Learning (DL), which have shown recent promise when predicting time-series especially when data are abundant. Rainfall-runoff modelling presents a challenge, yet accurate hydrological models are vital for ﬂood forecasting, hazard impact assessment, and to assess the potential effects of climate change on ﬂoods and water resources. In this study, we compare the performance of two DL-based models, a LSTM and an Entity Aware LSTM (EA LSTM). The DL models were trained using a newly published 5 data set, CAMELS-GB, for a sample of 518 catchments across Great Britain. To identify spatial and seasonal patterns in model performance, we compare the DL models against benchmark outputs from four lumped conceptual models recently conﬁgured for rainfall-runoff modelling in Great Britain. Our ﬁndings show that the LSTM models simulate discharge with consistently high model performance scores, including in catchments typically considered difﬁcult to model. The LSTM achieves a mean catchment NSE of 0.88 (0.86 for the EALSTM), which represents a performance improvement of 10% – 16% compared with 10 the benchmark conceptual models. Seasonal and spatial patterns indicate that the largest performance improvement relative to the benchmark is in the drier summer months and in drier catchments in the South East of England. By comparing LSTMs with conceptual models, we diagnose possible reasons for their different performance. We suggest that LSTMs offer useful predictive capability for rainfall-runoff modelling in Great Britain and elsewhere and note their value to support process understanding in locations where processes are less well understood

Physical and conceptual models can struggle with certain catchments and hydrological conditions which do not conform to the assumptions of the analyst's underlying perceptual model. For example, modelling difficulties may arise in catchments where the effects of groundwater or non-diffuse macropore flow dominate (Wheater et al., 2007), where observations poorly constrain the water balance (Beven, 2020), or in river basins where the topographic water catchment is not the appropriate 25 surface across which water is conserved and there are inter-catchment transfers of water through groundwater processes (Liu et al., 2020). Finding generally-applicable model structures has long posed a challenge for hydrological sciences (Linsley, 1982), and models require considerable effort to build, calibrate and maintain. It has been suggested that techniques from machine learning might offer promising predictive capability (Reichstein et al., 2019), particularly in situations where rainfall and river flow data are plentiful, yet where the appropriate perceptual models of surface and subsurface hydrological processes 30 are poorly understood.
Artificial Neural Networks (ANN) have shown skill when modelling complex and highly nonlinear systems. Due to the stacking of multiple connected layers, these models are often referred to as "Deep Learning" (DL) models. DL methods have been used in hydrology and meteorology since the early 1990s (Daniell, 1991;Halff et al., 1993;Dawson and Wilby, 1998;Wilby et al., 2003;Peel and McMahon, 2020). For many environmental applications, including rainfall-runoff modelling, 35 temporal structure in the data is important. Yet simple, feed-forward neural networks cannot capture information about the sequential nature of time series. By contrast, Recurrent Neural Networks (RNNs) aim to account for temporal dependence using a series of recurrent layers which incorporate new information at each time-step, and pass processed information as input to the next layer in the model. In this way, information is retained in the model over time, a feature which is important when simulating time-series with persistence (e.g., in meteorology and hydrology) (Hochreiter et al., 2001). Hochreiter (1991) 40 overcame problems with traditional RNNs by proposing a novel architecture to account for long-term dependencies. Long-Short Term Memory Networks (LSTMs) have an explicit memory state which is updated through a series of gates to model these long-term dependencies. LSTMs have been used successfully for speech recognition (Graves et al., 2013) and natural language processing (Wang and Jiang, 2015). More recently, they have been applied in hydrology. What follows is a partial review of recent studies using DL in hydrology. For a more complete picture on the uses of DL techniques in hydrology, an 45 interested reader is referred to Shen (2018); Beven (2020); Nearing et al. (2020). Kratzert et al. (2018) showed that an LSTM trained on 241 catchments in the US achieved similar performance metrics to the Sacramento Soil Moisture Accounting model coupled with a snow model across the USA (Burnash et al., 1973). When the LSTM was calibrated on all basins, it outperformed various benchmark models (Kratzert et al., 2019). The authors set out to address how to make predictions in ungauged basins, transferring knowledge from one basin to another. In doing so they 50 introduced a new LSTM architecture, the Entity Aware LSTM (EA LSTM). The "entities" refer to the spatial units (catchments) defined and measured by catchment attributes that are time invariant in the input data, such as topography, mean climate conditions and land-cover characteristics. The EA LSTM learned a high-dimensional vector that represents how the catchment attributes condition the relationship between the dynamic forcings (rainfall, temperature, potential evapotranspiration etc.) and the outputs (specific discharge). Therefore, the EA LSTM explicitly conditions the discharge response to meteorological forcing on time-invariant properties of river catchments, such as soil and topographic attributes. While the EA LSTM offered new potential for interpreting what the model had learned, model performance suffered when compared with that of the LSTM (Kratzert et al., 2019).
The performance of the LSTM for reproducing US streamflow has been further demonstrated by (Duan et al., 2020;Feng et al., 2020;Gauch et al., 2021). Other studies have considered the uses of the LSTM for producing forecasts of soil moisture, 60 also focused on the US . More recent work has begun not only to explore the accuracy of forecasts, but also to use LSTMs to: (i) provide estimates of uncertainty (Klotz et al., 2020); (ii) explore the ability of the LSTM to integrate prior physical knowledge into DL model architectures (Hoedt et al., 2021;Jiang et al., 2020); and (iii) use LSTMs to produce predictions at multiple timescales from a single model . Gauch et al. (2020) recently demonstrated that the LSTM can accurately predict water discharge at multiple timescales, such as daily and hourly. In contrast with the US National 65 Water Model (Viterbo et al., 2020), the Multi-TimeScale LSTM (MTS LSTM) showed a considerably smaller performance decrease when predicting hourly instead of daily streamflow. Their approach offers the potential to use LSTMs operationally, ingesting data at different temporal frequencies to produce predictions at a desired resolution.
Taken together, these studies demonstrate that LSTM models have credible, often substantially improved, simulation accuracy when compared with traditional conceptual hydrological models. However, they have been predominantly tested in the 70 US hydrological context. In this study we explore the potential for LSTMs to simulate discharge in Great Britain (GB), a temperate climate, using a newly published data set (CAMELS-GB) to train the LSTM models and benchmark their performance against four commonly-used conceptual models. We assess the predictive ability of the LSTM-based models and use them to understand the relationship between model performances and catchment characteristics. We test the LSTM performance under conditions where observations of precipitation and river flow do not close the water balance for the topographic catchment, 75 noting that under these conditions traditional hydrological models often struggle.
Our study poses the following four research questions: (i) How well do LSTM-based models simulate discharge in a temperate climate like Great Britain? (ii) Can the LSTM overcome limitations of previously calibrated models, producing accurate simulations in regions where hydrological models typically struggle to reproduce hydrological outputs? (iii) Are there hydrological processes that data-driven models simulate better than traditional hydrological models, which can be diagnosed by 80 benchmarking model performance? In turn, these research questions structure the remaining sections of the paper. First we describe the LSTMs' training and evaluation methods in detail. Next we present results at the national scale and then grouped by spatial and seasonal patterns of performance. In Section 4 we consider inter-model performance diagnostics and return to discuss the specific hydrological conditions in which LSTMs outperform traditional models. Finally we conclude with an indication of further promising research topics motivated by this study.

Study Region
We focus our benchmarking study on catchments in Great Britain (GB), a temperate and humid region on the eastern edge of the North Atlantic. An overview of the hydro-meteorological conditions can be found in Figure 1.  (Coxon et al., 2020b). a) mean catchment precipitation (mm day −1 ) b) shows the mean catchment potential evapotranspiration (mm day −1 ) c) mean catchment discharge (mm day −1 ) d) baseflow index, the ratio of the mean daily base flow to daily discharge. Precipitation is higher in the west and north of GB and lowest in the east and south (Figure 1a), as a result of higher elevation and prevailing winds from the west that bring rainfall from the Atlantic. The wettest areas in the north-west of Scotland average 3500 mm yr -1 of precipitation. Snow fractions are generally very low across GB, however there are a number of catchments in the Cairngorm mountains in north-east Scotland where the fraction of snow can reach 0.17. The driest areas are found in the South and East of GB, with a minimum of 500 mm yr -1 in the East of England (Coxon et al., 2020b). Furthermore, there are a number of large chalk aquifers in the south east, which cause the hydrograph to respond more slowly to rainfall events (Lane 95 et al., 2019). Seasonally, the highest monthly precipitation totals occur in winter months (DJF), and the least precipitation falls during the summer months (JJA). The temperature patterns enhance the availability of moisture. Evaporation losses are concentrated in the summer, from April to September (Lane et al., 2019). Anthropogenic and land-use changes significantly impact river flows (Prosdocimi et al., 2015;Vicente-Serrano et al., 2019). River discharge is most heavily modified in the south east and midland regions of England, in part due to high population density and a long history of human modifications to the 100 environment.

Data -CAMELS GB
All data used in this analysis come from the CAMELS-GB data (Coxon et al., 2020a). CAMELS-GB is a recently-released, large-sample, long-term, daily data set that offers the potential for GB-wide modelling studies. CAMELS-GB collated hydrologically relevant data for 671 GB catchments, between the years of 1970 and 2015. The data set includes daily time series for 105 meteorology and discharge (dynamic data, X t , y t ). Also included are catchment attributes (static data, A) such as topography, climate, hydrologic signatures, soil and land cover, hydrogeology, and human influence. These features are, in reality, not static over time. However, for the purposes of this study we treat these features as time-invariant. Further information on the variables we used as input to our model can be found in Table 2. The reader is directed to Coxon et al. (2020b) for details of the source of the data, how the data were processed and a discussion of data limitations.

110
The data set contains novel inputs compared with previous CAMELS (US, Chile, Brazil) data sets (Addor et al., 2017;Alvarez-Garreton et al., 2018;Chagas et al., 2020), such as human attributes, calculated potential evapotranspiration (pet) and uncertainty estimates. We do not use all of these features here. The static attributes we use to train the LSTM models are listed in Table 1. These static attributes were chosen to reproduce the experimental framework of Kratzert et al. (2019), however, the differences reflect the fact that the CAMELS-US and CAMELS-GB have slightly different attributes. These 115 include both catchment properties and climate properties, describing the conditions relevant for rainfall-runoff modelling in different catchments.

An Overview of the LSTM and EALSTM
In this paper, we test two neural network architectures used in other hydrological studies Kratzert et al., 2019). The first is the LSTM, which has been used in a variety of time-series modelling applications. The second model 120 is the EA LSTM, which conditions the discharge response to meteorological forcings on time-invariant properties of river catchments, such as soil and topographic attributes, treating these time-invariant properties separately. The LSTM captures information that is important over both long and short term time horizons, overcoming a key difficulty with traditional RNNs, which are unable to retain information over longer sequences (Hochreiter, 1991;Bengio et al., 1994).
LSTMs do this by maintaining two state vectors, a cell memory vector that captures slowly evolving processes (C t , Equation 125 5) and a more quickly evolving state vector, colloquially named the "hidden" vector (h t , Equation 6). The cell memory vector C t , accounts for longer-term dependencies, and a series of 'gates' control the information passing into and out of the memory vector. The hidden state vector (h t ) evolves more quickly depending on input information and the output of the memory vector (see Figure 2). The gates include: the forget gate (f t ), which controls the elements of the cell memory vector that are forgotten (i.e. how long water persists in the system, Equation to Kratzert et al. (2018) and Kratzert et al. (2019) for comprehensive descriptions of the LSTM and EA LSTM, and their hydrological interpretation. .
The EA LSTM was developed specifically for rainfall-runoff modelling (Kratzert et al., 2019). The key difference between the EA LSTM and the LSTM is that the input gate (i) is no longer conditional upon the dynamic (time-varying) data. Instead, ), and all other gates are solely influenced by the dynamic input data (Equation 7,9,10).
The EA LSTM is described as "entity-aware" because it explicitly learns how to use catchment attributes (A) to distinguish between similar dynamic inputs (X t ) for different catchments ("entities"). For the EA LSTM, i is determined solely by the catchment attributes (Equation 8). Therefore, each catchment has one unique hs dimensional vector which controls what information should persist in future timesteps. In contrast, the LSTM learns to modify the input gate i t based upon the 160 meteorological forcing data (X t ) and the catchment attributes (A). The output of the input gate (i t or i) is a vector of values between 0 and 1, which is learned from data. This vector, also known as an "embedding", translates our catchment attributes into a high-dimensional space that represents catchments in a manner optimised to differentiate between catchment rainfall-

165
For the sake of clarity, it is important to note that both models receive the same information. The LSTM still receives the static catchment attributes. However, rather than affecting only the input gate, the static data can influence all gates, since they are appended to a vector of dynamic inputs ([X t , A]) and so the same information is given to the LSTM at each timestep.
The static attributes are used by the LSTM in the same way as the dynamic data. This offers extra flexibility for the LSTM compared with the EA LSTM, since the LSTM is able to modify the input gate based on information from time-varying data, 170 whereas the EA LSTM is not. We are using the static nature of the data as a constraint on the EA LSTM to reflect the nature of the input data (separated into static and dynamic inputs -see Figure 2).

Model Training
We used the "neuralhydrology" codebase, written in Python 3.6 (Van Rossum et al., 2007), to train and evaluate the models, found here: github.com/neuralhydrology/neuralhydrology/. The configuration files used to run the models can be found using the links at the end of this article. The predictions and error metrics for the fitted models can be found online at Zenodo, zenodo.org/record/4555820.
The goal of rainfall-runoff modelling is to predict time-varying specific discharge, y t = (y 1 , ..., y T ) ∈ R T , measured in mm day -1 , for time t = {1, ..., T } ∈ N at measuring gauge n of N , given hydro-meteorological forcing data, X = (X 1 , ..., X T ), and catchment attributes (A - Table 2) within the catchment area upstream of the gauge. In the present case for GB, N = 669.

180
The underlying CAMELS-GB data has 671 station gauges. We train on data from only 669 stations because two basins have missing data in the static attributes; stations 18011 and 26006 have missing mean elevation (elev_mean) and mean drainage path slope (dpsbar). The data set, D, therefore consists of: Our task is to learn a single set of parameters, θ, of a model, M θ , that minimizes the loss function, (ŷ, y), globally, and thus 185 accurately simulates discharge for all of the basins across GB: We train our model using the modified Nash Sutcliffe Efficiency (NSE) loss as our objective function ( ), described in Kratzert et al. (2019). Other objective functions could be used, however, we sought to use the same objective function as the conceptual models we compare against, in order to control the possible sources of performance differences. The NSE describes the squared 190 error loss normalized by the total variance of the observations. In order to account for the fact that some basins will have lower variance than others, we follow Kratzert et al. (2019) to normalize by basin-specific variance. This prevents the loss from being overly weighted towards high-variance catchments.
Our input data were taken from CAMELS-GB, described above (Coxon et al., 2020b). We used precipitation, potential 200 evapotranspiration and temperature as dynamic inputs (X t = [p t , pet t , t t ]). We used 21 static inputs (A). Each catchment was characterised using 21 individual features describing the topographic, soil, land-cover, and climatic properties. These catchment attributes are described in Table 2. For both LSTM models we pass the final hidden output through a fully connected (linear) layer. This final layer maps our hidden state vector (R hs ) to a scalar prediction for the discharge of that gauge on that day timestep of specific discharge (ŷ t ). We train the LSTM models on 669 gauges with training data from our training period (1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997), which was chosen to match with the conceptual model experiments we compare against (Lane et al., 2019). Furthermore, the training data for both the LSTM based models and the conceptual models come from the same underlying sources. In order to make a fair comparison all national results shown below are calculated for the 518 gauges that are found in both the CAMELS GB data and 210 the benchmark data. We then evaluate model performance on all of these basins for our test (evaluation) period (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008).
For each model (LSTM, EA LSTM) we take the average of an ensemble of eight individually-trained models with different random seeds. This strategy accounts for the random initialisation of the network and the stochastic nature of the optimisation algorithm. We used a hidden size (hs) of 64 and a final fully connected layer with a dropout rate of 0.4, which aims to avoid overfitting. Dropout works by randomly forcing certain weights in the network to zero ("dropping them out"), forcing the 215 remaining weights to model the discharge without that extra information. This has been found to prevent weights 'fixing' the erroneous outputs of other weights, preventing this co-adaptation of weights and, ultimately, encouraging the model to use a simpler and more robust representation of rainfall-runoff processes (Srivastava et al., 2014). We chose the hyper-parameters (dropout rate, hidden sizehs) based on the choices in previous studies (Kratzert et al., 2019). We used the Adam optimisation algorithm (Kingma and Ba, 2014) and stopped training after 30 epochs. The LSTM ensemble took 10 hours to train. The EA 220 LSTM ensemble took 96 hours to train. All models were trained on a machine with 188GB of RAM and a single NVIDIA V100 GPU.

Model Performance Comparisons
The LSTMs learn to represent hydrological processes directly from data. When the LSTMs perform well a necessary corollary is that the data contains useful information about the hydrological processes. The differences in model performance between 225 the LSTMs and the benchmark hydrological models can be used to determine hydrological processes that are described by the input data, but uncaptured or under-represented by the benchmark hydrological models.

Benchmark Models
We compare the performance of the LSTM based models against a range of lumped, conceptual models. We used predicted discharge time series from Lane et al. (2019) who utilised the FUSE framework to train and evaluate four lumped conceptual 230 models across Great Britain (Clark et al., 2008). The four conceptual models used are: TOPMODEL (Beven and Kirkby, 1979), Variable Infiltration Capacity (VIC) (Liang, 1994), Precipitation-Runoff Modelling System (PRMS) (Leavesley et al., 1983) and SACRAMENTO (Burnash et al., 1973). These conceptual models are often used in operational settings, due to the relative ease of use and lower data requirements when compared with physically-based models (Lane et al., 2019). These conceptual models all explicitly maintain mass balance, and so assume no losses or gains of water other than flow from the catchment 235 outlet or evaporation.
The calibration and evaluation of these models was performed using the same underlying data as in CAMELS-GB, i.e. the National River Flow Archive data (Centre for Ecology and Hydrology, 2016) for the specific discharge (y t ), the Centre for Ecology and Hydrology Gridded Estimates of Areal Rainfall, CEH-GEAR, for precipitation (Tanguy, 2014)  for individual basins. The parameters that they use to produce simulations are unique to each basin. This often represents the state-of-the-art for traditional hydrological models. In contrast, the trained LSTM models learn one parameter set for all basins, using all basins to train the models. This represents a difference between testing model performance on hold-out data (as we do for the LSTM models), in contrast with the published results from (Lane et al., 2019), which test model performance on in-sample data that was also used to select the optimum parameters.

260
The conceptual models were calibrated and evaluated to produce simulated streamflows by Lane et al. (2019). We did not run these benchmarks ourselves. This is important because we have not biased the calibration of these models to favour the deep learning models. We have used the published time-series of model outputs to calculate performance scores for the conceptual models. This allows us to better understand the seasonal and geographical patterns in model performance.
Comparing the LSTM based models against these conceptual models allows us to determine the spatial and temporal patterns 265 in performance, helping to identify flow-regimes where the LSTM models add significant value when simulating GB discharge.

Evaluation Protocol
Each model produces a daily simulated discharge value at each station. Three example hydrographs are shown in Appendix B. The evaluation protocol described below evaluates the overall performance of each model to reproduce the observed hydrograph.

270
Since no single evaluation metric can fully capture the performance of streamflow simulations across all flow-regimes (Gupta et al., 1998), we use a number of metrics to address the performance of models across the flow regime, outlined below.
We evaluate the goodness-of-fit of the LSTM based models and the conceptual models using six evaluation metrics. The Nash-Sutcliffe Efficiency (NSE) (Nash and Sutcliffe, 1970), Equation 13 is perhaps the most widely used performance measure in hydrology (Ewen, 2011). It has been used for many years and there is extensive literature discussing its strengths and 275 weaknesses (Gupta et al., 2009). Owing to the squared term in the definition of NSE, it is more heavily influenced by high flows.
The NSE can be decomposed into three components, a correlation term (Equation 14), a bias term (BiasError, Equation 15) and a variability (SDError Equation 16) term (Gupta et al., 2009). The bias term measures the error in predicting the mean 280 flow. The variability term measures the error in predicting the standard deviation of discharge. We follow Lane et al. (2019) in using these decomposed aspects of NSE to diagnose the underlying causes of accurate or inaccurate simulations in a given catchment.  19).

295
Where l = 1, 2, ..., L is the index of the flow value within the low-flow segment, defined as 0.7-1.0 flow exceedence probabilities, following (Yilmaz et al., 2008). Where m1 corresponds to the lower bound of the middle section (m1 = 0.2) and m2 corresponds to the upper bound of the middle section (m2 = 0.7), following (Yilmaz et al., 2008) 3 Results

National Scale Model Performance
The LSTM and EA LSTM models systematically outperform the conceptual lumped models across Great Britain when evaluated using a variety of metrics, with differing levels of performance improvement (See Table 3).
305 Table 3. Summary of all goodness-of-fit metrics used to benchmark performance against the conceptual models for the validation period 1998-2008 on the 518 stations found in both CAMELS-GB data (Coxon et al., 2020a) and the FUSE conceptual models (Lane et al., 2019).
We have shown the median score. Values that are not significantly different from the best model are highlighted in bold (α = 0.001). Comparing the median NSE for all catchments, the LSTM (0.88) outperforms all other models, including the EA LSTM (0.86). The slightly lower median NSE for the EA LSTM models is consistent with results from previous studies (Kratzert et al., 2019). The difference between the LSTM based models is small relative to the difference between the LSTM based models and the conceptual models. Of the conceptual models, SACRAMENTO performs best (0.80), followed by ARNOVIC (0.78), PRMS (0.77) and TOPMODEL (0.76).

310
The CDFs (cumulative distribution functions) of the NSE (Figure 3a) show the entire distribution of LSTM scores is shifted towards better performances. The LSTM NSE scores are significantly different from all comparison models at α = 0.001 (Wilcoxon signed-rank-test). We see the same pattern for the EA LSTM models, where the distribution of NSE scores is also different from all other models at the α = 0.001 level. The performance improvement at the tails is particularly pronounced.
Neither the LSTM nor the EA LSTM model have any station gauges with an NSE of less than zero. This is in contrast to the Looking at different segments of the flow duration curve, the LSTM models outperform the conceptual models at low flows (Table 3). The LSTM shows a much greater performance improvement for low-flow bias score (%BiasFLV). This can be seen in the empirical CDFs (Figure 3e), where the conceptual models have a large proportion of stations with biases more extreme 330 than -100%. By contrast, the LSTM based models have very few (<5%). This finding is interesting because the performance of the LSTM at low-flows was previously identified as an area for further research and future improvement (Kratzert et al., 2019;Gauch et al., 2020) and indeed, we find that the LSTM based models outperform conceptual models on this low flow metric.
Comparing median scores, the LSTM has lower median bias in the slope of the midsection of the flow duration curve (%BiasFMS) than all models except ARNOVIC. When we consider the CDFs, both LSTMs have much shorter tails than 335 the conceptual models, showing that a greater proportion of catchments have biases closer to zero. The high-flow biases (%BiasFHV) are relatively similar for all models, as shown by Figure 3g), although the median scores show that there is a small performance improvement.
Overall, the biases at different flow exceedances suggest that the conceptual models produce adequate simulations for the high flows, but are less able to simulate low flows. The LSTM shows a much smaller performance decline at the low flows and 340 a competitive performance at high flows, suggesting that the LSTMs are more robust to extreme conditions. We also note that the negative bias, for the midsection and the upper-section of the flow duration curve, demonstrate that the LSTM model is conservative in its flow predictions.

Spatial Patterns of Performance
The spatial patterns of model performance show that the LSTM improves simulation of discharge across Great Britain see  Unlike the conceptual models, the LSTM had no difficulty in reproducing flows in North-Eastern Scotland. This is likely a result of the LSTM based models accurately simulating catchments in which snow processes are significant. The lack of 365 inclusion of a snow module in the conceptual models used as a benchmark very likely explains at least part of this difference in performance. It is interesting that the LSTM also performs less well in the South East relative to the performance of the LSTM elsewhere in GB. An initial hypothesis is that hydrological conditions in the drier catchments with groundwater transfers remain difficult to model, requiring time-varying parameters and more detailed representation of hydrogeological properties.The LSTM partly addresses these challenges (demonstrated by the performance improvement over the conceptual models), but 370 further research should address how the LSTM might be further improved in these low-flow regimes.
Spatial patterns in the biases for different sections of the flow duration curve, Figure 4, also show improvement across GB for the LSTM based models compared with the conceptual models. The low flow biases (%BiasFLV) for the LSTM and EA LSTM are smaller than the conceptual models across GB, although the largest biases can be found in South East England and South-Central Wales. The LSTM and EA LSTM tend to overpredict low flows, the same direction of biases as TOPMODEL.

375
Whereas, SACRAMENTO, PRMS, and ARNOVIC have a negative bias. This means that the LSTM is overpredicting low flows, with a larger bias in the South East. Only the LSTMs show consistent underprediction of the midsection slope of the flow duration curve (%BiasFMS). The slope of the midsection of the flow duration curve reflects a watershed having a "flashy" response (Yilmaz et al., 2008), potentially due to small soil moisture capacity. Therefore, an underprediction of the midsection reflects an underestimation of the "flashiness" of the catchment. The LSTM %BiasFMS is largest for the South LSTM shows a smaller negative bias than the conceptual models, the spatial pattern is very similar. Overall, the LSTM shows considerable improvement across GB, including these under performing regions, the South East of England, and East Scotland.
The regional performance matrix ( Figure 5) shows that while performance varies around GB for the conceptual models, the median regional performance of the LSTM is much more stable, ranging from an NSE of 0.85 (ANG -Anglia) to 0.91 385 (SWESW -SW England South Wales). The largest difference from GB average is 0.03 NSE for SWESW and ANG. In contrast the conceptual models have much more variable performances across the regions. PRMS for example ranges from 0.65 (ANG) to 0.84 (SWESW), a range of 0.19 NSE. The LSTM is more robust to different regional hydrological patterns, showing smaller variability in performance scores. In contrast, the conceptual models show are clearly more capable in certain hydrological regimes than others. from 21 river basin districts to eight regions. The leftmost column is the median score for all GB catchments, which is the same as in Table   3. It is included here for reference.

Seasonal Performance
LSTM based models reproduce similar seasonal patterns to the conceptual models, although the decline in LSTM summer NSE scores are more muted than the decline in conceptual model NSE scores (see Figure 6). Performances for all seasons are worse in the South East of England. This pattern is exacerbated in the summer months (JJA). The East-West gradient in model performances can be seen clearly in the seasonal errors of the conceptual models, whereas this East-West gradient is 395 less pronounced for the LSTM based models, although summer performances are lower in summer (JJA). In order to visualise the performance improvement of the LSTM compared with the conceptual models, we calculated the difference in NSE to explore seasonal patterns (∆NSE). For a catchment where the LSTM NSE is larger than the conceptual model NSE, the value will be positive. A more positive value reflects a larger performance improvement. We compare the four conceptual models to the LSTM.

400
The seasonal pattern of ∆NSEs shows that the LSTM-based models most improve simulations of discharge in the summer seasons ("JJA" -the green line in Figure 7). The seasonal pattern is consistent for all of the conceptual models that we benchmark against, although the seasonal difference in ∆NSE is largest for TOPMODEL and ARNOVIC.

Discussion
This study benchmarks the performance of the LSTM compared to four commonly used conceptual models and two physically-405 based models. The comparison with the conceptual models was calculated over a large number of catchments, making the study representative of performance in different regions of Great Britain. The performance of the LSTM demonstrates that there is adequate information in the observational data to accurately simulate discharge behaviours in the various hydrological conditions found in Great Britain. The LSTM performance is likely a conservative estimate owing to the limited training period and the lack of advanced hyperparameter tuning. The simulated time series can be found at: zenodo.org/record/4555820.

410
In the discussion that follows we will return to our three research questions: (i) How well do LSTM-based models simulate discharge in Great Britain? (ii) Can the LSTM overcome limitations of previously calibrated models, producing accurate simulations in regions where hydrological models typically struggle to reproduce hydrological outputs? (iii) Are there hydrological processes that data-driven models simulate better than traditional hydrological models, which can be diagnosed by benchmarking model performance?

Inter-Model Performances
The LSTM based models produce accurate simulations of discharge across GB, a temperate region. Two findings from this research confirm and extend the conclusions of previous work. First, the LSTM consistently outperforms the EA LSTM, although the differences in performance are small compared with the difference in performance between the LSTM-based models and the conceptual models. Secondly, LSTM-based models demonstrate state-of-the-art prediction accuracy for discharge modelling 420 (Kratzert et al., 2019;Nearing et al., 2020).
The EA LSTM is constrained to treat information that does not vary over time (catchment attributes) separately from information that varies over time (hydro-meteorological forcings). However, the constraint clearly penalizes performance, which was also found by Kratzert et al. (2019). The underperformance of the EA LSTM relative to the LSTM suggests that the value of the input gate (i) should be combined with time-varying information (X t ) to update the cell memory. The catchment 425 attributes alone are not sufficient to determine what information needs to be passed into the cell memory (Equation 8 compared with Equation 2). In other words, the LSTM learns more about the catchments' hydrological response to rainfall from the hydrographs themselves than from the static catchment attributes. This finding suggesting that the catchment properties, For all the analyses in this paper, the LSTM-based models are trained on all basins, with a single set of weights for the 445 whole of GB, and tested (evaluated) on out-of-sample time periods. Therefore, these LSTM models are regional models that are able to reproduce behaviours across Great Britain. In contrast, most hydrological models perform best when calibrated on individual basins. This distinction is important because it reflects the situations in which different models will perform best.
The LSTM-based models are most accurate when trained with as much data from as many catchments as possible (Gauch et al., 2021). In contrast, traditional hydrological models, including the lumped conceptual models we use as a benchmark, produce 450 their best simulations when trained on an individual catchment.
In order to further verify that the LSTM based models are producing comparatively accurate simulations, we took published NSE scores for 13 test basins in the UK for two process-based models (Fatichi et al., 2016), JULES Clark et al., 2011) and CLASSIC (Crooks et al., 2014). Both models use the same ancillary data such as discretised soil maps, land cover and flow direction grids, JULES is driven by atmospheric information and computes land surface fluxes directly, conserv-455 ing mass, energy and momentum. In contrast, CLASSIC-GB runs using precipitation and pre-calculated surface evaporation.
More detail can be found in the papers of Best et al. (2011) and Crooks et al. (2014) for the hydrological components of these two models. Martínez-de la Torre et al. (2019) simulated catchments using the JULES land-surface model. Input data came from 1 km gridded data for the 13 test catchments. The meteorological data comes from CHESS-met (radiation, temperature, humidity, wind-speed and pressure) and CEH-GEAR (precipitation), the same data sets used in CAMELS-GB. The input 460 data was 1 km gridded data rather than catchment averaged data. JULES was calibrated and tested on the period 1991-2000. CLASSIC-GB was run at four spatial resolutions (100km 2 , 25km 2 , 6.25km 2 and 1km 2 ). We use the results from their exper-iment at 1 km. The driving data used comes from the Met Office Rainfall and Evaporation Calculation System (MORECS).
The model simulation was produced from 1980-1983. We are aware that these time periods do not match with our test periods, 465 and therefore the following comparison should serve as a preliminary experiment to ascertain whether the LSTM is producing state-of-the-art simulation accuracy in GB. Further research should consider a more complete intercomparison of these models, and consider the robustness of model estimates to uncertain future conditions, as has been explored by Sungmin et al. (2020).  In summary, we have demonstrated that the LSTM models perform well relative to the conceptual models, demonstrating 470 consistent improvements across GB. Furthermore, a preliminary comparison with the physical model further adds weight to our suggestion that the LSTM produces state-of-the-art simulation accuracy.

In what hydrological conditions does the LSTM outperform benchmark models?
At the outset we hypothesised that LSTM performance improvement would be largest under conditions that conceptual models most often underperform, thus overcoming the limitations of previously calibrated models. The conceptual models struggled 475 to produce good simulations in two geographical regions. These were in the South East of England and North East of Scotland.
The performance improvement (∆NSE) is indeed largest in the South East of England and North East Scotland (see Figure 8).
North East Scotland is one of the most mountainous regions of GB. The Cairngorm National Park is the only area of GB where snow processes are consistently important, owing to catchments having a higher elevation. The results in Figure 8 show that the LSTM largely overcomes the difficulties in modelling these catchments, since ∆NSE is high. This is most likely due to the cell state (Equation 5) being able to represent longer-term stores and fluxes of water, therefore capturing the melting snow processes.
The South East is a relatively dry area (see Figure 1a), with large chalk aquifers contributing to a high baseflow index (see Figure 1d) and large urban and agricultural areas, contributing to a large anthropogenic signal in the hydrographs. Although the improvement in simulation accuracy compared to the conceptual models is large in the South East, the pattern of raw LSTM 485 NSE shows that the LSTM still underperforms in the South East relative to elsewhere in GB. The seasonal patterns showed that the LSTMs performed worse in summer months, which is the drier period of the year. Consistent with this spatial pattern, aridity is negatively correlated with model performance for all models (Figure 9), although the magnitude of this association is smaller for the LSTM based models than the conceptual models.
We observe consistently poorer performance across all models, conceptual and LSTM, in drier hydrological conditions. We 490 can think of two possible explanations. Either the use of NSE as an objective function fails to adequately weight performance in these low flow regimes (the NSE was the objective function across both the conceptual models and the DL models). An alternative explanation is that hydrological processes are significantly more complex in these drier regimes. For the former, there are other catchment attributes that point to improved modelling of high flows relative to low flows, since all model NSE scores show positive correlations with increased discharge (at mean flow, Q5 and Q95), as well as increased NSE as rainfall 495 increases (p_mean). A future study will consider the impact of different objective functions. For increased complexity of arid conditions (and therefore increased difficulty to model and lower performance scores) the lower catchment "connectivity" in these arid conditions could provide an explanation (Bracken and Croke, 2007). In winter, when soils are saturated, there are a greater number of pathways for water to enter river channels. Therefore, we say that connectivity is high in winter. Whereas, in summer there is greater resistance to water flow, since water can be absorbed and stored in drier soils, as found in Swiss 500 catchments by van Meerveld et al. (2019). Therefore, in summer connectivity is lower. The proposed impact of catchment connectivity on the performance improvement of the LSTM based models is ultimately speculative, and future work will explore whether the LSTM has learned to represent the concept of connectivity.
Like aridity, the relative cover of cropland (crop_perc) shows a strong negative correlation with conceptual model perfor- One of the key conditions that conceptual models struggle with is when the catchment water balance does not close. The conceptual models we test here explicitly maintain mass balance. They define the topographic surface water catchment to be the surface over which water is conserved, i.e. the surface water catchment does not leak, nor that any water enters other than through measured precipitation, for example through undercatch, drifting snow, or advection of fog, groundwater, or anthropogenic transfers into or out of the topographic catchment. Therefore, the conceptual models struggle to produce accurate 520 simulations in catchments where the water balance (defined in the data) does not close. The LSTM, in contrast, is free to diagnose inter-catchment transfers (either through anthropogenic or groundwater processes). This was the rationale behind our final research question that benchmarking the LSTM can help researchers diagnose processes that traditional models could simulate data better, given that information exists in the data to describe them.
We plot catchments on two dimensions (Figure 10), their wetness index (P/PE) and the runoff coefficient (Q/P), to identify 525 catchments where water transfers outside of the topographic surface water catchment may be occurring. Points above the horizontal line reflect catchments where the observed discharge is greater than the precipitation input to the catchment. This area of the graph represents catchments where the data has too little water to generate the observed runoff. Points below the curved line are where runoff deficits exceed total PET in a catchment. This area of the graph represents catchments where PET is not large enough to describe the water remaining after runoff is accounted for, i.e. the data has "excess" water ( Figure 10).

530
Interestingly, both the LSTMs and the conceptual models produce a performance decline in catchments with an imbalanced water balance. This suggests either that the LSTM models still struggle with water-limited and energy limited (low runoff coefficient and low wetness index) catchments. Alternatively, the fact that both LSTMs and conceptual models struggle in catchments where data does not meet the water balance constraints might suggest that human impacts on the hydrograph are ultimately unpredictable, such as abstraction and effluent returns. However, the performance decline is much less pronounced 535 than the conceptual models and the LSTM continues to produce simulations with NSE scores greater than 0.6. This suggests there remains more information in the data that the conceptual models are currently unable to utilise. Figure 10. Scatter plot for the relationship between the wetness index, runoff coefficient and the model NSE score. Each point is a catchment, coloured by the NSE score ranging from 0.8 (lighter) to 1.0 (darker). Points above the horizontal line reflect catchments where the observed discharge is greater than the precipitation input to the catchment. Points below the curved line are where runoff deficits exceed total PET in a catchment, therefore, there is "excess water" in the data, since PET cannot explain the leftover water after accounting for runoff.
We tested whether the LSTM was better able to simulate discharge in catchments with "excess" water (i.e. the points below the curved line on the plots in Figure 11). The LSTM is much more robust to these conditions and produces NSE scores that are comparable to the stations where the conceptual models perform best.

540
Overall, the results demonstrate that the LSTMs are better able to model conditions where the topographic surface water catchment has excess water. The performances are most improved for the conditions that are furthest from the water balance constraint being met. There are two key findings here. First, catchment transfers may be detectable from the data alone, assuming that the processes are at least in part due to real signals rather than data errors. Second, like conceptual models, LSTM performance declines in these catchments where runoff deficits exceed total PET. This could either be because the NSE  is an inappropriate objective function for these catchments, or else that hydrological processes in these catchments are not able to be modelled with the available data.

Conclusions
In this study we have compared two LSTM based models to four conceptual models over 518 catchments across Great Britain.
We have demonstrated that LSTM-based models trained on a large sample of catchment-averaged hydro-meteorological time- The spatial patterns of performance demonstrate that while the LSTMs improve simulations most in South East England and North East Scotland, they continue to underperform in South East England relative to elsewhere in GB. When we consider 560 the catchment conditions that are associated with this pattern it is clear that all models struggle with drier conditions and catchments where the water balance does not close. This demonstrates the importance of accounting for water balance losses and gains in hydrological modelling across GB and more research should be directed at quantifying water transfers through groundwater and human management.
We identified a number of hydrological characteristics that correlate with model performance. These correlations are similar 565 in direction for the conceptual models and the LSTMs. However, they differ in magnitude, and rank correlation scores are lower for all of the variables we tested. Overall, we find that the LSTMs are more robust to the diversity of observed hydrological conditions found across GB.
Future research will consider the internal states of the LSTM to identify how the LSTM learns to reproduce hydrological signatures. The LSTM is structured such that its internal states reflect subsurface stores and fluxes of water, such as soil 570 moisture.We expect that the internal states will allow us to explore how the LSTM has learned to diagnose between catchments with different degrees of external influence on the hydrograph. Finally, we will consider the impact of different objective functions on improving simulations of different parts of the flow duration curve.
This work demonstrates the ability of LSTM-based models to accurately simulate the hydrological response to rainfall in Great Britain. We have used a new data set, CAMELS-GB, and demonstrated that there is sufficient information in this data 575 to create state-of-the-art data-driven models based on an LSTM. The results provide a credible baseline for comparison of future models, and we make model predictions from trained LSTMs and error metrics for each catchment available here: zenodo.org/record/4555820.
The FUSE benchmark model simulations are available at: https://data.bris.ac.uk/data/dataset/3ma509dlakcf720aw8x82aq4tm. The neuralhyperiods. Climatology makes a prediction based on the mean discharge for that day of the year. Persistence is equivalent to predicting yesterday's value today, predicting the future will be the same as the past. Figure A1 shows that the processes are largely stationary, and the period we use for calibration is similar to the period we use for evaluation. Indeed, the period we use for calibration is slightly easier to predict than the test period, since the benchmark models perform better, i.e. the distribution of catchment NSE scores is shifted towards higher NSE scores during the train period. Furthermore, the conditions 590 for precipitation, PET, temperature and specific discharge are very similar between the train and test period. The temperatures have warmed slightly and there are slightly more days with zero precipitation, however, it is unlikely that such small changes