Deep learning for monthly rainfall–runoff modelling: a large-sample comparison with conceptual models across Australia

Clark, Stephanie R.; Lerat, Julien; Perraud, Jean-Michel; Fitch, Peter

doi:https://doi.org/10.5194/hess-28-1191-2024

Articles | Volume 28, issue 5

https://doi.org/10.5194/hess-28-1191-2024

Articles | Volume 28, issue 5

Research article

13 Mar 2024

Research article |

| 13 Mar 2024

Deep learning for monthly rainfall–runoff modelling: a large-sample comparison with conceptual models across Australia

Stephanie R. Clark, Julien Lerat, Jean-Michel Perraud, and Peter Fitch

Abstract

A deep learning model designed for time series predictions, the long short-term memory (LSTM) architecture, is regularly producing reliable results in local and regional rainfall–runoff applications around the world. Recent large-sample hydrology studies in North America and Europe have shown the LSTM model to successfully match conceptual model performance at a daily time step over hundreds of catchments. Here we investigate how these models perform in producing monthly runoff predictions in the relatively dry and variable conditions of the Australian continent. The monthly time step matches historic data availability and is also important for future water resources planning; however, it provides significantly smaller training datasets than daily time series. In this study, a continental-scale comparison of monthly deep learning (LSTM) predictions to conceptual rainfall–runoff (WAPABA model) predictions is performed on almost 500 catchments across Australia with performance results aggregated over a variety of catchment sizes, flow conditions, and hydrological record lengths. The study period covers a wet phase followed by a prolonged drought, introducing challenges for making predictions outside of known conditions – challenges that will intensify as climate change progresses. The results show that LSTM models matched or exceeded WAPABA prediction performance for more than two-thirds of the study catchments, the largest performance gains of LSTM versus WAPABA occurred in large catchments, the LSTMs struggled less to generalise than the WAPABA models (e.g. making predictions under new conditions), and catchments with few training observations due to the monthly time step did not demonstrate a clear benefit with either WAPABA or LSTM.

Download & links

Article (PDF, 6227 KB)

Download & links

How to cite.

Received: 17 May 2023 – Discussion started: 30 May 2023 – Revised: 30 Nov 2023 – Accepted: 31 Jan 2024 – Published: 13 Mar 2024

Highlights

A deep learning model (single-layer LSTM) matched or exceeded the performance of a WAPABA rainfall–runoff model in 69 % of study catchments.
Monthly datasets contain enough information to train the LSTMs to this level.
Generalisation to new conditions was found to improve with use of the LSTM, with implications for modelling under climate change.

1 Introduction

With progressively variable climate conditions and the ever-increasing accessibility of hydrologic data, there comes the opportunity to reconsider how available data are being used to efficiently predict streamflow runoff on a large scale. Hydrological researchers are increasingly turning to emerging machine learning techniques such as deep learning to analyse this increasing volume of data, due to the relative ease of extracting useful information from large datasets and producing accurate predictions about future conditions without the need for detailed knowledge about the underlying physical systems. Machine learning models have been shown to be capable of obtaining more information from hydrological datasets than is abstracted with traditional models, due to their automatic feature engineering and ability to effectively capture high-dimensional and long-term relationships (Nearing et al., 2021; Frame et al., 2021). The continually evolving machine learning field will continue to offer novel opportunities that can be harnessed for hydrological data analyses, and it is important to understand how these methods relate to classical models. Here, a basic machine learning model is benchmarked against a traditional conceptual model over a large sample of catchments as a step towards a general understanding of the use of deep learning models as a tool for the task of monthly rainfall–runoff modelling in Australian catchments.

Deep learning models have been shown in many applications to provide accurate hydrological predictions and classifications (Shen et al., 2021; Reichstein et al., 2019; Frame et al., 2022). These models are particularly useful to hydrological studies as they provide the potential to quickly add and remove predictors (Shen, 2018), scale to multiple catchments (Kratzert et al., 2018; Lees et al., 2021), automatically extract useful and abstract information from large datasets (Reichstein et al., 2019; Shen, 2018), make predictions in areas with little or no data (Kratzert et al., 2019; Majeske et al., 2022; Ouma et al., 2022; Choi et al., 2022), and extrapolate proficiently to larger hydrologic events than are seen in the training dataset (Li et al., 2021; Song et al., 2022).

The long short-term memory network (LSTM; Hochreiter and Schmidhuber, 1997) is a deep learning model that is gaining popularity in hydrology for daily time series predictions at individual basins or groups of basins due to its ability to efficiently and accurately produce predictions without requiring assumptions about the physical processes generating the data. The LSTM is a type of recurrent neural network (RNN), an extension of the multilayer perceptron that is specifically designed for use with time series data through its sequential consideration of input data. The LSTM further extends the RNN to incorporate gates and memory cells, allowing for input data to be remembered over much longer time periods and for unimportant data to be forgotten from the network. LSTMs make predictions by taking into account both the short and long temporal patterns in a time series as well as incorporating information from exogenous predictors. The data-driven detection of intercomponent, spatial, and temporal relationships by these deep learning models can be of particular benefit when attempting to represent systems in which the physical characteristics are not well defined and the intervariable relationships are complex.

The increasing popularity of the LSTM network in hydrology is due to its ability to capture the short-term interactions between rainfall and runoff, as well as the long-term patterns and interactions arising from longer-frequency drivers such as climate, catchment characteristics, land use, and changing anthropogenic activity. A growing number of publications are applying LSTMs to hydrological simulations and comparing results to process-based or conceptual modelling results.

A gap exists in the literature concerning a comparison of LSTMs and conceptual models at a monthly time step over a large sample of catchments. The conditions in which LSTMs or conceptual models may have an advantage for monthly rainfall–runoff modelling, in a general sense, are not yet understood as most machine learning applications in hydrology are individual-basin case studies (Papacharalampous et al., 2019) at a daily time step or higher frequency (e.g. Li et al., 2021; Yokoo et al., 2022). Though the LSTM has successfully matched conceptual model performance in some large-sample hydrology studies at daily time steps (e.g. in the USA, Kratzert et al., 2019, and the UK, Lees et al., 2021), it is yet unknown how these models compare to conceptual models for monthly runoff predictions in relatively dry conditions such as those characterised by Australian catchments.

Monthly hydrological models are important tools for water resources assessments as hydrologic data have historically been recorded at a monthly or longer frequency based on the schedule of manually collected measurements. Furthermore, the monthly time step is often the most practical for water resources planning with many decisions requiring only monthly streamflow predictions. With their simpler structure, fewer parameters and lower data requirements compared to daily models (Hughes, 1995; Mouelhi et al., 2006), monthly models are also useful tools to investigate uncertainty in rainfall–runoff model structure (Huard and Mailhot, 2008) and to support probabilistic seasonal streamflow forecasting systems (Bennett et al., 2017). Due to data availability, models designed to run on monthly time steps can be used across much larger areas, informing important large-scale water resources decision-making. For these reasons, generalisable models at monthly time steps are vital. However, the monthly time step is traditionally a difficult one to model as it requires extracting both short- and long-term hydrologic processes (Machado et al., 2011). In a machine learning context, the monthly time step differs significantly from the daily time step as it drastically reduces the size of the dataset available for model training (by a factor of 30). As the convergence of machine learning algorithms typically improves with larger datasets, a central research question of this paper is to explore the capacity of the LSTM algorithm to cope with the reduced amount of input data imposed by the monthly time step.

LSTMs have been used to model the rainfall–runoff relationship at a monthly time step in a limited number of localised studies, showing potential for this application on a broader scale. Ouma et al. (2022) used monthly aggregated data due to low data availability in three scarcely gauged basins in the Nzoia River basin, Kenya. Majeske et al. (2022) trained LSTMs with spatially and temporally limited data for three sub-basins of the Ohio River basin, claiming the daily time step was superfluous and cumbersome in some conditions. Lee et al. (2020) found the LSTM network to be adept at preserving long-term memory in monthly streamflow at a single station on the Colorado River over a 97-year study without any weakening of the short-term memory structure. Yuan et al. (2018) used a novel method for parameter calibration in an LSTM for monthly rainfall–runoff estimation at a single station on the Astor River basin in northern Pakistan. Song et al. (2022) found that the LSTM network better reproduced observed monthly runoff and simulated extreme runoff events than a physically based model at five discharge stations in the Yeongsan River basin in South Korea.

Large-sample hydrologic studies that assess methods on a large number of catchments are being increasingly called for in the field of hydrology (Papacharalampous et al., 2019; Mathevet et al., 2020; Gupta et al., 2014). Papacharalampous et al. (2019) compared the performance of a number of statistical and machine learning methods (no LSTM) on 2000 generated time series and over 400 real-world river discharge time series and determined that the machine learning and stochastic methods provided similar forecasting results. Mathevet et al. (2020) compared daily conceptual model performance (no machine learning) for runoff prediction in over 2000 watersheds, determining that performance depended more on catchment and climate characteristics than on model structure. Kratzert et al. (2018) found that individual daily-scale LSTMs were able to predict runoff with accuracies comparable to a baseline hydrological model for over 200 differently complex catchments. Kratzert et al. (2019) found a global LSTM trained on over 500 basins in the United States with daily data produced better individual catchment runoff predictions than conceptual and physically based models calibrated on each catchment individually. Lees et al. (2021) produced a global LSTM to model almost 700 catchments in Great Britain, finding that this model outperformed a suite of benchmark conceptual models, showing particular robustness in arid catchments and catchments where the water balance does not close. Jin et al. (2022) compared machine learning daily rainfall–runoff models to process-based models for over 50 catchments in the Yellow River basin in China. Frame et al. (2021) found that a global LSTM with climate forcing data performed similarly or outperformed a process-based model on over 500 US catchments, and that in catchments where hydrologic conditions are not well understood the LSTM was a better choice.

This study aims to determine the ability of a simple machine learning model (a single-layer LSTM) to match or exceed the performance of a conceptual monthly rainfall–runoff model (the WAPABA model; Wang et al., 2011) for predicting runoff, using inputs derived from easily accessible climate variables. The goal here is not to maximise LSTM performance to cutting-edge machine learning standards but rather to ascertain the minimum performance level that a non-expert user might expect to obtain from basic usage of an LSTM with the input data regularly used in a conceptual model. A frequently heard reason for hydrological researchers not engaging with machine learning approaches is the small data size associated with individual catchment time series, and it is of interest to examine the lower limits of data availability required to fit an LSTM with individual catchment monthly datasets.

A comparison is made on almost 500 basins across Australia, representing a wide variety of catchment types and hydro-climate conditions and with differing amounts of historical data. The prediction performance of the LSTM machine learning models is compared to the WAPABA conceptual models for each individual catchment. The proportion of catchments in which the runoff prediction performance of the conceptual model is met or exceeded by the machine learning model is determined. Conditions under which the machine learning models or the conceptual models may have an advantage are investigated, such as catchment size, flow level, and length of historical record. The central questions of this study are the following:

In general, do LSTMs match conceptual model prediction performance on Australian catchments?
Is the reduced number of data points due to the monthly time step an issue for training an LSTM?
Under what conditions is the LSTM of particular benefit or drawback (catchment size, flow level, amount of training data, etc.)?

The results of this large-sample analysis of LSTM performance over the Australian continent will assist in understanding whether LSTMs are a justifiable alternative to conceptual models for monthly rainfall–runoff prediction in Australia and similar environments, including if monthly datasets are sufficient to produce accurate predictions with the LSTM. Building on the results of this study, further benefits of deep learning could be harnessed through the creation of larger-scale models that encompass climatic, hydrologic, and anthropogenic patterns spanning multiple catchments, allowing for the sharing of information under similar conditions and the potential transfer of knowledge between data-rich and data-scarce regions or models that blend conceptual models into the machine learning network structure.

2 Data and methods

2.1 Data

The catchment and climate data used in this study are from a dataset curated by Lerat et al. (2020), comprising a selection of basins across Australia. The dataset spans all main climate regions of the continent, providing data from a variety of rainfall, aridity, and runoff regimes, as described in Table 1. Catchments where some data were marked as suspicious (e.g. high-flow data with large uncertainties, inconsistencies, suspected errors) or with more than 30 % missing data were excluded. This left 496 catchments in the study, with locations as shown in Fig. 1. The area of the individual catchments ranges from approximately 5 to 120 000 km².

Table 1Characteristics of the study catchments, over the period 1950–2020. PET refers to potential evapotranspiration.

Download Print Version | Download XLSX

https://hess.copernicus.org/articles/28/1191/2024/hess-28-1191-2024-f01

Figure 1Locations of the 496 study catchments, coloured by mean annual rainfall. The three labelled catchments, which will be used as examples in the study, represent a wet catchment (111005 in Northern Queensland), a temperate catchment (204014 in New South Wales), and a dry catchment (609012 in Western Australia).

Observed runoff data were collected from the Bureau of Meteorology's Water Data online portal (http://www.bom.gov.au/waterdata, last access: February 2022), rainfall and temperature data are from the Bureau of Meteorology's AWAP archive (Jones et al., 2009), and potential evapotranspiration data were computed by the Penman equation as part of the AWRA-L landscape model developed jointly by CSIRO and the Bureau of Meteorology (Frost et al., 2018). Rainfall, temperature, and evapotranspiration are averaged from daily grids (5 km×5 km) over each of the catchments.

The runoff records begin between January 1950 and September 1982 and end between October 2016 and June 2020. The number of runoff observations per catchment ranges from 425 to 846 with a median dataset size of 613 observations. The rainfall and potential evapotranspiration data cover the period from 1911 to 2020 continuously. The resulting dataset consists of a set of 496 time series ranging from 37 to 70 years in length, with a median record length of 51 years.

Training and testing data split

The dataset for each catchment is split into two portions for modelling – in machine learning these are referred to as “training” and “testing” sets, corresponding to the traditional “calibration” and “validation” sets used in hydrologic modelling. The training dataset runs from January 1950 (or the start of the station's record if later) to December 1995 for all catchments. The testing dataset begins in January 1996 for all catchments and ends in July 2020 (or at the end of the station's record if sooner). This split is chosen to divide the streamflow records into two relatively even periods but also to distinguish an early wet period from a testing period characterised by the Millennium Drought over south-eastern and eastern Australia (Van Dijk et al., 2013). WAPABA and LSTMs were trained and evaluated using the same data splits, giving identical durations and dataset sizes.

When split into training and testing sets at the beginning of January 1996, between 38 % and 72 % of the data from each catchment becomes the training set. The length of the training data record for individual catchments ranges from 14 to 47 years, with the smallest dataset used for training containing 172 observations. Typically in machine learning, a portion of the training data are held back to be used during the model fitting process to monitor for overfitting and to signal early stopping of training if necessary. Since the training datasets in this study are already small by machine learning standards, this has not been done as it would reduce the number of training observations significantly, as well as lead to a smaller training dataset than used in the WAPABA models. A sensitivity test has been performed to justify this choice, and it was found that training the LSTMs with 20 % of the training data reserved for this task (i.e. with the data split into training (64 %), validation (16 %), testing (20 %)) produced no apparent benefit in prediction performance.

2.2 Models

2.2.1 Deep learning time series models (LSTM)

The long short-term memory network, LSTM (Hochreiter and Schmidhuber, 1997), is an updated recurrent neural network (RNN) specifically designed for deep learning with time series data. The inclusion of gates and memory cells increases the length of time series the LSTM is able to process; three gates (input, output, and forget gates) regulate the flow of information into and out of the memory cell, determining which information from the past is to be retained and which can be forgotten. In this way, each member of the LSTM output becomes a function of the relevant input at previous time steps.

The LSTM network consists of an input layer, one or more hidden layers, and an output layer. The layers are connected by a set of updatable weights, with the same weights applying to all time steps of the data. Memory cells shadow each node on the hidden layer, retaining important information over long time periods. Each node of the input layer represents a variable of the input dataset. Observations are fed into the network along with a pre-specified number of predictor values from previous time steps (known as the lookback length or lag) which are cycled sequentially through the network. Network weights are updated by back-propagating the gradient of the error between the modelled and observed outputs. For detailed information on the mathematical functioning of the LSTM, see Goodfellow et al. (2016) and Kratzert et al. (2018).

In this study, a separate LSTM is trained for each catchment. Input to the LSTMs are monthly averaged measurements of rainfall depth (P), potential evapotranspiration (E), average maximum daily temperature over the month, and net monthly (effective) rainfall (P^∗) computed for month t by summing daily effective rainfall, as shown here:

\begin{matrix} (1) & P_{t}^{*} = \sum_{d = 0}^{d = days (t)} max (0, P_{d} - E_{d}) . \end{matrix}

Standard scaling of the input data is performed per catchment as follows:

\begin{matrix} (2) & {\tilde{X}}_{t} = \frac{X_{t} - μ_{x}}{σ_{x}}, \end{matrix}

where X_t is an input variable for month t, μ_x is its mean, and σ_x is its standard deviation over the training period. The target variable for LSTM training is monthly average runoff. Observed runoff values are scaled by taking the square root and then transforming to the range $[- 1, 1]$ per catchment, as follows:

\begin{matrix} (3) & Y_{t} = 2 \frac{\sqrt{Q_{t}} - Y_{0}}{Y_{1} - Y_{0}} - 1, \end{matrix}

where Q_t is the observed runoff for month t, and Y₀ and Y₁ are the minimum and maximum square-root-transformed flows over the training period, respectively. The square root transform is chosen to be conceptually consistent with the objective function of the WAPABA model calibration (as described below, mean absolute error of the square roots of flows). Note that the same scaling constants ( $μ_{x}, σ_{x}, Y_{0}, Y_{1}$ ) used during LSTM training are also applied to LSTM inputs and targets for the testing period. Using scaling constants only derived from the training data ensures that the training process is not incorporating any information from the testing dataset.

The loss function used for training the LSTM is the mean absolute error (MAE) performed on the transformed runoff, as follows:

\begin{matrix} (4) & L = \sum_{t} | Y_{t} - {\hat{Y}}_{t} |, \end{matrix}

where ${\hat{Y}}_{t}$ is the output of the network for month t, and Y_t is the transformed runoff for the same month.

Hyperparameters or parameters controlling the LSTM training algorithm were selected after a grid search (over 1016 separate runs) on a randomly selected catchment (14207) with a good length data record and tested on a small additional subset of catchments. As the purpose of this study was not to optimise catchment-specific predictions results, a more comprehensive hyperparameter search by catchment was deemed unnecessary. The hyperparameter space searched was the following: initial learning rate δ₀ ( $1 \times 10^{- 3}$ to $1 \times 10^{- 4}$ ), sequence (lookback or lag) length (6, 9, 12, 15, 18, 21, 24 months), and number of hidden nodes (10, 20, 30, 40, 50, 60). The hyperparameter set that performed the best predictions over the training period was selected for use in all LSTMs: 10 nodes on a single hidden layer, run with a sequence length of 6 months, and an initial learning rate δ₀ of 0.0001. Subsequent to this hyperparameter search, the effect of raising the initial learning rate for faster convergence while using input and recurrent dropout to prevent overfitting was investigated on all catchments. Empirically, and counter to our intuition, this never improved training performance, and so the initial learning rate δ₀ of 0.0001 was retained. The learning rate was allowed to vary during training with a patience of three epochs without improvement before multiplying by a factor of 0.2 to obtain a new learning rate. The dataset was divided into 400 steps per epoch for training; data were sent through the model in batches with a weight update after each (an epoch, or iteration, is concluded when the entire dataset has been run through the model once). The LSTM training was implemented using a gradient descent algorithm run for a maximum of 100 epochs. Training was set to stop early if the training error failed to decrease over five consecutive epochs. The LSTMs were implemented with TensorFlow in Python, using numeric seeds to ensure reproducible outcomes.

2.2.2 WAPABA rainfall–runoff models

The WAPABA model is a conceptual monthly rainfall–runoff model introduced by Wang et al. (2011). The model is an evolution of the Budyko framework proposed by Zhang et al. (2008) where water fluxes are partitioned using parameterised curves. The model uses two inputs, mean monthly rainfall and potential evapotranspiration, and operates in five stages. First, input rainfall is split between effective rainfall that will eventually leave the catchment and catchment consumption that replenishes soil moisture and evaporates. Second, catchment consumption is portioned between soil moisture replenishment and actual evapotranspiration. Third, effective rainfall is partitioned between surface water (fast) and groundwater (slow) stores. Fourth, the groundwater store is drained to provide a baseflow contribution. Fifth, the surface water and baseflow are added to obtain the final simulated runoff for the month. The model has five parameters described in Table 2 which interact as depicted in Fig. 2.

Table 2WAPABA model parameters.

Download Print Version | Download XLSX

https://hess.copernicus.org/articles/28/1191/2024/hess-28-1191-2024-f02

Figure 2WAPABA conceptual model schematic.

Deep learning for monthly rainfall–runoff modelling: a large-sample comparison with conceptual models across Australia

2.1 Data

Training and testing data split

2.2 Models

2.2.1 Deep learning time series models (LSTM)

2.2.2 WAPABA rainfall–runoff models

2.3 Performance evaluation

Comparison of performance metrics between catchments using normalised indexes

3.1 Example prediction results

3.2 Large-sample performance summary

3.3 Performance differences at individual catchments

3.4 Prediction performance comparison by catchment or time series characteristics

3.4.1 Catchment size

3.4.2 Flow level

3.4.3 Poorly predicted catchments

3.4.4 Generalising to changing conditions

3.4.5 Historical record length and dataset size

Metrics and models

Future research directions

Feed-forward neural network