Ingesting near-real-time observation data is a critical component of many operational hydrological forecasting systems. In this paper, we compare two strategies for ingesting near-real-time streamflow observations into long short-term memory (LSTM) rainfall–runoff models: autoregression (a forward method) and variational data assimilation. Autoregression is both more accurate and more computationally efficient than data assimilation. Autoregression is sensitive to missing data, however an appropriate (and simple) training strategy mitigates this problem. We introduce a data assimilation procedure for recurrent deep learning models that uses backpropagation to make the state updates.

Long short-term memory networks (LSTMs) are currently the most accurate and extrapolatable streamflow models available

Autoregression (AR) has been a core component of statistical hydrology for decades

In contrast with statistical autoregressive models, conceptual and process-based rainfall–runoff models typically use data assimilation (DA) to ingest near-real-time streamflow observations. There are a number of different DA methods used in the Earth sciences

Like dynamical systems models, LSTMs have a recurrent state. This means that it is possible to use DA with LSTMs. This would allow ingesting near-real-time observation data without AR, making it possible to train LSTM models that are able to leverage near-real-time streamflow data where and when available. Further, LSTMs are trained with backpropagation, which means that there already exists a gradient chain through the model's tensor network that can be used for implementing certain types of inverse methods required for DA. Similar principles have been applied to update other features in deep learning models for a variety of purposes. For example, backpropagation to update inputs and specific layers has been used as an analytical tool

The major concern with statistical approaches (like AR) is that they often do not generalize to new locations or to situations that are dissimilar to the training data

The purpose of this paper is to provide insight into trade-offs between DA and AR for leveraging potentially sparse near-real-time streamflow observation data. AR is easier to implement than DA (simply train a model with autoregressive inputs), and it is also more computationally efficient because it does not require any type of inverse procedure during prediction (e.g., variational optimization, ensembles for estimating conditional probabilities, high-dimensional particle sampling, etc.). Inverse procedures used for DA not only require significant computational expense, but also are sensitive to (hyper)parameters related to things like error distributions, regularization coefficients, and resampling procedures

As a caveat, DA is a large category of very diverse methods. We prefer to define

To allow for direct comparison with previous studies, we tested autoregression and backpropagation-based variational data assimilation using an open community hydrologic benchmark data set that is curated by the US National Center for Atmospheric Research (NCAR). This catchment attributes and meteorological large sample data set

CAMELS includes daily discharge data from the United States Geological Survey (USGS) Water Information System, which are used as training and evaluation target data. CAMELS also includes several daily meteorological forcing data sets (Daymet, NLDAS, Maurer) that are used as model inputs. Following

In total, we trained 46 LSTM models. Twenty-six (26) of these models were trained and tested using a sample split in time (i.e., some years of data were used for training and some years for testing, but all CAMELS basins contributed training data to all models). Twenty (20) of these models were trained and tested using a cross-validation split in space (i.e., some basins were withheld from training and used only for testing). The latter mimics a situation where no streamflow data are available in a given location for training (i.e., an ungauged basin), but data become available at some point during inference. The purpose of these basin-split experiments is less to test a likely real-world scenario as it is to highlight how the different approaches learn to generalize.

We trained two classes of models using both the space-split and basin-split approaches: simulation models and AR models. Simulation models do not receive lagged streamflow inputs and AR models do. Simulation models are used for baseline benchmarking and also for DA. One (1) simulation model was trained for the time-based train/test split, meaning that a single model was trained on all training data from all 531 basins. Ten (10) simulation models were trained for the basin split – in that case we used a k-fold cross validation approach with

We trained time-split AR models at five different lag times, meaning that the autoregressive streamflow was lagged by 1, 2, 4, 8, or 10 d, respectively. We also trained with different fractions of the streamflow data record withheld (as inputs) during the training period (0 %, 25 %, 50 %, 75 %, 100 %). This means that a total of 25 AR models were trained on a time-based train/test split. The reason for training AR models with different missing data fractions becomes apparent when we present results: models trained with some missing lagged streamflow inputs perform better when there are missing data during inference. Lagged streamflow data were withheld at these fractions as random sequences of missing data with mean sequence length of 5 d. For a full description about how data were withheld, see Appendix

We trained and tested basin-split AR models only at a lead time of 1 d, and only with a missing data fraction of 50 %. This means we trained a total of 10 basin-split AR models.

We did not consider other types of missing data (i.e., meteorological forcings or basin attributes) because they are not central to the question at hand (how best to use lagged streamflow observations where those are available), and missing meteorological inputs are not common in operational models – most operational hydrology models require meteorological data at every time step and most meteorological data sets are dense in time at the time resolution of the data set.

DA was performed on the trained simulation models – both the time-split and basin-split models. We performed DA on the 1 time-split simulation model with the same missing data fractions as the time split AR models: 0 %, 25 %, 50 %, and 75 %, (100 % missing data with DA are equivalent to the simulation model without DA). We also performed DA on the 10

Daily meteorological forcing data and static catchment attributes were used as input features for all models, and daily streamflow records were used as training targets with a normalized squared error loss function that does not depend on basin-specific mean discharge (i.e., to ensure that large and/or wet basins are not over-weighted in the loss function):

All models were trained using the training and test procedures outlined by

Median NSE (Nash–Sutcliffe efficiency) scores of AR models trained and tested with different fractions of lagged streamflow data withheld. The two subplots show the same results, but organized by the amount of lagged streamflow data withheld during training vs. during testing.

The strategy that we used to deal with missing data in AR models was to replace missing lagged streamflow data with model-predicted streamflow data at the same lag time. This is related to a standard machine learning (ML) technique for training recursive models, discussed in Appendix

Comparison of per-basin NSEs with an observation lag of 1 d:

The theory behind using backpropagation through tensor networks to perform variational data assimilation is given in Appendix

Data assimilation was performed during the test period on the “simulation” LSTMs outlined in Sect.

Cumulative density function (CDF) plots of per-basin NSE scores.

We used the ADAM optimizer for data assimilation with a dynamic learning rate that started at 0.1 and decreased by a factor of 0.9 (90 %) each time the update step loss failed to decrease. We used an assimilation window of 5 time steps (updating the cell state at time steps

Following previous studies (cited in Sect.

AR models were trained with fractions of missing data (withheld randomly) between 0 % and 100 %, and tested on data with different fractions of missing data. This was done to understand what effect the training data fraction has on performance. After choosing an appropriate fraction of missing data for training AR models, these models were trained with streamflow inputs that had varying lag times (between 1 and 10 d). Both AR and DA models were tested with different fractions of (randomly) withheld lagged streamflow input data and different lag times, however all metrics were calculated on all streamflow observations within each basin during the entire test period, even when some of the lagged streamflow data were withheld as inputs.

Median NSE over 531 basins of four models (simulation, AR trained with and without holdout data, and DA) as a function of lag time in days and fraction of missing lagged streamflow data in the test period. The AR and DA models here used 0 % missing data during inference. Notice the different scales on the

Overview of evaluation metrics.

Figure

For the remainder of our experiments (including basin-split experiments), we chose to benchmark AR models trained with 50 % missing lagged streamflow inputs. This represents a compromise between training with too many or too few missing data that only degrade below the (median) accuracy of the pure simulation model with a missing data fraction of 90 %.

Table

Median performance metrics over 531 CAMELS basins.

As a point of comparison with previous work,

Comparison of NSE scores between time split and basin split models at 1 d lead time with no training or inference (test) holdout. The left subplot shows median scores (over 531 basins) and 80 % interval (10th to 90th percentiles), and the right subplot shows distributions (over 531 basins).

Figure

Figure

Results in Fig.

Data assimilation is necessary in order to use certain types of data to “drive” dynamical systems models. For example, if a model is based on some conceptual understanding of a physical system (like a conceptual process-based rainfall–runoff model), then the only way to use observations of system states or outputs is through some type of inverse method. DA is a class of inverse methods that project information onto the states of a dynamical systems model. DA is often complicated to set up (e.g., choosing parameters to represent uncertainty distributions, sampling procedures), and often requires simplifying assumptions that cause significant information loss

It is worth noting that in the experiments presented here, running DA for inference on the 10 yr test period in 531 basins required approximately

DA has one advantage: it does not require that we choose how to withhold inputs to train the model. In cases where there are no target data during inference, AR models have potential to perform worse than a simulation model (see Fig.

To reiterate from the introduction, both DA and AR are broad classes of methods. We do not know of any benchmarking study in the hydrology literature that directly compares different DA methods over large, standardized, public data sets. Most DA methods are based on some type of inverse algorithm, which cause information loss e.g.,

Streamflow input data were sampled at different missing data fractions for training and testing AR models and for DA. We masked continuous periods of missing data by using two Bernoulli samplers. We sampled “downshifts” and “upshifts” through a time series at different rates (

The strategy we use for handling missing data in AR models is loosely related to a class of techniques used commonly to train recursive neural networks called

Our strategy for dealing with missing data in AR models does not replace the cell state (the recursive state of the LSTM) during training, but does run the risk of over-training to observed lagged streamflow inputs in cases where these data are sparse during inference (Fig.

The method of data assimilation that we will use in this paper is a type of variational assimilation. Variational assimilation works as follows

Notating observations and states as drawn from distributions

The LSTM is described by the following equations:

Model-predicted streamflow values comes from a

In any deep learning model, the various weights,

Notice that gradient chains like Eq. (

Gradient chains like Eq. (

The loss functions used for assimilation (i.e.,

Coefficients

The missed peaks metric is calculated by first locating all peaks in the observation and simulation time series that satisfy the following two criteria: (1) observed and simulated peaks must be at least 30 d apart, and (2) peaks must be above the 80th flow percentile in a given basin. Any peak in the observed time series that meets these two criteria and for which there is not a peak in the simulated time series that also meets these criteria within 1 d (before or after) is considered a missed peak. We report the fraction of observed peaks that are missed.

Hyperparameter tuning for data assimilation was done with a validation period (1980–1989) that is distinct from both the training (1999–2008) and test periods (1989–1999) outlined in Sect.

Data assimilation hyperparameter tuning grid search and final values.

We used a learning rate scheduler that dropped the learning rate every

In our setup,

We tested whether it was possible to predict where DA or AR might offer the most benefit by using CAMELS catchment attributes

Table of static catchment attributes. Descriptions taken from

Results for this analysis are given in Fig.

Snow fraction was the second strongest predictor of skill for both AR and DA, whereas this was not a strong predictor of skill in the pure simulation model.

The basin attributes that were the most important for determining NSE

Scatterplots,

Figure

The performance difference (NSE score) between top: autoregression and the baseline simulation, and bottom: data assimilation and the baseline simulation.

This appendix contains figures similar to Fig.

Same as Fig.

Same as Fig.

Same as Fig.

Same as Fig.

Same as Fig.

Same as Fig.

Plug-and-play code to reproduce the experiments reported in this paper is available at

DK, AKS, and GSN had the original idea for backpropagation-based data assimilation. All authors contributed to experimental design. GSN wrote the data assimilation code, performed hyperparameter tuning, and conducted all experiments except correlations with basin attributes, which were done by JMF. All authors contributed to experimental design. MG suggested to implement the input data flag for autoregression (which was transformative for AR skill). GSN, MG, DK, and FK integrated the data assimilation code into the NeuralHydrology codebase. GSN wrote the paper with contributions from all authors.

The contact author has declared that none of the authors has any competing interests.

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Frederik Kratzert was supported by a Google Faculty Research Award (PI: Sepp Hochreiter). Martin Gauch was supported by the Linz Institute of Technology DeepFlood project. Daniel Klotz was supported by Verbund AG.

This paper was edited by Erwin Zehe and reviewed by Ralf Loritz and one anonymous referee.