Deep learning is becoming an increasingly important way to produce accurate hydrological predictions across a wide range of spatial and temporal scales. Uncertainty estimations are critical for actionable hydrological prediction, and while standardized community benchmarks are becoming an increasingly important part of hydrological model development and research, similar tools for benchmarking uncertainty estimation are lacking. This contribution demonstrates that accurate uncertainty predictions can be obtained with deep learning. We establish an uncertainty estimation benchmarking procedure and present four deep learning baselines. Three baselines are based on mixture density networks, and one is based on Monte Carlo dropout. The results indicate that these approaches constitute strong baselines, especially the former ones. Additionally, we provide a post hoc model analysis to put forward some qualitative understanding of the resulting models. The analysis extends the notion of performance and shows that the model learns nuanced behaviors to account for different situations.

A growing body of empirical results shows that data-driven models perform well in a variety of environmental modeling tasks

The majority of machine learning (ML) and deep learning (DL) rainfall–runoff studies do not provide uncertainty estimates

Currently there exists no single, prevailing method for obtaining distributional rainfall–runoff predictions. Many, if not most, methods take a basic approach where a deterministic model is augmented with some uncertainty estimation strategy. This includes, for example, ensemble-based methods, where the idea is to define and sample probability distributions around different model inputs and/or structures

It is possible to fit DL models such that their own representations intrinsically support estimating distributions while accounting for strongly nonlinear interactions between model inputs and outputs. In this case, there is no requirement to fall back on deterministic predictions that would need to be sampled, perturbed, or inverted. Several approaches to uncertainty estimation for DL have been suggested

Our primary goal is to benchmark several methods for uncertainty estimation in rainfall–runoff modeling with DL. We demonstrate that DL models can produce statistically reliable uncertainty estimates using approaches that are straightforward to implement. We adapted the LSTM rainfall–runoff models developed by

Our secondary objective is to help advance the state of community model benchmarking to include uncertainty estimation. We want to do so by outlining a basic skeleton for an uncertainty-centered benchmarking procedure. The reason for this is that it was difficult to find suitable benchmarks for the DL uncertainty estimation approaches we want to explore. Ad hoc benchmarking and model intercomparison studies are common

To carve out a skeleton for a benchmarking procedure, we followed the philosophy outlined by

Nonetheless, we believe that good benchmarking is not something that can be done in a responsible way by a single contribution

The current study can be seen as a starting point for this process: we base the setup for a UE benchmark on a large, publicly curated, open dataset that is already established for other benchmarking efforts – namely the Catchment Attributes and MEteorolgoical Large Sample (CAMELS) dataset. Section

CAMELS

We used the same 531 basins from the CAMELS dataset (Fig.

Overview map of the CAMELS basins. The plot shows the mean precipitation estimates for the 531 basins originally chosen by

Benchmarking requires metrics to evaluate. No global, unique metric exists that is able to fully capture model behavior. As a matter of fact, it is often the case that even multiple metrics will miss important aspects. The choice of metrics will also necessarily depend on the goal of the benchmarking exercise. Post hoc model examination provides a partial remedy to these inefficiencies by making the model behavior more tangible. Still, as of now, no canonical set of metrics exists. The ones we employed should be seen as a bare minimum, a starting point so to speak. The metrics will need to be adapted and refined over time and from application to application.

The minimal metrics for benchmarking uncertainty estimations need to test whether the distributional predictions are “reliable” and have “high resolution”

All metrics that we report throughout the paper are evaluated on the test data only. With that we follow the thoughts outlined by

Probability plots

A deficit of the probability plot is its coarseness, since it represents an aggregate over time and basins. As such, it provides a general overview but necessarily neglects many aspects of hydrological importance. Many expansions of the analytical range are possible. One that suggested itself was to examine the deviations from the

To motivate why further metrics are required on top of the reliability plot, it is useful to look at the following observation: there are an infinity of models that produce perfect probability plots. One edge-case example is a model that simply ignores the inputs and produces the unconditional empirical data distribution at every time step. Another edge-case example is a hypothetical “perfect” model that produces delta distributions at exactly the observations every time. Both of these models have precision that exactly matches accuracy, and these two models could not be distinguished from each other using a probability plot. Similarly, a model which is consistently under-confident for low flows can compensate for this by being over-confident for higher flows. Thus, to better assess the uncertainty estimations, at least another dimension of the problem has to be checked: the resolution.

Overview of the benchmarking metrics for assessing model resolution. Each metric is applied to the distributional streamflow predictions at each individual time step and then aggregated over all time steps and basins. All metrics are defined in the interval [0,

To assess the resolution of the provided uncertainty estimates, we used a group of metrics (Table

We tested four strategies for uncertainty estimation with deep learning. These strategies fall into two broad categories: mixture density networks (MDNs) and MCD. We argue that these approaches represent a useful set of baselines for benchmarking.

The first class of approaches uses a neural network to mix different probability densities. This class is commonly referred to as MDNs

Illustration of the concept of a mixture density using Gaussian distributions. Plot

Illustration of a mixture density network. The core idea is to use the outputs of a neural network to determine the mixture weights and parameters of a mixture of densities (see Fig.

The output of an MDN is an estimation of a conditional density, since the mixture directly depends on a given input (Fig.

In this study, we tested three different MDN approaches.

One can read this enumeration as a transition from simple to complex: we start with Gaussian mixture components, then replace them with ALD mixture components, and lastly transition from a fixed number of mixture components to an implicit approximation. There are two reasons why we argue that the more complex MDN methods might be more promising than a simple GMM. First, error distributions in hydrologic simulations often have heavy tails. A Laplacian component lends itself to thicker-tailed uncertainty (Fig.

Characterization of distributions that are used as mixture components in our networks. Plot

MCD provides an approach to estimate a basic form of epistemic uncertainty. In the following we provide the intuition behind its application.

Dropout is a regularization technique for neural networks but can also be used for uncertainty estimation

Schematic depiction of the dropout concept.

All models are based on the LSTMs from

In short, our setting was the following. Each model takes a set of meteorological inputs (namely, precipitation, solar radiation, minimum and maximum daily temperature, and vapor pressure) from a set of products (namely, NLDAS, Maurer, and DayMet). As in our previous studies, a set of static attributes is concatenated to the inputs

For all MDNs we introduced an additional hidden layer to provide more flexibility and adapted the network as required (see Appendix

To make the benchmarking procedure work at the most general level, we employed the setup depicted in Fig.

Schemata of the general setup. Vertically the procedure is illustrated for two arbitrary basins,

We performed a post hoc model examination as a complement to the benchmarking to avoid potential blind spots. The analysis has three parts, each one associated with a specific property.

To address accuracy, we used standard performance metrics applied to single-point predictions (such as the Nash–Sutcliffe efficiency, NSE, and the Kling–Gupta efficiency, KGE; Table

Overview of the different single-point prediction performance metrics. The table is adapted from

To get an impression of the model consistency, we looked at the behavioral properties of the mixture densities themselves. The goal was to get some qualitative understanding about how the mixture components are used in different situations. As a prototypical example of this kind of examination, we refer to the study of

MDNs allow a quality check of the given distributional predictions. The basic idea here is that predicted distributions are estimations themselves. MDNs provide an estimation of the aleatoric uncertainty in the data, and the MCD is a basic estimation of the epistemic uncertainty. Thus, the estimations of the uncertainties are not the uncertainties themselves, but – as the name suggests – estimations thereof, and they are thus subject to uncertainties themselves. This does, of course, hold for all forms of uncertainty estimates, not just for MDNs. However, MDNs provide us with single-point predictions of the

Concretely, we examined how a second-order effect on the estimated uncertainty can be checked with the MCD approach (which provides estimations for some form of epistemic uncertainties), as it can be layered on top of the MDN approaches (which provide estimations of the aleatoric uncertainties). This means that the Gaussian process interpretation by

The probability plots for each model are shown in Fig.

Probability plot benchmark results for the 10-year test period over 531 basins in the continental US. Subplot

Figure

Kernel densities of the basin-wise deviation from the

Benchmark statistics for model precision. These metrics were applied to the distributional predictions at individual time steps. The lowest metric per row is marked in bold. Lower values are better for all statistics (conditional on the model having high reliability). This table also provides statistics of the empirical distribution from the observations (“Obs”) aggregated over the basins as a reference, which are not directly comparable with the model statistics since “Obs” represents an unconditional density, while the models provide a conditional one. The “Obs” statistics should be used as a reference to contextualize the statistics from the modeled distributions.

Lastly, Table

Table

Evaluation of different single-point prediction metrics. Best performance is marked in bold. Information about the inter-basin variability (dispersion) is provided in the form of the standard deviation whenever the mean is used for aggregation and in the form of the distance to the 25 % and 75 % quantiles when the median is used for aggregation.

Among the uncertainty estimation approaches, the models with asymmetric mixture components (CMAL and UMAL) perform best. UMAL provided the best point estimates. This is in line with the high resolutions of the uncertainty estimation benchmark: the sharpness makes the mean a better predictor of the likelihood's maximum and indicates again that the approach trades reliability for accuracy. That said, even with our naive approach for obtaining single-point estimations (i.e., simply taking the mean), both CMAL and UMAL manage to outperform the model that is optimized for single-point predictions with regard to some metrics. This suggests that it could make sense to train a model to estimate distributions and then recover the best estimates. One possible reason why this might be the case is that single-point loss functions (e.g., MSE) define an implicit probability distribution (e.g., minimizing an MSE loss is equivalent to maximizing a Gaussian likelihood with fixed variance). Hence, using a more nuanced loss function (i.e., one that is the likelihood of a multimodal, asymmetrical, heterogeneous distribution) can improve performance even for the purpose of making non-distributional estimates. In fact, it is reasonable to expect that the results of the MDN approaches can be improved even further by using a more sophisticated strategy for obtaining single-point predictions (e.g., searching for the maximum of the likelihood). The single-point prediction LSTM (

Illustration of second-order uncertainties estimated by using MCD to sample the parameters of the CMAL approach. The upper subplot shows an observed hydrograph and predictive distributions as estimated by CMAL. The lower subplots show the CMAL distributions and distributions from 25 MCD samples of the CMAL model at three selected time steps (indicated by black ovals shown on the hydrograph). The abbreviation “main pred” marks the unperturbed distributional predictions from the CMAL model.

Figure

In this experiment we want to demonstrate an avenue for studying higher-order uncertainties with CMAL. Intuitively, the distributional predictions are estimations themselves and are thus subject to uncertainty, and, since the distributional predictions do already provide estimates for the prediction uncertainty, we can think about the uncertainty regarding parameters and weights of the components as a second-order uncertainty. In theory even higher-order uncertainties can be thought of. Here, as already described in the Methods section, we use MCD on top of the CMAL approach to “stochasticize” the weights and parameters and expose the uncertainty of the estimations. Figure

This section gives an overview of the computational demand required to compute the different uncertainty estimations. All of the reported execution times were obtained by using NVIDIA P100 (16 GB RAM), using the Pytorch library

Generating all the needed samples for the evaluation with MCD and a batch size of 256 would take approximately 36.1 d (since 7500 samples have to be generated for 531 basins and 10 years at a daily resolution). In practice, we could shorten this time to under a week by using considerably larger batch sizes and distributing the computations for different basins over multiple GPUs. In comparison, computing the same number of samples by re-executing the CMAL model would take around 17.4 d. In practice, however, only a single run of the CMAL model is needed, since MDNs provide us with a density estimate from which we can directly sample in a parallel fashion (and without needing to re-execute the model run). Thus, the CMAL model, with a batch size of 256, takes only

Our basic benchmarking scheme allowed us to systematically pursue our primary objective – to examine deep learning baselines for uncertainty predictions. In this regard, we gathered further evidence that deep-learning-based uncertainty estimation for rainfall–runoff modeling is a promising research avenue. The explored approaches are able to provide fully distributional predictions for each basin and time step. All predictions are dynamic: the model adapts them according to the properties of each basin and the current dynamic inputs, e.g., temperature or rainfall. Since the predictions are inherently distributional, the predictions can be further examined and/or reduced to a more basic form, e.g., sample, interval, or point predictions.

The comparative assessment indicated that the MCD approach provided the worst uncertainty estimates. One reason for this is likely the Gaussian assumption of the uncertainty estimates, which seems inadequate for many low- and high-flow situations. There is, however, also a more nuanced aspect to consider: the MDN approaches estimate the aleatoric uncertainty. MCD, on the other hand, estimates epistemic uncertainty, or rather a particular form thereof. The methodological comparison is therefore only partially fair. In general, these two uncertainty types can be seen as perpendicular to each other. They do partially co-appear in our setup, since both the epistemic and aleatoric uncertainties are largest for high flow volumes.

Yet within the chosen setup it was observable that the methods that use inherently asymmetric distributions as components outperformed the other ones. That is, CMAL and UMAL performed better than MCD and GMM in terms of reliability, resolution, and the accuracy of the derived single-point predictions. The CMAL approach in particular gave distributional predictions that were very good in terms of reliability and sharpness (and single-point estimates). There was a direct link between the predicted probabilities and hydrologic behavior in that different distributions were activated (i.e., got larger mixture weights) for rising vs. falling limbs. Nevertheless, likelihood-based approaches (for estimating the aleatoric uncertainty) are prone to giving over-confident predictions. We were not able to diagnose this empirically. This might rather be a result of the limits of the inquiry than the non-existence of the phenomenon.

These limits illustrate how challenging benchmarking is. Rainfall–runoff modeling is a complex endeavor. Unifying the diverse approaches into a streamlined framework is difficult. Realistically, a single research group cannot be able to compare the best possible implementations of the many existing uncertainty estimation schemes – which include approaches such as sampling distributions, ensembles, post-processors, and so forth. We did therefore not only want to examine some baseline models, but also to provide the skeleton for a community-minded benchmarking scheme

To summarize, the presented results are promising. Viewed through the lens of community-based benchmarking, we expect progress on multiple fronts: better data, better models, better baselines, better metrics, and better analyses. To road to get there still has many challenges awaiting. Let us overcome them together.

Table

Overview of the general benchmarking setup.

NA stands for not available.

Adding noise to the data during training can be viewed as a form of data augmentation and regularization that biases towards smooth functions. These are large topics in themselves, and at this stage we refer to

Following these findings, we also add noise as a smoothness regularization for our experiments. Concretely, we decided to use a relative additive noise as a first-order approximation to the sort of noise contamination we expect in hydrological time series. The operation for regularization is

To provide a meaningful comparison, we conducted a hyperparameter search for each of the four conditional density estimators. A hyperparameter search is an extended search (usually computationally intensive) for the best pre-configuration of a machine learning model.

Search space of the hyperparameter search. The search is conducted in two steps: the variables used in the first step are shown in the top part of the table, and the variables used in the second step are shown in the bottom part and are written in bold.

NA stands for not available.

In our case we searched over the combination of six different hyperparameters (see Table

First, we informally searched for sensible general presets.

Second, we trained the models for each combination of the four hyperparameters “hidden size”

Third, we choose the best resulting model and refine the found models by searching for the best settings for the hyperparameters “batch size” (the number of samples shown per back-propagation step) and “learning rate” (the parameter for the update per batch).

Resulting parameterization from the hyperparameter search.

NA stands for not available.

The results of the hyperparameter search are summarized in Table

Gaussian mixture models

Given the rainfall–runoff modeling context, a GMM models the runoff

The negative logarithm of the likelihood between the training data and the estimated conditional distribution is used as loss:

Countable mixtures of asymmetric Laplacian distributions, for short CMAL, are another form of MDN where ALDs are used as a kernel function. The abbreviation is a reference to UMAL since it serves as a natural intermediate stage between GMM and UMAL – as will become clear in the respective section. As far as we are aware, the use of ALDs for quantile regression was proposed by

Uncountable mixture of asymmetric Laplacians

In analogy to the CMAL model equations, these extensions lead to the following equation for the conditional density:

Monte Carlo dropout

MCD is trained by maximizing the expectancy, that is, by minimizing the mean squared error. As such it is quite different from the MDN approaches. It provides an estimation of the epistemic uncertainty and as such does not supply a heterogeneous, multimodal estimate (it assumes a Gaussian form). For evaluation studies of MCD in hydrological fields, we refer to

We will make the code for the experiments and data of all produced results available online. We trained all our machine learning models with the

DK, FK, MG, and GN designed all the experiments. DK conducted all the experiments, and the results were analyzed together with the rest of the authors. FK and MG helped with building the modeling pipeline. FK provided the main setup for the “accuracy” analysis, AKS and GN for the “internal consistency” analysis, and DK for the “estimation quality” analysis. GK and JH checked the technical adequacy of the experiments. GN supervised the manuscript from the hydrologic perspective and SH from the machine learning perspective. All the authors worked on the manuscript.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was undertaken thanks in part to funding from the Canada First Research Excellence Fund and the Global Water Futures Program and enabled by computational resources provided by Compute Ontario and Compute Canada. The ELLIS Unit Linz, the LIT AI Lab, and the Institute for Machine Learning are supported by the Federal State of Upper Austria. We thank the projects AI-MOTION (LIT-2018-6-YOU-212), DeepToxGen (LIT-2017-3-YOU-003), AI-SNN (LIT-2018-6-YOU-214), DeepFlood (LIT-2019-8-YOU-213), the Medical Cognitive Computing Center (MC3), PRIMAL (FFG-873979), S3AI (FFG-872172), DL for granular flow (FFG-871302), ELISE (H2020-ICT-2019-3, ID: 951847), and AIDD (MSCA-ITN-2020, ID: 956832). Further, we thank Janssen Pharmaceutica, UCB Biopharma SRL, Merck Healthcare KGaA, the Audi.JKU Deep Learning Center, the TGW Logistics Group GmbH, Silicon Austria Labs (SAL), FILL Gesellschaft mbH, Anyline GmbH, Google (Faculty Research Award), ZF Friedrichshafen AG, Robert Bosch GmbH, the Software Competence Center Hagenberg GmbH, TÜV Austria, and the NVIDIA corporation.

This paper was edited by Jim Freer and reviewed by John Quilty, Anna E. Sikorska-Senoner, and one anonymous referee.