A note on leveraging synergy in multiple meteorological datasets with deep learning for rainfall-runoff modeling

with deep learning for rainfall-runoff modeling Frederik Kratzert1, Daniel Klotz1, Sepp Hochreiter1, and Grey S. Nearing2,3 1LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Austria 2Upstream Tech, Natel Energy Inc., Alameda, CA USA 3Department of Geological Sciences, University of Alabama, Tuscaloosa, AL United States Correspondence: Frederik Kratzert (kratzert@ml.jku.at), Grey S. Nearing (grey@upstream.tech)

In this context, we would like to point out that -depending of the goal of the modelling exercise -data-driven models can have an inherent advantage compared to traditional hydrological modeling techniques: A single data-driven model can use multiple forcing products directly. The models can learn to exploit potential synergies in different (imperfect) precipitation 25 data sets (or any other type of model input). In particular, deep learning models as used by Kratzert et al. (2019b, a) can take any number of different precipitation and other meteorological inputs at every timestep. Because the different input data sets are used simultaneously in a single nonparametric model, this has the potential to produce more accurate simulations by combining those inputs in spatiotemporally dynamic ways. The goal of this contribution is to test the strength of this hypothesis by assessing the model's ability to learn complex and spatiotemporally variable interactions between different precipitation 30 products.

Data
This study uses the Catchment Attributes and Meteorological dataset for Large Sample Studies (CAMELS; Newman et al., 2014;Addor et al., 2017b). CAMELS contains basin-averaged daily meteorological forcing inputs derived from three different 35 gridded data products for 671 basins across CONUS. The three forcing products are (i) DayMet (Thornton et al., 1997), (ii) Maurer (Maurer et al., 2002), and (iii) NLDAS (Xia et al., 2012), the former has 1 km x 1 km spatial resolution and the latter two have one-eighth degree spatial resolution. Although CAMELS includes 671 basins, to facilitate a direct comparison of results with previous studies we used only the subset of 531 basins that were originally chosen for model benchmarking by Newman et al. (2017), who removed all basins with area greater than 2000 km 2 , and also all basins where there was a discrepancy of 40 more than 10% between different methods of calculating basin area. These 531 basins were used for all experiments in this study except benchmarking against traditional hydrology models (see Sect. 2.4.1), because the benchmark models are only available at 447 of the 531 basins. Behnke et al. (2016) conducted a detailed analysis of eight different precipitation and surface temperature (daily max/min) data products, including the three used by CAMELS. Those authors compared gridded precipitation and temperature values to 45 station data using roughly 4000 weather stations across CONUS. Their findings were that "no data set was 'best' everywhere and for all variables we analyzed" and "two products stood out in their overall tendency to be closest to (Maurer) and farthest from (NLDAS2) observed measurements." Furthermore, they did not find a "clear relationship between the resolution of gridded products and their agreement with observations, either for average conditions ... or extremes" and noted that the "high-resolution DayMet ... data sets had the largest nationwide mean biases in precipitation." 50 Figure 1 gives an example of disagreement between precipitation products in CAMELS that we hope to capitalize on by training a model with multiple forcing inputs. This figure shows the noisy relationship between the three precipitation products in a randomly-selected basin (USGS ID: 07359610). Deep Learning approaches can learn to mitigate the type of noise shown in the scatter plot in the right-hand panel of Fig. 1 and use the inherent information by using multiple forcing products simultaneously in a single model. 55 -10-1989 26-10-1989 20-11-1989 15-12-1989 09-01-1990 Date show the first 100 days of precipitation data from all three products during the test period, and the right-hand subplot shows scatter between the three products over the full test period. The scatter shown in the right-hand subplot is the data uncertainty that we would like to mitigate by using multiple forcings simultaneously in a deep learning rainfall-runoff model. In this particular basin, there appears to be a 1-day shift between DayMet and Maurer, which is common in the CAMELS data set (this shift is apparent in 325 of the 531 basins; see The left-hand subplot of Fig. 1 shows a time-shift between DayMet and Maurer precipitation in the same basin. This type of shift is common. Behnke et al. (2016), for example, reported that "[b]ecause gridded products differ in how they define a calendar day (e.g., local time relative to Coordinated Universal Time), appropriate lag correlations were applied through cross-correlation analysis to account for the several-hour offset in daily station data." We performed a lag-correlation analysis DOI (see data availability section). In addition to the three daily forcing data sets from CAMELS, we used the same 27 catchment attributes as Kratzert et al. (2019a, b), which consist of topological, climatic, vegetation, and soil descriptors (Addor et al., 2017a). Prior to training any model, all input variables were normalized independently by subtracting the CONUS-wide mean and dividing by the CONUS-wide standard deviation.

Models
Long Short-Term Memory networks (LSTMs) are a type of recurrent neural network (Hochreiter, 1991;Hochreiter and Schmidhuber, 1997b;Gers et al., 1999). LSTMs are a type of state-space model that function through a set of input-stateoutput relationships. Gates, which are activated linear functions, control information flows from inputs and previous states to current state values (called an input gate), from current states to outputs (called an output gate), and also control the timescale 75 of each element of the state vector (called a forget gate). States (usually called cell states) accumulate and store information over time, much like the states of a dynamical system. Technical details of the LSTM model architecture have been described in several previous publications in hydrology journals, and we refer the reader to Kratzert et al. (2018) for a detailed explanation geared towards hydrologists.

80
To conduct our analyses we trained an LSTM model using each of the three forcing products together, separate LSTM models for each pairwise combination of forcing products (DayMet & Maurer, DayMet & NLDAS, and Maurer & NLDAS), and separate LSTMs for all three forcing products individually.
For each of these seven input configurations, we trained an ensemble of n = 10 different LSTMs with different randomly initialized weights. We report the statistics from averaging the simulated hydrographs from each of these 10-member ensembles 85 (single model results are provided in the Appendix). Ensembles are used to account for randomness inherent in the training procedure. The importance of using ensembles for this purpose was demonstrated by Kratzert et al. (2019b).
We used the same time periods for model training and testing as by Kratzert et al. (2019b) -this allows us to directly compare results of this study with the full set of benchmark hydrology models used by that previous study. The training period was from 1 October 1999 to 30 September 2008 (9 years of training data for each catchment) and the test period was 1 October 1989 to

Analysis
We examined the experiments described above with three types of analysis. The goal is to provide different illustrations of how 100 the LSTM leveraged multiple forcing products.
-Analysis 1 -Feature Ablation: An ablation study removes parts of the network to gain a better understanding of the model. We adopted this procedure by removing the different meteorological forcing products in a step-wise fashion and comparing the individual results by using multiple performance metrics and signatures (see Table 1). To provide context we also compare them against a family of conceptual and process-based hydrological models.

105
-Analysis 2 -Precipitation Uncertainty: We used triple collocation to estimate spatially-varying error characteristics of the three precipitation forcing products, and assessed relationships between these error statistics with the performance of both single-and multiple-forcing LSTMs. These experiments help us understand where we can expect value from using multiple forcing products in a single model.

-Analysis 3 -Sensitivity & Contribution:
We performed an input attribution analysis of the trained LSTM models to 110 quantify how the LSTM learned to leverage different forcing products in spatiotemporally dynamic ways.

Analysis 1: Feature Ablation
All LSTM ensembles were trained using a squared-error loss function (the basin-averaged NSE), however we are interested to know how the models simulate different aspects of the hydrograph. As such, we report a collection of hydrologically-relevant performance metrics outlined in Table 1. These statistics include the standard time-average performance metrics (NSE, KGE), 115 as well as comparisons between observed and simulated hydrologic signatures. The hydrologic signatures we report are the same ones used by Addor et al. (2018). For each signature, we compute the Pearson correlation between the signature derived from the observed discharge and derived from the simulated discharge of all basins.
To provide a baseline for comparison, LSTM ensembles were benchmarked against the same family of hydrological models used for benchmarking by Kratzert et al. (2019b). These models are: (i) SAC-SMA (Burnash et al., 1973;Burnash, 1995) 120 coupled with the Snow-17 snow routine (Anderson, 1973), hereafter referred to as SAC-SMA, (ii) VIC (Liang et al., 1994), (Clark et al., 2008;Henn et al., 2008) (three different model structures, 900, 902, 904), (iv) HBV (Seibert and Vis, 2012), and (v) mHM (Samaniego et al., 2010;Kumar et al., 2013). Some of these models were calibrated to individual basins and others were regionally calibrated. All of the benchmarks used Maurer forcings and all were calibrated and validated on the same time periods used in this study. In order to avoid any potential or implicit bias, we did not run any of our own benchmark 125 models -all models were solicited originally by Kratzert et al. (2019b) from different groups with experience running each individual model. The whole family of benchmark model runs is only available in 447 of the 531 CAMELS catchments, chosen by Newman et al. (2017). Thus, while we use the set of 531 catchments for all other parts of this study, we only considered 447 catchments for benchmarking against traditional hydrology models.  Clausen and Biggs (2000), Table 2 in Westerberg and McMillan (2015) High flow freq. Frequency of high-flow days (>9 times the median daily flow) Clausen and Biggs (2000), Table 2 in Westerberg and McMillan (2015) Low flow dur. Average duration of low-flow events (number of consecutive days <0.2 times the mean daily flow) Olden and Poff (2003), Table 2 in Westerberg and McMillan (2015) Low flow freq. Frequency of low-flow days (<0.2 times the mean daily flow) Olden and Poff (2003), Table 2

130
The objective of the second analysis is to demonstrate that the multiple-forcing model learns to leverage patterns in forcing data error structures. Our approach was to relate error characteristics of the different precipitation products with model performance, and with performance improvements due to using multiple forcing products. We used triple collocation to estimate error characteristics of the different forcing products. Triple collocation is a statistical technique to estimate error variances of three or more noisy measurement sources without knowing the true values of the measured quantities (Stoffelen, 1998;Scipal et al., 135 2010). Its major assumptions are that the error models are linear and independent between sources; in particular, that all (three or more) measurement sources are each a combination of a scaled value of the true variable plus some additive random noise: where M * are measurement values (i.e. here the modeled precipitation values), subscript i represents the source (DayMet, Maurer, NLDAS), and subscript t represents the timestep in the test period (1 October 1989 to 30 September 1999); T * is the 140 unobserved true value of total precipitation in a given catchment on a given day; ε * are i.i.d. measurement errors from any distribution.
The linearity assumption is not appropriate for precipitation data, which are typically assumed to have multiplicative error. Following Alemohammad et al. (2015), we assumed a multiplicative error model for all three precipitation source, and converted these to linear error models by working with the log-transformed precipitation data: Standard triple collocation is then applied, so that estimates of the error variances for each source are: for all i, j, k, where C i,j is the covariance between the time series of source i and source j; σ i is the variance of the distribution Additionally, extended triple collocation (McColl et al., 2014) allows us to derive the correlation coefficients between measurement sources and truth as: This triple collocation analysis was applied separately in each of the 531 CAMELS catchments to obtain basin-specific 155 estimates of the error variances, σ i , and truth-correlations, ρ i , for each of the three precipitation products. Albeit the assumption that the forcing products have independent error structures (i.e. ε i,t |= ε j,t ) is not met in our case we expect the results to be robust enough for the purpose at hand.

Analysis 3: Sensitivity & Contribution
All neural networks (like LSTMs) are differentiable almost everywhere by design. Therefore, a gradient-based contribution 160 analysis seems natural. However, as discussed by Sundararajan et al. (2017), the naive solution of using local gradients is not a reliable measures of sensitivity, since gradients might be flat even if the model response is heavily influenced by a particular input data source (which is not by necessity a bad property, see for example Hochreiter and Schmidhuber, 1997a). This is especially true in neural networks, where activation functions often include step-changes over portions of the input space -e.g., the sigmoid and hyperbolic tangent activation functions used by LSTMs have close-to-zero gradients at both extremes (see Sundararajan et al. (2017) proposed a method of input attribution for neural networks which accounts for this described lack of local sensitivity: integrated gradients. Integrated gradients are a path integral of the gradients from some baseline input x , to the actual value of the input, x: .
(6) 170 We used a value of zero precipitation everywhere as the baseline for calculating integrated gradients with respect to the three different precipitation forcings (DayMet, Maurer, NLDAS). We calculated the integrated gradients of each daily streamflow estimate in each CAMELS basin during the 10-year test period with respect to precipitation inputs from the past 365 days (the look-back period of the LSTM). That is, on day t = T , we calculated 1095 = 3 * 365 integrated gradient values related to the three precipitation products. The relative integrated gradient values quantify how the LSTM combines precipitation products 175 over time, over space, and also as a function of lag or lead-time into the current streamflow prediction. In theory, one has to take into account the effect of "explaining away", when analysing the decision process in models (Pearl, 1988;Wellman and Henrion, 1993). However, we assume that if evaluated over hundreds of basins and thousands of time steps, this effect is largely averaged out and therefore the analysis provides an indication of the actual information used by the model. The feature ablation analysis compared NSE values over 10-year test periods from the CAMELS basins for the seven distinct input combinations. As shown in Fig. 3, the three-forcing model had a median NSE value of 0.82 for the 447 basins, which were available for all benchmarking models. The three-forcing model outperformed all two-forcing models. Similarly, all twoforcing models outperformed all single-forcing models (all improvements were statistically significant at α = 0.001, using 185 the Wilcoxon test). The best single-forcing LSTM had a median NSE of 0.76. This indicates that the LSTM was able to leverage unique information in the precipitation signals (this is not an unusual finding in the context of machine learning, see for example : Sutton, 2019). We also note that the single-forcing LSTM with Maurer inputs outperformed the single-forcing NLDAS model, which agrees with the results of Behnke et al. (2016) who showed that Maurer precipitation was generally more accurate than NLDAS precipitation.

190
To put these results into context, Fig. 4 compares all LSTMs against the benchmark hydrology models in the 447 basins where simulations of all benchmark models were available. All LSTM models were better than all benchmark models through the entire CDF curve. As a point of reference, the difference in the median NSE between the best-performing single-forcing LSTM (DayMet) and the best-performing traditional hydrology model (HBV) was 0.09, while using all three CAMELS forcings increased that improvement over traditional models by another 0.055 (61%). The total improvement in the median NSE 195 of the multi-forcing LSTM over the best-performing hydrology model was 0.143 (21%). Table 2 and Table 3 give the benchmarking results from all metrics and signatures in Table 1   20% of basin-average). We therefore note that the LSTM approach -while generally providing the best available model -still has approximation difficulties towards the extreme lower-end of the runoff distribution.
Looking at the spatial distribution of the performance differences in all basins used for model training (i.e., 531 basins instead of the 447 basins used for the benchmarking described above), it is evident that the three-forcing LSTM outperformed the single forcing models almost everywhere (Fig. 5). Individual exceptions where "less is more" do however exist (e.g. Southern

205
California). Concretely, the three-forcing model was better than the best single forcing model in 66% of the basins (351 of 531) and had a higher NSE than the individual single-forcing LSTMs in over 80% of the basins

Results: Analysis 2 -Precipitation Uncertainty
DayMet typically produces lower NSE values in basins where triple collocation reported that the DayMet precipitation error variances are high. This is what we would expect: low skill in basins with high precipitation error; however we did not see 210 similar patterns with the other two precipitation products (see Fig. 6, where the triple collocation error variances and truthcorrelation are plotted against the NSE scores of the single-source models) . In fact, the NLDAS LSTM tends to perform worse in basins with lower precipitation error (as estimated by triple collocation). Figure 7 is an adapted version of Fig. 6 that highlights a few high-skill, high-variance NLDAS basins in blue. These basins correspond to a cluster of basins in the Rocky Mountains (Fig. 8) where NLDAS has low correlation with the other two products   Triple collocation measures (dis)agreement between measurement sources, rather than error variances directly. Figure 9 plots model performance against the individual variances of the precipitation products in each basin. This figure shows that the single-forcing DayMet LSTM tends to perform better in catchments with higher total precipitation variance (not triple collocation error variance), indicating better performance in wetter catchments. This is not true for the other two models, 220 where higher total variance is associated with a higher variance in model skill, indicating higher proportion of the variance due to measurement error.
To analyse the synergy due to using all forcings in a single LSTM we transposed the NSE improvements in each basin (due to using all three forcing products in the same LSTM) with the log-determinant of the covariance matrix of all three (standardized, log-transformed) precipitation products (Fig. 10). The log-determinant is a proxy for the joint entropy of the 225 three (standardized, log-transformed) products, and increases when there is larger disagreement between the three data sets.
Unlike in Fig. 9, the variances in Fig. 10 were calculated after removing the mean and overall variance of each log-transformed precipitation product so that the log-determinant of the covariance is not affected by the overall magnitude of precipitation in each catchment (i.e., does not increase in wetter catchments). With the exception of the anomalous NLDAS basins, Fig.   10 shows that the three-forcing model offers improvements with respect to the single-forcing models when there is larger  Figure 10. Fractional increase in NSE from the three-forcing model relative to the single-forcing models plotted against the log-determinant of the covariance matrix of all three (standardized, log-transformed) precipitation products. With the exception of the anomalous NLDAS basins (blue markers), the three-forcing model offers improvements with respect to the single-forcing models when there is larger disagreement between the three data sets. The three-forcing model learned to leverage synergy in these three precipitation products. Figure 11 shows the time-and basin-averaged integrated gradient of the multi-forcing LSTM as a function of lead time.

Results: Analysis 3 -Sensitivity & Contribution
To reiterate from above, the integrated gradient is a measure of input attribution, or sensitivity such that inputs with higher integrated gradients have a larger influence on model outputs. Integrated gradients shown in Fig. 11 were averaged over all 235 timesteps in the test period, and also over all basins. This figure shows the sensitivity of streamflow at time t = T to each of the three precipitation inputs at times t = T − s where s is the lag value on the x-axis. The main takeaways from this high-level illustration of the input sensitivities are: (1) that the sensitivity of current streamflow to precipitation decays with lead time (i.e., time before present) and (2) that the multi-forcing model has learned to ignore the Maurer input at the present timestep.
The reason for the latter is the time shift in the Maurer product illustrated in Fig. 2.

240
The multi-forcing LSTM learned to combine the different precipitation products in spatiotemporally variable ways. Fig.   11 demonstrates the overall behavior of the multi-forcing LSTM. It is, however a highly condensed aggregate of a highly non-linear system. As such a lot of specific information is lost -as is always the case when nonlinearities are aggregated. Therefore, Fig. 12 details the overall model behavior (through the lense of integrated gradients) by basin, and up to a lead time of s = 3 days prior to present. The model largely ignores Maurer precipitation at the current timestep in most basins -as 245 is already apparent in Fig. 11, but the ratio of the contributions of each product varies between basin. Figure 12 shows relative contributions of each precipitation product, but the overall importance of precipitation also varies between basin.
Similarly, Fig. 13 shows the spatial extend of the most sensitive contribution over all time steps (left-subplot) and the the overall sensitivity to all three products combined (right-subplot), which is highly correlated with the total (or average) precipitation in the basin.. That is, they display the sum of the integrated gradients over time, lag, and product. From the right-250 subplot it becomes evident that the precipitation has a larger contribution to the sensitivity of streamflow predictions in wetter basins. Figure 13 also shows the product with the highest overall contribution in each basin. It is possible to break the spatial relationship down even further. Concretely, we did examine at the spatial distribution of the highest-ranked product as a function of the lag time for rising and falling limits. We can then see that the multi-forcing LSTM learns to combine the different products in very nuanced ways, distinguishing between different memory timescales in 255 different basins for different hydrological conditions (Fig. 14).

Conclusions
The purpose of this paper is to show how LSTMs can leverage different precipitation products in spatiotemporally dynamic ways to improve streamflow simulations. The experiments show that there exist systematic and location-and time-specific differences between different precipitation products that can be learned and leveraged by deep learning. As might be expected, 260 the LSTMs tested here tended to improve hydrological simulations more when there were larger disagreement between different precipitation estimates in a given basin.
It is worth comparing these findings with classical conceptual and process-based hydrological models that treat precipitation estimate as an unique input. Current best-practice for using multiple precipitation products is to run an ensemble of hydrological models, such that each forcing data set is treated independently. Deep learning models not only have the ability to use a larger 265 number and variety of inputs than classical hydrology models. As a matter of fact, deep learning models do not need inputs that represent any given hydrological variable or process, and therefore have the potential to use less highly processed input data.
Future work might focus on building runoff models that take as inputs the raw measurements that were used to create standard precipitation data products. Deep learning provides possibilities not only for improving the quality of regional (Kratzert et al., 2019b) and even un-270 gauged (Kratzert et al., 2019a) simulations, but also potentially for replacing large portions of ensemble-based strategies for uncertainty quantification (e.g. Clark et al., 2016) with multi-input models. There are many ways to deal with the uncertainty in traditional hydrological modeling workflows. Arguably, the most common approach is to use ensembles (e.g., Clark et al., 2016). Ensembles can be either opportunistic -i.e., from a set of pre-existing models or data products -or constructed -i.e., sampled from a probability distribution - (Clark et al., 2016), but in either case the idea is to use variability to represent lack 275 of perfect information. Multi-input deep learning has the potential to provide a fundamentally alternative method for assessing this kind of uncertainty. Future work should additionally focus on producing predictive probabilities with multi-input deep learning models. precip IG sum Figure 13. The forcing product with highest overall contribution (sensitivity) in each basin (left-hand subplot) -averaged over prediction time step and lag. The alpha value (opacity) of each dot on this map is a relative measure of the fraction of the total integrated gradients of all three precipitation products (summed over time, lag, and product) due to the highest-contributing product. The right-hand subplot shows that the total integrated gradient summed over all three precipitation products is highly correlated with total precipitation in the basin.

Code availability
The code to reproduce all results and figures will be made available at https://github.com/kratzert/multiple_forcing 280 6 Data availability   To evaluate the model performance on the peak timing we used the following procedure: First, we determined peaks in the observed runoff time series by locality search. That is, potential peaks are defined as local maxima. To reduce the number of peaks and filter out noise, the next step was an iterative process where, by pairwise comparison, only the maximum peak is 290 kept until all peaks have at least a distance of 100 time steps to each other. The procedure is implemented in SciPy's find_peak function (Virtanen et al., 2020) and is used in the current work.
Second, we iterated over all peaks and searched for the corresponding peak in the simulated discharge time series. The simulated peak is defined as the highest discharge value inside of a window of ±3 days around the observed peak. And, the peak timing error is the offset between the observed peak and the simulated peak. The resulting metric is the average offset Hochreiter, S.: Untersuchungen zu dynamischen neuronalen Netzen, Diploma, Technische Universität München, 91, 1991.