Technical note: Calculation scripts for ensemble hydrograph separation

Ensemble hydrograph separation has recently been proposed as a technique for using passive tracers to measure catchment transit time distributions and new water fractions, introducing a powerful new tool for quantifying catchment 10 behavior. However, the technical details of the necessary calculations may not be straightforward for many users to implement. We have therefore developed scripts that perform these calculations on two widely used platforms (MATLAB and R), to make these methods more accessible to the community. These scripts implement robust estimation techniques by default, making their results highly resistant to outliers. Here we briefly describe how these scripts work, and offer advice on their use. We illustrate their potential and limitations using synthetic benchmark data. 15


Introduction
What fraction of streamflow is composed of recent precipitation? Conversely, what fraction of precipitation becomes streamflow promptly? What is the age distribution of streamwater? What is the "life expectancy" of precipitation as it enters a catchment? And how do all of these quantities vary with catchment wetness, precipitation intensity, and landscape characteristics? Questions like these are fundamental to understanding the hydrological functioning of landscapes and 20 characterizing catchment behavior. Ensemble hydrograph separation (EHS) has recently been proposed as a new tool for quantifying catchment transit times, using time series of passive tracers like stable water isotopes or chloride. Benchmark tests using synthetic data have shown that this method should yield quantitatively accurate answers to the questions posed above (Kirchner, 2019), and initial applications to real-world data sets (e.g., Knapp et al., 2019) have demonstrated the potential of this technique. 25 However, it has become clear over the past year that the equations of Kirchner (2019, hereafter denoted K2019) may be difficult for many users to implement in practically workable calculation procedures or computer codes. It has also become clear that robust estimation methods would be a valuable addition to the ensemble hydrograph separation toolkit, given the likelihood of outliers in typical environmental data sets. The present contribution is intended to fill both of these needs, by 30 https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License.
presenting user-friendly scripts that perform EHS calculations in either MATLAB or R, and that implement robust estimation by default.
Here we demonstrate these scripts using synthetic data generated by the benchmark model of K2019, which in turn was adapted from the benchmark model of Kirchner (2016). We use these benchmark data instead of real-world observations, 35 because age-tracking in the model tells us what the correct answers are, so that we can verify how accurately these EHS scripts infer water ages from the synthetic tracer time series. The benchmark model consists of two non-linear boxes coupled in series, with a fraction of the discharge from the upper box being routed directly to streamflow, and the rest being routed to the lower box, which in turn discharges to streamflow (for further details, see Kirchner, 2016 andK2019). It should be emphasized that the benchmark model and the ensemble hydrograph separation scripts are completely independent 40 of one another. The benchmark model is not based on the assumptions that underlie the ensemble hydrograph separation method. Likewise, the EHS scripts do not know anything about the internal workings of the benchmark model; they only know the input and output water fluxes and their isotope signatures. Thus the analyses presented here are realistic analogues to the real-world problem of trying to infer the internal functioning of catchments from only their inputs and outputs of water and tracers. 45 Figures 1a and 1b show the simulated daily water fluxes and isotope ratios used in most of the analyses below. The precipitation fluxes are averages over the previous day (to mimic the effects of daily time-integrated precipitation sampling), and the streamflow values are instantaneous values at the end of each day (to mimic the effects of daily grab sampling). We also aggregated these daily values to simulate weekly sampling, using weekly volume-weighted average tracer 50 concentrations in precipitation and weekly spot values in streamflow (representing grab samples taken at the end of each week). Five percent of the simulated tracer time series were randomly deleted to mimic sampling and measurement failures, and a small amount of random noise was added to mimic measurement errors.
To illustrate the need for robust estimation techniques, and to demonstrate the effectiveness of the robust estimation methods 55 employed in our scripts, we also randomly corrupted the synthetic isotope data with outliers (Fig. 1c). These outliers are intentionally large; for comparison, the entire range of the outlier-free data shown in Fig. 1b lies between the two dashed lines in Fig. 1c. The outliers are also strongly biased (they all deviate downward from the true values), making them harder to detect and eliminate. We make no claim that the size of these outliers, and their frequency in the data set, reflect outlier prevalence and magnitude in the real world (which would be difficult to estimate in practice, without replicate sampling or 60 other independent reference data). Instead, these outliers were simply chosen to be large enough, and frequent enough, that they will substantially distort the results of non-robust analyses. They thus provide a useful test for the robust estimation methods described below.

Estimating new water fractions using the function _
The simplest form of ensemble hydrograph separation seeks to measure the fraction of streamflow that is composed of recent 65 precipitation. Conventional hydrograph separation uses end-member mixing to estimate the time-varying contributions of "event water" and "pre-event water" to streamflow. By contrast, ensemble hydrograph separation seeks to estimate the average fraction of new water in streamflow, averaged over an ensemble of events (hence the name), based on the regression slope between tracer fluctuations in precipitation and discharge (see Fig. 2a where is the "event new water fraction" (the average fraction of new water in streamflow during sampling intervals with precipitation), and are the tracer concentrations in streamflow at time steps and 1, is the volumeweighted average tracer concentration in precipitation that falls between time 1 and time , and the intercept and the error term can be viewed as subsuming any bias or random error introduced by measurement noise, evapoconcentration effects, and so forth (see Sect. 2 of K2019 for formulae and derivations). 75 Although ensemble hydrograph separation is rooted in assumptions that are similar to end-member mixing, mathematically speaking it is based on correlations between tracer fluctuations rather than on tracer mass balances. As a result, it does not require that the end-member signatures are constant over time, or that all the end-members are sampled or even known, and it is relatively unaffected by evaporative isotopic fractionation or other biases in the underlying data (see Sect. 3.6 of 80 K2019). Even when new water fractions are highly variable over time, one can show mathematically (and confirm with benchmark tests) that ensemble hydrograph separation will accurately estimate their average (see Sect. 2 and Appendix A of K2019). As Fig. 2a shows, higher discharges (indicating wetter catchment conditions) may be associated with larger new water fractions, and thus stronger coupling between tracer fluctuations in precipitation and streamflow. Nonetheless, the regression slope in Fig. 2a  (and thus is expressed in units of day -1 ), and if they are sampled weekly, "new water" means water that fell within the previous week (and thus is expressed in units of week -1 ). Because the meaning and dimensions of new water fractions depend on the sampling interval, so do the numerical values, as illustrated by Knapp et al. (2019). In our example, the weekly event new water fraction, calculated from weekly sampling of the daily values in Fig. 1, is 0.443±0.024, which agrees, within error, with the true weekly event new water fraction (0.429±0.017). Astute readers will notice that the weekly 95 https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License. new water fraction is not 7 times the daily one, implying that translating between weekly and daily event new water fractions is not just a matter of converting the units. This is partly because weeks rarely consist of seven consecutive daily hydrological events (instead they typically include some days without rain). Thus the relationship between daily and weekly new water fractions will depend on the intermittency of precipitation events. One must also keep in mind that the proportion of new water in streamflow cannot exceed 1, so new water fractions, even when evaluated from low-frequency data, cannot 100 be arbitrarily large.
As explained in Sect. 2 of K2019, there are three main types of new water fractions. First, as noted above, the event new water fraction is the average fraction of new water in streamflow during sampling intervals with precipitation.
Second, the new water fraction of discharge is the average fraction of new water in streamflow during all sampling 105 intervals, with or without precipitation; this will obviously be less than the event new water fraction because periods without precipitation will not contribute any new water to streamflow. Third, the "forward" new water fraction, or new water fraction of precipitation , is the average fraction of precipitation that will be discharged to streamflow within the current sampling interval. Both and can be derived by re-scaling from Eq. (1) by the appropriate denominators. All three of these new water fractions can also be volume-weighted (to express, for example, the fraction of 110 new water in an average liter of streamflow, rather than on an average day of streamflow), if the regression in Eq. (1) is volume-weighted; these volume-weighted fractions are denoted using an asterisk, as * , * , and * .
In our scripts, new water fractions are calculated by the function EHS_Fnew. Users supply EHS_Fnew with vectors of evenly spaced data for the water fluxes and , and tracer concentrations and , in precipitation and discharge. Users 115 can also specify five options: a) the threshold precipitation rate _ ℎ ℎ (in the same units as ), below which precipitation inputs will be ignored, under the assumption that they will mostly have been lost to canopy interception, b) _ , a logical flag (default=false) indicating whether the new water fractions should be volume-weighted, c) , a logical flag (default=true) indicating whether the new water fractions should be calculated using robust estimation methods as described in Sect. 2.1 below, d) _ , a logical flag (default=true) indicating whether the standard error estimates 120 should account for serial correlation in the residuals, and e) , a point filter vector of logical flags indicating whether individual time steps should be included in the analysis, thus facilitating easy analyses of subsets of the original time series.
The function EHS_Fnew returns estimates of , , and and their associated standard errors, with or without volume-weighting depending on whether _ is set to true or false.

Robust estimation of new water fractions 125
The linear regression in Eq. (1), like any least-squares technique, is potentially vulnerable to outliers. Because potential outliers are often present in environmental data, practical applications of ensemble hydrograph separation would benefit from a robust method for estimating new water fractions. Such a method should not only be insensitive to outliers; ideally it should also be statistically efficient (i.e., it should yield reasonable estimates from small samples), and it should be asymptotically unbiased (i.e., it should converge to the conventional regression results when outliers are absent, with a bias 130 near zero for large samples). On these axes -precipitation and streamflow tracer fluctuations on the and axes, respectively, each expressed relative to 135 the streamflow tracer concentration in the previous time step -the regression slope estimates the event new water fraction . Here we are interested in how outliers affect this regression slope. When outliers are absent (Fig. 2a), the regression slope (0.164±0.006, estimate±standard error) is consistent with the true event new water fraction = (0.168±0.005) calculated from water age tracking in the benchmark model.

140
By contrast, outliers substantially distort the ensemble hydrograph separation plot in Fig. 2b; they extend well beyond the range of the outlier-free data indicated by the gray rectangle, and inflate the estimate of by nearly a factor of three.
Outliers in precipitation tracer concentrations will be displaced left or right from the corresponding true values (in Fig. 2b, these outliers are displaced to the left because they are all negative). Precipitation outliers will thus tend to flatten the regression line. Outliers in streamflow concentrations will appear in two different ways. First, they will be displaced above 145 or below the corresponding true values (in this case, they are only displaced below, because they are all negative). Secondly, they will also appear as strongly correlated deviations on both the and axes because streamflow concentrations at time -1 are used as reference values for both precipitation concentrations (on the axis) and streamflow concentrations (on the axis) at time . Unlike precipitation outliers, these correlated points will tend to artificially steepen the regression line. Thus, whether outliers steepen or flatten the regression relationships underlying ensemble hydrograph separation will depend on 150 the relative abundance and size of the streamflow outliers and precipitation outliers (relative to each other, and relative to the variability in the true streamflow and precipitation tracer values). In the example shown in Fig. 2b, the outliers have the net effect of artificially steepening the fitted slope, yielding an apparent of 0.430±0.018 that is more than 2.5 times the true value of 0.168±0.005 determined by age tracking in the benchmark model.

155
Many robust estimation methods will not be effective against outliers like those shown in Fig. 2b, which create points that have great leverage on the slope of the fitted line. This leverage can allow the outliers to pull the line close enough to https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License.
themselves that they will not be readily detected as outliers. To address this problem, our robust estimation procedure has two parts. The first step is to identify extreme values of both precipitation and streamflow tracer concentrations at the outset, and exclude them by setting them to NA (thus treating them as missing values). This will effectively prevent outliers from 160 exerting strong leverage on the solution. Because the exclusion criterion must itself be insensitive to outliers, we define extreme values as those that lie farther from the median than six times MAD, the median absolute deviation from the median.
The cutoff value of six times MAD was borrowed from the residual downweighting function used in Locally Weighted Scatterplot Smoothing (LOWESS: Cleveland, 1979). Any exclusion criterion may also eliminate points that are not outliers, but simply extreme values. However, unless the underlying distribution has very long tails, the 6 • MAD criterion will 165 exclude very few points that are not outliers. If the underlying data follow a normal distribution, for example, the chosen criterion will exclude only the outer-most 0.005 percent of that distribution.
As a second step, we use iteratively reweighted least squares (IRLS: Holland and Welsch, 1977) to estimate the regression slope, and thus the event new water fraction . IRLS iteratively fits Eq. (1) by linear regression, with point weights 170 that are updated after each iteration. Points with unusually large residuals are given smaller weight. In this way, IRLS regressions follow the linear trend in the bulk of the data, giving less weight to points that deviate substantially from that trend. This behavior, which allows IRLS to down-weight outliers, can have undesirable effects in analyses of outlier-free data exhibiting divergent trends. In Fig. 2a, for example, higher flows have steeper trends, with the highest 20 percent of flows (shown in red) exhibiting a much steeper trend than the rest of the data. Because IRLS gives these points relatively 175 less weight, the robust estimate of is 0.126±0.004, 25% less than the true value of 0.168±0.005 determined from age tracking in the benchmark model. Thus, in this case, the robust estimation procedure is somewhat less accurate than ordinary least squares if the data are free of outliers. Conversely, however, the outliers in Fig. 2b have little effect on the robust estimation procedure, which returns a estimate of 0.115±0.005, within 10% of the outlier-free value. This example demonstrates that like any robust estimation procedure, ours is highly resistant to outliers but at the cost of reduced 180 accuracy when outliers are absent, particularly in cases, like Fig. 2a, that superimpose widely differing trends. Robust estimation is turned on by default, but users can turn it off if they are confident that their data are free of significant outliers.

Profiling catchment behavior using _
Visual comparison of the different discharge ranges shown by different colors in Fig. 2a indicate that in these benchmark data, higher discharges are associated with stronger coupling between tracer concentrations in precipitation and streamflow, 185 implying that streamflow contains a larger fraction of recent precipitation. This observation implies that by estimating by regression for each discharge range separately, one can profile how new water fractions vary with discharge (and https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License. thus, at least in the benchmark model system, with catchment wetness). As outlined in K2019, this can be accomplished by splitting the original data set into separate ensembles and running EHS_Fnew on each ensemble individually.

190
Although this can be achieved by applying a series of point filter vectors to isolate each ensemble, here we provide a function, EHS_profile, that automates this process. Users supply EHS_profile with the same data vectors and logical flags needed for EHS_Fnew as described in Sect. 2 above, plus a criterion vector for sub-setting the data and two vectors that define the percentile ranges of this criterion variable to be included in each subset. Many different variables could be chosen as the sub-setting criterion; examples include discharge (or antecedent discharge), precipitation intensity (or antecedent 195 precipitation), day of year, soil moisture, groundwater levels, fractional snow cover, and so forth.  Fig. 3b). By contrast, the robust estimation procedure yields new water fractions (dark blue points) that closely follow the age tracking results (Figs. 3b-e and 4b), at least as long as the 210 fraction of outliers in the data set does not exceed10 percent. Somewhere between an outlier frequency of 10 and 20 percent, the robust estimation procedure reaches its so-called "breakdown point" (Hampel, 1971), at which it can no longer resist the outliers' effects (see Fig. 3f). This breakdown point is relatively low (for comparison, the breakdown point of the median as an estimator of central tendency is 50 percent) because the outliers introduce highly correlated artifacts into the analysis (see Fig. 2b) and because these particular outliers are very large and very strongly biased (they always lie below the true values). 215 The breakdown point could be raised by tailoring the exclusion criterion (step 1 in our two-step procedure) to these particular outlier characteristics -for example, by basing it on deviations relative to the median of the densest 50% of the data, rather than the median of all the data, to counteract the bias in the outliers. Doing so, however, would violate the principle that the scripts and the data used to test them should be fully independent of one another, as outlined in Sect. 1. In any case, the empirical breakdown point of 10-20 percent identified in Fig. 3 is specific to this particular data set with these 220 particular outlier characteristics, and should not be interpreted as indicating the likely breakdown point in other situations. In https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License. general, however, we would expect the robust estimation procedure to be more resistant to outliers that are smaller or less strongly biased.
Astute readers will note that the robust estimates of new water fractions almost exactly match the benchmark age tracking 225 data in the profiles shown here, whereas they underestimated the same age tracking data by roughly 25% in Sect. 2.1 above, where the data were not separated into distinct ranges of discharge or precipitation rates. The difference between these two cases is illuminating. Individual discharge ranges exhibit well-defined relationships between tracer fluctuations in precipitation and streamflow; that is, the individual colored discharge ranges in Fig. 2a show roughly linear scatterplots with well constrained slopes. Thus for these individual discharge ranges, the robust estimates agree with the benchmark "true" 230 values (and the non-robust estimates do too, if the underlying data are free of outliers). However, when these different discharge ranges are superimposed, the robust estimation procedure down-weights the high-discharge points because they follow a different trend from the rest of the data, resulting in an underestimate of the new water fraction averaged over all discharges. Thus users should be aware that our robust estimation procedure (like any such procedure) can be confounded by data in which some points exhibit different behavior than the rest, and are therefore excluded or down-weighted as 235 potential anomalies.

Estimating transit time distributions using _
One can estimate catchment transit time distributions from tracer time series by extending Eq.
(1) to a multiple regression over a series of lag intervals 0 … : where the vector of regression coefficients can be re-scaled to yield different types of transit time distributions as described in Sects. 4.5-4.7 of K2019. Applying Eq.
(2) to catchment data is straightforward in principle but tricky in practice, because any rainless intervals will lead to missing precipitation tracer concentrations for a range of time steps and lag intervals . Handling this missing data problem requires special regression methods, as outlined in Sect. 4.2 of K2019. Gaps in the underlying data can also lead to ill-conditioning of the covariance matrix underlying the least-squares 245 solution of Eq. (2), leading to instability in the regression coefficients . This ill-conditioning problem is handled using Tikhonov-Phillips regularization, which applies a smoothness criterion to the solution in addition to the least-squares goodness-of-fit criterion, as described in Sect. 4.3 of K2019.
The function EHS_TTD spares users from the practical challenges of implementing these methods. Users supply EHS_TTD 250 with the same data vectors needed for EHS_Fnew as described in Sect. 2 above. Users also specify , the maximum lag in https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License. the transit time distribution, and , the fractional weight given to the Tikhonov-Phillips regularization criterion (versus the goodness-of-fit criterion) in determining the regression coefficients . The default value of is 0.5, following Sect. 4.3 of K2019, which gives the regularization and goodness-of-fit criteria roughly equal weight; if is set to zero, the regularization criterion is ignored and the estimation procedure becomes equivalent to ordinary least squares. Users can set the optional 255 point filter to filter the data set by discharge time steps (for example, to track the ages of discharge leaving the catchment during high or low flows, regardless of the conditions that prevailed when the rain fell that ultimately became those streamflows). Alternatively, users can set the optional point filter to filter the data set by precipitation time steps (for example, to track the life expectancy of rainwater that falls during large or small storms, regardless of the conditions that will prevail when that rainwater ultimately becomes discharge). It is also possible to set both and 260 so that both the precipitation and discharge time steps are filtered, but this capability should be used cautiously because it could potentially lead to TTDs being estimated on only a small, and highly fragmented, part of the data set.
The function EHS_TTD returns vectors for the transit time distribution Q TTD (the age distribution of streamflow leaving the catchment), the "forward" transit time distribution P TTD (the "life expectancy" distribution of precipitation entering the 265 catchment), and their associated standard errors. If the _ flag is true, the corresponding volume-weighted distributions ( Q TTD * and P TTD * ) and their standard errors are returned. In all cases, the units are fractions of discharge or precipitation per sampling interval (e.g., day -1 for daily sampling or week -1 for weekly sampling). This difference in units should be kept in mind when comparing results obtained for different sampling intervals.

Robust estimation of transit time distributions 270
In EHS_TTD, robust estimation of transit time distributions follows a multi-step approach that is analogous to that which is used in EHS_Fnew (described in Sect. 2.1 above). We first exclude extreme values of both precipitation and streamflow tracer concentrations using the 6 • MAD criterion. We then apply iteratively reweighted least squares (IRLS) to Eq. (2), without regularization; this yields a set of robustness weights, which down-weight points that lie far away from the multidimensional linear trend of the data. These robustness weights are then applied within the Tikhonov-Phillips 275 regularized regression that estimates the transit time distribution. This robust estimation approach requires that we handle the missing data problem in a different way than the one that was documented in Sect. 4.3 of K2019. The necessary modifications are detailed in Appendix A.
This robust estimation procedure yields transit time distributions that are highly resistant to outliers (Fig. 5). The gray lines 280 in Fig. 5 show the true transit time distributions of discharge ( Q TTD) and "forward" transit time distributions of precipitation ( P TTD), as determined by age tracking in the benchmark model. These age tracking results are compared to transit time distributions calculated from the tracer time series using EHS_TTD, with and without robust estimation (dark and light https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License. symbols, respectively, in Fig. 5). When the tracer time series are outlier-free ( Fig. 5a and 5c), both the robust and non-robust estimation procedures accurately estimate these TTDs (i.e., the light and dark blue points closely follow the gray lines). 285 When the tracer time series are corrupted by outliers (Figs. 5b and 5d), the non-robust TTDs (light blue points) deviate substantially from the age tracking results (gray lines), but the robust TTDs (dark blue points) follow the gray lines nearly as well as with the outlier-free data.

Overestimation of uncertainties in humped transit time distributions
The benchmark tests shown in Figs. 2-5 above, like most of those presented in K2019, are based on a benchmark model 290 simulation that yields "L-shaped" TTDs, that is, those in which the peak occurs at the shortest lag. In this section we explore several phenomena associated with the analysis of distributions that are "humped", that is, those that peak at an intermediate lag.
Where tracer data are sufficient to constrain the shape of catchment-scale TTDs, they suggest that humped distributions are rare (Godsey et al., 2010). They are also not expected on theoretical grounds, since precipitation falling close to the channel should reach it quickly and with little dispersion, leading to TTDs that peak at very short lags (Kirchner et al., 2001;295 Kirchner and Neal, 2013). Nonetheless, humped distributions could potentially arise in particular catchment geometries (Kirchner et al., 2001) Thus it appears that the TTD error estimates are generally conservative (i.e., they overestimate the true error), but with 315 humped distributions the uncertainties are greatly exaggerated. Numerical experiments (Fig. 7) reveal that this problem https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License. arises from the nonstationarity of the transit times in the benchmark model (and, one may presume, in real-world catchment data as well). K2019 (Sect. 4 and Appendix B) showed that ensemble hydrograph separation correctly estimates the average of the benchmark model's nonstationary TTD, as one can also see in Figs. 6 and 8. When this (stationary) average TTD is used to predict streamflow tracer concentrations (which is necessary to estimate the error variance and thus the standard 320 errors), however, it generates nearly the correct patterns of values but not with exactly the right amplitudes or at exactly the right times (see Fig. 7a). This is the natural consequence of estimating a nonstationary process with a stationary statistical model. As a result, the residuals are larger, with much stronger serial correlations, than they would be if the underlying process were stationary (compare Figs. 7a and 7b Figure 8 shows that if the same TTDs are estimated from weekly data, the standard errors more accurately approximate the mismatch between the TTD estimates and the true values (i.e., the difference between the blue dots and the gray curves).
Weekly sampling yields much more reasonable standard errors in this case, because the multiple regression residuals are much less serially correlated (see Fig. 7c; is 0.66, increasing the standard error by only a factor of 2.2). In addition, with 335 daily data the TTD coefficients are estimated for a closely spaced mesh of lag times (with lag intervals of 1 day), and broad TTDs like the ones shown in Fig. 6 do not change much over such short lag intervals. Thus the individual TTD coefficients on such a closely spaced mesh are not well constrained; one could make the TTD stronger at the fifth daily lag and weaker at the sixth daily lag, for example, with little effect on the overall fit to the data. With weekly sampling, the TTD coefficients are more widely separated in time (with lag intervals of 1 week), and thus are less redundant with one another. 340  Fig. 10). This occurs because the input tracers are less correlated over weekly sampling intervals than over daily sampling intervals, and because the TTD is much stronger at short lags on weekly time scales (Fig. 8) than on daily time 355 scales (Fig. 6). In real-world cases, biases like those shown in Fig. 9 may not be obvious, because the correct answer (shown here by the gray line, derived from benchmark model age tracking) will not be known. However, the behavior in Fig. 9b is implausible on hydrological grounds (why should catchments quickly transmit a particularly large fraction of very small precipitation events to the stream?), and the profiles in Figs. 9b and 10b show strongly contrasting patterns. Thus observations like these may help in identifying biased new water fraction estimates, even in cases where the TTD itself has 360 not been quantified.

Visualizing catchment nonlinearity using precipitation-and discharge-filtered TTDs
Transit time distributions are typically constructed from the entire available tracer time series for a catchment, as in Figs. 5, 6, and 8. Such TTDs can be considered as averages of catchments' nonstationary transport behavior, as shown in Sect. 4.2 above. However, ensemble hydrograph separation can also be used to calculate TTDs for filtered subsets of the full 365 catchment time series, focusing on either discharge or precipitation time steps that highlight particular conditions of interest.
(In Appendix B we describe the new procedure that EHS_TTD uses for filtering precipitation time steps; this approach yields more accurate results than the one outlined in Sect. 4.2 of K2019.) TTDs from these filtered subsets of the full time series can yield further insights into catchment transport phenomena.

370
For example, we can map out the nonlinearities that give rise to catchments' nonstationary behavior, by comparing TTDs from subsets of the original time series that represent different catchment conditions (Fig. 11). Larger precipitation events in our benchmark model result in forward transit time distributions with peaks that are higher, earlier, and narrower (Fig. 11a).
A similar progression in peak height, timing, and width is observed in forward TTDs (Fig. 11b) obtained from the benchmark tracer time series by setting the point filter in EHS_TTD to focus on individual ranges of precipitation 375 rates. The backward transit time distributions in the benchmark model (Fig. 11c) differ somewhat from the forward transit time distributions (Fig. 11a), but exhibit a similar shift to higher, earlier, and narrower peaks at higher discharges. This trend is also reflected in backward TTDs (Fig. 11d)  progressions in peak height and shape, reflecting the nonlinearity in the benchmark model storages, which have shorter effective storage times at higher storage levels and discharges. Although the particular results shown in Fig. 11 are 385 generated by a synthetic benchmark model, they illustrate how similar analyses could be used to infer nonlinear transport processes from real-world catchment data. Comparing TTDs representing different levels of antecedent catchment wetness, for example, could potentially be used to determine how much more precipitation bypasses catchment storage during wet conditions. Similarly, TTDs representing different levels of subsequent precipitation (over the following day or week, for example) could potentially be used to determine how effectively such precipitation mobilizes previously stored water. Thus 390 Fig. 11 illustrates how TTDs from carefully selected subsets of catchment tracer time series can be used as fingerprints of catchment response, and as a basis for inferring the mechanisms underlying catchment behavior.

Choosing the number of TTD lags
An obvious question for users is the number of lags over which the TTD should be estimated. Here there is no fixed rule; the answer will depend on the time scales of interest, the length of the available tracer time series, and the shape of the TTD 395 itself (which of course will not be known in advance). An empirical approach is to compare the results for several different maximum lags , and see where the resulting TTDs are similar and different. Figure 12 shows this approach applied to daily and weekly tracer time series, yielding TTDs with contrasting shapes. The upper row (panels a and b) show L-shaped TTDs estimated from the same synthetic tracer time series that underlie Figs. 1-5, and the lower row shows humped TTDs from the same benchmark model driven by the same inputs, but with different parameters as described above. In each panel, 400 the shorter and longer TTDs (shown in different colors) are generally consistent with one another, except in the case of the 4lag TTD shown in blue in Fig. 12d. In that case, such a short TTD is evidently unable to capture the shape of the benchmark distribution, as indicated by its deviation from the TTDs of other lengths. One can also see that the last few lags of any TTD can diverge from the TTD shape defined by the other TTDs. In Fig. 12a the last few lags generally deviate downward and in Fig. 12c they generally deviate upward; thus there appears to be no general rule except that the last few lags of any TTD 405 estimates should be treated with caution and potentially excluded from analysis.
A further observation from Fig. 12 is that TTD estimates from weekly tracer data may be at least as accurate, if not more so, than those calculated from daily tracer data. This may seem surprising, particularly because the time series underlying Fig.   12 are all five years long; thus the daily time series contain 7 times as many individual tracer measurements than the weekly 410 time series. Nonetheless, for several reasons it is not surprising that in this case one could obtain more stable estimates from fewer data points. First of all, in these numerical experiments the precipitation tracer concentrations are serially correlated (as they also often are in the real world); thus there is more redundancy among the daily tracer inputs than among the weekly https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License. tracer inputs. Secondly, the precipitation volumes are less variable (in percentage terms) from week to week than they are from day to day, meaning that the weekly calculations use fewer input concentrations that are accompanied by very small 415 water volumes (and that therefore could not have much influence on the real-world catchment). And thirdly, lower sampling frequencies entail TTDs with coefficients at more widely spaced lags, which are thus less redundant with one another and thus can be individually constrained better. Of course with lower-frequency sampling one loses the short-lag tail of the TTD, which may be of particular interest. But in cases where this information is not crucial -or where only lowerfrequency data are available -it appears that TTDs can be reliably estimated from samples taken weekly, and perhaps at 420 even lower sampling frequencies.

Closing comments
In this short contribution, we have presented scripts that implement the ensemble hydrograph separation approach. We have also illustrated some of its quirks and limitations using synthetic data. The issues have been revealed through benchmark tests that are substantially stricter than many in the literature. One should not assume that other methods have fewer quirks 425 and limitations, unless those methods have been tested with equal rigor.
For example, many benchmark data sets are generated using the same assumptions that underlie the analysis methods that they are used to test. Although the results of such tests often look nice, they are unrealistic because those idealized assumptions are unlikely to hold in real-world cases. For example, the TTD methods presented here would work very well if 430 they were tested against benchmark data generated from a stationary TTD (see Fig. 7b), but this is hardly surprising since the regression in Eq. 2 assumes stationarity. But such a test is far removed from the real world, in which tracer data typically come from nonstationary catchment systems. Tests with nonstationary benchmarks yield results that are less (artificially) pleasing, but more realistic (e.g., Fig. 7a). These tests also demonstrate an important point, by showing how well the TTD method estimates the average of the time-varying TTDs that are likely to arise in real-world cases (see also Sect. 4 and 435 Appendix B of K2019).
Although these scripts have been tested against several widely differing benchmark data sets (both here and in K2019), we encourage users to test them with their own benchmark data to verify that they are behaving as expected. As the examples presented here show, ensemble hydrograph separation can potentially be applied not only to the high-frequency tracer data 440 sets that are now becoming available, but also to longer-term, lower-frequency tracer data that have been collected through many environmental monitoring programs. We hope that the availability of these scripts facilitates new and interesting explorations of the transport behavior of many different catchment systems.

Author contributions
J.W.K. wrote the R scripts and J.L.A.K. translated them into MATLAB. J.W.K. conducted the benchmark tests, drew the 445 figures, and wrote the first draft of the manuscript. Both authors discussed the results and revised the manuscript.

Data availability
After acceptance of the manuscript, the R and MATLAB scripts and benchmark data sets will be available from a doireferenced archive. For review purposes these files can be obtained from 450 https://www.envidat.ch/#/metadata/ensemble-hydrograph-separation.

Appendix A: Improved solution method for transit time distributions
Ensemble hydrograph separation estimates transit time distributions by a multiple regression of streamflow tracer fluctuations against current and previous precipitation tracer fluctuations (Eq. 2, which is the counterpart to Eq. 1 over 455 multiple lag intervals ). Performing this multiple regression with real-world data requires addressing the "missing data problem": precipitation tracer concentrations will be inherently unavailable during time steps where no precipitation falls, and both precipitation and streamflow tracer concentrations may also be missing due to sampling and measurement failures.
The scripts presented here handle missing data somewhat differently than the procedure outlined in Sect. 4.2 of K2019. In this appendix we outline the new procedure and explain why it is necessary. 460 Equation (2)  A3 The conventional least-squares solution to such a multiple regression is usually expressed in matrix form as , A4 https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License.
where is the vector of the regression coefficients , is the vector of the reference-corrected streamflow tracer 470 concentrations and is the matrix of the reference-corrected input tracer concentrations , at each time step and lag .
Equation (A4) cannot be applied straightforwardly to real-world catchment data, because it cannot be solved when values of or , are missing. K2019 handled this missing data problem using a variant of Glasser's (1964) "pairwise deletion" 475 approach. In this approach, Eq. (A4) was re-cast in terms of covariances, cov , ℓ ℓ cov , , A5 and these covariances were evaluated only for pairs of non-missing values (see Eqs. 42 and 43 of K2019), as signified by the subscripts in parentheses (i.e., they were only evaluated for time steps where both , and ,ℓ or , and were nonmissing). The solution method presented in K2019 recognized that these covariances must be adjusted to account for the 480 two different reasons that values can be missing. A value of , that is missing because of a sampling or analysis failure represents missing information; it cannot be included in calculating the corresponding covariances, but (barring biases in which values are missing) it should have no systematic effect. By contrast, a value of , that is missing because too little rain fell should dilute the covariances in which it appears, because with trivial precipitation inputs, the missing tracer concentration could not cause any meaningful co-varying change in streamflow tracer concentrations. These considerations 485 required the elements of the covariance matrix in (A5) to be adjusted using prefactors called and that accounted for the number of precipitation samples that were missing for the two different reasons outlined above (see Eqs. 44-45 and Appendix B of K2019).
Practical experience since the publication of K2019 has revealed at least three important limitations in the approach outlined 490 above (and detailed in Sect. 4.2 and Appendix B of K2019). First, although this approach can work well if values of or , are missing at random, in non-random cases it can lead to the covariances being estimated from inconsistent sets of values. For example, if is missing for some particular , the corresponding values of , will still be used in estimating the covariance matrix cov , ℓ ℓ . This is advantageous if the values are missing at random, because all the covariances will include as many data pairs as possible. However, the covariance estimates could become inconsistent, with potentially 495 substantial consequences for the solution to Eq. (A5), if the values are missing non-randomly. In our case, the missing values are inherently structured, because a single missing precipitation tracer concentration causes a diagonal line of missing values in , and a single missing streamflow tracer concentration causes a missing row in and two missing values in . The second problem is that our robust estimation procedure depends on iteratively reweighted least squares https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License.
(IRLS), which in turn requires us to calculate the regression residuals, which is impossible for any time step that is missing 500 either or any of the , . The third problem is that estimating the uncertainties in the TTD requires the error variance, which again requires calculating the residuals. This last problem can be circumvented by using Glasser's error variance formula (Eq. 52 of K2019), but K2019 warns that this formula can yield implausible results, including negative error variance values (which are of course logically impossible).

505
Here, rather than removing the missing values and using Glasser's error variance formula, instead we fill in the missing values and calculate the residuals directly by inverting Eq. (A1), thus facilitating both robust estimation using IRLS, and direct calculation of the error variance for purposes of uncertainty estimation. The key to this approach is that we subtract the means from and from each column of , (or subtract the weighted means in case of volume-weighting), and then we fill in the missing values with zeroes. Because each of the variables has already been "centered" to have a mean of zero, 510 the in-filled values of zero will have no effect on the solution of Eq. (A1); in statistical terms, they will exert no leverage.
This approach also has the advantage that the intercept in Eq. (A1) becomes zero and drops out of the problem.
Broadly speaking, the solution proceeds similarly to Sect. 4.2-4.4 of K2019, with several important differences. One is that the covariance matrix now requires different prefactors than the and used in K2019 to account for the two different 515 types of missing data, because missing values will affect the covariances differently now that they are being in-filled with zeroes. In principle, a value of , that is missing because of a sampling or analysis failure represents missing information, so it should not alter the covariances in which it appears. However, those covariances will be diluted when the missing value is replaced by zero (since it will add nothing to the cross-products in the numerator of the covariance formula, but will add to the total number in the denominator). The resulting covariances must be re-scaled to reverse this dilution artifact. By 520 contrast, values of , that are missing because no rain fell should dilute the covariances in which they appear, because with no precipitation, they could not cause any co-varying change in streamflow tracer concentrations. Thus replacing these missing values with zeroes correctly dilutes the corresponding covariances.
A sketch of the solution procedure is as follows. First we identify and remove outliers in the precipitation and streamflow 525 tracer concentrations and as described in Sects. 2.1 and 4.1, and use the remaining values to calculate the and , using Eqs. (A2) and (A3). Next we calculate a matrix , of boolean flags that indicate whether a given value of , will be usable or not, according to the criteria outlined in the paragraph above: a value of , is unusable if it is missing (and thus will need to be replaced by zero) and its corresponding value of precipitation is above p_threshold (and thus its contribution to the covariances would not be nearly zero anyway). If either of these conditions is not met, the value of , is usable 530 (potentially with replacement by zero if it is missing and corresponds to below-threshold precipitation inputs). In more explicit form, https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License. , 0 if , is missing and 1 o t h e r w i s e . A6 We then eliminate any rows in , , and for which is missing and/or all of the , are missing, because in such cases Eq. (A1) would have no meaningful solution. Next, we subtract the means (or the weighted means) from and from each 535 column of , and replace the missing values of , with zeroes (there will be no missing values of at this stage).

If
= true, we then solve the multiple regression in Eq. (A1) using IRLS. We do not use the regression coefficients from the IRLS procedure, but instead use its robustness weights to down-weight anomalous points.
where is the effective number of equally-weighted points, which accounts for the unevenness of the weights (if all of the weights are equal, equals , the length of the vector . 550 The means shown in Eqs. (A7) and (A8) all equal zero, but they are preserved here so that the formulas can be readily recognized. To account for the contrasting types of missing values as outlined above, we multiply each of the covariances by prefactors / ℓ , defined as , and ℓ , ,ℓ . A10 With these prefactors, the solution to Eq. (A1) becomes 555 To solve the same problem with Tikhonov-Phillips regularization, we instead solve cov , , A12 https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License.
where is the covariance matrix / ℓ cov , ℓ , is the Tikhonov-Phillips regularization matrix, and controls the relative weight given to the regularization criterion (see Eqs. 49 and 50 of K2019). 560 To estimate the uncertainties in the regression coefficients we calculate the residuals by inverting Eq. (A1), recalling that =0, , . A13 We then calculate the (weighted) residual variance, accounting for both the degrees of freedom and any unevenness in the 565 weights, where the (weighted) mean of the residuals should be zero, but it is included here for completeness. We then calculate the standard errors of the regression coefficients, following Eq. (54) of K2019, using SE , A15 570 but with the difference that the unevenness in the weighting is already taken into account in Eq. (A14), so the effective sample size is now calculated following Eq. (13) of K2019 as where is the lag-1 (weighted) serial correlation in the residuals . If the _ option is set to false, the effective sample size is calculated as 575 , . A17 The regression coefficients and their standard errors are then converted into TTDs and their associated standard errors using Eqs. (55), (60), and (63)-(66) of K2019. Readers should note, however, that these scripts do not explicitly take account of the sampling interval and its units. Thus their results should be interpreted as being in reciprocal units of the sampling interval, e.g., day -1 for daily sampling and week -1 for weekly sampling. 580 https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License.
The technical problem of performing such a calculation can be solved as described in Appendix A above, but this alone will not solve the mathematical problem created by the diagonal stripes of missing values. The mathematical problem is that the https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License.
where the diagonal sub-matrices are -by-Tikhonov-Phillips regularization matrices (see Eq. 49 of K2019), and the off-diagonal sub-matrices are -by-matrices of zeroes.
Benchmark tests verify that the approach outlined in Eq. (B5) yields much more accurate estimates of than the approach 620 outlined in K2019 does. Therefore this approach is employed in EHS_TTD whenever the input data are filtered according to precipitation time steps.   The robust estimates (dark blue symbols) closely follow the true values (gray line), until the outliers become so frequent that the robust estimation algorithm is no longer effective against them (f). 685 https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License.  shown relative to their lagged reference values as in the left hand side of Eq. 2), and fitting residuals (dark blue dots), for the nonstationary benchmark model with a humped time-averaged TTD (a), for the same model with the same parameters, but with constant precipitation rates and therefore a stationary humped TTD (b), and for the same model based on weekly rather 725 than daily sampling (c). The observed and fitted tracer time series are shown relative to the reference tracer concentration (the streamflow concentration beyond the longest TTD lag; see Kirchner, 2019 for details). In (a), the multiple regression fit to the streamflow tracers generally exhibits the correct behavior, but with minor errors in amplitude and timing, resulting in residuals that exhibit strong serial correlation (lag-1 = 0.96) and thus greatly exaggerated standard errors of the regression coefficients that define the TTD. By contrast, under a stationary benchmark model (b), achieved by holding the precipitation 730 rate constant at its average value, the multiple regression fit to the streamflow tracers yields much smaller residuals (note the difference in scale) with little serial correlation (lag-1 = 0.23). Weekly samples from the nonstationary benchmark model  applied to daily tracer time series, generated by the benchmark model using parameters that generate a humped distribution (see Fig. 6). Some light blue symbols are invisible where they are overprinted by dark blue symbols. The light gray lines 750 show the true new water fractions, as calculated from water age tracking in the benchmark model. These new water fractions are overestimated by both robust and non-robust methods.
https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License. using robust and non-robust methods (dark and light blue symbols, respectively; error bars indicate one standard error) 755 applied to weekly tracer time series from the benchmark model using parameters that generate a humped distribution (see Fig. 6). Some light blue symbols are invisible where they are overprinted by dark blue symbols. The light gray lines show the true new water fractions, as calculated from water age tracking in the benchmark model. In contrast to the results from daily time series (Fig. 9), weekly tracer time series yield new water fraction profiles that are consistent with the true values determined by age tracking. 760 https://doi.org/10.5194/hess-2020-330 Preprint. Discussion started: 29 July 2020 c Author(s) 2020. CC BY 4.0 License.