Load estimates are more informative than constituent
concentrations alone, as they allow quantification of on- and off-site
impacts of environmental processes concerning pollutants, nutrients and
sediment, such as soil fertility loss, reservoir sedimentation and irrigation
channel siltation. While statistical models used to predict constituent
concentrations have been developed considerably over the last
few years,
measures of uncertainty on constituent loads are rarely reported. Loads are
the product of two predictions, constituent concentration and discharge,
integrated over a time period, which does not make it straightforward to
produce a standard error or a confidence interval. In this paper, a linear
mixed model is used to estimate sediment concentrations. A bootstrap method
is then developed that accounts for the uncertainty in the concentration and
discharge predictions, allowing temporal correlation in the constituent data,
and can be used when data transformations are required. The method was tested
for a small watershed in Northwest Vietnam for the period 2010–2011. The
results showed that confidence intervals were asymmetric, with the highest
uncertainty in the upper limit, and that a load of 6262 Mg year

The environmental impact of processes such as erosion, sedimentation, eutrophication or degradation of aquatic ecosystems can only be quantified through reliable estimates of sediment, nutrient or pollutant loads (Walling and Webb, 1996). Monitoring constituent concentrations alone does not suffice as these provide information on in-stream quality but offer no means to evaluate outcomes such as reservoir siltation, erosion, soil fertility loss and pollution at the watershed scale – both on- and off-site. Despite abundant literature developing appropriate procedures for load estimates, most studies do not report a measure of uncertainty on the load (Kulasova et al., 2012).

In this paper, we will use the example of one of the most commonly measured constituents, suspended sediment, but the methodology developed is applicable to any constituent load. For suspended sediment, the most frequently used method to estimate loads is the so-called rating curve method (Gao, 2008; Horowitz, 2008). In this approach, the suspended sediment concentration (SSC) is predicted by some form of least squares regression with (often log-transformed) discharge as the explanatory variable. This approach introduces two sources of uncertainty into the load equation: the uncertainty in the sediment concentration equation (the so-called sediment rating curve), and the uncertainty in the discharge, as discharge is usually not measured directly, but rather is predicted from a stage–discharge rating curve, with water level as the predictor variable. Any measure of uncertainty on the constituent load must take into account the uncertainty in both the constituent concentration and the discharge.

Uncertainty of the sediment concentration prediction has been extensively
discussed and, depending on the catchment characteristics, generally good
concentration predictions are obtained with errors smaller than 15 %
(Horowitz, 2008). In some studies, however, the uncertainty is stated to be
considerable. Smith and Croke (2005) for example reported that discharge
only explained a quarter of the variability in the concentration data.
Walling and Webb (1988) suggested that seasonal differences of the
relationship between discharge (

Therefore, authors that do report a measure of uncertainty often select a method that is specifically geared towards the application of load estimation at hand and not necessarily applicable to other sites, making it hard to compare results throughout the literature. Harmel et al. (2009), for example, developed a software tool to assess the errors introduced from estimating discharge, sample collection, preservation and storage, and lab analysis. In this tool, each of these sources is considered to be the result of random variability following a normal distribution. The sources of error are assumed to be independent of each other, and it is assumed that the errors follow an additive law, but these assumptions will not apply in all situations.

Moatar and Meybeck (2005) assessed uncertainty on nutrient loads by comparing loads based on a random subsample of measurement times, with a high-resolution load that is considered the “true” load. This approach is suitable for testing different methods and different temporal measurement resolutions of load estimation, but it does not assess the uncertainty of the “true” load, as it is assumed that, with sufficiently high sampling frequency (in their case, daily), the measured load is equivalent to the actual load. More recently, two new candidate approaches have emerged to calculate confidence intervals on loads that have the potential to be generally applicable, regardless of the method used to calculate the load and the distributional assumptions made: bootstrap methods (Mailhot et al., 2008; Rustomji and Wilkinson, 2008; Vigiak and Bende-Michl, 2013) and Bayesian methods that result in credibility intervals (Pagendam et al., 2014; Vigiak and Bende-Michl, 2013).

The bootstrap is a Monte Carlo-type method, where a large number (

To assess uncertainty in constituent loads estimated from continuous
concentration and discharge predictions where serial correlation is present,
we propose a bootstrap-based method to assess uncertainty in constituent
loads that can be used with transformed data, that accounts for the
uncertainty in both the sediment rating curve and the stage–discharge rating
curve, and that allows for serial correlation in the time series data. We
checked whether any of these requirements can safely be neglected in certain
circumstances, and how they affect the resulting confidence intervals. The
corresponding code in SAS was created and is available online to accommodate
these different scenarios
(

Our specific aims were: (i) to establish a generally applicable method to calculate confidence intervals on constituent loads, using bootstrap methods, (ii) to account for serial correlation in the data, (iii) to assess whether or not the effect of the uncertainty on discharge is negligible, (iv) to evaluate how data transformations affect the calculations, and (v) to determine the number of bootstrap replicates required to obtain reliable confidence intervals. Combining these aspects, the proposed method provides a means to assess uncertainty on any type of constituent load which was calculated from continuous constituent concentration and discharge predictions estimated with regression-type methods. The approach thus allows load estimates to be reported with an uncertainty assessment, rather than as a point estimate alone, making them informative to end users and decision makers.

Discharge and suspended sediment concentrations were continuously monitored
for a period of 2 years (1 January 2010–31 December 2011) in a small agricultural
catchment in mountainous Northwest Vietnam. The catchment is located in the
Chieng Khoi commune (21

For the discharge monitoring, water levels were measured every 2 min for the
river station using pressure sensors (EcoTech, Germany). The stage–discharge
relationship was established with the velocity-area method (Herschy, 1995),
where the velocity is measured with a propeller-type current meter (OTT,
Germany) at one or more points in each vertical, depending on the water
depth. The discharge is subsequently derived from the sum of the product of
mean velocity, depth and width between verticals. Discharge measurements were
never taken on the same day, and the closest time interval between two
measurements was 1 week. The estimated discharge

As the irrigation management disturbed the natural relationship between

Rainfall was quantified with a tipping-bucket rain gauge on a weather station (Campbell Scientific, USA). Events were defined based on rainfall data (no pause in precipitation for longer than 30 min) and lag times were added based on cross-correlation analysis as described in Schmitter et al. (2012). A total of 420 rainfall events took place and were monitored during the 2-year study period.

Continuous sediment concentrations were then obtained from a mixed model described in Slaets et al. (2014). The response variable, sediment concentration, was Box–Cox-transformed to stabilize the variance using the SAS macro described in Piepho (2009). The optimal value of the transformation parameter was estimated by the maximum likelihood method, and the selected value was the log transformation. Other transformations, such as the square root, were inspected using residual plots and were found to be unsuitable for meeting the assumptions of normality and homoscedasticity. Predictor variables were chosen with forward selection based on the Akaike information criterion (AIC). The model uses turbidity and discharge as quantitative predictor variables, and accounts for serial correlation. As surface reservoir irrigation management was present in the watershed, classic variables related to catchment characteristics such as hysteresis patterns and exhaustion effects were not suitable predictors of sediment concentration. The predictor variables turbidity and discharge were also log transformed. All samples from the 2-year study period were used to build the concentration prediction model, and load estimates from both years are thus predicted from the same model with the same parameter estimates. All statistical analyses were performed using the MIXED procedure of SAS 9.4, which can fit linear models with more than one random effect. The covariance structure used to model serial correlation in the present study was a first-order autoregressive (AR(1)) model, which was selected based on the AIC. Assumptions of normality and homogeneity of variance were checked visually using diagnostic plots.

Conceptually, the concentration prediction error can also be separated into an underlying latent autoregressive process generating the true concentrations, and an independently distributed measurement error corresponding to white noise in time series data. The white noise is equal to the error that would remain if two measurements were conducted at almost coinciding time points. This variability is typically attributable to measurement error and in spatial statistics, this is what is known as a nugget effect. In the MIXED procedure, this effect was fitted by using the local option in the repeated statement.

Validation was performed using 5-fold cross validation, in which the dataset
is split randomly into five parts, and each part is used four times to
calibrate the model, and one time for validation, so that each observation in
the dataset is used for validation once. Pearson's correlation coefficient
(

In the non-parametric bootstrap, a large number (

Among these specialized methods, no preferred method has emerged from the literature. Furthermore, many of these methods require a vast set of decisions such as for example the block size for which no general recommendation exists. As a consequence, results from different methods are not straightforward to compare. As the goal of the bootstrap is to mimic the original sampling process, however, there is an intuitive choice in the case of event-based sampling: the rainfall events form natural “blocks” or sampling units, which is why water quality models used to predict continuous time series and thus new events should be validated on an event basis, rather than on a sample basis (Lessels and Bishop, 2013). So rather than sampling with replacement from the individual observations (water samples representing a single time point), all samples belonging to one event can be resampled with replacement, thus keeping all observations within one event together and maintaining the serial correlation intact.

On the other hand, base-flow samples are typically taken at fixed time
intervals far apart in time (here every 2 weeks). They can therefore be
considered to be independent and can be resampled by simple random sampling
with replacement, thus bootstrapping individual water samples from single
time points. In a previous model published in Slaets et al. (2014), we
explored the use of several alternative variance–covariance structures to
model the serial correlation. The selected spatial power model unfortunately
caused non-convergence for a large number of the bootstrap replicates when
using it for bootstrap load estimates, and therefore the AR(1) structure was
implemented as it did not have convergence issues. The difference in AIC
between the AR(1) and spatial power models was four points. Therefore the
spatial power model is most likely the best performing model, but there is
still considerable support for the AR(1) model (Burnham and Anderson, 2002).
The spatial power structure with time as the coordinate showed that the
autocorrelation becomes nearly zero for samples taken more than 80 min apart
– which coincides with the average duration of rainfall events. Therefore
the base-flow samples were considered to be independent. An increase in AIC
of one point when fitting a first-order autoregressive covariance
structure confirmed the lack of serial correlation in the base-flow samples.
By resampling events with replacement for the storm-flow samples and
observations with replacement for the base-flow samples,

Flowchart showing the three-step bootstrap mechanism.

This resampling process accounts for the uncertainty that arises from estimating the parameters of the sediment rating curve from a dataset with a limited number of observations. If there were an unlimited number of water samples available, the uncertainty of these parameter estimates would decrease to zero. But it is more realistic to assume that, even if there were a very large number of samples available, there would still remain scatter in the real constituent concentrations around the equation, as the equation simply does not fully explain all the variation in sediment concentration. Sediment loads vary not only with discharge, but also with upstream sediment supply, which in turn depends additionally on geology, soil types, land cover and land use change or management, all influencing sediment quantity and quality (Walling, 1977). Therefore, there is a fundamental reason for the scatter in the data: sediment loads are inherently non-capacity loads. Even if there were an unlimited number of samples available, this would not result in a perfect equation to predict sediment concentration. Therefore this additional uncertainty needs to be taken into account. For the discharge rating curve, if the river bed is stable and the stream bank vegetation does not change, the stage–discharge equation has a high accuracy and it is reasonable to assume the only error in the equation is measurement error; therefore, this additional uncertainty is not a concern.

To introduce this second source of error on the sediment rating curve, Rustomji and Wilkinson (2008) and Vigiak and Bende-Michl (2013) added an additional step to the bootstrap process: a randomly drawn residual from the original regression equation was added to the expected value of the constituent concentration, so that the predicted concentration included both the uncertainty of the parameters of the rating curve due to having a finite sample, and the uncertainty that arises from the fact that sediment concentrations simply cannot be perfectly predicted by any equation, regardless of how large the observed dataset would be. However, by randomly resampling from the residuals, it is assumed that these residuals are independent.

When this assumption does not hold because samples are taken very closely
together in time, as was the case for our dataset, the method can be modified
so that the added errors reflect the temporal autocorrelation. To this end,
the covariance parameter estimates from the original sample can be used as
plug-in estimates. In the present dataset, an AR(1) structure was fitted to
the data (Verbeke and Molenberghs, 2009), resulting in two covariance
parameter estimates: one for the autocorrelation parameter (

resampling with replacement from the (

block-bootstrapping the (

adding an error term to the concentration predictions to account for the residual scatter that is inherent to the sediment concentration.

Finally, these bootstrap instantaneous load estimates can be summed up for
the whole time interval, resulting in

If the data are not normally distributed, it can be necessary to transform variables, as was done for this dataset with a Box–Cox transformation. In this case, the variables in question can simply be transformed before starting the bootstrap, and all the bootstrap estimates are obtained on the transformed scale. The back-transformation is then performed in Eq. (9) to obtain load estimates on the original scale. For example, in a typical case where both discharge and sediment concentration need to be log-transformed, the bootstrap predictions of discharge in Eq. (4) and of concentration in Eq. (8) will be on the log scale. These predictors then need to be back-transformed to the original scale using the inverse of the logarithm.

This approach is applicable to any type of data transformation, and thus
offers a flexible framework that can accommodate different methods of
estimating the constituent concentration. However, if a modeled residual
error term

If a data transformation is required and one does not want to explicitly simulate the residual scatter, then a correction factor must be applied to the back-transformed concentration. This correction is needed because the naïve back-transformation (for example, taking the exponent of the predictions if the predictions are on the log scale) does not yield a predicted mean, but rather a predicted median. While medians can be informative measures of a central tendency to skewed datasets, they are not appropriate when the objective is to calculate a constituent load: loads are sums over equally spaced time points, and in order to obtain an unbiased estimate of this sum over time intervals, we need to sum up estimates of the expected values, rather than the medians, for each interval.

The required correction factor is specific to the type of data transformation. For a logarithmic transformation, the expected value can be obtained by adding on half of the residual error variance to the predicted concentration on the log scale before back-transforming. For other cases of the Box–Cox transformation, the correction depends on the selected transformation parameter. Solutions for specific examples of the transformation parameter can be found in Freeman and Modarres (2006). As the selected transformation in this dataset was the logarithm, the correction of adding half the residual error variance before back-transforming was compared to the approach where the error is simulated, in order to see how this affects sediment load estimates and the resulting confidence intervals.

A straightforward way to calculate a confidence interval (CI) on a parameter after bootstrapping is the bootstrap percentile method (Efron and Tibshirani, 1993). If a 95 % CI is required, the confidence interval would simply be calculated by ordering the bootstrap load estimates from small to large and taking the 2.5th and 97.5th percentiles as the lower and upper limits.

This method was used by Rustomji and Wilkinson (2008) on sediment loads and is transformation-respecting, also when the sample statistic is not normally distributed (Efron and Tibshirani, 1993). This property is important in the case of loads, because data are typically log-transformed. As a confidence interval depends on the tail of the empirical bootstrap distribution where fewer samples occur, a relatively large number of bootstrap replicates (upward of 500) are usually required to achieve acceptable accuracy (Efron and Tibshirani, 1993). How many exactly depends on the statistic in question, and should be empirically tested for each case: when the process is repeated, the resulting CI should not greatly differ, otherwise the number is too small. In the present dataset, a choice of 2000 bootstrap replicates yielded replicable results.

Improving upon the bootstrap percentile method, Efron and Tibshirani (1993) proposed bias-corrected and accelerated intervals, used by Vigiak and Bende-Michl (2013). Unfortunately, this approach requires an even larger number of bootstrap replicates than the percentile method to sufficiently reduce the Monte Carlo sampling error. This is a disadvantage when working with hydrological time series, as the datasets typically contain a large number of records already. This method then quickly becomes time consuming, and therefore in this paper, preference was given to the more intuitive and less computationally intensive bootstrap percentile method.

The proposed three-step bootstrap process offers an opportunity to assess the
importance of different aspects of the load calculation for the accuracy of
the estimate. By leaving out step 1 (bootstrapping the

As the accuracy of the stage–discharge relationship depends on the type of
streambed, the method chosen and the number of measurements taken, this
assumption might also hold true for some watersheds such as the one in this
study, where the relationship had a high

Discharge rating curve plotted on the log-transformed scale showing
the 95 % confidence interval for the regression line (dark grey) and for
new predictions (light grey). Stage–discharge rating curve:
log(discharge)

The coefficient of determination of the stage–discharge relationship was
0.98 (

Residual plot for the discharge rating curve, showing studentized residuals versus the predicted discharge (on the log-transformed scale).

Observed versus predicted values of the sediment rating curve.
Predictions are from the linear mixed model with turbidity and discharge as
quantitative predictor variables, and after 5-fold cross validation
(

Residual plot of the sediment concentration prediction model; studentized residuals versus the predicted sediment concentration (on the log-transformed scale).

Annual sediment load estimates (in Mg per year) for the 2 years of the study directly estimated without bootstrapping, and load estimates with 95 % confidence interval limits and interval widths (difference between the upper and lower limits) for the three different bootstrap methods: the full method shown in Fig. 1, the method without modeled error (i.e., leaving out Step 3 in Fig. 1) and the method without bootstrapping discharge (i.e., leaving out Step 1 in Fig. 1) (n/a: not applicable).

The size of the estimated load depended on the method chosen for estimation. First, the load was calculated directly from the model estimates based on the full datasets, without bootstrapping (Direct estimate in Table 1). The sediment concentrations in this case were back-transformed by applying the correction appropriate for log-transformed data, which is to add half the residual error variance before back-transformation. Second, the median of the bootstrap estimates of the sediment load was taken, where, identically to the first case, the concentrations were corrected by adding half the residual error variance before back-transforming (Bootstrap without modeled error in Table 1). Third, the median of the bootstrap estimates was taken for the bootstrap process that included a modeled, autoregressive error term (Full bootstrap method in Table 1).

For this last estimation method, the annual sediment load was estimated to be 6262 Mg in 2010 and 5543 Mg in 2011 (Table 1). When the median from the bootstrap sediment load estimates was taken without modeled error, but rather applying the back-transformation correction, the load was approximately 5 % higher for both annual and monthly load estimates (Table 1 and Fig. 6). The annual loads thus amounted to 6575 Mg in 2010 and 5839 Mg in 2011. Finally, if sediment loads were estimated not by bootstrapping, but directly from the data, then the results were around 10 % lower compared to the first estimates, at 5607 and 4997 Mg, respectively, in 2010 and 2011.

Monthly sediment load estimates (in Mg per year) for the 2 years of
the study with 95 % confidence interval limits for the three different bootstrap methods:

Histograms of bootstrap load estimates on the original scale (left) and the log scale (right) for 2 study years and for two bootstrap methods: the full method with modeling of the autocorrelated error (“full method”, top), and without modeling of the error (“no md. err.”, bottom).

In all three approaches the difference between the 2 years remained consistent and all estimates were within the bounds of the confidence intervals, both for those calculated by modeling error and those calculated by adding half the variance before back-transformation.

In this particular 2 km

Before looking at the bootstrap confidence intervals, the histograms of the bootstrap load estimates were evaluated (Fig. 7). The histogram of the 2000 bootstrap estimates looked reasonably smooth, so we concluded that sample size was adequate for the percentile bootstrap. When reducing the number of bootstrap replicates (Fig. 8), the change in smoothness, especially in the right tail, becomes visible. Tail smoothness of the empirical distribution is a requirement when using the percentile method to obtain confidence intervals (Efron and Tibshirani, 1993). At 500 bootstrap replicates, the center of the distribution displays a lack of smoothness as well, thus not only affecting the confidence interval estimates, but the load estimates as well. For both years and both for the full method and the method without modeled error, the histograms were found to be skewed to the right, even when the loads were log-transformed. This skewedness means that, in the case of our dataset, the assumption of normality would not hold for estimated annual loads.

As a result of the distribution of the loads, the confidence intervals were always asymmetric, with the difference between the upper limit and the estimate around 80 % larger than the difference between the estimate and the lower limit. The width of the intervals – the difference between the upper and lower limits of the interval – varied between years and between methods, while remaining on the same order of magnitude (Table 1). In 2010, the interval was always wider, regardless of which method was chosen, for the annual as well as the monthly loads (Table 1 and Fig. 6). The year 2010 contained a smaller proportion of the samples (73 out of 228), and this could be a cause of the difference. For the monthly loads (Fig. 6), confidence intervals were widest during months with the highest sediment loads (July till October), when excess reservoir water gets exported via the river.

Effect of the number of bootstrap replicates (1500, 1000 and 500) on the smoothness of the resulting empirical distribution for the estimated annual sediment load in 2011.

The bootstrap method affected the width of the confidence interval as well. The monthly and annual intervals resulting from applying a back-transformation correction were consistently wider than those resulting from the bootstrap process that modeled the autocorrelated error: not modeling the error changed the interval (limits in Mg) from (4331, 12 267) to (4372, 14 586) in 2010 and from (3593, 8975) to (3713, 10 410) in 2011 – in both cases an increase in width of about 20 %. The change was due to an increase in the upper bound of the interval, while the lower limits remained very similar. These results show that performing the back-transformation correction is only a very rough method of adjusting the predicted concentrations on the original scale, as this approach does not take the serial correlation in the data into account. For the monthly load estimates, the largest differences in confidence interval width between the full method and the back-transformation without simulated error were in July and August 2010, the months with the highest estimated loads (Fig. 6).

When, rather than applying the full bootstrap method, we did not bootstrap the discharge rating curve (meaning, we left out Step 1 of the process in Fig. 1), the width of the confidence interval decreased, as one less source of error is taken into account. In 2010, this changed the CI from (4203, 11 649) without accounting for uncertainty in the discharge rating curve to (4331, 12 267) when accounting for this uncertainty on discharge; and from (3521, 8397) to (3593, 8975) in 2011 – including discharge therefore resulted in a respective increase in width of 6 and 9 %. Similarly, in the monthly load estimates, not bootstrapping the discharge resulted in confidence interval widths up to 37 % smaller than those calculated with the full method (Fig. 6). Months with low flow showed equally compressed confidence intervals as months with high discharge, during which the reservoir spillover was feeding the river (July till October).

Change in the median and 95 % confidence intervals for the sediment load estimate of 2010 (in Mg) when decreasing the coefficient of determination of the discharge rating curve. The bold line indicates the CI width of the real (discharge, level) dataset. The letter “a” corresponds to not bootstrapping the (discharge, level) pairs.

The accuracy in the (

The bootstrap approach where the concentration prediction error was separated into an underlying latent autoregressive process generating the true concentrations, and an independently distributed measurement error corresponding to white noise in time series data, did not converge for 906 out of 2000 runs. Convergence problems are very common when trying to fit nugget models as these models tend to be difficult to fit. Particularly AR(1) type error structures are prone to these issues, as there is an inherent confounding between parameters of the independent white noise component and the autocorrelated component (Piepho et al., 2015). In a bootstrap setting where convergence was already an issue, adding such an effect was not feasible in the case of our dataset. For exploratory purposes, the nugget can be fitted to the original dataset without bootstrapping, in order to examine the contributions of the respective error components. Results of this exercise showed that indeed, the measurement error (0.67) was large compared to the latent process variance (0.09), the former being due to sensor error, both from the turbidity sensors and the pressure sensors for discharge, the manual grab sample process which may not accurately represent the mean concentration across the cross section, and laboratory error in determining the sediment concentrations. The error separation thus indicates that focusing on these factors could yield substantial improvement in the sediment rating curve.

As was shown in Table 1, the annual sediment load estimates differ depending on the method selected. While it is encouraging that all estimates are within the 95 % confidence interval limits, choosing a different method can lead to anything from an underestimation of 10 % to an overestimation of 20 % compared to the median of the full bootstrap process. Two issues play a role in these differences: the back-transformation of the sediment concentrations, and bias in the estimate of the annual load.

The effect of back-transforming the concentration predictions is visible when comparing the medians of the bootstrap estimates with and without modeled autocorrelated error. When the error was not modeled, the estimate itself increased by around 5 % in both years, corresponding to an absolute increase of around 300 Mg of sediment, and the CI became wider. Essentially, adding half the variance before back-transformation is a very rough way of estimating expected values of concentrations at observed time points – as shown by the larger CI – because it does not take the serial correlation in the data into account. If the naïve back-transformation would be applied, without any variance correction, the resulting estimates would be even lower than those where we add half the variance before back-transformation: around 4200 Mg in 2010 and 3700 Mg in 2011, or an underestimation of approximately 2000 Mg.

While the latter is a relatively common approach to implement the back-transformation of constituent concentration predictions which are typically predicted on the log-scale, it may not be the most appropriate solution when the concentrations are used to calculate loads. The crucial issue with load calculations is that a load is a sum over time points, which is essentially the same as computing an arithmetic average, and for that we need to estimate the expected values for the individual time intervals. If the predicted value on the log-scale is simply back-transformed, we are estimating medians of the concentrations, and while this may be appropriate if one is only interested in the concentrations, these medians cannot be multiplied with discharge and summed up to accurately predict a load.

When the bootstrap process includes a simulated, autocorrelated error, the result of that process is not a mean or a median concentration, but rather a simulated realization of an observed process. When it is not desired to simulate the error in the bootstrap process, then applying a back-transformation correction is an alternative, but the confidence intervals should be expected to be wider, as adding on half the residual error variance before back-transformation ignores the serial correlation. An alternative back-transformation correction often used in the literature, Duan's smearing correction, similarly assumes independent and identically distributed errors and is therefore not suitable for datasets where serial correlation is present (Duan, 1983). Duan's is a non-parametric correction, in contrast to the two parametric approaches we used.

The back-transformation method of the concentration predictions, however, is not the only force at work: the direct estimate from the data and the bootstrap median without modeled error are quite far apart, even though they both use the same back-transformation correction. Statistics, unless they are very simple (for example a sample mean), will typically have some bias. Bootstrapping can in fact be used to identify and correct bias even when the true underlying distribution is unknown; therefore, in most cases the bootstrap estimate will typically be different, as it removes this bias (Efron and Tibshirani, 1993). There are alternative methods in the literature intended to remove the bias on load estimates (Ferguson, 1986), but as the correction will depend on the variance of the data, numerical corrections are not generally applicable. However, as one would need to bootstrap in any case in order to produce a CI of the load, taking the median of the bootstrap estimates is a straightforward way to obtain constituent load estimates.

The most common data transformations in load estimation are typically other
variations of the Box–Cox power family, such as

Regarding the data transformation, while the sediment concentration was log-normally distributed, the log-transformed load estimates were not normally distributed (Fig. 7, right panel). This non-log-normality of our loads does not affect the viability of the bootstrap approach, as regression-type methods do not require (log-) normality of the observed concentration or load data but rather normality of the residuals of the fitted linear regression models. It does, however, limit the applicability of methods that use the log-normality assumption of the load to estimate a variance for the load. This assumption was made, for example, by Wang et al. (2011) in using the delta method as an alternative way to assess uncertainty on annual sediment load estimates.

The results showed that the CIs are relatively wide and asymmetrical with a
much larger uncertainty in the upper limit. And comparing the 2 years, when
the estimated load was higher, the uncertainty in it was larger as well
(i.e., in 2010). This is a trend consistent with other studies (Kuhnert et
al., 2012; Rustomji and Wilkinson, 2008). Although it is difficult to compare
uncertainty calculated with other methods in different catchments, our
confidence intervals are on the same order of magnitude as the CIs in those
two studies. For example, Kuhnert et al. (2012) calculated 80 %
confidence intervals on an annual load of 5232 Mg (

The factors governing the width of a confidence interval are essentially the
sample size and the accuracy of the two rating curve estimates. The accuracy
of the sediment rating curve in this study (Pearson's

These effects on the CI indicate that, indeed, overfitting is a concern even
when interpolating within the time series. The risk of overfitting is
particularly high with more complex models (Burnham and Anderson, 2002), as
was demonstrated as well with the example above, and it is not uncommon in
load estimations to fit models that are very flexible (e.g., spline
functions, sigmoid functions) and/or have used a large number of predictor
variables to a relatively small dataset. In such cases, bootstrap uncertainty
assessment can be an additional tool both for model selection and for
evaluating model fit. The change in percentage variance explained is less
pronounced after cross validation, and ranges from 56 to 64 %, implying
that the cross validation penalizes at least partially for any overfitting.
Water quality models, however, are often not validated, and only the

One would expect that, as the sediment rating curve has much more uncertainty
than the discharge rating curve, excluding the latter would not affect the
confidence intervals much, but for our dataset, this assumption did not hold:
even with a discharge rating curve with high accuracy (

This result underscores the importance of error propagation in uncertainty
assessments. Even though the discharge rating curve has a high accuracy, an
estimate of

The approach developed in this paper provides a means to assess uncertainty in any type of constituent load, which was calculated from continuous constituent concentration and discharge predictions estimated with regression-type methods. Compared to ordinary least squares regression methods to obtain load estimates, bootstrap estimates resulted in bias-corrected estimates that can take serial correlation into account when present as well as provide a measure of uncertainty in the load estimate.

The results show that, even when the uncertainty of the discharge rating curve is small, it is important to take into account that the errors propagate by using discharge both as a predictor variable for constituent concentration and in the instantaneous load equation. Application of the method in different watersheds, at different spatial and temporal scales, could elucidate whether discharge is an important driver of uncertainty in those settings as well.

The confidence intervals resulting from our proposed method showed that the uncertainty in the loads is quite large and is mostly on the upper limit of the estimate, as the intervals were strongly right-skewed. This asymmetry implies that, wherever load estimates are used to assess environmental impact, without reporting an uncertainty assessment, the maximum impact could be severely underestimated.

Additionally, the bootstrap process demonstrated that load estimates are biased downwards if calculated directly from data with increasing variance that has been transformed. While some alternative bias corrections are available, these are not consistently used, and this is another factor contributing to the underestimation of constituent loads thus far reported in the literature. Taking the median of the bootstrap estimates is an easy and generally applicable way to obtain unbiased estimates.

Reporting uncertainty is especially important when water quality models are complex. There has been a great increase in the use of more complex predictive methods for water quality, for example, the use of artificial neural networks, random forests or generalized additive models (Berk, 2008). The advent of these methods makes the consistent reporting of measures of uncertainty even more essential: the more complex a model is, the more prone it is to overfitting (Burnham and Anderson, 2002), as was demonstrated by the inflated confidence intervals when adding predictor variables to the sediment concentration model. Some measure of uncertainty should systematically be shown for any load estimate, and the method developed in this paper provides a flexible framework to do so.

The source code for the bootstrap analysis with the SAS software that was
used for the load estimates and corresponding confidence intervals is freely
available at

The SAS code used to
simulate a dataset with a fixed realized

The fieldwork data in this study were collected within the framework of the Uplands Program collaborative research center, a DFG-funded project in collaboration with Tran Duc Vien at the Hanoi University of Agriculture. The authors gratefully acknowledge the work of field assistants Do Thi Hoan and Nguyen Duy Nhiem, and the laboratory analyses were done at the Central Water and Soil Lab of the Hanoi University of Agriculture, under supervision of Nguyen Huu Thanh by Dang Thi Thanh Hue and Phan Linh. Finally we thank the two reviewers for their thoughtful insights on this paper. Edited by: S. Archfield Reviewed by: T. Kumke and one anonymous referee