Quantifying input uncertainty in the calibration of water quality models: reshuffling errors via the secant method

Uncertainty in inputs can significantly impair parameter estimation in water quality modeling, necessitating accurate quantification of input errors. However, decomposing input error from model residual error is still challenging. This study develops a new algorithm, referred to as Bayesian error analysis with reshuffling (BEAR), to address this problem. The basic approach requires sampling errors from a pre-estimated error distribution and then reshuffling them with their 10 inferred ranks via the secant method. This approach is demonstrated in the case of total suspended solids (TSS) simulation via a conceptual water quality model. Based on case studies using synthetic data, the BEAR method successfully isolates the input error and parameter error. The results of a real case study demonstrate that even with the presence of model structural error and output data error, the BEAR method can approximate the true input and bring a better model fit through an effective input modification. However, its effectiveness is limited by the assumption that the input uncertainty should be 15 dominant and that the prior information of the input error model can be estimated. The application of the BEAR method in TSS simulation is effective for understanding a range of water quality conditions and the further developed algorithm can be extended to other water quality predictions.


Introduction
For robust water management, uncertainty analysis is of growing importance in water quality modeling (Refsgaard et al., 20 2007). It can provide knowledge of error propagation and the magnitude of uncertainty impacts in model simulations to guide improved predictive performance (Radwan et al., 2004). However, the implementation of uncertainty analysis in water quality models (WQMs) is still challenging due to complex interactions among sources of multiple errors, generally caused by a simplified model structure (structural uncertainty), imperfect observed data (input uncertainty and observation uncertainty in calibration data) and limited parameter identifiability (parametric uncertainty) (Refsgaard et al., 2007). 25 Among them, input uncertainty is expected to be particularly significant in a WQM, interpreted here as the observation uncertainty of any input data. Observation uncertainty is different from other sources of uncertainty in modeling since these uncertainties arise independently of the WQM itself, thus, their properties (e.g. probability distribution family and distribution parameters) can, at least in principle, be estimated prior to the model calibration and simulation by analysis of https://doi.org/10.5194/hess-2020-563 Preprint. Discussion started: 12 November 2020 c Author(s) 2020. CC BY 4.0 License. the data acquisition instruments and procedures (McMillan et al., 2012). Rode and Suhr (2007) and Harmel et al. (2006) 30 reviewed the uncertainty associated with selected water quality variables based on the empirical quality of observations. The general methodology developed in their studies can be extended to the analysis of other water quality variables. Besides the error coming from the measurement process, the error from surrogated data is another major source of input uncertainty (McMillan et al., 2012). Measurements of water quality variables often lack desirable temporal and spatial resolutions, thus, the use of surrogate or proxy data is necessary for improved inference of water quality parameters (Evans et al., 1997, 35 Stubblefield et al., 2007. For a surrogate error, its probability distribution is easy to estimate from the residuals between the measurements and proxy values. These estimated error distributions are "prior knowledge" of input uncertainty before any model calibration and can serve as the a-priori uncertainty estimation in the modeling process. Input uncertainty can lead to bias in parameter estimation in water quality modeling (Chaudhary and Hantush, 2017, Kleidorfer et al., 2009, Willems, 2008. Improved model calibration requires isolating the input uncertainty from the total 40 uncertainty. However, the precise quantification of time-varying input errors is still challenging when other types of uncertainties are propagated through to the model results. In hydrological modeling, several approaches have been developed to characterize time-varying input errors, and these may hold promise for application in WQMs. The Bayesian total error analysis (BATEA) method provides a framework that has been widely used (Kavetski et al., 2006). Time-varying input errors are defined as multipliers on the input time series and inferred along with the model parameters in the Bayesian 45 calibration scheme. It leads to a high-dimensionality problem, which restricts the application of this approach to the assumption of event-based multipliers (the same multiplier applied to one storm event). In the Integrated Bayesian Uncertainty Estimator (IBUNE) (Ajami et al., 2007) approach, multipliers are not jointly inferred with the model parameters, but sampled from the assumed distribution and then filtered by the constraints of simulation fitting. This approach reduces the dimensionality significantly and can be applied in the assumption of the data-based multiplier (one multiplier for one 50 input data) (Ajami et al., 2007). However, this approach results in an underestimation of the multiplier variance and misidentification of the uncertainty sources (Renard et al., 2009). From the above, a new strategy should be developed to avoid high dimensional computation and ensure the accuracy of error identification.
To complete this goal, this study develops a new algorithm -Bayesian error analysis with reshuffling (BEAR). The derivation and details of the BEAR algorithm in quantifying input errors are described in Sect. 2. Section 3 introduces the 55 build-up/wash-off model (BwMod) to illustrate this approach. Its model input, streamflow, often suffers from observational errors from a rating curve. By comparing the results with other calibration frameworks, the ability of the BEAR method is explored in a synthetic case and a real case. In this way, the new algorithm is tested in a simple situation (with an assumption of true output data and model structure) and in a realistic situation (with the interference of multiple error sources) respectively. Section 4 evaluates the BEAR method and its implementation. Finally, Section 5 outlines the main conclusions 60 and recommendations for this work.
(3) 75 To counter the influence of input errors in a traditional calibration, an appealing approach is to subtract estimated errors p X  from the observed input o X . This is illustrated as the "proposed" approach and the superscript p represents the values in this "proposed" approach. The residual p  will change to If the equivalence between X  and p X  can be ensured for each data point, the modified input p X then becomes the same 80 as the true value * X . The proposed calibration (Eq. (4)) will result in an ideal calibration (Eq. (1)), where the optimal parameters p  will converge to their true values *  and the model residual P  will decrease to zero. Thus, the precise identification of input errors will result in the ideal model parameters and minimized residual error.
Selecting the optimal input error series according to the statistical characteristics of the residual error is the basic theory of current methods (Ajami et al., 2007, Kavetski et al., 2006. The challenge is to optimize the input errors effectively when 85 parameter errors impact this optimization. The BATEA method considers input errors and model parameters as a whole and infers them by taking advantage of the correlation among them. This results in a high dimensionality problem that cannot be avoided (Renard et al., 2009). The IBUNE method makes use of the stochasticity of sampled errors and selects the most https://doi.org/10.5194/hess-2020-563 Preprint. Discussion started: 12 November 2020 c Author(s) 2020. CC BY 4.0 License. 4 suitable error and parameter sample to minimize the residual error. This is less effective because the probability of cooccurrence of all optimal error/parameter values is very low (Renard et al., 2009). An improved strategy is necessary to 90 properly infer input errors through minimising total model residual error.

The innovation of the secant method
The secant method can be applied to address this problem. This is an iterative process to produce better approximations to the roots of a real-valued equation (Ralston and Jennrich, 1978). Here, the root is the optimal value of each input error and the equation is the corresponding model residual equal to zero. A traditional approach to updating this is impractical because 95 the estimated input error will fully complement the model error and always lead to a zero residual error regardless of the model parameters. More discussion on this is stated in Sect. 4.1.
This study attempts to transform this optimization into the rank domain. Here, the rank is defined as the order of any individual value relative to the other sampled values, and determines the relative magnitude of each error in all data errors.
For most WQM studies, the probability distribution of any input errors () f   can be estimated as per prior information. If 100 there is knowledge of the error distribution, the error value only depends on its rank in this distribution and this error distribution can then constrain the value range of sampled errors. Therefore, the secant method is very useful in the rank domain, where the root turns to the optimal rank of each input error (rather than its value) and the equation is still the corresponding model residual equal to zero. This new approach, referred to as the Bayesian error analysis with reshuffling (BEAR) method, should be implemented in two steps: sampling the errors from the estimated error distribution and 105 reshuffling these sampled errors corresponding to the inferred error ranks via the secant method.
The secant method (Ralston and Jennrich, 1978) can be repeated as , 1 , 2 , , 1 , 1 , 1 , 2 until a sufficiently accurate target value is reached. In this study, the target value is a residual of zero ( , 0 p iq  = ) indicating a perfect model fit (with input errors estimated exactly). Here, , iq k represents the estimated rank for ith input error at the qth 110 iteration, ,1 p iq  − is the residuals corresponding to the input error rank ,1 iq k − . The error rank of each data point is updated respectively via Eq. (5), where i =1,…n. n is the data length and also the number of the estimated errors as these errors are data-based.
After calculating Eq. (5), it is possible that the rank , iq k is out of the rank range (for example, less than 1 or more than n), or not an integer. Sorting , iq k in all the ranks , ( 1,..., ) iq k i n = can address this problem by effectively scaling the calculated ranks 115 https://doi.org/10.5194/hess-2020-563 Preprint. Discussion started: 12 November 2020 c Author(s) 2020. CC BY 4.0 License.
, iq k to an integer from 1 to n. Thus, in Eq. (5), , iq k should be changed to , iq K , representing the pre-rank. After sorting , iq K for all the errors, the post-rank , iq k will then belong to reasonable values.
From the above, estimating the rank of input errors via the secant method can be described as the following two steps: Update the rank of each input error , iq K ( 1,..., ) in = via the secant method respectively: where k( ) means calculating its rank.

Approximate Bayesian Computation -Sequential Monte Carlo (ABC -SMC)
This study chooses Approximate Bayesian Computation via Sequential Monte Carlo (ABC-SMC) as the calibration scheme. 125 ABC-SMC was first proposed by Sisson et al. (2007) and developed in the research of Toni et al. (2008). The ABC method is especially useful for problems in which the likelihood function is analytically intractable or costly to compute in traditional Bayesian approaches. For formal Bayesian approaches,, the likelihood function must be set carefully to meet the assumption about the residual error distribution, and this setting impacts the parameter estimation (Smith et al., 2015, McInerney et al., 2017, Wu et al., 2019. In the ABC method, setting an objective function is more general allowing for 130 potentially complex input error distributions where the likelihood is difficult to write. In the ABC-SMC approach, the parameter p  is first sampled from a prior distribution ( ) p P  (referred to as population 1). Then it is propagated through a sequence of intermediate distributions

Algorithm and an example of the BEAR method
According to the previous derivations, the algorithm quantifying input errors via the BEAR method is demonstrated in Fig. 1 and an illustrative example is presented in Table 1 and Fig. 2. Based on an ABC-SMC calibration scheme, the BEAR method works by replacing the observed input with a modified input that is obtained through the estimated input error rank 6 via the secant method. In Fig. 1, s refers to the number of the sequential updating populations in the ABC-SMC scheme, which increases until the objective function (measuring the fit between the calibration data and model outputs) of the sth population is less than the final tolerance  . The final tolerance  (i.e. the stopping criterion) is difficult to set before calibration due to the unknown range of objective function values, but in practice, it can be estimated after several population calibrations, according to the actual calculation range of the objective function and the target accuracy. In this 145 study, the calibration stops when 1000 proposed parameter sets are rejected in a row. The first tolerance 1  should be set sufficiently large to start the update. Any intermediate tolerance s  is set as the 30% quantile of the objective function results of the previous population s-1, such that it reduces automatically with a new population calculation.
In each calibration population, the input error ranks are updated over q iterations, where q increases until the objective function is less than tolerance s  . When q=1 and q=2, the input errors are randomly sampled from the estimated error 150 distribution because two sets of samples are prerequisites for the updating via the secant method (Table 1) (6) and (7)), demonstrated in the first two columns in the 3rd and 4th iteration in Table 1. According to the new rank , p iq k , the value with the same rank in the 2nd iteration is the estimated error in the new iteration. For example, the new rank at the 1st 155 time step in the 3rd iteration is 6, and its corresponding value in the 2nd iteration is -0.02, therefore, -0.02 is set as the updated input error at the 1st time step in the 3rd iteration. After the same reshuffling strategy, the re-ranked input errors will then lead to a new series of the modified inputs Note however if the model parameters are far away from the true values, especially in the initial population, iterative updating of the error ranks will have little effect in reducing the model residual. Therefore, the maximum times of iterations 160 should be set, referred to as Q. If q exceeds Q, the algorithm returns to the step resampling the model parameters (seen in Fig. 1). An example of four iterations is demonstrated in Table 1 and Fig. 2.
In the example given in Table 1, before reshuffling errors (i.e. the 1st iteration and 2nd iteration), the input errors do not approach the true values shown in Fig. 2, having much larger objective function results than the 3rd and 4th iteration. After the error reshuffling, the objective function calculated in the 4th iteration is smaller than the result in the 3rd iteration, 165 illustrating that the estimated errors in the 4th iteration are closer to the true values than the 3rd iteration. This is also supported by Fig. 2 where the red line (4th iteration) has a stronger correlation with the black line (true input error) than the yellow line (3rd iteration). From the above, the true input errors can be approximated through updating the error ranks to minimize the objective function of the residuals. https://doi.org/10.5194/hess-2020-563 Preprint. Discussion started: 12 November 2020 c Author(s) 2020. CC BY 4.0 License.

Comparison with other methods 170
To evaluate the ability of the BEAR method in quantifying input errors, three methods are compared, denoted as method T, D, R. Method "T" is the "traditional" method, regarding the observed input as error-free without identifying input errors (i.e. Eq. (2)), while the other two methods employ a latent variable to counteract the impacts of input error and build the modified input (i.e. Eq. (4)). In method D, "D" refers to the probability "Distribution" of input error, which is additional information considered in the calibration. This error distribution can be estimated before calibration according to the studies in the 175 introduction. Especially in the context of proxy errors, the probability distribution can be easily calculated via the residuals between the measurements and the corresponding proxy values. From this error distribution, potential input errors are randomly sampled and filtered by the minimization of the objective function, which is similar to the basic framework of the IBUNE method (Ajami et al., 2007). Method R represents the BEAR method developed in this study. "R" refers to the "Reshuffling" strategy via the secant method, which is an additional process to that used in method D to improve the input 180 error quantification.

Water quality model: the build-up/wash-off model (BwMod)
This study tests the BEAR algorithm in the context of the build-up/wash-off model (BwMod), which is a group of models to simulate two processes in sediment dynamics, including the build-up of sediments during dry periods and the wash-off 185 process during wet periods. The two formulations were developed in a small-scale experiment (Sartor and Boyd, 1972), while in applications at the catchment scale, the conceptualized parameters largely abandon their physical meanings and the formulations can be considered a "black-box" (Bonhomme and Petrucci, 2017). This study chooses Eq. (8) to describe the build-up process and Eq. (9) to express the wash-off of sediments, representing the non-linear relationship between the wash-off load (output) and the runoff-rate (input). These two equations were applied in the research of Sikorska et al. (2015) 190 and in this study, are written in the MATLAB programming language with the integration of the BEAR method. The time scale is typically set as daily, and the spatial scale is set as the catchment in this study. This version of BwMod has four parameters ( Table 2). The model input is streamflow, which typically comes from the observation of a rating curve. As discussed in the introduction, the error distribution can be estimated prior to the model calibration via a rating curve analysis.
The output of the BwMod is the concentration of total suspended solids (TSS), whose transport can be efficiently simulated 195 by the conceptualization of the build-up/wash-off process (Bonhomme andPetrucci, 2017, Sikorska et al., 2015). Although BwMod is relatively simple compared with process-based WQMs, its nonlinearity and the use of surrogates for the input data can make it a typical WQM scenario to test the BEAR algorithm.
The overall BwMod equations are: X is generated based on two types of input error models: an additive formulation and a multiplicative formulation, and the errors are assumed to follow a 210 normal distribution with mean  as 0.2 and standard deviation (SD)  as 0.5. An additive formulation (denoted as 'add' in Table 3) is suitable to illustrate error generation, while the multiplicative formulation (denoted as 'mul' in Table 3) is specifically applied for errors induced from a log-log regression procedure, which is common for water quality proxy processes (Rode and Suhr, 2007). In the additive formulation, the generated input may be negative. If so, the negative input should be truncated to a positive value. In the multiplicative formulation, the generated input will stay positive. The true 215 output * Y is the simulated TSS concentration via BwMod corresponding to the true input * X and model parameters set as the reference values in . The observed output o Y is assumed to be the same as the true simulation * Y , i.e. without error. In the calibration, the objective function is set as the Mean Squared Error (MSE). Considering the unknown initial sediment loads in real applications, the calibration sets 90 days as a warm-up period to remove the influence of antecedent conditions. 220 Following the algorithm described in Sect. 2.4, the model parameters and the time-varying input errors are estimated. In each population of the ABC-SMC calibration scheme, 50 sets of model parameters are updated. In the first population, the model parameters are sampled from a uniform distribution with the prior range described in Table 2.

9
The prior information about error parameters (i.e.  and  ) contains two conditions: one is fixed as the reference values (denoted as 'fixed' in Table 3), the other one is given the prior range, which needs to infer the error parameters in the 225 calibration (denoted as 'inferred' in Table 3).
To sum up, this study considers four scenarios in the synthetic case, including two sets of synthetic data generating from two input error models and two types of prior information about the error parameter (the details are shown in Table 3). Each scenario is calibrated via method T, method D and method R respectively. Their algorithms are described in Sect. 2.5 and their results are compared in Fig. 3 and Fig. 4. Figure 3 shows the statistical characteristics of the overall estimations. Figure  230 4 demonstrates the temporal dynamics of input estimations and model simulations.
Evaluating the input error quantification, method R always has much higher correlations with the true error series than method D in all calibration scenarios (shown in Fig. 3(3)). When the error parameters are inferred, the estimations of  via method D are smaller than the reference value (shown in Fig. 3(1)). This conclusion has also been reported in the study of Renard et al. (2009). The reason for this is that the randomness of the likelihood function leads to an underestimation of the 235 SD of input errors. Compared with method D, the  estimation via method R is less biased from its true value (shown in Fig. 3 (1)), while the estimation of  is worse via method R (shown in Fig. 3(2)).
For the model simulation, method R always produces the best output fit in all scenarios, supported by the highest red boxplots in Fig. 3(4). Also in Fig. 4, regardless of the calibration scenarios, the output uncertainty bands of method R (red parts) almost overlaps the true output (green line), being much better than method T (pink parts) and method D (blue parts). 240 However, the input uncertainty bands vary depending on the calibration scenarios. When the error parameters are fixed at the reference values (in the scenarios add-fixed and mul-fixed), method R always outperforms the other two methods regardless of input error models, as its Nash-Sutcliffe efficiency coefficient (NSE) are the highest (shown in Fig. 3(5)). In Fig. 4(1) and Fig. 4(3), the input uncertainty bands of method R (red parts) generally converge to the true value (green line), being better than method D (blue parts). Without the reshuffling strategy, Method D even gives worse input estimation and model 245 simulations than method T, demonstrated by the lower blue boxplots than pink boxplots in Fig. 3(5)) and Fig. 3(4). This illustrates that the ill-posed error sources in method D exert a negative impact on the model simulations. When the error parameters are inferred (in the scenarios of add-inferred and mul-inferred), the performance of method R depends on the input error models. For the scenario of add-inferred, method R is still better than other methods, having the biggest NSE (shown in Fig. 3(5)) and the closest error parameter estimation to the reference value (shown in Fig. 3(1) and Fig. 3(2)), 250 although the input uncertainty band is more negatively biased from the true value (green line) than method D in Fig. 4(2).
For the scenario of mul-inferred, the modified inputs via method R are further from the reference value than method D (shown in Fig. 3(5)), which might result from worse  estimations for the input error (shown in Fig. 3(2)).

Case study 2: Real data
The above synthetic case only exhibits input error and parameter error, which focuses on testing the ability of the BEAR 255 method in quantifying time-varying input errors while estimating model parameters. In real-life applications, the impacts of model structural error and output data error cannot be ignored. In order to explore the BEAR method in more general situations, e.g. with other errors' interference, a real case of one catchment located in southeast Wisconsin, USA is demonstrated. Table 4 is a description of the test catchment and data (Baldwin et al., 2013). The daily TSS concentration and streamflow data are collected from the USGS database on National Real-Time Water Quality (https://nrtwq.usgs.gov/). 260 The daily streamflow data in the USGS database comes from a stage-streamflow rating curve, where the stage and streamflow form a log-log linear relationship and the streamflow proxy errors follow a normal distribution with  as 0 and  as 0.103. This prior information is used in the real calibration, denoted as O-fixed scenario in Table 3. Because the BEAR method is implemented under the assumption that the input uncertainty is so significant that other sources of uncertainties can be ignored, another input data source with more significant data uncertainty, the streamflow simulation from a 265 hydrological model, has been considered. This study selects GR4J (Perrin et al., 2003) as the hydrological model and calibrates its parameters with the USGS streamflow data as calibration data. If the USGS streamflow data is regarded as the true input data, the residual error after the model calibration can approximate the data error of GR4J simulation, which follows a normal distribution in log space with  as 0 and  as 0.764. The BwMod calibration using this input data source and the prior information on data error is denoted as S-fixed scenario in Table 3. To explore the ability of the BEAR method 270 in other situations where the prior information about the input error is not sufficient, two scenarios with a wider range of the error parameters has also been considered, denoted as O-inferred and S-inferred in Table 3. The real case is also calibrated via three methods (i.e. method T, method D and method R) and adopts the same setting of the calibration algorithm as the synthetic case. Figure 5 uses several statistics to evaluate the calibration scenarios. For all scenarios in Fig. 5(b1), method R always 275 produces a better fit to the output data than method D, consistent with the synthetic case shown in Fig. 3(4). In Fig. 5(b2), "Reliability" here is the ratio of observations caught by the confidence interval of 5%-95%, and the average width of this interval band is referred to as "Sharpness" (Yadav et al., 2007, Smith et al., 2010. In the S-fixed and S-inferred scenarios with significant input errors, the results of method R show much higher reliability with a larger sharpness. However, in the O-fixed scenario with insignificant input errors (i.e.  =0.103), the reliability and sharpness of method R are smaller than 280 method D. Fig. 5(a) demonstrates that the  estimations vary depending on the calibration methods, but stay almost identical between two data sources. This illustrates that the impacts of other sources of errors significantly impair the error quantification, and their impacts are varied for different methods.
In the real case shown in Fig. 6, method R still produces the best fit to the output and the uncertainty band of the modified input via method D is centered on the observed data. In Fig. 6(c) to the observed streamflow (green line), even in (c3) and (c4) where the input data comes from the simulated streamflow (black line). According to the results of method T in Fig. 6(a), the simulations corresponding to the observed streamflow (in (a1) and (a2)) catch the dynamics of observed TSS concentration better than the simulations corresponding to the simulated streamflow (in (a3) and (a4)). Here, the observed streamflow from the rating curve should be closer to the true input data, 290 and could be regarded as the reference value. Given that, the modified inputs via method R are more reasonable.

The effectiveness of rank estimation
The novelty of the BEAR method lies in transforming a direct error value estimation to an error rank estimation. In a continuous sequence of data, the potential error values have an infinite number of combinations, while the error rank has 295 limited combinations, dependent on the data length. It is far more efficient to estimate the error rank than estimate the error value. Compared with the IBUNE framework (Ajami et al., 2007), the BEAR method additionally infers the error ranks to adjust the order of the sampled errors and reduce their randomness, which significantly improves the accuracy of the error estimation (as demonstrated by much higher NSEs than method D in Fig. 3). The application of the secant method plays an essential role in this by inferring each error rank according to the residual error. 300 Note that modifying each input error according to the corresponding residual error only works in the rank domain. In the value domain, if there is no constraint on the estimated input errors, they will fully compensate for the residual error with the aim of minimizing the objective function and subsequently be overfitted. There are two ways to impose restrictions. One is to regard errors and model parameters as a whole in calibration, resulting in the high dimensional computation (Kavetski et al., 2006). The other is to sample error randomly from the assumed error model IBUNE (Ajami et al., 2007), whose precision 305 cannot be guaranteed. While in the rank domain, the value range of the sampled errors can be effectively limited by the assumed error model.
One thing to note in the rank estimation is that even corresponding to the same rank, the error sampled at different times could be largely different, especially for a small sample size (depending on the data length) or a large standard deviation of the assumed error distribution. This problem can be addressed by selecting the optimal solution from multiple samples 310 according to the minimization of an appropriate objective function. The secant method is a successive approximation algorithm and one single iteration cannot guarantee the optimal results. Considering these two points, the BEAR method set q iterations in the algorithm (Fig. 1). q increasing until the objective function becomes smaller than the tolerance.

The impacts of prior information of input error model
Method D employs the same framework as IBUNE (Ajami et al., 2007), taking advantage of stochastic error samples to 315 modify the input observations. In Fig. 4 and Fig. 6, the uncertainty bands of modified inputs (blue parts) encompass the input observations (black line), illustrating that the intrinsic quality of the input observation determines the algorithm performance.
https://doi.org/10.5194/hess-2020-563 Preprint. Discussion started: 12 November 2020 c Author(s) 2020. CC BY 4.0 License. Figure 6 demonstrates that if the input error is insignificant in the residual, like in the O-fixed and O-inferred scenarios of the real case, the resultant simulations will fit the observed output (green line) well. Otherwise, the simulations are far away from the observed outputs (black line) due to inaccurate input observations (in the S-fixed and S-inferred scenarios in the real 320 case). As per the finding in the previous study of Renard et al. (2010), if the SD of input errors is inferred with the model parameters, method D will underestimate the SD (Fig. 3(1) and Fig. 5(a2)). If the intrinsic SD of input errors is large, a fixed SD cannot improve the input modification and model simulation, demonstrated by a wider band in Fig. 6(b3) than in Fig.   6(b4). If the SD of input errors is small, the prior information will constrain the impacts of other sources of errors. From the above, the data quality is more important than the availability of prior information for method D, especially when the 325 intrinsic SD of the input error is large.
However, the findings in method R are quite different. Although method R infers the input error by minimizing the model residual error, it is much more effective than method D to minimise the residual errors. For the synthetic case ( Fig. 4(c)) and real case (Fig. 6(c)), the model simulations via method R (red parts) are very close to the output observations (green line). In other words, the estimated input error mainly depends on the output observations. Therefore, in the real case with the same 330 output observation (Fig. 6(c)), the modified inputs are consistent among the scenarios. Given this, the model structure error plays an important role in estimating the input error.
To constrain the impacts of the other sources of error, accurate prior information about the input error model is important in method R. In the synthetic case, fixed scenarios always produce a higher NSE of the modified input ( Fig. 3(5)) and a larger correlation in the estimation error ( Fig. 3(5)) than inferred scenarios. This illustrates that prior information can limit the 335 impacts of model parameter error. In Fig. 6(a1), the modified inputs in the real case are around the reference value (green line), while in Fig. 6(a2), the modified inputs are biased from the reference value (green line). It should be noted that this difference is obvious in the scenarios with insignificant input error (where the model structural error is relatively large).
When the input error is dominant, like the S-inferred scenario, method R becomes more effective to estimate the input error, bringing a more precise estimation of the error SD than the O-inferred scenario and similar results to the S-fixed scenario. 340 To sum up, for method R, an accurate input error model can constrain the adverse impacts of the other sources of errors, especially when the other sources of error are dominant. But for method D, the input data quality is more important than this prior information.

The extension to other modeling scenarios
In this study, the BEAR method was developed in the calibration of BwMod at the daily time scale, whose input and output 345 correspond at each time step. Therefore, in Eq. (6), the model residual ,1 p iq  − and input error rank ,1 iq k − are at the same time step i. If the water quality system exhibits response delays, the time lag between the forcing data and the response (described as the lag) should be considered in the algorithm and Eq. (6) needs to be modified as per Eq. (11).
If the response caused by an input is not instantaneous but exhibits persistence (i.e. occurs over several time steps), the 350 autocorrelation in the output should be addressed to ensure the independence assumption of the rank updating is satisfied.
Current ways to deal with this problem in hydrologic modeling can provide a reference in the potential modification of the BEAR method. The persistence of residual errors can be represented by an autoregressive moving average (ARMA) model (Kuczera, 1983) or autoregressive (AR) model (Schaefli et al., 2007, Bates andCampbell, 2001). However, the ability of these approaches needs further discussion in systems with correlated responses. 355

Conclusion
Taking advantage of the prior information of an input error model, a new method, Bayesian error analysis with reshuffling (BEAR), is developed to approach the time-varying input errors in WQM inference. Through the investigation of synthetic data and real data, this method is shown to be robust and effective. The novelties of this algorithm are: (1) Estimating the error rank rather than directly estimating the error value which significantly improves the effectiveness of input error 360 quantification by reducing the potential search space for input errors. (2) The modification of the secant method links the error rank of each input data to its corresponding residual, which addresses the high dimensionality problem in current calibration methods.
However, the work in this study still identifies a few areas needing to be explored. Firstly, the availability of prior knowledge of the input error model is important. When this information is not reliable or even cannot be estimated, a 365 significant issue is the selection of a suitable error assumption. Thus, a general measure should be found to judge whether an error model is appropriate, especially in real cases where the "true" information is limited. Secondly, extensions of the BEAR method to other water quality modeling scenarios are subject to problems such as delayed and autocorrelated responses. Related studies in hydrologic modeling to deal with the delay and persistency of responses could be references in the modification of the BEAR method. Thirdly, if the sampling and reshuffling strategy is developed within a more 370 comprehensive framework to quantify multiple sources of error, the interactions amongst these error sources might be well identified and the quantification of individual errors might be improved. This study provides a starting point for developing https://doi.org/10.5194/hess-2020-563 Preprint. Discussion started: 12 November 2020 c Author(s) 2020. CC BY 4.0 License.