Quantifying input uncertainty in the calibration of water quality models: reordering errors via the secant method

Wu, Xia; Marshall, Lucy; Sharma, Ashish

doi:https://doi.org/10.5194/hess-26-1203-2022

Articles | Volume 26, issue 5

https://doi.org/10.5194/hess-26-1203-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

Special issue:

Frontiers in the application of Bayesian approaches in water...

https://doi.org/10.5194/hess-26-1203-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 26, issue 5

Research article

|

04 Mar 2022

Research article |

| 04 Mar 2022

Quantifying input uncertainty in the calibration of water quality models: reordering errors via the secant method

Xia Wu, Lucy Marshall, and Ashish Sharma

Download

Final revised paper (published on 04 Mar 2022)
Preprint (discussion started on 12 Nov 2020)

Interactive discussion

Status: closed

AC: Author comment | RC: Referee comment | SC: Short comment | EC: Editor comment

- Printer-friendly version

- Supplement

RC1: 'Reviewer comments', Anonymous Referee #1, 09 Dec 2020
- AC1: 'Reply on RC1', Lucy Marshall, 22 Mar 2021
RC2: 'recommend major revisions', Anonymous Referee #2, 10 Dec 2020
- AC2: 'Reply on RC2', Lucy Marshall, 22 Mar 2021
RC3: 'Review', Anonymous Referee #3, 17 Dec 2020
- AC3: 'Reply on RC3', Lucy Marshall, 22 Mar 2021

Peer-review completion

AR: Author's response | RR: Referee report | ED: Editor decision

ED: Reconsider after major revisions (further review by editor and referees) (12 Apr 2021) by Lorenz Ammann

In most cases, I agree with the reviewer’s comments and with the answer of the authors to those comments. There are, however, certain reviewer comments with which I fully agree and which are not yet properly addressed in the answers of the authors. Unfortunately, some of these are quite major comments and I hope that the authors can address them properly. I explain these concerns in more detail in the following.

Most importantly, it is not clear what the consequences of the reordering step in the secant method are on the acceptance probability of the sampler that forms the outer loop (in this case a ABC-SMC algorithm). Changing the acceptance probability of any sampler almost always has consequences on the convergence proofs for that algorithm, i.e. on its ability to sample the intended distribution, which in this case is the posterior. Note that this concern is still there if one uses BEAR together with other samplers such as MCMC methods or SMC algorithms, as mentioned by the authors in their replies. This means that there are doubts about which distribution the BEAR algorithm actually samples. This point was raised by Reviewer 3 in comment 1. The authors try to address this concern in Appendix B of their response, but they do not manage to show that BEAR samples the posterior distribution. Since the εX are sampled anew in each iteration of the outer loop (sampler), we have different chances to reach τs for each s and therefore different chances of accepting the proposed parameter sample (i.e. different acceptance probability). Also the ranks, kq, are re-initialized in each iteration. In the same answer, in Eq. 7, the authors claim that εX are a deterministic function of the parameters, but in the sentence below and in the scheme in Figure 1, they say that they are sampled from f(η). According to Fig. 1, even η changes during the inference.

I see two possibilities: either the authors can provide a mathematical proof that their algorithm samples the posterior distribution, or they clearly state in the manuscript that the resulting distribution of the parameters has an interpretation that is not clear and possibly different from the actual posterior distribution. In that case, the approach cannot be considered a Bayesian approach (which doesn't mean that it cannot be useful). In the second case, I suggest to provide a more thorough synthetic case study, where the model parameters κ, Smax, a and b are known and re-inferred by BEAR, which is not shown at the moment. One thing to investigate in such a case study is also the effect of the temporal resolution (see comments in the next paragraph).

BEAR assumes a direct correspondence between the input and the output at each time-step. This necessitates that the number of input and output time-steps are the same and assumes that the model output at each time-step is only affected by exactly one input value: the one at the same time. However, memory effects introduced by water quality models can be substantial. For example, in the model used by the authors, the differential equation will lead to significant memory and input that happened multiple time-steps ago will affect the model output later on. This also depends on the temporal resolution of the model. The BEAR method will have significant problems with models that show a pronounced memory effect, especially when the temporal resolution of the data is high and one model output is affected by many model inputs.

Hide

AR by Lucy Marshall on behalf of the Authors (05 Aug 2021) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (23 Aug 2021) by Lorenz Ammann

RR by Anonymous Referee #3 (26 Aug 2021)

Suggestions for revision or reasons for rejection

I thank the authors for their detailed revision of the manuscript. Overall, their revision helped clarifying a lot of uncertain aspects of the algorithm, and their responses to my concerns were mostly satisfactory, but work still remains to be done. Unfortunately, I fear that my main concerns about the theoretical foundations and consequences of the error re-ordering remain inadequately addressed. The derivation provided in Appendix B does not sufficiently clarify these points eithers: the core question (how exactly the re-ordering affects the base error distribution’s statistical moments or hyperparameters, and what the consequences are in Bayesian terms) remains unaddressed.

Since I take it that the authors would like to stick with the error re-ordering approach, I would argue that this leaves you with two possible pathways:

1) Provide a thorough theoretical derivation and in-depth investigation of what effect the error re-ordering really has, and what this means in Bayesian terms. After experimenting a bit with error re-ordering myself (see the Python code snippet below), I believe that a good start point might be couplings or measure transport (Pierre E. Jacob has a nice online lecture series on that called Couplings and Monte Carlo), as you seem to convert one distribution (the raw error distribution) into one with different statistical moments, one somehow moulded to the ideal error realizations.

2) Alternatively, you could simply drop the “Bayesian” attribute from your study or replace it with “Pseudo-Bayesian”. This might require that you adjust the acronym. Even without full theoretical justification, your algorithm can still provide a useful heuristic, and maybe that is enough. Even in this case, however, I believe the manuscript would benefit from an isolated analysis of what mechanically happens to the input error distribution when subject to reordering. I have provided a few thoughts on this below.

As a consequence, I would recommend another round of major revisions. If you follow option (2), I recommend specifically to add a new section into the theory/methodology chapter which explores and illustrates the effects of re-ordering errors on the raw error distributions in detail (this is important for the reader’s understanding of the approach– it should be part of the main manuscript, not the Appendix). To make space for this section, you could absorb some of the practical comments in the discussion (specifically, sections 4.1 and 4.3). There are also a lot of tangential comments addressing reviewer concerns throughout the manuscript which could be removed if their key points are addressed in this new section. This might also support the narrative thread of the manuscript by helping you to avoid the need to go on explanatory tangents.

As I don’t want to leave you hang out to dry on such a large and amorphous task, I would specifically suggest exploring a simple example case in this proposed section. Specifically:

• Ignore the model (for the purpose of error reordering this is unnecessary); instead, skip straight to positing some hidden “ideal” sequence of error realizations which perfectly compensate the true residual error (similar to your figure A1); derive the corresponding ranks;

• Use a simple, structured residual error sequence to make it easier to read the induced effect. The sequence doesn’t have to be realistic, merely insightful;

• Use a residual error distribution perfectly adjusted to the structured error you defined. In a synthetic test case, it’s easy to derive an empirical cdf.

• Then explore the consequences of re-ordering for time series of different length or input error distributions of different quality. Discuss the statistical moments after reordering.

I have provided an example Python script for this below and appended some of its result figures for different time series lengths of 10, 100, and 1000 in Figures 1, 2, and 3. In this example, I arbitrarily assumed the true/ideal error realizations to follow a sine curve. Note that something more realistic (like random samples from a Gaussian distribution) would have also worked, but the simplicity of a sine curve makes it significantly easier to read the effect of the re-ordering.

Some thoughts on the results:

As I suspected in major comment #7 for the first round of revisions, the longer the time series, the more likely the method is to achieve a “perfect fit”, so the effect of error re-ordering depends on the length of the time series. You discuss this briefly in the manuscript, but I think that this is among the most important mechanisms of BEAR, so it is worth demonstrating in isolation. For short time series (Figure 1), error re-ordering can already induce some degree of improvements by causing the marginal sample mean to follow the “ideal” error realizations; at the same time, the marginal error standard deviation decreases. This effect is exacerbated for longer time series (T=100, Figure 2, and T=1000, Figure 3). The consequence seems to be that the unordered, raw error distribution is “molded” to the ideal error realizations. I suspect (and you seem to share these suspicions in your responses) that in the limit of an infinitely long time series, error realizations would be compensated perfectly. This is important to discuss for prospective users of your manuscript, as it affects the algorithm’s behaviour in somewhat unexpected ways.

Other interesting things to visualize might be what happens if the error distribution is not perfect (for this, just replace “vals = dist(np.random.uniform(size=(1000,resolution)))” in the code with some other distribution). You already show this indirectly in your models, but demonstrating this effect in isolation rather than through the lens of performance metrics might be a lot clearer. Feel free to take inspiration from my example code or use it directly. I have attached it at the end of this manuscript.
Of course, this code snippet just demonstrates what happens when we are re-ordering error realizations, not how you arrive at the error ranks and their interaction with the parameter inference in the first place (which are potentially additional topics to discuss). I hope that even if you decide to follow option (1), this snippet might give you some ideas on where to start with the Bayesian justification. Good luck!

Specific comments (any line numbers I list correspond to the non-track-changes manuscript):
Line 47: the same multiplier applied to one storm event

Maybe “the same multiplier applied to all time steps of one storm event” might be better, if I understand this part correctly; using “same” for a singular object (“storm event”) sounds a bit strange.

Line 184: the time scale is typically set as daily and the spatial scale is set as the catchment

This is not particularly clear. I assume you simulate in daily timesteps, and aggregate the catchment’s (presumably) surface area into a single spatial unit? If so, it might be better to replace this with “thus, we use daily time steps and consider the catchment a single, homogeneous spatial unit” or something along these lines.

Lines 302: From this point of view, it is more efficient to estimate the error rank than estimate the error value,

This sentence ends on a comma, not a period.

Line 309-311: Besides, to avoid the high-dimension calculation, modifying each input error according to its corresponding residual error only works in the rank domain. In the value domain, if there is no constraint on the estimated input errors, they will fully compensate for the residual error to maximize the likelihood function and subsequently be overfitted

This requires more discussion in the revised manuscript, as it seemingly contradicts what you write in the paragraph immediately prior: In the previous paragraph, you recommend sampling error realizations repeatedly and selecting the optimal realization to overcome “sampling bias” and improve the fit to the actual observations. However, in this paragraph you praise this very same sampling bias for preventing overfit. These are contradictory messages: provided you get the ranks right, if you were to resample an infinite number of times, you would eventually get an error realization which compensates the true error perfectly (even if your input error distribution is a really poor approximation to the “true” error distribution), thus negating your protection against overfit. The fact that this protection against overfit depends on the length of the time series might not be that much of an issue if you interpret your approach merely as a heuristic, but even in this case you need some practical guidelines on when to re-sample for short time series. The proposed dedicated section might help clearing some of this confusion up.

Line 320-322: Thus, unlike formal Bayesian inference, the BEAR method does not update the posterior distribution of the input errors, but identifies the input error through the deterministic relationship between the input error and model parameter.

As far as I can see, this is the first time you mention that BEAR is not a formal Bayesian inference method, so I suspect you would go for option (2). In any case, mentioning this in the discussion for the first time is a bit late: Something this important should be stated earlier, as early as the introduction or abstract. If you add the section exploring the consequences of error reordering, this would also be a good place to elaborate on this.
Aside from this manuscript structuring argument, the statement in these lines is also unfortunately wrong. Refer to the attached figures for demonstration. I also elaborate more on this two comments below.

Line 383: “However, the work in this study still identifies a few areas needing to be explored.”

Nit-pick: This sentence is a bit unwieldy, in my opinion. How about “However, this study identifies a few areas which still need to be explored:”?

Response file / Response to major comment #7: “A point of clarification: in each subsequent iteration of the BEAR algorithm, a new population of input errors are sampled from their a priori distribution. This means that the distribution of errors at each iteration is the same prior to reordering, i.e. the population of errors do not converge to a distribution that has different statistical features (mean, standard deviation, skewness).”

You are of course correct, but I fear your comment misses the point somewhat: While it is undoubtedly true that the raw input error distribution never changes, BEAR does effectively update the input error distribution. In the figures from my code snippet, this can be seen in the bottom right subplots when compared to the centre right subplots. It becomes evident that the both the effective mean and standard deviation (and likely higher statistical moments as well) change dramatically after re-ordering. In essence, the fact that you are re-ordering transforms your raw error distribution and causes you to sample some different latent distribution instead. What this distribution really is remains the key question of your entire approach. If you can figure that out, you’ll be one large step closer to justifying this approach theoretically. =)

In a sense, what you are doing seems distantly related to ideas in measure transport, see for example Marzouk et al. (2016) for an overview. In measure transport, the ultimate goal is to indirectly sample from an (almost) arbitrary target distribution. This is achieved by sampling a simple reference distribution instead (for example a multivariate standard Gaussian), then converting these reference samples through a deterministic function into samples from the target distribution. Of course, finding the correct transformation function is the key objective of this entire endeavour, and consequently its main challenge. In your study, you approach this from the opposite direction: you have some transformation, now you should find out what distribution you are sampling.

In summary, I would say the parallels to your approach are as follows: even though the reference distribution (corresponding to your raw input error distribution) never changes, the pushforward distribution (corresponding to the latent distribution your re-ordered error realizations are effectively sampled from) changes with the transformation function (in your case, the re-ordering according to different error ranks).

Marzouk, Y., Moselhy, T., Parno, M., & Spantini, A. (2016). An introduction to sampling via measure transport. arXiv preprint arXiv:1602.05023; https://arxiv.org/abs/1602.05023.

Example code.

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
from matplotlib.gridspec import GridSpec
import copy

# Set a random seed
np.random.seed(0)

# Create some dummy error values to generate an artificial ranking
resolution = 10 # Adjust this for different time series lengths

# Create a sequence of "ideal" error realizations following a sine curve
x = np.linspace(0,2*np.pi,resolution)
y = np.sin(x)

# Get the sorted indices of these error realizations
ranks = np.argsort(y)

# Create a sorter list which can be used to reshape a sorted list according to
# the ranks we just defined
sorter = np.argsort(ranks)

# Get the statistical moments of this residual error distribution
mean = np.mean(y)
std = np.std(y)

# Create a figure, plot the ranks for each error realization
plt.figure(figsize=(16,6))
gs = GridSpec(nrows=3,ncols=2)
plt.subplot(gs[0,:])
plt.scatter(x,y)
plt.xlabel('data series $t$')
plt.ylabel('ideal residual \n error')
for idx,i in enumerate(ranks):
plt.text(x[i],y[i],str(idx))

# Create a cdf for a sine curve -----------------------------------------------

# Sample the sin cure with high resolution
dummy_sine_samples = np.sin(np.linspace(0,2*np.pi,10000))

# Sort these samples
dummy_sine_samples = np.sort(dummy_sine_samples)

# Search for potential duplicates
sineval,count = np.unique(dummy_sine_samples,return_counts=True)

# Create a cumulative sum for the counts
cumsum = np.cumsum(count,dtype=float)

# Standardize the cumulative sum
cumsum -= cumsum[0]
cumsum /= cumsum[-1]

# Create an interpolation function which takes a quantile as input and returns
# the corresponding sine value
dist = interp1d(
cumsum,
sineval )

# Statistical moments of the raw input error distribution ---------------------

# Draw a thousand realizations of our custom error distribution
vals = dist(np.random.uniform(size=(1000,resolution)))

# Plot the error realizations
plt.subplot(gs[1,0])
for n in range(1000):
plt.plot(x,vals[n,:],color='grey',alpha=0.01)
plt.xlabel('data points $t$')
plt.ylabel('raw input error \n realizations')

# Plot the sine curve for reference
plt.plot(x,y,'r',label = 'ideal error realizations')
plt.legend()

# Calculate the mean and covariance of the samples
mean_1 = np.mean(vals,axis=0)
cov_1 = np.cov(vals.T)

# Plot the mean
plt.subplot(gs[1,1])
plt.plot(x,mean_1,'r')
plt.ylabel('raw input error \n marginal mean',fontdict={'color':'r'})
plt.ylim([-1.1,1.1])
ax2 = plt.gca().twinx()
ax2.plot(x,np.sqrt(np.diag(cov_1)),'b')
ax2.set_ylabel('raw input error \n marginal std',fontdict={'color':'b'})
ax2.set_ylim([0,0.75])

# Statistical moments of the sorted input error distribution ------------------

# Now sort the error realizations
vals_sorted = copy.copy(vals)
vals_sorted = np.sort(vals_sorted,axis=1)

# And redistribute them according to the ranks we determined
for n in range(vals.shape[0]):
vals_sorted[n,:] = vals_sorted[n,sorter]

# Plot the re-ordered error realizations
plt.subplot(gs[2,0])
for n in range(1000):
plt.plot(x,vals_sorted[n,:],color='grey',alpha=0.01)
plt.xlabel('data series $t$')
plt.ylabel('sorted input error \n realizations')

# Plot the sine curve for reference
plt.plot(x,y,'r',label = 'ideal error realizations')
plt.legend()

# Calculate the statistical moments
mean_2 = np.mean(vals_sorted,axis=0)
cov_2 = np.cov(vals_sorted.T)

# Plot the mean
plt.subplot(gs[2,1])
plt.plot(x,mean_2,'r')
plt.ylabel('sorted input error \n marginal mean',fontdict={'color':'r'})
plt.ylim([-1.1,1.1])
plt.xlabel('data series $t$')
ax2 = plt.gca().twinx()
ax2.plot(x,np.sqrt(np.diag(cov_2)),'b')
ax2.set_ylabel('sorted input error \n marginal std',fontdict={'color':'b'})
ax2.set_ylim([0,0.75])

# Save the figure
plt.savefig('consequences_or_error_reordering_resolution_'+str(resolution)+'.png',bbox_inches='tight')

Referee Report: PDF

Hide

RR by Anonymous Referee #1 (13 Sep 2021)

Suggestions for revision or reasons for rejection

I think the manuscript clearly improved over the last iterations. Yet, I still struggle with the discrepancy between expectations that are raised in the title, abstract and introduction, and what is presented in the article, i.e. that BEAR is as comprehensive as BATEA and IBUNE and therefore an alternative to these two. This impression is fostered by phrases like L.53-54: “From the above, a new strategy should be developed to avoid high dimensional computation and ensure the accuracy of error identification.” or in chapter “2.4 Comparison with other methods”, where, first, BATEA is discarded with the plain statement that it might run into dimensionality problems and then BEAR is presented as a modification to IBUNE (Lines 172-173 “The “BEAR” method adds a reordering process into the “IBUNE” method to improve the accuracy of input error quantification”) .

To overcome this problem, the authors should make last modifications:
- From the beginning on present the BEAR method as add-on to SMC or potential extension to existing methods like IBUNE, but not as a stand-alone, equal alternative to BATEA or IBUNE
- Only keep BATEA in the introduction as benchmark reference and for discussion, but not in the comparsion – if BEAR shall be presented as alternative, the authors should have run BATEA as reference. This includes deleting repetitions like lines 311ff in the Discussion chapter 4: “There are two ways to impose restrictions. One is to regard errors and model parameters as a whole in calibration, like the BATEA framework (Kavetski et al., 2006), resulting in a high dimensional computation.” Otherwise, there is the impression as if there was an actual comparison made.

Personally, I find the treatment of the errors still rather arbitrary – by design, the method will minimize residuals between model and observations. I doubt that this is the intention of Bayesian methods. Therefore, it is also no surprise that NSE values from BEAR are often higher (see e.g. Fig. 1 and 2) or the variance of residuals is lower (see Fig. 4).

That said, I do see two points why it still might be worth publishing: 1st) The manuscript addresses the problem “input error” that has not been addressed as much over the last years. Yet, it is a very important one, e.g. especially regarding novel data-driven machine-learning models through which input errors might be propagated without regulation since unlike mechanistic models, they do not contain physical relations that might buffer some part of the input error. 2nd) The authors propose the use of the secant method to address the problem and even if I personally do not find the presented procedure to be the “problem-solver” modelers might look out for, readers can make their mind of whether this method still bears more potential (working with ranks of errors, etc.) or not.

So considering the points above, I would consider the publication of this manuscript as contribution to the discussion about issues and ideas in addressing input errors, but not as a proven framework to tackle them that outperforms existing methods.

Hide

ED: Publish subject to revisions (further review by editor and referees) (20 Sep 2021) by Lorenz Ammann

AR by Lucy Marshall on behalf of the Authors (01 Nov 2021) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (18 Nov 2021) by Lorenz Ammann

RR by Anonymous Referee #1 (03 Dec 2021)

ED: Publish subject to minor revisions (review by editor) (04 Dec 2021) by Lorenz Ammann

I agree with the reviewer comments regarding the specifically mentioned sentences that are confusing and should be deleted.

Regarding the other comment of referring to the BEAR method as pseudo-Bayesian, with the additional clarifications that the authors included in past review rounds, as far as I understand, it would depend on the technical details of how the BEAR algorithm is combined with a certain inference approach whether the final outcome is a formal or an informal Bayesian approach. Therefore, I think the need to strictly refer to the BEAR algorithm as a pseudo-Bayesian approach can be relativized. I therefore suggest the authors keep the current name of the algorithm.

Additional comments:

I suggest the authors rearrange the structure of the manuscript such that the results are presented in a separate Section 4. In Section 3, the description of BwMod, as well as the setup of the case studies can be kept, but all the results should be clearly separated in a following section.

The first paragraph of the conclusion section can be extended, and the second one shortened. The second paragraph is interesting and necessary, but some of the sentences are better placed in the discussion section (and maybe slightly extended).

Specific comments:

Line 35: Replace “For the surrogate error, its probability distribution” by “The probability distribution of the surrogate error”
Line 37: Replace “more” by “larger”
Line 52: Replace “input data” by “input data point”, if this is what is meant here
Line 60: Replace “two synthetic cases and a real case” by “two synthetic and one real case study”. Check also other instances where this change is appropriate.
Line 85: Replace “appealing” by “common”
Line 147: Replace “can be” by “is often”

Hide

AR by Lucy Marshall on behalf of the Authors (14 Dec 2021) Author's response Author's tracked changes Manuscript

ED: Publish as is (17 Dec 2021) by Lorenz Ammann

AR by Lucy Marshall on behalf of the Authors (20 Dec 2021) Author's response Manuscript

Short summary

Decomposing parameter and input errors in model calibration is a considerable challenge. This study transfers the direct estimation of an input error series to their rank estimation and develops a new algorithm, i.e., Bayesian error analysis with reordering (BEAR). In the context of a total suspended solids simulation, two synthetic studies and a real study demonstrate that the BEAR method is effective for improving the input error estimation and water quality model calibration.