Comparing the Impacts of Single- and Multi-Objective Optimization on the Parameter Estimation and Performance of a Land Surface Model

Xu, Cong; Zhu, Gaofeng; Zhang, Yang; Zhang, Kun

doi:10.5194/hess-2024-121

Preprints

https://doi.org/10.5194/hess-2024-121

Preprints

30 May 2024

| 30 May 2024

Status: this discussion paper is a preprint. It has been under review for the journal Hydrology and Earth System Sciences (HESS). The manuscript was not accepted for further review after discussion.

Comparing the Impacts of Single- and Multi-Objective Optimization on the Parameter Estimation and Performance of a Land Surface Model

Cong Xu, Gaofeng Zhu, Yang Zhang, and Kun Zhang

Abstract. In land surface models (LSMs), precise parameter specification is crucial to reduce the inherent uncertainties and enhance the accuracy of the simulation. However, due to the multi-output nature of LSMs, the impact of different optimization strategies (e.g., single- and multi-objective optimization) on the optimization outcome and efficiency remains ambiguous. In this study, we applied a revised particle evolution Metropolis sequential Monte Carlo (PEM-SMC) algorithm for both single- and multi-objective optimization of the Common Land Model (CoLM), constrained by latent heat flux (LE) and net ecosystem exchange (NEE) measurements from a typical evergreen needle-leaf forest observation site. The results reveal that the revised PEM-SMC algorithm, demonstrates a robust ability to tackle the multi-dimensional parameter optimization challenge for LSMs. The sensitive parameters for different target outputs can exhibit conflicting optimal values, resulting in single-objective optimization improving the simulation performance for a specific objective at the expense of sacrificing the accuracy for other objectives. For instance, solely optimizing for LE reduced the root-mean-square error (RMSE) of the simulated and observed LE by 20 % but increased the RMSE of the NEE by 97 %. Conversely, multi-objective optimization can not only ensure that the optimized parameter values are physically sound but also balances the simulation performance for both LE and NEE, as evidenced by the decrease in RMSE for LE and NEE of 7.2 W/m2 and 0.19 μmol m-2 s-1, respectively. In conclusion, these findings reveal that comprehensively integrating the various available observational data for multi-objective optimization is preferable for parameter calibration in complex models.

Received: 21 Apr 2024 – Discussion started: 30 May 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 1539 KB)

Supplement (1919 KB)

Download & links

Cong Xu, Gaofeng Zhu, Yang Zhang, and Kun Zhang

Status: closed

RC1:
'Comment on hess-2024-121', Anonymous Referee #1, 26 Jun 2024

The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2024-121/hess-2024-121-RC1-supplement.pdf

Citation: https://doi.org/10.5194/hess-2024-121-RC1
- AC1: 'Reply on RC1', Cong Xu, 11 Sep 2024
  
  We appreciate the reviewer’s positive and constructive feedback, which will assist us in further improving our work. Our responses have been submitted as a supplement pdf file.
  
  Citation: https://doi.org/10.5194/hess-2024-121-AC1
RC2:
'Comment on hess-2024-121', Jasper Vrugt, 24 Aug 2024
Summary: In this study, the authors applied a revised particle evolution Metropolis sequential Monte Carlo called PEM-SMC to single- and multi-objective optimization of the Common Land Model (CoLM) using measurements of the latent heat flux (LE) and net ecosystem exchange (NEE) from a typical evergreen needle-leaf forest observation site. The authors conclude that the revised PEM-SMC algorithm is a robust method for multi-dimensional parameter estimation of LSMs. The optimal parameters for LE and NEE exhibit a trade-off, necessitating the estimation of these parameters against LE and NEE data simultaneously.
Evaluation: This paper is about an important topic in hydrologic modeling, namely the training of a land surface model so that its simulated output matches as closely and consistently as possible multiple different measurement types. Overall, the paper is reasonably well written, albeit at times using incorrect methodological terminology and language. I worked quite a bit on Bayesian methods and am always interested in the application of these methods to Earth-systems, I am not particularly excited and convinced by the work presented in this paper. I will summarize my comments next. Based on these comments I recommend a major revision. This gives the authors an opportunity to revise their work. This may involve quite a bit of additional work as my comments will point out.
General comments:
While reading this manuscript, many thoughts went through my mind. In the first place, it was a trip down memory lane. The presented material reminded me of work done by Luis Bastidas and others about 3 decades ago, and my own work of about 15 years ago. Then, I was surprised to see that the authors do not reference much prior work and present the algorithms as if they invented all this themselves. Furthermore, the authors implemented a Bayesian approach, but they do not verify at all whether the assumptions they made were actually met in practice nor do they take advantage of their posterior distribution in analyzing model performance. Last but not least, I am not convinced yet that the mutation operator of their PEC-SMC algorithm (which is a crossover step!) leaves the target distribution invariant. I will now address each of these points in more detail. Before I do so, I must apologize for the fact that I heavily advertise my own work in my review below. In general, I do not like doing this, but this proved difficult in the present case. The authors’ paper demonstrates a keen interest in methodology. This methodology has various problems and shortcomings. To point this out, I must unfortunately refer to our own published works. The authors work with MATLAB, so it should be easy to evaluate/test my suggestions and test their method against “state-of-the art” methods in MATLAB I am pointing at. I hope my comments below clarify my concerns.
The conclusion that Land-surface models exhibit a trade-off in describing different data types has been well known to the community. In a series of papers in the 1990s Bastidas and co-authors have convincingly shown that LSM training results in different “optimal” parameter values if trained against different data types. They used a multi-objective optimization framework for this, along with a multi-objective sensitivity analysis method to rank the relative importance of individual parameters in describing different data types. This problem did not resolve with the use of a more complex LSM but is a result of (among others) epistemic uncertainty (model structural errors) and measurement errors of the controlling variables (exogeneous variables). This paper uses a Bayesian procedure and arrives at the same conclusions of Bastidas et al.

Thus, fundamentally, the conclusion of the authors is not new. Then, let me look at the methodology used. Does this warrant publication in HESS? Before I move on to the methodology, I would want to clarify that the wording “multiple objective” is erroneous. In essence, the authors are lumping together in their likelihood function two different data streams. They do this by multiplying the likelihoods of LE and NEE. This is not multi-objective, as the authors only have 1 likelihood (= combined likelihood of NEE and LE). With the choice of a Gaussian likelihood, the authors’ method is equivalent to weighted least squares. With a diagonal measurement error covariance matrix where the first n entries are the measurement error variances for the latent heat flux data and the next m entries are for the measurement error variances for the NEE data, you will arrive at exactly the same maximum a-posteriori (MAP) density solution [uniform prior is used]. What the authors did is maximum likelihood estimation with a weighted likelihood function.
The PEM-SMC algorithm the authors present & modify reminded me of my own work on particle-DREAM (Vrugt et al: Hydrologic data assimilation using particle Markov chain Monte Carlo simulation: Theory, concepts and applications), http://dx.doi.org/10.1016/j.advwatres.2012.04.002. Given the large overlap between the author’s PEM-SMC method and particle-DREAM I was surprised to see that there is not a single reference in their methodological description to the particle-DREAM paper. It is possible that the authors missed this work, but then their so-called mutation step in Equation 18 is taken directly from the particle-DREAM paper, and/or related papers on DE-MC and the DREAM algorithm. The authors use the same symbols in their paper for the jump-rate, gamma, parents, r₁ and r₂, etc. as we used in our publications. The authors should properly cite and acknowledge past work.

Fundamentally, I am not confident that the Metropolis acceptance probability defined after Equation (18) leaves the target distribution invariant. We struggled with this in the Particle-DREAM paper and presented in Appendix B (Page 476) a correct formulation of the acceptance probability so as to ensure ‘detailed balance’. I suggest the authors study this Appendix and determine whether their particle transitions guarantee an exact inference of the target distribution. This requires careful demonstration. As a side note, the PEM-SMC sampler has many elements in common with the ABC-Population Monte Carlo sampler used in Sadegh and Vrugt, (2013: see Appendix A, Page 4845). We also implemented a DREAM resampling step in that algorithm.

In addition to my previous comment, even if the method is theoretically correct, then one cannot simply make a change to an algorithmic recipe and claim this revised method is robust (as the authors do in their abstract). We cannot conclude anything about robustness using only the posterior distributions of the LE and NEE data. To inspire confidence in the revised algorithm one needs to demonstrate that the method works well on a large range of problems – meaning it converges to the known posterior distribution in a series of benchmark experiments. Otherwise, claims of robustness are meaningless, and we must simply trust that the algorithm correctly infers the target distribution of the LSM parameters.

Then, one more time about Equation (18), how is the resampling implemented? Are N candidate points generated simultaneously, and then pairwise the acceptance probability is computed to determine whether to accept the proposals or not. Or is the implementation sequential, that is, a proposal is created using Equation (18) and then this is accepted (or not) with Metropolis probability α. If accepted, the proposal replaces its parent, and then the next candidate point is created? The algorithmic recipe suggests this latter approach. The reason I ask this is that the first parallel implementation does not guarantee reversibility. This proof only holds for the latter, sequential, approach. See Appendix (Page 444-445) of Vrugt and Ter Braak, https://link.springer.com/article/10.1007/s11222-008-9104-9

How does the algorithm handle the parameter boundaries? What is done to a particle if it is sampled outside the prior parameter ranges (by the authors’ mutation step)? Detailed balance requires that the proposals are assigned a zero likelihood, or better, the candidate points are folded [e.g. see Vrugt and Ter Braak, 2011: https://doi.org/10.5194/hess-15-3701-2011]. Which mechanism did the authors implement?

Then, on a related note, the authors do not address the question so as to why the machinery of a particle SMC method is required first of all to infer the posterior parameter distribution. I am most familiar with the DREAM algorithm as I developed this myself with Cajo Ter Braak (2008). I seriously question whether based on the listed algorithmic variables of S = 100 and N = 200 the SMC algorithm will succeed in generating samples of the target distribution. I hazard to predict that the DREAM algorithm (MATLAB toolbox: DREAM Package) will be computationally more efficient, in large part because the burn-in will be substantially smaller as one only needs to run N = 3 Markov chains. This does not require a tempering schedule to move particles from a prior to a posterior distribution. This type of bridge sampling only complicates methodology. Most people are familiar with the DREAM algorithm, and so why do we need new machinery if “old methods” can do the job – and as I bet more efficiently than what is presented in the present paper. Also, the multi-try variant of DREAM allows a further speed-up of the convergence speed to the target (posterior) distribution as the chains are evaluated in parallel and multiple proposals are created in each chain in parallel as well. The best proposal is then accepted with a modified Metropolis acceptance probability. This methodology is described in https://doi.org/10.1029/2011WR010608

The authors assume a normal likelihood for the residuals of the LSM model. Why did they choice a normal likelihood. This is the default choice but must be supported by evidence. This require a-posterior checking of the residuals of the time series of measured and simulated latent heat fluxes and NEE. I bet the residuals will exhibit autocorrelation, a nonconstant variance, and deviate from normality. This would disqualify the use of the normal likelihood function. Good statistical practice requires the authors to evaluate that the assumptions about the likelihood function are met in practice. For example, consider Schoups and Vrugt, 2010. https://doi.org/10.1029/2009WR008933

The authors multiply the likelihoods of the latent heat flux and NEE as if these two data types are independent. Is this a valid assumption? As an alternative, one could consider a composite likelihood, where the two likelihoods are additive – one can find applications of this in statistical literature, specifically in the application to spatial data. Then, this would constitute a novelty and justify publication in a highly rated journal such as HESS.

Then, the toolbox of DREAM in MATLAB (called DREAM package) has built-in distribution-adaptive likelihood functions, such as the generalized likelihood and universal likelihood functions: https://doi.org/10.1016/j.jhydrol.2022.128542. These likelihood functions let the data speak for themselves – and let the residuals determine the most appropriate form of the likelihood function. This includes treatment of heteroscedasticity, correlation and non-normality. This guarantees that the residual properties will match the likelihood assumptions – and inspire much more confidence in the posterior parameter distributions derived from the latent heat flux and NEE data. Certainly, if the authors use a Bayesian approach, as they do in this paper, they must demonstrate that their residual assumptions hold. Otherwise, the parameter estimates are not particularly meaningful.

I do not understand why the authors use a complicated training SMC-based procedure, whereas there is relatively little they do with the knowledge of the posterior parameter distributions. In the first place, the authors should present the Bayesian predictive distributions (time series of confidence and prediction intervals of the simulated output) in their time series figures (e.g. Figure 4). What is the coverage of the prediction limits? As it stands right now, the authors focus their attention on the MAP solutions and ignore in large part parameter uncertainty. A maximum likelihood method would have done the job. Then, stochastic gradient descent would have found the MAP solution and a first-order approximation around this optimum would have given a linear estimate of parameter uncertainty.

The authors refer to Equation 15 as the “analytical method”. I am not familiar with this terminology. Equation 15 simply states that the measurement error standard deviation is equal to the square root of the sample variance of the residuals. They essentially use this as a sufficient statistic and embed this in their likelihood function. This is a common approach, but in the author’s, implementation raises two important questions. A) how was the sample variance of the residuals determined? This requires knowledge of the MAP (maximum a-posteriori density) LSM parameter values, and B) the estimate of the sample variance of the residuals will be biased as the residuals are likely to exhibit serial correlation. The “true” measurement error variance can only be determined after decorrelating this time series of MAP residuals.

The authors refer to their particle resampling step as a mutation step. This is wrong. A mutation is a random alteration to the DNA of a particle. This can happen to any parameter at any time and is fully random. The differential evolution (DE-I) step the authors use from DE-MC, DREAM or Particle-DREAM) is a crossover step. The proposed changes to the DNA of the particles are based on an underlying mechanism [= Equation 18], and augmented with an inconsequential random perturbation, zeta, to claim ergodicity. This is a crossover step wherein the DNA of two parents is combined to generate offspring (candidate particles). Note that in the original DREAM algorithm, there is no mutation step – only a crossover step. Then, why did the authors not consider the use of more than 1 pair (r1 – r2, r3 – r4, r5-r6, etc.), of differences to generate candidate points? We have shown that this enhances considerably the variability in the candidate points. Indeed, the pair r1/r2 has a 1/N*1/(N-1) selection probability, where N significant the population size. r1,r2,…,r6 on the contrary has a selection probability, of 1/N*1/(N-1)*1/(N-2) etc.; Thus one can generate a much large variation in the proposals with multiple pairs. The side-effect of this is that you may be able to reduce the population size.

Do I understand correctly that after convergence you have N = 200 samples from the target distribution? This is very small for a 17-dimensional parameter estimation problem and does not support an accurate depiction of the posterior parameter distribution. Again, why not just use an adaptive multi-chain Monte Carlo method such as DREAM_ZS or MT-DREAM_ZS? This machinery will do the job for you, while providing as byproduct a) automatic convergence monitoring (univariate and multivariate scale reduction factors, etc.), b) diagnostic checks of the residuals, c) confidence and prediction limits of the Bayesian forecast (simulation) PDF, d) scoring rules (CRPS, LS, QS, etc.) of the predictive distribution and access to distribution-adaptive likelihoods.

In principle, the authors have access to the predictive distribution of the model, but in their model evaluation resort only to common deterministic measures of model performance. This includes the NSE and RMSE in Equations 19 and 20. Their use entails a large loss of information about model performance. This is just a side note: The authors should consider evaluating the full predictive distribution using scoring rules. I hesitate to advertise my own work, but here it is, Distribution-Based Model Evaluation and Diagnostics: Elicitability, Propriety, and Scoring Rules for Hydrograph Functionals: https://doi.org/10.1029/2023WR036710. This merely serves as a note to alert the authors against the use of deterministic measures of goodness of fit, in case one has access to the full Bayes predictive distribution. This allows for the use of information-theoretic principled metrics – which offer more protection against misinformation, disinformation, etc.

Then, the authors use different sensitivity analysis methods to decide which parameters are sensitive and which ones are not. A critical assumption in this analysis is that the parameters are independent. This assumption is convenient but may not be realistic in practice. As an idea, the authors could determine sensitivity using their posterior parameter samples. This would equate to probabilistic variance-based sensitivity analysis and relies on high-dimensional model presentation (HDMR), an extension of Sobol to correlated variables. HDMR and HDMR with extended bases are computationally quite demanding, so this may pose problems with their LSM. Nevertheless, given their strong interests in computational methods I thought I’d point at the HDMR/HDMR_EXT toolboxes, which are available in MATLAB. Alternatively, they should consider the multi-criteria sensitivity analysis method of Bastidas, in evaluating sensitivity in the presence of more than one data type. But, realistically, why not do inference on all the parameters and then assess parameter sensitivity from the posterior LSM parameter distribution(s)? One could use measures such as the Kullback-Leibler divergence (= divergence of the logarithmic score, see Vrugt, 2024) to determine the distance between the marginal prior and marginal posterior distribution of each parameter. This is cheap to compute and will convey which parameters are most sensitive and which others are not. Indeed, for parameters that are sensitive one would expect its posterior marginal distribution to be small relative to its marginal prior. On the contrary, for a parameter that is insensitive it is common to see that its prior and posterior marginals are nearly equivalent. Maybe this approach simplifies the paper.

Figure 2: The x-labels are not readable and strangely chosen. If the authors want to demonstrate that the different optimization scenarios lead to different marginal distributions of the parameters, then why not use a single x-axis, linearly-scaled, and just plot the 3 distributions on top of this. Thus the same as now but with a linear scale of the x-axis. This would show immediately the differences between the methods. Then, I have my doubts whether the marginal distribution of the parameters shown are truly Gaussian. This may be an artefact of an insufficient sample of the target distribution, as commented on earlier in 7.

Figure 4: Maybe this is explained in the text and I missed it but why are the observations indicated with a pink interval? Noisy observations? Personally, I would prefer plotting the data as discrete points, alternatively, one can think of averaging the data (measurements) so there are fewer data points in return and then accompany these time-averaged estimates with error bars. As it stands right now, I see three deterministic simulations and then a wide range of possible data values. If the authors are interested in quantifying measurement uncertainty, then the nonparametric estimator of de Oliveira and Vrugt comes to mind: https://doi.org/10.1029/2022WR032263. Again, this is a sidenote, but it may help the authors present their data better. Maybe the authors can use this to their advantage.

Section 4.2: This is not a difference between single and multiobjective methods. All the results correspond to a single objective, albeit with one or more weighted data streams. I commented on this before.

Then, the authors use the terminology of optimization scenarios. Personally, I do not like the terminology of optimization in a Bayesian context. Optimization focuses on finding the single best solution whereas Bayesian inference is fundamentally different in that it wants to find a distribution of statistically acceptable solutions – this should include the MAP parameter values (= ML with uniform prior) but cannot be considered optimization. Some authors call it Bayesian calibration, I prefer Bayesian training or estimation. I leave this to the authors.

In summary, this paper needs a lot of work before it can be judged to make a significant contribution to the literature. I am sorry for highlighting my own work, but I felt this was relevant and necessary as the authors’ methodology has important flaws, shortcomings and needs to be reconsidered. Amongst others, I believe that a) the authors should properly demonstrate that their method is indeed robust and that the changes made to the PEC-SMC sampler leave the target distribution invariant, b) they must properly recognize past literature contributions, consider how their method differs from past methods, and articulate why we would need the SMC machinery after all to obtain the posterior LSM parameter distribution if current state of the arts method can do this job (and the maximum likelihood method/weighted least squares will do in the current implementation of the authors), c) test the residual properties (diagnostic checks of autocorrelation, distribution and variance homogeneity) so as to demonstrate that the assumptions of the normal likelihood function are indeed met, d) present the confidence and prediction limits of the Bayes predictive distributions, e) clean up their language of single and multiple objective, mutation and crossover step, etc. and f) possibly, consider more powerful model evaluation metrics rather than the deterministic measures used by the authors. No doubt, these comments and those listed above will involve substantial work. But this should significantly enhance the quality of the work presented in this paper.
I hope my comments are useful to improve your manuscript,
Jasper A. Vrugt
jasper@uci.edu
Citation: https://doi.org/10.5194/hess-2024-121-RC2
- AC3: 'Reply on RC2', Cong Xu, 17 Sep 2024
  
  Thank you for evaluating our work and providing valuable feedback. We fully recognize the importance of the improvements you have suggested and are committed to addressing each concern in detail. Please find the detailed responses to each concern in the supplementary file.
  
  Citation: https://doi.org/10.5194/hess-2024-121-AC3
RC3:
'Comment on hess-2024-121', Anonymous Referee #3, 27 Aug 2024

Dear Authors,
thank you very much for doing this work. I think the paper could make a contribution to HESS after some work. I leave here my main comments, i hope they help to improve the manuscript:

1- Literature review. I miss references to the work of Jasper Vrugt, mainly to Vrugt et al., 2012 (the DREAM paper), you might have a look at the work of Carlo Albert about ABC (during 2015 and maybe reply rom Vrugt later on), Kavetski et al. 2018 and Fenizia et al 2018 look through the introduction +references of these 2 papers; and the work from Prieto et al. 2021; 2022 about hydrological mechanisms identification and the diagnostic metrics there in. Otherwise the introduction is a bit repetitive and convoluted but it missses inormation, e.g. about the choose or development of diferent likelihoods (later on you assume a normal gaussian but i do not see the justification), model diagnostics metrics and why.

2- benchmark: i am missing a benchmark to compare. Maybe, one good idea might be to use the package from Vrugt for DREAM as benchmark - i am aware the author had everything ready to be applied

3- For the equations i suggest to use a properly maths notation. At least for me is helpul and i am sure for readers too. E.g. why is everything italic? i suggest you to differenciate vectors, matrices, random variables, etc (ie bold, capital letters , and so on). In Prieto et al., 2019; 2021; 2022 you can find examples for this and in whatever paper from Vrugt i am sure too.

4- Posterior pdf: please for the posterior of the parametere use the full Bayes equation and then say that the left hand is proportioinal to the right hand so that non Bayesian can follow it. Indeed this is meet because you use only one model (eg see prieto et al.., 2021, 2022)

5- concern: why the likelihood is a notmal likelihood? could you please justify and then analize the residuals of the posteriors? are you also meaning that the variables are independent and then the likelihooods can be multiplied? i suggest you to have a look at ABC here just in case it can help
6- diagnostic metrics: also, could you please take the advantage of doing probabilistic analysis to evaluate the posterior pdf using probabilitic metrics to look at reliability, precision and bias - ie not only deterministic (related) metrics, this only inspects one side of the history. Based on this, the advantages highlighted on the discussion section could be more defended.
7- for me the text is a bit confusing when talking about multiple objectives, i guess most of the readers tend to think about multiple objectie functions which is not the case because there is 1 likelihood - other thing is that there are 2 target variables
8- maybe a naive question, but do you need all the SA methods in the main manuscript?
Once again for your work
Hopefully my comments help in improving the paper
In case the Editor asks for a revised version of the paper, i am very happy to serve as reviewer of the revised version

Best Wishes,
Reviewer

Citation: https://doi.org/10.5194/hess-2024-121-RC3
- AC2: 'Reply on RC3', Cong Xu, 15 Sep 2024
  
  Thank you for your valuable feedback. We have responded to each of your comments, and the details are provided in the attached document.
  
  Citation: https://doi.org/10.5194/hess-2024-121-AC2

Status: closed

RC1:
'Comment on hess-2024-121', Anonymous Referee #1, 26 Jun 2024

The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2024-121/hess-2024-121-RC1-supplement.pdf

Citation: https://doi.org/10.5194/hess-2024-121-RC1
- AC1: 'Reply on RC1', Cong Xu, 11 Sep 2024
  
  We appreciate the reviewer’s positive and constructive feedback, which will assist us in further improving our work. Our responses have been submitted as a supplement pdf file.
  
  Citation: https://doi.org/10.5194/hess-2024-121-AC1
RC2:
'Comment on hess-2024-121', Jasper Vrugt, 24 Aug 2024
Summary: In this study, the authors applied a revised particle evolution Metropolis sequential Monte Carlo called PEM-SMC to single- and multi-objective optimization of the Common Land Model (CoLM) using measurements of the latent heat flux (LE) and net ecosystem exchange (NEE) from a typical evergreen needle-leaf forest observation site. The authors conclude that the revised PEM-SMC algorithm is a robust method for multi-dimensional parameter estimation of LSMs. The optimal parameters for LE and NEE exhibit a trade-off, necessitating the estimation of these parameters against LE and NEE data simultaneously.
Evaluation: This paper is about an important topic in hydrologic modeling, namely the training of a land surface model so that its simulated output matches as closely and consistently as possible multiple different measurement types. Overall, the paper is reasonably well written, albeit at times using incorrect methodological terminology and language. I worked quite a bit on Bayesian methods and am always interested in the application of these methods to Earth-systems, I am not particularly excited and convinced by the work presented in this paper. I will summarize my comments next. Based on these comments I recommend a major revision. This gives the authors an opportunity to revise their work. This may involve quite a bit of additional work as my comments will point out.
General comments:
While reading this manuscript, many thoughts went through my mind. In the first place, it was a trip down memory lane. The presented material reminded me of work done by Luis Bastidas and others about 3 decades ago, and my own work of about 15 years ago. Then, I was surprised to see that the authors do not reference much prior work and present the algorithms as if they invented all this themselves. Furthermore, the authors implemented a Bayesian approach, but they do not verify at all whether the assumptions they made were actually met in practice nor do they take advantage of their posterior distribution in analyzing model performance. Last but not least, I am not convinced yet that the mutation operator of their PEC-SMC algorithm (which is a crossover step!) leaves the target distribution invariant. I will now address each of these points in more detail. Before I do so, I must apologize for the fact that I heavily advertise my own work in my review below. In general, I do not like doing this, but this proved difficult in the present case. The authors’ paper demonstrates a keen interest in methodology. This methodology has various problems and shortcomings. To point this out, I must unfortunately refer to our own published works. The authors work with MATLAB, so it should be easy to evaluate/test my suggestions and test their method against “state-of-the art” methods in MATLAB I am pointing at. I hope my comments below clarify my concerns.
The conclusion that Land-surface models exhibit a trade-off in describing different data types has been well known to the community. In a series of papers in the 1990s Bastidas and co-authors have convincingly shown that LSM training results in different “optimal” parameter values if trained against different data types. They used a multi-objective optimization framework for this, along with a multi-objective sensitivity analysis method to rank the relative importance of individual parameters in describing different data types. This problem did not resolve with the use of a more complex LSM but is a result of (among others) epistemic uncertainty (model structural errors) and measurement errors of the controlling variables (exogeneous variables). This paper uses a Bayesian procedure and arrives at the same conclusions of Bastidas et al.

Thus, fundamentally, the conclusion of the authors is not new. Then, let me look at the methodology used. Does this warrant publication in HESS? Before I move on to the methodology, I would want to clarify that the wording “multiple objective” is erroneous. In essence, the authors are lumping together in their likelihood function two different data streams. They do this by multiplying the likelihoods of LE and NEE. This is not multi-objective, as the authors only have 1 likelihood (= combined likelihood of NEE and LE). With the choice of a Gaussian likelihood, the authors’ method is equivalent to weighted least squares. With a diagonal measurement error covariance matrix where the first n entries are the measurement error variances for the latent heat flux data and the next m entries are for the measurement error variances for the NEE data, you will arrive at exactly the same maximum a-posteriori (MAP) density solution [uniform prior is used]. What the authors did is maximum likelihood estimation with a weighted likelihood function.
The PEM-SMC algorithm the authors present & modify reminded me of my own work on particle-DREAM (Vrugt et al: Hydrologic data assimilation using particle Markov chain Monte Carlo simulation: Theory, concepts and applications), http://dx.doi.org/10.1016/j.advwatres.2012.04.002. Given the large overlap between the author’s PEM-SMC method and particle-DREAM I was surprised to see that there is not a single reference in their methodological description to the particle-DREAM paper. It is possible that the authors missed this work, but then their so-called mutation step in Equation 18 is taken directly from the particle-DREAM paper, and/or related papers on DE-MC and the DREAM algorithm. The authors use the same symbols in their paper for the jump-rate, gamma, parents, r₁ and r₂, etc. as we used in our publications. The authors should properly cite and acknowledge past work.

Fundamentally, I am not confident that the Metropolis acceptance probability defined after Equation (18) leaves the target distribution invariant. We struggled with this in the Particle-DREAM paper and presented in Appendix B (Page 476) a correct formulation of the acceptance probability so as to ensure ‘detailed balance’. I suggest the authors study this Appendix and determine whether their particle transitions guarantee an exact inference of the target distribution. This requires careful demonstration. As a side note, the PEM-SMC sampler has many elements in common with the ABC-Population Monte Carlo sampler used in Sadegh and Vrugt, (2013: see Appendix A, Page 4845). We also implemented a DREAM resampling step in that algorithm.

In addition to my previous comment, even if the method is theoretically correct, then one cannot simply make a change to an algorithmic recipe and claim this revised method is robust (as the authors do in their abstract). We cannot conclude anything about robustness using only the posterior distributions of the LE and NEE data. To inspire confidence in the revised algorithm one needs to demonstrate that the method works well on a large range of problems – meaning it converges to the known posterior distribution in a series of benchmark experiments. Otherwise, claims of robustness are meaningless, and we must simply trust that the algorithm correctly infers the target distribution of the LSM parameters.

Then, one more time about Equation (18), how is the resampling implemented? Are N candidate points generated simultaneously, and then pairwise the acceptance probability is computed to determine whether to accept the proposals or not. Or is the implementation sequential, that is, a proposal is created using Equation (18) and then this is accepted (or not) with Metropolis probability α. If accepted, the proposal replaces its parent, and then the next candidate point is created? The algorithmic recipe suggests this latter approach. The reason I ask this is that the first parallel implementation does not guarantee reversibility. This proof only holds for the latter, sequential, approach. See Appendix (Page 444-445) of Vrugt and Ter Braak, https://link.springer.com/article/10.1007/s11222-008-9104-9

How does the algorithm handle the parameter boundaries? What is done to a particle if it is sampled outside the prior parameter ranges (by the authors’ mutation step)? Detailed balance requires that the proposals are assigned a zero likelihood, or better, the candidate points are folded [e.g. see Vrugt and Ter Braak, 2011: https://doi.org/10.5194/hess-15-3701-2011]. Which mechanism did the authors implement?

Then, on a related note, the authors do not address the question so as to why the machinery of a particle SMC method is required first of all to infer the posterior parameter distribution. I am most familiar with the DREAM algorithm as I developed this myself with Cajo Ter Braak (2008). I seriously question whether based on the listed algorithmic variables of S = 100 and N = 200 the SMC algorithm will succeed in generating samples of the target distribution. I hazard to predict that the DREAM algorithm (MATLAB toolbox: DREAM Package) will be computationally more efficient, in large part because the burn-in will be substantially smaller as one only needs to run N = 3 Markov chains. This does not require a tempering schedule to move particles from a prior to a posterior distribution. This type of bridge sampling only complicates methodology. Most people are familiar with the DREAM algorithm, and so why do we need new machinery if “old methods” can do the job – and as I bet more efficiently than what is presented in the present paper. Also, the multi-try variant of DREAM allows a further speed-up of the convergence speed to the target (posterior) distribution as the chains are evaluated in parallel and multiple proposals are created in each chain in parallel as well. The best proposal is then accepted with a modified Metropolis acceptance probability. This methodology is described in https://doi.org/10.1029/2011WR010608

The authors assume a normal likelihood for the residuals of the LSM model. Why did they choice a normal likelihood. This is the default choice but must be supported by evidence. This require a-posterior checking of the residuals of the time series of measured and simulated latent heat fluxes and NEE. I bet the residuals will exhibit autocorrelation, a nonconstant variance, and deviate from normality. This would disqualify the use of the normal likelihood function. Good statistical practice requires the authors to evaluate that the assumptions about the likelihood function are met in practice. For example, consider Schoups and Vrugt, 2010. https://doi.org/10.1029/2009WR008933

The authors multiply the likelihoods of the latent heat flux and NEE as if these two data types are independent. Is this a valid assumption? As an alternative, one could consider a composite likelihood, where the two likelihoods are additive – one can find applications of this in statistical literature, specifically in the application to spatial data. Then, this would constitute a novelty and justify publication in a highly rated journal such as HESS.

Then, the toolbox of DREAM in MATLAB (called DREAM package) has built-in distribution-adaptive likelihood functions, such as the generalized likelihood and universal likelihood functions: https://doi.org/10.1016/j.jhydrol.2022.128542. These likelihood functions let the data speak for themselves – and let the residuals determine the most appropriate form of the likelihood function. This includes treatment of heteroscedasticity, correlation and non-normality. This guarantees that the residual properties will match the likelihood assumptions – and inspire much more confidence in the posterior parameter distributions derived from the latent heat flux and NEE data. Certainly, if the authors use a Bayesian approach, as they do in this paper, they must demonstrate that their residual assumptions hold. Otherwise, the parameter estimates are not particularly meaningful.

I do not understand why the authors use a complicated training SMC-based procedure, whereas there is relatively little they do with the knowledge of the posterior parameter distributions. In the first place, the authors should present the Bayesian predictive distributions (time series of confidence and prediction intervals of the simulated output) in their time series figures (e.g. Figure 4). What is the coverage of the prediction limits? As it stands right now, the authors focus their attention on the MAP solutions and ignore in large part parameter uncertainty. A maximum likelihood method would have done the job. Then, stochastic gradient descent would have found the MAP solution and a first-order approximation around this optimum would have given a linear estimate of parameter uncertainty.

The authors refer to Equation 15 as the “analytical method”. I am not familiar with this terminology. Equation 15 simply states that the measurement error standard deviation is equal to the square root of the sample variance of the residuals. They essentially use this as a sufficient statistic and embed this in their likelihood function. This is a common approach, but in the author’s, implementation raises two important questions. A) how was the sample variance of the residuals determined? This requires knowledge of the MAP (maximum a-posteriori density) LSM parameter values, and B) the estimate of the sample variance of the residuals will be biased as the residuals are likely to exhibit serial correlation. The “true” measurement error variance can only be determined after decorrelating this time series of MAP residuals.

The authors refer to their particle resampling step as a mutation step. This is wrong. A mutation is a random alteration to the DNA of a particle. This can happen to any parameter at any time and is fully random. The differential evolution (DE-I) step the authors use from DE-MC, DREAM or Particle-DREAM) is a crossover step. The proposed changes to the DNA of the particles are based on an underlying mechanism [= Equation 18], and augmented with an inconsequential random perturbation, zeta, to claim ergodicity. This is a crossover step wherein the DNA of two parents is combined to generate offspring (candidate particles). Note that in the original DREAM algorithm, there is no mutation step – only a crossover step. Then, why did the authors not consider the use of more than 1 pair (r1 – r2, r3 – r4, r5-r6, etc.), of differences to generate candidate points? We have shown that this enhances considerably the variability in the candidate points. Indeed, the pair r1/r2 has a 1/N*1/(N-1) selection probability, where N significant the population size. r1,r2,…,r6 on the contrary has a selection probability, of 1/N*1/(N-1)*1/(N-2) etc.; Thus one can generate a much large variation in the proposals with multiple pairs. The side-effect of this is that you may be able to reduce the population size.

Do I understand correctly that after convergence you have N = 200 samples from the target distribution? This is very small for a 17-dimensional parameter estimation problem and does not support an accurate depiction of the posterior parameter distribution. Again, why not just use an adaptive multi-chain Monte Carlo method such as DREAM_ZS or MT-DREAM_ZS? This machinery will do the job for you, while providing as byproduct a) automatic convergence monitoring (univariate and multivariate scale reduction factors, etc.), b) diagnostic checks of the residuals, c) confidence and prediction limits of the Bayesian forecast (simulation) PDF, d) scoring rules (CRPS, LS, QS, etc.) of the predictive distribution and access to distribution-adaptive likelihoods.

In principle, the authors have access to the predictive distribution of the model, but in their model evaluation resort only to common deterministic measures of model performance. This includes the NSE and RMSE in Equations 19 and 20. Their use entails a large loss of information about model performance. This is just a side note: The authors should consider evaluating the full predictive distribution using scoring rules. I hesitate to advertise my own work, but here it is, Distribution-Based Model Evaluation and Diagnostics: Elicitability, Propriety, and Scoring Rules for Hydrograph Functionals: https://doi.org/10.1029/2023WR036710. This merely serves as a note to alert the authors against the use of deterministic measures of goodness of fit, in case one has access to the full Bayes predictive distribution. This allows for the use of information-theoretic principled metrics – which offer more protection against misinformation, disinformation, etc.

Then, the authors use different sensitivity analysis methods to decide which parameters are sensitive and which ones are not. A critical assumption in this analysis is that the parameters are independent. This assumption is convenient but may not be realistic in practice. As an idea, the authors could determine sensitivity using their posterior parameter samples. This would equate to probabilistic variance-based sensitivity analysis and relies on high-dimensional model presentation (HDMR), an extension of Sobol to correlated variables. HDMR and HDMR with extended bases are computationally quite demanding, so this may pose problems with their LSM. Nevertheless, given their strong interests in computational methods I thought I’d point at the HDMR/HDMR_EXT toolboxes, which are available in MATLAB. Alternatively, they should consider the multi-criteria sensitivity analysis method of Bastidas, in evaluating sensitivity in the presence of more than one data type. But, realistically, why not do inference on all the parameters and then assess parameter sensitivity from the posterior LSM parameter distribution(s)? One could use measures such as the Kullback-Leibler divergence (= divergence of the logarithmic score, see Vrugt, 2024) to determine the distance between the marginal prior and marginal posterior distribution of each parameter. This is cheap to compute and will convey which parameters are most sensitive and which others are not. Indeed, for parameters that are sensitive one would expect its posterior marginal distribution to be small relative to its marginal prior. On the contrary, for a parameter that is insensitive it is common to see that its prior and posterior marginals are nearly equivalent. Maybe this approach simplifies the paper.

Figure 2: The x-labels are not readable and strangely chosen. If the authors want to demonstrate that the different optimization scenarios lead to different marginal distributions of the parameters, then why not use a single x-axis, linearly-scaled, and just plot the 3 distributions on top of this. Thus the same as now but with a linear scale of the x-axis. This would show immediately the differences between the methods. Then, I have my doubts whether the marginal distribution of the parameters shown are truly Gaussian. This may be an artefact of an insufficient sample of the target distribution, as commented on earlier in 7.

Figure 4: Maybe this is explained in the text and I missed it but why are the observations indicated with a pink interval? Noisy observations? Personally, I would prefer plotting the data as discrete points, alternatively, one can think of averaging the data (measurements) so there are fewer data points in return and then accompany these time-averaged estimates with error bars. As it stands right now, I see three deterministic simulations and then a wide range of possible data values. If the authors are interested in quantifying measurement uncertainty, then the nonparametric estimator of de Oliveira and Vrugt comes to mind: https://doi.org/10.1029/2022WR032263. Again, this is a sidenote, but it may help the authors present their data better. Maybe the authors can use this to their advantage.

Section 4.2: This is not a difference between single and multiobjective methods. All the results correspond to a single objective, albeit with one or more weighted data streams. I commented on this before.

Then, the authors use the terminology of optimization scenarios. Personally, I do not like the terminology of optimization in a Bayesian context. Optimization focuses on finding the single best solution whereas Bayesian inference is fundamentally different in that it wants to find a distribution of statistically acceptable solutions – this should include the MAP parameter values (= ML with uniform prior) but cannot be considered optimization. Some authors call it Bayesian calibration, I prefer Bayesian training or estimation. I leave this to the authors.

In summary, this paper needs a lot of work before it can be judged to make a significant contribution to the literature. I am sorry for highlighting my own work, but I felt this was relevant and necessary as the authors’ methodology has important flaws, shortcomings and needs to be reconsidered. Amongst others, I believe that a) the authors should properly demonstrate that their method is indeed robust and that the changes made to the PEC-SMC sampler leave the target distribution invariant, b) they must properly recognize past literature contributions, consider how their method differs from past methods, and articulate why we would need the SMC machinery after all to obtain the posterior LSM parameter distribution if current state of the arts method can do this job (and the maximum likelihood method/weighted least squares will do in the current implementation of the authors), c) test the residual properties (diagnostic checks of autocorrelation, distribution and variance homogeneity) so as to demonstrate that the assumptions of the normal likelihood function are indeed met, d) present the confidence and prediction limits of the Bayes predictive distributions, e) clean up their language of single and multiple objective, mutation and crossover step, etc. and f) possibly, consider more powerful model evaluation metrics rather than the deterministic measures used by the authors. No doubt, these comments and those listed above will involve substantial work. But this should significantly enhance the quality of the work presented in this paper.
I hope my comments are useful to improve your manuscript,
Jasper A. Vrugt
jasper@uci.edu
Citation: https://doi.org/10.5194/hess-2024-121-RC2
- AC3: 'Reply on RC2', Cong Xu, 17 Sep 2024
  
  Thank you for evaluating our work and providing valuable feedback. We fully recognize the importance of the improvements you have suggested and are committed to addressing each concern in detail. Please find the detailed responses to each concern in the supplementary file.
  
  Citation: https://doi.org/10.5194/hess-2024-121-AC3
RC3:
'Comment on hess-2024-121', Anonymous Referee #3, 27 Aug 2024

Dear Authors,
thank you very much for doing this work. I think the paper could make a contribution to HESS after some work. I leave here my main comments, i hope they help to improve the manuscript:

1- Literature review. I miss references to the work of Jasper Vrugt, mainly to Vrugt et al., 2012 (the DREAM paper), you might have a look at the work of Carlo Albert about ABC (during 2015 and maybe reply rom Vrugt later on), Kavetski et al. 2018 and Fenizia et al 2018 look through the introduction +references of these 2 papers; and the work from Prieto et al. 2021; 2022 about hydrological mechanisms identification and the diagnostic metrics there in. Otherwise the introduction is a bit repetitive and convoluted but it missses inormation, e.g. about the choose or development of diferent likelihoods (later on you assume a normal gaussian but i do not see the justification), model diagnostics metrics and why.

2- benchmark: i am missing a benchmark to compare. Maybe, one good idea might be to use the package from Vrugt for DREAM as benchmark - i am aware the author had everything ready to be applied

3- For the equations i suggest to use a properly maths notation. At least for me is helpul and i am sure for readers too. E.g. why is everything italic? i suggest you to differenciate vectors, matrices, random variables, etc (ie bold, capital letters , and so on). In Prieto et al., 2019; 2021; 2022 you can find examples for this and in whatever paper from Vrugt i am sure too.

4- Posterior pdf: please for the posterior of the parametere use the full Bayes equation and then say that the left hand is proportioinal to the right hand so that non Bayesian can follow it. Indeed this is meet because you use only one model (eg see prieto et al.., 2021, 2022)

5- concern: why the likelihood is a notmal likelihood? could you please justify and then analize the residuals of the posteriors? are you also meaning that the variables are independent and then the likelihooods can be multiplied? i suggest you to have a look at ABC here just in case it can help
6- diagnostic metrics: also, could you please take the advantage of doing probabilistic analysis to evaluate the posterior pdf using probabilitic metrics to look at reliability, precision and bias - ie not only deterministic (related) metrics, this only inspects one side of the history. Based on this, the advantages highlighted on the discussion section could be more defended.
7- for me the text is a bit confusing when talking about multiple objectives, i guess most of the readers tend to think about multiple objectie functions which is not the case because there is 1 likelihood - other thing is that there are 2 target variables
8- maybe a naive question, but do you need all the SA methods in the main manuscript?
Once again for your work
Hopefully my comments help in improving the paper
In case the Editor asks for a revised version of the paper, i am very happy to serve as reviewer of the revised version

Best Wishes,
Reviewer

Citation: https://doi.org/10.5194/hess-2024-121-RC3
- AC2: 'Reply on RC3', Cong Xu, 15 Sep 2024
  
  Thank you for your valuable feedback. We have responded to each of your comments, and the details are provided in the attached document.
  
  Citation: https://doi.org/10.5194/hess-2024-121-AC2

Cong Xu, Gaofeng Zhu, Yang Zhang, and Kun Zhang

Supplement

https://doi.org/10.5194/hess-2024-121-supplement

Cong Xu, Gaofeng Zhu, Yang Zhang, and Kun Zhang

Viewed

Total article views: 1,126 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
853	220	53	1,126	109	53	65

HTML: 853
PDF: 220
XML: 53
Total: 1,126
Supplement: 109
BibTeX: 53
EndNote: 65

Views and downloads (calculated since 30 May 2024)

Month	HTML	PDF	XML	Total
May 2024	49	8	4	61
Jun 2024	115	23	7	145
Jul 2024	26	9	6	41
Aug 2024	49	25	5	79
Sep 2024	42	11	15	68
Oct 2024	7	2	0	9
Nov 2024	7	8	0	15
Dec 2024	11	13	1	25
Jan 2025	7	8	1	16
Feb 2025	15	8	2	25
Mar 2025	13	10	2	25
Apr 2025	21	9	1	31
May 2025	18	5	2	25
Jun 2025	29	10	0	39
Jul 2025	25	9	0	34
Aug 2025	78	13	3	94
Sep 2025	294	8	0	302
Oct 2025	24	20	2	46
Nov 2025	23	21	2	46

Cumulative views and downloads (calculated since 30 May 2024)

Month	HTML	PDF	XML	Total
May 2024	49	8	4	61
Jun 2024	115	23	7	145
Jul 2024	26	9	6	41
Aug 2024	49	25	5	79
Sep 2024	42	11	15	68
Oct 2024	7	2	0	9
Nov 2024	7	8	0	15
Dec 2024	11	13	1	25
Jan 2025	7	8	1	16
Feb 2025	15	8	2	25
Mar 2025	13	10	2	25
Apr 2025	21	9	1	31
May 2025	18	5	2	25
Jun 2025	29	10	0	39
Jul 2025	25	9	0	34
Aug 2025	78	13	3	94
Sep 2025	294	8	0	302
Oct 2025	24	20	2	46
Nov 2025	23	21	2	46

Viewed (geographical distribution)

Total article views: 1,068 (including HTML, PDF, and XML) Thereof 1,068 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 22 Nov 2025

Short summary

This study proposes a revised Particle Evolution Metropolis Sequential Monte Carlo algorithm and evaluates its effectiveness in multi-dimensional and multi-objective parameter optimization for land surface models. Our comparative analysis of the impacts of single- and multi-objective optimization on parameter estimation and model performance demonstrates that integrating diverse observational data into multi-objective optimization is preferable for parameter calibration in complex dynamic model.


Total:	0
HTML:	0
PDF:	0
XML:	0