Multivariate calibration can increase simulated discharge uncertainty and model equifinality

Pool, Sandra; Fowler, Keirnan; Gardiya Weligamage, Hansini; Peel, Murray

doi:10.5194/hess-30-2797-2026

Articles | Volume 30, issue 9

https://doi.org/10.5194/hess-30-2797-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-30-2797-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 30, issue 9

Research article

|

11 May 2026

Research article |

| 11 May 2026

Multivariate calibration can increase simulated discharge uncertainty and model equifinality

Sandra Pool, Keirnan Fowler, Hansini Gardiya Weligamage, and Murray Peel

Download

Final revised paper (published on 11 May 2026)
Supplement to the final revised paper
Preprint (discussion started on 09 May 2025)
Supplement to the preprint

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-1598', Anonymous Referee #1, 02 Jun 2025

I reviewed the manuscript entitled “Multivariate calibration can increase simulated discharge uncertainty and model equifinality” by Pool et al. This study employs a Monte Carlo approach (100,000 parameter sets) to compare three hydrological model calibration strategies: discharge-only, ET-only, and discharge+ET. The key finding—that incorporating ET into calibration increases flux uncertainty and discharge variability while revealing incompatible parameter spaces for discharge and ET objectives—is compelling and well-supported. However, two critical issues require resolution before publication (see below). There are also some specific comments in the annotated manuscript. Overall, I would suggest a major revision.
Major Concern: Definition of "Behavioral Parameters"
The authors defined the top 100 parameter sets as “behavioral parameters” based solely on their ranking without explicitly addressing the performance thresholds associated with these sets. For example: What NSE/KGE values correspond to the top 100 sets? Does the performance of the 100th set differ significantly from the 101st? Could expanding the threshold to the top 1,000 sets yield comparable or better performance while reducing equifinality? Since the “behavioral” classification directly shapes the conclusions (e.g., parameter space in compatibility in Section5.2), this ambiguity undermines the practical relevance of the results. Therefore, I would suggest to add performance metrics (e.g., NSE/KCE ranges) for the top 100/1,000 parameter sets in a table or figure and to justify the choice of 100 sets (e.g., via sensitivity tests or breakpoint analysis of performance vs. parameter set rank).
Moderate Concern: Parameter Distribution Analysis (Figure 4)
Figure 4 illustrates parameter distribution overlaps between calibration strategies but lacks methodological transparency: The rationale for selecting specific parameter pairs (e.g., sensitivity, correlation) is unclear. Subjective terms like “separation” need quantitative support (e.g., Kolmogorov-Smirnov tests or overlap coefficients). I would suggest to clarify how parameter pairs were chosen and to replace qualitative descriptions with metrics (e.g., heatmaps of distribution overlap or pairwise correlation matrices) to synthesize relationships across all parameters.

Citation: https://doi.org/10.5194/egusphere-2025-1598-RC1
- AC2: 'Reply on RC1', Sandra Pool, 30 Aug 2025
  
  Dear Reviewer,
  We thank you for your valuable feedback on our manuscript. Please find below our reply to each of your comments.
  Yours sincerely,
  On behalf of all co-authors
  Sandra Pool
  
  Major Concern: Definition of "Behavioral Parameters"
  Comment 1: The authors defined the top 100 parameter sets as “behavioral parameters” based solely on their ranking without explicitly addressing the performance thresholds associated with these sets. For example: What NSE/KGE values correspond to the top 100 sets? Does the performance of the 100th set differ significantly from the 101st? Could expanding the threshold to the top 1,000 sets yield comparable or better performance while reducing equifinality? Since the “behavioral” classification directly shapes the conclusions (e.g., parameter space in compatibility in Section5.2), this ambiguity undermines the practical relevance of the results. Therefore, I would suggest to add performance metrics (e.g., NSE/KCE ranges) for the top 100/1,000 parameter sets in a table or figure and to justify the choice of 100 sets (e.g., via sensitivity tests or breakpoint analysis of performance vs. parameter set rank).
  Reply: We acknowledge that the selection of the top 100 parameter sets as behavioral is a subjective choice, which can affect the conclusions. We did some tests on the impact of the threshold number on parameter equifinality during the analysis of this study, and preliminary results suggest that a more relaxed threshold shouldn’t considerably change our conclusions. Following the suggestion of the reviewer, we will add a figure showing model performance for different thresholds (e.g., 1,10,100, 1000,…) focusing on the calibration case using discharge and actual evapotranspiration. We also plan to link these performance values with metrics quantifying the overlap of parameter distributions (see comment 2). Please note that boxplots showing model performance for all calibration cases are provided in Figs. S2 and S5.
  Moderate Concern: Parameter Distribution Analysis (Figure 4)
  Comment 2: Figure 4 illustrates parameter distribution overlaps between calibration strategies but lacks methodological transparency: The rationale for selecting specific parameter pairs (e.g., sensitivity, correlation) is unclear. Subjective terms like “separation” need quantitative support (e.g., Kolmogorov-Smirnov tests or overlap coefficients). I would suggest to clarify how parameter pairs were chosen and to replace qualitative descriptions with metrics (e.g., heatmaps of distribution overlap or pairwise correlation matrices) to synthesize relationships across all parameters.
  Reply: As correctly pointed out by the reviewer, we didn’t use a quantitative metric for measuring the lack of overlap between parameter distributions. This is because, in the two-dimensional space, there are numerous examples for which the lack of overlap is visually very clear. Having these examples was enough to reject our hypothesis. However, we agree with the reviewer that a quantitative metric for the overlap of parameter value distributions could be beneficial, particularly in connection with the reviewer’s comment 1. We therefore plan to include one of the statistical metrics suggested by the reviewer in an updated version of the manuscript and/or to calculate the overlap of the convex hull of the parameter clouds. The benefit of the latter is that it can be applied to two-dimensional data (as presented in Fig. 4) and it directly answers the question of whether there is an overlap or not.
  The parameter pairs demonstrating the lack of overlap in Fig. 4 were manually selected to show different separation patterns (Lines 296-299). We will adapt the sentence to be more clear about the rational of our choice. We will also consider adapting the choice of parameter pairs based on the statistical metrics quantifying the parameter distribution overlap.
  Specific comments
  L45: The 20 catchments used in this study are all within the United States.
  Reply: Thanks for pointing this out. We will adapt the sentence to refer to the correct region.
  L115: This is an important analysis technique but may not be familiar to everyone. I would suggest to add a couple of sentences (without technical details and directing the authors to section 2.3.3) some where from line 95 to 120.
  Reply: We plan to add some more information on the idea of Flux Mapping in line 95 where we first introduce the concept.
  L122: Delete duplicate title.
  Reply: We will remove the duplicate title.
  L170: I think it is complicated to have sub-order for panel index. I would suggest to call these b, c, d.
  Reply: We will adapt the labels of the sub-panels to a-d.
  L201: (KGE_0.2)
  Reply: We will add the abbreviation KGE_0.2to the sentence.
  L210: Isn't the symbol "KGE" just good to indicate "no bias term was considered"? So, here could be KGE_0.5, similarly to KGE_0.2 in Eq.(1).
  Reply: Yes, we could indeed simplify the abbreviation here. Following a suggestion from the other reviewer, we will write the equation for O_ETa in full form. This way we can avoid introducing the label KGE.mod.
  L230: How many grid cells in total? How do you define the increment of each axis?
  Reply: We divided each axis into ten grid cells resulting in 100 cells in total. We will add this information to be more precise about the percentage calculation.
  L252: Need to cite at the end of the sentence.
  Reply: We will add citations of Dembélé et al. (2020a), Mei et al. (2023), and Arciniega-Esparza et al. (2022) to support our statement.
  L265: I suggest to label this point too.
  Reply: We will update Fig. 2b with a label for the Litchfield catchment.
  L265: I suggest to provide the flux map too as another example.
  Reply: Adding the flux map for the Cow Bay catchment is a good idea. We will update the figure accordingly.
  L303: Is it just a random selection or is there some rules to follow when selecting the pairs?
  Reply: Please see reply to comment 2.
  L311: Is there a more quantitative way to describe "separation" here? If yes, perhaps the authors can consider something like a pairwise CC plot that the X and Y axis are the seven parameters and the corresponding grids are the "separation metric". By doing so, all 21 pairs can be visualized and they can easily repeat this plot for the other catchments to make a 4X3 plot matrix.
  Reply: Thanks for this very specific input. Once we will have calculated a separation metric (see reply to comment 2), we will try to visualize the results as suggested with a 4 x 3 plot matrix. We will most likely use such a figure in addition to the current Fig. 4 rather than replacing it, because Fig. 4 shows in a simple and convincing way the lack of overlap for some parameter distributions.
  L326: In my opinion, the pagination of this manuscript is not very long. The authors may consider to make this 4X3 for all the ternary plots to occupy an A4 paper entirely. I don't think the last one of the current panel b is needed.
  Reply: As suggested, we can make Fig. 5 (and Fig. 4, too) bigger.
  L332: "Less constrained flux maps" and "increased hydrograph uncertainties" seem to be the same thing?
  Reply: Since the hydrograph is a result of different combination of fluxes, we argue that increased hydrograph uncertainties are a result of less constrained flux maps. Although the two are inherently linked to each other, they have a slightly different meaning.
  L334: "Improved overall model performance" does not necessary mean reduced uncertainties.
  Reply: We agree that an improved overall model performance does not necessarily imply reduced uncertainties. What we wanted to highlight here, is that many studies reported improved overall model performance after multivariate model calibration, which they associated with improved model realism. Thus, improved overall model performance is seen as desirable and successful modelling outcome. Our work demonstrates that improved overall model performance (see line 250) does not necessarily mean that our model has become more realistic. We will adapt the sentence in line 334 to be clearer about that.
  L372: This is a good point. But the performance associated with the 100 parameter sets were not shown in the manuscript. I suggest to add this piece of information.
  Reply: Please see reply to comment 1.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1598-AC2
RC2:
'Comment on egusphere-2025-1598', Tam Nguyen, 12 Jun 2025

The study by Pool et al., titled “Multivariate calibration can increase simulated discharge uncertainty and model equifinality”, investigates the uncertainty associated with simulated discharge and the flux components (overland flow, interflow, and baseflow) under both univariate and multivariate calibration approaches. In the univariate calibration, the hydrological model was calibrated either for discharge (Q) or for actual evapotranspiration (ET), whereas in the multivariate calibration, the model was simultaneously calibrated for both Q and ET using a combined objective function. The analysis was conducted across twelve Australian catchments utilizing the SIMHYD hydrological model. The authors report that multivariate calibration—through the additional inclusion of ET—leads to increased uncertainty in both simulated discharge and the associated flux maps. This is attributed to the absence of “overlapping behavioral parameter sets” (Fig. 4).
Overall, the research demonstrates a high degree of novelty, as it is likely the first study to explore this topic in such depth. The manuscript is well written, the results convincingly support the conclusions, and the discussion is comprehensive.

Major comments:
In the manuscript, the authors refer to model structure at several points (e.g., lines 23–25, 117–119, and lines 79–80: “A lack of intersection might be a symptom of problems such as an inadequate model structure … or data that are subject to systematic biases or other error”). In this context—specifically regarding the lack of intersection in behavioral parameter sets shown in Figure 4—I am curious whether this outcome is due to limitations in the SIMHYD model structure or issues with the input data. Furthermore, presenting scatter plots of the objective function values for Q and ET from the 100,000 simulations for each basin would be insightful. Such visualizations could help identify cases of trade-offs (negative correlation) or cross-benefits (positive correlation) between the two objective functions, and potentially clarify whether these patterns arise from model structural limitations or data-related issues.
Minor comments
In this study, the authors employed a combined objective function approach. A potential drawback of this method is that model performance may become biased toward one variable at the expense of another. This trade-off can contribute to increased uncertainty, as improvements in the simulation of one variable may be offset by declines in the performance of another, and therefore, increase the uncertainty under multivariate calibration.
Line 54-65. Did these studies employ separate objective functions for each variable in their multivariable calibration approach?
Eq. (2) please write this equation in full form?
Figure 4: Does the axis show the normalized values [0,1] instead of the actual values? If yes, please specify in the text. Adding x and y values to the subplots (b) even their ranges are [0, 1]?

Citation: https://doi.org/10.5194/egusphere-2025-1598-RC2
- AC1: 'Reply on RC2', Sandra Pool, 30 Aug 2025
  
  Dear Tam Nguyen,
  We thank you for your valuable feedback on our manuscript. Please find below our reply to each of your comments.
  Yours sincerely,
  On behalf of all co-authors
  Sandra Pool
  
  Major comments
  Comment 1: In the manuscript, the authors refer to model structure at several points (e.g., lines 23–25, 117–119, and lines 79–80: “A lack of intersection might be a symptom of problems such as an inadequate model structure … or data that are subject to systematic biases or other error”). In this context—specifically regarding the lack of intersection in behavioral parameter sets shown in Figure 4—I am curious whether this outcome is due to limitations in the SIMHYD model structure or issues with the input data. Furthermore, presenting scatter plots of the objective function values for Q and ET from the 100,000 simulations for each basin would be insightful. Such visualizations could help identify cases of trade-offs (negative correlation) or cross-benefits (positive correlation) between the two objective functions, and potentially clarify whether these patterns arise from model structural limitations or data-related issues.
  Reply: The question of whether the lack of overlap stems from model structural inadequacy or data quality issues is indeed a very interesting and relevant one. Our analysis was conducted for two different ETa datasets (one remote sensing-based and one field-based) and led to very similar conclusions. We therefore assumed that data-related issues are not the main reason for our observations. So far, our analysis has focused on the parameter space and flux maps, and we plan to complement the analysis by the figure suggested by the reviewer to additionally provide results on model performance. The similarity of patterns in the scatterplots for the remote sensing-based data and the field-based data could then provide additional information on whether parameter equifinality and discharge uncertainty arise from model structural or data limitations.
  Minor comments
  Comment 2: In this study, the authors employed a combined objective function approach. A potential drawback of this method is that model performance may become biased toward one variable at the expense of another. This trade-off can contribute to increased uncertainty, as improvements in the simulation of one variable may be offset by declines in the performance of another, and therefore, increase the uncertainty under multivariate calibration.
  Reply: We agree with the reviewer that the observed increased uncertainty likely arises from the trade-off between the two calibration variables. Figs. S2 and S5 contain boxplots of model performance for discharge and ETa for all calibration cases and catchments. They indicate that trade-offs exist for both variables, i.e., model performance for discharge decreases if adding ETa to a discharge-based calibration and vice versa (see also lines 243-245). Trade-offs in model performance have been observed in various studies irrespective of the calibration approach (see answer to comment 3). An important difference to other studies is that this trade-off leads to increased simulated discharge uncertainty, which we can attribute to non-overlapping behavioural parameter clouds. Our work, therefore, asks the critical question of when does the trade-off become so large that the addition of data doesn’t help. We plan to mention the limitation of the combined objective function approach in the updated manuscript when introducing the method.
  Comment 3: Line 54-65. Did these studies employ separate objective functions for each variable in their multivariable calibration approach?
  Reply: All three studies cited in Lines 54-65 use separate objective functions for each variable. The difference is that Hartman et al. (2017) evaluate multiple variables simultaneously, whereas Arciniega-Esparza et al. (2022) and Kelleher et al. (2017) apply the calibration variables sequentially. We will adapt the text to provide this information.
  Comment 4: Eq. (2) please write this equation in full form?
  Reply: We will write the full form of the equation to better highlight the modification applied to KGE.
  Comment 5: Figure 4: Does the axis show the normalized values [0,1] instead of the actual values? If yes, please specify in the text. Adding x and y values to the subplots (b) even their ranges are [0, 1]?
  Reply: Yes, we normalized the values of all model parameters in Fig. 4. We will update the figure caption accordingly and add the values to the subplots. To be consistent, we will also add the values to the subplots in Fig. 5b.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1598-AC1

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Publish subject to revisions (further review by editor and referees) (08 Sep 2025) by Julia Knapp

AR by Sandra Pool on behalf of the Authors (12 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (24 Mar 2026) by Julia Knapp

RR by Tam Nguyen (05 Apr 2026)

RR by Anonymous Referee #1 (19 Apr 2026)

ED: Publish as is (21 Apr 2026) by Julia Knapp

AR by Sandra Pool on behalf of the Authors (23 Apr 2026) Manuscript

Short summary

Multivariate calibration has become a widely used method to improve model realism. We found that multivariate calibration can lead to less constrained flux maps and more uncertain hydrographs relative to univariate calibration. These symptoms could be caused by non-overlapping behavioural parameter distributions for the individual calibration variables. The results emphasize that the value of non-discharge data in calibration is contingent on the suitability of the model structure.

Multivariate calibration can increase simulated discharge uncertainty and model equifinality

Download

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection