Introducing the Model Fidelity Metric (MFM) for robust and diagnostic land surface model evaluation

Wu, Zezhen; Wei, Zhongwang; Lu, Xingjie; Wei, Nan; Li, Lu; Zhang, Shupeng; Yuan, Hua; Liu, Shaofeng; Dai, Yongjiu

doi:10.5194/hess-30-2651-2026

Articles | Volume 30, issue 9

https://doi.org/10.5194/hess-30-2651-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-30-2651-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 30, issue 9

Research article

|

06 May 2026

Research article |

| 06 May 2026

Introducing the Model Fidelity Metric (MFM) for robust and diagnostic land surface model evaluation

Zezhen Wu, Zhongwang Wei, Xingjie Lu, Nan Wei, Lu Li, Shupeng Zhang, Hua Yuan, Shaofeng Liu, and Yongjiu Dai

Download

Final revised paper (published on 06 May 2026)
Supplement to the final revised paper
Preprint (discussion started on 27 Dec 2025)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-6212', Anonymous Referee #1, 31 Jan 2026

This study introduces the Model Fidelity Metric (MFM) as an alternative to traditional metrics like NSE and KGE. The method demonstrates some practical improvements in specific failure modes, such as error compensation and low-variability conditions, using synthetic tests and the CAMELS dataset. However, the paper requires further improvement in the conceptual explanations, methodological descriptions, and understanding of error metrics. See my comments below.
Although the study conducted a sensitivity analysis on the hyperparameters, it does not provide specific guidance on their selection. It is suggested to supplement the paper with recommended parameter values or adaptive selection methods to enhance the practical utility of the approach.
Errors in land surface variable estimation are usually complex. In many studies, multiple metrics—such as correlation coefficients and bias—are commonly used to better understand the sources of these errors. Individually, these metrics cannot comprehensively reflect model deficiencies, but they offer greater flexibility. For example, soil moisture evaluations tend to emphasize correlation and ubRMSE, with less attention paid to bias. For variables such as ET and LAI, strong seasonality often necessitates decomposing the time series into anomalies and seasonal components, which are then evaluated separately. When developing new error metrics, how do you take these conventional practices into account?
Compared with traditional error metrics, MFM involves more complex computations. Could the authors clarify the scenarios in which they recommend using this metric?
In addition to evaluating the performance of estimated variables, error metrics are expected to help diagnose potential model deficiencies. While I recognize the advantages of MFM in some cases, how can its results be interpreted to identify specific problems in the model?
Eq (1): what are i and n？
Line 39: I do not think it is appropriate to refer to NSE as the standard metric for LSM evaluation, as this may be misleading. Although NSE is useful for normalizing model performance and enabling cross-basin and cross-model comparisons, it should not be considered inherently better than other metrics. Its application should be determined by the specific variable and purpose, and model errors are often best explained using multiple complementary metrics.
Lines 42-44: It is precisely because of its quadratic formulation and high sensitivity to outliers that NSE is often used in streamflow evaluations with a particular focus on peak flows. Controversial conclusions are more likely the result of applying NSE in inappropriate contexts, rather than an inherent problem with NSE itself.
Line 57: What are these limitations?
Lines 59-61: The correlation term in KGE helps penalize this issue.
Lines 64-67: This statement is rather vague. Could the authors provide a concrete example, for instance, specifying the data or variables involved?
Line 69: Likewise, regarding the “right for the wrong reasons” issue, a concrete example would be helpful. This would allow readers to assess the severity of the problems potentially associated with KGE, rather than relying solely on the authors’ statements.
Line 71: If KGE is highly responsive to such balancing errors, what are the implications in practice? For instance, for simulations with similar KGE values, how large can the peak flow errors be?
Line 103: What applications?
Lines 106-110: The authors claim that highly skewed, non-Gaussian distributions violate the normality assumptions of moment-based metrics such as NSE and KGE, potentially biasing model evaluation. But do NSE and KGE actually require normally distributed data, or is this statement an overgeneralization?
Line 152: While developing metrics less sensitive to error compensation is a worthwhile goal, it is important to recognize that any aggregated metric will inevitably reflect a combination of different error types (e.g., random, systematic, or phase errors). Complete elimination of error compensation within a single metric may therefore be unrealistic.
Line 157: What is p-Error? The definition of the NMAEp metric is presented too abruptly without a clear explanation.
Line 160: Also, hard to understand what is SUSE and why it is workable for addressing KGE’s shortcomings.
Line 234: This is not attributed by skewed data. This is a general artifact of aggregate error metrics sensitive to sign cancellation, which can occur with any distribution, including normal.
Line 244: What are min (𝑺, 𝑶) and max (𝑺, 𝑶)? 𝑺 and 𝑶 denote scaled and origin?
Line 290: Given that the limitations of NSE and KGE are discussed earlier, it is unclear why they are treated as benchmark metrics here. Would it be more appropriate to refer to them as baseline metrics?
Line 303: The introduction of the CAMELS dataset should appear earlier.

Citation: https://doi.org/10.5194/egusphere-2025-6212-RC1
- AC1: 'Reply on RC1', Zhongwang Wei, 31 Mar 2026
  
  We thank Reviewer #1 for thoughtful and constructive feedback. This Response to the Reviewer file provides a complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple color. Quotations from the revised manuscript are shown in blue color.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6212-AC1
RC2:
'Comment on egusphere-2025-6212', Anonymous Referee #2, 16 Mar 2026
This study proposed the Model Fidelity Metric (MFM) framework which integrates three orthogonal dimensions of model performance within a Euclidean framework, including 1) Accuracy (NMAEp) penalized by a Phase Penalty Factor (PPF); 2) Variability (SUSE); and 3) Distribution Similarity (PHI). The synthetic cases and CAMELS data tests all showed that MFM provides a more authentic and reliable assessment of model fidelity compared to other traditional metrics (e.g., KGE, NSE, etc.). Overall, the manuscript is well written and provides a very useful LSM evaluation metric. I have a few comments and suggestions for the authors to address.
Major comments:
The MFM framework depends on several tunable parameters, including the p value in NMAEp, the n_SUSE in SUSE, the c value in PPF, and n_PHI in PHI calculations. It is not clear how to effectively determine the values of these tunable parameters, which can affect the evaluation results especially when comparing with other traditional evaluation metrics.

The proposed accuracy term tends to dampen the impact of phase mismatch error compared to traditional metrics (e.g., Case 3), which seems to make the phase error difficult to be revealed by just looking at the MFM metric value. For example, the anti-correlation in Case 3, where the model gets the pattern entirely wrong (reversed), should be evaluated as very poor performance, right? But when using MFM, the impact of this phase error is weakened relative to other metrics. This essentially does not penalize model much for its phase mismatch, which might mislead model improvements because the MFM value looks still OK (e.g., Case 3) even if the model gets the entire pattern wrong/reversed.

Minor comments:
In the NMAEp calculation, how should users determine the p value, which seems also an arbitrary decision? Also, it is still not very clear to me why this formulation can avoid the impact of outliers when p is larger than 1. Should users compute NMAEp for multiple p values then?

Equation (5): Is the d2 term within or outside the exponent p operator in Equation (4)? Is d1 here |Si-Oi|^p ?

Section 2.2.2: How to determine the number of bins (n_suse)?

Equation (11): Why does it compute the minimal value of the two probabilities instead of their difference for each bin? The difference seems to better reflect their overlapping condition.

Section 2.3: I am a little confused here. Why does it need to use exponential transform for error and entropy components? There are other methods to normalize quantities to [0,1]. The exponential transform could introduce nonlinearity in the metrics which may favor more sensitivity to smaller values over larger values.

Equation (12): The PPF is integrated into the accuracy term, which seems to make the decomposition and diagnosis of timing mismatch error difficult.

I would suggest adding a discussion section to include: (1) guidance on how users can determine those tunable parameters in MFM framework in practice; (2) insights on how users can use MFM results to guide future LSM improvement.
Citation: https://doi.org/10.5194/egusphere-2025-6212-RC2
- AC2: 'Reply on RC2', Zhongwang Wei, 31 Mar 2026
  
  We thank Reviewer #2 for thoughtful and constructive feedback. This Response to the Reviewer file provides complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple. Quotations from the revised manuscript are shown in blue.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6212-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (06 Apr 2026) by Bo Guo

AR by Zhongwang Wei on behalf of the Authors (07 Apr 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (10 Apr 2026) by Bo Guo

RR by Anonymous Referee #1 (10 Apr 2026)

RR by Anonymous Referee #2 (10 Apr 2026)

ED: Publish as is (13 Apr 2026) by Bo Guo

AR by Zhongwang Wei on behalf of the Authors (20 Apr 2026) Manuscript

Short summary

Land surface models simulate exchanges at the Earth's surface. Traditional evaluation methods can be misleading because they may hide errors or be overly sensitive to outliers. We introduce the Model Fidelity Metric, which measures model performance in terms of accuracy, variability and distribution similarity. Tests with synthetic data and real streamflow observations show that this metric provides more stable and informative assessments of model performance.

Introducing the Model Fidelity Metric (MFM) for robust and diagnostic land surface model evaluation

Download

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection