Articles | Volume 30, issue 9
https://doi.org/10.5194/hess-30-2651-2026
© Author(s) 2026. This work is distributed under the Creative Commons Attribution 4.0 License.
Introducing the Model Fidelity Metric (MFM) for robust and diagnostic land surface model evaluation
Download
- Final revised paper (published on 06 May 2026)
- Supplement to the final revised paper
- Preprint (discussion started on 27 Dec 2025)
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on egusphere-2025-6212', Anonymous Referee #1, 31 Jan 2026
- AC1: 'Reply on RC1', Zhongwang Wei, 31 Mar 2026
-
RC2: 'Comment on egusphere-2025-6212', Anonymous Referee #2, 16 Mar 2026
- AC2: 'Reply on RC2', Zhongwang Wei, 31 Mar 2026
Peer review completion
AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload
ED: Reconsider after major revisions (further review by editor and referees) (06 Apr 2026) by Bo Guo
AR by Zhongwang Wei on behalf of the Authors (07 Apr 2026)
Author's response
Author's tracked changes
Manuscript
ED: Referee Nomination & Report Request started (10 Apr 2026) by Bo Guo
RR by Anonymous Referee #1 (10 Apr 2026)
RR by Anonymous Referee #2 (10 Apr 2026)
ED: Publish as is (13 Apr 2026) by Bo Guo
AR by Zhongwang Wei on behalf of the Authors (20 Apr 2026)
Manuscript
This study introduces the Model Fidelity Metric (MFM) as an alternative to traditional metrics like NSE and KGE. The method demonstrates some practical improvements in specific failure modes, such as error compensation and low-variability conditions, using synthetic tests and the CAMELS dataset. However, the paper requires further improvement in the conceptual explanations, methodological descriptions, and understanding of error metrics. See my comments below.
Although the study conducted a sensitivity analysis on the hyperparameters, it does not provide specific guidance on their selection. It is suggested to supplement the paper with recommended parameter values or adaptive selection methods to enhance the practical utility of the approach.
Errors in land surface variable estimation are usually complex. In many studies, multiple metrics—such as correlation coefficients and bias—are commonly used to better understand the sources of these errors. Individually, these metrics cannot comprehensively reflect model deficiencies, but they offer greater flexibility. For example, soil moisture evaluations tend to emphasize correlation and ubRMSE, with less attention paid to bias. For variables such as ET and LAI, strong seasonality often necessitates decomposing the time series into anomalies and seasonal components, which are then evaluated separately. When developing new error metrics, how do you take these conventional practices into account?
Compared with traditional error metrics, MFM involves more complex computations. Could the authors clarify the scenarios in which they recommend using this metric?
In addition to evaluating the performance of estimated variables, error metrics are expected to help diagnose potential model deficiencies. While I recognize the advantages of MFM in some cases, how can its results be interpreted to identify specific problems in the model?
Eq (1): what are i and n?
Line 39: I do not think it is appropriate to refer to NSE as the standard metric for LSM evaluation, as this may be misleading. Although NSE is useful for normalizing model performance and enabling cross-basin and cross-model comparisons, it should not be considered inherently better than other metrics. Its application should be determined by the specific variable and purpose, and model errors are often best explained using multiple complementary metrics.
Lines 42-44: It is precisely because of its quadratic formulation and high sensitivity to outliers that NSE is often used in streamflow evaluations with a particular focus on peak flows. Controversial conclusions are more likely the result of applying NSE in inappropriate contexts, rather than an inherent problem with NSE itself.
Line 57: What are these limitations?
Lines 59-61: The correlation term in KGE helps penalize this issue.
Lines 64-67: This statement is rather vague. Could the authors provide a concrete example, for instance, specifying the data or variables involved?
Line 69: Likewise, regarding the “right for the wrong reasons” issue, a concrete example would be helpful. This would allow readers to assess the severity of the problems potentially associated with KGE, rather than relying solely on the authors’ statements.
Line 71: If KGE is highly responsive to such balancing errors, what are the implications in practice? For instance, for simulations with similar KGE values, how large can the peak flow errors be?
Line 103: What applications?
Lines 106-110: The authors claim that highly skewed, non-Gaussian distributions violate the normality assumptions of moment-based metrics such as NSE and KGE, potentially biasing model evaluation. But do NSE and KGE actually require normally distributed data, or is this statement an overgeneralization?
Line 152: While developing metrics less sensitive to error compensation is a worthwhile goal, it is important to recognize that any aggregated metric will inevitably reflect a combination of different error types (e.g., random, systematic, or phase errors). Complete elimination of error compensation within a single metric may therefore be unrealistic.
Line 157: What is p-Error? The definition of the NMAEp metric is presented too abruptly without a clear explanation.
Line 160: Also, hard to understand what is SUSE and why it is workable for addressing KGE’s shortcomings.
Line 234: This is not attributed by skewed data. This is a general artifact of aggregate error metrics sensitive to sign cancellation, which can occur with any distribution, including normal.
Line 244: What are min (𝑺, 𝑶) and max (𝑺, 𝑶)? 𝑺 and 𝑶 denote scaled and origin?
Line 290: Given that the limitations of NSE and KGE are discussed earlier, it is unclear why they are treated as benchmark metrics here. Would it be more appropriate to refer to them as baseline metrics?
Line 303: The introduction of the CAMELS dataset should appear earlier.