Technical note: Benchmarking large-domain model performance under sampling uncertainty

Gründemann, Gaby J.; Knoben, Wouter J. M.; Song, Yalan; van Werkhoven, Katie; Clark, Martyn P.

doi:10.5194/hess-30-3439-2026

Articles | Volume 30, issue 11

https://doi.org/10.5194/hess-30-3439-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-30-3439-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 30, issue 11

Technical note

|

05 Jun 2026

Technical note |

| 05 Jun 2026

Technical note: Benchmarking large-domain model performance under sampling uncertainty

Gaby J. Gründemann, Wouter J. M. Knoben, Yalan Song, Katie van Werkhoven, and Martyn P. Clark

Download

Final revised paper (published on 05 Jun 2026)
Supplement to the final revised paper
Preprint (discussion started on 02 Feb 2026)
Supplement to the preprint

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-6460', Anonymous Referee #1, 13 Mar 2026

Review of "Technical note: Separating signal from noise in large-domain hydrologic model evaluation - Benchmarking model performance" by Gründemann et al.
The technical note promotes the use of various benchmarks for model performance evaluation, particularly in a large-domain setting (or for large-sample studies) and includes a quantification of sampling uncertainty from different periods through bootstrapping of different hydrological years.
The manuscript is clearly, concisely written and well structured.
Before I can recommend publication, however, I would like to raise the following comments:
major comments:

- Since this note is all about the benchmarks, I thing two ingredients are missing:

1) Please add the benchmarks and their description to the main text and not just to the supplementary material and ensure that the abbreviations match those in the figures (or vice versa)
2) Each of the benchmarks is essentially a test of how well a model should minimally perform regarding a specific aspect. This is not discussed in detail in the manuscript, but I think providing some examples would really help promoting the use of various benchmarks from very simple ones targeting maybe the water balance to more complex ones. I would suggest extending the dicussion and conclusions accordingly and as well as adding this explanation regarding which aspect they are benchmarking in the table describing them.
- there is the sampling uncertainty, there is the model uncertainty, but what makes up these metrics are also affected by the uncertainty inherent in the observations. It would be worth reminding the reader that these can be considerably large and influential on the performance metric. For instance, for discharge, there is the rating curve uncertainty that is not constant but varies with the flows (see for instance Westerberg et al., 2011)
Line by line comments:

Abstract

L4 name at least some examples of what is meant by a simple benchmark, i.e. make it more specific

L5-7 these results are valid for the study region and basins and but not for other regions, please add that the data set is from the United Stated and maybe add even NWM

L9 ", though accounting..." this part of the sentence is not clear. Please rephrase.
Main text

L21-25 the words "score", "statistics","efficiency", "metrics" are used and they are used interchangeably. I would suggest using only one, where this is applicable and using it consistently throughout the manuscript

L22 "and more" remove (there is already "for example" in the same sentence)

L34 ... or further checks are required
L40 "can be " -> "is"

L120 since the benchmarks are the core of this note, Table S1 should be moved to the main text and the abbreviations adjusted accordingly.

L126 "as" -> "that"

L239 Supporting

L239 abbreviation was already introduced in L25

L257 "perform" missing?

L262 which benchmark? please add

L284 remove "and" before "snow"

Figure 2 in the upper panel the lines are not distinguishable in b&w print

Figure 3 Please add the written-out benchmarks in the caption so that the figure can stand-alone.
References

Westerberg, I., Guerrero, J. L., Seibert, J., Beven, K. J., & Halldin, S. (2011). Stage‐discharge uncertainty derived with a non‐stationary rating curve in the Choluteca River, Honduras. Hydrological Processes, 25(4), 603-613.

Citation: https://doi.org/10.5194/egusphere-2025-6460-RC1
- AC1: 'Reply on RC1', Wouter Knoben, 06 Apr 2026
  
  Please see our response in the attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6460-AC1
RC2:
'Comment on egusphere-2025-6460', Anonymous Referee #2, 16 Mar 2026
The authors provide a nice and easy-to-read study on large-domain hydrological model evaluation. They introduced benchmarks as a valuable tool to investigate where a large domain hydrologic model might still lack in performance. This makes the manuscript, in my opinion, very interesting for a wider hydrological audience. Before publication, however, I think the manuscript would benefit from a bit more analysis on why the model is failing in certain areas. Below, I try to give some constructive feedback for the authors to consider:
General remarks:
I have the feeling that a simple explanation of why a model fails if it is worse than a certain benchmark would help readers to understand the point of this technical note.

I have the feeling the Introduction is not well linked to the rest of the manuscript. I did not get the impression that the questions raised were answered. The manuscript does not give any guidance if a score is indicative or useful for a model, nor does it go into quantifying their uncertainty. Isn’t the point of the manuscript more to find regions and reasons where the model is failing against the suggested benchmarks? I recommend restructuring the introduction accordingly.

Discussion: I would like to see a more in-depth discussion on what it actually means if the Benchmark is better than the model. After that, you can go into the analysis, where and why the model might have failed. For this, however, I would recommend putting more emphasis on why the model has failed. Maybe look at it from a model development perspective, what would you need to do to improve the model? Try to give some guidance. E.g., by correlating your J index against catchment attributes (soil, landuse, climate, etc.), and also against the KGE, might give more insights. I acknowledge that the authors already provide some discussion on why the model might fail under certain circumstances, but very few of them are really based on the results of this manuscript and are rather based on the authors' knowledge and other literature.

Minor comments:
Title: Is it really fitting what the manuscript is about? I would suggest something like “Technical note: Benchmarking large-domain hydrological model performance”. If the authors want to state uncertainty in their title, they should be specific what kind of uncertainty they are referring to.

Introduction: What are simple benchmarks exactly, where have they been used, and what's their benefit, how do they relate to hydrological signatures?

Section 2.4 Benchmarks: It should be better explained which benchmarks are actually used and why.

Figure 2: Might it be easier to focus on the evaluation period only? And maybe I missed it in the Data and Methods section, but it should be clearly defined what the evaluation period is. Section 2.4 is speaking of a validation period; is this used as a synonym here? If so, it would be better to use only one of the two words throughout the manuscript.

Figure 3: Show what BMs are actually standing for; that’s not too much text for the figure.
Citation: https://doi.org/10.5194/egusphere-2025-6460-RC2
- AC2: 'Reply on RC2', Wouter Knoben, 06 Apr 2026
  
  Please see our full reply in the attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6460-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Publish subject to revisions (further review by editor and referees) (15 Apr 2026) by Ralf Loritz

AR by Wouter Knoben on behalf of the Authors (17 Apr 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (20 Apr 2026) by Ralf Loritz

RR by Anonymous Referee #1 (20 May 2026)

RR by Anonymous Referee #2 (20 May 2026)

ED: Publish as is (21 May 2026) by Ralf Loritz

AR by Wouter Knoben on behalf of the Authors (21 May 2026)

Short summary

The quality of large-domain hydrologic model simulations is often quantified with so-called accuracy metrics. Here we use simple benchmarks to provide relevant context for these accuracy metrics. Results show that areas where the model cannot beat the benchmarks do not always align with areas where the accuracy metrics are low. This suggests that model improvements are possible in regions that under more typical model evaluation approaches (i.e., without benchmarks) might not be obvious.

Technical note: Benchmarking large-domain model performance under sampling uncertainty

Download

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection

Suggestions for revision or reasons for rejection