<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing with OASIS Tables v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpub-oasis3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://docs.oasis-open.org/ns/oasis-exchange/table" xml:lang="en" dtd-version="3.0" article-type="methods-article">
  <front>
    <journal-meta><journal-id journal-id-type="publisher">HESS</journal-id><journal-title-group>
    <journal-title>Hydrology and Earth System Sciences</journal-title>
    <abbrev-journal-title abbrev-type="publisher">HESS</abbrev-journal-title><abbrev-journal-title abbrev-type="nlm-ta">Hydrol. Earth Syst. Sci.</abbrev-journal-title>
  </journal-title-group><issn pub-type="epub">1607-7938</issn><publisher>
    <publisher-name>Copernicus Publications</publisher-name>
    <publisher-loc>Göttingen, Germany</publisher-loc>
  </publisher></journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5194/hess-30-3439-2026</article-id><title-group><article-title>Technical note: Benchmarking large-domain model performance under sampling uncertainty</article-title><alt-title>Benchmarking large-domain model performance</alt-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Gründemann</surname><given-names>Gaby J.</given-names></name>
          
        <ext-link>https://orcid.org/0000-0001-7311-7769</ext-link></contrib>
        <contrib contrib-type="author" corresp="yes" rid="aff1">
          <name><surname>Knoben</surname><given-names>Wouter J. M.</given-names></name>
          <email>wouter.knoben@ucalgary.ca</email>
        <ext-link>https://orcid.org/0000-0001-8301-3787</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2">
          <name><surname>Song</surname><given-names>Yalan</given-names></name>
          
        <ext-link>https://orcid.org/0000-0002-4722-148X</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff3">
          <name><surname>van Werkhoven</surname><given-names>Katie</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Clark</surname><given-names>Martyn P.</given-names></name>
          
        </contrib>
        <aff id="aff1"><label>1</label><institution>Schulich School of Engineering, University of Calgary, Alberta, Canada</institution>
        </aff>
        <aff id="aff2"><label>2</label><institution>Civil and Environmental Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America</institution>
        </aff>
        <aff id="aff3"><label>3</label><institution>Research Triangle Institute, Research Triangle Park, North Carolina, United States of America</institution>
        </aff>
      </contrib-group>
      <author-notes><corresp id="corr1">Wouter J. M. Knoben (wouter.knoben@ucalgary.ca)</corresp></author-notes><pub-date><day>5</day><month>June</month><year>2026</year></pub-date>
      
      <volume>30</volume>
      <issue>11</issue>
      <fpage>3439</fpage><lpage>3453</lpage>
      <history>
        <date date-type="received"><day>23</day><month>December</month><year>2025</year></date>
           <date date-type="rev-request"><day>2</day><month>February</month><year>2026</year></date>
           <date date-type="rev-recd"><day>17</day><month>April</month><year>2026</year></date>
           <date date-type="accepted"><day>21</day><month>May</month><year>2026</year></date>
      </history>
      <permissions>
        <copyright-statement>Copyright: © 2026 Gaby J. Gründemann et al.</copyright-statement>
        <copyright-year>2026</copyright-year>
      <license license-type="open-access"><license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p></license></permissions><self-uri xlink:href="https://hess.copernicus.org/articles/30/3439/2026/hess-30-3439-2026.html">This article is available from https://hess.copernicus.org/articles/30/3439/2026/hess-30-3439-2026.html</self-uri><self-uri xlink:href="https://hess.copernicus.org/articles/30/3439/2026/hess-30-3439-2026.pdf">The full text article is available as a PDF file from https://hess.copernicus.org/articles/30/3439/2026/hess-30-3439-2026.pdf</self-uri>
      <abstract><title>Abstract</title>

      <p id="d2e132">Large-domain hydrologic modeling studies are becoming increasingly common. The evaluation of the resulting models is however often limited to the use of aggregated performance scores that show where model accuracy is higher and lower. Moreover, the inherent uncertainty in such scores (i.e., the sampling uncertainty), stemming from the choice of time periods used for their calculation, often remains unaccounted for. Here we use a collection of simple benchmarks whilst accounting for this sampling uncertainty to provide context for the performance scores of a large-domain hydrologic model. These benchmarks are simple ways of predicting the variable of interest and include, for example, the long-term daily mean flow, daily precipitation scaled by the average rainfall-runoff ratio, and a basic 2-parameter model that represents a catchment's diffusive response to precipitation inputs. Our test case consists of simulations from the National Water Model v3.0 for approximately 4900 basins across the United States. The benchmarks suggest that there are considerable constraints on the model's performance in approximately one-third of the basins used for model calibration and in approximately half of the basins where model parameters are regionalized. Sampling uncertainty has limited impact: in most basins the model is either clearly better or worse than the benchmarks, though numerous cases remain where sampling uncertainty makes it difficult to clearly distinguish between model and benchmark performance. The areas where the benchmarks outperform the model only partially overlap with areas where the model achieves lower performance scores, and this suggests that improvements may be possible in more regions than a first glance at model performance values may indicate. A key advantage of using these benchmarks is that they are easy and fast to compute, particularly compared to the cost of configuring and running the model. This makes benchmarking a valuable tool that can complement more detailed model evaluation techniques by quickly identifying areas that should be investigated more thoroughly.</p>
  </abstract>
    
<funding-group>
<award-group id="gs1">
<funding-source>National Oceanic and Atmospheric Administration</funding-source>
<award-id>NA22NWS4320003</award-id>
</award-group>
</funding-group>
</article-meta>
  </front>
<body>
      

<sec id="Ch1.S1" sec-type="intro">
  <label>1</label><title>Introduction</title>
      <p id="d2e144">There is a pressing societal need for predictions of water-related risks across large geographical domains. Consequently, water resources modeling at national, continental and global scales is becoming increasingly common <xref ref-type="bibr" rid="bib1.bibx2 bib1.bibx11 bib1.bibx36 bib1.bibx50 bib1.bibx55" id="paren.1"><named-content content-type="pre">e.g.,</named-content></xref>. Thorough evaluation of such large-domain models is a necessity to improve our understanding of the water cycle, our ability to model it accurately, and to ensure the usability and reliability of model simulations for decision making.</p>
      <p id="d2e152">Considerable guidance on model evaluation exists, focusing for example on diagnostics <xref ref-type="bibr" rid="bib1.bibx19 bib1.bibx21" id="paren.2"><named-content content-type="pre">e.g.,</named-content></xref>, multi-variate evaluation <xref ref-type="bibr" rid="bib1.bibx42 bib1.bibx12" id="paren.3"><named-content content-type="post">e.g.,</named-content></xref>, and multi-objective evaluation <xref ref-type="bibr" rid="bib1.bibx13 bib1.bibx29" id="paren.4"><named-content content-type="pre">e.g.,</named-content></xref>. A common theme between these different approaches to model evaluation is that model performance tends to be quantified through performance metrics such as the Root Mean Squared Error (RMSE), the Nash-Sutcliffe Efficiency (NSE; <xref ref-type="bibr" rid="bib1.bibx35" id="altparen.5"/>) and the Kling-Gupta Efficiency <xref ref-type="bibr" rid="bib1.bibx20" id="paren.6"><named-content content-type="pre">KGE;</named-content></xref>. Such metrics summarize the (mis)match between observations and a model's simulations as a single performance score. These scores are useful because the community has relied on them for a long time and they now function as an informal shared test environment <xref ref-type="bibr" rid="bib1.bibx8" id="paren.7"/>. However, a key challenge remains that the scores calculated by these metrics are difficult to interpret in isolation <xref ref-type="bibr" rid="bib1.bibx47 bib1.bibx46 bib1.bibx25" id="paren.8"><named-content content-type="pre">e.g.,</named-content></xref>, partly because they tend to conflate model performance and flow variability <xref ref-type="bibr" rid="bib1.bibx46 bib1.bibx57 bib1.bibx8" id="paren.9"/>.</p>
      <p id="d2e192">The deliberate use of benchmarks can provide a helpful frame of reference for interpreting efficiency scores such as NSE and KGE, by setting realistic expectations of the possible performance in each basin <xref ref-type="bibr" rid="bib1.bibx47 bib1.bibx46 bib1.bibx32 bib1.bibx39 bib1.bibx48 bib1.bibx4 bib1.bibx25" id="paren.10"/>. A well-known example follows from a specific interpretation of the Nash-Sutcliffe Efficiency <xref ref-type="bibr" rid="bib1.bibx35" id="paren.11"/>:

          <disp-formula id="Ch1.E1" content-type="numbered"><label>1</label><mml:math id="M1" display="block"><mml:mrow><mml:mi mathvariant="normal">NSE</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>-</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:msubsup><mml:mo>∑</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup><mml:msup><mml:mfenced open="(" close=")"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">obs</mml:mi></mml:msub><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">sim</mml:mi></mml:msub><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mfenced><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:msubsup><mml:mo>∑</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup><mml:msup><mml:mfenced close=")" open="("><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">obs</mml:mi></mml:msub><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">obs</mml:mi></mml:msub></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:mrow></mml:mfenced><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

        where <inline-formula><mml:math id="M2" display="inline"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">obs</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M3" display="inline"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">sim</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> are observed and simulated streamflow respectively. This equation can be interpreted as a skill score that quantifies how much of the variance in <inline-formula><mml:math id="M4" display="inline"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">obs</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> the model (through <inline-formula><mml:math id="M5" display="inline"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">sim</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>) explains compared to the reference model, the long-term mean flow, <inline-formula><mml:math id="M6" display="inline"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">obs</mml:mi></mml:msub></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:math></inline-formula>. Although this specific benchmark, <inline-formula><mml:math id="M7" display="inline"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">obs</mml:mi></mml:msub></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:math></inline-formula>, is often criticized for the limited constraints it imposes on model performance <xref ref-type="bibr" rid="bib1.bibx46" id="paren.12"><named-content content-type="pre">e.g.</named-content></xref>, it provides a useful example of a simple benchmark. By comparing the performance of a model against a (much) simpler alternative way of predicting the variable of interest, it becomes easier to evaluate if and how much better the hydrologic model is.</p>
      <p id="d2e379">Benchmarks can take various forms, such as regression equations <xref ref-type="bibr" rid="bib1.bibx3" id="paren.13"><named-content content-type="pre">as used in certain land modeling experiments; e.g., </named-content></xref>, statistics such as persistence or climatology <xref ref-type="bibr" rid="bib1.bibx39 bib1.bibx22" id="paren.14"><named-content content-type="pre">as common in the streamflow forecasting community; e.g.,</named-content></xref>, or different versions of the same model <xref ref-type="bibr" rid="bib1.bibx11" id="paren.15"><named-content content-type="pre">to see if model changes have the desired effect; e.g.,</named-content></xref>. Benchmarking is also commonly seen when models of varying levels of complexity are compared, particularly in current large-domain modeling exercises that contrast the performance of machine learning methods to more traditional hydrologic models <xref ref-type="bibr" rid="bib1.bibx30 bib1.bibx50" id="paren.16"><named-content content-type="pre">e.g.,</named-content></xref>. The main trade-off between different types of benchmarks is the cost of employing the benchmarks compared to what can be learned from them. For example, the cost of comparing an existing hydrologic model against a second one is often prohibitive because configuration is too cumbersome, or run times are too long, but comparing the performance of any model against a simple baseline has been common practice as long as the Nash Sutcliffe Efficiency has been in use. Using simple benchmark models, such as the long-term mean, gives some idea of the predictability of the streamflow observations in each basin at negligible computational cost. Our hypothesis is that comparing the performance of a model against the performance of an ensemble of simple benchmarks can be an effective way to identify cases where the performance of a large-domain model is not as high as it could be, irrespective of the absolute values of the scores, and thus where opportunities for model improvement may exist.</p>
      <p id="d2e403">However, assessing if a model outperforms a benchmark is not always straightforward. Even if ignoring the fact that observational uncertainty may mean that model simulations are being compared to incorrect data <xref ref-type="bibr" rid="bib1.bibx56 bib1.bibx17" id="paren.17"><named-content content-type="pre">e.g.</named-content></xref>, a confounding issue is that performance scores such as NSE and KGE are inherently conditional on the time period for which they are calculated <xref ref-type="bibr" rid="bib1.bibx33 bib1.bibx43 bib1.bibx31 bib1.bibx7 bib1.bibx24" id="paren.18"/>. Both <xref ref-type="bibr" rid="bib1.bibx6" id="text.19"/> and <xref ref-type="bibr" rid="bib1.bibx37" id="text.20"/> show that, depending on the nature of the streamflow observations, a large fraction of the total model error may be concentrated in a disproportionally small number of time steps. In such cases, choosing a different period to calculate the scores on might give a very different assessment of the performance of the model. This is commonly referred to as sampling uncertainty. Sampling uncertainty can be considerable <xref ref-type="bibr" rid="bib1.bibx31 bib1.bibx7" id="paren.21"/>, and in many cases the scores obtained by different models have uncertainties greater than the differences between them <xref ref-type="bibr" rid="bib1.bibx7 bib1.bibx28" id="paren.22"/>. This complicates the assessment of differences between models, because the models might be statistically indistinguishable, and extends to benchmarking exercises: whether a model outperforms any given benchmark is subject to sampling uncertainty. However, the extent to which sampling uncertainty plays a role in large-domain model benchmarking is currently unknown.</p>
      <p id="d2e427">There is limited work on using benchmarks to provide assessments of large-domain predictability of hydrologic response <xref ref-type="bibr" rid="bib1.bibx48 bib1.bibx25" id="paren.23"/>, particularly while also considering the effect of sampling uncertainty. In this paper, we address this gap and show that evaluating a large-domain water resources model relative to simple benchmarks reveals regions where the model underperforms compared to simple alternatives, even when standard performance metrics suggest acceptable model skill. In Sect. <xref ref-type="sec" rid="Ch1.S2"/> we introduce the model (Sect. <xref ref-type="sec" rid="Ch1.S2.SS1"/>), data (Sect. <xref ref-type="sec" rid="Ch1.S2.SS2"/>) and performance metric (Sect. <xref ref-type="sec" rid="Ch1.S2.SS3"/>) used in the analysis, and provide a more in-depth discussion of benchmarks (Sect. <xref ref-type="sec" rid="Ch1.S2.SS4"/>) and sampling uncertainty (Sect. <xref ref-type="sec" rid="Ch1.S2.SS5"/>). Results are presented in Sect. <xref ref-type="sec" rid="Ch1.S3"/>, separated into an aggregated assessment of model and benchmark performance (Sect. <xref ref-type="sec" rid="Ch1.S3.SS1"/>), an evaluation of the associated sampling uncertainty (Sect. <xref ref-type="sec" rid="Ch1.S3.SS2"/>), and a spatial analysis of the results (Sect. <xref ref-type="sec" rid="Ch1.S3.SS3"/>). We briefly discuss our findings in Sect. <xref ref-type="sec" rid="Ch1.S4"/> and present our conclusions in Sect. <xref ref-type="sec" rid="Ch1.S5"/>.</p>
<sec id="Ch1.S1.SS1">
  <label>1.1</label><title>Note on definitions</title>
      <p id="d2e466">In the remainder of the text, we use the following definitions: <list list-type="bullet"><list-item>
      <p id="d2e471"><italic>Statistics</italic>: summary statistics derived from a time series (e.g., the long-term mean of flow observations, the daily median flow).</p></list-item><list-item>
      <p id="d2e477"><italic>Metrics</italic>: specific equations used to summarize model performance into a single number (e.g., the Root Mean Squared Error, the Nash-Sutcliffe Efficiency).</p></list-item><list-item>
      <p id="d2e483"><italic>Performance scores</italic>: values found for a given metric (e.g., the distribution of KGE values obtained when calibrating a given model for a set of basins).</p></list-item></list></p>
</sec>
</sec>
<sec id="Ch1.S2">
  <label>2</label><title>Data and Methods</title>
<sec id="Ch1.S2.SS1">
  <label>2.1</label><title>National Water Model v3.0 retrospective simulations</title>
      <p id="d2e504">We selected simulations from the National Water Model v3.0 (NWMv3.0) as a practical test case for our work, to investigate our hypothesis that deliberate use of benchmarks can help identify areas for model improvement. The National Water Model is used to generate operational forecasts across the United States, and is primarily designed to produce short-range and medium-range (18 h to 10 d) sub-daily streamflow forecasts. These forecasts are available for approximately 3.4 million river reaches, and complement the forecasts made by the various River Forecast Centres for approximately 3800 locations across the United States. The structure and setup of the NWMv3.0 are similar to those of NWMv2.1 <xref ref-type="bibr" rid="bib1.bibx38" id="paren.24"/> and are described in more detail in <xref ref-type="bibr" rid="bib1.bibx11" id="text.25"/>.</p>
      <p id="d2e513">We use the NWMv3.0 simulations from the NOAA National Water Model CONUS Retrospective Dataset for the period 1 January 1980 to 31 December 2022. Note that not all gauges have records for the entire period, and in some cases the period of analysis was thus shorter than the full length for which simulations are available. In the retrospective simulations, parameters for the NWM are obtained through a combination of calibration (i.e., parameter optimisation) on a subset of 1365 lightly regulated basins across CONUS and regionalization (i.e., parameter transfer) to the wider set of basins where either no streamflow observations are available or streamflow is more strongly impacted by water management <xref ref-type="bibr" rid="bib1.bibx11" id="paren.26"/>. The model was calibrated for the period 1 October 2016 to 30 September 2021 (NOAA, personal communication, 2025). In contrast to the setup used for forecasting, retrospective runs do not include data assimilation.</p>
      <p id="d2e519">For computational efficiency, we aggregated the hourly retrospective simulations to daily average values. This is not uncommon <xref ref-type="bibr" rid="bib1.bibx23 bib1.bibx53" id="paren.27"><named-content content-type="pre">e.g.,</named-content></xref>, though we note the model runs operationally at an hourly timestep and is most commonly used to predict flood peaks in basins with a response time well below 24 h. The model skill in simulating diurnal patterns will thus not be visible nor assessed in this study. Moreover, the goal of this work is to demonstrate the use of benchmarks in model evaluation, and the average daily NWM simulations provides a useful test case to do so.</p>
</sec>
<sec id="Ch1.S2.SS2">
  <label>2.2</label><title>Forcing data and streamflow observations</title>
      <p id="d2e535">Though NWMv3.0 simulations are available without a need to run the model, we need certain meteorological data for the benchmarks used in this work (benchmarks are explained in Sect. <xref ref-type="sec" rid="Ch1.S2.SS4"/>). The Analysis of Record for Calibration (AORC) is an hourly <inline-formula><mml:math id="M8" display="inline"><mml:mo>∼</mml:mo></mml:math></inline-formula> 800 m-resolution gridded meteorological forcing dataset used as input to NWM retrospective simulations <xref ref-type="bibr" rid="bib1.bibx14 bib1.bibx11" id="paren.28"/>, and thus used as input for the benchmarks in this work. We first aggregated the hourly gridded precipitation and 2 m air temperature to hourly basin averages using the areal mean. Precipitation was then aggregated from hourly to daily by summing the hourly amounts for each day from 1 February 1979 to 1 February 2023. For 2 m air temperature (used by the benchmark code to estimate snow fall and melt), we computed the daily mean. Streamflow observations from 1 January 1980 to 31 December 2023 were collected for approximately 4900 GAGES-II gauges for which streamflow simulations are available (i.e., the gauge was active for the full simulation period, and the stream reach the gauge is on is represented in the NWM) in the NWMv3.0 retrospective dataset <xref ref-type="bibr" rid="bib1.bibx54" id="paren.29"/>.</p>
</sec>
<sec id="Ch1.S2.SS3">
  <label>2.3</label><title>Model performance quantification</title>
      <p id="d2e561">The Kling-Gupta Efficiency <xref ref-type="bibr" rid="bib1.bibx20" id="paren.30"><named-content content-type="pre">KGE;</named-content></xref> was used to calibrate the NWMv3.0 on hourly timesteps (NOAA, personal communication, 2025):

                <disp-formula specific-use="gather" content-type="numbered"><mml:math id="M9" display="block"><mml:mtable displaystyle="true"><mml:mlabeledtr id="Ch1.E2"><mml:mtd><mml:mtext>2</mml:mtext></mml:mtd><mml:mtd><mml:mrow><mml:mstyle displaystyle="true" class="stylechange"/><mml:mi mathvariant="normal">KGE</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>-</mml:mo><mml:msqrt><mml:mrow><mml:mo>(</mml:mo><mml:mi>r</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:mo>(</mml:mo><mml:mi mathvariant="italic">α</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:mo>(</mml:mo><mml:mi mathvariant="italic">β</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow></mml:msqrt></mml:mrow></mml:mtd></mml:mlabeledtr><mml:mlabeledtr id="Ch1.E3"><mml:mtd><mml:mtext>3</mml:mtext></mml:mtd><mml:mtd><mml:mrow><mml:mstyle displaystyle="true" class="stylechange"/><mml:mi mathvariant="italic">α</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:msub><mml:mi mathvariant="italic">σ</mml:mi><mml:mi mathvariant="normal">s</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi mathvariant="italic">σ</mml:mi><mml:mi mathvariant="normal">o</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>,</mml:mo><mml:mspace width="2em" linebreak="nobreak"/><mml:mi mathvariant="italic">β</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:msub><mml:mi mathvariant="italic">μ</mml:mi><mml:mi mathvariant="normal">s</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi mathvariant="italic">μ</mml:mi><mml:mi mathvariant="normal">o</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mlabeledtr></mml:mtable></mml:math></disp-formula>

          where <inline-formula><mml:math id="M10" display="inline"><mml:mi>r</mml:mi></mml:math></inline-formula> is the Pearson correlation coefficient and subscripts o and s indicate observations and simulations, respectively. To stay as close to the NWM setup as possible, we use the Kling-Gupta Efficiency to quantify model performance in the remainder of this paper (though again note that we perform our analysis at daily time steps whereas the NWM was calibrated at hourly resolution). We repeated our analysis with the Nash-Sutcliffe Efficiency <xref ref-type="bibr" rid="bib1.bibx35" id="paren.31"><named-content content-type="post">presented in the Supplement</named-content></xref> to investigate if our conclusions hold for a different metric.</p>
</sec>
<sec id="Ch1.S2.SS4">
  <label>2.4</label><title>Benchmarks</title>
      <p id="d2e704">Hydrologic models are increasingly compared to different benchmarks than the long-term mean flow <xref ref-type="bibr" rid="bib1.bibx48 bib1.bibx27 bib1.bibx53" id="paren.32"><named-content content-type="pre">e.g.</named-content></xref>, but outside the forecasting community <xref ref-type="bibr" rid="bib1.bibx39" id="paren.33"><named-content content-type="pre">see e.g.</named-content></xref> such work is still somewhat limited. Benchmarks also vary in their strengths and weaknesses, and what constitutes a strong benchmark can change regionally <xref ref-type="bibr" rid="bib1.bibx39 bib1.bibx25" id="paren.34"/>.</p>
<sec id="Ch1.S2.SS4.SSS1">
  <label>2.4.1</label><title>Selection</title>
      <p id="d2e727">We therefore compare the performance of the NWM to the performance of an ensemble of simple benchmark models that cover various levels of complexity. A full list of the 17 different benchmark models used in this work can be found in Table <xref ref-type="table" rid="T1"/>. These benchmarks are effectively an “ensemble of opportunity”: they are conveniently available in the <monospace>HydroBM</monospace> package <xref ref-type="bibr" rid="bib1.bibx25" id="paren.35"/> and serve to illustrate the point made in the remainder of this paper. We note that this benchmark ensemble is neither exhaustive, nor is it meant to be. However, as long as more theory-driven benchmark selection methods are lacking (i.e., selecting a specific benchmark for a specific basin, based on the benchmark's suitability for representing the basin's specific flow regime), ensemble benchmarking methods provide an acceptable alternative.</p>
</sec>
<sec id="Ch1.S2.SS4.SSS2">
  <label>2.4.2</label><title>Description</title>
      <p id="d2e746">The main benefit of a benchmark ensemble is that it enables the simultaneous investigation of multiple aspects of model behavior. Each benchmark represents a simple way of predicting the variable of interest (here: streamflow), and thus sets a certain minimum expectation of how well a specific aspect of catchment behavior can be predicted. This in turn can be seen as a test for the model of interest: if the model underperforms compared to the simple alternative, improvements to the modeling chain may be possible. For example, if a model shows consistent bias during low flows but a simple seasonal cycle benchmark does not, this suggests that the flows themselves are relatively stable between years but that the model is somehow unable to replicate this pattern. The benchmark does not immediately point out the underlying causes of the model's bias, but it does show that model performance is not as high as it can be. As shown in Table <xref ref-type="table" rid="T1"/>, the benchmarks cover three different categories.</p>
      <p id="d2e751">The first category covers simple statistics calculated from the streamflow observations, which are then used as a predictor of streamflow on all time steps. These benchmarks quantify the stability of the flow regime in time by using past observations to provide an estimate of how flows at any given point in the future might look, and thus challenge the model to predict deviations from the catchment's typical streamflow behaviour. One example is the long-term mean flow which, if used as a predictor of flow, returns a time series of constant values (see Eq. <xref ref-type="disp-formula" rid="Ch1.E1"/>). A second example is the daily mean flow which characterizes the typical seasonal cycle of the flow regime. If the flow in any given year is different from the typical seasonal regime, the model should be able to predict these deviations. If it does, its performance will be higher than the benchmark's.</p>
      <p id="d2e756">The second category covers benchmarks that attempt to account for the influence of precipitation on streamflow. These benchmarks first calculate the average rainfall-runoff ratio (or ratios, in the case of the monthly benchmarks), and then use this ratio to scale incoming precipitation. This approach assumes that the amount of precipitation influences a catchment's streamflow response, but that the ratio of precipitation-to-streamflow conversion does not change markedly throughout time. These benchmarks thus challenge the model to predict deviations from typical rainfall-runoff ratios, which may be the case under prolonged drying or anomalous wet conditions. An example is the benchmark that applies average monthly rainfall runoff ratios to monthly precipitation totals. Despite its coarse temporal resolution (flows within a month are constant), this benchmark has shown considerable performance in a previous large-domain application <xref ref-type="bibr" rid="bib1.bibx25" id="paren.36"/>.</p>
      <p id="d2e762">The benchmarks in the third and most complex category are still rather simple one- and two-parameter models whose parameters are optimized using a brute-force approach. These benchmarks attempt to capture the main components of catchment behavior (i.e., partitioning, delayed response, attenuation of precipitation inputs) in parsimonious and aggregated ways. This approach challenges the model to see if the addition of further degrees of freedom (i.e., having more parameters) leads to an appreciable increase in predictive performance. The most complex benchmark in this category is the two-parameter Adjusted Smoothed Precipitation Benchmark (ASPB) proposed by <xref ref-type="bibr" rid="bib1.bibx46" id="text.37"/>. This benchmark scales incoming precipitation by the long-term rainfall-runoff ratio to simulate precipitation partitioning, smooths the resulting scaled precipitation with a moving window approach of calibrated length, and then shifts this smoothed response by a calibrated lag value. This provides a two-parameter approximation of the main components of catchment behaviour.</p>
</sec>
<sec id="Ch1.S2.SS4.SSS3">
  <label>2.4.3</label><title>Application</title>
      <p id="d2e776">We configure the benchmark models in the same way as a regular model application would be structured: the benchmarks are defined using data from a dedicated calibration period (though “calculation period” is a more accurate description for most benchmarks, because only BM16 and BM17 require parameter calibration) and then used to predict the streamflow in an independent evaluation period. We used the same 5-year time period to calibrate the benchmarks as was used to calibrate the NWMv3.0: from 1 October 2016 to 30 September 2021. In case the observation data were incomplete, we used either 4 or 3 water years within that same 5-year window instead. The evaluation period is all the data from 1 January 1980 to 31 December 2022 that is not used for calibration. The <monospace>HydroBM</monospace> package also includes a simple degree-day-based snow accumulation and melt routine, which we used with default parameters in snow-dominated basins. Parameters for BM16 and BM17 are integer values, here calibrated with the built-in brute force approach that trials all values within the <monospace>HydroBM</monospace> default ranges and selects the parameter (set) that results in the lowest Mean Squared Error between benchmark simulations and observations.</p>

<table-wrap id="T1" specific-use="star"><label>Table 1</label><caption><p id="d2e788">Brief explanation of the benchmarks used in this work, based on descriptions provided in <xref ref-type="bibr" rid="bib1.bibx25" id="text.38"/>.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="3">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="justify" colwidth="5.1cm"/>
     <oasis:colspec colnum="3" colname="col3" align="justify" colwidth="11cm"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">ID</oasis:entry>
         <oasis:entry colname="col2">Name</oasis:entry>
         <oasis:entry colname="col3">Description</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row rowsep="1">
         <oasis:entry namest="col1" nameend="col3"><italic>Derived from flow data only</italic>: these benchmarks attempt to account for stable predictability of the flow regime </oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM01</oasis:entry>
         <oasis:entry colname="col2">Mean flow</oasis:entry>
         <oasis:entry colname="col3">Long-term mean; benchmark time series has the same flow value for all time steps.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM02</oasis:entry>
         <oasis:entry colname="col2">Median flow</oasis:entry>
         <oasis:entry colname="col3">Long-term median; benchmark time series has the same flow value for all time steps.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM03</oasis:entry>
         <oasis:entry colname="col2">Annual mean flow</oasis:entry>
         <oasis:entry colname="col3">Mean flow per year; benchmark time series consists of a unique flow value computed for each year, assigned to each time step within the year; cannot be used to predict unseen data because the flow values needed to compute the yearly means are not available.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM04</oasis:entry>
         <oasis:entry colname="col2">Annual median flow</oasis:entry>
         <oasis:entry colname="col3">Median flow per year;  benchmark time series consists of a unique flow value computed for each year, assigned to each time step within the year; cannot be used to predict unseen data.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM05</oasis:entry>
         <oasis:entry colname="col2">Monthly mean flow</oasis:entry>
         <oasis:entry colname="col3">Mean flow per month; benchmark time series consists of the long-term mean flow value for each month, assigned to each time step within a given month; rough approximation of typical seasonality of the flow regime.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM06</oasis:entry>
         <oasis:entry colname="col2">Monthly median flow</oasis:entry>
         <oasis:entry colname="col3">Median flow per month; benchmark time series consists of the long-term mean flow value for each month, assigned to each time step within a given month; rough approximation of typical seasonality of the flow regime.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM07</oasis:entry>
         <oasis:entry colname="col2">Daily mean flow</oasis:entry>
         <oasis:entry colname="col3">Mean flow per day; benchmark time series consists of the long-term mean flow value for each calendar day; smooth approximation of typical seasonality of the flow regime.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">BM08</oasis:entry>
         <oasis:entry colname="col2">Daily median flow</oasis:entry>
         <oasis:entry colname="col3">Median flow per day; benchmark time series consists of the long-term median flow value for each calendar day; smooth approximation of typical seasonality of the flow regime.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry namest="col1" nameend="col3"><italic>Derived from rainfall-runoff ratios</italic>: these benchmarks attempt to account for the influence of precipitation on runoff </oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM09</oasis:entry>
         <oasis:entry colname="col2">Rainfall-runoff ratio to all</oasis:entry>
         <oasis:entry colname="col3">Scales total (i.e., summed) precipitation over the period of interest by the long-term rainfall-runoff ratio and distributes evenly over time steps (single estimated flow value for all time steps); conceptually similar to BM01.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM10</oasis:entry>
         <oasis:entry colname="col2">Rainfall-runoff ratio to annual</oasis:entry>
         <oasis:entry colname="col3">As BM09, but applies the long-term rainfall-runoff ratio to annual precipitation totals.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM11</oasis:entry>
         <oasis:entry colname="col2">Rainfall-runoff ratio to monthly</oasis:entry>
         <oasis:entry colname="col3">As BM09, but applies the long-term rainfall-runoff ratio to monthly precipitation totals.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM12</oasis:entry>
         <oasis:entry colname="col2">Rainfall-runoff ratio to daily</oasis:entry>
         <oasis:entry colname="col3">As BM09, but applies the long-term rainfall-runoff ratio to daily precipitation totals.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM13</oasis:entry>
         <oasis:entry colname="col2">Monthly rainfall-runoff ratio to monthly</oasis:entry>
         <oasis:entry colname="col3">As BM11, but using mean monthly rainfall-runoff ratios.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">BM14</oasis:entry>
         <oasis:entry colname="col2">Monthly rainfall-runoff ratio to daily</oasis:entry>
         <oasis:entry colname="col3">As BM12, but using mean monthly rainfall-runoff ratios.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry namest="col1" nameend="col3"><italic>Parsimonious models</italic>: these benchmarks attempt to simulate catchment response to precipitation </oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM15</oasis:entry>
         <oasis:entry colname="col2">Scaled precipitation</oasis:entry>
         <oasis:entry colname="col3">Attempts to account for precipitation partitioning into streamflow and undefined sink terms (0 parameters). In our application with daily time steps, identical to BM12.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM16</oasis:entry>
         <oasis:entry colname="col2">Adjusted precipitation</oasis:entry>
         <oasis:entry colname="col3">As BM15, adding a calibrated lag value to shift the estimated time series of flows (1 parameter). Attempts to account for lag in catchment response.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">BM17</oasis:entry>
         <oasis:entry colname="col2">Adjusted smoothed precipitation</oasis:entry>
         <oasis:entry colname="col3">AS BM16, smoothed by a moving window average of calibrated length (2 parameters total). Attempts to account for lag and attenuation in catchment response.</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
</sec>
<sec id="Ch1.S2.SS5">
  <label>2.5</label><title>Sampling uncertainty</title>
      <p id="d2e1064">Sampling uncertainty can be quantified with bootstrapping methods as implemented in the <monospace>gumboot</monospace> R package <xref ref-type="bibr" rid="bib1.bibx7 bib1.bibx5" id="paren.39"/>. The <monospace>gumboot</monospace> package works by creating a collection of synthetic hydrographs and calculating the score(s) of interest (such as KGE) from the observations and each synthetic hydrograph. We ran <monospace>gumboot</monospace> with the default settings as given in <xref ref-type="bibr" rid="bib1.bibx7" id="text.40"/>. Briefly, this means that <monospace>gumboot</monospace> creates each synthetic hydrograph by dividing the period of record into water years (using October as the starting month and enforcing a minimum of 100 valid values within each water year) and sampling water years with replacement until the record length is reached. Using water years ensures that each sampled period is hydrologically independent, and the synthetic records are thus plausible hydrographs for the basin. With default settings <monospace>gumboot</monospace> returns 1000 synthetic hydrographs and associated NSE and KGE scores. We then define the sampling uncertainty as the difference between the <inline-formula><mml:math id="M11" display="inline"><mml:mn mathvariant="normal">5</mml:mn></mml:math></inline-formula>th and <inline-formula><mml:math id="M12" display="inline"><mml:mn mathvariant="normal">95</mml:mn></mml:math></inline-formula>th percentile of these scores.</p>
      <p id="d2e1103">We calculate the sampling uncertainty for each basin, for both the NWM simulations and each of the 17 benchmarks. This allows us to report both KGE scores and their associated uncertainty, and from this derive whether the accuracy of NWM simulations can be considered statistically different from the accuracy of the benchmarks. We report those results as Cumulative Distribution Functions (CDFs) that show that scores and uncertainty across the sample. We also report these results on a per-basin basis for the NWM and the best-performing benchmark. In this case, we use the Jaccard index (also known as the ratio of verification, critical success index, and Tanimoto index) to quantify the relative overlap of both uncertainty intervals. Assuming two uncertainty intervals, <inline-formula><mml:math id="M13" display="inline"><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M14" display="inline"><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, defined as the difference between the <inline-formula><mml:math id="M15" display="inline"><mml:mn mathvariant="normal">5</mml:mn></mml:math></inline-formula>th (<inline-formula><mml:math id="M16" display="inline"><mml:mrow><mml:msup><mml:mi>I</mml:mi><mml:mrow><mml:mi mathvariant="normal">p</mml:mi><mml:mn mathvariant="normal">05</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>) to <inline-formula><mml:math id="M17" display="inline"><mml:mn mathvariant="normal">95</mml:mn></mml:math></inline-formula>th (<inline-formula><mml:math id="M18" display="inline"><mml:mrow><mml:msup><mml:mi>I</mml:mi><mml:mrow><mml:mi mathvariant="normal">p</mml:mi><mml:mn mathvariant="normal">95</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>) percentile estimates of KGE scores for the NWM (<inline-formula><mml:math id="M19" display="inline"><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>) and benchmark (<inline-formula><mml:math id="M20" display="inline"><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>):

            <disp-formula id="Ch1.E4" content-type="numbered"><label>4</label><mml:math id="M21" display="block"><mml:mrow><mml:mi>J</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>∩</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>∪</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mtext>overlap</mml:mtext><mml:mtext>span</mml:mtext></mml:mfrac></mml:mstyle><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

            <disp-formula id="Ch1.Ex1"><mml:math id="M22" display="block"><mml:mrow><mml:mtext>overlap</mml:mtext><mml:mo>=</mml:mo><mml:mo movablelimits="false">max⁡</mml:mo><mml:mspace width="-0.125em" linebreak="nobreak"/><mml:mo mathvariant="italic" mathsize="1.1em">{</mml:mo><mml:mn mathvariant="normal">0</mml:mn><mml:mo>,</mml:mo><mml:mspace linebreak="nobreak" width="0.33em"/><mml:mo movablelimits="false">min⁡</mml:mo><mml:mo>(</mml:mo><mml:msubsup><mml:mi>I</mml:mi><mml:mn mathvariant="normal">1</mml:mn><mml:mrow><mml:mi mathvariant="normal">p</mml:mi><mml:mn mathvariant="normal">95</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>I</mml:mi><mml:mn mathvariant="normal">2</mml:mn><mml:mrow><mml:mi mathvariant="normal">p</mml:mi><mml:mn mathvariant="normal">95</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:mo movablelimits="false">max⁡</mml:mo><mml:mo>(</mml:mo><mml:msubsup><mml:mi>I</mml:mi><mml:mn mathvariant="normal">1</mml:mn><mml:mrow><mml:mi mathvariant="normal">p</mml:mi><mml:mn mathvariant="normal">05</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>I</mml:mi><mml:mn mathvariant="normal">2</mml:mn><mml:mrow><mml:mi mathvariant="normal">p</mml:mi><mml:mn mathvariant="normal">05</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo><mml:mo mathvariant="italic" mathsize="1.1em">}</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

            <disp-formula id="Ch1.Ex2"><mml:math id="M23" display="block"><mml:mrow><mml:mtext>span</mml:mtext><mml:mo>=</mml:mo><mml:mo movablelimits="false">max⁡</mml:mo><mml:mo>(</mml:mo><mml:msubsup><mml:mi>I</mml:mi><mml:mn mathvariant="normal">1</mml:mn><mml:mrow><mml:mi mathvariant="normal">p</mml:mi><mml:mn mathvariant="normal">95</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>I</mml:mi><mml:mn mathvariant="normal">2</mml:mn><mml:mrow><mml:mi mathvariant="normal">p</mml:mi><mml:mn mathvariant="normal">95</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:mo movablelimits="false">min⁡</mml:mo><mml:mo>(</mml:mo><mml:msubsup><mml:mi>I</mml:mi><mml:mn mathvariant="normal">1</mml:mn><mml:mrow><mml:mi mathvariant="normal">p</mml:mi><mml:mn mathvariant="normal">05</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>I</mml:mi><mml:mn mathvariant="normal">2</mml:mn><mml:mrow><mml:mi mathvariant="normal">p</mml:mi><mml:mn mathvariant="normal">05</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
      <p id="d2e1419">When overlap <inline-formula><mml:math id="M24" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> span, both sampling uncertainty intervals exactly overlap, and the performance of the NWM can be considered indistinguishable from the performance of the benchmark. When overlap <inline-formula><mml:math id="M25" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> 0, the uncertainty intervals do not overlap, and the performance of the NWM and benchmark simulations can thus be considered to be clearly different. We then need to further distinguish whether the NWM performance can be considered higher or lower than that of the benchmarks. Here we make the simplifying assumption that the <inline-formula><mml:math id="M26" display="inline"><mml:mn mathvariant="normal">50</mml:mn></mml:math></inline-formula>th percentile score estimate can be used to determine the relative positions of both uncertainty intervals. If the <inline-formula><mml:math id="M27" display="inline"><mml:mn mathvariant="normal">50</mml:mn></mml:math></inline-formula>th percentile estimate of NWM performance is higher than the <inline-formula><mml:math id="M28" display="inline"><mml:mn mathvariant="normal">50</mml:mn></mml:math></inline-formula>th percentile estimate of benchmark performance, we consider the NWM to perform better than the benchmark (and vice versa). How much better (or worse) the performance of the NWM is, can then be quantified using Eq. (1). High values of <inline-formula><mml:math id="M29" display="inline"><mml:mi>J</mml:mi></mml:math></inline-formula> indicate a large amount of overlap (with complete overlap at <inline-formula><mml:math id="M30" display="inline"><mml:mrow><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula>) between the two distributions (i.e., smaller distinguishable differences), whereas low values of <inline-formula><mml:math id="M31" display="inline"><mml:mi>J</mml:mi></mml:math></inline-formula> indicate a small amount of overlap and clearer differences between the two distributions (no overlap at <inline-formula><mml:math id="M32" display="inline"><mml:mrow><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0</mml:mn></mml:mrow></mml:math></inline-formula>). A schematic overview of the methodology can be found in Fig. <xref ref-type="fig" rid="F1"/>.</p>

      <fig id="F1" specific-use="star"><label>Figure 1</label><caption><p id="d2e1501">Schematic overview of methodology. <bold>(a)</bold> Example selection of water years, showing observations as well as NWM simulations (top) and one of the benchmark simulations (bottom) for an arbitrary gauge (USGS ID 01037380). Water years indicated with alternating grey/white blocks. <bold>(b)</bold> Examples of synthetic hydrographs obtained from sampling water years with replacement. Water years indicated with alternating grey/white blocks. <bold>(c)</bold> Schematic representation of the 1000 KGE samples for the NWM and the benchmark, summarized as boxplots. <bold>(d)</bold> Overview of the terminology and method used to quantify relative overlap of the NWM and benchmark KGE samples.</p></caption>
          <graphic xlink:href="https://hess.copernicus.org/articles/30/3439/2026/hess-30-3439-2026-f01.png"/>

        </fig>

</sec>
</sec>
<sec id="Ch1.S3">
  <label>3</label><title>Results</title>
<sec id="Ch1.S3.SS1">
  <label>3.1</label><title>Aggregated performance</title>
      <p id="d2e1538">Figure <xref ref-type="fig" rid="F2"/> shows the KGE scores obtained by the NWM as well as the 17 benchmark models, visualized as Cumulative Distribution Functions (CDFs) for straightforward comparison of performance aggregated across all locations. Results are presented for both the calibration period (up to water 5 years of data used, depending on data availability at each gauge) and the evaluation period (up to 37 water years). Calibration performance quantifies data fitting potential (i.e., how well can a given method – model or benchmark – capture the patterns in the data at all in a given basin?). Evaluation performance shows what sort of predictive power that data fit actually has (i.e., how well can a given method capture the underlying processes in a way that leads to accurate predictions for unseen data?).</p>
      <p id="d2e1543">First, for both the calibration and evaluation period, the NWM (black line) reaches higher KGE scores considerably more often than any of the benchmarks (colored lines). However, NWM performance also shows a tendency to decline quickly at lower KGE values, suggesting that there are locations where NWM performance is not as high as that of some of the benchmarks. For calibration, this suggests that the NWM (14 calibrated parameters in NWMv2.1, <xref ref-type="bibr" rid="bib1.bibx11" id="text.41"/>, assumed to be a similar number for the NWMv3.0 calibrations shown here), as may be expected, has greater flexibility than the benchmarks (0 to 2 parameters) to fit to the specific characteristics of the calibration data. For evaluation, the CDFs of both model and benchmark performances show a tendency towards lower scores. This is commonly seen in any modeling study and typically attributed to some degree of overfitting of the model to specifics of the calibration data, or to a change in conditions between calibration and evaluation periods that the model cannot effectively account for. Some benchmarks (e.g., BM11, Fig. <xref ref-type="fig" rid="F2"/>k) show very limited performance change, suggesting that they capture the aggregated catchment response equally well (or poorly) during both data periods. Other benchmarks (e.g., BM07, Fig. <xref ref-type="fig" rid="F2"/>g) show very large performance changes, suggesting that calibration conditions were not sufficient to let the benchmark accurately capture underlying catchment behavior. Compared to the benchmark ensembles, the NWM does not stand out as having particularly large or small performance changes.</p>
      <p id="d2e1553">Second, three benchmarks of note are BM01 (for performing quite poorly), and benchmarks BM07 and BM17 (for performing rather well). BM01 (the mean flow benchmark; Fig. <xref ref-type="fig" rid="F2"/>a) can be found as a (nearly) vertical line at <inline-formula><mml:math id="M33" display="inline"><mml:mrow><mml:mtext>KGE</mml:mtext><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>-</mml:mo><mml:msqrt><mml:mrow><mml:mo>(</mml:mo><mml:mn mathvariant="normal">2</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:msqrt><mml:mo>≈</mml:mo><mml:mo>-</mml:mo><mml:mn mathvariant="normal">0.41</mml:mn></mml:mrow></mml:math></inline-formula>. This is the traditional choice of benchmark model, derived from the original formulation of the Nash-Sutcliffe Efficiency, and it is the only benchmark that shows no spatial variability at all during calibration (there is some variability during evaluation, because the mean flow calculated from the calibration data is not always close to the actual mean flow during evaluation). Comparison of this CDF to all others highlights the point made by <xref ref-type="bibr" rid="bib1.bibx46" id="text.42"/>: the mean flow is not an equally hard-to-beat benchmark in all basins, and location-specific benchmarks are needed to set more locally appropriate expectations for models <xref ref-type="bibr" rid="bib1.bibx25" id="paren.43"><named-content content-type="pre">see also</named-content></xref>.</p>
      <p id="d2e1594">BM07 (the daily mean flow benchmark; Fig. <xref ref-type="fig" rid="F2"/>g) is computed by taking the mean flow on each Julian day in the calibration period and appending these values to create a year-long timeseries, which is then repeated for each year of the full simulation period. While its CDF does not cover scores as high as the NWM CDF, this benchmark equally does not lead to KGE scores that are as low as some of those obtained by the NWM: during calibration, the NWM CDF covers a range of (roughly) <inline-formula><mml:math id="M34" display="inline"><mml:mrow><mml:mi mathvariant="italic">&lt;</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">5</mml:mn></mml:mrow></mml:math></inline-formula>, 1], whereas the CDF of BM07 covers a more restricted range of (roughly) [0,1]. For unseen data (evaluation) the BM07 CDF does not stand out compared to the other benchmarks, possibly due to the somewhat limited amount of data (maximum 5 years) used to compute the benchmark.</p>
      <p id="d2e1612">BM17 (the adjusted smoothed precipitation benchmark; Fig. <xref ref-type="fig" rid="F2"/>q) is a simple 2-parameter model that aims to capture three dominant facets of catchment functioning: partitioning of incoming precipitation into streamflow and sink terms, as well as time delay and attenuation of the resulting runoff <xref ref-type="bibr" rid="bib1.bibx46" id="paren.44"/>. Its CDF is quite similar to that of the NWM but more constrained; the KGE values for this benchmark are neither as high nor as low as those obtained by the NWM. However, the benchmark requires calibration of only 2 parameters, suggesting that within this experimental setup relatively high KGE scores are obtainable with limited degrees of freedom.</p>
      <p id="d2e1620">In summary, the differences between the NWM and all benchmarks at the lower performance end of the CDF suggest that there are basins where the NWM performance is hindered in some way that the benchmarks are not. At the same time, the NWM obtains higher performance scores than the benchmarks much more often, suggesting that the NWM is able to simulate a wider range of hydrologic behavior with some degree of accuracy than any individual benchmark can. However, note that the CDFs mask the spatial distribution of performance differences. A direct comparison of NWM and benchmark performance will be presented in Sect. 3.3.</p>

      <fig id="F2" specific-use="star"><label>Figure 2</label><caption><p id="d2e1625">Cumulative Distribution Function (CDF) plot of the Kling–Gupta Efficiency (KGE) scores for the NWMv3.0 and 17 simple benchmarks, across the full basin sample. For benchmarks 11, 12, 13 and 14, RRR stands for Rainfall Runoff Ratio. <inline-formula><mml:math id="M35" display="inline"><mml:mi>P</mml:mi></mml:math></inline-formula> (benchmarks 11–16) stands for precipitation.</p></caption>
          <graphic xlink:href="https://hess.copernicus.org/articles/30/3439/2026/hess-30-3439-2026-f02.png"/>

        </fig>

</sec>
<sec id="Ch1.S3.SS2">
  <label>3.2</label><title>Sampling uncertainty</title>
      <p id="d2e1649">Figure <xref ref-type="fig" rid="F3"/> shows the sampling uncertainty associated with the benchmarks and NWM simulations using data from the evaluation period. To save space, a number of benchmarks have been omitted: BM01 and BM02 (mean and median flow) as well as BM10 (rainfall-runoff ratio to annual) have, in the majority of cases, limited performance and little can be learned from these; BM03 and BM04 (annual mean and median flow) use annual flow statistics as a predictor and can by definition not be used for unseen data; BM09 (rainfall-runoff ratio applied to all timesteps) is conceptually very similar to BM01 and has been omitted for the same reason.</p>
      <p id="d2e1654">As shown in earlier work <xref ref-type="bibr" rid="bib1.bibx7" id="paren.45"/>, the sampling uncertainty of KGE scores can be substantial. In the case of the NWM (black line with grey uncertainty bounds) there is a broad inverse correlation between the KGE score and associated uncertainty bounds, though considerable scatter is present. This emphasizes the strong need to evaluate models while accounting for sampling uncertainty. In numerous basins, the KGE scores obtained by the NWM are strongly conditional on the idiosyncrasies of the evaluation period, and the same model instantiation might be evaluated quite differently if a different time period were to be used. The benchmarks show varying levels of sampling uncertainty. Some are mostly insensitive to data selection (e.g., BM13, BM14), whereas others are either highly sensitive (e.g., BM12, BM16), mostly robust but occasionally sensitive (e.g., BM06, BM08), or somewhere in between (e.g., BM07, BM17). The CDFs and uncertainty bounds should not be directly compared between the different subplots, but a general idea of the widths of these uncertainty intervals is helpful for understanding the results in the next section.</p>

      <fig id="F3" specific-use="star"><label>Figure 3</label><caption><p id="d2e1662">Cumulative Distribution Function (CDF) plot of the Kling–Gupta Efficiency (KGE) scores of the evaluation period, across the full basin sample. The NWMv3.0 KGE scores are in black, and the KGE scores for the simple benchmarks in colors. Sampling uncertainty (defined as the difference between the <inline-formula><mml:math id="M36" display="inline"><mml:mn mathvariant="normal">5</mml:mn></mml:math></inline-formula>th and <inline-formula><mml:math id="M37" display="inline"><mml:mn mathvariant="normal">95</mml:mn></mml:math></inline-formula>th percentile KGE estimate) in lighter colors. For benchmarks 11, 12, 13 and 14, RRR stands for Rainfall Runoff Ratio. <inline-formula><mml:math id="M38" display="inline"><mml:mi>P</mml:mi></mml:math></inline-formula> (benchmarks 11–16) stands for precipitation.</p></caption>
          <graphic xlink:href="https://hess.copernicus.org/articles/30/3439/2026/hess-30-3439-2026-f03.png"/>

        </fig>

</sec>
<sec id="Ch1.S3.SS3">
  <label>3.3</label><title>Spatial patterns</title>
      <p id="d2e1700">While CDFs of performance scores can be helpful to quickly compare performance differences across the full sample of basins, such approaches do not facilitate a basin-by-basin comparison of differences. Figure <xref ref-type="fig" rid="F4"/>a and d therefore show a spatial overview of model and benchmark performance during the evaluation period, using <monospace>gumboot</monospace>'s estimated <inline-formula><mml:math id="M39" display="inline"><mml:mn mathvariant="normal">50</mml:mn></mml:math></inline-formula>th percentile KGE score for both. For simplicity, we only assess the evaluation performance of the best benchmark in each basin (in other words, Fig. <xref ref-type="fig" rid="F4"/>d is a composite of different benchmarks selected for having the highest <inline-formula><mml:math id="M40" display="inline"><mml:mn mathvariant="normal">50</mml:mn></mml:math></inline-formula>th percentile KGE score). Both maps confirm the broad statement suggested by the various CDFs, namely that the NWM spans a wider range of performance scores than the benchmarks. The spatial pattern of performance scores shown for the NWM is comparable with that of other modeling studies across this domain <xref ref-type="bibr" rid="bib1.bibx37 bib1.bibx27 bib1.bibx16" id="paren.46"><named-content content-type="pre">e.g.</named-content></xref>: performance is lowest in the drier central regions, and higher along the wetter west coast, the western mountain regions, and east of the <inline-formula><mml:math id="M41" display="inline"><mml:mn mathvariant="normal">100</mml:mn></mml:math></inline-formula>th meridian. Benchmark performance is in many places lower than what is achieved by the NWM, but higher in the regions where the NWM already performs poorly.</p>
      <p id="d2e1737">Figure <xref ref-type="fig" rid="F4"/>b, c, e and f clarify these performance difference by showing the relative overlap of the sampling uncertainty intervals of the NWM and best benchmark. Overlap is quantified with the Jaccard index (Eq. <xref ref-type="disp-formula" rid="Ch1.E4"/>) and separated into cases where the estimated <inline-formula><mml:math id="M42" display="inline"><mml:mn mathvariant="normal">50</mml:mn></mml:math></inline-formula>th percentile KGE score of the NWM is higher than that of the best benchmark (Fig. <xref ref-type="fig" rid="F4"/>b, c; here the NWM outperforms the benchmarks) and vice versa (Fig. <xref ref-type="fig" rid="F4"/>e, f). These results are separated into basins used for calibration of the NWM parameters (Fig. <xref ref-type="fig" rid="F4"/>b, e), and cases where NWM parameters were regionalized (Fig. <xref ref-type="fig" rid="F4"/>c, f). For both sets of plots, the colored stations are complementary: a station plotted in green in Fig. <xref ref-type="fig" rid="F4"/>b (or Fig. <xref ref-type="fig" rid="F4"/>c) will appear as a yellow dot in Fig. <xref ref-type="fig" rid="F4"/>e (or Fig. <xref ref-type="fig" rid="F4"/>f) and vice versa. Note that no overlap (Jaccard index <inline-formula><mml:math id="M43" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> 0; dark green and bright red) indicates that the distributions of KGE scores are clearly separate (in other words, the NWM score is either clearly higher or lower than the benchmark score), whereas lighter colors indicate that the performance of the NWM and benchmark are closer together.</p>
      <p id="d2e1776">Figure <xref ref-type="fig" rid="F4"/>b shows that in approximately 70 % of calibration basins the NWM outperforms the benchmarks. In approximately 75 % of basins this is a clear improvement (Jaccard index <inline-formula><mml:math id="M44" display="inline"><mml:mrow><mml:mo>≈</mml:mo><mml:mn mathvariant="normal">0</mml:mn></mml:mrow></mml:math></inline-formula>). Basins where the KGE distributions of the NWM and best benchmark partly overlap are mostly found in the central mountainous and drier regions. Figure <xref ref-type="fig" rid="F4"/>e shows the remaining 30 % of calibration basins where the benchmarks outperform the NWM. Here too the overlap between the KGE distributions is mostly low, showing that in approximately 60 % of basins the benchmarks obtain clearly higher scores than the NWM. Clusters of basins where the benchmarks outperform the NWM are mostly concentrated in the interior west (broadly inland of the western coastal mountain ranges until somewhat east of the <inline-formula><mml:math id="M45" display="inline"><mml:mn mathvariant="normal">100</mml:mn></mml:math></inline-formula>th meridian) and the Appalachian Piedmont, with scattered occurrences elsewhere.</p>
      <p id="d2e1800">These patterns are reinforced in Fig. <xref ref-type="fig" rid="F4"/>c and f, which show the performance of the NWM in basins where its parameters were regionalized (i.e., not calibrated). The NWM outperforms the benchmarks in approximately 50 % of basins, located mainly along the western coast and in the humid eastern part of the US. In contrast, the benchmarks perform better in the interior west and the Appalachians, with the appearance of a new cluster of strong performance in central Florida and an increase in scattered basins. Notably, the benchmarks outperform the NWM in almost half of the regionalization basins, with clear regional patterns. Performance distributions do not overlap in almost three-quarters of both cases (<inline-formula><mml:math id="M46" display="inline"><mml:mrow><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0</mml:mn></mml:mrow></mml:math></inline-formula> in 73.2 % and 72.3 % in Fig. <xref ref-type="fig" rid="F4"/>c and f, respectively), suggesting that  sampling uncertainty plays only a limited role in our analysis. Importantly, whereas a glance at Fig. <xref ref-type="fig" rid="F4"/>a may suggest that NWM can be improved in the drier central and western regions where model performance is lower, the benchmarks suggest that improvements may be possible in much more widespread regions (Fig. <xref ref-type="fig" rid="F4"/>e, f).</p>
      <p id="d2e1824">As shown in the Supplement, these findings generally hold when the Nash-Sutcliffe Efficiency is used to quantify model and benchmark performance, but with a few important caveats. First, the benchmarks show a tendency towards lower NSE scores and their CDFs are further away from the NWM CDF (Figs. S1, S2). Second, the NWM outperforms the benchmarks in more basins when NSE is used to quantify model performance (the NWM is better in 79.3 % of calibration basins and in 63.4 % of regionalization basins; Fig. S3). This is somewhat surprising, given that the benchmarks are identical in both cases and the NWM was not calibrated on NSE, and points to a need for further work on robust model evaluation practices. Preliminary analysis suggests that these differences are driven by the different sensitivities of NSE and KGE to the bias, variability and correlation components <xref ref-type="bibr" rid="bib1.bibx20 bib1.bibx26 bib1.bibx31" id="paren.47"><named-content content-type="pre">see e.g.,</named-content></xref>. In at least some basins, the benchmarks perform clearly better on bias and much worse on correlation than the NWM, and because correlation errors are weighted more heavily in NSE, this results in a larger difference in NSE scores than in KGE scores.</p>

      <fig id="F4" specific-use="star"><label>Figure 4</label><caption><p id="d2e1833">Overview of spatial patterns in model and benchmark performance during the evaluation period. <bold>(a, d)</bold> Estimated 50th percentile KGE score for NWM and best benchmark respectively. <bold>(b, e)</bold> Jaccard index showing overlap between sampling uncertainty intervals where the 50th percentile KGE score for NWM <inline-formula><mml:math id="M47" display="inline"><mml:mi mathvariant="italic">&gt;</mml:mi></mml:math></inline-formula> benchmark, and NWM <inline-formula><mml:math id="M48" display="inline"><mml:mi mathvariant="italic">&lt;</mml:mi></mml:math></inline-formula> benchmark, respectively, for gauges used for model calibration. <bold>(c, f)</bold> Jaccard index showing overlap between sampling uncertainty intervals where the 50th percentile KGE score for NWM <inline-formula><mml:math id="M49" display="inline"><mml:mi mathvariant="italic">&gt;</mml:mi></mml:math></inline-formula> benchmark, and NWM <inline-formula><mml:math id="M50" display="inline"><mml:mi mathvariant="italic">&lt;</mml:mi></mml:math></inline-formula> benchmark, respectively, for gauges used for model regionalization. Histograms show Jaccard index distributions and specifically call out the number of <inline-formula><mml:math id="M51" display="inline"><mml:mrow><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0</mml:mn></mml:mrow></mml:math></inline-formula> cases, where the estimated metric distributions have no overlap. Borders obtained from <xref ref-type="bibr" rid="bib1.bibx10" id="text.48"/>.</p></caption>
          <graphic xlink:href="https://hess.copernicus.org/articles/30/3439/2026/hess-30-3439-2026-f04.jpg"/>

        </fig>

</sec>
</sec>
<sec id="Ch1.S4">
  <label>4</label><title>Discussion</title>
      <p id="d2e1904">We demonstrated how simple benchmarks can be used to assess the performance of large-domain hydrologic models. As our test case, we compared the NWMv3.0 daily-averaged retrospective simulation against the performance of 17 simple benchmark models across approximately 4900 basins in the United States. In basins used for model calibration, the benchmarks outperform the NWM in approximately 30 % of basins. The benchmarks perform primarily better in the interior mountainous and drier plains areas in the west as well as in the Appalachians. This pattern, with the addition of a cluster of basins in central Florida, appears even clearer in basins where the NWM parameters were regionalized, and the benchmarks outperform the NWM in almost 50 % of the basins. These patterns are different from where KGE scores suggest that the model performs poorly (Fig. <xref ref-type="fig" rid="F4"/>a). Based on KGE scores alone, one might conclude that the model does worst in the drier southwestern and central areas, but when performance is compared against benchmarks, more regions stand out as areas where improvements may be possible.</p>
      <p id="d2e1909">These results are broadly consistent with various evaluations of earlier versions of the NWM. For example, <xref ref-type="bibr" rid="bib1.bibx53" id="text.49"/> find that on daily time steps the NWMv2.1 outperforms a daily mean flow benchmark in 80 % of cases, and that NWM performance is better in natural basins than regulated ones <xref ref-type="bibr" rid="bib1.bibx1" id="paren.50"><named-content content-type="pre">i.e., basins where parameters are regionalized, also shown by</named-content><named-content content-type="post">though at hourly time steps</named-content></xref>. In ecological terms, one of the regions where the benchmarks provide better simulations than the NWMv3.0 broadly coincides with the Mediterranean California, North American Desserts, Temperate Sierras, and Great Plains eco-regions <xref ref-type="bibr" rid="bib1.bibx9" id="paren.51"/>. This aligns with results from <xref ref-type="bibr" rid="bib1.bibx23" id="text.52"/>, who found that the performance of the NWMv2.0 can be improved in drier climates with predominantly low vegetation.</p>
      <p id="d2e1928">While more in depth study is needed to understand the contributing factors, the nature of the benchmarks lets us speculate about potential improvements to the modeling chain. Three main lines of investigation may be worthwhile, focusing on model inputs, model structure, and model parametrization/regionalization. Large-domain parameter estimation has long been an open challenge, but existing <xref ref-type="bibr" rid="bib1.bibx45" id="paren.53"><named-content content-type="pre">e.g.,</named-content></xref> and promising recent advances <xref ref-type="bibr" rid="bib1.bibx49 bib1.bibx52 bib1.bibx15" id="paren.54"><named-content content-type="pre">e.g.,</named-content></xref> have not yet been implemented in most large-domain modeling chains. Regionalization of model parameters is similarly challenging <xref ref-type="bibr" rid="bib1.bibx34 bib1.bibx40 bib1.bibx58" id="paren.55"><named-content content-type="pre">e.g.,</named-content></xref>. The relative success of the benchmarks during both calibration (effective in 30 % of basins) and regionalization (effective in 50 % of basins) may suggest that improvements to parameter optimization and regionalization are possible.</p>
      <p id="d2e1946">The strong regional patterns in where the benchmarks outperform the models suggest solutions may need to be found more locally as well. For example, in the current NWM setup, parameters are regionalized for regulated basins. The NWM currently accounts for the location of more than 5000 reservoirs but does not include any operating rules for these reservoirs. Instead, data assimilation is used to correct and align model states with observations during forecasting for several hundred of these reservoirs <xref ref-type="bibr" rid="bib1.bibx11" id="paren.56"/>. The relative success of the benchmarks in the regulated basins suggests that some aspects of the resulting regulated streamflow are relatively predictable and that implementing a rudimentary reservoir operations module may be possible. Similarly, the relative success of the benchmarks in drier regions may point to a need to account for dry-region processes such as channel infiltration and transmission losses. Improvements to the representation of shallow aquifer systems <xref ref-type="bibr" rid="bib1.bibx44 bib1.bibx51" id="paren.57"><named-content content-type="pre">e.g., in the Northern Appalachian Mountains and Appalachian Piedmont;</named-content></xref>, low-lying coastal areas and wetlands (e.g., central Florida), snow pack dynamics (e.g., the western mountains), and surface depression storage (e.g., the prairie pothole region in North Dakota, South Dakota, Minnesota and Iowa) might also be needed.</p>
      <p id="d2e1958">However, the relative success of the benchmarks in these regions may also point to potential issues with the forcing data <xref ref-type="bibr" rid="bib1.bibx41" id="paren.58"><named-content content-type="pre">see e.g.</named-content><named-content content-type="post">who identify issues with convective summer precipitation in the NWMv2.1 forcing data over Alabama</named-content></xref>. The benchmarks are only minimally (or not at all) constrained by a need to respect mass and energy balances within the system, and will typically produce relatively unbiased simulations with larger variability and correlation errors (see Figs. S8 and S9). The model instead is bound by a need to partition its precipitation input correctly between storage, streamflow and evaporation, and may thus be more vulnerable to biases in the forcing data <xref ref-type="bibr" rid="bib1.bibx11" id="paren.59"><named-content content-type="pre">compare with</named-content><named-content content-type="post">who show that the NWMv2.1 has considerable bias in its simulations</named-content></xref>. Regions where the benchmarks outperform the model may thus also be locations where biases in the forcing data limit the model's ability to produce accurate streamflow simulations.</p>
      <p id="d2e1975">The type of benchmark may give some hints about the kind of problem the model encounters in a given region. Preliminary analysis (Figs. S4–S7) suggests that there are spatial patterns in the type of benchmark that provides the highest accuracy in each region. Streamflow-based benchmarks (Group 1) dominate in the Rocky Mountains, suggesting that the streamflow regimes here are relatively stable year-to-year. Runoff-ratio benchmarks (Group 2) are often the best benchmark in the drier parts of the western CONUS, suggesting that the partitioning of precipitation into streamflow and other components is relatively predictable in these basins, but modulated by the amount of incoming precipitation. The last group of benchmarks (very simple models) are often the most accurate benchmark in the wetter parts of the western CONUS as well as in the east. However, local analysis and comparison of model simulations against the benchmarks remains needed in order to understand which components of the simulations are better captured by the benchmarks, and what this means for potential improvements to the modeling chain. Particularly with the recent increase in large-sample studies, where results are predominantly shown as maps of performance scores and associated Cumulative Distribution Functions, there is a risk that the performance scores become a goal in themselves while locally poor model performance goes undetected. Benchmarks provide a convenient way of quickly identifying areas where improvements may be possible and, critically, these are not always the same regions where we find lower model performance scores.</p>
</sec>
<sec id="Ch1.S5" sec-type="conclusions">
  <label>5</label><title>Conclusions</title>
      <p id="d2e1987">We used an ensemble of simple benchmarks to provide context for the performance of a large-domain water model. We also account for sampling uncertainty in this work, but results suggest that in most basins the differences in performance between the National Water Model v3.0 and the benchmarks are large enough that this is only a minor concern. However, sampling uncertainty remains important in cases where models perform similarly. The benchmarks suggest that there are considerable constraints on the model's performance in approximately one-third of the basins used for model calibration and in approximately half of the basins where model parameters are regionalized. The areas where the benchmarks outperform the model only partially overlap with areas where the model achieves lower KGE scores, and this suggests that improvements may be possible in more regions than a first glance at model performance values may indicate. In cases where the benchmarks outperform the model, the nature of the benchmarks may suggest which elements of the modeling chain could be improved but it remains difficult to go beyond listing broad hypotheses. In-depth model evaluation thus remains necessary to identify which aspects of the simulations the benchmarks simulate more accurately than the model does, and what this implies for potential ways to improve the model. A key advantage of using these benchmarks is that they are straightforward and fast to compute, particularly compared to the cost of configuring and running the model. This makes benchmarking a valuable tool that can complement more detailed model evaluation techniques by quickly identifying areas that should be investigated more thoroughly.</p>
</sec>

      
      </body>
    <back><notes notes-type="codedataavailability"><title>Code and data availability</title>

      <p id="d2e1994">Streamflow observations were obtained on 31 March 2025 from the United States Geological Survey <xref ref-type="bibr" rid="bib1.bibx54" id="text.60"/> (<uri>https://waterdata.usgs.gov/nwis/dv</uri>, last access:  21 March 2025; DOI: <ext-link xlink:href="https://doi.org/10.5066/F7P55KJN" ext-link-type="DOI">10.5066/F7P55KJN</ext-link>). The NOAA National Water Model CONUS Retrospective Dataset was accessed on 28 May 2024 (AORC forcing) and 31 August 2024 (NWMv3.0 simulations) from <uri>https://registry.opendata.aws/nwm-archive</uri>. The benchmarks were calculated using the Python package <monospace>HydroBM</monospace> <xref ref-type="bibr" rid="bib1.bibx25" id="paren.61"/>, and the sampling uncertainty with the R package <monospace>gumboot</monospace> (<uri>https://cran.r-project.org/package=gumboot</uri> (last access:  3 June 2026), <xref ref-type="bibr" rid="bib1.bibx7 bib1.bibx5" id="altparen.62"/>). Intermediate results (CSV files containing the sampling uncertainty values for the National Water Model as well as the benchmarks) and code to create the figures in this manuscript and the Supplement are available on Zenodo (<ext-link xlink:href="https://doi.org/10.5281/zenodo.18028487" ext-link-type="DOI">10.5281/zenodo.18028487</ext-link>, <xref ref-type="bibr" rid="bib1.bibx18" id="paren.63"/>.</p>
  </notes><app-group>
        <supplementary-material position="anchor"><p id="d2e2031">The supplement related to this article is available online at <inline-supplementary-material xlink:href="https://doi.org/10.5194/hess-30-3439-2026-supplement" xlink:title="pdf">https://doi.org/10.5194/hess-30-3439-2026-supplement</inline-supplementary-material>.</p></supplementary-material>
        </app-group><notes notes-type="authorcontribution"><title>Author contributions</title>

      <p id="d2e2040">Gaby Gründemann: Conceptualization, Methodology, Software, Data Curation, Writing – Review &amp; Editing, Visualization. Wouter Knoben: Conceptualization, Methodology, Software, Data Curation, Writing – Original Draft, Writing – Review &amp; Editing, Visualization. Yalan Song: Data Curation, Software, Writing – Review &amp; Editing. Katie van Werkhoven: Conceptualization, Data Curation, Writing – Review &amp; Editing. Martyn Clark: Conceptualization, Methodology, Supervision, Writing – Review &amp; Editing, Project administration, Funding acquisition.</p>
  </notes><notes notes-type="competinginterests"><title>Competing interests</title>

      <p id="d2e2046">The contact author has declared that none of the authors has any competing interests.</p>
  </notes><notes notes-type="disclaimer"><title>Disclaimer</title>

      <p id="d2e2052">The statements, findings, conclusions, and recommendations are those of the authors and do not necessarily reflect the opinions of NOAA.  Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. The authors bear the ultimate responsibility for providing appropriate place names. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.</p>
  </notes><notes notes-type="financialsupport"><title>Financial support</title>

      <p id="d2e2061">This research has been supported by the National Oceanic and Atmospheric Administration (grant no. NA22NWS4320003).</p>
  </notes><notes notes-type="reviewstatement"><title>Review statement</title>

      <p id="d2e2067">This paper was edited by Ralf Loritz and reviewed by Tobias Houska and one anonymous referee.</p>
  </notes><ref-list>
    <title>References</title>

      <ref id="bib1.bibx1"><label>Abdelkader et al.(2023)Abdelkader, Temimi, and Ouarda</label><mixed-citation>Abdelkader, M., Temimi, M., and Ouarda, T. B.: Assessing the National Water Model’s Streamflow Estimates Using a Multi-Decade Retrospective Dataset across the Contiguous United States, Water, 15, 2319, <ext-link xlink:href="https://doi.org/10.3390/w15132319" ext-link-type="DOI">10.3390/w15132319</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx2"><label>Arheimer et al.(2020)Arheimer, Pimentel, Isberg, Crochemore, Andersson, Hasan, and Pineda</label><mixed-citation>Arheimer, B., Pimentel, R., Isberg, K., Crochemore, L., Andersson, J. C. M., Hasan, A., and Pineda, L.: Global catchment modelling using World-Wide HYPE (WWH), open data, and stepwise parameter estimation, Hydrol. Earth Syst. Sci., 24, 535–559, <ext-link xlink:href="https://doi.org/10.5194/hess-24-535-2020" ext-link-type="DOI">10.5194/hess-24-535-2020</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx3"><label>Best et al.(2015)Best, Abramowitz, Johnson, Pitman, Balsamo, Boone, Cuntz, Decharme, Dirmeyer, Dong, Ek, Guo, Haverd, Van Den Hurk, Nearing, Pak, Peters-Lidard, Santanello, Stevens, and Vuichard</label><mixed-citation>Best, M. J., Abramowitz, G., Johnson, H. R., Pitman, A. J., Balsamo, G., Boone, A., Cuntz, M., Decharme, B., Dirmeyer, P. A., Dong, J., Ek, M., Guo, Z., Haverd, V., Van Den Hurk, B. J. J., Nearing, G. S., Pak, B., Peters-Lidard, C., Santanello, J. A., Stevens, L., and Vuichard, N.: The Plumbing of Land Surface Models: Benchmarking Model Performance, J. Hydrometeorol., 16, 1425–1442, <ext-link xlink:href="https://doi.org/10.1175/JHM-D-14-0158.1" ext-link-type="DOI">10.1175/JHM-D-14-0158.1</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx4"><label>Beven(2023)</label><mixed-citation>Beven, K.: Benchmarking hydrological models for an uncertain future, Hydrol. Process., 37, e14882, <ext-link xlink:href="https://doi.org/10.1002/hyp.14882" ext-link-type="DOI">10.1002/hyp.14882</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx5"><label>Clark and Shook(2021)</label><mixed-citation>Clark, M. P. and Shook, K.: gumboot: Bootstrap Analyses of Sampling Uncertainty in Goodness-of-Fit Statistics, R package version 1.0.1, <uri>https://github.com/CH-Earth/gumboot</uri>, (last access: 4 September 2024), 2021.</mixed-citation></ref>
      <ref id="bib1.bibx6"><label>Clark et al.(2008)Clark, Slater, Rupp, Woods, Vrugt, Gupta, Wagener, and Hay</label><mixed-citation>Clark, M. P., Slater, A. G., Rupp, D. E., Woods, R. A., Vrugt, J. A., Gupta, H. V., Wagener, T., and Hay, L. E.: Framework for Understanding Structural Errors (FUSE): A modular framework to diagnose differences between hydrological models, Water Resour. Res., 44, <ext-link xlink:href="https://doi.org/10.1029/2007WR006735" ext-link-type="DOI">10.1029/2007WR006735</ext-link>, 2008.</mixed-citation></ref>
      <ref id="bib1.bibx7"><label>Clark et al.(2021)Clark, Vogel, Lamontagne, Mizukami, Knoben, Tang, Gharari, Freer, Whitfield, Shook, and Papalexiou</label><mixed-citation>Clark, M. P., Vogel, R. M., Lamontagne, J. R., Mizukami, N., Knoben, W. J. M., Tang, G., Gharari, S., Freer, J. E., Whitfield, P. H., Shook, K. R., and Papalexiou, S. M.: The Abuse of Popular Performance Metrics in Hydrologic Modeling, Water Resour. Res., 57, e2020WR029001, <ext-link xlink:href="https://doi.org/10.1029/2020WR029001" ext-link-type="DOI">10.1029/2020WR029001</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx8"><label>Clark et al.(2026)Clark, Knoben, Spieler, Gründemann, Thébault, Vásquez, Wood, Song, Shen, Carney, and Van Werkhoven</label><mixed-citation>Clark, M. P., Knoben, W. J., Spieler, D., Gründemann, G. J., Thébault, C., Vásquez, N. A., Wood, A. W., Song, Y., Shen, C., Carney, S., and Van Werkhoven, K.: Comment on Williams (2025): “Friends don't let friends use NSE or KGE for hydrologic model accuracy evaluation: A rant with data and suggestions for better practice”, Environ. Modell. Softw., 197, 106869, <ext-link xlink:href="https://doi.org/10.1016/j.envsoft.2026.106869" ext-link-type="DOI">10.1016/j.envsoft.2026.106869</ext-link>, 2026.</mixed-citation></ref>
      <ref id="bib1.bibx9"><label>Commission for Environmental Cooperation(1997)</label><mixed-citation>Commission for Environmental Cooperation: Ecological Regions of North America: Toward a Common Perspective, ISBN 2-922305-18-X, <ext-link xlink:href="http://www.cec.org/files/documents/publications/1701-ecological-regions-north-america-toward-common-perspective-en.pdf">http://www.cec.org/files/documents/publications/1701</ext-link> (last access:  29 January 2024), 1997.</mixed-citation></ref>
      <ref id="bib1.bibx10"><label>Commission for Environmental Cooperation (CEC)(2022)</label><mixed-citation>Commission for Environmental Cooperation (CEC): North American Environmental Atlas – Political Boundaries, Statistics Canada, United States Census Bureau, Instituto Nacional de Estadística y Geografía (INEGI). Ed. 3.0, Vector digital data [<inline-formula><mml:math id="M52" display="inline"><mml:mrow><mml:mn mathvariant="normal">1</mml:mn><mml:mo>:</mml:mo><mml:mn mathvariant="normal">10</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">000</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">000</mml:mn></mml:mrow></mml:math></inline-formula>], <uri>https://www.cec.org/north-american-environmental-atlas/political-boundaries-2021/</uri> (last access:  20 December 2023), 2022.</mixed-citation></ref>
      <ref id="bib1.bibx11"><label>Cosgrove et al.(2024)Cosgrove, Gochis, Flowers, Dugger, Ogden, Graziano, Clark, Cabell, Casiday, Cui et al.</label><mixed-citation> Cosgrove, B., Gochis, D., Flowers, T., Dugger, A., Ogden, F., Graziano, T., Clark, E., Cabell, R., Casiday, N., Cui, Z., et al.: NOAA's National Water Model: Advancing operational hydrology through continental-scale modeling, J. Am. Water Resour. As., 60, 247–272, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx12"><label>Döll et al.(2024)Döll, Hasan, Schulze, Gerdener, Börger, Shadkam, Ackermann, Hosseini-Moghari, Müller Schmied, Güntner, and Kusche</label><mixed-citation>Döll, P., Hasan, H. M. M., Schulze, K., Gerdener, H., Börger, L., Shadkam, S., Ackermann, S., Hosseini-Moghari, S.-M., Müller Schmied, H., Güntner, A., and Kusche, J.: Leveraging multi-variable observations to reduce and quantify the output uncertainty of a global hydrological model: evaluation of three ensemble-based approaches for the Mississippi River basin, Hydrol. Earth Syst. Sci., 28, 2259–2295, <ext-link xlink:href="https://doi.org/10.5194/hess-28-2259-2024" ext-link-type="DOI">10.5194/hess-28-2259-2024</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx13"><label>Efstratiadis and Koutsoyiannis(2010)</label><mixed-citation>Efstratiadis, A. and Koutsoyiannis, D.: One decade of multi-objective calibration approaches in hydrological modelling: a review, Hydrolog. Sci. J., 55, 58–78, <ext-link xlink:href="https://doi.org/10.1080/02626660903526292" ext-link-type="DOI">10.1080/02626660903526292</ext-link>, 2010.</mixed-citation></ref>
      <ref id="bib1.bibx14"><label>Fall et al.(2023)Fall, Kitzmiller, Pavlovic, Zhang, Patrick, St. Laurent, Trypaluk, Wu, and Miller</label><mixed-citation> Fall, G., Kitzmiller, D., Pavlovic, S., Zhang, Z., Patrick, N., St. Laurent, M., Trypaluk, C., Wu, W., and Miller, D.: The Office of Water Prediction's Analysis of Record for Calibration, version 1.1: Dataset description and precipitation evaluation, J. Am. Water Resour. As., 59, 1246–1272, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx15"><label>Farahani et al.(2025)Farahani, Wood, Tang, and Mizukami</label><mixed-citation>Farahani, M. A., Wood, A. W., Tang, G., and Mizukami, N.: Calibrating a large-domain land/hydrology process model in the age of AI: the SUMMA CAMELS emulator experiments, Hydrol. Earth Syst. Sci., 29, 4515–4537, <ext-link xlink:href="https://doi.org/10.5194/hess-29-4515-2025" ext-link-type="DOI">10.5194/hess-29-4515-2025</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx16"><label>Gauch et al.(2021)Gauch, Kratzert, Klotz, Nearing, Lin, and Hochreiter</label><mixed-citation>Gauch, M., Kratzert, F., Klotz, D., Nearing, G., Lin, J., and Hochreiter, S.: Rainfall–runoff prediction at multiple timescales with a single Long Short-Term Memory network, Hydrol. Earth Syst. Sci., 25, 2045–2062, <ext-link xlink:href="https://doi.org/10.5194/hess-25-2045-2021" ext-link-type="DOI">10.5194/hess-25-2045-2021</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx17"><label>Gharari et al.(2024)Gharari, Whitfield, Pietroniro, Freer, Liu, and Clark</label><mixed-citation>Gharari, S., Whitfield, P. H., Pietroniro, A., Freer, J., Liu, H., and Clark, M. P.: Exploring the provenance of information across Canadian hydrometric stations: implications for discharge estimation and uncertainty quantification, Hydrol. Earth Syst. Sci., 28, 4383–4405, <ext-link xlink:href="https://doi.org/10.5194/hess-28-4383-2024" ext-link-type="DOI">10.5194/hess-28-4383-2024</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx18"><label>Gründemann et al.(2025)Gründemann, Knoben, Song, van Werkhoven, and Clark</label><mixed-citation>Gründemann, G., Knoben, W., Song, Y., van Werkhoven, K., and Clark, M.: Data for “Separating Signal from Noise in Large- Domain Hydrologic Model Evaluation: Benchmarking model performance under sampling uncertainty”, Zenodo [data set], <ext-link xlink:href="https://doi.org/10.5281/zenodo.18028487" ext-link-type="DOI">10.5281/zenodo.18028487</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx19"><label>Gupta et al.(2008)Gupta, Wagener, and Liu</label><mixed-citation>Gupta, H. V., Wagener, T., and Liu, Y.: Reconciling theory with observations: elements of a diagnostic approach to model evaluation, Hydrol. Process., 3813, 3802–3813, <ext-link xlink:href="https://doi.org/10.1002/hyp.6989" ext-link-type="DOI">10.1002/hyp.6989</ext-link>, 2008.</mixed-citation></ref>
      <ref id="bib1.bibx20"><label>Gupta et al.(2009)Gupta, Kling, Yilmaz, and Martinez</label><mixed-citation>Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F.: Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling, J. Hydrol., 377, 80–91, <ext-link xlink:href="https://doi.org/10.1016/j.jhydrol.2009.08.003" ext-link-type="DOI">10.1016/j.jhydrol.2009.08.003</ext-link>, 2009.</mixed-citation></ref>
      <ref id="bib1.bibx21"><label>Gupta et al.(2012)Gupta, Clark, Vrugt, Abramowitz, and Ye</label><mixed-citation>Gupta, H. V., Clark, M. P., Vrugt, J. a., Abramowitz, G., and Ye, M.: Towards a comprehensive assessment of model structural adequacy, Water Resour. Res., 48, <ext-link xlink:href="https://doi.org/10.1029/2011WR011044" ext-link-type="DOI">10.1029/2011WR011044</ext-link>, 2012.</mixed-citation></ref>
      <ref id="bib1.bibx22"><label>Harrigan et al.(2023)Harrigan, Zsoter, Cloke, Salamon, and Prudhomme</label><mixed-citation>Harrigan, S., Zsoter, E., Cloke, H., Salamon, P., and Prudhomme, C.: Daily ensemble river discharge reforecasts and real-time forecasts from the operational Global Flood Awareness System, Hydrol. Earth Syst. Sci., 27, 1–19, <ext-link xlink:href="https://doi.org/10.5194/hess-27-1-2023" ext-link-type="DOI">10.5194/hess-27-1-2023</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx23"><label>Johnson et al.(2023)Johnson, Fang, Sankarasubramanian, Rad, Kindl Da Cunha, Jennings, Clarke, Mazrooei, and Yeghiazarian</label><mixed-citation>Johnson, J. M., Fang, S., Sankarasubramanian, A., Rad, A. M., Kindl Da Cunha, L., Jennings, K. S., Clarke, K. C., Mazrooei, A., and Yeghiazarian, L.: Comprehensive Analysis of the NOAA National Water Model: A Call for Heterogeneous Formulations and Diagnostic Model Selection, J. Geophys. Res.-Atmos., 128, e2023JD038534, <ext-link xlink:href="https://doi.org/10.1029/2023JD038534" ext-link-type="DOI">10.1029/2023JD038534</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx24"><label>Klotz et al.(2024)Klotz, Gauch, Kratzert, Nearing, and Zscheischler</label><mixed-citation>Klotz, D., Gauch, M., Kratzert, F., Nearing, G., and Zscheischler, J.: Technical Note: The divide and measure nonconformity – how metrics can mislead when we evaluate on different data partitions, Hydrol. Earth Syst. Sci., 28, 3665–3673, <ext-link xlink:href="https://doi.org/10.5194/hess-28-3665-2024" ext-link-type="DOI">10.5194/hess-28-3665-2024</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx25"><label>Knoben(2024)</label><mixed-citation>Knoben, W. J. M.: Setting expectations for hydrologic model performance with an ensemble of simple benchmarks, Hydrol. Process., 38, e15288, <ext-link xlink:href="https://doi.org/10.1002/hyp.15288" ext-link-type="DOI">10.1002/hyp.15288</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx26"><label>Knoben et al.(2019)Knoben, Freer, and Woods</label><mixed-citation>Knoben, W. J. M., Freer, J. E., and Woods, R. A.: Technical note: Inherent benchmark or not? Comparing Nash–Sutcliffe and Kling–Gupta efficiency scores, Hydrol. Earth Syst. Sci., 23, 4323–4331, <ext-link xlink:href="https://doi.org/10.5194/hess-23-4323-2019" ext-link-type="DOI">10.5194/hess-23-4323-2019</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx27"><label>Knoben et al.(2020)Knoben, Freer, Peel, Fowler, and Woods</label><mixed-citation>Knoben, W. J. M., Freer, J. E., Peel, M. C., Fowler, K. J. A., and Woods, R. A.: A Brief Analysis of Conceptual Model Structure Uncertainty Using 36 Models and 559 Catchments, Water Resour. Res., 56, e2019WR025975, <ext-link xlink:href="https://doi.org/10.1029/2019WR025975" ext-link-type="DOI">10.1029/2019WR025975</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx28"><label>Knoben et al.(2025)Knoben, Raman, Gründemann, Kumar, Pietroniro, Shen, Song, Thébault, Van Werkhoven, Wood, and Clark</label><mixed-citation>Knoben, W. J. M., Raman, A., Gründemann, G. J., Kumar, M., Pietroniro, A., Shen, C., Song, Y., Thébault, C., van Werkhoven, K., Wood, A. W., and Clark, M. P.: Technical note: How many models do we need to simulate hydrologic processes across large geographical domains?, Hydrol. Earth Syst. Sci., 29, 2361–2375, <ext-link xlink:href="https://doi.org/10.5194/hess-29-2361-2025" ext-link-type="DOI">10.5194/hess-29-2361-2025</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx29"><label>Kollat et al.(2012)Kollat, Reed, and Wagener</label><mixed-citation>Kollat, J. B., Reed, P. M., and Wagener, T.: When are multiobjective calibration trade-offs in hydrologic models meaningful?, Water Resour. Research, 48, <ext-link xlink:href="https://doi.org/10.1029/2011WR011534" ext-link-type="DOI">10.1029/2011WR011534</ext-link>, 2012.</mixed-citation></ref>
      <ref id="bib1.bibx30"><label>Kratzert et al.(2019)Kratzert, Klotz, Herrnegger, Sampson, Hochreiter, and Nearing</label><mixed-citation>Kratzert, F., Klotz, D., Herrnegger, M., Sampson, A. K., Hochreiter, S., and Nearing, G. S.: Toward Improved Predictions in Ungauged Basins: Exploiting the Power of Machine Learning, Water Resour. Res., 55, 11344–11354, <ext-link xlink:href="https://doi.org/10.1029/2019WR026065" ext-link-type="DOI">10.1029/2019WR026065</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx31"><label>Lamontagne et al.(2020)Lamontagne, Barber, and Vogel</label><mixed-citation>Lamontagne, J. R., Barber, C. A., and Vogel, R. M.: Improved Estimators of Model Performance Efficiency for Skewed Hydrologic Data, Water Resour. Res., 56, e2020WR027101, <ext-link xlink:href="https://doi.org/10.1029/2020WR027101" ext-link-type="DOI">10.1029/2020WR027101</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx32"><label>Legates and Mccabe(2013)</label><mixed-citation>Legates, D. R. and Mccabe, G. J.: A refined index of model performance: A rejoinder, Int. J. Climatol., 33, 1053–1056, <ext-link xlink:href="https://doi.org/10.1002/joc.3487" ext-link-type="DOI">10.1002/joc.3487</ext-link>, 2013.</mixed-citation></ref>
      <ref id="bib1.bibx33"><label>McCuen et al.(2006)McCuen, Knight, and Cutter</label><mixed-citation>McCuen, R. H., Knight, Z., and Cutter, A. G.: Evaluation of the Nash–Sutcliffe Efficiency Index, J. Hydrol. Eng., 11, 597–602, <ext-link xlink:href="https://doi.org/10.1061/(ASCE)1084-0699(2006)11:6(597)" ext-link-type="DOI">10.1061/(ASCE)1084-0699(2006)11:6(597)</ext-link>, 2006.</mixed-citation></ref>
      <ref id="bib1.bibx34"><label>Merz and Blöschl(2004)</label><mixed-citation>Merz, R. and Blöschl, G.: Regionalisation of catchment model parameters, J. Hydrol., 287, 95–123, <ext-link xlink:href="https://doi.org/10.1016/j.jhydrol.2003.09.028" ext-link-type="DOI">10.1016/j.jhydrol.2003.09.028</ext-link>, 2004.</mixed-citation></ref>
      <ref id="bib1.bibx35"><label>Nash and Sutcliffe(1970)</label><mixed-citation>Nash, J. and Sutcliffe, J.: River flow forecasting through conceptual models part I – A discussion of principles, J. Hydrol., 10, 282–290, <ext-link xlink:href="https://doi.org/10.1016/0022-1694(70)90255-6" ext-link-type="DOI">10.1016/0022-1694(70)90255-6</ext-link>, 1970.</mixed-citation></ref>
      <ref id="bib1.bibx36"><label>Nearing et al.(2024)Nearing, Cohen, Dube, Gauch, Gilon, Harrigan, Hassidim, Klotz, Kratzert, Metzger, Nevo, Pappenberger, Prudhomme, Shalev, Shenzis, Tekalign, Weitzner, and Matias</label><mixed-citation>Nearing, G., Cohen, D., Dube, V., Gauch, M., Gilon, O., Harrigan, S., Hassidim, A., Klotz, D., Kratzert, F., Metzger, A., Nevo, S., Pappenberger, F., Prudhomme, C., Shalev, G., Shenzis, S., Tekalign, T. Y., Weitzner, D., and Matias, Y.: Global prediction of extreme floods in ungauged watersheds, Nature, 627, 559–563, <ext-link xlink:href="https://doi.org/10.1038/s41586-024-07145-1" ext-link-type="DOI">10.1038/s41586-024-07145-1</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx37"><label>Newman et al.(2015)Newman, Clark, Sampson, Wood, Hay, Bock, Viger, Blodgett, Brekke, Arnold, Hopson, and Duan</label><mixed-citation>Newman, A. J., Clark, M. P., Sampson, K., Wood, A., Hay, L. E., Bock, A., Viger, R. J., Blodgett, D., Brekke, L., Arnold, J. R., Hopson, T., and Duan, Q.: Development of a large-sample watershed-scale hydrometeorological data set for the contiguous USA: data set characteristics and assessment of regional variability in hydrologic model performance, Hydrol. Earth Syst. Sci., 19, 209–223, <ext-link xlink:href="https://doi.org/10.5194/hess-19-209-2015" ext-link-type="DOI">10.5194/hess-19-209-2015</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx38"><label>NOAA(2025)</label><mixed-citation>NOAA: The National Water Model, <uri>https://water.noaa.gov/about/nwm</uri>, last access: 3 November 2025.</mixed-citation></ref>
      <ref id="bib1.bibx39"><label>Pappenberger et al.(2015)Pappenberger, Ramos, Cloke, Wetterhall, Alfieri, Bogner, Mueller, and Salamon</label><mixed-citation>Pappenberger, F., Ramos, M. H., Cloke, H. L., Wetterhall, F., Alfieri, L., Bogner, K., Mueller, A., and Salamon, P.: How do I know if my forecasts are better? Using benchmarks in hydrological ensemble prediction, J. Hydrol., 522, 697–713, <ext-link xlink:href="https://doi.org/10.1016/j.jhydrol.2015.01.024" ext-link-type="DOI">10.1016/j.jhydrol.2015.01.024</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx40"><label>Pool et al.(2021)Pool, Vis, and Seibert</label><mixed-citation>Pool, S., Vis, M., and Seibert, J.: Regionalization for Ungauged Catchments – Lessons Learned From a Comparative Large‐Sample Study, Water Resour. Res., 57, e2021WR030437, <ext-link xlink:href="https://doi.org/10.1029/2021WR030437" ext-link-type="DOI">10.1029/2021WR030437</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx41"><label>Quansah et al.(2025)Quansah, Doria, and Fall</label><mixed-citation>Quansah, J., Doria, R., and Fall, S.: Evaluating the Performance of the National Water Model: A Spatiotemporal Analysis of Streamflow Forecasting, Water, 17, 2950, <ext-link xlink:href="https://doi.org/10.3390/w17202950" ext-link-type="DOI">10.3390/w17202950</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx42"><label>Rakovec et al.(2016)Rakovec, Kumar, Attinger, and Samaniego</label><mixed-citation>Rakovec, O., Kumar, R., Attinger, S., and Samaniego, L.: Improving the realism of hydrologic model functioning through multivariate parameter estimation, Water Resour. Res., 52, 7779–7792, <ext-link xlink:href="https://doi.org/10.1002/2016WR019430" ext-link-type="DOI">10.1002/2016WR019430</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bibx43"><label>Ritter and Muñoz-Carpena(2013)</label><mixed-citation>Ritter, A. and Muñoz-Carpena, R.: Performance evaluation of hydrological models: Statistical significance for reducing subjectivity in goodness-of-fit assessments, J. Hydrol., 480, 33–45, <ext-link xlink:href="https://doi.org/10.1016/j.jhydrol.2012.12.004" ext-link-type="DOI">10.1016/j.jhydrol.2012.12.004</ext-link>, publisher: Elsevier B.V., 2013.</mixed-citation></ref>
      <ref id="bib1.bibx44"><label>Rutledge and Mesko(1996)</label><mixed-citation>Rutledge, A. T. and Mesko, T. O.: Estimated hydrologic characteristics of shallow aquifer systems in the Valley and Ridge, the Blue Ridge, and the Piedmont Physiographic Provinces based on analysis of streamflow recession and base flow, Professional Paper 1422-B, United States Geological Survey, <ext-link xlink:href="https://doi.org/10.3133/pp1422B" ext-link-type="DOI">10.3133/pp1422B</ext-link>, 1996.</mixed-citation></ref>
      <ref id="bib1.bibx45"><label>Samaniego et al.(2010)Samaniego, Kumar, and Attinger</label><mixed-citation>Samaniego, L., Kumar, R., and Attinger, S.: Multiscale parameter regionalization of a grid-based hydrologic model at the mesoscale, Water Resour. Res., 46, 1–25, <ext-link xlink:href="https://doi.org/10.1029/2008WR007327" ext-link-type="DOI">10.1029/2008WR007327</ext-link>, 2010.</mixed-citation></ref>
      <ref id="bib1.bibx46"><label>Schaefli and Gupta(2007)</label><mixed-citation>Schaefli, B. and Gupta, H. V.: Do Nash values have value?, Hydrol. Process., 21, 2075–2080, <ext-link xlink:href="https://doi.org/10.1002/hyp.6825" ext-link-type="DOI">10.1002/hyp.6825</ext-link>, 2007.</mixed-citation></ref>
      <ref id="bib1.bibx47"><label>Seibert(2001)</label><mixed-citation>Seibert, J.: On the need for benchmarks in hydrological modelling, Hydrol. Process., 15, 1063–1064, <ext-link xlink:href="https://doi.org/10.1002/hyp.446" ext-link-type="DOI">10.1002/hyp.446</ext-link>, 2001.</mixed-citation></ref>
      <ref id="bib1.bibx48"><label>Seibert et al.(2018)Seibert, Vis, Lewis, and van Meerveld</label><mixed-citation>Seibert, J., Vis, M. J. P., Lewis, E., and van Meerveld, H.: Upper and lower benchmarks in hydrological modelling, Hydrol. Process., 32, 1120–1125, <ext-link xlink:href="https://doi.org/10.1002/hyp.11476" ext-link-type="DOI">10.1002/hyp.11476</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx49"><label>Shen et al.(2023)Shen, Appling, Gentine, Bandai, Gupta, Tartakovsky, Baity-Jesi, Fenicia, Kifer, Li, Liu, Ren, Zheng, Harman, Clark, Farthing, Feng, Kumar, Aboelyazeed, Rahmani, Song, Beck, Bindas, Dwivedi, Fang, Höge, Rackauckas, Mohanty, Roy, Xu, and Lawson</label><mixed-citation>Shen, C., Appling, A. P., Gentine, P., Bandai, T., Gupta, H., Tartakovsky, A., Baity-Jesi, M., Fenicia, F., Kifer, D., Li, L., Liu, X., Ren, W., Zheng, Y., Harman, C. J., Clark, M., Farthing, M., Feng, D., Kumar, P., Aboelyazeed, D., Rahmani, F., Song, Y., Beck, H. E., Bindas, T., Dwivedi, D., Fang, K., Höge, M., Rackauckas, C., Mohanty, B., Roy, T., Xu, C., and Lawson, K.: Differentiable modelling to unify machine learning and physical models for geosciences, Nature Reviews Earth &amp; Environment, 4, 552–567, <ext-link xlink:href="https://doi.org/10.1038/s43017-023-00450-9" ext-link-type="DOI">10.1038/s43017-023-00450-9</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx50"><label>Song et al.(2025)Song, Bindas, Shen, Ji, Knoben, Lonzarich, Clark, Liu, Van Werkhoven, Lamont, Denno, Pan, Yang, Rapp, Kumar, Rahmani, Thébault, Adkins, Halgren, Patel, Patel, Sawadekar, and Lawson</label><mixed-citation>Song, Y., Bindas, T., Shen, C., Ji, H., Knoben, W. J. M., Lonzarich, L., Clark, M. P., Liu, J., Van Werkhoven, K., Lamont, S., Denno, M., Pan, M., Yang, Y., Rapp, J., Kumar, M., Rahmani, F., Thébault, C., Adkins, R., Halgren, J., Patel, T., Patel, A., Sawadekar, K. A., and Lawson, K.: High‐Resolution National‐Scale Water Modeling Is Enhanced by Multiscale Differentiable Physics‐Informed Machine Learning, Water Resour. Res., 61, e2024WR038928, <ext-link xlink:href="https://doi.org/10.1029/2024WR038928" ext-link-type="DOI">10.1029/2024WR038928</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx51"><label>Swain et al.(2004)Swain, Mesko, and Hollyday</label><mixed-citation>Swain, L. A., Mesko, T. O., and Hollyday, E. F.: Summary of the hydrogeology of the Valley and Ridge, Blue Ridge, and Piedmont Physiographic Provinces in the eastern United States, Professional Paper 1422-A, United States Geological Survey, <ext-link xlink:href="https://doi.org/10.3133/pp1422A" ext-link-type="DOI">10.3133/pp1422A</ext-link>, 2004.</mixed-citation></ref>
      <ref id="bib1.bibx52"><label>Tang et al.(2025)Tang, Wood, and Swenson</label><mixed-citation>Tang, G., Wood, A. W., and Swenson, S.: On Using AI‐Based Large‐Sample Emulators for Land/Hydrology Model Calibration and Regionalization, Water Resour. Res., 61, e2024WR039525, <ext-link xlink:href="https://doi.org/10.1029/2024WR039525" ext-link-type="DOI">10.1029/2024WR039525</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx53"><label>Towler et al.(2023)Towler, Foks, Dugger, Dickinson, Essaid, Gochis, Viger, and Zhang</label><mixed-citation>Towler, E., Foks, S. S., Dugger, A. L., Dickinson, J. E., Essaid, H. I., Gochis, D., Viger, R. J., and Zhang, Y.: Benchmarking high-resolution hydrologic model performance of long-term retrospective streamflow simulations in the contiguous United States, Hydrol. Earth Syst. Sci., 27, 1809–1825, <ext-link xlink:href="https://doi.org/10.5194/hess-27-1809-2023" ext-link-type="DOI">10.5194/hess-27-1809-2023</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx54"><label>U.S. Geological Survey(2025)</label><mixed-citation>U.S. Geological Survey: U.S. Geological Survey National Water Information System Database, U.S. Geological Survey [data set], <ext-link xlink:href="https://doi.org/10.5066/F7P55KJN" ext-link-type="DOI">10.5066/F7P55KJN</ext-link>, last access 21 March 2025.</mixed-citation></ref>
      <ref id="bib1.bibx55"><label>Van Jaarsveld et al.(2025)Van Jaarsveld, Wanders, Sutanudjaja, Hoch, Droppers, Janzing, Van Beek, and Bierkens</label><mixed-citation>van Jaarsveld, B., Wanders, N., Sutanudjaja, E. H., Hoch, J., Droppers, B., Janzing, J., van Beek, R. L. P. H., and Bierkens, M. F. P.: A first attempt to model global hydrology at hyper-resolution, Earth Syst. Dynam., 16, 29–54, <ext-link xlink:href="https://doi.org/10.5194/esd-16-29-2025" ext-link-type="DOI">10.5194/esd-16-29-2025</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx56"><label>Westerberg et al.(2011)Westerberg, Guerrero, Seibert, Beven, and Halldin</label><mixed-citation>Westerberg, I., Guerrero, J., Seibert, J., Beven, K. J., and Halldin, S.: Stage‐discharge uncertainty derived with a non‐stationary rating curve in the Choluteca River, Honduras, Hydrol. Process., 25, 603–613, <ext-link xlink:href="https://doi.org/10.1002/hyp.7848" ext-link-type="DOI">10.1002/hyp.7848</ext-link>, 2011.</mixed-citation></ref>
      <ref id="bib1.bibx57"><label>Williams(2025)</label><mixed-citation>Williams, G. P.: Friends don't let friends use Nash-Sutcliffe Efficiency (NSE) or KGE for hydrologic model accuracy evaluation: A rant with data and suggestions for better practice, Environ. Modell. Softw., 194, 106665, <ext-link xlink:href="https://doi.org/10.1016/j.envsoft.2025.106665" ext-link-type="DOI">10.1016/j.envsoft.2025.106665</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx58"><label>Yang et al.(2023)Yang, Li, Qi, Zhang, Yu, and Xu</label><mixed-citation>Yang, X., Li, F., Qi, W., Zhang, M., Yu, C., and Xu, C.-Y.: Regionalization methods for PUB: a comprehensive review of progress after the PUB decade, Hydrol. Res., 54, 885–900, <ext-link xlink:href="https://doi.org/10.2166/nh.2023.027" ext-link-type="DOI">10.2166/nh.2023.027</ext-link>, 2023.</mixed-citation></ref>

  </ref-list></back>
    <!--<article-title-html>Technical note: Benchmarking large-domain model performance under sampling uncertainty</article-title-html>
<abstract-html/>
<ref-html id="bib1.bib1"><label>Abdelkader et al.(2023)Abdelkader, Temimi, and
Ouarda</label><mixed-citation>
      
Abdelkader, M., Temimi, M., and Ouarda, T. B.: Assessing the National Water
Model’s Streamflow Estimates Using a Multi-Decade
Retrospective Dataset across the Contiguous United States, Water,
15, 2319, <a href="https://doi.org/10.3390/w15132319" target="_blank">https://doi.org/10.3390/w15132319</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib2"><label>Arheimer et al.(2020)Arheimer, Pimentel, Isberg, Crochemore,
Andersson, Hasan, and Pineda</label><mixed-citation>
      
Arheimer, B., Pimentel, R., Isberg, K., Crochemore, L., Andersson, J. C. M., Hasan, A., and Pineda, L.: Global catchment modelling using World-Wide HYPE (WWH), open data, and stepwise parameter estimation, Hydrol. Earth Syst. Sci., 24, 535–559, <a href="https://doi.org/10.5194/hess-24-535-2020" target="_blank">https://doi.org/10.5194/hess-24-535-2020</a>, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib3"><label>Best et al.(2015)Best, Abramowitz, Johnson, Pitman, Balsamo, Boone,
Cuntz, Decharme, Dirmeyer, Dong, Ek, Guo, Haverd, Van Den Hurk, Nearing, Pak,
Peters-Lidard, Santanello, Stevens, and Vuichard</label><mixed-citation>
      
Best, M. J., Abramowitz, G., Johnson, H. R., Pitman, A. J., Balsamo, G., Boone,
A., Cuntz, M., Decharme, B., Dirmeyer, P. A., Dong, J., Ek, M., Guo, Z.,
Haverd, V., Van Den Hurk, B. J. J., Nearing, G. S., Pak, B., Peters-Lidard,
C., Santanello, J. A., Stevens, L., and Vuichard, N.: The Plumbing of
Land Surface Models: Benchmarking Model Performance, J.
Hydrometeorol., 16, 1425–1442, <a href="https://doi.org/10.1175/JHM-D-14-0158.1" target="_blank">https://doi.org/10.1175/JHM-D-14-0158.1</a>, 2015.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib4"><label>Beven(2023)</label><mixed-citation>
      
Beven, K.: Benchmarking hydrological models for an uncertain future,
Hydrol. Process., 37, e14882,
<a href="https://doi.org/10.1002/hyp.14882" target="_blank">https://doi.org/10.1002/hyp.14882</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib5"><label>Clark and Shook(2021)</label><mixed-citation>
      
Clark, M. P. and Shook, K.: gumboot: Bootstrap Analyses of Sampling Uncertainty
in Goodness-of-Fit Statistics, R package version 1.0.1, <a href="https://github.com/CH-Earth/gumboot" target="_blank"/>,
(last access: 4 September 2024), 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib6"><label>Clark et al.(2008)Clark, Slater, Rupp, Woods, Vrugt, Gupta, Wagener,
and Hay</label><mixed-citation>
      
Clark, M. P., Slater, A. G., Rupp, D. E., Woods, R. A., Vrugt, J. A., Gupta,
H. V., Wagener, T., and Hay, L. E.: Framework for Understanding Structural
Errors (FUSE): A modular framework to diagnose differences between
hydrological models, Water Resour. Res., 44,
<a href="https://doi.org/10.1029/2007WR006735" target="_blank">https://doi.org/10.1029/2007WR006735</a>, 2008.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib7"><label>Clark et al.(2021)Clark, Vogel, Lamontagne, Mizukami, Knoben, Tang,
Gharari, Freer, Whitfield, Shook, and Papalexiou</label><mixed-citation>
      
Clark, M. P., Vogel, R. M., Lamontagne, J. R., Mizukami, N., Knoben, W. J. M.,
Tang, G., Gharari, S., Freer, J. E., Whitfield, P. H., Shook, K. R., and
Papalexiou, S. M.: The Abuse of Popular Performance Metrics in Hydrologic
Modeling, Water Resour. Res., 57, e2020WR029001,
<a href="https://doi.org/10.1029/2020WR029001" target="_blank">https://doi.org/10.1029/2020WR029001</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib8"><label>Clark et al.(2026)Clark, Knoben, Spieler, Gründemann, Thébault,
Vásquez, Wood, Song, Shen, Carney, and Van Werkhoven</label><mixed-citation>
      
Clark, M. P., Knoben, W. J., Spieler, D., Gründemann, G. J., Thébault, C.,
Vásquez, N. A., Wood, A. W., Song, Y., Shen, C., Carney, S., and
Van Werkhoven, K.: Comment on Williams (2025): “Friends don't let
friends use NSE or KGE for hydrologic model accuracy evaluation: A rant
with data and suggestions for better practice”, Environ. Modell.
Softw., 197, 106869, <a href="https://doi.org/10.1016/j.envsoft.2026.106869" target="_blank">https://doi.org/10.1016/j.envsoft.2026.106869</a>, 2026.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib9"><label>Commission for Environmental
Cooperation(1997)</label><mixed-citation>
      
Commission for Environmental Cooperation: Ecological Regions of North
America: Toward a Common Perspective, ISBN 2-922305-18-X,
<a href="http://www.cec.org/files/documents/publications/1701-ecological-regions-north-america-toward-common-perspective-en.pdf" target="_blank">http://www.cec.org/files/documents/publications/1701</a> (last access:  29 January 2024),
1997.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib10"><label>Commission for Environmental Cooperation
(CEC)(2022)</label><mixed-citation>
      
Commission for Environmental Cooperation (CEC): North American Environmental Atlas – Political Boundaries, Statistics Canada, United States Census Bureau, Instituto Nacional de Estadística y Geografía (INEGI). Ed. 3.0, Vector digital data [1:10,000,000], <a href="https://www.cec.org/north-american-environmental-atlas/political-boundaries-2021/" target="_blank"/> (last access:  20 December 2023),
2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib11"><label>Cosgrove et al.(2024)Cosgrove, Gochis, Flowers, Dugger, Ogden,
Graziano, Clark, Cabell, Casiday, Cui et al.</label><mixed-citation>
      
Cosgrove, B., Gochis, D., Flowers, T., Dugger, A., Ogden, F., Graziano, T.,
Clark, E., Cabell, R., Casiday, N., Cui, Z., et al.: NOAA's National Water
Model: Advancing operational hydrology through continental-scale modeling,
J. Am. Water Resour. As., 60, 247–272,
2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib12"><label>Döll et al.(2024)Döll, Hasan, Schulze, Gerdener, Börger, Shadkam,
Ackermann, Hosseini-Moghari, Müller Schmied, Güntner, and
Kusche</label><mixed-citation>
      
Döll, P., Hasan, H. M. M., Schulze, K., Gerdener, H., Börger, L., Shadkam, S., Ackermann, S., Hosseini-Moghari, S.-M., Müller Schmied, H., Güntner, A., and Kusche, J.: Leveraging multi-variable observations to reduce and quantify the output uncertainty of a global hydrological model: evaluation of three ensemble-based approaches for the Mississippi River basin, Hydrol. Earth Syst. Sci., 28, 2259–2295, <a href="https://doi.org/10.5194/hess-28-2259-2024" target="_blank">https://doi.org/10.5194/hess-28-2259-2024</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib13"><label>Efstratiadis and Koutsoyiannis(2010)</label><mixed-citation>
      
Efstratiadis, A. and Koutsoyiannis, D.: One decade of multi-objective
calibration approaches in hydrological modelling: a review, Hydrolog.
Sci. J., 55, 58–78, <a href="https://doi.org/10.1080/02626660903526292" target="_blank">https://doi.org/10.1080/02626660903526292</a>, 2010.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib14"><label>Fall et al.(2023)Fall, Kitzmiller, Pavlovic, Zhang, Patrick,
St. Laurent, Trypaluk, Wu, and Miller</label><mixed-citation>
      
Fall, G., Kitzmiller, D., Pavlovic, S., Zhang, Z., Patrick, N., St. Laurent,
M., Trypaluk, C., Wu, W., and Miller, D.: The Office of Water Prediction's
Analysis of Record for Calibration, version 1.1: Dataset description and
precipitation evaluation, J. Am. Water Resour. As., 59, 1246–1272, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib15"><label>Farahani et al.(2025)Farahani, Wood, Tang, and
Mizukami</label><mixed-citation>
      
Farahani, M. A., Wood, A. W., Tang, G., and Mizukami, N.: Calibrating a large-domain land/hydrology process model in the age of AI: the SUMMA CAMELS emulator experiments, Hydrol. Earth Syst. Sci., 29, 4515–4537, <a href="https://doi.org/10.5194/hess-29-4515-2025" target="_blank">https://doi.org/10.5194/hess-29-4515-2025</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib16"><label>Gauch et al.(2021)Gauch, Kratzert, Klotz, Nearing, Lin, and
Hochreiter</label><mixed-citation>
      
Gauch, M., Kratzert, F., Klotz, D., Nearing, G., Lin, J., and Hochreiter, S.: Rainfall–runoff prediction at multiple timescales with a single Long Short-Term Memory network, Hydrol. Earth Syst. Sci., 25, 2045–2062, <a href="https://doi.org/10.5194/hess-25-2045-2021" target="_blank">https://doi.org/10.5194/hess-25-2045-2021</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib17"><label>Gharari et al.(2024)Gharari, Whitfield, Pietroniro, Freer, Liu, and
Clark</label><mixed-citation>
      
Gharari, S., Whitfield, P. H., Pietroniro, A., Freer, J., Liu, H., and Clark, M. P.: Exploring the provenance of information across Canadian hydrometric stations: implications for discharge estimation and uncertainty quantification, Hydrol. Earth Syst. Sci., 28, 4383–4405, <a href="https://doi.org/10.5194/hess-28-4383-2024" target="_blank">https://doi.org/10.5194/hess-28-4383-2024</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib18"><label>Gründemann et al.(2025)Gründemann, Knoben, Song, van Werkhoven, and
Clark</label><mixed-citation>
      
Gründemann, G., Knoben, W., Song, Y., van Werkhoven, K., and Clark, M.: Data
for “Separating Signal from Noise in Large- Domain Hydrologic Model
Evaluation: Benchmarking model performance under sampling uncertainty”, Zenodo [data set],
<a href="https://doi.org/10.5281/zenodo.18028487" target="_blank">https://doi.org/10.5281/zenodo.18028487</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib19"><label>Gupta et al.(2008)Gupta, Wagener, and Liu</label><mixed-citation>
      
Gupta, H. V., Wagener, T., and Liu, Y.: Reconciling theory with observations:
elements of a diagnostic approach to model evaluation, Hydrol.
Process., 3813, 3802–3813, <a href="https://doi.org/10.1002/hyp.6989" target="_blank">https://doi.org/10.1002/hyp.6989</a>, 2008.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib20"><label>Gupta et al.(2009)Gupta, Kling, Yilmaz, and
Martinez</label><mixed-citation>
      
Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F.: Decomposition of
the mean squared error and NSE performance criteria: Implications for
improving hydrological modelling, J. Hydrol., 377, 80–91,
<a href="https://doi.org/10.1016/j.jhydrol.2009.08.003" target="_blank">https://doi.org/10.1016/j.jhydrol.2009.08.003</a>, 2009.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib21"><label>Gupta et al.(2012)Gupta, Clark, Vrugt, Abramowitz, and
Ye</label><mixed-citation>
      
Gupta, H. V., Clark, M. P., Vrugt, J. a., Abramowitz, G., and Ye, M.: Towards a
comprehensive assessment of model structural adequacy, Water Resour.
Res., 48, <a href="https://doi.org/10.1029/2011WR011044" target="_blank">https://doi.org/10.1029/2011WR011044</a>, 2012.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib22"><label>Harrigan et al.(2023)Harrigan, Zsoter, Cloke, Salamon, and
Prudhomme</label><mixed-citation>
      
Harrigan, S., Zsoter, E., Cloke, H., Salamon, P., and Prudhomme, C.: Daily ensemble river discharge reforecasts and real-time forecasts from the operational Global Flood Awareness System, Hydrol. Earth Syst. Sci., 27, 1–19, <a href="https://doi.org/10.5194/hess-27-1-2023" target="_blank">https://doi.org/10.5194/hess-27-1-2023</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib23"><label>Johnson et al.(2023)Johnson, Fang, Sankarasubramanian, Rad, Kindl
Da Cunha, Jennings, Clarke, Mazrooei, and
Yeghiazarian</label><mixed-citation>
      
Johnson, J. M., Fang, S., Sankarasubramanian, A., Rad, A. M., Kindl Da Cunha,
L., Jennings, K. S., Clarke, K. C., Mazrooei, A., and Yeghiazarian, L.:
Comprehensive Analysis of the NOAA National Water Model: A Call
for Heterogeneous Formulations and Diagnostic Model Selection,
J. Geophys. Res.-Atmos., 128, e2023JD038534,
<a href="https://doi.org/10.1029/2023JD038534" target="_blank">https://doi.org/10.1029/2023JD038534</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib24"><label>Klotz et al.(2024)Klotz, Gauch, Kratzert, Nearing, and
Zscheischler</label><mixed-citation>
      
Klotz, D., Gauch, M., Kratzert, F., Nearing, G., and Zscheischler, J.: Technical Note: The divide and measure nonconformity – how metrics can mislead when we evaluate on different data partitions, Hydrol. Earth Syst. Sci., 28, 3665–3673, <a href="https://doi.org/10.5194/hess-28-3665-2024" target="_blank">https://doi.org/10.5194/hess-28-3665-2024</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib25"><label>Knoben(2024)</label><mixed-citation>
      
Knoben, W. J. M.: Setting expectations for hydrologic model performance with an
ensemble of simple benchmarks, Hydrol. Process., 38, e15288,
<a href="https://doi.org/10.1002/hyp.15288" target="_blank">https://doi.org/10.1002/hyp.15288</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib26"><label>Knoben et al.(2019)Knoben, Freer, and Woods</label><mixed-citation>
      
Knoben, W. J. M., Freer, J. E., and Woods, R. A.: Technical note: Inherent benchmark or not? Comparing Nash–Sutcliffe and Kling–Gupta efficiency scores, Hydrol. Earth Syst. Sci., 23, 4323–4331, <a href="https://doi.org/10.5194/hess-23-4323-2019" target="_blank">https://doi.org/10.5194/hess-23-4323-2019</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib27"><label>Knoben et al.(2020)Knoben, Freer, Peel, Fowler, and
Woods</label><mixed-citation>
      
Knoben, W. J. M., Freer, J. E., Peel, M. C., Fowler, K. J. A., and Woods,
R. A.: A Brief Analysis of Conceptual Model Structure Uncertainty
Using 36 Models and 559 Catchments, Water Resour. Res., 56,
e2019WR025975, <a href="https://doi.org/10.1029/2019WR025975" target="_blank">https://doi.org/10.1029/2019WR025975</a>, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib28"><label>Knoben et al.(2025)Knoben, Raman, Gründemann, Kumar, Pietroniro,
Shen, Song, Thébault, Van Werkhoven, Wood, and
Clark</label><mixed-citation>
      
Knoben, W. J. M., Raman, A., Gründemann, G. J., Kumar, M., Pietroniro, A., Shen, C., Song, Y., Thébault, C., van Werkhoven, K., Wood, A. W., and Clark, M. P.: Technical note: How many models do we need to simulate hydrologic processes across large geographical domains?, Hydrol. Earth Syst. Sci., 29, 2361–2375, <a href="https://doi.org/10.5194/hess-29-2361-2025" target="_blank">https://doi.org/10.5194/hess-29-2361-2025</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib29"><label>Kollat et al.(2012)Kollat, Reed, and Wagener</label><mixed-citation>
      
Kollat, J. B., Reed, P. M., and Wagener, T.: When are multiobjective
calibration trade-offs in hydrologic models meaningful?, Water Resour.
Research, 48, <a href="https://doi.org/10.1029/2011WR011534" target="_blank">https://doi.org/10.1029/2011WR011534</a>, 2012.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib30"><label>Kratzert et al.(2019)Kratzert, Klotz, Herrnegger, Sampson,
Hochreiter, and Nearing</label><mixed-citation>
      
Kratzert, F., Klotz, D., Herrnegger, M., Sampson, A. K., Hochreiter, S., and
Nearing, G. S.: Toward Improved Predictions in Ungauged Basins:
Exploiting the Power of Machine Learning, Water Resour. Res.,
55, 11344–11354, <a href="https://doi.org/10.1029/2019WR026065" target="_blank">https://doi.org/10.1029/2019WR026065</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib31"><label>Lamontagne et al.(2020)Lamontagne, Barber, and
Vogel</label><mixed-citation>
      
Lamontagne, J. R., Barber, C. A., and Vogel, R. M.: Improved Estimators of
Model Performance Efficiency for Skewed Hydrologic Data, Water Resour.
Res., 56, e2020WR027101, <a href="https://doi.org/10.1029/2020WR027101" target="_blank">https://doi.org/10.1029/2020WR027101</a>,
2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib32"><label>Legates and Mccabe(2013)</label><mixed-citation>
      
Legates, D. R. and Mccabe, G. J.: A refined index of model performance: A
rejoinder, Int. J. Climatol., 33, 1053–1056,
<a href="https://doi.org/10.1002/joc.3487" target="_blank">https://doi.org/10.1002/joc.3487</a>, 2013.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib33"><label>McCuen et al.(2006)McCuen, Knight, and
Cutter</label><mixed-citation>
      
McCuen, R. H., Knight, Z., and Cutter, A. G.: Evaluation of the
Nash–Sutcliffe Efficiency Index, J. Hydrol. Eng.,
11, 597–602, <a href="https://doi.org/10.1061/(ASCE)1084-0699(2006)11:6(597)" target="_blank">https://doi.org/10.1061/(ASCE)1084-0699(2006)11:6(597)</a>, 2006.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib34"><label>Merz and Blöschl(2004)</label><mixed-citation>
      
Merz, R. and Blöschl, G.: Regionalisation of catchment model parameters,
J. Hydrol., 287, 95–123, <a href="https://doi.org/10.1016/j.jhydrol.2003.09.028" target="_blank">https://doi.org/10.1016/j.jhydrol.2003.09.028</a>,
2004.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib35"><label>Nash and Sutcliffe(1970)</label><mixed-citation>
      
Nash, J. and Sutcliffe, J.: River flow forecasting through conceptual models
part I – A discussion of principles, J. Hydrol., 10,
282–290, <a href="https://doi.org/10.1016/0022-1694(70)90255-6" target="_blank">https://doi.org/10.1016/0022-1694(70)90255-6</a>, 1970.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib36"><label>Nearing et al.(2024)Nearing, Cohen, Dube, Gauch, Gilon, Harrigan,
Hassidim, Klotz, Kratzert, Metzger, Nevo, Pappenberger, Prudhomme, Shalev,
Shenzis, Tekalign, Weitzner, and Matias</label><mixed-citation>
      
Nearing, G., Cohen, D., Dube, V., Gauch, M., Gilon, O., Harrigan, S., Hassidim,
A., Klotz, D., Kratzert, F., Metzger, A., Nevo, S., Pappenberger, F.,
Prudhomme, C., Shalev, G., Shenzis, S., Tekalign, T. Y., Weitzner, D., and
Matias, Y.: Global prediction of extreme floods in ungauged watersheds,
Nature, 627, 559–563, <a href="https://doi.org/10.1038/s41586-024-07145-1" target="_blank">https://doi.org/10.1038/s41586-024-07145-1</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib37"><label>Newman et al.(2015)Newman, Clark, Sampson, Wood, Hay, Bock, Viger,
Blodgett, Brekke, Arnold, Hopson, and Duan</label><mixed-citation>
      
Newman, A. J., Clark, M. P., Sampson, K., Wood, A., Hay, L. E., Bock, A., Viger, R. J., Blodgett, D., Brekke, L., Arnold, J. R., Hopson, T., and Duan, Q.: Development of a large-sample watershed-scale hydrometeorological data set for the contiguous USA: data set characteristics and assessment of regional variability in hydrologic model performance, Hydrol. Earth Syst. Sci., 19, 209–223, <a href="https://doi.org/10.5194/hess-19-209-2015" target="_blank">https://doi.org/10.5194/hess-19-209-2015</a>, 2015.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib38"><label>NOAA(2025)</label><mixed-citation>
      
NOAA: The National Water Model,
<a href="https://water.noaa.gov/about/nwm" target="_blank"/>, last access: 3 November 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib39"><label>Pappenberger et al.(2015)Pappenberger, Ramos, Cloke, Wetterhall,
Alfieri, Bogner, Mueller, and Salamon</label><mixed-citation>
      
Pappenberger, F., Ramos, M. H., Cloke, H. L., Wetterhall, F., Alfieri, L.,
Bogner, K., Mueller, A., and Salamon, P.: How do I know if my forecasts are
better? Using benchmarks in hydrological ensemble prediction, J.
Hydrol., 522, 697–713, <a href="https://doi.org/10.1016/j.jhydrol.2015.01.024" target="_blank">https://doi.org/10.1016/j.jhydrol.2015.01.024</a>, 2015.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib40"><label>Pool et al.(2021)Pool, Vis, and Seibert</label><mixed-citation>
      
Pool, S., Vis, M., and Seibert, J.: Regionalization for Ungauged Catchments
– Lessons Learned From a Comparative Large‐Sample Study,
Water Resour. Res., 57, e2021WR030437, <a href="https://doi.org/10.1029/2021WR030437" target="_blank">https://doi.org/10.1029/2021WR030437</a>,
2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib41"><label>Quansah et al.(2025)Quansah, Doria, and
Fall</label><mixed-citation>
      
Quansah, J., Doria, R., and Fall, S.: Evaluating the Performance of the
National Water Model: A Spatiotemporal Analysis of Streamflow
Forecasting, Water, 17, 2950, <a href="https://doi.org/10.3390/w17202950" target="_blank">https://doi.org/10.3390/w17202950</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib42"><label>Rakovec et al.(2016)Rakovec, Kumar, Attinger, and
Samaniego</label><mixed-citation>
      
Rakovec, O., Kumar, R., Attinger, S., and Samaniego, L.: Improving the realism
of hydrologic model functioning through multivariate parameter estimation,
Water Resour. Res., 52, 7779–7792, <a href="https://doi.org/10.1002/2016WR019430" target="_blank">https://doi.org/10.1002/2016WR019430</a>, 2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib43"><label>Ritter and Muñoz-Carpena(2013)</label><mixed-citation>
      
Ritter, A. and Muñoz-Carpena, R.: Performance evaluation of hydrological
models: Statistical significance for reducing subjectivity in
goodness-of-fit assessments, J. Hydrol., 480, 33–45,
<a href="https://doi.org/10.1016/j.jhydrol.2012.12.004" target="_blank">https://doi.org/10.1016/j.jhydrol.2012.12.004</a>, publisher: Elsevier B.V., 2013.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib44"><label>Rutledge and Mesko(1996)</label><mixed-citation>
      
Rutledge, A. T. and Mesko, T. O.: Estimated hydrologic characteristics of
shallow aquifer systems in the Valley and Ridge, the Blue Ridge, and
the Piedmont Physiographic Provinces based on analysis of streamflow
recession and base flow, Professional Paper 1422-B, United States
Geological Survey, <a href="https://doi.org/10.3133/pp1422B" target="_blank">https://doi.org/10.3133/pp1422B</a>, 1996.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib45"><label>Samaniego et al.(2010)Samaniego, Kumar, and
Attinger</label><mixed-citation>
      
Samaniego, L., Kumar, R., and Attinger, S.: Multiscale parameter
regionalization of a grid-based hydrologic model at the mesoscale, Water
Resour. Res., 46, 1–25, <a href="https://doi.org/10.1029/2008WR007327" target="_blank">https://doi.org/10.1029/2008WR007327</a>, 2010.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib46"><label>Schaefli and Gupta(2007)</label><mixed-citation>
      
Schaefli, B. and Gupta, H. V.: Do Nash values have value?, Hydrol.
Process., 21, 2075–2080, <a href="https://doi.org/10.1002/hyp.6825" target="_blank">https://doi.org/10.1002/hyp.6825</a>, 2007.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib47"><label>Seibert(2001)</label><mixed-citation>
      
Seibert, J.: On the need for benchmarks in hydrological modelling, Hydrol.
Process., 15, 1063–1064, <a href="https://doi.org/10.1002/hyp.446" target="_blank">https://doi.org/10.1002/hyp.446</a>, 2001.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib48"><label>Seibert et al.(2018)Seibert, Vis, Lewis, and van
Meerveld</label><mixed-citation>
      
Seibert, J., Vis, M. J. P., Lewis, E., and van Meerveld, H.: Upper and lower
benchmarks in hydrological modelling, Hydrol. Process., 32, 1120–1125,
<a href="https://doi.org/10.1002/hyp.11476" target="_blank">https://doi.org/10.1002/hyp.11476</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib49"><label>Shen et al.(2023)Shen, Appling, Gentine, Bandai, Gupta, Tartakovsky,
Baity-Jesi, Fenicia, Kifer, Li, Liu, Ren, Zheng, Harman, Clark, Farthing,
Feng, Kumar, Aboelyazeed, Rahmani, Song, Beck, Bindas, Dwivedi, Fang, Höge,
Rackauckas, Mohanty, Roy, Xu, and Lawson</label><mixed-citation>
      
Shen, C., Appling, A. P., Gentine, P., Bandai, T., Gupta, H., Tartakovsky, A.,
Baity-Jesi, M., Fenicia, F., Kifer, D., Li, L., Liu, X., Ren, W., Zheng, Y.,
Harman, C. J., Clark, M., Farthing, M., Feng, D., Kumar, P., Aboelyazeed, D.,
Rahmani, F., Song, Y., Beck, H. E., Bindas, T., Dwivedi, D., Fang, K., Höge,
M., Rackauckas, C., Mohanty, B., Roy, T., Xu, C., and Lawson, K.:
Differentiable modelling to unify machine learning and physical models for
geosciences, Nature Reviews Earth &amp; Environment, 4, 552–567,
<a href="https://doi.org/10.1038/s43017-023-00450-9" target="_blank">https://doi.org/10.1038/s43017-023-00450-9</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib50"><label>Song et al.(2025)Song, Bindas, Shen, Ji, Knoben, Lonzarich, Clark,
Liu, Van Werkhoven, Lamont, Denno, Pan, Yang, Rapp, Kumar, Rahmani,
Thébault, Adkins, Halgren, Patel, Patel, Sawadekar, and
Lawson</label><mixed-citation>
      
Song, Y., Bindas, T., Shen, C., Ji, H., Knoben, W. J. M., Lonzarich, L., Clark,
M. P., Liu, J., Van Werkhoven, K., Lamont, S., Denno, M., Pan, M., Yang, Y.,
Rapp, J., Kumar, M., Rahmani, F., Thébault, C., Adkins, R., Halgren, J.,
Patel, T., Patel, A., Sawadekar, K. A., and Lawson, K.: High‐Resolution
National‐Scale Water Modeling Is Enhanced by Multiscale
Differentiable Physics‐Informed Machine Learning, Water Resour.
Res., 61, e2024WR038928, <a href="https://doi.org/10.1029/2024WR038928" target="_blank">https://doi.org/10.1029/2024WR038928</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib51"><label>Swain et al.(2004)Swain, Mesko, and Hollyday</label><mixed-citation>
      
Swain, L. A., Mesko, T. O., and Hollyday, E. F.: Summary of the hydrogeology of
the Valley and Ridge, Blue Ridge, and Piedmont Physiographic
Provinces in the eastern United States, Professional Paper 1422-A,
United States Geological Survey, <a href="https://doi.org/10.3133/pp1422A" target="_blank">https://doi.org/10.3133/pp1422A</a>, 2004.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib52"><label>Tang et al.(2025)Tang, Wood, and Swenson</label><mixed-citation>
      
Tang, G., Wood, A. W., and Swenson, S.: On Using AI‐Based
Large‐Sample Emulators for Land/Hydrology Model Calibration
and Regionalization, Water Resour. Res., 61, e2024WR039525,
<a href="https://doi.org/10.1029/2024WR039525" target="_blank">https://doi.org/10.1029/2024WR039525</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib53"><label>Towler et al.(2023)Towler, Foks, Dugger, Dickinson, Essaid, Gochis,
Viger, and Zhang</label><mixed-citation>
      
Towler, E., Foks, S. S., Dugger, A. L., Dickinson, J. E., Essaid, H. I., Gochis, D., Viger, R. J., and Zhang, Y.: Benchmarking high-resolution hydrologic model performance of long-term retrospective streamflow simulations in the contiguous United States, Hydrol. Earth Syst. Sci., 27, 1809–1825, <a href="https://doi.org/10.5194/hess-27-1809-2023" target="_blank">https://doi.org/10.5194/hess-27-1809-2023</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib54"><label>U.S. Geological Survey(2025)</label><mixed-citation>
      
U.S. Geological Survey: U.S. Geological Survey National Water Information
System Database, U.S. Geological Survey [data set],
<a href="https://doi.org/10.5066/F7P55KJN" target="_blank">https://doi.org/10.5066/F7P55KJN</a>, last access 21 March 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib55"><label>Van Jaarsveld et al.(2025)Van Jaarsveld, Wanders, Sutanudjaja, Hoch,
Droppers, Janzing, Van Beek, and Bierkens</label><mixed-citation>
      
van Jaarsveld, B., Wanders, N., Sutanudjaja, E. H., Hoch, J., Droppers, B., Janzing, J., van Beek, R. L. P. H., and Bierkens, M. F. P.: A first attempt to model global hydrology at hyper-resolution, Earth Syst. Dynam., 16, 29–54, <a href="https://doi.org/10.5194/esd-16-29-2025" target="_blank">https://doi.org/10.5194/esd-16-29-2025</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib56"><label>Westerberg et al.(2011)Westerberg, Guerrero, Seibert, Beven, and
Halldin</label><mixed-citation>
      
Westerberg, I., Guerrero, J., Seibert, J., Beven, K. J., and Halldin, S.:
Stage‐discharge uncertainty derived with a non‐stationary rating curve in
the Choluteca River, Honduras, Hydrol. Process., 25, 603–613,
<a href="https://doi.org/10.1002/hyp.7848" target="_blank">https://doi.org/10.1002/hyp.7848</a>, 2011.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib57"><label>Williams(2025)</label><mixed-citation>
      
Williams, G. P.: Friends don't let friends use Nash-Sutcliffe Efficiency
(NSE) or KGE for hydrologic model accuracy evaluation: A rant with data
and suggestions for better practice, Environ. Modell. Softw.,
194, 106665, <a href="https://doi.org/10.1016/j.envsoft.2025.106665" target="_blank">https://doi.org/10.1016/j.envsoft.2025.106665</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib58"><label>Yang et al.(2023)Yang, Li, Qi, Zhang, Yu, and
Xu</label><mixed-citation>
      
Yang, X., Li, F., Qi, W., Zhang, M., Yu, C., and Xu, C.-Y.: Regionalization
methods for PUB: a comprehensive review of progress after the PUB decade,
Hydrol. Res., 54, 885–900, <a href="https://doi.org/10.2166/nh.2023.027" target="_blank">https://doi.org/10.2166/nh.2023.027</a>, 2023.

    </mixed-citation></ref-html>--></article>
