Articles | Volume 30, issue 11
https://doi.org/10.5194/hess-30-3439-2026
© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
https://doi.org/10.5194/hess-30-3439-2026
© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Technical note: Benchmarking large-domain model performance under sampling uncertainty
Gaby J. Gründemann
Schulich School of Engineering, University of Calgary, Alberta, Canada
Wouter J. M. Knoben
CORRESPONDING AUTHOR
Schulich School of Engineering, University of Calgary, Alberta, Canada
Yalan Song
Civil and Environmental Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America
Katie van Werkhoven
Research Triangle Institute, Research Triangle Park, North Carolina, United States of America
Martyn P. Clark
Schulich School of Engineering, University of Calgary, Alberta, Canada
Related authors
Cyril Thébault, Wouter J. M. Knoben, Nans Addor, Andrew J. Newman, Diana Spieler, Nicolás A. Vásquez, Yalan Song, Gaby J. Gründemann, Shaun Carney, Mukesh Kumar, Katie van Werkhoven, Chaopeng Shen, Andrew W. Wood, and Martyn P. Clark
EGUsphere, https://doi.org/10.5194/egusphere-2025-6083, https://doi.org/10.5194/egusphere-2025-6083, 2026
Short summary
Short summary
Reliable river flow prediction guide water supply planning and flood protection. We tested whether selecting or combining many models improves accuracy compared with single model. 78 models were used and tested in 559 river basins across the United States. A carefully chosen single model nearly matched more complex multi-model approaches, while combining models gave slightly higher accuracy and lower uncertainty. However, no approach worked best everywhere.
Wouter J. M. Knoben, Ashwin Raman, Gaby J. Gründemann, Mukesh Kumar, Alain Pietroniro, Chaopeng Shen, Yalan Song, Cyril Thébault, Katie van Werkhoven, Andrew W. Wood, and Martyn P. Clark
Hydrol. Earth Syst. Sci., 29, 2361–2375, https://doi.org/10.5194/hess-29-2361-2025, https://doi.org/10.5194/hess-29-2361-2025, 2025
Short summary
Short summary
Hydrologic models are needed to provide simulations of water availability, floods, and droughts. The accuracy of these simulations is often quantified with so-called performance scores. A common thought is that different models are more or less applicable to different landscapes, depending on how the model works. We show that performance scores are not helpful in distinguishing between different models and thus cannot easily be used to select an appropriate model for a specific place.
Sacha W. Ruzzante, Wouter J. M. Knoben, Thorsten Wagener, Tom Gleeson, and Markus Schnorbus
Hydrol. Earth Syst. Sci., 30, 2337–2355, https://doi.org/10.5194/hess-30-2337-2026, https://doi.org/10.5194/hess-30-2337-2026, 2026
Short summary
Short summary
Common metrics used to evaluate hydrologic models make it relatively easy to achieve high performance scores in highly seasonal catchments. However, we analysed 18 hydrologic models and found that almost all were worse at simulating interannual variability and change in seasonal streamflow regimes. This suggests that climate change impacts on streamflow may not be accurately predicted in highly seasonal tropical, alpine, and polar regions, which are highly vulnerable to climate change.
Nicolás A. Vásquez, Pablo A. Mendoza, Wouter Knoben, Martyn Clark, Tricia Stadnyk, and Naoki Mizukami
EGUsphere, https://doi.org/10.5194/egusphere-2026-1363, https://doi.org/10.5194/egusphere-2026-1363, 2026
Short summary
Short summary
Although distributed hydrological models are often calibrated using only streamflow data, this practice may provide unrealistic representations of the water cycle. We show that, while streamflow annual cycles can be reasonably simulated, the seasonality of other key variables - such as evapotranspiration, soil moisture, and snow cover - may be severely misrepresented. Our results highlight the need to assess seasonal patterns of variables beyond streamflow when calibrating hydrological models.
Delanie Williams, Mukesh Kumar, Katie van Werkhoven, Martyn Clark, Christopher Wilson, and Paul Miller
EGUsphere, https://doi.org/10.5194/egusphere-2026-583, https://doi.org/10.5194/egusphere-2026-583, 2026
Short summary
Short summary
Machine learning (ML) models have potential within water resource management to be used in place of traditional models. However, there is hesitance for many agencies and institutions to pursue ML models, due to existing familiarity, reliability, interpretability, resources, and workflows. Thus, instead of opting for complete retention of traditional models, or complete replacement by ML models, this perspective argues for the selective integration of ML within existing models.
Cyril Thébault, Wouter J. M. Knoben, Nans Addor, Andrew J. Newman, Diana Spieler, Nicolás A. Vásquez, Yalan Song, Gaby J. Gründemann, Shaun Carney, Mukesh Kumar, Katie van Werkhoven, Chaopeng Shen, Andrew W. Wood, and Martyn P. Clark
EGUsphere, https://doi.org/10.5194/egusphere-2025-6083, https://doi.org/10.5194/egusphere-2025-6083, 2026
Short summary
Short summary
Reliable river flow prediction guide water supply planning and flood protection. We tested whether selecting or combining many models improves accuracy compared with single model. 78 models were used and tested in 559 river basins across the United States. A carefully chosen single model nearly matched more complex multi-model approaches, while combining models gave slightly higher accuracy and lower uncertainty. However, no approach worked best everywhere.
David R. Casson, Guoqiang Tang, Nicolás Vásquez, Andrew W. Wood, and Martyn P. Clark
EGUsphere, https://doi.org/10.5194/egusphere-2025-6066, https://doi.org/10.5194/egusphere-2025-6066, 2026
Short summary
Short summary
This study generates meteorological ensembles tailored for mountain snow estimation, accounting for topography, wind-driven undercatch, and large-scale atmospheric patterns. Driving a physics-based snow model with these ensembles allowed uncertainty in weather inputs to carry through to estimates of snow accumulation and melt. Across three mountain basins, this led to realistic and reliable snow estimates useful for data assimilation, water supply and forecasting applications.
Peijun Li, Yalan Song, Ming Pan, Kathryn Lawson, and Chaopeng Shen
Hydrol. Earth Syst. Sci., 29, 6829–6861, https://doi.org/10.5194/hess-29-6829-2025, https://doi.org/10.5194/hess-29-6829-2025, 2025
Short summary
Short summary
This study explores how combining different model types improves streamflow predictions, especially in data-sparse scenarios. By integrating two highly accurate models with distinct mechanisms and leveraging multiple meteorological datasets, we highlight their unique strengths and set new accuracy benchmarks across spatiotemporal conditions. Our findings enhance the understanding of how diverse models and multi-source data can be effectively used to improve hydrological predictions.
Jiangtao Liu, Chaopeng Shen, Fearghal O'Donncha, Yalan Song, Wei Zhi, Hylke E. Beck, Tadd Bindas, Nicholas Kraabel, and Kathryn Lawson
Hydrol. Earth Syst. Sci., 29, 6811–6828, https://doi.org/10.5194/hess-29-6811-2025, https://doi.org/10.5194/hess-29-6811-2025, 2025
Short summary
Short summary
Using global and regional datasets, we compared attention-based models and Long Short-Term Memory (LSTM) models to predict hydrologic variables. Our results show LSTM models perform better in simpler tasks, whereas attention-based models perform better in complex scenarios, offering insights for improved water resource management.
Peter Wagener, Wouter J. M. Knoben, Niels Schütze, and Diana Spieler
EGUsphere, https://doi.org/10.5194/egusphere-2025-5413, https://doi.org/10.5194/egusphere-2025-5413, 2025
Short summary
Short summary
Hydrologic models help predict floods and droughts, but how we calibrate them changes what they get right. By testing eight objective functions across many model types and catchments, we found that each highlights different flow behaviours, such as floods, low flows, or water balance. No single approach is best for all flow conditions. Matching the calibration method to the study's purpose, or combining several methods, can make models more applicable to real-world water decisions.
Wouter J. M. Knoben, Cyril Thébault, Kasra Keshavarz, Laura Torres-Rojas, Nathaniel W. Chaney, Alain Pietroniro, and Martyn P. Clark
Hydrol. Earth Syst. Sci., 29, 5791–5833, https://doi.org/10.5194/hess-29-5791-2025, https://doi.org/10.5194/hess-29-5791-2025, 2025
Short summary
Short summary
Many existing datasets for hydrologic analysis tend to treat catchments as single spatially homogeneous units focusing on daily data and typically do not support more complex models. This paper introduces a dataset that goes beyond this set-up by (1) providing data at a higher spatial and temporal resolution, (2) specifically considering the data requirements of all common hydrologic model types, and (3) using statistical summaries of the data aimed at quantifying spatial and temporal heterogeneity.
Yuan Yang, Ming Pan, Dapeng Feng, Mu Xiao, Taylor Dixon, Robert Hartman, Chaopeng Shen, Yalan Song, Agniv Sengupta, Luca Delle Monache, and F. Martin Ralph
Hydrol. Earth Syst. Sci., 29, 5453–5476, https://doi.org/10.5194/hess-29-5453-2025, https://doi.org/10.5194/hess-29-5453-2025, 2025
Short summary
Short summary
We explore a machine learning-based data integration method that integrates streamflow (Q) and snow water equivalent (SWE) to improve streamflow estimates at various lag times (1–10 d, 1–6 months) and timescales (daily and monthly) over Western US basins. Benefits rank as: integrating Q at the daily scale > Q at the monthly scale > SWE at the monthly scale > SWE at the daily scale. Results highlight the method’s potential for short- and long-term streamflow forecasting in the Western US.
Wouter J. M. Knoben, Ashwin Raman, Gaby J. Gründemann, Mukesh Kumar, Alain Pietroniro, Chaopeng Shen, Yalan Song, Cyril Thébault, Katie van Werkhoven, Andrew W. Wood, and Martyn P. Clark
Hydrol. Earth Syst. Sci., 29, 2361–2375, https://doi.org/10.5194/hess-29-2361-2025, https://doi.org/10.5194/hess-29-2361-2025, 2025
Short summary
Short summary
Hydrologic models are needed to provide simulations of water availability, floods, and droughts. The accuracy of these simulations is often quantified with so-called performance scores. A common thought is that different models are more or less applicable to different landscapes, depending on how the model works. We show that performance scores are not helpful in distinguishing between different models and thus cannot easily be used to select an appropriate model for a specific place.
Gab Abramowitz, Anna Ukkola, Sanaa Hobeichi, Jon Cranko Page, Mathew Lipson, Martin G. De Kauwe, Samuel Green, Claire Brenner, Jonathan Frame, Grey Nearing, Martyn Clark, Martin Best, Peter Anthoni, Gabriele Arduini, Souhail Boussetta, Silvia Caldararu, Kyeungwoo Cho, Matthias Cuntz, David Fairbairn, Craig R. Ferguson, Hyungjun Kim, Yeonjoo Kim, Jürgen Knauer, David Lawrence, Xiangzhong Luo, Sergey Malyshev, Tomoko Nitta, Jerome Ogee, Keith Oleson, Catherine Ottlé, Phillipe Peylin, Patricia de Rosnay, Heather Rumbold, Bob Su, Nicolas Vuichard, Anthony P. Walker, Xiaoni Wang-Faivre, Yunfei Wang, and Yijian Zeng
Biogeosciences, 21, 5517–5538, https://doi.org/10.5194/bg-21-5517-2024, https://doi.org/10.5194/bg-21-5517-2024, 2024
Short summary
Short summary
This paper evaluates land models – computer-based models that simulate ecosystem dynamics; land carbon, water, and energy cycles; and the role of land in the climate system. It uses machine learning and AI approaches to show that, despite the complexity of land models, they do not perform nearly as well as they could given the amount of information they are provided with about the prediction problem.
Louise Arnal, Martyn P. Clark, Alain Pietroniro, Vincent Vionnet, David R. Casson, Paul H. Whitfield, Vincent Fortin, Andrew W. Wood, Wouter J. M. Knoben, Brandi W. Newton, and Colleen Walford
Hydrol. Earth Syst. Sci., 28, 4127–4155, https://doi.org/10.5194/hess-28-4127-2024, https://doi.org/10.5194/hess-28-4127-2024, 2024
Short summary
Short summary
Forecasting river flow months in advance is crucial for water sectors and society. In North America, snowmelt is a key driver of flow. This study presents a statistical workflow using snow data to forecast flow months ahead in North American snow-fed rivers. Variations in the river flow predictability across the continent are evident, raising concerns about future predictability in a changing (snow) climate. The reproducible workflow hosted on GitHub supports collaborative and open science.
Yalan Song, Wouter J. M. Knoben, Martyn P. Clark, Dapeng Feng, Kathryn Lawson, Kamlesh Sawadekar, and Chaopeng Shen
Hydrol. Earth Syst. Sci., 28, 3051–3077, https://doi.org/10.5194/hess-28-3051-2024, https://doi.org/10.5194/hess-28-3051-2024, 2024
Short summary
Short summary
Differentiable models (DMs) integrate neural networks and physical equations for accuracy, interpretability, and knowledge discovery. We developed an adjoint-based DM for ordinary differential equations (ODEs) for hydrological modeling, reducing distorted fluxes and physical parameters from errors in models that use explicit and operation-splitting schemes. With a better numerical scheme and improved structure, the adjoint-based DM matches or surpasses long short-term memory (LSTM) performance.
Diogo Costa, Kyle Klenk, Wouter Knoben, Andrew Ireson, Raymond J. Spiteri, and Martyn Clark
EGUsphere, https://doi.org/10.5194/egusphere-2023-2787, https://doi.org/10.5194/egusphere-2023-2787, 2023
Preprint archived
Short summary
Short summary
This work helps improve water quality simulations in aquatic ecosystems through a new modeling concept, which we termed “OpenWQ”. It allows tailoring biogeochemistry calculations and integration with existing hydrological (water quantity) simulation tools. The integration is demonstrated with two hydrological models. The models were tested for different pollution scenarios. This paper helps improve interoperability, transparency, flexibility, and reproducibility in water quality simulations.
Luca Trotter, Wouter J. M. Knoben, Keirnan J. A. Fowler, Margarita Saft, and Murray C. Peel
Geosci. Model Dev., 15, 6359–6369, https://doi.org/10.5194/gmd-15-6359-2022, https://doi.org/10.5194/gmd-15-6359-2022, 2022
Short summary
Short summary
MARRMoT is a piece of software that emulates 47 common models for hydrological simulations. It can be used to run and calibrate these models within a common environment as well as to easily modify them. We restructured and recoded MARRMoT in order to make the models run faster and to simplify their use, while also providing some new features. This new MARRMoT version runs models on average 3.6 times faster while maintaining very strong consistency in their outputs to the previous version.
Wouter J. M. Knoben and Diana Spieler
Hydrol. Earth Syst. Sci., 26, 3299–3314, https://doi.org/10.5194/hess-26-3299-2022, https://doi.org/10.5194/hess-26-3299-2022, 2022
Short summary
Short summary
This paper introduces educational materials that can be used to teach students about model structure uncertainty in hydrological modelling. There are many different hydrological models and differences between these models impact their usefulness in different places. Such models are often used to support decision making about water resources and to perform hydrological science, and it is thus important for students to understand that model choice matters.
Cited articles
Abdelkader, M., Temimi, M., and Ouarda, T. B.: Assessing the National Water Model’s Streamflow Estimates Using a Multi-Decade Retrospective Dataset across the Contiguous United States, Water, 15, 2319, https://doi.org/10.3390/w15132319, 2023. a
Arheimer, B., Pimentel, R., Isberg, K., Crochemore, L., Andersson, J. C. M., Hasan, A., and Pineda, L.: Global catchment modelling using World-Wide HYPE (WWH), open data, and stepwise parameter estimation, Hydrol. Earth Syst. Sci., 24, 535–559, https://doi.org/10.5194/hess-24-535-2020, 2020. a
Best, M. J., Abramowitz, G., Johnson, H. R., Pitman, A. J., Balsamo, G., Boone, A., Cuntz, M., Decharme, B., Dirmeyer, P. A., Dong, J., Ek, M., Guo, Z., Haverd, V., Van Den Hurk, B. J. J., Nearing, G. S., Pak, B., Peters-Lidard, C., Santanello, J. A., Stevens, L., and Vuichard, N.: The Plumbing of Land Surface Models: Benchmarking Model Performance, J. Hydrometeorol., 16, 1425–1442, https://doi.org/10.1175/JHM-D-14-0158.1, 2015. a
Beven, K.: Benchmarking hydrological models for an uncertain future, Hydrol. Process., 37, e14882, https://doi.org/10.1002/hyp.14882, 2023. a
Clark, M. P., Slater, A. G., Rupp, D. E., Woods, R. A., Vrugt, J. A., Gupta, H. V., Wagener, T., and Hay, L. E.: Framework for Understanding Structural Errors (FUSE): A modular framework to diagnose differences between hydrological models, Water Resour. Res., 44, https://doi.org/10.1029/2007WR006735, 2008. a
Clark, M. P., Vogel, R. M., Lamontagne, J. R., Mizukami, N., Knoben, W. J. M., Tang, G., Gharari, S., Freer, J. E., Whitfield, P. H., Shook, K. R., and Papalexiou, S. M.: The Abuse of Popular Performance Metrics in Hydrologic Modeling, Water Resour. Res., 57, e2020WR029001, https://doi.org/10.1029/2020WR029001, 2021. a, b, c, d, e, f, g
Clark, M. P., Knoben, W. J., Spieler, D., Gründemann, G. J., Thébault, C., Vásquez, N. A., Wood, A. W., Song, Y., Shen, C., Carney, S., and Van Werkhoven, K.: Comment on Williams (2025): “Friends don't let friends use NSE or KGE for hydrologic model accuracy evaluation: A rant with data and suggestions for better practice”, Environ. Modell. Softw., 197, 106869, https://doi.org/10.1016/j.envsoft.2026.106869, 2026. a, b
Commission for Environmental Cooperation: Ecological Regions of North America: Toward a Common Perspective, ISBN 2-922305-18-X, http://www.cec.org/files/documents/publications/1701 (last access: 29 January 2024), 1997. a
Commission for Environmental Cooperation (CEC): North American Environmental Atlas – Political Boundaries, Statistics Canada, United States Census Bureau, Instituto Nacional de Estadística y Geografía (INEGI). Ed. 3.0, Vector digital data [ ], https://www.cec.org/north-american-environmental-atlas/political-boundaries-2021/ (last access: 20 December 2023), 2022. a
Döll, P., Hasan, H. M. M., Schulze, K., Gerdener, H., Börger, L., Shadkam, S., Ackermann, S., Hosseini-Moghari, S.-M., Müller Schmied, H., Güntner, A., and Kusche, J.: Leveraging multi-variable observations to reduce and quantify the output uncertainty of a global hydrological model: evaluation of three ensemble-based approaches for the Mississippi River basin, Hydrol. Earth Syst. Sci., 28, 2259–2295, https://doi.org/10.5194/hess-28-2259-2024, 2024. a
Efstratiadis, A. and Koutsoyiannis, D.: One decade of multi-objective calibration approaches in hydrological modelling: a review, Hydrolog. Sci. J., 55, 58–78, https://doi.org/10.1080/02626660903526292, 2010. a
Fall, G., Kitzmiller, D., Pavlovic, S., Zhang, Z., Patrick, N., St. Laurent, M., Trypaluk, C., Wu, W., and Miller, D.: The Office of Water Prediction's Analysis of Record for Calibration, version 1.1: Dataset description and precipitation evaluation, J. Am. Water Resour. As., 59, 1246–1272, 2023. a
Farahani, M. A., Wood, A. W., Tang, G., and Mizukami, N.: Calibrating a large-domain land/hydrology process model in the age of AI: the SUMMA CAMELS emulator experiments, Hydrol. Earth Syst. Sci., 29, 4515–4537, https://doi.org/10.5194/hess-29-4515-2025, 2025. a
Gauch, M., Kratzert, F., Klotz, D., Nearing, G., Lin, J., and Hochreiter, S.: Rainfall–runoff prediction at multiple timescales with a single Long Short-Term Memory network, Hydrol. Earth Syst. Sci., 25, 2045–2062, https://doi.org/10.5194/hess-25-2045-2021, 2021. a
Gharari, S., Whitfield, P. H., Pietroniro, A., Freer, J., Liu, H., and Clark, M. P.: Exploring the provenance of information across Canadian hydrometric stations: implications for discharge estimation and uncertainty quantification, Hydrol. Earth Syst. Sci., 28, 4383–4405, https://doi.org/10.5194/hess-28-4383-2024, 2024. a
Gründemann, G., Knoben, W., Song, Y., van Werkhoven, K., and Clark, M.: Data for “Separating Signal from Noise in Large- Domain Hydrologic Model Evaluation: Benchmarking model performance under sampling uncertainty”, Zenodo [data set], https://doi.org/10.5281/zenodo.18028487, 2025. a
Gupta, H. V., Wagener, T., and Liu, Y.: Reconciling theory with observations: elements of a diagnostic approach to model evaluation, Hydrol. Process., 3813, 3802–3813, https://doi.org/10.1002/hyp.6989, 2008. a
Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F.: Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling, J. Hydrol., 377, 80–91, https://doi.org/10.1016/j.jhydrol.2009.08.003, 2009. a, b, c
Gupta, H. V., Clark, M. P., Vrugt, J. a., Abramowitz, G., and Ye, M.: Towards a comprehensive assessment of model structural adequacy, Water Resour. Res., 48, https://doi.org/10.1029/2011WR011044, 2012. a
Harrigan, S., Zsoter, E., Cloke, H., Salamon, P., and Prudhomme, C.: Daily ensemble river discharge reforecasts and real-time forecasts from the operational Global Flood Awareness System, Hydrol. Earth Syst. Sci., 27, 1–19, https://doi.org/10.5194/hess-27-1-2023, 2023. a
Johnson, J. M., Fang, S., Sankarasubramanian, A., Rad, A. M., Kindl Da Cunha, L., Jennings, K. S., Clarke, K. C., Mazrooei, A., and Yeghiazarian, L.: Comprehensive Analysis of the NOAA National Water Model: A Call for Heterogeneous Formulations and Diagnostic Model Selection, J. Geophys. Res.-Atmos., 128, e2023JD038534, https://doi.org/10.1029/2023JD038534, 2023. a, b
Klotz, D., Gauch, M., Kratzert, F., Nearing, G., and Zscheischler, J.: Technical Note: The divide and measure nonconformity – how metrics can mislead when we evaluate on different data partitions, Hydrol. Earth Syst. Sci., 28, 3665–3673, https://doi.org/10.5194/hess-28-3665-2024, 2024. a
Knoben, W. J. M., Freer, J. E., and Woods, R. A.: Technical note: Inherent benchmark or not? Comparing Nash–Sutcliffe and Kling–Gupta efficiency scores, Hydrol. Earth Syst. Sci., 23, 4323–4331, https://doi.org/10.5194/hess-23-4323-2019, 2019. a
Knoben, W. J. M., Freer, J. E., Peel, M. C., Fowler, K. J. A., and Woods, R. A.: A Brief Analysis of Conceptual Model Structure Uncertainty Using 36 Models and 559 Catchments, Water Resour. Res., 56, e2019WR025975, https://doi.org/10.1029/2019WR025975, 2020. a, b
Knoben, W. J. M., Raman, A., Gründemann, G. J., Kumar, M., Pietroniro, A., Shen, C., Song, Y., Thébault, C., van Werkhoven, K., Wood, A. W., and Clark, M. P.: Technical note: How many models do we need to simulate hydrologic processes across large geographical domains?, Hydrol. Earth Syst. Sci., 29, 2361–2375, https://doi.org/10.5194/hess-29-2361-2025, 2025. a
Kollat, J. B., Reed, P. M., and Wagener, T.: When are multiobjective calibration trade-offs in hydrologic models meaningful?, Water Resour. Research, 48, https://doi.org/10.1029/2011WR011534, 2012. a
Kratzert, F., Klotz, D., Herrnegger, M., Sampson, A. K., Hochreiter, S., and Nearing, G. S.: Toward Improved Predictions in Ungauged Basins: Exploiting the Power of Machine Learning, Water Resour. Res., 55, 11344–11354, https://doi.org/10.1029/2019WR026065, 2019. a
Lamontagne, J. R., Barber, C. A., and Vogel, R. M.: Improved Estimators of Model Performance Efficiency for Skewed Hydrologic Data, Water Resour. Res., 56, e2020WR027101, https://doi.org/10.1029/2020WR027101, 2020. a, b, c
Legates, D. R. and Mccabe, G. J.: A refined index of model performance: A rejoinder, Int. J. Climatol., 33, 1053–1056, https://doi.org/10.1002/joc.3487, 2013. a
McCuen, R. H., Knight, Z., and Cutter, A. G.: Evaluation of the Nash–Sutcliffe Efficiency Index, J. Hydrol. Eng., 11, 597–602, https://doi.org/10.1061/(ASCE)1084-0699(2006)11:6(597), 2006. a
Merz, R. and Blöschl, G.: Regionalisation of catchment model parameters, J. Hydrol., 287, 95–123, https://doi.org/10.1016/j.jhydrol.2003.09.028, 2004. a
Nash, J. and Sutcliffe, J.: River flow forecasting through conceptual models part I – A discussion of principles, J. Hydrol., 10, 282–290, https://doi.org/10.1016/0022-1694(70)90255-6, 1970. a, b, c
Nearing, G., Cohen, D., Dube, V., Gauch, M., Gilon, O., Harrigan, S., Hassidim, A., Klotz, D., Kratzert, F., Metzger, A., Nevo, S., Pappenberger, F., Prudhomme, C., Shalev, G., Shenzis, S., Tekalign, T. Y., Weitzner, D., and Matias, Y.: Global prediction of extreme floods in ungauged watersheds, Nature, 627, 559–563, https://doi.org/10.1038/s41586-024-07145-1, 2024. a
Newman, A. J., Clark, M. P., Sampson, K., Wood, A., Hay, L. E., Bock, A., Viger, R. J., Blodgett, D., Brekke, L., Arnold, J. R., Hopson, T., and Duan, Q.: Development of a large-sample watershed-scale hydrometeorological data set for the contiguous USA: data set characteristics and assessment of regional variability in hydrologic model performance, Hydrol. Earth Syst. Sci., 19, 209–223, https://doi.org/10.5194/hess-19-209-2015, 2015. a, b
NOAA: The National Water Model, https://water.noaa.gov/about/nwm, last access: 3 November 2025. a
Pappenberger, F., Ramos, M. H., Cloke, H. L., Wetterhall, F., Alfieri, L., Bogner, K., Mueller, A., and Salamon, P.: How do I know if my forecasts are better? Using benchmarks in hydrological ensemble prediction, J. Hydrol., 522, 697–713, https://doi.org/10.1016/j.jhydrol.2015.01.024, 2015. a, b, c, d
Pool, S., Vis, M., and Seibert, J.: Regionalization for Ungauged Catchments – Lessons Learned From a Comparative Large‐Sample Study, Water Resour. Res., 57, e2021WR030437, https://doi.org/10.1029/2021WR030437, 2021. a
Quansah, J., Doria, R., and Fall, S.: Evaluating the Performance of the National Water Model: A Spatiotemporal Analysis of Streamflow Forecasting, Water, 17, 2950, https://doi.org/10.3390/w17202950, 2025. a
Rakovec, O., Kumar, R., Attinger, S., and Samaniego, L.: Improving the realism of hydrologic model functioning through multivariate parameter estimation, Water Resour. Res., 52, 7779–7792, https://doi.org/10.1002/2016WR019430, 2016. a
Ritter, A. and Muñoz-Carpena, R.: Performance evaluation of hydrological models: Statistical significance for reducing subjectivity in goodness-of-fit assessments, J. Hydrol., 480, 33–45, https://doi.org/10.1016/j.jhydrol.2012.12.004, publisher: Elsevier B.V., 2013. a
Rutledge, A. T. and Mesko, T. O.: Estimated hydrologic characteristics of shallow aquifer systems in the Valley and Ridge, the Blue Ridge, and the Piedmont Physiographic Provinces based on analysis of streamflow recession and base flow, Professional Paper 1422-B, United States Geological Survey, https://doi.org/10.3133/pp1422B, 1996. a
Samaniego, L., Kumar, R., and Attinger, S.: Multiscale parameter regionalization of a grid-based hydrologic model at the mesoscale, Water Resour. Res., 46, 1–25, https://doi.org/10.1029/2008WR007327, 2010. a
Seibert, J.: On the need for benchmarks in hydrological modelling, Hydrol. Process., 15, 1063–1064, https://doi.org/10.1002/hyp.446, 2001. a, b
Seibert, J., Vis, M. J. P., Lewis, E., and van Meerveld, H.: Upper and lower benchmarks in hydrological modelling, Hydrol. Process., 32, 1120–1125, https://doi.org/10.1002/hyp.11476, 2018. a, b, c
Shen, C., Appling, A. P., Gentine, P., Bandai, T., Gupta, H., Tartakovsky, A., Baity-Jesi, M., Fenicia, F., Kifer, D., Li, L., Liu, X., Ren, W., Zheng, Y., Harman, C. J., Clark, M., Farthing, M., Feng, D., Kumar, P., Aboelyazeed, D., Rahmani, F., Song, Y., Beck, H. E., Bindas, T., Dwivedi, D., Fang, K., Höge, M., Rackauckas, C., Mohanty, B., Roy, T., Xu, C., and Lawson, K.: Differentiable modelling to unify machine learning and physical models for geosciences, Nature Reviews Earth & Environment, 4, 552–567, https://doi.org/10.1038/s43017-023-00450-9, 2023. a
Song, Y., Bindas, T., Shen, C., Ji, H., Knoben, W. J. M., Lonzarich, L., Clark, M. P., Liu, J., Van Werkhoven, K., Lamont, S., Denno, M., Pan, M., Yang, Y., Rapp, J., Kumar, M., Rahmani, F., Thébault, C., Adkins, R., Halgren, J., Patel, T., Patel, A., Sawadekar, K. A., and Lawson, K.: High‐Resolution National‐Scale Water Modeling Is Enhanced by Multiscale Differentiable Physics‐Informed Machine Learning, Water Resour. Res., 61, e2024WR038928, https://doi.org/10.1029/2024WR038928, 2025. a, b
Swain, L. A., Mesko, T. O., and Hollyday, E. F.: Summary of the hydrogeology of the Valley and Ridge, Blue Ridge, and Piedmont Physiographic Provinces in the eastern United States, Professional Paper 1422-A, United States Geological Survey, https://doi.org/10.3133/pp1422A, 2004. a
Tang, G., Wood, A. W., and Swenson, S.: On Using AI‐Based Large‐Sample Emulators for Land/Hydrology Model Calibration and Regionalization, Water Resour. Res., 61, e2024WR039525, https://doi.org/10.1029/2024WR039525, 2025. a
Towler, E., Foks, S. S., Dugger, A. L., Dickinson, J. E., Essaid, H. I., Gochis, D., Viger, R. J., and Zhang, Y.: Benchmarking high-resolution hydrologic model performance of long-term retrospective streamflow simulations in the contiguous United States, Hydrol. Earth Syst. Sci., 27, 1809–1825, https://doi.org/10.5194/hess-27-1809-2023, 2023. a, b, c
U.S. Geological Survey: U.S. Geological Survey National Water Information System Database, U.S. Geological Survey [data set], https://doi.org/10.5066/F7P55KJN, last access 21 March 2025. a, b
van Jaarsveld, B., Wanders, N., Sutanudjaja, E. H., Hoch, J., Droppers, B., Janzing, J., van Beek, R. L. P. H., and Bierkens, M. F. P.: A first attempt to model global hydrology at hyper-resolution, Earth Syst. Dynam., 16, 29–54, https://doi.org/10.5194/esd-16-29-2025, 2025. a
Westerberg, I., Guerrero, J., Seibert, J., Beven, K. J., and Halldin, S.: Stage‐discharge uncertainty derived with a non‐stationary rating curve in the Choluteca River, Honduras, Hydrol. Process., 25, 603–613, https://doi.org/10.1002/hyp.7848, 2011. a
Williams, G. P.: Friends don't let friends use Nash-Sutcliffe Efficiency (NSE) or KGE for hydrologic model accuracy evaluation: A rant with data and suggestions for better practice, Environ. Modell. Softw., 194, 106665, https://doi.org/10.1016/j.envsoft.2025.106665, 2025. a
Yang, X., Li, F., Qi, W., Zhang, M., Yu, C., and Xu, C.-Y.: Regionalization methods for PUB: a comprehensive review of progress after the PUB decade, Hydrol. Res., 54, 885–900, https://doi.org/10.2166/nh.2023.027, 2023. a
Short summary
The quality of large-domain hydrologic model simulations is often quantified with so-called accuracy metrics. Here we use simple benchmarks to provide relevant context for these accuracy metrics. Results show that areas where the model cannot beat the benchmarks do not always align with areas where the accuracy metrics are low. This suggests that model improvements are possible in regions that under more typical model evaluation approaches (i.e., without benchmarks) might not be obvious.
The quality of large-domain hydrologic model simulations is often quantified with so-called...