the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Benchmarking high-resolution hydrologic model performance of long-term retrospective streamflow simulations in the contiguous United States
Sydney S. Foks
Aubrey L. Dugger
Jesse E. Dickinson
Hedeff I. Essaid
David Gochis
Roland J. Viger
Yongxin Zhang
Download
- Final revised paper (published on 09 May 2023)
- Supplement to the final revised paper
- Preprint (discussion started on 09 Aug 2022)
- Supplement to the preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on hess-2022-276', Anonymous Referee #1, 11 Aug 2022
Review of "Benchmarking High-Resolution, Hydrologic Performance of Long-Term Retrospectives in the United States" by Towler et al.
Summary
In this paper, performance of the National Water Model (NWM) v2.1 and the National Hydrologc Model (NHM) v1.0 is evaluated over the United States. These models are different in their internal structure, use different calibration approaches and are run with different meteorological inputs, but are similar in the sense that both are run over a high-resolution spatial grid. Model performance is evaluated with the help of 9 different metrics (e.g. Nash-Sutcliffe, PBIAS) that are calculated using observations and model simulations at 5390 streamflow gauges. Attention focuses most on median values in the 5390-member sample, and on differences between both models in various broad regions across the US. There are some recommendations on how to improve both models; most notably by updating the model structures to account for human water use and the impact of lakes and reservoirs.
General comments
Having read this paper, I must admit that I am not entirely sure whether HESS is the right venue for this. Various sentences suggest that this publication is intended as a benchmark for further development of the NWM and HWM. For example:
- [line 25] "This benchmark provides a baseline to document performance and measure the evolution of each model application"
- [line 80] "This paper highlights select results of the benchmarking analysis to document baseline model performance and characterizes overall performance patterns of both models."
- [line 198] "Here, we provide select results, with a focus on documenting baseline model performance and providing insight towards model diagnostics and development."
- [line 315] "here we provide a lower benchmark to gauge the evolution of the NWMv2.1 and NHMv1.0".
This is a great goal that I think should be the standard in any model development exercise (as it is in many other fields), but this kind of benchmarking is of limited interest to anyone who does not actively work with these models. A technical report instead of a journal publication might be more appropriate.
To appeal to a wider (international) journal audience, the proposed benchmarking approach should be of general interest and I think in its current shape it fails to be that. My main concerns are that:
- The selected benchmarking metrics are too one-sided: out of the 9 metrics, 7 either include or are some form of model bias metric. Multiple other relevant aspects of hydrographs and model performance are not captured by these metrics.
- There is no clear way to relate a model's performance on this set of metrics to concrete suggestions for improvement of the model, because it is practically impossible to trace the scores a model obtains on these metrics to how well the model simulates a given hydrological process (though I appreciate that this is not an easy thing to do).
- The model results are presented in a vacuum: there is only very limited discussion of existing literature on benchmarking, there is no comparison of the performance of these two models to the performance of earlier modeling efforts across this domain, and there is no discussion about how high a model must score on any of the 9 metrics to be considered a good/plausible/acceptable/etc model.
- There is almost no guidance (or better yet, software) available for a reader who might want to apply this benchmarking approach to their own simulations, beyond a table that shows references for the 9 metrics and a CSV file that contains the list of gauge IDs.
I believe that these issues can be addressed to a certain extent (see specific comments below), but in its current shape this manuscript mostly describes what performance scores two arbitrary models obtain on a limited selection of model performance metrics, without any context for those scores whatsoever. I don't think that's enough to warrant publication in HESS.
Specific commentsl12. "a benchmark statistical design" - It's unclear to me what this means.
l90. "https://noaa-nwm-retrospective-2-1-pds.s3.amazonaws.com/index.html" - The NWM docs (https://water.noaa.gov/about/output_file_contents) seem to say that output files are in netCDF4 format, but if I follow this link all I can find is .comp files. What are these files and how can a reader open/use them?
l105. "Using the AORC meteorological forcings, NWMv2.1 calibrates a subset of 14 soil, vegetation, and baseflow parameters to streamflow in 1,378 gauged, predominantly natural flow basins. The calibration procedure uses the Dynamically Dimensioned Search algorithm (Tolson and Shoemaker, 2007) to optimize parameters to a weighted Nash-Sutcliffe efficiency (NSE) of hourly streamflow (mean of the standard NSE and log-transformed NSE). Calibration runs separately for each calibration basin, then a hydrologic similarity strategy is used to regionalize parameters to the remaining basins within the model domain." - This needs a reference to indicate where a reader can find further details about this procedure.
l113. "For the analysis in this work, hourly streamflow is aggregated to daily averages." - Looking at a snapshot of the USGS gauges used for this evaluation approach, observations seem to be available at a sub-daily resolution. Given that the model is run at a 3-hr resolution, and it is known that hydrologic processes of interest can show strong diurnal variation (e.g. evaporation, snowmelt), why are observations and simulations aggregated to daily values?
l148. "The NSE is formulated to emphasize high flows" - This statement seems to contradict the last part of this sentence: "models do not necessarily perform well at reproducing high flows when NSE is used for calibration". Suggest to rephrase this.
l156. "Correlation, standard deviation ratio, and percent bias" - These three are (almost) the constitutive components of the KGE metric, and also appear in the NSE (see e.g. the decomposition of RMSE by Murphy, 1988, https://doi.org/10.1175/1520-0493(1988)116%3C2417:SSBOTM%3E2.0.CO;2). There is likely value in looking at these individual components compared to the aggregated efficiency scores, but this section should state that these metrics are not independent from NSE and KGE.
l167. "Three hydrologic signatures defined by Yilmaz et al. (2008)" - There are many possible signatures one could chose from and these are sometimes divided into five separate categories (magnitude, frequency, duration, timing and rate of change; e.g. Olden & Poff, 2003, dx.doi.org/10.1002/rra.770). More recently, McMillan (2022; dx.doi.org/10.1002/hyp.14537) created a signature taxonomy that relates signatures to specific hydrologic processes. The selected signatures here exclusively address the magnitude component, without explaining why these other components are not addressed or how a model's performance on any of these signatures might inform which of the model's process representations needs to be improved.
More generally, out of the 9 presented metrics, 7 metrics are either some form of bias or include a bias component. This seems insufficient spread to me for a "standard metric suite". I believe this selection needs to be expanded quite a bit before these metrics can start to be used for comprehensive model benchmarking.
l170. "big precipitation" - This might be inaccurate phrasing in the case of colder catchments, where flow events might originate from snow/ice melt and not directly from individual precipitation events.
l178. "Foks et al., 2022" - The .csv file in this reference misses leading zeroes for station numbers, which makes searching for them somewhat difficult on the USGS website (https://waterdata.usgs.gov/nwis/uv?referred_module=sw&search_criteria=search_site_no&search_criteria=site_tp_cd&submitted_form=introduction). E.g. searching for station 1011000 yields no results with the default "exact match" option, whereas 01011000 does show a result. If possible, updating this resource could help others. Adding some guidance on how to obtain these observations in a reasonably efficient manner would be good too.
l191. "For statistical significance, we conduct pairwise testing, specifically the Wilcoxon signed-rank test. The Wilcoxon signed-rank test is a non-parametric alternative to paired t-test. The Wilcoxon signed-rank test is appropriate here since the metrics (particularly the efficiency metrics) contain outliers and are not necessarily normally distribute" - This is unclear to me. What is being compared pair-wise? Why? A reference to point the reader to info about a Wilcoxon signed-rank test would be good too.
l202. "median values" - Why are only medians discussed here? How meaningful is that on a 5000+ sample?
l206. "indicating that they are tracking similarly in terms of overall performance" - This may need to be a more nuanced. Because these correlations are calculated on ranks and not actual metric scores, I think all this indicates is that these models are similar in where they tend to do relatively better and worse (within their own 5390-member sample). I don't think these ranked correlations indicate that these models are similar in actual performance as measured by the metrics, which is what the text seems to say.
l209. "these three popular efficiency metrics are providing very similar information in terms of overall performance assessments" - Again, I think this may need to be a bit more nuanced. What I believe these correlations show is that relative ranks are similar for these three metrics. In the .csv files I can see that there are still quite large differences in the actual scores on the three metrics. I would suggest to rephrase this paragraph.
l216. "Figure 2" - Why is the x-axis in this figure capped at KGE = -0.25? Looking at the data in the .csv files I see that KGE scores go as low as KGE = -306 for the NWM, and KGE = -158 for the NHM. This suggests that there is a lot of rather poor model performance that's not shown in this figure. Should that not be discussed as well in a paper intended to set a baseline for model performance?
l219. "Table 4 bins the KGE scores" - A similar question can be asked here: why are these bins defined with a lower bin of KGE < 0.2? There seems to be a lot of variety in model performance below this arbitrary threshold. More generally speaking, what can be learned by binning the data in this way that is not obvious from a figure with four CDFs (one CDF each for west, central, southeast and northeast)? These KGE bin boundaries seem quite arbitrary to me and mask any variety within the bin. It might be cleaner to replace this table with CDFs per region instead.
l231. "Relatively good performance is seen in the Southeast" - This paragraph uses fairly arbitrary thresholds to discuss the KGE performance of both models (e.g., anything with KGE < 0.2 is considered poor performance; KGE > 0.8 is implicitly treated as a boundary above which everything is similarly good). Previous publications argue that efficiency scores such as NSE and KGE cannot be viewed in isolation but need to be compared to some form of baseline model, so that one can judge if these NSE/KGE scores are in fact poor or good for a given location (e.g. Seibert, 2001; Schaefli & Gupta, 2007; Pappenberger et al., 2015; Seibert et al., 2018). NSE includes such a benchmark by design (i.e. the mean annual flow - but this is often criticized as being too easy to beat). KGE does not include such a benchmark and therefore needs some other way to provide context. Work using the CAMELS catchments (Knoben et al., 2020) uses a seasonal cycle benchmark and suggests that for certain locations even KGE > 0.9 could be considered a basic requirement for models rather than being indicative of an exceptionally well-performing model. I think the KGE scores discussed in this paragraph need to be given some context, so that there is some objective reason to qualify a given KGE score as "poor", "good" etc. Presenting these scores in isolation does not help the reader understand what kind of model performance they indicate.
The same comment applies to the following paragraphs as well. The presented numbers need some context that gives the reader an objective reason to decide whether those numbers are indicative of good or bad model performance.
Knoben et al.: doi.org/10.1029/2019WR025975
Pappenberger et al.: doi.org/10.1016/j.jhydrol.2015.01.024
Schaefli & Gupta, 2007: doi.org/10.1002/hyp.6825
Seibert, 2001: doi.org/10.1002/hyp.446
Seibert et al.: doi.org/10.1002/hyp.11476l244. "It is noticeable that many of the sites are in the tails" - I find this hard to grasp from just looking at this figure. Adding a small histogram to the bottom left corner might help.
l315. "here we provide a lower benchmark to gauge the evolution of the NWMv2.1 and NHMv1.0" - This sentence seems to suggest that this publication is mainly intended to benchmark future development of the NWM and NHM. Would a technical report not be a more appropriate venue for this? The kind of information presented in this paper seems useful to those actively working with the NWM or NHM, but may be of somewhat limited interest to the wider hydrological audience.
l317 "The baseline can provide an a priori expectation for what constitutes a “good” model." - I respectfully disagree. This baseline shows the current performance of the NWM and the NHM but it provides no objective reason for calling either a good model. For example, the mean annual flow (NSE = 0; KGE = -0.41) is often used as a rudimentary threshold for model performance. The .csv files with metric values show that the NWM does not outperform the mean annual flow as a predictor in 23% of gauges if NSE is used, and 14% of gauges if KGE is used. Similarly, the NHM does not outperform a mean annual flow in 24% of cases if NSE is used, and 12% of cases if KGE is used. To make the statement that these results are a priori expectations for what constitutes a good model, a much more in-depth comparison of both models against a range of statistical benchmarks (e.g., mean annual flow, seasonal cycle, persistence) and existing model results across this domain (e.g. any number of results based on the CAMELS data, NLDAS [10.1029/2011JD016051], global models [10.5194/hess-24-535-2020]) is needed.
l336. "Results helped to identify potentially missing processes that could improve model performance. PBIAS results showed that for both models, simulated streamflow volumes are overestimated in the West region, particularly for the sites designated as Non-Reference. One primary reason for this may be that water withdrawal for human use is endemic throughout the West and neither model has a thorough representation of these withdrawals. Furthermore, neither model possesses significant representations for lake and stream channel evaporation which, through the largely semi-arid west, can constitute a significant amount of water "loss" to the hydrologic system (Friedrich et al., 2018). Lastly, nearly all western rivers are also subject to some form of impoundment. Even neglecting evaporative, seepage and withdrawal losses from these water bodies, the storage and timed releases of water from managed reservoirs can significantly alter flow regimes from daily to seasonal timescales thereby degrading model performance statistics at gaged locations downstream of those reservoirs" - Upon reading this I cannot help but wonder if PBIAS values were needed at all to determine that these models might be improved by accounting for human water use and the presence of lakes & reservoirs. These seem fairly obvious processes to me when one is working with "two models that have been developed to assess water availability and risks in the United States". Should this even be listed as a discussion/conclusion point, instead of being presented as a known a-priori limitation of these models?
l357. "state-of-the-art" - Without intending to disparage the work that undoubtedly has already gone into creating these models, calling them state-of-the-art seems an overstatement if neither of these water resources assessment tools has a way to account for human interaction with the water cycle.
l354. "Identifying a suite of metrics has an element of subjectivity, but our aim was to identify an initial set of metrics that can be applied to a wide variety of science questions (e.g., see Table 1.1 in Blöschl et al. 365 2013) and that build on standard practices for evaluation of model application performance within the hydrologic community" - As indicated earlier, with 7 out of 9 metrics focusing on bias I find this set of metrics too limited for even an initial set. Of course there is some subjectivity in selecting metrics, but there is also some existing understanding of which statistical properties of hydrographs might be relevant to look at, how those might be captured in streamflow signatures, and how those signatures might be used to explain how well a model simulates certain, specific processes. This current selection of metrics seems too ad-hoc to me and some deeper literature searching would likely result in a set of metrics with a much wider applicability.
l576. "Table 1" - It would be helpful if equations were added to each row here. The ratio metrics are currently difficult to interpret for the reader, because they cannot know whether these are calculated as sim/obs or obs/sim without looking into other references.
l576. "Table 1" - Why are these bias metrics capped at (-)100?
l642. "Reference (Ref, n= 1,115) and Non-Reference" - A brief explanation of what reference/non-reference means would be helpful. This could be a summary of lines 186-189).
Technical corrections
l162. "modeled and observed" - Is there a word missing that should come after "observed"?
l197. "Using daily observations and simulations from the NWMv2.1 (Towler et al., 2022a) and NHMv1.0 (Towler et al., 2022b)
hydrologic modeling applications" - The way the Towler et al. references are inserted in the text implies that they contain the daily time series of observations and simulations, but in reality these references include only the 9 metrics for each gauge. Suggest to clarify this.l204. "the differences are statistically significant given the large sample size" - Why are some values bold in the NWM column and others in the NHM column? Shouldn't they be bold in both or neither?
l230. "you move" - consider replacing with "one moves"
l241. "better and worse" - is there some text missing here that indicates compared against what these models do better or worse?
l403. "References" - This list is not entirely in alphabetical order.
l557. "https://10.5066/P9DKA9KQ" - Has this link been inserted correctly? When I click it it attempts to take me to a local file location instead of the link the text suggests this is. Unsure if this problem is on my end only, but the link in the Towler reference above this one seems to work fine for me.
l644. "Figure 2" - these figures are quite small. Stacking the subplots vertically would give more space to each figure.
l673. "Figure 8" - these figures are quite small. Stacking the subplots vertically would give more space to each figure.
l687. "Figure 11" - these figures are quite small. Stacking the subplots vertically would give more space to each figure.
Citation: https://doi.org/10.5194/hess-2022-276-RC1 -
AC1: 'Reply on RC1', Erin Towler, 16 Sep 2022
We appreciate the comments from Referee #1, and wanted to post a short, general response to their main concerns before providing a comprehensive response that addresses every comment point-by-point (which we will do once the Open Discussion is closed and all comments have been received).
The reviewer has raised several constructive, general comments that we can address to increase the impact of our research paper. While this study was spurred by programmatic needs and priorities specific to our model development teams, this feedback helps us to leverage our research efforts to increase the international appeal and general interest. To this end, one of the main suggestions is to provide performance context for both models. The approach suggested in Knoben et al. (2020) is a good option, whereby the KGE values for each model are compared with KGE values calculated based on a climatological benchmark (e.g., mean annual flow or interannual daily means and/or medians). Adding this type of analysis and more of the existing benchmarking literature would address their main concern #3. Second, with this additional focus on the KGE metric and its context, we could remove some of the other metrics investigated (like NSE and logNSE), which would reduce the number of bias-based metrics (main concern #1). However, we focus on magnitude because of its relevance to our application of water availability, and bias metrics are interpretable and fit-for-purpose in this use case. In our revision, we will make these points more explicit and refer to the suggested literature on other aspects of the hydrograph that could be used for evaluation of other applications. Another issue raised (main concern #2) is the difficulty of relating a model's performance to model processes, or to concrete suggestions for improvement; however, the reviewer acknowledges that this is widely recognized as not a trivial thing to do. To this point, using the reviewer’s previous suggestion, we can look more closely at the sites that perform worse than the climatological KGE benchmark. We have performed several preliminary analyses to this end, finding that most sites that perform worse than the climatological benchmark are influenced by human activities (i.e., “Non-Reference” sites). This can help shed some light on “how much” improvement we would need at these sites, for example from adding a management module, potentially providing a more concrete goal for model development. This would be of widespread interest, as the hydrologic modeling community grapples with how to account for the anthropogenic influence on watersheds, and most studies to date focus on minimally disturbed sites. This is our current thinking on how we can reshape the paper, and will provide more details/analysis in our comprehensive response after the Open Discussion closes. Finally, in thinking about the reviewer’s concern #4, we could post our codes (we have draft Jupyter notebooks and R codes), but we will continue to ponder this, for although we certainly could post them, one reason we didn’t initially was because most of the metrics we calculated were straightforward and already in existing R libraries (e.g., hydroGOF), and were detailed in Table 1.
We thank the reviewer for bringing up these points and will continue to think through how to address their main concerns. We look forward to providing a more comprehensive response and an improved revision once the Open Discussion closes.
Citation: https://doi.org/10.5194/hess-2022-276-AC1 -
AC2: 'General Response to All Reviewers', Erin Towler, 15 Nov 2022
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-276/hess-2022-276-AC2-supplement.pdf
-
AC3: 'RC1 Point-by-Point Response', Erin Towler, 15 Nov 2022
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-276/hess-2022-276-AC3-supplement.pdf
-
RC2: 'Comment on hess-2022-276', Robert Chlumsky, 29 Sep 2022
I have completed my review of the paper “Benchmarking High-Resolution, Hydrologic Performance of Long-Term Retrospectives in the United States”, Erin Towler et al. The paper presents a benchmark statistical design for the evaluation of process-based hydrologic models over large spatial and temporal scales, and is applied to evaluate the National Water Model v2.1 application of WRF-Hydro and the National Hydorlogical Model v1.0 of the Precipitation-Runoff Modeling System.
The paper itself is relatively straightforward in methods and application, including a description of both models, description of the metrics selected for evaluation and the presented comparison of the two models using the metrics selected. The paper draws a number of appropriate conclusions regarding the relative performance of the models spatially and based on flow regime, and is overall very well written and logically presented.
Regarding the comments on paper type, the paper aligns largely with a Technical Report format, though the additional discussion and interpretation of results help move it towards a Research Article.
A number of additional comments and concerns are presented here to help improve the paper.
General Comments
- In the Introduction, mention of previous studies that have addressed the 5,390 USGS gages used in this study would be relevant (have any studies used all of these gages as well?)
- Introduction - It would also be worth mentioning other datasets that have been commonly used in larger-scale benchmark and model intercomparison studies, such as the MOPEX (Duan et al., 2006) and CAMELS dataset (Addor et al., 2017). The Mai et al. (2022) GRIP-GL comparison would also be worth mentioning in the list of recent benchmark and model intercomparison studies.
- Introduction - Any previous studies benchmarking these two hydrologic models would be worth mentioning in the last introductory paragraph (lines 73-81), or mention that this is the first study benchmarking these two models specifically.
- Line 210 – are these three metrics providing very similar information for overall performance assessment in general, or simply because these models are similar and that happens to be the case in this study only? I would be surprised if this conclusion was generalized for very different hydrologic models, and I think this should be carefully rephrased to not overgeneralize from the limited model comparison (i.e. 2 similar models) presented in this study.
- Reference to Knoben et al. (2019) on what a baseline KGE performance is may be useful in interpreting the results, since 0.2 seems somewhat artbitrary. The Knoben et al. paper suggests -0.4 is a more comparable threshold to the NSE=0 interpretation, so perhaps some justification or rationale for using 0.2 is warranted.
- Table 4 – the bolding pattern is confusing to me, since it is meant to represent the maximum number (percent) of sites by KGE category (?), though the Northeast has two bolded numbers, and in the Central region the minimum number is bolded. Similar bolding patterns continue in other Tables and seem to be at least non-intuitive.
- Table 6 – I would suggest a summary column with the average metric across regions to help summarize the results, similar to how Table 5 summarizes results for Ref and non-ref sites. This would have some duplication with Table 5 but I think it is still worth including here as an additional column
- Figure 4 and lines 241-247 – I thnk that screening the models with poor initial performance from Figure 4, perhaps as a separate figure, would be more meaningful than comparing relative model performance between a KGE of 0.0 and -0.05. In either case, the models likely don’t capture enough of the observed behaviour for a modeler to care which is better, and this inhibits interpretation of Figure 4 in identifying any real differences between the models. It seems the models will be similar in any case, but I would filter results first.
- Line 268 – I think this statement is actually incorrect, since the lower variability at managed (non-reference) sites should already be normalized by comparison to observed data. My interpretation of this is that the ideal rSD is 1.0, and rSD below 1.0 indicates that the model underestimates the variability of flow. In both cases the models underestimate the variability of flow, in particular for reference sites relative to non-reference or managed sites. This suggests the models do better at capturing general changes in flow rather than sudden ones in unregulated reference sites perhaps. There is more interesting interpretation to add in this section.
- Line 277-279 – this can be compared with the GRIP-GL study results (Mai et al., 2022) to discuss general trends in Great-Lakes areas
- Line 295 – general comment but an actual histogram plot of the information in Supplemental Table 2 would likely convey this information much better and would aid the discussion. A simple histogram of frequency vs binned PBIAS_LF, and either facet or colour code each of the four regions on one plot would greatly aid the discussion
- Line 341 – it would be worth elaborating on the value of the passive lake/reservoir representation in the model relative the none. It is interesting that the model with the passive representation (NWMv2.1) does seem to perform slightly better than the NHMv1.0, though it is unclear if that is the reason why or what the improvement in performance would be with a better representation of reservoir operations. This would require some segmentation based on catchments with ‘significant’ reservoir controls, which is not included in this study, though worth discussing briefly here.
- Line 355 – the NWMv2.1 is described to perform better in high-flow-focused metrics than the NHMv1.0. This discussion should be expanded to how this could likely have been known from the model setup initially, since running the model on hourly or subdaily timesteps and aggregating will very likely produce better performance for peak flow metrics than a model that is run at a daily timestep, therefore this result should not be a surprise. This is touched on by mentioning that the latter model is designed for water availability, but I think this point should be emphasized.
- Conclusion – the concluding paragraph ends rather abruptly, a short one or two line paragraph at the end to tie off the accomplishments of the paper and goals for future studies would help to transition the conclusion.
Technical Comments:
- I was under the impression that CONUS was an acronym for contiguous United States (not conterminuous), though I suppose the definitions are practically the same
- Links in lines 92-93 should be properly cited instead of providing raw urls
- Line 168 – I would rewrite this paragraph slightly to something like: “Three additional hydrologic signatures are included which evaluate performance based on different parts of the flow duration curve (FDC) for high, medium, and low flows. The definitions for these hydrologic signatures as used in this study are consistent with those from Yilmaz et al. (2008). The bias of high flows…“ This will help the readability of the section, otherwise the reader is left wondering which metrics you are porting in from Yilmaz until the whole section is read.
- Line 201 – “…for all 5,390 cobalt gages …”. If these will be called cobalt gages in the paper, this should be used throughout the paper after its definition for consistency
- Line 221 – the line “Both models also have many sites with poor performance” – this can be quantified and merged with the next line, as many sites in a large sample study could mean 100 or 1000. Both models in fact have 30% of their sites with a KGE below 0.2, which is a lot of models with very poor performance (KGE below 0.2 is likely an ‘unusable’ or ‘untrustworthy’ model for most applications)
- Line 361 – link should be properly cited
References
Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P.: The CAMELS data set: catchment attributes and meteorology for large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, https://doi.org/10.5194/hess-21-5293-2017, 2017.
Duan, Q., Schaake, J., Andréassian, V., Franks, S., Goteti, G., Gupta, H. V., et al. (2006). Model parameter estimation experiment (MOPEX):
An overview of science strategy and major results from the second and third workshops. Journal of Hydrology, 320(1–2), 3–17. https://
doi.org/10.1016/j.jhydrol.2005.07.031
Knoben, Wouter & Freer, Jim & Woods, Ross. (2019). Technical note: Inherent benchmark or not? Comparing Nash-Sutcliffe and Kling-Gupta efficiency scores. Hydrology and Earth System Sciences Discussions. 1-7. 10.5194/hess-2019-327.
Mai, J., Shen, H., Tolson, B. A., Gaborit, É., Arsenault, R., Craig, J. R., Fortin, V., Fry, L. M., Gauch, M., Klotz, D., Kratzert, F., O'Brien, N., Princz, D. G., Rasiya Koya, S., Roy, T., Seglenieks, F., Shrestha, N. K., Temgoua, A. G. T., Vionnet, V., and Waddell, J. W. (2022):
The Great Lakes Runoff Intercomparison Project Phase 4: The Great Lakes (GRIP-GL)
Hydrol. Earth Syst. Sci., 26, 3537–3572. Highlight paper. Accepted Jun 10, 2022.
Citation: https://doi.org/10.5194/hess-2022-276-RC2 -
AC2: 'General Response to All Reviewers', Erin Towler, 15 Nov 2022
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-276/hess-2022-276-AC2-supplement.pdf
-
AC4: 'RC2 Point-by-point Response', Erin Towler, 15 Nov 2022
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-276/hess-2022-276-AC4-supplement.pdf
-
RC3: 'Comment on hess-2022-276', Anonymous Referee #3, 03 Oct 2022
This is a review of the manuscript “Benchmarking High-Resolution, Hydrologic Performance of Long-Term Retrospectives in the United States” by Towler et al. The manuscript compares the performance of two large-scale hydrologic models in estimating streamflow by comparing against observed streamflow at gauges across continental United States (CONUS). The performance is evaluated using a number a metrics that are commonly used in streamflow evaluation. The manuscript is well-written and easy to follow. The effort to create benchmarks for CONUS scale streamflow prediction models is commendable, necessary, and of interest to this journal and the hydrologic community. However the metrics presented here are commonplace and the evaluation/benchmarking workflow is not novel. My biggest criticisms of the study are regarding the consistency of comparing two model outputs (major comment 2) and the use of calibration gauges in evaluation (major comment 3).
The manuscript can still be considered for publication provided the authors sufficiently address my concerns. I, therefore, recommend Major Revision.
Major comments:
- Introduction: The Introduction is missing a comprehensive review of current literature and needs improvement to further clarify the hurdles being overcome by this study and bring out its novelty. Specifically, the last paragraph should have a few sentences summarizing how it is building on previous studies and what shortcomings are being overcome in this specific study. Additionally, for studies mentioned in L 48-65, please mention their drawbacks and how this study aims to overcome them. Also, review of studies regarding statistical design of large-sample benchmarks and intercomparisons has been ignored. The authors should also clarify how the benchmark statistical design used in this study compares to previous studies where large sample intercomparison and/or benchmarking have been carried out. Finally, the National Hydrologic Model is mentioned for the first time in the manuscript in L 75 when the authors are specifying the objectives of the study. The authors should introduce the two models briefly in the Introduction while also mentioning the reasons behind choosing these two specific models.
- L 113: NWM produces hourly streamflow using hourly atmospheric forcings whereas NHM produces daily streamflow using daily forcings. The hydrologic processes in the watersheds are simulated at different temporal scales (hourly vs daily) by the two models. Additionally, the many USGS gauges record 15-minute streamflow data. NWM can produce hourly streamflow and takes into account changes in hydrologic variables throughout the day. Averaging out higher resolution (hourly) streamflow timeseries produced using higher resolution (hourly) forcing to a coarser resolution is not the equivalent of simulating streamflow at a coarser resolution (daily) from coarse resolution (daily) forcings due to the non-linear nature of hydrologic processes. As such, is the comparison of the streamflow produced at two different temporal scales a consistent and fair comparison?
- Calibration: What was the calibration period for the two models? It is unclear from the text if gauges used in calibration were also part of evaluation. If the calibration period overlapped the evaluation period (October 1, 1983, to December 31, 2016), then the gauges used for calibrating either of the models should be removed from the set of gauges used for benchmarking the models. Including these gauges will introduce biases in the evaluation process.
- The study also includes gauges near the coast in the evaluation scheme. USGS gauges do not measure streamflow directly, rather the water surface elevation (WSE) is measured and the WSE is converted to streamflow using rating curves. Gauges near the coast can experience backwater from coastal surge traveling up the river and/or tides. In such cases, the rating curve for converting WSE to streamflow are violated and streamflow readings are highly erroneous. As such, should gauges near the coast be included in the evaluation scheme? Additionally, both NWM and NHM do not take into account the interaction between the river and sea/ocean.
- L 327-330: The authors should discuss why these areas are exhibiting poorer/better performance for both the models. They have a done good job of explaining the behavior of PBIAS in L 335-348 and need to similarly delve deeper into the potential causes of the behavior in the efficiency metrics for these regions.
- The authors need to discuss the limitations of this study and future work at the end of the manuscript in more detail. The limitations of the study extend beyond the subjectivity in choosing the performance metrics and their sensitivities. This could be a separated section or can be a continuation of the Results and Discussions.
Minor Comments:
- Title: is it really the United States if Alaska and the US territories have not been included? Should it be CONUS instead?
- L 177: The study uses 5,390 gauges and 5,389 of those are in GAGES II. So, there is just one gauge that was not part of GAGES II?
- L 191: “For statistical significance …” – statistical significance of what?
- L 350: refer to the appropriate table/figure
- Table 3 can be moved to supplementary information. KGE and NSE (and logNSE) are expected to behave somewhat similarly given their formulations. So this table does not convey anything particularly novel or important.
- Figure 2: There can be further subplots showing the CDF of KGE for the two models by region. This will be more informative than Table 4 which can then be moved to supplementary information.
- Figure 4: Just a suggestion, with there being so many points, it is hard to discern a trend or behavior from the figure. It might help to have region-wise or HUC-unit-wise medians color coded across CONUS. See Figure 8 in https://doi.org/10.1016/j.jhydrol.2022.127470 as an example.
- Please adjust the font size in the figures to make sure the legends, subplot number and lat/long are easily readable (Figures 3, 8, 11)
Citation: https://doi.org/10.5194/hess-2022-276-RC3 -
AC2: 'General Response to All Reviewers', Erin Towler, 15 Nov 2022
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-276/hess-2022-276-AC2-supplement.pdf
-
AC5: 'RC3 Point-by-Point response', Erin Towler, 15 Nov 2022
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-276/hess-2022-276-AC5-supplement.pdf