Articles | Volume 29, issue 20
https://doi.org/10.5194/hess-29-5593-2025
© Author(s) 2025. This work is distributed under the Creative Commons Attribution 4.0 License.
Understanding the relationship between streamflow forecast skill and value across the western US
Download
- Final revised paper (published on 22 Oct 2025)
- Preprint (discussion started on 14 Jan 2025)
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on egusphere-2024-4046', Anonymous Referee #1, 17 Feb 2025
- AC1: 'Reply on RC1', Parthkumar Modi, 24 Mar 2025
-
RC2: 'Comment on egusphere-2024-4046', Anonymous Referee #2, 20 Feb 2025
- AC2: 'Reply on RC2', Parthkumar Modi, 24 Mar 2025
Peer review completion
AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload
ED: Publish subject to revisions (further review by editor and referees) (07 Apr 2025) by Albrecht Weerts
AR by Parthkumar Modi on behalf of the Authors (20 May 2025)
Author's response
Author's tracked changes
Manuscript
ED: Referee Nomination & Report Request started (15 Jun 2025) by Albrecht Weerts
ED: Publish as is (17 Jul 2025) by Albrecht Weerts
AR by Parthkumar Modi on behalf of the Authors (01 Aug 2025)
Summary
The manuscript by Modi et al. presents a study on the link between forecast skill and value, in the case of a sample of unmanaged, snow-dominated stations in the United States. The authors focus on the prediction of low AMJJ volumes issued based on the ESP method with a distributed and an LSTM models, or taken directly from the NRCS operational forecasts. Synthetic forecasts based on streamflow climatology and introducing deviations in mean and in standard deviation serve as a reference to assess errors in true forecasts and derive skill and value for controlled forecast errors. Results reveal a symmetry in forecast skill, but an asymmetry in forecast value, and discuss the inadequacy of the initially chosen skill metric to explain value.
The paper is of very high quality, well written and very well illustrated. It tackles a lot of different scientific objectives, which include comparing the chosen LSTM and distributed models, and studying the relationship between skill in value in controlled and real forecast systems. I was unsure to which extent the first objective serves the second, or not, because the paper becomes lengthy with information that is secondary to the skill-value relationship. Nevertheless, I recommend this paper for publication provided that the points below are addressed or commented on.
General comments
Both WRFH and the LSTM generate daily streamflow volumes summed up to generate AMJJ volumes. In the case of the LSTM, it is not clear why the model was not trained on AMJJ volumes directly.
Section 2.1.3: The same cost is used for hits and false alarms. One could argue that a false alarm does more damage than just the preventive cost since it may deteriorate trust in or reputation of the decision-making institutions. This is not something accounting for here, but that would be worth discussing.
Throughout the results section, and related to Figure 8 and L237, it was not clear to me which tau value was chosen, or if a range of tau’s were used in the assessment of the APEVmax. I think this point requires clarification in Section 2.1.3, and potentially reminders in the interpretation of results.
Section 2.1.1 could benefit from a few clarifications. In particular the phrases “percentile of dryness” or “driest 2% conditions” were a bit unclear. The variables are listed, but the time step or period to be considered are unclear, and would be interesting to have for reference. It would also be interesting to add a sentence to state why this choice of methodology here, and why deviate from the methodology proposed by the USDM. The length of the historical period would also be interesting to have at this stage.
Section 2.1.2: A sum may be missing (in the equation or in the text) to compute losses for several forecasts. Related to this, the term n is not defined. Regarding notations, z is rather a probability of exceedance/non-exceedance associated with quantile y_z. Related to the final discussion on the inadequacy of this skill metric to reflect value, the equal weighting of the 3 quantiles is probably not suitable, nor resembling actual decision-making contexts (reflecting unequal importance on high/low volumes or asymmetrical decision thresholds). Could the author discuss this? Could another weighting or picking of quantile values be enough to match value patterns?
Section 2.3: This section may benefit from some discussion points about the choice of a normally distributed ensemble, which later appears to be a limit, about the fact that forecasts often overestimate in dry conditions and underestimate in wet conditions which is not mimicked here, about the fact that deviations applied reflect errors in mean (bias) or characteristics in terms of spread (sharpness), but the likely important feature here is rather discrimination, which is not experimented on. Related to Figure 4, a comment on the year-to-year variation in the forecast would be helpful. Are they solely due to the exclusion of the forecast year?
Section 2.5: Please clarify the RMAD criterion, in particular how errors between true and synthetic forecasts are calculated given that they are ensemble forecasts.
Throughout the manuscript and more specifically L521 “higher forecast skill and value were associated with negative errors in standard deviation”: “negative errors in standard deviation” can be misleading. Changes in sharpness, in themselves, are not errors if they are not associated with an absence of bias (seen in the skill matrices). Sharpness is not a performance metric. Here negative errors in standard deviation rather mean that the ensemble is close to a deterministic forecast, which is only associated with high forecast skill if, and only if, the forecasts are not biased. I recommend changing the phrase error in standard deviation throughout the paper, and recommend the following paper: Gneiting, T., Balabdaoui, F., Raftery, A.E., 2007. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69, 243–268.
Figure 12: I don’t understand why the three models appear in Figure 12a if this figure shows the synthetic forecasts. If my understanding is correct, only observations are used to generate the synthetic forecasts. Could you please clarify? Also L611 “the true LSTM and the corresponding synthetic forecast” has me confused.
The use of term “skill” in the paper is not always consistent: L630 “better captured in categorical measures than skill”: In the way the word “skill” is used in this study, it is not a metric (sometimes the case when it is the comparison of the performance of a forecast system with the performance of a benchmark) but rather a term used to qualify the performance of the forecast. Based on the use of the word “skill” in this study, categorical measures could very well be metrics used to define forecast skill. I suggest rephrasing. Also L654 “forecast skill generally reflects the accuracy of forecasts” can be unclear as accuracy can be perceived as one feature of forecast skill.
Key references are well used to corroborate or discuss results or limits of this work. Some references about asymmetry in decision-making, about synthetic forecasts, or about the need for adequacy between skill and value metrics can be found in the following works. I let the authors consider their relevance for their work:
Peñuela, A., Hutton, C., Pianosi, F., 2020. Assessing the value of seasonal hydrological forecasts for improving water resource management: insights from a pilot application in the UK. Hydrology and Earth System Sciences 24, 6059–6073. https://doi.org/10.5194/hess-24-6059-2020
Rouge, C., Peñuela, A., Pianosi, F., 2023. Forecast Families: A New Method to Systematically Evaluate the Benefits of Improving the Skill of an Existing Forecast. Journal of Water Resources Planning and Management 149, 04023015. https://doi.org/10.1061/JWRMD5.WRENG-5934
Crochemore, L., Materia, S., Delpiazzo, E. et al. 2024. A framework for joint verification and evaluation of seasonal climate services across socio-economic sectors. Bulletin of the American Meteorological Society. https://doi.org/10.1175/BAMS-D-23-0026.1
Detailed comments
L58-59: Please review the definition of probabilistic seasonal streamflow forecasts, as it is not necessarily volumes, the concept of season can be unclear, and the cited methods are not always combined.
L65: A definition and references for ESP would be necessary. Consider the following work: Day, G., 1985. Extended Streamflow Forecasting Using NWSRFS. J. Water Resour. Plann. Manage. 111, 157–170.
L66 “more accurate”: I am unsure whether this is about accuracy since ESP relies on climatology. It is rather about using outputs from dynamical meteorological or climate model instead.
L115-116 “particularly during extreme events like droughts”: references would be needed to support this. Consider the following work: Giuliani, M., Crochemore, L., Pechlivanidis, I., Castelletti, A., 2020. From skill to value: isolating the influence of end user behavior on seasonal forecast assessment. Hydrology and Earth System Sciences 24, 5891–5902. https://doi.org/10.5194/hess-24-5891-2020
L124 “forecasts respond to fundamental statistical measures” Consider reformulating.
L128 and elsewhere: the word “evaluate” may be ambiguous in a paper about forecast value if it is used for both skill and value. “assess” could be a more neutral option.
L128: Section 2.1.2 is rather about defining drought
L135 “fundamental performance metrics” as above, the choice of the adjective “fundamental” is not clear to me. I would suggest reformulating or clarifying.
L196-197: “rate of occurrence” I suggest introducing s here.
L224-225: This sentence is key for understanding these 3 parameters. I would suggest placing it earlier in the section.
L234-239: How are negative values accounted for when calculating the area? Is it possible to have negative PEVmax values?
L239: While 0 is the theoretical minimum, 0.9 seems to be the observed maximum. If that is correct, I suggest clarifying this by stating the theoretical maximum (infinity?) before this observed maximum.
L247 “snow-dominated basins (i.e., unmanaged headwater systems)”: the correspondence between the two basin types is not direct. Some snow-dominated basins in areas with altitude gradients can be heavily managed/influenced by hydropower dams. Please clarify.
L269: Give the full name for SWE as this is the first occurrence.
L269: “water-year-to-date” may be worth explaining in its first occurrence as well.
L270: The reference to Table A1 is not clear to me.
L283 “WY2006-2022”: this notation used throughout the manuscript should be explained here.
L290: I suggest citing the number of ensemble members used in practice here
L315 “snowpack information in the form of snow water equivalent”: Based on Figure 5, it seems only to be the case for the LSTM. Is this information also used in the case of WRFH?
L315: The reference to Table A1 does not seem correct.
L329: the probabilities extracted from the forecast ensembles can only be comparable if the ensembles have the same number of members. Here it seems to be the case (Section 2.4.1), but I suggest mentioning this here in Section 2.4 already for clarification.
L350 “~20-30 years”: In Figure 6, 23 years are mentioned, and L387 and 419, you mention the period 1983-2022 (40 years minus the forecast year). Please clarify.
L377 Here as well, the reference to Table A1 does not seem to match its content.
L389-390: I suggest the term “initial states” instead of “memory states”
Sections 2.4.2 and 2.4.3: it would be helpful to state in these sections the years used for model training/calibration (now in Appendices), in addition to the years of historical meteorology inputs and for which the forecasts are generated (already clear).
Section 2.5: It is generally advised to use different metrics for calibration/training and verification/validation. It is not clear here which metrics are used for which purpose.
L473: If my understanding is correct, there is a single AMJJ value per year for the period 2006-2022 (17 values). How many years remain once only dry years are selected? Can it really ensure robust results for the rest of the study?
L481 “As errors in mean or standard deviation increase beyond these ranges, forecast skill worsens”: This is arguable. Here the standard deviation reflects the sharpness of the probabilistic forecast. However, sharpness is a forecast characteristic rather than a performance metric.
L483: In the text, values seem to reach 0.9, but in Figure 8, the color scale ends at 0.5.
L496-498: This asymmetry is indeed interesting, and would benefit from some further discussion as to which parameters in the methodology cause this effect (AMJJ variable bounded by 0, tau value, alpha-s relationship and frequency of occurrence below 0.5 for droughts, …).
L532 “Each dot in Fig. 10 represents a basin with colors showing the median skill and value”: is it just skill?
L580: Given that only a type of forecast skill is investigated here, I suggest the following “This skill-value comparison between synthetic and true forecast systems indicates that factors beyond forecast skill, as defined in this study…”
L653: References would be helpful at the end of this sentence.
Figures
Figure 1: Equation numbers preceded by minus signs can be confusing. In the caption, “forecast probabilities are calculated from probabilistic forecasts” is redundant. Do you mean “threshold exceedance probabilities are calculated from probabilistic forecasts”?
Figure 2: The arrows used to indicate the cases when C<0 and C>L point to ranges where C>0 and C
Figure 4: This figure is helpful. It may be worth mentioning in sub-figure (a) that the AMJJ volume for the forecast year is excluded, if that is the case.
Figure 5: “Historical meteorology” and “Basin attributes” are rather vague. I suggest specifying these to better highlight the differences between the three types of modelling/forecasting chains.
Figure 7: The caption should explain the difference between shaded areas (distributions over the 76 basins) and the vertical lines.
Figure 8e: It is not clear why there is a miss based on the time series plot.
Figure 9: “Synthetic errors”: if they are calculated from the true forecasts, these are no longer synthetic errors.
Typos
L63: Give the full name for NRCS
L122 “performance of true forecasts against observations generated in this study” can be unclear as to what is generated in this study.
L123: “models”
L144: “when the AMJJ streamflow volume falls below”
L205 “where the value of”
L232: “REV” instead of PEV.
L260 “with one or fewer” : not sure about what this means, maybe this is correct.
L320 “statistical forecasts (…) operational forecasts”
L452-454: “during WY2001-2010 » appears twice in this sentence.
L456: “for both models”
L458 “These results suggest that the LSTM models, particularly LSTM”
Figure A3 is titled Figure A2 and referenced as Figure A3 in the text.
L473 “only for the drought years only”
L493: “the higher number of false alarms reduces”
L529 and L571 “the three true forecasts”