the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Flexible forecast value metric suitable for a wide range of decisions: application using probabilistic subseasonal streamflow forecasts
Richard Laugesen
Mark Thyer
David McInerney
Dmitri Kavetski
Download
- Final revised paper (published on 23 Feb 2023)
- Supplement to the final revised paper
- Preprint (discussion started on 21 Mar 2022)
- Supplement to the preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on hess-2022-65', Anonymous Referee #1, 27 May 2022
Summary
The manuscript by Laugesen et al. introduces a new metric to assess forecast value adapting the formulation of a previously existing metric, namely Relative Economic Value, within a flexible value assessment framework based on utility. The method is then exemplified with subseasonal forecasts in the case of the Murray River, Australia, where decisions tend to target high flow values. A sensitivity analysis is carried in this case study.
The paper, which proposes a new methodology and results of high significance for the forecasting community, is detailed, very didactic and of high quality, and will undeniably be valuable to researchers who wish to carry out advanced and flexible forecast value analyses, involving decision-makers’ levels of risk aversion.
I strongly recommend this paper for publication and list hereafter recommendations for clarification, as well as some minor points and typos.
Comments
L18-21: These two sentences seem a bit contradicting because you first announce value for all lead times, decision types and most levels of risk aversion, but then you nuance your statement beyond the second week, for binary decisions. I suggest nuancing the first statement. In addition, the case of the Murray-Darling basin being an example of application for sensitivity analysis rather than a stand-alone evaluation, I would consider these results as secondary compared to the advantages of the proposed RUV metric and the results of the sensitivity analysis well described in Section 6.2, which in themselves deserve to be highlighted in the abstract.
L20: “Beyond the second week” please mention that you are referring the lead time.
L26 (and throughout the paper): Here authors refer to the lens of “consumer” impact. The terms “user” and “decision-maker” are also used throughout the paper. Given that there are differences between these terms, I wonder whether the authors could clarify whether they use these three terms as interchangeable, or do they make a distinction. In the former case, are they actually interchangeable? Forecast datasets are increasingly open, and I am not sure whether users are indeed consumers in these cases. In the latter case, could you explicit the distinction made in an evaluation context?
L73-78: Based on these two examples, and purely intuitively, I would tend to consider both types of decision-makers to be risk averse (conservative approach to avoid spending in example 1 and flooding in example 2) but with a different sensitivity to forecast uncertainty. Could the authors elaborate on why they make a direct link between forecast uncertainty and risk aversion?
L95: Maybe reformulate “lead to improved forecast verification”. For instance: “lead to improved forecast verification indicators” or “improved forecast performance”.
L98: “first convert them”
L125: Isn’t it a 2x2 contingency matrix?
L133: The term “outcome” was unclear to me here. I was unsure whether it referred to each combination of possible Action/Event in Table 1. In my understanding, E depends on each information source (reference, forecast, or perfect) but uses all possible outcomes in its weighted mean. The term “outcome” was a bit confusing, while Equation 1 and L138 were perfectly clear. Since the Supplement helped in that matter, I would suggest referring it here already.
Equation 1: Could you please add the range within which V should fall (-∞ to 1)?
Equation 2: At this stage o is not defined.
Figure 1: The location of the phrase “Use reference to decide” is, I think, misleading. Based on the explanations (L162-164), it seems that for a cost-loss ratio of 0.5, for instance, the forecast outperforms climatology and should thus be used to decide, with a potential REV reaching about 0.8. Therefore, using the reference for a cost-loss ratio of 0.5 would not allow reaching a REV greater than that of the forecast. However, based on the figure, it seems that using the reference for a cost-loss ratio of 0.5 would allow reaching a REV greater than that of the forecast. Maybe the arrows pointing at the extreme intervals when the reference is indeed performing better, but this is currently not clear.
Additionally, it is not clear whether the arrows linked to “Always act” and “Never act” point at the interval when climatology < forecast or at the specific points (0;0) and (0;1) (see also the following comment).
Figure 1 (and all value diagrams): If I understand correctly the meaning of α=1 (never worth acting) and α=0 (always worth acting), the decision can be taken regardless of whether the forecast or climatological information is considered. This would mean that the relative economic value should be exactly equal to 0 in both cases (α=1 and α=0). If that is correct, and that no other parameter comes into the decision of acting or not, is there a reason why the two points (0;0) and (0;1) are not represented in the value diagram?
Equation 4 (and Equation 9): Probabilities being sometimes used with powers, I would suggest to place the index m as a subscript rather than superscript.
L207-217: I suggest adding an example graph of μ to illustrate your explanation. For instance, I find it hard to picture the concavity of μ, especially in the case of binary decisions.
L230: “the absolute value of a specific decision”
Equation 6: Here it is not clear to me why damage does not vary with time (in Appendix A it seems it does). It is also not clear why m, whom E, b and d depend on, appear in parenthesis in the case of b and d, and as a subscript in the case of E.
L237 “The damage function relates the streamflow magnitude to the economic damages”: At this stage, you have not mentioned streamflow yet, I would suggest sticking to the term “states of the world”.
L309: The previous section also comprised elements of methodology. Consider changing the name of this section.
Section 4.2: Could you briefly state why you chose this station and basin? To which extent do you expect your results (sensitivities) to differ in a catchment with different hydrometeorological characteristics?
L340-341: Given that you mention a rainfall post-processing step, I would recommend stating “raw streamflow forecasts” (L340) and “the streamflow observations” (L341) to avoid any misunderstanding.
Section 4.3: GR4J also uses temperature or potential evapotranspiration as input. Could you say something about what you used?
L345-346 “flow exceeding the height of a levee”: it would be more intuitive to talk about the “water level exceeding the height of a levee”
L374: “all decision-makers share the same level of risk.”
Table 3: (1) “Experiment 4: Impact of risk aversion on forecast value”; (2) In experiment 5, the decision thresholds says “All flow” but the decision type is “Binary”, which is counter-intuitive. “All possible thresholds” might be easier to understand, or “Thresholds from bottom 2% to top 0.04%”.
Figure 4: Here you consider two rather extreme yet probably realistic thresholds for converting the probabilistic forecast into a deterministic one. When reading the results, I was wondering whether moderate thresholds could alleviate the lack of forecast value for high and low cost-loss ratios and provide reasonable value for all cost-loss ratios. Could you answer this by displaying intermediate probability thresholds in this experiment?
Figure 6: To ease the reading of this figure whose lines are plain and with colors of similar intensity, I suggest adding dashes and dots to distinguish the three curves.
Figure 6: Could the authors explain the interesting difference in RUV pattern for the multi-categorical decision (also seen in other decision types) between lead week 2 and lead weeks 3 and 4? Why does value decrease with lead time for low cost-loss ratios (as expected) but increases with lead time (maybe less obvious) for high cost-loss ratios?
Experiment 3: In this experiment, authors look at the variation of value with the lead week. It is also common to look at the influence of the initialization month or season to appreciate the influence of different hydrological conditions on the value. Even though it would mean dividing the total forecast sample into subgroups and reducing significance, I think it could be a valuable addition to Figure 6 to show forecasts initialized in dry and wet conditions separately.
Figure 7: To ease reading, consider adding a horizontal line at y=0 in graphs displaying the overspend.
Experiment 4: It is currently unclear why the third line of Figure 7 is shown as it is little to not exploited in the interpretation. Please consider removing or spending some sentences to exploit this line of the figure.
L520: “making decisions with fixed critical probability thresholds leads to”
Sections 6.1, 6.2 and 6.3: Numbering the paragraphs is unnecessary.
L576: “summarizes”/“summarises”
L680 and Table 5: In the text, you mention that the formulation of Ct depends on the value of p, but in Table 5, the formulation of Ct depends on whether the action is taken or not rather than on p. I could not figure out why. Are the p values you are referring to in both instances different? Could you please clarify this point?
L701: The link to the companion dataset is missing.
- AC1: 'Reply on RC1', Richard Laugesen, 30 Aug 2022
-
RC2: 'Comment on hess-2022-65', Anonymous Referee #2, 13 Jul 2022
The paper presents a generalisation of the Relative Economic Value (REV) approach, providing a flexible metric, the “Relative Utility Value” (RUV), which can be useful for decision makers on the value of probabilistic subseasonal forecasts. The results show its application and sensitivity to several factors in a case study in Australia.
The paper is well written and demonstrated. I believe it brings novel aspects in the topic of hydrometeorological forecasting, and is an excellent demonstration of how forecast producers and users should work together to enhance the usefulness of skilful forecasts.
I have just some minor general and specific comments, presented below.
General comments:
I think some sentences need to be more carefully revised because they might convey a message that goes beyond the experimentations of this paper. For instance, concerning the first sentence of the Conclusions section, I do not believe that, overall, the value of probabilistic forecasts to making (good) decisions has not been established, as the authors say. Many public and private companies are convinced of the value of quantifying uncertainties in real-time forecasting and that is why this type of forecasts has been increasingly produced and used for many operations, from nowcasting to short-term flood forecasting and long-term inflows to reservoirs. Value has not been established (or explicitly calculated) at all lead times and users cases, I agree, but, overall, the forecasting (producers and users) community acknowledges that there is value for decision making in not being certain (or deterministic) about the unknown future. The added value of the paper, in my opinion, does not lie on bringing the “value” into discussion in forecast verification/evaluation, as this has been done in several papers previously, but in making the framework for assessing it more accessible and flexible, as the title says.
I was also puzzled by the authors when they say that a decision maker who is highly exposed to damages should use the reference climatology rather than a forecast based on meteorological numerical models for binary decisions (Conclusions, lines 639-640). This might be the case for the experiment showed (and the case described in the paper), but I doubt flood forecasters (forecasting a threshold exceedance for the next 12-24 hours, for instance) would be able to say to the population they are serving that they will abandon a city located close to a river and leave than with only a climatology-based information instead of rather investing into a (good) model-based forecasting and alert system because they are highly exposed to damages. I fully understand that if the potential costs of a flood event are high, and will be incurred if the flood occurs, whatever forecast we might deliver, then no forecasting system can save us, and it is better to work on protection (decreasing costs) at first. But even in this case, using climatology might not be beneficial either (the problem is elsewhere, not in the type of forecast being used). What I mean is that out of a more explicitly presented context, some sentences might rather diverge a reader from the purposes of the paper. Therefore, I would recommend to revise some general affirmative sentences, or at least introduce more context to them to avoid misunderstandings.
Another general comment is about the fact that we set the context of the paper on probabilistic subseasonal forecasts (up to 30 days), but much of the demonstrations and experiments refer to 1-7 day lead-time forecasts (and many concluding sentences seem to forget this context and generalize to any type of forecast and lead time). In many situations (but I am not sure about the case of the particular catchment of the study), a meteorological (model-based) forecast may show quality a couple of days ahead (1 to 5 days, for instance) and then be as skilful as climatology afterwards. How this difference in the quality of the forecasts might affect the results here? Is it justified to group together these lead times here? Would a (potential) difference in quality explain negative RUV (lines 412-414), where the authors say that climatology (as a forecast) is more useful than a (meteorological model based) forecast? (note: at the end, the decision maker is always using a forecast, either from a record of historic observations – climatology – or from a coupled atmospheric-hydrologic model).
Finally, a last overall comment I have is: why a systematic comparison with REV is so important in the development of a novel approach or metric in this topic? Is it because REV is widely used (or supposedly widely used)? How crucial is it as motivation for the study?
Specific comments:
- Introduction : I think the authors could introduce some literature on works done on forecast value and links between forecast quality and value with respect to inflows to hydropower reservoirs. These cover a large range of cases and lead times, and also use optimisation-based economic models to link forecast production (quality) to usefulness (economic value). It would be interesting to give here this broader view to the topic, I think, and then replace better the context of the paper (to which the conclusions drawn will specifically apply). Besides the paper mentioned in the discussion (Penuela et al.), some others that might be interesting are: https://doi.org/10.1002/2015WR017864; https://hess.copernicus.org/articles/23/2735/2019/; https://doi.org/10.1029/2019WR025280; https://hess.copernicus.org/articles/25/1033/2021/.
- Line 49: too many “and” words. Please, check.
- Line 50: “better verification implies more value”: I think you refer to “quality” and not “verification”. Please, check.
- Line 88-89: not clear to me. Please, check.
- Line 90, 102: when you refer to “the authors” I am sometimes a bit confused if you mean “you” or the authors in Matte et al. Please, check.
- Line 192-193: maybe it is not reported in scientific papers, but are you sure it is not commonly used by water managers in practice? Have you conducted a survey or any other study not reported here to assess it (i.e., real-world practices)?
- Line 227-230: again too many “and” words. I found the sentence unclear. Please, check (maybe also correct to “a specific decision”).
- Line 280: I am not fully convinced that information on amount spent, damage etc. at each time step is something valuable to a user. Is that so? Can you provide examples or a justification for that? I believe that users might be more interested in the long-term performance of a forecast system (in particular when it comes to reservoir operations), while a flood alert user would be interested in the whole flood event duration performance (and less on each time step). Maybe I misunderstood something here.
- Line 309: I do not think “Methodology” is a good title for the section. I would suggest “Application” or “Experiment”.
- Line 310-311: I guess that by “different decision-makers” you mean “different levels of aversion of decision-makers”. I think it is not the person themselves you are talking about but the theoretical level of aversion that you are modifying in the experiments.
- Section 4.1: I think part of it could go to the Introduction.
- Line 339: maybe place the references in the right place would help the reader (ex. Perrin et al., after GR4J, and not after RRP-S).
- Line 343: “seamless” has usually another meaning in the literature. It usually refers to a system that forecasts in a coherent and homogeneous way from minutes to hours and months. It is not usually related to performance across scales. Please, check.
- Section 4.4: I think part of it could go to the Introduction (lines 346-354).
- Line 369: what do you mean by “suitable”? How? Based on data?
- Table 3, experiment 4: check typo
- Fig. 4: I am not sure it is needed to show that we come up to the same results. I would suggest putting Experiment 1 and Experiment 2 together.
- Line 437: what do you mean by “ensemble sampling error”? Please, explain.
- Line 458: please, clarify the sentence (see my general comments above) in terms of saying that a “decision-maker should avoid using forecasts” in certain conditions.
- Line 464-466: Does this correspond to reality? Have you discussed the results with the Murray-Darling Basin managers, for instance? It would be interesting to link mathematical calculations to reality in the field, providing supporting to some sentences on the results and overall conclusions drawn in the paper.
- Fig. 7: I think it should be more commented. The differences we see in the column on the right do not seem to be “moderate”.
- Experiment 5: could you justify the choice of adopting a binary decision and alpha = 0.2 here? Also, why are you showing week 1 if the focus of the paper is on longer-term forecasts?
- Line 510-511: is this a general conclusion? Over any lead time and situation? Not all probabilistic streamflow forecasts are skilful and reliable. Do you mean for the case study of the paper? Please, clarify.
- Lines 513 and 514: I suggest using “developed” and “can be applied”.
- Line 520: Please check deleting “is”.
- Line 553: I do not understand what you mean by “a single forecast user” (single forecast or single user)? Please, clarify. Also “they” here refers to whom? The users?
- Line 569: by “mitigation” do you mean “real time mitigation of damages”? Sometimes mitigation is more related to “prevention” (out of real time) for some users. Please, clarify.
- Section 6.3: I suggest using “could” instead of “will” when talking about possible future pathways for further research/future works.
- Overall: please check the use (or the absence) of a comma before the word “which”.
- Figures/tables: overall, please check the use of colours in black and white printing (maybe use italics in Table 3 instead of red, for instance; use dotted lines instead of colours in other figures, etc.)
- AC2: 'Reply on RC2', Richard Laugesen, 30 Aug 2022
Peer review completion

