Evaluating the impact of post-processing medium-range ensemble streamflow forecasts from the European Flood Awareness System
- 1Department of Meteorology, University of Reading, Reading, United Kingdom
- 2European Centre for Medium-range Weather Forecasts, Reading, United Kingdom
- 3Department of Geography and Environmental Science, University of Reading, Reading, United Kingdom
- 4Department of Earth Sciences, Uppsala University, Uppsala, Sweden
- 5Centre of Natural Hazards and Disaster Science, CNDS, Uppsala, Sweden
- 6Department of Mathematics and Statistics, University of Reading, Reading, United Kingdom
- 7Department of Geography, University of Loughborough, Loughborough, United Kingdom
- 8UK Centre for Ecology and Hydrology, Wallingford, United Kingdom
- 1Department of Meteorology, University of Reading, Reading, United Kingdom
- 2European Centre for Medium-range Weather Forecasts, Reading, United Kingdom
- 3Department of Geography and Environmental Science, University of Reading, Reading, United Kingdom
- 4Department of Earth Sciences, Uppsala University, Uppsala, Sweden
- 5Centre of Natural Hazards and Disaster Science, CNDS, Uppsala, Sweden
- 6Department of Mathematics and Statistics, University of Reading, Reading, United Kingdom
- 7Department of Geography, University of Loughborough, Loughborough, United Kingdom
- 8UK Centre for Ecology and Hydrology, Wallingford, United Kingdom
Abstract. Streamflow forecasts provide vital information to aid emergency response preparedness and disaster risk reduction. Medium-range forecasts are created by forcing a hydrological model with output from numerical weather prediction systems. Uncertainties are unavoidably introduced throughout the system and can reduce the skill of the streamflow forecasts. Post-processing is a method used to quantify and reduce the overall uncertainties in order to improve the usefulness of the forecasts. The post-processing method that is used within the operational European Flood Awareness System is based on the Model Conditional Processor and the Ensemble Model Output Statistics method. Using 2-years of reforecasts with daily timesteps this method is evaluated for 522 stations across Europe. Post-processing was found to increase the skill of the forecasts at the majority of stations both in terms of the accuracy of the forecast median and the reliability of the forecast probability distribution. This improvement is seen at all lead-times (up to 15 days) but is largest at short lead-times. The greatest improvement was seen in low-lying, large catchments with long response times, whereas for catchments at high elevation and with very short response times the forecasts often failed to capture the magnitude of peak flows. Additionally, the quality and length of the observational time-series used in the offline calibration of the method were found to be important. This evaluation of the post-processing method, and specifically the new information provided on characteristics that affect the performance of the method, will aid end-users to make more informed decisions. It also highlights the potential issues that may be encountered when developing new post-processing methods.
Gwyneth Matthews et al.
Status: closed
-
RC1: 'Comment on hess-2021-539', Anonymous Referee #1, 17 Dec 2021
General comments:
The paper is very well written and very detailed. The topic of the study is certainly of great interest to the forecasting community. The combination of a hydrological uncertainty processor (MCP) and EMOS with the help of a Kalman filter approach is quite novel and attractive. The discussion of the results is, however, a bit too long and could be summarized more concisely focusing on floods aspects. Therefore, I suggest that it is worth publishing after some minor revision.
There are many EFAS paper available now and all describe the EFAS system. This could be shortened and only the differences to the operational settings of EFAS should be explained, in particular how the reforecasts are used. The biggest challenge in using reforecasts is the reduced number of ensemble members (11 instead of 51). This small number of members causes difficulties in computing the CRPS for making a fair comparison with a CRPS derived from the PDFs of the post-processed forecasts (see for example Zamo and Naveau, 2018). However, this problem is not mentioned and the presented results of the CRPSS should be treated with care.
A most recent paper from Skoven, et al. (2021) has a similar topic about evaluating the post-processing methods for EFAS (EMOS and the application of transformations like NQT). Therefore, the differences and novelty of this study should be stressed clearly and discussed in more detail. One difference is, apart from using the reforecasts and MCP, that the EMOS correction parameters are lead-time invariant. This could be stressed more clearly right from the beginning. For me, this lead-time invariance is a big drawback of the proposed method and rather problematic for deriving the total predictive uncertainty, which forces the Kalman filter sometimes to give more wait to the hydrological uncertainties.
Although the analysis of the different aspects like catchment size, elevation, regulation, length of period, are very interesting, it maybe could be shortened focusing on floods, which is the main topic of EFAS. Also, the detailed analysis of the Nash-Sutcliffe (KGE) is maybe too long, since the results of the post-processing methods are probabilistic and the KGE reduces the information content to the mean (median) of the Ensembles (or PDF). Therefore, I would suggest that the emphasis of the verification should be the CRPS.
Specific comments
Page 5: Highlighting the differences between the operational setting of EFAS and the setting used in this study should be sufficient. More details can be found in many other papers. However, the calibration period of the LISFLOOD model is missing. Is there an overlap between the period for calibrating the parameters of the hydrological model and the historical period p for the off-line calibration?
In Figure 1, the index of the parameters µ and Σ is ψ, whereas in the caption you use the index Φ.
You mention several times (e.g. line 172, page 7) that the minimum for the off-line calibration is 2 years. However, for the fitting of the GPD you use 1000 values (page 9). So you will need more than 2 years?
I would suggest to include a list of nomenclature, so you avoid repetitive descriptions like the tilde for the physical space (line 182, 204, 220, 340) and the definition of the timestep notation introduced on page 7 (lines 176-177).
On page 9, line 235, you write that the location parameter a is used for defining the breakpoint, but in Figure 2 the shape parameter c is used as breakpoint.
On page 9, line 251, you write about consistence between the 2 distribution. What does it mean? How do you check this?
On page 10, line 257 the concentrated likelihood method is mentioned without any further explanation what this method does. Maybe some more details would be helpful. Also, it is not clear for me how the GPD is weighted (line 261-262) .
Whereas the description of the linear approximation (page 11 – line 270 – 280) is maybe not necessary.
On page 13, it is not clear for me why you have observations (line 336) and water balance (line 337) for the period until t, but the forecasts (line 339) until t-1 ?
Line 349: …using a MCP method …
Page 14, line 372: in the recent period …
Page 16, line 411: you write that a set of forecasts are used estimate the two spread correction parameters. How did you choose the size of these sets of forecasts ?
In the Figure 3 you write CRPS besides the legend bar, but it should be CRPSS?
On page 19, line 483 you write that only 11 reforecasts are available (I suppose this 11 comes from 40 days/7day x 2). The number 11 could be misleading, since it happens to coincide with the number of members mentioned in the next sentence (line 485). Why do you fix it to 40 days? Since there is this a discrepancy between the operational setting and this analysis anyhow, you could set q to a longer period to include more reforecasts (e.g q=70 ~ 20 reforecasts).
At line 486 I don’t understand why the mean discharge value is predicted for the previous 6 hours?
The difference between the raw and the post-processed forecasts in Fig. 11a (mentioned on page 35, line 862) is very difficult to see and almost not visible.
Line 880: …uncertainties show a small increase
On page 41 the paragraph from line 1004 – 1010 can be removed
Line 1021 ..greater than..
I have some doubts about your suggestion that very short periods are sufficient (line 1045): the chance that such a short period will show the variability of the discharge needed for applying the NQT is rather small and the fitting of the GPD almost impossible. Consequently the back-transformation of the variables from the Normal space will always produce poor and very unreliable results for floods.
The citation of Coccia (line 1135) is incomplete. Also, the term “Multi-Temporal” in combination with the MCP (MT-MCP) is mentioned only in line 153-154 and in the conclusions (line 1054), but is not explained.
Zamo, M., Naveau, P. Estimation of the Continuous Ranked Probability Score with Limited Information and Applications to Ensemble Weather Forecasts. Math Geosci 50, 209–234 (2018). https://doi.org/10.1007/s11004-017-9709-7)
Skøien, J. O., Bogner, K., Salamon, P., & Wetterhall, F. (2021). On the Implementation of Postprocessing of Runoff Forecast Ensembles, Journal of Hydrometeorology, 22(10), 2731-2749
- AC1: 'Reply on RC1', Gwyneth Matthews, 24 Feb 2022
-
RC2: 'Comment on hess-2021-539', Anonymous Referee #2, 19 Jan 2022
This paper describes a novel post-processing method applied to the EFAS forecasts, assesses the improvements to forecasts realised by post-processing to a large number of catchments and investigates factors that influence the performance of the post-processor. The paper is very well structured and written, and the topic is of considerable interest to the forecasting researchers and practioners using the EFAS forecasts.
The paper is comprehensive covering the post-processing method itself, the improvements across the EFAS domain and the factors influencing the forecast performance, which necessitates a lengthy manuscript. All aspects presented are of interest, however I do wonder whether the paper could be separated into two more focussed manuscipts, perhaps one focussing on the novel aspects of the post-processing method and validating its assumptions, and a second on evaluation the benefits and investigating factors that influence its performance.
More specific comments:
The sample covariance matrix is used to characterise the joint distribution of the historic observations and water balance simulations, equantion 7. There are a potential issues that may be encountered using this approach and it would be good understand whether special treatments have been needed to overcome these. Specific issues that come to mind include: (i) The covariance matrix is computed over a set of historic observations and is likely to have inflated, or spurious, correlations over long lags if the seasonal cycle of streamflow is not considered. These inflated/spurious correlations are likely to lead to inflated variances of conditional predictions. (ii) The authors indicate that there missing (and possibly zero-valued) observations that are used in the estimation of the covariance matrices. For large sample covariance matrices such as those estimated in this study, missing observations can lead covariance matrices that are not positive definite. Have any issues been identified and any special treatment been implemnted to deal with
The KGE analysis is performed using the median as a point estimate of the forecast ensemble. The results obtained for the post-processed forecasts, particularly the bias ratios and variability ratios of less than one at long lead times, are not unexpected as the variance of the forecast median will be considerable more damped that the mean. The forecast mean is likely to be a better choice as the point estimate of the forecast ensemble. Some theoretical justification of the use of the ensemble mean with measures of squared error can be found in Gneiting (2011).
Analysis of forecasts for extreme events such as floods requires careful design to ensure that the performance evaluation is not biased (Lerch, 2017). In this paper, the analysis of peak timing is conditioned on observations exceeding a threshold (90th percentile discharge threshold) within the forecast period, and is likely to result in a biased evaluation of forecasts. A more rigourous approach would be to select the events based on forecasts exceeding the threshold. I also believe that rather than evaluating the timing of the peak in the forecast median, which doesn't correspond to the peak in any individual hydrograph, a more representative point estimate of the forecast timing error would be to compare the median (or mean) time to peak across all ensemble members to the timing of the observed peak.
line 373 - values in the recent perion should be "values in recent period"
Line 825 - CRPS calculated on deterministic forecasts is equivalent to the absolute error not the square absolute error.
Figures - The size of multi-panel figures (e.g. Figure 9, 12) could be increased to better illustrate the detail,
References:
Tilmann Gneiting (2011) Making and Evaluating Point Forecasts, Journal of the American Statistical Association, 106:494, 746-762, DOI: 10.1198/jasa.2011.r10138
Sebastian Lerch. Thordis L. Thorarinsdottir. Francesco Ravazzolo. Tilmann Gneiting. "Forecaster’s Dilemma: Extreme Events and Forecast Evaluation." Statist. Sci. 32 (1) 106 - 127, February 2017. https://doi.org/10.1214/16-STS588
- AC2: 'Reply on RC2', Gwyneth Matthews, 24 Feb 2022
Status: closed
-
RC1: 'Comment on hess-2021-539', Anonymous Referee #1, 17 Dec 2021
General comments:
The paper is very well written and very detailed. The topic of the study is certainly of great interest to the forecasting community. The combination of a hydrological uncertainty processor (MCP) and EMOS with the help of a Kalman filter approach is quite novel and attractive. The discussion of the results is, however, a bit too long and could be summarized more concisely focusing on floods aspects. Therefore, I suggest that it is worth publishing after some minor revision.
There are many EFAS paper available now and all describe the EFAS system. This could be shortened and only the differences to the operational settings of EFAS should be explained, in particular how the reforecasts are used. The biggest challenge in using reforecasts is the reduced number of ensemble members (11 instead of 51). This small number of members causes difficulties in computing the CRPS for making a fair comparison with a CRPS derived from the PDFs of the post-processed forecasts (see for example Zamo and Naveau, 2018). However, this problem is not mentioned and the presented results of the CRPSS should be treated with care.
A most recent paper from Skoven, et al. (2021) has a similar topic about evaluating the post-processing methods for EFAS (EMOS and the application of transformations like NQT). Therefore, the differences and novelty of this study should be stressed clearly and discussed in more detail. One difference is, apart from using the reforecasts and MCP, that the EMOS correction parameters are lead-time invariant. This could be stressed more clearly right from the beginning. For me, this lead-time invariance is a big drawback of the proposed method and rather problematic for deriving the total predictive uncertainty, which forces the Kalman filter sometimes to give more wait to the hydrological uncertainties.
Although the analysis of the different aspects like catchment size, elevation, regulation, length of period, are very interesting, it maybe could be shortened focusing on floods, which is the main topic of EFAS. Also, the detailed analysis of the Nash-Sutcliffe (KGE) is maybe too long, since the results of the post-processing methods are probabilistic and the KGE reduces the information content to the mean (median) of the Ensembles (or PDF). Therefore, I would suggest that the emphasis of the verification should be the CRPS.
Specific comments
Page 5: Highlighting the differences between the operational setting of EFAS and the setting used in this study should be sufficient. More details can be found in many other papers. However, the calibration period of the LISFLOOD model is missing. Is there an overlap between the period for calibrating the parameters of the hydrological model and the historical period p for the off-line calibration?
In Figure 1, the index of the parameters µ and Σ is ψ, whereas in the caption you use the index Φ.
You mention several times (e.g. line 172, page 7) that the minimum for the off-line calibration is 2 years. However, for the fitting of the GPD you use 1000 values (page 9). So you will need more than 2 years?
I would suggest to include a list of nomenclature, so you avoid repetitive descriptions like the tilde for the physical space (line 182, 204, 220, 340) and the definition of the timestep notation introduced on page 7 (lines 176-177).
On page 9, line 235, you write that the location parameter a is used for defining the breakpoint, but in Figure 2 the shape parameter c is used as breakpoint.
On page 9, line 251, you write about consistence between the 2 distribution. What does it mean? How do you check this?
On page 10, line 257 the concentrated likelihood method is mentioned without any further explanation what this method does. Maybe some more details would be helpful. Also, it is not clear for me how the GPD is weighted (line 261-262) .
Whereas the description of the linear approximation (page 11 – line 270 – 280) is maybe not necessary.
On page 13, it is not clear for me why you have observations (line 336) and water balance (line 337) for the period until t, but the forecasts (line 339) until t-1 ?
Line 349: …using a MCP method …
Page 14, line 372: in the recent period …
Page 16, line 411: you write that a set of forecasts are used estimate the two spread correction parameters. How did you choose the size of these sets of forecasts ?
In the Figure 3 you write CRPS besides the legend bar, but it should be CRPSS?
On page 19, line 483 you write that only 11 reforecasts are available (I suppose this 11 comes from 40 days/7day x 2). The number 11 could be misleading, since it happens to coincide with the number of members mentioned in the next sentence (line 485). Why do you fix it to 40 days? Since there is this a discrepancy between the operational setting and this analysis anyhow, you could set q to a longer period to include more reforecasts (e.g q=70 ~ 20 reforecasts).
At line 486 I don’t understand why the mean discharge value is predicted for the previous 6 hours?
The difference between the raw and the post-processed forecasts in Fig. 11a (mentioned on page 35, line 862) is very difficult to see and almost not visible.
Line 880: …uncertainties show a small increase
On page 41 the paragraph from line 1004 – 1010 can be removed
Line 1021 ..greater than..
I have some doubts about your suggestion that very short periods are sufficient (line 1045): the chance that such a short period will show the variability of the discharge needed for applying the NQT is rather small and the fitting of the GPD almost impossible. Consequently the back-transformation of the variables from the Normal space will always produce poor and very unreliable results for floods.
The citation of Coccia (line 1135) is incomplete. Also, the term “Multi-Temporal” in combination with the MCP (MT-MCP) is mentioned only in line 153-154 and in the conclusions (line 1054), but is not explained.
Zamo, M., Naveau, P. Estimation of the Continuous Ranked Probability Score with Limited Information and Applications to Ensemble Weather Forecasts. Math Geosci 50, 209–234 (2018). https://doi.org/10.1007/s11004-017-9709-7)
Skøien, J. O., Bogner, K., Salamon, P., & Wetterhall, F. (2021). On the Implementation of Postprocessing of Runoff Forecast Ensembles, Journal of Hydrometeorology, 22(10), 2731-2749
- AC1: 'Reply on RC1', Gwyneth Matthews, 24 Feb 2022
-
RC2: 'Comment on hess-2021-539', Anonymous Referee #2, 19 Jan 2022
This paper describes a novel post-processing method applied to the EFAS forecasts, assesses the improvements to forecasts realised by post-processing to a large number of catchments and investigates factors that influence the performance of the post-processor. The paper is very well structured and written, and the topic is of considerable interest to the forecasting researchers and practioners using the EFAS forecasts.
The paper is comprehensive covering the post-processing method itself, the improvements across the EFAS domain and the factors influencing the forecast performance, which necessitates a lengthy manuscript. All aspects presented are of interest, however I do wonder whether the paper could be separated into two more focussed manuscipts, perhaps one focussing on the novel aspects of the post-processing method and validating its assumptions, and a second on evaluation the benefits and investigating factors that influence its performance.
More specific comments:
The sample covariance matrix is used to characterise the joint distribution of the historic observations and water balance simulations, equantion 7. There are a potential issues that may be encountered using this approach and it would be good understand whether special treatments have been needed to overcome these. Specific issues that come to mind include: (i) The covariance matrix is computed over a set of historic observations and is likely to have inflated, or spurious, correlations over long lags if the seasonal cycle of streamflow is not considered. These inflated/spurious correlations are likely to lead to inflated variances of conditional predictions. (ii) The authors indicate that there missing (and possibly zero-valued) observations that are used in the estimation of the covariance matrices. For large sample covariance matrices such as those estimated in this study, missing observations can lead covariance matrices that are not positive definite. Have any issues been identified and any special treatment been implemnted to deal with
The KGE analysis is performed using the median as a point estimate of the forecast ensemble. The results obtained for the post-processed forecasts, particularly the bias ratios and variability ratios of less than one at long lead times, are not unexpected as the variance of the forecast median will be considerable more damped that the mean. The forecast mean is likely to be a better choice as the point estimate of the forecast ensemble. Some theoretical justification of the use of the ensemble mean with measures of squared error can be found in Gneiting (2011).
Analysis of forecasts for extreme events such as floods requires careful design to ensure that the performance evaluation is not biased (Lerch, 2017). In this paper, the analysis of peak timing is conditioned on observations exceeding a threshold (90th percentile discharge threshold) within the forecast period, and is likely to result in a biased evaluation of forecasts. A more rigourous approach would be to select the events based on forecasts exceeding the threshold. I also believe that rather than evaluating the timing of the peak in the forecast median, which doesn't correspond to the peak in any individual hydrograph, a more representative point estimate of the forecast timing error would be to compare the median (or mean) time to peak across all ensemble members to the timing of the observed peak.
line 373 - values in the recent perion should be "values in recent period"
Line 825 - CRPS calculated on deterministic forecasts is equivalent to the absolute error not the square absolute error.
Figures - The size of multi-panel figures (e.g. Figure 9, 12) could be increased to better illustrate the detail,
References:
Tilmann Gneiting (2011) Making and Evaluating Point Forecasts, Journal of the American Statistical Association, 106:494, 746-762, DOI: 10.1198/jasa.2011.r10138
Sebastian Lerch. Thordis L. Thorarinsdottir. Francesco Ravazzolo. Tilmann Gneiting. "Forecaster’s Dilemma: Extreme Events and Forecast Evaluation." Statist. Sci. 32 (1) 106 - 127, February 2017. https://doi.org/10.1214/16-STS588
- AC2: 'Reply on RC2', Gwyneth Matthews, 24 Feb 2022
Gwyneth Matthews et al.
Data sets
Post-processed reforecasts of the European Flood Awareness System and related evaluation data Gwyneth Matthews, Christopher Barnard http://dx.doi.org/10.17864/1947.333
Model code and software
Post-processed reforecasts of the European Flood Awareness System and related evaluation data Gwyneth Matthews, Christopher Barnard http://dx.doi.org/10.17864/1947.333
Gwyneth Matthews et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
623 | 163 | 13 | 799 | 8 | 7 |
- HTML: 623
- PDF: 163
- XML: 13
- Total: 799
- BibTeX: 8
- EndNote: 7
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1