Streamflow forecasts provide vital information to aid emergency response preparedness and disaster risk reduction. Medium-range forecasts are created by forcing a hydrological model with output from numerical weather prediction systems. Uncertainties are unavoidably introduced throughout the system and can reduce the skill of the streamflow forecasts. Post-processing is a method used to quantify and reduce the overall uncertainties in order to improve the usefulness of the forecasts. The post-processing method that is used within the operational European Flood Awareness System is based on the model conditional processor and the ensemble model output statistics method. Using 2 years of reforecasts with daily timesteps, this method is evaluated for 522 stations across Europe. Post-processing was found to increase the skill of the forecasts at the majority of stations in terms of both the accuracy of the forecast median and the reliability of the forecast probability distribution. This improvement is seen at all lead times (up to 15 d) but is largest at short lead times. The greatest improvement was seen in low-lying, large catchments with long response times, whereas for catchments at high elevation and with very short response times the forecasts often failed to capture the magnitude of peak flows. Additionally, the quality and length of the observational time series used in the offline calibration of the method were found to be important. This evaluation of the post-processing method, and specifically the new information provided on characteristics that affect the performance of the method, will aid end users in making more informed decisions. It also highlights the potential issues that may be encountered when developing new post-processing methods.

Preparedness for floods is greatly improved through the use of streamflow forecasts, resulting in less damage and fewer fatalities

Several approaches have been developed to reduce hydrological forecast errors and account for the predictive uncertainty. Improvements to the NWP systems used to force the hydrological model have been shown to reduce the uncertainty in the streamflow forecasts

Post-processing the streamflow forecast allows all uncertainties to be accounted for. Over the past few decades several techniques have been proposed. These techniques can be split into two approaches: (1) methods accounting for the meteorological and hydrological uncertainties separately and (2) lumped approaches which calculate the total combined uncertainty of the forecast. One of the first examples of the former approach was the Bayesian forecasting system which was applied to deterministic forecasts and consists of the Hydrological Uncertainty Processor

Many regression-based methods have been developed to post-process streamflow forecasts because of their relatively simple structure (e.g. quantile regression,

All the methods mentioned above, and many more that have not been mentioned

The EFAS domain covers hundreds of catchments across several hydroclimatic regions with different catchment characteristics. The raw forecasts (i.e. forecasts that have not undergone post-processing) have varying levels of skill across these catchments

Does the post-processing method provide improved forecasts?

What affects the performance of the post-processing method?

The remainder of the paper is set out as follows. In Sect.

The focus of this paper is the evaluation of the post-processing method used operationally to create the product referred to as the “real-time hydrograph” (see Fig.

Version 4 of the EFAS (operational in October 2020) uses the LISFLOOD hydrological model at an increased temporal resolution of 6 h and a spatial resolution of 5 km

Operationally, the medium-range ensemble forecasts are generated twice daily at 00:00 and 12:00 UTC with a maximum lead time of 15 d

The hydrological initial conditions for the streamflow forecasts are determined by forcing LISFLOOD with meteorological observations to create a simulation henceforth referred to as the

This section describes the post-processing method evaluated. Post-processing is performed at stations for which near-real-time and historic river discharge observations are available. The method is motivated by the MT-MCP

Flow chart describing the post-processing method at a station.

In this section notation and definitions used throughout the paper are introduced. The aim of post-processing is to correct the errors and account for the uncertainty that may be present in a forecast. As described in Sect.

The input data shown in Fig.

The raw ensemble forecast that is post-processed is the only data available in the forecast period. This forecast is produced at time

The time series of observations for a single station is denoted by the vector

Similarly, the time series of the water balance simulation, denoted by the vector

The methods used in this post-processing method utilise the properties of the Gaussian distribution, but discharge values usually have highly skewed non-Gaussian distributions

The NQT defines a one-to-one map between the quantiles of the cumulative distribution function (CDF) of the discharge distribution in physical space,

The offline calibration (see Fig.

The NQT requires the CDF of the observed and simulated discharge values in physical space, denoted

All values in the time series,

A Gaussian kernel is centred at each data point such that

The kernel density is estimated using a leave-one-out approach such that the density at

To guarantee data points in the tail, the largest 10 values are always assumed to be in the upper tail of the distribution (within the GPD), and the next 990 values (i.e.

For each test value of

The scale parameter,

The shape parameter,

The full distribution is the combination of the KDE and GPD weighted by their contribution to the total density,

Schematic of the distribution approximation method. All data points are shown by the short solid lines. The largest 10 data points are red (always in the upper tail), the next 990 largest data points are blue (tried as the location parameter), and the remaining data points are black. Gaussian kernels (grey dashed lines) are used to calculate the kernel density (purple line). For clarity, only the kernels centred at every 500th data point are plotted. The upper tail is fitted with a generalised type-II Pareto distribution (orange line). The breakpoint (dot-dashed black line) defines the separation between the two distributions. The integral of the density distribution function with respect to discharge (the sum of the purple- and orange-shaded areas) equals 1.

Once the variables that define the discharge density distribution, namely

This section describes the calculation of the joint distribution used in the online hydrological uncertainty estimation (see Sect.

Following on from Eq.

The splitting of the observed and simulated variables into two distinct time periods is discussed below. The joint distribution can now be defined in terms of

The joint distribution is denoted

To ensure that the covariance matrix,

As mentioned, the joint distribution is used in the estimation of the hydrological uncertainty in the online part of the post-processing method (see Sect.

These sub-matrices can be further decomposed into the components referring to the observed and simulated variables such that, for example,

This section describes the online correction part of the post-processing method (see Fig.

observations for the station,

the water balance simulation for the grid box containing the station's location,

a set of ensemble streamflow forecasts (from the same system as the forecast

The hydrological uncertainty is quantified using a MCP method which uses the discharge values from the recent period and the joint distribution,

In Sect.

The sub-matrices of the covariance matrix of the joint distribution that were defined in Eq.

By positioning the joint distribution in this way,

The conditional distribution of the unknown discharge values in the forecast period conditioned on the known discharge values in the recent period, denoted

The resulting predicted distribution,

This section describes the part of the online correction that estimates the meteorological uncertainty in the forecast of interest. As stated at the beginning of Sect.

The uncertainty that propagates through from the meteorological forcings is partially captured by the spread of the ensemble streamflow forecast. However, these forecasts are often under-spread, particularly at shorter lead times. The EMOS method

The variance of

The ensemble mean at each lead time and the auto-covariance matrices are calculated for each of the forecasts from the recent period after they have been transformed to the standard Normal space (not including the forecast produced at time

The current forecast,

The update step equations of the Kalman filter

The update step of the Kalman filter is applied to produce a probabilistic forecast in the standard Normal space containing information about both the meteorological and hydrological uncertainties. The distribution of this forecast is denoted

The

To maintain similarity to the operational system, the station models used in this evaluation are those calibrated for use in the operational post-processing. To avoid an unfair evaluation, station models must have been calibrated using observations from before the evaluation period. An evaluation period of approximately 2 years (from 1 January 2017 to 14 January 2019) was chosen to balance the length of the evaluation period with the number of stations evaluated. Of the 1200 stations post-processed operationally, 610 stations have calibration time series with no overlap with the evaluation period. Additionally, stations were required to have at least 95 % of the daily observations for the evaluation period, reducing the number of stations to 525. A further three stations were removed after a final quality control inspection (see Sect.

Map showing the locations of the 522 stations evaluated. The marker colour shows the continuous ranked probability score (see Sect.

The reforecasts used in this study are a subset of the EFAS 4.0 reforecast dataset

The reforecasts and the operational forecasts (see Sect.

All discharge observations were provided by local and national authorities and collected by the Hydrological Data Collection Centre of the Copernicus Emergency Management Service and are the observations used operationally. The operational quality control process was applied to remove incorrect observations before they were used in this study

The EFAS 4.0 simulation

The evaluation of the post-processing method is performed by comparing the skill of the raw forecasts with the corresponding post-processed forecasts. Since the aim of the post-processing is to create a more accurate representation of the observation probability distribution, all metrics use observations as the “truth” values. As mentioned in Sect.

Example of the real-time hydrograph product for the station in Brehy, Slovakia, on 31 January 2017.

In the real-time hydrograph the darkest shade of blue indicates the forecast median, making it the easiest and most obvious single-valued summary of the full probabilistic forecast for end users. The ensemble median of the raw forecasts is used in this evaluation because operationally the ensemble forecasts are often represented by box plots where the median at each timestep is shown.

The skill of the forecast median is evaluated using the modified Kling–Gupta efficiency score

The timing of the peak discharge is an important variable of flood forecasts. The peak-time error (PTE) is used to evaluate the effect of post-processing on the timing of the peak within the forecast. The PTE requires a single-valued forecast trajectory. For the reasons stated in Sect.

Two discharge thresholds are shown in the real-time hydrograph: the mean discharge (MQ) and the mean annual maximum discharge (MHQ). Both thresholds are determined using the observations from the historic period. For the post-processed forecasts, the probability of exceedance of the MQ threshold, PoE(MQ), is calculated such that

The relative operating characteristic (ROC) score and ROC diagram

Reliability diagrams are used to evaluate the reliability of the forecast in predicting the exceedance of the two thresholds. Reliability diagrams show the observed frequency vs. the forecast probability for bins of width 0.1 from 0.05 to 0.95. A perfectly reliable forecast would follow the one-to-one diagonal on a reliability diagram. The same combination of stations and lead times is used as with the ROC diagrams.

A commonly used metric to evaluate the overall performance of a probabilistic or ensemble forecast is the continuous ranked probability score

For some of the metrics described in Sects.

This section focuses on the overall impact of post-processing at all 522 of the evaluated stations across the EFAS domain and aims to address the research question “Does the post-processing method provide improved forecasts?”

The modified Kling–Gupta efficiency skill score (KGESS) is used to evaluate the impact of post-processing on the forecast median (see Sect.

Comparison of the raw and post-processed forecast medians.

Figure

Figure

Figure

Figure

There is a small decrease in the

Figure

The two factors impacting the ability of the post-processed forecasts to capture the variability of the flow are 1) the level of indication of the upcoming flow by the discharge values in the recent period and 2) the spread of the raw forecast. In the Kalman filter when the hydrological uncertainty distribution and the meteorological uncertainty distribution are combined (see Sect.

The meteorological uncertainty distribution is the spread-corrected raw forecast and includes the variability due to the meteorological forcings. For floods with meteorological drivers, if the magnitude of the peaks is under-predicted by the raw forecasts, then the post-processed forecasts are also likely to under-predict the magnitude of the peaks. Alternatively, if the raw forecast is unconfident in the prediction of a peak (e.g. only a couple of members predict a peak), then it may not have a sufficient impact within the Kalman filter and the post-processed forecast may not predict the peak regardless of the accuracy of the ensemble members that do predict the peak. The impact of the spread correction is discussed further in Sect.

The ensemble mean is another commonly used single-valued summary of an ensemble forecast

To evaluate the impact of post-processing on the ability of the forecast to predict the timing of the peak flow accurately, the PTE (see Sect.

Histograms showing the probability distribution of peak-time errors for all forecasts where the maximum observation is above the 90th percentile for the station (26 807 forecasts) for raw forecasts (orange) and post-processed forecasts (purple).

Approximately 40 % of the forecast medians of the raw forecasts have no error in the timing of the peak for peaks that occur within lead times of 1 to 5 d. This drops to 37 % for post-processed forecasts. Both sets of forecasts have approximately 60 % of forecasts with timing errors of 1 d or less. However, the post-processed forecasts are more likely to predict the peak too early. For maximum forecast values occurring at lead times of 6 to 10 d, the post-processed forecasts still tend to predict peaks earlier than the raw forecasts. However, for maximum forecast values occurring at lead times of 11 to 15 d, the post-processed forecasts are more likely to predict the peaks several days too late. This suggests that floods forecast at longer lead times by the post-processed forecasts should be considered carefully.

Overall, the impact of post-processing is small but tends towards the early prediction of the peak flow for short lead times and late peak predictions for longer lead times. However, there are three main limitations with this analysis. The first is that both sets of forecasts are probabilistic, and therefore the median may not provide an adequate summary of the forecast. Secondly, the evaluation here is forecast based rather than peak based in that the focus is the timing of the highest discharge value in the forecast within the forecast period and not the lead time at which a specific peak is predicted accurately. This was intentional, as the twice-weekly production of the reforecasts means that a specific peak does not occur at each lead time. Finally, the combination of forecasts at all the stations means the relationship between the runoff-generating mechanisms and the PTE cannot be assessed.

The ROC diagrams for the MQ and MHQ thresholds (see Sect.

Relative operating characteristic diagrams for

The spread of the raw forecasts is small at short lead times. This is shown by the overlapping of the points in Fig.

Post-processing also accounts for the hydrological uncertainty, allowing for a more complete representation of the total predictive uncertainty. In addition, as shown in Fig.

Relative operating characteristic scores (ROCS) and corresponding skill scores (ROCSS) for the raw and post-processed (pp) forecasts for lead times of 1–5, 6–10, and 10–15 d for the mean flow threshold (MQ) and the mean annual maximum threshold (MHQ).

The ROC diagram for the MHQ threshold (Fig.

Figure

Reliability diagrams for

Both sets of forecasts are consistently below the diagonal in the MHQ reliability diagram (Fig.

The distribution of forecasts (shown by marker size) is more uniform for the post-processed forecasts, particularly at shorter lead times. Since the ensemble reforecasts evaluated have 11 members and the operational forecasts have 73 members, the distribution for operational raw forecasts is expected to be slightly more even as the additional members allow for greater gradation in the probability distribution. The distribution of forecasts is skewed towards low probabilities showing, similarly to the ROC diagrams (Fig.

The continuous ranked probability skill score (CRPSS) is used to evaluate the impact of post-processing on the overall skill of the probability distribution of the forecasts. Figure

The continuous ranked probability skill score (CRPSS) for all 522 stations for lead times of 3, 6, 10, and 15 d. CRPSS values below 0 indicate that the forecast probability distribution is on average less skilful after post-processing and values above 0 indicate added skill after post-processing. Markers are outlined in red if the CRPSS is below 0 and in cyan if the CRPSS is above 0.9.

As was seen with the KGESS for the forecast median, there is a decrease in the improvement offered by post-processing at longer lead times. This can be seen in Fig.

The lack of clustering of the stations with CRPSS values above 0.9 suggests that the magnitudes of the largest corrections are due to station-dependent characteristics. On the other hand, the degraded stations at a lead time of 3 d appear to cluster in three loose regions. In all three regions the degradation is due to high short-duration peaks being captured better by the raw forecasts than the post-processed forecasts. At longer lead times the Spanish catchments are still degraded, but the Scottish stations are not. As discussed in Sect.

Comparing the CRPSS values in Fig.

As mentioned, many of the stations with CRPSS values below 0 at short lead times are degraded due to peak flows being better predicted by the raw forecasts. Therefore, the skill of the forecast at different flow levels is evaluated. Figure

The CRPSS for all 522 stations calculated over the forecasts (approximately 52 forecasts) with flow values in the lowest quartile (Q1) to the highest quartile (Q4). CRPSS values below 0 indicate the forecast probability distribution is on average less skilful after post-processing and values above 0 indicate added skill after post-processing. A log scale is used on the

The improvements for all four quartiles decrease with lead time, as has been seen previously in Figs.

In the previous section the impact of post-processing was shown to vary greatly between stations. The following sections investigate the factors that influence the effect of the post-processing method. The CRPSS is used in this analysis as it provides an assessment of the improvement or degradation to the overall skill of the probabilistic forecast.

To aid the discussion of the key results, some stations are highlighted. See Fig.

Observation time series for 1 year of the evaluation period from October 2017 to October 2018 for six example stations. The forecast medians of the raw and post-processed forecasts are shown for lead times of 3, 6, and 15 d.

Key results and the section that provide more information for each of the six stations used as examples and for which time series are shown in Fig.

This section looks at how meteorological and hydrological uncertainties affect the performance of the post-processing method. As mentioned in Sect.

Density plots showing the station CRPSS for lead times of 6 d

Figure

The purple lines in Fig.

Although the

Alternatively, the hydrological uncertainty distribution may have a greater weight within the Kalman filter. Some peaks at the Daldowie station in winter 2017/2018 are forecast accurately by the raw forecast median (grey boxes in Fig.

For the hydrological errors the

The 15 stations with the largest hydrological uncertainties show a small increase in average CRPSS with increasing hydrological uncertainties. This trend is visualised by the orange line in Fig.

The catchments within the EFAS domain vary greatly in terms of size, location, and flow regime. This section discusses catchment characteristics that impact the performance of the post-processing method, namely upstream area, response time, elevation, and regulation. In Fig.

The CRPSS for all 522 stations at every other lead time with stations categorised by their catchment characteristics.

Figure

There are two reasons why very small catchments must be removed to clearly identify the trend between upstream area and CRPSS. Firstly, most stations with upstream areas (provided by local authorities) smaller than 500 km

In Fig.

In Fig.

The regulation of rivers via reservoirs and lakes is difficult to model. Raw forecasts for many regulated catchments were found to have a negative correlation with the observations. In this study, a station is considered to be regulated if it is within three grid boxes downstream of a reservoir or lake in the LISFLOOD domain or if data providers have reported that the station is on a regulated stretch of the river. Figure

Violin plot of the CPRSS values for the 480 unregulated stations (green distribution) and the 42 regulated catchments (black lines) at lead times of 3, 6, 10, and 15 d.

At all lead times, the CRPSS values of most regulated stations are above the median of the unregulated stations. Additionally, the mean CRPSS value of the regulated stations is at least 0.1 higher than that of the unregulated stations for all lead times longer than 1 d. The improvement due to post-processing at regulated stations is dependent on whether the reservoir is in the same state during the recent and forecast periods and hence whether the discharge values from the recent period provide useful information about the state of the reservoir. At longer lead times it becomes more likely that the reservoir will have a changed state and therefore that the information provided by the recent discharge values is not useful. However, if the reservoir is in the same state, then the magnitude of the improvement from post-processing can be large. For example, the Porttipahta station in Finland is located at the Porttipahta reservoir, and its time series is shown in Fig.

It is interesting to consider whether other hydrological processes that are difficult to model can be accounted for by post-processing. For example, the peak in the winter and spring in the Daugavpils catchment (see Fig.

The length of the time series used to calibrate the station model varies between stations. The maximum length is dictated by the water balance simulation, which is available from 1 January 1990. However, many stations have shorter time series due to the availability of observations. Figure

At short lead times long time series in general lead to more improvement by post-processing than shorter time series. Longer time series allow the joint distribution between the observations and the water balance simulation to be more rigorously defined, allowing a more accurate conditioning of the forecast on the discharge values from the recent period. For lead times greater than 7 d the CRPSS distributions for all categories are similar. As discussed in Sect.

The CRPSS for all 522 stations at every other lead time with stations categorised by the length of their calibration time series. Very short time series: less than 15 years (63 stations). Short time series: 15 to 20 years (93 stations). Medium time series: 20 to 25 years (119 stations). Long time series: over 25 years (247).

In general, shorter time series tend to be more recent and so benefit from improved river gauging technology and also because non-stationarity between the calibration and evaluation periods is less likely to be an issue. The station in Montañana (shown in Fig.

Observations (blue) and water balance simulation (black) time series used in the calibration of the station model for the station in Montañana.

Post-processing is a computationally efficient method of quantifying uncertainty and correcting errors in streamflow forecasts. Uncertainties enter the system from multiple sources, including the meteorological forcings from numerical weather prediction systems (here referred to as meteorological uncertainties) and the initial hydrological conditions and hydrological model (here referred to as hydrological uncertainties). The post-processing method used operationally in the European Flood Awareness System (EFAS) uses a method motivated by the ensemble model output statistics

First, does the post-processing method provide improved forecasts? Our results show that for the majority of stations the post-processing improves the skill of the forecast, with median continuous ranked probability skill scores (CRPSS) of between 0.74 and 0.2 at all lead times. This improvement is greatest at shorter lead times of up to 5 d, but post-processing is still beneficial up to the maximum lead time of 15 d. The bias and spread correction provided by the post-processing increased the reliability of the forecasts and increased the number of correctly forecast flood events without increasing the number of false alarms. However, the post-processed forecasts also led to the flood peak often being forecast too early by approximately a day. Although forecasts for flood events at most stations did benefit from post-processing, the greatest improvements were to forecasts for normal flow conditions.

Second, what affects the performance of the post-processing method? Several factors were found to impact the performance of the post-processing method at a station. The post-processing method is more easily able to correct hydrological errors than meteorological errors. This is mainly because no bias correction is performed for the meteorological errors, whereas hydrological errors are bias corrected by conditioning the forecast on the recent observations. Therefore, stations where the errors were primarily due to hydrological errors were improved more. As the hydrological errors tend to be larger than the meteorological errors, this is beneficial; however, more research is required to fully account for biases due to the meteorological forcings as well.

The post-processing method was found to easily account for consistent hydrological biases that were often due to limitations in the model representation of the drainage network. However, the correction of forecast-specific errors (due to initial conditions and meteorological forcings) was largely determined by the response time of the catchment. Therefore, the greatest improvement was seen in catchments larger than 5000 km

The use of long historic observational time series for the offline calibration is beneficial, particularly for correcting forecast-specific errors. However, time series shorter than 15 years were found to be sufficient for correcting consistent errors in the model climatology even at a lead time of 15 d. The quality of the observations in the historic time series is important, and errors in the time series degraded the performance of the post-processing method and limit the usefulness of the forecasts.

These results highlight the importance of post-processing within the forecasting chain of large-scale flood forecasting systems. They also provide a benchmark for end users of the EFAS forecasts and show the situations when the post-processed forecasts can provide more accurate information than the raw forecasts. These results also highlight possible areas of improvement within the EFAS and the factors that must be considered when designing and implementing a post-processing method for large-scale forecasting systems.

The raw reforecasts (

The supplement related to this article is available online at:

GM, HC, SLD, and CP designed the study. GM and CB created the post-processed reforecast dataset. GM drafted the manuscript and performed the forecast evaluation. All the co-authors contributed to the editing of the manuscript and to the discussion and interpretation of the results.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We thank Paul Smith for the development and operational implementation of the post-processing method. We are grateful for the advice provided by Shaun Harrigan, David Richardson, and Florian Pappenberger. We thank the members of the Water@Reading research group for their advice and support.

This research has been supported by the Engineering and Physical Sciences Research Council (grant nos. EP/R513301/1 and EP/P002331/1), the Natural Environment Research Council (grant no. NE/S015590/1), the NERC National Centre for Earth Observation and the European Centre for Medium-Range Weather Forecasts.

This paper was edited by Daniel Viviroli and reviewed by two anonymous referees.