An increasing number of flood forecasting services assess and communicate the uncertainty associated with their forecasts. While obtaining reliable forecasts is a key issue, it is a challenging task, especially when forecasting high flows in an extrapolation context, i.e. when the event magnitude is larger than what was observed before. In this study, we present a crash-testing framework that evaluates the quality of hydrological forecasts in an extrapolation context. The experiment set-up is based on (i) a large set of catchments in France, (ii) the GRP rainfall–runoff model designed for flood forecasting and used by the French operational services and (iii) an empirical hydrologic uncertainty processor designed to estimate conditional predictive uncertainty from the hydrological model residuals. The variants of the uncertainty processor used in this study differ in the data transformation they use (log, Box–Cox and log–sinh) to account for heteroscedasticity and the evolution of the other properties of the predictive distribution with the discharge magnitude. Different data subsets were selected based on a preliminary event selection. Various aspects of the probabilistic performance of the variants of the hydrologic uncertainty processor, reliability, sharpness and overall quality were evaluated. Overall, the results highlight the challenge of uncertainty quantification when forecasting high flows. They show a significant drop in reliability when forecasting high flows in an extrapolation context and considerable variability among catchments and across lead times. The increase in statistical treatment complexity did not result in significant improvement, which suggests that a parsimonious and easily understandable data transformation such as the log transformation or the Box–Cox transformation can be a reasonable choice for flood forecasting.
In many countries, operational flood forecasting services (FFS) issue forecasts routinely throughout the year and during rare or critical events. End users are mostly concerned by the largest and most damaging floods, when critical decisions have to be made. For such events, operational flood forecasters must get prepared to deal with extrapolation, i.e. to work on events of a magnitude that they and their models have seldom or never met before.
The relevance of simulation models and their calibration in evolving conditions, such as contrasted climate conditions and climate change, has been studied by several authors. For example,
Addressing the extrapolation issue involves a number of methodological difficulties. Some data issues are specific to the data used for hydrological modelling, such as the rating curve reliability
Even if significant progress has been made and implemented in operational flood forecasting systems
The uncertainty associated with operational forecasts is most often described by a predictive uncertainty distribution. Assessing a reliable predictive uncertainty distribution is challenging because hydrological forecasts yield residuals that show heteroscedasticity, i.e. an increase in the uncertainty variance with discharge, time autocorrelation, skewness etc. Some studies
Various approaches to uncertainty assessment have been developed to assess the uncertainty in hydrological predictions
A first approach intends to model each source of uncertainty separately and to propagate these uncertainties through the modelling chain
In practice, it is necessary to consider the meteorological forecast uncertainty to issue hydrological forecasts. The ensemble approaches intend to account for this source of uncertainty. They are increasingly popular in the research and the operational forecasting communities. An increasing number of hydrological ensemble forecasting systems are in operational use and have proved their usefulness, e.g. the European Flood Awareness System
Multi-model approaches can be used to assess modelling uncertainty
In forecasting mode, data assimilation schemes based on statistical modelling are of common use to reduce and assess the predictive uncertainty. Some algorithms such as particle filters
Alternatively, numerous post-processors of deterministic or probabilistic models have been developed to account for the uncertainty from sources that are not modelled explicitly. They differ in several aspects
The approaches presented in Sect. 1.2.1 and 1.2.2 are not exclusive of each other. Even when future precipitation is the main source of uncertainty, post-processing is often required to produce reliable hydrological ensembles
Note that many of these approaches use a variable transformation to handle the heteroscedasticity and more generally the evolution of the predictive distribution properties with respect to the forecasted discharge. Some have no calibrated parameter, while others encompass a few calibrated parameters, allowing more flexibility in the predictive distribution assessment. More details on commonly used variable transformations are presented in Sect. 2.1.4.
In this article, we focus on uncertainty assessment with a post-processing approach based on residuals modelling. While the operational goal is to improve the hydrological forecasting, this study does not consider the meteorological forecast uncertainty: it only focuses on the hydrological modelling uncertainty, as these two main sources of uncertainty can be considered independently to present a framework aimed at testing the hydrological modelling and uncertainty assessment in the extrapolation context, to assess the ability and the robustness of a post-processor to provide reliable predictive uncertainty assessment for large floods when different variable transformations are used, to provide guidance for operational flood forecasting system development.
We attempt to answer three questions: (a) can we improve residuals modelling with an adequate variable transformation in an extrapolation context? (b) Do more flexible transformations, such as the log–sinh transformation, help in obtaining more reliable predictive uncertainty assessment? (c) If the performance decreases when extrapolating, is there any driver that can help the operational forecasters to predict this performance loss and question the quality of the forecasts?
Section 2 describes the data, the forecast model, the post-processor and the testing methodology chosen to address these questions. Section 3 presents the results of the numerical experiments that are then discussed in Sect. 4. Finally, a number of conclusions and perspectives are proposed.
We used a set of 154 unregulated catchments spread throughout France (Fig.
Characteristics of the 154 catchments, computed over the 1997–2006 data series.
The set of 154 unregulated catchments used in this study. Average altitude is given in metres above sea level (m a.s.l.).
We used discharge forecasts computed by the GRP rainfall–runoff model. The GRP model is designed for flood forecasting and is currently used by the FFS in France in operational conditions
Since herein only the ability of the post-processor to extrapolate uncertainty quantification is studied, the model is fed only with observed rainfall (no forecast of precipitation), in order to reduce the impact of the input uncertainty. For the same reason, the model is calibrated in forecasting mode over the 10-year series by minimising the sum of squared errors for a lead time taken as the LT. The results will be presented for four lead times, – LT / 2, LT, 2 LT and 3 LT – to cover the different behaviours that can be seen when data assimilation is used to reduce errors in an operational flood forecasting context
We used the empirical hydrological uncertainty processor (EHUP) presented in
The basic idea of the EHUP is to estimate empirical quantiles of errors stratified by different flow groups to account for the variation of the forecast error characteristics with forecast magnitude. Since forecast error characteristics also vary with the lead time when data assimilation is used, the EHUP is trained separately for each lead time.
For each lead time separately, the following steps are used:
Training:
The flow groups are obtained by first ordering the forecast–observation pairs according to the forecasted values and then stratifying the pairs into a chosen number of groups (in this study, we used 20 groups), so that each group contains the same number of pairs. Within each flow group, errors are calculated as the difference between the two values of each forecast–observation pair, and several empirical quantiles (we used 99 percentiles) are calculated in order to characterise the distribution of the error values. Application:
The predictive uncertainty distribution that is associated with a given (deterministic) forecasted value is defined by adding this forecasted value to the empirical quantiles that belong to the same flow group as the forecasted value.
Since this study focuses on the extrapolation case, the validation is achieved with deterministic forecasts higher than the highest one used for the calibration. Therefore, only the highest-flow group of the calibration data is used to estimate the uncertainty assessment (to be used on the control data). This highest-flow group contains the top 5 % pairs of the whole training data, ranked by forecasted values. This threshold is chosen as a compromise between focusing on the highest values and using a sufficiently large number of forecast–observation pairs when estimating empirical quantiles of errors. In extrapolation, when the forecast discharge is higher than the highest value of the training period, the predictive distribution of the error is kept constant, i.e. the same values of the empirical quantiles of errors are used, as illustrated in Fig.
The EHUP can be applied after a preliminary data transformation, and by adding a final step to back-transform the predictive distributions obtained in a transformed space. In previous work, we used the log transformation because it ensures that no negative values are obtained when estimating the predictive uncertainty for low flows
Many uncertainty assessment methods mentioned in the Introduction use a variable transformation to handle the heteroscedasticity of the residuals and account for the variation of the prediction distributions with the magnitude of the predicted variable. Here, we briefly recall a number of variable transformations commonly used in hydrological modelling. Let
Three analytical transformations are often met in hydrological studies: the log, Box–Cox and log–sinh transformations. The log transformation is commonly used
The Box–Cox transformation
The inverse transformation explains the final effect on the uncertainty assessment: the constant probability distribution in the transformed space (provided by the EHUP) will result in a distribution in the untransformed space, whose evolution depends on the behaviour of the inverse data transformation. Here, the Box–Cox transformation provides different behaviours, depending on its parameter value (
More recently, the log–sinh transformation has been proposed
In addition to the log transformation used by
Predictive 0.1 and 0.9 quantiles when assessed with no transformation, the log transformation, the Box–Cox transformation with its
Another common variable transformation is the normal quantile transformation
The EHUP is a non-parametric approach based on the characteristics of the distribution of residuals over a training data set. Moreover, the Box–Cox and the log–sinh transformations are parametric and require a calibration step. Therefore, the methodology adopted for this study is a split-sample scheme test inspired by the differential split-sample scheme of
To populate the three data subsets with independent data, separate flood events were first selected by an iterative procedure similar to those detailed by
The number of events and their characteristics vary greatly among catchments, as summarised in Table
Characteristics of the events selected for the lead time (LT) over the 1997–2006 data series.
The selected events were then gathered into three events sets – G1, G2 and G3 – based on the magnitude of their peaks and the number of useful time steps for each test phase (training of the EHUP post-processor, calibration of the variable transformations and evaluation of the predictive distributions): G1 contains the lowest events, while the highest events are in G3.
The selection of the data subsets was tailored to study the behaviour of the post-processing approach in an extrapolation context. The control data subset had to encompass only time steps with simulated discharge values higher than those met during the training and calibration steps. Similarly, the calibration data subset had to encompass time steps with simulated discharge values higher than those of the training subset.
To achieve these goals, only the time steps within flood events were used. We distinguished four data subsets, as illustrated in Fig.
Illustration of the selection of the data subsets for the Ill River at Didenheim (668 km
The discharge thresholds used to populate the D1, D2
Since there are only one parameter for the Box–Cox transformation and two parameters for the log–sinh transformation, a simple calibration approach of the transformation parameters was chosen: the parameter space was explored by testing several parameter set values. For the Box–Cox transformation, 17 values for the
Note that the hydrological model was calibrated over the whole set of data (1997–2006) to make the best use of the data set, since this study focuses on the effect of extrapolation on the predictive uncertainty assessment only.
We used a two-step procedure, as illustrated in Fig.
Residuals as a function of the forecast discharges in the transformed space. The horizontal dashed lines represent the 0.1, 0.25, 0.5, 0.75 and 0.9 quantiles of the residuals computed during the training phase of the EHUP post-processor, for the highest-flow group (top 5 % pairs of the training data ranked by forecasted values). The straight horizontal lines represent their use in assessing the predictive uncertainty in extrapolation during
Reliability was first assessed by a visual inspection of the probability integral transform (PIT) diagrams
The overall quality of the probabilistic forecasts was evaluated with the continuous rank probability score
For operational purposes, the sharpness of the probabilistic forecasts was checked by measuring the mean width of the 80 % predictive intervals. A dimensionless relative-sharpness index was obtained by dividing the mean width by the mean runoff:
In addition to the probabilistic criteria presented above, the accuracy of the forecasts was assessed using the Nash–Sutcliffe efficiency (NSE) calculated with the mean values of the predictive distributions (best value: 1).
Since the calibration step aims at selecting the most reliable description of the residuals in extrapolation, the
In cases where an equal value of the
Distributions of the
Distribution over the basins of the values of the Box–Cox transformation parameter obtained during the calibration step for the four different lead times.
Figure
Figures
Distribution over the basins of the values of the log–sinh transformation parameters obtained during the calibration step for the four different lead times.
First, we conducted a visual inspection of the PIT diagrams, which convey an evaluation of the overall reliability of the probabilistic forecasts (examples in Fig.
Examples of PIT diagrams obtained with the control data set D3, with different transformations at four locations.
Then the distribution of the
Distributions of the
Scatter plots of the reliability
In operational settings, non-exceedance frequencies of the quantiles of the predictive distribution, which are the lower and upper bounds of the predictive interval communicated to the authorities, are of particular interest. The 80 % predictive interval (bounded by the 0.1 and 0.9 quantiles) is mostly used in France. It is expected that the non-exceedance frequency of the lower bound and the exceedance frequency of the upper bound remain close to 10 % for a reliable predictive distribution. Deviations from these frequencies indicate biases in the estimated quantiles. Figure
Distributions over the catchment set of
In addition to reliability, we looked at other qualities of the probabilistic forecasts, namely the overall performance (measured by the CRPSS) and accuracy (measured by NSE). We also checked their sharpness (relative-sharpness metric). The distributions of four performance criteria are shown for LT in Fig.
Distributions of coverage rate, relative-sharpness, CRPSS and NSE values over the catchment set on the control data set D3, obtained with the different transformations tested (the filled box plots are related to calibrated transformations). The optimal values are represented by the horizontal dashed lines.
For operational forecasters, it is important to be able to predict when they can trust the forecasts issued by their models and when their quality becomes questionable. Therefore we investigated whether the reliability and reliability loss observed in an extrapolation context were correlated with some properties of the forecasts. First, Fig.
Comparison of the
In addition, two indices were chosen to describe the degree of extrapolation: the ratio of the median of the forecasted discharges in D3 over the median of the forecasted discharge in D2
Finally, we found no correlation with the relative accuracy of the deterministic forecasts either. The goodness of fit during the calibration phase cannot be used as an indicator of the robustness of the uncertainty estimation in an extrapolation context (see Supplement for figures).
Overall, the results obtained for the control data set suggest that the log transformation and the fixed Box–Cox transformation (BC
These results could be explained by the fact that the calibration did not result in the optimally relevant parameter set. To investigate whether another calibration strategy could yield better results, we compared the performance on the control D3 data set when the calibration is achieved on the D2
Besides the reduction of heteroscedasticity, many studies use post-processors which are explicitly based on the assumption of a Gaussian distribution and use data transformations to fulfil this hypothesis
We used the Shapiro–Francia test where the null hypothesis is that the data are normally distributed. For each parametric transformation, we selected the parameter set of the calibration grid which obtains the highest
Even if there is no theoretical advantage to using the Gaussian distribution calibrated on the transformed-variable residuals rather than the empirical distribution to assess the predictive uncertainty, we tested the impact of this choice. For each transformation, the predictive uncertainty assessment obtained with the empirical transformed-variable distribution of residuals is compared to the assessment based on the Gaussian distribution whose mean and variance are those of the empirical distribution. Figure
Distributions of the
Investigations on the impact of the choice between the empirical and the Gaussian distributions on the post-processor performance are shown in the Supplement. They show that the choice of the distribution is not the dominant factor.
In most modelling studies, several methodological steps depend on the range of the observations. First, calibration is designed to limit the residual errors in the available historical data. However the largest residuals are often associated with the highest discharge values. It is well known that removing the largest flood events from a data set can significantly modify the resulting calibrated parameter set. This is particularly true with the use of some common criteria, such as quadratic criteria, which strongly emphasise the largest errors
However, to provide robust models for operational purposes, we also need to focus on rare (rarely observed) events, still keeping in mind all the well-known issues associated with working with (too) few data
Even if major floods are rare, it is of the utmost importance that the forecasts issued during such events are reliable to facilitate efficient crisis management. Like Lieutenant Drogo in the Tartar Steppe, who spent his entire life fulfilling his day-to-day duties but waiting in his fortress for the invasion by foes
We use this framework to test the predictive uncertainty assessment using a statistical post-processing of a rainfall–runoff model, based on a variable transformation. The latter has to handle the heteroscedasticity and the evolution of the other predictive uncertainty distribution properties with the discharge magnitude to issue reliable uncertainty assessment, which is very problematic in an extrapolation context. As pointed out by
Using the proposed framework for an evaluation in an extrapolation context, we showed the following:
Using an appropriate variable transformation can significantly improve the predictive distribution and its reliability. However, a performance loss still remains in an extrapolation context with any of the three transformations we tested.
The transformations with more calibrated parameters do not achieve significantly better results than the transformations with no calibrated parameter:
while it allows a flexibility which can theoretically be very attractive in an extrapolation context, the log–sinh transformation is not more reliable in such a context; the uncalibrated log transformation and Box–Cox transformation with the We did not find any variable significantly correlated with the performance loss in an extrapolation context.
The findings reported herein corroborate the results of
Importantly, these results reveal significant performance losses in some catchments when it comes to extrapolation, whatever variable transformation is used. Even if the scheme tested yields satisfying results in terms of reliability for the majority of catchments, it fails in a significant number of catchments, and further investigations are needed to gain a deeper understanding of when and why failures occur.
We used the framework designed by
Though no variable was found to be correlated to the performance loss, the investigations should be continued using a wider set of variables. First, it may open new perspectives to explain these losses and improve our understanding of the flaws of the hydrological model and of the EHUP. Furthermore, it would be very useful to help operational forecasters to detect the hydrological situations for which their forecasts have to be questioned (in particular during major events when forecasts are made in an extrapolation context).
Furthermore, improving the calibration strategy and using a regionalisation of the predictive distribution assessment, as proposed in
Finally, more studies focusing on the extrapolation context may help to better elucidate the limitations of the modelling (hydrological model structure, calibration, post-processing etc.) and their consequences for practical matters. It is to be encouraged as a key for better and more reliable flood forecasting.
The GRP model belongs to the suite of GR models a production function which is the same as in the well-known GR4J model developed by a routing function which is a simplified version of the GR4J's routing function, since it only counts one flow branch composed of a unit hydrograph and quadratic routing store. The tests showed that the performance of the GRP and GR4J structures was similar in a forecasting mode.
The GRP model flow chart. After an interception step, the production function splits the net rainfall (
A snow module
Like any GR model, it is parsimonious. It has only three parameters: (a) an adjustment factor of effective rainfall, which contributes to finding a good water balance; (b) the unit hydrograph time base used to account for the time lag between rainfall and streamflow; and (c) the capacity of the routing store, which temporally smooths the output signal.
Its main difference with the other GR models is the implementation “in the loop” of two data assimilation schemes:
a state-updating procedure which modifies the main state of the routing function as a function of the last discharge values, output updating based on an autoregressive model of the multiplicative error or an artificial neural network (multi-layer perception) whose inputs are the last discharge value and the two last forecasting errors. In this study, the autoregressive model was used.
The parameters are calibrated in forecasting mode, i.e. with the application of the updating procedures. This model is used by the French flood forecasting services, some hydroelectricity suppliers and canal managers at an hourly time step in order to issue real-time forecasts for lead times ranging from a few hours to a few days at several hundred sites. Recently,
In this study, we used the formulations of the log–sinh transformation chosen by
Depending on the relative values of
Summary of the different behaviours of the log–sinh transformation.
When
If
This happens when
the
Moreover, when
As pointed out by McInerney et al. (2017), when
The
Data for this work were provided by Météo France (climatic data) and the hydrometric services of the French government (flow data). Readers can freely access streamflow observations used in this study at the Banque HYDRO website (
The supplement related to this article is available online at:
All co-authors contributed to the study and the article. LB and FB contributed equally and share co-first authorship.
The authors declare that they have no conflict of interest.
The authors thank the editor Dimitri Solomatine, Kolbjorn Engeland and one anonymous reviewer. Both reviewers provided insightful comments which helped to greatly improve this text. Thanks are also due to Météo-France for providing the meteorological data and Banque HYDRO database for the hydrological data. The contribution of the authors from Université Gustave Eiffel and INRAE was financially supported by SCHAPI (Ministry of the Environment).
This research has been supported by SCHAPI (Ministry of the Environment) (grant nos. 21367400 (2018), 2201132931 (2018) and 2102615443 (2019)).
This paper was edited by Dimitri Solomatine and reviewed by Kolbjorn Engeland and one anonymous referee.