A crash-testing framework for predictive uncertainty assessment when forecasting high flows in an extrapolation context

An increasing number of flood forecasting services assess and communicate the uncertainty associated with their forecasts. While obtaining reliable forecasts is a key issue, it is a challenging task, especially when forecasting high flows in an extrapolation context, i.e. when the event magnitude is larger than what was observed before. In this study, we present a crash-testing framework that evaluates the quality of hydrological forecasts in an extrapolation context. The experiment set-up is based on (i) a large set of catchments in France, (ii) the GRP rainfall–runoff model designed for flood forecasting and used by the French operational services and (iii) an empirical hydrologic uncertainty processor designed to estimate conditional predictive uncertainty from the hydrological model residuals. The variants of the uncertainty processor used in this study differ in the data transformation they use (log, Box–Cox and log–sinh) to account for heteroscedasticity and the evolution of the other properties of the predictive distribution with the discharge magnitude. Different data subsets were selected based on a preliminary event selection. Various aspects of the probabilistic performance of the variants of the hydrologic uncertainty processor, reliability, sharpness and overall quality were evaluated. Overall, the results highlight the challenge of uncertainty quantification when forecasting high flows. They show a significant drop in reliability when forecasting high flows in an extrapolation context and considerable variability among catchments and across lead times. The increase in statistical treatment complexity did not result in significant improvement, which suggests that a parsimonious and easily understandable data transformation such as the log transformation or the Box–Cox transformation can be a reasonable choice for flood forecasting.

1 Introduction 1.1 The big one: dream or nightmare for the forecaster?
In many countries, operational flood forecasting services (FFS) issue forecasts routinely throughout the year and during rare or critical events. End users are mostly concerned by the largest and most damaging floods, when critical decisions have to be made. For such events, operational flood forecasters must get prepared to deal with extrapolation, i.e. to work on events of a magnitude that they and their models have seldom or never met before. The relevance of simulation models and their calibration in evolving conditions, such as contrasted climate conditions and climate change, has been studied by several authors. For example, Wilby (2005), Vaze et al. (2010), Merz et al. (2011) and Brigode et al. (2013) explored the transferability of hydrological model parameters from one period to another and assessed the uncertainty associated with this parameterisation, while Coron et al. (2012) proposed a generalisation of the differential split-sample test (Klemes, 1986). In spite of its importance in operational contexts, only a few studies have addressed the extrapolation issue for flow forecasting, to the best of our knowledge, with the notable exception of data-driven approaches (e.g. Todini, 2007). Im-rie et al. (2000), Cigizoglu (2003) and Giustolisi and Laucelli (2005) evaluated the ability of trained artificial neural networks (ANNs) to extrapolate beyond the calibration data and showed that ANNs used for hydrological modelling may have poor generalisation properties. Singh et al. (2013) studied the impact of extrapolation on hydrological prediction with a conceptual model, and Barbetta et al. (2017) expressed concerns for the extrapolation context defined as floods of a magnitude not encountered during the calibration phase.
Addressing the extrapolation issue involves a number of methodological difficulties. Some data issues are specific to the data used for hydrological modelling, such as the rating curve reliability (Lang et al., 2010). Other well-known issues are related to the calibration process: are the parameters, which are calibrated on a limited set of data, representative or at least somewhat adapted to other contexts? A robust modelling approach for operational flood forecasting, i.e. a method able to provide relevant forecasts in conditions not met during the calibration phase, requires paying special attention to the behaviour of hydrological models and the assessment of predictive uncertainty in an extrapolation context.

Obtaining reliable forecasts remains a challenging task
Even if significant progress has been made and implemented in operational flood forecasting systems (e.g. Bennett et al., 2014;Demargne et al., 2014;Pagano et al., 2014), some uncertainty remains. In order to achieve efficient crisis management and decision making, communication of reliable predictive uncertainty information is therefore a prerequisite (Todini, 2004;Pappenberger and Beven, 2006;Demeritt et al., 2007;Verkade and Werner, 2011). Hereafter, reliability is defined as the statistical consistency between the observations and the predictive distributions (Gneiting et al., 2007). The uncertainty associated with operational forecasts is most often described by a predictive uncertainty distribution. Assessing a reliable predictive uncertainty distribution is challenging because hydrological forecasts yield residuals that show heteroscedasticity, i.e. an increase in the uncertainty variance with discharge, time autocorrelation, skewness etc. Some studies (e.g. Yang et al., 2007b;Schoups and Vrugt, 2010) account for these properties for the calibration of hydrological models within a Bayesian framework, using specific formulations of likelihood. In an extrapolation context, it is of utter importance that the predictive uncertainty assessment provides a correct description of the evolution of the predictive distribution properties with the discharge magnitude. Bremnes (2019) showed that the skewness of wind speed distribution depends on the forecasted wind. Modelled residuals of discharge forecasts often exhibit high heteroscedasticity (Yang et al., 2007a). McInerney et al. (2017) focused their study on representing error heteroscedasticity of discharge forecasts with respect to simulated streamflow.
To achieve reliable forecasts, a correct description of the heteroscedasticity, either explicitly or implicitly, is necessary.
Various approaches to uncertainty assessment have been developed to assess the uncertainty in hydrological predictions (see e.g. Montanari, 2011). The first step consists of identifying the different sources of uncertainty or at least the most important ones that have to be taken into account given a specific context. In the context of flood forecasting, decomposing the total uncertainty into its two main components is now common: the input uncertainty (mainly the meteorological forecast uncertainty) and the modelling uncertainty, as proposed by Krzysztofowicz (1999). More generally, the predictive uncertainty due to various sources may be explicitly modelled and propagated through the modelling chain, while the "remaining" uncertainty (from the other sources) may then be assessed by statistical post-processing.

Modelling each source of uncertainty
A first approach intends to model each source of uncertainty separately and to propagate these uncertainties through the modelling chain (e.g. Renard et al., 2010). Following this approach, the predictive uncertainty distribution results from the separate modelling of each relevant source of uncertainty (e.g. in their study on hydrological prediction, Renard et al. (2010) did not have to consider the uncertainty in meteorological forecasts) and from the statistical model specification. While this approach is promising, operational application can be hindered by the challenge of making the hydrological modelling uncertainty explicit, as pointed out by Salamon and Feyen (2009).
In practice, it is necessary to consider the meteorological forecast uncertainty to issue hydrological forecasts. The ensemble approaches intend to account for this source of uncertainty. They are increasingly popular in the research and the operational forecasting communities. An increasing number of hydrological ensemble forecasting systems are in operational use and have proved their usefulness, e.g. the European Flood Awareness System (EFAS; Ramos et al., 2007;Thielen et al., 2009;Pappenberger et al., 2011;Pappenberger et al., 2016) and the Hydrologic Ensemble Forecast Service (HEFS; e.g. Demargne et al., 2014).
Multi-model approaches can be used to assess modelling uncertainty (Velazquez et al., 2010;Seiller et al., 2017). While promising, this approach requires the implementation and the maintenance of a large number of models, which can be burdensome in operational conditions. There is no evidence that such an approach ensures that the heteroscedasticity of the predictive uncertainty distribution would be correctly assessed.
In forecasting mode, data assimilation schemes based on statistical modelling are of common use to reduce and assess the predictive uncertainty. Some algorithms such as particle filters (Moradkhani et al., 2005a;Salamon and Feyen, 2009;Abbaszadeh et al., 2018) or the ensemble Kalman filter (Moradkhani et al., 2005b) provide an assessment of the predictive uncertainty as a direct result of data assimilation ("in the loop"). Some of these approaches can explicitly account for the desired properties of the predictive uncertainty distribution, such as heteroscedasticity, through the likelihood formulation.

Post-processing approaches
Alternatively, numerous post-processors of deterministic or probabilistic models have been developed to account for the uncertainty from sources that are not modelled explicitly. They differ in several aspects (see a recent review by Li et al., 2017). Most approaches are conditional: the predictive uncertainty is modelled with respect to a predictor, which most often is the forecasted value (Todini, 2007(Todini, , 2009. Some methods are based on predictive distribution modelling, while others can be described as "distribution-free", as mentioned by Breiman (2001). Among the former, many approaches are built in a statistical regression framework to assess the total or remaining predictive uncertainty. Examples are the hydrologic uncertainty processor (HUP) in a Bayesian forecasting system (BFS) framework (Krzysztofowicz, 1999;Krzysztofowicz and Maranzano, 2004), the model-conditional processor (MCP; Todini, 2008;Coccia and Todini, 2011;Barbetta et al., 2017), the meta-Gaussian model of Montanari and Grossi (2008) or the Bayesian joint probability (BJP) method (Wang et al., 2009), among others. The latter approaches build a description of the predictive residuals from past error series, such as data-learning algorithms (Solomatine and Shrestha, 2009). Some related methods are the non-parametric approach of Van Steenbergen et al. (2012), the empirical hydrological uncertainty processor of Bourgin et al. (2014) or the k-nearest neighbours method of Wani et al. (2017). The quantile regression (QR) framework (Weerts et al., 2011;Dogulu et al., 2015;Verkade et al., 2017) lies in between in that it introduces an assumption of a linear relationship between the forecasted discharge and the quantiles of interest.

Combining different approaches
The approaches presented in Sect. 1.2.1 and 1.2.2 are not exclusive of each other. Even when future precipitation is the main source of uncertainty, post-processing is often required to produce reliable hydrological ensembles (Zalachori et al., 2012;Hemri et al., 2015;Abaza et al., 2017;Sharma et al., 2018). Thus, many operational flood forecasting services use post-processing techniques to assess hydrological modelling uncertainty, while meteorological uncertainty is taken into account separately (Berthet and Piotte, 2014). Post-processors are then trained with "perfect" future rainfall (i.e. equal to the observations). Moreover, even for assessing modelling uncertainty, combining the strengths of several methodologies may improve the results.
Note that many of these approaches use a variable transformation to handle the heteroscedasticity and more generally the evolution of the predictive distribution properties with respect to the forecasted discharge. Some have no calibrated parameter, while others encompass a few calibrated parameters, allowing more flexibility in the predictive distribution assessment. More details on commonly used variable transformations are presented in Sect. 2.1.4.

Scope
In this article, we focus on uncertainty assessment with a post-processing approach based on residuals modelling. While the operational goal is to improve the hydrological forecasting, this study does not consider the meteorological forecast uncertainty: it only focuses on the hydrological modelling uncertainty, as these two main sources of uncertainty can be considered independently (e.g. Bourgin et al., 2014). Del Giudice et al. (2013) and McInerney et al. (2017) presented interesting comparisons of different variable transformations used for residuals modelling. Yet, their studies do not focus on the extrapolation context. Since achieving a reliable predictive uncertainty assessment in an extrapolation context is a challenging task likely to remain imperfect if the stability of the characteristics of the predictive distributions is not properly ensured, it requires a specific crash-testing framework . The objectives of this article are to present a framework aimed at testing the hydrological modelling and uncertainty assessment in the extrapolation context, to assess the ability and the robustness of a postprocessor to provide reliable predictive uncertainty assessment for large floods when different variable transformations are used, to provide guidance for operational flood forecasting system development.
We attempt to answer three questions: (a) can we improve residuals modelling with an adequate variable transformation in an extrapolation context? (b) Do more flexible transformations, such as the log-sinh transformation, help in obtaining more reliable predictive uncertainty assessment? (c) If the performance decreases when extrapolating, is there any driver that can help the operational forecasters to predict this performance loss and question the quality of the forecasts?
Section 2 describes the data, the forecast model, the postprocessor and the testing methodology chosen to address these questions. Section 3 presents the results of the numerical experiments that are then discussed in Sect. 4. Finally, a number of conclusions and perspectives are proposed. We used a set of 154 unregulated catchments spread throughout France ( Fig. 1) over various hydrological regimes and forecasting contexts to provide robust answers to our research questions (Andréassian et al., 2006;Gupta et al., 2014). They represent a large variability in climate, topography and geology in France (Table 1), although their hydrological regimes are little or not at all influenced by snow accumulation. Hourly rainfall, potential evapotranspiration (PE) and streamflow data series were available over the 1997-2006 period. PE was estimated using a temperaturebased formula (Oudin et al., 2005). Rainfall and temperature data come from a radar-based reanalysis produced by Météo-France (Tabary et al., 2012). Discharge data were extracted from the national streamflow HYDRO archive (Leleu et al., 2014). When a hydrological model is used to issue forecasts, it is often necessary to compare the lead time to a characteristic time of the catchment (Sect. 2.1.2). For each catchment, the lag time (LT) is estimated as the lag time maximising the cross-correlation between rainfall and discharge time series.

Hydrological model
We used discharge forecasts computed by the GRP rainfallrunoff model. The GRP model is designed for flood forecasting and is currently used by the FFS in France in operational conditions (Furusho et al., 2016;Viatgé et al., 2018). It is a deterministic lumped storage-type model that uses catchment areal rainfall and PE as inputs. In forecasting mode (Appendix A), the model also assimilates discharge observations available when issuing a forecast to update the main state variable of the routing function and to update the output discharge. In this study, it is run at an hourly time step, and forecasts are issued for several lead times ranging from 1 to 72 h. More details about the GRP model can be found in Appendix A.
Since herein only the ability of the post-processor to extrapolate uncertainty quantification is studied, the model is fed only with observed rainfall (no forecast of precipitation), in order to reduce the impact of the input uncertainty. For the same reason, the model is calibrated in forecasting mode over the 10-year series by minimising the sum of squared errors for a lead time taken as the LT. The results will be presented for four lead times, -LT / 2, LT, 2 LT and 3 LT -to cover the different behaviours that can be seen when data assimilation is used to reduce errors in an operational flood forecasting context ).

Empirical hydrological uncertainty processor (EHUP)
We used the empirical hydrological uncertainty processor (EHUP) presented in Bourgin et al. (2014). It is a databased and non-parametric approach in order to estimate the conditional predictive uncertainty distribution. This postprocessor has been compared to other post-processors in earlier studies and proved to provide relevant results (Bourgin, 2014). It is now used by operational FFS in France under the operational tool called OTAMIN (Viatgé et al., 2019).
The main difference with many other post-processors (such as the MCP, the meta-Gaussian processor, the BJP or the BFS) is that no assumption is made about the shape of the uncertainty distribution, which brings more flexibility to represent the various forecast error characteristics encountered in large-sample modelling. We will discuss the impact of this choice in Sect. 4.2. The basic idea of the EHUP is to estimate empirical quantiles of errors stratified by different flow groups to account for the variation of the forecast error characteristics with forecast magnitude. Since forecast error characteristics also vary with the lead time when data assimilation is used, the EHUP is trained separately for each lead time. For each lead time separately, the following steps are used: 1. Training: -The flow groups are obtained by first ordering the forecast-observation pairs according to the forecasted values and then stratifying the pairs into a chosen number of groups (in this study, we used 20 groups), so that each group contains the same number of pairs.
-Within each flow group, errors are calculated as the difference between the two values of each forecastobservation pair, and several empirical quantiles (we used 99 percentiles) are calculated in order to characterise the distribution of the error values.

Application:
-The predictive uncertainty distribution that is associated with a given (deterministic) forecasted value is defined by adding this forecasted value to the empirical quantiles that belong to the same flow group as the forecasted value.
Since this study focuses on the extrapolation case, the validation is achieved with deterministic forecasts higher than the highest one used for the calibration. Therefore, only the highest-flow group of the calibration data is used to estimate the uncertainty assessment (to be used on the control data). This highest-flow group contains the top 5 % pairs of the whole training data, ranked by forecasted values. This threshold is chosen as a compromise between focusing on the highest values and using a sufficiently large number of forecast-observation pairs when estimating empirical quantiles of errors. In extrapolation, when the forecast discharge is higher than the highest value of the training period, the predictive distribution of the error is kept constant, i.e. the same values of the empirical quantiles of errors are used, as illustrated in Fig. 5.
The EHUP can be applied after a preliminary data transformation, and by adding a final step to back-transform the predictive distributions obtained in a transformed space. In previous work, we used the log transformation because it ensures that no negative values are obtained when estimating the predictive uncertainty for low flows . When estimating the predictive uncertainty for high flows, the data transformation has a strong impact in extrapolation, because the variation of the extrapolated predictive distribution, which is constant in the transformed space, is controlled in the real space by the behaviour of the inverse transformation, as explained below.

The different transformation families
Many uncertainty assessment methods mentioned in the Introduction use a variable transformation to handle the heteroscedasticity of the residuals and account for the variation of the prediction distributions with the magnitude of the predicted variable. Here, we briefly recall a number of variable transformations commonly used in hydrological modelling. Let y andỹ be the observed and forecasted variables (here, the discharge), respectively, and ε = y −ỹ the residuals. When using a transformation g, we consider the residuals ε = g(y) − g(ỹ).
Three analytical transformations are often met in hydrological studies: the log, Box-Cox and log-sinh transformations. The log transformation is commonly used (e.g. Morawietz et al., 2011): where a is a small positive constant to deal with y values close to 0. It can be taken as equal to 0 when focusing on large discharge values. This transformation has no parameter to be calibrated. Applying a statistical model to residuals computed on log-transformed variables may be interpreted as using a corresponding model of multiplicative error (e.g. assuming a Gaussian model for residuals of log-transformed discharges is equivalent to a log-normal model of the multiplicative errors y/ỹ). Therefore, it may be adapted to strongly heteroscedastic behaviours. It has been used successfully to assess hydrological uncertainty (Yang et al., 2007b;Schoups and Vrugt, 2010;Bourgin et al., 2014). The Box-Cox transformation (Box and Cox, 1964) is a classic transformation that is quite popular in the hydrological community (e.g. Yang et al., 2007b;Wang et al., 2009;Hemri et al., 2015;Reichert and Mieleitner, 2009;Singh et al., 2013;Del Giudice et al., 2013): (2) Here a is taken as being equal to 0 (as for the log transformation); the Box-Cox transformation is then a one-parameter transformation. It makes it possible to cover very different behaviours. The log transformation is a special case of the Box-Cox transformation when the calibration results in λ = 0. In contrast, applying the Box-Cox transformation with λ = 1 to the variable y to model the distribution of their residuals is equivalent to applying no transformation ( More recently, the log-sinh transformation has been proposed (Wang et al., 2012;Pagano et al., 2013). It is a twoparameter transformation: This transformation provides more flexibility. Indeed, for y α and y β, the log-sinh transformation reduces to no transformation, while for α y β it is equivalent to the log transformation ( Fig. 2). Thus, with the same parameterisation, it can result in very different behaviours depending on the magnitude of the discharge. Applying no transformation may be intuitively attractive when modelling the distribution of residuals for large discharge values when the variance is no longer expected to increase (homoscedastic behaviour). It is then particularly attractive when modelling predictive uncertainty in an extrapolation context, in order to avoid an excessively "explosive" assessment of the predictive uncertainty for large discharge values. In addition to the log transformation used by Bourgin et al. (2014), in this study we tested the Box-Cox and the log-sinh transformations to explore more flexible ways to deal with the challenge of extrapolating prediction uncertainty distributions (Fig. 2). The impacts of the data transformations used in this study are illustrated in Fig. 3.
Another common variable transformation is the normal quantile transformation (NQT; e.g. Kelly and Krzysztofowicz, 1997). It is a transformation without a calibrated parameter linking a given distribution and the Gaussian distribu- Figure 2. The inverse transformation explains the final effect on the uncertainty assessment: the constant probability distribution in the transformed space (provided by the EHUP) will result in a distribution in the untransformed space, whose evolution depends on the behaviour of the inverse data transformation. Here, the Box-Cox transformation provides different behaviours, depending on its parameter value (λ). It ranges from an affine transformation (equivalent to no transformation; red dashed line) to the log transformation (thick green dashed line). With a single parameterisation, the log-sinh transformation can be equivalent to the log transformation for values of y much smaller than the value of its parameter β and equivalent to an affine transformation for large values of y (much higher than β; see Appendix B). tion, quantile by quantile. While several hydrological processors such as the HUP, MCP and QR encompass the NQTtransformed variables, Bogner et al. (2012) warn against the drawbacks of this transformation, which is by construction not suited for the extrapolation context and requires additional assumptions to model the tails of the distribution (Weerts et al., 2011;Coccia and Todini, 2011). This is why we did not test this transformation in this study focused on the extrapolation context.
2.2 Methodology: a testing framework designed for extrapolation context assessment

Testing framework
The EHUP is a non-parametric approach based on the characteristics of the distribution of residuals over a training data set. Moreover, the Box-Cox and the log-sinh transformations are parametric and require a calibration step. Therefore, the methodology adopted for this study is a split-sample scheme test inspired by the differential split-sample scheme of Klemes (1986) and based on three data subsets: a data set for training the EHUP, a data set for calibrating the parameters of the variable transformation and a control data set for evaluating the predictive distributions when extrapolating high flows.

Events selection
To populate the three data subsets with independent data, separate flood events were first selected by an iterative procedure similar to those detailed by Lobligeois et al. (2014) and Ficchì et al. (2016): (1) the maximum forecasted discharge of the whole time series was selected; (2) within a 20 d period before (after) the peak flow, the beginning (end) of the event was placed at the preceding (following) time step closest to the peak at which the streamflow is lower than 20 % (25 %) of the peak flow value; and (3) the event was kept if there was less than 10 % missing values, if the beginning and end of the event were lower than 66 % of the peak flow value and if the peak value was higher than the median value of the time series. The process is then iterated over the remaining data to select all events. A minimum time lapse of 24 h was enforced between two events, ensuring that consecutive events are not overlapping and that the autocorrelation between the time steps of two separate events remains limited. The number of events and their characteristics vary greatly among catchments, as summarised in Table 2. Note that the events selected for one catchment can slightly differ for the four different lead times considered in this study, because the selection was made using the forecasted discharge and not the observed discharge.

Selection of the data subsets
The selected events were then gathered into three events sets -G1, G2 and G3 -based on the magnitude of their peaks and the number of useful time steps for each test phase (training of the EHUP post-processor, calibration of the variable transformations and evaluation of the predictive distributions): G1 contains the lowest events, while the highest events are in G3.
The selection of the data subsets was tailored to study the behaviour of the post-processing approach in an extrapolation context. The control data subset had to encompass only time steps with simulated discharge values higher than those met during the training and calibration steps. Similarly, the calibration data subset had to encompass time steps with simulated discharge values higher than those of the training subset.
To achieve these goals, only the time steps within flood events were used. We distinguished four data subsets, as illustrated in Fig. 4. The subset D1 gathered all the time steps of the events of the G1 group. Then, the set D2 of the time steps of the events of the G2 group was split into two subsets: D2 sup gathered all the time steps with forecasted discharge values higher than the maximum met in D1, and D2 inf was filled with the other time steps. Finally, D3 was similarly filled with all the time steps of the G3 events with forecasted discharge values higher than the maximum met in D2.
The discharge thresholds used to populate the D1, D2 sup and D3 subsets from the events belonging to the G1, G2 and G3 groups were chosen to ensure a sufficient number of time steps in every subset. We chose to set the minimum number of time steps in D3 and D2 sup to 720 as a compromise between having enough data to evaluate the methods and keeping the extrapolation range sufficiently large. We lowered this limit to 500 for the top 5 % pairs of D1, since this subset was only used to build the empirical distribution by estimating the percentiles during the training step and not used for evaluating the quality of the uncertainty assessment.

Calibration and evaluation steps
Since there are only one parameter for the Box-Cox transformation and two parameters for the log-sinh transformation, a simple calibration approach of the transformation parameters was chosen: the parameter space was explored by testing several parameter set values. For the Box-Cox transformation, 17 values for the λ parameter were tested: from 0 to 1 with a step of 0.1 and with a refined mesh for the subintervals 0-0.1 and 0.9-1. For the log-sinh transformation, a grid of 200 (α, β) values was designed for each catchment based on the maximum value of the forecasted discharge in the D2 subset, as explained in greater detail in Appendix B.
Note that the hydrological model was calibrated over the whole set of data (1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006) to make the best use of the data set, since this study focuses on the effect of extrapolation on the predictive uncertainty assessment only.
We used a two-step procedure, as illustrated in Fig. 5. The first step was to calibrate the parametric transformations. For each transformation and for each parameter set, the empirical residuals were computed over D1 and D2 sup , where the EHUP was trained on the highest-flow group of D1 (see Sect. 2.1.3) and the calibration criterion was computed on D2 sup . Indeed, the data transformations have almost no impact on the uncertainty estimation by EHUP for events of the same magnitude as those of the training subset. Therefore, the calibration subset has to encompass events of a larger magnitude (D2 sup ). The parameter set obtaining the best criterion value was selected. In the second step, the EHUP was trained on the highest-flow group of D1, D2 inf and D2 sup combined and using the parameter set obtained during the calibration step. Then, the predictive uncertainty distribution was evaluated on the control data set D3. Training the EHUP on the highest-flow group of the union of D1, D2 inf and D2 sup allows the uncertainty assessment to be controlled from small to large degrees of extrapolation (on D3). Indeed if we had kept the training on D1 only, we would have not been able to test small degrees of extrapolation on independent data for every catchment (see the discussion in Sect. 3.3).

Probabilistic evaluation framework
Reliability was first assessed by a visual inspection of the probability integral transform (PIT) diagrams (Laio and Tamea, 2007;Renard et al., 2010). Since this study was carried out over a large sample of catchments, two standard numerical criteria were used to summarise the results: the α index, which is directly related to the PIT diagram , and the coverage rate of the 80 % predictive intervals (bounded by the 0.1 and 0.9 quantiles of the predictive distributions), used by the French operational FFS (a perfect uncertainty assessment would reach a value of 0.8). The α index is equal to 1 − 2 · A, where A is the area between the PIT curve and the bisector, and its value ranges from 0 to 1 (perfect reliability).
The overall quality of the probabilistic forecasts was evaluated with the continuous rank probability score (CRPS;Hersbach, 2000), which compares the predictive distribution to the observation: where N is the number of time steps, F (Q) is the predictive cumulative distribution, H the Heaviside function and Q obs is the observed value. We used a skill score (CRPSS) to compare the mean CRPS to a reference, here the mean CRPS obtained from the unconditional climatology, i.e. from the distribution of the observed discharges over the same data subset. Values range from −∞ to 1 (perfect forecasts): a positive value indicates that the modelling is better than the climatology. For operational purposes, the sharpness of the probabilistic forecasts was checked by measuring the mean width of the 80 % predictive intervals. A dimensionless relative-sharpness index was obtained by dividing the mean width by the mean runoff: where q 90 (Q) and q 10 (Q) are the upper and the lower bounds of the 80 % predictive interval for each forecast, respectively. The sharper the forecasts, the closer this index is to 1. In addition to the probabilistic criteria presented above, the accuracy of the forecasts was assessed using the Nash-Sutcliffe efficiency (NSE) calculated with the mean values of the predictive distributions (best value: 1).

The calibration criterion
Since the calibration step aims at selecting the most reliable description of the residuals in extrapolation, the α index was used to select the parameter set that yields the highest reliability for each catchment, each lead time and each transformation. While other choices were possible, we followed the paradigm presented by Gneiting et al. (2007): reliability has to be ensured before sharpness. Note that the CRPS could have been chosen, since it can be decomposed as the sum of two terms: reliability and sharpness (Hersbach, 2000). However, in the authors' experience, the latter is the controlling factor . Moreover, the CRPS values were often quite insensitive to the values of the log-sinh transformation parameters.
In cases where an equal value of the α index was obtained, we selected the parameter set that gave the best sharpness index. For the log-sinh transformation, there were still a few cases where an equal value of the sharpness index was obtained, revealing the lack of sensitivity of the transformation in some areas of the parameter space. For those cases, we chose to keep the parameter set that had the lowest α value and the β value closest to max D2 ( Q). Figure 6 shows the distributions of the α-index values obtained with different transformations on the calibration data set (D2 sup ) for lead times LT / 2, LT, 2 LT and 3 LT. The distributions are summarised with box plots. Clearly, not using any transformation leads to poorer reliability than any tested transformation. In addition, we note that the calibrated transformations provide better results (although not perfect) than the uncalibrated ones in the calibration data set, as expected, and that no noticeable difference can be seen in Fig. 6 between the calibrated Box-Cox transformation (d), the calibrated log-sinh transformation (e) and the best-calibrated transformation (f). Nevertheless, the uncalibrated log transformation and Box-Cox transformation with parameter λ set at 0.2 (BC λ=0.2 ) reach quite reliable forecasts. Comparing the results obtained for the different lead times reveals that less reliable predictive distributions are obtained for longer lead times, in particular for the transformations without a calibrated parameter. Figures 7 and 8 show the distribution of parameter values obtained for the Box-Cox and the log-sinh transformation . The straight horizontal lines represent their use in assessing the predictive uncertainty in extrapolation during (a) the calibration step of the variable transformation parameters and (b) the evaluation step of the predictive uncertainty. During the calibration step (a), many parameters set values are tested, while only the calibrated set of transformation parameters is used during the evaluation step (b). The vertical dashed lines show the beginning of the extrapolation range. The grey dots are the data pairs which are not selected in D1, D2 inf , D2 sup or D3: they are not used during the calibration step or the evaluation step. Illustration from the Ill River at Didenheim, 668 km 2 . Data used for the EHUP training at each step is sketched by a thick rectangle. In this study, only the top 5 % pairs of the training data ranked by forecasted values are used to estimate the residual distribution, since we focus on the extrapolation behaviour of the EHUP.

Results with the calibration data set D2 sup
during the calibration step. The distributions vary with lead time. While the log transformation behaviour is frequently chosen for LT/2 and LT, the additive behaviour (corresponding to the use of no transformation; see Sect. 2.1.4) becomes more frequent for 2 and 3 LT. A similar conclusion can be drawn for the log-sinh transformation: a low value of α and a high value of β yield a multiplicative behaviour that is frequently chosen, for all lead times, but less for 2 and 3 LT than for LT/2 and LT. This explains in particular the loss of reliability that can be seen for the log transformation for 3 LT in Fig. 6. These results reveal that the extrapolation behaviour of the distributions of residuals is complex. It varies among catchments and with lead time because of the strong impact of data assimilation.

Reliability
First, we conducted a visual inspection of the PIT diagrams, which convey an evaluation of the overall reliability of the probabilistic forecasts (examples in Fig. 9). In some cases, the forecasts are reliable (e.g. the Thérain River at Beauvais, 755 km 2 , except if no transformation is used). Alternatively, these diagrams may show quite different patterns, highlight-  ing bias (e.g. the Meu River at Montfort-sur-Meu, 477 km 2 ) or under-dispersion (e.g. the Aa River at Wizernes, 392 km 2 , or the Sioulet River at Miremont, 473 km 2 ; for the latter, the calibration on D2 sup leads to log-sinh and Box-Cox transformations equivalent to no transformation, which turns out not to be relevant for the control data set, where the log and the Box-Cox transformations are more reliable).
Then the distribution of the α-index values in Fig. 10 reveals a significant loss of reliability compared to the values obtained with the calibration data set (Fig. 6). We note that the log transformation is the most reliable approach for LT / 2 and is comparable to the Box-Cox transformation (BC λ=0.2 ) for LT. With increasing lead time, the BC λ=0.2 transformation becomes slightly better than the other transformations, including the calibrated ones. In addition, comparing the results obtained for the different lead times confirms that it is challenging to produce reliable predictive distributions when extrapolating at longer lead times. Overall, it means that the added value of the flexibility brought by the calibrated transformations is not transferable in an independent extrapolation situation, as illustrated in Fig. 11.
In operational settings, non-exceedance frequencies of the quantiles of the predictive distribution, which are the lower and upper bounds of the predictive interval communicated to the authorities, are of particular interest. The 80 % predictive interval (bounded by the 0.1 and 0.9 quantiles) is mostly used in France. It is expected that the non-exceedance frequency of the lower bound and the exceedance frequency of the upper bound remain close to 10 % for a reliable predictive distribution. Deviations from these frequencies indicate biases in the estimated quantiles. Figure 12 reveals that on average the 0.1 quantile is generally better assessed than the 0.9 quantile on average, though the latter is generally more sought after for operational purposes. More importantly, the lack of reliability of the log transformation for the 3 LT lead time seen in Fig. 10 appears to be related to an underestimation of the 0.1 quantile, which is higher than for the other tested transformations, while the 0.9 quantile is less underestimated than for the other transformations. These results highlight that reliability can have different complementary facets and that some parts of the predictive distributions can be more or less reliable. In a context of flood forecasting, particular attention should be given to the upper part of the predictive distribution.

Other performance metrics
In addition to reliability, we looked at other qualities of the probabilistic forecasts, namely the overall performance (measured by the CRPSS) and accuracy (measured by NSE). We also checked their sharpness (relative-sharpness metric). The distributions of four performance criteria are shown for LT in Fig. 13. We note that the log transformation has the closest median value for the coverage ratio, at the expense of a lower median relative-sharpness value, because of larger predictive interval widths caused by the multiplicative behaviour of the log transformation. In addition, the CRPSS and the NSE distributions have limited sensitivity to the variable transformation (also shown by Woldemeskel et al. (2018) for the CRPS), even if we can see that not using any transformation yields slightly better results. This confirms that the CRPSS itself is not sufficient to evaluate the adequacy of uncertainty estimation. Similar results were obtained for the other lead times (Supplement).

Investigating the performance loss in an extrapolation context
For operational forecasters, it is important to be able to predict when they can trust the forecasts issued by their models and when their quality becomes questionable. Therefore we investigated whether the reliability and reliability loss observed in an extrapolation context were correlated with some properties of the forecasts. First, Fig. 14 shows the relationship between the α-index values obtained with D2 sup and those obtained with D3 for three representative transformations. The results indicate that it is not possible to anticipate the α-index values when extrapolating high flows in D3 based on the α-index values obtained when extrapolating high flows in D2 sup . In addition, two indices were chosen to describe the degree of extrapolation: the ratio of the median of the forecasted discharges in D3 over the median of the forecasted discharge in D2 sup , and the ratio of the median of the forecasted discharges in D3 over the discharge for a return period of 20 years (for catchments where the assessment of the vicennial discharge was available in the national database: http://www.hydro.eaufrance.fr, last access: March 2020). In both cases, no trend appears, regardless of the variable transformation used, with Spearman coefficients values (much) lower than 0.33. The reliability can remain high for some catchments even when the magnitude of the events of the control data set is much higher than that of the training data set (see Supplement for figures).
Finally, we found no correlation with the relative accuracy of the deterministic forecasts either. The goodness of fit during the calibration phase cannot be used as an indicator of  Overall, the results obtained for the control data set suggest that the log transformation and the fixed Box-Cox transformation (BC λ=0.2 ) can yield relatively satisfactory α index and coverage ratio values given their multiplicative or nearmultiplicative behaviour in extrapolation. More tapered behaviours that can be obtained with the calibrated Box-Cox or log-sinh transformations do not show advantages when extrapolating high flows on an independent data set. In other words, what is learnt during the calibration of the more complex parametric transformations does not yield better results in an extrapolation context. These results could be explained by the fact that the calibration did not result in the optimally relevant parameter set. To investigate whether another calibration strategy could yield better results, we compared the performance on the control D3 data set when the calibration is achieved on the D2 sup data set ("f: best calibrated") and on the D3 data set ("g: best reliability"). The results shown in Fig. 10 reveal that, even when the best parameter set is chosen among the 217 possibilities tested in this study (17 for the Box-Cox transformation and 200 for the log-sinh transformation), the α-index distributions are far from perfect and reliability decreases with increasing lead time. This suggests that the stability of the distributions of residuals when extrapolating Figure 10. Distributions of the α-index values on the control data set D3, obtained with different transformations for four lead times (the filled box plots represent the calibrated distributions). Box plots (5th, 25th, 50th, 75th and 95th percentiles) synthesise the variety of scores over the catchments of the data set. The optimal values are represented by the horizontal dashed lines. Option "g" gives the best performance that could be achieved with this model and this post-processor for these catchments (see the discussion in Sect. 4.1). Figure 11. Scatter plots of the reliability α index obtained with the log transformation and the log-sinh transformation (a) in D2 sup on the calibration step and (b) on D3 in the control step. high flows might be a greater issue than the choice of the variable transformation. Nonetheless, the gap between the distributions of the transformations without a calibrated parameter ("b" and "c"), the best-calibrated transformation ("f") and the best performance that could be achieved ("g") highlights that it might be possible to obtain better results with a more advanced calibration strategy. This is, however, beyond the scope of this study and is therefore left for further investigations.

Empirically based versus distribution-based
approaches: does the distribution shape choice impact the uncertainty assessment in an extrapolation context?
Besides the reduction of heteroscedasticity, many studies use post-processors which are explicitly based on the assump-tion of a Gaussian distribution and use data transformations to fulfil this hypothesis (Li et al., 2017). Examples are the MCP or the meta-Gaussian model; the NQT was designed to precisely achieve it. In their study on autoregressive error models used as post-processors, Morawietz et al. (2011) showed that error models with an empirical distribution for the description of the standardised residuals perform better than those with a normal distribution. We first checked whether the variable transformation helped to reach a Gaussian distribution of the residuals computed with the transformed variables. Then we investigated whether better performance can be achieved using empirical transformed distributions of residuals or using Gaussian distributions calibrated on these empirical distributions. We used the Shapiro-Francia test where the null hypothesis is that the data are normally distributed. For each parametric transformation, we selected the parameter set of the calibration grid which obtains the highest p value. For more than 98 % of the catchments, the p value is lower than 0.018 (0.023) when the Box-Cox transformation (the log-sinh transformation) is used. This indicates that there are only a few catchments for which the normality assumption is not to be rejected. In a nutshell, the variable transformations can stabilise the variance, but they do not necessarily ensure the normality of the residuals. It is important not to overlook this frequently encountered issue in hydrological studies.
Even if there is no theoretical advantage to using the Gaussian distribution calibrated on the transformed-variable residuals rather than the empirical distribution to assess the predictive uncertainty, we tested the impact of this choice. For each transformation, the predictive uncertainty assessment obtained with the empirical transformed-variable distribution of residuals is compared to the assessment based on the Gaussian distribution whose mean and variance are those of the empirical distribution. Figure 15 shows the α-index distributions obtained over the catchments for both options in the control data set D3. We note that no clear conclusion can be drawn. No transformation (or identity transformation), which does not reduce the heteroscedasticity at all, benefits from the use of the Gaussian distributions for all lead times. In contrast, the predictive uncertainty assessment based on the empirical distribution with the log transformation is more reliable than the one based on the Gaussian distribution. For short lead times, it is slightly better to use the empirical distributions for the calibrated transformations (Box-Cox and log-sinh), but we observe a different behaviour for longer lead times. For these longer lead times, assessing the predictive uncertainty using the Gaussian distribution fitted to the empirical distributions of transformed residuals obtained with the calibrated log-sinh or Box-Cox transformations is the most reliable option. It is better than using the log transformation with the empirical distribution, but not very different from using the BC λ=0.2 transformation.
Investigations on the impact of the choice between the empirical and the Gaussian distributions on the post-processor performance are shown in the Supplement. They show that the choice of the distribution is not the dominant factor.

A need for a focus change?
In most modelling studies, several methodological steps depend on the range of the observations. First, calibration is designed to limit the residual errors in the available historical data. However the largest residuals are often associated with the highest discharge values. It is well known that removing the largest flood events from a data set can significantly modify the resulting calibrated parameter set. This is particularly true with the use of some common criteria, such as quadratic criteria, which strongly emphasise the largest errors (Legates and McCabe Jr., 1999;Berthet et al., 2010;Wright et al., 2015). Conversely, it is likely that "unavailable data" such as a physically realistic but (so far) unseen flood would significantly change the calibration results if it could be included in the calibration data set. Moreover, model conceptualisation (building) itself is often based on the understanding of how a catchment behaves "on average". In some studies, outliers may even be considered as disturbing and be discarded (Liano, 1996).
However, to provide robust models for operational purposes, we also need to focus on rare (rarely observed) events, still keeping in mind all the well-known issues associated with working with (too) few data (Anctil et al., 2004;Perrin et al., 2007). For predictive uncertainty assessment, this issue is exacerbated by the seasonality of hydrological extremes (Allamano et al., 2011;Li et al., 2013) for most approaches which rely heavily on data (beyond data-learning approaches, all models which need to be calibrated). Therefore, there is an urgent need to gather and compile data on extreme events (Gaume et al., 2009). Nevertheless, operational forecasters must still prepare themselves to work in an extrapolation context, as pointed out by Andréassian et al. (2012).

Conclusions
Even if major floods are rare, it is of the utmost importance that the forecasts issued during such events are reliable to facilitate efficient crisis management. Like Lieutenant Drogo in the Tartar Steppe, who spent his entire life fulfilling his day-to-day duties but waiting in his fortress for the invasion by foes (Buzzati, 1940), many forecasters expect and are preparing for a major event, even if their routine involves only minor events. That is why a strong concern for the extrapolation context should be encouraged in all modelling and calibration steps. This article proposes a control framework focusing on the forecasting performance in an extrapolation context.
We use this framework to test the predictive uncertainty assessment using a statistical post-processing of a rainfallrunoff model, based on a variable transformation. The latter has to handle the heteroscedasticity and the evolution of the other predictive uncertainty distribution properties with the discharge magnitude to issue reliable uncertainty assessment, which is very problematic in an extrapolation context. As pointed out by McInerney et al. (2017), the choice of the heteroscedastic error modelling approach makes a significant impact on the predictive performance. This is true as well in an extrapolation context.

Main findings
Using the proposed framework for an evaluation in an extrapolation context, we showed the following: a. Using an appropriate variable transformation can significantly improve the predictive distribution and its reliability. However, a performance loss still remains in an extrapolation context with any of the three transformations we tested.
b. The transformations with more calibrated parameters do not achieve significantly better results than the transformations with no calibrated parameter: while it allows a flexibility which can theoretically be very attractive in an extrapolation context, the log-sinh transformation is not more reliable in such a context; the uncalibrated log transformation and Box-Cox transformation with the λ parameter set to 0.2 are robust and compare favourably.
c. We did not find any variable significantly correlated with the performance loss in an extrapolation context.
The findings reported herein corroborate the results of McInerney et al. (2017) within the context of flood forecasting and extrapolation: calibrating the Box-Cox or logsinh transformation can be counter-productive. We therefore suggest that operational flood forecasting services could consider the less flexible but more robust options: using the log transformation or the Box-Cox transformation with its λ parameter set close to 0.2.
Importantly, these results reveal significant performance losses in some catchments when it comes to extrapolation, whatever variable transformation is used. Even if the scheme tested yields satisfying results in terms of reliability for the majority of catchments, it fails in a significant number of catchments, and further investigations are needed to gain a deeper understanding of when and why failures occur.

Limitations and perspectives
We used the framework designed by Krzysztofowicz (1999), which has already been applied in various studies, that separates the input uncertainty (mainly the observed and forecasted rainfall) and the hydrological uncertainty. Our study focuses only on the "effect" of the extrapolation degree in the hydrological uncertainty when using the best available rainfall product. Future works should combine both input uncertainty (rainfall) and hydrological uncertainty  to evaluate the impact of using uncertain (forecasted) rainfall in a forecasting context.
Though no variable was found to be correlated to the performance loss, the investigations should be continued using a wider set of variables. First, it may open new perspectives to explain these losses and improve our understanding of the flaws of the hydrological model and of the EHUP. Furthermore, it would be very useful to help operational forecast-ers to detect the hydrological situations for which their forecasts have to be questioned (in particular during major events when forecasts are made in an extrapolation context).
Furthermore, improving the calibration strategy and using a regionalisation of the predictive distribution assessment, as proposed in Bourgin et al. (2015) and Bock et al. (2018), could help build more robust assessment of uncertainty quantification when forecasting high flows.
Finally, more studies focusing on the extrapolation context may help to better elucidate the limitations of the modelling (hydrological model structure, calibration, post-processing etc.) and their consequences for practical matters. It is to be encouraged as a key for better and more reliable flood forecasting. Figure A1. The GRP model flow chart. After an interception step, the production function splits the net rainfall (P n ), according to the level of the production store. The effective rainfall (P r ) is the sum of the direct flow and the percolation from this store. A corrective multiplicative coefficient (CORR) is then applied. Then the flow runs through a unit hydrograph (time base: TB) and the routing store (capacity: ROUT).

Appendix A: GRP model
The GRP model belongs to the suite of GR models (Michel, 1983). These models are designed to be as simple as possible but efficient for hydrological studies and for various operational needs, resulting in parsimonious model structures (https://webgr.irstea.fr/en/models/a-brief-history/, last access: 1 April 2019). The GRP model is designed for forecasting purposes . It is a deterministic continuous lumped storage-type model (Fig. A1). The inputs are limited to areal rainfall and (interannual) potential evapotranspiration (both data may be available in real time). It can be understood as the sequence of two hydrological functions: a production function which is the same as in the wellknown GR4J model developed by Perrin et al. (2003); a routing function which is a simplified version of the GR4J's routing function, since it only counts one flow branch composed of a unit hydrograph and quadratic routing store. The tests showed that the performance of the GRP and GR4J structures was similar in a forecasting mode.
A snow module (Valéry et al., 2014) may be implemented on top of the model if necessary. Like any GR model, it is parsimonious. It has only three parameters: (a) an adjustment factor of effective rainfall, which contributes to finding a good water balance; (b) the unit hydrograph time base used to account for the time lag between rainfall and streamflow; and (c) the capacity of the routing store, which temporally smooths the output signal.
Its main difference with the other GR models is the implementation "in the loop" of two data assimilation schemes: a state-updating procedure which modifies the main state of the routing function as a function of the last discharge values, output updating based on an autoregressive model of the multiplicative error or an artificial neural network (multi-layer perception) whose inputs are the last discharge value and the two last forecasting errors. In this study, the autoregressive model was used.
The parameters are calibrated in forecasting mode, i.e. with the application of the updating procedures. This model is used by the French flood forecasting services, some hydroelectricity suppliers and canal managers at an hourly time step in order to issue real-time forecasts for lead times ranging from a few hours to a few days at several hundred sites. Recently, Ficchì et al. (2016) paved the way to a multi-timestep GRP version.
The log-sinh transformation is then equivalent to an affine transformation.
The log-sinh transformation is then equivalent to a mere log transformation.

B3 Calibration
The α and β parameters are in the same physical dimension as the y variable. Since this study is dedicated to the extrapolation context, we used the following dimensionless parameterisation to calibrate the variable transformation in various catchments. α and β are compared to the maximum forecasted discharge in the D2 sup data subset: ( Q) and β = γ 2 · max D2 ( Q).
In the calibration step, the parameter space is explored on a (γ 1 , γ 2 ) grid: 18 values of γ 1 from 0.01 to 100 and 15 values of γ 2 from 0.1 to 100 were tested, excluding combined values leading to the very same behaviours, such as γ 1 3 · γ 2 , equivalent to no transformation (additive error model). A total of 200 (γ 1 , γ 2 ) combinations were tested for each calibration.
Data availability. Data for this work were provided by Météo France (climatic data) and the hydrometric services of the French government (flow data). Readers can freely access streamflow observations used in this study at the Banque HYDRO website (http: //www.hydro.eaufrance.fr/, last access: April 2020). Climate data can be retrieved on the Publithèque website of Météo France (https: //publitheque.meteo.fr/, last access: April 2020). These data are not publicly available but can be provided for research and nonprofit purposes -upon request to Météo France. Metadata on the catchments used in this study are detailed by Brigode et al. (2020) (https://doi.org/10.15454/UV01P1).
Author contributions. All co-authors contributed to the study and the article. LB and FB contributed equally and share co-first authorship.
Competing interests. The authors declare that they have no conflict of interest.
Acknowledgements. The authors thank the editor Dimitri Solomatine, Kolbjorn Engeland and one anonymous reviewer. Both reviewers provided insightful comments which helped to greatly improve this text. Thanks are also due to Météo-France for providing the meteorological data and Banque HYDRO database for the hydrological data. The contribution of the authors from Université Gustave Eiffel and INRAE was financially supported by SCHAPI (Ministry of the Environment).
Review statement. This paper was edited by Dimitri Solomatine and reviewed by Kolbjorn Engeland and one anonymous referee.