Extreme floods in Europe: going beyond observations using reforecast ensemble pooling

Assessing the rarity and magnitude of very extreme flood events occurring less than twice a century is challenging due to the lack of observations of such rare events. Here we develop a new approach, pooling reforecast ensemble members from the European Flood Awareness System (EFAS), to increase the sample size available to estimate the frequency of extreme local and regional flood events. We assess the added value of such pooling, determine where in Central Europe one might expect the most extreme events, and evaluate how event severity is related to physiographic and meteorological catchment 5 characteristics. We work with a set of 234 catchments from the Global Runoff Data Center matched to EFAS catchments and for which performance of simulated floods is good when compared to observed streamflow. We pool EFAS-simulated flood events for 10 perturbed ensemble members and lead times ranging from 22 to 46 days, where flood events are only weakly dependent (< 0.25 average correlation across lead times). The resulting large ensemble (130 time series instead of one) enables analyses of very extreme events, which occur less than twice a century. We demonstrate that such ensemble pooling produces 10 more robust estimates with considerably reduced uncertainty bounds (by ∼ 80% on average) than observation-based estimates, but may equally introduce biases arising from the simulated meteorology and hydrological model. Our results show that, for a given return period, specific floods are highest in steep, cold, and wet regions and are comparably low in regions with strong flow regulation through dams. Furthermore, our pooled flood estimates indicate that the probability of regional flooding is higher in Central Europe and Great Britain than in Scandinavia. We conclude that reforecast ensemble pooling is an efficient 15 approach to increase sample size and to derive robust local and regional flood estimates in regions with good hydrological model performance.

to stochastic models and large climate ensemble experiments which have traditionally been used to generate large samples of extreme events.

Methods and Materials
To assess the potential value of reforecast ensemble pooling in flood frequency analysis, we use reforecast simulations of streamflow generated by the European Flood Awareness System (EFAS). EFAS combines a weather prediction model with a hydrological model to generate hydrological simulations including streamflow (Figure 1a). First, we pre-process the EFAS data by applying bias correction and identify flood events in simulation runs, that were performed for different lead times and perturbed members, to test whether flood samples of different model runs are independent (Figure 1b; details below). Second, we evaluate how the use of a pooled ensemble approach can benefit flood frequency analysis in terms of best estimates and uncertainty. To do so, we derive local and regional flood estimates for Central and Northern Europe (Figure 1c). (1) Data set derivation from a weather prediction-hydrological model chain, (2) data processing through model and ensemble evaluation, and (3) local and regional frequency analysis by pooling flood events across lead times. AM = annual maxima, POT = peak over threshold.

Study region 2.2 Data
EFAS provides deterministic and probabilistic medium-range streamflow forecasts and early warning information Smith et al., 2016). It relies on numerical weather predictions from the European Centre for Medium-range Weather Forecasts (ECMWF), initial conditions derived using observed meteorological data, and the hydrological model LISFLOOD.
LISFLOOD is a spatially distributed hydrological rainfall-runoff model based on Geographic Information Systems (GIS) 90 developed by the Joint Research Centre (JRC) for operational flood forecasting at the pan-European scale ). EFAS v4.0 computes a water balance at a 6-hourly or daily time step for each grid cell (5km × 5km) using meteorological forcing data (precipitation, temperature, potential evapotranspiration, and evaporation rates). LISFLOOD represents a variety of processes (snowmelt, soil freezing, surface runoff, soil infiltration, preferential flow, soil moisture redistribution, drainage to the groundwater system, groundwater storage, and groundwater base flow) and routes runoff produced for each grid cell through 95 the river network using a kinematic wave approach. The model was calibrated using the modified Kling-Gupta efficiency metric (Gupta et al., 2009) for the period 1991-2017 for 1137 catchments for which discharge data were available (ECMWF, 2021).
The EFAS reforecasts cover the 20-year period from 1999-2018 and are initialized twice a week on Mondays and Thursdays 100 with lead times of up to 46 days at a 6-hourly time step. They are driven with ensemble meteorological reforecasts from ECMWF's numerical weather prediction model. The meteorological ensemble consists of 10 perturbed ensemble runs, which were derived using the same numerical weather prediction model but varying initial conditions.

Model and ensemble evaluation
We and relative Q 95 errors < 10%. The 234 catchments fulfilling these criteria are retained for further analysis (Figure 2).

Data subsetting
The pooled frequency analysis relies on reforecasts of daily streamflow time series generated through EFAS v4.0. For our analysis, we use the 10 perturbed ensemble runs and 24 lead times -a subset of all available lead times (l t ) chosen by 120 picking every 8th lead time available: l t = 0, 48, 96, ..., 1104 hours. A subset was chosen as a trade off between minimizing computational feasibility and maximizing sample size.

Bias correction
Simulated streamflow time series can be biased because of uncertainties introduced through the modeling process. Substan-125 tial bias indicates low model fidelity because there is limited agreement between observed and modeled distributions (DelSole and Shukla, 2010). Potential uncertainty sources introducing bias include meteorological input uncertainty due to the use of a numerical weather prediction system and hydrological parameter and model uncertainties. Any such biases must be corrected in order to align the simulated streamflow distributions with observed distributions. To do so, we apply non-parametric quantile mapping, which has been found to be more flexible and suitable than parametric mapping approaches (Gudmundsson et al.,130 2012), to the daily simulated discharge series using the R-package qmap (Gudmundsson, 2016). We estimate the empirical cumulative distribution function of the observed and simulated time series (reforecasts for different lead times and perturbations) for regularly spaced quantiles for the period 1999-2011, for which both simulations and observations are available. Then, we apply quantile mapping to the simulated distributions of the whole period 1999-2018 for the different lead times and perturbations. We also tested another commonly applied bias correction procedure which maps the simulated distribution using the 135 mean bias ratio on extracted extremes (as done in meteorological ensemble pooling studies e.g. by Kelder et al. (2020)). However, we find that a correction by the mean bias ratio will still lead to biased flood estimates especially for long return periods.
A comparison of the cumulative distribution functions of observed and simulated flood events shows that both non-parametric quantile mapping and correction by the mean bias ratio lead to an alignment of simulated with observed distributions and that quantile mapping produces more satisfactory results than correction by the mean bias ratio in most cases, especially for events 140 with high non-exceedance probabilities ( Figure 3). We therefore use the series obtained by non-parametric quantile mapping for the pooled flood frequency analysis.

Flood identification
After bias correction, we identify flood events in the time series simulated for different lead times and perturbed members (10 perturbed runs for each lead time). We use two flood extraction procedures: the annual maxima (AM) and peak-over-threshold

Stability and independence tests
Using the AM flood samples extracted from different simulation runs, we assess the suitability of the perturbed ensemble streamflow simulations for ensemble pooling by evaluating whether individual simulation runs can be considered independent 155 and stable, i.e. that simulated distributions vary only slightly across lead times (Kelder et al., 2020).
First, we assess model stability, i.e. check whether the generated ensemble exhibits any changes in distribution with lead time. Ideally a pooled ensemble should only exhibit weak changes in distribution with lead time. Such stability is assessed by comparing the distribution of AM events across different lead times ( Figure 4). To evaluate model stability for a large number of catchments, we compute Spearman's correlation between simulated 95% flood quantiles (20-year return period) and the 160 corresponding lead times.
Second, we check whether individual model runs can be considered independent. Ensemble member independence is an important factor determining the increase in effective sample size achieved through ensemble pooling. If all x simulation runs are independent, pooling increases effective sample size by x times. However, if the x simulation runs show a higher degree of dependence, pooling increases sample size by y < x times. We calculate Spearman's rank correlation using the pairs of AM 165 time series (following Kelder et al., 2020) ( Figure 5). Note that such correlation can directly only be computed for AM and not for peak-over-threshold (POT) series because POT series may differ across ensemble members in the number of events chosen for analysis. Therefore, it can be assumed that the POT events used in our subsequent analyses are more independent than AM events. To illustrate the difference between dependence in AM to POT samples, we indirectly compute Spearman's correlation for pairs of POT time series by using events where at least one of the time series exceeds a threshold. For this correlation 170 analysis, we replace non-exceedances in the paired time series by 0 (not ideal because this might artificially introduce some sort of dependence).

Frequency analysis
For the frequency analysis, we use the POT instead of AM flood samples to ensure inclusion of relevant events and to reduce the dependence between ensemble members (i.e. runs for different lead times and for different perturbations).

Local frequency analysis
For the local (catchment-specific) frequency analysis, we pool all POT events from the perturbed members of the lead times that can be considered independent, i.e. lead times ≥ 528 hours or 22 days (see Section 3.1). Such pooling increases the sample size available for frequency analysis from 1 × ∼ 20 events (roughly 1 event chosen per year on average) to 13 lead times ×10 members ×20 events = 2600 events. 180 We fit a theoretical Generalized Pareto distribution (GPD; Coles, 2001) to observed and pooled POT samples using maximum likelihood estimation. We use the fitted distributions to derive best estimate observed and simulated flood frequency curves using probabilities corresponding to return periods between 1 and 200 years. In addition, we derive 90% confidence intervals for the observed and simulated frequency curves using bootstrapping, i.e. we draw n = 1000 random samples from the observed and simulated samples, respectively, to derive 1000 theoretical flood frequency curves. We then use these resampled frequency 185 curves to compute 90% confidence intervals for the estimated flood frequency curves. We compare simulated to observed flood quantiles corresponding to return periods of T = 5, 10, 20, 50, 100, and 200 years by computing relative differences between simulated and observed quantiles. We furthermore compare the uncertainty of these estimates by computing the relative difference in the range between the 95% and 5% quantile of the 1000 resampled estimates for each return period.
As a reference for these theoretical estimates, we provide empirical return period estimates of the observed flood events an event within the sample. To also represent the uncertainty of these empirical estimates, we perform another bootstrap experiment, which derives plotting positions for different samples by removing 1 year at a time. Next, we map flood quantiles estimated for return periods of T = 10, 20, 50, 100, and 200 years for the 234 catchments using the GPD distributions fitted to the pooled POT samples.

195
To identify physiographical and hydro-meteorological characteristics important for explaining flood quantiles at different return periods, we use linear modeling. We fit different linear regression models of different size, i.e. with different numbers of explanatory variables, using exhaustive search (James et al., 2013) to predict flood quantiles with a specific return period, e.g. 10 years. For the exhaustive search (i.e. trying all variable combinations), we use a set of potential explanatory variables frequently used to explain flood characteristics including altitude, catchment area, number of dams in catchment, mean slope, 200 population count, mean temperature, mean precipitation, mean evapotranspiration, mean SWE, mean snowmelt, and mean soil moisture. We computed the mean areal hydro-meteorological characteristics including precipitation, temperature, evapotranspiration, snow-water-equivalent, snowmelt, and soil moisture (average over 4 layers) using the gridded ERA5-Land data set (ECMWF, 2019). Among the fitted models with different numbers of predictors, we identify the model with the smallest Bayesian Information Criterion (BIC) value for each return period and look at the explanatory variables retained in these mod-205 els. The sign and magnitude of the retained regression coefficients are used to describe the importance of each predictor in explaining flood quantiles for different return periods.

Regional frequency analysis
After performing the local frequency analysis, we look at probabilities of regional flooding. That is, we estimate the return periods of events that affect a certain percentage of catchments within a larger region, i.e. river basin. We focus on the major 210 river basins in Europe (HydroSHEDS; Lehner and Grill, 2013) and ask what is the probability that 30%, 50% and 70% of the catchments in each of these river basins are jointly affected by flooding. To compute such regional flooding probabilities, we follow the regional hazard estimation procedure proposed by Brunner et al. (2020b). For each large river basin, we (1) determine the available catchments located within the given region and focus on river basins with at least 5 catchments (out of the 234); (2) identify the number of flood events during which p% of the catchments are jointly flooded using a binary 215 flood event matrix, which indicates for each catchment whether it was affected by specific flood events identified across all catchments, lead times, and perturbed members; (3) compute the probability of regional flooding using the Weibull plotting position given by p (%) = 100 (x/(n + 1)), where n is the total number of events affecting at least one of the stations in the region, and x is the number of events where 220 p% of the stations were affected by flooding.

Ensemble evaluation
After identifying catchments with satisfactory model performance in terms of the EFAS historical runs, we assess the suitability of the streamflow ensemble generated using the perturbed numerical weather predictions and different lead times for ensemble 225 pooling. This assessment focuses on AM instead of POT flood samples because POT event identification can lead to the selection of an unequal number of events across lead times, which makes it impossible to compute correlations. We first consider stability (i.e. lack of drift across lead times) of annual maxima flood events simulated for 24 lead times ranging from 0 to 46 days for one example catchment (Figure 4). Within each year, flood magnitudes do not differ systematically across lead times ( Figure 4a) and cumulative flood distributions seem to be stable across lead times (Figure 4b). We now take a look at AM (in)dependence across perturbed ensemble members by computing Spearman's rank correlation between pairs of AM series derived for the 10 perturbed ensemble members at each lead time. AM (in)dependence across perturbed ensemble members seems to depend both on the catchment and on lead time ( Figure 5). Dependence is relatively high for short lead times and decreases with longer lead times. The strength of the dependence at t = 0, as well as the rate of 240 decrease with increasing lead time, depends on the catchment as illustrated by the different 'dependence decay' behavior of the four example catchments. While some catchments (e.g. b and d) show correlation values close to 0 for sufficiently long lead times, other catchments (e.g. a) show relatively high correlations of 0.5 even at lead times exceeding one month. That is, individual AM series for different ensemble members may not necessarily be fully independent in a hydrological context. This finding contrasts with independence tests performed for other types of extremes such as extreme precipitation (Kelder et al.,245 2020) or wind (Breivik et al., 2014), which found that simulated precipitation and wind extremes can be considered independent after certain lead times. The residual dependence in the case of hydrological simulations is likely caused by the long memory of hydrological systems, which is related to storage processes e.g. in the soil or the cryosphere, and the persistence of the effects of initial conditions on the model forecasts. Such residual dependence would be expected to be independent of the choice of hydrological model used to translate the independent precipitation series to streamflow. We seek to better understand which types of catchments show high/low ensemble member dependence across lead times.
Therefore, we compute median Spearman's rank correlation across the 10 ensemble members and 24 lead times for each of the 234 catchments and try to relate this median correlation to a catchment's flood seasonality ratio. The flood seasonality ratio R F is computed as R F = Q 95s /Q 95w , where Q 95s represents Q 95 in summer (months: April-Sept) and Q 95w represents Q 95 in winter (months: Oct-March). R F > 1 and < 1 represent catchments with more severe summer than winter floods and more 255 severe winter than summer floods, respectively. We also considered other metrics to explain median lead time correlation such as the baseflow index (Ladson et al., 2013) or catchment area but did not find any meaningful relationships. Median lead time dependence shows clear spatial patterns with higher dependencies in the Alps and Scandinavia than in the rest of Europe (Figure 6a). These regions with higher median dependencies are characterized by a summer flood regime as they are partly influenced by snowmelt contributions (Berghuijs et al., 2019). Median lead time correlation seems to generally 260 increase with higher seasonality ratios (Figure 6b), i.e. the more summer-/snow-dominated a flood regime is, the higher the AM dependence. However, some of the winter-/precipitation-dominated regimes can also have high dependence values.
The high dependence at low lead times suggests that simulations at lower lead times should be removed before pooling flood events for frequency analysis. In order to determine the lead times to be excluded, we compute median AM dependence across ensemble members and catchments for each lead time and perform a Pettitt change point test (Ryberg et al.,

Flood frequency analysis
Flood estimates derived by theoretical distributions fitted to pooled peak-over-threshold (POT) flood events from 10 ensemble members and 13 lead times are more robust, i.e. have smaller uncertainty, than flood estimates derived from distributions fitted to a small sample of observed POT events, as illustrated in Figure  The differences between observation-and simulation-based best estimates and uncertainty ranges vary by return period and by catchment (Figure 9). Relative differences between observation-and simulation-derived best estimates are mostly positive, i.e. observed quantiles tend to be larger than simulated quantiles (Figure 9a). These relative differences increase with return period length. Similarly, observation-derived uncertainty ranges are wider than simulation-derived uncertainty ranges and these 285 differences also increase with return period length (Figure 9b). Both the relative differences in best estimates and uncertainty bounds depend on catchment area and elevation to some degree (Figure 9c, d). Low-elevation and large catchments generally show lower relative differences in best estimates and uncertainty than high-elevation and small catchments. As the effect of area on relative differences is stronger than the effect of elevation, there are no clear spatial patterns in relative differences between observed and simulated best estimates and uncertainty bounds. Please also note that the relative differences between 290 simulated and observed best estimates and uncertainty bounds are uncorrelated with model performance, which means that increasing the cut-off threshold for EFAS model performance (level of E KG ) does not necessarily lead to an increase in the similarity between observed and simulated flood estimates. Relative difference between observed and simulated ((obs-sim)/sim) (c) best estimates and (d) range of uncertainty bounds (Q95 − Q05) for the 100-year return period in relation to catchment area and elevation. The larger the dot is, the higher elevation of a catchment is.
We now use the best estimates derived by ensemble pooling to map spatial patterns of flood quantiles over Central Europe for different return periods ( Figure 10). Flood quantiles are highest in the Alps and Great Britain and lowest in Northern Germany 295 and Scandinavia, independent of the return period. These spatial patterns corroborate previous findings that the Alps and Great Britain are regions with a comparably high number of flood events per year (Mangini et al., 2018) and that observation-based 100-year specific discharge is highest in the Alps, Great Britain and Norway and lowest along the Atlantic coast (Blöschl et al., 2019). Additionally, we find that the local flood quantiles are positively related to mean slope and mean precipitation of a catchment, and negatively related to the number of dams, temperature, and mean snowmelt of a catchment ( Figure 11). In other 300 words, catchments with steep slopes, colder climates, and higher mean precipitation tend to have higher flood quantiles, while catchments with a greater numbers of dams and higher snowmelt contributions tend to have smaller flood quantiles. Ensemble pooling can also be used to derive regional flood estimates, i.e. to compute the probability that a certain percentage of catchments within a region, i.e. large river basin, are jointly flooded ( Figure 12). Regional floods with a 30% coverage, i.e.
floods affecting at least 30% of catchments within a region, occur relatively frequently (return periods < 10 years) both in 305 Central Europe as well as Scandinavia (Figure 12a). In contrast, regional events with a 50% coverage are more likely in Central Europe (lighter colors) than in Scandinavia (darker colors). Very widespread events with 70% spatial coverage become very rare (return periods > 90 years) in most parts of Europe except in the Weser, Elbe, and Oder river basins. (a) (b) (c) Figure 12. Probabilities of regional flooding for European river basins with more than five catchments: (a) 30% affected, (b) 50% affected, and (c) 70% affected. Regions with less than 5 catchments where regional flood probabilities could not be determined are highlighted in grey and regions not covered by our dataset displayed in white.

Discussion
Pooling flood events derived from a streamflow reforecast ensemble substantially increases the sample size available for flood 310 frequency analysis. In doing so, it enables the study of very rare extremes absent in relatively short observed time series.
Increasing the sample size also facilitates the study of spatial patterns in the distribution of flood estimates corresponding to long return periods (e.g. 200 years; Figure 10e) and notably reduces uncertainty in most cases (Figure 9b) independent of the original EFAS model performance. Furthermore, it enables the study of rare spatial extremes, i.e. events that may affect multiple catchments at once (Figure 12). Therefore, streamflow reforecast ensemble pooling represents a suitable alternative 315 to stochastic or climate model large ensemble approaches for studying the frequency and magnitude of rare extreme events.
Similar to large ensemble approaches but in contrast to stochastic approaches, reforecast-based simulation approaches rely on physical representations of the hydrological cycle. Such physical representation may be especially valuable if relationships between different variables are of interest and if one wishes to study the physical drivers of flood events. In contrast, stochastic models have the advantage of being relatively straightforward to implement and are potentially less computationally intense.

320
The utility of reforecast pooling rests on the performance of the underlying hydrological simulations. The use of reforecast simulations instead of observations comes at the cost of potentially introducing uncertainty through simulated meteorological input or the hydrological model itself (structure and parameters) (Clark et al., 2016). These uncertainties may result in biased simulations, which may either under-or overestimate the whole or specific parts of the streamflow distribution. Such bias can be partly reduced by using bias correction techniques such as quantile mapping (Figure 3). Yet the plausibility of unprecedented events has the advantage that besides event magnitudes and timing, the number of events may also vary. This approach means that a one-to-one relationship between events extracted from two different ensemble members can no longer be established.
The flood ensemble pooling approach described herein is not limited to the EFAS reforecasts over Europe but could also be applied to other streamflow reforecast modeling systems such as the Global Flood Awareness System (GloFAS; Alfieri 345 et al., 2013) or the Global Flood Forecasting Information System (GLOFFIS; Emerton et al., 2016). Moreover, the pooling approach may be beneficial to other types of hydrological extremes beyond flood frequency analysis, such as droughts. Such an extension would require model evaluation targeted at the variable of interest. Hydrological extremes extracted from streamflow reforecasts may also be used in combination with climatic extremes extracted from meteorological reforecasts to study the frequency of compound events (such as joint pluvial and fluvial flooding), or the drivers of various extremes. In any case, the 350 use of simulated extremes pooling requires careful model evaluation and is likely to require some form of bias correction to ensure the fidelity of extremes.

Conclusions
Pooling of publicly-accessible reforecast flood events such as those generated through the European Flood Awareness System (EFAS) can be a useful tool to improve the robustness of flood estimates, particularly for rare events with long return peri-355 ods. However, as with other extremes (Kelder et al., 2020), such pooling is only effective if simulated floods show little bias and model drift across lead times, and if the floods extracted from different ensemble members are sufficiently independent to increase the effective sample size available for frequency analysis. Bias can be removed using bias correction techniques such as quantile mapping. The degree of dependence is subject to the catchment (with summer-flood-dominated catchments showing higher dependencies than winter-flood-dominated catchments), lead time (with decreasing dependence at longer lead 360 times), and event type (with peak-over-threshold events showing lower dependence than annual maxima events). The higher dependence of summer-flood-dominated catchments than winter-flood-dominated catchments suggests that catchment memory through snow storage effects is an important determinant of dependence and that catchments with a more predictable seasonality may have greater member dependence. We recommend pooling peak-over-threshold events from ensemble runs generated for lead times > 22 days, because floods are less dependent on average beyond such a lead time and because dependence is 365 lower for POT than AM events.
Our application of the pooling approach over 234 European catchments shows that local floods are most extreme in the Alps and Great Britain and least extreme in Scandinavia and Central Europe. It also indicates that regional extreme flood events, in which a large fraction of catchments flood simultaneously, are more likely in Central Europe than in Scandinavia. We conclude that pooled reforecast ensembles are beneficial in studying the probability of extreme and spatially extensive events in the