Flood spatial coherence, triggers, and performance in hydrological simulations: large-sample evaluation of four streamflow-calibrated models

Floods cause extensive damage, especially if they affect large regions. Assessments of current, local, and regional flood hazards and their future changes often involve the use of hydrologic models. A reliable hydrologic model ideally reproduces both local flood characteristics and spatial aspects of flooding under current and future climate conditions. However, uncertainties in simulated floods can be considerable and yield unreliable hazard and climate change impact assessments. This study evaluates the extent to which models calibrated according to standard model calibration metrics such as the widely used Kling–Gupta efficiency are able to capture flood spatial coherence and triggering mechanisms. To highlight challenges related to flood simulations, we investigate how flood timing, magnitude, and spatial variability are represented by an ensemble of hydrological models when calibrated on streamflow using the Kling–Gupta efficiency metric, an increasingly common metric of hydrologic model performance also in flood-related studies. Specifically, we compare how four well-known models (the Sacramento Soil Moisture Accounting model, SAC; the Hydrologiska Byråns Vattenbalansavdelning model, HBV; the variable infiltration capacity model, VIC; and the mesoscale hydrologic model, mHM) represent (1) flood characteristics and their spatial patterns and (2) how they translate changes in meteorologic variables that trigger floods into changes in flood magnitudes. Our results show that both the modeling of local and spatial flood characteristics are challenging as models underestimate flood magnitude, and flood timing is not necessarily well captured. They further show that changes in precipitation and temperature are not always well translated to changes in flood flow, which makes local and regional flood hazard assessments even more difficult for future conditions. From a large sample of catchments and with multiple models, we conclude that calibration on the integrated Kling–Gupta metric alone is likely to yield models that have limited reliability in flood hazard assessments, undermining their utility for regional and future change assessments. We underscore that such assessments can be improved by developing flood-focused, multi-objective, and spatial calibration metrics, by improving flood generating process representation through model structure comparisons and by considering uncertainty in precipitation input.

Abstract. Floods cause extensive damage, especially if they affect large regions. Assessments of current, local, and regional flood hazards and their future changes often involve the use of hydrologic models. A reliable hydrologic model ideally reproduces both local flood characteristics and spatial aspects of flooding under current and future climate conditions. However, uncertainties in simulated floods can be considerable and yield unreliable hazard and climate change impact assessments. This study evaluates the extent to which models calibrated according to standard model calibration metrics such as the widely used Kling-Gupta efficiency are able to capture flood spatial coherence and triggering mechanisms. To highlight challenges related to flood simulations, we investigate how flood timing, magnitude, and spatial variability are represented by an ensemble of hydrological models when calibrated on streamflow using the Kling-Gupta efficiency metric, an increasingly common metric of hydrologic model performance also in flood-related studies. Specifically, we compare how four well-known models (the Sacramento Soil Moisture Accounting model, SAC; the Hydrologiska Byråns Vattenbalansavdelning model, HBV; the variable infiltration capacity model, VIC; and the mesoscale hydrologic model, mHM) represent (1) flood characteristics and their spatial patterns and (2) how they translate changes in meteorologic variables that trigger floods into changes in flood magnitudes. Our results show that both the modeling of local and spatial flood characteristics are challenging as models underestimate flood magnitude, and flood timing is not necessarily well captured. They further show that changes in precipitation and temperature are not always well translated to changes in flood flow, which makes local and regional flood hazard assessments even more difficult for future conditions. From a large sample of catchments and with multiple models, we conclude that calibration on the integrated Kling-Gupta metric alone is likely to yield models that have limited reliability in flood hazard assessments, undermining their utility for regional and future change assessments. We underscore that such assessments can be improved by developing flood-focused, multi-objective, and spatial calibration metrics, by improving flood generating process representation through model structure comparisons and by considering uncertainty in precipitation input. extreme events such as floods (Brunner et al., 2019b;Das and Umamahesh, 2018) and when considering hydrological change. It is therefore challenging to produce statistically reliable estimates of future changes in flood hazard.
A model ideally reproduces different aspects of flooding, including local characteristics such as event magnitude and timing. To obtain such satisfactory flood simulations, hydrological models are often calibrated using one or several objective functions. One widely used metric that is often used in flood studies (e.g., Hundecha and Merz, 2012;Köplin et al., 2014;Vormoor et al., 2015;Wobus et al., 2017) is the Nash-Sutcliffe efficiency (E NS ; Nash and Sutcliffe, 1970) because it is considered integrative compared to others and focuses attention on high flows. However, E NS is formulated so that its optimal value systematically underestimates flow variability (Gupta et al., 2009), undermining the ability of a model to reproduce peak flow values. A related metric, the Kling-Gupta efficiency (E KG ; Gupta et al., 2009), is free from this constraint and may improve simulations of peak flows, especially if the variability-related component of the score is emphasized in calibration . This metric has been frequently used in recent flood modeling studies (e.g., Harrigan et al., 2020;Hirpa et al., 2018;Huang et al., 2018;Thober et al., 2018;Brunner and Sikorska, 2018) and seems to be widely accepted as a suitable choice for flood studies. This acceptance may arise from the general practice of developing models for a range of objectives. However, recent studies have shown that capturing flood magnitude and timing is challenging when such standard calibration metrics are used for parameter estimation (Lane et al., 2019;Brunner and Sikorska, 2018;. In addition to simulating the timing and magnitude of flow at individual catchments, it is also important to realistically reproduce spatial dependencies, i.e., the relationship of flood occurrence across gauging stations (Keef et al., 2013;De Luca et al., 2017;Berghuijs et al., 2019). An over-or underestimation of spatial dependencies across a network of gauging stations in regional flood hazard and risk assessments has been shown to under-or overestimate regional damage, respectively (Lamb et al., 2010;Metin et al., 2020). Prudhomme et al. (2011) have shown for a set of large-scale hydrological models that simulated high-flow episodes are less spatially coherent than observed events. Despite their high relevance for impact, the spatial aspects of flooding have often been overlooked in past simulation studies.
Local and spatial flood characteristics should be reliably simulated, not only under current but also under future climate conditions. However, models calibrated for current conditions may not be transferable in time (Thirel et al., 2015), partly because of a suboptimal representation of flood producing mechanisms. To overcome this transferability problem, the differential split-sample test has been proposed, whereby the model is calibrated and validated on two periods with differing climate conditions (Klemes, 1986;Seibert, 2003).
In this study, we evaluate the extent to which models calibrated according to the model calibration metric E KG are able to capture flood spatial coherence and flood triggering mechanisms. To this end, we first evaluate how well different hydrological models capture local flood events following the current paradigm. Secondly we expand the evaluation by analyzing how well the models capture spatial flood dependence, and finally we evaluate how the models capture flood triggering mechanisms. With this thorough evaluation, we assess which aspects of hydrological models may need to be improved if we want to bring hazard and change impact assessments to a point where we can make more reliable assessments of regional flood hazard and future changes.
For documenting modeling challenges related to floods, we look at the model output of four widely used hydrological models (Addor and Melsen, 2019), namely, the Sacramento Soil Moisture Accounting model (SAC-SMA; Burnash et al., 1973) combined with SNOW-17 (Anderson, 1973, the Hydrologiska Byråns Vattenbalansavdelning model (HBV; Bergström, 1976), the variable infiltration capacity model (VIC; Liang et al., 1994), and the mesoscale hydrologic model (mHM; Kumar et al., 2013;Samaniego et al., 2010). Identifying and documenting model weaknesses regarding regional and future flooding will highlight avenues for future model development and reveal potential deficiencies of a calibration strategy often applied for research studies on floods.

Data and methods
To study how local and spatial flood characteristics are reproduced by hydrological models calibrated on streamflow using the individual calibration metric, E KG , we compare observed to simulated flood event characteristics for a set of 488 catchments in the conterminous United States that have minimal human impact and catchment areas ranging from 4 to 2000 km 2 ( Fig. 1a) (Newman et al., 2015b).
The data set comprises catchments with a wide range of climate and streamflow characteristics, ranging from catchments with intermittent regimes and a very weak seasonality to catchments with a very strong seasonal cycle under the influence of snow (New Year's and melt regimes; Fig. 1b; Brunner et al., 2020b). Observed streamflow time series are available from the U.S. Geological Survey (USGS, 2019).

Model simulations
We use daily streamflow simulations for the period 1981-2008 generated with four well-known hydrological models (Addor and Melsen, 2019) offering different model structures and complexity: the lumped SAC model (Fig. SM 1; Burnash et al., 1973), the lumped HBV model (Fig. SM 2;Bergström, 1976), the lumped version of the VIC model (Fig. SM 3;Liang et al., 1994), and the grid-based, dis-  (Brunner et al., 2020b). tributed mesoscale hydrologic model mHM (Fig. SM 4;Kumar et al., 2013;Samaniego et al., 2010). Basing the study on four different modeling efforts has the advantage of enlarging the sample size from which conclusions can be drawn but the disadvantage that the models were not run as part of a controlled study, with consistent forcings, calibration periods, and parameter selection. The model parameters were calibrated on streamflow observations by minimizing E KG by Melsen et al. (2018) using Sobol-based Latin hypercube sampling (Bratley and Fox, 1988) for SAC, HBV, and VIC and by  for mHM using multi-scale parameter regionalization, whereby the transfer function parameters were identified using the dynamically dimensioned search algorithm (Tolson and Shoemaker, 2007). E KG is defined as where ρ is the correlation between observed and simulated runoff, α is the standard deviation of the simulated runoff divided by the standard deviation of observed runoff, and β is the mean of the simulated runoff, divided by the mean of the observed runoff. s ρ , s α , and s β are scaling parameters enabling a weighting of different components. When used individually, E KG has been found to result in a better performance for annual peak flow simulation than the longstanding and related hydrologic model evaluation metric E NS .
For SAC, Melsen et al. (2018) calibrated and evaluated 18 out of the 35 parameters available in the coupled SNOW-17 and SAC-SMA modeling system, for HBV 15 parameters, for VIC 17 parameters, and for mHM  and  calibrated and evaluated up to 48 parameters. All the models were driven with daily, spatially lumped meteorological forcing data representing current climate conditions: SAC, HBV, and VIC were driven with Daymet meteorological forcing (1 km resolution; Thornton et al., 2012) and mHM with the forcing by Maurer et al. (2002) (12 km resolution), both derived from observed precipitation and temperature. SAC, HBV, and VIC were calibrated and evaluated on the period 1985-2008, while mHM was calibrated on the period 1999-2008 and evaluated on the period 1989-1999. After calibration, all four models were run for the period 1980-2008 (calendar years), whereby the period 1980-1981 was here used for spin-up and therefore discarded from the analysis.
To provide insights with respect to where model performance is better/worse, we provide model evaluation results for five different streamflow regime types, which have been shown to be distinct in their flood behavior: (1) intermittent, (2) weak winter, (3) strong winter, (4) New Year's, and (5) melt ( Fig. 1; Brunner et al., 2020b). Catchments with intermittent regimes experience floods mainly in spring and summer, those with weak winter regimes in winter and spring, those with strong winter regimes in winter, those with a New Year's regime around New Year, and those with a meltdominated regime in spring because of snowmelt.
Model performance in terms of E KG varies spatially and is related to the hydrological regime (Fig. 2). It is overall lowest for catchments with intermittent regimes and a weak seasonality and highest for catchments with a strong seasonality such as a melt and New Year's regime. However, there is a high within-class variability in model performance. The finding that intermittent regimes are challenging to model successfully is well known in hydrology and reproduced in many studies, e.g., Unduche et al. (2018), who show that hydrological modeling on Prairie watersheds is very complex (Hay et al., 2018). Intermittent regimes may suffer in calibration if they rely solely on correlation-type measures because their day-to-day variation is more difficult to reproduce than a more pronounced and regular seasonality. Overall model performance decreases from mHM (median E KG 0.69), over SAC (median E KG 0.63) and VIC (median E KG 0.60) to HBV (median E KG 0.52). In addition to streamflow, we use areal precipitation and simulated soil moisture to explain potential differences in model performance.

Model evaluation for floods
We compare local and spatial flood characteristics extracted from the observed time series to those of the series simulated with the four models for the period 1981-2008 for the five streamflow regimes introduced above. Such a comparison enables identification of flood characteristics whose model representation could potentially be improved. To better understand potential model deficiencies, we look at how models capture flood triggering mechanisms and how they simulate floods under climate conditions different from the current ones.

Flood event identification
Flood events are identified for each of the five time series (one observed, four simulated) using a peak-over-threshold (POT) approach, similar to the one used in Brunner et al. (2019aBrunner et al. ( , 2020b. This approach consists of two main steps and results in two data sets each, which are used for the local and spatial analysis, respectively: (1) POT events (i.e., peak discharges) in individual catchments and (2) event occurrences across all catchments. In Step 1, independent POT events are identified in the daily discharge time series of the individual catchments using the 25th percentile of the corresponding time series of annual maxima as a threshold (Schlef et al., 2019) and by prescribing a minimum time lag of 10 d between events (Diederen et al., 2019). This procedure results in a first quartile of 36, a median of 40, and a third quartile of 47 events identified per basin. In Step 2, a data set consisting of the dates of flood occurrences across all catchments is compiled. This set is converted into a binary matrix which specifies for each catchment (columns) whether or not it is affected by a specific event (rows). We consider a catchment to be affected by a certain event if it experiences an event within a window of ±2 d of that event to take into account travel times. In addition to a binary matrix of all events, we set up seasonal binary matrices (winter: December-February, spring: March-May, summer: June-August, fall: September-November).

Flood characteristics at individual sites
We use the data sets resulting from Step 1, the POT events at individual catchments, to evaluate how well the models reproduce flood statistics at individual sites. We focus on the total number of events n (actual error: n s − n o , where "s" represents simulations and "o" observations), magnitude in terms of mean peak discharge x (relative error: (x s −x o )/x o ), and mean timing (absolute error: circular statistics suitable for defining central tendencies of variables with a circular behavior; Burn, 1997).

Spatial flood dependence
We then use the data sets resulting from Step 2 to evaluate how models reproduce overall and seasonal spatial flood dependence. To do so, we use the connectedness measure introduced by Brunner et al. (2020a), which quantifies the number of catchments with which a specific catchment coexperiences floods. The number of concurrent flood events for a pair of stations is determined based on a data set consisting of the dates of flood occurrences across all catchments. This set is converted into a binary matrix which specifies for each catchment whether or not it is affected by a certain event. The matrix compiled using observed streamflow time series contained 1164 events, among which 258 occur in winter, 291 in spring, 324 in summer, and 291 in fall. Following the definition used by Brunner et al. (2020a), a catchment is connected to another catchment if they share a certain number of events. We used here an event threshold of 1 % of the total or seasonal number of events to define connectedness (all months: 12 events, seasons: 3 events). We computed actual errors in flood connectedness by subtracting observed from simulated connectedness over all seasons and per season.

Flood triggers
To explain potential differences in model performance, we look at the relationship of simulated peak discharge with the two flood triggers precipitation and soil moisture on the day of flood occurrence. We focus on the day of occurrence because time of concentration is typically small for small headwater basins (USDA-NRCS, 2010).

Floods under change
In addition to assessing model performance under current climate conditions, we would like to understand potential, additional challenges arising when interested in future conditions. To do so, we look at how models translate changes in event temperature and precipitation into changes in POT discharge by performing a resampling-based sensitivity analysis. This sensitivity analysis aims at evaluating whether a model is still reliable under climate conditions different from the ones used in model calibration similar to split-sample or differential split-sample calibration and validation schemes (Klemes, 1986;Coron et al., 2012;Refsgaard et al., 2014;Thirel et al., 2015). To perform this sensitivity analysis, we generate surrogate time series of temperature, precipitation, and streamflow for each catchment (Wood et al., 2004;Brunner et al., 2020b). To generate these series, we randomly sample a series of years with replacement in the period 1981-2008, which we use to compose time series consisting of the daily values corresponding to these years for each of the three variables. For each of the surrogate series, we again extract POT flood events using the same procedure as described under Step 1. For each of the extracted events we then determine temperature and precipitation on the day of peak discharge. We use the sets of peak discharge, event temperature, and event precipitation to compute mean event discharge, temperature, and precipitation, which enables the derivation of a relationship between mean POT discharge and the two meteorological variables during events. We repeat the resampling n = 500 times to derive a relationship between changes in mean event temperature and precipitation and changes in mean POT streamflow. This resampling experiment results in a response surface of POT discharge spanned by mean event temperature and mean event precipitation for each catchment. We summarize the results obtained at individual locations by computing horizontal and vertical sensitivity gradients on these reaction surfaces using a linear regression model. The horizontal gradient describes the strength of POT discharge changes in response to event temperature changes, while the vertical gradient describes the strength of change in response to changes in event precipitation. Conducting this experiment for both observed and simulated time series allows for the determination of whether the models react to changes in mean event temperature and precipitation in the same way as the real-world system and are therefore suitable for use in climate change impact assessments of floods. If models produce different climate sensitivities than the ones seen in the observations, the use of models to simulate sets of flood events for future conditions may preclude reliable change assessments.

Flood characteristics at individual sites
Model performance at individual sites with respect to the number of events, event magnitude, and timing varies by model and hydrological regime type (Fig. 3).
For most catchments, the median deviation between the simulated and observed number of flood events lies close to zero (SAC: −3 events, HBV: −1, VIC: −1, mHM: 0). However, the simulations result in over-and underestimations of the number of events depending on the catchment (first and third quartiles for SAC: −9, 4; HBV: −8, 15; VIC: −7, 6; mHM: −6, 6). The overestimation is strongest for HBV, which overestimates the number of events for catchments with intermittent, weak winter, and melt regimes (Brunner et al., 2020b). Event magnitude in terms of peak discharge is generally underestimated for all regime types independent of the model, and also absolute flood timing errors are present in all models. They are the highest in catchments with intermittent regimes with a high variability in flood timing and low in catchments with a New Year's and melt regime, where the flood season is limited to a few months (Brunner et al., 2020a).

Spatial flood dependencies
Over all seasons, most models show a median error close to zero for flood connectedness. Flood connectedness can be over-and underestimated dependent on the catchment by most of the models, while HBV overestimates spatial dependence in most catchments (Fig. 4). Seasonally, most models over-or underestimate spatial dependence in certain regions. In winter, connectedness is overestimated by most models except for VIC, and the strength of over- estimation is strongest for HBV. In spring, most models tend to underestimate spatial dependence except for HBV that results in an overestimation of spatial dependence for catchments with an intermittent regime. Connectedness overestimation by HBV is most pronounced for catchments with an intermittent regime. Otherwise, connectedness over-/underestimation seems to be independent of the regime.

Flood triggers
The differences in model performance regarding local and spatial flood characteristics may be partially explained by differences in their structure and how they transform precipitation into runoff. Figure 5 shows how simulated peak discharge is related to event precipitation, event precipitation plus snowmelt, and simulated soil moisture over all catchments for the four hydrologic models. The SAC and VIC models show similar simulated relationships for all three variable pairs. There is a positive relationship between peak discharge and precipitation and peak discharge and rainfall plus snowmelt; i.e., the higher the precipitation input or rainfall and snowmelt combined, respectively, the higher the resulting peak discharge. This relationship is slightly more expressed for VIC than for SAC. In both models, soil moisture and event magnitude are also positively related with lower peak values, potentially associated with lower soil moisture states than more severe events. The peak dischargeprecipitation relationship of HBV and mHM is less straightforward than the one of SAC and VIC. HBV and mHM also show high discharge when precipitation input is high but may in some cases still produce high discharge values, even for low precipitation inputs. Such low precipitation inputs can also lead to high peak discharge for SAC but to a lesser degree than HBV and mHM. However, peak discharge and rainfall plus snowmelt show a strong linear relationship; i.e., the higher the combined rainfall and snowmelt input to the system, the higher the peak discharge. High flows are in most cases related to nearly full storage states but can occasionally also be triggered when soil moisture is low for SAC and VIC and to a lesser degree for HBV.

Floods under change
In addition to looking at how well local and spatial flood characteristics are represented by models, we look at how changes in temperature and event precipitation are translated into changes in flood flows to assess each model's suitability for climate impact assessments on floods. Our sensitivity analysis shows that the models have difficulty translating changes in event temperature and precipitation into sensitivities of flood flows (Fig. 6), which can be problematic if we would like to use such models in climate change assessments. Generally, flood flows show a relatively low sensitivity to changes in mean event precipitation and temperature. This is in contrast to the behavior for mean flow, which is strongly influenced by changes in mean precipitation as demonstrated in a similar experiment by Brunner et al. (2020b). The much stronger relationship between mean precipitation and flow than between event precipitation and flow might arise because mean flow is a climate signal (Knoben et al., 2018), whereas floods are more an event (higher frequency, short-term) signal. However, some catchments, e.g., Tucca Creek (New Year's regime), show a clear relationship between peak magnitude and both event temperature and precipitation. While these relationships are captured for some catchments (e.g., Blackwater River, weak winter regime or Tucca Creek, New Year's regime), they are not in other catchments. The simulated sensitivities may even point in another direction than the observed ones (e.g., Pacific Creek, melt regime). In the case of melt regimes, the misrepresentation of flood sensitivities by models suggests that they may have difficulty simulating snow-influenced flooding. This relatively poor model performance in capturing observed flood sensitivities can be generalized to the larger set of catchments studied here (Fig. 7). Temperature sensitivities are found to be positive or negative; i.e., an increase in temperature could lead to an increase or decrease of peak flow depending on the catchment. In general, these temperature sensitivities are relatively weak (i.e., gradients are close to zero), which may be the reason why they are difficult to capture. In contrast, precipitation sensitivities are mostly positive; i.e., an increase in event precipitation leads to an increase in peak flow. However, the strength of these sensitivities is underestimated by all models; i.e., a change in precipitation leads to too small a change in peak flow. This underestimation of sensitivity can be understood by the underestimation of flood magnitude in general.

Model performance in simulating floods
The results presented in this study demonstrate that simulating floods using hydrological models calibrated on the popular Kling-Gupta efficiency metric is challenging both at a local and spatial scale. At the local scale, flood timing and magnitude may not be perfectly captured, which can translate into a suboptimal representation of spatial dependencies because space and time are closely related. The challenges related to flood simulations become especially pronounced under climate conditions different from the current ones because additional sources of uncertainty are added to the modeling chain.
Even though the models have been calibrated for the local situation, substantial differences in magnitude and timing were found between observations and simulations. Locally, simulated floods showed smaller magnitudes and had different timing than observed ones, while the number of floods was reproduced relatively well except by the HBV model for catchments with intermittent regimes. The flood magnitude underestimation found for all four models tested is in line with previous studies showing that using E KG individ-ually results in an underestimation of peak flow  due to an underestimation of variability, which will result in an underrepresentation of extremes (Katz and Brown, 1992). Another factor potentially contributing to this underestimation is that the models were forced with spatially lumped instead of distributed data, which may have smoothed the simulated discharge response.
Under the current calibration paradigm, whereby models are calibrated to local discharge conditions using E KG as the objective function, flood connectedness is not accounted for. As a result, flood connectedness is not well captured by the models, as illustrated by the finding that flood connectedness is over-or underestimated depending on the season. The overestimation of spatial dependence in winter for all regimes except the melt regime is likely related to higher simulated than observed snowmelt as high soil moisture and snow availability have been shown to increase spatial flood connectedness (Brunner et al., 2020a). Related to this, the underestimation of spatial connectedness in spring may be related to the subsequent missing snowmelt contributions. Spatial connectedness in summer has been shown to be generally weak due to the occurrence of localized, convective events (Brunner et al., 2020a), which is reflected by most models except for HBV in the case of intermittent and melt regimes. Spatial flood connectedness has also been shown to be weak in fall (Brunner et al., 2020a) but is overestimated by most models. The finding that there is room to improve the representation of spatial flood dependencies is in line with previous studies showing that large-scale hydrological models have a weakness in reproducing regional aspects of floods (Prudhomme et al., 2011).
There are slight variations in performance among models. These variations may result from differences in the representation of flood producing mechanisms, as indicated by distinct behaviors in how the models translate precipitation into runoff. VIC and SAC show more linearity in their event precipitation and peak discharge relationship than HBV and mHM, possibly because VIC and SAC have the capability to generate surface runoff when precipitation intensity exceeds infiltration capacity (Burnash et al., 1973;Liang et al., 1994). In this case, incoming precipitation is directly translated into flood discharge. In contrast, HBV and mHM, the latter of which is based on the HBV model structure (Kumar et al., 2013), does not include a surface runoff component, and all discharge originates in the model stores (Bergström, 1976). This introduces a nonlinearity in the model response and may explain why a smaller precipitation input may still generate high peak flows in these models. These differences in process representation suggest that a "most suitable model" could be identified for a specific application at hand. If one is, e.g., interested in simulating floods in catchments with intermittent regimes, the HBV model does not seem to be an ideal choice because there it simulates too many floods with too small a magnitude. The overestimation of the number of events in catchments with intermittent regimes by HBV may be explained by its fast response to precipitation as expressed through its model parameter β, which introduces nonlinearity to the system (Viglione and Parajka, 2020).
Our climate sensitivity analysis shows that the simulation of floods becomes even more challenging under climate con- ditions different from the current ones as the hydrological models employed in this study have limited capability in reproducing observed hydrologic sensitivities during flooding. These limitations may be related to input uncertainties (Te Linde et al., 2007), insufficient model calibration , or equifinality in process contributions for simulations with (very) similar efficiency scores, leading to an inability to unambiguously identify the appropriate relative process contributions (Khatami et al., 2019).

Potential ways to improve model performance
The results of our model comparison highlight that there is room for improvement regarding the representation of local flood events, spatial flood dependence, and flood producing mechanisms. We discuss here four potential ways for improving model performance: developing flood-tailored calibration metrics, considering spatial aspects in model calibration, improving representation of flood processes, and representing input uncertainties.
A first possibility to improve model performance is to develop calibration metrics tailored to flooding instead of relying on E KG . Our results show that E KG can lead to simulation performance deficits for phenomena of interest, including an underestimation of peak flow, a misrepresentation of timing, and over-or underestimation of seasonal spatial flood connectedness. As is evident in some existing practiceoriented applications of hydrological models (Hogue et al., 2000;Unduche et al., 2018;World Meteorological Organization, 2011), the simulation of floods and other hydro-logic phenomena is likely to be improved by using more tailored model calibration strategies. The representation of streamflow variability could potentially be improved by giving more weight to the variability component of an integrative metric such as the E KG (Pool et al., 2017), whereas the representation of flood magnitude and timing may be improved by giving more weight to the bias and correlation components of the E KG . Alternatively, these characteristics could be optimized explicitly by minimizing the error in key hydrograph signatures related to site-specific flood phenomena. Such flood-focused optimization may similarly to E KG rely on multiple objectives in a scalar function (Gupta et al., 1998;Efstratiadis and Koutsoyiannis, 2010), such as volume error, root-mean-squared error, and peak flow error (Moussa and Chahinian, 2009); E NS and relative peak deviation (Krauße et al., 2012); E KG , peak efficiency, and logarithmic efficiency ; or E KG , peak efficiency, and mean absolute relative error (Sikorska-Senoner et al., 2020). In addition, model performance can potentially be improved by using multiple metrics describing important catchment processes (Madsen, 2003;Dembélé et al., 2020), i.e., flood generating mechanisms such as soil moisture and snowmelt.
A second way to improve model performance is to focus on the spatial representation of extremes, which may be improved by considering spatially distributed features of model response or spatial correlation within a spatial calibration framework. Such a framework could build upon existing spatial verification metrics such as the spatial prediction comparison test used, e.g., to validate precipitation fore-casts (SPCT; Gilleland, 2013), empirical orthogonal functions (EOFs), or Kappa statistics (Koch et al., 2015). For the calibration and evaluation of spatially distributed hydrological models, Koch et al. (2018) recently proposed the SPAtial EFficiency (SPAEF) metric, which reflects three equally weighted components: correlation, coefficient of variation, and histogram overlap. To improve the spatial dependence of floods across different sites, such spatial calibration frameworks would need to include spatial verification metrics focusing on extremes, which could, e.g., be achieved by looking at deviations of simulated from observed F-madograms, which measure extremal dependence (Cooley et al., 2012). Please note, however, that even the use of spatial verification metrics may not overcome the lack of spatial heterogeneity in precipitation or soil moisture data.
A third way of improving model performance is to test whether a model is fit for purpose and to identify model structures which accurately represent relevant flood producing mechanisms. The importance of model structure choice has been highlighted in previous studies both for low-and high-flow events (Melsen and Guse, 2019;Kempen et al., 2020;Knoben et al., 2020) and should depend on the spatial complexity of the phenomenon studied (Hrachowitz and Clark, 2017). However, model structure choice for a specific application is not straightforward, and automatic model structure identification frameworks have only been introduced very recently (Spieler et al., 2020). To improve the representation of flood processes, such frameworks would ideally explicitly consider local and spatial flood characteristics and the representation of different flood generation processes such as rain-on-snow events or flash floods. The representation of rain-on-snow floods for example requires an accurate representation of the energy balance in order to represent factors affecting snowmelt processes such as net radiation and turbulent heat fluxes (Pomeroy et al., 2016;Li et al., 2019) .
A fourth possibility to improve model performance is to address data uncertainty of streamflow observations and of precipitation input. Errors in streamflow measurements caused by stage-discharge rating-curve uncertainty (Coxon et al., 2015;Kiang et al., 2018) influence model calibration and evaluation. To improve uncertainty estimates, such uncertainty should be accounted for by explicitly considering streamflow measurement uncertainty in model calibration (McMillan et al., 2010). In addition, the uncertainty of the precipitation product used to drive a hydrological model can lead to differences in observed and simulated flows (Te Linde et al., 2007;Renard et al., 2011). Precipitation products may show observation uncertainties (Mcmillan et al., 2012) and underestimate extreme rainfall or the spatial dependence of extreme precipitation at different locations because spatial smoothing or averaging during the gridding process reduces variability (Haylock et al., 2008;Risser et al., 2019). Such spatial uncertainty could be accounted for by using probabilistic analyses of precipitation fields (Newman et al., 2015a;Frei and Isotta, 2019). The consideration of such in-put uncertainty is particularly important if we are interested in future changes because of climate model and scenario uncertainty, where precipitation uncertainty is specifically pronounced (Chen et al., 2014;Lopez-Cantu et al., 2020). Even though many of these possibilities have been discussed in previous studies, their consideration in flood analyses is not standard practice.

Conclusions
Our model comparison shows that flood characteristics are not always well captured in hydrological models developed for research studies -even when the models have been calibrated with a calibration metric perceived suitable for flood modeling, the Kling-Gupta efficiency metric (E KG ). The number of flood events was over-or underestimated depending on the catchment, flood magnitudes were underestimated by all models in most catchments, and the ability of the model to accurately reproduce event timing was proportional to the hydroclimatic seasonality. These model deficiencies in reproducing local flood characteristics, especially timing, can lead to a misrepresentation of spatial flood dependencies, particularly in winter, because the temporal and spatial dimensions of flooding are closely linked. Our sensitivity analysis also shows that climate sensitivities of floods, especially to changes in precipitation, are not well represented in models, even if the model can be deemed "well calibrated" according to the E KG metric. These sensitivities are generally underestimated by models independent of the geographical areas considered; i.e., an increase in event precipitation may not be translated into a strong enough increase in flood peak. The misestimation of these sensitivities may undermine the reliability of future flood hazard assessments relying on such models.
The limited capability of the models in reproducing local and spatial flood characteristics and the sensitivity of runoff to precipitation inputs is partly attributed to model structure and partly to a reliance of the calibration on an individual variable (streamflow) and metric (E KG ). While E KG is integrative of certain properties (bias, variance, correlation), it does nonetheless not explicitly focus on high-flow values, their spatial dependencies, or processes generating high-flow values. We conclude that calibration using only an individual model performance metric or variable can result in model implementations that have limited value for specific model applications, such as local and in particular spatial flood hazard analyses and change impact assessments. This study underscores the importance of improving the representation of magnitude, timing, spatial connectedness, and flood generating processes. Potential ways of achieving such improvements include developing flood-focused, multiobjective, and spatial calibration metrics, improving flood generating process representations through model structure comparisons, and reducing uncertainty in precipitation in-put. Such steps are recommended to improve the reliability of flood simulations and ultimately local and regional flood hazard assessments under both current and future climate conditions.
Data availability. Observed streamflow measurements were made accessible by the USGS and can be downloaded via the website at https://waterdata.usgs.gov/nwis/uv/?referred_module=sw (USGS, 2019). Simulated streamflow, precipitation, and storage time series can be requested from Lieke Melsen (lieke.melsen@wur.nl) for the SAC, HBV, and VIC models and from Oldrich Rakovec (oldrich.rakovec@ufz.de) for the mHM model.
Author contributions. MIB and MPC developed the study design. NM, OR, and LAM provided the model simulations and together with MIB, MPC, and WJMK interpreted the model output. AWW assisted with the paper's background and messaging and proposed the climate sensitivity strategy. WK produced the model illustrations. MIB wrote the first draft of the paper, and all co-authors revised and edited the paper.