Identifying sensitivities in flood frequency analyses using a stochastic hydrologic modeling system

This study employs a stochastic hydrologic modeling framework to evaluate the sensitivity of flood frequency analyses to different components of the hydrologic modeling chain. The major components of the stochastic hydrologic modeling chain, including model structure, model parameter estimation, initial conditions, and precipitation inputs were examined across return periods from 2 to 100 000 years at two watersheds representing different hydroclimates across the western USA. A total of 10 hydrologic model structures were configured, calibrated, and run within the Framework for Understanding Structural Errors (FUSE) modular modeling framework for each of the two watersheds. Model parameters and initial conditions were derived from long-term calibrated simulations using a 100 member historical meteorology ensemble. A stochastic event-based hydrologic modeling workflow was developed using the calibrated models in which millions of flood event simulations were performed for each basin. The analysis of variance method was then used to quantify the relative contributions of model structure, model parameters, initial conditions, and precipitation inputs to flood magnitudes for different return periods. Results demonstrate that different components of the modeling chain have different sensitivities for different return periods. Precipitation inputs contribute most to the variance of rare floods, while initial conditions are most influential for more frequent events. However, the hydrological model structure and structure–parameter interactions together play an equally important role in specific cases, depending on the basin characteristics and type of flood metric of interest. This study highlights the importance of critically assessing model underpinnings, understanding flood generation processes, and selecting appropriate hydrological models that are consistent with our understanding of flood generation processes.

relates probability of occurrence to magnitude of a flood. There are numerous approaches to developing these curves, including (1) statistical stream gauge analysis, e.g., calculating the annual exceedance probability (AEP) (National Research Council 1988); (2) 'design storm' rainfall-runoff hydrologic model estimates, where the return period of the flood is equal to the return period of the precipitation (e.g. Packman and Kidd 1980;Boughton and Droop 2003;Reclamation 2006;Wright et al. 2020);35 (3) more complex fully stochastic rainfall-runoff modelling to explicitly represent the impacts of hydrological processes on floods (Rahman et al. 2002;Schaefer and Barker 2002;Nathan et al. 2003;Wright et al. 2014); and (4) analysis of paleoflood records (England et al. 2010). Typically, multiple methods are employed in these analyses to help understand the uncertainty of model results (e.g., England et al. 2014). Many of these methods rely on the assumption of AEP-neutrality, i.e., that a rainfall event has a similar AEP to the flood event. 40 The assumption of AEP neutrality is often not verifiable or violated (e.g. Rahman et al. 2002;Kuczera et al. 2006;Small et al 2006;Pathiraja et al. 2012;Paquet et al. 2013;Ivancic and Shaw 2015;Sharma et al. 2018). One avenue to address this is to perform stochastic rainfall-runoff modelling. In stochastic rainfall-runoff modelling, flood frequency (FF) estimates are typically produced using stochastic event simulations using a single hydrologic model with randomly perturbed model 45 parameters, initial conditions (ICs), and precipitation event forcing scenarios from defined precipitation frequency distributions (Rahman et al. 2002;Paquet et al. 2013;Wright et al. 2020). This modelling chain permits deviations from AEP-neutrality and quantifies the impacts of IC, model parameter, and precipitation event forcing variability in FF estimates.
However, past research on hydrologic model behaviour also emphasises the differences in model performance and responses 50 for various event types given different model parameters and structures across hydroclimates (e.g. Clark et al. 2008;Mendoza et al. 2015;Markstrom et al. 2016;Newman et al. 2017;Mizukami et al. 2019), highlighting the possible need to include multiple model structures in stochastic flood modelling studies. Model structure can vary widely. For example, a model may simply be defined by a loss methodology where an initial and continuous losses are defined at the start of and during the event simulation (e.g. Boughton and Droop 2003), or can be more complex employing various methods to explicitly simulate the 55 dominant hydrological processes (e.g., snow melt, surface runoff generation). Additionally, most methods used to perturb model parameters and meteorological forcings do not allow us to identify which components are the most sensitive in an FF estimate. Therefore, we systematically explored the sensitivity FF estimates to provide a better understanding of which components of the modelling chain have the most impact on FF estimates across example hydroclimatic regimes using basins within the Western United States (USA). 60 To our knowledge, the systematic examination of model structure contributions to variations in flood frequencies is a novel contribution to the flood modelling literature. Previous work has examined uncertainty and sensitivities in statistical methods (e.g. Hosking and Wallis 1986;Stedinger et al 1993;Klemes 2000;Kidson and Richards 2005;Thieken 2005, 2009;Hu et al. 2020), or from probabilistic hydrologic modelling systems (Hashimi et al. 2000;Franchini et al. 2000;Blazkova et al. 2009;Arnaud et al. 2017). Specifically, the companion papers of Hashimi et al. (2000) and Franchini et al. (2000) undertake a one at a time local sensitivity analysis and a full sensitivity analysis using a factorial sampling design to examine basin climate characteristics and hydrologic model parameters impacts on FF estimates. Hashimi et al (2000) find that several parameters related to the basin climate (e.g. average rainfall, storm intermittency) along with several hydrologic model parameters such as the percolation rate have higher sensitivity when considering FF estimates. They also conclude that soil 70 moisture at event onset is the linking mechanism that explains why their particular parameters are the most sensitive. For example, soil moisture states closer to saturation result in larger floods for a given event with wetter soils modulated by a wetter mean climate or lower percolation rates. Franchini et al. (2000) perform a full sensitivity analysis and confirm the local sensitivity results. However, model structure is not systematically varied in Hashimi et al. (2000) and Franchini et al. (2000).

75
The overall goal of this study is to improve both the quality and efficiency of hydrologic risk estimates for infrastructure design. The specific objective is to understand which components of the modelling chain have the largest impact to FF estimates. To address this objective, we ask the following question: What aspects of the modelling chain in stochastic FF analysis have the most sensitivity across a range of return intervals spanning 2-100,000 years? Our null hypothesis is that for rare floods (floods with return periods greater than 50,000 years) the sensitivity related to the precipitation event forcing 80 dominates the total variance of a FF estimate as sketched in Figure 1a, with variance in FF estimates arising from the aforementioned factors: 1) model structure, 2) model parameters, 3) initial conditions, and 4) precipitation event forcing. We postulate that there may be other dominant factors contributing to FF sensitivity outside of precipitation event forcing for rare floods. We explore these components of the modelling chain by: 1) using a multi-hydrologic model ensemble, 2) sampling model parameters across the model structures, 3) sampling model initial conditions that are internally consistent for each model 85 structure from calibrated continuous long-term simulations, and 4) incorporating statistical uncertainty in the distributions that define the precipitation forcing. Further, we explore the impact of precipitation timing using two meteorological sequences; in one, we force the model with a single precipitation event, in the second, we force the model with a single precipitation event and random historical weather after the precipitation event to drive a stochastic (ensemble) event simulation framework. The two different meteorological sequence methodologies were used to mimic different United States agency methodologies 90 (Section 3.1.6). We use the analysis of variance (ANOVA) methodology to examine relative contributions of variance to FF estimates across the return periods of interest for all factors for both meteorological sequences. While the focus of this study was on stochastic rainfall-runoff modelling, the methods and implications discussed here may be applicable to simpler rainfallrunoff modelling as well, such as AEP-neutral model estimates.

Study Basins 95
The Island Park Dam in Idaho and Altus Dam in Oklahoma watersheds are used as representative basins of mountainous snowmelt (Island Park) and semiarid high plains (Altus) hydroclimates, respectively. These basins were selected because not terrain with medium to coarse textured soils and spans an elevation from about 1120 m at the western edge of the basin to 450 m at the eastern outlet. This area contains many topographic features known as playa lakes (closed basins with a low area in the center that may see water storage following heavy rainfall) and thus the total contributing area is smaller than the total area of the watershed. We used the Reclamation estimated contributing area of 5051 km 2 . Much of the basin above Altus Dam is devoted to agriculture with a majority of the land cover consisting of cultivated crops, pasture, and hay production. The 120 drainage basin contains no large forested areas, but there are treed riparian zones along the watercourses and trees in cultivated shelterbelts (Reclamation 2012). Altus Dam is a semi-arid basin that also has a seasonal cycle to precipitation with most occurring in winter through summer, primarily as rainfall. The spring and summer rainfall events are primarily convective in nature with sometimes very intense rainfall rates and high total accumulations over short periods of time that may coincide with peak basin soil moisture in the spring. 125

Modelling Workflow
Our stochastic hydrologic modelling workflow includes the Framework for Understanding Structural Errors (FUSE) hydrologic modelling framework, the Shuffled Complex Evolution (SCE) optimization algorithm, and precipitation frequency https://doi.org/10.5194/hess-2021-49 Preprint. Discussion started: 3 March 2021 c Author(s) 2021. CC BY 4.0 License. distributions from Reclamation. Additionally, we have used the law of total probability (e.g. Tijms 2003;Nathan et al. 2003) 130 and the analysis of variance (ANOVA) method to compute the FF estimates and partition the variance across the workflow components respectively.
For each basin, hydrologic models are configured and calibrated using an ensemble of historical meteorology . Then, long-term continuous simulations are made to generate spun-up initial conditions for event simulations. Event 135 simulations are then performed across hydrologic models, model parameters, initial conditions, and precipitation frequency distribution estimates for two event sequence possibilities. For each precipitation frequency distribution, we split the probability density function into 50 bins and sample 25 events per bin and perform 2500 model simulations for each possible model-parameter-IC-precipitation frequency combination. This follows the total probability theorem methodology used at Reclamation in their stochastic flood modelling. We implemented a factorial experimental design, using all combinations of 140 the 10 hydrologic models, 11 parameter sets, 4 initial condition sets, and 11 precipitation frequency estimates for Island Park Dam (3 precipitation frequency estimates for Altus Dam) for a total 4840 combinations with 2500 model simulations per combination resulting in 12.1 million event simulations for Island Park Dam (referred to as Island Park) and 3.3 million event simulations for Altus Dam (referred to as Altus). The different precipitation frequency estimates come from the fact that this project leveraged previously completed studies for these data. We do not believe this will significantly impact the results, as 145 the ANOVA analysis takes these sampling differences into account.

Hydrologic Model Framework
The FUSE hydrologic modelling system is a freely available, modular modelling framework that enables developing and testing many conceptual hydrologic models in a single computational framework. It incorporates multiple parameterizations for many hydrologic fluxes (or processes) at the individual flux level, with each equation formulated as a function of the model 150 state, each in a separate code module. This allows the numerical solver to be separated from the flux parameterizations so that every FUSE configuration relies on the exact same numerical scheme. FUSE also incorporates a conceptual temperature index snow model, using elevation bands with user specified precipitation and temperature lapse rates to represent seasonal snowpack and changes in meteorology with elevation. Control at the individual flux level is key to understanding how changes in process representation affect the modelling system behavior. See Clark et al. (2008) and Henn et al. (2015) for more details regarding 155 FUSE.
FUSE uses several configuration files in which the user can specify the model decisions for process representation, numerical solver parameters, model calibration options, access to input and output data, etc. The structural modularity in FUSE is underpinned by one file prescribing the equations to be used for each model component. This file can be changed independently 160 from the other model settings, enabling the user to isolate the effects of the model structure decisions on the simulations. FUSE contains the SCE optimization algorithm (Duan et al. 1992) to calibrate any hydrologic structure the user specifies. SCE is a https://doi.org/10.5194/hess-2021-49 Preprint. Discussion started: 3 March 2021 c Author(s) 2021. CC BY 4.0 License.
robust global optimization algorithm that is widely used across the operational and research communities. FUSE uses the network common data format (netCDF) for all input and output data streams (forcing meteorology, any available observations for calibration, calibration results, simulated states and fluxes), with the same file formats regardless of hydrologic model 165 configuration. Overall, the design of the FUSE system allows for easy configuration, calibration, and simulation of multiple hydrologic models for long term continuous simulations or short event simulations.
FUSE is first used to mimic three widely used hydrologic models: Hydrologic Engineering Center-Hydrologic Modelling System (HEC-HMS) model (Bennett 1998), the Variable Infiltration Capacity (VIC) model (Liang et al. 1994), and the 170 SACramento-Soil Moisture Accounting (SAC-SMA) model (e.g. Anderson 2002). This provides a relatable base set of models to operational groups within the USA. Note that the FUSE instantiations of the models only mimic the actual models cited.
FUSE does not use the same numerical solver, some process simplifications are made (particularly for VIC where we simplify evapotranspiration), different parameter estimations schemes are used, and FUSE does not contain the same coding errors as the original models (see Clark et al. 2008 for FUSE details). As a result, when mimicking a pre-existing model using a modular 175 framework, some significant differences between their simulations can exist (Knoben et al., 2019). We then assembled seven other hydrologic model structures by varying particular processes from the three base models for a total of ten structures that we used to compute FF estimates for both basins (see Table 1 for the full list).

FUSE Meteorological Forcing and Calibration
All 10 hydrologic models for both basins were calibrated using the SCE optimization algorithm. We used KGE and RMSE as 180 objective functions because the choice of objective function is subjective and dependent on available data and user needs.
Additionally, recent work has highlighted that careful consideration needs to be given to the choice of objective function for high flow events (Mizukami et al. 2019). Root mean squared error (RMSE) is directly related to Nash-Sutcliffe Efficiency (NSE). Further, it can be shown that RMSE/NSE is made up of three component contributions to the total value: correlation (r), variability ( ), and bias ( ). The Kling-Gupta Efficiency (KGE) is a reformulation of these same components, which 185 allows the user to easily understand their individual contributions to the total KGE value (Gupta et al. 2009) and is shown in where EDs is the scaled Euclidian distance from the ideal point and , , and are scale factors to adjust the weighting of the correlation, variability and bias terms (set to 1 typically). The KGE is also beneficial to use because the scale factors can 190 be adjusted to emphasize the different components of KGE. Here we tested RMSE and KGE calibrations using daily streamflow, and KGE computed using annual peak flow values. We also examined modifying the KGE scale factor from 1 to 5 to emphasize model flow variance in an effort to better capture flood peaks. Inflated values resulted in model behavior A maximum of 40,000 model runs was allowed for the SCE calibration of each model structure and basin. Reconstructed daily inflow data from Reclamation was used for Island Park, while annual peak flow data developed by Reclamation was used for Altus due to lack of better available data for calibration at the time of this study. The impact of these different objective functions and calibration data for the basins will be discussed in Section 4.

200
The meteorological forcing data consisted of a 100-member ensemble of gridded precipitation and temperature at 6 km resolution described in Newman et al. (2015). Observations of precipitation and temperature and the process of projecting point measurements to grids across sometimes complex terrain are inherently uncertain. This ensemble dataset was designed to estimate those uncertainties and provide many plausible historical traces for hydrologic model applications. Each individual member was used to calibrate each hydrologic model, resulting in a 100-member ensemble of calibrated model parameters for 205 each model for each basin (100 ensembles × 10 models × 2 basins). Because of the available observational data, different spin up and calibration periods were used. For Island Park, the hydrologic models were spun up for water years (WY) [1970][1971][1972][1973][1974][1975][1976][1977][1978][1979] and calibrated on WY 1980-2014 (35 WYs), while Altus was spun up for WY 1980-1984and calibrated on WY 1985-2011. Again, while the number of WYs for both catchments is similar, data availability meant that Altus calibration only relied on annual peaks, while for Island Park daily streamflow values were used. 210

Initial Condition Specification
Continuous simulations using the subsampled parameter sets were then performed and full model states were output each day for the full calibration periods for each hydrologic model and basin. These states were sampled to determine the ICs for the event simulations. Sampling initial states from continuous simulations has the advantage of providing ICs that are consistent with the specific hydrologic model and parameter set. Applying random perturbations to an IC may result in unrealistic states 215 and subsequent simulation results.
For Island Park, the focus was on ICs from April through June that had minimal (> 10 mm) snow water equivalent snowpack to represent flood events near the end of the snowmelt season around peak climatological soil moisture storage. For Altus, the focus was on late winter through mid-summer ICs (February-July) when both soil moisture and precipitation event intensity 220 and volumes are around their climatological maximums. For both basins and all models, four ICs were sampled in the top 10 percent, the 90 th , 94 th , 97 th , and 99 th percentiles of total column soil moisture within all validation years and months.

Precipitation Frequency Estimates
Regional frequency analysis (RFA) is a useful method for extending the period of record in environmental datasets by means basic assumption of RFA is that extreme events recorded at stations located within a predetermined homogeneous region can be described by the same probability distribution. By scaling the data by the respective at-site mean (ASM), the user assumes that a single probability distribution is valid for every location within the homogeneous region, while the magnitude can vary spatially.

230
The L-moments method (Hosking and Wallis 1997) is one example of a regional frequency method. The basis of the Lmoments algorithm is that linear combinations of moments can be used to estimate model parameters for extreme value distributions. The moments of interest (also referred to as L-statistics) include L-CV, L-skewness, and L-kurtosis and are computed for every site utilized in an analysis. Regional moments are developed using weighted averages of the site-specific moments, where the weight is proportional to period of record. The regional L-moments are then used to define the regional 235 growth curve (RGC), a dimensionless quantile function that represents the cumulative distribution function of the frequency distribution valid for all sites located within the homogenous region. Site-specific precipitation-frequency estimates (Qi(F); Equation 2) are developed by scaling the RGC (q(F)) by a site-specific ASM (μi), allowing the magnitudes of precipitationfrequency estimates to vary spatially across the region of interest.

240
Reclamation (2015) developed median and uncertainty precipitation-frequency curves for the Island Park watershed using a regional L-moments approach combined with Latin hypercube resampling procedures. More specifically, the authors used annual maximum two-day precipitation totals from 45 stations in a homogeneous region surrounding the Island Park watershed 245 to estimate parameters of the four-parameter Kappa distribution. The authors used Latin-hypercube sampling methods in R via the "qnorm" function to perform Monte Carlo sampling to create 300 parameter sets using variations in five parameters: at-site mean, regional L-Cv, regional L-skew, regional L-kurtosis, and areal-reduction factor. Results from this analysis include 11 frequency distributions 5 th , 14 th , 23 rd , 32 nd , 41 st , 50 th , 59 th , 68 th , 77 th , 85 th , and 95 th percentiles. Kappa parameters from Reclamation (2015) are reproduced in Table 2. During stochastic simulations performed here, we force two-day historical 250 precipitation events to equal basin-average magnitudes sampled from the two-day precipitation frequency curve valid over the Island Park watershed, while retaining the spatial precipitation structure from the historical event.
Similarly, Reclamation (2012) developed precipitation-frequency estimates including median and uncertainty bounds for the Altus watershed using a regional L-moments approach combined with Latin hypercube sampling procedures. The authors 255 focused on annual maximum one-day (or 24-hour) precipitation totals recorded at 482 stations with at least five years of data and used Latin hypercube sampling to produce 150 parameter sets based on variations in the same five parameters listed above: at-site mean, regional L-Cv, regional L-skewness, regional L-kurtosis, and areal-reduction factor. The report provides all precipitation-frequency estimates in the form of fourth-order polynomials, with coefficients reproduced in Table 3 Island Park simulations, we force one-day historical precipitation events to equal basin-average magnitudes sampled from the 260 one-day precipitation frequency curve valid over the Altus watershed, while retaining the spatial precipitation structure from the historical event.

Event Sequencing
Some stochastic modelling studies at Reclamation force the rainfall-runoff model with a precipitation event (e.g., two-day event) followed by no precipitation for the remaining simulation time (Reclamation 2018). The lack of additional precipitation 265 after the primary precipitation event is not based on any physical reasoning, thus we examine both dry and historical meteorological sequences after the primary precipitation event (two-day at Island Park and one-day at Altus). The forcing event lengths differ because of the differing meteorology driving floods in the two basins. Again, more intense shorter duration convective precipitation events primarily cause flooding at Altus, while longer duration precipitation events associated with extratropical cyclones is the primary flood driver at Island Park. In the dry meteorological sequence, we set precipitation to 270 zero after the primary precipitation event. In the historical meteorology setup, we randomly sample ensemble member meteorology sequences using the same start date corresponding to the sampled IC for the simulation period only. In other words the ICs are taken from a continuous simulation and define the event start date, but we then randomly sample from all 100 members to redefine the event period. In both cases, the primary precipitation forcing (two-day and one-day) is forced to equal sampled values from the precipitation frequency curve. Future work should examine event sequencing in greater detail, 275 particularly to quantify the impacts of possible future circulation changes on FF estimates and sensitivities.

ANOVA
As noted above, the total probability theorem is used to compute modelled basin runoff at return periods of 2, 5, 10, 20, 50 100, 500, 1,000, 5,000, 10,000, 50,000, and 100,000 years from the stochastic simulations for all model, parameter, IC, and precipitation distribution combinations, for both event sequences. An ANOVA analysis is then performed on the runoff values 280 for all the return periods for both event sequences and basins. The ANOVA framework is a computationally frugal way to estimate individual component contributions to the total variance of a variable such as runoff by relying on a sum of squares decomposition. ANOVA analyses have been widely used in hydrometerology to separate the components of future climate changes (Hawkins and Sutton, 2009;Lehner et al., 2020) and to determine which elements of the model chain contribute most to the spread of the projected changes in streamflow (Bosshard et al., 2013;Addor et al. 2014;Breuer et al., 2017;Chegwidden 285 et al., 2019).
By estimating the fractional (relative) variance contributions of each factor and all two factor interactions, we identified the pieces of the modelling workflow which contribute most to FF sensitivity for each return period. We used the 'anovan' MATLAB function as implemented in MATLAB version 9.8.0.1380330 (2020a) Update 2. This function allows for N-way 290 ANOVA computations with mixed continuous and categorical predictors, and specification of the interaction terms to be https://doi.org/10.5194/hess-2021-49 Preprint. Discussion started: 3 March 2021 c Author(s) 2021. CC BY 4.0 License. estimated (https://www.mathworks.com/help/stats/anovan.html). Here we specify model structure and parameters as categorial predictors and precipitation event forcing and initial conditions as continuous predictors. Precipitation event forcing and initial condition values are normalized before the ANOVA analysis was performed. Finally, we perform a subsampling and bootstrapping of the effects that have more samples than the effect with the fewest samples (e.g. for Island Park ICs have 295 4 samples, precipitation frequency distributions have 11 samples) following Bosshard et al. (2013). Disparate sample sizes can bias the fractional variance estimates, overestimating the contributed variance for effects with more samples. Performing subsampling with bootstrapping (n=1000) alleviates the bias (Bosshard et al. 2013).

Model Calibration
When examining daily flow time series, the KGE and RMSE daily metrics produce more realistic simulations than the KGE 300 interval metric as seen in Figure 3. This is a somewhat expected result as the interval metric contains no time information (correlation) on the daily scale. The daily KGE metric based calibration outperforms the daily RMSE based calibration, where the daily RMSE based calibration underestimates the flow variance (not shown) in agreement with past studies (Gupta et al. 2009). The KGE interval metric-based calibration represents the peak flows well (with some overrepresentation) but has large differences in event recession curves with overestimation of flow in the days and weeks immediately following high flow 305 events. This erroneous recession curve representation would result in very different volume-based floods versus daily metricbased calibrations.
Given the above calibration characteristics and the available calibration data at Island Park (daily flow) and Altus (annual peak flow), daily KGE was selected as the calibration metric for Island Park and interval KGE as the calibration metric for Altus. 310 Daily KGE provides the best all-around simulation when considering daily peak flows as well as volume integrations over days to weeks at Island Park. For Altus, calibrating to yearly peak flows using KGE provided a better overall peak flow calibration than RMSE calculated using annual peak flows, likely due to the reformulated weighting of bias and variance as compared to RMSE. Again, these results agree with Mizukami et al. (2019), which examined some of the same calibration metrics using multiple hydrologic models and hundreds of basins across the contiguous United States. They found that KGE 315 outperforms RMSE (or NSE) based calibrations and that peak flow metrics do outperform KGE for peak flow simulation but result in much degraded daily model performance with sometimes severe modelled flow biases. Figure 4 highlights the final CDF of the calibrated KGE for all ten models for Island Park (Fig. 4a) and Altus (Fig. 4b). It is not possible to make direct performance comparisons between the models at the two basins given that the KGE values are 320 based on daily (Island Park) and annual peak runoff (Altus). However, in a broad sense, model behavior at Island Park is much more constrained than Altus based on the relative ranges of calibration scores for each basin (different x-axis ranges from left https://doi.org/10.5194/hess-2021-49 Preprint. Discussion started: 3 March 2021 c Author(s) 2021. CC BY 4.0 License.
to right panels). These differences informed the model parameter sampling strategies and show that the model behavior at Island Park is more constrained than at Altus along the model parameter dimension.

FUSE Parameter Set Selection 325
The 100 parameter sets available for each model and basin were subsampled for the final FF event simulations. Because Island Park had more available data for calibration, the final calibrated model performance was very similar across the 100 members for all 10 hydrologic models. Therefore, 11 parameter sets spanning the full range of model performance were sampled for each hydrologic model using the 1st, 10th, 20th, 30th, 40th, 50th, 60th, 70th 80th, 90th and 99th percentiles of the cumulative density function (CDF) of the calibration objective function. For Altus, the calibrated model behavior was less constrained 330 due to the much smaller amount of calibration data available. Therefore, the 10 best calibrated parameter sets for each hydrologic model were used, which constrained model parameter induced differences in model behavior, but still not to the same level as Island Park.

Sensitivity Analysis
The ANOVA analysis was performed using the full complement of FF estimates for both basins and precipitation event forcing 335 sequences. All fractional variance contributions are normalized by the total variance in the FF estimate for each return period such that if a component has a fractional variance of 0.5 that component contributes half of the total variance for that return period. The plots represent the 2, 5, 10, 50, 100, 1,000, 10,000, 50,000, and 100,000-year return periods. For Figures 6 through   11, the dry meteorological sequence is always in panel a) and the historical meteorology event sequence is always in panel b), and the color coding follows Figure 1. Interaction terms are a blend of the two primary components (e.g. model structure-340 model parameter interactions are red-orange).
Normalized FF plots including all possible effect combinations for both models are shown in Figure 5. Annual exceedances at Island Park in the mean follow a nearly linear trend on the semi-log X-axis plot with the range of possible values having relatively higher spread at larger return intervals (Fig. 5a), which is consistent with the hydrology of Island Park being a less 345 flashy more snowmelt flow dominated basin. The normalized FF curve at Altus is highly non-linear even with a semi log Xaxis with little flow for many small return periods (Fig. 5b). Sharp increases in flood responses after roughly the 500 year return period are seen with normalized spread larger than at Island Park for the largest return periods (50-100,000 years).

Island Park
ANOVA results using all available model structures, sampled parameter sets, sampled ICs, and sampled precipitation 350 frequency distributions for Island Park are shown in Figure 6. When all model structures are included ICs dominate the frequent events less than about 5000 years, while the precipitation frequency distribution is the most important for rarer events. Model structure consistently contributes roughly 20% of the variance and is generally the second most important effect across all return periods, outside of 1,000 -10,000 year flood where ICs, precipitation frequency curves, and model structure vary in leading, secondary or tertiary importance depending on the dry or wet meteorological sequencing. Model parameters and 355 interaction terms contribute roughly 10% of the variance for less frequent events for both meteorological sequences. For rare floods with return periods larger than 50,000 years, event precipitation is about twice as important as model structure and 3 times more important than ICs for dry sequences after the event input, while for historical meteorological sequencing, the event forcing is only about 1.5x more important than model structure for 100,000 year floods. 360 Figure 7 presents the fractional variance contributions for Island Park using the three base models: HEC-HMS, VIC, and SAC-SMA. Similar to all models, ICs and the precipitation frequency distribution specification are the most important for frequent and extreme events, respectively. Model structure is the second most important contributor for frequent events, but for return periods larger than 1,000 years, model structure-parameter interactions become as or more important than model. Again, moving from the dry to historical meteorological sequence decreases the variance contribution of precipitation frequency 365 distributions, and increases the importance of model structure, model structure-parameter interactions, and ICs across all return periods (compare Figure 7a to 7b). This is somewhat counter intuitive but may be related to the fact that soil states can strongly influence recession curve characteristics and additional non-extreme precipitation event forcing is either stored or released within the 14 day volume integration depending on model structure, parameters, or ICs.

370
Using a different combination of the ten possible model structures results in a slightly different conclusion. The set of simulations presented in Figure 8 represents the set of three hydrologic models that generates the largest flood responses to larger precipitation event forcing. Overall, the precipitation frequency distribution specification is still the most important at extreme events, and ICs are most important for very frequent events, but model structure contributes a larger fraction of the total variance across all return periods and is often of similar magnitude to either ICs or precipitation frequency distribution 375 changes ( Figure 8). Here we see that moving from the dry to historical meteorological sequence increases the importance of model structure (compare Figure 8a to 8b). This is because these three model structures have more variation between each other given additional precipitation input than the variability in runoff changes due to ICs. Differences in surface runoff versus subsurface storage and slower baseflow appear to be driving the model structure variability and is discussed more in Section 6. 380

Altus
The ANOVA results using all available model structures, sampled parameter sets, sampled ICs, and sampled event forcings for Altus are shown in Figure 9. Similarly to Island Park, ICs are most important for frequent events, while precipitation event forcing is most important for rarer events. Two differences are of note here. First, precipitation event forcing is generally more important across return periods at Altus versus Island Park. Second while model structure is slightly less important, model parameters and model structure-parameter interactions are of similar importance to model structure, such that the combination of model structure and parameter effects and interactions is as important as precipitation event forcing for both meteorological sequences. Finally it is evident that meteorological sequencing is inconsequential at Altus, which makes intuitive sense given the single day peak flow metric for Altus versus the 14 day integrated volume metric at Island Park. 390 The ANOVA results for Altus using the three base models show a similar picture as for Island Park. ICs almost always contribute the most variance for frequent events (less than a few hundred years) and the precipitation frequency distributions are the most important for larger events ( Figure 10). However, the precipitation frequency distributions are even more important for Altus than at Island Park particularly for the historical meteorological sequence, as they contribute around 50% 395 of the total variance for 50,000-100,000 year events as compared to around 30% at Island Park. Model structure and model structure-model parameter interactions are of secondary importance across essentially all return periods. Again, moving from dry to historical meteorological sequencing does not change the picture significantly at Altus (compare Fig 10a to 10b), which is expected as the flood metric is the single day maximum flow and generally single day maximum flow is directly related to the extreme precipitation flood event input and not subsequent smaller events. 400 Further examination of multiple model combinations at Altus revealed that using the two most disparate model responses, SAC-SMA (Model #3) and the SAC-SMA/HEC-HMS combination (Model #6) models results in substantial increase in importance of model parameters and model structuremodel parameter interactions (Figure 11). In fact, model structuremodel parameter interactions contribute the most variance across all return periods in this case. Additionally, model structure 405 and model parameter effects contribute similar variance to the precipitation frequency distributions. Again, moving from dry to historical meteorological sequencing does not substantially change the message here as expected (compare Figure 11a to 11b). For this case the model responses are starkly different, such that it may be possible to rule out one of the model structures as plausible, however model structure selection work is outside the scope of this study.

Discussion 410
The results of this study demonstrate that workflow and methodological decisions impact hydrologic model behavior and the final variance estimates of a FF study. This suggests that careful consideration of the various components of stochastic flood modelling should be undertaken. To our knowledge, the inclusion of model structure into FF estimate sensitivity analysis is a novel contribution to our understanding of stochastic flood modelling systems. We reaffirm that calibration metrics only constrain model behavior for components of the hydrograph most related to the calibration metric (e.g. Mendoza et al. 2015, 415 Mizukami et al. 2019. For streamflow-based calibration, KGE is a robust metric that provides balanced model behavior across all components of the hydrograph because of its formulation and should be used over RMSE/NSE if possible. Furthermore, calibration metrics focusing on high flow only generally result in degraded model performance for other parts of the hydrograph such as the recession curve, in agreement with Mizukami et al. (2019). The implication for this work is that calibrated hydrologic models using RMSE/NSE may have inferior performance for longer duration volume flood metrics 420 because of substantial biases introduced during calibration that was not designed to constrain flow volumes.
The ANOVA results demonstrate that ICs contribute the most variance for frequent events and the precipitation frequency distribution specification contributes the most variance for extreme events. One area for future study is the specification of the precipitation frequency distribution and uncertainty estimates. Here we relied on previously published precipitation 425 frequency results, as developing new estimates is outside the scope of this study. However, it is possible that the specification of the distribution and the uncertainty estimation methodology could have an impact on subsequent analysis steps.
Furthermore, the precipitation frequency distribution methods differ across the basins, which is an inconsistency with unquantified impacts. Normalization of the precipitation inputs before the ANOVA analysis possibly mitigates these potential issues, but further exploration could be undertaken in future work. 430 Additionally, model structure, model parameters, and model structure-parameter interactions may have important contributions across the return periods depending on the flood metric and basin. In this study all ten model structures are treated as equally plausible. Future stochastic FF studies should consider model structure in their experimental design with thought given to constraining the model structure ensemble to plausible model configurations using available techniques (Jakeman and 435 Hornberger 1993;Gupta et al. 2012). Model parameter and model structuremodel parameter interactions are more important at Altus, where the available calibration data limited the ability for calibration to constrain model performance. Consideration of model parameter variations should be taken into account when scoping projects with little calibration data available.
Differences in model total storage and subsequent runoff generation drive the different flood responses across both basins. 440 Figure 12 shows the average model response for models #1 (HEC-HMS) and #3 (SAC-SMA) for a subset of precipitation event forcings for Island Park. The change in storage and cumulative runoff are normalized by the total precipitation event forcing to highlight storage and runoff efficiency differences between the two models. Note the precipitation event forcing occurs on days 1 and 2. Models with high event-based runoff ratios generate runoff more readily and have smaller subsurface storages, while models with lower event runoff ratios allow for more infiltration and storage. Model #3 stores about 60% more 445 of the precipitation event than model #1 and ends up generating 25% less cumulative runoff than model #1. These differences are more important for basins using integrated flood metric such as Island Park here, as responsive models generate larger volumes while the other models store more of the precipitation event forcing and release it over longer periods of time. This point should be the focus of additional study and provides one physical process comparison to identify the appropriate model structures for a given basin.
While the focus of this study was on stochastic rainfall-runoff modelling for FF studies, there are potentially broader applications to hydrologic modelling for any purpose, including planning, design, or restoration often focused on more frequent floods up to extreme events for risk analysis. For example, stochastic rainfall-runoff modelling is data and labor intensive, thus less intensive methods are frequently used, most commonly AEP-neutral assumptions of precipitation return period being 455 equal to flood return period. Even in those studies, model selection, parameterization, initial conditions, calibration, and forcing still play an important role in model outcome. Additionally, examining a range of return periods rather than just extreme floods was intentional to help inform a broader range of applications beyond those focused on risk for large dams where only extreme events are relevant. Understanding of sensitivity in rainfall-runoff modelling, whether stochastic or not, is important for flood studies. The results of this study can help guide model selection and development and provide a better 460 understanding of variance in a variety of flood studies.

Conclusions
The key generalizable conclusions are: 1) ICs and precipitation frequency distributions contribute the most variance in the stochastic flood modelling chain for frequent and extreme events respectively. 465 2) Hydrological model structure can be equally important, particularly for multi-day volume flood metrics. This highlights the need to critically assess assumptions underpinning models, understand basin flood generation processes and develop methods to select appropriate models. This includes the re-examination of the AEP neutral assumption and shifting to model process parameterizations that are most plausible for the study catchment.
3) Model parameter and model structure-parameter interactions can be important if the model parameter space is not 470 well constrained during calibration. 4) Confirming many other studies (e.g. Gupta et al. 2009, Mizukami et al. 2019, the Kling-Gupta Efficiency (KGE) results in better hydrologic model performance than NSE (or RMSE) for calibration of extreme events and volume integrated flood metrics, and is more flexible for application specific uses through the use of user specified component weights. 475

Data Availability
The watershed characteristics, reconstructed flow data, and precipitation frequency estimates are available by request from the United States Bureau of Reclamation. The FUSE model output data are available at https://data.usbr.gov.