This study employs a stochastic hydrologic modeling framework to evaluate the sensitivity of flood frequency analyses to different components of the hydrologic modeling chain. The major components of the stochastic hydrologic modeling chain, including model structure, model parameter estimation, initial conditions, and precipitation inputs were examined across return periods from 2 to 100 000 years at two watersheds representing different hydroclimates across the western USA. A total of 10 hydrologic model structures were configured, calibrated, and run within the Framework for Understanding Structural Errors (FUSE) modular modeling framework for each of the two watersheds. Model parameters and initial conditions were derived from long-term calibrated simulations using a 100 member historical meteorology ensemble. A stochastic event-based hydrologic modeling workflow was developed using the calibrated models in which millions of flood event simulations were performed for each basin. The analysis of variance method was then used to quantify the relative contributions of model structure, model parameters, initial conditions, and precipitation inputs to flood magnitudes for different return periods. Results demonstrate that different components of the modeling chain have different sensitivities for different return periods. Precipitation inputs contribute most to the variance of rare floods, while initial conditions are most influential for more frequent events. However, the hydrological model structure and structure–parameter interactions together play an equally important role in specific cases, depending on the basin characteristics and type of flood metric of interest. This study highlights the importance of critically assessing model underpinnings, understanding flood generation processes, and selecting appropriate hydrological models that are consistent with our understanding of flood generation processes.

Understanding flood risk is important to support infrastructure design and operations. Hydrologic hazard curves and flood hydrographs are used to evaluate hydrologic risks for a given facility, e.g., a dam. A hydrologic hazard curve is a curve that relates the probability of occurrence to the magnitude of a flood. There are numerous approaches to developing these curves, including (1) statistical stream gauge analysis, e.g., calculating the annual exceedance probability (AEP; National Research Council 1988), (2) design storm rainfall–runoff hydrologic model estimates, where the return period of the flood is equal to the return period of the precipitation (e.g., Packman and Kidd, 1980; Boughton and Droop, 2003; Swain et al., 2006; Wright et al., 2020), (3) more complex fully stochastic rainfall–runoff modeling to explicitly represent the impacts of hydrological processes on floods (Rahman et al., 2002; Schaefer and Barker, 2002; Nathan et al., 2003; Wright et al., 2014), and (4) an analysis of paleo-flood records (England et al., 2010). Typically, multiple methods are employed in these analyses to evaluate the uncertainty of model results (e.g., England et al., 2014). Many of these methods rely on the assumption of AEP neutrality, i.e., that a rainfall event has a similar AEP to the flood event.

The assumption of AEP neutrality is often not verifiable or justified (e.g., Rahman et al., 2002; Kuczera et al., 2006; Small et al., 2006; Pathiraja et al., 2012; Paquet et al., 2013; Ivancic and Shaw, 2015; Sharma et al., 2018; Yu et al., 2019). One way to address this is to perform stochastic rainfall–runoff modeling. In stochastic rainfall–runoff modeling, flood frequency (FF) estimates are typically produced using stochastic event simulations using a single hydrologic model with randomly perturbed model parameters, initial conditions (ICs), and precipitation input forcing scenarios from defined precipitation frequency distributions (Rahman et al., 2002; Paquet et al., 2013; Wright et al., 2020). This modeling chain permits deviations from AEP neutrality and quantifies the impacts of ICs, model parameters, and precipitation input forcing variability in FF estimates.

However, past research on hydrologic model behavior also emphasizes the differences in model performance and responses for various event types given different model parameters and structures across hydroclimates (e.g., Clark et al., 2008; Mendoza et al., 2015; Markstrom et al., 2016; Newman et al., 2017; Mizukami et al., 2019), highlighting the possible need to include multiple model structures in stochastic flood modeling studies. Model structure can vary widely. For example, a model may simply be defined by a loss methodology, where an initial and continuous losses are defined at the start of and during the event simulation, e.g., Boughton and Droop (2003), or it can be more complex, employing various methods to explicitly simulate the dominant hydrological processes (e.g., snowmelt and surface runoff generation). Additionally, most methods used to perturb model parameters and meteorological forcings do not allow us to identify which components are the most sensitive in an FF estimate. Therefore, we systematically explored the sensitivity FF estimates to provide a better understanding of which components of the modeling chain have the most impact on FF estimates across example hydroclimatic regimes using basins within the western USA.

To our knowledge, the systematic examination of model structure contributions to variations in flood frequencies is a novel contribution to the flood modeling literature. Previous work has examined uncertainty and sensitivities in statistical methods (e.g., Hosking and Wallis, 1986; Stedinger et al., 1993; Klemes, 2000; Kidson and Richards, 2005; Merz and Thieken, 2005, 2009; Hu et al., 2020) or from probabilistic hydrologic modeling systems (Hashemi et al., 2000; Franchini et al., 2000; Blazkova et al., 2009; Arnaud et al., 2017; Peleg et al., 2017; Zhu et al., 2018). The companion papers of Hashemi et al. (2000) and Franchini et al. (2000) undertake a one-at-a-time local sensitivity analysis and a full sensitivity analysis using a factorial sampling design to examine basin climate characteristics and hydrologic model parameters impacts on FF estimates. Hashemi et al. (2000) find that several parameters related to the basin climate (e.g., average rainfall and storm intermittency), along with several hydrologic model parameters such as the percolation rate, have higher sensitivity than other model parameters and climate characteristics when considering FF estimates. They also conclude that the soil moisture at event onset is the linking mechanism that explains why their particular parameters are the most sensitive. For example, soil moisture states closer to saturation result in larger floods for a given event with wetter soils modulated by a wetter mean climate or lower percolation rates. Franchini et al. (2000) perform a full sensitivity analysis and confirm the local sensitivity results. However, model structure is not systematically varied in Hashemi et al. (2000) and Franchini et al. (2000). In a related study, Zhu et al. (2018) perform an analysis of variance on a spatially distributed stochastic hydrologic model to show that initial conditions have a strong influence on flood frequency estimates.

The overall goal of this study is to improve the quality of hydrologic risk estimates for infrastructure design. The specific objective is to understand which components of the modeling chain have the largest impact on FF estimates. To address this objective, we ask the following question: what aspects of the modeling chain in stochastic FF analysis have the most sensitivity across a range of return intervals spanning 2–100 000 years?

Given that variance in FF estimates arise from (1) model structure, (2) model parameters, (3) initial conditions, and (4) precipitation event forcing, our null hypothesis is that, for rare floods (floods with return periods greater than 50 000 years), the sensitivity related to the precipitation event forcing dominates the total variance of a FF estimate (Fig. 1a). We postulate that there may be other dominant factors contributing to FF sensitivity outside of precipitation event forcing for rare floods. We explore these components of the modeling chain by (1) using a multi-hydrologic model ensemble, (2) sampling model parameters across the model structures, (3) sampling model initial conditions that are internally consistent for each model structure from calibrated continuous long-term simulations, and (4) incorporating statistical uncertainty in the distributions that define the precipitation forcing. Furthermore, we explore the impact of meteorology specification within the event simulation by performing two sets of event simulations using exactly the same precipitation inputs. In the first, we force the model with a single precipitation input followed by zero precipitation; in the second, we force the model with a single precipitation event and random historical weather after the defined precipitation input to drive a stochastic (ensemble) event simulation framework, which better represents real world conditions. The two different meteorological sequence methodologies were used to mimic different USA agency methodologies (Sect. 3.6). We use the analysis of variance (ANOVA) methodology to examine relative contributions of variance to FF estimates across the return periods of interest for all factors for both meteorological sequences. While the focus of this study was on stochastic rainfall–runoff modeling, the methods and implications discussed here may be applicable to simpler methods such as AEP-neutral model estimates.

Conceptual contribution of relative variance contribution from
initial conditions (blue), model parameters (red), model structure (orange), and precipitation event forcing (green) across return periods (larger return periods towards the right) for

The Island Park Dam (Idaho) and Altus Dam (Oklahoma) watersheds are used as representative basins of mountainous snowmelt (Island Park) and semiarid high plains (Altus) hydroclimates, respectively. These basins were selected because not only are they representative of the dominant hydroclimates of the western USA, they also have been the subject of past flood studies where basin delineations, observed streamflow, and precipitation frequency distributions were developed by Reclamation (2016a).

Island Park (Fig. 2a) is located on Henrys Fork River, approximately 56 km north of Ashton, Idaho, and water stored at Island Park is used locally for
irrigation. The Island Park watershed is roughly 1297 km

Island Park

Altus Dam is on the North Fork Red River about 27 km north of the city of
Altus, OK. The purposes of the dam and reservoir are to provide irrigation
storage for lands in southwestern Oklahoma, flood control on the North Fork
of the Red River, an augmented municipal water supply for the city of Altus,
fish and wildlife conservation benefits, and recreation. The watershed
extends from Altus Dam in Oklahoma westward to Amarillo, Texas (Fig. 2b).
The watershed consists of generally rolling terrain, with medium to coarse
textured soils, and spans an elevation from about 1120 m at the western edge
of the basin to 450 m at the eastern outlet. This area contains many
topographic features known as playa lakes (closed basins with a low area in
the center that may see water storage following heavy rainfall), and thus, the
total contributing area is smaller than the total area of the watershed. We
used the Reclamation (2012) estimated contributing area of 5051 km

Our stochastic hydrologic modeling workflow includes the Framework for Understanding Structural Errors (FUSE) hydrologic modeling framework (Sect. 3.2), the shuffled complex evolution (SCE) optimization algorithm (Sect. 3.2), and precipitation frequency distributions from Reclamation (2012, 2016a) (Sect. 3.5). Additionally, we have used the law of total probability (e.g., Tijms, 2003; Nathan et al., 2003; this Sect. 3.1, the next paragraph, and Sect. 3.7) and the ANOVA method (Sect. 3.7) to compute the FF estimates and partition the variance across the workflow components, respectively. Figure 3 provides a workflow diagram describing our stochastic hydrologic modeling system.

Workflow diagram of the complete stochastic modeling system
described in Sect. 3. The calibration, initial condition generation, and
precipitation-input-generation steps are shown in more detail in the upper
two gray panels, which then feed to the lower right gray panel. Merging of
the flood event precipitation, with specified 0 precipitation days or
historical meteorology, is show in the lower right figure. Color coding
follows Fig. 1.

For each basin, hydrologic models are configured and calibrated using an ensemble of historical meteorology (Newman et al., 2015). To simplify the experimental design, we chose to represent the basins using watershed or lumped hydrologic models, removing the need to calibrate a distributed hydrologic model and add other dimensions accounting for the spatial variability of rainfall and ICs (e.g., Zhu et al., 2018). Examination of how varying spatial representation of watersheds impacts FF estimates could be the subject of further research. Long-term continuous simulations are used to generate spin-up initial conditions for event simulations (shown in the upper left panel of Fig. 3 and discussed more in Sects. 3.3–3.4). Event simulations are then performed across hydrologic models, model parameters, initial conditions, and precipitation frequency distribution estimates for two meteorology sequence possibilities, flood event precipitation only, and flood event precipitation plus historical precipitation, as discussed more in Sect. 3.6. For each precipitation frequency distribution, we split the probability density function into 50 bins, sample 25 events per bin, and perform 2500 model simulations for each possible model parameter–IC–precipitation frequency combination. This study follows the methodology used by Reclamation (e.g., Reclamation 2012, 2016b) in the stochastic flood modeling, as shown in the upper right panel of Fig. 3, and provides for uniform sampling across the precipitation frequency distribution, including extremely rare precipitation events. These precipitation inputs are then used in the event simulations with the precipitation frequency distributions representing 1 d events for Altus and 2 d events for Island Park Dam, and Island Park Dam precipitation inputs are randomly split across 2 d. Note that performing long integration continuous simulations using a stochastic rainfall simulator is another valid approach (e.g., Calver et al., 1999; Yu et al., 2019) that could be used within our general ANOVA framework, where direct specification of ICs and precipitation frequency curves would be replaced by specification of a stochastic rainfall (and other meteorological variables) generator (Yu et al., 2019). Finally, flood events are defined as 14 d volume floods for Island Park and single day peak flows at Altus.

We implemented a factorial experimental design, using all combinations of the 10 hydrologic models (Sect. 3.2), 11 parameter sets (10 for Altus Dam) (Sect. 4.1), four initial condition sets (Sect. 3.4), and 11 precipitation frequency estimates for Island Park Dam (three precipitation frequency estimates for Altus Dam; Sect. 3.5) for a total of 4840 combinations, with 2500 model simulations per combination, resulting in 12.1 million event simulations for Island Park Dam (hereafter referred to as Island Park) and 3.3 million event simulations for Altus Dam (hereafter referred to as Altus). The different precipitation frequency estimates come from the fact that this project leveraged previously completed studies for these data. We do not believe that this will significantly impact the results, as the ANOVA analysis takes these sampling differences into account.

The FUSE hydrologic modeling system is a freely available, modular modeling framework that enables the developing and testing of many conceptual hydrologic models in a single computational framework. It incorporates multiple parameterizations for many hydrologic fluxes (or processes) at the individual flux level, with each equation formulated as a function of the model state, each in a separate code module. This allows the numerical solver to be separated from the flux parameterizations so that every FUSE configuration relies on exactly the same numerical scheme. FUSE also incorporates a conceptual temperature index snow model, using elevation bands with user-specified precipitation and temperature lapse rates to represent seasonal snowpack and changes in meteorology with elevation. Control at the individual flux level is key to understanding how changes in process representation affect the modeling system behavior. Clark et al. (2008) and Henn et al. (2015) provide more details on the FUSE modeling framework.

FUSE uses several configuration files in which the user can specify the model decisions for process representation, numerical solver parameters, model calibration options, access to input and output data, etc. The structural modularity in FUSE is underpinned by one file prescribing the equations to be used for each model component. This file can be changed independently from the other model settings, enabling the user to isolate the effects of the model structure decisions on the simulations. FUSE has been coupled with the SCE optimization algorithm (Duan et al., 1993) to calibrate any hydrologic structure the user specifies. SCE is a robust global optimization algorithm that is widely used across the operational and research communities. FUSE uses the network common data format (netCDF) for all input and output data streams (forcing meteorology, any available observations for calibration, calibration results, and simulated states and fluxes), using the same file format regardless of hydrologic model configuration. Overall, the design of the FUSE system enables easy configuration, calibration, and simulation of multiple hydrologic models for long term continuous simulations or short event simulations.

FUSE is first used to mimic three widely used hydrologic models, namely the Hydrologic Engineering Center–Hydrologic Modelling System (HEC-HMS) model (Bennett, 1998), the variable infiltration capacity (VIC) model (Liang et al., 1994), and the SACramento–Soil Moisture Accounting (SAC-SMA) model (e.g., Anderson, 2002). This provides a relatable base set of models to operational groups within the USA. Note that the FUSE instantiations of the models only mimic the actual models cited. FUSE does not use the same numerical solver, some process simplifications are made (particularly for VIC, where we simplify evapotranspiration), different parameter estimations schemes are used, and FUSE does not contain the same coding errors as the original models (see Clark et al., 2008, for FUSE details). As a result, when mimicking a pre-existing model using a modular framework, some substantial differences between their simulations can exist (Knoben et al., 2019). We then assembled seven other hydrologic model structures by varying particular processes from the three base models for a total of 10 structures that we used to compute FF estimates for both basins (see Table 1 for the full list). Again, we use a watershed or lumped spatial configuration of FUSE in this study.

All 10 hydrologic models for both basins were calibrated using the SCE
optimization algorithm (Fig. 3). We used the Kling–Gupta efficiency (KGE) and the root mean squared error (RMSE) as objective
functions because the choice of objective function is subjective and
dependent on available data and user needs. Additionally, recent work has
highlighted that careful consideration needs to be given to the choice of
objective function for high-flow events (Mizukami et al., 2019). Root mean
squared error (RMSE) is directly related to the Nash–Sutcliffe efficiency (NSE).
Furthermore, it can be shown that RMSE/NSE is made up of the following three component
contributions to the total value (Murphy, 1988; Clark et al., 2021):
correlation (

FUSE hydrologic processes (far left column) and the various selected process representations for the 10 hydrologic models.

A maximum of 40 000 model runs was allowed for the SCE calibration of each model structure and basin. Reconstructed daily inflow data from Reclamation (2016b) was used for Island Park, while annual peak flow data developed by Reclamation (2012) was used for Altus due to lack of better available data for calibration at the time of this study. The impact of these different objective functions and calibration data for the basins will be discussed in Sect. 4.

The meteorological forcing data consisted of a 100-member ensemble of
gridded precipitation and temperature at 6 km resolution, as described in Newman
et al. (2015). Observations of precipitation and temperature and the process
of projecting point measurements to grids across sometimes complex terrain
are inherently uncertain. This ensemble data set was designed to estimate
those uncertainties and provide many plausible historical traces for
hydrologic model applications. The ensemble precipitation and temperature
grids are merged to the watershed scale using fractional areal weighting for
all meteorological grid cells that intersect a basin polygon. Watershed-scale forcing derived from each individual meteorological ensemble member
was used to calibrate each hydrologic model, resulting in a 100-member
ensemble of calibrated model parameters for each model for each basin (100
ensembles

Continuous simulations were performed using the optimized model parameters and full model states were output each day for the full calibration periods for each hydrologic model and basin. These states were sampled to determine the ICs for the event simulations. Sampling initial states from continuous simulations has the advantage of providing ICs that are consistent with the specific hydrologic model and parameter set versus applying random IC perturbations. Applying random perturbations to an IC may result in unrealistic states and subsequent simulation results.

For Island Park, the focus was on ICs from April through June that had
minimal (

Regional frequency analysis (RFA) is a useful method for extending the period of record in environmental data sets by means of a space-for-time substitution, where additional information in space supplements the lack of information in time. The basic assumption of RFA is that extreme events recorded at stations located within a predetermined homogeneous region can be described by the same probability distribution. By scaling the data by the respective at-site mean (ASM), the user assumes that a single probability distribution is valid for every location within the homogeneous region, while the magnitude can vary spatially.

The L-moments method (Hosking and Wallis, 1997) is one example of a regional
frequency method. The basis of the L-moments algorithm is that linear
combinations of moments can be used to estimate model parameters for extreme
value distributions. The moments of interest (also referred to as
L-statistics) include L-CV, L-skewness, and L-kurtosis and are computed for
every site utilized in an analysis. Regional moments are developed using
weighted averages of the site-specific moments, where the weight is
proportional to period of record. The regional L-moments are then used to
define the regional growth curve (RGC), a dimensionless quantile function
that represents the cumulative distribution function of the frequency
distribution valid for all sites located within the homogenous region.
Site-specific precipitation–frequency estimates (

Reclamation (2016a) developed median and uncertainty precipitation–frequency curves for the Island Park watershed using a regional L-moments approach combined with Latin hypercube resampling procedures. More specifically, the authors used annual maximum 2 d precipitation totals from 45 stations in a homogeneous region surrounding the Island Park watershed to estimate parameters of the four-parameter Kappa distribution. The authors used Latin hypercube sampling methods in R, via the qnorm function, to perform Monte Carlo sampling to create 300 parameter sets using variations in five parameters, namely the at-site mean, regional L-CV, regional L-skew, regional L-kurtosis, and areal reduction factor (ARF), to account for converting point precipitation frequency estimates to the basin-average precipitation frequency estimates. For Island Park, the authors of that study applied a constant ARF for all exceedance probabilities, even though ARFs have been shown to vary by exceedance probability (e.g., Bell, 1976). More specifically, Reclamation (2016a) multiplied the point-specific precipitation–frequency curve by a constant ARF of 0.85, which they estimated using a historical point and basin-average precipitation depths available in HMR 55A (Hansen et al., 1988) and HMR 57 (Hansen et al., 1994). Results from the Island Park analysis include 11 frequency distributions (5th, 14th, 23th, 32th, 41th, 50th, 59th, 68th, 77th, 85th, and 95th percentiles). Kappa parameters from Reclamation (2016a) are reproduced in Table 2. During the stochastic simulations performed here, we force 2 d historical precipitation inputs to equal the basin-average magnitudes sampled from the 2 d precipitation frequency curve valid over the Island Park watershed.

Similarly, Reclamation (2012) developed precipitation–frequency estimates, including median and uncertainty bounds for the Altus watershed, using a regional L-moments approach combined with Latin hypercube sampling procedures. The authors focused on annual maximum 1 d (or 24 h) precipitation totals recorded at 482 stations, with at least 5 years of data, and used Latin hypercube sampling to produce 150 parameter sets based on variations in the following same five parameters listed above: at-site mean, regional L-CV, regional L-skewness, regional L-kurtosis, and ARF. The ARF for Altus was developed using a linear relationship between point and basin-average storm totals, using 12 different storms that impacted the Altus watershed identified in HMR 51 (Schreiner and Riedel, 1978), HMR 52 (Hansen et al., 1982), and HMR 55A (Hansen et al., 1988). The Altus report provides all precipitation–frequency estimates in the form of fourth-order polynomials, with coefficients reproduced in Table 3. Similar to Island Park simulations, we force 1 d historical precipitation events to equal basin-average magnitudes sampled from the 1 d precipitation frequency curve valid over the Altus watershed. Median precipitation frequency curves for both basins, including the 5th and 95th statistical sampling percentiles, are shown in Fig. 4.

Precipitation frequency estimates for Altus (1 d; blue) and Island Park (2 d; black) including the median (solid), and the 5th and the 9th percentiles (dashed).

Parameters used to define the four-parameter Kappa distribution. Table is reproduced from Table 4.5 in Reclamation (2016a).

Polynomial coefficients (fourth order) that describe the lower, median, and upper precipitation frequency curves for Altus. Table reproduced from Table 5.7 in Reclamation (2012).

Some stochastic modeling studies at Reclamation force the rainfall–runoff model with a specified precipitation input (e.g., 2 d input from a precipitation frequency distribution), followed by no precipitation for the remaining simulation time (Reclamation, 2018). The lack of additional precipitation after the specified precipitation input is not based on any physical reasoning; thus, we examine both zero and historical precipitation sequences after the specified precipitation input (2 d at Island Park and 1 d at Altus; Sects. 2 and 3.5). In the specified meteorological sequence, we set precipitation to zero after the specified precipitation input. In the historical meteorology sequence, we randomly sample ensemble member meteorology using the same event start date from the selected IC. In both meteorological sequences, the specified precipitation forcing is exactly the same. Future work should examine event sequencing in greater detail, particularly to quantify the impacts of possible future circulation changes on FF estimates and sensitivities.

As noted in Sect. 3.1, the total probability theorem is used to compute modeled basin runoff at return periods of 2, 5, 10, 20, 50 100, 500, 1000, 5000, 10 000, 50 000, and 100 000 years from the stochastic simulations for all model, parameter, IC, and precipitation distribution combinations for both meteorology sequences. To do this, the distribution of flood event runoff was divided into 50 bins, following the division of the precipitation frequency distributions (Fig. 3; upper right box). All event simulations exceeding the modeled runoff threshold in a given bin were counted, and then the probability of exceedance of that runoff threshold was computed as the summation of the probability of precipitation inputs occurring in that specific bin times the number of flood events. We then used linear interpolation to estimate the runoff at the specific return periods listed above. Again, 14 d volumes are used at Island Park and maximum daily runoff are used at Altus to define the flood event.

An ANOVA analysis is then performed on the runoff values for all the return
periods for both meteorology sequences and basins. The ANOVA framework is a
computationally frugal way to estimate individual component contributions to
the total variance of a variable such as runoff by relying on a sum of
squares decomposition. ANOVA analyses have been widely used in
hydrometeorology to separate the components of future climate changes
(Hawkins and Sutton, 2009; Lehner et al., 2020) and to determine which
elements of the model chain contribute most to the spread of the projected
changes in streamflow (Bosshard et al., 2013; Addor et al., 2014; Breuer et
al., 2017; Chegwidden et al., 2019). By estimating the fractional (relative)
variance contributions of each factor and all (two) factor interactions, we
identified the pieces of the modeling workflow which contribute most to FF
sensitivity for each return period. We used the anovan MATLAB function, as
implemented in MATLAB version 9.8.0.1380330 (2020a) update 2. This function
allows for N-way ANOVA computations with mixed continuous and categorical
predictors and specification of the interaction terms to be estimated
(

When examining daily flow time series, the KGE and RMSE based on daily flow produce more realistic simulations than the KGE using annual peak flow, as seen in Fig. 5. This is a somewhat expected result as the interval metric contains no time information (correlation) on the daily scale. The daily KGE-based calibration outperforms the daily RMSE-based calibration, where the daily RMSE-based calibration underestimates the flow variance (not shown), in agreement with past studies (Gupta et al., 2009). The annual peak-flow-based KGE calibration represents the peak flows well (with some overestimation) but has large differences in event recession curves, with overestimation of flow in the days and weeks immediately following high-flow events. This erroneous recession curve representation would result in very different volume-based floods versus daily metric-based calibrations.

Island Park calibration period runoff for water year (WY) 1993, with RMSE using daily flow, KGE using daily flow, and KGE using annual maximum flow.

Given the above calibration characteristics and the available calibration data at Island Park (daily flow) and Altus (annual peak flow), daily KGE was selected as the calibration metric for Island Park and annual peak-flow-based KGE as the calibration metric for Altus (Sect. 3.3 defines these calibration metrics). Daily KGE provides the best all-around simulation when considering daily peak flows and volume integrations over days to weeks at Island Park. For Altus, calibrating to yearly peak-flow-based KGE provided a better overall peak flow calibration than RMSE calculated using annual peak flows, likely due to the reformulated weighting of bias and variance as compared to RMSE. Again, these results agree with Mizukami et al. (2019), who examined some of the same calibration metrics using multiple hydrologic models and hundreds of basins across the contiguous USA. They found that KGE outperforms RMSE-based (or NSE-based) calibrations, and that peak flow metrics do outperform KGE for peak flow simulation but result in a much degraded daily model performance with sometimes severe modeled flow biases.

Figure 6 highlights the final cumulative
density function (CDF) of the calibrated KGE for all 10 models
for Island Park (Fig. 6a) and Altus (Fig. 6b). It is not possible to make
direct performance comparisons between the models at the two basins, given
that the KGE values are based on daily (Island Park) and annual peak runoff
(Altus). However, in a broad sense, model behavior at Island Park is much
more constrained than Altus, based on the relative ranges of calibration
scores for each basin (different

The 100 parameter sets available for each model and basin were subsampled for the final FF event simulations. Because Island Park had more available data for calibration (Sect. 3.3), the final calibrated model performance was very similar across the 100 members for all 10 hydrologic models. Therefore, 11 parameter sets spanning the full range of model performance were sampled for each hydrologic model using the 1st, 10th, 20th, 30th, 40th, 50th, 60th, 70th 80th, 90th, and 99th percentiles of the cumulative density function (CDF) of the calibration objective function. For Altus, the calibrated model behavior was less constrained due to the much smaller amount of calibration data available. Therefore, the 10 best-calibrated parameter sets for each hydrologic model were used, which constrained model-parameter-induced differences in model behavior but still not to the same level as Island Park.

The ANOVA analysis was performed using the full complement of FF estimates for both basins (Sect. 3.7) and both meteorological sequences (Sect. 3.6). All fractional variance contributions are normalized by the total variance in the FF estimate for each return period such that if a component has a fractional variance of 0.5 then that component contributes half of the total variance for that return period. The plots represent the 2, 5, 10, 50, 100, 1000, 10 000, 50 000, and 100 000 year return periods. For Figs. 8–13, the specified input meteorological sequence is always in panel (a) and the historical input meteorology sequence is always in panel (b), and the color coding follows Fig. 1. Interaction terms are a blend of the two primary components (e.g., model structure–model parameter interactions are red–orange).

First, normalized FF plots, including all possible effect combinations for
both models, are shown in Fig. 7. Annual exceedances at Island Park in the
mean follow a nearly linear trend on the semi-log

Normalized (by maximum possible flood runoff) FF curves, with the median in red, and the interquartile range (25th–75th percentiles), in dark
gray, the 10th–90th percentile spread, in medium gray, and the minimum to maximum spread, in light gray, for

ANOVA results using all available model structures, sampled parameter sets,
sampled ICs, and sampled precipitation frequency distributions for Island
Park are shown in Fig. 8. When all model structures are included, ICs
dominate the frequent floods less than about 5000 years, while the
precipitation frequency distribution is the most important for rarer floods.
This agrees with other recent studies examining IC contribution to FF
estimates using ANOVA or similar frameworks (Peleg et al., 2017; Zhu et al.,
2018). Model structure consistently contributes roughly 20 % of the
variance and is generally the second most important effect across all return
periods outside of 1000–10 000 year flood, where ICs, precipitation
frequency curves, and model structure vary in leading, secondary, or tertiary
importance, depending on the meteorological sequencing. Model parameters and
interaction terms contribute roughly 10 % of the variance for less
frequent floods for both meteorological sequences. For rare floods with
return periods larger than 50 000 years, precipitation input is about twice
as important as model structure and 3 times more important than ICs for the
specified meteorological sequence, while, for historical meteorological
sequencing (Sect. 3.6), the precipitation input is only about 1.5

Island Park fractional variance contributions using all 10 model
structures for the

Figure 9 presents the fractional variance contributions for Island Park using the following three base models: HEC-HMS, VIC, and SAC-SMA (Sect. 3.2; Table 1). Similar to all models, ICs and the precipitation frequency distribution specification are the most important for frequent and extreme floods, respectively. Model structure is the second most important contributor to frequent floods, but for return periods larger than 1000 years, model structure–parameter interactions become as or more important than the model. Again, moving from the specified to historical meteorological sequence decreases the variance contribution of precipitation frequency distributions and increases the importance of model structure, model structure–parameter interactions, and ICs across all return periods (compare Fig. 9a to b). This is somewhat counterintuitive but may be related to the fact that soil states can strongly influence recession curve characteristics, and the additional precipitation in the historical meteorological sequence (from zero after the specified input in the specified sequence) is either stored or released within the 14 d volume flood integration period depending on model structure, parameters, or ICs.

Island Park fractional variance contributions using the three base
models, namely HEC-HMS (model 1), VIC (model 2), and SAC-SMA (model 3), for
the

Using a different combination of the 10 possible model structures results in a slightly different conclusion. The set of simulations presented in Fig. 10 represents the set of three hydrologic models that generates the largest flood responses to larger precipitation event forcing. Overall, the precipitation frequency distribution specification is still the most important for extreme floods, and ICs are most important for very frequent floods, but model structure contributes a larger fraction of the total variance across all return periods and is often of similar magnitude to either ICs or precipitation frequency distribution changes (Fig. 10). Here we see that moving from the specified to historical meteorological sequence increases the importance of model structure (compare Fig. 10a to b). This is because these three model structures (Table 1) have more variation between each other, given additional precipitation input, than the variability in runoff changes due to ICs. Differences in surface runoff versus subsurface storage and slower baseflow appear to be driving the model structure variability and are discussed more in Sect. 6.

Island Park fractional variance contributions for the three most
responsive model structures (i.e., structures associated with the largest
runoff/precipitation ratio), namely HEC-HMS (model 1), HEC variant (model 4), and a SAC-SMA/HEC-HMS combination (model 6), for the

The ANOVA results, using all available model structures, sampled parameter sets, sampled ICs, and sampled event forcings, for Altus are shown in Fig. 11. Similar to Island Park, ICs are most important for frequent floods, while precipitation event forcing is most important for rarer floods. There are two differences of note here. First, precipitation frequency distributions are generally more important across return periods at Altus versus Island Park. Second, while model structure is slightly less important, model parameters and model structure–parameter interactions are of similar importance to model structure, such that the combination of model structure and parameter effects and interactions is as important as precipitation frequency distributions across both meteorological sequences (Sect. 3.6). Finally, it is evident that meteorological sequencing is inconsequential at Altus, which makes intuitive sense given the single day peak flow metric for Altus versus the 14 d integrated volume metric at Island Park (Sects. 3.1 and 3.7).

Altus fractional variance contributions using all 10 model
structures for the

The ANOVA results for Altus, using the three base models, show a similar picture as for Island Park. ICs almost always contribute the most variance for frequent floods (less than a few hundred years), and the precipitation frequency distributions are the most important for larger floods (Fig. 12). However, the precipitation frequency distributions are even more important for Altus than at Island Park, particularly for the historical meteorological sequence, as they contribute around 50 % of the total variance for 50 000–100 000 year floods as compared to around 30 % at Island Park. Model structure and model structure–model parameter interactions are of secondary importance across essentially all return periods. Again, moving from specified to historical meteorological sequencing does not change the picture significantly at Altus (compare Fig. 12a to b), which is expected, as the flood metric is the single day maximum flow and generally single day maximum flow is directly related to the extreme precipitation event input and not subsequent smaller precipitation inputs.

Altus fractional variance contributions using the three base
models, namely HEC-HMS, VIC, and SAC-SMA, for the

Further examination of multiple model combinations at Altus revealed that using the two most disparate model responses, i.e., SAC-SMA (model 3) and the SAC-SMA/HEC-HMS combination (model 6) models, results in a substantial increase in the importance of model parameters and model structure–model parameter interactions (Fig. 13). In fact, model structure–model parameter interactions contribute the most variance across all return periods in this case. Additionally, model structure and model parameter effects contribute similar variance to the precipitation frequency distributions. Again, moving from specified to historical meteorological sequencing does not substantially change the message here as expected (compare Fig. 13a to b). For this case, the model responses are starkly different, such that it may be possible to rule out one of the model structures as plausible; however, model structure selection work is outside the scope of this study.

Altus fractional variance contributions for the two most
disparate flood responses, namely SAC-SMA (model 3) and a SAC-SMA/HEC-HMS
combination (model 6), for the

The results of this study demonstrate that workflow and methodological decisions impact hydrologic model behavior and the final variance estimates of an FF study. This suggests that careful consideration of the various components of stochastic flood modeling should be undertaken. To our knowledge, the inclusion of model structure into FF estimate sensitivity analysis is a novel contribution to our understanding of stochastic flood modeling systems. We reaffirm that calibration metrics only constrain model behavior for components of the hydrograph most related to the calibration metric (e.g., Mendoza et al., 2015; Mizukami et al., 2019). For streamflow-based calibration, KGE is a metric that provides balanced model behavior across all components of the hydrograph because of its formulation and should be used over RMSE/NSE if possible. Furthermore, calibration metrics focusing on high flow only generally result in degraded model performance for other parts of the hydrograph, such as the recession curve, in agreement with Mizukami et al. (2019). The implication for this work is that calibrated hydrologic models using RMSE/NSE may have inferior performance for longer-duration volume flood metrics because of substantial biases introduced during calibration that was not designed to constrain flow volumes.

The ANOVA results demonstrate that ICs contribute the most variance for frequent floods, and the precipitation frequency distribution specification contributes the most variance for extreme floods. One area for future study is the specification of the precipitation frequency distribution and uncertainty estimates. Here we relied on previously published precipitation frequency results, as developing new estimates is outside the scope of this study. However, it is possible that the specification of the distribution and the uncertainty estimation methodology could have an impact on subsequent analysis steps. Furthermore, the precipitation frequency distribution methods differ between the basins, which is an inconsistency with unquantified impacts. Normalization of the precipitation inputs before the ANOVA analysis possibly mitigates these potential issues, but further exploration could be undertaken in future work. Inclusion of drier ICs could be made and may further increase the importance of ICs, particularly for basins that may experience large precipitation events on drier soils, such as arid environments with specific soil conditions or common floods with less than a few hundred years return periods (Yu et al., 2019). The addition of four drier ICs in the factorial experimental design would, in this case, result in another roughly 15 million simulations, or 210 million simulation days, which may be non-trivial, depending on the computational resources available.

Additionally, model structure, model parameters, and model structure–parameter interactions may have important contributions across the return periods, depending on the flood metric and basin. In this study, all 10 model structures are treated as equally plausible. Future stochastic FF studies should consider model structure in their experimental design, with thought given to constraining the model structure ensemble to plausible model configurations using available techniques (Jakeman and Hornberger, 1993; Gupta et al., 2012). Model parameter and model structure–model parameter interactions are more important at Altus, where the available calibration data limited the ability of the calibration to constrain model performance. Consideration of model parameter variations should be taken into account when scoping projects with little calibration data available.

Change in storage (black lines) and cumulative runoff (red lines) normalized by flood event precipitation input for model 1 (solid) and model 3 (dashed) at Island Park for one precipitation frequency distribution bin, using the median (50th percentile) precipitation frequency distribution.

Differences in model total storage and subsequent runoff generation drive the different flood responses across both basins. Figure 14 shows the average model response for models 1 (HEC-HMS) and 3 (SAC-SMA) for a subset of precipitation event forcings for Island Park. The change in storage and cumulative runoff are normalized by the total precipitation input to highlight storage and runoff efficiency differences between the two models. Note the precipitation input occurs on days 1 and 2. Models with high event-based runoff ratios generate runoff more readily and have smaller subsurface storages, while models with lower flood event runoff ratios allow for more infiltration and storage. Model 3 stores about 60 % more of the precipitation event than model 1 and ends up generating 25 % less cumulative runoff than model 1. These differences are more important for basins using an integrated flood metric, such as Island Park here, as responsive models generate larger volumes, while the other models store more of the precipitation event forcing and release it over longer periods of time. This point should be the focus of additional study and provides one physical process comparison to identify the appropriate model structures for a given basin. While the focus of this study was on stochastic rainfall–runoff modeling for FF studies, there are potentially broader applications to hydrologic modeling for any purpose, including planning, design, or restoration often focused on more frequent floods up to extreme ones for risk analysis. For example, stochastic rainfall–runoff modeling is data and labor intensive, thus less intensive methods are frequently used, most commonly AEP-neutral assumptions of the precipitation return period being equal to the flood return period. Even in those studies, model selection, parameterization, initial conditions, calibration, and forcing still play an important role in model outcome. Additionally, examining a range of return periods rather than just extreme floods was intentional to help inform a broader range of applications beyond those focused on risk for large dams where only extreme floods are relevant. Understanding of the sensitivity in rainfall–runoff modeling, whether stochastic or not, is important for flood studies. The results of this study can help guide model selection and development and provide a better understanding of variance in a variety of flood studies.

The key generalizable conclusions are as follows:

ICs and precipitation frequency distributions contribute the most variance in the stochastic flood modeling chain for frequent and extreme floods, respectively.

Hydrological model structure can be equally important, particularly for multi-day volume flood metrics. This highlights the need to critically assess assumptions underpinning models, understand basin flood generation processes, and develop methods to select appropriate models. This includes the re-examination of the AEP-neutral assumption and shifting to model process parameterizations that are most plausible for the study catchment.

Model parameter and model structure–parameter interactions can be important if the model parameter space is not well constrained during calibration.

Confirming many other studies (e.g., Gupta et al., 2009; Mizukami et al., 2019), the Kling–Gupta efficiency (KGE) results in better hydrologic model performance than NSE (or RMSE) for the calibration of extreme events and volume-integrated flood metrics and is more flexible for application-specific uses through the use of user-specified component weights.

The FUSE model is available at

The watershed characteristics, reconstructed flow data, and precipitation frequency estimates are available on request from the United States Bureau of Reclamation. The FUSE model output data are available at

All authors contributed to the experimental design. AJN and MS performed the computations, while AJN, AGS, and KDH prepared the figures and tables. AJN led preparation of the paper with contributions from all authors.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partially supported by the National Center for Atmospheric Research, which is a major facility sponsored by the National Science Foundation (grant no. 1852977). We
gratefully acknowledge high-performance computing support from Cheyenne
(

This research has been supported by the Bureau of Reclamation (grant no. R16AC00039) and the National Science Foundation (grant no. 1852977).

This paper was edited by Elena Toth and reviewed by Daniel Wright and two anonymous referees.