Exploring the impact of forcing error characteristics on physically based snow simulations within a global sensitivity analysis framework

Physically based models provide insights into key hydrologic processes but are associated with uncertainties due to deficiencies in forcing data, model parameters, and model structure. Forcing uncertainty is enhanced in snowaffected catchments, where weather stations are scarce and prone to measurement errors, and meteorological variables exhibit high variability. Hence, there is limited understanding of how forcing error characteristics affect simulations of cold region hydrology and which error characteristics are most important. Here we employ global sensitivity analysis to explore how (1) different error types (i.e., bias, random errors), (2) different error probability distributions, and (3) different error magnitudes influence physically based simulations of four snow variables (snow water equivalent, ablation rates, snow disappearance, and sublimation). We use the Sobol’ global sensitivity analysis, which is typically used for model parameters but adapted here for testing model sensitivity to coexisting errors in all forcings. We quantify the Utah Energy Balance model’s sensitivity to forcing errors with 1 840 000 Monte Carlo simulations across four sites and five different scenarios. Model outputs were (1) consistently more sensitive to forcing biases than random errors, (2) generally less sensitive to forcing error distributions, and (3) critically sensitive to different forcings depending on the relative magnitude of errors. For typical error magnitudes found in areas with drifting snow, precipitation bias was the most important factor for snow water equivalent, ablation rates, and snow disappearance timing, but other forcings had a more dominant impact when precipitation uncertainty was due solely to gauge undercatch. Additionally, the relative importance of forcing errors depended on the model output of interest. Sensitivity analysis can reveal which forcing error characteristics matter most for hydrologic modeling.


Introduction
Physically based models allow researchers to test hypotheses about the role of specific processes in hydrologic systems and how changes in environment (e.g., climate, land cover) may impact key hydrologic fluxes and states (Barnett et al., 2008;Deems et al., 2013;Leavesley, 1994;Clark et al., 2011b).Due to the complexity of processes represented, these models usually require numerous meteorological forcing inputs and model parameters.Most inputs are not measured at the locations of interest and require estimation; hence, large uncertainties may propagate from hydrologic model inputs to outputs.Despite ongoing efforts to quantify forcing uncertainties (e.g., Bohn et al., 2013;Flerchinger et al., 2009;Clark and Slater, 2006) and to develop methodologies for incorporating uncertainty into modeling efforts (e.g., Slater and Clark, 2006;He et al., 2011a;Kavetski et al., 2006a;Kuczera et al., 2010), many analyses continue to ignore uncertainty.These often assume either that all forcings, parameters, and structure are correct (Pappenberger and Beven, 2006) or that only parametric uncertainty is important (Vrugt et al., 2008).Neglecting uncertainty in hydrologic modeling reduces confidence in hypothesis tests (Clark et al., 2011b), thereby limiting the usefulness of physically based models.
Previous work on forcing uncertainty in snow-affected regions has yielded basic insights into how forcing errors propagate to model outputs and which forcings introduce the most uncertainty in specific outputs.However, these studies have typically been limited to (1) empirical/conceptual models (He et al., 2011a, b;Raleigh and Lundquist, 2012;Shamir and Georgakakos, 2006;Slater and Clark, 2006), (2) errors for a subset of forcings (e.g., precipitation or temperature only) (Burles and Boon, 2011;Dadic et al., 2013;Durand and Margulis, 2008;Lapo et al., 2015;Xia et al., 2005), (3) model sensitivity to choice of forcing parameterization (e.g., longwave) without considering uncertainty in parameterization inputs (e.g., temperature and humidity) (Guan et al., 2013), and (4) simple representations of forcing errors (e.g., Kavetski et al., 2006a, b).The last is evident in studies that only consider single types of forcing errors (e.g., bias) and single distributions (e.g., uniform) and examines errors separately (Burles and Boon, 2011;Koivusalo and Heikinheimo, 1999;Raleigh and Lundquist, 2012;Xia et al., 2005).Lapo et al. (2015) show that biases have a greater impact than random errors on modeled snow water equivalent and surface temperature but their analysis only considers longwave and shortwave forcings and considers errors separately.Examining uncertainty in one factor at a time remains popular but fails to explore the uncertainty space adequately, ignoring potential interactions between forcing errors (Saltelli and Annoni, 2010;Saltelli, 1999).In contrast, global sensitivity analysis explores the uncertainty space more compre-hensively by considering uncertainty in multiple factors at the same time.
The purpose of this paper is to use global sensitivity analysis to assess how specific forcing error characteristics influence outputs of a physically based snow model.To our knowledge, no previously published study has investigated this topic in snow-affected regions.It is unclear how (1) different error types (bias vs. random errors), (2) different error distributions, and (3) different error magnitudes across all forcings affect model output.The impact of forcing errors on models can be tested by corrupting forcings with specified characteristics (e.g., artificial biases and random errors) and quantifying the impact on model outputs (e.g., Oudin et al., 2006;Spank et al., 2013), but we are unaware of any detailed studies that have done this type of experiment for all meteorological forcings commonly required for physically based snow models.We hypothesize that (1) model outputs are more sensitive to biases than random errors in forcing variables, (2) the assumed probability distribution for biases will alter the relative ranking of importance in forcing errors, and (3) the magnitude of forcing biases will have a strong influence on which forcing errors are most important.
In our view, it is important to clarify the relative impact of specific error characteristics on modeling applications, so as to prioritize future research directions, improve understanding of model sensitivity, and to address questions related to network design.For example, given budget constraints, is it better to invest in a heating apparatus for a radiometer (to minimize bias due to frost formation on the radiometer dome) or in a higher quality radiometer (to minimize random errors associated with measurement precision)?Additionally, it is important to contextualize different meteorological data errors, as these errors are usually studied independently of each other (e.g., Flerchinger et al., 2009;Huwald et al., 2009), and it is unclear how they compare in terms of model sensitivity.
The overarching research question is how do assumptions regarding forcing error characteristics impact our understanding of uncertainty in physically based model output?Using the Sobol ' (1990) global sensitivity analysis framework, we investigate how artificial errors introduced into high-quality, observed forcings (temperature, precipitation, wind speed, humidity, shortwave radiation, and longwave radiation) at four sites in contrasting snow climates propagate to four snow model outputs (snow water equivalent, ablation rates, snow disappearance timing, and sublimation) that are important to cold region hydrology.We select a single model structure and set of parameters to clarify the impact of forcing uncertainty on model outputs.Specifically, we use the physically based Utah Energy Balance (UEB) snow model (Mahat and Tarboton, 2012;Tarboton and Luce, 1996) because it is computationally efficient.The presented framework could be extended to other models.
, Qso is measured reflected shortwave radiation (W m −2 ), and T surf is measured snow surface temperature (K).b Note that P data were adjusted with a multiplier (see Sect. 2) prior to conducting the sensitivity analysis.

Study sites and data
We selected four seasonally snow covered study sites (Table 1) in distinct snow climates (Sturm et al., 1995;Trujillo and Molotch, 2014).The sites included (1) the Imnavait Creek (IC, 930 m) site (Euskirchen et al., 2012;Kane et al., 1991;Sturm and Wagner, 2010), located in the tundra north of the Brooks Range in Alaska, USA, (2) the maritime Col de Porte (CDP, 1330 m) site (Morin et al., 2012) in the Chartreuse Range in the Rhône-Alpes of France, (3) the intermountain Reynolds Mountain East (RME, 2060 m) sheltered site (Reba et al., 2011) in the Owyhee Range in Idaho, USA, and (4) the continental Swamp Angel Study Plot (SASP, 3370 m) site (Landry et al., 2014) in the San Juan Mountains of Colorado, USA.We selected these sites because of the quality and completeness of the forcing data and because they spanned contrasting climates (Table 1), allowing us to check for potential climate dependencies in sensitivity to forcing errors.Generalization of the results with climate was not possible due to the low sample size of sites.
The sites had high-quality observations of model forcings at hourly time steps.Serially complete published data sets are available at CDP, RME, and SASP (see citations above).At IC, data were available from multiple co-located stations (Griffin et al., 2010;Bret-Harte et al., 2010a, b, 2011b, c, a;Sturm and Wagner, 2010).These data were quality controlled, and gaps in the data were filled as described in Raleigh (2013).
We considered only 1 year for analysis at each site (Table 1) due to the high computational costs of the experiment.Measured evaluation data (e.g., snow water equivalent, SWE) at daily resolution were used only for qualitative assessment of model output.SWE was observed at snow pil-lows at IC and RME.At CDP, a cosmic ray detector collected SWE data.At SASP, acoustic snow depth data were converted to daily SWE using density inferred from nearby Snow Telemetry (SNOTEL) (Serreze et al., 1999) sites and local snow pit measurements (Raleigh, 2013).
Before conducting the sensitivity analysis, we adjusted the available precipitation data at each site with a multiplicative factor to correct for potential undercatch errors (e.g., Goodison et al., 1998;Rasmussen et al., 2012;Yang et al., 2000) and to ensure the base model simulation with all observed forcings reasonably represented observed SWE.Several studies have demonstrated the necessity of precipitation adjustments for realistic SWE simulations, even at wellinstrumented sites (e.g., Hiemstra et al., 2006;Reba et al., 2011;Schmucki et al., 2014).Precipitation adjustments were most necessary at IC, where windy conditions preclude effective measurements (Yang et al., 2000).In contrast, only modest adjustments were necessary at the other three sites because they were located in sheltered clearings and because some corrections were already applied to the published data.We considered adjustment multipliers ranging from 0.5 to 2.5 (increments of 0.05) and selected the multiplier that yielded the lowest root mean squared error between observed and modeled SWE.Precipitation multipliers were 1.6 at IC and 1.15 at SASP, and 0.9 at CDP and RME.The undercatch errors at IC were consistent with the 61-68 % of the undercatch errors found by Yang et al. (2000) for Wyoming-type gauges in wind-blown regions.
The initial discrepancies between modeled and observed SWE (prior to applying the above precipitation multipliers) may have resulted from deficiencies in the measured forcings, model parameters, model structure, and measured verification data, and justification of our decision to apply pre-cipitation multipliers was warranted.Manual observations of SWE (e.g., snow surveys, snow pits) generally supported the automatically collected SWE observations (no figures shown) and thus differences between observed and modeled SWE did not likely stem from issues in the verification data.Sites where we decreased the precipitation data (CDP and RME) were also the warmer sites and experienced more mixed rain-snow events in the winter.Hence, we considered multiple hypotheses to explain the SWE differences at these sites: (1) the choice of rain-snow parameterization, (2) the choice of parameters (e.g., threshold temperatures) for the rain-snow parameterization, and (3) the quality of the forcing data (e.g., precipitation).For these warmer sites, an exploratory analysis revealed that either (1) or (3) could explain the SWE differences, but auxiliary data (e.g., precipitation phase data) were not available to discriminate these hypotheses.Choosing a different rain-snow parameterization might minimize the SWE differences at the warmer sites but would not rectify the SWE differences at the colder sites (IC and SASP) where most winter precipitation falls as snow.Therefore, the most straightforward and consistent approach was to adjust the precipitation data and to leave the native UEB parameterizations intact.It was beyond the scope of this study to optimize model parameters and unravel the relative contributions of uncertainty for factors other than the meteorological forcings.Nevertheless, we suggest these precipitation adjustments minimally affected the sensitivity analysis, as we did not quantitatively compare the model outputs to the observed response variables (e.g., SWE).

Model and output metrics
The UEB is a physically based, one-dimensional snow model (Mahat and Tarboton, 2012;Tarboton and Luce, 1996;You et al., 2014).UEB represents processes such as snow accumulation, snowmelt, albedo decay, surface temperature variation, liquid water retention and refreezing, and sublimation.Due to the one-dimensional structure of the model, UEB does not account for lateral mass transfer of snow (e.g., windinduced snow drifting) and therefore these processes must be represented in other model components (e.g., precipitation uncertainty; see Sect. 3.2.3).UEB has a single bulk snow layer and an infinitesimally thin surface layer for energy balance computations at the snow-atmosphere interface.UEB tracks state variables for snowpack energy content, SWE, and a dimensionless snow surface age (for albedo computations).We ran UEB at hourly time steps with six forcings: air temperature (T air ), precipitation (P ), wind speed (U ), relative humidity (RH), incoming shortwave radiation (Q si ), and incoming longwave radiation (Q li ).We used fixed parameters across all scenarios (Table 2).We initialized UEB during the snow-free period; thus, model spin-up was unnecessary.With each UEB simulation, we calculated four summary output metrics: (1) peak (i.e., maximum) SWE, (2) mean ablation rate, (3) snow disappearance date, and (4) total annual snow sublimation.The first three metrics are important for the timing and magnitude of water availability and identification of the snowpack regime (Trujillo and Molotch, 2014), while the fourth impacts the partitioning of annual P into runoff and evapotranspiration.We calculated the snow disappearance date as the first date when 90 % of peak SWE had ablated, similar to other studies that use a minimum SWE threshold for defining snow disappearance (e.g., Schmucki et al., 2014).The mean ablation rate was calculated in the period between peak SWE and snow disappearance and was taken as the absolute value of the mean of all SWE decreases.

Forcing error scenarios
To test how error characteristics in forcings affect model outputs, we examined five scenarios (Fig. 1, Table 3) with different assumptions regarding error types, distributions, and magnitudes (i.e., error ranges).In the first scenario, only bias (normally distributed for additive errors or lognormally distributed for multiplicative precipitation errors) was introduced into all forcings at a level of high uncertainty (based on values observed in the field; see Sect.3.2.3below).This scenario was named NB, where N denotes normal (or lognormal) error distributions and B denotes bias only.The remaining scenarios were identical to NB except one aspect was changed: scenario NB+RE considered both bias and random errors (RE) in all forcings, scenario UB considered uniformly distributed biases in all forcings, scenario NB_gauge considered precipitation error magnitudes associated with gauge undercatch, and scenario NB_lab considered error magnitudes for all forcings at minimal values (i.e., specified instrument accuracy as found in a laboratory).Constructed in this way (Fig. 1), we could test model sensitivity to (1) bias vs. random errors by comparing NB and NB+RE, (2) error distributions by comparing NB and UB, and (3) error magni- a B: bias, RE: random errors.Biases are additive (b i = 0, Eq. 5) for all forcings except P , which has multiplicative bias (b i = 1).b Probability distributions were truncated in instances when introduction of errors caused non-physical forcing values (see Sect. 3.3.5).c The high upper P bias (300 %) mimics cases where snowfall data collected in an area of drift deposition are assumed (incorrectly) to represent other basin locations.d Uncertainty ranges in this scenario are based primarily on manufacturer's specified accuracy for typical sensors deployed at SNOTEL sites (NRCS Staff, personal communication, 2013).We assume the P storage gauge has the same accuracy as a typical tipping bucket gauge.e We neglect P undercatch errors in the lab uncertainty scenario.
tudes by comparing NB (high forcing uncertainty) to both NB_gauge (moderate uncertainty in precipitation but high uncertainty for all other forcings) and NB_lab (low forcing uncertainty).

Error types
Forcing data inevitably have some (unknown) combination of bias and random errors.However, hydrologic sensitivity analyses have tended to focus more on bias with little or no attention to random errors (Raleigh and Lundquist, 2012), whereas data assimilation methods often focus on random errors but assume bias does not exist (e.g., Dee, 2005).Rarely distributed biases with error magnitudes found in the field.NB+RE is the same as NB but also considers random errors.UB is the same as NB but considers uniformly distributed errors instead.NB_gauge is the same as NB but with reduced precipitation uncertainty (typical difference between precipitation gauge and snow pillow).NB_lab is the same as NB but considers laboratory error magnitudes.
is there any consideration of interactions between these error types.As a recent example, Lapo et al. (2015) tested biases and random errors in Q si and Q li forcings, finding that biases generally introduced more variance in modeled SWE than random errors.Their experiment considered biases and random errors separately (i.e., no error interactions allowed) and examined only a subset of the required forcings (i.e., radiation).Here, we examined coexisting biases in all forcings in NB, UB, NB_gauge, and NB_lab, and coexisting biases and random errors in all forcings in NB+RE.
Table 3 shows the assignment of error types for the five scenarios.We relied on studies that assess errors in measurements or estimated forcings to identify typical characteristics of biases and random errors.Published bias values were more straightforward to interpret than random errors because common metrics, such as root mean squared error (RMSE) and mean absolute error (MAE), encapsulate both systematic and random errors.Hence, when defining random errors, the published RMSE and MAE served as qualitative guidelines.

Error distributions
In their recent review of global sensitivity analysis applications in hydrological modeling, Song et al. (2015) identified the selection of probability distributions (this section) and ranges (Sect.3.2.3)as among the most important considerations.While it is common practice in sensitivity analysis to assume a uniform distribution when sampling model parameters (e.g., Campolongo et al., 2011;Rosero et al., 2010), this may fail to represent the real distribution of errors in meteorological forcing data, as the uniform distribution implies that extreme and small biases are equally probable.It is more likely that real error distributions more closely resemble non-uniform distributions, with higher probability of smaller biases and lower probability of more extreme biases (e.g., normal distributions).Investigators in other fields (e.g., Foscarini et al., 2010;Touhami et al., 2013)  ies broadly suggest that the grouping of most important factors may be similar under different distribution assumptions, particularly in cases when interactions are minimal, but the relative ranking of factors within those groups may vary depending on the distribution.Here we test how the assumed probability distribution influences the sensitivity of a snow model to forcing errors.We designed the UB scenario with the naive hypothesis that the probability distribution of biases was uniform for all six meteorological variables.In contrast, error distributions (Table 3) were assumed non-uniform (described below) in scenarios NB, NB+RE, NB_gauge, and NB_lab.Unfortunately, error distributions are reported less frequently than error statistics (e.g., bias, RMSE) in the literature.We assumed that T air and RH errors follow normal distributions (Mardikis et al., 2005;Phillips and Marks, 1996), as do Q si and Q li errors.Conflicting reports over the distribution of U indicated that errors may be approximated with a normal (Phillips and Marks, 1996), a lognormal (Mardikis et al., 2005), or a Weibull distribution (Jiménez et al., 2011).For simplicity, we assumed that U errors were normally distributed.Finally, we assumed P errors followed a lognormal distribution to account for snow redistribution due to wind drift/scour (Liston, 2004) or to account for precipitation gauge undercatch (Durand and Margulis, 2007).Error distributions were truncated in cases when the introduced errors violated physical limits (e.g., negative U ; see Sect.3.3.5).

Error magnitudes
We considered three magnitudes of forcing uncertainty (Table 3): levels of uncertainty found (1) in the field for all forcings (i.e., NB), (2) in the field for all forcings except precipitation (which has uncertainty due to precipitation gauge undercatch, i.e., NB_gauge), and (3) in a controlled laboratory setting (i.e., NB_lab).These cases were considered because they sampled realistic errors (NB and NB_gauge) and minimum errors (NB_lab).We expected that the error ranges exerted a major control on model uncertainty and sensitivity, as demonstrated in several prior sensitivity analyses (see review of Song et al., 2015).
Consideration of error magnitudes was achieved in each scenario by assigning a range to each error probability distribution (see Sect. 3.2.2 and Table 3).While non-uniform distributions (e.g., normal) are typically described by measures other than the range (e.g., mean and variance), we scaled these distributions (see Sect. 3.3.5 for details) such that they were bounded within a specified range.This convention was necessary to ensure that differences between scenarios NB and UB were due solely to the shape of the error probability distributions, and not due to differences in both distribution shape and the domain.Additionally, this followed the typical practice of sensitivity analysis where the range specifies the domain of the distribution.
P error ranges spanned both undercatch (e.g., Rasmussen et al., 2012) and wind drift/scour errors in NB, NB+RE, and UB but only undercatch errors in NB_gauge.We assumed that P biases due to gauge undercatch in NB_gauge ranged from −10 to +10 % because Meyer et al. (2012) found 95 % of SNOTEL sites (often in forest clearings) had observations of accumulated P within 20 % of peak SWE.Results of NB, NB+RE, and UB were thus most relevant to areas with prominent snow redistribution (e.g., alpine zone), whereas NB_gauge results were more relevant to areas with minimal wind drift errors.It could be argued that uncertainty due to snow drift processes is a structural issue and not a source of forcing error; however, this distinction depends strongly on what type of model is considered.This process is clearly a structural component for snow models with explicit (e.g., three dimensional models with dynamic wind transport, Lehning et al., 2006) or implicit (e.g., onedimensional models with probabilistic subgrid snow variability routines, Clark et al., 2011a) treatment of snow redistribution.However, when a one-dimensional snow model is applied at length scales shorter than drift process length scales (as assumed here with UEB), it is not possible to account for snow drift in a structural sense.Therefore, we treat drifting snow as a form of precipitation error in NB, NB+RE, and UB.Because UEB lacks dynamic wind redistribution, accumulation uncertainty was not linked to U errors but instead to P errors (e.g., drift factor, Luce et al., 1998).
In contrast, scenario NB_lab assumed laboratory levels of uncertainty (i.e., measurement accuracy) for each forcing.Skiles et al. ( 2012) considered a similar scenario in their sensitivity analysis of the SNOBAL model (Marks and Dozier, 1992;Marks et al., 1992) to instrument accuracy at SASP, finding a 5-day range in uncertainty in modeled snow disappearance, with longwave uncertainty having the greatest impact.An emerging sensitivity analysis (Sauter and Obleitner, 2015) with the CROCUS model (Brun et al., 1992) applied on the Kongsvegen Glacier (Svalbard) indicates that longwave measurement uncertainty has an approximately comparable effect on modeled snow depth as ±25 % precipitation uncertainty and is the most dominant influence on the modeled energy balance and turbulent heat flux (relative to the measurement uncertainty of other forcings).Here we build on these efforts to examine how instrument accuracy impacts modeled snow variables in a variety of seasonal snow climates.In reality, laboratory uncertainty levels vary with the type and quality of sensors, as well as related accessories (e.g., radiation shield for the temperature sensor), which we did not explicitly consider.Because the actual sensors available varied between sites (Table 1) and we needed consistent errors across sites within scenario NB_lab, we assumed that the manufacturers' specified accuracy of meteorological sensors at a typical SNOTEL site were representative of minimum uncertainties in forcings because of the widespread use of SNOTEL data in snow studies.While we used the specified accuracy for idealized P measurements in NB_lab, we note that the instrument uncertainty of ±3 % was likely unrepresentative of errors likely to be encountered.For example, corrections applied to the P data (see Sect. 2) exceeded this uncertainty by factors of 3-20.

Sensitivity analysis
Numerous approaches that explore uncertainty in numerical models have been developed in the literature of statistics (Christopher Frey and Patil, 2002), environmental modeling (Matott et al., 2009), and optimization/calibration of hydrology and earth systems models (Beven and Binley, 1992;Duan et al., 1992;Kavetski et al., 2002Kavetski et al., , 2006a, b;, b;Kuczera et al., 2010;Razavi and Gupta, 2015;Song et al., 2015;Vrugt et al., 2009Vrugt et al., , 2008)).Among these, global sensitivity analysis is an elegant platform for testing the impact of input uncertainty on model outputs and for ranking the relative importance of inputs while considering coexisting sources of uncertainty.Global methods are ideal for non-linear models (e.g., snow models).The Sobol' (1990) (hereafter Sobol') method is a robust global method based on the decomposition of variance (see below).We investigate Sobol', as it is often the baseline for testing sensitivity analysis methods (Herman et al., 2013;Li et al., 2013;Rakovec et al., 2014;Tang et al., 2007).

Overview: model conceptualization and sensitivity
One can visualize any hydrology or snow model (e.g., UEB) as where Y is a matrix of model outputs (e.g., SWE), M is the model operator, F is a matrix of forcings (e.g., T air , P , U ), and θ is an array of model parameters (e.g., Table 2).The goal of sensitivity analysis is to determine which input factors (F and θ ) are most important to specific outputs (Y) (Matott et al., 2009).Sensitivity analyses often focus more on the model parameter array (θ ) than on the forcing matrix (Foglia et al., 2009;Herman et al., 2013;Li et al., 2013;Nossent et al., 2011;Rakovec et al., 2014;Rosero et al., 2010;Rosolem et al., 2012;Tang et al., 2007;van Werkhoven et al., 2008).However, recent analyses have considered other input factors and sources of uncertainty (e.g., Baroni and Tarantola, 2014;Schoups and Hopmans, 2006).Here, we extend the sensitivity analysis framework to forcing uncertainty by creating k new parameters (φ 1 , φ 2 , . . ., φ k ) that specify forcing uncertainty characteristics (Vrugt et al., 2008) and reformulate Eq. ( 1) as By fixing the original model parameters (Table 2), we focus solely on the influence of forcing errors on model outputs (Y).Note it is possible to consider uncertainty in both forcings and parameters in this framework.

Sobol' sensitivity analysis
Sobol' sensitivity analysis uses variance decomposition to attribute output variance to input uncertainty.First-order and higher-order sensitivities can be resolved; here, only the total-order sensitivities were examined (see below) for clarity and because the resulting first-order sensitivity indices were typically comparable to the total-order sensitivity indices (e.g., 83 % of all cases had total-order and first-order indices within 10 % of each other), suggesting minimal error interactions.The Sobol' method is advantageous in that it is model independent, can handle non-linear systems, and is among the most robust sensitivity methods (Saltelli and Annoni, 2010;Saltelli, 1999).The primary limitation of Sobol' is that it is computationally intensive, requiring a large number of samples to account for variance across the full parameter space.A key assumption to the Sobol' approach used in this paper (see Sect. 3.3.3) is that the factors are independent; hence, our analysis does not consider cases of correlated errors (e.g., a positive measurement bias in T air that causes a negative RH bias).Frameworks have been proposed for the case of correlated factors (e.g., forcing errors) in a sensitivity analysis (e.g., Kucherenko et al., 2012)

Sensitivity indices and sampling
Within the Sobol' global sensitivity analysis framework, the total-order sensitivity index (S Ti ) describes the variance in model outputs (Y ) due to a specific forcing error (φ i ), including both unique (i.e., first-order) effects and all interactions with all other parameters: where E is the expectation (i.e., average) operator, V is the variance operator, and φ ∼i signifies all parameters except φ i .The latter expression defines S Ti as the variance remaining in Y after accounting for variance due to all other parameters (φ ∼i ).S Ti values have a range of [0, 1].Interpretation of S Ti values was straightforward because they explicitly quantified the variance introduced to model output by each parameter (i.e., forcing errors).As an example, an S Ti value of 0.7 for bias parameter φ i on output Y j indicates 70 % of the output variance was due to bias in forcing i (including unique effects and interactions).
A number of numerical methods are available for evaluating sensitivity indices, and most adopt a Monte Carlo approach (Saltelli et al., 2010).Evaluation of Eq. ( 3) requires two sampling matrices, which we refer to as matrices A and B (Fig. 2a).To construct A and B, we first specified the number of samples (N) in the parameter space and the number of parameters (k), depending on the error scenario (Table 3).Selecting input factor samples for these two matrices was achieved using the quasi-random Sobol' sequence (Saltelli and Annoni, 2010).The sequence can be approximated as a uniform distribution in the range [0, 1]. Figure 2a shows input factor samples from an example Sobol' sequence in two dimensions.For each scenario and site, we generated a (N × 2k) Sobol' sequence matrix with quasirandom numbers in the [0, 1] range, and then divided it in two parts such that matrices A and B were each distinct (N × k) matrices.Calculation of S Ti required perturbing factors; therefore, a third Sobol' matrix (A (i) B ) was constructed from A and B. In matrix A (i) B , all columns were from A, except the ith column, which was from the ith column of B, resulting in a (kN × k) matrix (Fig. 2a).Section 3.3.5 provides specific examples of this implementation.From Eq. (3), we compute S Ti as (Jansen, 1999;Saltelli et al., 2010) where f (A) is the model output evaluated on the A matrix, f (A is the model output evaluated on the A (i) B ma-trix where the ith column is from the B matrix, and i designates the parameter of interest.Evaluation of S Ti required N (k + 2) simulations at each site and scenario.

Bootstrapping of sensitivity indices
To test the reliability of S Ti , we used bootstrapping with replacement across the N (k + 2) outputs, similar to Nossent et al. (2011).The mean and 95 % confidence interval were calculated using the Archer et al. (1997) percentile method and 10 000 samples.For all cases, final S Ti values (i.e., computed sensitivity indices with all samples considered) were close to the mean bootstrapped values (i.e., 99 % had a difference of less than 0.001 and no difference was greater than 0.003), suggesting convergence.Thus, we report only the mean and 95 % confidence intervals of the bootstrapped S Ti values.

Workflow and error introduction
Figure 2 shows the workflow for creating the Sobol' A, B, and B matrices, mapping input factor samples to errors, applying errors to the original forcing data, executing the model and saving outputs, and calculating S Ti values.The workflow was repeated at all sites and scenarios.Each step is described in more detail below.
-Step 1: generate an initial (N × 2k) Sobol' matrix (with N and k values for each scenario; Table 3), separate into A and B, and construct A (i) B (Fig. 2a).NB+RE had k = 12 (six bias and six random error parameters).All other scenarios had k = 6 (all bias parameters).
-Step 2: in each simulation, map the input factor sample of each forcing error parameter (φ i ) to the specified error distribution and range (Fig. 2b, Table 3).Here we treat the input factor samples as quantiles, which allows us to map these to errors via different probability distributions.For a uniform distribution, the quantile values scale linearly between the specified lower and upper error ranges (Fig. 2b).This linear scaling is not possible for normal (or lognormal) distributions (due to differences in distribution shape) and we therefore map the quantile values to normal (or lognormal) distributions scaled within the specified range.We begin by generating a probability distribution of random numbers with specified mean = 0 and standard deviation of 1 for the case of a normal distribution, and with specified mean = 20 and standard deviation of 0.5 for the case of a lognormal distribution.The random numbers of the distribution are normalized in the [0, 1] range by subtracting the minimum value and dividing by the maximum value, and then quantiles of these normalized values are computed.The final step of the mapping is to multiply the normalized quantile by the specified range of uncertainty and adding the lower bound value.For example, -Step 3: in each simulation, perturb (i.e., introduce artificial errors) the observed time series of the ith forcing (F i ) with bias (all scenarios) or both bias and random errors (NB+RE only) (Fig. 2c): where F i is the perturbed forcing time series, φ B,i is the bias parameter for forcing i, b i is a binary switch indicating multiplicative bias (b i = 1) or additive bias (b i = 0), φ RE,i is the random error parameter for forcing i, R is a time series of randomly distributed noise (normal distribution, mean = 0) scaled in the [−1, 1] range, and c i is a binary switch indicating whether random errors are introduced (c i = 1 in scenario NB+RE and c i = 0 in all other scenarios).For T air , U , RH, Q si , and Q li , b i = 0; for P , b i = 1.The decision to treat biases as multiplicative for P but additive for all other forcings was made based on practical considerations (e.g., multiplicative biases in T air are difficult to interpret) and on convention of past studies that report forcing errors.However, we note this is somewhat subjective, as errors in some forcings (e.g., radiation) have been reported in both conventions.For P , U , and Q si , we restricted random errors to periods with positive values.Similar to other sensitivity analyses (e.g., Baroni and Tarantola, 2014), we checked F i for non-physical values (e.g., negative Q si ) and set these to physical limits.This was most common when perturbing U , RH, and Q si ; negative values of perturbed P only occurred when random errors were considered (Eq.5).Due to this resetting of non-physical errors, the error distribution was truncated (i.e., it was not always possible to impose extreme errors).Additional tests (not shown) suggested that distribution truncation changed sensitivity indices minimally (i.e., < 5 %) and thus we assumed this truncation did not alter the relative ranking of forcing errors.
-Step 4: input the N(k + 2) perturbed forcing data sets into UEB (Fig. 2d).At each site, NB+RE required 140 000 simulations, whereas the other four scenarios each required 80 000 simulations, for a total of 1 840 000 simulations in the analysis.The doubling of k in NB+RE did not result in twice as many simulations because the number of simulations scaled as N (k + 2).
-Step 5: save the model outputs for each simulation (Fig. 2e).The outputs included daily time series of SWE, and four summary outputs including peak SWE, mean ablation rate, snow disappearance date, and total snow sublimation.
-Step 6: calculate S Ti for each forcing error parameter and model output (Fig. 2f) based on Sects.3.3.3and 3.3.4.Prior to calculating S Ti , we screened the model outputs for cases where UEB simulated too little or too much snow (which can occur with perturbed forcings); this was an essential step to ensure meaningful results.
Other studies (e.g., Pappenberger et al., 2008) have also applied screening methods to model output prior to calculating sensitivity indices.For a valid simulation, we required a minimum peak SWE of 50 mm, a minimum continuous snow duration of 15 days, and identifiable snow disappearance.We rejected samples that did not meet these criteria to avoid meaningless or undefined metrics (e.g., peak SWE in ephemeral snow or snow disappearance for a simulation that did not melt out).
The number of rejected samples varied with site and scenario (Table 4).On average, 94 % passed the requirements.All cases had at least 86 % satisfactory samples, except in UB at SASP, where only ∼ 34 % met the requirements.In this case, the most common reason for rejecting a simulation was that too much snow was simulated, such that it never disappeared by the end of the model run.The rejected runs were characterized by high (positive) precipitation biases and low (negative) biases in T air , Q si , and Q li .Despite this attrition, S Ti values still converged in all cases.4 Results

Propagation of forcing uncertainty to model outputs
Figure 3 shows density plots of daily SWE from UEB at the four sites and five forcing error scenarios (Fig. 1, Table 3), while Fig. 4 summarizes the model outputs.As a reminder, NB assumed normal (or lognormal) biases at field level uncertainty.The other scenarios were the same as NB, except NB+RE considered both biases and random errors, UB considered uniform distributions, NB_gauge considered gauge undercatch biases in precipitation, and NB_lab considered lower error magnitudes in all forcings (i.e., laboratory level uncertainty).Large uncertainties in SWE were evident, particularly in NB, NB+RE, and UB (Fig. 3a-l).The large range in modeled SWE within these three scenarios often translated to large ranges in mean ablation rates (Fig. 4e-h), snow disappearance dates (Fig. 4i-l) and total sublimation (Fig. 4mp).In contrast, SWE and output uncertainties in NB_gauge and NB_lab were comparatively small (Figs.3m-t, 4).Model output ranges were generally larger in NB_gauge than in NB_lab.The envelope of SWE simulations in NB_lab more tightly encompassed observed SWE at all sites, except during early winter at IC (Fig. 3m), which was possibly due to initial P data quality and redistribution of snow to the snow pillow site.
NB and NB+RE generally yielded similar SWE density plots (Fig. 3a-h) but NB+RE yielded a slightly higher frequency of extreme SWE simulations.NB and NB+RE also had very similar (but not equivalent) mean outputs values and ensemble spreads at all sites except IC (Fig. 4).This initial observation suggested that random errors in the forcings had minimal impact on model behavior at CDP, RME, and SASP.NB+RE and NB model outputs were slightly different at IC  3).The number of model simulations in the density plots varies with the site and scenario (see Table 4).The density plots were constructed using 100 bins in the SWE dimension with relative frequency tabulated in each bin each day.Note the frequency color bar is on a logarithmic scale.Sites are arranged from top to bottom in order of increasing elevation and decreasing latitude.Scenarios are defined as normally distributed bias (NB), normally distributed bias and random errors (NB+RE), uniformly distributed bias (UB), normally distributed bias with precipitation gauge uncertainty (NB_gauge), and normally distributed bias at laboratory error magnitudes (NB_lab).
(particularly for the ablation rates), indicating that random errors had some influence there, and this was possibly due to the low snow accumulation (∼ 200 mm peak SWE observed) at that site and brief snowmelt season (less than 10 days in the observations).
NB and UB yielded generally very different model outputs (Figs. 3, 4).The only difference in these two scenarios was the assumption regarding error distribution (Table 3).Uniformly distributed forcing biases (scenario UB) yielded a relatively uniform ensemble of SWE simulations (Fig. 3i-l), larger mean values of peak SWE and ablation rates, and later snow disappearance, as well as larger uncertainty ranges in all outputs (Fig. 4).At some sites, UB also had a higher frequency of simulations where seasonal sublimation was negative (i.e., condensation).
Contrasting NB and NB_gauge, NB_gauge had a lower uncertainty range in SWE and slightly higher mean peak SWE at all sites (Figs. 3, 4).With the exception of RME, the ranges in ablation rates in NB_gauge were at least 50 % smaller than in NB (Fig. 4e-h).Snow disappearance ranges were marginally smaller in NB_gauge relative to NB (Fig. 4i-l).Finally, sublimation ranges were very similar between NB and NB_gauge (Fig. 4m-p).Relative to NB, NB_lab had smaller uncertainty ranges in all model outputs (Figs. 3, 4), an expected result given the lower magnitudes in forcing errors in NB_lab (Table 3).Likewise, NB_lab SWE simulations were generally less biased than NB, relative to observations (Fig. 3).NB_lab generally had higher mean peak SWE and ablation rates and later mean snow disappearance timing than NB (Fig. 4).

Model sensitivity to forcing error characteristics
Total-order sensitivity indices (S Ti ) were calculated for four summary variables of model output (peak SWE, mean ablation rates, snow disappearance dates, and total sublimation) and for daily SWE output at all sites and error scenarios.Ex-amination of the total-order indices with sample size indicated that most indices stabilized after evaluating the model at 3000-5000 samples (no figures shown).Below we sequentially compare sensitivity indices from different scenarios to scenario NB to test the impact of differences in error characteristics (type, probability distribution, and magnitudes).

Impact of error types
We first focus on sensitivity to forcing bias, as this error type was common to scenarios NB and NB+RE.Figure 5 shows the computed total-order sensitivity indices from the two scenarios (with sensitivities to biases and random errors shown separately in NB+RE).Both NB and NB+RE showed that UEB peak SWE was most sensitive to P bias at all sites (Fig. 5a-d).In both scenarios, P bias was also the most important factor for ablation rates and snow disappearance at all sites (Fig. 5e-l).For ablation rates in NB, T air bias was the next most important factor (after P bias) at CDP, while biases in Q si and Q li were secondarily important at RME (Fig. 5f, g).For ablation rates at IC in NB+RE, most types of errors had some baseline influence (i.e., S Ti = 0.5) on model sensitivity (Fig. 5e).In both NB and NB+RE, biases in the radiation terms were of secondary importance to snow disappearance timing (Fig. 5i-k).In contrast to the other three model outputs, sublimation in NB and NB+RE was insensitive to P bias and the most important factors varied somewhat between sites and scenarios (Fig. 5m-p).In both scenarios, sublimation was most sensitive to RH bias at IC and U bias at SASP.At CDP and RME, sublimation was most sensitive to RH bias in NB; however, in NB+RE, sublimation was most sensitive to Q li bias at CDP and to T air bias at RME (Fig. 5n, o).In both scenarios, biases in T air , Q si , or Q li were generally of secondary importance for sublimation.
We hypothesized that the snow model outputs would have higher sensitivity to biases than to random errors in the forcings.The results of our analysis generally supported this hypothesis.Across all outputs and sites, S Ti values for random errors were always less than or comparable to the smallest S Ti bias values, and the most important factor was always a bias term (Fig. 5).Furthermore, there was typically high correspondence between NB and NB+RE (bias terms only) in terms of identifying the most important forcing error (e.g., P bias in peak SWE and ablation rates at all sites, Fig. 5a-h).The main exceptions were snow disappearance at IC (Fig. 5i), and sublimation at CDP and RME (Fig. 5n,  o), where the two scenarios identified different errors as the most important factor.However, even in these exceptional cases, the two scenarios yielded similar groupings of more important vs. least important errors.For example, biases in T air and RH were important to sublimation at RME in both Hydrol.Earth Syst.Sci., 19, 3153-3179, 2015 www.hydrol-earth-syst-sci.net/19/3153/2015/ scenarios (Fig. 5o), though they distinguished these sensitivities differently (i.e., NB found the RH bias was more important whereas NB+RE found the T air bias was more important).
While there was general correspondence between NB and NB+RE (bias terms), sensitivity indices were not identical across cases, due to interactions between biases and random errors in NB+RE.Random errors changed model sensitivity to biases, and the change in sensitivity was more notable (i.e., absolute change exceeding 0.10) for ablation rates and snow disappearance at IC (Fig. 5e, i) and sublimation at all sites (Fig. 5m-p).Random errors amplified model sensitivity to biases in some cases (e.g., U bias in all sublimation scenarios) but diminished model sensitivity to biases in other cases (e.g., RH bias in all sublimation scenarios).Because consideration of second-order sensitivity indices was beyond the scope of the study, we were unable to determine which specific interactions were important in terms of error types, and leave this topic for future work.

Impact of probability distribution of errors
We hypothesized that the assumed probability distribution of errors would alter the relative hierarchy of forcing biases.However, the results did not consistently support this hypothesis (Fig. 6).In all cases, scenarios NB and UB identified the same factor as the most important and similar factors as the least important at all sites.Specifically, P bias was most important for peak SWE, ablation rates, and snow disappearance at all sites in both scenarios (Fig. 6a-l).The only exception was in scenario UB at IC, where ablation rates had similar sensitivity to P bias and U bias.In both scenarios, T air bias was the second most important factor for peak SWE and ablation rates at the warmest site, CDP.Both scenarios showed that RH bias was the least important factor to snow disappearance at all four sites (Fig. 6i-l).Finally, both NB and UB showed that P bias was least important for sublimation (in contrast to the other model outputs) and that RH and U biases were among the most sensitive factors for sublimation (Fig. 6m-p).More specifically, sublimation was most sensitive to RH bias at IC, CDP and RME, and to U bias at SASP (Fig. 6m-p).
For a few specific forcings and outputs, the selected probability distribution played a role in model sensitivity to that type of forcing bias.For example, assumption of a uniform probability distribution (UB) for forcing errors enhanced the sensitivity of sublimation to U and RH biases but reduced sublimation sensitivity to Q si and Q li biases at all sites (Fig. 6m-p).In contrast, assuming a normal distribution (NB) of biases yielded the opposite results.Additionally, modeled ablation rates at IC were notably more sensitive to forcing biases (precipitation excluded) in scenario UB than in NB.

Impact of error magnitude
We hypothesized that the relative magnitude of forcing errors would exert a strong control on model sensitivity.Comparing NB to NB_gauge and to NB_lab generally supported this hypothesis (Fig. 7).The contrast in S Ti values between scenarios NB, NB_gauge, and NB_lab implied that the specified ranges of forcing errors was a critical determinant of model sensitivity.
While P bias was the most important factor at all sites in NB for peak SWE, ablation rates, and snow disappearance, P bias was never the most important factor for these model outputs in NB_gauge and in many cases was among the least important errors (Fig. 7a-l).In NB_gauge, peak SWE was most sensitive to RH bias at IC, to T air bias at CDP and RME, and to Q li bias at SASP (Fig. 7a-d).Ablation rates in NB_gauge were most sensitive to T air bias at CDP and to Q li bias at IC, RME, and SASP (Fig. 7e-h).Snow disappearance was also most sensitive to Q li bias at all four sites in NB_gauge (Fig. 7i-l).However, for sublimation at all sites, NB and NB_gauge yielded very similar sensitivities to forcing biases (Fig. 7m-p).Specifically, in both NB and NB_gauge, modeled sublimation was most sensitive to RH bias at IC, CDP, and RME and to U bias at SASP (Fig. 7m-p).The similarity in sublimation sensitivity indices between NB and NB_gauge emerged because these scenarios only differed in terms of P uncertainty (Table 3) and because P bias was not important to modeled sublimation.The contrast between sensitivity indices in these two scenarios and for these four outputs illustrated that model sensitivity may depend on both the magnitudes of uncertainty for specific forcings and on the output of interest.
Whereas NB_gauge demonstrated that reducing the magnitude of forcing uncertainty in one factor (i.e., precipitation) was sufficient to change which factors were most and least important, NB_lab showed that changing the magnitude of forcing uncertainty in all terms could yield a substantially different pattern of model sensitivity (Fig. 7).As a primary example, scenarios NB and NB_lab did not agree on whether P bias or Q li bias was the most important factor for peak SWE, ablation rates, and snow disappearance dates at all four sites (Fig. 7a-l).For sublimation, NB_lab sensitivity indices indicated that Q li bias was most important, whereas RH bias (IC, CDP, and RME) and U bias (SASP) were most important in NB (Fig. 7m-p).Across all sites and outputs in NB_lab, Q li bias was consistently the most important factor (Fig. 7).In one sense, this was surprising, given that the bias magnitudes were lower for Q li than for Q si (Table 3).However, the albedo of snow minimizes the amount of energy transmitted to the snowpack from Q si , thereby rendering Q si errors less important than Q li errors.Additionally, the non-linear nature of the model may enhance the role of Q li through interactions with other factors.The general lack of importance in P bias in NB_lab (main exception was peak SWE at IC, Fig. 7a) was due to the discrepancy between the laboratory-specified accuracy for P gauges and typical errors encountered in the field.

Relative controls of forcing error characteristics on SWE sensitivity
The above results sequentially compared sensitivity indices from different error scenarios to NB in order to ascertain how different assumptions regarding error types, probability distributions, and magnitudes translated to changes in model sensitivity.To summarize the relative controls of these three forcing error characteristics on model sensitivity, we calculated daily sensitivity indices of modeled SWE to forcing biases at each site and scenario (Fig. 8).This final analysis was conceptually different than the previous analyses (Figs.5-7) in terms of the model output considered.Whereas the previous analyses computed sensitivity indices for summative model outputs (e.g., peak SWE, total sublimation), the final analysis recalculated sensitivity indices for SWE each day.This approach allowed us to examine how SWE model sensitivity changed as a function of time within the snow season.
Comparing the broad patterns in the time varying S Ti values across the five scenarios, it was evident that error magnitudes were the greatest determinant in model sensitivity to forcing errors through the snow season (compare Fig. 8a-l with m-t).NB, NB+RE, and UB exhibited similar patterns, with high S Ti in P bias throughout the year and with the other forcing biases yielding low S Ti values in the winter and increasing S Ti values in the spring and early summer for some forcings (Fig. 8a-l).In contrast, NB_gauge and NB_lab (Fig. 8m-t) had lower S Ti values for P bias and more coherent changes in S Ti values that were more synchronized with the specific part of the snow season.
After error magnitudes, the next most important determinant to model sensitivity was the probabilistic distribution of forcing errors (compare Fig. 8a-d and i-l).Relative to NB, UB tended to yield lower S Ti values for P bias.UB also had higher S Ti values for biases in T air , Q li , and Q si as time progressed at IC, CDP, and RME (Fig. 8i-k).Finally, the addition of random errors was least important to model sensitivity, as the evolution of S Ti bias values was very similar between NB and NB+RE at most sites (compare Fig. 8a-d Figure 7. Same as Fig. 5, but comparing S Ti values from scenarios NB, NB_gauge, and NB_lab to test model sensitivity as a function of error magnitudes.NB considers normally distributed bias at error magnitudes found in the field.NB_gauge has lower precipitation uncertainty (gauge undercatch) than NB but is otherwise identical.NB_lab considers normally distributed bias at error magnitudes found in the laboratory.and e-h).Random errors mattered the most to modeled SWE at IC, but random errors only changed S Ti values (on average) by less than 10 %.

Discussion
Here we examined the sensitivity of physically based snow simulations to forcing error characteristics (i.e., types, probability distributions, and magnitudes) using the Sobol' global sensitivity analysis.A key result is that among these three characteristics, the magnitude of biases had the most significant impact on UEB simulations (Figs. 3, 4) and on model sensitivity (Figs. 7,8).The assumed probability distribution of biases was important in that it increased the range of model outputs (compare NB and UB in Fig. 4) but, surprisingly, this usually translated to only modest changes in model sensitivity to forcing errors (Figs. 6,8).Random errors were usually less important than biases.Although random errors changed model sensitivity to biases through error interactions, this effect was only large in specific conditions (e.g., ablation rates at IC; Fig. 5e), and the snow model was never more sensitive to random errors than to biases (Fig. 5).Below we discuss these three error characteristics (in order of importance, as suggested by the results), place forcing uncertainty in the context of structural uncertainty, and identify limitations of the analysis and future research directions.

Ranges of error magnitudes
The results supported our hypothesis that the magnitude of biases strongly influences the relative importance of forcing errors.The three magnitudes of uncertainty considered (NB, NB_gauge, and NB_lab) all resulted in different patterns in model sensitivity to forcing biases, and these patterns also varied with the output of interest (Fig. 7).Modeled peak SWE, ablation rates, and snow disappearance were consistently sensitive to P bias in scenario NB and to Q li bias in scenario NB_lab, but there was less consistency in the dominant forcing errors across these three outputs in sce-  nario NB_gauge.While peak SWE, ablation rates, and snow disappearance dates had similar sensitivities to forcing errors (particularly to P biases), sublimation exhibited notably different sensitivity to forcing errors.P bias was frequently the least important factor for sublimation, in contrast to the other model outputs.Biases in RH, U , and T air were often the major controls on modeled sublimation in NB, NB+RE, UB, and NB_gauge, while Q li bias controlled modeled sub-limation in NB_lab.These field results partially agree with the sensitivity analysis of Lapp et al. (2005), who showed the most important forcings for sublimation in the Canadian Rockies were U and Q si .However, they did not consider Q li in their sensitivity analysis, so the experiments are not exactly comparable.These results suggest that no single forcing is important across all modeled variables and that model sensitivity strongly depends on the output of interest.
The dominant effect of P bias on modeled peak SWE, ablation rates, and snow disappearance in the field scenarios (e.g., NB) confirmed previous reports that P uncertainty is a major control on snowpack dynamics (Durand and Margulis, 2008;He et al., 2011a;Schmucki et al., 2014).It was surprising that P bias was often the most critical forcing error for ablation rates in these scenarios (Figs. 5,6).Prior investigations into the relative importance of forcings to ablation were typically framed for a snowpack at the end of winter, such that P uncertainty was not considered (e.g., Zuzel and Cox, 1975).The results here showed that ablation rates were highly sensitive to P bias and this is likely because it controlled the timing and length of the ablation season.Positive P bias extends the fraction of the ablation season in the warmest summer months when ablation rates and radiative energy approach maximum values, whereas negative P bias truncates the fraction of ablation in the warm season.Trujillo and Molotch (2014) reported a similar result based on SNOTEL observations.The contrast between scenarios NB, NB_gauge, and NB_lab highlights that selection of the error ranges is a critical step in sensitivity analysis.However, we recognize that there is some subjectivity in the specification of these ranges.Quantification of errors in forcing estimation methods is best achieved through comparisons with surface observations (e.g., Bohn et al., 2013;Flerchinger et al., 2009), but it remains challenging to specify error ranges with confidence (Song et al., 2015).Key considerations controlling the ranges and impacts of forcing errors include the representativeness of the forcing data (e.g., reanalysis, numerical weather model output, extrapolated surface measurements) in the study area, the length scale of dominant processes (e.g., snow drifting), and the configuration of the snow model (e.g., spatial scale, complexity).Here we selected ranges in the field scenarios to encompass errors encountered across a variety of possible forcing data sources (Table 3), but ultimately the appropriate ranges must be tailored to the specific application.This supports the need for continual evaluation of forcing data sets across a variety of climates and environmental conditions.

Probability distribution of errors
The results did not universally support our hypothesis that the assumed probability distribution of biases was important to the relative ranking of forcing errors.The relative consistency in the dominant forcing errors between NB and UB may have emerged because the probability distributions of all six forcing biases varied together between these two scenarios (i.e., all forcing biases were uniform in UB and either normal or lognormal in NB).While we did not conduct additional tests, we suspect that changing the probability distribution of just a single forcing error (e.g., T air bias) from normal to uniform would have uniquely enhanced model sensitivity to that particular forcing error (Touhami et al., 2013).
The similarity of results between scenarios NB and UB conform to findings in previous studies (e.g., Foscarini et al., 2010;Touhami et al., 2013) where uniform and normal distributions identified similar factors as the most important.These previous studies imply that greater differences in sensitivity indices (as a function of distribution) will emerge when factor interactions are more prominent.The case with the strongest error interactions here (i.e., ablation rates at IC) also yielded the largest differences in sensitivity indices between scenarios NB and UB, which is consistent with the prevailing logic.

Error types
The results were consistent with our hypothesis that the snow model is more sensitive to biases than to random errors in the forcings.While previous investigations supported this idea for shortwave and longwave forcings in physically based snow models (i.e., Lapo et al., 2015), the current study showed that biases are more important than random errors for all commonly required meteorological forcings (and not just irradiances).The model was more sensitive to biases and less sensitive to random errors due to the systematic nature of biases.In contrast, the effect of random errors tended to cancel out when integrating model outputs over long periods.Our selected model outputs were generally a function of several months of mass and energy exchange in the snowpack, thereby ensuring minimization of effects from random errors.Random errors only had a greater impact on ablation rates at IC (Fig. 5e) and this was because the relatively brief snowmelt period presented an opportunity for the random errors to not cancel out.Hence, the model may have greater sensitivity to random errors for other model outputs not considered here that integrate over relatively short timescales (e.g., snowmelt over a single day).

Contextualizing forcing and structural uncertainties
Our central argument at the onset was that forcing uncertainty may be comparable to parametric and structural uncertainty in snow-affected catchments.To support our argument and to place our results in context, we compare our results at CDP in 2005-2006 to Essery et al. (2013), who assessed the impact of structural uncertainty in a suite of local snowpack processes (i.e., snow compaction, fresh snow density, snow albedo evolution, surface heat and moisture fluxes, snow cover fraction, snow hydrology, and thermal conductivity) on SWE simulations from 1701 physically based snow models at the same site/year.Figure 9 compares the 95 % uncertainty ranges in peak SWE, ablation rates, and snow disappearance in NB, NB_gauge, and NB_lab to the ranges found across the 1701 snow models of Essery et al. (2013).
From the comparisons at this site, it is clear that the uncertainty associated with drifting snow (i.e., scenario NB) overwhelms the structural uncertainty in local snowpack processes for all three model outputs.As discussed previously, it could be argued that the uncertainty due to drifting snow is a structural issue (not a forcing issue) and that this does not represent the uncertainty of sheltered areas where drifting snow is less important.Hence, NB_gauge may be a better determinant of the level of uncertainty that can be attributed unambiguously to errors in forcing data.In that case, the output uncertainty range due to model forcing is still larger than that due to the structural uncertainty (as considered by Essery et al., 2013) in the cases of peak SWE and snow disappearance but is smaller for ablation rates (Fig. 9).As expected, the case of forcing uncertainty in NB_lab yields the lowest range in model outputs at CDP (Fig. 9), though it is interesting to note that the uncertainty in peak SWE due to structural uncertainty (90 mm) is only marginally larger than that due to the specified instrument accuracy (60 mm).These compar-isons illustrate that forcing uncertainty cannot be discounted and that the magnitude of forcing uncertainty is a critical factor in how forcing uncertainty compares to other sources of uncertainty (e.g., structural).This resonates with the recent work of Magnusson et al. (2015), who found that uncertainty in the P forcing was a greater determinant of model performance than structural considerations.

Caveats and future research
Limitations of the analysis are that the impact of forcing error characteristics on model behavior is evaluated through the lens of a single sensitivity analysis method and a single snow model.It is possible that alternative sensitivity analysis methods might yield different results than the Sobol' method, as suggested in previous studies (e.g., Pappenberger et al., 2008).Likewise, we recognize it is possible that different snow models may yield different sensitivities to forcing uncertainty.As one example, both Koivusalo and Heikinheimo (1999) and Lapo et al. (2015) found UEB (Tarboton and Luce, 1996) and SNTHERM (SNow THERmal Model) (Jordan, 1991) exhibited significant differences in radiative and turbulent heat exchange.As another example, the role of U bias on snowpack formation may vary strongly depending on the snow model configuration.Because of the lack of wind transport in UEB, we lumped snow drift uncertainty into P uncertainty via a "drift factor" formulation (Luce et al., 1998) and this could not account for the role of wind in snow drift/scour processes (Mott and Lehning, 2010;Winstral et al., 2013).This convention would be unnecessary for a model that explicitly models this process (e.g., the SNOWPACK model, Lehning et al., 2006), and for this type of model we would expect the role of U bias to be enhanced (relative to UEB) for outputs such as peak SWE and snow disappearance timing.While sensitivity may vary with model selection in these examples, there is also evidence suggesting that similar results may emerge when using different snow models for a similar type of error scenario.Despite using different models, a somewhat different suite of forcing variables, and slightly different error ranges, our NB_lab experiment corroborated independent reports that Q li measurement uncertainty was the most important to both modeled snow disappearance (Skiles et al., 2012) and sublimation/latent heat exchange (Sauter and Obleitner, 2015).Our analysis demonstrated this result was consistent across four snow climates and this result was apparent in four different model outputs (Fig. 7).The implication here is that more work is needed to better understand how different snow models respond to forcing uncertainty.
Generalizing the relationship between model sensitivity and site climate is a research topic of high interest.Although we found similarities in model sensitivity to specific forcing errors across sites (e.g., high sensitivity to P bias in peak SWE, ablation rates, and snow disappearance in NB, NB+RE, and UB; Fig. 8a-l), we note that the sites exhib-ited some differences in sensitivity when P uncertainty was reduced to gauge levels (Fig. 8m-p).Additionally, the sites exhibited differences in the relative importance of secondary forcing errors (Figs. 6, 7).There may be interesting linkages between climate and model sensitivity but we were unable to generalize relationships between site geo-characteristics and sensitivity indices because of the relatively low number of sites represented here (n = 4 sites, 1 year each) and the confounding number of differences between sites.A much larger population of snow measurement sites is required in order to test relationships between sensitivity indices and site characteristics, and this is an important avenue of future research.A successful example of relating climate characteristics to sensitivity indices when many study sites and years are available can be found in van Werkhoven et al. (2008).
While the Sobol' method is often considered the "baseline" method in global sensitivity analysis, we note the limitation is that it comes at a relatively high computation cost (1 840 000 simulations across four sites and five error scenarios) and it may be prohibitive for many modeling applications (e.g., for models of higher complexity and dimensionality).For context, the typical time required for a single simulation was 1.4 s, resulting in a total computational expense of 720 h (30 days) across all scenarios.Examination of the convergence rates indicated that most sensitivity indices stabilized after one-third of the simulations completed and hence the same results could have been found using significantly fewer simulations (no figures shown).Ongoing research is developing new sensitivity analysis methods that compare well to Sobol' but with reduced computational demands (e.g., Cukier et al., 1973;Morris, 1991;Rakovec et al., 2014), and is continuing to compare how different methods classify sensitive factors differently (Pappenberger et al., 2008;Tang et al., 2007).We expect that detailed sensitivity analyses that concurrently consider uncertainty in forcings, parameters, and structure in a hydrologic model will be more feasible in the future with better computing resources and advances in sensitivity analysis methods.
The question remains "what can be done about forcing errors in hydrologic modeling?First, the results suggest model-based hypothesis testing must account for uncertainties in forcing data.The results also highlight the need for continued research in constraining P uncertainty in snow-affected catchments.Progress is being achieved with advanced pathways for quantifying snowfall precipitation such as NWP models (Rasmussen et al., 2011(Rasmussen et al., , 2014) ) and through systematic intercomparisons of precipitation and snow gauges (e.g., Solid Precipitation Intercomparison Experiment, http://www.rap.ucar.edu/projects/SPICE/).However, in a broader sense, the hydrologic community should also consider whether deterministic forcings (i.e., single time series for each forcing) are a reasonable practice for physically based models, given the large uncertainties in both future (e.g., climate change) and historical data (especially in poorly monitored catchments) and the complexities of hydrologic systems (Gupta et al., 2008).We suggest that probabilistic model forcings (e.g., Clark and Slater, 2006), which have a legacy in data assimilation methods (e.g., precipitation uncertainty Durand and Margulis, 2007), present one potential path forward where measures of forcing uncertainty can be explicitly included in the forcing data sets.The challenges are (1) to ensure statistical reliability in our understanding of forcing errors and (2) to assess how best to input probabilistic forcings into current model architectures.

Conclusions
Application of the Sobol' sensitivity analysis framework across sites in contrasting snow climates reveals that forcing uncertainty can significantly impact model behavior in snow-affected catchments.Model output uncertainty due to forcings can be comparable to or larger than model uncertainty due to model structure.Furthermore, this work demonstrates that sensitivity analysis can be applied to understand the role of specific error characteristics in model behavior.Key considerations in model sensitivity to forcing errors are the magnitudes of forcing errors and the outputs of interest.For the physically based snow model tested, random errors in forcings are generally less important than biases, and the probability distribution of biases is relatively less important to model sensitivity than the magnitude of biases.
The analysis shows how forcing uncertainty might be included in a formal sensitivity analysis framework through the introduction of new parameters that specify the characteristics of forcing uncertainty.The framework could be extended to other physically based models and sensitivity analysis methodologies and could be used to quantify how uncertainties in model forcings and parameters interact.Based on this framework, it would be interesting to assess the interplay between coexisting uncertainties in forcing errors, model parameters, and model structure, and to test how model sensitivity changes in relation to all three sources of uncertainty.

Figure 1 .
Figure1.Scenarios of interest and the type, distribution, and magnitude of errors considered in each.NB considers normally (or lognormally) distributed biases with error magnitudes found in the field.NB+RE is the same as NB but also considers random errors.UB is the same as NB but considers uniformly distributed errors instead.NB_gauge is the same as NB but with reduced precipitation uncertainty (typical difference between precipitation gauge and snow pillow).NB_lab is the same as NB but considers laboratory error magnitudes.
have tested how distribution assumptions (uniform vs. normal) change their computed measures of model sensitivity.These stud-

Figure 2 .
Figure 2. Conceptual diagram showing methodology for imposing errors on the forcings with error parameters (φ i ) within the Sobol' sensitivity analysis framework, and workflow for model execution and calculation of sensitivity indices on model outputs (Y ).

Figure 3 .
Figure 3. Observed (black line) and modeled SWE (color density plot) at the four sites across the five uncertainty scenarios (see Fig. 1 and Table3).The number of model simulations in the density plots varies with the site and scenario (see Table4).The density plots were constructed using 100 bins in the SWE dimension with relative frequency tabulated in each bin each day.Note the frequency color bar is on a logarithmic scale.Sites are arranged from top to bottom in order of increasing elevation and decreasing latitude.Scenarios are defined as normally distributed bias (NB), normally distributed bias and random errors (NB+RE), uniformly distributed bias (UB), normally distributed bias with precipitation gauge uncertainty (NB_gauge), and normally distributed bias at laboratory error magnitudes (NB_lab).

Figure 4 .
Figure 4. Distributions of model outputs (rows) at the four study sites (columns) arranged by scenario.For each scenario, the circle is the mean and the whiskers show the range encompassing 95 % of the simulations (see Table 4 for number of simulations for each site and scenario).The dashed lines in (a)-(d) and (i)-(l) are the observed values.Axes are matched between sites for a given model output; note that the range in scenario UB in (d) is truncated by the axes' limits (upper value = 3030 mm).

Figure 5 .
Figure 5. Model sensitivity as a function of forcing error type.Shown are the total-order sensitivity indices (S Ti ) of four model response variables (columns) at the four sites (rows) from scenarios NB and NB+RE.In NB+RE, bias and random error parameters are shown separately.NB+RE considers normally distributed bias and random errors, while NB considers normally distributed bias only.The bar indicates the mean (bootstrapped) sensitivity indices and associated 95 % confidence intervals.

Figure 6 .
Figure 6.Same as Fig. 5, but comparing S Ti values from scenarios NB and UB to test model sensitivity as a function of error probability distribution.UB considers uniformly distributed bias, while NB considers normally distributed bias.

Figure 8 .
Figure 8. Variation of daily SWE sensitivity to forcing bias based on site (columns) and error scenario (rows).The normalized range (where 1 = maximum SWE) in modeled SWE is shown (gray area) for context.Sensitivity indices in the early and late part of the snow season were screened out, as a high number of simulations with SWE = 0 yielded invalid sensitivity indices.

Figure 9 .
Figure 9. Uncertainty ranges (95 % intervals) in (a) peak SWE, (b) ablation rates, and (c) snow disappearance date at CDP in WY2006 for three forcing uncertainty scenarios and the Essery et al. (2013) structural uncertainty.

Table 1 .
Basic characteristics of the snow study sites, ordered from left to right by increasing elevation.

Table 2 .
UEB model parameters used in all simulations and sites.

Table 3 .
Details of error types, distributions, and uncertainty ranges for the five scenarios.
those applications for future work.Below, we provide a brief summary of the Sobol' sensitivity analysis methodology implemented here but note that further details can be found inSaltelli et al. (2010).

Table 4 .
Number of samples (N) and model simulations (in parentheses) meeting the requirements for minimum peak SWE and snow duration and valid snow disappearance dates at each site (rows) in each scenario (columns).The number of model simulations scaled as N (k + 2), where k = 12 in scenario NB+RE and k = 6 in all other scenarios.When a simulation was rejected, all related simulations (based on resampling) were also rejected.