Using an ensemble of artificial neural networks to convert snow depth to snow water equivalent over Canada

Canada’s water cycle is driven mainly by snowmelt. Snow water equivalent (SWE) is the snow-related variable that is most commonly used in hydrology, as it expresses the total quantity of water (solid and liquid) stored in the snowpack. Measurements of SWE are, however, expensive and not continuously accessible in real time. This motivates a search for alternative ways of estimating SWE from measurements that are more widely available and continuous over time. SWE can be calculated by multiplying snow depth with the bulk density of the snowpack. Regression models proposed in the literature 5 first estimate snow density and then calculate SWE. More recently, a novel approach to this problem has been developed and is based on an ensemble of multilayer perceptrons (MLPs). Although this approach compared favourably with existing regression models, snow density values at the lower and higher ends of the range remained inaccurate. Here, we improve upon this recent method for determining SWE from snow depth. We show the general applicability of the method through the use of a large data set of 234 779 snow depth-density-SWE records from 2878 non-uniformly distributed sites across Canada. These data cover 10 almost four decades of snowfall. First, it is shown that the direct estimation of SWE produces better results than the estimation of snow density followed by the calculation of SWE. Second, optimizing MLP parameters separately for each snow climate class further improves the accuracy of SWE estimates. A comparison with commonly used regression models reveals that the ensemble of MLPs proposed here leads to noticeably more accurate estimates of SWE. This study thus shows that delving deeper into artificial neural network theory helps improve SWE estimation and that using a greater number of MLP parameters 15 could lead to even further improvements.

contrast to Broxton et al. (2019), the available snow measurements were spatially distant and temporarily irregular. The ANN incorporated meteorological data as additional explanatory variables to support the estimation of density. Odry et al. (2020) used an ensemble of multilayer perceptrons (MLPs, a type of ANN) to provide estimates, at least in part, of the uncertainty associated with converting snow depth to density. A comparison of the ANN model of Odry et al. (2020) with the regression models of Jonas et al. (2009) and Sturm et al. (2010) in a leave-one-out set-up showed that the ANN model provided the most 70 accurate results. However, Odry et al. (2020) also noted that all three models performed relatively poorly for very low or very high snow densities. Given that high densities generally correspond to the beginning of the melt period, it is particularly crucial to obtain accurate SWE values for this period of the year.
This study is a follow-up to the work of Odry et al. (2020). Here we test means of improving their initial model. Specifically, we verify two hypotheses: (1) using SWE instead of density as the target variable for the ANN will produce more accurate 75 estimates of SWE; (2) in-depth testing of different ANN structures will improve estimates of SWE. We also take the opportunity to apply the proposed snow depth-SWE conversion framework to the entire area of Canada and test its applicability in a broader context. Canada is characterized by a range of climates associated with different snow classes.
The remainder of the paper is organized as follows. In Sect. 2, we present the main background elements and the essential literature review. This review includes the mathematical theory of MLPs, the applied snow climate classes (Sturm et al., 2009), 80 both statistical regression models used for comparison against our neural network models, and the performance assessment metrics. Section 3 presents the experimental protocol, including the available data and input variables and the procedure for determining the MLP architecture. Results are presented in Sect. 4, and our conclusions are summarized in Sect. 5.
2 Literature review 2.1 The multilayer perceptron as a basic tool 85 Following Goodfellow et al. (2016), a multilayer perceptron (MLP) is a fully connected, feed-forward artificial neural network, meaning that information can only travel in one direction within the network. Being fully connected entails that no initial assumptions need to be made in regard to the data.
The goal is to approximate a desired function f such that y = f (θ, x), where y is the target variable (in our case SWE), x represents the input variables, and θ represents some parameters, namely weights and biases. The use of non-linear activation 90 functions in each neuron enables the network to approximate non-linear functions f .
To determine the parameters, one can perform an optimization over the space of the weights and biases by using a training data set. To do so, the parameters are initialized commonly at random and close to zero by following either a uniform or Gaussian distribution. For regression problems, the mean square error (MSE) is used as the objective function in the optimization. Goodfellow et al. (2016) show a derivation of the MSE from the maximum likelihood estimator. During training, the training 95 data set is presented multiple times to the model, and the term epoch is used to denote one run over the entire training data set.
It is recommended that between each epoch the order of the data points of the training set is permuted. This prevents the model from learning features in the order of insertion. The training of a MLP is related to an ordinary optimization. The difference is that we optimize directly for a training data set with the goal that the measure of performance of a validation data set is also optimized. Commonly, there will be a turning point where the training error continues to decrease, but the validation error starts 100 to increase. For a single deterministic MLP, this turning point would be chosen as the best number of epochs for the model.
When training an ensemble of MLP, however, it is ideal to maintain diversity among the ensemble members to cover the range of uncertainty pertaining to the target variable. If all members of the ensemble were to be trained until the aforementioned turning point, they would all become very similar, and it would become pointless to use an ensemble in the first place. For this reason, we adopt here the protocol proposed by Boucher et al. (2010) and monitor rather the performance of the entire 105 ensemble at each epoch (see Sect. 4.1).
During the optimization, a non-convex function is solved; this function contains multiple minima having a similar performance. The objective function with respect to the parameters is non-convex because of several symmetric configurations of a neural network. Thus, exchanging the associated bias and weights of one neuron with another neuron in the same layer entails the same results. Furthermore, ANN applications usually use input variables that are related to each other. The interchange-110 ability of dependent input variables results in multiple parameter sets having a similar performance. The number of dimensions in the optimization is equal to the number of parameters, and it is also argued by Goodfellow et al. (2016) that local minima are rare in high-dimensional spaces. Dauphin et al. (2014) point out that with increasing dimensions, the ratio of the number of saddle points to the number of local minima increases exponentially. Therefore, an optimization method is needed that does not become stuck in saddle points.

115
The concept of stochastic gradient methods has been introduced to avoid saddle points. These methods are based on the ordinary steepest gradient method; however, rather than using the entire training data set (batch) at once, a number of data records (or "minibatches") are taken for each iteration during the optimization. Consequently, the optimization surface is slightly altered at each iteration. This can even sometimes help to escape shallow local minima. For the interested reader, Goodfellow et al. (2016) provide an overview of various optimization algorithms.

120
The related studies of Broxton et al. (2019) and Odry et al. (2020) both used the second-order Levenberg-Marquardt algorithm. Broxton et al. (2019) worked with the Matlab deep learning toolbox and Odry et al. (2020) used an in-house built Matlab toolbox by Dr Kris Villez, derived from the Matlab deep learning toolbox. The conversion of the codes into Python revealed parameter updates where the error of the objective function is decreased. This conceals multiple failures. Further discussion about the drawbacks of the Matlab toolbox is available in Kwak et al. (2011). Because of the exponentially increasing dimensions of optimization with larger neural networks, we will only consider first-order optimizers, as these are computationally more efficient. Goodfellow et al. (2016), in their overview of optimization methods, conclude that stochastic gradient methods with adaptive learning rates are the best choice for a first-order optimizer. Schaul et al. (2014) compare stochastic gradient 130 methods by testing them on small-scale problems. The algorithms RMSProp and AdaDelta produce good results. The drawback of RMSProp is the need for a global learning rate, which must be defined by the user. To address this problem, Zeiler (2012) introduced the algorithm AdaDelta, which eliminates the requirement of this global learning rate.
There are several reasons that support the use of an ensemble rather than a single model. First, random parameter initialization can end up in different local minima on the parameter surface with similar performance. This situation is related to 135 the concept of equifinality, introduced by Bertalanffy (1968) for open systems stemming from the work of the biologist Hans Driesch. Therefore, using an ensemble accounts for the uncertainty of the model parameters. Second, the ensemble offers the possibility of probabilistic simulations. Therefore, probabilistic evaluation methods, introduced in Sect. 2.4, can be used. This leads to greater insight into model performance. Third, the different members of the ensemble can be used in a Kalman filter or a particle filter for data assimilation in a hydrological model.

140
Finally, in regard to the architecture of the MLP, through the universal approximation theorem, it is proven that any continuous function can be approximated by a feed-forward neural network with a single hidden layer under mild assumptions. This theorem was first proven by Cybenko (1989) for the sigmoid activation function. Hornik (1991) showed the same theorem is independent of the choice of the activation function but assumes that the output layer is linear, which is the case for regression problems. The theorem does not make any claims about guidelines of the architecture or about the learning ability of the model.

145
In ANN applications, therefore, the perfect approximation of the function is not obtained. Rather, there is a trade-off between computational cost, the time to test different architecture compositions, and accurate approximation.
According to Goodfellow et al. (2016), the most common activation functions are the hyperbolic tangent (tanh) and the rectified linear unit (ReLU) functions. The calculation of the derivative of tanh is computationally expensive, and the derivative of the tanh function for large values is almost zero, which slows down learning. The derivative of the ReLU function is easy 150 to compute, and the derivative stays constant for large positive values. However, the derivative is zero for negative values, which stops the training within the optimization. This phenomenon is known as "dying neurons". The Leaky ReLU function can tackle this issue by assigning a slightly positive slope, typically 0.01, to the negative part of the ReLU function.

Snow classification
Because of the very large territory covered by this study, and in accordance with the findings of Sturm et al. (2010) and 155 Bormann et al. (2013), we expect that training a different ensemble of MLP for each snow class will result in an improved accuracy for estimates of SWE compared to training an ensemble over the data for Canada as a whole.  Sturm et al. (2009) propose a global seasonal snow classification system that separates the world map into seven different snow classifications at a resolution of 0.5 • × 0.5 • . Snow can vary in character (e.g. snow found in areas close to coasts consists mainly of wet snow, whereas snow located in the continental interior is characterized by dry snow). Therefore, the data within 160 one snow class should be systematically more consistent and may show less variability than taking the snow data as a whole.
This can help in our study, as the MLP must capture fewer features, which may lead to a better optimization and, in turn, improved performance. The snow classes across Canada are presented in Fig. 1.
Because of the coarse resolution of the snow classification map, some stations close to the coast are classified as water.
Furthermore, the snow class ice only contains 124 records. For these two classes, we assign them to the nearest-in terms of 165 physical distance-accepted snow class.
The empirical cumulative distribution function (ECDF) of snow depth, SWE, and snow density for each snow class is depicted in Fig. 2. The number of records in each snow class is presented in Fig. 2b. The ephemeral snow class is found only on Vancouver Island, which also explains the few records of this class. Furthermore, the high annual precipitation and marked variability in elevation across Vancouver Island explains the very high variability of snow depth, SWE, and snow density for 170 this class. When comparing the taiga and tundra snow classes, one can observe that snow depth is lower in the tundra snow class owing to the lower total precipitation in northern Canada. Nonetheless, the density distribution of the tundra snow class is higher; this may reflect the wind densification effect in this region as it is relatively more important than other processes for the tundra snow class. The maritime, mountain, and ephemeral snow classes have a higher density distribution than the prairie and taiga snow classes because of the greater oceanic influence on the maritime and ephemeral regions. This influence results 175 in wetter snow and, therefore, higher snow densities within these latter regions. Most records of the mountain snow class are found in the Rocky Mountains along the border of British Columbia and Alberta, an area influenced by the Pacific Ocean, thereby producing a higher density snow. A more thorough analysis of the data revealed that the high number of records in the maritime and mountain snow classes is due to records from snow pillows in British Columbia, collected between 1996 and

Regression models
In this section we introduce two regression models from the literature, which we use as benchmarks to compare with our MLP-based conversion model, using the performance indicators described in Sect. 2.4.

The Sturm model
185 Sturm et al. (2010) proposed a model to convert snow depth into bulk density for the snow cover. The estimated density is then used to calculate SWE. As observed by Dickinson and Whiteley (1972), the range of snow density is smaller than the range of snow depth. Therefore, Sturm et al. (2010) claim that estimating the snow density is probably the most accurate way to estimate SWE. The regression model is formulated as:

190
where [ρ max , ρ 0 , k 1 , k 2 ] are parameters, SD obs is the snow depth, and DOY obs is the number of days since 01 October. The value of DOY obs is 0 on 01 January. A different regression model is run for each snow class. The snow classes are defined by Sturm et al. (2009). The parameters for each snow class are obtained by carrying out an optimization on a training data set associated with that snow class. The RMSE between the estimated and the measured snow density is used as an objective function in the optimization. the data into months. This model also estimates density, and from this density, it calculates the SWE. The regression is defined as:

200
where a and b are the parameters of the linear regression, SD obs is the observed snow depth, and offset reg is a regional specific parameter. The data set is split by month and into three elevation classes, denoted by x < 1400 m, 1400 m≤ x < 2000 m, and x ≥ 2000 m, where x is the elevation of the site. After that, the parameters a and b are derived by fitting a linear regression to each portion of the data set. This process produces 36 independent linear regression models. After solving for the parameters of the linear regressions, we perform a simulation without offset reg . The regional specific parameter offset reg is thus the 205 average of the model residual between the simulated and observed density of samples in a given region. In our study, the snow classes defined by Sturm et al. (2009) are used as these regions. Note that the separation into the snow classes for the offset reg calculation is applied independently after the linear regressions are performed for each month and elevation class. The offset reg parameter eliminates the regional bias. Furthermore, an ensemble is said to be reliable if the relative frequency of the event, for a given simulation probability, is equal to the simulation probability.

The ignorance score
Roulston and Smith (2002) introduced the ignorance score, which is defined as: 220 where f M is the simulated pdf, and o ∈ R is the observation.
It assigns the minimum value of zero if the probabilistic forecast predicts the observation with a probability equal to 1, indicating a perfect simulation. Therefore, the ignorance score is a strictly proper scoring rule. To assess the model, the average of the ignorance scores over all n records is taken, defined as: where f M j is the pdf derived from the ensemble of the j th simulation, and o j is the j th observation.
In the empirical case, i.e. without distribution fitting, we consider the sorted predicted ensemble with m members by {s 1 , s 2 , ..., s m }. The cumulative distribution function is simply the staircase function over the members. From this, we construct a pdf by assigning each area between two members: If the observation coincides with a member, the larger probability of the two adjacent areas is taken. Furthermore, f M (x) is set to 0.001, if the observation lies outside of the ensemble.

The continuous ranked probability score
Given that the ignorance score evaluates the simulated probability function only at the point of observation, no information 235 about the area surrounding the observation or the shape of the probability function is included. The continuous ranked probability score (CRPS) addresses this drawback by working directly on the cdf. The CRPS of one record is defined as: where F M is the cdf derived from the ensemble, and o ∈ R is the observation. The integrand is the squared difference between the cdf derived by the ensemble and the cdf derived by the observation. The CRPS takes values in R + 0 , where zero is assigned 240 if the predicted cdf is equal to the cdf derived by the observation. This is the perfect simulation. This describes the CRPS as a strictly proper scoring rule. When calculating the empirical CRPS, the staircase function over the members within the ensemble is taken as the cdf.
Hersbach (2000) presents a decomposition of the empirical CRPS into a reliability part and the potential CRPS. We denote the ensemble of one record by {x 1 , ..., x m }. The reliability part is then defined as: where o i is the relative frequency of the observation being smaller than the midpoint of the bin To get the reliability score of the model, the average over all records is taken.
Small values of the reliability portion testify to the high reliability of the ensemble simulation. The reliability portion is 250 highly related to the rank histogram, introduced in the next section (Sect. 2.4.3). The difference is that the reliability part takes the spread of the ensemble into account, which is not incorporated into the rank histogram. When the reliability part is subtracted from the empirical CRPS, the potential CRPS remains. The potential CRPS is the same as the CRPS when the reliability part is equal to zero, which indicates a perfectly reliable model. The potential CRPS is related to the average spread of the ensemble. A narrow ensemble leads to a small potential CRPS. 255 9 https://doi.org/10.5194/hess-2020-566 Preprint. Discussion started: 18 November 2020 c Author(s) 2020. CC BY 4.0 License.

The rank histogram
The rank histogram was developed independently and almost simultaneously by Anderson (1996), Hamill and Colucci (1997), and Talagrand et al. (1997). It determines the rank of each observation within the associated predicted ensemble. The ranks of all observations are then presented in a histogram. A perfectly reliable ensemble forecast shows a flat rank histogram. Furthermore, the rank histogram can reveal information about bias and under-and overfitting. For instance, if the rank histogram shows a 260 decreasing trend with higher ranks, the model has a positive bias. An increasing trend with higher ranks shows a negative bias. Furthermore, a u-shaped rank histogram indicates a too-small ensemble spread. This limited spread is related to an overconfident model, which means that the model ignores some parts of the uncertainty. Several reasons explain this, including the fit of the model. If a model is trained too long, it is trained too specifically on the training data. Consequently, the model performs well on the training data, but it lacks generality, meaning that the model is incapable of simulating an unseen data set.

265
In contrast, a bell shape indicates a too-large ensemble spread. Note that a flat histogram is a necessary condition for a model to be reliable, but this is not a sufficient condition. Examples and further discussion are given by Hamill (2001).

The reliability diagram
The reliability diagram is a visual tool for determining the reliability of an ensemble forecast. The relative frequency of the observation, given the probability of the ensemble, is plotted against the probability of the ensemble.

270
To construct a reliability diagram, we work with the empirical staircase function over the ensemble members as cdf.
Let us assume that we have k bins, denoted by b 1 , ..., b k . These bins represent a probability and, therefore, their sizes are bs 1 , ..., bs k ∈ (0, 1]. Furthermore, all bins are centred on 0.5, therefore indicating an interval around the median. Thus, a bin b i is indicated by the interval [s i , e i ]. We find the s i -and e i -quantiles within the ensemble distribution, denoted by s eni and e eni , respectively. Interpolation is used if s i and e i lie between the steps. Therefore, the interval [s eni , e eni ] shows the interval around 275 the median of the ensemble distribution, which has the probability equal to bs i . Then, we determine the relative frequency of the observations found within this interval, denoted by o i . The points (bs i , o i ) are plotted against the line of the identity function. If all points fall on that line, the model is perfectly reliable, meaning that the relative frequency of the observation is equal to the simulated frequency.

Skill scores 280
Skill scores enable comparing simulated model outputs of different magnitudes. The skill score SS is defined as: where score s is the score of the actual simulation, score ref is the score of a reference simulation, and score perf is the score of a perfect simulation. Climatology or persistence is often used as a reference simulation. A skill score takes values from (−∞, 1]. A value of 1 indicates that the actual simulation is a perfect simulation. A negative skill score indicates that the actual 285 simulation is less accurate than the reference simulation. Because of the lack of continuous SWE measurements, a variation of climatology is taken as a reference simulation. Records from the training and validation data sets at the same location and within a time window of ±15 days around the date of the corresponding observation are used to build up a reference ensemble having 20 members. For 77 % of the testing records, an 290 ensemble from climatology could be obtained in this manner. In the calculation of the skill score in Eq. (7), the mean over the individual scores is taken within the fraction. This enables us to use only the portion of records where we found the 20 members to calculate the reference scores.

Sensitivity score
Following Olden and Jackson (2002), we set input-output records as references where the output is the 20th, 40th, 60th, and 295 80th percentiles of the target variable SWE in the validation data set. Because single records can deviate from normality, we take the average of 1000 records around the percentile to maintain the generality of the reference record. Subsequently, one input variable is perturbed at a time by taking uniformly distributed values between the maximum and minimum values in the validation set of the considered input variable. The perturbed inputs are inserted into the network, and the perturbed output is generated. The sensitivity score is then given by the RMSE between the perturbed and reference output, averaged over the contre les Changements Climatiques (MELCC) provided a Canada-wide snow data set, which includes SWE, snow depth, and snow density. For the remainder of this paper, this data set will be denoted as the Canadian historical snow survey (CHSS).
For this study, we use the above-mentioned CHSS snow data, collected from 01 January 1980 to 16 March 2017. It consists of 234 779 measurements from 2878 stations. Figure 3 presents some characteristics of the CHSS data set. The distributions of SWE and SD present a right-skewed gamma distribution. Snow density is almost normally distributed in a range from 310 50 kg m −3 to 600 kg m −3 with a few outliers at higher values. This distribution is related to the retrieval of two different data sets. The data from ECCC is bounded by the interval [50, 600] for snow density, reflecting snow having a density of up to 600 kg m −3 . Above this threshold, snow begins to transform into ice. The higher densities encountered in the MELCC data are due to the presence of an ice layer in the snow pack or to measurement artifacts. Figure 4 shows the location of sites within the CHSS data set.  where t orig is the original temperature of ERA5, elev station and elev grid are the elevation of the station and the grid point of ERA5 reanalysis, respectively. We apply a constant lapse rate of −6 • C km −1 , a value consistent with studies in the Rocky Mountains (e.g. Dodson and Marks, 1997;Bernier et al., 2011) and the global mean environmental lapse rate of −6.5 • C km −1 (Barry and Chorley, 1987).

Explanatory variables
From the total precipitation, temperature, and snow density data, we calculate the following explanatory variables. This initial pool of variables is based on Odry et al. (2020).

330
number of days since the beginning of winter, number of days without snow since the beginning of winter, number of freeze-thaw cycles; threshold for freezing and thawing is set at −1 • C of the maximum and at 1 • C of the minimum temperature, respectively the degree-day index, i.e. accumulation of positive daily temperatures since the beginning of the winter,

335
the snow pack aging index, i.e. the mean number of days since the last snowfall weighted by the total solid precipitation on the day of the snowfall, the number of layers in the snowpack estimated from the timeline and intensity of solid precipitation. A new layer is considered to be created if there is a three-day gap since the last snowfall, accumulated solid precipitation since the beginning of the winter,

340
accumulated solid precipitation during the last n days, accumulated total precipitation during the last n days, mean average temperature during the last n days (average temperature is taken as the mean of the maximum and minimum temperatures).
We set the beginning of winter as 01 September because the seasonal distribution of the CHSS data set has the first snow 345 records starting from mid-September, with the exception of some outliers. The separation of precipitation into solid and liquid parts is done by: where p(t av ) is the probability of snow, dependent on the average temperature t av . Jennings et al. (2018) showed by using precipitation data from the Northern Hemisphere that this logistic regression model outperforms any temperature threshold 350

Input uncertainty
We target the uncertainty of the measurements of the input variable snow depth. According to the WMO (2014)

Tested characteristics
After setting up a reference MLP with the characteristics shown in Table 2, one single characteristic is tested at a time. Hereby, all 12 possible explanatory variables from Sect. 3.2 are taken. The number of neurons in the hidden layer is derived by the 365 proposed rules of thumb of Heaton (2008) and Hecht-Nielsen (1989). The number of members is set at 20 because this number showed consistent results during several trials.

Batch size 100
Shuffling data before each epoch No

Number of members in the ensemble 20
Input uncertainty of snow depth No Table 3. A summary of the tests performed to obtain the final MLP architecture. The parameter initialization 1 is given by Goodfellow et al. (2016), where m indicates the number of inputs, the parameter initialization 2 is given by Glorot and Bengio (2010), where m and n indicate the number of inputs and outputs, respectively. The parameter initialization 3 is our suggestion

Input variable selection
To determine the importance of the input variables, we carry out a sensitivity analysis. In Sect. 2.4.6, the introduced sensitivity score is used to determine the order of the importance of the input variables. Subsequently, a stepwise reduction of the input  When adding neurons to the hidden layer, the complexity and the number of parameters (the weights and biases) of the model are increased. This complicates the optimization and therefore, entails a longer training time to obtain the desired parameters.

Results
The results are divided into three sections. In Sect. 4.1, we discuss the determination of the MLP architecture, following mainly

Comparison of SWE and snow density as the target variable
A comparison of the two target variables is performed by using the reference MLP architecture presented in Table 2 Fig. 5a. Furthermore, in Fig. 5e the reliability part of the CRPS increases later when SWE is the target variable compared with when the target is variable density (Fig. 5b). This behaviour is also consistent with the ignorance score, which has its minimum at epoch 20 in Fig. 5f, whereas in Fig. 5c, the ignorance score for snow density increases from the beginning. This is consistent with the rank histograms in Fig. 6. The higher variability of SWE has a positive effect on the rank 395 histogram, meaning that the spread of the ensemble remains sufficiently large up to epoch ten.
From these results, we can derive that the lower variability of snow density results in too-little spread of the ensemble already after one epoch. This observation allows us to proceed with the target variable SWE. Nonetheless, even with SWE, the spread of the ensemble becomes too narrow, too quickly. This issue is addressed by incorporating input uncertainties, as we show in Sect. 4.1.2.

Input uncertainty
We apply input uncertainty because we cannot train for a sufficient enough period to obtain the best error scores measured by RMSE, MAE, and CRPS. This inability is due to a loss of reliability and overfitting. This type of overfitting is related to the ensemble and does not describe the overfitting of one single MLP. To better understand, we examine the RMSE in Fig.   5, which is related to the MSE, the objective function in the optimization of the MLP. The RMSE continues to decrease be-405 yond 15 epochs (optimal trade-off between accuracy and reliability). Therefore, the single MLP remains in an underfit state. We use the reference set-up presented in Table 2 Adding regularization would shift the MLP to a more underfit state by simplifying the network. Table 4 shows that incorporating input uncertainty decreases the ignorance score, whereas the other measures are unaffected. Therefore, input uncertainty widens the spread of the ensemble. Further improvements can likely be achieved by incorporating the input uncertainty of the meteorological data through use of the ERA5 ensemble.  Table 4 presents all results for the tested characteristics listed in Table 3. The change of the activation function from tanh in the reference set-up to ReLU and Leaky-ReLU does not produce any improvement. Using the ReLu function as the activation function in the network shows a similar but delayed behaviour in the performance scores. In terms of accuracy, we obtain similar results as with the tanh case via a longer learning. Nonetheless, the increase of the reliability part is not delayed as much as the 415 decrease in accuracy. This entails a worse trade-off between accuracy and reliability. When the activation function is set to the Leaky-ReLU activation function, we observe almost no change relative to the ReLU case. The analogous argumentation can be applied.
For the RMSProp algorithm, we set the global learning rate at 0.1, 0.01, 0.001, and 0.0001 during testing. The best result is produced for the trial having a learning rate of 0.0001. The learning rate of 0.001 produces an earlier decrease in accuracy 420 scores and an earlier increase for the reliability part. The trials having a learning rate of 0.1 and 0.01 converge at higher values  Table 2 for RMSE and MAE. The higher learning rates can over-jump desirable minima and end up in worse ones. In summary, the overall learning rate must be small enough to achieve the best results. Even lower learning rates will entail a later behaviour. Therefore, the same trade-off between reliability and accuracy can be obtained if training time is increased. When we compare RMSProp to AdaDelta, we note no improvement. Consequently, the AdaDelta method is preferred because there are no learning 425 rates that need to be adjusted. Table 4. Comparison of all tested characteristics worth considering; Parameter initialisation 2 and 3 as in Table 3; The number of epochs is selected so that the trade-off between reliability and accuracy of the ensemble is increased. The underlined numbers in bold indicate an improvement throughout the testing. Combo combines para. init. 3 , shuffled data, and input uncertainty, and is used as the final set up A test of the shuffling data produces an almost identical result as that for the reference set-up; however, it eliminates bias, which is almost zero beyond 15 epochs.
In regard to the parameter initialization, the equation 1 in Table 3 is a uniform distribution U (−0.29, 0.29) in our application.
The narrow interval results in a too-narrow ensemble, which entails a high value for the reliability part and ignorance score. The 430 second trial, using the parameter initialization 2 (in our case U (−0.68, 0.68)), shows a very similar behaviour as the reference set-up but with a lower number of epochs. This matches the interval of the random initialization being more narrow than that in the reference set-up. It entails that the spread of the ensemble narrows more quickly and results in an early increase of the reliability part and an early minimum in the ignorance score at ten epochs. The third trial, using the parameter initialization 3 , shows a delayed behaviour. However, this initialization has a better score for the reliability part of the CRPS but shows 435 no improvement for the accuracy scores. We can conclude that to maintain a reliable ensemble, the interval of the uniform distribution must be sufficiently large. Nonetheless, an increase in the interval does not offer much improvement for the tradeoff between reliability and accuracy.

Input variable selection
The results of the sensitivity scores for each variable are presented in Table 5. Recall that this sensitivity score is the average 440 RMSE of the four reference percentiles and all ensemble members (see Sect. 2.4.6). Therefore, the value of this score is proportional to the importance of a variable for the conversion model. As shown in Table 5, snow depth is, unsurprisingly, the most important input variable. Figure 7 illustrates how reducing the number of input variables gradually affects the SWE estimation error and the corresponding CRPS and ignorance score. This reduction follows Table 5; the least influential input variable (snow density from 445 Average temperature of the last six days 63.9 Number of layers in snow pack 102.4 Total precipitation in the last ten days 102.5 Average age of the snow cover 116.9 Accum. positive degrees since the beginning of winter 124.9 Accum. solid precipitation since the beginning of winter 160.2 Number of days since the beginning of winter 184.6 Total solid precipitation in the last ten days 191.0 Days without snow since the beginning of winter 193.2 Snow depth 972.0 ERA5) is removed first, and so on. Figure 7a and b show that the RMSE, MAE, and CRPS do not increase significantly when the number of input variables decreases to the six most influential variables. A larger increase (worsening) is only observed below this number. This pattern reflects the input variables being dependent on each other as all, except for snow density and snow depth, are derived from the same meteorological data. Therefore, adding more variables does not provide more information to the model. This result is consistent with the study of Odry et al. (2020). When observing the reliability part of the CRPS 450 (Fig. 7b), the score begins to increase below ten input variables. However, the ignorance score in Fig. 7c increases even when only one variable is eliminated. Thus, more input variables help widen the spread of the ensemble. Nevertheless, the ignorance score tends to flatten as the number of input variables increases. Because we want to preserve the spread of the ensemble and obtain a good portrait of the uncertainty of SWE estimation, we use all 12 variables as inputs in the final set-up.  Fig. 8f, we observe no specific trend. Note that the thresholds are earlier for the ignorance score than for the reliability part of the CRPS. Consequently, one can assume that the reliability will also increase for higher numbers of hidden neurons for three, four, and five training epochs.

Figure 7.
Stepwise reduction of input variables, ordered according to their influence as determined in Table 5 Figure 8. Results of ensembles using the set-up defined in Sect. 4.1 for different numbers of training epochs and neurons in the hidden layer The earlier increase in the ignorance score can be explained by observing the rank histograms of 1-3 training epochs in Fig.   9. A bell shape in the rank histogram indicates that the spread of the ensemble is too wide, resulting in low values overall for 465 the empirical pdf. Because the ignorance score is computed using the logarithm of the probability density corresponding to the observation, overdispersed ensembles strongly penalize this score.

21
https://doi.org/10.5194/hess-2020-566 Preprint. Discussion started: 18 November 2020 c Author(s) 2020. CC BY 4.0 License. The formula of the reliability part of the CRPS in Eq. (6) demonstrates that the difference between the observed frequency, given the predicted distribution, and the predicted probability is squared. Therefore, if the difference is small, this will dominate the reliability part and result in a small value even with a large ensemble spread. Thus, the reliability score penalizes the outliers 470 in the trials having more than four or five epochs.
This analysis finalizes the first MLP model for Canada as a whole. Our final choice for MLP architecture is the set-up derived from Sect. 4.1; however, we set the number of epochs at five and the number of neurons in the hidden layer at 120. This provides the best trade-off between accuracy and reliability because 120 hidden neurons is the point where accuracy stagnates and where the ignorance score increases for five epochs. Except for the case of one training epoch, the reliability part of the 475 CRPS is poorest for five epochs, but it also leads to an acceptable rank histogram in Fig. 9e. The ensembles trained for two and three epochs show a bell shape, indicating underdispersion, which is an unacceptable condition.

Determining the number of epochs and the number of hidden neurons for a specific MLP model for each snow class
We repeat the same procedure as in Sect. 4.1.5; this time, however, it is run separately for each snow class. Table 6 shows the 480 optimal combination of the number of hidden neurons and number of epochs for each snow class. The behaviour is similar to the "single model" case covering Canada. However, as the amount of available data to train decreases, the number of required training epochs increases. In the present case, the amount of available data varies between the different snow classes, some having only short records. In the end, with the above determined model structure (Sect. 4.1.5), we obtain an ensemble of 20 MLPs for each snow class. This represents the second group of MLP models (one per snow class) that we compare with the 485 single MLP (covering Canada as a whole) and regression models.

MLP model performance on a testing set
In this section, we use the testing data set and evaluate the performance of the single MLP ensemble model covering Canada as a whole and the separate models for each snow class (multiple MLP ensembles). Table 7 reveals that the multiple MLP ensemble model has a better overall performance. The reliability diagram and the rank histogram also show a more reliable   Figure 11 shows the distribution of the residuals of the simulated ensemble median and the observations. The distribution of the residuals for the multiple MLP model is narrower, indicating less error overall. Both are approximately symmetrical around zero, which strengthens the result of the MBE being almost zero (Table 7). For the single MLP model, 50 % of the errors are smaller than ±20 mm, and 90 % are smaller than ±80 mm. For the multiple MLP model, 50 % of the errors are smaller than 495 ± 18 mm, and 90 % are smaller than ± 65 mm.
Next, we apply skill scores to ensure a valid comparison between SWE estimates of differing magnitudes. The climatology of SWE from the CHSS data set is used as the reference simulation (see Sect. 2.4.5).
The separation into different snow classes reveals that, as expected, the multiple MLP model eliminates bias in all snow classes, as shown in Fig. 12a. Furthermore, for all snow classes, the two MLP models perform better than climatology, as 500 indicated by positive MAE, RMSE, and CRPS skill scores in Fig. 12a and b.
The model performs better within the ephemeral snow class in terms of accuracy, although the results need to be taken with some reservations; the low amount of data and the high variability of SWE in this snow class induce a rather poor reference simulation (climatology).
The reliability part of the CRPS in Fig. 12b shows a large improvement of the multiple MLP model for the ephemeral snow 505 class. In contrast, the ignorance score in Fig. 12c shows a slightly poorer skill score for this same snow class. A deeper analysis  The tundra snow and taiga snow classes have the lowest skill scores in terms of both accuracy and reliability. In particular, the reliability part of the CRPS in Fig. 12b results in a large decrease of the skill score. As the variability of SWE is low in these areas, as presented in Fig. 2, the reference simulation based on climatology produces a reliable ensemble and causes a poor reliability skill score for the actual simulation.

515
The maritime snow class shows a slightly better accuracy than the mountain snow class. This difference may relate to the complexity of snow accumulation patterns in mountainous regions owing to the high spatial and temporal variability of all physical processes and variables in these areas. Also the temperature correction, as explained in Sect. 3.2, induces larger errors where height is highly variable. The improved reliability skill score for the multiple MLP model is explained by fewer outliers for the mountain and maritime snow classes. In total, both classes show similar rank histograms for the single and multiple 520 models. The slight decrease in the ignorance skill score for the maritime snow class can be explained by different average spreads of the simulated and reference ensembles.
In the next step, the testing data set is divided into elevation classes from 0 m to 2100 m, with a step size of 300 m and one class for sites above 2100 m. The results are not shown, as the main conclusions are identical to those obtained from the separation into snow classes. Both types of MLP ensembles (single and multiple) outperform climatology.

525
Second, an analysis over the course of the year shows an improved accuracy for the MLP models compared with climatology for all months, except for July, August, and September. During these three months, there is generally very little snow across the country and, consequently, very little data. Additionally, the beginning of the winter, subjectively taken as 01 September, causes a reset to zero for several explanatory variables; the variables of temperature, snow depth, and SWE data remain, however, within their usual ranges. This greatly complicates the proper training of the MLPs. Overall, except these three problematic 530 months, the accuracy and reliability remain relatively constant throughout the year with a slightly improved performance in spring and early summer compared with climatology.

Comparison of MLP models with regression models
The regression models are trained and validated using the same perturbed snow depth data sets that are used to optimize the MLP models in Sect. 3.3. The testing data set is then used for comparison purposes in this section.

535
There is a possibility of producing an ensemble simulation by performing the deterministic regression models on perturbed snow depth and obtaining multiple members. However, given that the spread in the simulated ensemble of the MLP models is explained mainly by the various parameter initializations, this approach entails an ensemble that is too narrow when regression models are used. This leads to a comparison of models that has little meaning: the optimization of the Sturm model could be initialized with different parameter sets; however, the Jonas model has a perfect set of parameters. Therefore, we must compare 540 the models using exclusively deterministic evaluation techniques.
In addition to the regression models, we also considered a simple benchmark model, which takes the average observed snow density in the training and validation data sets as a constant snow density and then calculates SWE for the testing data set.  The results are presented in Table 8. Both models using MLP ensembles outperform the simple benchmark as well as the Sturm and Jonas models.   Figure 14a shows that the model often underestimates SWE, which also explains the negative bias observed in Table 8.

560
The Jonas and Sturm models explain approximately 7 % more SWE variability. Figure 14b and c reveal that outliers with high values occur for lower-and medium-range SWE observations. Numerous SWE observations above 2500 mm are consistently overestimated but follow the identity line. Compared with the two MLP models, the Sturm and Jonas models perform better at these higher values. Compared with the benchmark model, the single and multiple MLP models explain approximately 15 % and 16 % more SWE variability, respectively. Figure 14d and e show that up to 2500 mm, the estimates of both MLP models 565 follow mainly the identity line with a some outliers at approximately 500 mm. Above 2500 mm, the model seems to have an upper boundary that results in an underestimate. This upper boundary is less restrictive for the multiple MLP model, which performs better for the high SWE records. The resulting R 2 values for the benchmark and Sturm model are similar to those in the study of Odry et al. (2020), which considered only the province of Quebec. The MLP model in the latter study achieved an R 2 value of 0.919. Therefore, we improve on this by 5 % and 6 % for the single and multiple MLP models, respectively.

Conclusions
In this study, we test two hypotheses. The first hypothesis holds that using SWE rather than density as the target variable for the ANN will produce more accurate estimates of SWE. The second hypothesis states that in-depth testing of different ANN structures will further improve these SWE estimates. As part of this second hypothesis, we investigate whether the ANN model must be trained specifically for different regions, as determined by snow climate classes (Sturm et al., 2009), or whether the 575 model could be trained only once, for Canada as a whole. Furthermore, we use existing regression models, developed for the same purpose, as benchmarks to obtain a better perspective on how our model performs.
Our snow-depth-to-SWE model uses the inputs of snow depth, estimated snow density, and other explanatory variables derived from meteorological data. The available snow data includes snow depth, SWE, and snow density measurements from across Canada, collected over almost 40 years. 580 We then use an ensemble of multiple MLPs to address the issue of the random parameter initialization during optimization. The approach also provides a probabilistic estimate to gain greater insight into model performance. A trade-off between reliability and accuracy is used as a means of evaluation, which gives a more comprehensive analysis of SWE-estimation models.
Many previous models (e.g. Odry et al., 2020;Jonas et al., 2009;Sturm et al., 2010;Painter et al., 2016;Broxton et al., 585 2019) determine SWE by estimating snow density and calculating SWE on the basis of snow density and snow depth. This study investigates a direct estimate of SWE. This approach shows a slight increase in accuracy and a large gain in reliability compared to indirect estimates. Consequently, the remainder of the study uses SWE as the target variable rather than snow density.
In our investigation of model structures, we built two models. One model uses a single MLP ensemble for all of Canada. The

595
A sensitivity analysis reveals that a greater number of input variables increases the reliability of the ensemble. Therefore, adding more variables could further heighten the model's reliability. An incorporation of these variables (of a type other than meteorological ones) may also help to increase accuracy. Finally, this study shows an optimal performance of networks having large numbers of neurons in the hidden layer at amounts far above the commonly used rules of thumb. This provides a motivation to look into network structures having multiple layers. Montúfar (2014) also showed that the number of neurons needed 600 to approximate a function in a deep network with multiple layers increases exponentially in a network with one hidden layer.
Therefore, deep networks are possibly more efficient. Goodfellow et al. (2016) provides multiple examples that foster this claim empirically. As a result, there could remain numerous means of improving SWE estimates from snow depth using machine learning techniques, and the methods proposed here could be refined. It could also be useful to investigate the application of other types of machine-learning algorithms, including random forests.

605
Code and data availability. The code including some testing data is available on Ntokas (2020). The whole data set excluding data from partners which do not allow to publish them (affecting 2%) is available on Harvard Dataverse, which link can be found in the ReadMe file on Ntokas (2020). Note that the user needs to fill in a form regarding the terms and conditions for the usage.
which K.N. then translated to Python and modified. The original idea for this work was from Marie-Amélie Boucher, who also guided the work throughout and edited the manuscript significantly. Camille Garnaud was involved in the guidance for this project as a representative of ECCC and provided a final proofread of the manuscript before its submission.
Competing interests. The authors declare no competing interests.