Evaluation of random forests for short-term daily streamﬂow forecasting in rainfall and snowmelt-driven watersheds

. In the past decades, data-driven machine learning (ML) models have emerged as promising tools for short-term streamﬂow forecasting. Among other qualities, the popularity of ML models for such applications is due to their relative ease in implementation, less strict distributional assumption, and competitive computational and predictive performance. Despite the encouraging results, most applications of ML for streamﬂow forecasting have been limited to watersheds where rainfall is the major source of runoff. In this study, we evaluate the potential of random forests (RF), a popular ML method, to make 5 streamﬂow forecast at 1-day lead time at 86 watersheds in the Paciﬁc Northwest. These watersheds cover diverse climatic conditions and physiographic settings and exhibit varied contributions of rainfall and snowmelt to their streamﬂow. Watersheds are classiﬁed into three hydrologic regimes: rainfall-dominated, transient, and snowmelt-dominated, based on the timing of center-of-annual ﬂow volume. RF performance is benchmarked against naïve and multiple linear regression (MLR) models and evaluated using four criteria: coefﬁcient of determination, root mean squared error, mean absolute error, and Kling-Gupta 10 efﬁciency (KGE). Model evaluation scores suggest that the RF performs better in snowmelt-driven watersheds compared to rainfall-driven watersheds. The largest improvement in forecasts, compared to benchmark models, are found among rainfall-driven watersheds. RF performance deteriorates with increase in catchment slope and soil sandiness. We note disagreement between two popular measures of RF variable importance and recommend jointly considering these measures with the physical processes under study. These and other results presented provide new insights for effective application of RF-based streamﬂow 15 forecasting.


Introduction
Nearly all aspects of water resource management, risk assessment, and early-warning systems for floods rely on accurate streamflow forecast. Yet streamflow forecasting remains a challenging task due to the dynamic nature of runoff in response to spatial and temporal variability in rainfall and catchment characteristics. Therefore, development of skillful and robust 20 streamflow models is an active area of study in hydrology and related engineering disciplines.
While physical models remain a common and powerful tool for predicting streamflow, ML models are gaining popularity due to some of their unique qualities and potential advantages. Compared with the often labor-intensive and computationally expensive task of parameterizing in physical model (Tolson and Shoemaker, 2007;Boyle et al., 2000), ML models are data-flow and different streamflow contributions of rainfall and snowmelt. Drainage basin factors such as topography, vegetation, and soil can affect the response time and mechanisms of runoff (Dingman, 2015). Few studies attempted to account for or report these effects on models' performance. Without such consideration, it is difficult to determine if a data-driven model can be generalized to watersheds not included in the given study. Therefore, our objectives are (1) to examine and compare the performance of RF in a number of watersheds across hydrologic regimes and (2) to explore the role of catchment characteristics 65 in model performance that are overlooked in previous studies.
In practice, RF can be trained to forecast streamflow at various timescales, depending on the input variables provided. Rasouli et al. (2012) forecasted streamflow at 1-7 day lead times using three ML models and data from combinations of climate indices and local meteo-hydrologic observations. The authors concluded that models with local observations as predictors were generally best at shorter lead times while models with local observations plus climate indices were best at longer lead times 70 of 5-7 days. Also, the skillfulness of all three models decreased with increasing lead times. In our study, we focused on 1-day lead time forecasting and therefore did not include long-term climate information. At longer lead times, changes in weather conditions would likely exert much greater control on runoff and the performance of the model.
We select RF to forecast streamflow for two reasons. First, RF has been referenced to deliver high performance in short-term streamflow forecasts (Mosavi et al., 2018;Papacharalampous and Tyralis, 2018;Li et al., 2019;Shortridge et al., 2016), making 75 it a good candidate for our study. Second, RF allows for some level of interpretability . This is delivered through two measures of predictive contribution of variables: mean decrease in accuracy (MDA) and mean decrease in node impuritiy (MDI). These two measures have been widely used as means for variable selection in classification and regression studies in bioinformatics (Chen and Ishwaran, 2012), remote sensing classification (Pal, 2005), and flood hazard risk assessment (Wang et al., 2015). The

Prediction Prediction Prediction Prediction
Average of all predictions Parameter ntree nodesize mtry X i ≥ t 1 X i < t 1 X i ≥ t 3 X i < t 3 X i ≥ t 4 X i < t 4 2 Methodology 90

Random forests
Proposed by Breiman (2001), RF is a supervised, non-parametric algorithm within the decision tree family that comprises an ensemble of decorrelated trees to yield prediction for classification and regression tasks. Non-parametric methods such as RF do not assume any particular family for the distribution of the data (Altman and Bland, 1999). Since a single decision tree can produce high variance and is prone to noise (James et al., 2013), RF addresses this limitation by generating multiple 95 trees where each tree is built on a bootstrapped sample of the training data (Fig. 2). Each time a binary split is made in a tree (also known as split node), a random subset of predictors (without replacement) from the full set of predictor variables is considered (Fig. 2). One predictor from these candidates is used to make the split where the expected sum variances of the response variable in the two resulting nodes is minimized. The randomization process in generating the subset of the features prevents one or more particularly strong predictor from getting repeatedly chosen at each split, resulting in highly correlated 100 trees (Breiman, 2001). After all the trees are grown, the forests make prediction on a new data point by having all trees run through the predictors. In the end, the forests cast a majority vote on a label class for classification task or produce a value for regression task by averaging all predictions. Breiman (2001) provided full details on RF and its merit. The randomForest package in R developed by Liaw et al. (2002) was used for model training and validation in our study. The step-by-step of building a regression RF follows: Anonymous: Please explain this more clearly. I doubt that readers unfamiliar with random forests and ML will understand this.
(I made the same comment in rounds 1 and 2, and it was ignored.) Algorithm 1 Building a regression RF Step 1: n bootstrap samples are drawn from training set, each has the same size as the training sample. This is also known as ntree or number of trees in the forest.
Step 2: At each binary node split, a subset of mtry predictors, X i , is randomly selected from p predictor space, Ω p , that results in X i ∈ Ω p for {i ∈ 1,..., mtry}, mtry < p.
Step 3: The single best combination of predictor X i among X predictor variables and threshold t is selected to split the observations, y j , into binary regions R 1 = { y j |X i < t } and R 2 = { y j |X i ≥ t } that minimize: whereŷ R1 is the mean of observations in R 1 andŷ R2 is the mean of observations in R 2 .
Step 4: Repeat step 2-3 until all terminal region contains less than nodesize observations. Due to sampling with replacement, some observations may not be selected during the bootstrap. These are referred to as out-of-bag or OOB and used to estimate the error of the tree on unseen data. It has been estimated that approximately 37% of samples constitute OOB data (Huang and Boutros, 2016). An average OOB error is calculated for each subsequently added tree to provide an estimate of the performance gain. The OOB error can be particularly sensitive to the number of random predictors used at each split mtry and number of trees ntree (Huang and Boutros, 2016). Generally, the predictive 110 performance improves (or OOB error decreases) as ntree increases. However, recent research has shown that depending on the dataset, there is a limit for number of trees where additional growing does not improve performance (Oshiro et al., 2012).
It has been advised that mtry is set to no larger than 1/3 of total number of predictors for optimal regression prediction (Liaw et al., 2002), which is also the default value in randomForest function in R and widely adopted in literature. Nevertheless, Huang and Boutros (2016) found that this value is dataset-dependent and could be tuned to improve the performance of RF.
115 Bernard et al. (2009) argued that the number of relevant predictors highly influences optimal mtry value. In this study, we select the optimal mtry using an exhaustive search strategy, in which all possible values of mtry are considered, using R package Caret (Kuhn et al., 2008). While all considered parameters might have an effect on the performance of RF, we chose to focus on two parameters: ntree and mtry for a number of reasons. First, these two parameters were originally introduced by (Breiman, 2001) in the development of RF algorithm. Second, ntree in a forest is a parameter that is generally not tunable 120 but should be set sufficiently high (Oshiro et al., 2012;Probst et al., 2019) for RF to achieve optimal performance. Furthermore, empirical results provided in previous works suggest that mtry is the most influential out of parameters in RF (Bernard et al., 2009;Van Rijn and Hutter, 2018;Probst et al., 2019). Figure 2 illustrates the step-by-step operating principle of growing RF and the relevant parameters.

125
In addition to assessing a model's overall predictive ability, there is also interest in understanding the contribution of each predictor variable to model performance. There are two built-in measures for assessing variable importance in RF: mean decrease in accuracy (MDA) and mean decrease in node impurity (MDI). Both were developed by Breiman (Breiman et al.,6 Anonymous: You need punctuation here. I suggest inserting a comma here, which means you will need to replace the colon before "ntree" with a comma. Anonymous: Replace with "Breiman (2001)".
Anonymous: ? What do you mean by this? Any hyperparameter is tunable.
Anonymous: I don't understand how this sentence justifies experimenting with ntree instead of the many other hyperparameters available. 1984;Breiman, 2001). After all trees are grown, OOB data during training is used to compute the first measure. At each tree, the mean squared error (MSE) between predicted and observed is calculated. Then the values of each of the p predictors are 130 randomly permuted with other predictor variables held constant. The difference between the previous and new MSE is averaged over all trees. This is considered the predictor variable's MDA (Liaw et al., 2002) and values are reported in percent difference in MSE. The procedure is repeated for each predictor variable. Given that there is a strong association between a predictor and response variable, breaking such bond would potentially result in large error in the prediction (i.e., large MDA). MDA value can be negative where a predictor has no predictive power and adds noise to the model. Strobl et al. (2007), however, 135 expressed caution that permutation-based measures such as MDA could show a bias towards correlated predictor variables by overestimating their importance, particularly in high-dimensional data sets.
The second method, MDI, measures the average error reduction each time a predictor is selected to make a split during training. It is based on the principle that a binary split only occurs when residual errors (or impurity) of two descendent nodes are less than that of their parent node. the value, the more important the predictor.

Benchmark models
We benchmark the performance of RF during the validation period against multiple linear regression (MLR) and simple naïve models using the calculated Pearson correlation coefficient (r) between forecasted and observed values for each model. In naïve model, we assume "minimal-information" scenario and the best estimate of the streamflow from the next day is the observed 150 value from current day (Gupta et al., 1999). Its r, in this case, is the 1-day autocorrelation coefficient in the time series and measures of the strength of persistence. We train and verify MLR model using same data sets and predictors supplied to RF model.

Performance evaluation criteria
There exist different model performance criteria and each provides unique insights on the correspondence between forecasted 155 and observed streamflow values. While r and its square, namely coefficient of determination (R 2 ), are often used, Legates and McCabe Jr (1999) discussed the limitation of these two measures where they were reported to be especially oversensitive to extreme values or outliers. The authors recommended that absolute error measures (i.e., root mean squared error or mean absolute error) and goodness-of-fit measure, such as the Nash-Sutcliffe efficiency (NSE), could provide more reliable and conservative assessment of the models. Kling-Gupta efficiency (KGE) is a relatively new metric that was developed based on hydrologic models by addressing several shortcomings diagnosed with NSE. For these reasons, we selected the following four criteria to evaluate RF performance: R 2 , RMSE, MAE, and KGE. These criteria cover various aspects of model's performance and also provide intuitive interpretation as explained in the remainder of this section.
R 2 can be interpreted as the proportion of the variance in the observed values that can be explained by the model. Values are 165 in the range between 0 and 1 where 1 indicates the model is able to explain all variation in the observed dataset.
where N is total number of the observations during the validation period,ŷ i and y i are the forecasted and observed values at day i respectively with MAE provides an average magnitude of the errors in the model's predictions without considering the direction (underestimation or overestimation).
RMSE is the standard deviation of the residuals between the predictions and observations. It is more sensitive to larger error due to the squared operation. Both MAE and RMSE scores range between 0 and ∞ where a score of 0 indicates a perfect 175 match between predicted and observed data. The standardization in streamflow measurements (described in Sect. 3) allows comparison of MAE and RMSE across gauges.
KGE metric ranges between −∞ and 1. While there currently is not a definitive KGE scale, Knoben et al. (2019) showed KGE values in the range between −0.41 and 1 indicate the model improves upon the mean flow benchmark, which assumes the 180 predicted streamflow values equal to the mean of all observations. KGE value of 1 suggests the model can perfectly reproduce observations. KGE is calculated as follows: where r is the Pearson correlation coefficient, α is a measure of relative variability in the forecasted and observed values, and β represents the bias: where σŷ is the standard deviation in observations, σ y is the standard deviation in forecasted values, µŷ is the forecasted mean, and µ y is observation mean.
In hydrological forecast, one might be interested in the ability of the model to capture more extreme events rather than the overall performance. This is particularly relevant in flood risk assessment and flood forecasting where floods are associated 190 with discharge exceeding a high percentile (typically ≥ 90 th ) (Cayan et al., 1999). The definition of "extreme" depends on the objective of the study. Here, we adopt the peak-over-threshold method. For the validation period, we calculated the 90 th , 95 th , and 99 th percentile streamflow values at each watershed. These are considered thresholds. If an observed daily streamflow exceeded this threshold, it would be considered an extreme event. We measure the ability of RF to capture these events using two additional criteria: probability of detection (POD) and false alarm rate (FAR). The calculation followed as in (Karran et al.,195 2013). and where ω is a specified threshold.

200
3 Study Area and data

Watersheds in the Pacific Northwest Hydrologic Region
In this study, we focus on watersheds in the Pacific Northwest Hydrologic Region (Fig. 1). This region covers an area of 836,517 km 2 and encompasses all of Washington, six other states, and British Columbia, Canada. For the purpose of maintaining consistency in monitoring protocol and data, we only consider watersheds on the US territory. The Columbia River and its 205 tributaries make up the majority of the drainage area, traveling more than 2000 km with an extensive network of more than 100 hydroelectric dams and reservoirs have been built along these river channels. Hydropower in the Columbia River Basin supplies approximately 70 percent of Pacific Northwest energy (Payne et al., 2004). Flood control is also an important aspect of reservoir operation in this region.
The north-south running Cascade Mountain Range divides the region into eastern and western parts and strongly influence 210 the regional climate. The windward (west) side of the mountain receives an ample amount of winter precipitation compared to the leeward (east) side. When temperature falls near freezing point, precipitation comes in the form of snow and provides water storage for dry summer months. Summers tend to be cool and comparatively dry. East of the Cascades, summer rainfall result from rapidly built thunderstorm and convective events that can produce flash floods (Mass, 2015). For this region, proximity to the ocean creates a more moderate climate with a narrower seasonal temperature range compared to the inland 215 areas, particularly in the winter. Spatial trends and variations in annual mean temperature, total precipitation, drainage area, and elevation of the watersheds are shown in Fig. 3.

Streamflow
Our analysis uses streamflow data available through the USGS National Water Information System (NWIS) (https://waterdata. 220 usgs.gov/nwis/sw). From NWIS, we selected daily streamflow time series for gauges using the following criteria: 1) continuous operation during the 10-year period between 2009 and 2018, 2) have less than 10 percent of missing data, and 3) positioned in watersheds with "natural" flow that is minimally interrupted by anthropogenic intervention. The third criterion was met using the GAGES-II: Geospatial Attributes of gauges for Evaluating Streamflow dataset (Falcone, 2011) classification to identify watersheds with least-disturbed hydrologic condition and represented natural flow. We performed additional screening by com-225 puting correlation coefficient between the respective gauge and mean basin streamflow and removed those with a correlation of less than 0.5. We also excluded small creeks with drainage area less than 50 km 2 . In total, 86 watersheds were selected ( Fig.   1).
Following methodology proposed in (Wenger et al., 2010), the watersheds were further grouped into three classes of hydrologic regimes based on the timing of center-of-annual flow, which is defined as the date at which half of the total annual flow region (Mantua et al., 2009;Elsner et al., 2010;Vano et al., 2015), we adopted this partition in our study for two reasons.
First, as Regonda et al. (2005) pointed out, the classification provides a summary of information about type and timing of precipitation, timing of snowmelt, and the contribution of these hydro-climatic variables to streamflow. This helps us assess model performance in consideration of sources of runoff. Second, the classification provides a basis to generalize the results to other watersheds that are not part of the study.

240
On average, records at these watersheds have less than 3 percent missing data during the 2009-2018 period. The drainage area of the watersheds range between 51 km 2 and 3355 km 2 , and the mean elevation range from 239 m and 2509 m, estimated from 30-m resolution digital elevation model (Table 1).

Precipitation
Daily precipitation observations were obtained from the AN81d PRISM dataset (Di Luzio et al., 2008). This gridded dataset 245 has a resolution of 4 km, covers the entire continental US from January 1981 to present, and is continuously updated every 6 months. Best estimate gridded value is derived by using all the available data from numbers of station networks ingested by the PRISM Climate Group. A combination of climatologically aided interpolation (CAI) and radar interpolation were used in developing PRISM dataset. In our study, watershed daily precipitation time series were constructed by computing the arithmetic mean for precipitation values of all grid points that fall within the given watershed.

Snow water equivalent and temperatures
SWE is defined as the depth of water that would be obtained if a column of snow were completely melted (Pan et al., 2003).
Daily SWE data were retrieved from 201 SNOTEL stations in HUC 17. These stations are part of the network of over 800 sites located in remote, high-elevation mountain watersheds in the western U.S. The elevation of these stations are in the range of 128 m and 3142 m. At SNOTEL sites, SWE is measured by a snow pillow-a pressure sensitive pad that weighs the 255 snowpack and records the reading via a pressure transducer. As the temperature shift is the primary trigger for snowmelt, daily maximum temperature (TMAX) and minimum temperature (TMIN) from SNOTEL sensors were also retrieved and included as predictors for streamflow. The obtained data reflected the last measurement recorded for the respective day at each site.
We It is noted the in situ data from these of stations cannot capture the spatial variability of snow accumulation and computing 265 an area-averaged snowpack value from observations remains a challenging task (Mote et al., 2018). The SNOTEL averages therefore represent first-order estimates of snow coverage and temperature conditions.

Predictor selection
Future daily mean streamflow (Q t+1 ) is the response variable in our study. We attempt to explain the variability in Q t+1 using eight relevant predictors from the three datasets (Table 2) processing showed moderate to strong non-linear temporal correlation between daily streamflow and the pentad at each gauge.
We also derived two variables: sum of 3-day precipitation (P 3 t ) and snowmelt (SD t ) from available data. Inclusion of 3-day precipitation was to account for large winter storms that can last for several days, which often result in surges in streamflow.
SD t was calculated as the difference between SWE at day t and t − 1. Soil moisture is also a relevant variable in streamflow modeling as it controls the partition between infiltration and runoff of precipitation (Aubert et al., 2003). However, soil moisture data is often limited and incomplete, especially at daily interval Sum of 3-day precipitation (P t + P t−1 + P t−2 ) P 3 t mm Derived from PRISM 4 Snow water equivalent Pentad P EN t -- Table 3. The optimized parameter mtry using exhaustive-search strategy (mtry = {1, 2, 6, 7, 8} were considered but not found as the optimal value at any gauge).

Parameter tuning
As we mentioned in Sect. 2, error rate in RF can be sensitive to two parameters: the number of trees ntree and number of randomly selected predictors available for splitting at each node mtry. We tested RF on training data sets of 30 randomly chosen watersheds and observed that the reduction in out-of-bag MAE error is negligible after 2000 trees. We then set ntree=2000 for all 86 watersheds. mtry, on the other hand, was tuned empirically using a combination of exhaustive search approach and   Figure 6. Boxplots for Pearson correlation coefficient between forecasted and observed values for three models: RF, naïve, and MLR across three flow regimes. Two-sample Wilcoxon rank-sum significance tests are performed and p-value (in black) are included for each pair of models.  (Table 3) show that the optimal mtry values are {3, 4, 5} with median MAE of 0.0127, 0.0116, and 0.0079 respectively. The optimal mtry at each gauge was then used in both training and validating the model. Because the number of predictors in our study is relatively small, computation burden of the exhaustive search was manageable. As the number of candidate grows, a random search strategy (Probst et al., 2019), in which values are drawn 305 randomly from a specified space, can be more computationally efficient. Figure 6 shows the distributions of Pearson correlation coefficient (r) between forecasted and observed values obtained from the three models: RF, naïve, and MLR. Non-parametric, two-sample Wilcoxon rank-sum significance tests (Wilcoxon et al., 1970), which is used to assess whether the values obtained between two separate groups are systematically different from one 310 another, suggest that the pair-wise differences in r values between RF and the other two models are statistically significant

15
Anonymous: are (p < 0.05) in two flow regimes. RF is observed to outperform both naïve and MLR models in rainfall-driven and transient watersheds. Among snowmelt-driven watersheds, the three models yield similar correlation coefficients (p > 0.05). In Fig. 7a and 7b, we observe most points lie on the left of the 1-to-1 line, suggesting that RF outperforms naïve model at individual watersheds in rainfall-driven and transient regimes. We also discern that large improvement, defined as the positive difference 315 in r values between RF and naïve model, tends to occur with lower persistence. This suggests that application of RF would be most benefiting at watersheds where next-day streamflow is less dependent on the condition of the current day. Among snowmelt-driven watersheds, the data points lie on the 1-to-1 line, indicating that the three models show marginal difference in r values. As Mittermaier (2008) pointed out, the choice of reference can affect the perceived performance of the forecast system. Our pair-wise comparisons highlight the fact that evaluating data-driven models should be performed in consideration 320 of the autocorrelation structure in the data (Hwang et al., 2012

Evaluation of RF overall performance
We next evaluated the overall performance of RF across three flow regimes using four criteria: R 2 , KGE, MAE, and RMSE (Table 4, Fig. 8). Here, we observe a similar trend in R 2 , KGE, MAE, and RMSE scores compared to r values trend in Fig. 6 where RF performs better in snowmelt-dominated than in rainfall-dominated (higher R 2 and KGE, lower MAE and RMSE).
Snowmelt-dominated watersheds have the smallest range of R 2 values across the three groups. This may suggest that there 330 is less variability in flow behaviors at individual gauges in this group and is consistent with the observed data where the hydrographs of snowmelt-driven watersheds tend to be less flashy compared to rainfall-driven watersheds. Not surprisingly, transitional group has the largest spread in R 2 values as watersheds in this group share characteristics from the other two groups.
Because RMSE is more sensitive to larger errors compared to MAE, the difference between the two scores represents the 335 extent in which outliers are present in error values (Legates and McCabe Jr, 1999). In rainfall-driven and transient groups, the shape of the boxplot distributions remain fairly consistent between the two error scores, suggesting that distribution of large errors is similar to that of mean errors in these watersheds (Fig. 8). The MAE scores are heavily skewed towards 0 while RMSE scores are more evenly spread among snowmelt-driven watersheds. In snowmelt-driven watersheds, we observe a noticeably wider interquartile range (difference between first quartile and third quartile) in RMSE plot compared to MAE plot. This 340 indicates that RF can still be susceptible to underestimation or overestimation in watersheds where the mean error is relatively low.
In Table 4 Anonymous: Replace with "at most" or "at most individual".
Anonymous: Is this shown anywhere in the paper, or does the conclusion from a separate analysis? I'm not asking for another figure, but if the conclusion comes from a separate analysis, please specify "(not shown)" at the end of the sentence.
Anonymous: Replace with "the r-value trend".
Anonymous: Why use someone else's mean-flow benchmark, instead of computing the mean-flow benchmark for your own dataset? I imagine that this wouldn't take a lot of effort, and the mean-flow benchmark could differ a lot between your dataset and theirs. Anonymous: I just read your response to reviewers, where you said the following: "To clarify, the author derived and concluded that a Kling-Gupta efficiency (KGE) > −0.41 improves upon the mean-flow benchmark, which is the KGE achieved by always predicting the time-mean flow ('climatology') at any basin."  "So the mean-flow benchmark is truly independent of the basin? If so, please make this clear in the manuscript. Otherwise, it's unclear why you're using someone else's mean-flow benchmark, rather than computing the benchmark for your own data." Figure 9. The probability of detection (POD) plotted against the false alarm rate (FAR) for three extreme thresholds: 90 th , 95 th , and 99 th percentiles. Thin black line connects values from the same watershed. (Vertical axis) Number of times RF correctly forecasted events that exceeded the threshold divided by the total number of exceedance. (Horizontal axis) Number of times RF incorrectly forecasted events that exceeded the threshold divided by the total number of non-exceedance. at all watersheds in our study. Our results are comparable to findings in (Tongal and Booij, 2018) where authors compare the performance of RF, SVM, and ANN to simulate daily discharge with baseflow separation at four rivers in California and Washington. Although authors did not classify these basins, it can be inferred that three of the rivers were rainfall-driven and one was snowmelt-driven. RF model in their study produced KGE scores of 0.41, 0.81, and 0.92 for the rainfall-driven water basins (without baseflow separation). However, our KGE scores for snowmelt-fed watersheds (with a median of 0.94) 350 are higher compared to the reported 0.55 in their study.

RF performance on extreme streamflows
We also examine the model's capacity to forecast extreme events because of their potential high impact and associated flood risks in this region. Ability of RF to correctly detect extreme flows exceeding 90 th , 95 th , and 99 th percentile thresholds (defined as the POD) for each watershed are plotted against the FAR in Fig. 9. A threshold point falling below the no-skill line indi-355 cates the model yields higher FAR than POD and is considered to have no predictive power for that threshold. RF becomes expectedly less skilful in its forecasts with increase in magnitude of the events. The model tends to perform better among snowmelt-dominated watersheds (higher POD, lower FAR) compared to those in transient and rainfall-driven groups. At the 95 th threshold, RF can forecast correctly at least 50 percent of the extreme events (POD > 0.5) at most watersheds. At the 99 th threshold, the difference in RF's ability to forecast extreme streamflow among the three flow regimes becomes less obvious. Anonymous: Please explain (either in the caption or in the main body where you introduce Figure 9) that these ROC curves are different than typical ROC curves. i.e., Since you plot only 3 thresholds, the x-axis does not go all the way from 0 to 1. For people used to looking at ROC curves, like me, this looked very wrong until you explained it. While few studies have examined complex diurnal hydrologic responses in high-elevation catchments (Graham et al., 2013), our particular result suggests large surges in streamflow sustained by spring and early summer snowmelt can be difficult to predict, even at 1-day lead time, and is an ongoing research subject (Ralph et al., 2014;Cho and Jacobs, 2020). In our study, we observe high POD is accompanied by low FAR for the same threshold. This may suggest that RF is skillful in its forecasts 365 of extreme events.

Analysis of variable importance
Variable importance is a useful feature in both understanding the underlying process of current model and generating insights for selection of variable in future studies (Louppe et al., 2013). RF quantifies variable importance through two measures: MDA and MDI (Fig. 10). In both measures, the higher value indicates variable contributes more to the model accuracy. Intuitively, 370 streamflow from previous day is shown to be the most importance variable due to persistence. This is reflected across three flow regimes and two measures. We also observe the sum of 3-day precipitation tends to have more predictive power than than 1-day precipitation. Maximum temperature and minimum temperature share similar contribution where minimum temperature tends to receive slightly higher scores. Among snowmelt-dominated watersheds ( Fig. 10c and 10f), we anticipate snow indices (SD t and SW E t ) contribute more in the prediction than precipitation and this is also reflected. Surprisingly, pentad comes third and 375 fourth in MDI and MDA respectively.. This supports the long-term snowpack memory of daily streamflow (Zheng et al., 2018) and can be useful in real-time prediction. Precipitation does not seem to have significant contribution to the model's accuracy Anonymous: The full range, from min to max? (I also asked this question in round 2, and it was not answered.) in this group. Although PRISM precipitation data includes both rainfall and snowfall, it is likely that the majority of fallen precipitation in these high-altitude watersheds is stored as snow on the surface and does not immediately contribute to runoff. Li et al. (2017) estimated that 37 % of the precipitation falls as snow in western US, yet snowmelt is responsible for 70 % 380 of the total runoff in mountainous areas. It is still very surprising to observe such low contribution of precipitation variable to RF model accuracy. Nevertheless, we observe general agreement between the two measures in ranking of the variables in snowmelt-driven group.
In transient and rainfall-dominated groups, there is noticeable disagreement between the two criteria. Precipitation (P t ) and 3-day precipitation (P 3 t ) tend to rank lower in MDA measure ( Fig. 10a and 10b) compared to MDI (Fig. 10d and 10e).

385
Specifically, in rainfall-dominated group, 3-day precipitation and precipitation are placed 2 nd and 3 rd based on median MDI compared to 4 th and 7 th in MDA. Maximum and minimum temperatures, on the other hand, tend to be more important in MDA calculation compared to in MDI. In Shortridge et al. (2016), RF model was used to predict streamflow at five rain-fed rivers in Ethiopia. Similarly calculated MDA in that study suggested precipitation was less important (7.71 %) than temperature (12.74 %). Linear model in the same study, however, considered the coefficient for precipitation to be significant (p << 0.01) 390 while temperature coefficient was not (p = 0.08). In Obringer and Nateghi (2018), the authors predicted daily reservoir levels in three reservoirs in Indiana, Texas, and Atlanta using RF and other ML techniques. Precipitation was reported as the least important variable and ranked behind dew point temperature and humidity. Inspecting the probability density functions of our predictors, we suspect that for variables that are heavily skewed and zero-inflated (e.g., precipitation), permutation-based MDA may underestimate their importance compared to those that are more normally distributed such as maximum and minimum 395 temperatures. In our precipitation data (both training and validation), at least 30 percent of the daily observations are 0 across the watersheds. There is a high likelihood that the day with zero precipitation ends up with the same value during the shuffling process, thus potentially affecting the randomness created to compute MDA. While we did not perform additional simulation to further confirm whether MDA and MDI measures are sensitive to highly-skewed and zero-inflated variables, this can be a topic of future research. Strobl et al. (2007), however, showed RF variable importance measures can be unreliable in situations 400 where potential predictor variables vary in their scale of measurement. Temperature predictors receiving higher MDA can also be due to identified bias where permutation-based importance measures overestimates the true contribution of correlated variables (Gregorutti et al., 2017). There is also an ongoing discussion regarding the stability of both measures in different simulated datasets (Calle and Urrea, 2010;Nicodemus, 2011;Ishwaran and Lu, 2019). Although results from MDI make more sense in our case, we suggest RF users to exert caution when interpreting outputs from these two measures.  Table 1 in the Supplement). The results are shown in Table 5. There is a strong 20 Anonymous: Does "in this group" mean "for snow-melt-dominated watersheds"? (I made this comment in round 2 as well.) Anonymous: Why would MDA, but not MDI, underestimate the importance of variables with a non-normal distribution?
(I made this comment in round 2 as well.) Anonymous: But you standardized all the predictors to range from 0 to 1, right? So the random forest saw predictors only in normalized units, not in physical units. So the "scale of measurement" should have had no impact on your importance measures.
Anonymous: What is a "potential predictor variable"? (I made this comment in round 2 as well.) Anonymous: Are you suggesting that the two temperature variables (min and max) have more correlation with other predictors than do the two precip variables (1-day and 3 day)? If so, have you verified this by computing the correlations in your dataset?
(I made this comment in round 2 as well.) Anonymous: I still don't know what you mean by Figure 11. KGE scores plotted against (a) the average percent of slope and (b) the average percent of sand in soil at each watershed. Best-fit lines were determined using simple linear regression. Pearson correlation coefficients were computed with associated significance.

21
"stability" here.  . 11a). As steeper hillslope often associates with faster surface and subsurface water movement during event-flow runoff, this can result in shorter response time. We observe a similar trend between KGE scores and percent of sand in the soil (Fig.   11b) where the RF performs worse in watersheds with higher hydraulic conductivity (i.e., higher sand content). This could be 415 a result of rapid subsurface flow from soil profile enabled by soil macropores in mountainous forested area (Srivastava et al., 2017), where subsurface flow is the predominant mechanism. Without a quantification of the partition of discharge into surface flow and subsurface flow at individual watersheds, it is difficult to determine the relative importance of subsurface runoff mechanisms in regulating streamflow and how that may have affected the RF performance. The findings, however, suggest RF performance can deteriorate at watersheds with quick-response runoff when supplied with 1-day delayed observation data.

420
It appears that stream density and the amount of vegetation cover may also affect the performance of RF, but the relationships are not statistically significant at α = 0.05. Aspect eastness, drainage area, and basin compactness are not determining factors to variability in the KGE scores. We also explored the impact of land-use and land-cover, which can be represented by the extent of impervious cover in each watershed. However, because we only selected unregulated watersheds that experienced minimal human disruption during the initial screening, most watersheds have very little impervious cover (less than 5 %). It is 425 noted that these selected characteristics are not meant to be exhaustive, but rather representative of various types of factors that could help explain the variability in model performance. Furthermore, an alternative approach to Pearson's correlation is to use ANOVA to test for marginal significance of each catchment variable to KGE while accounting for their interaction. Because our objective is not to make inference on KGE based on these variables and ANOVA analysis can be complicated to interpret, we choose to compute correlation coefficient.

Limitations and future research
There are some notable limitations in our study as well as RF in general. The classification of watersheds into three flow regimes was based on the timing of the climatological mean of the annual flow volume, which can fluctuate from year to year. This is particularly true for the watersheds in the transient group where streamflow is contributed by a mixture of runoff from winter rainfall and springtime snowmelt and the inter-annual variability is tremendous in both magnitude and timing (Lundquist 435 et al., 2009). Therefore, the membership of the classified watersheds from this group can vary. In fact, Mantua et al. (2009) discussed the future shift of transient runoff watersheds towards rainfall-dominated in Washington State. Because we trained RF using the same input variables for all watersheds regardless of flow regimes and calculated performance criteria separately, the classification does not alter the results at individual watershed.
In the study, we used estimated precipitation from PRISM, which is an interpolation product and combines data from var-440 ious rain gauges from multiple networks. Despite possible introduced errors and uncertainty, we believe the use of spatially distributed product better represents the areal estimation of precipitation over the watershed than a single rain gauge measurement. In real-time forecast, this would be not be feasible due to the added time to compile and process such data. Similarly, we provided RF model with a basin-average SWE from SNOTEL stations as an estimate of snowpack condition. Using a more spatially consistent SWE data such as the Snow Data Assimilation System (Pan et al., 2003) product would potentially improve 445 model accuracy. As our results indicate that RF can produce reasonable forecasts, potential future research could explore the sensitivity of the model using satellite derived snow products and even include t + 1 precipitation forecast as a predictor in the model.
An inherent limitation of RF is the lack of direct uncertainty quantification in prediction. In our case, the forecasted streamflow using RF does not yield a standard error comparable to that provided by traditional regression model, and hence no way 450 to provide probabilistic confidence intervals on predictions. Methods to estimate confidence intervals have been proposed by Wager et al. (2014), Mentch and Hooker (2016), and Coulston et al. (2016), but they are not widely applied. For future work, computation of confidence interval in RF prediction will be useful in addressing and understanding uncertainty.

Conclusions
Accurate streamflow forecast has extensive applications across disciplines from water resources and planning to engineering 455 design. In this study, we assessed the ability of RF to make daily streamflow forecasts at 86 watersheds in the Pacific Northwest Hydrologic Region. Key results are summarized below: -Based on the KGE scores (ranging from 0.62 to 0.99), we show that RF is capable of producing skilfull forecasts across all watersheds.
-RF performs better in snowmelt-dominated watersheds, which can be attributed to stronger persistence in the stream-460 flow time series. The largest improvements in forecast compared to naïve model are found among rainfall-dominated watersheds.
-The two approaches for measuring predictor importance yield noticeably different results. We recommend interpretation of the these two measures should be coupled with understanding of the physical processes and how these processes are connected.

465
-Increase in steepness of slope and amount of sand content are found to deteriorate RF performance in two flow regime groups. This demonstrates catchment characteristics can cause variability in performance of the model and should be considered in both predictor selection and evaluation of the model.
Considering the current and future vulnerabilities of the Pacific Northwest to flooding caused by extreme precipitation and significant snowmelt events (Ralph et al., 2014), skillful streamflow forecasts can have important implications. Due to its 470 practical applications, RF and RF-based algorithms continue to gain popularity in hydrological studies (Tyralis et al., 2019).
Given the promising results from our study, RF can be used as part of an ensemble of models to achieve better generalization ability and accuracy not only in streamflow forecast but also in other water-related applications in this region.
Code and data availability. Example code for building random forests model in R and data are available at https://github.com/leopham95/ RandomForestStreamflowForecast