Evaluation and interpretation of convolutional-recurrent networks for regional hydrological modelling

Deep learning has emerged as a useful tool across geoscience disciplines; however, there remain outstanding questions regarding the suitability of unexplored model architectures and how to interpret model learning for regional scale hydrological modelling. Here we use a convolutional-recurrent network, a deep learning approach for learning both spatial and temporal patterns, to predict streamflow at 226 stream gauges across the region of southwestern Canada. The model is 10 forced by gridded climate reanalysis data and trained to predict observed daily streamflow between 1979 and 2015. To interpret the model learning of both spatial and temporal patterns, we introduce two experiments with evaluation metrics to track the model’s response to perturbations in the input data. The model performs well in simulating the daily streamflow over the testing period, with a median Nash-Sutcliffe Efficiency (NSE) of 0.68 and 35% of stations having NSE > 0.8. When predicting streamflow, the model is most sensitive to perturbations in the input data prescribed near and within the basins being 15 predicted, demonstrating that the model is automatically learning to focus on physically realistic areas. When uniformly perturbing input temperature timeseries to obtain relatively warmer and colder input data, the modelled freshet timing and intensity changes in accordance with the transition timing from belowto above-freezing temperatures. The results demonstrate the suitability of a convolutional-recurrent network architecture for spatiotemporal hydrological modelling, making progress towards interpretable deep learning hydrological models. 20


Introduction
The use of deep learning (DL) has gained traction in geophysical disciplines (Bergen et al., 2019;Gagne II et al., 2019;Ham et al., 2019), including hydrology, providing alternative or complementary approaches to supplement traditional processbased modelling (Hussain et al., 2020;Kratzert et al., 2018Kratzert et al., , 2019aKratzert et al., , 2019bMarçais and de Dreuzy, 2017;Shen, 2018; Van et al., 2020). While substantial progress has been made towards distributed process-based hydrological models, input and target 25 data are becoming available at increasingly higher spatiotemporal resolution leading to greater computational requirements and human labour (Marsh et al., 2020). DL in hydrology has emerged in recent years as an active field of exploration in efforts to maximize the use of growing in situ and remote sensing datasets (Kratzert et al., 2018;Shen, 2018;Shen et al., 2018). https://doi.org /10.5194/hess-2021-113 Preprint.  Early applications of machine learning in hydrology date back to the 1990s, with artificial neural network (ANN) models used for rainfall-runoff modelling Dandy, 1996, 2000;Zealand et al., 1999). ANN models aim to approximate 30 functions that connect input data (e.g. weather data), represented by input neurons, to output or target data (e.g. streamflow data), represented by output neurons, through a series of hidden layers, each containing hidden neurons. The training of these models, i.e. the tuning of model parameters in the functions interconnecting each layer, aims to minimize the distance between model output and observed target data. However, ANNs struggle to represent more complex temporal or spatial information, and so ANN models are often highly tuned for a specific basin and cannot be applied elsewhere. In contrast, the design of DL, 35 generally referring to large multi-layer networks applied directly to large, raw datasets, has led to both improved representation of more complex functions and better learning of spatial and/or temporal patterns within the data (Shen, 2018).
As a type of DL architecture, long short-term memory (LSTM) networks are designed to learn sequential relationships on a range of scales. Learning is achieved by including four gates through which information can flow, the outputs of which interact with the memory state of the network. LSTMs have had particular success in natural language processing (NLP), 40 including applications of text prediction (Karpathy et al., 2015), language translation , image captioning (Kiros et al., 2014), and video-to-text conversion (Venugopalan et al., 2015). In hydrology, Kratzert et al (2018) demonstrated the effectiveness of LSTMs for rainfall-runoff modelling, using the previous 365 days of basin-averaged weather variables to predict the next day of streamflow at a stream gauge station. They have shown that a single LSTM can be trained on hundreds of basins, and then fine-tuned for either each region or each basin, oftentimes outperforming standard hydrological models. 45 Additionally, LSTM models trained on many basins have been shown to also outperform standard hydrological models for prediction at ungauged basins, demonstrating the potential for LSTM models to be used as regional hydrological models (Kratzert et al., 2019a). However, while addressing the need to learn complex sequential information, this approach does not explicitly consider spatial information, and as such has been primarily used as a lumped hydrological model with basinaveraged values or point-observations as input. 50 Another type of DL architecture, convolutional neural networks (CNNs) are specifically designed to learn spatial information. Learning is achieved through convolving an input with a layer of filters made up of trainable parameters. The development of CNNs was largely driven by image classification applications (Krizhevsky et al., 2012). In the geosciences, CNNs have gained popularity only relatively recently with applications including long-term El-Nino forecasting (Ham et al., 2019), precipitation downscaling (Vandal et al., 2017), and urban water flow forecasting (Assem et al., 2017). Importantly, 55 CNNs have been combined with LSTMs to encode both spatial and temporal information. Sequential CNN-LSTM models have been used to map input videos to class labels or text captions, where frames in the video are passed first through a CNN, the output of which is then passed through an LSTM (Donahue et al., 2017). Alternatively, LSTM models with convolutional rather than fully connected (or 'dense') layers have also been used to encode spatiotemporal information for applications including precipitation nowcasting (Shi et al., 2015). In hydrology, CNN (and particularly combined CNN-LSTM) models 60 have seen fewer applications to date as compared to the LSTM approach, with recent work developing 1D CNNs for rainfallrunoff modelling (Hussain et al., 2020; Van et al., 2020). https://doi.org /10.5194/hess-2021-113 Preprint.  Historically, hydrological model development has emphasized understanding and incorporating physical processes in order to improve model performance (Freeze and Harlan, 1969;Hedstrom and Pomeroy, 1998;Marsh et al., 2020;Painter et al., 2016;Pomeroy and Gray, 1990). Considering the emphasis on process-based modelling within the hydrological 65 community (Bahremand, 2016) and the multifaceted challenges surrounding water management (Milly et al., 2008;Wheater and Gober, 2013), it is important that DL-based hydrological models are interpretable and trustworthy in addition to being successful in simulating accurate streamflow. How a model is interpreted, and what it means to interpret a DL model, depends on the model architecture (e.g. ANN, CNN, LSTM), the task the model is performing (e.g. regression, classification), and the research questions being asked with the model. Several methods to interpret DL models have been developed, as outlined 70 below.
One approach to interpret CNN models is to visualize the regions in the input that are most important for decision making, which can be done for both classification and regression problems. Techniques such as class activation mapping (CAM) and gradient class activation mapping (Grad-CAM) utilize deep feature maps to determine the regions most important for classification (Selvaraju et al., 2020). Another technique, layerwise relevance propagation (LRP), backpropagates from a 75 single output neuron through the trained network to identify the input region which is most relevant for determining the value of the output neuron (Bach et al., 2015). For LRP, the propagation rules used depend on model architecture (Arras et al., 2019;Bach et al., 2015;Toms et al., 2020). In contrast to these 'white-box' approaches that interpret the model through explicit use of the model parameters, 'black-box' approaches do not utilize internal network states for interpretation. For example, techniques such as occlusion (Zeiler and Fergus, 2014) and randomized image sampling explanation (RISE) (Petsiuk et al.,80 2018) iteratively gray-or zero-out subsets of an input image and measure how a model's predictions change due to this perturbation. Occlusion and RISE can identify the area in the input where the model's predictions are most sensitive to perturbation, which can be interpreted as being the most important information for the model to have in order to make its prediction.
Recurrent networks can be challenging to interpret as the relevance of any feature in the network depends on the processing 85 of previous features. LSTMs have often been interpreted by analysing their internal states (Shen, 2018). For example, Karpathy et al (2015) visually inspect cell states of an LSTM trained for natural language processing applications to identify states which track various recognizable text features, such as quotations and line length. Most states, however, were found to be uninterpretable (Karpathy et al., 2015). A similar approach has been taken for interpreting LSTMs in hydrology; for example, Kratzert et al. (2018) discuss cell states as being comparable to storages in traditional hydrological models. They 90 show that the evolution of one cell state closely resembles the dynamics of a snowpack, increasing when temperatures are below freezing and quickly depleting when temperatures rise above freezing (Kratzert et al., 2018). More recently, LRP has been adapted for the LSTM architecture (Arras et al., 2019); however, to our knowledge there are no examples of its use in the geoscientific literature.
We aim to create a relatively simple and interpretable DL model which maps spatiotemporal weather fields, represented 95 by gridded climate data at a relatively coarse (~75 km) spatial resolution, to streamflow at multiple stream gauge stations https://doi.org /10.5194/hess-2021-113 Preprint. Discussion started: 11 March 2021 c Author(s) 2021. CC BY 4.0 License. across a region. By explicitly encoding spatial information, we aim to develop a DL analogue of a distributed hydrological model on a regional scale which predicts streamflow on a regional scale without the need for climate downscaling. Our specific objectives for this paper are to: 1) evaluate how well the sequential convolutional-recurrent model performs when predicting daily streamflow simultaneously at multiple stream gauge stations across a region, 2) investigate if the model has learned to 100 focus on the areas within or near the watersheds where streamflow is being predicted, and 3) investigate if the model has learned physically interpretable temporal links which drive the timing and intensity of the spring freshet in snowmelt dominated basins. The first objective is related to evaluating the accuracy of the model's predictions, while the latter two objectives relate to model interpretability. We do not undergo an exhaustive parameter search to create the best or most complex model; rather, we develop a model with relatively few predictor variables which is sufficient for achieving these objectives. 105 The paper is structured in the following way: in Sect. 2, we discuss the study region. In Sect. 3, we outline the datasets used, and detail our decision making for choosing the input and output variables. In Sect. 4, we outline our methods, including the architecture, training, and evaluation of the model, and describe the experiments developed for interpreting the model's learning. In Sect. 5, we present and discuss the results of our analysis, and we present a summary and conclusions in Sect. 6.

Study region 110
We chose the south-central domain of Western Canada as our study region, containing large sections of the provinces of British Columbia (BC) and Alberta (AB) (Fig. 1). This region contains a range of hydroclimate regimes, allowing for our modelling approach to be evaluated across a range of conditions. In winter, relatively warm and moist Pacific air is advected into the study region, leading to frequent rainfall events at low elevations along the west coast of British Columbia where the maritime climate is wetter and milder as compared to much of the rest of the study region. While most precipitation typically 115 falls as rain at lower elevations, substantial winter snowfall leads to seasonally continuous snow and substantial glacier coverage at higher elevations in the coastal region (Trubilowicz et al., 2016). Cooler winter conditions through much of the rest of the province allow for the accumulation of a seasonal snowpack (Moore et al., 2010). In contrast, winters in Alberta are colder and drier given the influence of Arctic air masses. Substantial snowfall can occur in Alberta when comparably moist Pacific air crosses the Rockies and interacts with cold and dry Arctic air (Sinclair and Marshall, 2009;Vickers et al., 2001), 120 but most precipitation in Alberta falls as rain in the spring and early summer. The seasonal streamflow characteristics are described in Sect. 3.2.

Streamflow data 135
We use daily streamflow data (in [m 3 s -1 ]) extracted from Environment and Climate Change Canada's Historical Hydrometric Data website (HYDAT) (Environment and Climate Change Canada, 2018). We only use stations which are upstream of dams (measuring naturalized flows) and which are currently active. Many stream gauges do not record data every day of the year, so we only select stream gauges which have no more than 40% of daily data missing between 1979 and 2015 and for which no more than 1 year is missing more than 40% of data. For all missing data, we fill the daily value with the 140 average value of that day between 1979 -2015. If all years are missing that day (which is true for some stations which do not record in low flow periods), we fill the missing day with the minimum seasonal streamflow. The threshold of 40% is chosen to allow for relatively dense spatial coverage of stations across the study region and is acceptable considering that most missing data are during low-flow seasons when rainfall and snowmelt are not strongly driving streamflow dynamics. It is acceptable to allow for one year with greater than 40% missing data because it substantially increases the station density. There are 279 145 stream gauge stations in Alberta which are active and have naturalized flow. Of these, 120 meet the aforementioned criteria; however, only 66 meet the stricter criteria of having all years with less than 40% missing data. In BC, there are 288 active and naturalized stream gauges; of these, 145 meet the less strict criteria and 108 meet the stricter criteria. Missing data is a common feature in geoscientific datasets which poses a challenge for the use of machine learning models (Karpatne et al., 2019), and so creating a suitably large training dataset may require pre-processing steps like those outlined above. 150 We further restrict the study region to stations south of 56° N because stream gauge density is greater below this latitude.
Out of the total 265 stations that meet the less strict criteria, 226 stream gauge stations are south of 56° N. This means that north of 56° N, 15% of stream gauges span 33% of the study region, and so this restriction reduces memory and computational requirements, as well as increases the spatial stream gauge density. As our goal is not to develop a model at continental scale, but to investigate how well the model can learn different streamflow regimes, the above-described data selection process is 155 justified for our application. The data from each of the 226 stations are normalized so that each station's streamflow has a mean of zero and unity variance over the training period.

Streamflow clusters
To optimally facilitate the investigation of the model's learning with the methods described in Sect. 4, we cluster the stations in our study region into subdomains based on both similarity in streamflow regime and proximity in space. The 160 clustering input (observation) for each station is a vector where the first 365 dimensions are the seasonal streamflow, the next 182 dimensions are repeated values of latitude, and the final 182 dimensions of repeated values of latitude ( Fig. A1 in Appendix A). This clustering input is designed to give seasonal streamflow and geographic location similar weight in the clustering algorithm, where approximately one half of the input is seasonal streamflow, one quarter is latitude, and one quarter in longitude. By clustering in this way, the stream gauges that belong to the same cluster are likely to have similar streamflow 165 and experience similar climatic conditions. Seasonal streamflow is normalized at each station to have zero mean and unity variance, while latitude and longitude are each normalized across all stations to have a mean of zero and unity variance. We use agglomerative hierarchical clustering with Ward's method (Hastie et al., 2009) to identify six subdomains or clusters of streamflow (Fig. 2). The number of clusters chosen (six in this case) is determined from the dendrogram (Fig. A2). We refer to the clusters as north-western, north-eastern, central, southern, eastern, and coastal, as labelled in Fig. 2. 170 There are key differences between the streamflow regimes identified by the clustering (Fig. 2). Only the lower-elevation coastal stream gauges are characterized by low summer flows and high winter flows which are driven by winter rainfall events; all other clusters differ from one another largely in the timing and intensity of both the spring freshet (the first streamflow peak in a year) and a second rainfall-driven peak occurring in spring, summer, or fall. The eastern and north-eastern clusters are characterized by relatively early spring freshet, followed by rainfall-driven streamflow peaks in early summer. The southern, 175 central, and north-western stations are characterized by a later and more sustained spring runoff, in part due to a longer-lasting snowpack which accumulates from the relatively higher rates of winter precipitation in British Columbia.

Weather data
As input weather variables to the model, we select daily fields of precipitation, maximum temperature, and minimum 190 temperature, extracted from ERA5 reanalysis (Hersbach et al., 2020) from the European Centre for Medium-Range Weather Forecasts (ECMWF). Data are downloaded at 6-hourly temporal resolution and 0.75° x 0.75° spatial resolution for the time period 1979 -2015. Our selection of variables is based on the assumption that the combination of precipitation and temperature is sufficient for estimating both how precipitation contributes to streamflow as rainfall, and for estimating the onset, intensity, and longevity of the spring freshet through the seasonal accumulation and ablation of a snowpack. We recognize that the 195 underlying physics which governs streamflow throughout the year is more complex than these comparably simple assumptions (e.g. interactions between surface water and ground water (Hayashi et al., 2016), evapotranspiration (Penman and Keen, 1948), snow redistribution from wind (Pomeroy and Li, 2000)); however, we are assuming that temperature and precipitation from reanalysis data can act as proxies from which most of the information can be inferred (e.g. Essou et al., 2016). While additional variables could be used as climatic drivers of streamflow (e.g. solar radiation, evaporation, wind), we opt to use a simpler 200 model with fewer input variables as a proof of concept and to achieve our goals stated in Sect. 1.
ERA5 reanalysis was chosen as it is globally available from 1979 through the present (once complete it will be available from 1950 onwards), and has been shown to compare well against other reanalysis products (Hersbach et al., 2020). Importantly, the precipitation output from ERA5 has been found to typically outperform the earlier ERA-Interim reanalysis in the northern Great Plains region, which experiences a similar climate to the Prairie region in our study area (Xu et al., 2019). 205 We downloaded total precipitation (variable name: 'tp'; parameter ID: 228) and near-surface air temperature (variable name: '2m Temperature'; parameter ID: 500011), from which daily total precipitation, daily maximum temperature, and daily minimum temperature are calculated.

Methods
Here we summarize our methods before providing details of each key step. We use a sequential CNN-LSTM model 210 to map weather predictors to streamflow at multiple stream gauge stations simultaneously. As input data, we use the past 365 days of weather, covering the whole study region, in order to predict streamflow of the next day at N stream gauge stations ( Fig. 3 and Table 1). The CNN learns the spatial features in each day of the input, while the LSTM model learns the temporal relationship between these features in order to predict streamflow. After the model is trained, we evaluate its performance against the observed streamflow over a testing period, which is independent of the training period. Finally, we introduce two 215 experiments to investigate the model's learning. The first experiment is focused on interpreting the learning of spatial links between the predictors and streamflow, while the second experiment is focused on the learning of links between temperature and the snowmelt-driven freshet. This section will provide a brief overview of both the CNN and LSTM architectures, followed by a description of our CNN-LSTM model design and training, and finally with a description of metrics and experiments developed for the model evaluation and interpretation. 220

CNN overview
The CNN is constructed using two main types of layers: convolutional and pooling. Convolutional layers are made 225 up of multiple filters, which are constructed by trainable weights. Each filter convolves across an input layer to produce an output image which is then passed through a nonlinear activation function. Mathematically, a single output neuron can be calculated from a single filter as: where !"" is the value of one neuron in the output layer, is the nonlinear activation function, !"" are the weights of 230 the filter, !"" is the region of the input layer, !"" is the bias value of the output neuron, and , , and correspond to width (e.g. number of pixels along the x-direction of the image), height (e.g. number of pixels along the y-direction of the image), and depth (e.g. number of channels of the image) of the input, respectively. Pooling layers reduce image resolution, which reduces memory requirements of the network; for example, a 2x2 max-pooling layer will reduce the number of pixels by a factor of 4 by outputting only the maximum value of each 2x2 region of the input. CNN architectures often have a repeating 235 structure of several convolutional layers followed by pooling layer. Through training, the convolutional layers learn the spatial features present with more abstract features being learned at deeper layers, and the pooling layers reduce images to smaller and smaller sizes. The output feature vector is encoded with the learned spatial information from the input.

LSTM overview
The LSTM network output is determined by the interaction between two internal states: the cell state ( ) which acts as 240 the memory of the network, and the hidden state ℎ( ) which is an intermediate output of the network. Both states are updated at each time step (1 ≤ ≤ ) by a series of gates through which information can flow: the forget gate ' , the input gate ' , the potential cell update ' H , and the output gate ' . Each time step of the input is concatenated with the hidden state as calculated in the prior time step before being passed through the network; in this way, learned information from previous time steps is used to calculate the next output. In the following equations, weights ( ) and biases ( ) are the learnable parameters 245 in the network: where ' is the input vector to the LSTM at time , tanh is a hyperbolic tangent function, is a sigmoid function, and square brackets indicate concatenation. The cell state at time is determined by the prior cell state and the interactions with the outputs of the forget, input, and potential cell update, while the hidden state at time is determined by the new cell state and the output gate: where ⨀ denotes elementwise multiplication. The final hidden state, ℎ 3 , is passed through a dense layer constructed of fully connected neurons. The activation of this dense layer is linear, and so the final output is a linear transformation of the final 260 hidden state: This output, 9:;< , is a vector of normalized streamflow for a single day at multiple stream gauge stations. The length of 9:;< is encoded with the learned sequential information from the input series.

Sequential CNN-LSTM architecture 265
An overview of the model architecture is shown in Fig. 3, and information of each layer is presented in Table 1. We use a sequential CNN-LSTM model in order to simultaneously map the previous 365 days of temperature and precipitation to the next day of streamflow at multiple stream gauges throughout the study region (i.e. days 1 through 365 of weather are used to predict day 366 of streamflow). Daily weather images are constructed where the height and width of the image correspond to the number of grid cells along latitude and longitude, respectively, and with three channels corresponding to normalized 270 maximum temperature ( =>? ), minimum temperature ( =#3 ), and precipitation ( ). Yearly weather videos are constructed from the past 365 days of weather images, where each frame in the video is a weather image. One year-long weather video is used as an input to predict the next day of streamflow at the 226 stream gauge stations. Each frame in the video is passed independently through the CNN, which converts each of the 365 frames into a feature vector of length 32. This feature vector is a representation of the learned spatial features found in that frame of weather. There are 365 feature vectors generated from 275 one year-long weather video, since there are 365 days in the input video. Then, this series of feature vectors is passed through an LSTM, which learns the sequential relationship between the learned spatial features and outputs a final hidden vector, ℎ 3 (Equation 7), with length 80. This hidden vector contains information of the sequential relationships between the spatial features and is next passed through a dense layer with linear activation to connect to the final output neurons ( 9:;< , Equation  Training the DL model requires a balance of having sufficient complexity to learn the mapping from weather to streamflow, but without being so complex that the model overfits to the training set and performs poorly on the validation set. 295 With that in mind, we designed this architecture (Table 1) considering the following: 1) the number of pooling layers is limited by the relatively small input images (spatial size of 12 x 32); 2) the number of filters in the deepest layer of the CNN determines the length of the spatial feature vector (input to LSTM); and 3) the number of parameters in a single LSTM layer goes linearly with the length of the spatial feature vector and quadratically with the number of LSTM units. In addition to these general guiding principles, we found that a single LSTM layer with more units performed better than multiple LSTM layers with fewer 300 units, as had previously been used when predicting streamflow at a single station (Kratzert et al., 2018). The success of a single LSTM layer with more units is likely because we map the LSTM hidden state to multiple stream gauges (a higherdimensional space) rather than a single neuron, and so more units are required for this mapping to work well. Additionally, we found that including 32 filters of size 1x1 as the first layer improved model performance.

Training 305
We use fine-tuning (Razavian et al., 2014) to train our model in two steps: 1. Bulk training: a model is trained on all 226 stream gauge stations in the region.
2. Fine-tuning: the bulk model is further trained at stations from each of the six clusters ( Fig. 1) in the following way. The bulk model is copied six times, with one copy used for each cluster, but the last dense layer in the bulk model is removed and replaced with a new dense layer which has as many neurons as stations in that cluster. 310 Weights in the new dense layer are randomly initialized. Each fine-tuned model is then trained further on only the stations in that cluster.
For both bulk and fine-tuned models, early stopping is used to reduce overfitting. We use a dropout layer with a dropout rate of 0.1 between the CNN and LSTM layers for regularization (Srivastava et al., 2014). We use batch sizes of 64, a learning rate of 10 -4 , mean squared error loss, and Adam optimization (Kingma and Ba, 2017). For all models, we use 1979 -2000 for 315 training, 2001 -2010 for validation, and 2011 -2015 for testing. We use the Keras (Chollet, 2015) and Tensorflow (Abadi et al., 2016) libraries in Python (Van Rossum and Drake, 2009), and we use Google Colab for access to a cloud GPU. In total, we train an ensemble of 10 bulk models and further fine-tune each one, yielding an ensemble of 10 bulk models and an ensemble of 10 fine-tuned models per cluster. Training a single bulk model on a single cloud GPU in Colab takes on the order of tens of minutes. 320

Evaluation of model performance
We evaluate how well streamflow is simulated by the bulk and fine-tuned models with the Nash-Sutcliffe Efficiency (NSE) (Nash and Sutcliffe, 1970). For each station, we calculate NSE over the test period for both the bulk and fine-tuned models, using the ensemble mean as the final model output. NSE is defined as: where T is the total number of time steps in the test series, = ' is the modelled streamflow for that station at that time, ; ' is the observed streamflow for that station at that time, and ; iiii is the mean observed streamflow for that station over the whole test period. The overall performance of both the bulk and fine-tuned models is evaluated by the median NSE of all stations as evaluated over the test period. When = 1, the modelled streamflow is exactly equal to the observed streamflow, while < 0 indicates very poor model performance as more variability would be captured if the streamflow was represented 330 with its mean value than with the modelled streamflow.

Spatial perturbations
We interpret the model's learning of spatial links by testing the following hypothesis: if the model is learning physical processes that drive streamflow at a given stream gauge station, then the modelled streamflow at that station should be most 335 sensitive to perturbations in the watershed or vicinity of that station, and less sensitive to perturbations further from that station.
To test this hypothesis, we perturb small spatial regions of the input weather video and determine how sensitive the predicted streamflow of each cluster of stream gauge stations (Fig. 1) is to the areas which are perturbed. To evaluate the regions of the input space which are most important for streamflow predictions at each stream gauge, we take the following steps, each of which will be discussed in more detail: Steps 1 -4 are similar to the occlusion (Zeiler and Fergus, 2014) and RISE (Petsiuk et al., 2018) algorithms in that we iteratively perturb the input and generate sensitivity heat maps based on how the output changes. The RISE approach zeroes out portions of the input image, which here would be equivalent to setting a portion of the input to be the mean weather values since the input variables are normalized to have zero mean; therefore, the difference between the perturbed and unperturbed 350 input would depend on how close the input variables are to their mean values in each day. Instead, here we perturb the input by adding or subtracting a 2D gaussian curve from the input video, which alters each day in the input no matter if it is near the mean or not. We developed this method, as opposed to using already established methods such as occlusion, RISE, or LRP, because our method is both agnostic of model architecture, and is grounded in a physical understanding of the key processes taking place (i.e. the perturbations are adding in synthetic warm/wet or cold/dry areas and we determine if and how this changes 355 streamflow in the perturbed basins).
In step 1, we define a gaussian perturbation, p, as: where and are longitude and latitude (in degrees), K and K are the longitude and latitude of a randomly selected point within the study domain (in degrees), ? and I are the standard deviations of the gaussian in the x-and y-directions (in 360 degrees), and is a multiplicative factor which has equal probability of being either 1 or -1 for each perturbation. The gaussian has an amplitude of 1 and standard deviations are 1.5 pixels in both the x-and y-direction. The amplitude determines the strength of the perturbation and the standard deviations determine the extent. The amplitude was chosen to be 1, equating to a maximum perturbation of a single standard deviation of each input variable. ? and I were chosen so that the gaussian perturbation is small relative to both the height and width of the input weather frame, but larger than a majority of basins. This 365 perturbation is added to every channel (predictor variable) and frame in the input video. An example of a perturbation, a perturbed maximum temperature field, and the perturbed streamflow response is shown in Fig. A3.
In step 2, we pass the perturbed video through the trained model, and calculate the absolute value of the difference between the unperturbed and the perturbed modelled streamflow for each stream gauge. From this difference at each station, we can quantify how important the perturbed area is for the model's decision making. We quantify the importance in step 3 by 370 defining a sensitivity heat map for each stream gauge station: where # ( , ) is the sensitivity heat map of stream gauge , = # is the unperturbed modelled streamflow at the stream gauge , =,K # is the perturbed modelled streamflow at the stream gauge , and ( , ) is the perturbation. Each perturbation produces 226 different sensitivity heat maps, corresponding to one heat map for each stream gauge station. 375 In step 4, we iterate through the first three steps for each day in the test series until the mean sensitivity heat maps converge, here taken as when the relative error between heat maps of subsequent perturbations falls below 0.5%. Then, for each streamflow cluster, we calculate the mean sensitivity heat map across all stream gauges in the cluster, all iterations, and all days in the test series as: Finally, in step 5, we identify the values in the cluster heat map which are either: 1) within the watershed or within 1 pixel of distance from the watershed boundaries, or 2) further than 1 pixel in distance from the watershed boundaries (where 385 1 pixel has size 0.75° x 0.75°). If the model is focusing on the areas which are within or near the cluster's basins, then we expect the sensitivity within or near the basins to have a higher mean sensitivity and a substantially different distribution of sensitivity than the distribution of sensitivity outside or far from the basins. To evaluate how different the distributions of sensitivity are, we calculate the Kolmogorov-Smirnov D-statistic (Chakravarti et al., 1967)  Additionally, we characterize the heat maps by the value , here defined as the area fraction of the sensitivity heat map which is more than the half-maximum sensitivity. Smaller values of (closer to 0) indicate that the model is focused on a smaller portion of the input area, while large values of (closer to 1) indicate that larger portions of the input video are 400 important for the model's prediction at that station. and are calculated from the ensemble mean heat map of each cluster and each station, respectively.

Temperature perturbations
We assume that the transition from below-to above-freezing temperatures is strongly related to the onset of snowmelt and thus the timing of the freshet. While the assumption is a simplification of processes dictated by the surface energy balance, 405 the use of positive temperatures as successful indicators for the warming and melting of snow is a common assumption of positive-degree-day models in simulating snow and glacier melt across multiple spatial scales (e.g. Hock, 2003;Radic et al., 2014). Therefore, for interpreting the model's learning, we introduce the following hypothesis: if the model is learning physical processes which are driving streamflow over the course of one year, and since snowmelt is a key contributor to streamflow, then the modelled freshet should occur once temperatures in the forcing data have transitioned from below-to above-freezing. 410 To test the hypothesis, we add a spatially uniform temperature perturbation, Δ , to both the maximum and minimum temperature channels, i.e. the same temperature change as measured in degrees Celsius is added to every pixel and every day in the test period. With this perturbation we create a new test set which has either warmer or colder temperature channels than the original, but the same precipitation channel. We pass this new test set through the model and compute the mean seasonal flow for each cluster, where the mean is derived across all years in the test set and all stations in the cluster. We perform these 415 steps for the range −5 ∘ C ≤ Δ ≤ 5 ∘ with an increment of 1 ∘ C to test how the modelled streamflow responds under a range of warmer or cooler conditions. Then, for each cluster region and for each temperature perturbation, we identify when the 30-day running mean of daily minimum temperature and maximum temperature transition from being below-to above-freezing temperatures: where P,=>? and P,=#3 indicate the day when maximum and minimum temperatures warm above freezing, respectively. The timing of a freshet has been previously defined in different ways, each with the goal of indicating when the spring snowmelt is strongly contributing to streamflow (e.g. Vincent et al., 2015;Zhang et al., 2001). For each cluster and temperature perturbation, we define the freshet timing ( 9QRSTR' ) as the day when the 30-day running mean of modelled streamflow rises to 425 be halfway between the winter minimum flow ( =#3 ) and spring maximum flow ( =>? ): For each cluster and temperature perturbation, we also define the intensity or magnitude of the freshet to be the spring maximum flow, =>? . By perturbing temperatures in the range of −5 ∘ ≤ Δ ≤ 5 ∘ , we can track how well the model is learning the links between the temperature transitions and the intensity and timing of the snowmelt-driven freshets. 430

Evaluation of NSE
For each station, we derive ensemble-mean streamflow for the bulk model runs and fine-tuned model runs. The median fine-tuned NSE calculated over the test period is 0.68, and 35% of stream gauges have NSE > 0.8 (Fig. 4a). We compare the performance of the bulk versus fine-tuned models by looking at the difference in NSE between the bulk and fine-435 tuned models (Δ ), evaluated across stations for each cluster (Fig. 5a) and in space (Fig. A4a). We find that overall, there is a small increase in NSE, with only a median Δ = 0.02. The best performing stations are those in the central, southern, and north-western clusters, all of which have snowmelt dominated streamflow regimes throughout BC (Fig. 2). For these clusters, which represent a majority of stations, there is relatively little change in NSE between the bulk and fine-tuned models (Fig. 5a). The eastern cluster, which is made up of stations in the relatively arid Prairie region, has the worst overall 440 performance and shows only slight improvements after fine-tuning. The coastal cluster, which is made up of rainfall dominated stations along the west coast, has a relatively narrow range of NSE and shows the largest improvement from fine-tuning. The north-eastern cluster, which is characterized as having comparable snowmelt-and rainfall-driven peaks in spring and summer, respectively, also shows a notable improvement from fine-tuning. Importantly, the median NSE is relatively consistent across model runs in the fine-tuned ensemble, with a range of only 0.05 across all 10 fine-tuned model runs. This result indicates 445 that in terms of NSE, the fine-tuned model runs perform similarly as evaluated across the whole region.

455
To illustrate the model's performance in simulating different streamflow regimes, we compare the fine-tuned model output between a station with a snowmelt dominated regime and a station with a rainfall dominated regime (Fig. 6). The snowmelt dominated station is well simulated by the ensemble-mean ( = 0.87), capturing the timing and magnitude of many daily or weekly scale streamflow peaks; however, the 2 interval is not consistently narrow throughout the year (Fig.  460 6a). Rather, it is smallest in the low-flow periods and freshet, and then larger during the recession over spring and summer.
The modelled streamflow for the rainfall dominated station yields a lower NSE than the snowmelt dominated station ( = 0.59), despite the ensemble-mean being able to correctly model the timing of most rainfall-induced onset, peaks, and decay.
However, the peak magnitude in streamflow is often under-or over-estimated, particularly for the largest observed peaks (Fig.   6b). The 2 interval is relatively narrow throughout the year, indicating that the 10 fine-tuned models output relatively similar 465 streamflow.

Figure 6: Examples of observed and modelled streamflow for one year at two stations of different streamflow regimes.
The ensemble mean is the mean across the 10 model runs, and the shaded area is plus/minus two standard deviations across 485 the 10 model runs. The station with the fifth-highest NSE is chosen in each of the southern and coastal clusters, which are snowmelt-and rain-dominated clusters, respectively. An arbitrary year in the test set is chosen. We chose the fifth-best performing station as to show more typical model performance for these clusters.

Spatial perturbations
Stream gauge stations, watershed boundaries, and sensitivity heat maps for each cluster are displayed in Fig. 7. The central, southern, and coastal regions are generally most sensitive in the areas near and within the watersheds of the cluster, which means that information nearby the watersheds is most important for the model to predict streamflow. The models' predictions are relatively insensitive to perturbations further from the watersheds as indicated by the low values of ( , ) in 495 Alberta for the coastal cluster and in northern British Columbia for the southern and central clusters. This result indicates that information far from the watersheds is less important for the models' decision making. In contrast, the eastern and northeastern clusters are sensitive both within the watersheds of the cluster and at a second sensitive area along the west coast of British Columbia. These findings indicate that models for these latter clusters are relying on links across space and time which may be present between the input and output datasets, but which are not physically driving streamflow; consequently, long-500 term forecasting is likely not appropriate as these links may not hold in the future.

510
The comparison of sensitivity between regions which are within/near the watersheds versus areas which are outside/far from the watersheds are summarized for each cluster (Fig. 8). The steps to calculate the D-statistic for one cluster is shown in  The D-statistic is further evaluated by comparing both the bulk and fine-tuned models (Fig. 5b). All clusters except for 525 the north-eastern cluster show an increase in D through fine-tuning, and in particular, southern stations show the largest increase in D. Because D is a metric which indicates how different the inside/outside sensitivity distributions are from one another, the widespread increase in D through fine-tuning indicates that fine-tuning helps the model separate more relevant information (within/near watershed regions having higher sensitivity) from less relevant information (outside/far from watershed regions having lower sensitivity). 530 We also compare the values of A between the bulk and fine-tuned models per cluster (Fig. 5c) and in space (Fig. A4b).
Here, all clusters have their mean decrease with fine-tuning and the median difference in between the bulk and fine-tuned models is Δ = −0.09, meaning that the fine-tuned models are sensitive to smaller areas of the input relative to the bulk model. While the central cluster only shows a minor decrease in , all other clusters show a larger decrease. Because is a metric which indicates the area that is most sensitive to perturbation, the widespread decrease in indicates that the process 535 of fine-tuning generally helps the model to focus on smaller areas of the input space.
It has previously been shown that fine-tuning an LSTM-based hydrological model can lead to a moderate improvement in performance which is heterogeneous in space (Kratzert et al., 2018). Here we build on this understanding to show also that fine-tuning substantially influences what the model is learning. Comparing the results between clusters in terms of NSE, , and , we find that the process of fine-tuning does not impact model performance equally across all clusters (Fig. 5). 540 Specifically, the central cluster is the least impacted by fine-tuning, as indicated by the relatively small differences in NSE, , and between the bulk and fine-tuned models. The bulk model focuses on large areas of the input, and since the central cluster spans a relatively large area of the input space (the largest area of all the clusters), the bulk model is already effective in learning the weather-to-streamflow mapping and further fine-tuning does not substantially change its learning. On the other hand, the coastal cluster is more impacted by fine-tuning, with simulated streamflow more closely matching observations 545 (increase in NSE) with more realistic learning (increase in and a substantial decrease in ). In other words, fine-tuning has made the model focus on smaller regions nearby the watersheds, which has led to better performance. Considering the processes which drive streamflow, fine-tuning has minimal impact on NSE at southern and central clusters (which are melt dominated flow), and most improves NSE at the coastal, north-western, north-eastern, and eastern clusters (where a rain-driven flow peak is present in the seasonal hydrograph) ( Fig. 2 and Fig. 5a). Seeing as fine-tuning also more substantially decreases 550 at these latter clusters, and that rainfall occurs over smaller spatial scales as compared to snowmelt, we suggest that the process of fine-tuning allows the model to better learn rainfall-runoff processes.
Interestingly, it is not necessarily true that the model run which performs best according to NSE is the model which has best learned to focus within the watersheds being predicted. All models in the fine-tuned ensemble achieve relatively similar performance as evaluated by the median NSE (range of 0.05 across the 10 models) while there is a much larger range of median 555 (range of 0.25 across the 10 models). Furthermore, NSE and are not significantly correlated (correlation coefficient = 0.05 with p-value > 0.1). In fact, the model with the highest median NSE has the lowest median ( = 0.69 and = 0.41). The range in across the models in the ensemble and the lack of significant correlation between NSE and indicate that while all model runs can output streamflow to a similar degree of accuracy, the internal learning processes are different among model runs. 560

Temperature perturbations
The intensity and timing of the modelled freshet, as well as the timing of transition from below-to above-freezing temperature, reveals characteristic patterns of snowmelt for the snowmelt dominated clusters (Fig. 9). When no temperature perturbation is added (e.g. Δ = 0), the snowmelt driven streamflow in southern, central, north-western, north-eastern, and eastern clusters experience a large increase in modelled streamflow after temperatures increase above freezing. For these 565 snowmelt driven streamflow regimes, a positive temperature perturbation (Δ > 0 ∘ ) advances the timing of the freshet's onset while the intensity of the freshet decreases (Fig. 10). A possible physical interpretation of this result is that a warmer climate would lead to both a smaller fraction of precipitation falling as snow rather than as rain and a shorter cold season, leading to a thinner seasonal snowpack, an earlier onset to melt, and less water to drive streamflow in spring. Similarly, a decrease in temperature (Δ < 0 ∘ ) delays the timing of the freshet's onset while the intensity of the freshet increases (Fig.  570 10). Notably, fall and winter streamflow is suppressed when temperatures are lowered, and enhanced when temperatures are raised in the more rainfall driven coastal and north-western clusters (approximately before April and after December; Fig. 9d and 9e). These results are consistent with the rationale that a colder climate would lead to both a larger fraction of precipitation falling as snow rather than rain and a longer cold season, building a deeper snowpack which can deliver a larger volume of water to streamflow in spring. Similarly, when temperatures are raised (Δ > 0 ∘ ), winter and fall streamflow increases, 575 which can physically be explained by more precipitation falling as rain which leads to a faster streamflow response.
Importantly, while the timing of the freshet onset relative to the timing of above-freezing maximum/minimum temperatures may not be the same for all clusters ( Fig. 9 and Fig. 10), all clusters respond similarly to a change in the timing of this temperature transition, and this response is consistent with a physical understanding of the drivers of streamflow.

Discussion
Our model achieves comparable performance to previous studies which have used deep learning for modelling streamflow across a region using meteorological inputs; for example, LSTMs have been used to achieve median NSE of 0.72 590 across 241 catchments in the United States (Kratzert et al., 2018) with the worst performance in the more arid regions (similar to our model's poor performance in the eastern cluster). Additionally, Kratzert et al. (2019) achieved a median NSE of 0.63 across 531 catchments in the United States and found that model performance was improved (median NSE of 0.74 across 531 catchments) when catchment characteristics were included in the input to incorporate information related to the climate, topography, vegetation, and subsurface. The CNN-LSTM modelling framework introduced here could be extended to include 595 spatially distributed variables which are constant in time (such as topography and permeability) by using these variables as additional channels in each frame of the input video, for example.
One of our key findings is that the model performs well (high NSE values) in all clusters except for the eastern cluster.
We compare our results with findings from process-based models previously used to predict streamflow at 45 of the same stations as in our study (Table 2). We identify studies which modelled streamflow at daily temporal resolution, and evaluated 600 performance using NSE or correlation between observed and modelled streamflow over at least a one-year period. In total we selected 45 stations for this comparison, keeping in mind that this is not an exhaustive comparison of all studies across the region, nor do we claim that the identified studies are necessarily directly comparable with our results as each define calibration and validation periods differently. As our goals are to explore the CNN-LSTM model architecture and interpret its decision making, and not necessarily to outperform existing process-based models, we do not need to compare every individual station to process-based models. The intercomparison shows that the CNN-LSTM model performance is at least similar to, and often outperforms, many process-based models existing in the literature for the southern, central, coastal, north-western, and northeastern clusters (Table 2). Our model achieves higher values of NSE or correlation at 37 of the 45 identified stream gauge stations, indicating generally good performance in all clusters except for the eastern cluster. It is notable that the CNN-LSTM model achieves this performance with only coarse resolution temperature and precipitation as input. 610 The eastern cluster is unique among our six clusters in terms our model's poor performance, and also in terms of the region's hydrology and a lack of studies in the literature at the same locations and with the same metrics as our study for comparison. Our model's inability to successfully simulate streamflow for the eastern cluster (Fig. 5a) could be due to the unique hydrological conditions in the Prairie region which cannot be learned from the past 365 days of temperature and precipitation alone. Prairie topography is characterized by small surface depressions which result in intermittent water 615 connectivity and variable sized drainage basins (Shaw et al., 2012). Rainfall and snowmelt are stored in upstream depressions rather than contributing to streamflow when the depressions are not full, and so there may not be a substantial streamflow response even if there is rainfall or snowmelt within the basin. Additionally, there is hysteresis between contributing area and water storage within these depressions, meaning that the area which contributes to streamflow is determined by both the presence of depressional storage and if the storage is increasing (wetting) or decreasing (drying) (Shook and Pomeroy, 2011). 620 Because storage in ponds can vary substantially on both seasonal and decadal timescales (Hayashi et al., 2016;Shaw et al., 2012), it is likely that the CNN-LSTM model has insufficient input information to successfully learn the complex rainfallrunoff behaviour observed in the eastern basins (e.g. both the seasonal and decadal fluctuations of depressional storage cannot be learned from only a seasonal-scale input time series). Several studies have developed process-based models for application to different stations than ours in the Prairie region, which have outperformed our DL-based approach. While our model had a 625 median = 0.17, process-based models achieve higher values, for example, > 0.4 (Mengistu and Spence, 2016;Unduche et al., 2018) and > 0.7 (Muhammad et al., 2019).
Considering the complexity of hydrology in real catchments and the dependence of streamflow on locally resolved processes, it is notable that the CNN-LSTM model achieves good streamflow simulation with only temperature and precipitation forcing data. The model's performance is likely limited by its inability to learn processes beyond those which 630 could be more easily inferred from the streamflow response to temperature and precipitation, such as advective fluxes (e.g. wind transport of snow), evaporative fluxes (e.g. sublimation of the snowpack, evapotranspiration), and interactions between ground and surface water. Additionally, our model uses forcing data at comparably coarse spatial resolution (0.75° x 0.75°, or ~75 km resolution) as compared to studies identified in Table 2 (e.g. 0.0625° x 0.0.0625° in Shrestha et al. (2012); 10 km resolution in Eum et al. (2017)). While it is notable that the CNN-LSTM model achieves good performance without climate 635 downscaling, it is also possible that the absence of information of locally resolved meteorological data limits the model's learning and performance. Nevertheless, the model could still be successful in regions where the importance of such processes is less than those which can be inferred from coarse resolution temperature and precipitation alone. For example, winters are generally warmer and wetter in British Columbia (where the model performs better) as compared to Alberta (where the model performs worse), which may limit the importance of processes such as blowing snow transport and sublimation by increasing 640 cohesion of the snowpack (Pomeroy and Gray, 1990). Additionally, characteristics such as topography and subsurface geology play an important role in determining how surface and ground water interact. However, since geology and topography remain essentially constant in time through the study period, the controls they exert on streamflow should remain consistent through the study period as well. As such, their role in determining streamflow at each station may be encoded as part of the mapping between the meteorological forcing and streamflow response at each station so long as it is consistent through the study period. 645 In order for the model to learn the mapping between the meteorological forcing and streamflow, a sufficiently long data record is necessary for training. The CNN-LSTM architecture presented here predicts streamflow at multiple stations simultaneously, and so the number of observations used for training is decreased by a factor N as compared to using an LSTM model at each station individually, where N is the number of stream gauge stations (e.g. for each year of daily streamflow at stations, the CNN-LSTM model is trained with 365 observations, while an LSTM model would be trained with × 365 650 observations). For a given architecture, having a smaller training set means that there are fewer computational requirements to train a single epoch; however, this is only beneficial so long as the training dataset is still large enough to achieve good model performance. There may be regions of interest where the streamflow record is not as extensive as in our study region and thus training only within the region of interest may not produce sufficiently accurate streamflow predictions. A potential solution to this problem could be to use transfer learning with a CNN-LSTM model pre-trained in a region with a sufficiently 655 long streamflow record and then transferred to the new region of interest.

Summary and conclusions
This study investigated the applicability of a sequential CNN-LSTM model for regional scale hydrological modelling, where the model was forced by gridded climate data to predict streamflow at multiple stream gauge stations simultaneously.
We focused on using a relatively simple deep learning model, with the input data represented by temperature and precipitation 660 reanalysis for the period 1979 -2015 given on relatively coarse spatial resolution (0.75° x 0.75°). In particular, we investigated how well the model learns different streamflow regimes and how physically realistic the model's learning was for each streamflow regime. To reach these goals, the model was trained, validated, and tested on a set of stream gauge stations across Western Canada, initially partitioned into six clusters based on the similarity in both seasonal streamflow regime and proximity in space. A set of metrics was introduced and developed to evaluate the model performance and to investigate the model's 665 learning, in particular the learning of spatial and temporal processes taking place at the weather-to-streamflow model's mapping. We summarize the major findings as follows: 1) The model successfully simulated streamflow at multiple stations simultaneously, with a median NSE of 0.68 and 35% of stations having NSE > 0.8. The best model performance was for stations with snowmelt dominated 670 streamflow in British Columbia, and the worst performance was for the eastern cluster of stations in the Prairie region.
The poor performance in the Prairie region may be due to the importance of processes which are underrepresented or not represented in the training data, such as processes occurring over longer than annual timescales, or at smaller spatial scales, or which are not able to be described from a single year of temperature and precipitation alone.
2) For a majority of stations, the model was most sensitive to perturbations in the input data prescribed near and within 675 the basins being predicted, demonstrating that the model's spatial learning focused on areas where the physical drivers of streamflow are occurring. For the eastern and north-eastern clusters, the model was sensitive to perturbations that are relatively far from the watersheds where streamflow was being predicted, thus linking the streamflow to weather fields far away (> 500 km apart). In these cases, the model may be more appropriate for short term rather than long term prediction, as the learned links over far distances may not hold in the future. 680 3) To investigate the learning of temporal patterns, we focused on the timing and intensity of the spring freshet. By uniformly perturbing temperature input to drive the model with warmer and colder climates relative to present, the model responded by changing the intensity and timing of the freshet in accordance with the timing of the transition from below-to above-freezing temperatures. 4) Fine-tuning by streamflow regime led to modest improvements in model performance as evaluated by NSE, but 685 allowed the model to focus on areas near and within the watersheds where streamflow is being predicted. We conclude that fine-tuning is most beneficial for directing the model to focus on processes taking place on relatively smaller spatiotemporal scales (e.g. rainfall driven as opposed to snowmelt driven streamflow).
The CNN-LSTM model presented has been able to explicitly incorporate both spatial and temporal information for 690 predicting streamflow across a region. In addition to successfully simulating streamflow across a range of streamflow regimes, we are able to interpret key aspects of the model's learning. Interpretability of model learning builds trust in the model's predictions, leading to potentially further applications whether for prediction, or as a compliment to process based models, or in further exploration of processes which are poorly understood or represented in process based models. 695 700   The mean sensitivity evaluated over the test period for the ensemble of models, with red indicating more sensitive and blue indicating less sensitive. C) The sensitivity distributions (within/near in pink and not within/near in blue), calculated by kernel density estimation (KDE) using Gaussian kernels and Scott's rule for kernel bandwidth calculation. D) The Kolmogorov-1045 Smirnov D-statistic is calculated from the sensitivity cumulative density functions (CDFs).