Resampling and ensemble techniques for improving ANN-based high streamflow forecast accuracy

Data-driven flow forecasting models, such as Artificial Neural Networks (ANNs), are increasingly used for operational flood warning systems. However, flow distributions are highly imbalanced, resulting in poor prediction accuracy on high flows, both in terms of amplitude and timing error. Resampling and ensemble techniques have shown to improve model performance of imbalanced datasets such as streamflow. In this research, we systematically evaluate and compare three resampling: random undersampling (RUS), random oversampling (ROS), and SMOTER; and four ensemble techniques: randomised 5 weights and biases, bagging, adaptive boosting (AdaBoost), least squares boosting (LSBoost); on their ability to improve high flow prediction accuracy using ANNs. The methods are implemented both independently and in combined, hybrid techniques. While some of these combinations have been explored in the broader machine learning literature, this research contains many of the first instances of these algorithms to address the imbalance problem inherent in flood and high flow forecasting models. Specifically, the implementation of ROS, and new approaches for SMOTER, LSBoost, and SMOTER-AdaBoost are presented 10 in this research. Data from two Canadian watersheds (the Bow River in Alberta, and the Don River in Ontario), representing distinct hydrological systems, are used as the basis for the comparison of the methods. The models are evaluated on overall performance and on high flows. The results of this research indicate that resampling produces marginal improvements to high flow prediction accuracy, whereas ensemble methods produce more substantial improvements, with or without a resampling method. Compared to simple ANN flow forecast models, the use of ensemble methods is recommended to reduce the amplitude 15 and timing error in highly imbalanced flow datasets.

yellow. Aerial imagery obtained from © Esri (Esri, 2020). Surface water and watershed boundaries obtained from © Scholars GeoPortal (DMTI Spatial Inc., 2014a, b, c, 2019 and the © TRCA (Toronto and Region Conservation Authority, 2020b) Finally, there is considerable evidence that ensemble-based and resampling techniques to improve prediction accuracy on infrequent samples such as high flows (Galar et al., 2012). Ensemble methods, such as bootstrap aggregating (Bagging) and 60 boosting, are known for their ability to improve model generalisation. Such methods are widely used in classification studies and are increasingly being adapted for regression tasks (Moniz et al., 2017b). However, ensemble methods alone do not directly address the imbalance problem, as they typically do not explicitly consider the distribution of the target dataset. Thus, ensemble methods are often combined with preprocessing strategies to address the imbalance problem (Galar et al., 2012). Resampling is a common preprocessing technique that can be used to create more uniformly distributed target dataset or generate synthetic

Early investigations
The following section provides descriptions for the two watersheds under study. The parametrisation of the single ANN models to predict flow in each watershed (referred to as the base models) is described. The output of the base models are used to exemplify the inability of these ANNs to accurately predict high flows (from both an amplitude and temporal error perspective) and to illustrate the imbalance problem. improvements to flow forecasting models have been identified as a key strategy for mitigating flood damage Khan et al. (2018).
The Don River, illustrated in Fig. 1 (b), begins in the Oak Ridges Moraine and winds through the Greater Toronto Area until it meets Lake Ontario in downtown Toronto. The 360km 2 Don River watershed is heavily urbanised which results in the high flows seen in the River to be attributable to the direct runoff following intense rainfall events. Its urbanised landscape has 95 also contributed to periodic historical flooding (Toronto and Region Conservation Authority, 2020a). Persistent severe flooding (recently in 2005 and 2013) have motivated calls for further mitigation strategies such as improved flow forecast models and early warning systems (Nirupama et al., 2014).
The histograms in Figure 2 illustrate the highly imbalanced domains of the target flow for both rivers. A high flow threshold (Θ HF ) is defined, which is used to distinguish between typical and high flows. Flow values greater than the threshold are 100 4 https://doi.org/10.5194/hess-2020-430 Preprint. Discussion started: 9 October 2020 c Author(s) 2020. CC BY 4.0 License. referred to as high flows (q HF ) while flows below the threshold, as typical flows (q TF ). Target flow statistics for the Bow and Don Rivers are provided for the complete flow distribution, as well as the q TF and q HF subsets, in Table 1.  5 https://doi.org/10.5194/hess-2020-430 Preprint. Discussion started: 9 October 2020 c Author(s) 2020. CC BY 4.0 License.  The utilisation of a fixed threshold for distinguishing between common and rare samples is used both in flow forecasting (Crochemore et al., 2015;Razali et al., 2020;Fleming et al., 2015) and in more general machine learning studies that are focused on the imbalance problem (Moniz et al., 2017a). In this research, the high flow threshold is simply and arbitrarily taken 105 as the 80th percentile value of the observed flow. The threshold value is ideally derived from the physical characteristics of the river (i.e., the stage at which water exceeds the bank or the water level associated with a given return period); unfortunately this site-specific information is not readily available for the subject watersheds. An important consideration to make while selecting a Θ HF value is that it produces a sufficient number of high flow samples; too few samples risks overfitting and poor generalisation. The distinction between typical and high flows is used in some of the resampling techniques in Sect. 3.1 and 110 for assessing model performance in Sect. 3.4.

Base model description
The base models, also known as the base learner, for both systems use upstream hydro-meteorological inputs (water level, precipitation, and temperature) to predict the downstream water level (the target variable). The multi-layer perception (MLP) ANN is used as the base model for this study and the selected model hyperparameters are summarised in Table 2. The MLP-

115
ANN was chosen as the base model because it is the most commonly used machine learning architecture for predicting water resources variables in river systems (Maier et al., 2010  a hidden layer of 10 neurons; a grid-search of different hidden layer sizes indicated that larger numbers of hidden neurons have little impact on the ANN performance. Thus, to prevent needlessly increasing model complexity, a small hidden layer is 120 favoured. The number of training epochs is determined using early-stopping (also called stop-training), which is performed by dividing the calibration data into training and validation subsets; training data is used to tune the ANN weights and biases whereas the validation performance is used to determine when to stop training (Anctil and Lauzon, 2004). For this study, the optimum number of epochs is assumed if the error on the validation set increases for 5 consecutive epochs. Early-stopping is a common technique for achieving generalisation and preventing overfitting (Anctil and Lauzon, 2004  used, which is common practice for ANN-based flow forecasting models (Snieder et al., 2020;Abbot and Marohasy, 2014;Fernando et al., 2009;Banjac et al., 2015).
The Partial Correlation (PC) input variable selection (IVS) algorithm is used to to determine the most suitable inputs for 135 each model from the larger candidate set (He et al., 2011;Sharma, 2000). Previous research for the Don and Bow Rivers found that PC is generally capable of removing non-useful inputs in both systems, achieving reduced computational demand and improved model performance (Snieder et al., 2020). The simplicity and computational efficiency of the PC algorithm method makes it an appealing IVS algorithm for this application. The 25 most useful inputs amongst all the candidates listed in Table   3, determined by the PC algorithm, are used in the models for each watershed. A complete list of selected inputs is shown in 140 Appendix A.
The Bow and Don River base models produce coefficients of Nash-Sutcliffe efficiency (CE) greater than 0.95 and 0.75, respectively. These scores are widely considered by hydrologists to indicate good performance (Crochemore et al., 2015). (2015) also describe the heteroscedastic nature of flow prediction models. This region of high flows also exhibits amplitude errors in the excess of 1 meter, casting doubt on the suitability of these models for flood forecasting applications. In Fig.   5 (b and c) the normalised inverse frequency of each sample point is plotted against the flow gradient, illustrating how the 150 most frequent flow values typically have a low gradient with respect to the forecast lead time, given by (q t+L − q t )/L. Note that the inverse frequency is determined using 100 histogram bins. Thus, when such a relationship exists, it is unsurprising that model output predictions are similar to the most recent autoregressive input variable. Previous work that analysed trained ANN models for both subject watersheds demonstrates how the most recent autoregressive input variable is the most important variable for accurate flow predictions (Snieder et al., 2020). Without accounting for the imbalanced nature of flow data, data-driven models are prone to inadequate performance similar to that of the base models described above. Consequently, such models may not be suitable for flood related applications such as early flood warning systems. The following section describes and reviews resampling and ensemble methods, which are proposed as solutions to the imbalance problem.
3 Review and description of methods for handling imbalanced target datasets 160 Many strategies have been proposed for handling imbalanced domains, which can be broadly categorised into three approaches: specialised preprocessing, learning methods, and combined methods (Haixiang et al., 2017;Moniz et al., 2018). According to a comprehensive review of imbalanced learning strategies (Haixiang et al., 2017) resampling and ensemble methods are among the most popular techniques employed. Specifically, a review of 527 papers on imbalanced classification (Haixiang et al., 2017) found that a resampling technique was used 156 times. From the same review, 218 of the 527 papers used an 165 ensemble technique such as Bagging or boosting. Many of the studies reviewed used combinations of available techniques and often propose novel hybrid approaches that incorporate elements from several algorithms. Since it is impractical to compare every unique algorithm that has been developed for handling imbalance data, the scope of this research adheres to relatively basic techniques and combinations of resampling and ensemble methods. The following sections describe the resampling and ensemble methods used in this research. The review attempts to adhere to hydrological studies that featuring each of the 170 methods, however, when this is not always possible, examples from other fields are presented.
First, it is important to distinguish between the data imbalance addressed in this study and cost-sensitive imbalance. Imbalance in datasets can be characterised as a combination of two factors: imbalanced distributions of samples across the target domain and imbalanced user interest across the domain. Target domain imbalance is related solely to the native distribution of samples while cost-sensitivity occurs when costs vary across the target domain. While both types of imbalance are relevant 175 to the flow forecasting application of this research, cost-sensitive methods are complex and typically involve developing a relationship between misprediction and tangible costs, for example, property damage (Toth, 2016). Cost-sensitive learning is outside the scope of this research, which is focused on reducing high flow errors due to the imbalanced nature of the target flow data.

Resampling techniques 180
Resampling is widely used in machine learning to create subsets of the total available data with which to train models. Resampling is conducted for two purposes in this research: ensemble methods (discussed in Sect. 3.2) use repeated resampling to generate diversity among ensemble members (Brown et al., 2005) and as a preprocessing technique to change the training data distribution to influence model performance across the target domain (Moniz et al., 2017a). This following sections discusses the use of resampling as a preprocessing technique. RUS is performed by subsampling a number of frequent cases equal to the number of infrequent cases, such that there are an even amount in each category and achieving a more balanced distribution compared to the original set. As a result, all of the rare cases are used for training, while only a fraction of the normal cases are used. RUS is intuitive for classification problems; for two-class classification, the majority class is undersampled such that the number of samples drawn from each 190 class to the number of samples in the minority class (Yap et al., 2014). However, RUS is less straightforward for regression, as it requires continuous data first to be categorised, as to allow for an even number of samples to be drawn from each category.
Categories must be selected appropriately such that they are continuous across the target domain and each category contains a sufficient number of samples to allow for diversity in the resampled dataset (Galar et al., 2013). Undersampling is scarcely used in flow forecasting applications, despite seeing widespread use in classification studies. Ruhana et al. (2014) demonstrate 195 an application of fuzzy-based RUS for categorical flood risk support vector machine (SVM) based classification, which is motivated by the imbalanced nature of flood data. RUS is found to outperform both ROS and synthetic minority oversampling technique (SMOTE) on average across 5 locations.
In this research, N available flow samples are categorised into N TF typical and N HF high flows based based on the threshold Θ HF . The undersampling scheme draws N HF with replacement from each of the subsets, such that there are an equal number of 200 each flow category. RUS can be performed with or without replacement; the former provides greater diversity when resampling is repeated several times, and is thus this approach is selected for the present research.

Random oversampling
ROS simply consists of oversampling rare samples, thus modifying the training sample distribution through duplication Yap et al. (2014). ROS is procedurally similar to RUS, also aiming to achieve a common number of frequent and infrequent samples.

205
Instead of subsampling the typical flows, high flows are resampled with replacement so that the number of samples matches that of the typical flow set. The duplication of high flows in the training dataset increases their relative contribution to the model's objective function during calibration. Compared to undersampling, oversampling is advantaged such that more samples in the majority class are utilised. The drawbacks of this approach are that there is an increased computational cost. There are few examples of ROS applications in water resources literature; studies tend to favour SMOTE, which is discussed in the 210 following section. Saffarpour et al. (2015) use oversampling to address the class imbalance of binary flood data; surprisingly, oversampling was found to decrease classification accuracy compared to the raw training dataset. Recently, Zhaowei et al.
(2020) applied oversampling for vehicle traffic flow, as a response to the imbalance of the training data.
For ROS, as with RUS, N available flow samples are categorised into N TF typical and N HF high flows based based on the threshold Θ HF . The oversampling scheme draws N TF with replacement from each of the subsets, such that there are an equal 215 number of each flow category. ROS is distinguished from RUS in that it produces a larger sample set that inevitably contains duplicated of high flow values.

Synthetic minority oversampling technique for regression
SMOTER is a variation of the SMOTE classification resampling technique introduced by (Chawla et al., 2002) that bypasses excessive duplication of samples by generating synthetic samples, which unlike duplication, create diversity within the en-220 sembles. SMOTE is widely considered as an improvement over simple ROS as the increased diversity help prevent overfitting (Ruhana et al., 2014). For a given sample, SMOTE generates synthetic samples by randomly selecting one of k nearest points, determined using k-nearest neighbours (KNN), and sampling a value at a linear distance between the two neighbouring points.
The original SMOTE algorithm was developed for classification tasks; Torgo et al. (2013) developed the SMOTER variation, which is an adaptation of SMOTE for regression. SMOTER uses a fixed threshold to distinguish between 'rare ' and 'normal' 225 points. In addition to oversampling synthetic data, SMOTER also randomly undersamples normal values, to achieve the desired ratio between rare and normal samples. The use of SMOTE in the development of models that predict stream flow is only being recently attempted. Atieh et al. (2017) use two methods for generalisation: Dropout and SMOTER; these were applied to ANN models that predicted the flow duration curves for ungauged basins. They found that SMOTER reduced the number of outlier predictions, whereas both approaches resulted in the improved performance of the ANN models. Wu et al. (2020) used SMOTE 230 resampling in combination with AdaBoosted sparse Bayesian models. The combination of these methods resulted in improved model accuracy compared to previous studies using the same dataset. Razali et al. (2020) used SMOTE with various Bayesian network and machine learning techniques, including decision trees, KNN and SVM. Each technique is applied to a highly imbalanced classified flood dataset (flood flow and non-flood flow categories); the SMOTE decision tree model achieved the highest classification accuracy. SMOTE decision trees have also been applied for estimating the pollutant removal efficiency of 235 bioretention cells. Wang et al. (2019a) found that decision trees developed with SMOTE had the highest accuracy for predicting pollutant removal rates; the authors attribute the success of SMOTE to its ability to prevent the majority class from dominating the fitting process. Sufi Karimi et al. (2019) employ SMOTER resampling for stormwater flow prediction models. Their motivation for resampling is flow dataset imbalance and data sparsity. Several configurations are considered with varying degrees of oversampled synthetic and undersampled data. The findings of the study indicate that increasing the oversampling rate tends 240 to improve model performance compared to the non-resampled model, while increasing the undersampling rate produces a marginal improvement. Collectively, these applications of SMOTE affirm its suitability for mitigating the imbalance problem in the flow forecasting models featured in this research.
SMOTER is adapted in this research following the method described by (Torgo et al., 2013). One change in this adaptation is that rare cases are determined using the θ HF value, instead of a relevancy function. Similarly, only high values as considered as 245 'rare', instead of both low and high values as rare, as in the original algorithm. Oversampling and undersampling are performed at rates of 400% and 0% respectively, as to obtain an equivalent number of normal and rare cases.

Ensemble-based techniques
Ensembles are collections of models with diverse error distributions. Diversity in ensembles is achieved through a variety of methods, including varying the initial set of model parameters, varying the model topology, varying the training algorithm, and 250 varying the training data (Sharkey, 1996;Brown et al., 2005). Ensembles are typically combined to form discrete predictions (Sharkey, 1996;Shu and Burn, 2004) or used to estimate the uncertainty attributable to the source of ensemble diversity (Tiwari and Chatterjee, 2010;Abrahart et al., 2012).
Model ensembles are defined in a variety of ways within water resources literature. The term ensemble is widely used to describe a collection of numerical models, which have divergent predictions caused by uncertain initial conditions. Numeri-255 cal weather predictions are a common application of such ensembles (Leutbecher and Palmer, 2008). Ensemble Streamflow Prediction (ESP) refers to streamflow prediction as a counterpart to dynamic hydrological prediction, ESP models are based on historical data and typically used when dynamic hydrological data is unavailable (Harrigan et al., 2018;Tanguy et al., 2017). Finally, within machine learning literature, ensembles of learners simply refers to any collection of data-driven models Valentini and Masulli (2002); Dietterich (2000). While these definitions are not mutually exclusive, the latter definition of the 260 ensemble is the one used throughout this research.
The predictions of multiple ensemble members may or may not be combined. In the latter case, multiple predictions can be used to form a spread of predictions. Ensembles members are most commonly combined through simple averaging, though more complex combiners are sometimes used (Shu and Burn, 2004;Zaier et al., 2010). Ensembles that are combined to produce discrete predictions have been proven to outperform single models by reducing model bias and variance, thus improving overall 265 model generalisability (Brown et al., 2005). This has lead to their widespread application in hydrological modelling (Abrahart et al., 2012).
There are many distinct methods for creating ensemble methods. The purpose of this paper is not to review all ensemble algorithms, but rather to compare four ensemble methods that commonly appear in literature: randomised weights and biases, bagging, adaptive boosting, and gradient boosting. While several studies have provided comparisons of ensemble methods, 270 none of these studies have explicitly studied their effects on high flow prediction, nor their combination with resampling strategies, which is common in applications outside of flow forecasting.
Methods that aim to improve generalisability have shown promise in achieving improved prediction on high flows, which may be scarcely represented in training data. However, to the knowledge of the authors, no research has explicitly evaluated the efficacy of ensemble-based methods for improving high flow accuracy. Applications of ensemble methods for improving 275 performance of imbalanced target variables have been thoroughly studied in classification literature. Several classification studies have demonstrated how ensemble techniques can improve prediction accuracy for imbalanced classes (Galar et al., 2012;López et al., 2013;Díez-Pastor et al., 2015b, a;Błaszczyński and Stefanowski, 2015). Such methods are increasingly being adapted for regression problems (Moniz et al., 2017b, a), which is typically achieved by projecting continuous data into a classification dataset (Solomatine and Shrestha, 2004).

Randomised weights and biases
Randomised weights and biases is one of the simplest ensemble-based methods. In this method, ensemble members are only distinguished by the randomisation of the initial parameter values (i.e., the initial weights and biases for ANNs in this research) used for training. For this method, an ensemble of ANNs is trained, each member having a different randomised set of initial weights and biases. Thus when trained, each ensemble member may converge to different final weight and bias values. Ensem-285 ble members are combined through averaging. This technique is often used, largely to alleviate variability in training outcomes and uncertainty associated with the initial weight and bias parameterisation (Shu and Burn, 2004;de Vos and Rientjes, 2005;Fleming et al., 2015;Barzegar et al., 2019). Despite its simplicity, this method has been demonstrated to produce considerable improvements in performance when compared to a single ANN model, even outperforming more complex ensemble methods (Shu and Burn, 2004). The weights and biases of each ANN are initialised using the default initialisation function in MATLAB 290 and an ensemble size of 20 is used.

Bagging
Bagging is a widely used ensemble method first introduced in (Breiman, 1996). Bagging employs the bootstrap resampling method, which consists of sampling with replacement, to generate subsets of data on which to train ensemble members.
The ensemble members are combined through simple averaging to form discrete predictions. Bagging is a proven ensemble 295 method in flood prediction studies and has been widely applied and refined for, both spatial and temporal prediction, since its introduction by Breiman (1996). Chapi et al. (2017) (2009) compared stacking and bagging ANN models against parametric regression for estimating low flow quantile for summer and winter seasons and found higher performance in ANN models (single and ensemble) compared to traditional regression models (Ouarda and Shu, 2009). Cannon and Whitfield (2002) applied bagging to MLP-ANN models for predicting flow and found that bagging helped create the best performing ensemble neural network. Shu and Burn (2004) evaluated six approaches for creating ANN ensembles for regional flood frequency 310 flood analysis, including bagging combined with either simple averaging or stacking; bagging resulted in higher performance compared to the basic ensemble method. In a later study, Shu and Ouarda (2007) used bagging and simple averaging to create ANN ensembles for estimating regional flood at ungauged sites. Implementing Bagging is uncomplicated, a description of the algorithm is described in its original appearance (Breiman, 1996). This research uses a Bagging ensemble of 20 members.

Adaptive boosting 315
The AdaBoost algorithm was originally developed by Freund and Schapire (1996) for classification problems. The algorithm has undergone widespread adaptation and its popularity has lead to the development of many subvariations, which typically introduce improvements in performance, efficiency, and expanded for regression problems. This study uses the AdaBoost.RT variation (Solomatine and Shrestha, 2004;Shrestha and Solomatine, 2006). Broadly put, the AdaBoost algorithm begins by training an initial model. The following model in the ensemble is trained using a resampled or reweighted training set, based 320 on the residual error of the previous model. This process is typically repeated until the desired ensemble size is achieved or a stopping criterion is met. Predictions are obtained by weighted combination of the ensemble members, where model weights are a function of their overall error. Similar to Bagging, there are many examples of AdaBoost applications for flow prediction. Solomatine and Shrestha (2004) compared various forms of AdaBoost against bagging in models predicting river flows and found AdaBoost.RT to outperform 325 bagging. In a later study, the same authors compared the performance of AdaBoosted M5 tree models against ANN models for various applications, including predicting river flows in a catchment; they found higher performance in models that used the AdaBoost.RT algorithm compared to single ANNs (Shrestha and Solomatine, 2006). Liu et al. (2014) used AdaBoost.RT for calibrating process-based rainfall-runoff models, and found improved performance over the single model predictions. Wu et al.
(2020) compared boosted ensembles against Bagged ensembles for predicting hourly streamflow and found the combination 330 of AdaBoost (using resampling) and Bayesian model averaging gave the highest performance.
The variant of AdaBoost in this research follows the algorithm, AdaBoost.RT proposed by (Solomatine and Shrestha, 2004;Shrestha and Solomatine, 2006). This algorithm has three hyperparameters. The relative error threshold parameter is selected as the 80th percentile of the residuals of the base learner and 20 ensemble members are trained. AdaBoost can be performed using either resampling or reweighting (Shrestha and Solomatine, 2006); resampling is used in this research as it has been found to 335 typically outperform reweighting (Seiffert et al., 2008). Recently, several studies have independently proposed a modification to the original AdaBoost.RT algorithm by adaptively calculating the relative error threshold value for each new ensemble member (Wang et al., 2019b;Li et al., 2020). This modification to the algorithm was generally found to be detrimental to the performance of the models in the present research, thus, the static error threshold described in the original algorithm description was used (Solomatine and Shrestha, 2004).

Least squares boosting
LSBoost is a variant of gradient boosting, which is an algorithm that involves training an initial model, followed by a sequence of models that are each trained to predict the residuals of the previous model in the sequence. This is in contrast to the AdaBoost method, which uses the model residuals to inform a weighted sampling scheme for subsequent models. The prediction at a given training iteration is calculated by the weighted summation of the already trained model(s) from the previous iterations.

345
For LSBoost weighting is determined by a least-squares loss function; other variants of gradient boosting use a different loss function (Friedman, 2000).
Gradient boosting algorithms have previously been used to improve efficiency and accuracy for flow forecasting applications. Ni et al. (2020) use the gradient boosting variant XGBoost, which uses Desision Trees (DTs) as the base learners, in combination with a Gaussian Mixture Model (GMM) for streamflow forecasting. The GMM is used to cluster streamflow data, 350 and an XGBoost ensemble is fit to each cluster. Clustering streamflow data into distinct subsets for training is an old concept (Wang et al., 2006). It has a similar objective to resampling methods employed in this research, which is to change the training sample distribution. The combination of XGBoost and GMM is found to outperform standalone SVM models. Erdal and Karakurt (2013) developed gradient boosted regression trees and ANNs for predicting daily streamflow and found gradient boosted ANNs to have higher performance than the regression tree counterparts. Worland et al. (2018)  The implementation of LSBoost in this research is unchanged from the original algorithm (Friedman, 2000). The algorithm has two hyperparameters; the learning rate which scales the contribution of each new model and the number of boosts. A learning rate of 1 is used and the number an ensemble size of 20 is used.

370
The resampling and training strategies reviewed above can be combined to further improve model performance on imbalanced data; numerous algorithms have been proposed in literature that embed resampling schemes in ensemble learning methods. Galar et al. (2012) describes a taxonomy and presents a comprehensive comparison of such algorithms for classification problems. Many of these algorithms effectively present minor improvements or refinements to popular approaches. Alternative to implementing every single unique algorithm for training ensembles, this study proposes employing a systematic approach to 375 combine preprocessing resampling and ensemble training algorithms, in a modular fashion; such combinations are referred to as 'hybrid methods'. Hybrid methods hope to achieve the benefits of both standalone methods: improved performance on high flows while maintaining good generalisability. Thus, in this research, every permutation of resampling (RUS, ROS, and SMOTER) and ensemble methods (RWB, Bagging, AdaBoost, and LSBoost) is evaluated in this research, resulting in twelve unique hybrid methods. For resampling combinations with RWB ensembles, the resampling is performed once, thus, diversity 380 is only obtained from the initialisation of the ANN. This combination is equivalent to evaluating each resampling technique individually, to provide a basis for comparison with resampling repeated for each ensemble member, as used in the other ensemble-based configurations. For combinations of resampling with Bagging, AdaBoost, and LSBoost, the resampling procedure is performed for training each new ensemble member. One non-intuitive hybrid case is the combination of SMOTER with AdaBoost, because the synthetically generated samples do not have predetermined error weights. A previous study has rec-385 The hyperparameters for each of the resampling and ensemble method employed in this study are listed in Table 4. Every 390 ensemble uses the ANN described in Sect. 2.2 as the base learner. The hyperparameters of the base learner are kept the same throughout all of the ensemble methods to allow for a fair comparison (Shu and Burn, 2004) (excluding of course the number of epochs, which is determined through validation stop-training).

Model implementation and evaluation
All aspects of this work are implemented in MATLAB 2020a. The Neural Network Toolbox was used to train the base ANN 395 models. The resampling and ensemble algorithms used in this research were programmed by the authors and available upon request.

Performance assessment
The challenges of training models on imbalanced datasets outlined in Sect. 1 and evaluating model performance are one and the same: many traditional performance metrics (e.g., MSE, CE, etc.) are biased towards the most frequent flows and the 400 metrics are insensitive to changes in high flow accuracy. In fact, despite their widespread use, these metrics are criticised in https://doi.org/10.5194/hess-2020-430 Preprint. Discussion started: 9 October 2020 c Author(s) 2020. CC BY 4.0 License.
literature. For example, ANN models for sunspot prediction produced a lower RMSE (equivalent to CE when used on datasets with the same observed mean) compared to conventional models, however were found to have no predictive value (Abrahart et al., 2007). Similarly, CE values may be misleadingly favourable if there is significant observed seasonality (Ehret and Zehe, 2011). CE is also associated with the underestimation of large peak flows, volume balance errors, and undersized variability 405 (Gupta et al., 2009;Ehret and Zehe, 2011). Zhan et al. (2019) suggest that CE is sensitive to peak flows due to the square term. This assertion is correct while comparing two samples, however, when datasets are imbalanced, the errors of typical flows overwhelm those of high flows. Ehret and Zehe (2011) evaluate the relationship between phase error and RMSE using triangular hydrographs; their study shows how RMSE is highly sensitive to minor phase errors, however, when a hydrograph has a phase and amplitude error RMSE is much more sensitive to overpredictions compared to underpredictions.

410
The coefficient of efficiency (CE), commonly known as the Nash-Sutcliffe efficiency, is given by the following formula: where q is the observed flow,q is the predicted flow, andq is the mean observed flow.
The persistence index (PI) is a measure similar to CE, but instead of normalising the sum of squared error of a model based on the observed variance, it is normalised based on the sum of squared error between the target variable and itself, lagged by the 415 lead time of the forecast model (referred to as the naive model). Thus, the CE and PI range from an optimum value of 1 to -∞, with values of 0 corresponding to models that are indistinguishable from the observed mean and naive models, respectively. The PI measure overcomes some of the weaknesses of CE, such as a misleadingly high value for seasonal watersheds. Moreover, PI is effective in identifying when models become over-reliant on autoregressive inputs, as the model predictions will resemble those of the naive model. PI is given by the following formula: where L is the lead time of the forecast.
In order to quantify changes in model performance on high flows, both the CE and PI measures are calculated for typical flows (TF) and high flows (HF) (Crochemore et al., 2015). The resampling methods are expected to improve the high flow CE at the expense of CE for typical flows, while ensemble methods are expected to produce an outright improvement in 425 model generalisation, reflected by reduced loss in performance between the calibration and test data partitions. Thus, the objective of these experiments is to find model configurations with improved performance on high flows while maintaining strong performance overall. TF and HF performance metrics are calculated based only on the respective observed flows. For example, the CE for high flows is calculated by: where q HF is given by: The performance for CE TF , PI HF , and PI TF are calculated in the same manner, substituting q TF (t) for q HF (t) in Eq. 4 for HF calculations, and using Eq. 2 in place of Eq. 1 for PI calculations.

K-fold cross-validation 435
The entire available dataset is used for both training and testing by the use of KFCV, a widely used cross-validation method (Hastie et al., 2009;Bennett et al., 2013;Solomatine and Ostfeld, 2008;Snieder et al., 2020). Ten folds are used in total; eight folds for calibration and two for testing. Of the eight calibration folds, six are used for training while two are used for earlystopping. When performance is reported as a single value, it refers to the mean model performance of the respective partition across K-folds. It is important to distinguish between the application of KFCV for evaluation (as used in this research) as 440 opposed to using KFCV for producing ensembles, in which an ensemble of models is trained based on a KFCV data partitioning scheme (Duncan, 2014).

Results
This section provides a comparison of the performance of each of the methods described throughout Sect. 3 applied to the Bow Based on the CE values in Figs. 6 -7 and Tables 5 -6, the majority of the Bow and Don River models achieve "acceptable" prediction accuracy (as defined by Mosavi et al. (2018)). Values of CE TF and CE HF are both lower than the CE, which is to be expected as the flow variance of each subset is lower than that of the entire set of flows. For the Bow River models, the CE   values also have higher variability compared to the overall CE and CE TF , as shown in Fig. 6a. In contrast, for the Don River models, the difference in CE, CE TF , and CE HF is less pronounced; whereas the CE (for the entire dataset) is typically higher, 21 https://doi.org/10.5194/hess-2020-430 Preprint. Discussion started: 9 October 2020 c Author(s) 2020. CC BY 4.0 License. as expected, than both the CE TF and CE HF , the difference between CE TF and CE HF is low, as demonstrated in the mean and range of the box-whisker plots in Fig. 7a. Unlike the Bow River, the Don River does not exhibit notable seasonality, resulting 470 in smaller difference between the HF and TF.
Values of PI are typically lower than for CE for both watersheds. The Bow River models obtain PI values centred around 0 (see Fig. 6b Fig. 7b) than the PI (for the entire dataset) and the PI HF .
These lower PI TF are due to the low variability (steadiness) of the Don River TFs (see Fig. 4), and thus, the sum of squared 480 error between the naive model and observed flows is also low, reducing the PI value. The low value of PI TF is attributed to the quality of the naive model, not the inaccuracy of the ANN counterpart. Note that PI HF are typically slightly higher than the overall PI: during high flows, there is greater variability, thus the naive model is less accurate, resulting in a higher PI score.

Comparison of resampling and ensemble methods
This section provide a more detailed comparison of performance across the different resampling and ensemble methods. As 485 expected, all three resampling methods (RUS, ROS,and SMOTER) typically increase HF performance, often at the expense of TF performance. Based on results shown in Table 5, the SMOTER-variations provide the highest performance for HF for the Bow River. SMOTER-RWB CE HF is 0.72, an increase from 0.617 of the base model, whereas the SMOTER-Bagging PI HF is 0.144, compared to -0.175 for the base model. These indicators suggest that the HF prediction accuracy has improved slightly using these SMOTER-variations. The results shown in Table 6  to the base model regardless of whether they are combined with a resampling strategy. Thus, using such ensembles is highly recommended for improved model performance across all flows.
The LSBoost models have the poorest HF performance out of all the ensemble methods studied. This is consistent across all resampling methods and both watersheds. In contrast, the change in performance for CE TF and PI TF is less detrimental when using LSBoost, suggesting that this method is not well-suited to improve HF performance. The LSBoost models are slightly 525 overfitted, despite utilising the stop-training for calibrating the ANN ensemble members. This is indicated by the degradation in performance between the calibration and test dataset, a change which is larger than that seen in the other ensemble models.
This is most noticeable for the RUS-LSBoost models for both the Bow and the Don Rivers, which are more prone to overfitting compared to other models, due to the smaller number of training samples. The CE decreases from 0.97 to 0.902 for the Bow and 0.835 to 0.715 for the Don River; none of the other models that use RUS exhibit such a gap between train and test performance.

530
One reason that the improvements made by the boosting methods (AdaBoost and LSBoost) are not more substantial may be due to the use of ANNs as base learners. ANNs typically have more degrees of freedom compared to the decision trees that are most commonly used as base learners; thus, the additional complexity offered by boosting does little to improve model predictions. Nevertheless, these methods still tend to improve performance over the base model case. Ensembles of less complex models such as regression trees are expected to produce relatively larger improvements when relative to the single 535 model predictions.

Limitations and Future work
A limitation of this study is the lack of a systematic case-by-case hyperparameter optimisation of the models. The base learner parameters (e.g. topology, activation function, etc.) were constant across all ensemble members. Likewise, the ensemble hyperparameters were not optimised, but simply tuned using an ad-hoc approach. A systematic approach to hyperparameter 540 optimisation for each model will likely yield improved model performance. However, hyperparameter optimisation on such a scale would be very computationally expensive. Similarly, the selection of the HF threshold may affect CE HF and PI HF performance, and the sensitivity of model performance of this threshold should be explored.
This study featured resampling and ensemble methods for improving prediction accuracy across an imbalanced target dataset, i.e., the high flows. Further to imbalanced target data, flood forecasting applications commonly have imbalanced 545 cost; for example, underprediction is typically more costly than overprediction. The use of cost-functions, such as asymmetric weighting applied to underpredictions and overpredictions, for flood forecasting has been shown to reduce underprediction of flooding (Toth, 2016). Many cost-sensitive ensemble techniques (e.g., Galar et al. (2012)) have yet to be explored in the context of flood forecasting models and should be the focus of future work.

550
This study evaluated the efficacy of resampling and ensemble techniques for improving the performance of high flow forecasting models for two Canadian watersheds, the Bow River in Alberta, and the Don River, in Ontario. This research attempts to address the widespread problem of poor performance on high flows when using data-driven approaches such as ANNs. Improving performance on high flows is essential for model applications such as early flood warning systems. Three resampling (RUS, ROS, and SMOTER) and four ensemble techniques (RWB, Bagging, AdaBoost, and LSBoost) are implemented as part 555 of ANN flow forecasting models, for both watersheds. These methods are implemented independently and combined in hybrid approaches, in order to assess their efficacy for improving high flow performance. Contributions include proposing the use of ROS in the water resources field, an adapted application for SMOTER, and new implementations of LSBoost with ANNs, and SMOTER-AdaBoost. Resampling methods generally only produces a small improvement in high flow performance, based on CE and PI, with the SMOTER variation providing the most consistent improvements. Ensemble methods produced more sub-