Multi-step ahead daily inflow forecasting using ERA-Interim reanalysis dataset based on gradient boosting regression trees

Inflow forecasting plays an essential role in reservoir management and operation. The impacts of climate change and human activities make accurate inflow prediction increasingly difficult, especially for longer lead times. In this study, a new hybrid inflow forecast framework with ERA-Interim reanalysis data as input, adopting gradient boosting regression trees (GBRT) and the maximum information coefficient (MIC) was developed for multi-step ahead daily inflow forecasting. Firstly, the ERAInterim reanalysis dataset provides enough information for the framework to discover inflow for longer lead times. Secondly, 10 MIC can identify effective feature subset from massive features that significantly affects inflow so that the framework can avoid over-fitting, distinguish key attributes with unimportant ones and provide a concise understanding of inflow. Lastly, the GBRT is a prediction model in the form of an ensemble of decision trees and has a strong ability to capture nonlinear relationships between input and output in long lead times more fully. The Xiaowan hydropower station located in Yunnan Province, China is selected as the study area. Four evaluation criteria, the mean absolute error (MAE), the root mean square error (RMSE), the 15 Nash-Sutcliffe efficiency coefficient (NSE) and the Pearson correlation coefficient (CORR), were used to evaluate the established models using historical daily inflow data (1/1/2017-31/12/2018). Performance of the presented framework was compared to that of artificial neural networks (ANN), support vector regression (SVR) and multiple linear regression (MLR) models. The experimental results indicate that the developed method generally performs better than other models and significantly improves the accuracy of inflow forecasting at lead times of 5-10 days. The reanalysis data also enhances the 20 accuracy of inflow forecasting except for forecasts that are one-day ahead.

in-en, 2018). The total amount of LEQDW in Yunnan and Sichuan Provinces increased from 1.5 billion kWh to 45.6 billion kWh from 2011 to 2016, with an average annual growth rate of 98.0%. In recent years, due to the increased number of hydropower stations and installed hydropower capacities, the problem of discarding water caused by inaccurate inflow forecasting is becoming increasingly serious, which has also produced a negative impact on the development of hydropower in China. 35 The main challenge in inflow forecasting caused by climate change and human activities at present are low accuracy, especially for longer lead times (Badrzadeh et al., 2013;El-Shafie et al., 2007). Meanwhile, due to streamflow variation by reason of climate change and human activities, inflow forecasting model often needs to be rebuilt and the model parameters need to be recalibrated according to the actual inflow and meteorological data within one or two years. To address these problems, a variety of models and approaches have been developed. These approaches can be divided into three categories: 40 statistical methods (Valipour et al., 2013), physical methods (Duan et al., 1992;Wang et al., 2011;Robertson et al., 2013), and machine learning methods (Chau et al., 2005;Liu et al., 2015;Rajaee et al., 2019;Zhang et al., 2018;Yaseen et al., 2019;Fotovatikhah et al., 2018;Mosavi et al., 2018;Chau, 2017). Each method has its own conditions and scope of application.
Statistical methods are usually based on historical inflow records and mainly include the autoregressive model, the autoregressive moving average (ARMA) model and the autoregressive integrated moving average (ARIMA) model (Lin et 45 al., 2006). Statistical methods assume that the inflow series is stationary and the relationship between input and output is simple. However, real inflow series is complex, nonlinear and chaotic (Dhanya and Kumar, 2011), making it difficult to obtain high-accuracy predictions using statistical models. Physical methods which have clear mechanisms are implemented using theories of inflow generation and confluence. These methods can reflect the characteristics of catchment but are very strict with initial conditions and input data (Bennett et al., 2016). Meanwhile, these methods used for flood forecasting have 50 a shorter lead time and cannot be used to acquire long-term forecasting results due to input uncertainty. Machine learning methods, having a strong ability to handle the nonlinear relationship between input and output and recently shown excellent performance in inflow prediction, are widely used for medium and long-term inflow forecasts. In particular, several studies had shown that artificial neural networks (ANN) (Rasouli et al., 2012;Cheng et al., 2015;El-Shafie and Noureldin, 2011) and support vector regression (SVR) (Tongal and Booij, 2018;Luo et al., 2019;Moazenzadeh et al., 2018) are the two 55 powerful models for inflow predicting. However, these models still have some inherent disadvantages. For example, ANN is prone to being trapped by local minima and both ANN and SVR suffer from over-fitting problems and reduced generalizing performance. In recent years, gradient boosting regression trees (GBRT) (Fienen et al., 2018;Friedman, 2001), a nonparametric machine learning method based on a boosting strategy and decision trees, was developed and had been used in traffic (Zhan et al., 2019) and environmental (Wei et al., 2019) field and proved to alleviate these problems mentioned 60 above. Thus, GBRT is selected for daily inflow prediction with a lead times of 1-10 days in this paper. Compared with ANN and SVR, GBRT also has two other advantages. Firstly, GBRT can rank features according to their contribution to model scores, which is of great significance for reducing the complexity of the model. Secondly, GBRT is a white box model and can be easily interpreted. To the best of our knowledge, GBRT has not been used for daily inflow prediction with a lead times of 1-10 days before. For comparison purposes, ANN, SVR and multiple linear regression (MLR) have been employed 65 to forecast daily inflow and are considered as bench mark models in this study.
In addition to forecasting models, a vital reason why many approaches cannot attain higher accuracy for inflow predictions is that inflow is influenced by various factors (Yang et al., 2019), such as rainfall, temperature, humidity, etc. Thus, it is very difficult to select appropriate features for inflow forecasting. Current feature selection methods for inflow forecasting mainly include two methodologies. The first method is the model-free method (Bowden et al., 2005;Snieder et al., 2019) which 70 employs a measure of the correlation coefficient criterion ( He et al., 2011;Badrzadeh et al., 2013;Siqueira et al., 2018;Pal et al., 2013) to characterise the correlation between a potential model input and the output variable. The second method is the model-based method (Snieder et al., 2019) which usually utilizes the model and search strategies to determine optimal input subset. Common search strategies include forward selection, backward elimination et al (May et al., 2011). The correlation coefficient has limited ability for capturing nonlinear relationships and exhaustive search tend to need the higher 75 computation burden. In order to select effective inputs accurately and quickly, the maximal information coefficient (MIC) (Reshef et al., 2011) is used to select input factors for inflow forecasting. MIC is a robust measure of the degree of correlation between two variables and has attracted a lot attention from academia (Zhao et al., 2013;Ge et al., 2016;Lyu et al., 2017;Sun et al., 2018). In addition, sufficient potential input factors are the prerequisite for obtaining reliable and accurate prediction results and it is not enough to use only antecedent inflow series as the input of the model. To enhance the 80 accuracy of inflow forecasting and acquire a longer lead time, increasing amounts of meteorological forecasting data have been used for inflow forecasting (Lima et al., 2017;Fan et al., 2015;Rasouli et al., 2012). However, with extended lead times, the errors of forecast data continuously increase because the variables obtained by numerical weather prediction (NWP) system are also affected by complex factors (Mehr et al., 2019). Moreover, with the continuous improvement of forecasting systems, it is difficult to obtain consistent and long series of forecasting data (Verkade et al., 2013). To mitigate 85 these problems, the reanalysis data generated by ERA-Interim (European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis Interim) (Dee et al., 2011), which was proved to be one of the best methods for reanalysis of data describing atmospheric circulation and elements (Kishore et al., 2011), has been used as an input. The reanalysis data which has less error than observed data and forecast data is the result of assimilating observed data with forecast data. ERA-Interim shows the results of a global climate reanalysis from 1979 to date, which are produced by a fixed version of NWP system. 90 The fixed version ensures that there are no spurious trends caused by an evolving NWP system. Therefore, meteorological reanalysis data satisfies the need for long sequences of consistent data and has been used for the prediction of wind speeds (Stopa and Cheung, 2014) and solar radiation (Ghimire et al., 2019;Linares-Rodríguez et al., 2011).
This study aims to provide a reliable inflow forecasting framework with longer lead times for daily inflow forecasting. The framework adopts the ERA-Interim reanalysis dataset as the input which ensured ample information is supplied to depict 95 inflow. MIC is used to select appropriate features to avoid over-fitting and waste of computing resources caused by feature redundancy. GBRT, which is robust to outliers and has strong non-linear fitting ability, is used as the prediction model to improve inflow forecasting accuracy of longer lead times. This paper is organized as follows: Section 2 describes a case study and collected data. Section 3 introduces the theory and process of methods used, including MIC and GBRT. Section 4 shows the results and discussion of the data, followed by the 100 conclusions in Section 5.

Study area and collected data
The Xiaowan Hydropower Station in the lower reaches of the Lancang River is chosen as the study site (Fig. 2). The Xiaowan Hydropower Station is the main controlling hydropower station in the Lancang River and it is very meaningful to 105 adopt the Xiaowan Hydropower Station as the case study. The Lancang River is approximately 2000 km long and has a drainage area of 113300 km 2 above the Xiaowan Hydropower Station. The Lancang River which is also known as the Mekong River originates in the Tibetan Plateau and runs through China, Myanmar, Laos, Thailand, Cambodia, and Vietnam.
The major source of water flowing into the Lancang River in China comes from melting snow on the Tibetan plateau ( Commission Mekong River, 2005). 110 We collected ERA-Interim reanalysis dataset, observed daily inflow and rainfall data for Xiaowan for 8 years (January 2011 to December 2018). Fig. 3 depicts the daily inflow series. The data from January 2011 to December 2014 (1461 days, approximately 50% of the whole dataset), from January 2015 to December 2016 (731 days, approximately 25% of the whole dataset) and from January 2017 to December 2018 (730 days, approximately 25% of the whole dataset) are used as training, validation and testing set, respectively. The reanalysis dataset can be downloaded from https://apps.ecmwf.int/datasets/data/ 115 interim-full-daily/levtype=sfc/ and is provided every 12 hours on a spatial grid size of 0.25° × 0.25°. Based on the expert knowledge and on the basis of available literature, the near-surface 26 variables (Table A1) from the reanalysis data are considered as potential selected predictors for inflow forecasting, which include the total precipitation (tp), the 2 meter temperature (t2m), the total column water (tcw), etc. More details about ERA-Interim dataset are presented in the Appendix A. 120

Feature scaling and feature selection
Feature scaling is necessary for machine learning methods and all features are scaled to the range between 0 and 1 before taking part in the calculation, as follows: where scale x and original x indicate the scaled and original data, respectively. max x and min x represent the maximum and minimum of inflow series, respectively. 125 Reasonable selection of input variables can reduce the computational burden and improve the prediction accuracy of the model by removing redundant feature information and reducing the dimensions of the features. If too many features are selected, model will become very complex, which will cause trouble when adjusting parameters, resulting in over-fitting and difficult convergence. Moreover, natural patterns in the data will be blurred by noise (Zhao et al., 2013). On the other hand, if irrelevant features are chosen, there will add noise into the model and also hinder the learning process. MIC is employed to 130 select inputs from candidate predictors from reanalysis data. The lagged inflow and rainfall series are identified by partial autocorrelation function (PACF) and cross-correlation function (CCF). The corresponding 95% confidence interval is used to identify significant correlations. Furthermore, when correlation coefficient slowly declines and cannot fall into confidence interval, a trial and error procedure is used to determine the optimum lag, i.e., starting from one-lag and then modifying the external inputs by successively adding one more lagged time series into inputs (Amiri 2015;Shoaib et al., 2015). 135 3 Methodology

Feature selection via maximal information coefficient
The calculation of MIC is based on concepts of the mutual information (MI) (Kinney and Atwal, 2014). For a random variable X, such as observed inflow, the entropy of X is defined as where ( ) p x is the probability density function of X = x. Furthermore, for another random variable Y, such as observed 140 rainfall, the conditional entropy of X given Y may be evaluated from the following expression where ( | ) H X Y is the uncertainty of X given knowledge, ( , ) p x y and ( | ) p x y are the joint probability density and the conditional probability of X = x and Y = y, respectively. The reduction of the original uncertainty of X, due to the knowledge of Y, is called the MI (Amorocho and Espildora, 1973;Chapman, 1986), defined by The calculation of MIC is divided into three steps. Consider given a dataset D, including variable X and Y with a sample size 145 n. Firstly, drawing scatter plots of X and Y and drawing grids for partitioning which is called an x-by-y grid. Let D|G denote the distribution of D divided by one of x-by-y grids as G.
Lastly, MIC is introduced as the maximum value of characteristic matrix, that is, the upper bound of the grid size which is a function of sample size, defined B = n 0.6 . 150 We perform feature selection from ERA-Interim reanalysis dataset in two steps via MIC. First, compute MIC of each reanalysis variables and observed inflow. Then, sort features based on MIC in a descending order and determine the optimum inputs by using a trail-and-error procedure, i.e. starting from the top one feature and then modifying the external inputs by successively adding one more feature into model inputs. The selected k features from reanalysis data are used as part of input to the model. 155

Gradient boosting regression trees
GBRT is an ensemble model which mainly includes two algorithms: decision tree algorithm and the boosting algorithm. The decision tree robust to outliers is used as a primitive model and boosting algorithm as integration rule is used to improve inflow forecasting accuracy.

The decision tree 160
The decision tree in this paper refers to decision tree learning used in computer science, which is one of the predictive modelling approaches used in machine learning. A decision tree consists of branch nodes (the tree structure) and leaf nodes (the tree output).
Supposing a training dataset is given in a feature space with N features and each feature with n samples, {(X1, y1), (X2, y2), …, (Xn, yn )} (Xi = (x1, x2, …, xN), i =1, 2, …, n). In the input space where the training set is located, each region is recursively 165 divided into two subregions and the output value of each subregion is used to construct a binary decision tree. The top-down cyclic branch learning of the decision tree adopts a greedy algorithm where each branch node only cares about its own objective function. By traversing all features and all segmentation points of each feature, the best feature j and segmentation points s can be found by minimizing squared loss: y i is the observed value and R 1 (j, s) and R 2 (j, s) are the results of partitioning. c1 and c2 are output values of R 1 (j, s) and R 2 (j, s), respectively. Fig. 4 shows an example of a decision tree model with a max depth and number of leaf nodes of 3 and 5, respectively. If the threshold of loss is set as the stopping condition of the decision tree, it will easily lead to over-fitting problems. Hence, we set the following parameters to alleviate the over-fitting problem of the decision tree model: the maximum depth of the tree, the minimum number of samples required to split an internal node, the minimum number of 175 samples required to be at a leaf node and the number of leaf nodes. These parameters are also the ones used for optimization when using the decision tree.

The boosting algorithm
The idea of gradient boosting originated in the observation by Breiman (Breiman, 1997) and can be interpreted as an optimization algorithm based on a suitable cost function. Explicit regression gradient boosting algorithms are subsequently 180 developed (Friedman, 2001;Mason et al., 2000). The boosting algorithm used is described here. Supposing a training dataset with n sample {(X 1 , y 1 ), (X 2 , y 2 ), · · ·, (Xn , yn)}, a squared loss function is used to train the decision tree: The core of the GBRT algorithm is the iterative process of training the decision with a residual method. The iterative training process of GBRT with M decision trees is as follows: 2) For m-th (m=1, 2, ..., M) decision trees: a) Operating i-th (i=1, 2, ..., n) sample points. Using the negative gradient of the loss function to replace the residual in the current model .., T) as its corresponding leaf node region is obtained, where t is the number of leaf nodes of regression. 190 c) For each leaf region t = 1, 2, ..., T, and the best fitting value is calculated by 1 arg min ( , d) The fitting results are updated by adding the obtained fitting values to the previous ones using According to the above introduction to GBRT, the parameters of the GBRT can be divided into two categories: boosting 195 parameters and tree parameters. The boosting parameters include the learning rate and the number of weak learners (learning_rate and n_estimators). The learning rate setting is used for reducing the gradient step. The learning rate influences the overall time of training, and the smaller the value is, the more iterations are required for training. There are four tree parameters: max_leaf_nodes, min_samples_leaf, min_samples_split and max_depth. Hence, GBRT has six parameters control model complexity (Fienen et al., 2018), which we adjusted for tuning by using a trial-and-error procedure. 200

Evaluation criteria of the models
It is critical to carefully define the meaning of performance and to evaluate the performance on the basis of the forecasting and fitted values of the model compared with historical data. The root mean squared error (RMSE) and mean absolute error (MAE) are the most commonly used criteria to assess model performance and are calculated using Eq. (10) and Eq. (11), respectively. 205 where ˆi Q and i Q are the inflow estimation and observed value at time i, respectively and n is the number of samples. The RMSE is more sensitive to extremes in sample sets and thus it is used to evaluate the model's ability to simulate flood peaks.
The Pearson correlation coefficient (CORR) is a measure of the strength of the association between observed inflow series and forecasted inflow series; it is calculated according to Eq. (12).
where Q is the mean of the estimation series. The range of the CORR is between 0 and 1 and values close to 1 demonstrate 210 a perfect estimation result.
where  is the standard deviation of the observed values,  is the standard deviation of the inflow estimation,  is the mean of the observed series and  is the mean of the inflow estimation series. 215 where h = 1, 2, . . ., H are the inflow indices for inflows with exceedance probabilities lower than 0.02. In this paper, the inflow threshold of exceedance probabilities equalling 0.02 is 1722 m 3 /s.
The Index of Agreement (IA) (Willmott, 1981) plays a significant role in evaluating the degree of the agreement between 220 observed series and inflow estimation series. Similar to CORR, its range is between 0 (no agreement at all) and 1 (perfect fit).
It is given by: In GBRT, we measure the relevance of different lags observed inflow and rainfall with observed inflow at the time of forecast via partial autocorrelation function (PACF) and cross-correlation function (CCF) (Badrzadeh et al., 2013) and select appropriate lags as predictors of model by hypothesis test and trial-and-error procedures. Then, data pre-processing and feature scaling are carried out for selected predictors. Next, dividing the dataset into training set, validation set, and testing set according to the length of each data set specified in advance (in Section 2.2). A grid search algorithm, which is an 230 exhaustive search all candidate parameter combination method, is guided to optimization model parameters by evaluation of validation set for each lead time (Chicco and Davide, 2017). Lastly, prediction results are evaluated based on testing set.
Compared with GBRT, GBRT-MIC adds reanalysis data which are selected via MIC (in Section 3.1) as the input of the model. Moreover, GBRT-MIC also calculates the importance of features according to the prediction results and ranks the features (Louppe, 2014). 235 It is difficult to perform multi-step forecasting by the reason of accumulation of errors, reduced accuracy, and increased uncertainty. The current state of multi-step ahead forecasting is reviewed, there are mainly two strategies that you can use for multi-step forecasting for single-output, namely, Static (Direct) multi-step forecast and Recursive multi-step forecast (Bontempi et al., 2013;Taieb et al., 2012). Recursive forecast strategy is biased when the underlying model is nonlinear and is sensitive to the estimation error, since estimated values, instead of actual ones, are more and more used when we get 240 further in the future (Bontempi et al., 2012). Thus, the Static multi-step forecasting strategy is employed in this paper. Since the Static strategy does not use any approximated values to compute the forecasts, it is not prone to any accumulation of errors. The model structures of GBRT and GBRT-MIC are as follows:

Experimental results and discussion
In order to compare with GBRT-MIC, the ANN-MIC, SVR-MIC and MLR-MIC, obtained by replacing GBRT in the framework with ANN, SVR and MLR, respectively, are also employed for inflow forecasting with lead times of 1-10 days. 250 As mentioned previously, six indices, i.e. the MAE, RMSE, CORR, KGE, BHV and IA, are calculated to evaluate the performance of models based on the testing set. We also explored the feature importance based on the GBRT-MIC model (Louppe, 2014). All computations of this paper are performed on a ThinkPad P1 workstation containing an Intel Core i7-9850H CPU with 2.60 GHz and 16.0 GB of RAM, using the version 3.7.10 of Python (Python Software Foundation, 2020), which is powerful, fast and open, and scikit-learn package (Pedregosa et al., 2011). 255 Fig. 6 shows the PACF, CCF and the corresponding 95% confidence interval from lag 1 to lag 12. The PACF shows significant autocorrelation at lag one and lag four, respectively ( Fig. 6(a)), and thus, inflow series one and four-day lag are selected as the inputs of model. CCF between inflow and rainfall gradually decreases as increasing the time lag (Fig. 6(b)) and cannot fall into 95% confidence interval. Therefore, a trial-and-error procedure is used to determine optimal selection of 260 lagged rainfall series. 13 input structures are tried (Table 1) and the trial results are shown in Fig. A1. The results indicate that 7th input structure obtains best performance. Accordingly, rainfall series from one to six-day lag are selected as the inputs of model. As mentioned previously, based to MIC between inflow and the reanalysis variable (Table A1), a trial-anderror procedure is used to determine optimal input subset. 26 input structures are tried (Table 2) and the trial results are shown in Fig. A2. The results show that 8th input structure obtains best performance and thus the No.1 to 8 predictors in 265 Table A1 are selected as the model input.

Feature selection
Finally, a total of 16 variables including 8 observed variables and 8 reanalysis variables are selected as the model inputs (Table 3). As shown in Table 3, No. 9 to 18 are reanalysis variables and the range of MIC of the reanalysis variables selected is 0.643 to 0.847. Furthermore, No. 9 and No. 13 to 16 are variables related to temperature. Soil temperature level 3 (No. 9) is the temperature of the soil in layer 3 (28-100 cm, the surface is at 0 cm). The temperature of the snow layer (No. 13) gives 270 the temperature of the snow layer from the ground to the snow-air interface. No. 10 to 12 are variables related to the water content of the atmosphere. 2 meter dewpoint temperature (No. 10) is a measure of the humidity of the air. Combined with temperature and pressure, it can be used to calculate the relative humidity. The total column water vapor (No. 11) is only the total amount of water vapor, which is a fraction of the total column water. Total column water (No. 12) is the sum of water vapor, liquid water, cloud ice, rain and snow in a column extending from the surface of the Earth to the top of the 275 atmosphere. Volumetric soil water layer 1 (No. 19) is the volume of water in soil layer 1. In summary, all the selected predictors are interpretable and have a good physical connection with inflow.

Hyperparameter optimization
For machine learning methods, hyperparameters are a king of parameters that are set before training and cannot be directly learned from the regular training process. In order to improve the performance of models, it is imperative to tune the 280 hyperparameters of models. Grid search is employed to tune the hyperparameters of GBRT, GBRT-MIC, ANN-MIC and SVR-MIC.
Reviewing to the basis of available literature (Badrzadeh et al., 2013;Rasouli et al., 2012), an optimizer in the family of quasi-Newton methods, namely L-BFGS is used as the training algorithm of ANN and the number of hidden layers is fixed to 3. Another two parameters, namely activation function and the number of nodes of the hidden layer need to be adjusted. A 285 range of 2-20 neurons and four commonly used activation functions (Table 4) (Fig. 7) and ANN with fewer nodes is inclined to obtain lower error. The optimal parameter combination for each 290 lead time is listed in Table 5. It can be seen that the optimal number of nodes is 2, 3 or 4 and the optimal activation function is either tanh or logistic function.
For SVR, according to Lin et al. (2006) and Dibike et al. (2001), radial basis function (RBF) outperforms other kernel functions for runoff modelling and thus RBF is used as the kernel function in this study. There are three parameters need to be adjusted. Firstly, an appropriate tuning range of parameter is determined by a trial-and-error procedure. And then, to reach 295 at an optimal choice of these parameters, the MAE is used to optimize the parameters by grid search. Optimal tuning parameters of SVR are shown in Table 5.
As mentioned earlier, for GBRT, there are six parameters need to be adjusted. In order to obtain an optimal parameter combination as soon as possible, we optimize all parameters in two steps. Firstly, n_estimators and learning_rate are fixed to 100 and 0.1, respectively. The max_leaf_nodes, min_samples_leaf, max_depth and min_samples_split, four tuning 300 parameters generate 40000 models at each lead time. Secondly, after the tree parameters are determined, learning_rate is modified to 0.01 and n_estimators is determined by grid search. To accommodate the computational burden, all models are distributed among about 12 central processing units (CPUs) and total wall time for the runs is about 7 hours for GBRT_MIC and GBRT. Table 6 lists optimal tuning parameters of GBRT and GBRT-MIC. Fig. 8 illustrates performance indices of GBRT and GBRT-MIC on the testing set (2017/01/01-2018/12/31) at lead times of 1-10 days. It is obvious that the reanalysis data selected by MIC makes a great improvement on the GBRT forecasting at both short and long lead times. In particular, for the longer lead times prediction of GBRT-MIC is significantly outperform GBRT. For Fig. 8(a), the MAE of GBRT-MIC decreases from 175 to 172, a decrease of 1.74% for two-day ahead forecasting and decreases from 273 to 237, a decrease increasing to 13.18% for ten-day ahead forecasting compared with 310 GBRT. For Fig. 8(b), the RMSE of GBRT-MIC achieves 1.4% and 10.6% reduction for two and ten-day ahead forecasting, respectively, compared with GBRT. For Fig. 8(c), 8(d) and 8(f), the CORR, KGE and IA of GBRT-MIC increase by 0.2%, 2.2%, 1.0% for two-day ahead forecasting and 3.4%, 7.8% and 2.2% for ten-day ahead forecasting, respectively. Fig. 8(e) compares the BHV of GBRT and GBRT-MIC which indicates reanalysis data can enhance forecasting of extreme values. Fig. 9(a) shows the five-day ahead forecasted inflow of GBRT-MIC and GBRT versus the observed inflow in the testing set. 315

Inputs comparison 305
The slopes of fitting curve of GBRT-MIC and GBRT are 0.89 and 0.81, respectively, which also demonstrates that GBRT-MIC can obtain more accurate inflow forecasting than GBRT. Fig. 9(b) illustrates the distribution of the forecast errors of GBRT and GBRT-MIC. The results show the prediction error of two models approximate to normal distribution. It demonstrates that the prediction error contains less information that is not extracted by the model and more errors of forecasted inflow concentrate at 0 around by GBRT-MIC than GBRT. Fig. 9(c) provides forecasted inflow time series (from 320 the testing set) of GBRT-MIC and GBRT at lead time of five-day. It can be seen that GBRT-MIC provides great performance compare to GBRT, especially for the extreme values. This reveals that the problem of inaccurate extreme value prediction arisen in areas with concentrated rainfall for the GBRT model could be mitigated by incorporating the reanalysis data identified by MIC.

Model comparison 325
GBRT-MIC, SVR-MIC, ANN-MIC with obtained optimal model parameters are employed for inflow forecasting of one to ten-day ahead. Summarized results for training and testing set are presented in Table 7 and Table 8, respectively. To avoid local minima problems, 50 ANN-MIC models are trained for each lead time and the median of the predictions of the 50 models gives the final prediction. It is clear from Table 7 that the GBRT-MIC are more efficient in the training set than other models at lead times of 1-10 days, which demonstrates that GBRT-MIC has a powerful fitting ability. Meanwhile, all 330 machine learning models obtain better forecasted results than MLR-MIC which cannot capture nonlinear relationship. It should be noted that ANN-MIC has best performance for extreme values in terms of BHV in the training set.
As shown in Table 8, GBRT-MIC performs best for the testing set at lead times of 4-10 days in terms of six indices. At a lead time of ten days, the KGE of GBRT-MIC even reached 0.8317. At the lead times of 1-3 days, three machine learning models obtain approximate performance but outperform MLP-MIC. The machine learning models can acquire enough 335 information to perform forecasting at the short lead time (1-3 days).
The performance indices of these four models in the testing set (2017-2018) at the lead times of 1-10 days are presented in Fig. 10. The results indicate the performance of these four models decreases (higher MAE, RMSE and BHV, and lower CORR, KGE and IA) as the lead time increases. As mentioned earlier, the four models perform equally well for one-to three-day ahead forecasting, whereas significant differences among their performances are found as the lead time exceeds 340 three days. It clearly indicates that the GBRT produce much higher CORR, KGE and IA, and lower MAE, RMSE and BHV than the other three models for four to ten-day ahead forecasting except that the ANN-MIC perform nearly to GBRT-MIC for ten-day ahead forecasting. It should be noted that SVR performs worst according to BHV and KGE, which demonstrates that SVR cannot capture extreme values. On the contrary, GBRT-MIC significantly outperform other models in terms of BHV at lead times of 1-10 days, which indicates that GBRT-MIC is able to obtain extreme values among all models 345 developed in this paper.

Feature importance
A benefit of using gradient boosting is that after the boosted trees are constructed, relative importance scores for each feature can be acquired to estimate the contribution of each feature to inflow forecasting. Fig. 11 shows the feature importance based on GBRT-MIC for lead times of one and ten days. The one-day lag observed time series ( 1 t Q  ) is more important for shorter 350 lead times (Fig. 11(a)), which demonstrates that the historical observed values are essential to inflow forecasting at shorter lead times.
The features (e.g., 10 3 t stl  and 10 2 t d m  ) from the reanalysis data have a high relative importance at longer lead times ( Fig.   11(b)). Based on the analysis of the concepts of 10 3 t stl  and 10 t tcw  (Section 4.1), we infer that the temperature near the ground effects the inflow by affecting the melting of snow which is consistent with the fact that the Lancang River is a snow-355 melt river. The ten-day lag observed time series ( 10 t Q  ) is also very important which indicate the long memory of inflow series (Salas 1993). Meanwhile, it is found that the reanalysis data provides important information for inflow forecasting at longer lead times.

Conclusion
In this study, GBRT-MIC is employed to make inflow forecasts for lead times of 1-10 days and ANN-MIC, SVR-MIC and 360 MLR-MIC are developed to compare with GBRT-MIC. The reanalysis data selected by MIC, the antecedent inflow and the rainfall records selected by PACF and CCF are used as predictors to drive the models. These models are compared using six evaluation criteria, the MAE, RMSE, CORR, KGE, BHV and IA. It is shown that GBRT-MIC, ANN-MIC and SVR-MIC outperform MLR-MIC at lead times of 1-10 days, and GBRT-MIC performs best at lead times of 4-10 days, especially for forecasting of extreme values. 365 According to comparison the forecasted results of GBRT and GBRT-MIC, we conclude that GBRT-MIC can be used for more accurate and reliable inflow forecasting at lead times of 1-10 days and reanalysis data selected by MIC makes a great improvement on the GBRT forecasting, especially for lead times of 4-10 days. In addition, the feature importance achieved by GBRT-MIC demonstrates that soil temperature, the total amount of water vapour in a column and dewpoint temperature near the ground contribute to increase the prediction accuracy of inflow at longer lead times. 370 In summary, the developed framework that integrates GBRT and reanalysis data selected MIC and can well perform inflow forecasting at lead times of 1-10 days. The results of this study are of significance to assist power stations in making power generation plans 7-10 days in advance in order to reduce LEQDW and flood disasters.
Another direction of improving the results could be considering heuristic methods (for example, Grey Wolf algorithm) to optimize model parameters, which could search for more wide range of hyper parameters and get optimization parameters 375 more quickly.

Appendix A
The ERA-Interim is a reanalysis product of the global atmospheric forecasts at ECMWF which is produced through data assimilation system, called as the Integrated Forecast System (IFS). The system includes a 4-dimensional variational analysis 385 (4D-Var) with a 12-hour analysis window. The spatial resolution of the data set is approximately 80 km (0.72°) on 60 levels in the vertical from the surface up to 0.1 hPa. (Berrisford et al., 2011). This reanalysis meteorological products of from 0.125° to 2.5° are generated by interpolation. This reanalysis meteorological products from the ERA-Interim such as rainfall, maximum and minimum temperatures, and wind speed at 0.25° (latitude) × 0.25° (longitude) spatial and 12-hour temporal resolutions for the study period 2011-2018 are downloaded from the ECMWF webpage. 390 13 input structures from observed data are tried and 50 trials are performed for each input structure. The results (Fig. A1) 395 show 7th input structure is the optimal input subset for GBRT. Figure A1. Trial results of 13 input structures from observed data. 26 input structures from reanalysis data are tried and 50 trials are performed for each input structure. The results (Fig. A2) show 8th input structure is the optimal input subset for GBRT-MIC. 400 Figure A2. Trial results of 13 input structures from reanalysis data.

Author contributions
S.L. carried out the study design, the analysis and interpretation of data, and drafted the manuscript. Z.L. and B.L. participated in the study design, data collection, analysis of data, and preparation of the manuscript. C.C. and Z.Z. carried out 405 the experimental work and the data collection and interpretation. X.J. participated in the design and coordination of experimental work, and acquisition of data. All authors read and approved the final manuscript.
Stopa, J. E., and Cheung, K. F.: Intercomparison of wind and wave data from the ECMWF Reanalysis Interim and the NCEP Climate Forecast System Reanalysis, Ocean Model., 75, 65-83, https://doi.org/10.1016/j.ocemod.2013.12.006, 2014 Plan. Manage., 120 (4) Table 3. List of inputs of GBRT-MIC. There are of two types, observed and reanalysis variables. The reanalysis variables are available two time a day at 00:00 UTC and 12:00 UTC. The cumulative variable (e.g., Total column water) is the sum of two periods and the 625 instantaneous variable (e.g. 2 meter dewpoint temperature) is the mean of two periods.