Assessing the predictive capability of randomized tree-based ensembles in streamflow modelling

Introduction Conclusions References


Introduction
Streamflow processes are complex non-linear hydrological phenomena exhibiting a high degree of spatial and temporal variability.Their accurate characterization plays an important role in any decision-making process concerned with water availability, such as water reservoirs planning and management, operation of hydropower plants and irrigation systems, management of urban water supply systems, and many others.Two main approaches to streamflow modelling and prediction can be discerned in the hydrological literature (e.g.Beck, 1991;Wheater et al., 1993;Young, 2003) hypothetico-deductive (or bottom-up) approach, according to which the physical mechanisms that contribute to streamflow formation in the hydrological cycle are either conceptualized in a simplified lumped representation (conceptual models) or mathematically reproduced as a system of partial differential equations (physically-based models); and the inductive (or top-down) approach, in which the mapping from the space of predictor variables (e.g.precipitation, temperature) to that of the response variables (i.e.streamflow) is inferred totally and directly from observational data to a more general class of models (data-driven or metric models).
Depending on the objective of the modelling exercise, one approach can be more appropriate than the other.The complexity of conceptual and physically-based models is key to improve our understanding of the hydrological process and has a clear advantage in "what-if" or scenario analyses.However, the high number of parameters and states these models include, particularly to characterize spatial variability, often result in mis-calibration and over-parameterization (e.g.Jakeman and Hornberger, 1993;Beven, 2001), ultimately limiting the model predictive capability and operational value.Data-driven models combine high predictive potential and a more compact representation, with generally considerably less parameters and state variables, which well combines with the computational burden of optimization-based decision-making (e.g.Castelletti et al., 2010).Yet, their effective identification requires long data records and their normal black-box nature, revealing very little of the internal structure, is often a deterrent to the systematic use in operational hydrology, though some successful attempts have been made to produce understandable insights from these model structures (e.g.Young and Beven, 1994;Babovic and Keijzer, 2002;See et al., 2008).
Data-driven type of models applied to streamflow modelling includes traditional ARMA (e.g.Rasmussen et al., 1996, and references therein) and all its extensions, transfer function models (e.g.Young, 2006), and data-based mechanistic (DBM) models (Young, 2003;Romanowicz et al., 2008).Methods from data mining, machine learning and artificial intelligence have also gained a good reputation in operational hydrology (Solomatine and Ostfeld, 2008).Among them, artificial neural networks, firstly used Introduction

Conclusions References
Tables Figures

Back Close
Full for streamflow modeling by Hsu et al. (1995), are the most popular choice (see the reviews by Maier and Dandy, 2000;Shamseldin et al., 2002).Other data-driven approaches largely experienced in hydrological modeling (e.g.see the comparative analysis by Elshorbagy et al., 2010a,b) include Fuzzy rule-based systems (e.g.Hundecha et al., 2001) and support vector machines (e.g.Lin et al., 2006).All these data-driven model families are based on the parameterization of the input-output relationship and are built by a two-stage identification process: first, the model structure is selected (including, when relevant, model input selection), then the parameters are estimated with appropriate automatic algorithms.The wrong selection of the model structure might have a significant impact on the predicting capability of the identified model, even when the parameters can be estimated optimally within the selected family of functions.
A less traditional data-driven approach that is receiving increasing attention in the hydrological literature (e.g.Laaha and Bl öschl, 2006;Sauquet and Catalogne, 2011;Bachmair and Weiler, 2012) is represented by decision trees, in particular Classification And Regression Trees (CART, Breiman et al., 1984), which are the simplest form of a decision tree.CART are non-parametric regressors with tree-like structures obtained by recursively partitioning the input space into mutually exclusive regions.The most internal regions (leaves) are associated with a constant output value obtained as the average of the output data falling in each leaf.CART have two advantages over most of the above mentioned data-driven approaches.First, they avoid the need to find potentially complicated parametric functions, thus reducing the potential for a model structural component to the prediction error (Iorgulescu and Beven, 2004).Second, the tree structure can readily be interpreted as a cascade of "if-then" rules between combinations of inputs and the output, and so CART can give better insight on the model internal structure and underlying physical processes (Iorgulescu and Beven, 2004;Wei and Watkins Jr., 2011).CART have been shown to perform comparatively well than other data-driven models in a number of applications (Dawson et al., 2000;Iorgulescu and Beven, 2004;Vezza et al., 2010).Yet they suffer from a double drawback: (i) the predicted output is composed of discrete values and the streamflow is Introduction

Conclusions References
Tables Figures

Back Close
Full reconstructed as a piecewise constant function.To ensure a good predicting accuracy, the number of output classes (tree leaves) must be very high, but this increases the risk of overfitting the observed data and reduces the model generalization ability (Ho, 1995;Breiman, 1996).(ii) The partitioning process is deterministically performed by exhaustively comparing all the possible combinations of input values to select the best performing partition.This makes computation requirements growing rapidly with the input space dimensionality and indeed the optimal training of a decisions tree is NP-hard (Hyafil and Rivest, 1976).
The first weakness can be resolved in two ways.One idea is to replace averaging in the tree leaves by fitting a linear regression function to the data and obtaining a continuous representation of the output.This approach, mostly known as M5 tree-modelling, was first introduced by Quinlan (1992) and applied to hydrological problems by Solomatine and Dulal (2003); Solomatine and Xue (2004); Bhattacharya and Solomatine (2005); Stravs and Brilly (2007); Jothiprakash and Kote (2011).Another idea is to use an ensemble method (e.g.bagging Breiman, 1996 or boosting Freund andSchapire, 1996) to build a forest of regression trees.The underlying concept of ensembles is that multiple model predictions aggregated in one ensemble output allow to obtain better predictive performance than any of the constituent models (Dietterich, 2000).The adoption of tree ensembles for hydrological modeling has been reported by Snelder et al. (2009); Erdal and Karakurt (2013).Both the contributions show that trees ensembles remarkably advance the prediction capability of CART and generally compare favorably to other data-driven approaches.
Unfortunately, neither M5 or CART ensembles help in reducing the computational burden associated with the optimal deterministic tree building process they incorporate.Rather model identification is made even more computationally intensive by generally increasing the number of operations to be performed by the training algorithm.Recently, randomization methods have been shown to be an effective companion of ensemble tree methods (e.g.Geurts, 2002, and references therein).In fact, ensemble methods highly benefit from the diversity in the constituent models (Kuncheva and Introduction Conclusions References Tables Figures

Back Close
Full Whitaker , 2003) and injection of randomness is a way of producing more or less diversified ensembles (Ho, 1995).In particular, the direct randomization of the individual tree growing method seems to be more productive for the ensemble in terms of both accuracy and computational requirements than the optimality of traditional induction algorithms, such those in M5 and CART (Geurts, 2002).
Several approaches have been developed based on the direct randomization of the tree growing method (e.g.Bagging predictors, Breiman, 1996;Random Subspace, Ho, 1998;Random forests, Breiman, 2001;PERT, Cutler and Guohua, 2001).Lately, the extremely randomized trees developed by Geurts et al. (2006) (Extra-Trees in short) have been empirically demonstrated to outperform most of the other randomized and deterministic methods in terms of both prediction accuracy (more specifically, variance and bias reduction) and computational efficiency.Extra-Trees are ensembles of totally randomized trees in that they randomize both the input variables and the splitting values considered in creating a partition, in the process of building a tree, and create a forest of trees to compensate for the randomization, via averaging of the constituent tree outcomes.The combination of averaging and randomization ensures (i) modeling flexibility/accuracy (i.e.ability of characterizing strong nonlinear relationships), (ii) computational efficiency (and thus scalability to large datasets), and (iii) scalability with respect to input dimensionality.In addition, (iv) Extra-Trees, like several other tree-based ensemble methods (Jong et al., 2004), can be exploited to infer the relative importance of the input variables and to order them accordingly (Wehenkel, 1998;Fonteneau et al., 2008).This allows to provide an ex-post interpretation of the model and makes the model more understandable and credible to the users than other data-driven approaches.
In this paper we explore the applicability of Extra-Trees to streamflow modeling and comprehensively analyze their advantages and disadvantages in terms of prediction, explanation and computational efficiency.Specifically, we adopt a four-step assessment procedure including (i) random sampling of the observational dataset to ensure a robust evaluation of the model performance (Elshorbagy et al., 2010a); Introduction

Conclusions References
Tables Figures

Back Close
Full (ii) multi-criteria assessment of the model performance (Hwang et al., 2012, and references therein) to consistently validate the model behavior under different flow conditions; (iii) comparative assessment of predicting accuracy and computational efficiency against tree-based methods (M5 and CART) already experimented in water-related applications and other traditional data-driven approaches (ANNs and multiple linear regression); (iv) uncertainty analysis on the model residual.
The numerical analysis is conducted on two streamflow modelling problems with different spatial domains, hydro-meteorological features, and temporal dynamics.Marina catchment, Singapore, is a relatively small urban catchment with a very short time of concentration, considerably altered by human intervention and subject to a tropical climate; the Cunning River, Western Australia, is a large river basin, predominantly natural, characterized by a mediterranean climate and modeled with a daily time step.

Extremely randomized trees (Extra-Trees)
Tree-based regressors are structured as a hierarchical cascade of rules able to predict numerical values of the output (Breiman et al., 1984).The process of building the nodes and branches forming a tree is based on the partitioning of the input space into mutually exclusive regions according to a pre-defined splitting criterion, progressively narrowing down the size of the regions.Eventually, when the number of instances in a region becomes smaller than a specific a preassigned value (or their values vary just slightly), the partitioning of that region stops and a leaf is created.Whenever a new instance is fed into the tree, a specific path is followed according to the splitting rules defined in the tree-building procedure, and the predicted output is then obtained from the aggregation of the values stored in the leaf.The splitting criterion, the termination test, the number of trees grown, and the rule adopted to associate a numerical value to each leaf are the key-features differentiating the many tree-based methods available in the literature.
On one extreme CART are a fully deterministic single-tree method, on the other, Extra-Trees are a totally randomized ensemble method as explained next.Introduction

Conclusions References
Tables Figures

Back Close
Full

Model building
Extra-Trees substantially differ from traditional deterministic and randomized methods in two particular aspects.First, in the process of building a tree, the selection of the input and splitting value to split a node are randomized, i.e. occur independently of the output variable.Second, an ensemble of M trees is created in order to compensate for the effect of randomization, and the outcome of the ensemble is the average of each tree output.Nodes are split using the following rule: K alternative inputs (cutdirections) are randomly selected and, for each one, a random splitting value (cut-point) is chosen; a score is then associated to each cut-direction and the one maximizing the variance reduction following the adopted splitting criterion is adopted to split the node.
The termination test that determines when to stop partitioning a node is based on the number of instances within the node.When this number is smaller than a user-defined value n min , the algorithm stops partitioning a node and a leaf is created (Geurts et al., 2006).To each leaf a value is eventually assigned, obtained as the average of the target values associated to the inputs falling in that leaf.The estimates produced by the M trees are finally aggregated by arithmetic average (see Table 1 for a tabular version of the Extra-Trees building algorithm).The rationale behind the approach is that the use of the original training dataset (instead of a bootstrap replica, as in the Bagging method Breiman, 1996) is motivated to minimize bias, while the combined use of randomization and ensemble averaging is aimed at reducing the variance of the model output (Geurts et al., 2006).

Hyperparameters
The three hyper-parameters M, K , and n min characterizing the model building algorithm diversely affect the ensemble performance and overall method efficiency.Increasingly high values of M reduce the variance of the final estimate (Breiman, 2001), but also considerably add to the computational requirements of the building algorithm, so the final choice depends on a trade-off between the desired model accuracy and available Introduction

Conclusions References
Tables Figures

Back Close
Full computing power.K can be chosen in the interval [1, . . ., n], with n being the number of input variables, and controls the level of randomness in the tree building process.The smaller K , the stronger the randomization of the trees and the weaker the dependence of their structure on the values of the output variable in the training dataset.At the extreme case, when K is equal to 1, the splits (cut-directions and cut-points) are chosen in a totally independent way of the output variable and the method builds totally randomized trees.As empirically demonstrated by Geurts et al. (2006), the optimal value of K for regression problems is equal to the number n of inputs, and so the number of cut-directions randomly selected.Finally, the threshold n min is used to balance bias and variance reduction.Large values of n min lead to small trees, with high bias and small variance; conversely, low values of n min lead to fully-grown trees, which may over-fit the data.The optimal tuning of n min can depend on the level of noise in the training dataset: the noisier are the outputs, the higher should be the optimal value of n min .Although this tuning might require some experiments, Geurts et al. (2006) have shown that a value of n min between 5 and 50 is a robust choice in a broad range of typical conditions.

Computational requirements
From the computational point of view, the complexity of the Extra-Trees building procedure is on the order of |D| • log(|D|), with |D| being the number of input-output observations in the training dataset D. The computational time linearly increases with M and K , and logarithmically decreases for increasing values of n min , meaning that the approach still remains computationally efficient, though based on the construction of a tree ensemble.This because the splitting rule is very simple compared to other splitting rules that locally optimize the cut-points, as, for example, those adopted by CART and M5.Introduction

Conclusions References
Tables Figures

Back Close
Full

Input ranking
The particular structure of Extra-Trees can be exploited to rank the importance of the n input variables in explaining the selected output behavior.This approach, as originally proposed by Wehenkel (1998), is based on the idea of scoring each input variable by estimating the relative variance reduction it can be associated with by propagating the training dataset D over the M different trees composing the ensemble.More precisely, the relevance G(x i ) of the i -th input variable x i in explaining the output y can be evaluated as follows where ν j is the j -th non-terminal node in the τ-th tree, Ω is the number of non-terminal nodes in the tree, δ(ν j , x i ) is equal to 1 if the variable x i is used to split the node ν j (and 0 otherwise), and ∆ var (ν j ) (or ∆ var (s i , D)) is the variance reduction associated to node ν j (see Table 1).Finally, the input variables {x 1 , x 2 , . .., x n } are sorted by decreasing values of their relevance (see Table 2 for a tabular version of the input ranking algorithm).

Marina catchment
Marina catchment feeds the homonymous reservoir located in the heart of Singapore.The reservoir, created in late 2008 with the construction of a tidal barrier, has a surface area of 2.45 km 2 and an active storage of about 3.2 × 10 6 operated for floods control and drinking water supply (Galelli et al., 2013).Five main tributaries discharge water into the reservoir, draining a catchment of ≈ 100 km 2 (almost 15 % of the land area of Singapore) and producing a mean annual inflow of about 150 × 10 6 with a typical tropical pattern.The catchment includes one of the most densely populated and urbanized regions in Singapore and south-east Asia (Xie, 2006), and its drainage system consists of concrete lined canals, which make the time of concentration extremely short (≈ 1 h) and the base flow almost null.Because of the high-intensity rainfall events characterizing the region (Selvalingam et al., 1987), discharges occur in high peaks over short periods of few hours (see Fig. 1, upper panel).
The available dataset consists of hourly rainfall and inflow measurements over the period 1 April 2009-31 December 2011, for a total of 24 120 data points (see Table 3 for the descriptive statistics of the output variable).The selection of the most significant time-lags is performed by means of the Mutual Information (MI) criterion (e.g.Hejazi and Cai, 2009, and references therein), which singled out an input set composed of three time-lags for each variable, namely [y t−1 , y t−2 , y t−3 , r t−1 , r t−2 , r t−3 ], with y t−1 and r t−1 denoting the inflow and rainfall in the time interval [t − 1, t].The streamflow modelling exercise is then performed over a prediction horizon of 1 h.

Canning River
The second dataset is taken from the Canning River basin, a major tributary of the Swan River in Western Australia.The river drains a catchment area of ≈ 850 km September.The combination of this rainfall pattern and land use gives the river an ephemeral nature (Young, 2002) with practically no flow during the summer period.As discussed in Young et al. (1997), a data analysis shows indeed a strong non-linear correlation between the rainfall and the river flow (Fig. 1).
For the present analysis the dataset consists of daily rainfall, temperature and flow measurements available for the period 1 January 1977-31 December 1987, for a total of 4017 data points (Table 3).As for the former dataset, the most significant input variables are selected with the MI criterion.According to this criterion two time-lags for each variables, namely [y t−1 , y t−2 , r t−1 , r t−2 , T t−1 , T t−2 ] (with T t−1 denoting the average temperature in the time interval [t − 1, t]), are selected to predict the flow 1 one-dayahead.

Setting the experiments
The quantitative assessment of Extra-Trees is performed using a four-step procedure: Random sampling.To ensure a robust evaluation of the model performance (Elshorbagy et al., 2010a), the two datasets are randomly sampled (without replacement) 100 times, in order to create at each sampling exercise a training/cross-validation and testing subsets, respectively containing two thirds and one third of the available data.Ten different groups (each composed of training/cross-validation and testing subsets) are then selected based on their statistical properties, namely mean and standard deviation of the output variable.Ten different models are identified on the 10 data groups, with each model finally evaluated on the corresponding testing subset.
Model evaluation.The Extra-Trees evaluation is based on multi-assessment criteria (Hwang et al., 2012) at moderate flow values.This assessment is completed by a graphical analysis of the scatter plots and hydrographs.
Comparative assessment.The best Extra-Trees ensemble so identified is compared against several machine learning modeling methods, including tree-based methods (M5 model trees and CART) and ANNs.To facilitate the comparison, Multiple Linear Regression (MLR) models are employed as base line references.
Uncertainty analysis.To estimate the uncertainty associated to model predictions, the residuals of the 10 testing subsets are computed and aggregated in a single dataset, for which a probability distribution is fit.In the benchmarking exercise, a twosample Kolmogorov-Smirnov test is then performed to compare the distributions of model residuals.In particular, residuals are tested under the null hypothesis that they are from the same continuous distribution: two residuals are considered significantly different if the null hypothesis is rejected at the 5 % confidence level (p-value 0.05).

Prediction
Extra-Trees' predicting potential is assessed for different values of M, K , and n min .The sensitivity analysis is performed by running an extensive number of training/crossvalidation and testing experiments on the selected 10 data groups of each dataset.As explained in Sect.2.1, the value of K is fixed equal to the number n of input variables, which is 6 for both Marina and Canning dataset.criterion is given in Fig. 4. For both Marina and Canning dataset the larger the number M of trees in the forest, the higher the variance reduction.The reduction in the variance has a positive effect on the Extra-Trees estimation error and reflects in the abatement of the distance between observed and predicted values for M growing from 1 to 100.Since the computation time linearly increases with M, a balance must be found between accuracy and time requirements.The saturation effect (Fig. 4c, d) might help in deciding a proper value (see also Castelletti et al., 2010): the performance improvement from values of M greater than 200-300 is distinctively negligible.The value of n min determines the number of leaves in a tree and, thus, the ensemble's overall tradeoff between bias and variance.As shown in Figs. 2 and 3, reducing n min has a positive effect on all the assessment criteria.This effect is consistent up to a value of n min equal to about 5. Indeed, when this threshold is reached, the model building algorithm produces fully grown trees, with the consequent risk of over-fitting the data (i.e.lower bias but higher variance in the model output).
In synthesis, sensitivity analysis shows that Extra-Trees provide reasonably good performance over a broad range of parameter values: the value of M must indeed be as large as possible, though a saturation effect is reached for M greater than 200-300, while n min , as already discussed by Geurts et al. (2006), should be comprehended between 5 and 15.For the subsequent analysis (i.e.input ranking and benchmarking) a parameterization with M and n min equal to 500 and 5, respectively is finally chosen.

Explanation
As anticipated, the Extra-Trees model building algorithm implicitly allows to rank the model inputs in terms of their relevance in explaining the output.This is useful for the ex-post physical interpretation of the cause-effect relationships captured by the model.The ranking is run on the ensemble selected at the end of the model building process.In particular, an ensemble is cross-validated on the selected 10 data groups of each dataset, and the inputs are sorted in decreasing order according to the ranking Introduction

Conclusions References
Tables Figures

Back Close
Full algorithm described in Sect.2.4.The results obtained as the average relative contribution (over 10 data groups) are reported in Tables 4 and 5.
As for Marina Catchment, the measured rainfall r t−1 and antecedent flow y t−1 are the most important variables, contributing for about 80 % of the ensemble total variance.The measured rainfall r t−1 is ranked in the first position, with a relative score of almost 67 %.This high relevance is due to the hydraulic characteristics of Marina catchment, which is drained by concrete lined canals with an almost null base flow: high flow peaks are mainly driven by rainfall, so the cumulated precipitation in the previous hour becomes the most relevant information to the model output.Because of the short time of concentration (approximately one hour), the measured precipitation and antecedent flow with 2 and 3 time-lags are less important.
The Canning River drains a large, natural catchment forced by a mediterranean climate.As illustrated in Table 5, the antecedent flow with 1 and 2 time lags is the most relevant variable (87 % of the ensemble output), followed by rainfall and temperature.

Benchmarking
The best Extra-Trees ensemble identified in the model building process is compared against M5 model trees, CART, ANNs and MLR.The same experimental setting and datasets used for the Extra-Trees are adopted in this benchmarking exercise in order to guarantee a rigorous and unbiased comparison.

Models implementation
The MatLab toolbox M5PrimeLab (Jekabsons, 2010) is used to implement the M5 model trees in the different case studies and relative data groups.Pruning and smoothing are accounted for as suggested in Jothiprakash and Kote (2011); in particular, the smoothing coefficient is optimized via trial-and-error in the range [0, 20] ( Wang and Witten, 1997).The other parameters requiring a manual tuning are the split threshold Introduction

Conclusions References
Tables Figures

Back Close
Full CART are implemented with the MatLab Statistics Toolbox, which relies on the original algorithm proposed by Breiman et al. (1984).Similarly to the other tree-based methods adopted in this study (i.e.Extra-Trees and M5), the minimum number of training samples one node may represent is heuristically optimized in the range [2,1000].Pruning is adopted to compute the full tree and the optimal sequence of pruned subtrees, thus minimizing the risk of over-fitting the cross-validation data.
The MatLab Neural Network Toolbox is adopted to set up the ANNs, whose parameters are optimized by means of the Levenberg-Marquardt algorithm.For each of the 10 data groups (of each case study), the ANNs cross-validation process is repeated 100 times with 100 different initialization of the random weights.The most performing parameterization in terms of RMSE is then selected as representative of a data group.As for the ANNs architecture, the number of input nodes corresponds to the number of input variables (thus 6 for both Marina and Canning River case study), while the number of hidden nodes is heuristically optimized in the range [1,10].
MLR models are also implemented in MatLab, and calibrated using least-squares.
For each machine learning method considered in this study, this implementation eventually leads to 10 models (for each case study) developed and tested using the corresponding unseen data groups.

Results and analysis
As discussed in Sects.3.1 and 4.2, the Marina catchment dataset is characterized by a weak autocorrelation in the hourly inflow to the reservoir.This is the reason why providing the antecedent flow as an input to predict future discharges does not increase the information available to the different models.Rather, the limiting factor for the model performance seems to be the capability of exploring the correlation between the future inflows and the measured rainfall and flow.This is confirmed by the results reported in Table 6.Extra-Trees and M5 outperform the other models with respect to 1632 Introduction

Conclusions References
Tables Figures

Back Close
Full all the multi-assessment criteria.In this specific comparison, Extra-Trees and M5 are, de-facto, comparable over the whole range of flows, as shown by the NS and RRMSE values.Extra-Trees and M5 are also comparable in terms of MAE, which indicates the goodness of fit at moderate flow values.Yet, M5 stands out as the most performing model when accounting for the RMSE, which measures the model performance relevant to high flows.This behavior can probably be explained by considering the different models architectures: M5 have linear models in the final (pruned) leaves, and this allows them to extrapolate over unseen events; the Extra-Trees prediction corresponds to the average of the output values associated to the inputs falling in a specific leaf, and this can limit their extrapolation capabilities.The third model family in order of performance is ANNs, while the worst results are attributable to CART and MLR.The low CART performance can again be explained by accounting for the model architecture: the CART model building algorithm provides an optimal partitioning of the input space (with respect to the standard deviation reduction of the output variables; see Breiman et al., 1984), but the prediction associated to each leaf is simply the average of the output values associated to the inputs falling in a specific leaf.As a consequence, a CART structure can be seen as a classification of the different flow regimes registered in the training/cross-validation data group, and this can limit the overall model predictive capabilities as confirmed by the scatter plots and hydrograph shown in Figs. 5 and 6.This does not occur with Extra-Trees since the model building algorithm improves the performance of a single model by ensemble averaging.Unlike Marina catchment, the Canning River dataset shows a stronger autocorrelation in the flow process, and this enhances the information content at the disposal of the different models.As shown in Table 7, models are characterized by more comparable performance, although Extra-Trees and M5 stand out as the most performing models.This analysis is confirmed by a graphical analysis of the scatter plots and hydrograph (Figs. 5 and 6).Introduction

Conclusions References
Tables Figures

Back Close
Full

Residuals analysis
The Logistic probability distribution, with different parameters α and β, is found to best fit the residuals of the different models on both case studies.Although being characterized by the same distribution, a graphical analysis shows a substantial difference in the estimated parameters (Fig. 7).All the models residuals have a symmetrical distribution, but Extra-Trees and M5 have the smallest predictive uncertainty.These two models are followed by CART and ANNs, which show lower probability of null residuals and a more prominent kurtosis.The linear model residuals are statistically comparable to CART residuals for Canning River case study, while they show an asymmetrical distribution for Marina catchment dataset.This means that the MLR model residuals are biased, and the model is statistically prone to an underestimation of the inflow.This difference in the pdf parameterizations is confirmed by the two-sample Kolmogorov-Smirnov test: the p-value is null for all the combinations of models residuals, and it thus indicates that the models residuals may represent different distributions.

Computational requests
All the cross-validation and testing experiments for M5, CART, ANNs and MLR are carried out in MatLab 7.10.0(R2010a) environment running on a 2.4 GHz Intel Core 2 Duo with 4 GB Ram.The experiments for Extra-Trees are carried out using a compiled C++ package running on the same machine.From Table 8 it can be noticed that when the different models are applied to the Canning River case study, the computational requests are quite limited, with Extra-Trees and M5 requiring for example 78.40 and 32.21 s, respectively for the cross-validation process of a single data group consisting of 2560 samples (1280 in testing).The computational requests of ANNs are smaller, but it is here necessary to account for the 100 random initializations (for a single initialization the computational request is equal to 8.24 s).
On the other hand, the application of these models to Marina catchment problem, characterized by a much larger number of samples ( 16 in testing), shows a different picture.The Extra-Trees CPU time to cross-validate an ensemble of 500 Extra-Trees (with n min = 5) increases to 1008.88 s, while the amount of time spent on M5 is 1788.30s.The Extra-Trees model building algorithm is roughly 45 % faster than the M5 one.Apart from the specific model implementation (the C++ executable may be faster than Matlab environment), the reason for this important difference stands in the rule adopted when splitting a node during the building process.The M5 building procedure examines all possible splits by exhaustive search (and then chooses the one that maximizes the standard deviation reduction of the output variable), while the Extra-Trees model building algorithm explores only K cut-directions (with K equal to the number of input variables) with corresponding splitting values.Although building an ensemble of trees, the overall computational burden remains limited because of the simple splitting rule.

Conclusions
Extra-Trees have been evaluated in their predicting accuracy, explanation ability and computational performance comparatively to other very popular data-driven methods in a streamflow modeling exercise.The analysis was numerically conducted on two hydrological datasets.Results show that (i) Extra-Trees provide good performance on both datasets, in terms of different assessment criteria.Moreover, their performance is numerically equivalent to that of the best performing models identified during the benchmarking exercise (i.e.M5); (ii) despite their ensemble nature, Extra-Trees outperform the other methods in terms of computational efficiency when adopted on large datasets (good scalability), such as Marina catchment; finally, (iii) Extra-Trees provide a physically interpretable ranking of the input variables in terms of relevance in explaining the output.It can also be observed that being a non-parametric method, Extra-Trees do not require any parameter optimization whereas they provide good performance over a broad range of hyper-parameters.In addition, the combined use of randomization Introduction

Conclusions References
Tables Figures

Back Close
Full and ensemble averaging is aimed at minimizing the output variance without the need for any a-posteriori processing, such as pruning and smoothing (adopted for M5).This has two advantages in that it further simplifies the model identification and it adds to Extra-Trees computational efficiency.
In conclusion, Extra-Trees are a valid alternative to traditional parametric data-driven methods, such as ANNs, and to other non-ensemble tree-based approaches.They can be adopted for any hydrological problem (as they provide performance equivalent to those achievable with parametric methods), and should be recommended for computational intensive problems.Full  a single Extremely Randomized Tree.The algorithm is repeated M times to produce an ensemble.
For each selected input variable x i (with i = 1, . .., K ): Step 2a.Compute the minimum and maximum value of x i in D, denoted as x i ,min D and x i ,max D .
Step 2b.Randomly select a cut-point s i in the interval x i ,min D , x i ,max D .
Step 2c.Return the split x i < s i .
Step 3.Among the K splits {s 1 , s 2 , . .., s K }, select the split s * such that where: -∆ var (s i , D) is the variance reduction defined as var{y|D} ) and D r (x i ) are the two subsets of D satisfying the conditions x i < s i and x i s i , -|D| is the number of samples in D, |D l (x i )| and |D r (x i )| are the number of samples in D l (x i ) and D r (x i ).
Step 4. According to s * , split the set D into the subsets D l (x i ) and D r (x i ), and return the (non-terminal) node ν j .
For the subset D l (x i ) (and D r (x i )), verify the following conditions: ) is lower than n min (minimum cardinality).
-All input variables {x 1 , x 2 , . .., x n } are constant in D l (x i ) (or D r (x i )).
-The output variable is constant in D l (x i ) (or D r (x i )).
If one of the conditions in Step 5 is satisfied, the subset is is leaf (labelled with the average of the output variables values).Alternatively, Steps 1-5 are repeated by replacing D with D l (x i ) (or D r (x i )).Introduction

Conclusions References
Tables Figures

Back Close
Full   vance), and an ensemble of M Extra-Trees.
Step 1. Assign to each input variable x i (with i = 1, . .., K ) a score G(x i ) equal to 0.
Step 2. Define suitable values for M, K and n min and build an ensemble of Extra-Trees (as described in Table 1).
At each splitted node ν j update the score corresponding to the selected input variable x i according to the following equation: Step 3. Normalize the score G(x i ) of each input variable, and sort these values in decreasing order.Introduction

Conclusions References
Tables Figures

Back Close
Full  Full  Full  Full  Full  Full  Full Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | The Extra-Trees capabilities are tested on two streamflow modelling problems with different spatial domains and hydro-meteorological features: Marina catchment is a relatively small urban catchment, considerably altered by human intervention and subject to a tropical climate; the Cunning River watershed is a large basin, predominantly natural, characterized by a mediterranean climate.Discussion Paper | Discussion Paper | Discussion Paper | 2 , where woodland is the predominant land use.The climate shows a mediterranean pattern, characterized by warm and dry summers and cool, wet winters.The long-term average annual rainfall for the catchment is ≈ 900 mm mostly falling between May and Discussion Paper | Discussion Paper | Discussion Paper | , aimed at describing the model behavior under different flow conditions.The criteria considered are the (i) Nash-Sutcliffe (NS) criterion and the (ii) Relative Root Mean Squared Error (RRMSE), which are normalized statistics providing a description of the models behaviour over the whole range of flow conditions; the (iii) Root Mean Squared Error (RMSE), which measures the goodness of fit relevant to high flows; (iv) the Mean Absolute Error (MAE), which indicates the goodness of fit 1628 Discussion Paper | Discussion Paper | Discussion Paper | 25 values for M and n min are sampled in the domains [1, 1000] and [2, 1000], leading to 625 different parameterizations.The extreme cases are: (i) a single Extra-Tree with large leaves (i.e.M = 1, n min = 1000) or a fully-grown tree (i.e.M = 1, n min = 2), (ii) a large forest composed of small or fullygrown trees (M = 1000 with n min = 1000 or 2, respectively).The values of the multi-assessment criteria as a function of M and n min are illustrated in Figs. 2 and 3, while a graphical analysis of the parameters' effect on the NS Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | and the minimum number of training samples one node may represent.The former is explored in the range [0.05, 0.20], the latter in the range [2, 1000].
These include modeling of large datasets and input selection: large datasets are becoming more frequent in several hydrological applications, such as the modeling of urban hydrological processes, where the short time of concentration of urban catchments requires adopting a very short sampling/modelling time (e.g. one hour in Marina catchment), thus largely adding to the dimensionality of the training and testing datasets.Introduction

Table 1 .
Tabular version of the Extra-Trees building algorithm.: an output variable y, n inputs {x 1 , x 2 , . .., x n } and a training dataset D composed of |D| input-output observations.Output: Input

Table 2 .
Tabular version of the Extra-Trees input ranking algorithm.Input: an output variable y, n inputs {x 1 , x 2 , . .., x n } and a training dataset D composed of |D| input-output observations.Output: ranking of the input variables (sorted by decreasing values of their rele-

Table 3 .
Descriptive statistics of the output variable for Marina catchment and Canning River datasets.

Table 4 .
Input Ranking results for the Marina catchment dataset (average over 10 data groups).

Table 5 .
Input ranking results for the Canning River dataset (average over 10 data groups).The initial variance is 2958.69.

Table 6 .
k-fold cross-validation (with k = 10) and testing results of Extra-Trees and benchmarking models for Marina Catchment dataset.

Table 7 .
k-fold cross-validation (with k = 10) and testing results of Extra-Trees and benchmarking models for Canning River dataset.

Table 8 .
Comparison of k-fold cross-validation (with k = 10) and testing CPU time for Extra-Trees, M5, CART, ANNs and MLR for Marina and Canning River dataset.The estimates are with respect to a single (of the 10) data group composing each dataset.