Experimental investigation of the predictive capabilities of data driven modeling techniques in hydrology-Part 1: Concepts and methodology

A comprehensive data driven modeling experiment is presented in a two-part paper. In this first part, an extensive data-driven modeling experiment is proposed. The most important concerns regarding the way data driven modeling (DDM) techniques and data were handled, compared, and evaluated, and the basis on which findings and conclusions were drawn are discussed. A concise review of key articles that presented comparisons among various DDM techniques is presented. Six DDM techniques, namely, neural networks, genetic programming, evolutionary polynomial regression, support vector machines, M5 model trees, and K-nearest neighbors are proposed and explained. Multiple linear regression and na ı̈ve models are also suggested as baseline for comparison with the various techniques. Five datasets from Canada and Europe representing evapotranspiration, upper and lower layer soil moisture content, and rainfall-runoff process are described and proposed, in the second paper, for the modeling experiment. Twelve different realizations (groups) from each dataset are created by a procedure involving random sampling. Each group contains three subsets; training, cross-validation, and testing. Each modeling technique is proposed to be applied to each of the 12 groups of each dataset. This way, both prediction accuracy and uncertainty of the modeling techniques can be evaluated. The description of the datasets, the implementation of the modeling techniques, results and analysis, and the findings of the modeling experiment are deferred to the second part of this paper. Correspondence to: A. Elshorbagy (amin.elshorbagy@usask.ca)


Introduction
Data driven modeling (DDM) techniques have been in use for nearly two decades for hydrological modeling, prediction, and forecasting.Many articles reporting the application of various techniques to various hydrological case studies are available in literature.Yet, data driven techniques are still facing some classical opposition because of multiple reasons inherit in such techniques (e.g., lack of transparency and difficulty of reproducing the results).Hydroinformatics researchers started to identify problems of data driven modeling (Maier and Dandy, 2000;Elshorbagy and Parasuraman, 2008;Solomatine and Ostfeld, 2008) and tried to suggest some solutions or modeling guidelines and frameworks.Cherkassky et al. (2006) have listed the quality of the datasets, choosing robust learning methods that can handle heterogeneous data, and the need for uncertainty estimates associated with predictions as some of the main issues and challenges facing computational intelligence in earth sciences.
There is no doubt that more scientific rigour should have been maintained in the applications and use of data driven techniques in earth sciences.Bowden et al. (2005) explored different techniques for input determination for neural network models in water resources applications, showing a comparative performance of different methodologies for determining input variables.Abrahart et al. (2008) have used the example of neural network applications to highlight the shortcomings of the present approach, and how to build stronger foundations.Apparently, their argument can be easily generalized to apply to other data driven techniques.In fact, the modeling shortcomings and ambiguity inherit in A. Elshorbagy et al.: Experimental investigation of the predictive capabilities DDM techniques are less than the ones introduced by the way such techniques were presented in earth sciences literature.One of the fundamental means to assess a modeling technique is to evaluate it against other modeling techniques, whether conceptual or data driven ones.One can observe that in the literature of data driven hydrology, the modeling comparative studies are usually impaired due to the less-than-comprehensive approach adopted.With few exceptions, the following problems can be noticed: (i) Only one or two modeling techniques have been used at a time in a single study; (ii) if more techniques were employed, then only one or two datasets have been used for the applications.This leads to conclusions that are based on the unique characteristics of such dataset (Abrahart et al., 2008); (iii) Datasets were split into two subsets for training and testing, where the models were tested iteratively using the testing data subset.This means, in fact, that the testing data are used, at least implicitly, during training.In this case, the generalization ability of the developed model is questionable; and (iv) when datasets were correctly split into three subsets for training, cross-validation, and testing, only one random realization of the three subsets was used.Such use of a single realization of the dataset makes it difficult to assess the predictive uncertainty and the effect of the split approach on the adopted models.
The above-mentioned deficiencies, in addition to other requirements identified by Abrahart et al. (2008) including the need for testing the models over a range of conditions, the reasoning behind the data splitting, and the need for designing repeatable experiments and reproducible findings, are the motives behind this study.The aim of this study is to evaluate and test the predictive abilities of six DDM techniques on five different case studies of rainfall-runoff, evapotranspiration, and soil moisture content.Multiple random realizations of the three subsets of each dataset will be created and used with each and every modeling technique.The techniques will be evaluated against multiple linear regression models and, when applicable, naïve models, which assume no change compared to the long term trend or mean value.Both prediction accuracy and uncertainty will be evaluated.The prediction accuracy refers to the overall match between the measured and the predicted values, whereas the prediction uncertainty quantifies the variability and the distribution of the model residuals (errors) around the mean error.The authors intend to make all datasets used in this study available for all interested researchers to test the results and conduct further studies.The authors hope and aim that this study could serve as a benchmark study for assessing future proposed modeling, optimization, and input processing methods or techniques.
This study is presented in two companion papers.This first part consists of, after this introduction, a section that briefly summarizes some of the key comparative studies in hydrology literature, followed by a section explaining the study methodology and the experimental set up.The fourth section describes the modeling techniques adopted in this study as well as the implementation tools.The last section of this first part is a general summary.The second part (Elshorbagy et al., 2010) begins with a brief introduction section that is followed by a section contains a description of study sites, the collected data, and how five different case studies (datasets) representing various hydrological processes were created from three sites.This section also explains how the methodology was applied and how inputs for the various case studies were selected.The third section reports on the implementation details and parameter values, when applicable, of each modeling techniques for the various datasets.Results of the various techniques and analysis are presented in the fourth section.A general discussion and guidelines are presented in Sect.5, whereas the conclusions and findings of the entire study are presented in the last section.

Comparative hydrological modeling studies using data driven modeling techniques
The number of studies that reported some sort of comparison between various DDM techniques in hydrology is very large, and it is beyond the possibility of being summarized here (for presentation of some of the latest advances see, for example, the volume edited by Abrahart et al., 2008).However, some key and representative studies are presented here.Solomatine and Siek (2006) presented an algorithm, which facilitates incorporation of domain knowledge into one particular type of modular model (model tree).They employed the M5flex algorithm to two hourly and daily rainfall-runoff datasets as well as five widely used benchmark datasets-Autompg, Bodyfat, CPU, Friedman, and Housing (Blake and Mertz, 1998).They compared the M5flex method with global artificial neural networks (ANNs) and other local M5 model tree modeling methods (M5, M5opt).They concluded that M5flex delivered high performance because of the use of additional domain knowledge for determining the best split attributes and values.Solomatine and Xue (2004) showed that both M5 model tree technique and ANNs perform similarly for flood forecasting problems in the upper reach of the Huai River in China, but the model trees have certain advantages in terms of transparency in the model structure over ANNs.Sivapragasm et al. (2007) found that there is no significant difference in the prediction accuracy between GP and ANNs for forecast of daily flows, but genetic programming (GP) has an advantage of identifying the optimum inputs.Makkeasorn et al. (2008) compared GP and ANN models for forecasting river discharges.The findings indicated that GP-derived streamflow forecasting models were generally favored for forecasting over ANNs.Further, the most forward looking GP-derived models can even perform a 30-day streamflow forecast ahead of time with a reasonable estimation accuracty.Jayawardena et al. (2005) compared the Hydrol.Earth Syst.Sci., 14, 1931Sci., 14, -1941Sci., 14, , 2010 www.hydrol-earth-syst-sci.net/14/1931/2010/ GP technique in modeling rainfall-runoff process to more conventional modelling approaches.They used the GP technique to predict the runoff from three catchments in Hong Kong and two catchments in southern China, and showed that the GP technique evolved simple models that enabled the quantification of the significance of different input variables for prediction.Parasuraman et al. (2007) used two hourly evapotranspiration (ET) datasets to compare between GP and ANNs for prediction of ET.Not much difference was found, with regard to the prediction accuracy, between the two techniques.Wu et al. (2007) applied a modular support vector machine (SVM) model, termed distributed SVR (D-SVR), with a two step Genetic Algorithm parameter optimization method, to carry out prediction of water level in a river.The D-SVR method disaggregates the original training set into a couple of subsets, and then generates a local SVR for each subset independently.Wu et al. (2007) evaluated the performance of D-SVR against the predictions from linear regression (LR), nearest neighbor (NN) method, and genetic algorithm-based ANN (ANN-GA) methods.The proposed D-SVR model can predict the water level better when compared with the other models.However LR model performed better in comparison with NN, ANN-GA models, which was attributed to the linear mapping relation between input and output variables that restricts the power of NN and ANN.In their study, Lin et al. ( 2006) employed an SVM model to predict long-term flow discharges in Manwan Hydropower scheme in Tibet.It was found through comparison of results with ARMA and ANN models that the SVM model can provide more accurate predictions of long term flow discharges.Further, Lin et al. (2006) concluded that SVM has its distinct capabilities and advantages, such as the error tolerance, in identifying hydrological time series comprising nonlinear characteristics.In their study, C ¸imen (2008) applied SVMs for the estimation of suspended sediment concentration/load.The observed streamflow and suspended sediment data of two rivers in the USA, which have been already used in earlier studies using ANNs, were considered.It was found that the negative sedimentation estimates, which were encountered using ANNs, did not happen during the application of SVMs.Khan and Coulibaly (2006) examined the application of the SVM and successfully demonstrated the ability of SVMs to predict the mean monthly lake water level up to 12 months ahead.SVM was found to be more advantageous than ANNs, which prescribes more number of controlling parameters.Khan and Coulibaly (2006) deduced that SVM proved to be more competitive and promising compared to the widely used ANNs and conventional seasonal multiplicative autoregressive (SAR) models.Behzad et al. (2008) compared SVM with ANN and ANN-GA models for prediction of daily runoff of Bakhtiyari River watershed in Iran.They considered available climate information as model inputs.They concluded that the prediction accuracy of SVM was at least as good as that of ANN and ANN-GA models in some cases, and better in some other cases.Furthermore, Behzad et al. (2008) found that SVM converges considerably faster compared to other models.Wu et al. (2008) demonstrated the feasibility of SVM for forecasting of soil water content in Purple hilly area located in Southwest University in Chongqing.They compared the predictions from SVM with ANNs, and showed that the results from the SVM predictor significantly outperformed other baseline predictors such as ANNs.Giustolisi and Savic (2006) found that Evolutionary Polynomial Regression (EPR) was more accurate than GP for extracting a symbolic expression for Chezy resistance coefficient.Elshorbagy and El-Baroudy (2009) differentiated between equation-based GP and program-based GP.They further compared GP with EPR technique using a highly nonlinear dataset (soil moisture content).It was found that program-based GP outperformed EPR in its prediction accuracy.More importantly, Elshorbagy and El-Baroudy (2009) demonstrated the need for adopting multiple data driven modeling techniques and tools (modeling environments) to obtain reliable predictions.This brief literature review shows that findings and conclusions were sometimes seemingly contradictory with regard to the superiority of one technique over the other.Apparently such findings should be viewed as data-specific, and thus, lack generality and strong support for cause-effect relationships.

Methodology and experimental setup
In order to achieve the objectives of this paper with regard to the comparative predictive performance of various DDM techniques, first, a set of distinctive modeling techniques were identified.The selected techniques are (i) artificial neural networks (ANNs); (ii) genetic programming (GP); (iii) evolutionary polynomial regression (EPR); (iv) support vector machines (SVM); (v) M5 model trees; and (vi) K-nearest neighbors (K-nn).To facilitate the comparison and allow for performance evaluation in light of easily understandable and widely recognized techniques, multiple linear regression (MLR) models and/or naïve models were employed as base line references.
Second, five different case studies representing different hydrological processes or variables (actual evapotranspiration, soil moisture content, and rainfall-runoff) were selected.The datasets present a wide range of challenges to data driven techniques because of their various levels of complexity, embedded feedback mechanism (such as the evapotranspiration process), and nonlinearity.The datasets will be explained in more details in a later section of this paper.Third, for each dataset, model inputs were either identified in this research or were pre-selected based on previous studies.Even though appropriate model inputs were secured for this study, the identification of the optimum inputs was not given an extraordinary emphasis since the focus of this research is inter-technique comparison.As long as the inputs are the same for the various modeling techniques, an unbiased analysis can be conducted toward achieving the objectives of this study.
Fourth, split samples from each dataset were prepared for the modeling experiment.Each set of the five datasets was randomly sampled 100 times without replacement, such that every time the dataset is split into three distinct subsets: training, which contains one half of the total data instances; crossvalidation, which contains one sixth of the data instances, and testing, which contains one third of the data instances.Twelve different groups (three subsets each) out of the 100 groups were selected based on the statistical properties of the output variable (e.g., runoff).The aim was to select the samples where the mean and the standard deviation values of the three subsets (training, cross validation, and testing) are similar or, at least, the differences are minimal.The crossvalidation subset was used for stopping the model training and selecting the best model, whereas the testing subset was kept completely unseen during the training process.Twelve different models were developed based on the 12 data groups (the best model based on cross-validation was picked every time), and each model was tested using the corresponding testing subset.These procedures were repeated using the six different data driven modeling techniques, applied to each of the five different datasets.The results of this experiment allows for investigating ensemble outputs from each modeling techniques, average and range of possible prediction accuracy, and predictive uncertainty.Minimizing the squared error was the error function used with all adopted techniques.
Fifth, the predictive accuracy of the various models and techniques were evaluated using the root mean squared error (RMSE), the mean absolute relative error (MARE), the mean bias (MB), and the correlation coefficient (R).The authors believe that these four error statistics, along with the visual comparison between observed and predicted values, are sufficient to reveal any significant differences among the various modeling techniques with regard to their prediction accuracy.The formulae of the error measures are presented in Eqs.(1-4) below.
Where N represents the number of instances presented to the model; O i and P i represent observed and predicted counterparts; and O and P represent the mean of the corresponding variables.However, sometimes conflicting results regarding the performance of various models may arise due to the use of various error measures (Dawson et al., 2007;Elshorbagy et al., 2000).In this study, a supplemental error measure that combines the effects of the four error measures in one indicator is proposed.The new indicator, called the ideal point error (IPE) is based on identifying the ideal point in the four dimensional space that each model aims to reach.The coordinates of the ideal point should be: (RMSE = 0.0; MARE = 0.0; MB = 0.0; R = 1.0).The IPE (Eq.5) measures how far the model is from the ideal point.All individual error measures are given equal relative weights, and all are normalized using the maximum error so that the final IPE value for each model ranges between 0.0 for the best model and 1.0 for the worst model.
Where i denotes model (i) and j denotes technique (j ).
Sixth, the predictive uncertainty of the models was assessed using the model residuals, the difference between the observed and the predicted values.For each dataset and each modeling technique, the residuals are computed for all 12 models representing the range of possible residuals.The residuals of the 12 models are merged in one set of presumably random variable, and a probability distribution was fit to this variable.
Seventh, the gamma test was conducted to assist in gaining some insight into the predictability of the output variables using nonlinear smooth functions, and possibly some leads into the process of selecting appropriate modeling techniques for a particular case study.The main idea of the gamma test (test) is estimating the variance of the noise on the output variable, which could be an estimate of the best mean squared error that a smooth model can achieve for the corresponding output.The test was implemented using winGamma (Jones et al., 2001) that assumes that non-determinism in a smooth model from inputs to outputs is due to the presence of statistical noise on the outputs: Where f is a smooth function and is noise, and that the variance of the noise Var( ) is bounded.The -test is based on et al., 1997).Delta (δ) and γ functions can be defined as follows: Hydrol.Earth Syst.Sci., 14,[1931][1932][1933][1934][1935][1936][1937][1938][1939][1940][1941]2010 www.hydrol-earth-syst-sci.net/14/1931/2010/ Where p is a pre-selected value, and y L(i,k) is the corresponding output value for the k nearest neighbors of X i in Eq. ( 7) (Stefánsson et al., 1997).A least squares regression line can be constructed for the p points (δ N (k), γ N (k)) where can be computed: The intercept on the vertical axis is the value (Jones et al., 2001).As δ N (k) approaches zero, γ N (k) approaches Var( ) in probability.In addition to , three other useful statistics can be calculated: (i) the gradient A, which is the slope of the regression line that indicates the complexity of the system (steeper gradient indicates greater complexity) (Evans and Jones, 2002), (ii) the V-ratio, which is a scale invariant noise estimate where is divided by the variance of the output variable.A V-ratio close to zero indicates high degree of predictability of the output variable, and (iii) the M-test, which is the size of data that is possibly required to produce a stable asymptote of .The value might be estimated for scaled (normalized) or unscaled dataset, but the gradient will be more informative if estimated based on scaled dataset.In general, if the inputs have inconsistent units, it is advisable to conduct the -test using the scaled data (Jones et al., 2001).

Artificial neural networks (ANNs)
ANN is a method of computation and information processing motivated by the functional units of the human brain, namely neurons.Since abundant information on ANNs is available in literature (e.g., Haykin, 1999; ASCE Task Committee on Application of Artificial Neural Networks in Hydrology, 2000), the description of ANNs herein is brief, and limited to the needs of this study.According to Haykin (1999), a neural network is a massively parallel distributed information processing system that is capable of storing the experiential knowledge gained by the process of learning, and of making it available for future use.Mathematically, ANNs are universal approximators with an ability to solve large-scale complex problems such as time series forecasting, pattern recognition, nonlinear modeling, classification, and control.This is achieved by identifying the relationships among given patterns.
Feedforward neural networks (FFNNs) are the most widely adopted network architecture for the prediction and forecasting of hydrological variables (Minns and Hall, 1996;Maier and Dandy, 2000;Dibike and Solomatine, 2001).Typically, FFNNs consist of three layers: input layer, hidden layer, and output layer.The number of nodes in the input layer corresponds to the number of inputs considered for modeling the output.The input layer is connected to the hidden layer with weights that determine the strength of the connections.The number of nodes in the hidden layer(s) indicates the complexity of the problem being modeled.The hidden layer nodes come with an activation function, which helps in nonlinearly transforming the inputs into an alternative space where the training samples are linearly separable (Brown and Harris, 1994).Detailed review of ANNs and their application in hydrology can be found in Maier and Dandy (2000) and in ASCE Task Committee on Application of Artificial Neural Networks in Hydrology (2000).
The FFNNs adopted in this study make use of the tansigmoidal activation function in the hidden layer and the linear activation function in the output layer.While the tansigmoidal activation function squashes the input between −1 and 1, the linear activation function calculates the neurons output by simply returning the value passed to it.One of the important issues in the development of neural networks model is the determination of optimal number of hidden neurons that can satisfactorily capture the nonlinear relationship existing between the input variables and the output.The number of neurons in the hidden layer is usually determined by trial-and-error method with the objective of minimizing the cost function (typically, the error on crossvalidation dataset) (ASCE Task Committee on Application of Artificial Neural Networks in Hydrology, 2000).Levenberg-Marquardt back propagation algorithm is used for training the FFNNs in this study.

Genetic programming (GP)
Genetic Programming (GP), introduced by Koza (1992), is an evolutionary algorithm based on the concepts of natural selection and genetics.GP extends the search of genetic algorithms for optimal set of parameters to include the model space, so that both the model structure and the associated model parameters can be optimized simultaneously.Genetic symbolic regression (GSR) is a special application of GP in the area of symbolic regression, where the objective is to find a mathematical expression in symbolic form, which provides an optimal fit between a finite sample of values of the independent variable and its associated values of the dependent variable (Koza, 1992).GSR can be considered as an extension of numerical regression problems, where the objective is to find the set of numerical coefficients that best fits a predefined model structure (linear, quadratic, or polynomial).Nevertheless, GSR does not require the functional form to be defined a priori, as GSR involves finding the optimal mathematical expression in symbolic form (both the discovery of the correct functional form and the appropriate numerical coefficients) that defines the predictand-predictor relationship.GSR is sometimes referred to as equation-based GP.Another form of GP is program-based GP, where the explicit equation may not be necessarily produced, but rather a program (code) is the final output.Elshorbagy and El-Baroudy (2009)  that program-based GP can be more effective than equationbased GP with regard to its prediction accuracy.GPLAB (Silva, 2005), a GP toolbox for MATLAB that provides the evolved equation in the form of a parse tree is an example of an equation-based GP tool, whereas Discipulus (Francone, 2001), used in this study, is an example of a program-based GP tool.Genetic Programming (GP) is a widely used machine learning (ML) technique; it uses a tree-like structure, as decision trees, to represent its concepts and its interpreter as a computer program.Therefore, some authors even considered it to be a superset of all other ML representations; this may enable GP to produce any solution that is produced by any other ML system (Banzhaf et al., 1998).It uses different genetic operators such as crossover and mutation, together with beam search to reach candidate solutions from the overall population of solutions.Although GP is computationally intensive, like most machine learning techniques, it has its own limitations.The major problem is the deterioration of the prediction ability of the developed model with longer prediction horizon, which is a common problem in any modeling method.The adverse consequences of this problem can be mitigated by combining GP technique with knowledgebased techniques that depend on the accumulated knowledge of the process under consideration.This will enhance the quality of the developed models and add to the understanding of the complicated hydrological processes (Babovic and Keijzer, 2002).
Several applications of the GP technique in hydrology exist in the literature.Parasuranam et al. (2007a) explored the utility of GP to develop explicit models for eddy covariance-measured actual evapotranspiration.Babovic and Keijzer (2002) addressed the utility of GP in developing rainfall-runoff models on the basis of hydro-meteorological data, as well as in combination with other conventional models, i.e. conceptual models.It was reported that the GP models gave more insights into the functional relationships between different input variables resulting in more robust models.Parasuraman et al. (2007b) used GP to evolve pedotransfer functions (PTFs) for estimating the saturated hydraulic conductivity (K s ) from soil texture (sand, silt, and clay) and the bulk density.Similarly, Jayawardena et al. (2005) compared the GP technique in modeling rainfall-runoff process to the traditional modeling approaches.They used the GP technique to predict the runoff from three catchments in Hong Kong and two catchments in southern China, and showed that the GP technique evolved simple models that enabled the quantification of the significance of different input variables for prediction.In literature, there was an emphasis on GP's ability to produce explicit equations, but in this research program-based GP is employed to utilize the full predictive ability of the technique.
For GP implementation, the first step is to define the functional and terminal sets, along with the objective function and the genetic operators.The functional set and the termi-nal set are the main building blocks of GP, and hence, their appropriate identification is central in developing a robust GP model.The functional set consists of basic mathematical operators {+, −, *, /, sin, exp, . . .} that may be used to form the model.The choice of the operators considered in the functional set depends upon the degree of complexity of the problem to be modeled.The terminal set consists of independent variables and constants.The constants can either be physical constants (e.g.Earth's gravitational acceleration, specific gravity of fluid) or randomly generated constants.Different combinations of functional and terminal sets are used to construct a population of mathematical models (or programs).Each model (individual) in the population can be considered as a potential solution to the problem.Genetic operators include crossover and mutation, and they are discussed in detail later in this section.Once the functional and terminal sets are defined, the next step is to generate the initial population for a given population size.The initial population can be generated in a multitude of ways, including, the full method, grow method, and ramped half-and-half method.In the full method, the new trees are generated by assigning non-terminal nodes until a pre-specified initial maximum tree depth is reached, and the last depth level is limited to the terminal nodes.In the grow method, each new node is randomly chosen between the terminals and the non-terminals, with the terminals making up the nodes at the initial maximum tree depth.The ramped half-and-half method is a combination of the full and the grow methods.For each depth level considered, half of the individuals are initialized using the full method and the other half using the grow method.The ramped half-and-half method is shown to produce highly diverse trees, both in terms of size and shape (Koza, 1992), and thereby provides a good coverage of the search space.More information on the different methods of generating the initial population can be found in Koza (1992).Once initialized, the fitness of each individual (mathematical model) in the population is evaluated based on the selected objective function.The better the fitness of an individual, the greater is the chance of the individual breeding into the next generation.In this study, root mean squared error is used as the objective function, and a lower value of RMSE indicates better fitness.At each generation, new sets of models are evolved by applying the genetic operators: crossover and mutation (Koza, 1992;Babovic and Keijzer, 2000).These new models are termed offspring, and they form the basis for the next generation.
After the fitness of the individual models in the population is evaluated, the next step is to carry out selection.The objective of the selection process is to create a temporary population called the mating pool, which can be acted upon by genetic operators: crossover and mutation.Selection can be carried out by several methods like truncation selection, tournament selection, and roulette wheel selection.As roulette wheel selection is one of the most commonly used methods including Koza (1992), it has been adopted in this study.
Hydrol.Earth Syst.Sci., 14, 1931Sci., 14, -1941Sci., 14, , 2010 www.hydrol-earth-syst-sci.net/14/1931/2010/ Roulette wheel is constructed by proportioning the space in a roulette wheel based on the fitness of each model in the population.The selection process ensures that the models with better fitness have more chance of breeding into the next generation.Crossover is carried out by initially choosing two parent models from the mating pool, and selecting random crossover points for each of the parents.Based on the selected crossover points, the corresponding sub-tree structures are swapped between the parents to produce two different offspring with different characteristics.The number of models undergoing crossover depends upon the chosen probability of crossover (P c ). Mutation involves random alteration of the parse tree at the branch or node level.This alteration is done based on the probability of mutation (P m ).For an overview of different types of computational mutations, readers are referred to Babovic and Keijzer (2000).While the role of the crossover operator is to generate new models, which did not exist in the old population, the mutation operator guards the search against premature convergence by constantly introducing new genetic material into the population.

Evolutionary polynomial regression (EPR)
Evolutionary Polynomial Regression (EPR) is another data driven and machine learning technique that models time series or regression-type data containing information about physical processes (Giustolisi and Savic, 2006).EPR combines the power of evolutionary algorithms with numerical regression to develop polynomial models combining the independent variables together with the user-defined function as follows (Laucelli at al., 2005): where Ŷ is the EPR-estimated dependent variable, F (.) is the polynomial function constructed by EPR, X is the independent variables' matrix, f (.) is a user-defined function, a i is the coefficient of the i-th term in the polynomial, a o is the bias and m is the total number of the polynomial terms.Inclusion of the user-defined function is provided to enhance the characterization of the response (dependant) variable.As the developers of the EPR tool state "EPR is a two-stage technique for constructing symbolic models: (i) structure identification; and (ii) parameter estimation", where it uses genetic algorithm (GA) simple search method to search in the model structure space.EPR uses the least squares (LS) method to estimate the parameters of the selected model structure based on the performed GA search.Applications of EPR are found in Savic et al. (2006), Doglioni et al. (2008), Elshorbagy and el-Baroudy (2009), and Giustolisi et al. (2007).The search proceeds by using the standard GA operators, crossover and mutation; noting that this type of search is not exhaustive as it is practically impossible to conduct such search on an infinite search space (Laucelli et al., 2005).Even though EPR might be viewed as a subset of GP, its reported good performance while emphasizing the polynomial structure makes it a potential candidate for this study.This study makes use of the EPR toolbox (Laucelli et al., 2005), which is based on "homonymous modeling methodology based on a hybrid evolutionary paradigm".It is a multiobjective implementation of EPR in the sense that it produces several models, which are the best trade-off, considering fitness to training data vs. parsimony.The EPR tool performs three types of regression, i.e. dynamic, static, and classification.Dynamic modeling is used to model systems that have memory, or in other words, when the present state of the system depends on the previous states of other input variables.On the other hand, static systems are systems that are not influenced by the previous states of input variables.Classification modeling is a special type of static modeling in which the model output is an integer (Laucelli et al., 2005).The readers may refer to the user manual for the details of the EPR toolbox and the different components of its simple interface (Laucelli et al., 2005).

Support vector machine (SVM)
The foundation for the subject of Support Vector Machines has been largely developed by Vapnik in the 1960s and 1970s (Vapnik, 1998, see also Cherkassky and Mulier, 2007) and it is now gaining popularity due to many attractive features.Its formulation embodies the Structural Risk Minimisation (SRM) principle, which has been shown to be superior to the traditional Empirical Risk Minimisation (ERM) principle, employed by many of the other modelling techniques.SRM minimises an upper bound on the expected risk, as opposed to ERM that minimises the error on the training data.It is this difference that is claimed to provide SVM with a greater ability to generalise, which is a principal goal in statistical learning.
SVM algorithm was first developed to solve the classification problem, but was extended to the domain of regression problems.In regression and time series prediction applications, excellent performances were obtained (Muller et al., 1997;Mattera and Haykin, 1999;Dibike et al., 2001).The goal of ε-SV regression (Vapnik, 1995) is to find a function f (x) that has at most ε deviation from the actually obtained targets y i for all the training data, and at the same time, is as flat as possible.In case of linear functions f , where •, • denotes the dot product in X. Flatness in this case means seeking small w, which can be ensured by minimizing the Euclidean norm, i.e., w 2 .Sometimes, it is not possible to approximate all pairs (x i ,y i ) with ε precision.So, it is possible to allow for some errors in the form of slack variables ζ i ,ζ * i .The problem can be written as a convex optimization problem: The constant C > 0 determines the tradeoff between the flatness of f and the amount up to which deviations larger than ε are tolerated.Fig. 1 presents an example of SV regression with the ε-tube in which errors are ignored -leading to better model generalization.
A Lagrange function from both the objective function and the corresponding constraints can be constructed by introducing a dual set of variables (Müller et al., 1997): where α i , α * i , η i , η * i ≥ 0. Finally, w can be written as follows: This is called Support Vector expansion, i.e. w can be completely described as a linear combination of the training patterns x i .The above discussion is based only on linear SVM regression.For nonlinear regression, the SVM has a great advantage that can represent the nonlinear function in an arbitrary number of dimensions efficiently through a defined Kernel.The idea is to map the training input vector x i into a higher dimensional space (called feature space) or hyperplane, by the function , while the regression for x remains linear.Thus, the procedure is the same as the linear SVM except changing the dot product x i ,x by (x i ), (x) .The Kernel function: K(x i ,x) = (x i ), (x) can assume any form.Many Kernels are being proposed by researchers; however, the most common ones are: Where γ , τ, and d are Kernel parameters.
In this study, the SVM implementation within WEKA 3.6.0Software (Bouckaert et al., 2008;Witten and Frank, 2005) has been used.

Model trees
Model trees (or M5 model trees) are a relatively new machine learning technique introduced by Quinlan (1992) who also suggested the algorithm that uses information theory to build them -the M5 algorithm.This is effectively a piece-wise linear regression model.A complex modelling problem can be solved by dividing it into a number of simple tasks and building simple models for each of them.
A model tree (MT) belongs to a class of modular models, which uses the "hard" (i.e.yes-no) splits of input space into regions progressively narrowing the regions of the input space.Thus model tree is a hierarchical (or tree-like) modular model that has splitting rules in non-terminal nodes and the expert models at the leaves of the tree.In M5 model trees, the expert models are simple linear regression equations derived by fitting to the non-intersecting data subsets.Once these models are formed recursively in the leaves of the hierarchical tree, then prediction with the new input vector consists of the two steps: (i) attributing the input vector to a particular subspace by following the tree; and (ii) running the corresponding model.A brief description of the model tree algorithm is presented below.
The first step in building a model tree is to determine which input variable is the best to split the training set.The splitting criterion (i.e.selection of the input variable and splitting value of the input variable) is based on treating the standard deviation of the target values that reach a node as a measure of the error at that node, and calculating the expected reduction in error as a result of testing each input variable at that node.The expected error reduction, which is called standard deviation reduction, SDR, is calculated by After examining all possible splits by exhaustive search, M5 chooses the one that maximizes SDR.The splitting of the training examples is done recursively to the subsets.The splitting process terminates when the target values of all the examples that reach a node vary only slightly, or only a few instances remain (this is a user-defined parameter).This division often produces over-elaborate structures leading to overfitting models.They can be pruned back, for instance by replacing a subtree with a single model in a leaf.Additionally, 'smoothing' may be also performed to compensate for the sharp discontinuities that will inevitably occur between the adjacent linear models at the leaves of the pruned tree.In smoothing, the outputs from adjacent linear equations are updated in such a way that their difference for the neighboring input vectors belonging to the different leaf models will be smaller.Details of the pruning and smoothing process can be found in Witten and Frank (2000).Figure 2 presents an example of model tree.
As compared to other machine learning techniques, model tree learns efficiently and can tackle tasks with very high dimensionality -up to hundreds of variables.The main advantage of model tree is that results are transparent and interpretable.During the last years several authors have shown the effectiveness of the M5 machine learning method in rainfall-runoff modelling (see, e.g., Solomatine and Dulal, 2003;Solomatine and Siek, 2006;Stravs and Brilly, 2007).

K-nearest neighbors
The K-nearest neighbors (K-nn) technique is one of the simplest forms of instance-based learning, which can be treated as plain memorization (Witten and Frank, 2005).Once a set of training instances has been memorized, one encoun-tering a new (testing) instance, the memory is searched for the training instance that most closely resembles the testing instance.Instead of creating rules (or continuous function approximation surface), K-nn technique works directly from the examples themselves.Each new instance is compared with existing ones using a distance metric, and the closest existing distance is used to assign the output to the new instance.Usually, more than one nearest neighbors is used.Standard Euclidean distance (or any other distance measure) is used as a metric to represent "resemblance".When multiple nearest neighbors are employed, the output of the testing instance can be based either on simple average, weighted average, or any more sophisticated function.In this study, the simplest method, which is the average value of the K-nearest neighbors, is used.An apparent drawback to instance-based representation is that it does not make explicit the structures that are learnt.Instances do not really describe the patterns in data.Karlsson and Yakowitz (1987); Parasuraman and Elshorbagy (2007); and Solomatine et al. (2008) presented some hydrological prediction case studies using K-nn technique.

Summary
Data driven modeling techniques, and in particular machine learning techniques, have addressed and solved many issues in hydrological modeling but also caused questions and concerns to be raised.The most important concerns are regarding the way DDM techniques are handled, compared, evaluated, and the basis on which findings and conclusions were drawn.The sub-optimal approach in designing modeling experiments, the use and the split of datasets, the exclusive use of techniques and case studies, and writing research articles from the standpoint of advocating certain techniques have contributed to the problem.In this first part of the twopart paper, a concise but comprehensive review of key articles that presented comparisons among various data driven A. Elshorbagy et al.: Experimental investigation of the predictive capabilities modeling techniques was summarized.It was concluded that findings were usually dataset-specific, to some extent contradictory, and thus, difficult to generalize.A comprehensive data driven modeling experiment was proposed and explained.Six data driven modeling techniques, namely, neural networks, genetic programming, evolutionary polynomial regression, support vector machines, M5 model trees, and Knearest neighbors were proposed and briefly explained.Multiple linear regression and naïve models were also suggested as baseline for comparison with the various techniques.
Five different case studies representing three different hydrological processes or variables (evapotranspiration, soil moisture, and rainfall-runoff) from Canada and Europe were proposed for the modeling experiment, and described in the second paper.The central step of the methodology is creating 12 different realizations (groups) from each dataset by random sampling.Each group contains three subsets; training, cross-validation, and testing.Each technique was proposed to be applied to each of the 12 groups of each dataset.This methodology was designed to evaluate both prediction accuracy and uncertainty of the various techniques on a wide range of possibilities that allow for comprehensive testing the modeling capabilities of these techniques.This study focused on regression-type problems.No consideration was given to time series or real time prediction cases.The second part (Elshorbagy et al., 2010) describes the datasets and addresses the application of the proposed methodology through the input selection and the implementation of the various techniques.Results, analysis, and discussion of the findings of this study are presented in the second paper as well.
Edited by: R. Merz
Illustration of splitting in a model tree.where, T represents set of examples that reach the splitting node, T 1 , T 2 ,. . ., represents the subset of T that results from splitting the node according to the chosen input variable, sd represents standard deviation, |T i |/|T | is the weight that represents the fraction of the examples belonging to subset T i .