Experimental investigation of the predictive capabilities of data driven modeling techniques in hydrology – Part 2 : Application

Experimental investigation of the predictive capabilities of data driven modeling techniques in hydrology – Part 2: Application A. Elshorbagy, G. Corzo, S. Srinivasulu, and D. P. Solomatine Centre for Advanced Numerical Simulation (CANSIM), Department of Civil & Geological Engineering, University of Saskatchewan, S7N 5A9 Saskatoon, SK, Canada Department of Hydroinformatics & Knowledge Management, UNESCO-IHE Institute for Water Education, Delft, The Netherlands Water Resources Section, Delft University of Technology, Delft, The Netherlands Received: 29 October 2009 – Accepted: 7 November 2009 – Published: 19 November 2009 Correspondence to: A. Elshorbagy (amin.elshorbagy@usask.ca) Published by Copernicus Publications on behalf of the European Geosciences Union.


Introduction
The research methodology explained in the first part of this two-companion paper was implemented in the sequence presented earlier.First, inputs of the various models were identified.A mixed approach of input selection was adopted since identification of optimum inputs was not in itself one of the objectives of this study.The next section describes the five different datasets.The two soil moisture datasets (Elshorbagy and Parasuraman, 2008) and a reduced hourly version of the evapotranspiration (AET) dataset (Parasuraman and Elshorbagy, 2008;Parasuraman et al., 2007) were used in earlier studies.This study benefited from the input structure identified in the earlier studies, and sometimes (e.g., the case of the evapotranspiration dataset) enhanced the input structure by considering more inputs identified using the mutual information content.
A. Elshorbagy et al.: Experimental investigation of the predictive capabilities in this study.The SWSS is currently the largest operational tailings dam in the world, holding approximately 435 million cubic meters of material, covering 25 km 2 , and standing approximately 40 m high with a 20H:1V side-slope ratio.Soils consist of mine tailings sand overlain with 0.4 to 0.8 m of topsoil that is a mixture of peat and secondary mineral soil with a clay loam texture.Both vegetation species and composition vary across the SWSS, with dominant groundcover including horsetail (Equisetum arvense), fireweed (Epilobium angustifolia), sow thistle (Sonchus arvense), and white and yellow sweet clover (Melilotus alba, Melilotus officinalis).Tree and shrub species include Siberian larch (Larix siberica), hybrid poplar (Populus sp.hybrid), trembling aspen (Populus tremuloides), white spruce (Picea glauca), and willow (Salix sp.).For the SWSS facility, the ground-water table is located well below the rooting zone, at a depth between 0.8-1.0m, and hence does not directly contribute to the evapotranspiration process.Accurate estimation of actual evapotranspiration from the reconstructed watersheds is of vital importance as it plays a major role in the water-balance of the system, which links directly to ecosystem restoration strategies.The weather station located on top of the SWSS facility measured the air temperature (AT) ( • C), ground temperature (GT) ( • C), net radiation (NR) (W/m 2 ), relative humidity (RH), and wind speed (WS) (m/s).Turbulent fluxes of heat and water vapor were measured using a CSAT3 sonic anemometer and thermometer (Campbell Scientific) and an LI-7500 CO2/H2O gas analyzer (Li-Cor).Ground heat flux was measured using a cm3 radiation and energy balance (REBS) ground heat flux plate placed at 0.05 m depth.In EC technique, the covariance of vertical wind speed with temperature and water vapor is used to estimate the sensible heat (H) and latent heat (LE) fluxes (Parasuraman and Elshorbagy, 2008).More information on the EC technique can be found in Drexler et al. (2004).Raw turbulence measurements were made at 10 Hz and fluxes were calculated using 30-min block averages with a 2-D coordinate rotation.
The half hourly EC-measured LE flux (the product of the latent heat of vaporization and evapotranspiration) at the SWSS facility for two growing seasons (from 3 May to 21 Sep 2005 and from 27 May 9 Sep 2006) is considered in this study.The total precipitation during the two seasons is 275 mm and 265 mm, respectively and the average day-time reference evaporation rate is 0.27 mm/h.Nevertheless for modeling purposes, the day time (08:00 h -20:00 h) evapotranspiration alone is considered.After eliminating records of missing data, the remaining number of data instances were 5,307 data points.Since evapotranspiration is commonly perceived as being highly dependent on climatic variables, the EC-measured LE flux is modeled as a function of NR, AT, GT, RH, and WS, as well as possible combinations of these variables.The descriptive statistics of the datasets used for training, cross validation, and testing are presented in Table 1.The coefficient of variation (CV) of different variables during training, cross validation, and testing are comparable.The AET dataset, and each dataset, was randomly sampled 100 times; creating 100 realizations of the dataset with three split samples (training, cross-validation, and testing) created from every dataset realization.

Soil moisture content
Over the years, several large scale soil cover (reconstructed watersheds) experiments have been conducted to assess the performance of different reclamation strategies in northern Alberta, Canada, by studying the basic mechanisms that control the moisture movement within these covers.In particular, three experimental soil covers (D1, D2, and D3) were established in the year 1999.The experimental covers were constructed over the saline-sodic overburden with thickness of 0.50 m, 0.35 m, and 1.0 m, comprising a thin layer of peat mineral mix over varying thickness of secondary (glacial/till) soil.Cover D1 consists of 20 cm of peat overlying 30 cm of till, and it is considered for this study.The soil cover has an area of 1 ha (approximately 200 m long and 50 m wide), with a 5:1 slope (5 horizontal to 1 vertical).This reconstructed watershed, compared to natural watersheds, is not stable during its initial stages, and hence evolves over time to achieve hydro-sustainability.In order to track the evolution (hydrological changes) of the watershed, intensive instrumentations were installed in the watershed.Each watershed has an individual soil station located at the middle of the slope, which measures the volumetric soil moisture content of the upper peat (SMP) and the lower till (SMT) layers, twice a day.Soil moisture is measured using TDR principles with model CS615 (Boese, 2003).The TDR sensors were Hydrol.Earth Syst.Sci., 14, 1943Sci., 14, -1961Sci., 14, , 2010 www.hydrol-earth-syst-sci.net/14/1943/2010/ installed laterally into the soil profile.Watershed D1 has eight TDR sensors installed over a depth range of 0.05 m to 1.00 m.Hourly values of soil temperature of peat (STP) and till (STT) layers are measured using thermisters buried in the watershed at the depth ranges corresponding to the TDR sensors.Consequently, D1 has eight soil temperature sensors.A weather station located in the mid-slope measures air temperature (AT), and precipitation (P ).Similarly, Bowen station located at the mid-slope measures net-radiation (NR) and energy fluxes.All the meteorological variables are measured in an hourly scale.More details on the field instrumentation program and the data collected can be found in Boese (2003) and Elshorbagy et al. (2007).
Average daily values of precipitation, air temperature, soil temperature (STP and STT), net radiation (NR), soil moisture (SMP and SMT) as well as possible combinations of them, are considered for modeling purposes.The ground temperature and soil moisture contents are depth averaged for each layer (upper peat and lower till).As the soil stratum is frozen during the winter, only summer (May-September) time data of years 2000 till 2006 are considered.The total number of instances available for modeling purposes was 972 data points.As the reconstructed watersheds evolve over time to achieve hydro-sustainability, the freeze-thaw cycles and decomposition of highly organic peat layer increases the porosity of the soil and consequently increase infiltration rates (Haigh, 2000).Hence, modeling the moisture dynamics of such evolving watersheds would be adding to the already challenging task of modeling soil moisture.The descriptive statistics of the datasets used for training, cross validation, and testing are presented in Table 2 for the peat and the till layer datasets, respectively.For modeling purposes, two datasets were generated from the site; one for predicting SMP and the other for SMT.The same set of inputs was used in both datasets.The coefficient of variation (CV) of different variables during training, cross validation, and testing are comparable (Table 2).

Rainfall-runoff
The rainfall-runoff dataset used in this study is taken from the Ourthe subcatchment, which is a subcatchment of River Meuse The river basin covers part of France, Belgium and The Netherlands (Fig. 1).The area analyzed in this research is approximately 22 000 km 2 , from Borgharen-dorp (Jeker basin on the Netherlands border) to Meuse source-St Mihiel (Lorraine basin in France).This meso-scale catchment system has been widely explored with data driven and expert knowledge (de Wit, 2001;Tu et al., 2005).
The greater part of the discharge of the River Meuse is supplied by its tributaries.Groundwater, precipitation and artificial extractions constitute the discharge to a smaller extent (de Wit, 2001).The Meuse has a great number of tributaries, varying greatly in their sizes.The largest is the Ourthe, with a contributing area of 3.626 km 2 .The Ourthe subcatchment The average travel time from upstream to downstream is one day (Berger, 1992).More information about the hydrological properties of the basin and the data validation are referred to Berger (1992) andDe Wit (2007).The daily rainfall and runoff data of the Ourthe subcatchment from 11 January, 1988 till 31 December 1998 (4.008 data points) were used for modeling purposes in this study.Two distinct datasets were created: (i) the first is a dataset where only rainfall data were used as model inputs to predict the runoff; and (ii) the second is the same dataset but the preceding time step (t-1) runoff, in addition to the rainfall data, were used as inputs to predict the runoff at time t.The descriptive statistics of the variables that are used as model outputs in this study are presented in Table 3.
Figure 2 presents the inputs identified for the AET case study using AMI method.For the two rainfall-runoff datasets, the Average Mutual Information (AMI) method was used to identify the inputs for predicting the daily runoff (Fig. 3).The inputs-output of the five case studies are presented in Table 4.One should note that in light of the focus of this study, which is the comparative analysis of various data driven techniques, the important criterion is to use the same set of inputs across all adopted models.After inputs have been identified, each dataset was randomly sampled 100 times; creating 100 realizations of the dataset with three split samples (one half for training, one sixth for cross-validation, and one third for testing) created from every dataset realization.Figure 4 shows an example of this process for the peat moisture dataset.Similar process was conducted with each one of the five case studies.Based on the similarity of the statistical properties (mean and standard deviation) of the three split samples, the best 12 realizations of each dataset are identified for the modeling exercise in this study.
3 Model implementation

Artificial neural networks (ANNs)
The Levenberg-Marquardt algorithm was used for training all neural network models using the MATLAB Neural Networks toolbox.For each of the 12 dataset realizations of a case study, the ANN was executed 200 times with 200 different random weight initializations.The best model of the 200 runs was identified as the best ANN model.The cross validation sub dataset was used to stop the training process.Accordingly, 12 ANN models were developed and tested using the corresponding unseen dataset.In all optimum ANN models, the number of input nodes was equivalent to the number of inputs, and all networks had one output node.The number of hidden nodes ranged from three to 13, with an average number of seven hidden nodes in single hidden layer ANNs.

Genetic programming (GP)
Discipulus Software (Francone, 2001) was used to implement the program-based GP to all datasets.GP was applied to the various dataset realizations similar to the way followed with ANNs.The addition, subtraction, multiplication, comparison, conditions, division, and trigonometric operators were allowed.The program size varied from 80-512 bits, with population size of 500 and generations without improvement up to 300.The probabilities of mutation and crossover were 30% and 50%, respectively.The program was allowed to run for at least two hours.The authors experimented with the run time and observed that improvement could be almost negligible beyond two hours.Similar to the case of ANN applications, 12 non-dominated GP models were developed and tested on the corresponding testing set of each case study.

Evolutionary polynomial regression (EPR)
The EPR Toolbox (Laucelli et al., 2005) was used to implement the static EPR technique to all datasets, following the same experimental steps adopted with the ANNs and the GP techniques.The EPR Toolbox allows for many choices in terms of the polynomial types, functions used within the polynomial terms, and the number of terms and exponents.
In this study, the default number of terms (up to five) was used whereas a comprehensive search among the possible combinations of polynomial types and functions was conducted.Accordingly, 12 non-dominated EPR models were developed and tested on the corresponding testing set of each case study.The EPR type and function developed for each case study are presented in Table 5.

Support vector machine (SVM)
WEKA 3.6.0Software (Bouckaert et al., 2008) was used in this study to implement the SVM to all datasets, following the same experimental steps adopted with the previous techniques.SVM models with linear, polynomial, and radial basis function (RBF) kernels were tested on all datasets.With the exception of the Rainfall -runoff II case study, the RBF kernel was found to provide the best predictive performance.
In case of the rainfall-runoff II case study, both linear and RBF kernels were almost on par.Therefore, SVM with RBF kernel was adopted in this study.The constant C (Elshorbagy et al., 2010, Part 1) and the kernel parameter γ were optimized from an exponential range of the following values: 0.0313; 0.0625; 0.125; 0.25; 0.50; 1.00; 2.00; 4.00; 8.00; and 16.00.Non-dominated 12 SVM models were developed and tested on the corresponding testing set of each case study.

M5 model trees
WEKA 3.6.0Software (Bouckaert et al., 2008) was used in this study to implement the M5 model trees to all datasets, following the same experimental steps adopted with the previous techniques.The tree pruning coefficient was optimized during the execution of the models to minimize the average squared error.A range of values from 3-30 was tested in this study.12 M5 model tree models were developed and tested on the corresponding testing set of each case study.

K-nearest neighbors (K-nn)
WEKA 3.6.0Software (Bouckaert et al., 2008) was used in this study to implement the K-nn technique to all datasets, following the same experimental steps adopted with the previous techniques.The number of the nearest neighbors was optimized during the execution of the models to minimize the average squared error.A range of values from 1-50 neighbors was tested in this study.Accordingly, 12 K-nn models were developed and tested on the corresponding testing set of each case study.The ranges of the optimum numbers of nearest neighbors for each case study are presented in Table 6.

Evapotranspiration case study
The performance of the various techniques applied to the half-hourly actual evapotranspiration (AET) case study is provided in Table 7.The best, the worst, and the average of the performances of the 12 models of all techniques are shown.It is certainly useful to judge techniques based on the range of performances (difference between the best and the worst models), however, if a single value is needed, then one has to rely on the average performance.Table 7 supports the idea that in most cases, it is not possible to find a technique that dominates others with respect to all error measures.But if a technique is better than the rest with respect to two different error measures (e.g., RMSE and R), this can be a strong indication of the superiority of such a technique.In the AET case study, GP, SVM, M5 model trees, and K-nn techniques can be identified as the best techniques, followed by EPR, in terms of the predictive accuracy.The performance of the ANNs was worse than the linear regression (MLR) technique in this particular case study.This highlights the important fact that the half-hourly AET data were captured reasonably well in a linear relationship considering the provided model inputs.Therefore, a technique that forces nonlinear structures on the input-output relationship (ANNs) may not be favorable in all cases.Certainly, the AET data are not strictly linear; that is why local and/or modular linear models (e.g., M5 and K-nn) could be optimum choices.
Since all 12 models of each technique are non-dominated models and represent possible performances of the technique under consideration, the output of all 12 models are integrated in one set and presented in Fig. 5.The figure shows the scatter plots of observed vs. predicted AET data.The scatter around the 45-degree line supports the conclusion made earlier regarding the performances of the various techniques.However, the plots allow to make two additional observations; first, all techniques were less successful in predicting high values.The tips of the data plumes were always below the 45-degree line.This might be an indication that the ideal inputs that can describe all dynamics of the process for this case study have not been optimally identified.The SVM (Fig. 5d) was more successful than other techniques in approaching the high values.The M5 model trees and MLR (Fig. 5e and g) were the least successful in this regard.Table 8 shows the ideal point error (IPE) measure calculated for all techniques.The IPE statistic, integrating all four error measures in one indicator, lends another support to the conclusions made earlier.Except the ANNs, all other techniques have close performances, with the possibility of identifying the SVM, GP, M5, and K-nn; followed by EPR as better techniques than the rest.The utility of the idea of adopting multiple models (12 in this study) based on different random realizations of the datasets to evaluate    8).
Based on the outputs of the 12 non-dominated models of each technique, the predictive uncertainty of the various techniques can be easily analyzed.The residuals (predicted value minus observed value) of the 12 models were integrated in one set to conduct probabilistic analysis.Frequency curves were constructed for the residuals of each technique.@RISK Software (Palisade Corporation, 2005) was used to fit the best probability distribution from a selection of more than 15 possible distributions.The best-found probability distributions of the residuals of the various techniques are shown in Fig. 6.The Logistic (α, β) distribution was found to fit the residuals of all modeling techniques, with different values of location parameter α and scale parameter β.Ideally, the best technique is the one that has residuals represented by the narrowest, symmetrical, and tallest (has the highest probability value at zero residuals) probability distribution.Such a distribution implies the smallest level of predictive uncertainty, which could be translated to the highest level of reliability.Figure 6 reveals that, not only in terms of the predictive accuracy, but also the predictive uncertainty SVM is the best, flollowed by GP, K-nn, M5 and EPR.Clearly, the ANN technique leads to the most uncertain results with the widest range of residuals, whereas the MLR is occupying the middle position.
The Kolmogorov-Smirnov (KS) nonparametric test was conducted on the model residuals of all techniques to test the null hypothesis that the model residuals of any two techniques are sampled from the same distribution.The test was conducted at the default significance level of p = 0.05.The matrix of the p-values is given as Table 9.With the exception of K-nn and M5 techniques, there is strong statistical evidence that the residuals of the various techniques are stemming from different distributions.Even though the visual assessment of Fig. 6 shows that the SVM, M5, and K-nn are very similar, the KS test indicates that the SVM performs differently.

Peat (upper layer) soil moisture case study
The performance of the various techniques applied to the daily soil moisture data of the upper peat layer (SMP) case study is provided in Table 10.Unlike the evapotranspiration case study, Table 10 shows that both ANNs and GP techniques can be considered superior to other modeling techniques due to their domination with respect to the four error measures.It has to be noted that in case of soil moisture content, low values of the RMSE and the MARE might be misleading because the entire dataset is limited to a narrow Hydrol.Earth Syst.Sci., 14,[1943][1944][1945][1946][1947][1948][1949][1950][1951][1952][1953][1954][1955][1956][1957][1958][1959][1960][1961]2010 www.hydrol-earth-syst-sci.net/14/1943/2010/  range (0.30-0.55) of values (Table 3).In this case, the R statistic becomes the most important indicator (Elshorbagy and Parasuraman, 2008).For example, if an average-all model is constructed just by assuming that the best predictor is the average soil moisture value of all observations in the training dataset, the predicted value will be always 0.442.In this case, the RMSE and the MARE values are 0.05 and 0.10, respectively, but the R statistic value is almost zero; indicating an extremely poor model.Accordingly, ANNs and ANNs GP EPR SVM M5 K-nn MLR ANNs 1 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 GP 1 0.0001 0.0000 0.0000 0.0000 0.0000 EPR 1 0.0000 0.0000 0.0000 0.0000 SVM 1 0.0000 0.0000 0.0000 M5 1 0.051 0.0000 K-nn 1 0.0000 MLR 1 GP are the best modeling techniques for this case study (producing the R values of 0.60 and 0.61, respectively), followed by the K-nn and the M5 techniques.The MLR is clearly inferior to other techniques, which points to the possibility that the SMP dataset is a highly nonlinear dataset.The authors believe that this is a major reason for the relative success of ANNs in this case study compared to the previous (AET) case study.The moisture storage effect (Elshorbagy and El-Baroudy, 2009;Elshorbagy and Parasuraman, 2008) attributes to the nonlinearity of the process.Techniques that can handle highly nonlinear data (ANNs and GP) were quite successful, followed closely by local/modular models (M5 and K-nn).Even though the EPR technique was relatively close to the K-nn and M5, the performance of the SVM technique was the poorest with an R value of 0.44; slightly higher than the MLR.The scatter plots (Fig. 7) show clearly that the error measures, including the IPE (Table 8), reflect only the average overall performance of the models, and favored models that produce scatter with less dispersion (e.g., GP and EPR).However, the plots reveal that ANNs outperforms other techniques where, at least, the trend of the higher range of peat moisture values was captured better than the other techniques could do.Similar to the AET case study, frequency curves were constructed for the residuals of each technique (Fig. 8).Interestingly, the best-found probability distributions of the residuals of the various techniques differed.The LogLogistic    (γ , β, α) probability distribution was found to fit the residuals of SVM and K-nn, and M5 modeling techniques, Logistic (α, β) for ANNs, Lognormal (µ, σ ) for GP, Beta (α1, α2) for EPR and MLR techniques.This reflects the fact that the adopted modeling techniques are different in the way that they predict the output and minimize the errors, even if their average overall error values are close.The frequency curves reflect the considerable outperformance of the ANNs, K-nn, M5, and SVM over other more uncertain and biased techniques, such as MLR and the EPR techniques.An important observation here is the lower uncertainty of the SVM technique.The small uncertainty of the SVM technique reflected by the probability distribution is affected by the narrow range of residuals and small overall RMSE, however, the SVM models are poor in capturing the trend of the SMP data -this is indicated by the lower R value.
The Kolmogorov-Smirnov nonparametric test was conducted on the model residuals of all techniques to test the null hypothesis that the model residuals of any two techniques are sampled from the same distribution.The matrix of the p-values is given in Table 11.There is strong statistical evidence that the residuals of the various techniques are stemming from different populations.ANNs 1 0.0000 0.0021 0.0000 0.0000 0.0156 0.0000 GP 1 0.0001 0.0000 0.0015 0.0000 0.0000 EPR 1 0.0000 0.0003 0.0000 0.0069 SVM 1 0.0000 0.0000 0.0000 M5 1 0.0000 0.0000 K-nn 1 0.0000 MLR 1

Till (lower layer) moisture case study
The till moisture case study (SMT) is similar to the previous case study with regard to the small variability in the dataset, and the nonlinear response to the climatic variables due to the large storage effect.Table 3 shows that the variability (CV) of the till moisture data is half of that of the peat moisture data, whereas the skew in the till moisture dataset is nearly double that of the peat moisture.The error measures shown in Table 12 (and in particular the R statistic) reveal that Knn, GP, and ANNs are better candidates than other modeling techniques based on the same argument mentioned earlier regarding the R statistic.Similar to the previous case study, SVM and MLR techniques were the lowest in the rank with regard to the prediction accuracy.The small variability, combined with the high nonlinearity, of the SMT dataset contributed to the relative success of the K-nn technique in this particular case study.The failure of the MLR is an indicator of the potential utility of the ANNs for modeling the SMT.
Frequency curves were constructed for the residuals of each technique (Fig. 9) to investigate the predictive uncertainty.The graph in this case provides useful and more insightful view of the predictive reliability of the various techniques.The K-nn, GP, ANNs, and the SVM are clearly less uncertain and less skewed than EPR and other linear techniques (M5 and MLR) in this case study.The best-found probability distributions of the residuals of the various techniques differed across techniques.The LogLogistic (γ , β, α) distribution was found to fit the residuals of SVM and Knn, and ANNs modeling techniques, Logistic (α, β) for GP, Lognormal (µ, σ ) for EPR and MLR, and ExtremeValue (a, b) for M5.This reflects the fact that some of the adopted modeling techniques are really different in the way that they predict the output and minimize the errors, whereas some similarity is identified among the ANNs, K-nn, and SVM techniques.This similarity is only in terms of approaching the optimum solution, and leaving model residuals to be similarly distributed, but not necessarily in the distribution parameters.Similar to the SMP case study, less uncertainty with the use of the SVM is due to model residuals that stay around the mean, and thus, reduce the variability and the average error.This should not be confused with the poor accuracy of capturing trends in the data (low R value in Table 12 and even high IPE value in Table 8).
The Kolmogorov-Smirnov nonparametric test was conducted on the model residuals (raw data; not fitted distribution) of all techniques to test the null hypothesis that the model residuals of any two techniques are sampled from the same population.The matrix of the p-values is given as Table 13.The K-S test reveals that there is no evidence to reject the hypothesis in the case of the EPR and M5, and also GP and M5.The visual analysis of Fig. 9 confirms the finding regarding EPR and M5; however, M5 and GP are visually different.The reason is that the graph presents the best-fit distributions that should be used to make conclusions regarding the potential of the techniques and their possible performance on untested cases in the future.The K-S is a nonparametric test that relies on the cumulative frequency of the sample itself.For the rest of the adopted techniques, there is strong statistical evidence that the residuals of the various techniques are stemming from different populations.

Rainfall-runoff case study I
The performance of the various techniques applied to the daily rainfall-runoff I (R-R I) case study is provided in Table 14.In this case study, the preceding runoff was not used as an input for the models, therefore, the information content can be considered limited (only rainfall of the current and the three preceding days were used).The performances of all techniques were almost on par as shown by close values of average RMSE and R (Table 14) as well as close values of the IPE indicator (Table 8).Nonetheless, one can observe that M5, GP, and MLR were slightly better and less biased (lower MB values) than the other techniques.In a situation like this R-R I case study, where the information content itself is limited; it may not be possible to differentiate among the various modeling techniques.The limiting factor for the prediction accuracy becomes the information content rather The best-found probability distributions of the residuals of the various techniques did not differ.The Logistic (α, β) probability distribution, with different parameter values for each technique, was found to fit the residuals of all modeling techniques.This reflects the fact that the adopted modeling techniques produce residuals that have similar nature, and that all techniques were similar in the way that they predict the output and minimize the errors (Fig. 10).Even though the visual analysis of Fig. 10 shows almost no practical differences among the various probability distributions, the pvalues of the K-S test (Table not shown because all values are zeros) indicate that there is strong evidence to reject the null hypothesis.Based on the K-S test, the model residuals of the various techniques could represent different distributions.There is no contradiction between the K-S test results and the visual test because a slight shift on the graph might be translated to a statistically significant difference.

Rainfall-runoff case study II
This rainfall-runoff II (R-R II) case study is the same as the previous R-R I dataset with one difference; that is the preceding runoff was used as an additional input.In such a strongly autocorrelated series as the daily runoff, providing the preceding runoff as an input to predict the current runoff make strong information content at the disposal of the predictive models.Even though the MLR technique may not be suitable for this case study because one of the inputs (preceding runoff) is autocorrelated, it is used to show how much information can be a captured by a global linear model.In addition to this, a naïve model for predicting the daily runoff was developed just by using the preceding runoff value as an estimate of the current runoff.The performance of the various techniques applied to the daily R-R II case study is provided in Table 15.GP, M5, and EPR, followed by the MLR, techniques are better choices than the other techniques for this case studies.They provide the lowest RMSE, MARE, MB, and the highest R values.The IPE indicator in Table 8 also mostly supports this finding.Expectedly, the presence of the preceding runoff as an input in this case study makes the input-output relationship more globally linear than nonlinear.The superiority of the MLR over the ANNs supports this idea.Instance-based leaning techniques that use simple average of the nearest neighbors (K-nn) may not be a good choice.K-nn found almost most of the information within a range of very small number of neighbors (average of 3 neighbors, Table 6), but the failure to regress the information weakens the input-output relationship.The information capture in linear models could be even enhanced by local/modular techniques, such as the M5 model trees.Figure 11 shows the scatter plots of observed vs. predicted runoff II data.The scatter around the 45-degree line supports the conclusion made earlier regarding the superiority of the GP, M5, and EPR, and the inferiority of K-nn, ANNs, and SVM techniques.The success of GP, EPR, and M5 across all ranges of the dataset is noticeable (Fig. 11b, c, e).With the exception of the SVM and naïve models, the best-found probability distributions of the residuals of the various techniques did not differ.The Logistic (α, β) probability distribution, with different parameter values for each technique, was found to fit the residuals of ANNs, GP, EPR, M5, K-nn, and MLR techniques, whereas Normal (µ, σ ) was found to fit the residuals of the SVM and the naïve models.In spite of the similarity in the best-fit distribution, the parameters were completely different even visually (Fig. 12).All modeling techniques produced symmetrical distributions of model residuals, but GP, EPR, and M5 possess the smallest predictive uncertainty.The p-values of the K-S test (Table 16) indicate that there is strong evidence to reject the null hypothesis.Based on the K-S test, the model residuals of the various techniques could represent different distributions.ANNs GP EPR SVM M5 K-nn MLR ANNs 1 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 GP 1 0.0003 0.0000 0.0000 0.0000 0.0000 EPR 1 0.0000 0.0000 0.0000 0.0000 SVM 1 0.0000 0.0000 0.0000 M5 1 0.0000 0.0000 K-nn 1 0.0000 MLR 1

Discussion
After evaluating the various data driven modeling (DDM) techniques from both perspectives of prediction accuracy and uncertainty, one of the means to gain further insight into their modeling capabilities is to compare the performance deterioration in the testing phase to that in the training phase.Less deterioration may indicate a higher level of reliability and less uncertainty about the technique's performance in future and untested applications.The percent deterioration is calculated for each technique by dividing the difference between training and testing performance by the training performance.A negative percent means that the performance of the technique during the testing phase was better than that during the training phase.M5 are comparable to other techniques in terms of prediction accuracy and uncertainty, they deserve to be given preference as candidate modeling techniques.One of the fundamental questions of this research study is whether there are real differences among the techniques under consideration with regard to their predictive capabilities.The results and analysis show that serious evaluation of the various techniques has to rely on multiple ways, such as the average overall error represented by multiple error measures, scatter plots of the observed vs. predicted outputs, probabilistic analysis of the model residuals, and statistical tests of the significance of the differences among the residuals of various models/techniques.As an example, the SVM technique performed well on the peat moisture case study in terms of the overall average error measures and the probability distribution of the residuals, however, the scatter plots reveal that the models were not behavioral; i.e., could not capture the trend of the phenomenon at all.On the other hand, the superiority of the ANNs over other techniques on the same dataset was revealed by the scatter plots.The analysis presented in the previous section shows that SVM, M5, K-nn, and GP techniques were the best candidates for modeling the evapotranspiration case study.In the peat moisture case study, ANNs, GP, and followed by K-nn, M5, and EPR provided the best performances, whereas ANNs, GP, and K-nn were the best for modeling the till moisture dataset.Even though the K-S test show that the difference between the residuals of GP and M5 was insignificant, this should be treated with caution.The test compares the residuals but fail to assess the difference in the R statistic, which is the key indicator in this particular case study.M5 was not successful in this nonlinear dataset.For the rainfall-runoff I dataset, all techniques were on par, and perhaps there is no need for a sophisticated nonlinear model.In the last case study (rainfall-runoff II), that has an autoregressive term and hence can be described by less non-linear mappings, GP, M5, and EPR were obviously better than the other techniques.
Neural networks could be one of the optimum modeling choices for highly nonlinear case studies (e.g., peat and till soil moisture), but could be completely dominated by other techniques as it was the case for the AET and the rainfallrunoff II case study, depending on the level of linearity in the dataset.M5 is an excellent choice for linear and some nonlinear dataset; it performed poorly only in the till moisture dataset.EPR, though it was not a top choice except in the rainfall-runoff II case study, was never completely dominated by other methods, and sometimes it was among the best techniques.The excellent generalization ability (minimum performance deterioration during the testing phase) of the EPR adds to its potential for hydrological applications.However, in nonlinear datasets, EPR was always less successful than GP.GP was the only technique that was always either the top model or, at least, among the best models regarding both prediction accuracy and uncertainty.The ability of GP to adapt the structural complexity of the generated model/program to the dataset could be one of the main reasons of its superb predictive capability.The SVM seems to be significantly affected by the choice of kernels.In this study, the RBF kernel was chosen based on its performance on the cross validation sample of most case studies (four out of five cases).In the linear rainfall-runoff II case study, when a linear kernel was tested, the prediction accuracy, represented by RMSE, MARE, and R, improved by 20-25%.
Two limitations of this study have to be noted.First, the effect of the model inputs on the predictive capabilities was not investigated.Adding more important inputs, or removing some of them, affects the degree of linearity/nonlinearity of the input-output relationship, and thus, the model performance.Such an effect may differ from one technique to the other.Second, some capabilities of the various techniques and tools were not, and perhaps cannot be, thoroughly covered.The Discipulus software for GP was run for almost two hours each time.It was observed that allowing from 24-48 h of run could slightly improve the results.The EPR tool allows for multiobjective optimization, rather than just minimizing the squared error, but it was not tried in this study.Also instance-based techniques (K-nn) could be further improved using weighted average or regression of the nearest neighbors.ANNs could be trained using Bayesian regularization algorithm (Demuth and Beale, Hydrol. Earth Syst. Sci., 14, 1943-1961, 2010 www.hydrol-earth-syst-sci.net/14/1943/2010/In this experiment, the main objective was to investigate the predictive capabilities of the various data driven techniques.However, a brief ensemble prediction analysis was conducted in this study.For every case study (e.g., AET), ensemble prediction was calculated by averaging the predicted output values from the six modeling techniques.This process was repeated for each of the 12 dataset realizations.A summary of the ensemble prediction accuracy is provided in Table 19.In the cases of the AET and the P-R I dataset, ensemble predictions were not different from the results of the best individual technique.However, ensemble predictions were better than the best individual techniques in the cases of SMP and SMT.Interestingly, in the case of the P-R II case study, the best individual technique (GP) performed better than the ensemble.Apparently, when GP performed notably better than the other techniques, the results of the ensemble (averages of all techniques) will not improve the prediction accuracy (Table 19).
The non-parametric Gamma test ( -test) (Chuzhanova et al., 1998;Evans and Jones, 2002, and recently applied in hydrology by Remesan et al., 2008) was conducted to gain insight into the predictability and the complexities of the modeled processes, and possible leads into selection of suitable modeling techniques.The statistic was calculated for every dataset using the original training and cross-validation subsets as one integrated subset (all unique points).The Vratio, gradient, and the M-test were all calculated using the scaled data (zero mean and 0.5 standard deviation as specified by WinGamma software).The statistic was calculated using the unscaled data to facilitate the comparison with the error variance of the various modeling techniques (Table 18).The following observations can be made: (i) for the AET case study, the error variance of the linear regression technique (2302) was already lower than the statistic; indicating that complex nonlinear model (e.g., ANNs) may not be necessary.The low gradient value of 0.041 shows that a noncomplex smooth function can be used for modeling the AET process, whereas the reasonably low V-ratio indicates that there is high predictability in the output variable.GP, shown to perform well on all case studies, achieved the lowest error variance.Even though it is lower than the estimated , but when it is divided by the AET variance (Table 1), the ratio is 0.23; similar to the V-ratio.; (ii) for the R-R I case study, similar to the AET, there is no need for nonlinear complex model, especially in light of the high V-ratio that indicates low level of predictability.The low level of predictability is attributed to the lack of appropriate inputs, which was rectified in the R-R II case study.All techniques were found to perform on par.The slight superiority of the M5 (ratio of error variance to output variance is 0.44), which is a modular linear technique can be attributed to the fact that it does not produce a smooth function.This is something that the -test may not capture well; (iii) similar conclusions can be made for the R-R II case study.Nonlinear techniques, such as ANNs, will not perform well.The very low V-ratio that indicates very high predictability might be achieved by techniques that can outperform MLR, yet have the ability to adapt to linear situations.As expected GP, EPR, and M5 performed extremely well in this case; (iv) both SMP and SMT case studies, the MLR technique failed to achieve the estimated value, and actually produced ratios of error variance to output variance of 1.0 and 0.8, respectively.This finding points to the possibility that more complex nonlinear models are needed.As the results of this study show, in addition to GP, the ANNs and K-nn were relatively more successful in the SMP and SMT case studies.However, it should be noted that -test relates well to the model performance with regard to the squared error, but in cases where the criterion of performance is the R statistic, the test may not be the optimum tool; (v) the M-test indicates the number of data points that is perhaps needed to achieve the accuracy indicated by the V-ratio.It can be noticed from Table 18 that the size of the datasets needed for developing nonlinear models for the peat and till soil moisture are slightly more than what was used in this study.For the other three case studies, the size of the training datasets exceeded the M-test.
The -test may assist in the selection of the appropriate modeling techniques by applying first multiple linear regression models and evaluating the residuals against the -test values.Decision can be made regarding the need for a complex nonlinear technique.If there is a need for such technique, then ANNs and K-nn (in addition to GP, for example) should be seriously considered.If it is concluded that complex nonlinear techniques are not needed, then improvement of results can be sought using GP, EPR, and M5.
When complex nonlinear techniques are not needed, and the predictability is low (i.e., high V-ratio) significant improvement may not be at all possible.

Conclusions
Neural networks (ANNs) that have hidden nodes with nonlinear transfer functions may impose on the data a model with complexity level that is higher than that needed by many hydrological data.The results of the experiment conducted in this research study show that ANNs were a sub-optimal choice for the actual evapotranspiration (AET) and the two rainfall-runoff case studies.In the nonlinear case studies (peat and till soil moisture), ANN models were the most successful ones.In general, genetic programming (GP) was the most successful technique due to its ability to adapt the model complexity to the modeled data.Evolutionary polynomial regression (EPR) performance could be close to the GP with datasets that are more linear than nonlinear.Support vector machines (SVM) are sensitive to the kernel choice and if appropriately selected, the performance of SVM can improve.M5 model trees performs very well with linear and semi linear data, which cover wide range of hydrological situations.In nonlinear case studies, ANNs, K-nearest neighbors (K-nn), and GP could be more successful than other modeling techniques.K-nn was also successful in linear situations, and it deserves more attention as a potential modeling technique for hydrological applications.
The results of this study show that a winner modeling technique cannot be easily declared.DDM techniques should be applied in ensemble fashion.Multiple groups (realizations) of each dataset should be randomly generated, by sampling without replacement, and should be divided into three split samples of training, cross-validation for stopping the training phase, and testing for applying the model once.Developing multiple non-dominated models of each technique, based on the multiple realizations of the dataset, allows for evaluating the predictive accuracy and uncertainty in a comprehensive way.Multiple overall average error measures, frequency distributions of model residuals, and scatter plots of observed vs. predicted data should be all used as one package to evaluate the predictive capabilities of the modeling techniques.Gamma test can be used as a guide to assist in the selection of the appropriate modeling technique for a particular dataset.Further studies can be built on the experiment presented in this research to evaluate other data driven techniques and to study the impact of input selection and input pre-processing on the relative predictive capabilities of the techniques.
s) AT: air temperature ( • C); GT: ground temperature ( • C); NR: net radiation (W/m 2 ); Sum(NR −4 ): the cumulative net radiation over the preceding four time steps; RH: relative humidity; WS: wind speed (m/s); P : precipitation (mm); STP: depth averaged soil temperature of the upper peat layer ( • C); STT: depth averaged soil temperature of the lower till layer ( • C); Sum(P −6 ): the cumulative precipitation over the preceding six time steps (mm); Sum(AT −6 ): the cumulative air temperature over the preceding six time steps ( • C); SM P : depth averaged soil moisture content of the upper peat layer; SM T : depth averaged soil moisture content of the lower till layer; and Q t : the runoff (m 3 /s).

Fig. 1 .
Fig. 1.The Meuse river basin and the sub-basins upstream of Borgharen.Sub-basin 10 (Ourthe) is used in the case study.

Fig. 2 .Fig. 3 .
Fig. 2. Average mutual information and correlation of inputs-output for the evapotranspiration case study.

Fig. 4 .
Fig. 4. Statistical properties of the training/cross-validation/testing subsets for 100 random realizations.T r : Training; V a : Validation; T e : Testing.

Fig. 6 .
Fig. 6.Probability distribution of the 12 model residuals of all techniques (evapotranspiration case study).

Fig. 8 .
Fig. 8. Probability distribution of the 12 model residuals of all techniques (peat moisture case study).

Fig. 9 .
Fig. 9. Probability distribution of the 12 model residuals of all techniques (till moisture case study).

Fig. 10 .
Fig. 10.Probability distribution of the 12 model residuals of all techniques (rainfall-runoff I case study).

Fig. 12 .
Fig. 12. Probability distribution of the 12 model residuals of all techniques (rainfall-runoff II case study).

Table 1 .
Descriptive statistics of the AET dataset.

Table 2 .
Descriptive statistics of the daily peat and till moisture datasets.
has large discharges rising fast.Through its nature and location, close to the Dutch border, the Ourthe is also the most important Meuse tributary for flood forecasts.In its upper course, the Ourthe consists of two branches: the Ourthe

Table 3 .
Descriptive statistics of the output variables of all datasets.

Table 5 .
EPR type and functions of all case studies.

Table 6 .
The optimum number of nearest neighbors (K-nn) of the 12 models in each case study.

Table 7 .
Testing results of all models applied to the evapotranspiration dataset.
various techniques presents itself through Tables7 and 8.If the modeler picks, for example, the best model of one technique and compares it with the worst model of another technique, a different and perhaps biased conclusion might be made regarding the performance of these techniques.The best ANN model with IPE value of 0.31 is much better than the worst EPR model with IPE value of 0.37 (Table

Table 8 .
IPE testing results of all models applied to all datasets.

Table 9 .
The p-values of the two samples K-S test on the model residuals (evapotranspiration).

Table 10 .
Testing results of all models applied to the Peat moisture dataset.

Table 11 .
The p-values of the two samples K-S test on the model residuals (peat moisture).

Table 12 .
Testing results of all models applied to the Till moisture dataset.

Table 13 .
The p-values of the two samples K-S test on the model residuals (till moisture).

Table 14 .
Testing results of all models applied to the Rainfall-Runoff I .

Table 15 .
Testing results of all models applied to the Rainfall-Runoff II.

Table 16 .
The p-values of the two samples K-S test on the model residuals (rainfall-runoff II).

Table 18 .
The gamma test results on all case studies.

Table 19 .
Accuracy results of ensemble prediction based on average output values of six techniques.