Multi-criteria validation of artificial neural network rainfall-runoff modeling

In this study we propose a comprehensive multicriteria validation test for rainfall-runoff modeling by artificial neural networks. This study applies 17 global statistics and 3 additional non-parametric tests to evaluate the ANNs. The weakness of global statistics for validation of ANN is demonstrated by rainfall-runoff modeling of the Plasjan Basin in the western region of the Zayandehrud watershed, Iran. Although the global statistics showed that the multi layer perceptron with 4 hidden layers (MLP4) is the best ANN for the basin comparing with other MLP networks and empirical regression model, the non-parametric tests illustrate that neither the ANNs nor the regression model are able to reproduce the probability distribution of observed runoff in validation phase. However, the MLP4 network is the best network to reproduce the mean and variance of the observed runoff based on non-parametric tests. The performance of ANNs and empirical model was also demonstrated for low, medium and high flows. Although the MLP4 network gives the best performance among ANNs for low, medium and high flows based on different statistics, the empirical model shows better results. However, none of the models is able to simulate the frequency distribution of low, medium and high flows according to non-parametric tests. This study illustrates that the modelers should select appropriate and relevant evaluation measures from the set of existing metrics based on the particular requirements of each individual applications. Correspondence to: R. Modarres (r m5005@yahoo.com)


Introduction
The rainfall-runoff relationship is an important issue in hydrology and a common challenge for hydrologists.Due to the tremendous spatial and temporal variability of watershed characteristics such as snowpack, soil moisture, hydraulic conductivity, watershed slope, seasonal rainfall etc., the rainfall-runoff relationship is usually a nonlinear process.Since the middle of the 19th century, different methods have been applied by hydrologists within rainfall-runoff modeling whereupon many models have attempted to describe the physical processes involved (e.g.mathematicalphysical lumped or distributed models).
Over the last decade, there has been a tremendous growth in the interest of application of a class of techniques that operate in a manner analogous to that of biological neurons system, i.e. artificial neural networks (ANNs).While ANNs are capable of capturing non-linearity in the rainfall-runoff process compared with other modeling approaches (Hsu et al., 1995), ANN models have been applied in hydrology and in the context of rainfall-runoff modeling (Smith and Eli, 1995;Dawson and Wilby, 1998;Tokar and Markus, 2000;Zhang and Govindaraju, 2003;Kumar et al., 2005).From these studies, it has been demonstrated that ANN models can be flexible enough to simulate the rainfall-runoff processes successfully.
Various types of neural network models are available for rainfall-runoff modeling.Feedforward artificial neural networks (FFANNs) maintain a high level of research interest due to their ability to map any function to an arbitrary degree of accuracy.This has been demonstrated theoretically for both the radial basis function (RBF) network and the popular multilayer perceptron (MLP) network (Harpham   and Dawson, 2005).The primary goal of ANN modeling is the prediction or forecasting of hydrological variables, e.g.runoff prediction.In this case, a set of variables is divided into two sets prior to the model building: the training set and validation set.The validation set is kept aside to evaluate the accuracy of the model derived from the training test.In the validation phase, the model output is compared with actual outputs using statistical measurements such as rootmean-square error (RMSE) and the coefficient of correlation (CORR).
However, the equality of the probabilistic characteristics of the observed and simulated runoff is usually ignored in validation test.It is important because the simulated runoff should reflect the relevant hydrological characteristics of the observed runoff in terms of both magnitude and frequency.For example, the observations are arranged in order of the magnitude, beginning with 1 for the biggest, when the flow duration curves are depicted.Therefore, the simulated runoff should reproduce the probabilistic behavior of the observed runoff, especially for both upper and lower extreme values.
In this regard, the main objectives of this study are twofold; in the first step, we develop an effective ANN model for studying the rainfall-runoff relationship in the study area and verify the models by the global statistics such as rootmean-square error (RMSE), coefficient of correlation and coefficient of efficiency.In the second step, the non-parametric test for the equality of the mean, variance and probability distribution of the observed and simulated runoff is used to validate rainfall-runoff models and to compare them with global statistics.

Study area and data
In this study, the most popular FFANN architecture, i.e.MLP, is used for rainfall-runoff modeling of the main upstream basin of the Zayandehrud watershed in the western region of Isfahan Province in the center of Iran.Zayandehrud watershed has two main basins called Ghaleh Shahrokh and the Plasjan Basin.These two basins connect directly to the Zayandehrud Dam which provides the water supply for Isfahan province.The input and output variables for ANN is the daily rainfall and runoff of the Plasjan basin (Fig. 1).The data set includes Plasjan daily streamflow time series and three daily rainfall time series of the stations within the basin for the period of 1978-2000.The daily streamflow of Plasjan is given in Fig. 2.

Multi-layer perceptron
In this study, the multilayer perceptron architecture assumes that the unknown function (rainfall-runoff) is represented by a multilayer feed forward network of sigmoid units.An ANN model with n input neurons (x 1 , . . ., x n ), h hidden neurons (w 1 , . . ., w h ) and m output neurons (z 1,...,Zm ) is considered in this study.The function that this model calculates is (1) Where g and f are activation functions, i, j, and k are representing input, hidden and output layers respectively, τ j is the bias for neuron w j and ε k is the bias for neuron z k , β ij is the weight of the connection from neuron x i to w j and α j k is the weight of the connection from neuron w j to z k .The hyperbolic tangent sigmoid function is used in this study as activation function for the hidden nodes.The function can be written as the following Where s i is the weighted sum of all incoming information and is also referred to as the input signal The major advantage of the MLP is that it is less complex than other artificial neural networks such as Radial Basis Function (RBF), and has the same nonlinear input-output mapping capability (Coulibaly and Evora, 2007).The training of the MLP involves finding an optimal weight vector for the network.The objective function of the training process is: Where N is the number of training data pairs, M is the output node number, t kp is the desired value of the kth output node for input pattern p, and z kp is the kth element of the actual output associated with input p (Antar et al., 2006).

Model development
The total daily observation was divided into training, validation and cross-validation sets prior to the model building.The cross-validation is used to avoid any overfitting during training.In this study, 60, 25 and 15% of data was used for training, validation and cross-validation, respectively.
It is worth noting that the method used to divide the data has significant impact on the results.In other words, the network may use low or high flow samples and give a yield of great precision for training set but fails to simulate outside the range of the training data (Tokar and Johnson, 1999;Shahin et al., 2000).In this study, the rainfall and runoff data were randomized prior to training the network to avoid this problem.The randomization of input data was emphasized by many researchers such as Bras and Rodríguez-Iturbe (1985) and Ochoa-Rivera et al. (2002) for hydrologic variables with large degree of variability and uncertainty.They stated that using only historical data as inputs into ANN may result in a scarcely documented response.However, for the randomization may lead to loosing the historical memory of the basin in cases of the application of ANN for streamflow time series forecasting.
Table 1.The rainfall and runoff variables used to construct neural network with the cross correlation (CCC) and Autocorrelation coefficients (ACC).

Variable CCC ACC
x(1) : R1(t-1), Daily rainfall of station(1) at lag time 1-day, 0.133 - x(2) : R1(t-2), Daily rainfall of station(1) at lag time 2-days 0.119  x(3) : R2(t-1) In the first step, we select the input data for MLP networks.According to the autocorrelation properties of daily rainfall and runoff time series and the cross correlation between daily rainfall and runoff series, different input variables can be used for ANN.However, due to the possibility of zero rainfall and runoff in the Zayandehrud basin, the initial efforts to construct the ANN showed that data transformation is necessary to reduce the variance of rainfall and runoff time series.In this study, we apply standardized rainfall and runoff time series to construct the ANN.After trial and error, the following standardized variables were selected as input and output data of ANN.The cross-correlation coefficients (CCC) between streamflow and selected rainfall variables and the autocorrelation coefficients (ACC) of streamflow time series at different lags are also given in Table 1.All the coefficients are significant at 1% level.
The output of the model is streamflow discharge of the Plasjan River (Q t ) at the outlet of the basin.We tested different MLP architectures and found that the MLP with 1-hidden layer (i.e.MLP1) is not appropriate while other MLPs (MLP2, MLP3, MLP4 and MLP5) are suitable networks for modeling rainfall-runoff relationship of Plasjan basin.The random order was used for training material and the Levenberg-Marquardt back Propagation algorithm, as the most efficient algorithm (Ramirez-Beltran and Montes, 2002) was used to train neural network and training was stopped at 1000 epochs.The learning rate was set from 0.7 to 0.1 and the learning rule is momentum.Each MLP network contained 7 hidden units positioned in each hidden layer.The performance of these networks is depicted in Fig. 3a-d which shows the network estimated streamflow against observed validation data set.

Empirical model
In order to compare ANN with an empirical model, we also develop a multiple linear regression (MLR) model for rainfall-runoff relationship.The discharge of Plasjan River (Q t ) is selected as the dependent variable and the input variables of ANN are selected as independent variables.The  best-fit model is estimated using a stepwise procedure and selected based on the highest coefficient of determination (R 2 ) and residual test for normality.Finally, the following regression model is estimated: The performance of regression model is depicted in Fig. 4 for the validation data set.2004) validated simulation models by using statistical characteristics such as average, standard deviation, skewness coefficient, autocorrelation coefficient, maximum and minimum values, and performance criteria such as relative error, absolute error, frequency of success, ranges of relative and absolute errors.Liu et al. (2003) validated the results of the ANN models with root mean square error and determination coefficient.
Very recently Aksoy and Dahamsheh (2009) used a multicriteria validation of ANN models developed for Jordan by using graphical and numerical measures including the forecasted and observed time series, scatter diagram, the residual time series between the forecast and observation, mean absolute and relative errors between the forecast and observation, dimensionless mean absolute error and dimensionless mean relative error between the forecast and observation.Additionally following performance measures are adopted: Determination coefficient to quantify the linearity between the forecast and observation, mean square error, mean absolute error; and a and b (the slope and the intercept) in the bestfit linear line of the scatter diagram between the forecast and observation.
As there is no single definite evaluation test, it is important to apply a multi-criteria assessment of ANN skill (Dawson et al., 2002;Kumar et al., 2005).These statistics are summarized in a recent paper by Dawson et al., (2007) and could be calculated automatically on the Hydrotest website available at http://www.hydrotest.org.uk.We apply 17 criteria which are listed in Appendix A. The reader is referred to Dawson et al. (2007) for the mathematical formulation of these criteria.
These error statistics are given for different MLP networks in Table 2.It is evident that the MLP4 network is better than all other networks.Compared with regression model and according to some criteria, i.e.MARE, ME, MRE and MSRE, the regression model performs better than MLP4 network.However, these criteria that are unbounded do not necessarily show the preference of regression model because the low score of these criteria do not necessarily indicate a good model in terms of accurate forecasts, since positive and negative errors will tend to cancel each other out.

Statistical validation
Although the above error statistics provide relevant information on the overall performance of the models they do not provide specific information about model performances at high or low flows, which are of critical importance in flood or low flow contexts.This study proposes other criteria to evaluate the performance of ANNs, especially for the rainfallrunoff relationship.These criteria are divided into the following graphical and numerical tests:

Graphical tests
In this section we compare the box-plot and probability plot of the observed and computed flows.The probability plot of the observed and simulated streamflow is fitted by Blom's method (Blom, 1958) which is based on the fractional rank of the observation.The parameters of the probability function are estimated by maximum likelihood method.These tests are useful for visual comparison of the upper or lower tail of the distribution of the observed and estimated streamflow.The box-plots of observed and estimated streamflow for different MLP networks and regression model are illustrated in Fig. 5 From box-plots, it is clear that the MLP4 network and regression model most closely match the observed streamflow, especially for high flows.
The probability plots for the observed and MLP4 network reveal that the distribution of observed and MLP4-estimated streamflow data are more similar for a normal distribution (Fig. 6) than for a gamma distribution (Fig. 7) because the lower tail of a gamma distribution is very different for observed and estimated streamflow.The gamma distribution for MLP2 and MLP5 networks are also presented in Fig. 8.It is clear that the networks are not able to reproduce the probability distribution of the observed streamflow and there is a significant difference in both upper and lower tails of the quantile distribution of streamflow.The probability plots of estimated streamflow by regression model are also presented in Fig. 9.The normal probability plot (Fig. 9a) is similar to the normal probability plot of observed streamflow and MLP4 network (Fig. 6a and b, respectively).However, the Normal and Gamma probability plots for regression and observed streamflow are different, particularly for lower tail of distribution.These probability plots illustrate that neither the MLP network nor the regression model are able to simulate the probability distribution of the observed streamflow (see also Table 3 and Sect.6.2.2).
Although the MLP4 network seems to be a better network than other networks and does not achieve very much better results than those of the regression model for rainfall-runoff modeling of the Zayandehrud basin, it would wise to check the validation of the ANN network by statistical measurements presented in the following section.

Statistical tests
In this section, we suggest useful statistical tests to evaluate the performance of the ANNs and to compare these ANNs with each other.These statistical methods include nonparametric tests to compare mean, standard deviation and the cumulative distribution function (CDF) of observed and estimated streamflow.Khan et al. (2006) used these statistics

Non-parametric test for the difference of two population means
The Wilcoxon rank sum method (Conover, 1980) is a robust non-parametric method for constructing a hypothesis test p-value for µ 1 − µ 2 (difference of two population means).At any significance level greater than the p-value, one rejects the null hypothesis, and at any significance level less than the p-value one accepts the null hypothesis.For example, if p-value is 0.04, one rejects the null hypothesis at a

Non-parametric test for the equality of two population variances
The equality of two population variances can be tested using Levene's test.The hypothesis for the Levene's test can be defined as (Khan et al., 2006): H a :σ i =σ j =... =σ k for at least one pair (i, j ) In performing Levene's test, a variable X with sample size N is divided into k subgroups, where N i is the sample size of the ith subgroup, and the Levene test statistic is defined as: where Z ij is defined as: where Xi is the median of the ith subgroup, Zi is the group mean of the Z ij and Z i is the overall mean of the Z ij .The Levene's test rejects the hypothesis that the variances are equal if where W>F (α,k−1,N−k) is the upper critical value of the F distribution with k − 1 and N − k degrees of freedom at a significant level of α.

Non-parametric test for equality of CDFs of two populations
Kolmogorov-Smirnov (K-S) non-parametric test (Conover, 1980) is used to compare cumulative distribution function (cdf) of observed and simulated streamflow series.Suppose, F 1 (x) and F 2 (x) are cdfs of two sample data of a variable x.The null hypothesis and the alternative hypothesis concerning their cdfs are: for at least one value of x and the test statistics, Z is defined as which is the maximum vertical distance between the distributions F 1 (x) and F 2 (x).If the test statistic is greater than the critical value, the null hypothesis is rejected.
To evaluate the performance of MLP networks, we apply the tests in two cases.First, the observed and simulated streamflow time series are compared for the overall validation test.For the second case, the percentiles of observed and simulated streamflow time series are compared in order to check the validation of ANNs for the prediction of high, medium and low streamflows.The streamflow time series are divided into the first 0-25% (P1), the second 25-75% (P2) and the third 75-100% (P3) percentiles.
Table 3 indicates the results of non-parametric tests at 95% significance level for the first case.It is evident that none of For comparing high, medium and low flow in the second case, the streamflow time series are divided into three percentile groups and the above non-parametric tests are applied for each group.Table 4 represents the global statistics of the networks for each percentile group.Those values highlighted in bold in this table indicate the "best" model out of the five models when assessed using each particular evaluation metric.For example, according to IoAD criterion, the MLP4 network gives the best performance for the third per-Hydrol.Earth Syst.Sci., 13,[411][412][413][414][415][416][417][418][419][420][421]2009 www.hydrol-earth-syst-sci.net/13/411/2009/For the first percentile, or the low flows, MLP4 network performs better than other networks based on most of the criteria.However, for some criteria such as AME, PDIFF and PEP, the MLP2 is better than other MLP networks.These criteria illustrate the error of the highest output between the modeled and the observed dataset which is not suitable for low flow error measurement.For the second percentile, the same results can be seen for MLP4 and MLP3 networks.However, for the third or the upper percentile which shows the efficiency of the model for estimating high flows, the MLP4 is the best network.Jain and Srinivasulu (2004) also mentioned that the high flows can be effectively modeled by MLP networks.However, they concluded that for medium and low flow simulation by ANNs, the use of genetic algorithm (GA) may be advantage because the watershed condition is much more complex and dynamic for low flows than high flows.
On the other hand, the regression model seems to be more effective than MLP networks for rainfall-runoff modeling according to almost all criteria and different percentiles.The regression model scores well in terms of most of the metrics.However, the MLP is still better than regression model in terms of PDIFF and PEP.In other words, the MLP4 networks estimate high flows more accurate than regression model while the regression model performs better than MLP4 for medium and low flows.The results of the total data (Table 2) also indicated the better performance of MLP4 network over regression model for high flows.
Table 5 presents the results of non-parametric tests for three percentile groups.It is found that MLP2 is still an insignificant model for rainfall-runoff relationship modeling for the Plasjan River because all p-values are below 0.05.
The MLP3 network can reproduce the mean of observed streamflow for the second and third percentiles but the network is weak in simulating standard deviation and the probability distribution of the observed streamflow because the p-values are below 0.05.The MLP4 network indicates the best simulation results for the mean and standard deviation of the observed streamflow similar to the MLP5, it also fails to reproduce the mean and standard deviation of the observed streamflow.On the other hand, the regression model is similar to the MLP4 network.
However, the Kolmogorov-Smiornov test demonstrates that neither the ANNs nor the regression model can reproduce the probability distribution of streamflow in the validation phase of the modeling.Although the MLP4 network and regression model are able to simulate the mean and standard deviation of the observed streamflow but they could not reproduce the probability distribution of the observed streamflow.

Conclusion and summary
Artificial neural networks are powerful tool for modeling nonlinear relationships in hydrology such as rainfall-runoff relationship.The validation phase of the neural network modeling plays an important role in the efficiency testing of the modeling.The global statistics are common methods used in this phase.However, the findings reported in this paper show that the global statistics broadly reflect the accuracy of the model but are insufficient indicators of the best ANN because they do not capture the mean, standard deviation and probability distribution of the observed streamflow.This paper also illustrate s the dangers of relying on one metric alone to evaluate and select different models.
Although the multi layer perceptron with four hidden layers was selected as the best neural network based on the global statistics, it failed to reproduce the probability distribution of observed streamflow.The MLP4 network also gives better results than regression model for entire testing data set.
However, it is important to reproduce streamflow statistics such as the mean, standard deviation and probability distribution for high, medium and low flows.According to the objectives of the ANN, i.e. flood or low flow simulation or forecasting, it is very important to check the accuracy of the ANN output separately in future studies.For example, the best ANN in this study, MLP4, gives better estimation for high flows than for low flows.But the MLP4 network is not able to reproduce the probability functions of different percentiles according to the Kolmogorov-Smirnov test.Although the regression model is better than ANNs based on different criteria, it is also inadequate to reproduce probability distribution of the observed streamflow.
In general, the findings of this study conclude that, for validation phase of ANN, the common global statistics are not sufficient and relying on one measurement is not relevant.A multi-criteria assessment based on different global and non-parametric tests is essential for verifying and selecting an optimum ANN.One should use a range of methods to evaluate the methods.This study also shows the advantage of the application of empirical, physical or conceptual models together with ANN because some of these models may give better results with more simple modeling procedure than ANNs.
Edited by: J. Liu

Fig. 1
Fig. 1 Location of Zayandehrud watershed in Isfahan Province and the location of Plasjan Basin.The rainfall stations (black circles) and Plasjan hydrometery station (black triangle) are also shown inside the Plasjan Basin.

Fig. 1 .
Fig. 1.Location of Zayandehrud watershed in Isfahan Province and the location of Plasjan Basin.The rainfall stations (black circles) and Plasjan hydrometery station (black triangle) are also shown inside the Plasjan Basin.

Fig 4
Fig 4 Scatter plot of observed versus simulated streamflow (m3/s) with regression model

Fig. 5 Fig. 5 .
Fig. 5 Comparison of box-plots of observed runoff and simulated runoff by MLP networks Fig. 5. Comparison of box-plots of observed runoff and simulated runoff by MLP networks.

Fig. 6
Fig. 6 Normal cumulative probability plots for a) observed and b) MLP4 simulated streamflow Fig. 6.Normal cumulative probability plots for (a) observed and (b) MLP4 simulated streamflow.

Fig. 7
Fig. 7 Gamma cumulative probability plots for a) observed and b) MLP4 simulated streamflow Fig. 7. Gamma cumulative probability plots for (a) observed and (b) MLP4 simulated streamflow

Fig. 9
Fig. 9 Normal (a) and Gamma (b) cumulative probability plots for simulated streamflow by regression model Fig. 9. Normal (a) and Gamma (b) cumulative probability plots for simulated streamflow by regression model

Table 2 .
Performances indices for MLP and regression models.
It is noted thatUnal et al. (

Table 3 .
Test results (p-values) of non-parametric methods for the difference between observed and ANN and regression-simulated streamflow data at 95% confidence level.

Table 4 .
Performances indices for MLP and regression models and different percentile groups (P1, P2 and P3).

Table 5 .
Test results (p-values) of non-parametric methods for the difference between observed and ANN and regression-simulated streamflow percentile groups at 95% confidence level.