Ice breakup forecast in the reach of the Yellow River: the support vector machines approach

Accurate lead-time forecast of ice breakup is one of the key aspects for ice ﬂood prevention and reducing losses. In this paper, a new data-driven model based on the Statistical Learning Theory was employed for ice breakup prediction. The model, known as Support Vector Machine (SVM), follows the principle that aims at minimizing the struc-5 tural risk rather than the empirical risk. In order to estimate the appropriate parameters of the SVM, Multiobjective Shu ﬄ ed Complex Evolution Metropolis (MOSCEM-UA) al-gorithm is performed through exponential transformation. A case study was conducted in the reach of the Yellow River. Results from the proposed model showed a promising performance compared with that from artiﬁcial neural network, so the model can be 10 considered as an alternative and practical tool for ice breakup forecast.


Introduction
Ice flood is a common phenomenon in river of north china every year.It often occurs during the period of thaw when river discharge increases due to snowmelt that causes the forces on an ice cover exceed its strength.Ice jams are often formed when the broken ice is transported along the river and accumulated in the river bed.It can cause many hazards, such as bridge or levee failure, structural damage of dams, and erosion of riverbed and banks (Massie et al., 2002).Ice breakup prediction can be used to increase warning time and to minimize damage caused by ice floods or ice jams.Therefore, it is of great importance to make an accurate ice breakup forecast.
Ice breakup prediction is much difficult because the formation of breakup result from a complex interaction between hydrologic, hydraulic, and meteorological processes.Due to the complexity of physical processes that cannot currently be described with deterministic models, there are no reliable methods to predict the date of breakup with a significant lead time (White, 2003) predicting the ice breakup, including empirical methods, mathematical models, and statistical models et al.
White provided examples of existing breakup ice jam prediction methods and discussed their potential advantages and disadvantages (White, 2003).Most advances in river ice hydrology were reviewed in other literatures (Beltaos, 2000;Morse and Hicks, 2005).Beltaos calculated the threshold flows that can result in significant jam flooding using a numerical model named RIVJAM (Beltaos, 2003).For statistical models, logistic regression was widely used for predicting breakup, ice jam occurrence (White, 1996).Massie pointed out that breakup ice jam prediction models had historically been limited to classical empirical single-variable threshold-type analyses to statistical methods.He employed neural network for ice jam prediction and improved the forecast accuracy (Massie et al., 2002).Instead of using neural network alone, fuzzy logic and artificial neural networks were applied for modeling the maximum water level during river ice breakup for both flood and non-flood event years (Mahabir et al., 2006).A fuzzy optimization neural network approach was also developed for forecasting freeze-up date and break-up date (Chen and Ji, 2005).In China, the index method was firstly used for ice breakup forecast in the 1950', it is a empirical method which has the disadvantages of short lead time and poor accuracy.To solve problems aforementioned, empirical correlation method is developed based on certain physical processes, it improved the forecast accuracy but usually was restricted to the application conditions.In the 1990', some mathematical models were developed which is effective and efficient, but it is often failed due to some parameters can not be achieved in practice which limited their wide applications.
In summary, although physically based models have several advantages, but they are so complex and require many hypothesis for establishment, so black box model have gained much popularity in recent years.It doesn't need to understand the complex process and the only thing is to establish the relationship between input and output variables which make it ease of use.In the past decade, artificial neural network (ANN), as one of the famous black-box models, has been increasingly applied to vari- ous hydrological problems for it can dealing with complex nonlinear processes (ASCE, 2000b).But there are some limitations for the widely application of ANN, such as the network architecture selecting, training algorithm selecting, and the most serious problems is that it do not lead to one global or unique solution (ASCE, 2000a).
In this paper, an alternative data-driven model based on statistical learning theory called Support Vector Machine (SVM) is used for the task of ice breakup prediction.It has originally been used for classification purpose and later was been extended for regression and prediction (Huang et al., 2005;Ye et al., 2005;Niu et al., 2006;Sun and Yang, 2006).SVM follows the principle of structural risk minimization.The key property of SVM is that it prevents the overfitting (or overtraining) and the solution is always unique and globally optimal.As for the optimal parameters selecting of SVM, the Multiobjective Shuffled Complex Evolution Metropolis (MOSCEM) algorithm is employed.
The remainder of the paper is organized as follows: Sect. 2 provides a briefly introduction to the SVM and its parameters identification method (MOSCEM).A description of study area and data used is presented in Sect.3. In Sect.4, the implementation of SVM for ice breakup prediction is described and results are shown compared with ANN.The paper closes with a summary and some problems to be studied.
2 Support vector machine for regression (SVR)

Introduction to SVR
Here, a brief description of SVM for regression is given below.Detailed descriptions of SVR can be found in Vapnik (1995).
The SVR maps the input data x into a high-dimensional feature space F by linear or nonlinear mapping, in which the training data may exhibit linearity, and then to perform linear regression in this feature space.The regression estimation is to construct where φ(•) is a nonlinear function by which x is mapped into a feature space, b denotes the bias.Linear regression is performed in the high-dimensional feature space by ε-insensitive loss function.To estimate ω and b, Eq. ( 2) is converted to the primary function given by Eq. ( 4), where ξ i and ξ * i are slack variables that specify the upper and the lower training errors subject to an error tolerance ε.In this optimization problem, most data examples are expected to be in the ε-tube as show in Fig. 1.If a data example {(x i , y i )} is outside the tube, then an error ξ i or ξ * i exists.From Eq. ( 4), some characters of SVR can be found: (i) the training error is minimized by minimizing ξ i and ξ * i and (ii) ω 2 /2 is minimized to raise the flatness of f (x).Equation ( 4) is a standard optimal problem which can be solved by applying Lagrange theory.By introducing Lagrange multipliers α i , α * i , Eq. ( 4) can be converted to the following form: To solve the optimal problem above, the Lagrange multipliers α i , α * i are calculated, and the optimal desired weight vector of the regression hyperplane is Thus, the regression function can be written as, Herein, K (φ(x), φ(x i )) is called the Kernel function.The value of the Kernel is the inner product of the two vectors x i and x j in the feature space φ(x i ) and φ(x j ).Any function that satisfies Mercer's condition (Vapnik, 1998) can be used as the Kernel function.
By using different kernel functions listed in Table 1, the SVR algorithm can construct a variety of learning machines.
In Eq. ( 5), only some of (α i − α * i ) will be held as non-zero values, they are called the support vector.That is, these data points lie on or outside the ε-tube (Fig. 1).
The parameters of SVR are the cost constant C, the radius of the insensitive tube ε, and the kernel parameters.These parameters are mutually dependent.Parameter C controls the smoothness of the approximation function.A greater C value indicates that the objective is mainly to minimize the empirical risk, which makes the learning machine complex.On the contrary, a smaller C value makes learning machine yield poor approximation.The parameter ε also affects the smoothness or complexity of the approximation function.In addition, ε determines the number of support vectors because it controls the accuracy of the approximation function.Smaller values of ε may lead to more support vectors and result in a complex learning machine and vice visa.So it is very critical to set the appropriate SVR parameters.

SVR parameters selection
Many algorithms have been recorded in literature for SVM parameters selection, such as genetic algorithm (GA), particle swarm algorithm (PSO) and Shuffled Complex Evolution algorithm (SCE-UA) (Duan, 1992) no matter how carefully chosen, are often insufficient to measure all the characteristics of the training error (Vrugt et al., 2003).In this study, an effective and efficient algorithm for multiobjective optimization, entitled the Multiobjective Shuffled Complex Evolution Metropolis (MOSCEM) algorithm developed by Vrugt (2003), was employed for SVM parameters selection.MOSCEM is an improvement over the Shuffled Complex Evolution (SCE-UA) algorithm, using Pareto dominance concept to evolve the population toward Pareto optimal solutions and merging the strengths of Metropolis algorithm.
In this section, for optimizing the following multiobjective function, the implementation framework of MOSCEM algorithm (Vrugt, 2003) is briefly introduced below.
in which F i (θ) is a single objective function, θ is the parameter set to be optimal selected. Steps: (1) Generate sample: Generate s samples {θ 1 , θ 2 , . .., θ s } randomly from the feasible parameter space and compute the multiobjective vector F (θ i ) at each point θ i .
(2) Ranking points: Compute the fitness f i for each individual of the sample using the Pareto concept, sort the s points in order of decreasing fitness value and store them in an array D[1:s, 1:n+M+1], where n is the number of parameters, and the remaining M+1 columns are used to store the multiobjective vector and the fitness values.
(4) Sequence evolution: A new candidate point in each sequence k is generated using a multivariate normal distribution centered around the current draw of sequence (k) augmented with the covariance structure induced between the points in complex k.Metropolis acceptance rule is used to decide whether the offspring Introduction

Conclusions References
Tables Figures

Back Close
Full should be added to the current sequence or not.If it is accepted, then the worst member of the current complex k will be replaced with it, otherwise, the worst member should be replaced with the last member of S k .Finally, when predefined number of iterations reached, new complexes are formed by means of shuffling process.
(5) Unpack all complexes C back into D and calculating the fitness of all points, sort the points in order of increasing fitness value.
(6) Check the convergence criteria.Once convergence criteria are satisfied, stop; Otherwise, go to step 3.
MOSCEM algorithm guarantees convergence toward the Pareto set of solutions which reflect the model structure uncertainty, so the probability forecasts can be made rather than point forecasts by traditional method.In MOSCEM algorithm, there are three algorithm parameters to be defined by user, which is maximum iteration times t, population size s and the number of complexes p.

Study area and data used
The ice breakup prediction in the Bayangaole gauging station was performed in this study, the gauging station located in the Inner Mongolia section of the Yellow River, as shown in Fig.The breakup date are calculated from the reference date (1 May), for example, breakup date of 46 days means that the river ice break up 46 days before 1 May, i.e. on 15 March.By using the correlation analysis, three factors are selected as the forecast factors, that is, accumulated positive air temperature from the date that temperature goes above zero to the break-up date, average water level and average streamflow of break-up period.Data used for ice breakup forecast was collected by Ji (2002).The first 29 samples are used for training and the rest 5 samples for validation.

Prediction model
The implementation of SVM for forecasting ice breakup can be generalized as the following form: where Y is N×1 vector of model predictions, X is N×3 matrix of input variables and θ is a vector of n unknown SVM parameters.In this case, Y is the annual ice breakup date, X including the following input variants: accumulated daily positive air temperature, average water level and average streamflow, θ refers to two SVM parameters (C and ε) and kernel function parameters.
The most widely used kernel function is the Gaussian radical basis function (RBF) as follows, which has one parameter σ: Once the suitable SVM parameters are selected, training should be conducted for forecasting.The SVM software developed by Steve Gunn (2001)

Data preprocessing
Before training the SVM, data (both the input and output data) should be preprocessed for the two reasons.Firstly, preprocessing can ensure that all variables receive equal attention during the training process.Secondly, preprocessing is important for the efficiency of training algorithms (Wang et al., 2006).In general, there are two types of preprocessing methods.The first one is to rescale the data to a small interval, for example, [−1, 1], [−0.9, 0.9], [0.1, 0.9] or [0, 1].Another is to standardize the data by subtracting the mean value and then dividing by the standard deviation, that is rescale the data to a Gaussian function with a mean of 0 and standard deviation of 1.The advantage of using rescaled interval of [0.1, 0.9] is that extreme events occurring outside the range of the calibration data may be accommodated (Dawson and Wilby, 1999).In this study, all data is rescaled to the interval of [0.1, 0.9] using the equation where x i is the scaled value, x i is the original value, x min , x max is the minimal and maximum value of x, respectively.

Performance measures
There are many literatures on model forecasting evaluating indices.The following numerical performance statistics that are defined here are used to evaluate the forecasting results.
1. Mean absolute error (MAE) 4. Mean relative error (MRE) In Eqs. ( 12)-( 15), y i is the observed value, y is the mean observed value, y i is the predicted value, y is the mean predicted value, n is the total number of predicted values.

Prediction steps Introduction Conclusions References
Tables Figures
(3) SVM training using MOSCEM algorithm.Two objective functions were considered: where n is the number of testing samples.
The MOSCEM algorithm software developed by Vrugt et al. (2003) was used in this study.For this case, the three parameters value was selected through running the MOSCEM algorithm, the optimal values are C=75.35,ε=0.0375, σ=0.2199.
(4) Prediction using the SVM with optimal parameters.To prevent overtraining and improve the generalization ability, Bayesian regularization method was used for training the ANN.No unified theory exits for determining the number of hidden layer neurons.A trial and error procedure is generally used to determine the optimal numbers.In this study, four hidden layer neurons was selected that yielding the best performance.

Results and discussions
To facilitate comparison, same training and testing samples are used for SVM and ANN. Figure 3 depicts the training and testing performance of the two models.The scatter plots of observed versus forecast day of the ice breakup are shown in Fig. 4. Forecast results are listed in Tables 2 and 3 compares the performance indices obtained with the forecasting results of the SVM and ANN.It is show that SVM outperforms ANN considerably both in training period and testing period, it also proved that SVM is more suitable for small samples prediction than ANN. Figure 4 shows that the errors of training and testing period of SVM are smaller than that of ANN.
The major drawback of ANN is the fact that it do not lead to one global or unique solution compared with the SVM.It is mainly due to the differences of the initial weights of the ANN.Furthermore, the SVM training process always seeks a global optimized solution and avoids overfitting that eventually leads to better generalization performance than neural network models.Despite its superior features, SVM also have some limitations.
(1) In order to construct an efficient SVM model, hyper-parameters must be selected properly.Otherwise, it may lead to over-fitting or under-fitting.Different parameter sets may make a great difference in performance.In this study, MOSCEM algorithm is used for hyper-parameters selecting, the forecast result shows the effectiveness of the algorithm.But experiences shows that in order to find the optimal hyper-parameter set, the number of random samples should be set to Introduction

Conclusions References
Tables Figures

Back Close
Full a large number, when the number of training samples is very large, it will be very time consuming.Therefore, the effectiveness and efficiency of the MOSCEM algorithm depends on the parameters of its own.So in order to get an ideal performance, these parameters of MOSCEM should be determined by trial and error.
(2) Many functions can be used as the kernel function for SVM, as Table 1 showed.In this study, only radial basis kernel function is used, and no comparison is made for the performances of different kernel function for SVM.It should be the future research work.
(3) In this study, due to the limitation of data, there are only 34 samples for training and validation.As the number increasing, the model should be further investigated.

Conclusions
In this study, SVM is employed for ice breakup prediction in Bayangaole gauging station.To build reliable forecasting models, MOSCEM algorithm is implemented for selecting optimal parameters for SVM in a multiobjective framework.This multi-criteria method can be used to determine the optimal model structure.2 and Table 3 compares the performance indices obtained with the forecasting results of the SVM and ANN.It is show that SVM outperforms ANN considerably both in training period and testing period, it also proved that SVM is more suitable for small samples prediction than ANN. Figure 3 shows that the errors of training and testing period of SVM are smaller than that of ANN.Furthermore, the SVM training process always seeks a global optimized solution and avoids overfitting that eventually leads to better generalization performance than neural network models.Despite its superior features, SVM also have some limitations.
is the regularization term which denotes the Euclidean norm, and L ε (•) is called ε-insensitive loss function that measuring the empirical risk, C is a positive constant that determines the trade-off between the model complexity and the amount up to which error larger than ε are tolerated.The points lying on or outside the ε-bound of the decision function are support vectors (black points).On the right, the ε-insensitive loss function is shown in which the slope is determined by C.
2. The Inner Mongolia reach lies in the top north of the Yellow River, with the altitude of more than 1000 m, it has a cold and long winter.Usually, the Inner Mongolia reach has about the 4-5 months ice period every year.Due to the variation of the time and strength of the cold wave intrusion and the influence of wind, annual ice flood date always varied greatly.The ice-break often occurs in the second and last ten days of March, and seldom occurs in the first ten days of April.During every spring, ice floods occur frequently and bring huge loss, especially with the development of the social economy, the loss will be greater.

4. 5
Comparing model Artificial neural network (ANN) was also used for comparing in this study.The key problem for constructing the ANN is to determine the number of hidden layer neurons and training algorithm.Three layers of ANN is constructed, input and output variables are same to the SVM.Tan-sigmoid transfer function and linear transfer function were Introduction layer and output layer, respectively.Training epoch is set to 500.

Figure 1
Figure 1 The location of Bayangaole gauging station

Figure 2 .
Figure 2. Models performances for breakup prediction during training and testing period.Fig.3. Models performances for breakup prediction during training and testing period.

Fig. 3 .Figure 3 .
Figure 2. Models performances for breakup prediction during training and testing period.Fig.3. Models performances for breakup prediction during training and testing period.

Fig. 4 .
Fig. 4. Scatter plots of forecast versus observed ice breakup date for the training and testing data sets: (a) SVM (b) ANN.
. Many models have been developed for Figures )} n , where x i denotes the input vector; y i denotes the output value and n is the total number of data sets.The aim is to identify a regression function y=f (x), which can accurately predict the outputs {y i } corresponding to a new set of input-output examples, {(x i , y i )}.The linear regression function using the following function: ) Introduction was used in this study.Introduction ) Figures

Table 1 .
Compared with the ANN model, SVM model appears to be more suitable for ice breakup forecast.Due to the limitation of the data set, only 34 samples are used for forecast in this work, as the increasing of data set number, the proposed model based on SVM should be further investigated in the ongoing study.Introduction Kernel functions for SVR.

Table 2 .
Forecasting results of the SVM and ANN.

Table 3 .
Comparison of the forecasting results from each model.