Earth System Sciences Seasonal prediction of winter extreme precipitation over Canada

Abstract. For forecasting the maximum 5-day accumulated precipitation over the winter season at lead times of 3, 6, 9 and 12 months over Canada from 1950 to 2007, two nonlinear and two linear regression models were used, where the models were support vector regression (SVR) (nonlinear and linear versions), nonlinear Bayesian neural network (BNN) and multiple linear regression (MLR). The 118 stations were grouped into six geographic regions by K-means clustering. For each region, the leading principal components of the winter maximum 5-d accumulated precipitation anomalies were the predictands. Potential predictors included quasi-global sea surface temperature anomalies and 500 hPa geopotential height anomalies over the Northern Hemisphere, as well as six climate indices (the Nino-3.4 region sea surface temperature, the North Atlantic Oscillation, the Pacific-North American teleconnection, the Pacific Decadal Oscillation, the Scandinavia pattern, and the East Atlantic pattern). The results showed that in general the two robust SVR models tended to have better forecast skills than the two non-robust models (MLR and BNN), and the nonlinear SVR model tended to forecast slightly better than the linear SVR model. Among the six regions, the Prairies region displayed the highest forecast skills, and the Arctic region the second highest. The strongest nonlinearity was manifested over the Prairies and the weakest nonlinearity over the Arctic.


Introduction
Extreme precipitation events, responsible for economic loss and ecological damage, impact agriculture, energy use and human activity.There has been enhanced interest in recent years on the apparent increase in the frequency and/or severity of extreme precipitation events for many regions, which might be related to the increasing concentrations of greenhouse gases (Easterling et al., 2000;Groisman et al., 2005).Though the long-term trend of extreme precipitation events was generally not significant in most areas of Canada (Zhang et al., 2001;Kunkel, 2003), the establishment of an accurate and timely extreme event monitoring and prediction system is still of prime importance for alleviating the potential impacts posed by climate variations and extreme weather.
Numerous previous studies have shown that the El Niño-Southern Oscillation (ENSO), centered in the tropical Pacific, plays an important role in North American climate variability, especially during the winter season (Barnston, 1994;Shabbar and Barnston, 1996;Goddard et al., 2001;Wu et al., 2005;Shabbar, 2006).Besides ENSO, other circulation patterns, such as the North Atlantic Oscillation (NAO), Pacific-North American (PNA) teleconnection, Pacific Decadal Oscillation (PDO) etc. have been found to show influences on precipitation over the Northern Hemisphere (Hsieh et al., 2006;Wu et al., 2006a, Bonsai et al., 2006;Lorenzo et al., 2008;Lin et al., 2008), and may contribute skill in seasonal precipitation forecasts.Most seasonal forecasts focus on predicting the seasonal mean of the precipitation instead of seasonal statistics of extreme precipitation events.Such seasonal extreme statistics are noisier and more non-Gaussian Published by Copernicus Publications on behalf of the European Geosciences Union.
Z. Zeng et al.: Seasonal prediction of winter extreme precipitation compared to the seasonal mean (where the averaging of data reduces noise and renders the distribution more Gaussian due to the central limit theorem), hence seasonal extreme statistics may be even harder to predict.
One commonly used technique for seasonal predictions is the empirical or statistical approach, using linear statistical methods such as correlation, regression (Ward and Folland, 1991), and canonical correlation analysis (Shabbar and Barnston, 1996).More recently, machine learning methods such as neural networks (Haupt et al., 2009;Hsieh, 2009) have been introduced for nonlinear regression and nonlinear canonical correlation analysis (Wu et al., 2006a, b;Cannon and Hsieh, 2008).Neural networks have been applied to downscale seasonal mean precipitation from global climate models (Tolika et al., 2007) and to infer the influence of climate indices on seasonal mean precipitation (Pasini and Langone, 2010), while a wavelet-neuro-fuzzy method has been developed for daily precipitation forecasts (Partal and Kisi, 2007).The advantage of nonlinear methods to linear methods is generally far less evident for climate applications than for weather applications, since averaging nonlinear daily relations produces near-linear seasonal relations as a consequence of the central limit theorem (Yuval and Hsieh, 2002;Hsieh and Cannon, 2008).A seasonal extreme statistic like the maximum amount of precipitation accumulated over 5 consecutive days in the winter season does not involve extensive averaging as in the computation of the seasonal mean, thereby avoiding the linearization effect of the central limit theorem.Hence despite their potentially higher noise-tosignal level than the seasonal mean, seasonal extreme statistics may be more suited than the seasonal mean for nonlinear forecasting by machine learning methods.
Neural network (NN) methods, generally regarded as forming the first wave of breakthrough in machine learning, became popular in the late 1980s for nonlinear regression problems, whereas kernel methods (e.g.support vector regression, SVR) arrived in a second wave in the second half of the 1990s (Bishop, 2006;Hsieh, 2009).SVR has two advantages over NN models -it avoids the multiple minima problem associated with nonlinear optimization used in NN models, and robust error norms are used in SVR instead of the non-robust mean squared error (MSE) norm, allowing SVR to better handle datasets with outliers.The use of a suitable nonlinear kernel function in SVR allows it to be fully nonlinear, while the use of a linear kernel function restricts SVR to a linear model.Nevertheless, the linear SVR model is different from the multiple linear regression (MLR) model, since the robust error norm is used in SVR but not in MLR.Applications of SVR to hydrological problems include Dibike et al. (2001), Khan andCoulibaly (2006), Bürger et al. (2007) and Anandhi et al. (2008).
In this paper, we have a four-way comparison of forecast skills from nonlinear SVR, linear SVR, Bayesian NN (BNN) and MLR.The objective is to see how robust and non-robust structures as well as nonlinear and linear capa-bility in the models affect forecast skills when the predictand is the very noisy and non-Gaussian winter extreme precipitation anomaly.The description of the data and the forecasting methods are given in Sects. 2 and 3, respectively.Section 4 presents the results of forecasting the winter extreme precipitation over Canada, followed by the conclusion in Sect. 5.

Data
Monthly extended reconstructed sea surface temperature (SST) data (ERSST version 3 -Smith et al., 2008) were obtained from the National Oceanic and Atmospheric Administration (NOAA) with a spatial resolution of 2 • ×2 • for the period 1950-2007; while monthly 500 hPa geopotential height (Z500) data with 2.5 • ×2.5 • horizontal resolution from the National Centers for Environmental Prediction (NCEP) reanalysis were used in this study for the same period (Kalnay et al., 1996).We only used SST data within the zonal band between 30 • S and 70 • N, and Z500 data over the North Hemisphere (20 • N-90 • N), quite similar to Shabbar and Barnston (1996).To reduce memory need, the SST data were averaged into 6 • ×4 • grids with 1020 spatial points, and the Z500 data into 5 • ×5 • grids with 1008 spatial points.Seasonal SST and Z500 anomalies were obtained by removing the climatological seasonal cycle from the monthly mean data and filtering them using a 3-month running mean.After standardizing the anomalies, time-lagged copies of the data were stacked (i.e. the original copy, plus copies timelagged by 3, 6 and 9 months were assembled together) and treated as a new enlarged dataset to be compacted by principal component analysis (PCA).This PCA process, called space-time PCA, singular spectrum analysis or extended empirical orthogonal function (EEOF) analysis, was performed on the SST and Z500 (standardized) anomalies separately, each having 5 leading principal components (PCs) retained, explaining 52% and 37% of the variance in the SST and Z500 anomalies, respectively.These will be referred to as the SST-PCs and Z500PCs below.
Monthly climate indices for the Niño-3.4region SST (NINO), the North Atlantic Oscillation (NAO), the Pacific-North American (PNA) teleconnection, the Scandinavia (SCA) pattern, and the East Atlantic (EA) pattern -were downloaded from the website of Climate Prediction Center (CPC), NOAA.The description of listed indices can also be found from the CPC site (http://www.cpc.ncep.noaa.gov/data/teledoc/telecontents.shtml).Monthly values of the Pacific Decadal Oscillation (PDO) were obtained from the Joint Institute for the Study of the Atmosphere and Ocean, University of Washington (http://jisao.washington.edu/pdo/PDO.latest).
Daily 5-d total precipitation records were obtained from 461 climate stations in Canada for the 1900-2007 period.Only stations with data covering at least the period of 1950-2007 were considered as candidates for the analysis.This period was selected to maximize the number of stations while attempting to maintain the longest possible records.
In addition, stations with more than 5% missing data over 1950-2007 were not used.Under these conditions, only 118 stations qualified for further study.For each station, its monthly maximum was first calculated from the daily 5-d accumulated precipitation data, which suggested the heaviest precipitation event during that month.The climatological seasonal cycle, i.e. the average of the maximum 5-d total precipitation for each calendar month over the years 1950-2007, was removed from the monthly maximum 5-d total precipitation to give the monthly extreme precipitation anomalies.Then the 3-month maximum of the anomalies was taken to be the seasonal extreme precipitation anomaly.
Only winter (December to February) data from 1950/1951 to 2006/2007 (57 winters) were analyzed here.The reason that the maximum 5-d total precipitation instead of the daily extreme is used here because this study focuses on the extreme events related to low-frequency signals of largescale variations in the atmosphere-ocean system.In addition, larger-scale impacts, such as floods from heavy precipitation are mostly due to multi-day episodes.Maximum 5-d rainfall has also been chosen as one of the standard seasonal extreme precipitation indices by the European Union STARDEX project (STAtistical and Regional dynamical Downscaling of Extremes for European regions) (http://www.cru.uea.ac.uk/projects/stardex/).In view of the diversity of the Canadian climate, we classified the 118 stations into groups using K-means clustering (Zhang et al., 2001;Whitfield et al., 2002).The 118×118 elements of the intercorrelation matrix among station precipitation, which assumes the internal spatial coherence of precipitation variability does not change with time, and the 118×6 elements of the correlation matrix between station precipitation and the six climate indices, which reflects the relationship between seasonal extreme precipitation and large-scale atmospheric teleconnection and SST in- dices, were taken as inputs to the K-means algorithm.The Euclidean distance was used in cluster analysis to measure dissimilarity between stations.The number of clusters was varied from 2 to 8, and 6 was chosen because of its spatial consistency and clear physical/geographical interpretation.Some clustering methods (e.g.Rao and Srinivas, 2006) allow the number of clusters to be chosen objectively.
Figure 1 presents the spatial distribution of the Canadian stations, with their membership in the six clusters shown by different symbols.The cluster analysis has divided the Canadian domain into six geographic regions.The Pacific coastal region (R1), under the influence of warm ocean currents and moisture-laden winds, receives the most rain and snow during winter.In the Cordilleran region (R2), the warm, moist Pacific air is forced to rise over the mountains, cools and falls on the western slopes in sizeable amounts of precipitation as rain at lower altitudes and snow at higher ones; however, the eastern slopes and central plateau region are arid.The Prairies (R3) receive considerably less precipitation than most other parts of Canada, often being dry for long periods.For the Arctic region (R4), it is extremely cold with very low precipitation.The Great Lakes region (R5) receives rather uniform precipitation through the year with heavy snowfalls in winter.On the Atlantic coast (R6), extremely cold air masses are modified by oceanic influences, which also cause considerable snow and precipitation in winter.The number of stations for each cluster/region and the corresponding mean precipitation, i.e. the 3-month means of the 5-d total precipitation over all winters and over all stations in each region, are shown in Table 1, where the mean precipitation was 78.5 mm over the Pacific coast and 63.8 mm over the Atlantic coast, much larger than the 8.8 mm over the Arctic region in winter.
For each region, we applied PCA to the seasonal extreme precipitation anomalies, and preserved the leading PCs.Table 1 summarized the explained variance by the first few PCs retained for each region in column 4. For example, the 7 leading PCs for the Pacific coastal region (R1) account for 85% of total variance of the precipitation anomalies.Each PC was then chosen as the predictand for a forecast model.The seasonal extreme precipitation anomaly forecasts were reconstructed by summing the forecasted PCs multiplied by their corresponding empirical orthogonal function (EOF) spatial patterns.

Support vector regression
Support vector machines were originally designed for classification problems (Vapnik, 1995).They were then extended to nonlinear regression problems (Vapnik et al., 1997;Bishop, 2006).Here we describe the essence of support vector regression (SVR).
Let x denote the m inputs or predictors and y denote the single output variable or predictand.By introducing a nonlinear mapping function φ, the nonlinear regression problem between x and y can be converted to a linear regression problem between φ and y, i.e.
where , denotes the inner product, and w and b are the regression coefficients obtained by minimizing the error between f and the observed values of y.To measure this error, instead of the commonly used mean squared error norm, SVR uses the -insensitive error norm defined by i.e. when the difference between f and y is smaller than , the error is ignored, whereas when the difference between f and y is large, the error approximates the mean absolute error, which unlike the mean squared error, is robust to outliers in the data.The w and b coefficients are estimated by minimizing the regularized error function R using sample data (x i ,y i ), where with C and prescribed parameters (commonly referred to as hyperparameters), and N the sample size.The second term is called the regularization (or weight penalty) term, and when a small value of C is used, the regularization term becomes prominent relative to the first term, and the minimization of R forces the w coefficients to have small magnitude, thereby limiting model complexity.
The conversion of a nonlinear regression problem to a linear regression problem (Eq. 1) eliminates the need for nonlinear optimization, which has to deal with the presence of multiple local minima in the error function, as in the case of NN methods.However, φ(x) may be a very high (or even infinite) dimensional vector, hence solving the linear regression problem may be prohibitively expensive.In SVR, a kernel trick is used, which is to replace the inner product φ(x),φ(x ) in the solution algorithm by a kernel function K(x,x ), which does not involve handling the unwieldy φ(x).The minimization of Eq. ( 3) involves Lagrange multipliers, and the final regression estimate can be expressed as in the form (Bishop, 2006) Performance of the SVR model depends on the choice of the kernel function and the hyperparameters.In this study, we used the linear kernel, K(x,x i )= x,x i , and the Gaussian or radial basis function (RBF) kernel, K(x,x i )=exp − x−x i 2 /(2σ 2 ) , with the hyperparameter σ controlling the width of the Gaussian function.When the linear kernel is used, the SVR performs robust linear regression, whereas with the RBF kernel, the SVR model performs robust nonlinear regression.We used the SVR codes by Chang and Lin (2001), downloadable from the LibSVM website (http://www.csie.ntu.edu.tw/∼ cjlin/libsvm).The hyperparameters, C, , and σ (for the RBF kernel) can be tuned instead of predefined subjectively.

Bayesian neural network (BNN)
As NN models are now commonly used in hydrology (Solomatine and Ostfeld, 2008), we will only briefly outline the approach used in our study.An NN model is trained from a data set (x,y), with x the predictors and y the predictand, by adjusting network parameters or weights w so as to minimize a regularized error function where the first term is the parameter C times the mean squared error, while the second term is the regularization term.A small C will strongly suppress the magnitude of w found by the optimization process, thereby yielding a less complex (i.e. less nonlinear) model.The best value for C is commonly chosen upon validating the model performance over independent data not used in training the model.With the optimal C, the model should be neither overfitting nor underfitting the data.An alternative to using validation to find the best value for C is BNN (MacKay, 1992), a neural network designed based on a Bayesian probabilistic formulation.The idea of BNN is to treat the network parameters or weights as random variables, obeying an assumed prior distribution.Once observed data are available, the prior distribution is updated to a posterior distribution using Bayes' theorem.BNN automatically determines the optimal value of C without the need of validation data (Bishop, 2006).In this study, the BNN model used was from the NETLAB toolbox (Nabney, 2002), with a standard mapping function f , i.e. a layer of hyperbolic tangent mapping followed by linear mapping.As NN suffers from multiple minima in E, an ensemble of 30 BNN models was built from random initial weights, and the mean of the forecasts from the 30 ensemble members was taken as the final forecast of the BNN model.

Cross-validation
For seasonal forecasting, the sample size to the number of predictors is relatively small, since we have 5 SSTPCs, 5 Z500PCs and 6 climate indices as predictors.Hence PCA is again applied to these predictor time series to further reduce the number of predictors.An additional advantage of PCA is to produce uncorrelated predictors.To determine p, the optimal number of PCs to retain as predictors, crossvalidation is needed.In an n-fold cross-validation procedure, the data record is divided into n segments, a segment is reserved as validation data, and the other segments as training data.The model is trained using the training data, then validated or tested on the independent data in the validation segment.By rotating the validation segments, the entire data record can be used for validation (Bishop, 2006).As mentioned earlier, for each region (as determined by the cluster analysis), PCA was applied to the seasonal extreme precipitation anomalies for all the stations in that region, yielding the predictand PCs.Cross-validation is also needed to determine nPC, the optimal number of predictand PCs to retain for each region (Table 1).
For the SVR model, we used the Cherkassky and Ma (2004) approach to estimate the value of the hyperparameters, and then use a finer grid search to pinpoint the optimal values of the hyperparameters under cross-validation.To use independent data to test or verify the model forecasts, a second round of cross-validation is needed, hence a double cross-validation procedure (see the Appendix for details).

Forecast skill scores
To evaluate model performance on forecasting the seasonal extreme precipitation, we reported the Pearson correlation coefficient (CORR), the Willmott index of agreement (IOA) between the observed and model-predicted values, the skill score based on the mean absolute error (MAE) of the forecast, and Skill V =SD p /SD o , the ratio of the standard deviation (SD) of the model predictions to that of the observations.All four skill scores are used because they indicate different components of model error.While CORR is a common measure of the linear dependence between the forecast and the observation, it does not take forecast bias into account, thus it is possible for a forecast with large errors to still have a good CORR score.IOA is defined as (Willmott, 1982): where N is the number of samples at the station, O i and P i are, respectively, the observed and predicted values for the ith sample, O is the average of the observed values, and 0≤IOA≤1, with 1 being perfect score.IOA has been proposed as an alternative to CORR, but it is sensitive to the difference between the mean of P i and O as well as the difference between the standard deviation of P i and that of O i .MAE measures the mean absolute error between the observed and predicted values, i.e.
MAE is considered a more natural and superior measure of average error than the commonly used root mean squared error (Willmott and Matsuura, 2005).To compare forecasting performance across different regions, instead of MAE, we used the MAE skill score (MAESS), defined by MAESS=1−MAE/MAE c , where is the MAE of the climatological forecasts.The MAESS is positive (negative) when the accuracy of the forecasts is greater (less) than the accuracy of the climatological forecasts.The Skill V score is used to measure how close the predicted standard deviation approaches the observed one, with the perfect score being 1.

Forecast results
The cross-validated forecast scores averaged over all stations in each region at lead times of 3, 6, 9 and 12 months using the MLR, SVR with linear kernel (SVR-L), nonlinear SVR with RBF kernel (SVR-R) and BNN models are shown in Figs.2-7 for the six regions.
For the Pacific coastal area (Fig. 2), CORR, IOA and MAESS showed that in general the SVR-R model tended to do slightly better than the SVR-L model, and both did better than the MLR and BNN models.Only the SVR-R model attained slightly positive MAESS for most lead times (Fig. 2c), while both linear models and BNN displayed negative MAESS, indicating that they underperformed climatological forecasts.The relatively poor performance of MAESS relative to CORR can be partly explained by the Skill V score shown in Fig. 2d, which shows all models dramatically under-predicting the magnitude of the anomalies.Ironically, BNN had the best Skill V scores as well as the worst MAESS among the four models.The reason is that BNN being a non-robust nonlinear model is easily overfitted to the very noisy data.Even with an ensemble average to alleviate overfitting, BNN generated relatively large amplitude forecasts compared to the other three models which are less

Conclusions
SVR models, with linear and RBF kernels, have been applied to predict the seasonal extreme precipitation anomalies in winter over Canada.In general, the robust SVR models clearly outperformed the non-robust MLR and BNN models in terms of forecast skills, thereby demonstrating the value of models with robust error norms for dealing with the very noisy and non-Gaussian winter extreme precipitation data.Meanwhile the performance of the nonlinear SVR model (SVR-R) tended to be slightly better than the linear SVR model (SVR-L), with the exception of the Arctic region, which seemed to lack a nonlinear signal.The strongest nonlinearity was found over the Prairies according to the difference in the forecast performance between the SVR-R and SVR-L models.This indicates that in the Prairies, gains in forecast skills came not only from using a robust error norm, but also from the nonlinear influence of climate fluctuations such as ENSO and other teleconnections on the extreme precipitation.
Arctic winter precipitation is different from that in the other five regions as all of Arctic winter precipitation is snow with relatively low water content that is easily moved by strong winds.There are many occurrences of strong winds in the Arctic winter where snow is advected from other areas under clear skies, which causes biases in catchments.This may explain why Arctic winter precipitation appears more linear than the precipitation in other regions.
Comparing the skill levels of the six regions, we found highest skill in the Prairies, presumably due to the strong nonlinear signal there, followed by the Arctic (despite the lack of a nonlinear signal), then the Pacific coastal region, followed by the Cordillera region, and finally by the low skill regions of the Atlantic coast and the Great Lakes, where presumably the lack of a strong ENSO signal there contributed to the low skills (Shabbar et al., 1997;Shabbar, 2006).Overall, the forecast skills attained by the various models were modest -even when the correlation skill was positive, the MAE skill was often below that from climatological forecasts.
A disadvantage of nonlinear methods such as SVR and BNN is that it is generally futile to determine the contribution of forecast skill from individual predictors when there are many predictors.The compression of predictors by PCA further made determining the contributions from individual predictors infeasible.

Appendix A Double cross-validation
The procedure of Sect.3.3 actually involves two rounds of cross-validation, an outer round (CV1) and an inner round (CV2) (Fig. A1).For CV1, the first 5 yr of data were reserved for forecast testing, and the remaining data were used as training data.Forecast testing was only done on the middle 3 yr of the 5-yr data segment to alleviate the leakage of low-frequency signals from the training data to the adjacent test data.We repeated the above process by moving the 5-yr window of test data forward by 3 yr each time until the whole record was used for forecast testing.
On the training data, a 7-fold cross-validation (CV2) was implemented to determine the optimal values for p (the optimal number of predictor PCs), nPC (the optimal number of predictand PCs) and the hyperparameters: First the Cherkassky and Ma (2004) estimates were used for the hyperparameters, and the optimal p was estimated in CV2.Then a finer grid search for the optimal hyperparameter values and for the optimal nPC was undertaken in CV2.The model trained with these optimal p, nPC and hyperparameters was then used to forecast the test/validation data under CV1.For BNN, the optimal number of hidden neurons to use in a neural network model was found from CV2.

Fig. 1 .
Fig. 1.Spatial distribution of the Canadian stations, with different symbols used to indicate the six geographic regions determined by a cluster analysis.The shading illustrates the Canadian topography.

Fig. 8 .
Fig. 8. Spatial distribution of the forecast correlation skills of the SVR-R model at individual stations over Canada at lead times of (a) 3, (b) 6, (c) 9 and (d) 12 months.

Fig. 9 .
Fig. 9. Difference between the forecast correlation skills of the nonlinear SVR model (SVR-R) and that of the linear SVR model (SVR-L) at lead times of (a) 3, (b) 6, (c) 9 and (d) 12 months.The two numbers beside each panel give the number of stations where the SVR-R correlation is higher (lower) than that of the SVR-L model, as indicated by the +(−) sign.

Fig. 10 .
Fig. 10.Difference between the forecast correlation skills of the SVR-R model and that of the MLR model at lead times of (a) 3, (b) 6, (c) 9 and (d) 12 months.The two numbers beside each panel give the number of stations where the SVR-R correlation is higher (lower) than that of the MLR model, as indicated by the +(−) sign.

Fig. A1 .
Fig. A1.Schematic diagram illustrating the double cross-validation procedure.In the outer round (CV1), the training data are shown in grey and the 3-year validation data shaded.The 1-year data segments (shown in white) bridging the training data and the validation data are not used, to avoid autocorrelation leaking information from the training data to the adjacent validation data.The validation data segment is moved repeatedly in 3-year increments from the start of the data record to the end in this cross-validation loop, so forecast performance is validated over the whole record.Meanwhile in the inner loop (CV2), the training data from CV1 are assembled and divided into 7 segments, with 6 used for training and one (shaded) for validation.Again the training and validation segments are rotated in the loop so all segments are eventually used for validation to determine the optimal model values/hyperparameters.The optimal model determined from CV2 is then used to forecast over the 3-year validation segment in CV1.

Table 1 .
Number of stations, mean winter precipitation, and percentage variance of the winter extreme precipitation anomalies explained by the first several PCs (with nPC being the number of PCs chosen as predictands based on cross-validation), for each of the six regions over Canada.