Downscaling of surface moisture flux and precipitation in the Ebro Valley ( Spain ) using analogues and analogues followed by random forests and multiple linear regression

In this paper, reanalysis fields from the ECMWF have been statistically downscaled to predict from large-scale atmospheric fields, surface moisture flux and daily precipitation at two observatories (Zaragoza and Tortosa, Ebro Valley, Spain) during the 1961–2001 period. Three types of downscaling models have been built: (i) analogues, (ii) analogues followed by random forests and (iii) analogues followed by multiple linear regression. The inputs consist of data (predictor fields) taken from the ERA-40 reanalysis. The predicted fields are precipitation and surface moisture flux as measured at the two observatories. With the aim to reduce the dimensionality of the problem, the ERA-40 fields have been decomposed using empirical orthogonal functions. Available daily data has been divided into two parts: a training period used to find a group of about 300 analogues to build the downscaling model (1961–1996) and a test period (1997– 2001), where models’ performance has been assessed using independent data. In the case of surface moisture flux, the models based on analogues followed by random forests do not clearly outperform those built on analogues plus multiple linear regression, while simple averages calculated from the nearest analogues found in the training period, yielded only slightly worse results. In the case of precipitation, the three types of model performed equally. These results suggest that most of the models’ downscaling capabilities can be attributed to the analogues-calculation stage. Correspondence to: G. Ibarra-Berastegi (gabriel.ibarra@ehu.es)


Introduction
Global Climate Models (GCM) and Numerical Weather Forecast (NWF) models solve discretized versions of the primitive equations that govern the evolution of the climate system and, particularly, its atmospheric component.Due to technical limitations of current supercomputers, GCM are currently run using resolutions ranging from about 4 to 1.5 degrees for climate simulations and most recent NWF operate with a spatial resolution as high as T1279 L91 (at ECMWF), with a 16 km resolution globally (0.1 degree).
Statistical downscaling is one of the approaches that have been developed and used to regionalize the outputs from GCM and NWF i.e. to construct region-or site-specific scenarios.The rationale for this is based on the fact that the GCM and NWF are not able to simulate surface variables with enough accuracy on regional and local scales.Statistical downscaling consists in a search in the observed data for a statistical relationship or transfer function between the surface climate variable to be downscaled (predictand) and potential predictors, which are frequently the large-scale upper air variables.Statistical downscaling is very often used in the development of regional climate change, scenarios but it has also been used for NWF, particularly for short time ranges with (Fernandez-Ferrero et al., 2009, 2010;Hamill and Whitaker, 2006;Voisin et al., 2010 and references therein).
The analogue downscaling approach tries to identify from historical records, synoptic conditions that are similar to the current atmospheric state as described by GCM and NWF G. Ibarra-Berastegi et al.: Downscaling of surface moisture flux and precipitation in the Ebro Valley (Spain) models.The idea is that historical records at a given location obtained under past similar atmospheric conditions, can now be expected to be a good estimate for current observations at the same place.The Euclidean distance is usually employed (Matulla et al., 2008) to compare current and past synoptic conditions and then identify those which are close (similar) to the current state of the atmosphere.In order to reduce the dimensionality of the phase space and to help in the finding of good analogues, empirical orthogonal functions (EOF) are usually calculated from raw synoptic data.It is recommended to retain a number of EOF representing 80 %-90 % of the overall variability (Matulla et al., 2008).
The analogue technique has often been used as a downscaling method (Zorita and von Storch, 1999;Fernández and Sáenz, 2003;Timbal and Jones, 2008) and it performs similarly when compared with Canonical Correlation Analysis (CCA) while outperforming nonlinear approaches like Classification and Regression Trees (CART) and neural networks (NN) (Zorita and von Storch, 1999).
For downscaling purposes, several regression techniques -linear or not -have been also widely used.In this line, for surface temperature, a classical CCA seems to perform better than other linear methods (Huth, 2002).Among nonlinear methods is worth mentioning the use of different types of NN which are reported to outperform linear models (Weichert and Bürger, 1998;Miksovsky and Raidl, 2005;Davy et al., 2010).Other works suggest that for statistical downscaling of temperature, linear models cannot be beaten by NN (Huth et al., 2008).
A combination of analogues with other techniques like CCA, NN or a Bayesian model for precipitation downscaling but at 6-hourly level and in forecasting mode has also yielded good results (Fernández-Ferrero et al., 2009, 2010).In general, several studies exist which as a first step, have used an initial search of analogues followed by some calibration step (bayesian or not) (Benestad, 2008(Benestad, , 2010;;Hamill and Whitaker, 2006;Voisin et al., 2010).The use of bayesian techniques for calibration purposes of ensemble forecasts is the focus of other studies focused on multivariate gaussian data (Stephenson et al., 2005).It is well known that analogue-based statistical downscaling models cannot produce non-observed record-breaking output values and some calibration techniques have been proposed to overcome this problem (Benestad, 2010).
In the last decade, a great number of machine learning algorithms have been developed (Witten and Frank, 2005).Among them, Random Forests (RF) has become increasingly popular for several reasons, being the most important one its ability to model nonlinear relationships.
When compared with other techniques also intended to deal with nonlinear regression, in the case of similar downscaling problems the literature shows that RF and a type of NN, the multilayer perceptron, perform similarly (Eccel et al., 2007).
RF is based on the CART technique, in which a tree is built to relate a set of inputs to a group of output variables.The relationship between inputs and outputs may be highly nonlinear if taken as a whole, but if analyzed at different ranges, the nature of the relationships inside each range may be modelled in a simpler way.For this reason, in a CART the feature space is recursively splitted into a set of regions in which a simple model like a constant (Hastie et al., 2001) or a simple linear regression can be fitted.As the splitting process progresses starting from the root, the tree is divided at subsequent nodes into branches and sub-branches until reaching the leafs (a more homogeneous region where a simple regression between inputs and outputs is more evident).The final stage implies going backwards to the tree obtained and prune it using cost-complexity pruning algorithms.(Hastie et al., 2001).
Random forests are built using CART but adding two layers of randomness (Liaw and Wiener, 2002): 1.If a number of bootstrap samples from original data are drawn, a regression tree can be obtained with each sample, thus obtaining a "forest".
2. In standard trees, each node is split using the best split among all variables.In a random forest, for each tree, each node is split using the best among a subset of m predictors randomly chosen at that node.
Each tree is grown to the largest extent possible.There is no pruning.The most relevant parameters in running the RF algorithm are (i) m the number of predictors made available to each node and (ii) the number of trees grown Being p the number of candidate predictors, values of m like p/3 or p 0.5 are usually adopted.Nevertheless, results do not strongly rely on the choice of m (Breiman, 2001).The number of trees necessary for good performance grows with the number of predictors.The best way to determine how many trees are necessary is to compare predictions made by a forest to predictions made by a subset of a forest.When the error obtained with the subset is the same as the error obtained with the full forest, this means that the forest has got enough trees (Breiman, 2001).
RF can be used for classification and regression purposes.In the case of regression, the output of the RF model will simply be the average of the outputs obtained in each tree.RF are reported to clearly outperform CART (Siroky, 2009), being one of RF's most important advantages when compared with other techniques, that overfitting never takes place (Siroky, 2009).The relative importance of each predictor on the predictand is estimated at each tree by calculating the increase in the mean square error due to permuting a given predictor.If that regressor has no predictive value for the response, it should not make a difference if its values are randomly permuted before the predictions are generated (Grömping, 2009).Then, differences due to these permutations in all the Hydrol.Earth Syst.Sci., 15, 1895Sci., 15, -1907Sci., 15, , 2011 www.hydrol-earth-syst-sci.net/15/1895/2011/ trees are averaged and normalized by the standard error.This is known as mean decrease in accuracy of the predictor.
In the case of the Iberian Peninsula (IP), precipitation shows an important spatial variability with typical yearly values ranging from 300 to 3000 mm depending on the location.The variability of precipitation is associated with several atmospheric patterns, depending on the location or the season of the year: the most important patterns influencing Iberian precipitation are East Atlantic (EA), North Atlantic Oscillation (NAO); Southern Oscillation Index (SOI) and Scandinavia (SCAND) (Rodriguez-Puebla et al., 1998;Fernández et al., 2003).
Although in precipitation downscaling there is little consensus as to the choice of atmospheric predictor variables (Wilby and Wigley, 2000), the most common input variables have been 850 and 500-hPa heights.In other works, additional variables like 500-hPa temperature, 700 hPa levels, 500-hPa relative humidity, 500-hPa wind components, 1000-500-hPa thickness and 850-hPa temperature fields to mention a few, have also been used (Cavazos, 2000;Wilby and Wigley, 2000;Schoof and Pryor, 2001).In more recent works, the predictors for the downscaling procedure include mean sea level pressure, geostrophic wind field, specific humidity and moisture flux (Wei et al., 2010) or total precipitable water (Timbal and Jones, 2008).A thorough analysis of different combinations of predictors applied to short range quantitative precipitation forecast was presented by Fernandez-Ferrero et al. (2009).
In the case of precipitation, using highly nonlinear techniques like radial basis function (RBF) NN instead of more simple linear models, only represents a marginal improvement (Weichert and Bürger, 1998).Other works also suggest that in the case of precipitation, a linear model of downscaling can perform equally or even better than highly nonlinear NN (Trigo et al., 2001).A comparison between NN and linear downscaling for daily temperature and precipitation showed that NN performed better, although the results with precipitation have been significantly poorer (Schoof and Pryor, 2001).As reported in the literature, performance tends to be better in winter (Timbal and Jones, 2008;Wei et al., 2010) and mid latitudes (Cavazos and Hewitson, 2005).Typical correlation coefficients between predictions and observations may reach values around 0.7 for downscaling of monthly averages and about 0.5 for daily precipitation.
The technique of analogues have been used for precipitation downscaling in the IP with similar results and are reported to outperform NN (Zorita and Von Storch, 1999).Analogues followed by a bias-correcting heuristic formula has also been used for precipitation downscaling (Timbal and Jones, 2008).Generally speaking, the combination of classification schemes like Cluster Analysis (Schoof and Pryor, 2001), or self-organizing maps (Cavazos and Hewitson, 2005) followed by regression techniques (linear or not) seem to yield the best results for precipitation downscaling.A comprehensive review of methods and predictors used can be found in the literature (Wilby and Wigley, 2000).
At a hemispheric scale, changes in the vertically integrated moisture transport are connected to the different phases of the NAO (Hurrell, 1995).Regarding the moisture transport on the Iberian Peninsule (IP), the NAO and EA patterns are known to play a key role (Fernandez et al., 2003).The two most important moisture source regions affecting the IP are in a tropical-subtropical North Atlantic corridor that extends from the Gulf of Mexico to the IP, and the IP itself and the surrounding Mediterranean (Gimeno et al., 2010).The importance of these two source areas varies throughout the year, and also with respect to different climatic regions inside the IP.
Some works have also focused on the downscaling of different moisture variables.A set of downscaling models built on multiple linear regression showed that in the case of humidity variables, the most efficient predictors are the lowmedium tropospheric air humidity variables (up to 500 hPa) and adding circulation and/or temperature variables to the predictors could only bring a marginal or even no improvement over the downscaling models based on the humidity variables only (Huth, 2005).The results show that in the case of water vapour pressure and dew point temperature, results are better than those obtained for relative humidity and dew point depression (Huth, 2005).Like in the case of precipitation, a comparison of water vapour downscaling methods using linear regression and RBF NN, showed that a nonlinear model like RBF outperfoms linear models, although not dramatically (Wilby and Wigley, 2000).
In this study, a combination of analogues, RF and linear regression techniques are applied, analyzed and compared for downscaling purposes in the Ebro Valley (Spain) (Fig. 1).The main objective of the study is to compare the performance of different techniques for the downscaling of moisture-related variables.All of the techniques share a common core, the search of analogues in the space spanned by the EOF of the predictor fields, but they differ in the way the output from the analogue phase is postprocessed.
The target variables are the moisture flux and precipitation.Changes in the variables involved in the water cycle in the area of interest (Ebro Valley) like moisture flux and precipitation, are likely to have an impact on future changes regarding the overall water cycle in the Mediterranean Sea.Gaining a better knowledge on downscaling techniques applied to the variables involved in the water cycle, may contribute to a more accurate regionalization of future projections as described by large-scale climatic models (Mariotti, 2010).
Precipitation is one of the most commonly used variables for climate change studies.The reason is that the evolution in the amount of precipitation over a country is crucial for a whole set of natural and economic processes.
In the case of the surface moisture flux (in this paper, denoted as the vector q 10 with zonal and meridional components q x and q y ), it is a variable closely related with the hydrological cycle over any area.Most of the humidity transport takes place through low-level atmospheric layers and is affected by several surface characteristics such as topography, land surface and others that vary in the smallest scales.In general, a complete study of the moisture cycle over the region would require the analysis of vertically integrated moisture transports and water column, which are the quantities related through conservation laws with moisture sinks and sources such as precipitation or evaporation (Berbery and Rasmusson, 1999).
Unfortunately, historical data from rawinsonde observations are quite scarce, and the study cannot be currently easily extended to vertically integrated moisture transports in order to close the atmospheric hydrological cycle through the whole troposphere.Thus, for this study the moisture fluxes at the surface are considered as an initial problem before further studies allow an analysis of tropospherically integrated transports.Additionally, moisture flux at the low levels of the troposphere is a variable obtained from the product of two original variables forecast by a NWF or a GCM model.It is interesting to check the ability of the downscaling model to forecast it, since there are other variables, such as the wind speed that involve multiplication of original variables and are of special interest, for instance in wind energy.
The previous experiences gained by the research group in the combination of analogues and linear/nonlinear techniques (Fernández and Sáenz, 2003;Fernández-Ferrero et al., 2009, 2010) has inspired the methodology used in this paper.In this work, analogues, linear regression and random forests have been used to build statistical downscaling models for surface moisture flux and precipitation at two observatories (Zaragoza and Tortosa, Ebro Valley, Spain) corresponding to the period 1961-2001.The models have been built at a daily time scale.
Section 2 presents the data used in the study and the methodology applied to perform and evaluate the forecasts.Results are presented in Sect.3. Finally, conclusions and prospects for future work are presented in Sect. 4.

Data
For q 10 and precipitation downscaling purposes, the largescale observed predictors have been derived from the European Centre Medium-Range Weather Forecast (ECMWF), corresponding to the ERA-40 (Uppala et al., 2005) and ERA-Interim reanalysis, downloaded from ECMWF's MARS server with a 1.125 • × 1.125 • resolution.ERA-40 and ERA-Interim reanalysis are projected onto the same grid.The studied area is a rectangle of 35 gridpoints with latitudes in the range (39.375 • N, 43.875 • N) and longitudes between −3.375 • E and 3.375 • E, that is, the North-Eastern part of the Iberian Peninsule, and more specifically, the Ebro valley (Fig. 1).In order to check the sensitivity of the results to the size of the domain, two additional domains with 90 and 9 gridpoints have been used, and additional information on them can be found in the discussion paper, whilst results from these domains will only briefly presented here.Considering the spatial resolution of the original dataset, the daily frequency (that is the large sample used) and the high number of vertical levels used, larger domains could not be used due to memory constraints in the available computers.However, previous results from the authors suggest (Fernandez and Saenz, 2003;Figs. 13 and 14;and Table 1) that very big domains do not imply a real advantage when using analogues in the EOF space.The reason is that the EOF select the directions of the phase space with the maximum variability, which, for very large domains, can happen in areas very remote to the placement of the observatories where the forecast is done.
The original predictors as defined in ERA-40 and ERA-Interim reanalysis grid were as follows: geopotential (Z), temperature (T ), zonal wind speed (U ), meridional wind speed (V ) and relative humidity (RH) defined at the following 5 levels: 300, 500, 700, 850 and 1000 hPa, Additional variables considered were mean sea level pressure (MSL), surface pressure (SP), zonal wind speed at 10 m (U 10), meridional wind speed at 10 m (V 10), temperature at 2 m (T 2) and dew point temperature at 2 m (D2).The selection of predictors is consistent with previous results of the authors regarding the optimal set of predictors (Fernandez-Ferrero et al., 2009).Finally, for all these variables at the 35 gridpoints, data were available at 00:00, 06:00, 12:00 and 18:00 GMT.That made a total of 4200 daily predictor variables in the area considered.Due to their different nature and ranges, all predictors were rescaled to have a mean of zero and a standard deviation of one (Imbert and Benestad, 2005) With the aim to reduce the dimensionality of ERA-40 and ERA-Interim data, Empirical Orthogonal Functions (EOF) were calculated, selecting for each variable a varying number of EOF under the condition that the retained fraction of variance was at least 80 %.With this criterium, the number of EOF for each variable ranged from 5 to 28, thus notoriously reducing the number of predictors from 4200 to a final global amount of 79.These 79 leading EOF were calculated for ERA-40 (Table 1) and ERA-Interim (Table 2).The four values corresponding to the same days (00Z, 06Z, 12Z and 18Z) are considered as four samples at different times during the computation of the EOF in the same way that data with time delays is used during the computation of extended EOF (Weare and Nasstrom, 1982).Surface observations at two locations (Zaragoza and Tortosa) were obtained from the European Climate Assessment (ECA) dataset http://eca.knmi.nl(Klein Tank et al., 2002).The variables obtained from ECA repository were temperature, relative humidity, mean sea level pressure, precipitation and wind speed and direction from which zonal and meridional components (U s and V s ) were computed.Mean sea level pressures at Tortosa (48 m altitude.)and Zaragoza (247 m altitude) were corrected to surface pressure (SP) by assuming that the vertical temperature profile corresponds to the adiabatic lapse rate.Combining SP with observed temperature and relative humidity, specific humidity (q) values (kg water vapour/kg air) at surface level were calculated at both locations using the Clausius-Clapeyron equation.Zonal and meridional moisture fluxes (q x and q y expressed as kilograms of water vapour per square meter per second) were calculated as follows: (1) where ρ represents the density of air as a function of temperature.
In Zaragoza, the meridional component of q 10 (q y ) accounts for 29.3 % of the overall variance associated to the q 10 vector, while in Tortosa, the meridional component represents as much as 67.3 %.The most important values of the variables measured in Zaragoza and Tortosa can be seen in Table 3.
In a similar way as described above, (Clausius-Clapeyron followed by Eqs. ( 1) and (2) fed with U 10 and V 10 instead of U s and V s ), zonal and meridional components of q 10 were calculated on the ERA-40 and ERA-Interim gridpoints using T 2, D2, SP and ρ.For comparison purposes, zonal and 1.55 0.00 8.30 q x (kg vapour m −2 s −1 ) −0.0016 −0.034 0.03 q y (kg vapour m −2 s −1 ) 0.001 −0.037 0.045 meridional components of q 10 (q x and q y ) at the nearest gridpoints to Zaragoza (13.6 km away) and Tortosa (65.1 km away) were considered.The reason is that any downscaling effort on q 10 should yield better results that the values of q 10 as calculated from both reanalyses at the geographically closest gridpoints.Additionally, daily precipitation data were also retrieved from Global Precipitation Climatology Project (GPCP) http://jisao.washington.edu/data/gpcp/daily/which provides gridded precipitation data with a 1 • × 1 • resolution (Huffman et al., 2001, Adler et al., 2003).
The precipitation data from the GPCP dataset (Adler et al., 2003;Huffmann et al., 2001) is used because previous studies (Lucarini et al., 2007) have already found inadequacies in precipitation data from ECMWF reanalyses.Precipitation observations are not directly assimilated during the preparation of reanalyses and therefore, they are produced by the model used in the data assimilation system.Therefore, it is better to use precipitation data from other sources like GPCP.
In the GPCP grid, the closest gridpoints to Zaragoza and Tortosa are located respectively 59 and 36 km away from the observatories.Persistence and precipitation data as given by the GPCP dataset at these two gridpoints were used as additional reference values.Again, the idea behind the use of these data is that any downscaling effort on precipitation could only be justified if better results than local persistence and/or raw GPCP data at these two nearest gridpoints were obtained.That is, it can not be assumed that observations are error free, particularly for a magnitude such as precipitation.If two different observational estimations of precipitation (ECA and GPCP) differ in a given amount, it can not be expected that a particular downscaling model could be more accurate than any of the observational datasets with respect to the other is.

Building the models
The original database consisted of 14 975 daily cases spanning from 1961 through 2001 and was splitted into a training (years 1961-1996, 13 149 cases) and a test period (years 1997-2001, 1826 cases).In both the training and the test periods, daily ERA-40 and surface ECA values were available.Additionaly, for the test period ERA-Interim data were also available.
In this study, downscaling models have been built for two locations (Zaragoza and Tortosa) and three variables: zonal and meridional components of q 10 and precipitation.The steps followed in all cases have been: 1.For each of the 1826 days belonging to the test dataset, the 300 nearest cases among the 13 149 days corresponding to the training database are selected.The nearest cases are those with the smallest Euclidean distance to the current case as defined in the 79-dimensional hyperspace corresponding to the historical ERA-40 data (Table 1).The reason for choosing a number of 300 analogues was to allow for a reasonable number of cases (4-5) for each of the 79 candidate predictors, thus avoiding overfitting when applying linear regression at the following regression stage.
2. With these 300 analogues, two downscaling regression models are built using as candidate predictors the 79 ERA-40's principal components and, as predictand, the chosen variable (any of the the two components of q 10 or precipitation in Zaragoza or Tortosa).Two techniques are used to build the models: random forests (RF, with m = 9 predictors) and multiple linear regression (MLR, stepwise regression).
Being 79 the number of predictors, several values of m as suggested by the literature were tested with RF: m = 9 (79  10,20,30,40,50,60,70.For these ten candidate values of m, the correlation coefficients between observations and predictions were computed for all the variables and also, the most influential inputs were identified.The correlation coefficients obtained with the different candidate values of m were not different at a 95 % confidence level for all the variables studied.Also, the relevant inputs identified were the same -as could be expected from CART-based RF (Grömping, 2009) -, so for the sake of simplicity m = 9 was the final value selected for this study.
At a preliminary stage of this work, several values of the number or trees were considered and it could be seen that the test set error did not decrease beyond approximately 100 trees.As suggested by the literature (Breiman, 2002;Peters et al., 2007) and since RF produces trees very rapidly, 1000 was the number of trees finally selected for all predictands.The predicted values was the average of the outputs from the 1000 trees (Breiman, 2001).
As mentioned above, two regression models for each predictand are fitted on the 300 nearest cases.In all cases, the candidate predictors are the 79 EOF from ERA-40 reanalysis and the predictand, the surface variable (q x , q y or precipitation in Zaragoza or Tortosa as derived from ECA observations).One of the models is fitted using a MLR and the other using random forests.Due to the gaps in the ECA database, some historical records of the predictands corresponding to the most similar 300 days identified in the atmospheric circulation analogues (reanalysis), are not present.For this reason, regression is carried out using a set of cases in which predictors and predictand are present that typically ranges between 250 and 300.For each case belonging to the test dataset, two models (RF and MLR) are fitted in this way.
3. Using the 79 EOF corresponding to every day of the test dataset as inputs, the two models previously fitted (RF and MLR) on the 250-300 most similar historic records were used to calculate an estimated value of the chosen variable for the same day in Zaragoza or Tortosa.To test the sensitivity of both techniques (RF and MLR) to the use of ERA-40 or ERA-Interim analyses, once the two models have been fitted on ERA-40 data, both models are run with two different sets of inputs: (i) the 79 EOF from ERA-40 (models denoted as RF ERA-40 and MLR ERA-40) and (ii) the 79 EOF obtained with ERA-Interim (RF ERA-Interim and MLR ERA-Interim).Additionally, a plain average obtained from ECA values corresponding to the 250-300 most similar daily cases identified, is used to build an dditional analogue-type downscaling model (denoted as "Analogues" model).
4. Finally, as mentioned above, the most evident estimations of q 10 and precipitation were also considered.In the case of zonal and meridional components of q 10 , (q x , q y ,) the values directly calculated using ERA-40 and ERA-Interim reanalyses raw data at the geographically nearest points to Zaragoza and Tortosa.In the case of precipitation, the idea is the same but two other references were used.The first one was the GPCP satellite and rain gauge merged precipitation data set.The second one was just to consider the persistence of levels from the previous day.
It is worth mentioning that the sensitivity of this methodology to (i) the domain size and (ii) the use of "mixed EOF" (Benestad et al., 2002) instead of independent EOF, was intensivily tested.The results show that for this area of the Iberian Peninsule, changing the area covered by the domain by an order of magnitude (in km 2 ) does not have an important impact on results.This is due to the fact that the most influential variables used for downscaling exhibit a small spatial variability.As a consequence, using independent EOF or mixed EOF obtained from variables that do not have an important spatial variability, does not have an important effect on results either (for in-depth details, please see the discussion forum of HESS corresponding to this paper).

Evaluation of models
All the models have been evaluated and intercompared using the 1826 observations corresponding to the test dataset.Being the objective of this paper to assess the overall performance of the models mentioned above, a set of statistical indicators has been chosen to compare observations and models' predictions: i. Correlation coefficient (R).
ii. Ratio of standard deviations (RSD) between observations and model's standard deviations.A good model should show values near one.
With the aim to summarize in a visual manner models' performance according to this group of three statistical indicators (R, RSD and RMSE), Taylor diagrams (Taylor, 2001) have been plotted.Additionally, three more statistical indicators have also been considered:
the correlation coefficient, an additional indicator called index of agreement has been proposed (Wilmott, 1981(Wilmott, , 1982;;Wilmott et al., 1985) and adopted for this study.It is an indicator of the overall agreement between model and observations that ranges between 0 and 1 (perfect model).
In all cases, bootstrap resampling (1000 samples extracted from the test data set) has been used to calculate 95 % confidence intervals corresponding to these statistical indicators.Likewise, for model intercomparison purposes, 95 % confidence intervals have also been calculated to assess differences between two models' performances as described by the set of indicators metioned above.
The relative importance of input variables is to be assessed in two different ways, depending on the nature of the model.In the case of RF models, input relevance is related to the increase of the mean square error MSE (%) due to the random permutation of the input variable's values.For linear regression models, the most important variables will be the ones that after being incorporated into the equation, are responsible for the highest increases in the overall determination coefficient R 2 (Grömping, 2009).The most influential predictors with one technique or another (MLR or RF) are roughly the same, as could be expected from the literature (Grömpig, 2009).

Results
The main results can be seen in Tables 4-5 and Figs.2-7.
In both locations (Zaragoza and Tortosa) models' best performance for precipitation (R ∼ 0.5) are notoriously poorer than for any of the two components of q 10 (R ∼[0.8,0.85]).
The reason for this is that the nature of precipitation is much more intermitent and dependent of very local factors than in the case of q 10 .It is worth mentioning that models fitted with ERA-40, when they are fed with ERA-Interim data, experiment a deterioration in performance which is more evident in analogues followed by MLR.In the case of analogues+RF, the degradation is much smaller being a reasonable option for the future fitting RF models on ERA-40 data (covers the period from mid-1957 to mid-2002) and use them with the more recently available ERA-Interim (from 1989 onwards).For zonal and meridional components of q 10 at both locations, the comparison of the downscaling using raw ERA-40 or ERA-Interim data at the nearest gridpoints, indicates that ERA-Interim values tend to be more accurate than those yielded by ERA-40.The lack of continuity between both reanalyses suggests that the procedures and algorithms used in ERA-Interim provide a better description of at least q 10 .Even though they perform worse than RF ERA-40 and MLR ERA40 models, both reanalyses' direct outputs yield reasonable estimates of q 10 , particularly for the meridional component.The overall performance of raw ERA-Interim predictions, particularly, in Hydrol.Earth Syst.Sci., 15,[1895][1896][1897][1898][1899][1900][1901][1902][1903][1904][1905][1906][1907]2011 www.hydrol-earth-syst-sci.net/15/1895/2011/ the meridional component of q 10 at Zaragoza and Tortosa, compares well with that obtained by analogues.Zaragoza (13.6 km) is closer than Tortosa (65.1 km) to its corresponding nearest gridpoint and this can explain why results at Zaragoza are somewhat better than those at Tortosa.
Analyzing in detail the downscaling of q 10 in Zaragoza, it can be seen that RF ERA-40 gives the best results with values of R of 0.86 (zonal) and 0.85 (meridional).However, the comparison with MLR-ERA40 (Table 4)  MLR ERA-40 (Fig. 2).In the case of the meridional component of q 10 (29.3 % of variance), RMSE, RM and D are not different at a 95 % confidence level, while RSD and RM are better for the linear MLR ERA40 model.Only the correlation coefficient is higher for RF ERA-40 (Fig. 3).
In the case of the zonal component of q 10 in Tortosa (Fig. 4) (Table 5), RF ERA-40 and MLR ERA-40 at a 95 % confidence level do not have different values of FA2, RM, and D. R and RMSE are better for RF ERA-40 while MLR ERA-40 has got a better RSD indicator.For the meridional component of q 10 , (Fig. 5) (Table 5), RF ERA40 slightly outperforms MLR ERA40 in all the statistical indicators, except for RSD.In all the cases mentioned above, it can be seen that analogues only have a little worse performance when compared with RF ERA-40 and MLR ERA-40.Therefore, it can be concluded that obtaining analogues represents the most efficient step in the model building process and accounts for the most important part of the goodnes of fit.Using RF or MLR at a next stage adds a much smaller improvement.The employment at this second stage of RF, a recently developed machine learning algorithm intended to deal with highly non linear mechanisms like the ones known to be involved in atmospheric processes, does not represent a net improvement when compared with the classical MLR model.
The most influential variables in the downscaling of q 10 at both locations, have been identified, as mentioned above, by computing the increase (%) of MSE due to the permutation of their data when acting as inputs for the RF.In the 1826 RF models fitted, the most frequently selected as influential
inputs have been the first EOF of V 10, the first EOF of T2 and the first and second EOF of D2 The interpretation is as follows.As can be seen in q 10 roses at both locations (not shown) the direction of q 10 at both locations is mainly zonal.The meridional component of surface q 10 affects moisture fluxes due to the fact that meridionally flowing air must cross mountain ranges over central Iberian Peninsula, thus suffering a strong Föhn effect.Similarly, warmer air leads to a higher moisture flux as explained by the Clausius-Clapeyron equation.The variability in the dew-point temperature enclosed in the two leading EOF reflect the main temporal (1st EOF) and spatial changes (2nd EOF) of the humidity in the air.
Regarding precipitation at both locations, for RF ERA-40, MLR ERA-40 and analogues, the statistical indicators R, RMSE and FA2 do not differ at a 95 % confidence level.In both locations, the model of analogues can represent better the observed mean, MLR ERA-40 exhibits the best RSD and RF ERA-40 has the best D value.Therefore, it can be concluded that, if globally considered, RF ERA-40 and MLR ERA-40 models do not represent any significant improvement over the plain use of analogues (Tables 4-5).At both locations, these three models outperform GPCP predictions and also persistence.The results are similar to those found in the literature (see Introduction) and all models tend to underpredict extreme values (please, see discussion forum on HESS website).The most important input variables for precipitation downscaling have been the first EOF of relative humidity RH, the first EOF of meridional wind speed V Hydrol.Earth Syst.Sci., 15,[1895][1896][1897][1898][1899][1900][1901][1902][1903][1904][1905][1906][1907]2011 www.hydrol-earth-syst-sci.net/15/1895/2011/
and the second EOF of dew-point temperature D2.Similarly, precipitation depends on the availability of moisture and the saturation of air (leading EOF of relative humidity), the direction (northern/southern) from which the air flows (1st EOF of V ) and local effects accounted for by the second EOF of D2.

Conclusions and future outlook
The results of the statistical downscaling carried out for q 10 and precipitation indicate that analogues represent a powerful and at the same time, easy-to-use tool.In most cases, their results outperform raw predictions derived from reanalysis data (ERA-40, ERA-Interim and GPCP) at the nearest gridpoints or persistence of daily precipitation.Incorporating a second regression stage represents a clear though not overwhelming improvement.However, using at this second stage either a classical linear model or a more sophisticated tool like RF, does not make a net difference.As mentioned in the Introduction, similar effects with other nonlinear techniques like NN have been described in the literature.The explanation for this may lay in the fact that the combination of a great number of highly nonlinear mechanisms involved in the downscaling might result in a linearization of the overall effect.A further explanation already pointed out in the literature (Zorita and Von Storch, 1999) might be that most of the overall nonlinearity involved, has been already captured and described at the stage of the selection of analogues.As a conclusion, the combination of two techniques which are quite easy to use and implement like analogues and multi- ple linear regression, can be the best compromise between accuracy and simplicity in similar downscaling problems.However, an important shortcome of MLR is the lack of continuty between ERA-40 and ERA-Interim which introduces a heavy degradation in model's performance, thus not making possible the use of an important set of historical records and learning periods, currently available only as ERA-40.Instead, RF models fitted on ERA-40 data and fed with ERA-Interim, experiment a much smaller deterioration.The reason might be that MLR models heavily rely on the coefficients of an equation and the differences between ERA-40 and ERA-Interim input data tend to be amplified by the values of the coefficientes.However, the nature of the regression with RF is different and is based on CART where homogeneous areas in the final leafs are sought as the trees split at the different stages.In this sense, it can be expected that similar homogeneous areas can be described either using ERA-40 or ERA-Interim EOF (Tables 1-2) as inputs, since leading EOF from both reanalysis are likely to be describing the same physical effects on the studied area.Anyway, before considering it as definitive, this result should be confirmed for other variables, since it might happen that it cannot be generalized to other cases with, for instance, larger biases, as could happen when using GCM output in a perfect-prog approach Further research is currently being carried out by this group in two directions: (i) to find a relationship between ERA-Interim reanalysis and moisture flux at several locations on the Ebro Valley but at different upper levels (ii) to apply the methodology followed so far to other variables involved in the water cycle.All the calculations have been carried out using the freely available software R (R core development team, 2009) and apart from the core module, four specific packages have been used: (i) "FactoMineR" for calculation of EOF (Husson et al., 2008), (ii) "MASS" for multiple linear regression (Venables and Ripley, 2002), (iii) "randomForest" for RF (Liaw and Wiener, 2002) and (iv) "plotrix" for Taylor diagrams (Lemon, 2006).
iv. Fraction of two (FA2).FA2 indicates the fraction of cases in which the ratio between observations and model's predictions falls in the range [0.5-2].v.Ratio of means (RM) between observations and model's averages.vi.Index of agreement D. In order to overcome some of the widely reported problems associated to the plain use of www.hydrol-earth-syst-sci.net/15/1895/2011/ Hydrol.Earth Syst.Sci., 15, 1895-1907
1906 G. Ibarra-Berastegi et al.: Downscaling of surface moisture flux and precipitation in the Ebro Valley (Spain)

Table 2 .
The 79 leading EOF for ERA-Interim.