Simplifying a hydrological ensemble prediction system with a backward greedy selection of members – Part 1: Optimization criteria

Hydrological Ensemble Prediction Systems (HEPS), obtained by forcing rainfall-runoff models with Meteorological Ensemble Prediction Systems (MEPS), have been recognized as useful approaches to quantify uncertainties of hydrological forecasting systems. This task is complex both in terms of the coupling of information and computational time, which may create an operational barrier. The main objective of the current work is to assess the degree of simplification (reduction of the number of hydrological members) that can be achieved with a HEPS configured using 16 lumped hydrological models driven by the 50 weather ensemble forecasts from the European Centre for Medium-range Weather Forecasts (ECMWF). Here, Backward Greedy Selection (BGS) is proposed to assess the weight that each model must represent within a subset that offers similar or better performance than a reference set of 800 hydrological members. These hydrological models’ weights represent the participation of each hydrological model within a simplified HEPS which would issue realtime forecasts in a relatively short computational time. The methodology uses a variation of the k-fold cross-validation, allowing an optimal use of the information, and employs a multi-criterion framework that represents the combination of resolution, reliability, consistency, and diversity. Results show that the degree of reduction of members can be established in terms of maximum number of members required (complexity of the HEPS) or the maximization of the relationship between the different scores (performance). Correspondence to: D. Brochero (darwin.brochero.1@ulaval.ca)


Introduction
In hydrology, as in many applications, it is accepted that there is no superior model for every application under all circumstances (Duan et al., 2007;Alpaydin, 2010).Today, the availability of the Meteorological Ensemble Prediction Systems (MEPS) and its subsequent coupling with multiple hydrological models offers the possibility of building Hydrological Ensemble Prediction Systems (HEPS) relying on a large number of members.But the complexity of such HEPS becomes an operational burden when one has to evaluate several hundreds of scenarios at each time step.
To provide an idea of the complexity that can be achieved in HEPS, represented for example by the number of members to handle, it is worth mentioning the principal areas of uncertainty associated with the hydrological process (Schaake et al., 2007) as follows: -Uncertainty from the meteorological data: in this case, the MEPS are responsible for providing this information.Different centres around the world are currently working on this issue, for example the TIGGE initiative consists of ensemble forecast data from ten global centres, for a total of 259 members (TIGGE, Bougeault et al., 2010).In relation to this, Bao et al. (2011) have shown that a HEPS comprised of meteorological members derived from multiple meteorological centres may actually perform better as compared to an ensemble derived from a single meteorological model.
-Uncertainty from the rainfall-runoff model: each hydrological model combines two important elements regarding the uncertainty associated with the hydrological Published by Copernicus Publications on behalf of the European Geosciences Union.
process: the initialization uncertainty (i.e. the initial state of the model) and the model uncertainty (from parameter identification to model conceptualization).In this regard, the methodology proposed by Beven and Binley (1992) provides the evaluation of parameter uncertainty from the point of view of equifinality.For example Pappenberger et al. (2005) have shown the advantages in HEPS to flood inundation predictions coupling MEPS with both hydrological and hydraulic models that have been evaluated at the same time with the GLUE methodology.
Another way of conceptualizing the uncertainty of the model focuses on a multi-model approach, making good use of the resources invested in the development of dozens of hydrological models.For instance, Velázquez et al. (2011) have shown, based on the database of the present paper, that the ensemble predictions produced by a combination of several hydrological model structures and meteorological ensembles have higher skill and reliability than ensemble predictions given either by a single hydrological model fed by weather ensemble predictions or by several hydrological models driven by a deterministic meteorological forecast.Cloke and Pappenberger (2009) have already highlighted the computational demand of using MEPS for flood forecasting as one of the main points to overcome in the future, either by new technologies (stochastic chip technology) or by efficient use of computing clusters.Thus, the selection of hydrological members as part of a simplified model can be useful given the computational cost of running models and creating ensembles.Vrugt et al. (2008) have suggested the selection of hydrological models as an additional task that can be run based on the results of the post-processing using Bayesian Model Averaging (BMA) in a multi-criteria framework.
As a compromise, researchers have attempted to cluster MEPS for flood prediction in various ways: by lagging ensembles and deriving representative members through hierarchical clustering over the domain of interest, and thus to produce a reduced ensemble set at higher resolution (Marsigli et al., 2001); by analyzing the relation between atmospheric circulation patterns and extreme discharges (Ebert et al., 2007), or by establishing, in a deterministic way ("best match" approach), the location of the forecast that is the most similar to the rainfall pattern of the catchment (Xuan et al., 2009).
Here, we propose the selection of hydrological members directly in the HEPS with a technique called Backward Greedy Selection under the different scores presented in Sect. 2. In the case of MEPS with interchangeable members (the case presented here), the selection is oriented to evaluate the hydrological models participation inside a subset of a few members.
The HEPS under study is formed of 16 lumped hydrological models forced by the 50 meteorological inputs of the ECWMF EPS, leading to a grand-ensemble of 800 members.
This approach was tested in 10 catchments located in France for a period of seventeen months (from March 2005 to July 2006).Another important feature of the HEPS at hand is the short duration of the series.This has been highlighted by several authors as a negative point in the evaluation of system performance in the case of extreme events (Renner et al., 2009;Cloke and Pappenberger, 2009).This condition imposes the use of resampling and recombination techniques in the proposed methodology shown in Sect.3.
Other studies that focused on periods of analysis very similar to the one used in this paper have also proven the usefulness of the ECMWF EPS.For example Rousset et al. (2007) evaluated hundreds of French catchments from 4 September 2004 to 31 July 2005 showing that the information given by the ensemble forecast is useful for flood warning and water management agencies.Similarly, Thirel et al. (2008), in a comparative analysis of short-range meteorological forecasts from the ECMWF EPS and PEARP EPS of Météo-France under the scheme of SIM coupling, analysed the competence jurisdiction of each of the two EPS from 11 March 2005 to 30 September 2006, showing that the ECMWF EPS seemed best suited for low flows and large basins while the PEARP EPS was best suited for floods and small basins.
We do emphasize that the results shown in this first phase focused primarily on the analysis of the scores in the process of selecting hydrological members.Furthermore, we evaluated the notion of interchangeability of the MEPS and HEPS members, concluding that the participation of the hydrological models in the subset of selected members is sufficient to guide the members' selection, as shown below in Sect. 4. Finally, conclusions are drawn and a guideline for future work is given in Sect. 5.

Verification statistics for ensemble forecasts
Following the guidelines given by Cloke and Pappenberger (2009), we consider several metrics in the selection of hydrological members with BGS.We thus quote some of the features that are evaluated in probabilistic forecasting.The reader is referred to Murphy (1993) and Wilks (2005) for a detailed description of these features.
-Bias: correspondence between mean forecast and mean observation.
-Reliability: correspondence between conditional mean observation and conditioning forecast, averaged over all forecasts.
-Resolution: degree to which the forecasts sort the observed events into groups that are different from one another.It is related to reliability, in that both are concerned with the properties of the conditional distributions of the observations given the forecasts.
-Consistency: degree to which the ensembles apparently include the observations being predicted as equiprobable members.
Additionally, we propose the use of the diversity concept studied in machine learning, i.e. the members should be as correct as possible, and when they make errors, these errors should be complementary (Kuncheva, 2004).Thus, the scores used in this research have been chosen because they quantify different aspects of the ensemble prediction's quality.
In some cases, it is necessary to establish a priori a probabilistic distribution function that fits systematically the prevision ensembles for each time step.In the hydrological community, it is accepted that an adjustment of the gamma distribution makes more sense than a normal distribution given the asymmetry in the distribution of precipitation and discharge (Vrugt et al., 2008); however, the gamma function evaluation involves a distribution which is more complex than the normal distribution which has explicit mathematical expressions.Székely (2003) proposes Monte Carlo techniques for the adjustment of any distribution to the ensembles.
For this study, some simulations were performed to evaluate differences between normal and gamma distributions in the case of the Continuous Ranked Probability Score (CRPS) and the Ignorance Score (IGNS).The results showed minor variations in contrast with a high computational cost.It is nonetheless important to note that this similarity is evaluated inside the ensembles with previsions varying between 30 and 800 hydrological members, as detailed below; in small samples it is expected that the results represent the expected asymmetry of information.
Note that the CRPS can be evaluated directly from the cumulative distribution of observed frequencies (Hersbach, 2000).However, considering the computational cost in evaluating this score thousands of times, a normal distribution was assumed.
The mathematical notation of each element in the scores, explained below, is drawn from Appendix A.

Continuous ranked probability score (CRPS)
The CRPS simultaneously evaluates reliability, resolution, and uncertainty (Hersbach, 2000;Gneiting and Raftery, 2007).Smaller values indicate better performance.Its minimal value of zero is only achieved in the case of a perfect deterministic forecast.Note that the CRPS has the dimension of the observation o t .Its mean value is equivalent to the mean absolute error for a deterministic forecast (Hersbach, 2000).Assuming that the forecast ensembles (y t ) are normally distributed, the CRPS at the time t is defined by Eq. ( 1) (Gneiting and Raftery, 2007):

Ignorance score (IGNS)
Proposed by Good (1952) as the logarithmic score, the IGNS is given by Eq. ( 2): This score is described in detail by Roulston and Smith (2002).It is used to evaluate the sharpness or spread (Vrugt et al., 2006).It severely penalizes the bias, since positioning the observation in forecast regions of low probability lead to values that tend to infinity.It is defined simply as the logarithm of the ensemble probability density function (f (y t )) at the point corresponding to the observation (o t ).Smaller values indicate better performance.
The logarithmic score involves a harsh penalty for low probability events and therefore is highly sensitive to extreme cases (Gneiting and Raftery, 2007).To rule out the possibility that the results solely reflect the effect of a few outliers, we analysed trimmed means of the IGNS series excluding the highest and lowest 2% data values, following Weigend and Shi (2000).Infinite values were replaced by the next worst non-infinite value, following Boucher et al. (2010).

Reliability diagram -mean square error (RD MSE )
Given that m denotes the different M thresholds of probability to assess, the reliability of the system can be directly measured from the comparison of these M thresholds with the conditional probability of observation as a function of the forecast (o m ).Since observation of the event is dichotomous (r t = 1 if the event occurred and r t = 0 otherwise) such conditional probability or relative frequency observed ōm is given by Eq. (3): where N is the number of forecast-observation pairs used in verification.The goal is to have well-calibrated forecast systems where the relative frequency is essentially equal to the probability of the forecast, i.e. ōm ≈ I m (Wilks, 2005).The plot of the conditional probability versus the probability of the forecast (I m ) is called the reliability diagram.In this study, as discussed later in Sect.3.3, it is necessary to establish a single target value, so the Mean Square Error between the probability forecast and the observed frequency in the Reliability Diagram (RD MSE ), as suggested by Wilks (2005), is evaluated by Eq. ( 4): These distances are all small for well-calibrated forecasts.

Normalized deviation of the rank histogram from flatness (δ ratio)
The reliability, consistency and bias of the ensemble are evaluated in this score.That is, the rank histogram is used to evaluate whether the ensembles apparently include the observations being predicted as equiprobable members.The rank histogram is a graphical approach that was devised independently by Anderson (1996); Hamill and Colucci (1997) and Talagrand et al. (1997).The rank of the observations within each ensemble is evaluated and then plotted in the form of a histogram.In the case of equality of observation with one or more of the ensemble members, the rank is chosen randomly.For a reliable system, over all d + 1 members, the number of elements in each interval of the rank histogram (S c ) has an expected value N/(d + 1), while the deviation ( ) of the histogram from flatness is measured by Eq. ( 5) (Talagrand et al., 1997): A reliable system has an expectation of 0 = dN d+1 .The δ ratio (δ = / 0 ), proposed by Talagrand et al. (1997) is used as a measure of the reliability of an ensemble prediction system for a scalar variable.A value of δ that is considerably larger than 1 is a proof of unreliability.
Given the difficulty of assessing the probabilistic nature of the studied HEPS, the use of the rank histogram is totally dependent upon eventually relaxing the ensemble members distribution, such as has been proposed by some authors (see Sect. 2c in Anderson, 1996 andSect. 3a in Hamill andColucci, 1997).Velázquez et al. (2011) showed that the reliability of the studied HEPS improved in two ways: first with the combination of all perturbed members from ECMWF EPS and the 16 hydrological models studied, and second, by increasing the lead time.A common feature is that the higher the observed dispersion, the greater the HEPS reliability.

Median of coefficients of variation (MDCV)
The standard deviation is a classical measure of dispersion; however, it preserves the magnitude of the observed variable, complicating the joint interpretability of the results of the 10 basins in evaluation.So, the coefficient of variation (CV) as a dimensionless measure is useful in comparing different data sets with respect to central location and dispersion (Kottegoda and Rosso, 2009).
In this research, the analysis of the HEPS dispersion, through CV (results are omitted in this article), showed an increase proportional to the lead time, so the first lead time has a mean CV of 0.05 while longer lead times (e.g. 9 days), reached a mean value of 0.6.Note that CV is calculated for each time step.However, the mean CV is not a good measure of location in the skewed CV series evaluated for each basin.The MeDian of the Coefficients of Variation (MDCV), given by Eq. ( 6), turns out to be a much better measure: The hypothesis under the maximization of the MDCV is that a gain in dispersion should increase the reliability of the HEPS.

Combined criterion (CC)
Selecting only one criterion may give a partial view of the forecast performance and even be misleading.The combination of several metrics into one diagram has already been evaluated (Taylor, 2001), but is inappropriate for this study because a scalar objective value is required for the selection procedure.So, we propose the following guidelines to define the CC: -The combination should assign weights to each of the scores as a direct measure prioritizing some of the characteristics of the HEPS in evaluation.Additionally, these weights, in a general framework, offer the possibility of constructing a trade-off among different objectives.In our case, weights were used only to give priority to the reliability in the selection, because Velázquez et al. (2011) showed that this was the most influential aspect in the evaluation of the HEPS studied here.For this reason the weight assigned to the reliability corresponds to twice that of the other factors, which have a unit weight.
-Each score in the selected ensemble of hydrological members (se subscript) is normalized from the division by the corresponding score in the initial 800-member ensemble (ie subscript), placing each component on the same scale.
-All scores except the MDCV function are oriented for minimization.However, the IGNS has the peculiarity of having negative values, making it necessary to establish a threshold (z 1 ) in the normalization so as to manipulate the duality of having a positive (or negative) score in the selection and a negative (or positive) score in the 800-member set.Thus, we establish z 1 = −2, since the preliminary analysis of selection under different scenarios (different catchments and number of members to be selected) showed minimum values for this score of about −1.5.With regard to the MDCV function, a threshold of z 2 = 1 is used to change the orientation since the objective is to maximize dispersion, as testing different scenarios showed maximum values of about 0.8.

Elements to compare the performance of members' selection (NS, G NS , G SC )
Note that the CC could be used to compare the performance of the members' selection with respect to the 800-member set.So, in a general framework, if all features of the ensemble forecast have the same importance, one members' selection with equal performance to the 800-member set will lead to a CC equal to 5, values lower than 5 indicate a selection of higher performance than the base set of 800 members, and values greater than 5 indicate the detriment of any feature of the 800-member set.Hereafter this particular condition of unit weights in the CC will be called Normalized Sum (NS).This distinction is important to display the priority that can be defined a priori to any feature in the members' selection training with BGS.In this way, it is possible to define a gain index for the scores balance with respect to 5 (Eq.8): It is possible that the NS evaluated in the selected sets with BGS hides undesirable effects on the balance of the scores, for example to substantially improve one score with respect to the other score(s).To check this condition, a gain index for each score is also proposed: A positive index indicates superior performance of the selected set.The absolute value in the denominator is needed to assess the performance of IGNS, which can have positive and negative values.

Experimental set-up
Figure 1 shows the selection procedure applied to the 800member HEPS.The main elements of the methodology are described below.

Database: 800-member HEPS
Database details can be found in Velázquez et al. (2011).The study is conducted over 10 French catchments with a typical response time of 3 days.These catchments represent a large variety of hydro-climatic conditions (Fig. 2  The 50 perturbed forecasts from ECMWF was provided at a 0.5 • × 0.5 • lat/lon grid resolution.A detailed description of the ECWMF EPS model can be found in Molteni et al. (1996) or Buizza (2005).Forecasts are issued at 12:00 UTC and extend over 240 h.Rainfall amounts were accumulated at 24 h time steps, starting at 0 h to match with observed daily data, which resulted in nine daily lead times.No bias removal or disaggregation was performed.For each catchment, areal mean rainfall forecasts were computed by averaging the rainfall amounts of each grid above the catchment, weighted by the percentage of the catchment area inside the grid.
The sixteen hydrological models are lumped models and correspond to various conceptualizations of the rainfallrunoff transformation at the catchment scale.Some original model structures were modified.Thus, to avoid unfair comparisons of models, they will be referred to hereafter as HM## (Table 2).It is beyond the scope of this article to present these models.References with a detailed explanation of each model structure can be found in Velázquez et al. (2011).
On the other hand, analysis of the median coefficient of variation (MDCV), as a measure of the diversity of the HEPS, revealed the following characteristics: -The variability is low at least for the first three days of predictions (MDCV < 0.12), many models showing no variability (i.e. the same response for all members).As shown by Velázquez et al. (2011), part of this difficulty may be inherited from the meteorological ensembles, which are not reliable prior to about a 3-day lead time.
More importantly, it is believed that not including uncertainties associated with the hydrological initial conditions at the onset of the forecasts takes its toll on reliability, at least for the first few time steps of the hydrological predictions, i.e. until the mean characteristic response time scale of the studied catchments (3 days) is reached.
-As for the incremental variability, it depends on the forecast horizon.MDCV for 4 to 9-day predictions reached between 0.2 and 0.6, respectively.
Consequently, the results presented in this paper are strictly based on the 9-day forecast horizon.This decision is justified on the variability within the ensemble forecasts as well as on the fact that the selection of hydrological members as a method of simplifying HEPS should be unique regardless of the forecast horizon.The companion paper (Brochero et al., 2011) assesses the transferability of the 9-day members' selection to other forecast horizons.

Resampling technique
In some algorithms, such as the BGS, the overfitting1 is highlighted as a structural problem.So, one method for improving generalization which is called early stopping (Hudson and Demuth, 2011;Alpaydin, 2010), well-known in the neural network community, is used in the methodology proposed here.
In this technique, the available data is divided into three subsets.The first subset is the training set, which is used in BGS for sequentially removing the members.The second subset is the validation set.The error on the validation set is monitored during the training process.The validation error normally decreases during the initial phase of training, as does the training set error.However, when the selection begins to overfit the data, the error on the validation set typically begins to rise.When the validation error increases for a specified number of members, the training is stopped.The test set error is not used during training, but it is used to compare different models.
The need to define three subsets to run the BGS and the short length of the series impose the use of resampling techniques such as k-fold cross-validation, which maximizes the utilization of the available information.
Moreover, one notes the high degree of linear correlation exhibited in the first lags of the correlogram of the flow series at hand (e.g. in the 80 % of the catchments evaluated, the correlation using a lag of three days was greater than 0.82).So, the choice of the training and validation data should be directed in order to temporarily avoid near data to form the two subsets.For example, suppose that the linear correlation between o t and o t+1 is equal to 0.8 and that the selection of members has been trained in o t and validated in o t+1 .The validation could consequently be highly contaminated by the effect of the correlation between data.Correlation contamination is avoided by forming training and validation subsets from groups of 10 consecutive data (blocks) rather than from individual data.It is important to note that contrarily to standard hydrology applications, the order of the events is not important in the BGS process.
Here, the dataset is divided into 5 equal-sized parts in order to create 5 experiments.In each experiment, a part is kept out for testing, while the remaining four parts, a priori divided in blocks, are randomly combined to form training and validation subsets.The detailed process develops in two steps: -Step 1: Data and test set configuration.The test set is set-up from simple cut-offs to "guarantee" statistical independence with the training-validation process.To build the test set, the series is subdivided into five folds, each of which corresponds to the test set of each experiment.For example, if N denotes the length of the series, the test set of the first experiment corresponds to the first fold (i = 1 to N/5 ), similarly the test set of the fifth experiment will be the last fold (i = 4N/5 to N ).Thus, strong linear correlation between trainingvalidation and the test dataset is limited only to the values situated near the cut-off line.
-Step 2: Blocks' selection of the training and validation sets.The remaining 4 parts are grouped into k blocks of consecutive pairs of observations-ensemble forecast, then randomly choosing 75 % of the blocks for the training set and the remaining 25 % sets for the validation set.

Backward Greedy Selection (BGS)
In Machine Learning, the evaluation of multiple models for simulation or prediction of an event, and to further select those which together enhance or simplify a condition for adjustment, is known as an overproduce and select.In a general context of selection, numerous methods have been developed.There are greedy selection methods (Backward or Forward Selection) but also methods such as integer programming and evolutionary algorithms.
Here, BGS and the idea of subdividing the data into three subsets to improve the generalization are applied.For its implementation it is necessary to define the error function "E" (that it is one of the given statistical scores shown in Sect.2) and the minimum number of members.With regard to the minimum number of members, which was arbitrarily defined as 30 here, the choice is mainly due to the high availability of initial members (800), for example with 30 hydrological members a level of compression of information equivalent to 96.25 % is reached.It is certain that if the selection task had started with a pool of 50 members, then the minimum number of members could have been defined as 10, for example.Moreover, the minimum number of members is just a stopping criterion of selection with BGS because the number of members to define as optimal should focus on specific analysis in each basin.
The members' elimination mechanism begins with all members (d) and removes them one by one, at each step removing the one that decreases the error the most (or increases it the least).The removal mechanism is as follows: 1.It begins with a subdivision of the dataset (χ) into training (χ t ), validation (χ v ), and test set (χ p ).
2. The reference set G d , containing all of the original d members, is presented.
The hydrological member "y j " corresponds to the one that, when it is removed, has the greater impact on the training set error E (i.e.minimise train error the most).
It is important to note that E must be a scalar or single value.
The reference set is then updated by removing the y j member in G.
G iter = G iter+1 \y j 4. At this point, the error E in the validation set χ v , excluding the y j member, is evaluated.Backward Greedy Selection is a local search procedure that does not guarantee finding the optimal subset.For example, y x and y p by themselves may not be pertinent but together they may decrease the error substantially.But, because the algorithm is greedy and removes hydrological members one by one, it may not be able to detect this.Here, the BGS is executed with a resampling technique explained in Sect.3.2.

Combination of results
The variability of each experiment set-up in the crossvalidation step increases the probability of reaching different hydrological member' selections.So, it is necessary to determine an integration mechanism for a global solution for each catchment.Here, the importance of each hydrological member y i within the ensemble is then assumed as being directly proportional to the iteration number at which it was eliminated during the selection process in each experiment (iter . The combined ranking is thus the mean rank of elimination as defined in Eq. ( 10): For example, if the rank of elimination of the hydrological member y i is 50, 60, 200, 10, and 150 in the five experiments, then the mean rank of elimination is equal to 94.Finally, the final selection (s) of the nm2 best members corresponds to the members which have the highest mean rank of elimination given by Eq. ( 11): It should be noted that another possibility to integrating the results could have been based on the frequency of selection of each hydrological member of the ensemble, and later to elect the members with the highest frequency, but as this integration leads to a low performance, these results are omitted from this paper.

Interpretability of hydrological members' selection
In the case of MEPS in which the members are not perfectly interchangeable (e.g.Meteorological Service of Canada -MSC, TIGGE database), the selection of hydrological members with BGS focuses directly on the combinations of hydrological members that maintain or improve characteristics of the super ensemble of reference.But in the HEPS driven by a MEPS with interchangeable members (e.g.ECMWF EPS), the selection should be directed more clearly to a method of selection and weighting of hydrological models based on their participation in the final selected subset.Therefore, we can create a new simplified high-performance HEPS using the same proportion of the hydrological members associated with a random choice of the meteorological members.
For example if the final selection shows that the simplified HEPS should consist of ten members for the hydrological model "A" and thirty members for the hydrological model "B", then we should expect to achieve a high performance HEPS if we randomly pick ten meteorological members to evaluate the hydrological model "A" and thirty meteorological members, randomly chosen once again, to assess the hydrological model "B".Section 4.3 presents such an analysis.
With respect to the IGNS score, mean values are generally negative, which shows that on average the system has an acceptable bias.Finally, in terms of CRPS, Velázquez et al. (2011) show in detail the efficiency of CRPS in this 800-member HEPS.
Note that results discussed in this paper correspond to a "pseudo test dataset" for comparing the performance between different scores in the process of selecting hydrological members, since the data used to minimize all error functions are exactly the same.
It is a "pseudo test dataset" because there is a high probability that the data used in testing (the complete series) have been used in the BGS training process, becoming the indicator of an optimistic estimator of the selection (Diamantidis et al., 2000); however, we do emphasize that the first part of this research focuses on an analysis of scores in the BGS process with the subsequent integration of results, and the second phase presented in a companion paper (Brochero et al., 2011) shows a rigorous test of generalization in time and space.
Validation results were omitted mainly because they have a trend similar to the training ones, except for some experiments where the random distribution of the training and validation sets was not statistically homogeneous.
In order to illustrate the interchangeability of the members of the ECMWF EPS and equiprobability of this system, Sect.4.3 shows both the performance of the subset found with the BGS and the boxplot diagrams of 200 random experiments of 50 members, with and without the guidance of the BGS solution.

Selection performance
An example of the results obtained is shown in Fig. 3, which compares the 30-member and the 800-member results for the M0421510 catchment, after an optimization based on the δ criterion.In general the 30-member scores are better or as good as the reference set.
We stress the fact that the selection task focuses on the participation of the hydrological models.For instance, Fig. 3e shows that the selected hydrological members make use of 13 of the 16 available lumped models, however, the strong participation of the models 3, 7, 9, and 14 is displayed, which is an interesting combination of hydrological models, especially taking into account the much poorer performance of the 16-member multi-model approach driven by the deterministic prediction (Table 3) and knowing that these hydrological models are not of equal quality with regards to MSE performance.This suggests that the selection favoured a diversity of errors.Specifically, Fig. 3a shows that the 30-member CRPS equals the reference value.Also, taking into account that the CRPS generalizes the Mean Absolute Error (MAE) for a point forecast (Gneiting and Raftery, 2007), it is important to stress that the CRPS values are always lower than the MAE values, when the deterministic counterpart was taken as the mean of each daily ensemble, in agreement with results obtained by other authors (Boucher et al., 2009;Velázquez et al., 2011).Another remarkable feature of CRPS is its direct relationship with the flow magnitude; the shapes of the CRPS and of the hydrograph are similar.A direct strategy of optimization could then focus on removing the hydrological members that have a large impact in the daily extreme CRPS values.Note also that the selection not only preserves the mean CRPS (0.16) but also the structure of the CRPS series.
Figure 3b shows that the 30-member 4 % trimmed mean ignorance score (−1.01) has also improved over the initial value (−0.99).Regarding the time structure of the IGNS, it is observed that both the 30-member and 800-member values have many extreme values which suggest low assessments of the predictive distribution of the ensembles, i.e. a bias problem in the forecasts (note that a value of 4.5 corresponds to an evaluation of the pdf near 0.0442).
With regard to the reliability diagram, Fig. 3c shows a considerable agreement improvement (1.09e-3) over the initial value (1.74e-3).This gain in reliability may be traced back to the optimization criterion used: the δ ratio, which is entirely based on the integration of the whole range in terms of corresponding verifications (observations).Similarly, Fig. 3d reveals that the rank histograms have a nearly uniform distribution, even if the first rank reflects a slight bias.Those imperfections demonstrate the difficulty inherent in minimizing the δ ratio.
At the end of the selection process, the (MDCV) has slightly decreased, from 0.37 to 0.35.This confirms that optimization with the δ criterion seeks diversity of the ensemble forecasts in the correct way, not necessarily maximizing the MDCV.
Figure 3e illustrates the occurrence of each lumped model from the 30-member ensemble.A wide selection of models alone could justify the multi-model approach advocated here.Results show that 13 models out of 16 were selected in this case, and that no models were selected more than 7 times.
Taking into account the detailed analysis for the 30member selections and the global analysis performed for each of the catchments, the combined criterion leads to the best BGS results.The next section presents this analysis.However, the issue of the optimal number of hydrological members remains somehow blurred.So, Fig. 4 revisits that question in terms of the gain index based on NS defined in Eq. ( 8). Figure 4 emphasizes that the 30-member selection always displays a positive gain index.However, one should keep in mind that the optimal number of members should be based on an individual analysis of the different scores balance, i.e. evaluating that the normalized sum does not mask the detriment in a score(s) with gains made in other.
On the other hand, to reflect the BGS performance in the selection, Fig. 4 also presents the NS evaluation with 200 random selections of 30, 50, 100, 200, and 400 members in terms of gain index defined in Eq. ( 8).It is clear that BGS selection with positive gains are always obtained -improving the balance of the scores.Otherwise in random experiments the percentiles 10, 25, 50, 75, and 90 are shown generally in the range of a negative gain index (i.e. a detriment to the balance of the criteria).This tendency is obviously stronger in random selections of 100 or fewer hydrological members where the probability of taking the most representative hydrological responses is lower.It is important to note how even in the random selection of 200 and 400 members (25 % and 50 % of the 800 hydrological members) the NS in 75 % of the evaluations shows a negative gain index.
To check each score individually, Table 4 shows the median of 200 random selections for basin H36 optimized with the combined criterion.The random selections pick 50 hydrological members to evaluate each score in a standardized fashion, that is, dividing the score obtained in the selection subset by the reference score of all 800 members of base (see each component in Eq. 7 without weight parameter).
Table 4 shows an analysis to evaluate the sensitivity of the scores with respect to the selection of hydrological members in the database under study.So, it is possible to point out the following: -In the hydrological members' selection, the greatest challenge is selecting a small set of members, for example 30 or 50.
-CRPS indifference to the selection of members, and to a lesser extent, both the low variability of the IGNS and the MDCV function.-The hydrological members' selection presents its greatest challenges in maintaining or improving reliability and the consistency of the ensemble represented by the δ ratio, as shown in Table 3.Therefore, to define the combined criterion, such as an error term in BGS, the reliability term (RD MSE ) has more weight to guide the optimization in this way.At this point it should be noted that consistency has a direct relationship with reliability, although ensemble consistency does not necessarily imply that probability forecasts constructed from the ensemble are reliable in the sense of conditional outcome relative frequencies being equal to the forecast  (Wilks, 2011).
Finally, Table 5 shows detailed results for each score in the selection process with BGS for the basin H36.It shows that in the BGS methodology, with the combined criterion as error function, is not detrimental to any of the scores.Instead, gains in the balance scores (normalized sum) are mainly due to optimization of system reliability while preserving the quality of the other scores.

Scores interaction in the selection
Table 6 summarizes results for more catchments and optimization criteria.The 30-member comparison is based on a normalized sum (Sect.2.7).In this way, a value of NS lower than 5 indicates an overall improved performance.Performance for all criteria are also given in Table 6 for completeness, and the best optimization criterion for each catchment is identified in bold letters.
Overall, the combined criterion (CC) offers an effective and direct rule, finding balance between features offered by each of the criteria.However, it is important to point out the two cases for which the δ criterion provides a slightly better optimum.This reflects the limitations of the BGS technique or the effects of the combination of results, because if the objective function (CC) is equal to the criterion used to compare results obtained with different objectives, the CC should obviously always find the best solution within the vision of a global optimization tool.
The δ ratio criterion, based on a rank histogram which is the most common approach for evaluating whether a collection of ensemble forecasts for a scalar predictand satisfies the consistency condition (Wilks, 2005), comes to a close second.It led to the best performance for two catchments and to the second best performance for five other catchments.This is particularly interesting considering the simplicity of this approach with respect to the combined approach.In addition, the δ criterion favoured the highest average participation of hydrological models.
The CRPS and IGNS led to a poorer selection, to the point that they were not considered further after experimenting with the first four catchments allowing an economy in computational time.The CRPS showed low variability, so it is not very sensitive to changes in the selection of hydrological members, as was shown in Tables 4 and 5.The IGNS demonstrated a negative relationship with reliability, leading to poor performance in terms of the reliability diagram (RD) and δ ratio.They are also correlated, optimizing one criterion often favouring the improvement of the other one.
Specifically the behaviour of the optimization of each score could also be described from the following relationships observed in Table 6: -Optimization based on CRPS is detrimental to the reliability.For example, it had the effect of increasing RD by a factor of 10, for catchment Q25.The CRPS also decreases diversity of the members (MDCV), except for catchment B31 where it remained stable.
-The combined criterion (CC) leads to stable CRPS values.The most remarkable gains come in terms of RD, as provided in the weights definition of the Eq. ( 7).With reference to δ ratio, evaluations reveal the difficulty in maintaining the stability of this criterion, but the differences between the selection and the reference set are not pronounced.As for the MDCV, the diversity is in most cases maintained or improved.The IGNS performance is often slightly decreased.In conclusion, the CC promotes overall good performance, increasing the reliability of the system (decrease of the RD MSE score) and ensuring the stability in the other scores.
-Selection based on the RD score is detrimental to the CRPS.As for reliability, there are some cases for which the error increases.This condition is surprising given that the combined criterion always achieved reductions of this error, but it could not last under the assumption of a greater weight of this score in the combination because the relationship is constant, which highlights the interaction between the scores as a mechanism implicit in the reduction of error reliability (RD MSE ).The δ ratio is never improved, while diversity (MDCV) is lost except in three cases (B31, Q25, and U25) where interestingly the MDCV increased (theoretically consistent effect).Finally, the IGNS shows a negative trend to the minimization of the RD.
-By definition, the δ ratio focuses on the reliability and the consistency of the ensemble.In fact, it leads to better reliability performance in terms of RD, than when the selection is optimized with RD itself.The δ ratio also preserves the resolution of the forecast, as shown by the CRPS and IGNS results.All of this is accompanied by a slight loss in performance in terms of δ ratio, which can be explained by the direct relationship of this score with the number of members.However, this dependency rather than becoming an obstacle in the selection stands as a logical consequence of the system, since statistically a better performance is expected from a system that combines a larger number of members (Alpaydin, 2010).Finally, with respect to MDCV, it is shown once again that diversity, hypothetically represented by MDCV, fluctuates between values that indicate the extent to which such diversity needs to be maintained in the ensemble.
-When the selection process focuses on the maximization of MDCV, the relationship with CRPS, the IGNS and δ ratio is always negative.However, there are four cases in which the reliability is improved by increasing the diversity index, from which it follows that while reliability improves the resolution drops.
In summary, the interaction of different scores, as seen from the 30-member selection, shows that the optimization focused on scores that mainly define the resolution of the ensemble (CRPS, IGNS) has a negative impact on the reliability, consistency, and ensemble diversity.It also reveals that if the selection is based only on a reliability view, the ensemble loses resolution and consistency.Maximization of the MDCV is in general detrimental to the other criteria, but sometimes improves reliability, a condition that can easily Hydrol.Earth Syst.Sci., 15,[3307][3308][3309][3310][3311][3312][3313][3314][3315][3316][3317][3318][3319][3320][3321][3322][3323][3324][3325]2011 www.hydrol-earth-syst-sci.net/15/3307/2011/  be understood from a theoretical point of view.The δ ratio improves reliability while maintaining resolution.The combined approach stands out as the most balanced criterion.
The above analysis focused exclusively on 30-member selections.However, a global vision requires the analysis of the evolution of the scores as the number of hydrological members is reduced.Such an analysis is specific to each catchment, so as an example, Fig. 5 shows evolution diagrams of the various scores as a function of the number of members, for the catchment A79.
In order to assess the joint evolution of all scores the gain index defined by Eq. ( 9) was used.Figure 5a  RD optimization (Fig. 5b) is surprisingly unfavourable to the δ ratio (negative gain index), which is related to the indifference of the RD with respect to the location of the observation within the ensemble, while this location analysis creates a solid indicator of the system consistency.Likewise, it is remarkable that the normalized sum for RD is equal to 4.96 when the number of hydrological members is equal to 100.This is strictly because loss in consistency (negative gain index in the δ ratio of 40 %) and resolution (IGNS equivalent to losses of 10 %) is balanced by a positive gain of about 50 % in RD.
The δ ratio (Fig. 5c) displays a gradual overall improvement of individual scores in a selection of about 70 hydrological members, when the various scores show a tendency to decrease in performance.At this point it is important to note that the normalized sum (NS) reached 4.53.
Figure 5d shows that criteria focusing on the resolution and the consistency have a negative relationship with the maximization of the diversity (MDCV), overall gains are achieved only when the number of hydrological members is greater than 400.
The combined criterion (Fig. 5f) improves collective performance of all scores in the selection, with an optimal number of hydrological members of 70 for this catchment, coinciding with the interaction shown in the minimization of the δ ratio (Fig. 5c).Scores tend to lose quality afterwards.Table 7 groups the 100-member scores following optimization with the combined score and the δ ratio, the two best ones.These values confirm the superiority of the combined score, leading to the smallest NS for all catchments, mainly because of the great influence on minimizing reliability.This also maximizes MDCV to such an extent that it allows a proper balance between reliability, resolution, and consistency.It is also remarkable that for 8 catchments out of 10, the δ ratio is minimized even more than when the optimization is focused on the δ ratio itself.Optimization based on the δ ratio also improved scores over the initial 800member values (NS < 5) for 9 catchments out of 10.This single criterion is thus also very appealing, especially because it makes use of all 16 models in its selection.
Additionally, the δ ratio can be highlighted as a simple optimization criterion, which for 100 % of the catchments, makes use of the participation of all hydrological models in the formation of the solution, which is not the case for the optimization with the CC.

Interchangeability of MEPS members as input of hydrological models
In order to illustrate the interchangeability of the members of the ECMWF EPS and equiprobability of this system, Fig. 6a shows that a random selection oriented only with the hydrological models' participation in the BGS has a chance to have even better performance than the 800-member HEPS upper 90 % (top of the box diagram).These box plots are constructed by retaining the participation of hydrological models in the response but with a random selection of members of the MEPS.On the other hand, Fig. 6b shows the same kind D. Brochero et al.: Simplifying a hydrological ensemble prediction system, Part 1 of results under different random selections but without considering the participation of hydrological models found with BGS.
Figure 6 highlights three main aspects: high-performance solutions based on the proportion given by the BGS, low variability, and high performance of the BGS solutions.
The performance of selections based on the proportion of members found in the BGS solution is evident in Fig. 6a.So, it is demonstrated that the proportion of members for a hydrological model is generally a sufficient criterion to reduce the number of members while improving the balance of the scores represented by the normalized sum.For comparison, Fig. 6b illustrates the system response to random selections without any a priori guidance, showing that in all cases the normalized sum is greater than 5 and have recurring extremes greater than 7.
Regarding the variability of the normalized sum evaluated in random selections guided by the BGS solution, it can be seen that the interquartile range (Q 3 − Q 1 ) is at worst equal to 0.3 (catchment H36), which is a much lower value than for the purely random selection, as shown in Fig. 6b where the latter interquartile range is equal to 0.6.
The generalization of the BGS method is discussed in detail in the companion paper, where the temporal and spatial generalization is evaluated for a nearby catchment.However, Fig. 6a shows that the catchments H36 and J85 obtained combinations with a normalized sum lower than those obtained with the BGS method (see only cross points at the bottom in Fig. 6a), which can be associated with the integration of experiments carried out in a subdivision database for each catchment or the BGS algorithm structure -it is known that the classical BGS algorithm is unable to detect the collective influence of the variables.

Conclusions
Previous results on the number of hydrological members and the HEPS conformation (Velázquez et al., 2011) have shown, based on the database of the present paper, that the ensemble predictions produced by a combination of several hydrological model structures and meteorological ensembles (800member set) have higher skill and reliability than ensemble predictions given either by a single hydrological model fed by weather ensemble predictions (50-member set) or by several hydrological models driven by a deterministic meteorological forecast (16-member set).So, our goal was focused on at least replicating the good quality of the 800-member set with fewer hydrological members.
Hydrological member selection is justified by the computational cost to issue a hydrological forecast based on the combination of meteorological models and hydrological models.In this line, the selection of hydrological members without sacrificing the quality of a forecast stands out as an operational option.Results presented here support the idea that selecting HEPS members is viable.It is in general even possible to expect a better balance of scores in the subset of selected hydrological members than in the original much larger ensemble, based on standard scores such as the CRPS, the IGNS, the reliability diagram, and the δ ratio.The diversity, sought in the multi-model approach with MEPS, may also be maintained in the final selection.
The simplification of the HEPS can be addressed from two points of view: as a function of the maximum simplification of the number of hydrological members or as a function of the maximization of the balance of the scores.Simplification of the number of hydrological members involves the definition of a limit ensuring statistical consistency of the scores assessed.A trade-off exists between the number of hydrological members and the level of improvement in scores.For example, in this study, the best balance of scores is achieved with a number of members fluctuating between 30 and 100, maximizing the qualities of the system: reliability, consistency, resolution, and diversity.So in the worst case this corresponds to a 87.5 % (700 members/800 members) compression level.The ultimate level of compression is in fact a compromise between the gain index and the complexity of the system.The ultimate decision should be established according to the requirements and the operational capacity of the hydrological probabilistic forecast system.
The evaluation of five individual scores as criteria for optimizing the selection process revealed the complexity of the relationship between them.In many situations, improving one score is achieved at the expense of another score.Therefore, the design of a combined criterion (CC) led to an important methodological improvement that integrates many characteristics of each score.The δ ratio is the best single optimization criterion, not very distant to the achievements of the combined criterion (CC).
The CRPS is often the primary score used for evaluating HEPS performances.However, results here indicate that it is not a good for hydrological members' selection in this case of study.In fact, it was often possible to preserve or minimize the CRPS using other objective criteria.Likewise, the centralization of the selection process in the IGNS heavily penalized the reliability and the consistency of the system.With respect to the MDCV, the uncontrolled maximization of this parameter, which describes diversity, leads to a deterioration of the other sought qualities of the system.There exists a threshold beyond which the system abruptly loses reliability, resolution, and consistency.On the other hand, experiments showed that both the δ ratio and the CC improve the balance of the scores.
The proposed methodology is part of the so-called datadriven models, so the design is independent of the database, in this case the evolution of MEPS or hydrological models.Precisely this point stands out as one of the advantages of the proposed methodology, since the selection of hydrological members could be implemented in any desired combination between any MEPS (e.g.ECMWF EPS, MSC, US National Centers for Environmental Prediction -NCEP) and hydrological models.
The cross-validation, a vital part of the proposed methodology, systematically deals with the issue of the short length of the series.However, it is widely applicable to any length of condition series.
Finally, the encouraging results of this study will lead to an interest in testing other global search (non-greedy) tools such as evolutionary algorithms.

Fig. 2 .
Fig. 2. Selected catchments are identified with the first three digits of each code used in Table 1.The other delimited basins are part of the study of results' generalization shown in Brochero et al. (2011).

Fig. 3 .
Fig. 3. Comparison between the initial ensemble (800 members) and the ensemble selected (30 members) for the lead time 9 in the catchment M0421510.(a) Figure above: observed flow; figure below: CRPS, x-axis indicates day/month.Note the correspondence between higher observed flows and higher mean CRPS.(b) Figure above: observed flow; figure below: IGNS.Note that there is no full correspondence between the higher IGNS and higher observed flow, x-axis indicates day/month.(c) Reliability diagram error (RD MSE based on vertical distances between the points).(d) Rank histogram for the 30 selected hydrological members.The horizontal dashed line indicates the frequency (N/d + 1) attained by a uniform distribution.(e) Occurrences of the employed models in the final solution of 30 hydrological members.

Fig. 4 .
Fig. 4. Evolution of the normalized sum (NS) in terms of gain index for the lead time 9, optimization with the combined criterion.Logarithmic scale on the x-axis.The normalized sum equal to 5 represents the performance of the initial 800-member ensemble.The thin red line represents the normalized sum under different numbers of members found with BGS.Symbols for the 200 random selection experiments: the blue vertical line identifies the interquartile range, white circles represent the median and yellow diamonds corresponding to the percentiles 10 and 90.

Fig. 5 .
Fig. 5. Evolution of the gain index for each score under different optimization schemes in the basin A79 for the lead time 9.A logarithmic scale is used on the x-axis.The chosen optimization criterion in the selection is shown at the top of each subfigure.The lower part of each subfigure indicates the values of the normalized sum (NS) for the number of hydrological members shown on the x-axis.
and 5e clearly show that an optimization based on resolution of the system (CRPS or IGNS) is detrimental to the reliability.Figure 5 also highlights the correspondence of CRPS and IGNS throughout the selection process, when the optimization is focused on one or the other.

Fig. 6 .
Fig.6.Backward Greedy Selection (BGS) and Box-plots in 200 random experiments of 50 hydrological members for the lead time 9. On each box, the central mark is the median, the edges of the box are the 25th (Q 1 ) and 75th percentiles (Q 3 ), the whiskers or limits to consider the outliers extend fromQ 1 − 1.5 × (Q 3 − Q 1 ) to Q 3 + 1.5 × (Q 3 − Q 1 )points (but not necessarily correspond to observed data), and the outliers are plotted individually as cross points.(a) Random selection oriented with the frequency observed in the BGS to check the interchangeability in the 800 member-set, (b) Random selection without any guidance to check the BGS performance.
Fig.6.Backward Greedy Selection (BGS) and Box-plots in 200 random experiments of 50 hydrological members for the lead time 9. On each box, the central mark is the median, the edges of the box are the 25th (Q 1 ) and 75th percentiles (Q 3 ), the whiskers or limits to consider the outliers extend fromQ 1 − 1.5 × (Q 3 − Q 1 ) to Q 3 + 1.5 × (Q 3 − Q 1 )points (but not necessarily correspond to observed data), and the outliers are plotted individually as cross points.(a) Random selection oriented with the frequency observed in the BGS to check the interchangeability in the 800 member-set, (b) Random selection without any guidance to check the BGS performance.

Table 1 .
Main characteristics of the studied basins (mean annual values) based on a 36 year length of the series.
and Table 1), and were evaluated over a period of 17 months (from March 2005 to July 2006).Temperature, rainfall, and flow data are available at a daily time step over the period extending from 1970 to 2005, and P: precipitation, ET: potential evapotranspiration, Q: flow.wereused for the calibration and validation of the hydrological models.Observed data for the period 11 March 2005 to 31 July 2006 was used only for the evaluation of the forecasts.The forecast verification period is thus independent of the calibration/validation period.Rainfall data come from the meteorological analysis system SAFRAN of Météo-France (seeQuintana-Seguí et al., 2008 for details).They consist of rainfall accumulated at a daily time step and available for the entire country of France at an 8 × 8-km grid resolution.Daily streamflow data come from the French database Banque Hydro (http://www.hydro.eaufrance.fr/).The length of available observed streamflow time series varies according to the catchment, with, on average, 29 years of available daily data for the catchment dataset used here.
Brochero et al.:Simplifying a hydrological ensemble prediction system, Part 1 5.The subset G nmin of the selected members is achieved, then the whole selection process is analysed on the training and validation results.

Table 3 .
Performance for the 16-member ensemble (16 hydrological models are driven by the deterministic forecast from ECMWF EPS) and the 800-member ensemble (16 hydrological models are driven by the 50 perturbed members forecast from ECMWF EPS) for a 9-day forecast time horizon.Hereafter, RD MSE values are expressed on a 10 −3 basis.

Table 4 .
Median of 200 random selections in catchment H36 for the lead time 9.The scores are presented in a standardized fashion and oriented at the minimization coinciding with the formulation of each component of the combined criterion (Eq.7).

Table 5 .
Results of BGS in catchment H36 for the lead time 9 with the combined criterion as error function.The scores are presented in a standardized fashion and oriented at the minimization coinciding with the formulation of the Eq. 7.

Table 6 .
Selection of 30 hydrological members based on different scores in all the basins or the lead time 9. NHM indicates the number of hydrological models participating in the solution.

Table 7 .
Selection of 100 hydrological members based on the combined (CC) and δ criteria.NHM indicates the number of hydrological models participating in the solution.
Brochero et al.:Simplifying a hydrological ensemble prediction system, Part 1R(y i )Mean rank of elimination of the y i hydrological member sFinal selection of the nm best hydrological members in the selection process xp Iteration number at which was eliminated the y i hydrological member during the selection process in the xp experiment www.hydrol-earth-syst-sci.net/15/3307/2011/ Hydrol.Earth Syst.Sci., 15, 3307-3325, 2011 D.