Performance and reliability of multimodel hydrological ensemble simulations based on seventeen lumped models and a thousand catchments

This work investigates the added value of ensembles constructed from seventeen lumped hydrological models against their simple average counterparts. It is thus hypothesized that there is more information provided by all the outputs of these models than by their single aggregated predictors. For all available 1061 catchments, results showed that the mean continuous ranked probability score of the ensemble simulations were better than the mean average error of the aggregated simulations, confirming the added value of retaining all the components of the model outputs. Reliability of the simulation ensembles is also achieved for about 30% of the catchments, as assessed by rank histograms and reliability plots. Nonetheless this imperfection, the ensemble simulations were shown to have better skills than the deterministic simulations at discriminating between events and non-events, as confirmed by relative operating characteristic scores especially for larger streamflows. From 7 to 10 models are deemed sufficient to construct ensembles with improved performance, based on a genetic algorithm search optimizing the continuous ranked probability score. In fact, many model subsets were found improving the performance of the reference ensemble. This is thus not essential to implement as much as seventeen lumped hydrological models. The gain in performance of the optimized subsets is accompanied by some improvement of the ensemble reliability in most cases. Nonetheless, a calibration of the predictive distribution is still needed for many catchments. Correspondence to: J. A. Veĺazquez (juan-alberto.velazquez.1@ulaval.ca)


Introduction
In hydrology, traditional approaches focus on a single model thought to be the best possible for a given application.In opposition, multimodel combination aims at extracting as much information as possible from a group of existing models.The idea is that each model of the group provides specific information that might be combined to produce a better overall simulation.This concept has been widely tested because no hydrological model could yet be identified as the "best" model in all circumstances (Oudin et al., 2006).
Indeed, the selection of a "best" model for a given application is a complex task.For instance, Marshall et al. (2005) proposed a method in which hydrological models may be compared in a Bayesian framework accounting for model and parameter uncertainty, while Clark et al. (2008) proposed a Framework for Understanding Structural Errors (FUSE) in order to diagnose differences in hydrological model structures.The latter approach allowed the elaboration of 79 different model structures combining components of 4 existing hydrological models.Results lead the authors concluding that it is unlikely that a single model structure may provide the best streamflow simulation for basins of different climate regimes.A framework called Modular Modeling System (MMS) has been developed by the US Geological Survey to develop a variety of physical processes models that can be coupled with management models for a wide range of operational issues (Leavesley et al., 1996).MMS uses a library that contains compatible modules for simulating a variety of water, energy and biochemical processes.In such framework, a model is created by selectively coupling the most appropriated process algorithms from the library to create the "optimal" model for the desired application.
In multimodel combination, Shamseldin et al. (1997) compared three combinational methods over five rainfall-runoff models and eleven catchments.The methods were the simple model average (SMA), the weighted average, and artificial neural networks.Results showed that the combined outputs were more accurate than the best single one.Later, Georgakakos et al. (2004) tested a multimodel approach over six catchments.Combined outputs were constructed with both calibrated and uncalibrated distributed model simulations, using the SMA.Results confirmed the better performance of the combined series over individual ones; furthermore, the authors claimed that multimodel simulations should be considered as an operational tool.Ajami et al. (2006) examined yet another method of combination, namely the multimodel superensemble of Krishnamurti et al. (1999), using outputs from seven distributed models.They found that more sophisticated combination techniques may further improve simulation accuracy, that at least four models are required to obtain consistent multimodel simulations, and that the multimodel accuracy is related to the accuracy of the individual member models (longer dataset and more models might then improve multimodel combination results).Viney et al. (2009) compared predictions for one catchment exploiting ten models of different model types, covering lumped, semi-distributed, and fully distributed models combined in many ways.Their results differ from Ajami et al. (2006) in that the best ensembles are not necessarily those containing the best individual models.For the same catchment and models as Viney et al. (2009), Boorman et al. (2007) suggested that a number of at least 6 models are required for a multimodel ensemble to ensure good model performance and that any number above six may not considerably improve the performance of the ensemble.
Other studies have applied the Generalized Likelihood Uncertainty Estimation (GLUE) methodology (Beven and Binley, 1992) in order to assess the uncertainty associate with the predictions.This procedure works with multiple sets of parameters values and allows differentiating sets of values that may be equally likely as simulators of a catchment.At the heart of GLUE is the concept of rejecting non-behavioural models and weighting the behavioural ones for ensemble simulations.Recently, Liu et al. (2009) have proposed a methodology for identifying behavioural models avoiding the subjective choice of a threshold based on a global goodness of fit index, replacing it by a condition for every time step based on an observation error set prior.An application of the GLUE methodology to account uncertainty in model parameter, model structure and data is presented by Krueger et al. (2010), however, the understanding of data uncertainties often remains incomplete (e.g.rainfall input).Another multimodel combinational method has been proposed by Oudin et al. (2006) who resorted to two different parameterizations of the same model.
The Ensemble Bayesian Model Averaging (BMA) has been proposed for multimodel combination (Raftery et al., 2003(Raftery et al., , 2005)).In this framework, the probability density function (pdf) of the quantity of interest predicted by the BMA is essentially a weighted average of individual pdf's predicted by a set of individual models that are centered around their forecasts.The weights assigned to each of the models reflect their contribution to the forecast skill over the training period.Typically, the ensemble mean outperforms all or most of the individual members of the ensemble (Raftery et al., 2005).BMA has been successfully applied in streamflow prediction (Duan et al., 2007), groundwater hydrology (Neuman, 2003), soil hydraulic (Wöhling and Vrugt, 2008) and surface temperature, and sea level pressure (Vrugt et al., 2008).However, Vrugt et al. (2007) report no advantage when comparing multimodel BMA and Ensemble Kalman filtering (Evensen, 1994).
In meteorology, the DEMETER project aimed developing a multi-model ensemble-based system for seasonal to interannual prediction, which relies on seven global atmosphere -ocean coupled models, each running from an ensemble of initial conditions.The evaluation demonstrates the enhanced reliability and skill of the multimodel ensemble over a more conventional single-model ensemble approach (Palmer et al., 2004;Hagerdon et al., 2005).Output from the DEMETER multimodel system has been also applied to malaria prediction models (Jones et al., 2010).
An alternative idea, which is gaining ground, combines models through optimization.For example, Devineni et al. (2008) proposed an algorithm combining streamflow forecast from individual models based on their skill, as assessed from the rank probability score.The methodology assigns larger weights to models leading to better predictability under similar prediction conditions.This multimodel combination has been tested over a single catchment, combining two statistical models.Seven multimodel combinations techniques were tested and results showed that developing optimal model combinations contingent on the predictor lead to improve predictability.
Multimodel combination has also been applied in an operational context.Loumagne et al. (1995) combined model outputs using weights adapted to the state of the flood forecasting system.This procedure proved to be more effective than choosing the best model at each time step.Coulibaly et al. (2005) combined three structurally different hydrologic models to improve the accuracy of a daily reservoir inflow forecast based on the weighted average method.They found that model combination can offer an alternative to the daily operational updating of the models, providing a cost-effective solution to operational hydrology.Marshall et al. (2006Marshall et al. ( , 2007) ) used a hierarchical mixture of experts (HME) allowing changes in the model structure, depending on the state of the catchment.The approach was tested on Australian catchments by combining the results from two models structures in the first case and two parameterizations of a conceptual model in the second case.Results showed that the HME improves performance over any model taken alone.
The view shared by the above studies is the production of improved hydrological simulations through the aggregation of a group of outputs into a single predictor.The present study hypothesizes that there is more value exploiting all the outputs of this group than the single aggregated one, following the philosophy of meteorological ensemble prediction (Schaake et al., 2007).All the members of the ensemble are then used to fit a probability density function (the predictive distribution), and are useful to evaluate confidence intervals for the outputs, the probability of streamflow being above a certain threshold value, and more.In other words, an ensemble allows appreciating the simulation uncertainty.He et al. (2009) used a coupled atmospheric-hydrologichydraulic system driven by the TIGGE (THORPEX Interactive Grand Global Ensemble) ensemble forecast (seven meteorological agencies) for flood warning in the River Severn catchment located in Wales.Another study is presented by He et al. (2010) which used predictions from six meteorological agencies, for the Huai River catchment in China, to drive a hydrological model forecasting the July-September 2008 flood event.Their results established the TIGGE multimodel as a promising tool for discharge forecasts.
The present study aims assessing the added value of ensembles constructed from seventeen lumped hydrological models (the probabilistic simulations) against their simple average counterparts (the deterministic simulations).It resorts to 1061 French daily streamflow time series extending over a ten-year period, in order to generalize conclusions.The probabilistic performance based on all seventeen outputs is first compared to the deterministic one.Then the reliability of the ensembles is assessed as well as their operational value in terms of hit rate and false alarm rate.Further ensemble performance improvement is finally sought through model selection: subsets of the seventeen lumped hydrological model outputs are objectively constructed using a genetic search algorithm optimizing the Continuous Ranked Probability Score.
The methodology is described in the next section.Results are presented in Sect.3, while conclusions are given in Sect. 4.

Methodology
Catchments and models are presented along scores and tools used to evaluate the performance and reliability of the ensembles.The genetic search algorithm is described last.

Catchments and models
Deterministic and probabilistic streamflow simulations from seventeen hydrological models are analyzed on 1061 French catchments.The dataset was built by Le Moine (2008) and   (Durand et al., 1993;Quintana-Segui et al., 2008).Potential evapotranspiration is estimated from air temperature, using the radiation-based formulation proposed by Oudin et al. (2005).The first half of the time series is used for calibration, while the second half is used for validation.All results provided herein concern the validation sub-dataset.
All seventeen hydrological models are of low to moderate complexity: the number of parameters ranging from 4 to 13.Table 2 lists the tested model structures along with the number of optimized parameters and stores for their tested version.Most of these models were used by Perrin et al. (2001) and Mathevet (2005).All model structures were applied in a lumped mode, which means that catchments were not split into sub-catchments or grids but considered as a single unit.Although some of the test catchments are quite large, this does not seem to be a real limitation for the application of lumped models, as shown by the results by Merz et al. (2009).Obviously, some specific conditions or events may be better modelled using semi-distributed or fully distributed spatial schemes (see e.g.Jaun et al., 2008), but the modelling scheme proposed here can be applied with other model types.Further discussion on the impact of the spatial scheme on model performance can be found in Andréassian et al. (2004) or Smith et al. (2004).
These lumped models correspond to various conceptualizations of the rainfall-runoff transformation at the catchment scale.They all include a soil moisture accounting procedure but with various formulations (linear or non linear, possibly with several soil layers).They also include transfer functions to account for the travel time and different pathways of water at the catchment scale.These functions includes from 1 to 5 linear or non linear stores, and unit hydrographs or pure time delays.Some of the models include a non conservative function (correction factors of inputs or groundwater exchange functions) used to adjust the water balance.All the models were applied in the same conditions, i.e. ran at a daily time step using the same rainfall and potential evapotranspiration inputs and calibrated with the same procedure.This single application framework provides more comparable results between model structures.This is one of the reasons why the original model structures were modified as they sometimes had specificities that did not match this framework.Note that the objective here was not to evaluate the original structures but to have a variety of conceptualizations.To avoid confusion with the original models from which they are derived, only 4 letter acronyms are used in Table 2 and identification numbers will be used in the text and figures.Model's structure description is available from authors.
Calibration was performed using a local search procedure, as described by Edijatno et al. (1999), applied in combination with a pre-screening of the parameter space as proposed by Mathevet (2005).This pre-screening provides a likely starting point for the search algorithm and limits the risks to be trapped in local optima.Mathevet (2005) showed that this approach is competitive for this type of models, in terms of efficiency and effectiveness, when compared with more sophisticated global search procedures.

Performance and reliability
Deterministic simulations were aggregated using the simple average method (SMA).This is the simplest procedure for combining outputs from an ensemble of individual models (Shamseldin et al., 1997).Ensembles were constructed in different forms.First, a simple pooling of all seventeen model outputs was considered.Then, subsets of the seventeen lumped hydrological model outputs were identified objectively using the genetic search algorithm described in Sect.2.3 and the Continuous Ranked Probability Score as the objective function.Finally, subsets of eight models, selected according to their deterministic performance, were tested for comparison.

The Absolute Error criteria
The evaluation of the performance of the deterministic simulations is based on the absolute error (AE), a linear scoring rule that describes the average magnitude of the errors without considering their direction.The main advantage of the AE over alternative deterministic scores is that it can be directly compared to the Continuous Ranked Probability Score -described next -of the probabilistic simulations (Gneiting and Raftery, 2007).It thus provides a way to compare the performance of ensemble simulations against the performance of deterministic simulations, for each individual catchment.
The CRPS is defined as: where F t is the cumulative predictive distribution function for the time t, x is the predicted variable (here streamflow) and x t is the corresponding observed value.The function H {x ≥ x t } is the Heaviside function which equals 1 for simulated values larger than the observed value and 0 for simulated values lower than the observation.The CRPS is positive and a zero value indicates a perfect simulation.An analytical solution of Eq. ( 1) exists for normal predictive distributions (Gneiting and Raftery, 2007).However, because the normality of the predictive distribution is not always true in the present study, a Monte Carlo approximation to Eq. ( 1) has been used instead (Székely et al., 2003;Gneiting et al., 2007): Where X and X are independent vectors consisting of 1000 random values from a gamma distribution adjusted to the predictive function F t .As already mentioned, an interesting property of the CRPS is that it reduces to the AE score in the case of a deterministic simulation (Gneiting and Raftery, 2007).However, because the score for a specific forecastobservation pair, at a certain time, cannot be interpreted, we rather consider for each station the average of all individual scores as a measure of the quality of the simulation system, thus comparing mean AE (MAE) and mean CRPS (CRPS), which values are directly proportional to the magnitude of the observations.We also aim to evaluate the performance gain in terms of CRPS that may bring the optimization procedure.Based on the skill score (e.g.Wilks, 1995), the percentage of improvement over the reference is given by: gain

Reliability
Reliability refers to the statistical consistency between simulations and observations.For instance, a reliable 90% confidence interval calculated using the predictive distribution function should contain the observed value in 9 cases out of 10 on average.On the other hand, the potential CRPS corresponds to the best possible CRPS value that could be obtained with the database and the particular simulation system that is used, if the latter were perfectly reliable.Because of the complex nature of the CRPS, other means of assessing the reliability is often used in parallel, such as the rank histogram and the reliability diagram.Unreliable simulations can be misleading and should be used with caution, if at all.Many methods for post processing the probabilistic forecasts from ensembles have been proposed, such as the ensemble dressing (i.e., kernel density) approaches (Roulston and Smith, 2003;Wang and Bishop, 2005;Fortin et al., 2006), Bayesian model averaging (Raftery et al., 2005), nonhomogeneous Gaussian regression (Gneiting et al., 2005), logistic regression techniques (Hamill et al., 2004(Hamill et al., , 2006)), analog techniques (Hamill et al., 2006), forecasting assimilation (Stephenson et al., 2005), statistical postprocess calibration approach (Wood and Schaake, 2008), variance inflation method (Johnson and Bowler, 2009), the simple binning technique (Stensrud and Yussouf, 2007) and several others.However, these procedures were not considered in the present work.
The reliability of the predictive distribution can be visually assessed using the rank histogram (Talagrand et al., 1999;Hamill, 2001).To construct it, the observed value x t is added to the ensemble simulation.That is, if the simulation has n members, the new set consists of n + 1 values.Then, the rank associated with the observed value is determined.This operation is repeated for all simulations and corresponding observations in the archive.The rank histogram is obtained by constructing the histogram of the resulting N ranks.The interpretation of the rank histogram is based on the assumptions that all the members of the ensemble simulation along with the observations are independent and identically distributed; under these hypotheses, if the predictive distribution is well calibrated, then the rank histogram should be close to uniformity (equally distributed).An asymmetrical histogram is usually an indication of a bias in the mean of the simulations.If the rank histogram is symmetric and "U" shaped, it may indicate that the predictive distribution is under-dispersed.If it has an arch form, the predictive distribution may be over-dispersed.
Because it is not practical to present all 1061 rank histograms, results will be synthesised using the ratio δ metric proposed by Candille and Talagrand (2005): a numerical indicator reflecting the squared deviation from flatness in individual rank histograms.It is given by δ = 0 (4) where: and s k is the number of elements in the kth interval of the rank histogram.For a reliable system, s k has an expectation of N/(n + 1).Then, 0 is the ratio that would be obtained by a perfectly reliable system: leading to a target value of δ = 1.Of course, a perfectly reliable system is a theoretical concept.In practice, a system is declared unreliable whenever its δ value is quite larger than 1 (Candille et al., 2005).However, the exact δ threshold, above which a system may be declared unreliable, has to be established for each investigation, notably because the δ metric is proportional to the length of the time series (the threshold value adopted here will be discussed later on).Some applications of the δ metric include evaluating the degree of reliability of meteorological ensembles by comparing δ values according to their series lengths (e.g.Jiang et al. 2009).
Alternatives to the rank histogram exist, such as the QQplot (e.g.Thyer et al. 2009).They remained unexplored here.The reliability diagram is another approach used to graphically represent the performance of probability simulations of dichotomous events.A reliability diagram consists of the plot of observed relative frequency as a function of simulation probability and the 1:1 diagonal represents the perfect reliability line (Wilks, 1995).In the present study, ten confidence intervals have been calculated with nominal confidence level of 5% to 95%, with an increment of 5% for each emitted simulation.Then, for each simulation and for each confidence interval, it was established whether or not each confidence interval covered the observation.This is repeated for all simulation-observation pairs and its mean is then plotted (Boucher et al., 2009).Verification results can be quite sensitive to sampling variability in some cases (Bradley et al., 2003;Clark et al., 2006).To assess this situation, we assigned confidence limits to the reliability diagram using a bootstrap technique.

Hit over threshold criteria
The relative operating characteristic (ROC) curve (Peterson et al., 1954;Mason, 1982) plots the probability of detection (POD) versus the probability of false detection (POFD), which are given by: POD = hits hits + misses (7) POFD = false alarms correct negatives + false alarms (8) The four combinations of simulations (yes or no) and observations (yes or no), called the joint distribution, are: hit (the event simulation to occur and did occur), miss (the event simulation not to occur, but did occur), false alarm (event simulation to occur, but did not occur) and correct negative (event simulation not to occur and did not occur) (e.g.Wilks, 1995).The area under the ROC curve characterizes the quality of a simulation system's ability to correctly anticipate the occurrence or non occurrence of the events.In constructing a ROC curve, simulations are expressed in binary as "warnings" or "not warnings" indicating whether or not the defined event is expected to occur.The ROC area ranges from 0 to 1, 0.5 indicating no skill and 1 being the perfect score.ROC measures the ability of the simulation to discriminate between two alternative outcomes, thus measuring resolution.It is not sensitive to bias in the simulation, so says nothing about reliability.A biased simulation may still have good resolution and produce a good ROC curve, which means that it may be possible to improve the simulation through calibration.
The ROC is thus a basic decision-making criterion that can be considered as a measure of potential usefulness (WMO, 2002).

Genetic algorithm
Genetic algorithm is a technique for optimization of problems or systems.It is inspired from biology, more specifically by genetic codes, where solutions are typically translated into binary code string.The search of optimal solution is regulated by rules based on Darwin's theory on the survival of the fittest, by which the strings are allowed to survive from one generation (i.e.iteration) to another and to trade part of their genetic material with other strings depending of their robustness as defined by the objective function (e.g.Anctil et al., 2006).The present work uses genetic algorithm to identify model subsets optimizing the Continuous Ranked Probability Score.The rules of reproduction, crossover and mutation employed here are well described in Goldberg (1989).
The coded string consists of seventeen elements or positions, each one representing a specific model: 0 values identify models that are not used, while 1 values identify models that are retained.A total of 131 054 combinations of at least two models can be generated from a pool of seventeen candidates.The processes of reproduction, crossover and mutations regulate the search in the domain of all these possible combinations, where the objective function is the inverse squared CRPS.At each generation, 50 combinations are thus investigated.From the initial generation, 20 others are created, leading to the consideration of 1000 model subsets.This search is repeated over all 1061 catchments.
As already mentioned, the first half of the time series is used for optimization, while the second half is used for validation.All results provided herein concern strictly the validation sub-dataset.

Individual model performance
MAE values are used to compare individual model performance, based on their frequency of occurrence in the top 5 ranking for each catchment (Fig. 2).There are clear differences between models.Some of them are more frequently in the top five, such as models 1, 2 and 3, while others are rarely present, such as model 17 and 16 -note that Fig. 2 justifies the model ordering in Table 2.The selected seventeen models thus offer a wide range of individual performance.

Comparison of deterministic and probabilistic simulations
The main scope of the present study is to answer the following question: is there more valuable information in the ensemble simulations than in the deterministic ones?This question is first tackled by comparing the CRPS and the MAE values for the C0 reference ensemble formed by all seventeen models.In Fig. 3, all 1061 catchments lead to a CRPS value lower than the MAE ones, confirming the added value of retaining all the components of the ensembles over their average deterministic values.Note that simulations for each catchment have been standardized by their corresponding mean streamflow observation to facilitate comparison between them.However, it remains possible that some individual models surpass in performance the C0 reference ensemble.Indeed such situations occur quite frequently when relying on deterministic simulations, which provides the lowest MAE for only 38% of the catchments; while for example model 1 surpasses the performance of all the other models including the deterministic simulation in 21% of the catchments (Fig. 4a).The performance gain following the usage of the SMA aggregating multiple model outputs is thus not as universal as proposed by Shamseldin et al. (1997) or Georgakakos et al. (2004).However, the situation gets considerably better when using the probabilistic ensemble simulations (i.e.keeping all individuals model outputs), which improves on the performance of all individual models in 96% of the catchments (Fig. 4b).These striking results confirm the superiority of the probabilistic approach over the deterministic one.The next question concerns the reliability of the ensemble simulations, as assessed by the rank histograms and the reliability plots.Figure 5 presents some examples of rank histograms in order to interpret their corresponding ratio δ values.As mentioned earlier, a threshold δ value has to be established for each experimental set-up because this metric is proportional to the length of the time series.From Fig. 5, it is assessed that, for the simulation system and series length at hand, a ratio δ value of about 20 may be used as a practical upper limit of reliability (Fig. 5c), while value of about 100 is without a doubt under-dispersed (Fig. 5f).This is confirmed by the corresponding reliability diagrams presented for three discharge thresholds, larger than 0 (Fig. 6), than quantile 50 (Fig. 7) and than quantile 75 (Fig. 8) of the observation time series.It is noted that, for some catchments (e.g.catchments 224 and 292), there is an improvement in the reliability of the ensembles for larger discharges.Now considering the entire database, the cumulative frequency of the ratio δ in Fig. 9 shows that reliability is achieved for about one third of the catchments (δ values below 20) and that the system is clearly unreliable for at least 20% of the catchments (δ values larger than 100), the other cases being debatable.An operating simulation system based on the C0 reference ensembles would thus need to include the calibration of the predictive distribution for an important number of catchments, in order to improve their reliability.
Nonetheless the reliability imperfection of our simulation system, its ability to discriminating between events and nonevents is next confronted to the same ability of the deterministic simulations.For that purpose, ROC scores were calculated for threshold values respectively corresponding to quantiles 10, 25, 50, 75 and 90 of the observation time series.Results are shown in Fig. 10.It can be noted that the probabilistic ROC scores are superior to the deterministic ones in almost all cases.This proves again the superiority of the ensemble philosophy over the aggregation philosophy, at least for better event detections, even if the produced ensemble could in many cases be further improved by the application of a calibration procedure.It is also noteworthy that   the predictive distributions are skilled for the large majority of the catchments (ROC values superior to 0.5) and that the system is better at detecting larger events such as quantiles 50 or higher, than low flow events such as quantile 10.For the latter case, the probabilistic simulations largely improve over the deterministic ones that prove to be unskilled for many catchments.

Looking for optimized model ensembles
Could the system performance be further improved through model selection?A genetic search algorithm is used to answer that question, objectively optimizing the CRPS value for each catchment.Such analysis will also help answer some other subsidiary questions like: Are seventeen models enough or too many to produce an operational ensemble?Are all models equally useful to the ensemble subsets or only the ones that perform better individually?Does any gain in performance through optimization come at the cost of a loss of reliability?
The optimization procedure described in Sect.2.3 was applied to all catchments.Many model subsets showed improved performance over the C0 reference ensemble.More specifically, improvements were found for 1057 out of the 1061 catchments, which represent 99.6% of the database.The gain in terms of CRPS resulting from the performed optimization is shown in Fig. 11 (see Eq. (3) where the reference value is C0).The gain varies from 0.3% to 93% with a median value of 5.5%.There is also a gain in the quality of the ensemble's reliability as seen in Fig. 12 that draws the initial ratio δ values against the ones of the optimized subsets: an improvement was obtained in 86% of the cases.However, those gains are not large enough to solve the under dispersion issue of the produced ensembles.A calibration procedure is thus still needed for most catchments.Figure 13a shows the relative frequency of selection of the models in the best subset ensemble of each catchment.When compared to Fig. 2, which showed the frequency of occurrence in the top five ranking, it may be deduced that all models are useful contributors to subset ensembles that outperform the C0 reference ensemble.Nonetheless, the optimization procedure does somehow favour models that lead to the best individual performance, namely 1, 2, 3, 4 and 5, then 7, 8 and 9. Furthermore, no links could be established between the level of complexity of the models (number of optimized parameters and storages) and their usefulness for the optimized subsets.
Figure 13b presents the relative frequency of the number of models in these subsets over all catchments.From 7 to 10 models are deemed sufficient to construct ensembles with improved performance.Figure 14 provides yet another view of the optimized subsets, where they are categorized by number of models, which varies from 2 to 16. Boxplots were produced in order to illustrate the variability of the 1061 CRPS values (standardized with their corresponding mean streamflow observation as in Fig. 3).In general, results show that there exist many subset sizes that improve on the C0 reference performance obtained by pooling all seventeen lumped model outputs (the median for the best optimized combination is 0.1850 and the median for C0 is 0.1976).Furthermore, these subsets are superior to the ones constructed with the best eight individual models (C1 with a median of 0.1965) and with the worst 8 individual models (C2 with a median value of 0.2240).This latter result supports the finding of Viney et al. (2009) that the best ensembles are not necessarily those containing the best individual models, but it seems that the inclusion of some good models is essential.Figure 15 shows an example of rank histograms of one catchment for different ensembles of models (C0, C1, C2 and the optimized subset Coptim).It illustrates the improvement of the spread after the optimization procedure.

Conclusions
The main scope of this work was to compare the added value of ensembles constructed from seventeen lumped hydrological models against their simple average counterparts.Ensembles are probabilistic simulations that allow appreciating the uncertainty according to the spread of their predictive distribution at each time step.For example, they may be used to evaluate confidence intervals for the outputs or probabilities of the streamflow being above a certain threshold value.Conversely, the simple average of the seventeen lumped outputs leads to a single aggregated predictor, which provides no specific information about its uncertainty.
For all 1061 catchments, results showed that the CRPS of the ensembles were lower than the MAE of the aggregated simulations, confirming the added value of retaining all the components of the ensembles over their aggregated deterministic values.Furthermore, the probabilistic simulations outperfom all individual models in 96% of the catchments, while the same occurs for only 38% of the catchments in the case of the aggregated deterministic simulations.
Reliability of the simulation ensembles is achieved for about 30% of the catchments.An operating simulation system would thus need to include a calibration of the predictive distributions in order to improve their reliability.In spite of this imperfection, the ensembles were shown to be skilled at discriminating between events and non-events, based on the ROC scores, especially for larger streamflows.Again, the comparison between probabilistic and deterministic skills was favorable to the probabilistic approach.Genetic algorithm was next used to identify model subsets optimizing the CRPS.Many model subsets were found improving the performance of the reference ensemble.In most cases, from 7 to 10 models selected among the 17 available models were deemed sufficient to construct ensembles with improved performance.However, even if an important dis- parity was noticed between the individual performances of the available models, all of them appeared in many optimized subsets.Furthermore, the optimized subsets were found superior to the ones constructed with the best eight individual models, which means that the best ensembles are not necessarily those containing the best individual models.The gain in performance of the optimized subsets is accompanied by an improvement of the ensemble reliability in 86% of the cases.Nonetheless, a calibration procedure is still needed for many catchments.
More sophisticated aggregation methods may also have been tested, as discussed in the introduction.They may have improved the performance (MAE) of our deterministic simulations, as suggested by the results of previous studies.However, the calibration of the predictive distribution should also improve the performance (CRPS) of the probabilistic simulation.
All in all, this work advocates the increased usage of multiple hydrological models for performance improvement and for uncertainty assessment.However, more work is needed concerning model selection and the sought after diversity that brings the essence of model ensembles: reliability.Some scientific questions remain unanswered and need to be investigated in the future: 1. How much model selection influences multimodel performance and reliability?We suggest constructing multimodel ensembles by using different types of models (e.g.distributed, lumped and even neuronal network models).We theorize that such variety may also improve multimodel ensembles, as the results obtained with different lumped model structures of this study.
2. How uncertainty in initial conditions, meteorological data, and model structure propagates during hydrological forecasting?More research assessing all sources of uncertainty should be carried and emergent tools like particle filtering (e.g.Moradkhani et al., 2005) may help identify the uncertainty sources that should be dealt with in priority.
Acknowledgements.The authors acknowledge the fruitful revisions of Martyn Clark, Massimiliano Zappa and an anonymous reviewer.Editor (Jim Freer) is also thanked for his constructive comments.Financial support for this work has been provided by the Institut EDS, MITACS, and CONACYT-México.The authors also thank Météo-France for making the meteorological data available for this study.
Edited by: J. Freer

Fig. 2 .
Fig. 2. Relative frequency of occurrence in the top 5 ranking, based on individual MAE values for all catchments.

Fig. 3 .
Fig. 3. Mean probabilistic and deterministic scores comparison.Catchments are ordered according to their MAE value.

Fig. 6 .
Fig.6.Reliability plots for the same catchments as in Fig.5, for all time series.Dashed lines depict the 95% confidence interval.

Fig. 7 .
Fig. 7.Reliability plots for the same catchments as in Fig.5, for discharge larger than quantile 50 of the observation series.Dashed lines depict the 95% confidence interval.

Fig. 8 .
Fig.8.Reliability plots for the same catchments as in Fig.5, for discharge larger than quantile 75 of the observation time series.Dashed lines depict the 95% confidence interval.

Fig. 13 .
Fig. 13.Relative frequency of (a) the presence of each model in the optimized subset, and (b) the number of models in these subsets, for all 1061 catchments.

Fig. 14 .
Fig. 14.Box plot of the CRPS over the 1061 catchments, as a function of the number of models per optimized subsets.

Table 1 .
Characteristics of the 1061catchment dataset.

Table 2 .
Models identification and characteristics.