Technical Note : Combining Quantile Forecasts and Predictive Distributions of Stream-flows

Abstract. The enhanced availability of many different hydro-meteorological modelling and forecasting systems raises the issue of how to optimally combine this great deal of information. Especially the usage of deterministic and probabilistic forecasts with sometimes widely divergent predicted future stream-flow values makes it even more complicated for decision makers to sift out the relevant information. In this study multiple stream-flow forecast information will be aggregated based on several different predictive distributions, resp. quantile forecasts. For this combination the Bayesian Model Averaging (BMA) 5 approach, the Nonhomogeneous Gaussian Regression (NGR), also known as Ensemble Model Output Statistic (EMOS) model and a novel method called Beta transformed Linear Pooling (BLP) will be applied. By the help of the Quantile Score (QS) and the Continuous Ranked Probability Score (CRPS), the combination results for the Sihl river in Switzerland with about five years of forecast data will be compared and the differences between the raw and the optimally combined forecasts will be highlighted. The results demonstrate the importance of applying proper forecast combination methods for decision makers in 10 the field of flood and water resources management.


Introduction
The combination, or aggregation, of differing probability distributions into a single one could result in beneficial effects, since the differences between various forecast systems provide a better understanding of the uncertainty about the target quantities, and the aggregates may reflect more accurately the information.However, the biggest advantage of aggre-gation is that the forecaster is not forced to decide a priori which forecast system is the most reliable at the actual point of issuing a forecast, because the combination method will be optimized at each forecast run by taking into consideration the quality of the forecast from previous time steps.Thus, the data themselves will automatically lead to the optimal decision incorporating all available information about the different deficiencies and strengths of the individual forecast systems.
In econometrics and related disciplines, the combination of forecasts has a long tradition starting with Bates and Granger (1969) suggesting the use of empirical weights derived from "out of sample" forecast variances.An overview of the last 40 years of forecast combination in the field of economics can be found in Wallis (2011).Thompson (1977) was one of the first who outlined the advantages of forecast combinations in meteorology and Shamseldin et al. (1997) showed different methods of combining the output of different hydrological models.In Abrahart and See (2002) different combination methods for hydrological forecast models are compared.Diks and Vrugt (2010) compare different model averaging approaches, showing that a simple regression method could result in improvements comparable to more sophisticated methods.
In general the challenge of model combination is that, apart from the simple model averaging methodologies, different weights need to be assigned according to the quality of the forecast of the preceding days and periods.A frequently used method for model averaging and forecast combination is the method of Bayesian model averaging (BMA) introduced by Min and Zellner (1993) and Raftery et al. (1997), where the weights are based on posterior model probabilities within a Bayesian framework.The BMA method has been Published by Copernicus Publications on behalf of the European Geosciences Union.
In Gneiting et al. (2005) and Gneiting et al. (2007) the term "calibration" is used to describe the statistical consistency between the distributional forecasts and the observations and is a joint property of the predictions and the events that materialize.A state of the art calibration and bias correction method is non-homogeneous Gaussian regression (NGR), also known as the ensemble model output statistic (EMOS) technique of Gneiting et al. (2005).It fits a single parametric predictive probability density function (pdf) using summary statistics from the (multi-model) ensemble and corrects simultaneously for biases and dispersion errors.Also, NGR has been applied many times successfully for calibrating and combining hydro-meteorological ensemble forecasts (see for example Hemri et al., 2014).
The Beta-transformed linear pooling (BLP) approach, which has been developed recently by Ranjan and Gneiting (2010) and Gneiting and Ranjan (2013) for combining predictive distributions, will be tested and compared with the NGR and the BMA in this study.To the author's knowledge the BLP and the associated estimation of weights, which assign relative importance to the individual predictive distributions, have not been applied to hydrological forecasts so far.
Before the combination methods are applied, the errors of the hydrological model are corrected in order to minimize the difference between the last available observation and the predictions at the time of initialization of the forecast.This process of error correction is later on called post-processing, since it starts after completing the hydrological simulations and predictions given meteorological observations or forecasts.Depending on the post-processing method, quantiles or pdfs for future streamflows will be derived for each single forecast time step.Whereas quantile regression (QR) methods (Koenker, 2005) and modifications of them will lead to predictions of quantiles, a predictive pdf can be derived for example by the recently developed waveVARX method (Bogner and Pappenberger, 2011) directly.For more details of these post-processing methods, the reader is referred to Bogner et al. (2016), whereas the objective of this paper will be the analysis of combination methods of forecasts.In the next section the three combination methods and the applied verification measures will be described.After the presentation of the data and the results, the outcome of the comparison will be discussed and summarized in the conclusions.

Methods
Three different combination methods have been applied to the flood forecasting system for the Sihl River at station Zurich (Switzerland), where two meteorological forecasts, the 16-member COSMO-LEPS (Montani et al., 2011) and the deterministic C7 system (produced at MeteoSwiss with ≈ 7 km resolution) are implemented (a detailed description can be found in Addor et al., 2011, Ronco et al., 2015, and Liechti et al., 2016).
In a first step the hydrological modelling errors of all these forecasts will be minimized, using a QR method in combination with neural networks (QRNN, Taylor, 2000;Cannon, 2011).This will result in direct estimates of the inverse cumulative density function (i.e. the quantile function), which in turn allows the derivation of the predictive uncertainty (see for example Weerts et al., 2011, López López et al., 2014, and Dogulu et al., 2015, where the application of the QR in order to estimate predictive uncertainties (PUs) is outlined).If the number of estimated quantiles within the domain {0 < τ < 1} is sufficiently large, the resulting distribution could be considered to be continuous.In this study the number of quantiles is set to nine with probability levels τ = 0.01, 0.1, 0.2, 0.25, 0.5, 0.75, 0.8, 0.9, 0.99.In Quiñonero Candela et al. (2006) the cdf or pdf is constructed by combining step interpolation of probability densities for specified τ -quantiles with exponential lower and upper tails, which will be called the empirical method (EMP).Alternatively the pdf could be constructed by monotone re-arranging of the τ -quantiles and estimating a log-normal distribution (LN) to these quantiles for each lead time t.The advantage of the quantile re-arranging and the distribution fitting is 2fold and efficiently prevents known problems occurring with QR: firstly it eliminates the problem of crossing of different quantiles (i.e. the unrealistic but possible outcome of the non-linear optimization problem yielding lower quantiles for higher streamflow values - Chernozhukov et al., 2010 -e.g. the value of the 0.90 quantile is higher than the value of the 0.95 quantile), and secondly it permits the extrapolation to extremes not included in the training sample (Bowden et al., 2012).
This QRNN method will be applied to each ensemble member of the COSMO-LEPS forecasts, resulting in 16 forecasts of quantiles, and to the C7 forecasts.Lichtendahl et al. (2013) have examined averaging quantiles of continuous distributions given by multiple information sources rather than averaging probabilities.Both approaches of probability and quantile averaging have been applied in this paper for averaging the post-processed ensemble prediction system (EPS) based streamflow forecasts in order to get one predictive pdf or quantile forecast.Before applying the probability averaging approach, a pdf has been constructed by the LN method, i.e. a log-normal distribution has been fitted to the re-arranged τ -quantiles.
Thus, in total there are five different forecasts available after post-processing, two based on the application of the QRNN method for the COSMO-LEPS with probability averaging (p.aver.)or quantile averaging (q.aver.),two postprocessed C7 forecasts based on QRNN with the EMP and the LN approach, and one forecast based on the waveVARX method.Additionally the raw COSMO-LEPS forecast will be included in the following combination procedures as well (see Fig. 1).
Three different methods will be tested for optimally combining these six forecast models (M1, . .., M6), which allow us to assign different weights to the raw and five postprocessed forecasts.For the application of the first two methods, BMA and NGR, the streamflow values have been transformed to the normal space by the help of the normal quantile transformation ( Van der Waerden, 1952, 1953a, b).

Bayesian model averaging (BMA)
If the combination is calculated within a Bayesian framework by using weights corresponding to the posterior model probabilities, it is usually referred to as BMA and follows from direct application of Bayes' theorem as explained in e.g.Min and Zellner (1993) and Raftery et al. (1997).
In Raftery et al. (2005) the statistical BMA model is extended to dynamical forecast models, where each forecast and/or ensemble member is represented by a probabilistic distribution for which a weight is assigned based on the past performance of each individual forecast.These weights are used to combine all distributions into one single mixture distribution.Therefore the BMA predictive model of the quantity of interest y is given by where h m is the posterior probability (i.e.weight) of forecast k m , the best forecast derived from its performance in the training period, and the conditional pdf of y on k m , g m (y|k m ), given that k m is the best forecast in the ensemble with m = 1, . .., M members, or models.The transformation of the streamflow values to the normal space beforehand allows the application of the BMA method based on mixtures of univariate normal distributions.In the work of Wang et al. (2012) and Schepen and Wang (2015) variants of the BMA method have been applied, which allow the direct usage of the cdfs for estimating the weighting parameters.However, in this study these BMA approaches have not been implemented, and the estimated medians (τ = 0.5) from the five post-processing methods and from the raw COSMO-LEPS are taken as input only in order to allow better comparison with the following NGR approach.

Non-homogeneous Gaussian regression (NGR)
Another possibility to address underdispersion and forecast bias is the use of the NGR method, also known as EMOS, and is based on multiple linear regression for linear variables, such as temperature or streamflows, and logistic regression for binary variables, such as precipitation occurrence or freezing.More information about the MOS technique can be found for example in Glahn and Lowry (1972) and Wilks (1995).Its extension for ensembles is explained in Gneiting et al. (2005) and a brief summary of this method is given hereafter.Let y denote again the variable of interest (e.g.streamflow) and let k 1 , . .., k M be the corresponding forecast of the M ensemble members or models.If N (µ, σ 2 ) denotes a Gaussian density with mean µ and variance σ 2 , the NGR predictive distribution is given by where Thus the predictive mean is equal to the regression estimates with coefficients a 0 , . .., a m , b 0 , and b 1 and forms a bias-corrected weighted average of the different forecasts (ensemble members), whereas the predictive variance depends linearly on the variance of the forecast models (ensemble members).Although modifications for the NGR exist for non-normal distributed variates (see for example Baran, 2014;Baran and Lerch, 2015), the streamflow values have been transformed to the normal space for comparison reasons and the medians (τ = 0.5) from the five post-processing methods and from the raw COSMO-LEPS are taken as input as in the BMA method.

Beta-transformed linear pool (BLP)
In Ranjan and Gneiting ( 2010) it has been stated that any non-trivially weighted average of distinct probability forecasts will be uncalibrated and lack sharpness, even when the individual forecasts have been calibrated.Hence they suggested a composite of the traditional linear pool with a beta transform.The aggregation method introduced by Ranjan and Gneiting (2010) and Gneiting and Ranjan (2013) considers the BLP for a set of predictive cdfs F 1 , . .., F M as for y ∈ R, where B α,β denotes the cdf of the standard Beta distribution with parameters α > 0 and β > 0 and ω 1 , . .., ω M are non-negative weights that sum to 1.The BLP density forecast for the component densities f i , . .., f M then is  likelihood method.The log-likelihood function for the BLP model (Eq.4) is (ω 1 , . .., ω M ; α, β) where B is the classical Beta function.This BLP approach has been applied now to combine the different forecast systems.The quantiles resulting from the QRNN method (models M1, M4, and M5) forecasts have been converted to pdfs by applying the LN method (by fitting a log-normal distribution to the re-arranged τ quantiles).

Verification
Although probability and quantile forecasts are both probabilistic products, the former is expressed in terms of a probability (e.g. that a certain threshold will be exceeded) and the latter is given by a quantile for a particular probability level of interest (Bouallègue et al., 2015).Since the outputs of the QRNN model are quantiles, it is reasonable to evaluate the performance with a skill score which has been developed for predictive quantiles (Koenker and Machado, 1999;Friederichs and Hense, 2007), known as the quantile score (QS).It is based on an asymmetric piecewise linear function, the so-called check function, ρ τ y i − q τ,i , which is a function of the probability level τ (0 < τ < 1) and the error between the observation y i and the quantile forecast q τ,i for i = 1, . .., N , where N is the sample size.The check function is defined as ρ τ y i − q τ,i = τ y i − q τ,i ∀y i ≥ q τ,i (τ − 1) y i − q τ,i ∀y i < q τ,i and the QS results as the mean of the check function with penalties 1 − τ and τ for under-and over-forecasting (see The CRPS compares the forecast probability distribution with the observation and both are represented as cdfs.If F is the predictive cdf and y is the verifying observation, Gneiting and Ranjan (2011) showed that the CRPS can be defined equivalently as a standard form, Thus, in the standard form (Eq. 8) an ensemble of predictions can be converted into a piecewise constant cdf with jumps at the different models (ensemble members), and I {.} is a Heaviside step function, with a single step from 0 to 1 at the observed value of the variable.The equivalence of Eqs. ( 8)-( 9) was noted by Laio and Tamea (2007).For the quantile forecast q τ = F −1 (τ ), the integrand in Eq. ( 9) equals the quantile score, i.e. the mean of the check function (Eq.6).That means the CRPS corresponds to the integral of the QS over all thresholds, or likewise the integral of the QS over all probability levels (Laio andTamea, 2007 andGneiting andRanjan, 2011).Hence, the CRPS averages over the complete range of forecast thresholds and probability levels, whereas the QS looks at specific τ -quantiles; thus, it is more efficient in revealing deficiencies in different parts of the distributions, especially with respect to the tails of the distribution.Both verification measures are negatively oriented, meaning the smaller the better.

Results
COSMO-LEPS and C7 forecasts are available from 24 February 2010 to 27 April 2016 once a day with hourly time resolution, which have been post-processed in order to derive predictive distributions and quantile forecasts.To calibrate and validate the post-processing parameters (QRNN and waveVARX), the data sets of available hourly observations and corresponding simulations have been split into two halves (calibration period: 2010-2012; validation period: 2013-2016).The results of the validation, which are not shown due to lack of space, highlight the improvements of the QRNN method (similar to the results shown in Bogner et al., 2016).
The weighting parameters of the combination methods are estimated by applying a moving window with a size of 7 days (168 h) for optimization.Different window sizes have been The six forecasts are the QRNN method for the COSMO-LEPS with quantile averaging (QRNN-CL-q.)-M1, probability averaging (QRNN-CL-p.)-M2, the waveVARX(-CL) method -M3, the raw COSMO-LEPS (CL) forecast -M4, the two post-processed C7 forecasts based on QRNN with the EMP -M5, and the LN approach -M6.tested as well, but 7 days was chosen finally as a trade-off between computing time and efficiency.In Fig. 2 an example of the temporal evolution of the hourly weights for a lead time of 48 h for the three combination methods is shown.
Before the forecast skill of the three combination methods are compared, the statistical consistency between the predictive cdf and the observations are analysed with the help of the probability integral transform (PIT) as proposed by Dawid (1984) (see Fig. 3).In the case of well-calibrated fore- casts, the sequence of PIT values will follow a uniform distribution U (0, 1).U-shaped PIT histograms indicate underdispersed forecasts with too little spread on average, and inverse U-shaped histograms correspond to overdispersed forecasts (see for example Gneiting et al., 2007;Laio and Tamea, 2007).
The question now is whether there are significant differences between the three combination methods.Therefore the QS has been applied at first to highlight possible differences between the combination methods in more detail.
In Fig. 4 the results of the QS at four lead times for the raw COSMO-LEPS (C-L, black line) and for the three combination methods BLP (red line), NGR (green line), and BMA (blue line) are shown and compared to the QS results of the raw C-L (black circles).Additionally, a simple quantile mapping (QM) is applied (cyan diamonds) to the raw C-L forecasts in order to evaluate the positive effect of using more complex methods.Thereby the cdf of the raw forecast is matched to the cdf of the observations.As mentioned in Zhao et al. (2017), QM is highly effective for bias correction, but ensemble spread reliability problems cannot be solved properly.
In Fig. 5 the CRPS results of the six forecast models are shown in comparison to the BLP in order to demonstrate the motivation of aggregating these systems.As can be seen clearly, the combined forecast outperforms each of the individual forecasts in view of the CRPS.
The CRPS for the raw C-L, the QM approach and the three combination methods is shown in Fig. 6.

Discussion
So far most of the studies comparing the results of the BMA and the NGR approach have not found any preference (see for example Williams et al., 2014).In this paper these two methods are checked against the BLP, which has not been used for hydrological purposes until now.In a first step the weights derived for each individual, raw and post-processed, forecast system are compared.The pattern of these optimized weights in Fig. 2 shows rather vague similarities between the three combination methods.The BLP and the NGR are in general more spiky, with rapid changes between consecutive hours.This could result from problems in convergence from the optimization algorithm applied for estimating the parameters ("constrOptim" in R R Core Team, 2016).
In general the weights show some periodicity, which indicates that some models are more appropriate to be used in   certain seasons and for certain flow conditions during a year.However, the limited amount of data does not allow us to draw clear conclusions.
The results of the PIT clearly indicate that all three combinations result in well-calibrated forecasts with close to uniform histograms.In Fig. 3 the examples for the 48 h forecast are given, highlighting the heavy underdispersiveness of the raw forecasts.The same behaviour is visible for almost all lead times; however, the raw COSMO-LEPS forecasts get less underdispersed with increasing lead time, since the spread and the uncertainty in the ensemble increase.
The analysis of the QS (Fig. 4) shows slightly better results for the BLP, followed by the NGR and BMA.The raw COSMO-LEPS (C-L) and the QM are much worse, especially for smaller lead times.It is interesting to see that the QS of the raw C-L follows a straight line for smaller lead times (6 and 12 h) in the same manner as one would expect from deterministic forecasts, because of the underdispersiveness of the C-L at the beginning of the forecast horizon.The slope of this line is an indicator of the size of the (positive) bias.The QM at a lead time of 6 h is also a straight line, however, with an opposite but much smaller and negative slope (bias) in comparison to the raw C-L.With increasing lead times the QS of the raw C-L and the QM forecasts come closer to the combined forecasts for probability levels between 0.1 and 0.5.This is most probably caused by the increased spread of the ensemble.However, for a lead time of 24 and 48 h, the raw C-L forecasts still show the worst behaviour at higher flows, whereas the QM method performs at a lead time of 48 h almost as well as the combination methods, apart from the forecasts around the median.
As already stated previously, the comparison of the CRPS of the different post-processed methods and the aggregated ones (e.g.BLP) clearly identifies the advantage of combination (Fig. 5).The CRPS, i.e. the integral of the QS, for the different combination methods (Fig. 6) confirms the results of the QS.In general the results of the BLP are slightly better than the NGR and BMA results.It seems that for those periods of lead times, where the BLP is not superior (e.g.around 20 h), the optimization routines had problems on convergence.However, further analysis will be necessary.The comparison with the QM approach confirmed the results of Zhao et al. (2017), since the forecast quality did not show any improvements at the first lead times because of the underdispersiveness of the raw C-L.Thus, the more complex combination by far outperforms the QM method.

Conclusions
Combination is an essential tool for improving the forecast quality.The different methods are all more or less equally suited.Although the BLP showed slightly better results, the straightforward application and the low computational costs of the NGR make this method an equally good alternative, at least for this case study.The parameter estimation of the BMA and the BLP could get quite time-consuming and sometimes results in suboptimal solutions, which could degrade the gain of applying combination methods.

Figure 1 .
Figure1.Set of six different forecast models available for combination, five post-processed plus one raw forecast.For the quantile averaging (M1) and the probability averaging (M2) method, an example of averaging two ensemble members is indicated.

Figure 3 .
Figure 3. Probability integral transform (PIT) of the raw and three combined forecasts at a lead time of 48 h.

Figure 4 .
Figure 4. Quantile score (QS) for various lead times and the three combination methods in comparison to the raw COMSO-LEPS and a simple quantile mapping (QM) approach.

Figure 6 .
Figure 6.CRPS of the raw and combined forecasts.