Interactive comment on “ Post processing r infall forecasts from numerical weather prediction models for short term streamflow forecasting ”

Sub-daily ensemble rainfall forecasts that are bias free and reliably quantify forecast uncertainty are critical for flood and short-term ensemble streamflow forecasting. Post- processing of rainfall predictions from numerical weather prediction models is typically required to provide rainfall forecasts with these properties. In this paper, a new approach to generate ensemble rainfall forecasts by post-processing raw numerical weather prediction (NWP) rainfall predictions is introduced. The approach uses a simplified version of the Bayesian joint probability modelling approach to produce forecast probability distributions for individual locations and forecast lead times. Ensemble forecasts with appropriate spa- tial and temporal correlations are then generated by linking samples from the forecast probability distributions using the Schaake shuffle. The new approach is evaluated by applying it to post- process predictions from the ACCESS-R numerical weather prediction model at rain gauge locations in the Ovens catch- ment in southern Australia. The joint distribution of NWP predicted and observed rainfall is shown to be well described by the assumed log-sinh transformed bivariate normal dis- tribution. Ensemble forecasts produced using the approach are shown to be more skilful than the raw NWP predictions both for individual forecast lead times and for cumulative to- tals throughout all forecast lead times. Skill increases result from the correction of not only the mean bias, but also biases conditional on the magnitude of the NWP rainfall prediction. The post-processed forecast ensembles are demonstrated to successfully discriminate between events and non-events for both small and large rainfall occurrences, and reliably quan- tify the forecast uncertainty. Future work will assess the efficacy of the post-processing method for a wider range of climatic conditions and also investigate the benefits of using post-processed rainfall fore- casts for flood and short-term streamflow forecasting.


Introduction
Forecasts of streamflow are valuable to a range of users.Forecasts of potential flood conditions provide emergency and water managers with the opportunity to plan mitigation strategies and responses such as evacuations (Roulin, 2007;Penning-Rowsell et al., 2000;Blöschl, 2008).Forecasts of within bank streamflow events, such as freshes and low flow conditions, allow water managers to optimise water distribution, minimise potential damage to private property and maximise environmental benefits in regulated streams (George et al., 2011).All these water management actions can potentially have a range of costs and benefits and therefore forecast users require an indication of forecast uncertainty to allow the risks associated with management decisions to be assessed.
Forecasting streamflows requires estimates of the catchment wetness at the forecast time and predictions of the weather conditions, particularly rainfall, during the forecast period.Neither of these components can be known precisely at the time a forecast is made and therefore both are sources of streamflow forecast uncertainty.Hydrological models used to transform observed and forecast rainfall to streamflow simulations also introduce uncertainties in streamflow forecasts through their simplified representations of the true hydrological processes (Pokhrel et al., 2013;Gupta et al., 2006).In this paper we focus on methods of quantifying the uncertainty associated with predictions of rainfall during the forecast period.

D. E. Robertson et al.: Post-processing rainfall forecasts from numerical weather prediction models
In Australia, numerical weather prediction (NWP) models provide forecasts of weather conditions for lead times of up to 10 days.However, raw output that is publicly available from Australian NWP models is deterministic and often contains systematic errors (Shrestha et al., 2013).These errors can emerge from two major sources (Ebert, 2001).Finescale physical processes are parameterized in NWP models in order to run them at the relatively coarse spatial and vertical resolutions necessary for routine operational applications.NWP models also require the initial conditions of the atmosphere and land/sea surface to be specified for each forecast.Both the model parameterizations and initial conditions are potential sources of systematic forecast errors.Outside Australia, ensemble predictions systems have been developed to reduce systematic errors and quantify forecast uncertainty by producing multiple runs of the NWP model with varying initial conditions or model parameterizations.However the spread of the ensemble is commonly too narrow and therefore not reliable in a probabilistic sense (Hamill and Colucci, 1997;Santos-Muñoz et al., 2010;Clark et al., 2011).
Statistical calibration or post-processing methods are frequently applied to correct biases and produce forecasts that reliably quantify uncertainty.Many methods use some form of probability model to post-process forecasts for a single forecast period and location (Wilks, 2006;Schaake et al., 2007;Wu et al., 2011;Kleiber et al., 2010;Sloughter et al., 2007;Glahn and Lowry, 1972;Hamill et al., 2004).A common approach for meteorological applications is to use a two part probability model, where the probability of precipitation occurrence is post-processed using logistic regression and the rainfall amount modelled using a Gamma distribution conditioned on the raw NWP output (Sloughter et al., 2007).There are numerous variants of this approach using different transformations for NWP predicted rainfall, and observed rainfall and levels of complexity in the logistic regression and Gamma distribution conditioning models (Hamill et al., 2004;Sloughter et al., 2007).Generalising the approach requires a considerable number of parameters and risks overfitting.For hydrological applications, methods which model the joint distribution of NWP rainfall predictions and their corresponding observations have been developed (for example, Wu et al., 2011;Schaake et al., 2007).These joint distribution modelling methods have complex parameterizations and require the appropriate transformations for data normalisation or marginal distributions to be selected at each location.
Post-processed NWP rainfall predictions produced by applying a probability model to each forecast period and location separately will not contain the appropriate spatial and temporal correlation structures necessary for streamflow forecasting applications (Clark et al., 2004;Schaake et al., 2007;Wu et al., 2011).Statistical post-processing methods which explicitly model spatial and temporal correlations structures are typically computationally expensive and are yet to be widely adopted for operational streamflow forecasting applications.To overcome these computational challenges, Clark et al. (2004) described the "Schaake shuffle" which produces ensemble forecasts by linking samples from discretely post-processed forecasts to follow historically observed spatial and temporal correlation patterns.
Recently, the Bayesian joint probability (BJP) modelling approach (Wang and Robertson, 2011;Wang et al., 2009) has successfully post-processed seasonal rainfall predictions from the global climate model (POAMA) effectively removing biases and reliably quantifying forecast uncertainty (Wang et al., 2012a;Charles et al., 2011).The formulation of the BJP modelling approach is similar to the methods described by Wu et al. (2011) and Schaake et al. (2007), and therefore it may also be useful for post-processing sub-daily rainfall predictions.The advantage of the BJP modelling approach is that it provides a highly flexible probability model with relatively few parameters, through its use of a parametric transformation for data normalisation and variance stabilisation, and Bayesian parameter inference methods.However, sub-daily rainfall totals have a more highly skewed distribution and considerably greater intermittency of precipitation than seasonal rainfall totals, and therefore the performance of the approach may be limited due to shortcomings in the parametric transformation and the treatment of precipitation intermittency as a problem of censored data.
The objective of this study is twofold.Firstly we assess whether the BJP modelling approach can be effectively used to post-process sub-daily rainfall predictions from a deterministic NWP model for single forecast lead times.Secondly we assess the performance of ensemble rainfall forecasts produced by linking samples from the post-processed probabilistic forecasts using the Schaake shuffle, demonstrating that the post-processed forecasts are more skilful than the raw output from the NWP and that the forecast uncertainty is reliably quantified.
The remainder of the paper is structured as follows.The next section describes the NWP predictions and observed data used in this study.Section 3 describes the implementation of the BJP modelling approach for post-processing subdaily rainfall predictions and methods used to check model assumptions and verify forecasts.Section 4 presents results for model checking and forecast verification.In Sect.5, we discuss the potential limitations of the method and the current application, and identify possible extensions.Section 6 provides a summary of the paper and draws conclusions.

Study catchment and data
For this study we focus on the Ovens catchment in south-east Murray Darling Basin in Australia.A continuous flood and short-term flow forecasting system is being developed for the catchment because it provides a significant source of unregulated inflow to the Murray River and has several urban centres that have experienced significant economic damage from flooding.The time of concentration to the catchment outlet is of the order of four to five days; however the time of concentration to some flood sensitive areas within the catchment can be less than 24 h and therefore hydrological models are run at sub-daily time steps (Pagano et al., 2011).
Hourly observed precipitation data were obtained from the operational flood forecasting database of Australian Bureau of Meteorology for 33 rain gauges located in the Ovens catchment (Fig. 1).Carboor Upper is highlighted in Fig. 1 as many of the results presented focus on this site.Mean annual rainfall at the 33 gauges locations varies between 550 mm, near the catchment outlet, and 1950 mm in the catchment headwaters.An historical archive of hourly precipitation data is available from September 1991.However as the data are observations used operationally, the archive contains missing records for some locations and times.Rain gauge data were used for this study rather than the subcatchment rainfall used for real-time forecasting.This was done to limit the influence of artefacts resulting from missing data that are introduced by the interpolation techniques currently in operational use.
Rainfall predictions were obtained from the Australian Community Climate Earth-System Simulator (ACCESS).Several variants of the ACCESS model are used to form the Australian Parallel Suite (APS), which is the basis of numerical weather prediction in Australia (Australian Bureau of Meteorology, 2010).For this study we use predictions from the regional ACCESS model (ACCESS-R) which is run every 12 h (00:00 and 12:00 UTC) at a 37.5 km resolution out to a lead time of 72 h.ACCESS-R data are available at 1 h intervals.The domain of the regional ACCESS model extends from 65 • S, 65 • E to 17.125 • N, 184.625 • E and boundary conditions are sourced from the global AC-CESS model, which runs at approximately 80 km resolution.Hindcasts for the ACCESS suite of models are not available.An archive of real-time predictions for a 20 month period (approximately 600 forecasts) extending from January 2010 to August 2011 is available.While a longer record is desirable it is unlikely to be available for operational forecasting applications in Australia.
In operational conditions, streamflow forecasts are issued once a day at 23:00 UTC (09:00 LST -local standard time).For this study we use the most recently issued NWP prediction (12:00 UTC) that is available when the streamflow forecasts are made.This means that the first eleven hours of NWP rainfall predictions are neglected and post-processing is applied to NWP predictions between 11 and 72 h after the time of forecast issue.Forecasts for these periods are subsequently referred to as lead times 0 to 60 h, where lead time 0 forecasts are for the hour commencing 23:00 UTC on the day the forecast is issued.

Post-processing NWP model rainfall predictions
We apply a modified version of the BJP modelling approach to post-process raw NWP rainfall predictions for individual forecast lead times.Full details of the BJP modelling approach are provided in Wang et al. (2009) and Wang and Robertson (2011) here we present a brief overview to highlight the differences between the original implementation and the application used in this study.In contrast to Wang et al. (2009) and Wang and Robertson (2011) our formulation is for a bivariate problem where a single predictor and single predictand are used.The model predictor (y 1 ), in this case NWP rainfall predictions for a single lead time, and predictand (y 2 ), in this case observed rainfall, are arranged as a column vector y = y 1 y 2 .
For this study we apply log-sinh transformations (Wang et al., 2012b) to normalize the variables and stabilize their variances rather than the Yeo-Johnson transformation (Yeo and Johnson, 2000) where The set of model parameters (θ) describe the transformation, using two parameters (α and β), mean (µ) and standard deviation (σ ) for each predictor and predictand, and correlation coefficients (r).All model parameters are reparameterized to ease parameter inference.Reparameterizations of model parameters are described in the Appendix.
The original formulation of the BJP modelling approach for seasonal forecasting infers model parameters and their uncertainties using Markov chain Monte Carlo methods to sample from the posterior parameter distribution p (θ|Y OBS ), where Y OBS = y 1 OBS , y 2 OBS , . . ., y n OBS and y t OBS is the observed predictor and predictand data for event t, t = 1, 2, . . ., n. Formulation of the posterior parameter distribution is detailed in the Appendix.
For operational short-term forecasting applications considerably more data are available to infer model parameters than for seasonal forecasting applications.This will reduce parameter uncertainty, and computational resources required to infer parameter uncertainties using a large data set may not necessarily be available in real-time.Therefore, in this study we obtain a single set of model parameters that gives the maximum a posteriori (MAP) solution.
We obtain the MAP solution for the joint distribution of model parameters using a stepwise approach.We obtain the parameters describing the MAP solution of the log-sinh transformed normal distribution for the marginal distribution of each predictor and predictand separately.We find the MAP solution using the shuffled complex evolution algorithm (Duan et al., 1994) to ensure that a global optimum is found.We then use the parameters describing the MAP solution for the marginal distributions of the predictors and predictands in the joint distribution and infer the matrix of transformed correlation coefficients that describe the MAP solution for the joint log-sinh transformed bivariate normal distribution.
To produce a probabilistic forecast using as single set of parameters, the transformed bivariate normal distribution is conditioned on the predictor value using the procedure described by Wang and Robertson (2011).Where a predictor value is equal to the censoring threshold, data augmentation is used to generate a value less than the censoring threshold and the joint distribution is conditioned on the augmented predictor value (Wang and Robertson, 2011;Robertson and Wang, 2012).We draw 1000 samples from the conditional distribution to represent the forecast probability distribution.If the predictor value is equal to the censor threshold and data augmentation is required, then a different augmented predictor value is used for each sample drawn.
The models have a single predictor (NWP rainfall predictions for a single lead time) and a single predictand (observed rainfall).Different censoring thresholds are used for the predictor and predictand to reflect the differing precisions of available data.The censoring threshold for observed rainfall is 0.2 mm which is the minimum measurable rainfall amount for the majority of operational tipping bucket rain gauges.Observed rainfall data contained values less than 0.2 mm which resulted from data regularisation procedures and therefore these data were not considered reliable observations.The censoring threshold for NWP rainfall predictions is set to 0.01 mm.A lower threshold is used for NWP rainfall predictions because they represent average rainfall over a large spatial extent.Therefore, rainfall predictions lower than the minimum measurable amount is likely to result in measurable rainfall at some specific locations.A nonzero threshold was imposed in the NWP rainfall predictions because the data contained some very small values that were found to be artefacts of numerical processing methods.
Models were established for three-hour rainfall accumulations.Separate models were established to post-process NWP rainfall predictions for each forecast lead time and rain gauge location.These modelling methods were informed by previous analysis which showed that the skill of predictions of three hour rainfall accumulations is greater than for one hour rainfall accumulations; there is a diurnal cycle in the mean bias of the NWP, and the correlation between observed and NWP rainfall is spatially variable and decreases with lead time (Shrestha et al., 2013).
These post-processed probabilistic forecasts of three hour rainfall accumulations (for lead times of 0-60 h) are random samples from independent probability distributions and hence ensemble members created by linking these samples in a simple manner will not contain appropriate spatial and temporal correlation structures.We apply the Schaake shuffle (Clark et al., 2004) to generate ensembles with appropriate spatial and temporal correlations from the post-processed probabilistic forecasts.The Schaake shuffle uses many historically observed time series for a period corresponding to the probabilistic forecasts as the basis for the spatial and temporal correlation structures.Time series of observation ranks are obtained by ranking the observations within each time step and location.An ensemble member is then constructed using one time series of observation ranks.For each time step the observation rank is replaced with the sample of the corresponding rank from the probabilistic forecast.The full ensemble is constructed by repeating this process for all time series of observation ranks.

Model checking
The proposed post-processing method makes assumptions about the form of the marginal and joint distributions of observed and predicted rainfall.It is necessary to establish that the assumed log-sinh transformed bivariate normal distribution is consistent with observations.We check two aspects of the assumed distribution in fitting mode: (1) the consistency of observed and modelled marginal distributions of the predictor and predictand; (2) the consistency of modelled and observed correlation coefficients.
To assess the consistency of the observed and modelled marginal distributions, the joint model is fitted to all available data using the procedure described in the previous section.The marginal distributions are then derived numerically as follows.A set of sample vectors is drawn from the fitted joint model of predictors and predictands.The number of samples in the set is equal to the number of observations used in model fitting.A cumulative distribution marginal is then produced for the predictor and predictand.This cumulative marginal distribution reflects only one realisation from the fitted joint model.Multiple, in this case 1000, realisations of the cumulative marginal distribution are then generated to represent the uncertainty associated with taking a limited set of samples from the fitted joint distribution.The median and the [0.05, 0.95] uncertainty bands of the cumulative marginal distributions are then extracted from the multiple realisations and compared with observed data in a probability plot.Comparisons are made in both the transformed and untransformed space.
A similar procedure is used to assess the consistency between the modelled and observed correlation coefficients.A set of sample vectors is drawn from the fitted joint distribution of predictor and predictand.The number of samples in the set is identical to the number of observations used in model fitting.The modelled correlation coefficient between the predictor and predictand is computed from the set of sample vectors.This correlation coefficient represents only a single realisation from the fitted joint distribution.Uncertainty in the modelled correlation coefficient is estimated by generating 1000 sets of sample vectors from the joint distribution and computing the correlation for each set.The median and [0.05, 0.95] uncertainty bands of the modelled correlation coefficients are then extracted and compared to the observed value.Kendall's rank correlation coefficient is used as it is more appropriate for variables that are highly skewed and contain many zero values than the more commonly used Pearson's correlation coefficient.

Forecast verification
The quality of the post-processed rainfall forecasts is assessed using a leave-one-month-out cross-validation procedure.The procedure is implemented by inferring parameters of the joint distribution using all available data with the exception of one month.Rainfall for all the events in the leftout month are then forecast and compared to corresponding observations.This procedure is used to ensure that the forecasts are verified independent of model fitting and a similar number of data are used to fit the model as will be available operationally.
Many aspects of the performance of the post-processed ensemble rainfall forecasts need to be assessed.The performance of forecasts is assessed for individual forecast lead times and for cumulative forecast totals.This enables the performance of the post-processing probability model and the efficacy of the Schaake shuffle ensemble generation method to be assessed separately.Aspects of forecast performance that are assessed include: skill, bias, discrimination and reliability.We also assess the correlation structure of the postprocessed forecasts to establish the efficacy of the Schaake shuffle.

Forecast skill
Forecast skill is a measure of the quality of a set of forecasts relative to a baseline or reference set of forecasts (Jolliffe and Stephenson, 2003).Skill scores describe the percentage reduction in a measure of forecast error relative to a reference forecast and therefore characterise the benefit of using the forecast of interest rather than the reference forecast.In this study, the continuous ranked probability score (CRPS; Hersbach, 2000) is used as the measure of forecast error and the reference forecast is climatology.The climatology reference forecast is the cross-validation marginal distribution of observed rainfall.We compare the CRPS skill score of the raw NWP rainfall predictions and post-processed rainfall forecasts.For the raw deterministic NWP rainfall predictions, the CRPS reduces to the mean absolute error.
In addition to assessing the overall or unconditional skill of the post-processed forecasts we also assess how the skill varies with the size of the forecast event.We undertake this conditional skill assessment by computing skill scores conditioned on forecast mean exceeding a range of thresholds from 0.2 to 5 mm.For the conditional skill scores we estimate the sampling uncertainty through bootstrap resampling (Shrestha et al., 2013) and present the [0.05, 0.95] confidence intervals.

Forecast bias
Forecast bias is the average difference between the mean of the probabilistic forecast and corresponding observation.Biases in rainfall forecasts will potentially be amplified in streamflow forecasts and therefore it is important that rainfall forecast have minimal bias.Forecast bias, as a percentage of the observed value, is assessed for the raw NWP predictions and post-processed forecasts for individual forecast lead times and cumulative forecast totals.We also assess the conditional bias, and sampling uncertainty, of the postprocessed forecasts by computing forecast bias conditioned on forecast mean exceeding a range of thresholds from 0.2 to 5 mm.

Forecast discrimination
Significant streamflow events primarily result from significant rainfall events.Therefore, it is important for rainfall forecasts to be able to identify significant rainfall events when they occur.The relative operating characteristic (ROC) assesses the ability to discriminate between events and nonevents.The ROC plots the hit rate against the false alarm rate for a range of probability thresholds.For unskilled forecasts a ROC plot will follow a diagonal line, where as perfect forecasts will a ROC plot will travels vertically from the origin to the top left of the diagram and then horizontally to the top right.Note that unskilled forecasts from a forecast discrimination sense are not the same as climatology forecasts used as a reference for the CRPS skill score, rather they imply that the forecast event probabilities are random.Here, ROC plots are used to assess forecast discrimination for two important forecast events, the event of rainfall less than 0.2 mm and the event of rainfall greater than 5 mm.Forecast discrimination is assessed for individual forecast lead times and for cumulative forecast totals.To understand how the forecast discrimination varies for forecast events ranging between 0.2 and 5 mm we compute the area under the ROC curve for a range of threshold over that interval and also estimate the uncertainties using bootstrap resampling.

Forecast reliability
Forecast reliability is concerned with the statistical consistency between the forecast probability distributions and the observed frequency of associated events (Toth et al., 2003).The reliability of the forecast probability of an event of rainfall less than 0.2 mm and the forecast probability of an event of greater than 5 mm are assessed using reliability diagrams (Wilks, 2006).We produce reliability diagrams using forecasts for individual forecast lead times and for cumulative forecast totals.The reliability diagram for individual forecast lead times assesses the reliability of forecasts made using individual post-processing models.We assess the reliability of pooled forecasts for day 1 (lead times of 0-21 h) and for day 2 (lead times of 24-45 h).The reliability diagrams for the cumulative forecast totals assesses the ability of the Schaake shuffle to restore the appropriate correlation structure of the forecast ensembles.We assess the reliability of forecast total rainfall for for day 1 (lead times of 0-21 h) and for day 2 (lead times of 24-45 h).

Forecast correlations
The Schaake shuffle is applied to ensure that the forecast ensembles have the appropriate correlation structures.We check that the temporal correlations in the ensemble forecast are appropriate by comparing the lag-1 Kendall correlation of the post-processed forecasts before and after the Schaake shuffle to the corresponding observed lag-1 Kendall correlation for all forecast lead times.

Results
Model fitting and forecast verification results were obtained for all 33 rain gauges in the Ovens catchment.Here we focus the presentation of results on a single rain gauge (site 82163 Carboor Upper, shown in Fig. 1), which is located near the centre of the catchment.

Model fitting
Figure 2 presents the modelled and observed marginal distributions in both the transformed and untransformed space for a single location and forecast lead time.The modelled and observed marginal distributions appear to be consistent both in the transformed and untransformed space.The majority of observed values generally lie within the 90 % uncertainty band and observed values falling both above and below the modelled median marginal distribution.Results for other forecast lead times at this site and other sites are not shown but are comparable to the results for this site.
Figure 3 presents the fitted and observed correlations between NWP predicted and observed rainfall for all forecast lead times at a single site in the Ovens catchment.The modelled correlations appear to be consistent with observed values.The number of observed correlations lying outside the 90 % uncertainty band is consistent with expectations as one observed correlation lies above the 90 % uncertainty band and one lies below.Results for other sites in the Ovens catchment are not shown, but are comparable to those presented in Fig. 3.
The model checking results shown in Figs. 2 and 3 suggest that the log-sinh transformed bivariate normal distribution is consistent with observed data and therefore appropriate for modelling the joint distribution of NWP predicted and observed rainfall.periods.The raw NWP predictions have negative skill for some individual periods, suggesting that it would be better to use a climatology forecast.However, post-processing produces rainfall forecasts with positive skill for all lead times out to 57 h.Forecast skill is highest for rainfall predictions for the 3-6 h lead time and displays a gradual decline with increasing lead time.Post-processing results in marked improvements in skill over the raw NWP predictions, with the skill of the post-processed forecast being on average 37 % higher than the raw NWP predictions.The skill of post-processed forecasts of cumulative rainfall totals (Figs. 4 and 5), increases for the first three lead times and then remains relatively stable at a CRPS skill score of approximately 50 % out to 57 h.The raw NWP rainfall predictions display similar behaviour, but skill scores are approximately 20 % lower than the post-processed forecasts for all lead times.The skill of the cumulative forecasts is greater than forecasts for individual periods because errors in individual periods will tend to compensate for each other.

Forecast skill
Figure 5 presents the skill of the post-processed forecasts conditioned on the forecast mean exceeds a range of thresholds ranging between 0 and 5 mm for 0-3 and 30-33 h lead times.The skill of the post-processed forecasts tends to increase as the conditioning threshold increases.In parallel, the sample size reduces rapidly and consequently the uncertainty of the skill estimates grows.This suggests that the skill of the post-processed forecasts appears to be consistent over the range of thresholds assessed.

Forecast bias
Figure 6 presents the bias in the raw NWP rainfall predictions and post-processed forecasts as a function of lead time.The post-processed forecasts display little forecast bias at any lead time.Bias in the raw NWP rainfall predictions tends to be cyclical and can be as great as 50 % of the observed mean.The cyclic nature of biases in the raw NWP rainfall predictions is likely the product of the limited ability of NWP models to describe the diurnal cycle.Post-processing methods can overcome this limitation, provided that they are developed in a manner that allows for diurnal variations in forecast performance, as done here.
Correction of the forecast bias will be the greatest contribution to improvements in forecast skill.Figure 6 displays the correction to the mean bias, however, bias correction using the BJP modelling approach is more sophisticated than just correcting the mean bias.Using different marginal distributions, and particularly transformations, for the raw NWP rainfall predictions and observed data allows for a non-linear bias correction (Fig. 7).This results in improvements in forecast skill that are greater than those that would be achieved by just correcting the mean bias.
As expected from the previous analysis, biases in the post-processed cumulative rainfall ensembles are minimal throughout the entire forecast period (Fig. 8).The biases for the raw NWP predictions decrease for lead times up to 9 h and then are relatively stable near zero.However, the magnitude of biases in the post-processed ensemble forecasts is nearly always smaller than in the raw forecasts.
The bias of the post-processed forecasts conditioned on the forecast median for 0-3 and 30-33 h lead time forecasts is presented in Fig. 8.The bias begins to depart from close to zero for events where the forecast median exceeds approximately 2.5 mm.However, for nearly all threshold values the confidence limits intersect the black line depicting zero bias and suggesting that the departures from zero may be solely due to sampling uncertainties.

Forecast discrimination
Forecast discrimination is assessed using plots of the relative operating characteristic (ROC).The ability of the postprocessed forecasts to discriminate between events and nonevents varies with lead time and the event being considered (Fig. 9).At shorter lead times, the ROC curves for forecasts of individual periods tend to approach the top left corner of the plot, while at longer lead times they are closer to the diagonal.This suggests that forecasts for shorter lead times have a greater ability to discriminate between events and non-events than forecasts for longer lead times.The contrast in forecast discrimination with lead time is stronger for the high rainfall events (precipitation > 5 mm) than for the event of rainfall less than 0.2 mm.This suggests that as lead time increases the probability of high rainfall in the post-processed forecasts becomes less informative and less strongly correlated observed high rainfall events.However, the ROC curves do not approach the diagonal line at any lead time, which suggests the post-processed forecasts are always skilful.This is supported by the skill scores presented earlier.
The ROC curves for cumulative forecast rainfall totals display significantly less spread than the curves for individual forecast lead times.For the event of rainfall less than 0.2 mm, the forecast discrimination is stronger for shorter lead times than for longer lead times.However, for the events of greater than 5 mm, there are no clear differences in forecast discrimination with lead time.
Figure 10 presents the area under the ROC curve for a spectrum of event magnitudes for 0-3 and 30-33 h lead times.The area under the ROC curve tends to remain constant, given the sampling uncertainty, with increasing event size.This suggests that the skill of the forecasts is not related to the size of the forecast event.

Reliability
Figure 11 presents reliability diagrams for the probability of rainfall exceeding two thresholds for individual forecast lead times pooled for lead times in day 1 and day 2. The reliability diagrams illustrate that the forecast probability of a rainfall event of less than 0.2 mm appears to be reliable, with the observed relative frequencies closely following the line reflecting perfect reliability.The forecast probability of a rainfall event of greater than 5 mm also appears to be reliable for day 1.For day 2 the forecast probability of a rainfall event of greater 5 mm appears to be less reliable.However, very few forecasts have a probability of rainfall exceeding 5 mm that falls into the upper bin of this diagram, and therefore, there is considerable sampling uncertainty associated with the observed frequencies.This sampling uncertainty is highlighted by the wide confidence intervals in Fig. 11.
Figure 12 presents reliability diagrams for two probability thresholds for 24 h rainfall totals for day 1 and day 2. The reliability diagrams show that the observed relative frequencies follow the diagonal line reflecting perfect reliability.For the forecast probability of 24 h forecast rainfall totals exceeding than 5 mm, small deviations from the diagonal line occur for both day 1 and day 2 for several bins.The number of samples in these bins is small and therefore subject to considerable sample variability as depicted by the confidence intervals.Overall, the forecasts of 24 h rainfall totals appear to be reliable.
The probabilistic forecasts of 24 h rainfall totals are produced by summing individual ensemble members.These forecasts will only be reliable if the forecasts for individual periods are reliable and the ensemble members have the appropriate temporal correlation structures.The temporal correlations in the ensemble members were introduced using the Schaake shuffle.Here we have demonstrated that the probability distributions of forecasts for both individual periods and cumulative totals are reliable and therefore the temporal correlations introduced by the Schaake shuffle seem appropriate.

Forecast correlations
Figure 13 presents the lag-1 Kendall correlation of the ensemble rainfall forecasts, before and after application of the Schaake shuffle, and of the corresponding observations.The lag-1 correlations of the probabilistic forecasts before application of the Schaake shuffle are close to zero, which is expected given that these forecasts are random samples from independent probability distributions.After application of the Schaake shuffle, the lag-1 correlations of the ensemble forecasts are significantly larger and close to those of the observations.The lag-1 correlations of the ensemble forecasts are expected to be lower than those of the observations because the majority of the forecasts have a larger proportion of zero values than the observations.These zero values will tend to reduce lower magnitude of the correlation coefficients.

Further discussion
High quality forecasts of sub-daily rainfall are critical for forecasting streamflows, particularly floods, in small and rapidly responding catchments.The marginal distributions of sub-daily raw NWP rainfall predictions differ from those of the observations and therefore post-processing is necessary.Observed rainfall displays a diurnal cycle, with maximum mean rainfall occurring between 03:00 and 09:00 p.m. LT, while the raw NWP predictions display little diurnal cycle.Poorly representing the timing and magnitude of the diurnal cycles, particularly in precipitation, is a known problem with many NWP models and is commonly related to the representation and parameterization of convective processes (Evans and Westra, 2012;Dai and Trenberth, 2004).Therefore, it may be more appropriate to condition the post-processing of NWP rainfall predictions on the type of rainfall rather than lead time.However, previous analysis found that errors in NWP rainfall predictions could not be predicted by synoptic or rainfall types for Australian conditions (Roux et al., 2012).
One of the major challenges for developing and evaluating short-term streamflow forecasting systems, and particularly post-processing methods for rainfall predictions, in Australia is the limited availability of retrospective NWP predictions from the ACCESS suite of models.The lack of retrospective NWP predictions imposes some limitations on this study and the conclusions that can be drawn.Significant streamflow events, including floods, result from significant rainfall events and therefore the ability to forecast significant rainfall events is critical.Few large, flood causing, rainfall events exist in the record of ACCESS predictions used in this study.Post-processing methods that use parametric modelling, such as the one used in this study, can be used to extrapolate relationships beyond the range of data used to fit the model and produce post-forecasts for rare events.However, the quality of these extrapolated forecasts cannot be comprehensively assessed.The reliability diagram for the probability of precipitation exceeding 5 mm for day two forecasts provides an example of this problem where the number of samples in the high forecast probability bins is very small and therefore no conclusive statement about the reliability of these forecasts can be made.In the extreme case, such as in arid zones, it is possible that during the period of available retrospective NWP predictions no rainfall is observed or predicted for some forecast periods.This has the potential to prevent the establishment of a model and as a result post-processing of NWP rainfall predictions may not be possible.Therefore, forecasts of extreme rainfall events need to be used with caution and methods need to be further developed to handle situations where there are insufficient non-zero rainfall observations and predictions to establish a post-processing model.
The post-processing approach described in this paper models only the concurrent relationship between raw NWP predicted and observed rainfall to produce a rainfall forecast.It assumes that the temporal correlation in mean rainfall at different time periods is adequately described by the raw NWP predictions and does not make use of the temporal or spatial lag correlations in rainfall observations.In addition, if the NWP predictions have consistent errors in the timing or spatial location of rainfall events, then the current approach will not necessarily produce the most skilful rainfall forecasts.To accommodate both these possibilities in a post-processing method requires a more sophisticated model where multiple forecast lead times are included in a single model.The simplified BJP modelling approach used here can potentially be adapted to produce forecasts  for multiple periods from a single model.However, it would require strong parameterization of the correlation matrix to limit the risk of overfitting.Such an approach is attractive as it would remove the need to use the Schaake shuffle to create ensembles from separately post-processed probability distributions, as the spatial and temporal correlations would be explicitly modelled.Stronger assumptions about all model parameters may also be able to deal with situations where little or no rainfall is observed or predicted for some forecast lead times.
In this study, the post-processing method has only been applied to a catchment in the temperate zone of southern Australia.In this catchment, rainfall is predominantly produced by large-scale synoptic systems moving across the catchment.Large-scale synoptic systems are better predicted by NWP models because they tend to evolve relatively slowly and occur on spatial scales that are resolved by the models   Reliability diagrams for the probability of 24 h forecast rainfall totals being less than 0.2 mm and the probability of 24 h forecast rainfall totals exceeding than 5 mm for day 1 (lead times 0-21 h) and for day 2 (lead times 24-45 h) at site 82163 Carboor Upper (1 : 1 dashed line, perfectly reliable forecast; circles, observed relative frequency; vertical lines, [0.05, 0.95] uncertainty intervals; insert, number of events in each of the different forecast probability ranges).(Roux and Seed, 2011;Roux et al., 2012).NWP models tend not to predict rainfall from convective systems well because these processes evolve rapidly and commonly occur on spatial scales finer than those resolved by the model.In areas where substantial rainfall is produced by convective systems, the raw NWP rainfall predictions may not be sufficiently correlated with rain gauge observations to produce skilful rainfall forecasts using the method described in this paper.Further work is proposed to assess the efficacy of the postprocessing method for catchments experiencing a range of climatic conditions in Australia.
The motivation for post-processing NWP rainfall predictions is to produce bias free ensemble rainfall forecasts that can be used for ensemble streamflow forecasting.Using bias free ensemble rainfall forecasts to force an initialised hydrological model has the potential to increase the number of lead times for which skilful streamflow forecasts can be produced.Assessing the benefits of using ensemble rainfall forecasts for streamflow forecasting is beyond the scope of the current study, but will be the subject of future investigations.Part of these investigations will include examining the temporal resolution at which post-processed rainfall forecasts are most skilful and which lead to the most skilful streamflow forecasts.

Summary and conclusions
Sub-daily ensemble rainfall forecasts that are bias free and reliably quantify forecast uncertainty are critical for flood and short-term ensemble streamflow forecasting.The raw output from numerical weather prediction models typically does not provide rainfall forecasts with these properties and therefore some form of post-processing is required.In this paper we describe a new approach to generate ensemble rainfall forecasts by post-processing raw NWP rainfall predictions.The approach uses a simplified version of the Bayesian joint probability modelling approach, which was designed for seasonal streamflow forecasting, to produce forecast probability distributions for individual locations and forecast lead times.Ensemble forecasts with appropriate spatial and temporal correlations are then generated by linking samples from the forecast probability distributions using the Schaake shuffle.
We apply the approach to post-process rainfall predictions from the ACCESS-R numerical weather prediction model at rain gauge locations in the Ovens catchment in southern Australia.We demonstrate that the assumed log-sinh transformed bivariate normal distribution is appropriate for modelling the joint distribution of NWP predicted and observed rainfall.The method is shown to produce ensemble forecasts that are more skilful than the raw NWP predictions both for individual forecast lead times and for cumulative forecast totals.Skill increases result from the correction of not only the mean bias, but also biases conditional on the magnitude of the NWP rainfall prediction.The post-processed forecast ensembles are demonstrated to successfully discriminate between events and non-events for both small and large rainfall occurrences, and reliably quantify the forecast uncertainty.
This study has assessed the post-processing approach for conditions where rainfall is principally due to large-scale synoptic systems.Further work is proposed to assess the efficacy of the post-processing method for catchments experiencing a range of climatic conditions in Australia, particularly in areas where significant rainfall is the result of convective processes.Future investigations will also assess the benefits of using post-processed rainfall forecasts for flood and short-term streamflow forecasting and examine the temporal resolution at which rainfall post-processing is most effective.
A uniform prior is specified for both of the transformation parameters; however, because these parameters are not directly estimated it is necessary to apply the Jacobian of the reparameterization to the uniform prior p(ln α) = J α→ ln α p(α), where the Jacobian determinant of the reparameterization (J α→ ln α ) is given by A more elaborate prior for the pair of (m the Jacobian determinant of the reparameterization J s 2 →s * from s 2 to s * is given by J s 2 →s * = ds 2 ds * = s 2 ; the Jacobian determinant of the reparameterization (J m→m * ) from m to m * is given by and p µ, σ 2 takes the simplest form of priors commonly used for normal distribution mean and variance (Wang and Robertson, 2011;Gelman et al., 1995) p µ, σ 2 ∝ 1 σ 2 .
The prior for the reparameterized correlation coefficient is related to the prior for the original correlation coefficient by p(ϕ) = J r→ϕ p(r), where J r→ϕ is the Jacobian determinant for the transform from r to ϕ, and Wang et al. (2009) use a marginally uniform prior is used for the correlation matrix, which for the bivariate case reduces to p(r) ∝ 1.

Figure 4 Fig. 2 .
Figure 4 presents the CRPS skill scores of the raw NWP predictions and post-processed rainfall forecasts for individual

Fig. 3 .
Fig. 3. Observed (red dots) and modelled median (vertical lines, representing [0.05, 095] uncertainty range) correlation coefficients between NWP forecast and observed precipitation for post-process post-processing models covering lead times from 0 to 57 h at site 82163 Carboor Upper.

Fig. 4 .
Fig. 4. Variation in CRPS skill score of ensemble rainfall forecasts for individual periods (left panel) and cumulative rainfall totals (right panel) with lead time at site 82163 Carboor Upper.

Fig. 5 .
Fig. 5. CRPS skill scores (solid black line) and [0.05, 0.95] confidence intervals (red and blue dashed lines) conditional on the forecast mean exceeding threshold rainfall for ensemble rainfall forecasts at site 82163 Carboor Upper at lead times of 0-3 h (left panel) and 30-33 h (right panel).The number of events over which the skill scores are computed are given by the black dashed lines.

Fig. 6 .
Fig. 6.Percentage bias for individual forecast lead times (left panel) and cumulative forecast totals (right panel) as a function of lead time at site 82163 Carboor Upper.

Fig. 7 .
Fig. 7. Ensemble mean plotted against the raw NWP prediction for lead time 0 forecasts at site 82163 Carboor Upper showing the nonlinear nature of bias correction (1 : 1 solid line).

Fig. 8 .
Fig. 8. Percentage bias (solid black line) and [0.05, 0.95] confidence intervals (red and blue dashed lines) conditional on the forecast mean exceeding threshold rainfall for ensemble rainfall forecasts at site 82163 Carboor Upper at lead times of 0-3 h (left panel) and 30-33 h (right panel).The number of events over which the skill scores are computed are given by the black dashed lines.

Fig. 9 .
Fig. 9.Relative operating characteristics at all lead times for individual forecast lead time and cumulative forecast totals for events of rainfall less than the minimum observable and events greater than 5 mm at site 82163 Carboor Upper.

Fig. 10 .
Fig. 10.Area under the ROC curve (solid black line) and [0.05, 0.95] confidence intervals (red and blue dashed lines) for a spectrum of threshold rainfall events for ensemble rainfall forecasts at site 82163 Carboor Upper at lead times of 0-3 h (left panel) and 30-33 h (right panel).The number of events over which the skill scores are computed are given by the black dashed lines.

Fig. 11 .
Fig.11.Reliability diagrams for the probability of a rainfall event of less than 0.2 mm and the probability of a rainfall event of greater than 5 mm for individual forecast lead times pooled for day 1 (lead times 0-21 h) and for day 2 (lead times 24-45 h) at site 82163 Carboor Upper (1 : 1 dashed line, perfectly reliable forecast; circles, observed relative frequency; vertical lines [0.05, 0.95] uncertainty interval; insert, number of events in each of the different forecast probability ranges).

Fig
Fig. 12.Reliability diagrams for the probability of 24 h forecast rainfall totals being less than 0.2 mm and the probability of 24 h forecast rainfall totals exceeding than 5 mm for day 1 (lead times 0-21 h) and for day 2 (lead times 24-45 h) at site 82163 Carboor Upper (1 : 1 dashed line, perfectly reliable forecast; circles, observed relative frequency; vertical lines, [0.05, 0.95] uncertainty intervals; insert, number of events in each of the different forecast probability ranges).

Fig. 13 .
Fig.13.Lag-1 Kendall correlation coefficients for ensemble forecasts before and after application of Schaake shuffle and for observations (solid lines, median for all forecast events and observed; shaded bands, [0.05, 0.95] intervals computed from all forecast events).