Streamflow forecasting is prone to substantial uncertainty due to
errors in meteorological forecasts, hydrological model structure, and
parameterization, as well as in the observed rainfall and streamflow data
used to calibrate the models. Statistical streamflow post-processing is an
important technique available to improve the probabilistic properties of the
forecasts. This study evaluates post-processing approaches based on three
transformations – logarithmic (Log), log-sinh (Log-Sinh), and Box–Cox with

Hydrological forecasts provide crucial supporting information on a range of water resource management decisions, including (depending on the forecast lead time) flood emergency response, water allocation for various uses, and drought risk management (Li et al., 2016; Turner et al., 2017). The forecasts, however, should be thoroughly verified and proved to be of sufficient quality to support decision-making and to meaningfully benefit the economy, environment, and society.

Sub-seasonal and seasonal streamflow forecasting systems can be broadly classified as dynamic or statistical (Crochemore et al., 2016). In dynamic modelling systems, a hydrological model is usually developed at a daily time step and calibrated against observed streamflow using historical rainfall and potential evaporation data. Rainfall forecasts from a numerical climate model are then used as an input to produce daily streamflow forecasts, which are then aggregated to the timescale of interest and post-processed using statistical models (e.g. Bennett et al., 2017; Schick et al., 2018). In statistical modelling systems, a statistical model based on relevant predictors, such as antecedent rainfall and streamflow, is developed and applied directly at the timescale of interest (Robertson and Wang, 2009, 2011; Lü et al., 2016; Zhao et al., 2016). Hybrid systems that combine aspects of dynamic and statistical approaches have also been investigated (Humphrey et al., 2016; Robertson et al., 2013a).

Examples of operational services based on the dynamic approach include the Australian Bureau of Meteorology's dynamic modelling system (Laugesen et al., 2011; Tuteja et al., 2011; Lerat et al., 2015); the Hydrological Ensemble Forecast Service (HEFS) of the US National Weather Service (NWS) (Brown et al., 2014; Demargne et al., 2014); the Hydrological Outlook UK (HOUK) (Prudhomme et al., 2017); and the short-term forecasting European Flood Alert System (EFAS) (Cloke et al., 2013). Examples of operational services based on a statistical approach include the Bureau of Meteorology's Bayesian Joint Probability (BJP) forecasting system (Senlin et al., 2017).

Dynamic and statistical approaches have distinct advantages and limitations. Dynamic systems can potentially provide more realistic responses in unfamiliar climate situations, as it is possible to impose physical constraints in such situations (Wood and Schaake, 2008). In comparison, statistical models have the flexibility to include features that may lead to more reliable predictions. For example, the BJP model uses climate indices (e.g. NINO3.4), which are typically not used in dynamic approaches. That said, the suitability of statistical models for the analysis of non-stationary catchment and climate conditions is questionable (Wood and Schaake, 2008).

Streamflow forecasts obtained using hydrological models are affected by uncertainties in rainfall forecasts, observed rainfall and streamflow data, as well as by uncertainties in the model structure and parameters. Progress has been made towards reducing biases and characterizing the sources of uncertainty in streamflow forecasts. These advances include improving rainfall forecasts through post-processing (Robertson et al., 2013b; Crochemore et al., 2016), accounting for input, parametric, and/or structural uncertainty (Kavetski et al., 2006; Kuczera et al., 2006; Renard et al., 2011; Tyralla and Schumann, 2016), and using data assimilation techniques (Dechant and Moradkhani, 2011). Although these steps may improve some aspects of the forecasting system, a predictive bias may nonetheless remain. Such bias can only be reduced via post-processing, which, if successful, will improve forecast accuracy and reliability (Madadgar et al., 2014; Lerat et al., 2015).

This study focuses on improving streamflow forecasting at monthly and
seasonal timescales using dynamic approaches, more specifically, by
evaluating several forecast post-processing approaches. Post-processing of
streamflow forecasts is intended to remove systemic biases in the mean,
variability, and persistence of uncorrected forecasts, which arise due to
inaccuracies in the downscaled rainfall forecasts (e.g. errors in downscaling
forecast rainfall from a grid with

A number of post-processing approaches have been investigated in the literature, including quantile mapping (Hashino et al., 2007) and Bayesian frameworks (Pokhrel et al., 2013; Robertson et al., 2013a), as well as methods based on state-space models and wavelet transformations (Bogner and Kalas, 2008). Wood and Schaake (2008) used the correlation between forecast ensemble means and observations to generate a conditional forecast. Compared with the traditional approach of correcting individual forecast ensembles, the correlation approach improved forecast skill and reliability. In another study, Pokhrel et al. (2013) implemented a BJP method to correct biases, update predictions, and quantify uncertainty in monthly hydrological model predictions in 18 Australian catchments. The study found that the accuracy and reliability of forecasts improved. More recently, Mendoza et al. (2017) evaluated a number of seasonal streamflow forecasting approaches, including purely statistical, purely dynamical, and hybrid approaches. Based on analysis of catchments contributing to five reservoirs, the study concluded that incorporating catchment and climate information into post-processing improves forecast skill. While the above review mainly focused on post-processing of sub-seasonal and seasonal forecasts (as it is the main focus of the current study), post-processing is also commonly applied to short-range forecasts (e.g. Li et al., 2016) and to long-range forecasts up to 12 months ahead (Bennett et al., 2016).

In most streamflow post-processing approaches, a residual error model is applied to quantify forecast uncertainty. Most residual error models are based on least squares techniques with weights and/or data transformations (e.g. Carpenter and Georgakakos, 2001; Li et al., 2016). In order to produce post-processed streamflow forecasts, a daily scale residual error model is used in the calibration of hydrological model parameters, and a monthly/seasonal-scale residual error model is used as part of streamflow post-processing to quantify the forecast uncertainty. In a recent study, McInerney et al. (2017) concluded that residual error models based on Box–Cox transformations with fixed parameter values are particularly effective for daily scale streamflow predictions using observed rainfall, yielding substantial improvements in dry catchments. This study investigates whether these findings generalize to monthly and seasonal forecasts using forecast rainfall.

An important aspect of this work is its focus on general findings applicable over diverse hydro-climatological conditions. Most of the studies in the published literature use a limited number of catchments and case studies to test prospective methods. Dry catchments, characterized by intermittent flows and frequent low flows, pose the greatest challenge to hydrological models (Ye et al., 1997; Knoche et al., 2014). Yet the provision of good-quality forecasts across a large number of catchments is an essential attribute of national-scale operational forecasting services, especially in large countries with diverse climatic and catchment conditions, such as Australia.

This paper develops streamflow post-processing approaches suitable for use in
an operational streamflow forecasting service. We pose the following aims.

Aim 1: evaluate the value of streamflow forecast post-processing by comparing forecasts with no post-processing (hereafter called “uncorrected” forecasts) against post-processed forecasts.

Aim 2: evaluate three post-processing schemes based on residual error models with data transformations recommended in recent publications, namely the Log, Box–Cox (McInerney et al., 2017), and Log-Sinh (Wang et al., 2012) schemes, for monthly and seasonal streamflow post-processing.

Aim 3: evaluate the generality of results over a diverse range of hydro-climatic conditions, in order to ensure the recommendations are robust in the context of an operational streamflow forecasting service.

The rest of the paper is organized as follows. The forecasting methodology is described in Sect. 2 and application studies are described in Sect. 3. Results are presented in Sect. 4, followed by discussions and conclusions in Sects. 5 and 6 respectively.

Schematic of the dynamic streamflow forecasting system used in this study. A similar approach is used by the Australian Bureau of Meteorology for its monthly and seasonal streamflow forecasting service.

The streamflow forecasting system adopted in this study is based on the Bureau of Meteorology's dynamic modelling system (Fig. 1). Daily rainfall forecasts are input into a daily rainfall–runoff model to produce “uncorrected” daily streamflow forecasts. These streamflow forecasts are then aggregated in time and post-processed to produce monthly and seasonal streamflow forecasts, which are issued each month. Two steps are involved: calibration and forecasting, discussed below.

The GR4J rainfall–runoff model (Perrin et al., 2003) is used as it has been
proven to provide (on average) good performance across a large number of
catchments ranging from semi-arid to temperate and tropical humid (Perrin et
al., 2003; Tuteja et al., 2011). GR4J is a lumped conceptual model with four
calibration parameters: maximum capacity of the production store

In the calibration step, the daily rainfall–runoff model is calibrated to observed daily streamflow using observed rainfall (Jeffrey et al., 2001) as forcing. The calibration of the parameters is based on the weighted least squares likelihood function, similar to that outlined in Evin et al. (2014). Markov chain Monte Carlo (MCMC) analysis is used to estimate posterior parametric uncertainty (Tuteja et al., 2011). Following MCMC analysis, 40 random sets of GR4J parameters are retained and used in the forecast step. A cross-validation procedure is implemented to verify the forecasts, as described in Sect. 3.4. The calibration and cross-validation are computationally intensive; therefore, we use the High Performance Computing (HPC) facility at the National Computing Infrastructure (NCI) in Australia.

Prior to the forecast period, observed rainfall is used to force the
rainfall–runoff model. During the forecast period, 166 replicates of daily
downscaled rainfall forecasts from the Bureau of Meteorology's global climate
model, namely the Predictive Ocean Atmosphere Model for Australia, POAMA-2,
are used (see Sect. 3.2 for details on POAMA-2). These rainfall forecasts are
inputted into GR4J and propagated using the 40 GR4J parameter sets to obtain
6640 (166

The streamflow post-processing method used in this work consists of fitting a statistical model to the streamflow forecast residual errors, defined by the differences between the observed and forecast streamflow time series over a calibration period. Typically these errors are heteroscedastic, skewed, and persistent. Heteroscedasticity and skew are handled using data transformations (e.g. the Box–Cox transformation), whereas persistence is represented using autoregressive models (e.g. the lag-one autoregressive model, AR(1); Wang et al., 2012; McInerney et al., 2017). We begin by describing the two major steps of the streamflow post-processing procedure (Sect. 2.3.2 and 2.3.3), and then describe the transformations under consideration (Sect. 2.4).

The parameters of the streamflow post-processing model are calibrated as
follows.

Step 1: compute the transformed forecast residuals for month or season

Step 2: compute the
standardized residuals:

The standardization process in Eq. (2) aims to account for seasonal
variations in the distribution of residuals. The quantities

Step 3: assume the standardized residuals are described by a first-order
autoregressive (AR(1)) model with Gaussian innovations:

The parameters

Once the streamflow post-processing scheme is calibrated, the post-processed
streamflow forecasts for a given period are computed. For a given ensemble
member

Step 1: sample the innovation

Step 2: generate the standardized residuals

Step 3: compute the normalized residuals

Step 4: back-transform each normalized residual

Note that the above algorithm may occasionally generate negative streamflow predictions, which we reset to zero. In addition, the algorithm can generate predictions that exceed historical maxima; such predictions could in principle also be “adjusted” a posteriori, though we do not attempt such an adjustment in this study. These aspects are discussed further in Sect. 5.6.

The observed streamflow and median streamflow forecasts are transformed in Step 1 of streamflow post-processing (Sect. 2.3.2), to account for the heteroscedasticity and skewness of the forecast residuals. We consider three transformations, namely the logarithmic, Log-Sinh, and Box–Cox transformations.

The logarithmic (Log) transformation is

The Log-Sinh transformation (Wang et al., 2012) is

The Box–Cox (BC) transformation (Box and Cox, 1964) is

The Log transformation is a simple and widely used transformation; McInerney et al. (2017) reported that in daily scale modelling it produced the best reliability in perennial catchments (from a set of eight residual error schemes, including standard least squares, weighted least squares, BC, Log-Sinh, and reciprocal transformations). However, the Log transformation performed poorly in ephemeral catchments, where its precision was far worse than in perennial ones.

The Log-Sinh transformation is an alternative to the Log and BC transformations, and was proposed by Wang et al. (2012) to improve precision at higher flows. The Log-Sinh approach has been extensively applied to water forecasting problems (see for example, Del Giudice et al., 2013; Robertson et al., 2013b, Bennett et al., 2016). However, in daily scale streamflow modelling of perennial catchments using observed rainfall, the Log-Sinh scheme did not improve on the Log transformation: its parameters tend to calibrate to values for which the Log-Sinh transformation effectively reduces to the Log transformation (McInerney et al., 2017).

Finally, the BC transformation with fixed

In the remainder of the paper, the term “uncorrected forecasts” refers to streamflow forecasts obtained using the steps in Sect. 2.2.3, and the term “post-processed forecasts” refers to forecasts based on a streamflow post-processing model, which includes the standardization and AR(1) model from Sect. 2.3 as well as a transformation (Log, Log-Sinh, or BC0.2) from Sect. 2.4. As the post-processing schemes considered in this work differ solely in the transformation used, they will be referred to as the Log, Log-Sinh, and BC0.2 schemes.

Locations of the 300 catchments used in this study. The catchments are classified as dry or wet based on the aridity index. The Köppen climate classifications for Australia are shown. The Dieckmans Bridge catchment (site id: 145010A), used as a representative catchment in Fig. 8, is indicated by the red circle.

The empirical case study is carried out over a comprehensive set of 300
catchments with locations shown in Fig. 2. The
figure also shows the Köppen climate zones. These catchments are selected as
representative of the diverse hydro-climatic conditions across Australia.
The catchment areas range from as small as 6 km

In each catchment, data from 1980 to 2008 are used. Observed daily rainfall data were obtained from the Australian Water Availability Project (AWAP) (Jeffrey et al., 2001). Potential evaporation and observed streamflow data were obtained from the Bureau of Meteorology.

Catchment-scale rainfall forecasts are estimated from daily downscaled
rainfall forecasts produced by the Bureau of Meteorology's global climate
model, namely the Predictive Ocean Atmosphere Model for Australia (POAMA-2)
(Hudson et al., 2013). The atmospheric component of POAMA-2 uses a spatial
scale of approximately 250

The performance of the post-processing schemes is evaluated separately in dry
versus wet catchments. In this work, the classification of catchments into
dry and wet is based on the aridity index (AI) according to the following
equation:

Catchments with AI

Schematic of the cross-validation framework used for forecast verification, applied with the 1-year validation period corresponding to the year 1990 (following Tuteja et al., 2016).

The forecast verification is carried out using a moving-window
cross-validation framework, as shown in Fig. 3. We use 5 years of data
(1975–1979) to warm up the model and apply data from 1980 to 2008 for
calibration in a cross-validation framework based on a 5-year moving window.
Suppose we are validating the streamflow forecasts in year

The performance of uncorrected and post-processed streamflow forecasts is evaluated using reliability and sharpness metrics, as well as the CRPSS (see Sect. 3.5.3). Note that the Bureau of Meteorology uses Root Mean Squared Error (RMSE) and Root Mean Squared Error in Probability (RMSEP) scores in the operational service in addition to CRPSS; however, these metrics have not been considered in this study.

Forecast performance (verification) metrics are computed separately for each forecast month. To facilitate the comparison and evaluation of streamflow forecast performance in different streamflow regimes, the high- and low-flow months are defined using long-term average streamflow data calculated for each month. The 6 months with the highest average streamflow are classified as “high-flow” months, and the remaining 6 months are classified as “low-flow” months. The performance metrics listed below are computed for each month separately; the indices denoting the month are excluded from Eqs. (10), (11), and (12) below to avoid cluttering the notation.

The reliability of forecasts is evaluated using the probability integral
transform (PIT) (Dawid, 1984; Laio and Tamea, 2007). To evaluate and compare
reliability across 300 catchments, the

The sharpness of forecasts is evaluated using the ratio of inter-quantile
ranges (IQRs) of streamflow forecasts and a historical reference (Tuteja et
al., 2016). The following definition is used:

An IQR

The CRPS metric quantifies the difference between a forecast distribution and
observations, as follows (Hersbach, 2000):

The CRPS summarizes the reliability, sharpness, and bias attributes of the
forecast (Hersbach, 2000). A “perfect” forecast – namely a point
prediction that matches the actual value of the predicted quantity – has
CRPS

The IQR and CRPSS metrics are defined as skill scores relative to a reference forecast. In this work, we use the climatology as the reference forecast, as it represents the long-term climate condition. To construct these “climatological forecasts”, we used the same historical reference as the operational seasonal streamflow forecasting service of the Bureau of Meteorology. This reference is resampled from a Gaussian probability distribution fitted to the observed streamflow transformed using the Log-Sinh transformation (Eq. 7). This approach leads to more stable and continuous historical reference estimates than sampling directly from the empirical distribution of historical streamflow, and can be computed at any percentile (which facilitates comparison with forecast percentiles). Although the choice of a particular reference affects the computation of skill scores, it does not affect the ranking of post-processing models when the same reference is used, which is the main aim of this paper.

When evaluating forecast performance, a focus on any single individual metric can lead to misleading interpretations. For example, two forecasts might have a similar sharpness, yet if one of these forecasts is unreliable it can lead to an over- or under- estimation of the risk of an event of interest, which in turn can lead to a sub-optimal decision by forecast users (e.g. a water resources manager).

Given inevitable trade-offs between individual metrics (McInerney et al.,
2017), it is important to consider multiple metrics jointly rather than
individually. Following the approach suggested by Gneiting et al. (2007), we
consider a forecast to have “high skill” when it is reliable and sharper
than climatology. To determine the “summary skill” of the forecasts in each
catchment, we evaluate the total number of months (out of 12) in which
forecasts are reliable (i.e. with a

A table providing the percentage of catchments with high and low summary skills is used to summarize the forecast performance of a given post-processing scheme. To identify any geographic trends in the forecast performance, the summary skills are plotted on a map. The summary skills together with individual skill score values are used to evaluate the overall forecast performance, and are presented separately for wet and dry catchments, as well as separately for high- and low-flow months.

Performance of monthly forecasts in terms of CRPSS, reliability (PIT

Performance of seasonal forecasts in terms of CRPSS, reliability
(PIT

Results for monthly and seasonal streamflow forecasts are now presented. Section 4.1 compares the uncorrected and post-processed streamflow forecast performance. Section 4.2 evaluates the performance of post-processed streamflow forecasts obtained using the Log, Log-Sinh, and BC0.2 schemes. The CRPSS, reliability, and sharpness metrics are presented in Figs. 4 and 5 for monthly and seasonal forecasts respectively.

Distributions of differences in the monthly forecast performance metrics of the Log and Log-Sinh schemes compared to the BC0.2 scheme.

Distributions of differences in the seasonal forecast performance metrics of the Log and Log-Sinh schemes compared to the BC0.2 scheme.

Initial inspection of results found considerable overlap in the performance metrics achieved by the error models. To determine whether the differences in metrics are consistent over multiple catchments, the Log and Log-Sinh schemes are compared to the BC0.2 scheme. This comparison is presented in Figs. 6 and 7 for monthly and seasonal forecasts respectively. The BC0.2 scheme is taken as the baseline because inspection of Figs. 4 and 5 suggests that the BC0.2 scheme has better median sharpness than the Log and Log-Sinh schemes, over all the catchments and for both high- and low-flow months individually.

Seasonal streamflow forecast time series (blue line) and observations (red dots) at Dieckmans Bridge catchment (site id: 145010A). The shaded area shows the 99 % prediction limits.

Seasonal streamflow forecast skill scores at Dieckmans Bridge catchment, computed from the time series shown in Fig. 8 for 6 high-flow months and 6 low-flow months.

The streamflow forecast time series and corresponding skill for a single representative catchment, Dieckmans Bridge, are presented in Figs. 8 and 9 respectively.

The summary skills of the monthly and seasonal forecasts are presented in Figs. 10 and 11. The figures include a histogram of summary skills across all catchments to enable comparison between the uncorrected and post-processing approaches.

Summary skill of monthly forecasts obtained using the Log, Log-Sinh, and BC0.2 schemes across 300 Australian catchments. The performance of uncorrected forecasts is also shown. The summary skill is defined as the number of months where high-skill forecasts (i.e. forecasts that are reliable and sharper than climatology) are obtained. The inset histogram shows the percentage of catchments in each performance category and also serves as the colour legend.

Summary skill of seasonal forecasts obtained using the Log, Log-Sinh, and BC0.2 schemes across 300 Australian catchments. See Fig. 10's caption for details.

In terms of CRPSS, the largest improvement as a result of post-processing
(using any of the transformations considered here) occurs in dry catchments.
This finding holds for both monthly (Fig. 4c) and seasonal forecasts
(Fig. 5c). For example, when post-processing is implemented, the median CRPSS
of monthly forecasts in dry catchments increases from approximately 7 %
(high-flow months) and

In terms of reliability, the performance of uncorrected streamflow forecasts is poor, with about 50 % of the catchments being characterized by unreliable forecasts at both the monthly and seasonal timescales (Figs. 4 and 5, middle row). In comparison, post-processing using the three transformation approaches produces much better reliability, achieving reliable forecasts in more than 90 % of the catchments.

In terms of sharpness, the uncorrected forecasts and the BC0.2 post-processed forecasts are generally sharper than forecasts generated using the other transformations (Figs. 4g and 5g). The use of post-processing achieves much better sharpness than uncorrected forecasts for low-flow months, particularly in dry catchments. For example, for low-flow months in dry catchments (Fig. 4i), the median IQR99 is greater than 200 %, while similar values range between 40 % and 100 % for post-processed forecasts. Similarly, for seasonal forecasts, post-processing approaches improve the median sharpness from 150 % (uncorrected forecasts) to 50 %–110 % (Fig. 5i).

In terms of CRPSS, Figs. 4a–c and 5a–c show considerable overlap in the boxplots corresponding to all three post-processing schemes, in both wet and dry catchments. This finding suggests little difference in the performance of the post-processing schemes, and is further confirmed by Figs. 6a–c and 7a–c, which show boxplots of the differences between the CRPSS of the Log and Log-Sinh schemes versus the CRPSS of the BC0.2 scheme. Across all catchments, the distribution of these differences is approximately symmetric with a mean close to 0. In dry catchments, the BC0.2 slightly outperforms the Log scheme for high-flow months and the Log-Sinh scheme slightly outperforms the Log scheme for low-flow months. Overall, these results suggest that none of the Log, Log-Sinh, or BC0.2 schemes is consistently better in terms of CRPSS values.

In terms of reliability, post-processing using any of the three
post-processing schemes produces reliable forecasts at both monthly and
seasonal scales, and in the majority of the catchments (Figs. 4 and 5, middle
row). The median

In terms of sharpness, the BC0.2 scheme outperforms the Log and Log-Sinh schemes. This finding holds in all cases (i.e. high-/low-flow months and wet/dry catchments), both for monthly and seasonal forecasts (Figs. 4 and 5, bottom row). The plot of differences in the sharpness metric (Figs. 6 and 7, bottom row) highlights this improvement. In half of the catchments, during both high- and low-flow months, the BC0.2 scheme improves the IQR99 by 30 % (or more) compared to the Log and Log-Sinh schemes. In dry catchments, the improvements are larger than in wet catchments. For example, in dry catchments during high-flow months, the BC0.2 scheme improves on the IQR99 of Log and Log-Sinh by 40 %–60 % in over a half of the catchments, and by as much as 170 %–190 % in a quarter of the catchments.

To illustrate these results, a streamflow forecast time series at Dieckmans
Bridge catchment (site id: 145010A) is shown in Fig. 8 and performance
metrics calculated over 6 high-flow months and 6 low-flow months are shown in
Fig. 9. This catchment is selected as it is broadly representative of typical
results obtained across the wide range of case study catchments. The period
in Fig. 8 (2003–2007) is chosen because it highlights the difference in
forecast interval between the uncorrected and post-processing approaches. The
figure indicates that in terms of reliability, the uncorrected forecast has a
number of observed data points outside the 99 % predictive range
(Fig. 8a). This is an indication that the forecast is unreliable. This
finding can be confirmed from the corresponding

Performance of post-processing schemes, expressed as the percentage of catchments with high and low summary skill. Results shown for monthly and seasonal forecasts. A catchment with “high summary skill” is defined as a catchment where “high-skill” forecasts are achieved in 10–12 months out of the year; “high-skill” forecasts are defined as forecasts that are reliable and sharper than climatology.

Figures 10 and 11 show the geographic distribution of the summary skill of the uncorrected and post-processing approaches for monthly and seasonal forecasts respectively. Recall that the summary skill represents the number of months with streamflow forecasts that are both reliable and sharper than climatology. Table 1 provides a summary of the percentage of catchments with high and low summary skill for the uncorrected and post-processing approaches for monthly and seasonal forecasts (see Sect. 3.5.5).

The findings for forecasts at the monthly scale are as follows (Fig. 10 and
Table 1).

Uncorrected forecasts perform worse than post-processing techniques in the sense that they have low summary skill in the largest percentage of catchments (16 %). The percentage of catchments where high summary skill is achieved by uncorrected forecasts is 40 %.

Post-processing forecasts with the Log and Log-Sinh schemes reduce the percentage of catchments with low summary skills from 16 % to 2 % and 7 % respectively. However, the percentage of catchments with high summary skill also decreases (in comparison to uncorrected forecasts), from 40 % to 33 % for both the Log and Log-Sinh schemes.

Figure 10: the improvement achieved by the BC0.2 scheme (compared to the Log/Log-Sinh schemes) is most pronounced in New South Wales (NSW) and in the tropical catchments in Queensland (QLD) and the Northern Territory (NT). The few catchments where the BC0.2 scheme does not achieve a high summary skill are located in the north and north-west of Australia.

Log scheme has the largest percentage (19 %) of catchments with low summary skill and a relatively small percentage (9 %) of catchments with high summary skill.

Post-processing forecasts with the Log and Log-Sinh schemes reduce the percentage of catchments with low summary skill from 19 % to 18 % and 17 % respectively. The percentage of catchments with high summary skill increases from 9 % to 12 % and 22 % respectively.

Post-processing with the BC0.2 scheme once again provides the best performance: it produces forecasts with low summary skill in only 2 % of the catchments, and achieves high summary skill in 54 % of the catchments. As seen in Fig. 11, similar to the case of monthly forecasts, the biggest improvements for seasonal forecasts occur in the NSW and Queensland regions of Australia.

Overall, Table 1 shows that, across all schemes, BC0.2 results in a larger percentage of catchments with low summary skill and a larger percentage of catchments with high summary skill. It can also be seen that the summary skills of post-processing approaches are lower for seasonal forecasts than for monthly forecasts.

Section 4.1–4.3 show that post-processing achieves major improvements in reliability, as well as in CRPSS and sharpness, particularly in dry catchments. Although all three post-processing schemes under consideration provide improvements in some of the performance metrics, the BC0.2 scheme consistently produces better sharpness than the Log and Log-Sinh schemes, while maintaining similar reliability and CRPSS. This finding holds for both monthly and, to a lower degree, seasonal forecasts. Of the three post-processing schemes, the BC0.2 scheme improves by the largest margin the percentage of catchments and the number of months where the post-processed forecasts are reliable and sharper than climatology.

A comparison of uncorrected and post-processed streamflow forecasts was provided in Sect. 4.1. Uncorrected forecasts have reasonable sharpness (except in dry catchments), but suffer from low reliability: uncorrected forecasts are unreliable at approximately 50 % of the catchments. In wet catchments, poor reliability is due to overconfident forecasts, which appears a common concern in dynamic forecasting approaches (Wood and Schaake, 2008). In dry catchments, uncorrected forecasts are both unreliable and exhibit poor sharpness. Post-processing is thus particularly important to correct for these shortcomings and improve forecast skill. In this study, all post-processing models provide a clear improvement in reliability and sharpness, especially in dry catchments. The value of post-processing is more pronounced in dry catchments than in wet catchments (Figs. 4 and 5). This finding can be attributed to the challenge of capturing key physical processes in dry and ephemeral catchments (Ye et al., 1997), as well as the challenge of achieving accurate rainfall forecasts in arid areas. In addition, the simplifications inherent in any hydrological model, including the conceptual model GR4J used in this work, might also be responsible for the forecast skill being relatively lower in dry catchments than in wet catchments. Whilst using a single conceptual model is attractive for practical operational system, there may be gains in exploring alternative structures for ephemeral catchments (e.g. Clark et al., 2008; Fenicia et al., 2011). We intend to explore such alternative model structures for difficult ephemeral catchments. In such dry catchments, the hydrological model forecasts are particularly poor and leave a lot of room for improvement: post-processing can hence make a big difference on the quality of results.

We now discuss the large differences in sharpness between the BC0.2 scheme
versus the Log and Log-Sinh schemes. The Log-Sinh transformation was designed
by Wang et al. (2012) to improve the reliability and sharpness of
predictions, particularly for high flows, and has worked well as part of the
statistical modelling system for operational streamflow forecasts by the
Bureau of Meteorology. The Log-Sinh transformation has a variance stabilizing
function that (for certain parameter values) tapers off for high flows. In
theory, this feature can prevent the explosive growth of predictions for high
flows that can occur with the Log and Box–Cox transformations (especially
when

McInerney et al. (2017) found that, when modelling perennial catchments at
the daily scale, the Log-Sinh scheme did not achieve better sharpness than
the Log scheme. Instead, the parameters for the Log scheme tended to converge
to values for which the tapering off of the Log-Sinh transformation function
occurs well outside the range of simulated flows, effectively reducing the
Log-Sinh scheme to the Log scheme. In contrast, the Box–Cox transformation
function with a fixed

Our findings in this study confirm the insights of
McInerney et al. (2017) – namely that the Log-Sinh scheme
produces comparable sharpness to the Log scheme – across a wider range of
catchments. This finding indicates that insights from modelling residual
errors at the daily scale apply at least to some extent to streamflow
forecast post-processing at the monthly and seasonal scales. Note the minor
difference in the treatment of the offset parameter

The goal of the forecasting exercise is to maximize sharpness without sacrificing reliability (Gneiting et al., 2005; Wilks, 2011; Bourdin et al., 2014). The study results show that relying on a single metric for evaluating forecast performance can lead to sub-optimal conclusions. For example, if one considers the CRPSS metric alone, all post-processing schemes yield comparable performance and there is no basis for favouring any single one of them. However, once sharpness is taken into consideration explicitly, the BC0.2 scheme can be recommended due to substantially better sharpness than the Log and Log-Sinh schemes.

Similarly, comparisons based solely on CRPSS might suggest reasonable
performance of the uncorrected forecasts: 55 %–80 % of months have
CRPSS

A number of challenges and questions remain in regards to selecting the performance verification metrics for specific forecasting systems and applications. An important question is how to include user needs into a forecast verification protocol. This could be accomplished by tailoring the evaluation metrics to the requirements of users. Another key question is to what extent do measures of forecast skill correlate to the economic and/or social value of the forecast? This challenging question was investigated by Murphy and Ehrendorfer (1987) and Wandishin and Brooks (2002), who found the relationship between quality and value of a forecast to be essentially nonlinear: an increase in forecast quality may not necessarily lead to a proportional increase in its value. This question requires further multi-disciplinary research, including human psychology, economic theory, communication and social studies (e.g. Matte et al., 2017; Morss et al., 2010).

When designing an operational forecast service for locations with streamflow regimes as diverse and variable as in Australia (Taschetto and England, 2009), it is essential to thoroughly evaluate multiple modelling methods over multiple locations to ensure the findings are sufficiently robust and general. This was the major reason for considering the large set of 300 catchments in our study. This set-up also yields valuable insights into spatial patterns in forecast performance. For example, the Log and Log-Sinh schemes perform relatively well in catchments in south-eastern Australia, and relatively worse in catchments in northern and north-eastern Australia (Figs. 10 and 11). In contrast, the BC0.2 scheme performs well across the majority of the catchments in all regions included in the evaluation. The evaluation over a large number of catchments in different hydro-climatic regions is clearly beneficial to establish the robustness of post-processing methods. Restricting the analysis to a smaller number of catchments would have led to less conclusive findings.

The empirical results clearly show that the BC0.2 post-processing scheme improves forecast sharpness (precision) while maintaining forecast accuracy and reliability. As discussed below, this improvement in forecast quality offers an opportunity to improve operational planning and management of water resources.

The management of water resources, for example, deciding which water source to use for a particular purpose or allocating environmental flows, requires an understanding of the current and future availability of water. For water resources systems with long hydrological records, water managers have devised techniques to evaluate current water availability, water demand, and losses. However, one of the main unknowns is the volume of future system inflows. Streamflow forecasts provide crucial information to water managers and users regarding the future availability of water, thus helping reduce uncertainty in decision making. This information is particularly valuable for supporting decisions during drought events. In this study, forecast performance is evaluated separately for high- and low-flow months – providing a clearer indication of predictive ability for flows that are above and below average respectively. A detailed evaluation of forecasts for more extreme drought events is challenging as these events are correspondingly rarer. Limited sample size makes it difficult to make conclusive statements: e.g. if we focus on the lowest 5 % of historical data with a 30-year record, we may only have roughly 1.5 samples for each month/season. The uncertainty arising from limited sample size requires further development of forecast verification techniques, potentially adapting some of the approaches used by Hodgkins et al. (2017).

There are several opportunities to further improve the seasonal streamflow forecasting system. This section describes avenues related to specialized treatment of zero flows and high-flow forecasts, uncertainty analysis of post-processing model parameters, and the use of data assimilation (state updating).

The post-processing approaches used in this work do not make special provision for zero flows in the observed data. Robust handling of zero flows in statistical models, especially in arid and semi-arid catchments, is an active research area (Wang and Robertson, 2011; Smith et al., 2015), and advances in this area are certainly relevant to seasonal streamflow forecasting.

A similar challenge is associated with the forecasting of high flows, as the post-processing approaches used in this work can produce streamflow predictions that exceed historical maxima. The IQR ratio used to assess forecast sharpness will detect unreasonably long tails (i.e. extremes) in the predictive distributions and hence can indirectly identify instances of unreasonably high-flow forecasts. Further research is needed to develop techniques to evaluate the realism of forecasts that exceed historical maxima.

Another area for further investigation is the identifiability of parameters

Finally, the forecasting system used in this study does not employ data assimilation to update the states of the GR4J hydrological model. Gibbs et al. (2018) showed that monthly streamflow forecasting benefits from state updating in catchments that exhibit non-stationarity in their rainfall–runoff dynamics. Note that data assimilation of ocean observations has been implemented in the climate model (POAMA2) used for the rainfall forecast (Yin et al., 2011) (see Sect. 3.2 for additional details).

This study focused on developing robust streamflow forecast post-processing schemes for an operational forecasting service at the monthly and seasonal timescales. For such forecasts to be useful to water managers and decision-makers, they should be reliable and exhibit sharpness that is better than climatology.

We investigated streamflow forecast post-processing schemes based on residual
error models employing three data transformations, namely the logarithmic
(Log), log-sinh (Log-Sinh), and Box–Cox with

The following empirical findings are obtained.

Uncorrected forecasts (no post-processing) perform poorly in terms of reliability, resulting in a mischaracterization of forecast uncertainties.

All three post-processing schemes substantially improve the reliability of streamflow forecasts, both in terms of the dedicated reliability metric and in terms of the summary skill given by the CRPSS.

From the post-processing schemes considered in this work, the BC0.2 scheme is found to be best suited for operational application. The BC0.2 scheme provides the sharpest forecasts without sacrificing reliability, as measured by the reliability and CRPSS metrics. In particular, the BC0.2 scheme produces forecasts that are both reliable and sharper than climatology at substantially more catchments than the alternative Log and Log-Sinh schemes.

The data underlying this research can be accessed from the following links:
observed rainfall data (

The authors declare that they have no conflict of interest.

Data for this study are provided by the Australian Bureau of Meteorology. This work was supported by Australian Research Council grant LP140100978 with the Australian Bureau of Meteorology and South East Queensland Water. We thank the anonymous reviewers for constructive comments and feedback that helped us substantially improve the paper. Edited by: Albrecht Weerts Reviewed by: two anonymous referees