For streamflow forecasting, rainfall–runoff models are often augmented with updating procedures that correct forecasts based on the latest available streamflow observations of streamflow. A popular approach for updating forecasts is autoregressive (AR) models, which exploit the “memory” in hydrological model simulation errors. AR models may be applied to raw errors directly or to normalised errors. In this study, we demonstrate that AR models applied in either way can sometimes cause over-correction of forecasts. In using an AR model applied to raw errors, the over-correction usually occurs when streamflow is rapidly receding. In applying an AR model to normalised errors, the over-correction usually occurs when streamflow is rapidly rising. In addition, when parameters of a hydrological model and an AR model are estimated jointly, the AR model applied to normalised errors sometimes degrades the stand-alone performance of the base hydrological model. This is not desirable for forecasting applications, as forecasts should rely as much as possible on the base hydrological model, with updating only used to correct minor errors. To overcome the adverse effects of the conventional AR models, a restricted AR model applied to normalised errors is introduced. We show that the new model reduces over-correction and improves the performance of the base hydrological model considerably.

Rainfall–runoff models are widely used to generate streamflow forecasts, which provide essential information for flood warning and water resource management. For streamflow forecasting, rainfall–runoff models are often augmented by updating procedures that correct streamflow forecasts based on the latest available observations of streamflow and their departures from model simulations. Model errors reflect limitations of the hydrological models in reproducing physical processes as well as inaccuracies in data used to force and evaluate the models.

The most popular updating approach uses autoregressive (AR) models, which exploit the “memory” – more precisely the autocorrelation structure – of errors in hydrological simulations (Morawietz et al., 2011). Essentially, AR updating uses a linear function of the known errors at previous time steps to anticipate errors in a forecast period. Forecasts are then updated according to these anticipated errors. AR updating is conceptually simple and yet generally leads to significantly improved forecasts (World Meteorological Organization, 1992). AR updating has been shown to provide equivalent performance to more sophisticated non-linear and non-parametric updating procedures (Xiong and O'Connor, 2002).

In rainfall–runoff modelling, model errors are generally heteroscedastic (i.e. they have heterogeneous variance over time) (Xu, 2001; Kavetski et al., 2003; Pianosi and Raso, 2012) and non-Gaussian (Bates and Campbell, 2001; Schaefli et al., 2007; Shrestha and Solomatine, 2008). In many applications (Seo et al., 2006; Bates and Campbell, 2001; Salamon and Feyen, 2010; Morawietz et al., 2011), AR models are applied to normalised errors that are considered homoscedastic and Gaussian. Normalisation is often achieved through variable transformation by using, for example, the Box–Cox transformation (Thyer et al., 2002; Bates and Campbell, 2001; Engeland et al., 2010) or, more recently, the log-sinh transformation (Wang et al., 2012; Del Giudice et al., 2013). In other applications (Schoups and Vrugt, 2010; Schaefli et al., 2007), AR models are applied directly to raw errors, but residual errors of the AR models may be explicitly specified as heteroscedastic and non-Gaussian.

There is no agreement on whether it is better to apply an AR model to normalised or raw errors. Recent work by Evin et al. (2013) found that an AR model applied to raw errors may lead to poor performance with exaggerated uncertainty. They demonstrated that such instability can be mitigated by applying an AR model to standardised errors (raw errors divided by standard deviations). Here, standardisation has a similar effect to normalisation in that it homogenises the variance of the errors (but does not consider the non-Gaussian distribution of errors). Conversely, Schaefli et al. (2007) pointed out that when an AR model is jointly estimated with a hydrological model, there is a clear advantage in applying an AR model to raw errors rather than normalised (or standardised) errors. Schaefli et al. (2007) found that using raw errors leads to more reliable parameter inference and uncertainty estimation, because the mean error is close to zero and therefore the simulations are free of systematic bias. The same is not necessarily true when applying an AR model to normalised errors.

In this study, we evaluate AR models applied to both raw and normalised errors in four Australian catchments and three United States (US) catchments. We show that when estimated jointly with a hydrological model, the AR model applied to normalised errors sometimes degrades the stand-alone performance of the base hydrological model. We also identify the fact that both of these conventional AR models can sometimes cause over-correction of forecasts. We introduce a restricted AR model applied to normalised errors and demonstrate its effectiveness in overcoming the adverse effects of the conventional AR models.

A hydrological model is a function of forcing variables (precipitation and
potential evapotranspiration), initial catchment state,

In this study, we firstly examine two first-order AR error models:

An AR error model applied to normalised errors (referred to as

An AR error model applied to raw errors (referred to as

Both the AR-Norm and AR-Raw models represent the lag-1 autocorrelation by an AR process and both employ the log-sinh transformation. However, the way the log-sinh transformation is applied differs between the two models. The AR-Norm model first applies the log-sinh transformation to the observed and model simulated streamflow, and then assumes that the error in the transformed space follows an AR(1) process. In contrast, the AR-Raw model essentially assumes that the error in the original space follows an AR(1) process and only applies the log-sinh transformation to fit the asymmetric and non-Gaussian error distribution.

The medians of the updated streamflow forecast (referred to as

We will demonstrate in Sect. 4 that the AR-Norm and AR-Raw models can
sometimes cause over-correction of forecasts. Motivated to overcome the
potential for over-correction, we introduce a modification of the AR-Norm
model, called the restricted AR-Norm model (referred to as

The AR-Norm, AR-Raw and RAR-Norm models are each calibrated jointly with the
hydrological model. The method of maximum likelihood is used to estimate the
error model parameters

for AR-Norm

for AR-Raw

for RAR-Norm

We use daily data from four Australian catchments and three catchments from the US (Fig. 1, Table 1). Australian streamflow data are taken from the Catchment Water Yield Estimation Tool (CWYET) data set (Vaze et al., 2011). Australian rainfall and potential evaporation data are derived from the Australian Water Availability Project (AWAP) data set (Jones et al., 2009). All data for the US catchments come from the Model Intercomparison Experiment (MOPEX) data set (Duan et al., 2006). The selected US catchments are amongst the 12 catchments used by Evin et al. (2014) to compare joint and postprocessor approaches to estimating hydrological uncertainty, and allow us to compare results with that study (the other catchments used by Evin et al. (2014) are influenced by snowmelt, which is not considered in the hydrological model used in this study). The Abercrombie River and the Guadalupe River intermittently experience periods of very low (to zero) flow, while the other rivers flow perennially (Table 1). Such dry catchments are challenging for hydrological simulations and error modelling. All catchments have high-quality streamflow records with very few missing data.

Map of US (top panel) and Australian (bottom panel) catchments.

Catchment characteristics.

We forecast daily streamflow with the GR4J rainfall–runoff model
(Perrin et al., 2003). We apply updating procedures to correct
these forecasts. All results presented in this paper are based on
cross-validation to ensure the results can be generalised to independent
data. We use different cross-validation schemes for the Australian and US
catchments, because of the shorter streamflow records available for the
Australian catchments:

For the Australian catchments, we use data from 1992 to 2005 (14 years) for these catchments. We then generate 14-fold cross-validated streamflow forecasts. The data from 1990 to 1991 are only used to warm up the GR4J model. For a given year, we leave out the data from that year and the following year when estimating the parameters of GR4J and error models. For example, if we wish to forecast streamflows at any point in 1999, we leave out data from 1999 and 2000 when we estimate parameters. The removal of data from the following year (2000) is designed to minimise the impact of hydrological memory on model parameter estimation. We then generate streamflow forecasts in that year (1999) with model parameters estimated from the remaining data.

For the US catchments we follow the split-sampling validation scheme suggested by Evin et al. (2014) to make our results comparable to that study: (1) an 8 year calibration (9 September 1973–26 November 1981) (i.e. 3000 days) with an 8 year warm-up period and (2) a 17 year validation (27 November 1981–1 May 1998) (i.e. 6000 days) with an 8 year warm-up period.

The first adverse effect of the conventional AR models is over-correction of errors in updating as streamflows are rising. By over-correction, we mean that the AR model updates the hydrological model simulations too much. Over-correction is difficult to define precisely; however, we will demonstrate the concept with two examples in the Mitta Mitta catchment: the first example illustrates over-correction by the AR-Norm model, and the second example illustrates over-correction by the AR-Raw model.

An example of over-correction caused by the AR-Norm model in the Mitta Mitta catchment. Dashed lines: forecasts from the base hydrological model (i.e. without error updating). Solid lines: forecasts with error updating.

The fraction of instances where

Forecast streamflows for the Orara catchment for an example 1 year
period. The top panel shows streamflows forecast with the AR-Norm model; the
bottom panel shows streamflows forecast with the RAR-Norm model. Dashed
lines: forecasts from the base hydrological model (i.e. without error
updating). Solid lines: forecasts with error updating. Tick marks on the

To illustrate the problem of over-correction caused by the AR-Norm model,
Fig. 2 presents a 1 week time series for the Mitta Mitta catchment, showing
streamflow forecasts with GR4J before error updating (referred to as
streamflow forecast with the

Figure 3 shows instances of possible over-correction by the AR-Norm model,
identified by the condition

Figure 4 presents a time series for the Orara catchment that shows the instances susceptible to over-correction for the AR-Norm model. These instances all occur when the streamflow rises. The RAR-Norm model effectively rectifies the problem of over-correction caused by the AR-Norm model. We note that there is nothing that forces the instances susceptible to over-correction identified by the AR-Norm model to be the same as those identified by the RAR-Norm models, because the two models are calibrated independently (and therefore base hydrological model simulations may be different). However, the restriction defined in the RAR-Norm model is largely applied to the instances where the AR-Norm model is susceptible to over-correction.

The second adverse effect of conventional AR models is over-correction of forecasts as streamflows recede. An example is presented in Fig. 5 where the AR-Raw model causes over-correction. Here, the base hydrological model over-estimates the receding hydrograph on 5 October 1993. The magnitude of the error update given by the AR-Raw model cannot adjust according to the value of the forecast. As a result, the AR-Raw model updates the forecast on 6 October 1993 by a large amount, resulting in serious under-estimation (the forecast streamflow is nearly zero), and an artificial distortion of the hydrograph. (We note that we have seen this problem become much worse in unpublished experiments of forecasts made for several time steps into the future, sometimes resulting in forecasts of zero flows during large floods.) In contrast, the AR-Norm model performs better in this example, giving a smaller magnitude of error update by recognising that the hydrograph is moving downward. It is generally true that in applying the AR-Raw model, over-correction may occur when the streamflow is receding. The RAR-Norm model produces updated streamflow similar to the AR-Norm model when the hydrograph recedes rapidly and avoids the over-correction by the AR-Raw model on 6 October 1993.

Figure 6 provides more examples of the over-correction caused by the AR-Raw model from a longer time-series plot for the Abercrombie catchment. There are three clear instances of over-correction, all occurring on the time step immediately after large peaks in observed streamflows. The RAR-Norm model works better than the AR-Raw model to avoid the three instances of over-correction for the Abercrombie catchment. Overall, the RAR-Norm model takes a conservative position when streamflow changes rapidly, either rising or falling. When streamflow changes rapidly, it is difficult to anticipate the magnitude of forecast error. Accordingly, the conventional AR models are prone to over-correction in such instances.

An example of over-correction caused by the AR-Raw model in the Mitta Mitta catchment. Dashed lines: forecasts from the base hydrological model (i.e. without error updating). Solid lines: forecasts with error updating.

Forecast streamflows for the Abercrombie catchment for the period between 1 Augst 1997 and 15 September 1997. The top panel shows streamflows forecast with the AR-Raw model; the bottom panel shows streamflows forecast with the RAR-Norm model. Dashed lines: forecasts from the base hydrological model (i.e. without error updating). Solid lines: forecasts with error updating. Grey shading denotes instances of over-correction caused by the AR-Raw model.

NSE of streamflows forecast with the AR-Norm, AR-Raw and RAR-Norm models (colours). Performance of the corresponding base hydrological models is shown by hatched blocks.

The third adverse effect with conventional AR error models is the stand-alone performance of the base hydrological model (GR4J). As noted above, the parameters of the base hydrological model are estimated jointly with each error model. For streamflow forecasting, we expect to obtain a reasonably accurate forecast from the base hydrological model followed by an updating procedure as an auxiliary means of improving the forecast accuracy. At lead times of many time steps (e.g. streamflow forecasts generated from medium-range rainfall forecasts) the magnitude of AR error updates becomes rapidly smaller (tending to zero), and thus the performance of the base hydrological model is crucial for realistic forecasts at longer lead times. While we only investigate forecasts at a lead time of one time step in this study, we aim to develop methods that can be applied to forecasts at longer lead times. Furthermore, if the base hydrological model does not replicate important catchment processes realistically, the performance of the hydrological model outside the calibration period may be less robust.

Figure 7 presents the Nash–Sutcliffe efficiency (NSE) (Nash and Sutcliffe,
1970) calculated from the base hydrological model and the error models. When
the AR-Norm model is used, the forecasts from the base hydrological model are
very poor for the Orara catchment (NSE

In general, the AR-Raw base hydrological model performs as well as or better than the AR-Norm base hydrological model. The AR-Raw base hydrological model is notably better than the AR-Norm base hydrological model in the Abercrombie and Orara catchments (Fig. 7). This suggests that more robust performance can be expected of base hydrological models when AR models are applied to raw errors.

The RAR-Norm model generally improves the performance of the AR-Norm base hydrological model to a level similar to the AR-Raw base hydrological model (Fig. 7). The improvement over the AR-Norm base hydrological model is especially evident for the Orara (Figs. 4 and 7) and Abercrombie catchments (Fig. 7).

Comparison of the NSE calculated at (a) the receding limb and (b) the rising limb of the hydrograph for three different error models.

Comparison of the observed streamflows (

We note that for the AR-Norm models, the updated forecasts are not always better than forecasts generated by the base hydrological models. For the Tarwin and Guadalupe catchments, AR-Norm forecasts are not as good as the forecasts generated by the AR-Norm base hydrological model. This points to a tendency to overfit the parameters to the calibration period, resulting in the error model undermining the performance of the base hydrological model under cross-validation. Such a lack of robustness is highly undesirable in forecasting applications, where the hydrological models should be able to operate in conditions that differ from those experienced during calibration. Note that this problem also occurs in the RAR-Norm model (Guadalupe) and in the AR-Raw model (Abercrombie, Guadalupe), but to a much smaller degree.

In general, the updated forecasts from the RAR-Norm model show similar or better forecast accuracy, as measured by NSE, than both the AR-Raw model and the AR-Norm model (Fig. 7). We note that the Orara catchment is an exception: here the AR-Raw model shows slightly better performance than the RAR-Norm model. Conversely, the RAR-Norm model shows notably better performance than both the AR-Norm and AR-Raw models in the Abercrombie and Guadalupe catchments. This suggests the RAR-Norm model may work better in intermittently flowing catchments, although further testing is required to establish that this is true for a greater range of catchments.

We further evaluate the NSE of the three different error models calibrated
when streamflows are receding (i.e.

PIT uniform probability plots. Curves on the diagonal indicate perfectly reliable forecasts.

We have shown that over-corrections can lead to inaccurate deterministic forecasts, and we now discuss the consequences for the probabilistic predictions given by each of the error models. We assess probabilistic forecast skill with skill scores derived from two probabilistic verification measures: the continuous rank probability score (CRPS) and the root mean square error in probability (RMSEP) (denoted by CRPS_SS and RMSEP_SS, respectively) (Wang and Robertson, 2011). Both skill scores are calculated with respect to a reference forecast. The reference forecast is generated by resampling historical streamflows: for a forecast issued for a given month/year (e.g. February 1999), we randomly draw a sample of 1000 daily streamflows that occurred in that month (e.g. February) from other years with replacement (e.g. years other than 1999). Table 3 compares these two skill scores calculated for the all catchments. The RAR-Norm model performs best across the range of skill scores and catchments, attaining the highest CRPS_SS in 4 of the 7 catchments and the highest RMSEP_SS in 4 of 7 catchments. Even where RAR-Norm was not the best performed model, it performs very similarly to the best performing model in all cases. Interestingly, the AR-Raw model tends to outperform the AR-Norm model in CRPS_SS while the reverse is true for RMSEP_SS. The CRPS tests how appropriate the spread of uncertainty is for each probabilistic forecast, while RMSEP puts little weight on this. The results suggest that while the median forecasts of AR-Norm tends to be slightly more accurate than those of the AR-Raw model, the forecast uncertainty is represented slightly better by the AR-Raw model.

To understand better how reliably the forecast uncertainty is quantified by each model, we produce probability integral transform (PIT) uniform probability plots (Wang and Robertson, 2011) in Fig. 9. There are two main points to draw from these plots. First, the curves are very similar for all error models (a partial exception is the San Marcos catchment, where the AR-Raw model is slightly closer to the one-to-one line than the other models). This demonstrates that, in general, the models produce similarly reliable uncertainty distributions. Second, all models show an inverted S-shaped curve, which indicates that the uncertainty ranges are too wide. This underconfidence is a result of using a Gaussian distribution to characterise the error. The Gaussian distribution is not flexible enough to represent the high degree of kurtosis in the distribution of the residuals after error updating (partly because the errors become very small after updating). We are presently experimenting with other distributions in order to address this issue, and will seek to publish this work in future. For the purposes of the present study, we conclude that the three error models are similarly reliable.

Comparison of the skill scores based on CRPS and RMSEP (denoted by CRPS_SS and RMSEP_SS) for three different error models.

For streamflow forecasting, rainfall–runoff models are often augmented with an updating procedure that corrects the forecast using information from recent simulation errors. The most popular updating approach uses autoregressive (AR) models that exploit the “memory” in model errors. AR models may be applied to raw errors directly or to normalised errors.

We demonstrate three adverse effects of AR error updating procedures on seven catchments. The first adverse effect is possible over-correction on the rising limb of the hydrograph. The AR-Norm model can exhibit the tendency to over-correct the peaks or on the rise of a hydrograph, because error updating can be (overly) amplified by the back-transformation. The second adverse effect is the tendency to over-correct receding hydrographs. This tendency is most prevalent in the AR-Raw model, which can fail to recognise that a large error update may not be appropriate for small streamflows.

The third adverse effect is that the stand-alone performance of the base hydrological model can be poor when the parameters of the rainfall–runoff model and the error model are jointly estimated. We show that poor base hydrological model performance is particularly prevalent in the AR-Norm model. The poor performance appears to occur in catchments with highly skewed streamflow observations (the intermittent Abercrombie River, and the Orara River, a catchment in a subtropical climate). For example, in the Orara River, the base hydrological model tends to greatly over-estimate streamflows, and then relies on the error updating to correct the over-estimates. This is not desirable in real-time forecasting applications for two major reasons. First, modern streamflow forecasting systems often extend forecast lead times with rainfall forecast information (Bennett et al., 2014). The magnitude of AR updating decays with lead time, and forecasts at longer lead times rely heavily on the performance of the base hydrological model. Second, hydrological models are designed to simulate various components of natural systems, such as baseflow processes or overland flow. In theory, simulating these processes correctly will allow the model to perform well for climate conditions that may substantially differ from those experienced during the parameter estimation period. If the hydrological model parameters do not reflect the natural processes for a given catchment, the hydrological model may be much less robust outside the parameter estimation period.

We note that the poor performance of the hydrological model may be specific to the GR4J model, and may not occur in other hydrological models. Evin et al. (2014) estimated hydrological model and error model parameters jointly using GR4J and another hydrological model, HBV, for the three US catchments tested here. While they did not assess the performance of the base hydrological models, they found that HBV tended to perform more robustly when combined with different error models. It is possible that we may have achieved more stable base model performance had we used HBV or another hydrological model. We note, however, that our conclusions can probably be generalised to other hydrological models that do not offer robust base model performance under joint parameter estimation (e.g. GR4J). Because the RAR-Norm model limits the range of updating that can be applied, it will tend to rely more heavily on the base hydrological model, and therefore will tend to favour parameter sets that encourage good stand-alone performance of the base model. For those hydrological models that already produce robust base model performance under joint parameter estimation (perhaps HBV), RAR-Norm is unlikely to undermine this performance for the same reasons. We see some evidence of this in our experiments with GR4J: when the performance of the base hydrological model is already strong relative to the updated forecasts for the AR-Norm and AR-Raw models (e.g. the Tarwin, Mitta Mitta, or Guadalupe catchments), the RAR-Norm model base hydrological model also performs strongly.

The tendency of the AR-Norm model to over-correct rising streamflows is probably generic. In particular, transformations other than the log-sinh transformation may still lead to over-correction at the peak of hydrograph. The proof in Appendix B shows that if a transformation satisfies some conditions (first derivate is positive and second derivate is negative), it will tend to correct more for higher forecast streamflows and can cause the problem of over-correction. The conditions given by Appendix B are generally true for many other transformations used for data normalisation and variance stabilisation in hydrological applications, such as logarithm transformation or the Box–Cox transformation with the power parameter less than 1.

We use joint parameter inference to calibrate hydrological model and error model parameters, in order to address the true nature of underlying model errors. Inferring parameters of the error model and the base hydrological model independently – i.e. first inferring parameters of the base hydrological model, holding these constant and then inferring the error model parameters – relies on simplified and often invalid error assumptions (it assumes independent, homoscedastic and Gaussian errors), but nonetheless could be a pragmatic alternative to the joint parameter inference to reduce computational demands. The over-correction of conventional AR models is independent of the parameter inference, whether the error and base hydrological model parameters are inferred jointly or independently.

In order to mitigate the adverse effects of conventional AR updating procedures, we introduce a new updating procedure called the RAR-Norm model. The RAR-Norm model is a modification of the AR-Norm model: in most instances it operates as the AR-Norm model, but in instances of possible over-correction it relies on the error in untransformed streamflows at the previous time step. That is, RAR-Norm is essentially a more conservative error model than AR-Norm: in situations where streamflows change rapidly, it opts to update with whichever error (transformed or untransformed) is smaller. This forces greater reliance on the base hydrological model to simulate streamflows accurately, leading to more robust performance in the base hydrological model. The RAR-Norm model clearly outperforms the AR-Norm model in both the updated and base model forecasts, as well as ameliorating the problem of over-correcting rising streamflows. The RAR-Norm model's advantage over the AR-Raw model is less clear: both the base hydrological model and the updated forecasts produced by the AR-Raw model perform similarly to (or sometimes slightly better than) the RAR-Norm model. However, the RAR-Norm model clearly addresses the problem of over-correcting receding streamflows that occurs in the AR-Raw model. As we show, this type of over-correction can seriously distort event hydrographs, and cause forecasts of near zero streamflows when reasonably substantial streamflows are observed. While these instances are not very common, the failure in the forecast is a serious one. As we note earlier, the over-correction of receding streamflows is likely to be exacerbated when producing forecasts at lead times of more than one time step. Accordingly, we contend that the RAR-Norm model is preferable to both the AR-Norm and AR-Raw models for streamflow forecasting applications.

For brevity we only show the case of the AR-Norm model; analogous arguments
can be used to prove the cases of the AR-Raw and RAR-Norm models. The
streamflow ensemble forecast

We will show analytically that the AR-Norm model gives a larger magnitude of the error update for a higher forecast streamflow.

Firstly, we will show that the first derivate of the log-sinh transform

Next, we will derive the difference in magnitudes of the error update
between low and high forecast streamflows. For the sake of notation
simplicity, we rewrite

This work is part of the WIRADA (Water Information Research and Development Alliance) streamflow forecasting project funded under CSIRO Water for a Healthy Country Flagship. We would like to thank Durga Shrestha (CSIRO) for valuable suggestions that led to substantial strengthening of the manuscript. We would like to thank two reviewers, Bettina Schaefli and Mark Thyer, for their careful reviews and valuable recommendations, which have improved the quality of this manuscript considerably. Edited by: G. Di Baldassarre