Accuracy of reservoir inflow forecasts is instrumental for maximizing the
value of water resources and benefits gained through hydropower generation.
Improving hourly reservoir inflow forecasts over a 24 h lead time is
considered within the day-ahead (Elspot) market of the Nordic exchange
market. A complementary modelling framework presents an approach for
improving real-time forecasting without needing to modify the pre-existing
forecasting model, but instead formulating an independent additive or
complementary model that captures the structure the existing operational
model may be missing. We present here the application of this principle for
issuing improved hourly inflow forecasts into hydropower reservoirs over
extended lead times, and the parameter estimation procedure reformulated to
deal with bias, persistence and heteroscedasticity. The procedure presented
comprises an error model added on top of an unalterable constant parameter
conceptual model. This procedure is applied in the 207 km

Hydrologic models can deliver information useful for management of natural resources and natural hazards (Beven, 2009). They are important components of hydropower planning and operation schemes where it is essential to estimate future reservoir inflows and quantify the water available for power production on a daily basis. The identification and representation of the significant responses of hydrologic systems have been diverse among hydrologists. Different hydrologists have incorporated their perceptions of the functioning of hydrologic systems into their models and come up with several rival models; some of them process based and others data based (for thorough reviews of the historic development of hydrologic modelling refer to Todini, 2007 and Beven, 2012). These models can be grouped into two main classes, conceptual and data-driven models.

Lumped conceptual hydrologic models are the most commonly used models in operational forecasting. Models of this class use sets of mathematical expressions to provide a simplified generalization of the complex natural processes of the hydrologic systems in the headwater areas of reservoirs. Application of such models conventionally requires estimating the model parameters by conditioning them to observed hydrologic data. Unlike conceptual models, data-driven models establish mathematical relationship between input and output data without any explicit attempt to represent the physical processes of the hydrologic system. Reconciling the two modelling approaches and combining the advantages of both approaches (Todini, 2007) has produced some example applications in forecasting systems where the two modelling approaches are harmoniously used for improving reliability of hydrologic model outputs (e.g. Abebe and Price, 2003; Solomatine and Shrestha, 2009).

Usefulness of a model for operational prediction is determined by the level of accuracy to which the model reproduces observed hydrologic behaviour of the study area. In operational applications, evaluation of how well the models capture rainfall–runoff processes, especially the snow accumulation and melting process in cold regions, is important because of the extent to which the models accurately reproduce the reservoir inflows can significantly influence the efficiency of the hydropower reservoir operation and subsequently the power price. Application of hydrologic models for reproducing historic records can suffer from inadequacy in model structure, incorrect model parameters, or erroneous data. Consequently, despite failing to reproduce the observed hydrographs exactly, they enable simulation of hydrologic characteristics of a study catchment to a fair degree of accuracy. It gets more challenging when using the models in the operational set-up for forecasting the unknown future just based on the known past, which the model might not capture accurately. In the context of the Norwegian hydropower systems, being unable to predict future reservoir inflows accurately has negative consequences on the power producers. Norway's energy producers have to pledge the amount of energy they produce for next 24 h in the day-ahead market and if unable to provide the pledged amount of energy the chance of incurring losses is very high. Estimation of future reservoir inflows (be it long- or short-term) involves estimating the actual (initial) state of the basin, forecasting the basin inputs during the lead time, and describing the water movement during the lead time (Moll, 1983). Hence, the quality of a hydrologic forecast depends on the accuracy achieved and methodology selected in implementing each of these aspects.

In this study, we intend to use conceptual and data-driven models complementarily. A conceptual model with calibrated model parameters is used as the fundamental model that approximately captures dominant hydrologic processes and forecasts the behaviour of the catchment deterministically. A data-driven model is then formulated on the residuals, the difference between observations and predictions from the conceptual model. By studying the whole set of residuals and exploring the information they contain, important information that describes the inadequacies of the conceptual model can be extracted. In general, this kind of information can be used for improving either the conceptual model itself or the prediction skill of a forecasting system. Emulating the practice in most Norwegian hydropower reservoir operators, we stick to the latter purpose with the aim of enhancing the performance of a hydropower reservoir inflow forecasting system. According to Kachroo (1992), data-driven models defined on the residuals from a conceptual model can expose whether the conceptual model is adequate to identify essential relationships exhibited in the input–output data series. Data-driven models can establish the mathematical relationship that describes the persistence revealed in the residual time series, which is caused by failure of the conceptual model to capture all the physical processes exactly. Thus, in the operational sense, the data-driven models can play a complementary role by adjusting output of the conceptual model whenever the conceptual model needs corrective adaptation (e.g. Serban and Askew, 1991; World Meteorological Organization, 1992).

Several example applications can be found in the scientific literature on using conceptual and data-driven models complementarily. For instance, Toth et al. (1999) compared performance improvements six autoregressive integrated moving average (ARIMA)-based error models brought to streamflow forecasts from a conceptual model to identify the best error model and data requirements. Shamseldin and O'Connor (2001) coupled a multi-layer neural network model on top of a conceptual rainfall–runoff model to improve accuracy of streamflow forecasts without interfering with the operation of the conceptual model. Similarly, Madsen and Skotner (2005) developed a procedure for improving operational flood forecasts by combining error models (linear and non-linear) and a general filtering technique. Xiong and O'Connor (2002) investigated performance of four error-forecast models, namely, the single autoregressive, the autoregressive threshold, the fuzzy autoregressive threshold and the artificial neural network updating models, for improving real-time flow forecasts and compared their results. Likewise, Goswami et al. (2005) examined the forecasting skill of eight error-modelling-based updating methods. A recent review on the application of error models and other data assimilation approaches for updating flow forecasts from conceptual models can be found in Liu et al. (2012).

As reviewed above, the principle of complementing conceptual models with data-driven models has enjoyed applications in real-time hydrologic forecasting since the 1990s. The methodological contribution of the present work is reformulation of the parameter estimation procedure for the data-based model. We recognize that the bias, persistence and heteroscedasticity seen in the residuals from the conceptual model reflect structural inadequacy of the conceptual model to capture the catchment processes and, hence, are important in defining the manner the residual series is dealt with. Accordingly, we describe the reservoir inflows in a transformed space and present an iterative algorithm for estimating parameters of the data-driven model and the transformation parameters jointly.

Two main features distinguish application aspects of the present paper from previously published work built on the same concept of complementing conceptual models with data-driven models. First, it attempts to provide hourly reservoir inflows of improved accuracy 24 h ahead. The earlier papers mainly succeeded in improving forecasts for forecast lead times up to six time steps or incorporated a scheme to update the forecast system at an interval of six time steps. Second, an attempt is made in what follows, to produce a probabilistic forecast by estimating the uncertainty of the error model, rather than only the deterministic estimate. This, thereby, enables forecast of an ensemble of reservoir inflows, thereby allowing for a risk-based paradigm for hydropower generation to be put to use. Reasons as to why hydrologic forecasts should be probabilistic and the potential benefits therein are presented and explained in Krzysztofowicz (2001). Krzysztofowicz (1999) described a methodology for probabilistic forecasting via a deterministic hydrologic model. Li et al. (2013) presented a review of scientific papers that provide various regression and probabilistic approaches for assessing performance of hydrologic models during calibration and uncertainty assessment. Smith et al. (2012) demonstrate a good example of producing probabilistic forecasts based on deterministic forecast outputs. In this paper, the improvement levels achieved are evaluated deterministically using the same or similar metrics as past studies, and probabilistically using (i) the containing ratio (Xiong et al., 2009), which is also referred to as reliability score (e.g. Renard et al., 2010) and (ii) the probability integral transform (PIT) plot. The technique is similar to the predictive Q–Q plot (e.g. Thyer et al., 2009) but assesses, in terms of the percentiles, how close a continuous random variable transformed by its own cumulative distribution function (cdf) is to a uniform distribution. We emphasise here that taking into account uncertainties emanating from various recognized sources and describing the degree of reliability of the inflow forecasts has important benefits. According to Montanari and Brath (2004), the Bayesian forecasting system (BFS) and the generalized likelihood uncertainty estimation (GLUE) are the popular methods for inferring the uncertainty in hydrologic modelling. Yet, the scope of producing probabilistic inflow forecasts in this study is limited to attaching a certain probability to the deterministic forecasts, which are common in the Norwegian hydropower industry, based on analysis of the statistical properties of the error series from the conceptual model, and assessing its degree of reliability.

In the next section, the complementary model set-up is formulated and the performance evaluation criteria are provided. An example application is presented in the subsequent section. This includes description of the study area and data used, findings from the evaluation of the complimentary set-up and its components during calibration and validation, and results of forecasting skill assessment using deterministic and reliability metrics. Finally, concluding remarks are provided.

The widely applied conceptual hydrologic model, HBV (Hydrologiska Byråns Vattenbalansavdelning) (Bergström,
1995), is used in this study. The version used allows for dividing the study catchment
up into 10 elevation zones. A deterministic HBV model with already calibrated
model parameter values was assumed to take the role of the operational
hydrologic models Norwegian hydropower companies commonly use for
forecasting reservoir inflows. In the operational set-up, the air temperature
and precipitation input over the forecast lead time are obtained from the
Norwegian Meteorological Institute (

The error model aims at exploiting the bias, persistence and heteroscedasticity in the residuals and estimating the errors likely to occur in the forecast lead time. Forecasting the error in the lead time is regarded as a two-step process: offline identification and estimation of the error model, and error predictions based on most recent information.

An error model that captures the structures the processes model is missing should lead to a zero-mean homoscedastic residual series from the modelling framework. In order to identify the right structure and establish a parsimonious model that adequately describes the data, we diagnose the residuals and address the bias, persistence and heteroscedasticity the series might exhibit as follows.

First and foremost, we transform the observed (

The discrepancy (

In order to provide improved hourly reservoir inflow forecasts over a
24 h lead time, the error-forecasting model takes the form of Eq. (3). In
order to overcome lack of observed residuals encountered for forecast
lead time (

Parameters of the AR model can be set to the corresponding Yule–Walker
estimates of

Values of

The residuals series from the transformed inflow data are calculated (

Perform an optimization for the error-model parameters (

Adjust (

In addition to visual evaluation of the hydrographs, performance of the
present procedure is robustly analysed using deterministic and reliability
metrics. The root mean square error (RMSE), relative error (RE) and the
Nash–Sutcliffe efficiency (NSE) (Nash and Sutcliffe, 1970) are employed to
evaluate efficiency of the models during calibration and validation
deterministically. Evaluations are made with respect to varying forecast
lead times and season-wise as well. Among the three statistical performance
criteria, the RE (Eq. 5) measures the relative error between the total
observed and predicted inflow volume. For a good simulation the value of RE
is expected to be close to zero. Quantifying the relative error (RE) of the
simulations/forecasts is important because it indicates how the inaccuracies
affect a hydropower company's ability to deliver the amount of energy it has
pledged to provide to the energy market. Therefore, special attention is
given to the less aggregate version of RE, which we refer to as
percentage volume error (hereafter PVE) and describe as follows.

Location, characteristics and potential evapotranspiration estimates of the study catchment.

Observed and predicted reservoir inflow hydrographs during calibration (left-column panels) and validation (right-column panels) of the conceptual model.

Another useful metric used for assessing forecasting skill of the
complementary set-up is through uncertainty analysis. An interval forecast
(Chatfield, 2000) can be constructed by specifying an upper and lower limit
between which the future reservoir inflow is expected to lie with a certain
probability (1

The Krinsvatn catchment is located in Nord Trøndelag County in mid-north
Norway. It comprises an area of 207 km

Observed hourly data of 11 water years (September 2000 to August 2011) were split into three sets used for warming-up (2000), calibrating (2001–2005) and validating (2006–2010) the conceptual and the error models alike. Observed precipitation and temperature data of two meteorological stations (i.e. Svar-Sliper and Mørre-Breivoll) in neighbouring catchments are used. Discharge data for the catchment are derived from water level records at the Krinsvatn gauge station. Romanowicz et al. (2006) outline the advantages to direct use of water-level information in hydrologic forecasting. Rating curve uncertainties and their influence on the accuracy of flood predictions have been very well documented (e.g. Sikorska et al., 2013; Aronica et al., 2006; Pappenberger et al., 2006; Petersen-Overleir et al., 2009). Krinsvatn is considered a stable discharge measurement site with few external influences, and the rating curve was updated in 2004. This study, however, considers the uncertainty of the rating curve to be one of the factors contributing to the total error expressed in Eq. (2) and does not address it separately.

The catchment is divided into 10 elevation zones in the HBV model set-up. Input data used are hourly areal precipitation, air temperature, and potential evapotranspiration. The model is run on an hourly time step for the water years 2000 to 2005 with the last 5 water years being used for model calibration. Calibration is carried out using the shuffled complex evolution algorithm (Duan et al., 1993), with the NSE between the observed and predicted flows as an objective function. Description of the model parameters along the corresponding optimized values is provided in Table 1.

Model parameters and corresponding optimized values.

Summary of overall and seasonal performance of the conceptual model during the calibration (September 2001 to August 2005) and validation (September 2006 to August 2011) periods.

The simulation and observed reservoir inflow hydrographs shown in Fig. 2
indicate a certain level of agreement for most of the calibration and
validation periods, which the statistical evaluations (Table 2) agree with.
The overall hourly reservoir inflow predictions during calibration and
validation show efficiency of NSE

PVE counts of the six PVE classes (i.e.

Stacked-column plots of (1) PVE counts of the six absolute PVE
classes (

Details of the extent to which the reservoir inflows are under- and
overestimated can be seen in Fig. 3c and d. The fraction of time the
simulated inflows exhibited under- and overestimation during calibration is
51.9 and 46.8 %, respectively. In the validation period, the reservoir
inflows are underestimated about 65.6 % of the time compared to
overestimation in 33.4 % of the time. This is also revealed in the
findings from statistical metrics in Table 2, which disclose the bias in the
model. Yet, the results in Fig. 3 further reveal that the model predictions
deviate from the observations at high discharges. For example, during the
validation period 59.2 % of the time observations exceeded the
predictions by magnitudes of more than 10 %. Such information is useful
because direct evaluation of observed and predicted values explains the
implications of model performance on the planning and operation of a
hydropower system better than an aggregated variance-based statistic. From
an operational management point of view, considerable underestimation of
reservoir inflows can have both short-term and long-term effects on the
operation of a hydropower system. In the short-term, the company could be
forced to release unvalued water especially when the reservoir water level
is close to its maximum capacity. Hence, the high percentage of
underestimations that occur in the autumn and spring seasons (during
calibration and validation) should not be tolerated because the inflows in
the autumn and spring seasons are very important. On the one hand,
substantial overestimation of reservoir inflows can at least expose any
Norwegian hydropower company to undesirable expenses due to obligations to
match the power supply it has failed to deliver by dealing with other
producers in the intra-day physical market (Elbas). Although overestimation
does not seem to be a pertinent issue, Fig. 3d unmasks that the inflows are
overestimated by a magnitude

Following the example of Xu (2001), a Kolmogorov–Smirnov test is applied to
residuals of the conceptual model. The test revealed that the residuals are
not normally distributed. The maximum deviation between the theoretical and
the sample lines is 0.130, which is larger than Kolmogorov–Smirnov test
statistic of 0.008 at significance level

Plots of

Presence of homoscedasticity in the residuals series is diagnosed visually by plotting the residuals versus the predicted reservoir inflows (Fig. 4a). With respect to the horizontal axis, the scattergram does not remain symmetric for the entire range of predicted inflows. The residuals show high variability and possible systematic bias when inflows are less than 3.5 mm while the opposite is true when the inflows exceed 3.5 mm. Inflows of magnitudes between 3.5 and 5.5 mm seem to be underestimated, while overestimation is visible when the inflow rates are greater than 5.5 mm. However, as can be seen from Fig. 2, inflows of magnitude up to 3 mm represent reservoir inflows during the rise of the hydrographs including all peak inflows for all hydrologic years except 2005 and 2010. Hence, except for the possible systematic bias during low flows, the inference from the scatter plot is inconclusive to support or dismiss the issue of predominant underestimation revealed in the model performance evaluation. Moreover, hourly inflows of magnitudes higher than 3mm are rare and occurred about 0.1 % of the time over the calibration and validation period.

Plots of autocorrelation and partial autocorrelation functions of the residual time series (Fig. 4b and c) indicate a strong time persistence structure in the error series. Rapid decaying of the partial autocorrelation function confirms the dominance of an autoregressive process, which the gradually decaying pattern of the autocorrelation function also suggests. Thus, in order to obtain a Gaussian series, it is important to address issues of heteroscedasticity and serial correlation in the residual series. As the current study aims at utilising the persistent structure in the residuals for supplementing the forecasting system, the corrective action to be taken only aims at removing the heteroscedasticity. A successful way to do it is through transformation of the flow data (e.g. Engeland et al., 2005). As outlined in the methodology section, the reservoir inflows (both observed and predicted) are transformed while estimating parameters of the error model.

In accordance with the findings from the ACF and PACF plots discussed in
Sect. 3.3.2, AR models of up to an order of

Stacked-column plots of

Calibration efficiencies calculated for the error model using the RMSE, RE and
NSE metrics are 0.096,

Imitating operational application of forecasting models in the Norwegian hydropower system, reservoir inflows for the day-ahead market (Elspot) are estimated using the presented forecasting system. The system has to run once a day at an hourly time step, sometime before 12:00 LT after retrieving the latest observations, and the inflow forecasts are issued for the next 24-hourly time steps beginning from 12:00 LT. Overall performance of the complementary model in forecasting the reservoir inflows during the calibration and validation periods is first discussed and is followed by evaluation of its forecasting skill with respect to forecast lead times. Evaluation of the forecast skill presented in this paper is based on assessment of forecasts made for the period between September 2006 and August 2011 as the data sets from September 2000 to August 2006 are used for calibrating the system.

Ratio between occurrence frequency of low PVE (

Summary of relative seasonal RMSE reductions as a function of forecast lead time (minimum, mean and maximum values computed from corresponding computations for the hydrologic years 2006–2010).

Assessment of the overall forecasting skill of the complementary set-up shows significant improvement in forecast accuracy. The RMSE and NSE statistical criteria computed between forecasted and observed inflows are 0.095 and 0.896, respectively. RMSE values for the autumn, winter, spring and summer forecasts are 0.094, 0.090, 0.132 and 0.044, respectively, and the corresponding NSE values are 0.904, 0.905, 0.859 and 0.873.

Proving capability of the complementary set-up to reduce the bias revealed in the simulation forecasts from the conceptual model, which was pointed out in the previous section, the 24 h lead-time forecasts exhibited low-level underestimation bias with RE equal to 3.8 %. Degree of bias in the inflow forecasts differed seasonally. The RE computed for each season in a decreasing order is summer (10.2 %), spring (4.6 %), autumn (2.9 %) and winter (0.7 %). The relatively higher bias in the spring and autumn forecasts can be related to runoff generation in the Krinsvatn catchment due to snowmelt or occurrence of precipitation in the form of rainfall, which can affect the persistence structure in the residual series obtained from the conceptual model.

Stacked-column plots in Fig. 5 display the occurrence level of each of the
six PVE classes in the residual series between forecasts and observations.
Visual comparison of stacked-column plots of Fig. 5 and Fig. 3 shows
reduction in PVE count of the high PVE classes and increase in PVE counts of
low PVE classes; e.g. PVE count for the PVE class

Relative reductions in RMSE between forecasts from the complementary set-up and the simulated forecasts from the conceptual model are computed. Detailed results for each season of the hydrologic years between 2006 and 2010 are presented in Table 4. The results are also summarized in terms of the minimum, mean and maximum relative RMSE reduction as shown in Fig. 6. Excluding forecasts in autumn and winter seasons of the 2006 water year, relative RMSE reductions are observed in forecasts of short and long lead times. Of course, in all four seasons, the achieved level of improvement in forecast accuracy is high for short lead times and diminishes gradually with increased lead time. Results show that accuracy of the reservoir inflows in the spring and summer seasons are improved over the entire range of the forecast lead time. Likewise, reduction in RMSE is observed for all autumn and winter inflow forecasts except for the water years 2006 and 2007, respectively.

In order to get insight on the improvement level in a unit directly related
to hydropower production, the change in PVE count of each PVE class is
calculated. Change in PVE count of a given absolute PVE classes is the
difference between the PVE counts for the complementary set-up and that for
the conceptual model. The results are summarized as shown in Fig. 7. The
figure shows that the PVE count of high magnitude absolute PVE classes are
reduced and the opposite is true for that of the smaller absolute PVE
classes. For instance, regardless of the type of discrepancy (under- or
overestimation) noted, the change in PVE counts of the absolute PVE of the
class

Relative RMSE reductions (%) in reservoir inflows forecast as a function of forecast lead time.

Change in number of occurrence of the six absolute PVE classes
(

Observed hydrograph (broken lines) and the forecasted 95 % confidence interval.

Calculation of the relative RMSE reduction and the change in PVE counts agree that the forecast accuracy is improved through the complementary set-up. The assessments further revealed that the degree of improvement weakens with increased forecast lead time. However, the relative RMSE reduction computations indicate that in some occasions the simulated inflow forecasts stand out to be better. The relative RMSE reduction values for lead times longer than 20 h (Table 4) show that complementing the conceptual model with an error model is counterproductive in autumn and winter seasons of the water years 2007 and 2006, respectively.

Computation of the CR for the entire forecast reveals that 95.8 % of the observations are inside the 95 % prediction interval. The inflow hydrographs (Fig. 8) confirm that most of the observed inflows are contained in the specified uncertainty bounds.

The percentage of observation points falling within the forecasted 95 % confidence interval varies from season to season and across hydrologic years (see Fig. 9a). All observed winter and summer inflows are bracketed in the 95 % uncertainty bound at least 95 % of the time. In general, the winter season is more of a snow accumulation period and a closer observation of the hydrographs (see Fig. 8) reveals that the summer hydrographs cover the recession and base flow portions of the annual hydrographs. Thus, better persistence structure and predictable discrepancies between simulated forecasts from the conceptual model and the observations. As Goswami et al. (2005) argued, the persistence structure in residual series primarily arises from the dynamic storage effects of a catchment system.

The desired percentage of autumn observations is contained in the 95 %
prediction interval the years 2006, 2008 and 2010. In the years 2007 and 2009, however, only 93.2 and 93.8 % of the observed autumn
inflows are bracketed in the estimated 95% prediction intervals,
respectively. Reliability score (CR) calculations for the spring season
indicate that percentage of observation points falling in the desired
prediction interval percentage are below 95 % in the hydrologic years
2009 and 2010 (i.e. 93.8 and 89.2 %, respectively). Unlike winter
and summer inflows, autumn and spring flows mostly cover portions of the
hydrograph corresponding to the rising limb or high-flow regime (see
Fig. 8). While physical factors contributing to the increase in quick flow into
the reservoir are precipitation incidents (in the form of rainfall) and
melting of snow in the headwaters, comprehension of this concept and its
encapsulation into the HBV model leaves control of the catchment response to
two threshold values (TX and TS; see Table 1 for description). Employing
such simple threshold values to govern initiation of the runoff generation
process based on air temperature measurement at a given time step obviously
involves more sources of uncertainty (i.e. measurement, model structure and
model parameters). For instance, we assume the input air temperature at a
given time step is erroneously recorded to be higher than TX and/or TS due
to measurement error. Subsequently, the model will partition the
precipitation as rainfall and initiate melting of snow, which the
observation does not reveal. This kind of misclassification of precipitation
and/or misrepresentation of snow accumulation and melting processes can
simply occur due to the error in the input temperature record. Because of
this, the persistence in the errors between simulated forecasts from the
conceptual model and the observations can get weaker. According to Goswami
et al. (2005), some degree of persistence in the model input (i.e. rainfall)
is another primary source of the persistence characteristic of observed flow
series. Even though the least CR calculated for the autumn and spring seasons
are by no means too bad (i.e.

The fraction of observed inflows bounded within the estimated prediction interval decreases with increased lead time (Fig. 9b). The reliability score for all 24 forecast lead times fulfil the requirement of containing 95 % of the observations. For lead times beyond 19 h, the exact CR values are slightly lower than 95 % with a minimum of 94.8 % at forecasts lead time of 24 h.

Summary of seasonal containing ratio (95 % prediction interval) during reservoir inflow forecasting (September 2006 to August 2011)

Reliability score (containing ratio CR) for 95 % prediction
interval for

Findings from evaluation of the forecast skill of the complementary set-up using deterministic and probabilistic metrics support each other. The present procedure is able to improve accuracy of reservoir inflow forecasts and the level of improvement decreases as the forecast lead time increases. Deterministic evaluation of performance of the forecast system indicates that the concept of complementing the conceptual model with a simple error is not always effective. As discussed earlier, in some occasions the present method can get counterproductive in forecasting inflows when the forecast lead time is beyond 20 h. Similarly, detailed assessment of the reliability (Table 5) shows that the CR of the forecasting system can get below 95 % at forecast lead times less than 17 h; e.g. at forecast lead time of 9 h, only 89 % of the observed spring inflows of the 2006 water year are bracketed in the 95 % prediction interval. It can also be noted that for shorter forecast lead times, the percentage of observations contained in the prediction bounds exceed 95 %. Although a greater proportion of observations falling in the prediction bound is desirable, a high CR at short forecast lead times might indicate too wide a bandwidth. This along a CR that declines with increased lead time might suggest invalidity of the assumptions behind computation of the bounds (e.g. Smith et al., 2012). The two issues at stake here are the Gaussian assumption on the basis of which the prediction bounds were constructed, and the model identification and parameter estimation approach implemented. In order to assess the former, we conducted the PIT uniformity probability test.

From an operational hydrology point of view, we concur with the opinion of
Thyer et al. (2009) that the toughest goodness-of-fit test the complementary
framework has to pass is whether the predictive distribution is consistent
with the observed inflow, which the PIT uniform probability plots (PIT
plots) evaluate directly. This involves deriving at each time step the

Comparison of the transformed

The parameter (AR model coefficient(s) and transformation parameters) estimation technique we employed (Sect. 2.2.2) follows a pseudo multi-objective optimization approach, which includes minimizing the sum of squares of the residuals and making sure a homoscedastic residual series. We first employed the least squares (LS) method to estimate the parameters associated with several AR models (of the order of 1 to 3). Since the unit of the inflows (the errors as well) in the transformed space depended on the transformation parameters, and the inclusion of the transformation parameters into the calibration problem posed a challenge to identify the optimal among the candidate AR models, we resorted to the dimensionless KS statistic. The KS metric served as a relative quantitative measure to discriminate between candidate models by measuring how close-to-constant the residual variances' are. As a result, the selected AR model is suboptimal in terms of yielding the least discordance between predictions and observations. Putting aside the issue of (in)validity of the Gaussian assumption, we demonstrate that shortcomings of the present LS- and KS-based model, which we refer to as the LS–KS model, the probabilistic metrics revealed are not unique to the implemented parameter estimation approach. In order to verify this, we set-up an AR model estimating the coefficients and transformation parameters by maximizing the Gaussian maximum likelihood (GML).

An AR(2) model was identified with coefficients and transformation
parameters:

In the present study, the forecasting system comprising of additively set-up conceptual and simple error models is presented. Parameters of the conceptual model were left unaltered, as are in most operational set-ups, and the data-driven model was arranged to forecast the corrective measures to be made to outputs of the conceptual models to provide more accurate inflow forecasts into hydropower reservoirs several hours ahead.

Application to the Krinsvatn catchment revealed that the present procedure could effectively improve forecast accuracy over a 24 h lead time. This proves that the efficiency of a flow forecasting system can be enhanced by setting up a data-driven model to complement a conceptual model operating in the simulation mode. Furthermore, the current study reveals that analysing characteristics of the residuals from the conceptual model is important and heteroscedastic behaviour should be addressed before identifying and estimating parameters of the error model. Compared to past studies that applied data-driven and conceptual models in a complementary way, the present procedure is successful in providing acceptably accurate forecast for extended lead times. It also outlines procedure for extracting useful information from the bias, the persistence and the heteroscedasticity the residual series from the conceptual model exhibited, although the assumption that the residuals from the modelling framework to be random failed to hold.

Results also indicate that probabilistic forecasts can be obtained from deterministic models by constructing uncertainty of the complementary set-up based on predictive uncertainty of the simple error model. The uncertainty bound seems to satisfy the reliability requirement of containing about 95 % of the observations in the prediction interval when evaluated over the entire forecasting period. Its reliability with respect to forecast lead time also appears satisfactory for all 24 forecast lead times in terms of containing the desired percentage of observations. Nevertheless, detailed assessment revealed that the degree of reliability of the forecasts vary from season to season and one hydrologic year to another. Given that the error model essentially makes use of the persistence structure in the residuals from the conceptual model, the present procedure seems to be unable to capture transitions in the hydrograph errors from over- to underestimation (and vice versa). On the one hand, it was unveiled that the degree of reliability of the forecasts decline with longer lead times and the deterministic metrics (RMSE and PVE) confirmed the same. Reliability assessment using the PIT plots revealed that, regardless of season and lead time, the uncertainty bands somehow appear to be wider than they should be. The PIT plots spotlighted the challenge associated with forecasting confidence intervals using the LRVE or similar methods, which estimate the model error variance from the historical residuals.

In order to address these challenges, a future development can be to explore methodologies for taking care of seasonal variability in the structure of the residual series. Updating the error models periodically can be one solution but care must be taken if the selected updating method makes a Gaussian assumption. Another alternative would be to explore more complex stochastic models for the residuals, that use exogenous predictor variables either observed directly (much like the seasonal reservoir inflow forecasting models described in Sharma et al., 2000), or using state variables simulated from the conceptual model (like the Hierarchical Mixtures of Experts framework in Marshall et al., 2006 and Jeremiah et al., 2013). Formulation of these models will also offer better insight into the deficiencies that exist within the HBV conceptual model, thereby allowing further improvement to reduce the structural errors present. A subsequent study (Gragne et al., 2015) attempts to address some of these issues using a filter updating procedure, which assimilates inflow measurements periodically to the error-forecasting model, and explores the potential of a data assimilation technique for improving model forecast accuracy and constraining forecast uncertainty without significant computational costs.

Another interesting topic of future investigation is the intercomparison of the probabilistic forecasts presented in the current paper with the same from popular methods such as the Bayesian forecasting system, the generalized likelihood uncertainty estimation and the Bayesian recursive estimation. We believe this would enable identification of the most effective and reliable probabilistic forecasting method that can also be implemented in an operational set-up.

This work was supported by the Norwegian Research Council through the project Updating Methodology in Operational Runoff Models (192958/S60) and the consortium of Norwegian hydropower companies led by Statkraft. The hydrological data used in the project were retrieved from database of the Norwegian Water Resources and Energy Directorate (NVE). The meteorological data were obtained from Trønderenergi AS and we thank Elena Akhtari for making them available to us. We would like to acknowledge the assistance of Keith Beven in the preparation of this manuscript. We thank the editor and two anonymous reviewers for their constructive comments, which helped improve the manuscript. Edited by: E. Toth