Assessment of the potential forecasting skill of a global hydrological model in reproducing the occurrence of monthly flow extremes

As an initial step in assessing the prospect of using global hydrological models (GHMs) for hydrological forecasting, this study investigates the skill of the GHM PCRGLOBWB in reproducing the occurrence of past extremes in monthly discharge on a global scale. Global terrestrial hydrology from 1958 until 2001 is simulated by forcing PCRGLOBWB with daily meteorological data obtained by downscaling the CRU dataset to daily fields using the ERA-40 reanalysis. Simulated discharge values are compared with observed monthly streamflow records for a selection of 20 large river basins that represent all continents and a wide range of climatic zones. We assess model skill in three ways all of which contribute different information on the potential forecasting skill of a GHM. First, the general skill of the model in reproducing hydrographs is evaluated. Second, model skill in reproducing significantly higher and lower flows than the monthly normals is assessed in terms of skill scores used for forecasts of categorical events. Third, model skill in reproducing flood and drought events is assessed by constructing binary contingency tables for floods and droughts for each basin. The skill is then compared to that of a simple estimation of discharge from the water balance ( P -E). The results show that the model has skill in all three types of assessments. After bias correction the model skill in simulating hydrographs is improved considerably. For most basins it is higher than that of the climatology. The skill is highest in reproducing monthly anomalies. The model also has skill in reproducing floods and droughts, with a markedly higher skill in floods. The model skill far exceeds that of the water balance estimate. We conclude that the prospect for using PCR-GLOBWB for monthly and seasonal forecasting of the occurrence of hydrological extremes is positive. We argue that this conclusion applies equally to other similar GHMs and LSMs, which may show sufficient skill to forecast the occurrence of monthly flow extremes.

Given the capability of GHMs to quantify streamflow, their relevance for integrated water resources management of large river basins has been recognized (Refsgaard, 2001).Reliable and timely forecasts of extremes in streamflow can help mitigate flood and drought risks and optimize water allocations to different sectors and sub-regions.The application of GHMs could be particularly promising for developing regions of the world where no effective flood and drought early warning systems are in place.However, up to now large-scale hydrological models have rarely been used for river flow forecasting, mainly because appropriate routing of river discharge is not included, and forecasting systems are limited to higher resolution national or regional domains (e.g. the European LISFLOOD system with a grid resolution of 5 × 5 km; De Roo et al., 2000).
In this paper we investigate the skill of the global hydrological model PCR-GLOBWB in reproducing the occurrence of past extremes in the monthly discharges of 20 large rivers of the world that represent all continents and a wide range of climatic zones.The motivation for the paper is twofold.The first objective is to present our evaluation of PCR-GLOBWB as an initial step in assessing the prospect of using a GHM for forecasting hydrological extremes.The second one is to identify a methodology that can serve as a benchmark verification procedure for hydrological forecasting.This procedure uses methods and skill scores that were developed primarily for verification of meteorological forecasts.
Global terrestrial hydrology is simulated for a historical period from 1958 until 2001, by forcing PCR-GLOBWB with a meteorological data set produced by combining ERA-40 reanalysis (Uppala et al., 2005) and Climate Research Unit (CRU) data from the University of East Anglia (New et al., 2000).The use of a historical meteorological dataset implies that the hydrological forecasts are not affected by forecasting uncertainty in the forcing and the propagation thereof with increasing lead times.In this sense, the results presented here are indicative of the maximum skill that can currently be achieved by this and similar GHMs given the associated errors in forcing, discharge observations, model structure and parameterization.
We assess the skill of PCR-GLOBWB in reproducing hydrological extremes in three ways.First, a general verification of simulated hydrographs is carried out.Second, model skill in reproducing significantly higher and lower flows than the monthly normals is assessed by constructing categorical contingency tables and applying skill scores used in meteorology for forecasts of ordinal categorical events.Third, model skill in reproducing flood and drought events is assessed by applying verification measures for forecasts of binary events, where floods and droughts are defined in terms of discharge values being higher or lower than discharges associated with a given return period.The model skill quantified in terms of these three sets of skill scores is then compared with the skill obtained by a simple estimation of discharge from the water balance (P -E) over each basin.
We use discharge observations from the Global Runoff Data Center (GRDC) reference dataset which contains monthly discharges for most basins.Consequently, the forecasting skill that we assess in this study is indicative for the potential skill that could be achieved in monthly and seasonal forecasting, rather than medium-range forecasting.
Among other studies in which the discharge simulations of other GHMs and LSMs have been compared to discharge observations, the novelty of this work is to evaluate the ability of a GHM in reproducing the occurrence of anomalous flows and past flood and drought events with skill measures used in verification of meteorological forecasts, in the prospective context of operational hydrological forecasting.
The rest of this paper is set up as follows: Sect. 2 describes the GHM PCR-GLOBWB, the historical simulation, the meteorological forcing as well as the discharge data used for skill assessment.Section3 describes the assessment of skill in reproducing hydrographs, anomalous flows and floods and droughts.Results are presented and discussed in Sect.4, followed by conclusions in the last section.

Hydrological model
PCR-GLOBWB (PCRaster Global Water Balance) is a hydrological model that simulates the terrestrial part of the global water cycle (Van Beek and Bierkens, 2009;Bierkens and Van Beek, 2009).It is coded in the high-level computer language PCRaster for constructing environmental models (Wesseling et al., 1996).PCR-GLOBWB is fully distributed and operates on a regular grid with a cell size of 0.5 × 0.5 • (ca.55 km squared at the Equator).Meteorological forcing is applied on a daily time step and assumed to be constant over the grid cell.Sub-grid variability is taken into account in the representation of short and tall vegetation, open water, different soil types, saturated area, surface runoff, interflow and groundwater discharge.
PCR-GLOBWB is a "leaky-bucket" type of model that calculates the water balance for every grid cell by tracking the transfer of water between the atmosphere and the cell, through stores within each cell, and laterally, as discharge, from one cell to the next.The model calculates the storages and fluxes of water, simulates the generation of runoff and its propagation as discharge through the river network.Precipitation falls either as snow or rain depending on atmospheric temperature.It can be intercepted by vegetation and added to the finite canopy storage, which is subject to open water evaporation.Snow is accumulated when the temperature is lower than 0 • C and melts when it is higher.Snow melt is added to rain and throughfall; it is stored in the available pore space in the snow cover, or reaches the top soil layer.Part of this water is transformed into surface runoff and the remainder infiltrates into the soil through two vertically stacked soil layers and an underlying groundwater layer.Water is exchanged between these layers following Darcy's law and the resulting soil moisture is subject to evapotranspiration.The remaining water contributes to lateral drainage as interflow from the soil layers or baseflow from the groundwater reservoir.The total drainage which consists of surface runoff, interflow and baseflow is routed through the drainage network of rivers, lakes and wetlands, based on DDM30 (Döll and Lehner, 2002), using the kinematic wave approach.An extensive description of PCR-GLOBWB can be found in Van Beek and Bierkens (2009) and Van Beek et al. (2011).

Meteorological data set
The meteorological variables required to force PCR-GLOBWB are daily values of precipitation, evapotranspiration and temperature.In the absence of direct estimates of actual evapotranspiration, the model can be forced with values of potential evapotranspiration calculated from temperature, radiation, cloud cover, vapour pressure and wind speed.
In order to force PCR-GLOBWB with daily meteorological data at 0.5 • resolution, the monthly fields of the CRU TS 2.1 data set (New et al., 2000) have been downscaled to daily fields using ERA-40 reanalysis (Uppala et al., 2005).Precipitation fields are downscaled multiplicatively while an additive correction is used for temperature.Reference potential evapotranspiration is calculated first on a monthly basis, based on monthly cloud cover and vapour pressure deficit from CRU TS 2.1 as well as radiation and wind speed from CRU CLIM 1.0 (New et al., 2002).Reference evapotranspiration is converted to crop-specific potential evapotranspiration using crop factors derived following FAO guidelines.Finally, potential evapotranspiration is downscaled multiplicatively to daily values using ERA-40 temperature fields.The methodology used to calculate potential evaporation for the different land surfaces in PCR-GLOBWB and the downscaling of the meteorological data is described in detail by Van Beek (2008).The resulting meteorological data set is limited

Simulated and observed discharge time series
The simulated discharge time series represent non-regulated flow.Twenty large river basins are selected for comparison of simulated and observed time series on the basis of three criteria.The first one is to represent all the continents, a wide range of climate zones and latitudes as well as a variety of precipitation regimes.The second criterion is the availability of observed monthly streamflow records for at least part of the period 1958-2001.The third criterion is to focus on developing regions which would benefit most from operational seasonal forecasting.Selected basins can be seen in Fig.
The discharge data for most of the selected basins are obtained from the Global Runoff Data Center (GRDC, 2007).When GRDC data are not available, records from the Global River Discharge Database, RivDis 1.1 (Vörösmarty et al., 1998) are used.The period of record for the discharge values reported in the GRDC and RivDis databases varies widely from basin to basin (Table 1).Simulated daily discharges for the model grid cells corresponding to gauging stations are aggregated into monthly values, since this is the temporal resolution at which observed discharge data are available for validation.The simulated and observed discharge time series are used in the assessment of skill as described in the following section.

Measuring the skill in reproducing hydrographs
The performance of the model in hydrograph simulation is assessed in terms of verification measures used in forecasting of continuous variables, without applying thresholds.For this assessment, the most commonly applied statistical measure, mean squared error (MSE) is calculated for each river basin.In order to judge the predictive skill, the raw MSE scores are transferred into MSE Skill Scores, (MSESS).The MSESS provide a relative measure of the quality of the simulation compared to the mean climatology as a low skill alternative method of estimation.Here climatology refers to the long term mean of the available monthly discharge records for each of the 12 months of the year.The MSESS is defined as: The range of values that MSESS can take is [-∞, 1]; with the maximum value of 1 indicating perfect skill; a value of 0 indicating a model skill equivalent to the climatology; and a negative value implying that the model performs worse than the climatology.Additionally we use the coefficient of determination (R 2 ) and Nash and Sutcliffe's coefficient of efficiency (NS), which are often employed in the validation of hydrological models.These coefficients provide a measure of the model skill relative to the long-term mean, and independent of the climatology.NS takes on the values [-∞, 1] and R 2 [0, 1], with higher values indicating higher skill.
Bias due to errors in the meteorological forcing, discharge records, model parameters, or simplifying assumptions, can highly degrade the quality of the output of a hydrological model (Hashino et al., 2007).This is true for our simulations as well.We applied these skill measurement methods on both non bias-corrected and bias-corrected simulation results.Verification with non bias-corrected data presents a better reflection of potential shortcomings in the skill of the GHM and provides the opportunity to compare our simulations with the results of other studies which use non biascorrected data, such as the Water Model Intercomparison Project (WaterMIP), which quantifies and explains the differences in the results of five GHMs and six LSMs (Haddeland et al., 2011).Verification with bias-corrected data, on the other hand, is relevant for the assessment of forecasting skill, which is the ultimate purpose of this study.It provides an indication of the maximum skill that can be achieved when the systematic bias due to model errors or forcing is eliminated, as is generally the case in operational forecasting.
In this study a simple method of a posteriori bias correction is carried out.It is true that an a priori correction by basin-specific calibration has a stronger physical basis than an a posteriori adjustment of the model output.On the other hand, given the time, data and computational capacity required for model calibration, a simple post-processing has the advantage of being far more straightforward and transparent.The post-processing method we employed is as follows: bias is calculated for each pair of simulation and observation.Calculated biases are grouped into 12 months of the year, and a mean bias is calculated for each of these 12 months.Every discharge value is corrected for the mean bias calculated for the corresponding month of the year.The correction is done by simply subtracting the mean bias for the corresponding month from the simulated monthly discharge value.

Measuring the skill in reproducing anomalous flows
In order to analyze whether the model is capable of reproducing higher or lower flows than usual for a given month, the discharge time series are transformed into categorical events defined in terms of three categories of high, normal and low flow.The analysis is carried out for two different sets of categories.For the first set, high flow is defined as discharge values above the 75th percentile for the month in question; normal flow between the 75th and the 25th percentile; and low flow below the 25th percentile.For the second set, the 90th and the 10th percentiles are used.Thresholds are identified separately for simulated and observed discharge.This approach eliminates any systematic under-or overestimation in the simulations and thus removes the need for bias correction.The skill in simulating these three classes is assessed by constructing categorical contingency tables and applying skill scores for ordinal categorical events.
Here we use Gerrity Scores (GS) (Gerrity, 1992) which is a subset of the Gandin and Murphy (GM) family of equitable scores for deterministic categorical forecasts (Gandin and Murphy, 1992).The criterion of equitability is based on the principle that random forecasts or constant forecasts of the same single category receive a no-skill score (Murphy and Daan, 1985).GM scores use a scoring matrix which represents the reward or penalty accorded to each pair of simulation and observation on the contingency table.In contrast to other equitable scores such as the Heidke skill score (Heidke, 1926), and Peirce's skill score (PSS) (Haansen and Kuipers, 1965), the GM family considers differences in relative sample probabilities of categories when appropriating a reward or penalty (Livezey, 2003).A correct forecast of a low probability category is rewarded more than that of a high probability category.Likewise failure to forecast a rare event receives a lighter penalty than a common event.
GS and LEPSCAT scores (Potts et al., 1996) are the two subsets of the GM family that are appropriate for the specific case of ordinal categories, defined as ranges of a continuous variable such as discharge.In this study, GS are preferred since they are recommended by Livezey (2003) for ordinal categorical events, on the practical basis of being more convenient to use compared to LEPSCAT.GS provide higher penalties as the discrepancy between simulated and observed  1955 1960 1965 1970 1975 1980 1985 1990 1995 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 Murray date discharge (m 3 /s) 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 x 10 4 Ganges date discharge (m 3 /s) 1955 1960 1965 1970 1975 1980 1985 1990 1995   This score takes on the maximum value of 1 for perfect skill, and the value of 0 for no-skill.The value of GS for a categorical forecast with K number of categories is given by Eq. ( 2): where the relative sample frequency p ij of each outcome on the K × K contingency table is multiplied by the corresponding scoring factor s ij (i, j = 1, . . ., K) from a scoring matrix S with relative levels of rewards and penalties and summing the values.The elements s ij of the scoring matrix S is given by Eq. ( 3): (3)

Measuring the skill in reproducing floods and droughts
Floods and droughts are regarded as simple binary events defined as exceedences of threshold discharges.For some rivers a monthly time scale may seem to be too coarse to correctly predict flood sizes.However, when we limit ourselves to forecasting monthly flows in terms of binary events, these will certainly be indicative for increased probability of floods for large rivers.It can be seen in Appendix A that at gauging station Lobith on the Rhine, throughout the years with available records during the period from 1815 to 2008, extreme daily discharges almost always coincide with large monthly discharges.When the annual maxima of daily discharge are plotted against the monthly mean discharge of the month in which this daily maximum occurred, resulting points cluster along a straight line (see Fig. A1), with daily maxima higher than monthly mean values as would be expected.Moreover, Fig. A2 shows that for most of the years, the month in which the annual maximum daily discharge occurred is also the month of maximum monthly flow or directly precedes or succeeds this month.Since the Rhine is the smallest of the 20 global rivers in this study, and given the fact that it has a rather complex regime, one can infer that the same assumption holds for other larger basins as well.1955 1960 1965 1970 1975 1980 1985 1990 1995 1955 1960 1965 1970 1975 1980 1985 1990 1995  Decision thresholds for a basin may be defined using various hydrological and economical criteria.A comprehensive approach with verification over the full range of possible thresholds for each basin is beyond the scope of this study.Therefore, a single set of decision thresholds for floods and droughts common for all river basins is selected that can reasonably distinguish between the usual and extreme states of each basin.The flood and drought thresholds used in this study correspond to 5-yr return periods for each river.
The discharges corresponding to the 5-yr flood and drought events have been derived using the Annual Maximum Series method.
The choice of 5-yr return periods for floods as well as droughts is made on the basis of two considerations.On the one hand, events with return periods of a few years do not reflect the long-term variability, and do not represent unusually extreme states of a river.On the other hand, the limited availability of discharge observations does not allow the estimation of rare events beyond a fraction of the record length.Five years in this case appears to be a practical return period for the assessment of model skill in reproducing both types of hydrological extremes observed in all basins, the record lengths for which are given in Table 1.For the two basins with the longest records, i.e. the Danube and the Mississippi, we repeat the analysis for return periods of ten years.
Similar to the approach used in the construction of categorical tables described in Sect.3.2, for the construction of binary tables, the thresholds for observations and simulations are identified separately in order to decrease the effect of any systematic under-or overestimation.The skill in simulating  (Heidke, 1926), Peirce's skill score (PSS) (Haansen and Kuipers, 1965), Gilbert's skill score (GSS) (Schaefer, 1990) and odds ratio skill score (ORSS) (Stephenson, 2000).As stated in Sect.3.2, the criterion of equitability is based on the principle that random forecasts or constant forecasts of the same single category receive a no-skill score (Murphy and Daan, 1985).Two of these four equitable scores, namely HSS and GSS, are markedly dependent on sample climate.Sample climate, defined as the sample estimate of the unconditional probability of occurrence of an event, is purely a characteristic of the observations with no direct relevance to skill assessment (Mason, 2003).Since dependence on sample climate makes a skill score unjustifiably sensitive to variations in observed climate and therefore unreliable, HSS and GSS are excluded in this study.The remaining two equitable scores PSS and ORSS are independent of the sample climate and recommended in several studies (McBride and Ebert, 2000;Stephenson, 2000;Göber et al., 2004).However, ORSS is also excluded because the presence of zero in any cell of the contingency table renders this skill score inappropriate (Livezey, 2003).PSS is preferred to other scores in this study on the basis of these considerations.
The possible values of PSS are within the range [−1, 1] and its true zero-skill value is 0. Negative values imply less skill than a random prediction.The PSS for floods and droughts for each basin are calculated in terms of cell counts of the relevant contingency tables according to the formula: where a, b, c and d represent the cell counts for each of the possible outcomes of hit, false alarm, miss and correct rejection respectively.

Measuring added skill over a simple water balance estimate
In order to demonstrate the added value of running a complex hydrological model over a simple estimation of the water balance, the MSESS (non-bias corrected), GS and PSS are applied on an alternative set of monthly discharge values at the outlet of each basin.These discharge values are computed as follows: monthly actual evapotranspiration (E) is subtracted from the precipitation (P ) on a monthly basis, then aggregated over the drainage network including downstream losses due to open water evaporation to obtain the instantaneous monthly discharge.This estimate of P -E incorporates the same information from the climatic forcing, but ignores hydrological information on stores and fluxes that lead to temporal and spatial redistribution.Skill comparison of model results with this estimate shows the added value of the routing and hydrology, while both suffer from the same poor climatological forcing.

Skill in reproducing hydrographs
The results of the historical simulation and observed discharge time series for the selected rivers are presented in Fig. 2 for visual inspection.The difference between the simulations and observations can be attributed to several errors such as those in the meteorological forcing, discharge records, model parameters, or simplifying assumptions.The possible model errors are discussed in depth in Van Beek and Bierkens (2009) and Van Beek et al. (2011).Three groups of rivers present a large discrepancy between the simulations and observations.The first group is the Arctic rivers, such as the Lena and Mackenzie, and snow and glacier dominated rivers such as the Indus.Undercatch in the CRU snowfall amounts reported by Fiedler and Döll (2007) results in a large underestimation of the spring discharge after the start of snowmelt.The second group consists of those basins with heavy regulation and large amounts of withdrawal for  irrigation and consumption, such as the Murray, Zambezi and Parana.The routing scheme in the current version of PCR-GLOBWB simulates natural discharge and does not include reservoir operations and withdrawals.Therefore the simulated natural flow on these heavily regulated rivers is in disagreement with the measured discharge.Although it is one of the most heavily regulated rivers, the Nile does not show this discrepancy since measurements of natural flow upstream of the High Dam is available for comparison.The last group consists of rivers in the tropics, which show either overestimation as in Africa, or underestimation as in the Amazon.This is mostly attributable to the low station coverage over the tropics in the CRU dataset and to a lesser extent poor precipitation forecasts in ERA-40 (Troccoli and Kalberg, 2004).
The improvement in predictive skill due to the correction of bias can be seen on the discharge time series before and after the bias correction (Figs. 2 and 3), as well as the reliability diagrams (Fig. 4).It can be observed from these figures that bias correction highly improves the results.This improvement is documented quantitatively in Table 2, which shows the MSE skill scores for the selected basins, both before and after the bias correction.Table 2 shows that without a bias correction, the MSESS for the majority of basins are negative.The improvement in the MSESS due to the correction varies widely, but is quite high in general, yielding a skill higher than the climatology for most basins.The three basins where the highest skill is observed are the Yangtze, the Rhine and the Mississippi, with MSESS above 0.70.The model performs worse than the climatology in four basins.It is interesting to note that the three basins with the worst performance, namely the Niger, the Nile, and the Congo are all African rivers.The fourth basin with negative skill is the Amazon.The relatively low skill in the Amazon and other monsoon-dominated basins such as the Indus and the Mekong can be explained to a certain degree by the fact that for such basins the climatology is already a good estimate of the expected discharge, so that it is difficult to perform better than that.The relatively high values of R 2 and NS for these basins, which are also presented in Table 2, indicate that the model performance is not poor in monsoon-dominated basins, provided that it is evaluated using measures independent of the climatology.

Skill in reproducing anomalous flows
A complete summary of the joint distribution of categorical simulations and observations for the selected basins is presented in 3 × 3 contingency tables (Tables 3 and 4).These tables provide the basis for the calculation of the Gerrity Scores for each basin.As can be seen in Table 5, all the resulting values of GS are positive, indicating that the model has skill in reproducing categorical events.In general, GS values are higher for reproducing the 75th and the 25th percentile flows than for the 90th and the 10th, as the skill is expected to decrease for more extreme flow.
The same three rivers with the highest skill in simulating exact discharges, namely the Yangtze, the Rhine and the Mississippi, have again the highest scores for categorical events.The model performance in categorical simulations for the African rivers the Niger, the Nile, and the Congo is much better than in reproducing hydrographs.The lowest skill among all the basins is observed for another African river, the Zambezi, though still above the climatology.For the Amazon, where the skill in reproducing hydrographs is less than that of the climatology, we observe that the skill in reproducing anomalous flows is rather high compared to other basins.This shows that even in cases where the model simulations are biased and do not outperform the climatology in reproducing hydrographs, the skill in reproducing anomalous flows can be relatively high.

Skill in reproducing floods and droughts
The 2 × 2 contingency tables for flood and drought events for the selected basins can be seen in Table 6.The PSS calculated on the basis of these tables are presented in Table 7.The resulting PSS show that the skill obtained by binary forecasts of 5-yr floods and droughts is also higher than an unskilled forecasting system.The system has a markedly higher skill in forecasting floods compared to droughts.
Model structure and process descriptions explain the difference in skill in reproducing floods and droughts.Floods are largely controlled by the rapid response of basins and thus react almost directly to the above-average rainfall of the forcing depending on the antecedent conditions.In contrast, droughts or low flows represent the response of the hydrological system to prolonged periods of below-average rainfall.As such, they are more sensitive to the uncertainty in model parameterization affecting processes such as the buildup of soil moisture deficit, the depletion of the groundwater system by baseflow and the regulation of discharge by reservoirs or changed withdrawal.With respect to baseflow, PCR-GLOBWB contains a conceptual model to describe the influence of lithology and drainage density.This model is parameterized using global datasets but not calibrated.As a consequence it can resolve the general trend but not all local variations.Moreover, the simulated discharge in this study is the natural one and regulation and consumption are not considered.All in all, this makes droughts more sensitive to model uncertainty, all the more so as the rank order of these events can be less accurately assessed due to the relatively larger variability of this phenomenon.
There are no basins where the model has a negative skill score in reproducing either floods or droughts; but for seven basins, the PSS indicates no skill in reproducing droughts.This is because the PSS takes on the value of zero when the contingency table shows no hits.For some basins, the model demonstrates perfect skill in reproducing floods.This is a shortcoming of the skill score used.The score takes on the value of one in cases where there are either no misses or no false alarms.Yet, to be able to assign perfect skill, one would expect the number of both misses and false alarms to be zero.
The skill assessment in reproducing 5-yr events is not applicable to the Zambezi for which the available discharge record only covers four years (see Table 1).For this basin, PSS is undefined due to the absence of any observed event.The short length of the observed discharge records affects the assessment of skill negatively for the Brahmaputra (five years and ten months) and the Ganges (nine years; Table 1).
For the two basins with the longest records, i.e. the Danube and the Mississippi, we have repeated the analysis for return periods of ten years.The results, which are presented in Appendix C, show that for both basins, PSS in floods decrease when the return period increases, as expected.For the Mississippi, the PSS in reproducing 10-yr droughts is surprisingly slightly higher than in 5-yr droughts.For the Danube, the PSS in 10-yr droughts is zero since there are no hits on the contingency tables.
Notwithstanding the problems related to limited observation lengths, skill in reproducing flood and drought events is demonstrated.

Added skill over a simple water balance estimate
The added value of running a complex hydrological model over a simple estimation of the water balance is demonstrated by comparison of the skill scores MSESS (non-bias corrected), GS and PSS for model simulated discharges and for the P -E estimate.Skill scores for both model results and for the P -E estimate are presented in Appendix B.
The results show that model skill by far exceeds that of the P -E estimate in all cases.Skill comparison of model results with this estimate shows the added value of the routing and hydrology, while both suffer from the same poor climatological forcing.In contrast, the monthly climatology of observed discharge performs better than the P -E estimate as it is more attuned to the actual climate, save for its anomalies, as well as the regulation.

Conclusions and recommendations
As an initial step in assessing the prospect of global hydrological forecasting, we tested the ability of a global hydrological model PCR-GLOBWB in reproducing the occurrence of past extremes in the monthly discharge of 20 large rivers of the world.We assessed the model skill in three ways: first in simulating hydrographs, second in reproducing monthly anomalies and third in reproducing flood and drought events.The advantage of such a procedure is that it provides a more detailed assessment of forecasting skill and an insight into which types of forecasting are more promising.
Verification of non bias-corrected hydrographs reflects model and forcing errors, thus providing the opportunity for improvement.In addition it allows comparison with the results of other studies which use non bias-corrected data.Eliminating the systematic bias due to model errors or forcing, on the other hand, provides an indication of the maximum skill that can be achieved in operational forecasting.Simulations with PCR-GLOBWB are biased for most basins, and the skill in reproducing hydrographs is lower than the observed climatology.The model skill improves significantly after a post-processing bias correction and surpasses the observed climatology in most basins.
Results of the analysis indicate that the skill obtained in reproducing monthly anomalies using non bias-corrected data is higher than the climatology for all basins.The model also has skill in reproducing floods and droughts, with a markedly better performance in the case of floods.The model skill surpassess that of a simple water balance estimate in all cases.
Although simulated hydrographs may be biased and do not always outperform the observed climatology even after bias correction, higher skills can be attained in forecasting the occurrence of monthly anomalies as well as floods.The prospects for operational forecasting of monthly hydrological extremes are thus positive.PCR-GLOBWB is similar to other GHMs in model structure and parameterization; and the forcing data is similar to those used in simulations with other GHMs and LSMs.The performance of PCR-GLOBWB in reproducing runoff is comparable to those of other GHMs (Sperna Weiland et al., 2010;Wada et al., 2008) and to LSMs (Sperna Weiland et al., 2011).Given these similarities we argue that our conclusion is valid for other comparable GHMs and LSMs as well.
This assessment in retrospect is a preliminary one and it shows a potential skill given the current GHM, with a meteorological forcing based on observations.The true skill should be assessed in forecasting mode using meteorological forecasts subject to uncertainty from numerical weather prediction (NWP) models.

Appendix A Correlation between annual maxima of daily and monthly discharges at gauging station Lobith on the Rhine
and to assess Published by Copernicus Publications on behalf of the European Geosciences Union.N. Candogan Yossef et al.: Reproducing the occurrence of monthly flow extremes freshwater availability

Figure A2 :
Figure A1: Annual maxima of daily discharge vs. corresponding monthly mean flows 474 475

Table 3 .
Categorical contingency tables for 75 th and 25 th percentiles 642 o:observed, s:simulated, L: low flow, N: normal flow, H:high flow 643

Table 4 .
Categorical contingency tables for 90 th and 10 th percentiles 645 o:observed, s:simulated, L: low flow, N: normal flow, H:high flow 646

Table 5 .
Gerrity skill scores in reproducing anomalous flows for 75th and 25th percentiles, and 90th and 10th percentiles.

Table 6 .
Binary contingency tables for floods and droughts.o: observed, s: simulated.

Table 7 .
Peirce's skill scores for floods and droughts.