Detection of global runoff changes : results from observations and CMIP 5 experiments

This paper assesses the detectability of changes in global streamflow.
First, a statistical detection method is applied to observed (no missing
data which represent 42% of global discharge) and reconstructed (gaps are
filled in order to cover a larger area and about 60% of global discharge)
streamflow. Observations show no change over the 1958–1992 period. Further,
an extension to 2004 over the same catchment areas using reconstructed data
does not provide evidence of a significant change. Conversely, a significant
change is found in reconstructed streamflow when a larger area is
considered. These results suggest that changes in global streamflow are
still unclear. Moreover, changes in streamflow as simulated by models from
Coupled Model Intercomparison Project 5 (CMIP5) using the historic and
future RCP 8.5 scenarios are investigated. Most CMIP5 models are found to
simulate the climatological streamflow reasonably well, except for over South
America and Africa. Change becomes significant between 2016 and 2040 for all
but three models.


Introduction
Human influence has now been documented in several parts of the water cycle: atmospheric water vapour (e.g.Willett et al., 2007;Santer et al., 2007), land precipitation (e.g.Zhang et al., 2007), or land evapotranspiration (e.g.Douville et al., 2013).The case of runoff or river discharge is more contrasted.While some studies thought to have identified robust trends over some specific regions (e.g.Stahl et al., 2010 over Europe, Krakauer and Fung, 2008 over the US), other studies focused on the global scale have led to somewhat contradictory results.Based on 221 rivers, corresponding to 40 % of global continental runoff, Labat et al. (2004) documented an increasing global runoff at the end of the 20th century compared to the beginning.In contrast, based on data from 925 rivers, corresponding to 80 % of global runoff, Dai et al. (2009) show a slight decrease in global runoff over the second half of 20th century.This discrepancy can be explained by the differences either in the number of gauging stations used, in the period of investigation, or in the method used to fill the gaps.Indeed, both studies were using some reconstructions (meaning gap filling) in order to provide a more comprehensive spatio-temporal coverage.
Over the last few years, several studies have attempted to explain the supposed observed global runoff trend, sometimes based on land surface models (LSMs).Labat et al. (2004) were the first to relate the supposed positive runoff trend to global warming.They pinpointed a positive feedback between warming, an increase in ocean evaporation and an increase in continental precipitation.This assumption was then contradicted by many other studies.For example, using MOSES LSM, Gedney et al. (2006) explained this positive trend by the decrease in transpiration as a result of the stomatal closure due to rising atmospheric CO 2 .Using OR-CHIDEE LSM, Piao et al. (2007) concluded that the land use and climate change are primarily responsible for the observed positive runoff trend.In the same way but using the LPJmL model, Gerten et al. (2008) find that the impact of stomatal closure and land use changes are very small and that the main factor explaining runoff change is precipitation change.The relative importance of the fertilization and stomatal closure effects and land use is still very model-dependent (Alkama et al., 2010).In a recent study, Alkama et al. (2011) hypothesize that the observed surface warming and the associated decline of permafrost and glaciers, not yet included in most LSMs, could have contributed to increased runoff at high latitude.They also emphasize that runoff trend is a regional scale issue, if not basin dependent.Finally, the majority of recent studies conclude that there was no significant global runoff trend in the late twentieth century (Milliman et al., 2008;Dai et al., 2009;Alkama et al., 2011).
This paper first aims to provide a novel assessment of the significance of recent observed changes.This assessment is based on the temporal optimal detection (TOD) method (Ribes et al., 2010).While most previous studies consider global mean runoff, the TOD method is able to provide a single global diagnostic based on continental-scale mean runoffs.The TOD method is applied to both observed data only (meaning no missing data) and reconstructed data (i.e. a substantial fraction of streamflow time series is missing and reconstructed by Dai et al., 2009).
With regard to future projections, an intensification of the hydrological cycle over the 21st century is widely assumed (e.g.Liu et al., 2012).However, regional patterns of humaninduced changes in surface hydroclimate are complex and less certain than those in temperature.Indeed, both increases or decreases may be expected in future precipitation and runoff, depending on the region (Milly et al., 2005;Alkama et al., 2010).Our study investigates the large-scale runoff change over the late 20th and 21st century (with atmospheric greenhouse gas and aerosol concentrations from the Representative Concentration Pathways RCP 8.5 scenario), as simulated by 14 CMIP5 (http://cmip-pcmdi.llnl.gov/cmip5/)Atmosphere Ocean General Circulation Models (AOGCMs).First, we assess the extent to which these simulated runoffs are consistent with observations.To this end, a comparison with observed streamflow is performed for the past few decades.Second, the same experiments are used to investigate how the anthropogenic perturbation (green house gases) may lead to different responses, depending on the model.Third, the same detection technique is applied to climate change scenarios in order to determine the significance of the simulated changes.The date at which the changes become significant is of particular interest, and provides some information with respect to the consistency or inconsistency between observed and simulated changes.
This paper primarily addresses the following three major issues: 1. How does global observed and reconstructed stream flow change over time?
2. Are simulated streamflows reasonably consistent with observations?
3. How will streamflow change in the future?

Data
To the best of our knowledge, the most complete downstream discharge dataset in existence was collected by Dai et al. (2009).This dataset represents historical monthly streamflow at the farthest downstream stations for the world's 925 largest ocean-reaching rivers from 1900 to 2004.However, the length and reliability of the available time series vary greatly from one river basin to another, and gaps are usually found.Observed streamflows are subject to some uncertainties, and in particular measurement uncertainty (e.g.related to the estimation of rating curves), potential homogeneity breaks, and missing values.Measurement errors are very difficult to address and no homogenised datasets are currently available, so the results provided in this study are conditional to this dataset, following previous work (e.g.Dai et al., 2009) that also investigated the recent trends in global streamflows.
Gaps may be filled by using statistical techniques, numerical simulations using land surface models (LSMs), or a combination of both.In the present application, gaps were filled by applying a statistical linear correction to the river discharge simulated by a LSM with observed atmospheric forcings (see Dai et al., 2009).Such a reconstruction, however, is likely to introduce additional uncertainty.Results may depend on potential inaccuracy of the LSM used, homogeneity breaks in the atmospheric forcing, uncertainties coming from the observations (sometimes only a few years) used to calibrate the statistical correction, and others.As a consequence, this study carefully distinguishes between two different treatments.First, we analyse observed streamflows only, by considering time series with no missing data.In this way, the number of the selected rivers is reduced to 161 over 1958-1992 period.This period was chosen in order to find an optimal compromise between spatial and temporal covering.Note that even under this restrictive treatment, the period investigated is similar to the one considered in Gedney et al. (2006) or Alkama et al. (2010).Second, in order to consider the larger spatio-temporal coverage available, we apply the same analysis to the dataset including reconstructed streamflows.As linear regression cannot be used if there is too much missing data, Dai et al. (2009) succeeded in reconstructing only 687 gauging stations for the whole 1958-2004 period.We consider these 687 catchment areas over this period.Finally, a third product is used in order to extend the "observations" up to 2004.We then consider the 161 rivers observed over the 1958-1992 period and allow missing/reconstructed values over the 1993-2004 period.This extension does include reconstructions, but the amount of reconstructed values is much reduced compared to the previous case (i.e.687 rivers).River discharge, in addition to being potentially influenced by anthropogenic climate change, may be affected by direct human intervention, due to water resource management (dams), water withdrawal (e.g.irrigation, industrial or domestic uses), changes in land use that impact evapotranspiration, and others.In terms of climate change detection attribution, these direct influences may be regarded as "confounding factors", as they may cause a substantial trend without any climate change.Detailed estimation of such direct perturbations is very challenging and no global discharge database of "naturalized streamflows" is currently available.However, several studies addressed the issue of quantifying these direct anthropogenic influences at the global scale, and suggested that they had a minor impact on multi-year trends.Wisser et al. (2010), have quantified the impact of irrigation and reservoir operations over the 20th century.They concluded that "the land use, expansion of irrigation and the construction of reservoirs has considerably and gradually impacted hydrological components in individual river basins.Variations in the volume of water entering the oceans annually, however, are governed primarily by variations in the climate signal alone with human activities playing a minor role".The later is shown to hold at the continental scale (i.e. for individual oceanic basin, which corresponds to scales similar to the ones considered here, but with a different clustering).Other studies (e.g.McCelland et al., 2004;Adam et al., 2007;Adam and Lettenmaier, 2008) confirmed that dams have altered the seasonality of discharge, especially over upstream rivers, but are not responsible for changing annual values.Then, the impacts of land use changes on land surface hydrology are still debated.On the one hand, when irrigation is neglected, land use can have an important influence on runoff via a decrease in surface evapotranspiration (Piao et al., 2007).This conclusion is also supported by Sterling et al. (2013) in the case of taking into account the irrigation.On the other hand, Liu et al. (2008) and Sun et al. (2008) indicate that deforestation over China, associated with irrigation, leads to increased evapotranspiration over the 20th century.Other studies, over individual river basins, suggested that the sign of land use induced change was unclear (e.g.Twine et al., 2004, over the Mississipi river basin, VanShaar et al., 2002, over the Columbia River basin).Finally, some direct human influences via other activities have also been investigated and shown to have limited impact.For instance, Mc-Celland et al. (2004) demonstrated that increased forest fire frequency and severity may have contributed to changes in discharge, but cannot be considered as a major driver.
We also used simulated runoff by different models from to the Coupled Model Inter-comparison Project Phase 5 (CMIP5).Those runs supply three kinds of experiments: historical runs in which all external forcings come from observations, future runs which use greenhouse gas and aerosol emissions from the RCP 8.5 scenario, and finally piControl runs in which pre-industrial forcings are constant.The pi-Control runs are used to evaluate the internal climate variability.There are four RCPs types of possible future scenarios, and the RCP 8.5 involves the highest greenhouse gas concentrations at the end of the 21th century.It expects to reach the radiative forcing of 8.5 W m −2 (∼ 4 times more than the current value) at the end of 2100 which correspond approximately to 1370 ppmv of atmospheric CO 2 concentration.This scenario involves an intensive use of fossil fuels, with little mitigation stringency.
For this entire study, an ensemble of 8 zones where river basins are merged by continent and climate area was selected.The motivation for separating the northern cold climate from the tropics comes from Dai et al. (2009) and Alkama et al. (2011), which found significant runoff increase at high latitude that cannot be explained by the atmospheric forcing.While the motivation of merging all of Africa's river basins in a single zone, even with the existence of large differences in climates, is coming from an ensemble of previous studies that shows that generally all of the largest African river basins had significantly decreased over the second half of 20th century (e.g.Alkama et al., 2011;Dai et al., 2009;Gedney et al., 2006;Labat et al., 2004).The 8 selected zones are North America, Central America, South America, North Europe (including arctic basins), South Europe, North Asia (corresponding to Siberia), South Asia (including Oceania) and Africa (Fig. 1).
Three steps are used before comparing the modelled runoff by different CMIP5 experiments and observations over the 8 regions: (1) interpolate using bilinear method all of the CMIP5 modelled runoff (mm d −1 ) into the same grid (0.5 • × 0.5 • ); (2) aggregate the river basins at the 0.5 • × 0.5 • grid.To be consistent with the data, river basins are defined at the known latitude and longitude of the observed gauged stations which are different to the river mouth; (3) simulated runoff are then averaged over the computed river basins and merged into the 8 defined zones in the Fig. 1.

The temporal optimal detection method
Detection is the process of demonstrating that an observed change is significantly different (in a statistical sense) that cannot be explained by natural internal variability.The statistical method used for detection is the temporal optimal detection method (Ribes et al., 2010).We review the main concepts here but refer to Ribes et al. (2010) for full details about the method.The TOD method is based on a linear model: where Y (s, t) denotes the observed streamflow at location s and time t, a(s) is the climatological mean, b(s) and x(t) are respectively the spatial and temporal patterns of change, and ε(s, t) denotes the internal variability.TOD basically assumes that the temporal pattern of change is known while the spatial pattern is not.This is a substantial difference from other methods, such as optimal fingerprinting, in which the full spatio-temporal pattern of change is assumed to be known (up to a scaling factor, e.g.Hasselmann, 1993).This assumption makes the TOD method particularly suitable here, because the spatial pattern of changes in global runoff is still under debate and somewhat model-dependent (see Sect. 3.2).Regarding internal variability, the TOD method assumes that ε has a red noise structure (or autoregressive process of order 1, AR1, see e.g.Brockwell and Davis 1991).The red noise structure means that the random term ε(s, t) satisfies: where η(s, t) is a white noise in time, i.e. η(s, t) is independent from η(s, t − 1).This assumption also means that, for example, the autocorrelation function decreases exponentially, with no long-range memory effect.An AR1 process is then described by a single parameter, α (see Eq. 2), which is the one-year lag autocorrelation of ε.Note that in many statistical tests, residuals ε are assumed to be white noise (i.e.α = 0), which makes the detection easier.Given Y (s, t), x(t) and α, the inputs, the TOD method provides an estimate of the trend at each location b(s).Based on this estimate, TOD performs a statistical test of the nullhypothesis "b = 0", and so returns a single P value describing how significantly observations have changed.

Application to global runoff data
Here we discuss the choice of the parameters x(t) and α, and the extent to which global discharges satisfy the assumptions behind the TOD method.
While assumed to be known, the temporal pattern of change x(t) is commonly evaluated from simulations.In order to base our study on a very simple temporal pattern that is not model-dependent, we used only linear trends (i.e.x(t) = t).Note that the use of a linear trend instead of a potentially more complex smooth temporal pattern may be suboptimal.However, over short periods like the ones investigated here for observation (35 or 45 yr) the non-linearity of the change is probably not the dominant feature.In addition to be very simple, this choice is consistent with several previous studies dealing with potential changes in global hydrology (e.g.Labat et al., 2004;Gedney et al., 2006;Dai et al., 2009;Alkama et al., 2010Alkama et al., , 2011)).
The choice of α as well as the discussion on the accuracy of the red noise assumption are here based on the analysis of pre-industrial control simulations.As observations are presumably influenced by external forcings, internal variability cannot be inferred directly from observations.Conversely,   control simulations where external forcings are constant over time are expected to provide a physically based description of internal variability, and using such simulations is quite common in detection and attribution studies (e.g.Hegerl and Zwiers, 2011).Figure 2a shows the α value as estimated from the time series of global runoff, for each CMIP5 control simulation.It is computed as the correlation between y(t) and y(t − 1).Although some discrepancies appear between different models, all values are between 0.04 and 0.3 with a medium value close to 0.2.These control simulations may also be used to check the accuracy of the red noise assumption.Figure 2b, c and d illustrate the distribution of the P value if the TOD test is applied to different 50 yr segments from all control simulations using α = 0, 0.2 or 0.3, respectively.Such segments could be regarded as independent realizations under the null distribution of the test.If internal variability is properly accounted for (in particular, if the red noise assumption is accurate), the P values of the test applied to these segments should be distributed uniformly between 0 and 1. Figure 2b, c and d suggests that the more suitable choice is α = 0.2 (distribution close to uniform), while α = 0.3 (resp.α = 0) is too conservative (permissive), leading to less (more) than expected values under the 5 % threshold.In the following, we primarily discuss the results assuming α = 0.2 or α = 0.3 (which makes the detection more conservative and corresponds to the highest value found in individual models).The results obtained with α = 0 (i.e.white noise) are also shown in some cases in order to provide a lower bound where no memory effect is accounted for.Note that the red noise assumption with α = 0.2 seems also consistent with the autocorrelation function of streamflow, as simulated in control runs (not shown).
Finally, an important feature of the method with respect to our study is that it performs a multivariate diagnosis; i.e. it provides one single statistical diagnosis based on regional (continental-scale) streamflow.In particular, a change that generates increases or decreases in runoff depending on the region would be captured by this method.The TOD method, as implemented here, may then be regarded as a strategy for testing trends significance that allows the change to be spatially non-uniform and takes into account a non-white internal variability.In particular, it differs from testing the significance of each regional trend individually, as a single test is performed for all regions simultaneously here.

Statistical test on observed and reconstructed runoffs
The coverage of the 161, respectively 687, rivers worldwide is shown in Fig. 1.Some regions (e.g.western part of Asia, desert regions and southern part of South America) suffer from lack of data.Indeed, only 31 % (43 %) of global land area excluding Antarctica are covered by the 161 (687) rivers basins which correspond to about 42 % (60 %) of global land discharge.
We first applied the TOD method to observed runoff over the 1958-1992 period, based on 8 zones in which observed streamflow at 161 downstream gauged stations are merged.Results are shown in Fig. 3a in terms of P value, for three values of the α coefficient.The P value of year 1980, for instance, is obtained by applying the test to the data before 1980, i.e. the 1958-1980 period.This allows us to analyze the time evolution of the P value.The P value shown in 1992 provides the result of the statistical test applied to the full period of interest, 1958-1992.As might have been expected, larger year-to-year variations are observed at the beginning of the period compared to the end, as one single year has a stronger relative impact on the P value (the size of the sample being smaller).This paper also investigates the date at which detection occurs.A precise definition of this date is then required.In the following, we consider that detection occurs on year "t0" if the P value of the statistical test (TOD method) remains below the 5 % threshold after t0 (i.e. for all t > t0), while it was higher than 5 % the year before (i.e.t0−1).Note that using such a definition, the P value might have fallen below 5 % at some date t1 < t0 (but the significance of the change has then vanished at some point between t1 and t0).
Figure 3a (left) shows that the P value remains higher than the significance threshold, 0.05, over the 1958-1992 period.This reveals no significant change in observed streamflow until 1992.This result is very robust here as it is obtained even under the white-noise (i.e.α = 0 which is very unlikely) assumption.
However, one can wonder what these results could have been over a more recent period.TOD is then applied over the same 161 river basins for the whole 1958-2004 period.As mentioned above, this extension requires to use a few reconstructed data over the 1993-2004 period, and cannot be regarded as "observations only".After 1992, Fig. 3a reveals that changes are still not detected for α = 0.2 or α = 0.3, as the P value remains mainly higher than 0.1.It permits us to conclude that there is no significant change in observed global runoff on the observed 161 gauging stations from 1958 to 2004.Note that the P value for α = 0 becomes lower than 0.05 but, as discussed before, this does not allow us to reasonably claim that a change is detected.The relative anomaly (trend over the whole period compared to the runoff mean value) distribution in regional runoff shown in Fig. 3a (right) reinforces this conclusion.Trends are rather small compared to the mean streamflow, except over Africa, where it reaches −30 %.This result is confirmed by applying the TOD test over each individual region: a significant change in runoff is detected only over Africa since 1980 (Fig. 4) which is consistent with the previously published results (e.g.Alkama et al. 2011) which pinpoint to a large decrease in precipitation and runoff over the 2nd half of 20th century.Figure 4 is widely discussed in Sect.3.3.In the same way, TOD is applied over the 687 river basins and over the same period .Figure 3b shows a significant change in reconstructed runoff since 2000 at the 95 % significance level for α = 0 and, to a lesser extent, for α = 0.2.For α = 0.2, the P value remains below but close to the significance level of 5 %.For α = 0.3, no changes are found.Here, we conclude that a change is detected, because detection does occur with a medium value of α.This result is not very robust, however, as detection no longer holds with a more conservative choice of α.Taking into account new rivers (687 rather than 161) and/or reconstructed rather than observed streamflow then seems to impact the results.However, the distribution and intensity of the relative discharge anomaly are not notably affected (Fig. 3b).Africa is still the only region that exhibits a significant (negative) runoff trend.

Evolution of observed and simulated runoff
The evaluation of the CMIP5 simulated runoff was performed over the 8 zones and at the global scale corresponding to the 161 river basins.Figure 5 shows the temporal evolution of the yearly mean runoff (mm d −1 ) from 1958 to 2100 for each of the 14 CMIP5 models.The temporal evolution of the observed runoff from 1958 to 1992 corresponding to the same zones is also shown.There are two outlying models, BCC and GISS, which show a large underestimation of both the global and the regional runoffs (Fig. 6) as well as a low variability (Fig. 7).At the global scale, all other CMIP5 models seem to simulate runoff in terms of mean state reasonably well (simulated runoff = observed runoff ±25 %).They generally underestimate global runoff slightly, except for the MIROC model, which simulates a global runoff overestimated by about 15 %.The runoff simulated over South America is underestimated by all models.The simulated runoff is also underestimated by the BCC, GISS MRI and INM models over Africa.In contrast, more than 50 % overestimation is shown by all of NorESM, MIROC, IPSL, CSIRO, GFDL, CCSM, CanESM and FGOALS models.Over this continent, the runoff simulated by both CNRM-CERFACS and MPI is closer to the observations.Over all other regions, the error made by all models (except of course BCC and GISS) did not exceed 50 %, except FGOALS over South Europe and Central America, and MRI over central America.The error in the standard deviations is also well Fig. 5. From 1958 to 2100 global and regional time series of the simulated (colours) and observed (black) annual runoff.The median of the fourteen models is given by the thick red line.simulated (Fig. 7).Indeed, it did not exceed 50 % except over Africa (North America) where only CNRM, MPI, FGOALS, CanESM and GFDL (GFDL and INM) are reasonable.We can also note the large error of CSISRO (MPI) over South America (Centrale America).
Over the same period 1958-1992, observations show a large 0.05 mm yr −2 positive (negative) runoff trend over South America (Africa), only reproduced by MRI and MIROC (FGOALS and MIROC; Fig. 8).Remaining models exhibit small or opposite trends.Over all regions, the models did not show any consensus with respect to the 1958-1992 trend.In particular, the sign of the trend depend on the model, some models are close to the observed trend while others do not, etc.Many conclusions drawn from the previous CMIP3 exercise (e.g.Milly et al., 2005;Nohara et al., 2006) were consistent with the present study in terms of the comparison between simulated and observed runoff means, inter-annual variability and trends.Despite the different bias existing in different model simulation, no bias correction methods are applied in this study.
Over the 21st century, all models show a positive global runoff trend except the INM model, in which the simulated global runoff decreases.At the regional scale, all models are in agreement and show a positive trend over northern Asia, Scandinavia, North America and South Asia.In contrast, the models simulate a negative trend over South Europe, except CNRM-CERFACS and IAP models, which are positive.For the other regions (South America and Africa), the models are in disagreement amongst themselves.For example, over South America, a negative trend is simulated by CCCMA and CSIRO, while MIROC, NCC, IPSL show a positive trend.Over Africa, the simulated runoff increases in CNRM-CERFACS, MIROC, MPMIP, NCAR, CCCMA, IPSL and NCC, but decreases in CSIRO.Over Central America, models show no clear trend except GFDL (positive trend) and NCC and IAP (negative trend).

Statistical test on simulated runoff
In Fig. 9, the TOD test is applied to each CMIP5 model over the 161 global river basins still merged into 8 zones, and over the whole 1958-2100 period.In order to highlight central behaviour, the median of these 14 P values is computed each year, and its time evolution is illustrated.The P values are also shown for the observed and the reconstructed data previously presented.Note that the P values of observed and simulated runoff are calculated over the 161 river basins whereas reconstructed data are computed over the 687 river basins.The P values obtained from the CMIP5 runoff calculated over the 687 rivers are very similar to those computed over the 161 rivers.The results are only shown for α = 0.2 (medium value).We define the date at which detec-tion occurs as the first year for which the P value remains lower than the 0.05 threshold up to 2100.The two models that simulate low runoff variability, BCC and GISS, are the first to detect a significant runoff change (in 2002 and 2005, respectively).The INM model, simulating no global significant trend, is the last to detect a significant change, in 2060.All other models detect a significant change between 2016 and 2040.This means that changes in runoff, as simulated by current climate models, are expected to become significant in the coming decades.As a consequence, the result previously obtained on observed runoff (161 river basins) appears to be very consistent with climate model projections.
The result obtained on reconstructed runoff (687 river basins) seems less consistent, as a change was found from 2000 onwards.In particular, Figure 9   value computed from reconstructed runoff is on the border (if not outside) of the set of climate model projections.This feature, together with the substantial difference between the results obtained on observed and reconstructed data, may call into question the quality of reconstructions, and/or the accuracy of climate model projections.
The results from the same analysis applied over individual regions is shown in Fig. 4. Similarly to what was found at the global scale, the P values are very scattered over the 20th century, and tend to become closer at the end of the 21st century.This analysis suggests that the detection occurs earlier in northern high latitude regions, consistent with a higher signal-to-noise ratio.This is particularly pronounced over North Asia.Virtually all models simulate a change over these regions.Results are much more contrasted over other regions, as some models do not simulate any change on the regional discharge.This is particularly clear over Central America and South Asia, as the median P value do not fall below the 5 % threshold even in 2100.This suggests that even at the end of the 21st century, and under intensive emissions, no clear runoff change appears over some continental scale regions.

Summary and conclusions
In this work, the TOD statistical test (Ribes et al., 2010) is used to evaluate the possible changes on recent and future (RCP 8.5 conditions) runoff, based on fourteen CMIP5 experiments and streamflow data from Dai et al. (2009).This evaluation is made over 8 zones, merging the world's 161 largest rivers.Our analysis suggests some answers to the three issues raised in the Introduction.

How does global observed and reconstructed streamflow change over time?
No significant runoff change is found in the observations over the entire set of 161 rivers from 1958 to 1992.Extension to 2004, using reconstructed streamflows over the same catchment areas, does not lead to a different conclusion.This confirms previous results by Dai et al. (2009) and Alkama et al. (2011).In contrast, reconstructed data over 687 rivers shows significant change at the 95 % confidence level over the 1958-2004 period, at least with a medium assumption regarding the internal variability persistence.This change is not robust to a more conservative choice regarding internal variability.This result seems rather contradictory with the conclusions by Dai et al. (2009) who found no significant trend on global rivers discharge based on the same data.This discrepancy most likely comes from differences in the statistical method used.While Dai et al. (2009) were only looking at the global mean time series, our diagnosis is based on continental scale discharge, and can be explained by opposite changes over different continents that tend to compensate themselves and result in little change on the global mean.Taken as a whole, these results suggest that changes in global runoff are still unclear.Indeed, positive detection is only obtained when considering a dataset where a substantial amount of data comes from reconstruction.It is not robust to a narrowing of the spatial domain, nor to a little change (considering an important slow process such as groundwater on global streamflow) in the description of the internal variability (i.e. using α = 0.3 instead of α = 0.2).With the use of reconstructions, additional questions arise with respect to the accuracy of the reconstruction, which depends on the quality of the atmospheric forcing used, the capabilities of the LSM, the relevance of the statistical correction applied, and others.We finally conclude that changes in global discharge cannot be robustly identified from observations over the recent decades.

Are simulated streamflows reasonably consistent with observations?
Except for BCC and GISS, which show large underestimations of global runoff, the other CMIP5 simulations perform reasonably well.However, regional biases are far from being negligible, as the model bias can exceed 50 % of the mean observed runoff over some regions.These biases are compa-rable to those found in the last CMIP3 exercise (Nohara et al., 2006;Milly et al., 2005).

How will streamflow change in the future?
The majority of CMIP5 models under RCP 8.5 conditions simulate an increase in runoff over South Asia, northern Europe, northern Asia and North America, and a decrease over southern Europe.However, no significant change appears over Central America, and no consensus can be found over South America and Africa.These features are similar to what Milly et al. (2005), Nohara et al. (2006) and IPCC (2007) have already shown.More globally, all models show an intensification of the global hydrological cycle over the 21st century.Indeed, the global continental precipitation, evaporation and runoff tend to increase.Change in global runoff becomes significant between 2016 and 2040 for all but three models.This suggests that our finding of no clear change from the observations is rather consistent with current projections for the next century.

Fig. 1 .
Fig. 1.Coverage of 161 (687) river basins top (bottom) over the 8 selected zones which are 1 = South America, 2 = Africa, 3 = South Asia including Oceania, 4 = North Asia corresponding to Siberia, 5 = South Europe, 6 = North Europe including Arctic basins, 7 = Central America and 8 = North America.The circles represent the in situ gauged stations for each river accounted for in this study.

Fig. 2 .
Fig. 2. (a) estimated alpha (α) based on the global runoff time series from each CMIP5 model (piControl simulations); (b), (c) and (d) are the distribution of the P value when TOD test is applied to different segments of 50 yr periods of all CMIP5 control runs using α = 0, 0.2 and 0.3, respectively.

Fig. 3 .
Fig. 3. (a) (left) temporal evolution of observed(1968-1992) and reconstructed (1992-2004)  runoff P value over 161 river basins merged over 8 zones.The full horizontal black line represent the threshold level at 5 %.(right) distribution of the runoff relative anomalies ( Q/Q) in percentage over 161 river basins.(b) same as (a) but using reconstructed data over 687 river basins rather than observations over the whole 1968-2004 period.

Fig. 4 .
Fig. 4. Temporal P value of observed (black) and simulated (light blue) runoff over the 8 regions merging 161 river basins using α at 0.2.The median of the 14 is in blue.