Historical and future changes in global flood magnitude – evidence from a model–observation investigation

To improve the understanding of trends in extreme flows related to flood events at the global scale, historical and future changes of annual maxima of 7 d streamflow are investigated, using a comprehensive streamflow archive and six global hydrological models. The models’ capacity to characterise trends in annual maxima of 7 d streamflow at the continental and global scale is evaluated across 3666 river gauge locations over the period from 1971 to 2005, focusing on four aspects of trends: (i) mean, (ii) standard deviation, (iii) percentage of locations showing significant trends and (iv) spatial pattern. Compared to observed trends, simulated trends driven by observed climate forcing generally have a higher mean, lower spread and a similar percentage of locations showing significant trends. Models show a low to moderate capacity to simulate spatial patterns of historical trends, with approximately only from 12 % to 25 % of the spatial variance of observed trends across all gauge stations accounted for by the simulations. Interestingly, there are statistically significant differences between trends simulated by global hydrological models (GHMs) forced with observational climate and by those forced by bias-corrected climate model output during the historical period, suggesting the important role of the stochastic natural (decadal, inter-annual) climate variability. Significant differences were found in simulated flood trends when averaged only at gauged locations compared to those averaged across all simulated grid cells, highlighting the potential for bias toward well-observed regions in our understanding of changes in floods. Future climate projections (simulated under the RCP2.6 and RCP6.0 greenhouse gas concentration scenarios) suggest a potentially high level of change in individual regions, with up to 35 % of cells showing a statistically significant trend (increase or dePublished by Copernicus Publications on behalf of the European Geosciences Union. 1544 H. X. Do et al.: Historical and future changes in global flood magnitude crease; at 10 % significance level) and greater changes indicated for the higher concentration pathway. Importantly, the observed streamflow database under-samples the percentage of locations consistently projected with increased flood hazards under the RCP6.0 greenhouse gas concentration scenario by more than an order of magnitude (0.9 % compared to 11.7 %). This finding indicates a highly uncertain future for both flood-prone communities and decision makers in the context of climate change.

To assess the model structural uncertainty across GHMs, trends in streamflow extremes simulated under observational 139 atmospheric forcing, available through the Global Soil Wetness Project Phase 3 (GSWP3) reanalysis (Kim, 2017), were 140 compared to observed trends. The influence of the high uncertainty in climate models (Kumar et al., 2013;Kiktev et al., 141 2003) on streamflow simulations was assessed by comparing observed trends and trends simulated when using atmospheric forcing from four General Circulation Models (GCMs) for the historical period ('hindcast' simulations; 143 hereafter referred to GCMHIND atmospheric forcing). These GCMs were bias corrected but their simulations have 144 different sub-monthly, inter-annual and decadal variability and thus the hindcast simulations reflect both GHM and GCM 145 uncertainty. To quantify the implication of model uncertainty for future projections of flood hazard, trends simulated 146 under projected climate change by the end of this century (using the same four GCMs) were also assessed for two 147 greenhouse gas concentration scenario RCP2.6 (hereafter referred to GCMRCP2.6 atmospheric forcing) and RCP6.0 148 (hereafter referred to GCMRCP6.0 atmospheric forcing). As a result, four simulation settings were used in this study, 149 denoted by the atmospheric forcing; an overview is given in Table 1. These settings comprise two historical runs (GSWP3 150 and GCMHIND runs), and two future runs (GCMRCP2.6 and GCMRCP6.0), collectively amounting to a total of 69 151 simulations (see Table S3 in supplementary with full list of simulations). 152 For GSWP3 simulations, a preliminary analysis (see section 4 of supplementary material) shows that both 'naturalised 153 runs' (i.e. human water management not taken into account) and 'human impact runs' (i.e. human water management 154 inputs were used) exhibit similar characteristic of trends in MAX7 index. Some potential reasons for negligible impacts of 155 human water management are the spatial distribution of stream gauges (may be biased toward regions with insignificant 156 changes in water management during the 1971-2005 period), or the inclusion of small catchments (more that 3,000 157 catchments with reported area less than 9,000 km 2 ), thus floods are more sensitive to changes in climate forcing relative to 158 the accumulated basin-wide influence of human impacts. Naturalised runs were therefore chosen, since this setting is 159 available for more GHMs (six) when compared to the human impact setting (four). Although significant efforts were 160 made by ISIMIP to keep the setting across simulations as consistent as possible, there were some differences in model 161 versions and input data (e.g., WaterGAP2.2 (ISIMIP2a) was used in ISIMIP2a while WaterGAP2.2c was used in 162 ISIMIP2b; ORCHIDEE (Guimberteau et al., 2014)  To enable an observation-model comparison, simulated discharge needs to be extracted from gridded model output. 175 Large-scale hydrological models, however, generally do not simulate discharge accurately over small-to-medium size 176 catchments due to the coarse resolution of river network datasets in their routing schemes (Hunger and Döll, 2008). To 177 address this limitation, previous GHMs evaluations usually selected large catchments (a threshold of 9,000 km 2 was 178 adopted, approximating the size of a one-degree longitude/latitude grid cell) and routed discharge (units: m 3 /s) at the 179 outlet of the catchment was used as simulated streamflow for a specific catchment (Zhao et  catchments (e.g. area less than 9,000 km 2 ), the un-routed runoff simulation (units: mm/day) was extracted while observed discharge was converted to runoff using catchment area prior to comparison (Gudmundsson et al., 2012b;Beck et al., 183 2017). To increase the sample size for the model-observation comparison (the first objective), the present study used both 184 daily (i) un-routed runoff for small catchments and (ii) routed discharge simulations for large ones, and thus two 185 extraction procedures were adopted. A summary of these extraction procedures is provided below while detailed technical 186 descriptions are provided in section 2 of supplementary material. 187  For catchments with area from 0 to 9,000 km 2 : un-routed runoff (mm/day) was extracted and then converted into 188 discharge (m 3 /s) by multiplying averaged runoff with catchment area reported in station metadata. Specifically, 189 catchment boundaries were superimposed on the GHM grid to obtain the weighted-area tables, which were then 190 used to derive averaged runoff from the un-routed runoff simulation. To avoid double counting runoff from the 191 same grid points, runoff for catchments that share similar weighted-area tables (i.e. similar simulated streamflow 192 would be extracted -see supplementary section 2 for detail description) was averaged (using catchment areas as 193 weights) and a single 'averaged time series' was used in place of the runoff from the component catchments. 194  For catchments with area greater than 9,000 km 2 : the 'discharge output' approach (Zhao et al., 2017) was 195 adopted to extract routed discharge (m 3 /s) from the GHM cell corresponding to the outlet of each catchment. 196 To ensure sufficient data is available for historical trend analysis, only GSIM stations with at least 30 years of data 197 available during the 1971-2005 period were considered (each year having at least 335 days of available records, implying 198 that annual maximum of a specific year is identified only when more than 90% of daily record is available). These 199 relatively strict selection criteria also enable a comparison between this study and preceding observation-based were used to extract simulated streamflow for small catchments, stations were further filtered using two criteria: (i) 202 availability of reported catchment area, and (ii) catchment boundary was accompanied with a "high" or "medium" quality 203 flag (i.e. the discrepancy between reported and estimated catchment area is less than 10%). 204 A total of 4,595 stations satisfied the quality selection criteria, of which large catchments (i.e. area greater than 9,000 205 km 2 ) where no suitable grid cell could be identified were further removed (11 catchments). For cases of two or more small 206 catchments (i.e. area less than or equal to 9,000km 2 ) having similar weighted-area tables, the 'averaged time series' (using 207 catchment areas as weights) was calculated. A total number of 1,542 time series fell in this category and were aggregated 208 into 624 'averaged time series'. Figure 2 shows the spatial distribution of the final dataset for model-observation

Detecting trends in annual maximum streamflow 218
For each streamflow dataset, daily discharge was smoothed to 7-day averages to reduce variability in simulated 219 streamflow, which can arise from the coarse routing parameters of GHMs (Dankers et al., 2014). The annual maximum 220 time series of 7-day averaged discharge (labelled as the MAX7 index in the GSIM archive) was then derived to represent 221 peak flow events. For gridded datasets, the 'centre averaged approach' (e.g. averaged streamflow of Jan 7th is the mean 222 value of Jan 4 -10th) was used (the common setting of the CDO software, freely available at 223 https://code.mpimet.mpg.de/projects/cdo), and the MAX7 timeseries was therefore derived for each GSIM station using 224 this same approach. As a result, the derived value of the MAX7 index is slightly different to the value available in the 225 online version of GSIM , which applied a 'backward-moving average' technique (e.g. averaged streamflow of Jan 7th is 226 the mean value of Jan 1 -7 th ). Our preliminary analysis (not shown), however, indicated that this difference did not lead 227 to substantial changes in the key findings (i.e., similar spatial composition between increasing and decreasing trends). 228 The magnitude of trends in the MAX7 index at a specific catchment or grid cell was quantified using the normalised significance of the local trend was assessed using a Mann-Kendall test at the 10% two-sided significance level (Wilks, 231 2011). The null hypothesis (no trend) is rejected if the two-sided p-value of the test statistic (Kendall's τ) is lower than 232 0.1, while the direction of the trend (i.e. increasing or decreasing) was determined using the sign of τ. 233

Statistical techniques 234
To explore GHMs' capacity to simulate observed trends and the implication of model uncertainty to projected trends, 235 trends in streamflow extremes obtained from GSIM (observed trends) and ISIMIP simulations (simulated trends) are 236 analysed. The observed trends were available for 3,666 observation locations. Simulated trends were available for all 237 59,033 GHM grid cells (estimated from routed discharge of each grid cell; Antarctica and Greenland were removed). To 238 enable a model-observation comparison, we also extract a subset of simulated trends over the 3,666 observation locations 239 (described in section 2.2). 240

A hypothesis-test approach for comparison of trend characteristics 241
A range of hypothesis tests (summarised in Table 2; GSWP3 simulations were used to assess GHM uncertainty while 242 GCMHIND simulations were used to assess the combined GCM-GHM uncertainty) was applied to address the first two 243 objectives, which require comparing trend characteristics exhibited from different streamflow datasets. Four gauge-or cell-based significance calculated using the Mann-Kendall test at the 10% significance level. To assess 253 whether the percentage of significant (increasing/decreasing) trends exhibited from a specific streamflow dataset 254 is produced by random chance, a field significance test (Do et al., 2017) was adopted (described in Table 2). 255 -Trend spatial pattern: The spatial distribution of trends in streamflow extremes over a spatial domain. Pearson's 256 correlation (r statistic) (Galton, 1886;Kiktev et al., 2003) between trends of MAX7 index obtained from two 257 datasets was used as a measure of similarity in the trend spatial structure. The hypothesis test (pattern similarity 258 test) was adopted to assess whether: (i) the correlation between simulated trends introduced by GHMs and observed trends is significantly higher than zero; and (ii) the correlation between trends simulated under hindcast 260 atmospheric forcing and observed trends is significantly lower than that between trends simulated under 261 observational atmospheric forcing and observed trends. derive a null-hypothesis distribution of the change that occurred due to random chance. The null hypothesis is rejected at 5% one-sided significance level when the true percentage falls on the right-hand side of the 95 th percentile of the resampled distributions.
Hypothesis 4: The correlation between trends obtained from two streamflow datasets was not significantly higher than '0' (i.e. zero pattern similarity).
'Zero pattern similarity' was compared to the probability distribution function (PDF) of pairwise correlation between simulated and observed trends, drawn from a bootstrap procedure similar to that proposed by Kiktev et al. (2003). The null hypothesis is rejected at 5% one-sided significance level when zero correlation falls on the left-hand side of the 5th percentile of the resampled distributions.
Hypothesis 5: The correlation between GCMHIND simulated trends and observed trends was not significantly lower than the correlation between GSWP3 simulated trends and observed trends The actual pairwise correlation between GCMHIND simulated trends and observed trends (denoted by ) was compared to the bootstrapped PDF of correlation exhibited from GSWP3 simulated trends (denoted by * ). If falls on the left-hand side of the 5 th percentile * , there is evidence to reject the null-hypothesis at the 5% one-sided significance level. Field significance test similar to that presented in Hypothesis 3 but trends obtained from all grid cells were the subject of the assessment.

2.4.2
Estimating uncertainty of trend characteristics across ensemble members 265 The third and final objective, which focused on the implications of GCM-GHM uncertainty on projected changes in 266 flood hazard, was addressed by quantifying the spread of trend characteristics (i.e. trend mean, trend standard 267 deviation, and percentage of significant trends) exhibited from routed discharge projections under two representative 268 concentration pathways. 269 The spatial uncertainty of projected trends (GCMRCP2.6 and GCMRCP6.0) was also quantified by calculating intra-270 /inter-model correlation of the trend patterns across all ensemble members available under the two projections. Intra-271 model correlation represents spatial uncertainty introduced by the GCM and was calculated from simulated trends 272 introduced by the same GHM (using different simulated atmospheric forcing). Inter-model correlation represents the 273 combined GCM-GHM spatial uncertainty, and was calculated for each pair of simulated trends that were: (i) 274 introduced by the different GHMs; and (ii) forced with different projected atmospheric forcing. 275 To assess the robustness of GHMs in projecting changes in flood hazard, each grid-cell available in the discharge 276 simulation grid was then categorised into one of the five 'flood-risk' (here "flood-risk" level is defined as the number 277 of ensemble members projecting significant increasing trends) groups based on the number of 278 GCMRCP2.6/GCMRCP6.0 simulation members projecting a significant increasing trend (Group 1: no members, 279 Group 2: from 1 to 5 members, Group 3: from 6 to 10 members, Group 4: from 11 to 15 members and Group 5: from 280 16 to 18 members). 281 Finally, to assess whether locations projected with an increasing trend by the majority simulations are adequately 282 monitored, each GSIM gauge was allocated into one of these five groups based on the gauge's geographical 283 coordinates. The allocation of gauges into these groups was then analysed to determine whether the most 284 comprehensive global database of daily streamflow records to-date was evenly distributed across the five 'flood risk 285 regions'. An inadequately coverage of stream-gauge networks over high risk regions indicate potentially high 286 vulnerability to future changes in flood hazards, as insufficient data is available to inform decision makers. Europe (including the UK). Note that the observation locations are not evenly distributed (86% in North America and 296 Europe), and thus the confidence of this assessment varies substantially across continents. 297 The multi-model average of GSWP3 simulated trends (trends simulated under observational atmospheric forcing; 298 middle panels of Figure 3) has generally good capacity to reproduce spatial patterns of observed trends. The multi-299 model average of GCMHIND simulated trends (trends simulated under hindcast atmospheric forcing; lower panels of 300 Figure 3), however, could not reproduce some spatial agglomerations of trends in streamflow maxima (e.g. the 301 decreasing trends in south-eastern Australia, increasing trends over north-eastern Europe). This feature indicates the 302 inconsistent climate variability between GCMs and the real world, suggesting GCM climate forcing cannot account 303 for observed trends at sub-continental scale. In addition, GCMs uncertainty can potentially contribute to this 304 inconsitency. Interestingly, the multi-model average of both GSWP3 and GCMHIND simulations generally exhibits a 305 lower magnitude of changes (i.e. closer to 'zero change') compared to the observed trends. This feature is more 306 prominent in GCMHIND (21 simulations available) compared to GSWP3 (six simulations available), and can be 307 explained by two possibilities. The first possible explanation is the nature of averaging, which tends to smooth out 308 variability in trend magnitude across ensemble members, leading to a relatively 'close to zero' change across the 309 globe (given that each GCMs has stochastic decadal climate variability, so that averaging results forced by GCMs 310 tends to cancel trends). An alternative explanation is that individual simulations also exhibit a lower magnitude of 311 change relative to observation. As Figure 3 is not sufficient to evaluate the latter possibility, a more detailed 312 comparative analysis between observed trends and individual simulated trends using both historical climate forcings 313 (via GSWP3) and GCM hindcasts was conducted. Specifically, four characteristics of trends in extreme flows (i.e.  Table 3 shows the results of the global model-observation comparison using GSWP3 simulated trends across the six 328 GHMs. Compared to observed trends, most simulated trends have a significantly higher global trend mean at the 329 observed locations and lower trend standard deviation. The percentage of locations showing significant trends varies 330 substantially across simulations, but the values were not statistically significant. All GHMs demonstrate low-to-331 moderate capacity in simulating the spatial pattern of trends (spatial correlation coefficients range from 0.35 to 0.50, 332 indicating that GSWP3 simulated trends account for between 12%-25% of the cross-location variability in the 333 observed trend signal). There is, however, a notable difference in terms of the overall sign of trends simulated by each 334 GHM. This feature indicates that using different GHMs can lead to different interpretations about the overall change 335 in flood hazard at the global scale, despite having a common boundary forcing. Therefore, the 'closer to zero' trends 336 of ensemble averages (illustrated in Figure 3) likely reflects the implication of averaging rather than a systematic bias 337 of GHMs toward a low magnitude of change. As an implication, ensemble averages even though useful, should not be 338 used as a sole ground to infer changes in floods, as it may undermine the actual magnitude of simulated trends. As a 339 result, the following analyses will report the full range (and mean) of each trend characteristic estimated across all 340 ensemble members to communicate the uncertainty underlying the results. 341   significantly lower simulated trend standard deviation can be partially attributable to the coarse resolution of GHMs' 356 atmospheric/land surface inputs, which may not sufficiently reflect the variation of hydrological processes across 357 small-to-medium size catchments. 358

364
Among 21 GCMHIND simulations, the 'zero similarity' hypothesis (hypothesis 5) was rejected over 13 simulations, 365 indicating that GCM-GHM ensemble members possess some capacity to simulate the spatial structure of observed 366 trends in streamflow extremes. The correlation between GCMHIND simulated trends and GSIM observed trends, 367 however, is significantly lower than that exhibited from GSWP3 simulated trends across all GHMs (reported at Table  368 3). The results of the similarity assessment are illustrated for a single GHM (H08; as the results were similar for other 369 GHMs) in Figure 4, where the correlation between observed trends and GSWP3 simulated trends is significantly 370 different from zero. In contrast, the correlation between observed trends and each of the simulated trends under 371 hindcast atmospheric forcing (GCMHIND simulations) is much lower, with two of the four not being statistically 372 higher than zero. These results confirm the substantial influence of atmospheric forcing on the simulated trend pattern 373 relative to GHMs structure. 374 To further quantify changes at the regional scale, a model-observation comparison (identical to that at the global 382 scale) was conducted over six continents and the results are summarised in Table 5 (multi-model averages are  383 shown). The trend mean exhibited from GSIM ranges from -10.7% (Oceania) to 2.4% change per decade (Europe) 384 while trend standard deviation ranges from 8.3% (Europe) to 15.8% change per decade (Oceania). The percentage of 385 significant increasing (decreasing) trends exhibited from GSIM ranges from 3.2% to 22.6% (from 6.3% to 29.1%) 386 and the composition of significant trends across the six continents is consistent to a previous investigation (Do et al., 387 2017). The observed percentage of significant trends is found to be above random chance for Europe (increasing 388 flood magnitude) and Australia (decreasing flood magnitude) and this feature is captured quite well by GSWP3 389 simulated trends, with at least half of the simulations confirming field significances detected from GSIM. Trend 390 characteristics simulated by GHMs at continental scale confirms some important findings from global scale assessments, suggesting substantial uncertainty of trends in streamflow extremes introduced by GHMs at the 392 continental scale: 393 -Both GSWP3 and GCMHIND simulations generally exhibit a higher trend mean and lower trend standard 394 deviation compared to the observed trend at the continental scale (see also Section 3.1 of the supplementary). 395 -GCMHIND simulations generally exhibit lower capacity to reproduce trend characteristics relative to 396 GSWP3 simulations due to the combined GHM-GCM uncertainty. 397 For GSWP3 simulations, the spatial correlation is weakest in Asia, as no simulation rejects the null-hypothesis of 398 'zero similarity', while the spatial correlation is strongest in Oceania (mainly southern Australia; correlation of 0.63). 399 Oceania, however, exhibits the highest model-observation discrepancy in trend mean and trend standard deviation, 400 indicating the capacity of a given GHM in terms of the trend spatial structure is not necessarily consistent with its 401 performance in terms of the mean and spread of trends. 402 GCMHIND trends also suggest the opposite composition between percentages of significant trends compared to 403 GSIM trends (e.g. simulated trends suggest more locations showing significant increasing trends while observed 404 trends suggest the opposite). Among six continents, GCMHIND trends exhibited the lowest correlation (-0.14) in 405 Oceania, whereas GSWP3 suggested the strongest correlation in this continent. This assessment further indicates the 406 substantial impact of atmospheric forcing relative to GHM model structure on the simulated trends in high flow 407 events. It is informative to note that this result is expected, as GCMs (although have been bias-corrected) generally 408 have low capacity in reproducing the timing of wet/dry periods or the spatial distribution of climate extremes (Kiktev 409 et al., 2007), and GHMs are likely to inherit these limitations when using GCMs' outputs as climate forcing data. 410 Table 5. Characteristics of trends exhibited from GSIM/GSWP3/GCMHIND streamflow dataset at the continental scale (each observation location of 3,666 sites was allocated into 412 one of the six continents). For simulated trends, only the multi-model average is shown for each region. Trend mean and trend standard deviation are expressed in % change per 413 decade. Values in the parentheses show the number of simulations rejecting the null-hypothesis described in Table 2

Determining the representativeness of observation locations in the GHM simulations 417
To assess the representativeness of observations locations in GHM grid cells, trend characteristics obtained from all 418 simulated grid cells were compared to those estimated from the observation locations (3,666 sites globally). For 419 GSWP3 simulations, the results suggest a significant difference between trend characteristics from all model grid 420 cells compared to those obtained from the observation locations (Table 6;  confirms the importance of uncertainty in atmospheric forcing in driving the spatial structure of the simulated trends, 442 which will be explored further in the next section.   Table 7 shows a relatively low spread of the global trend mean (ranging from -1.3% to 0.8% change per decade; 455 multi-model average of 0.0% change per decade for both GCMRCP2.6 and GCMRCP6.0) and trend standard 456 deviation (ranging from 1.8% to 4.1% change per decade) across ensemble members. LPJmL and ORCHIDEE 457 generally suggest a decreasing trend at the global scale, evident through the negative global mean and more grid cells 458 showing significant decreasing trends. The standard deviation of trends in future simulations is substantially lower 459 than the historical run (reported in Table 6). This feature is potentially due to the capacity of longer time series in 460 capturing the inter-decadal variability of the streamflow regimes, with both dry and wet periods being considered 461 (Hall et al., 2014). Projected trends under the RCP2.6 scenario generally have closer to zero mean and lower standard 462 deviation compared to those introduced by the RCP6.0 scenario, reflecting the nature of an ambitious 'low-end 463 warming' scenario, when anthropogenic climate change reaches its peak at the middle of the 21 st century followed by 464 a generally stable condition. 465 Interestingly, although most models suggest relatively moderate changes in the global trend mean, the composition 466 between percentages of grid cells showing significant trends varies substantially, ranging from 7.5% (7.1%) to 30.1% 467 (35.0%) for significant increasing (decreasing) trends at the 10% level, with RCP6.0 generally exhibits higher values. 468 This finding indicates that inferences of changes focusing on global averages may mask significant regional trends, as 469 there was a substantially high percentage of locations exhibiting significant increasing and decreasing trends 470 exhibited in individual models. 471 Uncertainty in the spatial structure of trends in streamflow extremes is further investigated using both intra-model (to 472 reflect GCM uncertainty) and inter-model correlations (to reflect the combined GCM-GHM uncertainty). A more 473 robust spatial pattern of projected trends under RCP6.0 was found, indicated through generally higher intra-/inter-474 model correlation compared to those exhibited from trends simulated under RCP2.6 across all GHMs. This feature 475 potentially reflects the less contrasted regional climate change of RCP2.6 relative to RCP6.0. The inter-model 476 correlation is consistently lower than intra-model correlation due to the combined uncertainty of both GHMs and 477 GCMs.

485
To quantity the robustness in terms of regions with significant trends in streamflow extremes, the number of 486 simulations showing significant increasing/decreasing trends was counted for each grid cell (value ranging from 0 to 487 18). As shown in Figure 5, the projections under RCP2.6 (top panels) do not suggest many regions with an increasing 488 trend for most ensemble members, but consistently suggest decreasing trends over the majority of Africa, Australia We first categorised each simulation grid-cell into one of the five 'flood-risk' groups. Note that in this analysis, "risk" 518 is defined as the number of simulations projecting a significant increasing trend, rather than the prominent definition 519 of risk as the combination of hazard, exposure and vulnerability (Kron, 2005). In this analysis, RCP6.0 scenario was 520 chosen as it yielded a higher global 'risk' of flood hazard relative to RCP2.6 scenario. 521 Figure 6 presents the percentage of all simulated grid cells (left panel) categorized in each of the five groups, and of 522 GSIM stations located in each group (right panel). As can be seen, 11.7% of grid cells fell into the "high risk" groups 523 (8.9% from Group 4 with 11-15 ensemble members, and 1.8% in Group 5 with 16-18 ensemble members), while 524 68.9% of grid cells fell into the "low risk" groups (22.0% for Group 1 with no ensemble members, and 46.9% for 525 Group 2 with 1-5 ensemble members). Of all GSIM stations, only 0.9% are located in "high risk" grid cells (no 526 station located in Group 5 grid cells) compared to 89.5% of stations located in "low risk" grid cells (35.4% for Group 527 1 and 54.1% for Group 2). The uneven distribution of stream gauges indicates potential difficulties in using 528 observational records to provide an assessment of global or regional changes in flood hazard, which in part arises 529 from data caveats associated with the spatiotemporal coverage and quality of observed gauge records across the 530 globe. This finding further suggests the urgent demand for ongoing efforts to make streamflow observation more  3. Climate variability and climate model uncertainty (i.e., the effect of using different GCMs to simulate the 555 historical climate) significantly reduced the extent to which the GHMs' captured the observed spatial 556 structure of trends. This was evident through significantly lower correlation between observed trends and 557 simulated trends, when GCMs were used for the climate forcing, than when climate observations were used. are sparsely sampled, covered by less than 1% of all available stream-gauges listed in the catalogue of 576 GSIM. Data coverage, as a result, remains the key limitation of this study, which could potentially lead to an 577 erroneous conclusion on the state-of-understanding of historical trends in flood hazard globally. Specifically, 578 substantial changes, although having occurred, might not be captured by available streamflow records. 579 Our findings also show that individual models may provide contradictory signal of changes in flood hazards for a 580 specific region, indicating high uncertainty in model-based inferences of changes in flood hazards. As a result, 581 alternatives for the conventional approach in estimating changes in streamflow extremes at the global and regional 582 scale (i.e. unweighted mean across all grid points) should be investigated. For instance, the spatial weighted averages 583 (e.g. using inverse distance relative to observed locations as weights) could be used to compute global means of 584 changes. Regional analysis using homogenised regions as the basis of reporting spatial domains (Zaherpour et al., The substantial discrepancy of trends simulated by different GHMs, despite having a common forcing boundary, 587 represents another challenges in using GHM ensemble, as there are a wide range of factors that could contribute to 588 these discrepancies. This study provides a (non-exhaustive) list of key differences across participated GHMs 589 (supplementary Section 1) that could individually or collectively lead to different model outputs. Diagnosing the 590 influence of these factors to models' capacity in simulating trends is still under-represented in the literature, and is an 591 important research agenda for future investigations. For instance, the impact of different methods to simulate snow 592 dynamic could be assessed by investigating model performances across catchments where snowmelt plays an 593 (in)significant role in flood generations. 594 Improved performance of GHMs in terms of simulating changes in flood hazard, considering the many factors 595 influencing model capacity, is achievable only through the combined efforts of many communities. The spread of 596 trends in streamflow extremes (trend standard deviation) could be simulated more accurately by finer spatiotemporal 597 resolution GHMs. Such an improvement in GHMs, however, is highly dependent on the quality of input datasets (e.g. be emphasized to ensure runoff is accurately converted into river discharge (Zhao et al., 2017). 612 This study presents a comprehensive investigation of historical and future changes in flood hazard using a hybrid 613 model-observation approach. The results highlighted a substantial difference between trend characteristics simulated 614 by GHMs and that obtained from GSIM archive. Our findings, therefore, suggest more attention should be paid to 615 investigating GHMs performance in the context of historical and future flood hazard, which is important for not only 616 the scientific community but also for stakeholders when using results of GHM simulations (Krysanova et al., 2018). 617 This is particularly important to determine the appropriateness of GHMs in specific investigations, as model 618 performance may vary substantially across different variables (e.g. moderate capacity in simulating spatial structure 619 of trends may be accompanied by a low performance in terms of simulating trend mean). 620 Large-sample evaluations, however, are highly dependent on data availability, which is one of the key barriers to a 621 holistic perspective of changes in floods. In this study, the unevenly distributed GSIM stations, partially due to the 622 constraint in data accessibility, do not provide representative samples at both global and continental scale. such as GRDC or WMO should also push forward the movement of making streamflow data accessible to the 630 research community. More initiatives based on citizen science (Paul et al., 2018) should be adopted, as this is a 631 potential option to crowdsource water data and offset the limitation of traditional observation system. Finally, attention should also be paid to stream gauges maintenance, data housekeeping and data sharing to ensure ongoing 633 flood monitoring is available to the present and future generations. 634

Acknowledgement 635
We thank two anonymous reviewers for their constructive comments that helped to improve the manuscript.