Linear Optimal Runoff Aggregate (LORA): A global gridded synthesis runoff product

10 No synthesized global gridded runoff product, derived from multiple sources, is available despite such a product being useful to meet the needs of many global water initiatives. We apply an optimal weighting approach to merge runoff estimates from hydrological models constrained with observational streamflow records. The weighting method is based on the ability of the models to match observed streamflow data while accounting for error covariance between the participating products. To address the lack of observed streamflow for many regions, a dissimilarity method was applied to transfer the weights of the 15 participating products to the ungauged basins from the closest gauged basins using dissimilarity between basins in physiographic and climatic characteristics as a proxy for distance. We perform out-of-sample tests to examine the success of the dissimilarity approach and we confirm that the weighted product performs better than its 11 constituents products in a range of metrics. Our resulting synthesized global gridded runoff product is available at monthly time scales, and includes time variant uncertainty, for the period 1980 – 2012 on a 0.5 grid. The synthesized global gridded runoff product broadly agrees 20 with published runoff estimates at many river basins, and represents well the seasonal runoff cycle for most of the globe. The new product, called Linear Optimal Runoff Aggregate (LORA), is a valuable synthesis of existing runoff products and will be freely available for download on geonetwork.nci.org.au.


Introduction
Runoff is the horizontal flow of water on land or through soil before it reaches a stream, river, lake, reservoir or other channels. 25 It has been widely used as a metric for droughts (Shukla and Wood, 2008;van Huijgevoort et al., 2013;Bai et al., 2014;Ling et al., 2016) and to understand the effects of climate change on the hydrological cycle (Ukkola et al., 2016;Zhai and Tao, 2017). Characterizing its dynamics and magnitudes is a major research aim of hydrology and hydrometeorology and a critical importance to improve our understanding of the current conditions of the large-scale water cycle and predict its future states.
More accurate estimates also provide additional constraint for climate model evaluation., yet direct measurement of runoff at large scales is simply not possible.
While runoff observations do not exist, direct streamflow or river discharge observations -basin integrated runoff -have been archived in many databases. The most comprehensive international streamflow database is the Global Runoff Data Base (GRDB; www.bafg.de), which consists of daily and monthly quality-controlled streamflow records from more than 9500 5 gauges across the globe. Geospatial Attributes of Gages for Evaluating Streamflow version II (GAGES-II;Falcone et al., 2010) represents another noteworthy streamflow database, consisting of daily quality-controlled streamflow data from over 9000 US gauges.
Hydrological and land surface models are capable of producing gridded runoff estimates for any region across the globe (Sood and Smakhtin, 2015;Bierkens, 2015;Kauffeldt et al., 2016). However, these runoff estimates suffer from uncertainties due to 10 shortcomings in the model structure and parameterization and the meteorological forcing data (Beven, 1989;Beck, 2017a).
There are various ways to use streamflow observations for improving the runoff outputs from these models. The conventional approach consists of model parameter calibration using locally observed streamflow data (see review by Pechlivanidis et al., 2011). Another widely used method is through regionalization; that is, the transfer of knowledge (e.g., calibrated parameters) from gauged basins to ungauged basins (see review by Beck et al., 2016). In contrast, several other studies attempted to correct 15 the runoff outputs directly rather than the model parameters, for example by bias-correcting model runoff outputs based on streamflow observations (Fekete et al., 2002;Ye et al., 2014), or by combining or weighting ensembles of model outputs to obtain improved runoff estimates (e.g., Aires, 2014). There are, however, relatively few continental-and global-scale efforts to improve model estimates using observed streamflow.
A broad array of gridded model-based runoff estimates are freely available, including but not limited to ECMWF's Interim 20 reanalysis (ERA-Interim ; Dee et al., 2011), NASA's Modern Era Retrospective-analysis for Research and Applications (MERRA) Land (Reichle et al., 2011), the Climate Forecast System Reanalysis (CFSR; Tomy and Sumam, 2016), the second global soil wetness project (GSWP2; Dirmeyer et al., 2006), the Water Model Intercomparison Project (WaterMIP; Haddeland et al., 2011), and the Global Land Data Assimilation System (GLDAS; Rodell et al., 2004). Recently, the eartH2Observe project has made available two ensembles (tier-1 and -2) of state-of-the-art global hydrological and land surface model outputs 25 (http://www.earth2observe.eu/; Beck et al., 2017a;and Schellekens et al., 2017). Although model simulations represent the only time varying gridded estimates of runoff at the global scale, they are subject to considerable uncertainties, resulting in large differences in runoff simulated by the models. Many studies have therefore evaluated and compared the gridded runoff models (see overview in Table 1 of Beck et al., 2017a).
Despite the demonstrated improved predictive capability of multi-model ensemble approaches Pan et al., 30 2012;Bishop and Abramowitz, 2013;Mueller et al., 2013;Munier et al., 2014;Aires, 2014;Rodell et al., 2015;Jiménez et al., 2017;Hobeichi et al., 2018;Zhang et al., 2018), very little has been done to utilise this range of model simulations toward improved runoff estimates. This paper implements the weighting and rescaling method introduced in Bishop and Abramowitz (2013) and Abramowitz and Bishop (2015) to derive a monthly 0.5° global synthesis runoff product. Briefly summarized, we use a bias correction and weighting approach to merge 11 state-of-the-art gridded runoff products from the eartH2Observe project, constrained by observed streamflow from a variety of sources. This approach also provides us with corresponding uncertainty estimates that are better constrained than the simple range of modelled values. For ungauged regions we employ a 5 dissimilarity method to transfer the product weights to the ungauged basins from the closest basins using dissimilarity between basins as a proxy for distance. Such a synthesis product is in line with the multi-source strategy of Global Energy and Water EXchanges (GEWEX; Morel, 2001) and NASA's Making Earth Science Data Records for Use in Research Environments (MEaSUREs; Earthdata, 2017) initiatives and is particularly useful for studies that aim to close the water budget at the grid scale. 10 Sections 2.1 describes the observed streamflow data. Section 2.2 presents the participating datasets used to derive the weighted runoff product. Section 2.3 details the weighting method implemented in the gauged basins, while Section 2.4 focuses on the ungauged basins. Section 2.5 examines the approach used to derive the global runoff product. We then present and discuss our results in Section 3 and 4 before concluding.

Observed streamflow data
We used observed streamflow from the following four sources: (i) the US Geological Survey (USGS) Geospatial Attributes of Gages for Evaluating Streamflow (GAGES)-II database (Falcone et al., 2010); (ii) the Global Runoff Data Base (GRDB; http://www.bafg.de/GRDC/); (iii) the Australian Peel et al. (2000) database; and (iv) the global Dai (2016) database. We discarded duplicates and from the remaining set of stations discarded those satisfying at least one of the following criteria: (i) 20 basin area <8000 km 2 (fewer than three 0.5 grid cells); (ii) record length <5 y in the period 1980-2012 (not necessarily consecutive); and (iii) low observed streamflow (i.e. around 0) that does not represent the total runoff across the basins due to significant anthropogenic activities. A river basin was identified with significant anthropogenic activities if it has > 20% irrigated area using the Global Map of Irrigation Areas (GMIA-Version 4.0.2; Siebert et al., 2007) or has > 20% classified as "Artificial surfaces and associated areas" according to the Global Land Cover Map (GlobCover-Version 2.3; Bontemps et al., 25 2011). In total 596 stations (of which 20 are nested in the basins of other stations) were found to be suitable for the analysis ( Fig. 1).

Simulated runoff data
To derive the global monthly 0.5° synthesis runoff product, we used 11 total runoff outputs (from eight different models) and 30 seven streamflow outputs (from six different models) produced as part of tiers 1 and 2 of the eartH2Observe project (available via ftp://wci.earth2observe.eu/). The models and their available variables are presented in Table 1. For tier 1 of eartH2Observe, the models were forced with the WATCH Forcing Data ERA-Interim (WFDEI) meteorological dataset (Weedon et al., 2014) corrected using the Climatic Research Unit Timeseries dataset (CRU-TS3.1; Harris et al., 2014). For tier 2, the models were forced using the Multi-Source Weighted-Ensemble Precipitation (MSWEP) dataset (Beck et al., 2017b). The runoff and streamflow values are provided in kg m -2 s -1 and m 3 s -1 , respectively. For consistency, the runoff outputs with resolution <0.5° 5 were resampled to 0.5° using bilinear interpolation. In some cases, the river network employed by the model did not correspond with the stream gauge location, in which case we manually selected the grid cell that provided the best match with the observed streamflow.
The runoff outputs were only used if no streamflow output was available and only in basins smaller than 100,000 km 2 . To make the runoff data consistent with the streamflow data, we integrated the runoff over the basin areas (termed Ragg, units m 3 10 s -1 ). Thus, for basins smaller than 100,000 km 2 the synthesis product was derived from 11 model outputs, whereas for basins larger than 100,000 km 2 the synthesis product was derived from seven outputs.
We detail in sections 2.3 and 2.4 our methods to derive the weighted runoff product for the global land. A flowchart summarizing the process is provided in Fig. 2. 15

Implementing the weighting approach at the gauged basins
At each gauged basin, we built a linear combination q of the participating modelled streamflow datasets (i.e. Ragg in small basins and modelled streamflow, q, in large basins) that minimized the mean square difference with the observed streamflow Q at that basin such that: q j = ∑ k ( k j − K k=1 k ) where ∈ [1, ] are the time steps and ∈ [1, ] represent the participating models, (i.e., integrated runoff Ragg over the basin areas in small basins and modelled streamflow at a gauge 20 location in large basins) is the value of the participating dataset in m3 s-1 at the j th time step of the kth participating model, the bias term k is the mean error of k in m 3 s -1 . The set of weights k provides an analytical solution to the minimization of subject to the constraint that ∑ w k = 1 K k=1 , where is the observed streamflow at the j th time step. This minimization problem can be solved using the method of Lagrange multipliers by finding a minima for The solution to the minimization of ( , ) can be expressed as w =  . A is symmetric and the term , is the covariance of the a th and b th bias corrected dataset after subtracting the observed dataset, while each diagonal term , is the error variance of dataset k. We note here that the solution presented here is based on the performance of the participating products (diagonal terms of A) and the dependence of their errors (accounted for by the non-diagonal terms of A). For derivation see Bishop and Abramowitz (2013).
We then derived the weighted runoff dataset by applying the computed weights on the bias corrected runoff estimates of the participating models. The weighted runoff dataset is expressed as: 5 Where k j is the value of runoff estimate in kg m −2 s −1 of the k th participating model at the j th time step and b′ k is its runoff bias in kg m −2 s −1 .
To calculate the runoff bias b′ k , we assumed that for each model k and at each time j the bias ratio of a model (defined as the ratio of the model error to the simulated magnitude) is the same for streamflow and runoff estimates Eq. (1). In small basins, 10 the bias ratio of modeled streamflow was calculated by using Ragg k j instead of the modeled streamflow k j Eq. (2).
We note that there no empirical evidence in the literature that the assumptions presented in Eq. 1 and Eq. 2 are valid. However, given that these assumptions constitute a part of our overall approach that we tested and proved its success later in this paper, 15 the validity of these assumptions is very likely to hold true.
To avoid over-fitting when applying the weighting approach, we limited the number of participating models so that the ratio of number of records (i.e. total number of available monthly observations within the period of study) to number of models does not fall below ten. As a result of this, when required, we discarded the models that had the highest bias (i.e. left terms in Eq (1, 2)) until the threshold was met. The weighting and the bias correction occasionally resulted in negative runoff values, 20 we replaced any negative values with zero. Table S1 shows examples of weights and bias ratios calculated for the participating models over a range of river basins. It shows that HBVS, JULES1, JULES2 and SURF2 didn't participate in the weighting over the large basins (i.e. Amur, Indigirka, Mississippi, Murray-Darling, Olenek, Parana, Pechora and Yenisei) since these models don't have estimates for streamflow which are needed to construct the weights over large basins. For the smaller Copper River basin, however, runoff estimates from all models participated in deriving weighted runoff estimates. Table S1 25 also shows that in many cases, models were assigned negative weights. While this might not be expected in typical performance-based weighting, it is possible when weighting is based on error covariance as well as their performance differences in this formulation. We show below how the weights can be modified to non-negative weights.
We implemented the ensemble dependence transformation process detailed in Bishop and Abramowitz (2013) to compute the gridded time-variant uncertainty associated with the derived runoff estimates. For any given gauged basin, we first calculated 30 the spatial aggregate of our weighted runoff estimate Ragg , then quantified 2 , the error variance of Ragg with respect to the observed streamflow Q over time as: Then, we wished to guarantee that the variance of the constituent modelled estimate 2 about Ragg at a given time step, averaged over all time steps where we have available streamflow data, is equal to 2 , such as 2 = 1 ∑ 2 =1 . 5 Since the variance of the existing constituent products do not, in general, satisfy this equation. We transformed them so that it does. This involved first modifying the set of weights to a new set ̃ such that accurately reflects our ability to reproduce the observed streamflow.
We refer the reader to Bishop and Abramowitz (2013) for proofs . 15 In order to estimate √ 2 , the uncertainty of the runoff attributes at each point in time and space, we first transformed the runoff fields to ̃ by applying the same transformation parameters and such that ̃= + ( ̅ + ( − ̅ ) − ). We then calculated the error variance Finally, we used √ 2 as the spatially and temporally varying estimate of runoff uncertainty standard deviation, which we will refer to below simply as 'uncertainty'. It provides a much more defensible uncertainty estimate than simply calculating the 20 standard deviation of the involved products.
We note that for a given basin, √ 2 represents the uncertainty of the modelled streamflow i.e. Ragg , while √ 2 represents the uncertainty of modelled runoff at each grid cell across the basin. This means that at every time step, there is one value for √ 2 per basin, and one value for √ 2 per grid across the basin.

Deriving runoff estimates at the ungauged river basins
Implementing the weighting approach requires observed streamflow to constrain the weighting, which we do not have at ungauged river basins (defined in section 2.1). To address this, we used the modelled and observed streamflow from the three most similar gauged river basins, based on pre-defined physical and climatic characteristics, to derive model weights at each ungauged basin. The selected gauged river basins served as donor basins to the ungauged receptor basins. We then 5 implemented the weighting technique on the ensemble of 11 (in small basins) or eight (in large basins) model outputs by matching Ragg calculated across the selected donor basins with the observed streamflow. This resulted in one set of weights and bias ratios obtained jointly from the three donor basins. Finally, we transferred the weights and bias ratios computed at the donor basins to the receptor basin and subsequently computed the associated uncertainty values.
Most of the gauged river basins were classified as donor basins. Some, however, were excluded from being donors where we 10 found (based on Ragg or modeled streamflow time series and metric values) that none of the models was able to simulate the streamflow dynamics. These basins are mainly located in areas of natural lakes, in mountainous areas covered with snow, or in wet regions with intense rainfall. We therefore (subjectively) decided that those excluded basins should be assigned to a "non-donor and non-receptor" category.
We applied the method presented in Beck et al. (2016) to calculate a similarity index S between a donor basin a and a receptor 15 basin b expressed as: Where p denotes the climatic and physiographic characteristics as in Table 4 of Beck et al. (2016). This includes aridity index, fractions of forest and snow cover, soil clay content, surface slope, and annual averages of precipitation and potential evaporation. p,a and p,b are the values of the characteristic p at donor and receptor basins, respectively. IQRp is the 20 interquartile range of characteristic p calculated over the land surface, excluding deserts (defined by an aridity index > 5, see Table 4 of Beck et al. (2016)) and areas covered with ice during most of the year (defined by climate zones Tundra, Subarctic and Ice cap using a simplified climate zones map (Fig. S1) created by the Esri Education Team for ArcGIS online (World Climate Zones -Simplified; Esri Education Team, 2014)). From Eq. 3 it follows that the most similar donor a to a receptor b is the one that has the lowest index value with basin b. We applied this approach to identify the three most similar 25 donors for every receptor basin. The dissimilarity technique has been previously applied to find ten donors for one receptor.
Given that all the selected donors must have very close similarity indices, we found by trial and error that increasing the number of donor basins might introduce donor basins that have a significantly different similarity index, and that setting the number of donor basins to three seemed most appropriate.
In very large basins, physiographic and climatic heterogeneity can result in misleading basin-mean averages. We therefore 30 excluded highly heterogeneous basins from the list of donors and classified them as 'non-donor and non-receptor' basins, and also broke up large heterogeneous receptor basins by climate groups into smaller basin zones and then treated them as separate basins to effectively receive sets of weights and bias ratios from the donor basins to the separate parts. Here we defined large heterogeneous basins as basins with areas greater than 1,000,000 km 2 and covering climate zones that belong to at least two groups of 1) Tropical Wet, 2) Humid continental, Humid subtropical, Mediterranean and Marine, 3) Tropical Dry, Semi-arid and Arid, 4) Tundra, Subarctic and Ice cap and 5) Highlands. Climate classification is based on the simplified climate zones map (World Climate Zones climate zones map; Esri Education Team, 2014) defined above. We used this particular climate 5 map because it comprises only 12 broad climate groups (compared to more than 30 in other climate maps e.g. Köppen-Geiger).
This reduced the divisions made to large heterogenous basins, while ensuring that the resultant basin zones of individual basins have very distinct climate characteristics. Figure 3 shows the spatial coverage of the donor basins, receptor basins and nondonor and non-receptor basins.

Out-of-sample testing 10
To test that this approach is producing a runoff estimate at receptor basins (using transferred weights from the most similar gauged basins) that is better than any of the individual models, we performed an out-of-sample test. In this test, we selected a gauged basin and treated it as a receptor basin, constructing model weights by using the three most similar donor basins. We could then compare: (a) observed streamflow; (b) the in-sample weighted product (WPin) derived by using observed streamflow for this basin to weight models; (c) an out-of-sample weighted product (WPout) derived by constructing the weighting at the 15 three most similar basins, and; (d) the individual model estimates at each basin. We calculated four metrics of performance for WPin, WPout and each of the 11 datasets: Mean Square Error MSE=mean(Raggobserved streamflow) 2 ; Mean Bias=mean| Raggobserved streamflow |; Correlation COR=corr(observed streamflow, Ragg) and Standard Deviation (SD) difference= σ Ragg − σ observed streamflow . We repeated the out-of-sample test for all the gauged basins (donor basins and non-donor and non-receptor basins). 20 We displayed the results of the out-sample-test by showing the percentage performance improvement of WPout compared to WPin and each individual model, yielding 12 different values of performance improvement. If the approach is succeeding, we expect that both WPout and WPin will perform better than any of the models used in this study, and also WPin should be in better agreement with the observed streamflow when compared to WPout.
We used box and whisker plots to show the results of performance improvement of WPout calculated relative to WPin and the 25 11 datasets across all the gauged basins. The lower and upper hinges of a boxplot represent the first (Q1) and third (Q3) quartiles respectively of the performance improvement results and the line inside the boxplot shows the median value. The extreme of the lower whisker represents the maximum of 1) min(dataset) and 2) (Q1 -IQR), while the extreme of the upper whisker is the minimum of 1) max(dataset) and 2) (Q3 + IQR)), where IQR represents the interquartile range (i.e. Q3 -Q1 ) of the performance improvement results. A median line located above the 0 axis is an indication that the out of sample weighting offers an 30 improvement in more than half of the basins.
The uncertainty estimates computed at the gauged basins represent well the deviation of the spatial aggregate of our weighted product ( ) from the observed streamflow, since the in-sample uncertainty estimates are calculated from the variance of the transformed ensemble, which by design equals MSE of against the observation (i.e. error variance of ). To test if the uncertainty estimates perform well out-of-sample (i.e. at the ungauged basins), we performed another out of sample test. In this test, we took a gauged basin, but instead of constraining the weighting using observed streamflow from this basin, we constructed model weights by using the three most similar donor basins. We could then calculate MSE of against observation from the three donor basins, and we denoted this by MSEin, which represents the uncertainty estimates calculated 5 in-sample, since the observational data used in this case is the same dataset that was used to train the weighting. We also calculated the MSE of the aggregated weighted product against the actual observation of the gauged basin and we denoted this by MSEout. MSEout represents the uncertainty estimates computed out-of-sample, since the comparison was performed against observational data that has not been used to train the weighting. We repeated the out-of-sample test for all the gauged basins.
We displayed the results of the out-sample-test by showing the ratios of MSEin to MSEout. If the approach is succeeding, we 10 expect that this ratio is around one, indicating that the values of MSEin and MSEout are close to each other. We used a box and whisker plots to show the results.

Results
The results for the out-of-sample test are displayed in the box and whisker plots presented in Fig. 4 (a -d).
The MSE and Mean bias plots in Fig. 4 (a and d) indicate that across almost all the gauged basins WPout performs better than 15 each of the individual models. Similarly, the COR plot in Fig. 4 (c) shows that the out-of-sample weighting has in fact improved the correlation with observational data across almost all the gauged basins. The SD difference plot ( Fig. 4 (b)) shows a significant improvement of WPout relative to the models, but the number of basins that benefit from this improvement decreased, perhaps because the variability of the individual members of the weighting ensemble is not necessarily temporally coincident at all the basins, resulting in decreased variability. The negative performance improvement of WPout relative to WPin 20 across all metrics (first boxplot, Fig. 4 (a-d)) indicates that the weighting performs better in-sample than out-of sample, which is to be expected. Critically though, the fact that the weighting delivers improvement over all models when the weights are transferred from similar basins indicate that the dissimilarity technique is succeeding and can be effectively used at the ungauged basins by feeding the weighting with data from the most similar basins with streamflow observations. Furthermore, the boxplot in Fig. 5 shows that, overall, when the uncertainty estimates are computed out-of-sample they are very similar to 25 what they would have been if they were computed in-sample. This demonstrates that the dissimilarity technique can be effectively used to derive not only the weighting product but also its associated uncertainties at the ungauged basin.
Based on the improvement that the weighting approach implemented in both gauged and ungauged basins offers over Ragg estimates computed for 11 individual model runoff estimates, in terms of MSE, SD difference, COR and Mean Bias against observed streamflow data, we now present details of the mosaic of the individual weighted runoff estimates derived across all 30 the basins that we name LORA. At the gauged basins, the weighting was trained with the Ragg of the modelled runoff at the individual basins and constrained with the observed streamflow. At ungauged basins, the dissimilarity approach was first implemented to find the three most similar basins, then the weighting was trained on the combined datasets from these three basins. Subsequently, weights were transferred to the ungauged basins and applied to combine the runoff estimates at the individual basins.
The eight modelled runoff datasets listed in Table 1 as part of the tier1 ensemble were recently included in a global evaluation 5 by Beck et al. (2017a). In their analysis, they computed a summary performance statistic that they termed OS by incorporating several long-term runoff behavioural signatures defined in Table 3 of Beck et al. (2017a) and found that the mean of runoff estimates from four models only (LISFLOOD, WaterGAP3, W3RA and HBV-SIMREG) performed the best in terms of OS ̅̅̅̅ (i.e. mean of OS over all the basins included in their study) relative to each individual modelled runoff estimates and the mean of all the modelled runoff estimates. In this study, we calculated the mean runoff from the four best products found by Beck 10 et al. (2017a), that is (LISFLOOD, WaterGAP3, W3RA and HBV-SIMREG. Hereafter, we refer this as "Best4", and we calculated four statistics (RMSE, SD difference, COR and Mean bias defined here as mean(dataset-obs)) for Ragg computed from LORA, Best4 and each of the 11 runoff datasets across all the gauged basins. The boxplots in Fig. 6 (a-d) display the results.
The RMSE plot in Fig. 6 (a) shows that LORA has the lowest RMSE values with the observed streamflow. All of the 15 component models exhibit a similar performance in RMSE. Similarly, LORA has overall the least SD difference with observations ( Fig. 6 (b)) across more than half of the basins. The Mean bias plot in Fig. 6 (d) shows a non-significant positive bias in LORA relative to the observation at the majority of the basins. Best4, HBV-SIMREG, PCR-GLOBWB and particularly LISFLOOD exhibit a positive mean bias across most of the basins but with much higher bias magnitude compared to that of LORA. HTESSEL and SURFEX estimates from both tiers (i.e tier1 and tier2) together with JULES (tier2) and WGAP3 show 20 negative and positive bias distributed evenly across the basins. LORA shows the highest temporal correlation with the observed streamflow at more than half of gauged basins (Fig. 6 (c)). The low RMSE and Mean bias values relative to the other estimates is partly due to the bias correction applied before the weighting. While all the performance metrics calculated here show that LORA outperforms Best4, these metrics do not allow us to assess how well LORA performs in terms of bias in the runoff timing, replicating the peaks or representing quick runoff, with the exception of the correlation metric. These aspects were 25 studied in more detail in Beck et al. (2017a) and showed that Best4 performs well in these performance metrics.
All the models involved in deriving LORA with the exception of HBV-SIMREG were found in the study of (Beck et al., 2017a) to show early spring snowmelt peak and an overall significant underestimation of runoff in the snow-dominated basins.
To see how well LORA performs at high latitudes, we examined the gauged basins located at higher latitudes (>60°) and we calculated two statistics -COR and mean biasas in Fig. 6 (c-d) but this time for the snow-dominated basins only. We display 30 the results in Fig. 7.
The temporal correlation plot in Fig. 7 (a) shows that LORA is in better agreement with observed streamflow at snowdominated basins compared to the ensemble of all the gauged basins on the globe (Fig. 6 (c)) with an overall average improvement of 7%. Similarly, HBV-SIMREG shows an improved correlation with the observed streamflow at snowdominated basins with an average improvement of 14%, this agrees with the results reported by Beck et al. (2017a) who attributed the improved performance of HBV-SIMREG in snow-dominated regions to a snowfall gauge undercatch correction.
The overall performance of Best4 and LISFLOOD do not change in terms of spatial correlation; on the contrary, all the remaining products show a degraded performance. Figure 7 (b) shows that LORA exhibits small biases across snow-dominated 5 basins relative to participating models. Conversely, with the exception of LISFLOOD, all the tier1 products including Best4 show a negative mean bias across more than half of the snow-dominated basin, in particular HTESSEL, JULES, SURFEX and W3RA show a large negative bias at most of these basins. This agrees with the negative bias found in the study of Beck et al. (2017a) in all tier1 products except LISFLOOD. These results indicate that LORA is likely to slightly overestimate runoff in high latitudes whereas all tier1 products with the exception of LISFLOOD tend to underestimate runoff in these regions, and 10 that this underestimation is larger for HTESSEL, JULES, SURFEX and W3RA. Tier2 products show both positive and negative bias across the basins. Their bias is of a lower magnitude than that found in tier1 products. That is probably because the forcing precipitation used to derive tier 2 outputs (i.e. MSWEP) has less biases than that used to derive tier1 estimates (i.e. WFDEI corrected using CRU-TS3.1). We also calculated the two metrics, SD difference and mean bias as in Fig. 6 (a and b), but we found no noticeable differences in the performance of any of the products relative to that found globally in Fig. 6 (a  15 and b). The results displayed in Fig. 6 and Fig. 7 are discussed further below.
We calculated the seasonal relative uncertainty expressed as the ratio of the seasonal average uncertainty to seasonal mean runoff (i.e.  To generate this plot, we calculated the average Ragg for each month over the period of availability of observed streamflow. 30 The shaded regions represent the range of uncertainty associated with the derived runoff. In the Amazon basin, LORA overestimates runoff in the wet season and underestimates it in the dry season, but the observed streamflow during the dry season still lies within the error bounds of LORA. LORA shows good agreement with the observed cycle in the Mississippi.
In the Niger and Murray-Darling basins, while LORA overestimates the observed streamflow, it shows a much better agreement compared to Best4 which strongly overestimates runoff. In the Parana basin, LORA underestimates the observed streamflow in all seasons except summer. In the subarctic basins, LORA shows different behavior within the individual basins.
In Pechora and Olenek, LORA represents well the seasonal cycle and the magnitude of runoff, whereas in the Amur, Lena and 5 Yenisei, LORA shows an early shift of the runoff peak and an overall overestimation of runoff. In the Indigirka, LORA overestimates the spring peak, but the observed seasonal cycle lies within the error bounds.
We compared our mean annual runoff (mm/year) with those estimated by a well-known land surface hydrological model the Variable Infiltration Capacity (VIC; Liang et al., 1994) model as well as adjusted VIC estimates after enforcing the physical constraints of the water budget in the study of Zhang et al. (2018) over comparable temporal and spatial scale for 16 large 10 basins chosen from different climate zones on the globe. The mean annual runoff was computed over the period 1984 -2010 instead of 1980 -2012 to maximize the temporal agreement with the study of Zhang et al. (2018). We also showed the average annual volume of water that discharges from these basins computed from LORA and the observational data. Table 2 shows that for some basins VIC and LORA agree well in estimating mean annual runoff (i.e. difference between LORA and at least one of VIC and VIC adjusted for budget closure <10%). This threshold is met in the Amazon, Columbia,

Discussion
The results of the out-of-sample test suggest that deriving runoff estimates in an ungauged basin by training the weighting with 25 streamflow data from similar basins -in terms of climatic and physiographic characteristics -is successful. While the runoff product derived by using weights from external basins outperforms the runoff estimates from the individual models, the weighted runoff derived in-sample offers overall even more capable runoff estimates.
It follows from Fig. 8 that the runoff values computed over dry climates tend to be less reliable than those in other regimes. This is perhaps due to biases in the WFDEI precipitation forcing that are propagated and intensified in the simulated runoff 30 (Beck et al., 2017a). Another possible reason is the reduced proficiency of models in representing runoff dynamics in arid climates where runoff tends to be highly non-linearly related to rainfall and often evaporates locally without reaching a river system (Ye et al., 1997). Also, due the lower density of gauged basins in the arid and semi-arid climates compared to other regimes, receptor basins are dominant over dry climates, which reduces the skill of the weighting to produce good runoff estimates. This is also in line with our conclusions from Fig. 4 that the weighting provides more reliable results in the gauged basins. 5 All the tier1 model outputs involved in this study with the exception of HBV-SIMREG were found by Beck et al. (2017a) to show early spring snowmelt in the snow-dominated basins. Both the Yenisei and the Lena are large basins (2.6 and 2.4 million km 2 , respectively), and henceas noted in Sect. 2.2only models that had estimates of both streamflow and runoff were used to derive LORA at these basins, and therefore HBV-SIMREGwhose inclusion would have improved the weighting -was excluded. Beck et al. (2017a) also found that LISFLOOD has the best square root-transformed mean annual runoff among the 10 tier1 datasets and perfoms well in terms of temporal correlation in all climates, this agrees with the high temporal correlation of LISFLOOD seen in Fig. 6 (c) and Fig. 7 (a), and also explains the highest weights attributed to LISFLOOD in the majority of snow-dominated basins (Table S1). Because of this, and because LISFLOOD tends to overestimate runoff across half of the snow-dominated basins (as shown in Fig.7 (b)) LORA exhibits a positive bias across half of the snow-dominated basins ( Fig.   7 (b)) and particularly in Lena, Amur and Yenisei basins (Fig. 9). 15 Further, we provide in Fig. S2 the spatial distribution of correlation results from Fig. 6 (c). The basins are colour-coded by their temporal correlation with the observed streamflow and the number of basins in each category is given. Basins in yellow are those where LORA is highly correlated with the observation while dark blue basins are those where LORA exhibits a negative correlation with the observation. It can be noted from Fig. 6(c) that occurrence of negative correlation is extremely unusual which explains why these were considered outliers and were not shown in the box and whisker plot. Likely, low 20 correlation basins are unusual and constitute less than 12% of the number of basins (excluding basins with negative correlation). Also, the median value is above 0.8, which is higher than any constituent estimates. We selected a basin from each correlation range and examined the timeseries of LORA and the observed streamflow more closely (Fig. S3-S7), in particular illustrating the uncertainty estimate of LORA. In Ganges, LORA captures well the observed time-series dynamic with a tendency to over-estimate streamflow peak in August (Fig. S3). Over Madeira basin, LORA is able to represent 25 reasonably well most of the climatic variability found in the observation (Fig. S4). In Congo, the catchment has an irregular time-series dynamic, LORA is in principle able to capture a large part of the climatic variability in the observation (Fig. S5).
In Lena, the observation shows a peak in June and a second less significant peak in September (Fig. S6). Both peaks are captured by LORA during most of the time series with a tendency to underestimate the late summer peak and overestimate the early summer peak. In the upper Indus, LORA does not capture the magnitudes of observed streamflow and shows a reversed 30 seasonal cycle which explains why it exhibits negative correlation with the observation (Fig.S7). Zhang et al. (2018) found disagreement between simulated runoff from three LSMs and observed streamflow over Indus basin which they expected to be due to errors in the observational data from GRDB dataset. Pan et al. (2012) and Sheffield et al. (2009) assumed that the errors in the measured streamflow are inversely proportional to the area of the basins and ranges between 5% and 10%. Whereas Di Baldassarre and Montanari (2009) analyzed the overall error affecting streamflow observations and found that these errors range between 6% and 42%. In earlier studies, the errors in streamflow measurement were estimated to range from 10% to 20% (Rantz, 1982;Dingman, 1994). In the study of Zhang et al. (2018), the error ratios of VIC were set to be 5%. In this study, we used the weighting approach to compute gridded 5 uncertainty values based on the discrepancy between the Ragg of the derived runoff and the associated observational dataset in each gauged basin or alternatively, based on the discrepancy between Ragg of the derived runoff and the associated observational dataset from three similar basins in the case of ungauged basins. The derived gridded uncertainty changes in time and space. Our uncertainty estimates show higher values than those set for VIC, and additionally the estimated values and their reliability change with climate and season (Fig. 8). It follows from Table 2 that in most of the basins the mean annual 10 runoff uncertainty exceeds 30% of the values of the associated runoff itself. In fact, when the values of runoff approach zero (i.e. in arid and semi-arid regions during the hot climate or in the snow dominated basins during winter) it is expected that the uncertainty values become very close to the associated runoff estimates and eventually the error ratio becomes high. It is not surprising that the estimated relative uncertainties exceed the error ratios of the observations. Also the change of the uncertainty values with time and space is consistent with the fact that the individual datasets that were used to derive LORA exhibit 15 performance differences in different climates and terrains (Beck et al., 2017a). As discussed in Hobeichi et al. (2018), the weighting approach has its own advantages and drawbacks. One limitation is that 25 a common imperfection in all the individual products is likely to propagate into the derived product. The early spring runoff peak found in both LORA and the datasets that were used to derive it is an example of this limitation. On the other hand, the seasonal runoff cycle of LORA in both Pechora and Olenek (i.e. two snow-dominated basins) indicate that LORA was able to capture the seasonal signal and the timing of the runoff peak very well as opposed to the constituent products and Best4, which also suggests that the weighting has the ability to overcome the weaknesses of the individual products. Additionally, it was 30 shown in Beck et al. (2017a) that tier1 products consistently overestimate runoff in arid and semi-arid regions due to a bias in the WFDEI precipitation forcing, this appears in the massive overestimation exhibited by Best4 in Niger and Murray-Darling ( Fig. 9), however the weighting was able to eliminate a large amount of this overestimation, which also emphasizes the ability of the weighting approach to mitigate limitations in individual models. Another limitation arises from the scarcity of observed streamflow particularly in the arid regions and from the quality of the observational data itself. As noted earlier, the errors in GRDB dataset were reported to range between 10% and 20% and were found by Di Baldassarre and Montanari (2009) to have an average value that exceed 25% across all the studied river basins. Also, given that there are no direct observations for runoff, uncertainties were computed from the discrepancy between the modelled runoff aggregates and observed streamflow. This ignored the lag time between LORA integrated runoff and observed streamflow at the mouth of the river and induced biases 5 that possibly led to overestimated uncertainty over large gauged basins.
The weighting technique allows the addition of new runoff estimates when they become available. This will be particularly beneficial if the future estimates represent reasonably the runoff peak in the snow-dominated regions.

Conclusion
In this study, we presented LORA, a new global monthly runoff product with associated uncertainty. LORA was derived for 10 1980-2012 with monthly temporal resolution at 0.5° spatial resolution by applying a weighting approach that accounts for both performance differences and error covariance between the constituent products.
To ensure full global coverage, we used a similarity index to transfer weights and bias ratios constructed from gauged basins with similar climatic and physiographic characteristics to ungauged basins. This allows the derivation of runoff in areas where we do not have observed streamflow . 15 We showed that this approach is succeeding, that LORA performs better than any of its constituent modelled products in a range of metrics, across basins globally and especially in the higher latitudes. However, LORA tends to overestimate runoff and shows an early snow-melt peak in some snow-dominated basins. LORA was not found to significantly overestimate runoff in arid and semi-arid regions as opposed to the constituent products.
The approach and product detailed here offers the opportunity for improvement as new streamflow and modelled runoff 20 datasets become available. It presents a new, relatively independent estimate of a key component of the terrestrial water budget, with a justifiable and well constrained uncertainty estimate.

Competing interests
The authors declare that they have no conflict of interest. was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. We are grateful to the Global Runoff Data Centre (GRDC) for providing observed streamflow data. We thank the participants of the eartH2Observe project for producing and making available the model simulations. We also acknowledge that the HydroBASINS product has been developed on behalf of World Wildlife Fund US (WWF), with support from, and in collaboration with: the EU BioFresh project, Berlin, Germany; the International Union for 10