Technical Note: Evaluation and bias correction of an observations- based global runoff dataset using historical streamflow observations from small tropical catchments in the Philippines

The predictability of freshwater availability is one of the most important issues facing the world’s population. Even in relatively wet tropical regions, seasonal fluctuations in the water cycle complicate the consistent and reliable supply of water to urban, industrial and agricultural demands. Importantly, historic streamflow monitoring datasets are crucial in 15 assessing our ability to model and subsequently plan for future hydrologic changes. In this technical note we evaluate a new global product of monthly runoff (GRUN_v1; Ghiggi et al., 2019) using small tropical catchments in the Philippines. This observations-based monthly runoff product is evaluated using archived monthly streamflow data from 55 catchments with at least 10 years of data, extending back to 1946 in some cases. These catchments are completely independent of the GRUN gridded product as no catchments in the Philippines were of sufficient size to fulfil the original filtering criteria and 20 databases of these data were either not digitized or difficult to compile. Using monthly runoff observations from catchments with more than 10 years of data between 1946 and 2014, we demonstrate across all observations significant but weak correlation (r2 = 0.372) and skilful prediction (Volumetric Efficiency = 0.363 and log(Nash-Sutcliff Efficiency) = 0.453) between the predicted values and the observations. At a regional scale we demonstrate that GRUN performs best among catchments located in Climate Types III (no pronounced maximum rainfall with short dry season) and IV (evenly distributed 25 rainfall, no dry season). We also find a weak negative correlation between volumetric efficiency and catchment area, and a positive correlation between volumetric efficiency and mean observed runoff. Further, analysis of individual rivers demonstrates systematic biases (over and under) in baseflow during the dry season, and under-prediction of peak flow during some wet months among most catchments. These results demonstrate the potential utility of GRUN and future data products of this nature with due consideration and correction of systematic biases at the individual basin level to: 1) assess trends in 30 https://doi.org/10.5194/hess-2020-26 Preprint. Discussion started: 24 February 2020 c © Author(s) 2020. CC BY 4.0 License.

regional scale runoff over the past century, 2) validate hydrologic models for un-monitored catchments in the Philippines, and 3) assess the impact of hydrometeorological phenomenon to seasonal water supply in this wet but drought prone archipelago. Finally, to correct for underprediction during wet months we perform a log-transform bias correction which greatly improves the nationwide Root Mean Square Error between GRUN and the observations by an order of magnitude (2.648 vs. 0.292 mm/day). This technical note demonstrates the importance of performing such corrections when accounting 35 for the proportional contribution of catchments from smaller catchments in tropical land such as the Philippines to global tabulations of discharge.

Introduction
The global water crisis is considered as one of the three biggest global issues that we need to contend with, affecting an estimated two-thirds of the world's population (Kummu et al., 2016;WEF, 2018). Among the sources of freshwater, the 40 most important compartment in terms of utility is surface water flow, which is the primary resource for irrigation, industrial use and for bulk domestic water supply for many large cities. Along with the purpose of flood mitigation during extreme weather events, monitoring streamflow is a vital activity that many nations conduct with various levels of coverage. Long term streamflow datasets prove useful in resource management and infrastructure planning (e.g., Evaristo and McDonnell, 2019). Such data is even more critical in areas that rely on run-of-the-river supply and do not utilize storage structures such 45 as dams and impoundments. Further, a robust, long term dataset is crucial in the face of increased variability in stream discharge due to land use change, increased occurrence of mesoscale disturbances and climate change (e.g., Abon et al., 2016;David, et al., 2017;Kumar et al., 2018).
There is a disparity in the availability of long-term gauged rivers datasets between continental areas and smaller island nations. This in the face of the latter having an invariably more dynamic hydrometeorologic system owing to the size 50 of their catchment and proximity to the ocean (e.g., Abon et al., 2011;Paronda et al., 2019). Furthermore, in the case of tropical island nations these are where the impact of climate change in the hydrologic cycle could be observed the most (Nurse et al., 2014). Thus, the Philippines offers a unique example where manual stream gauging programs have started in 1904 and, while spotty at times, have continued on to today. In this work we analyse data since 1946. This island nation on the western side of the Pacific Ocean shows a very dynamic hydrologic system as affected by tropical cyclones, seasonal 55 monsoon rains, sub-decadal cycles such as the El Nino Southern Oscillation (ENSO) and overlaid on top of all these are the hydrologic changes caused by climate change (Abon et al., 2016;David, et al., 2017;Kumar et al., 2018).
In the absence of long-term streamflow datasets, several researchers have compiled datasets worldwide which are used to extrapolate streamflow in non-gauged areas (Maybeck et al., 2013, Gudmundsson et al., 2018, Do et al., 2018. Several global hydrological models have also been created to project variations in streamflow and extend present-day 60 measurements to the future (ref). The latest contribution to modelled global streamflow product is the Global Runoff (~50km by 50km) spatial resolution. It uses global streamflow data from 718 large river basins that is used to train a machine learning algorithm that would be able to take precipitation and temperature data to predict monthly global runoff for the period 1902-2014. 65 This technical note evaluates the accuracy of the GRUN dataset (GRUN_v1) as applied to the hydrodynamicallyactive smaller river basins in the Philippines. Additionally, it explores the possible hydrologic parameters that may need to be considered and/or optimized such that global datasets may be able to model runoff in smaller, ungauged basins more accurately. Indices and Metadata Archive (GISM) (Ghiggi et al., 2019). In this contribution we analyze GRUN_v1 (https://doi.org/10.3929/ethz-b-000324386) which was trained on large (> 10,000 km 2 ) stations from GISM (Do et al., 2018;Gudmundsson et al., 2018) and validated using 718 large (>50,000 km 2 ) monthly river discharge datasets from the Global

Runoff
Data Centre (GRDC) Reference Dataset (https://www.bafg.de/GRDC/EN/04_spcldtbss/43_GRfN/refDataset_node.html). We refer the reader to Ghiggi et al. (2019) 80 for more information but note that because of the catchment size filtering criteria, none of the GISM and GRDC data from the Philippines was used (Personal Communications, G. Ghiggi, 2019). As such, we view our analysis as a completely independent test of the GRUN runoff reconstruction using small tropical catchments.

Historical streamflow observations 85
In this contribution we analyse monthly observations of discharge from 74 manually observed streamflow stations from three Philippine datasets. The observations span 1946 to 2016, although only data through 2014 are utilized due to the time period included in GRUN_v1.

Bureau of Research and Standards (BRS) Dataset 90
The historical discharge data was originally acquired from the Bureau of Research Standards (BRS) under Department of Public Works and Highways (DPWH). The records keeping was transferred to the Bureau of Design, also under DPWH, which continues to record gage data from some rivers up to this date. The degree of accuracy of records were categorized as "excellent", "good", "fair", or "poor" using the following convention: "Excellent" means about 95% of daily discharges are within ±5% difference of the actual gauge height vs height computed from the rating curve; "Good" is within 95 ±10%; and "Fair" is within ±15%; while "Poor" means daily discharges are below the 15% "Fair" accuracy. This is the only https://doi.org/10.5194/hess-2020-26 Preprint. Discussion started: 24 February 2020 c Author(s) 2020. CC BY 4.0 License. basis of accuracy for this set of data. A majority of the reprocessed BRS data used in this analysis come from Tolentino et al. (2016), however, some of the datasets were subsequently updated using the online archive for this study (https://apps.dpwh.gov.ph/streams_public/station_public.aspx).  Tolentino et al., 2016;Kintanar, 1984;Jose and Cruz, 1999). Note that no long-term stations are available for Palawan.

Global Runoff Data Centre (GSIM) Reference Dataset
Only two time series of the available five GSIM time series (Maybeck et al., 2013, Gudmundsson et al., 2016, Do et al., 2018) cover more than 10 years. Further, none of these data were included in the original training of GRUN due to catchment size constraints. 115

Criteria for Inclusion of Datasets
All catchment areas were verified using the digital elevation model from the 2013 Interferometric Synthetic Aperture Radar (IfSAR) data. All hydrologic datasets were normalized to runoff (mm/yr), sometimes also notated as 'specific discharge' in the literature. We only considered streamflow stations where the published and verified areas agreed 120 and coverage spanned 10 or more years. The location of all streamflow stations is shown on Figure 1 and listed in Table 1.
Catchment areas span 4 orders of magnitude (8.93 to 6487 km 2 ) and cover the majority of the Philippines excluding Palawan (see Figure 1). The location of catchments was paired to GRUN grid cells (0.5° by 0.5°) for the analysis. Given that all but one catchment is smaller than the area of the GRUN grid cells (~3,000 km 2 ) we view this pairing as sufficient for validation purposes. This assumption was tested by interpolating the GRUN grid to the gauging location as well as the watershed 125 centroids, and no significant difference in correlation to the observations that were observed.

Statistical Performance Metrics
To assess the performance of GRUN we use a suite of metrics commonly used to assess model performance in hydrologic studies. These metrics are calculated for each individual catchment (n=55) and for each climate type (n=4; see 130 below) shown in Figure 1.
Firstly, we use the commonly used square of the Pearson correlation coefficient (r 2 ). This metric for bivariate correlation measures the linear correlation between two variables. In this case the predicted monthly values from GRUN versus the observed monthly values from the streamflow datasets. It varies from 0 (no linear correlation) to 1 (perfect correlation). The use of r 2 does not account for systematic over-or under-prediction in runoff because it only accounts for 135 correlation among the observed and predicted values (see Krause et al. (2005) for further discussion of the use of r 2 in hydrological model assessment). https://doi.org/10.5194/hess-2020-26 Preprint. Discussion started: 24 February 2020 c Author(s) 2020. CC BY 4.0 License.
Secondly, the primary metric used here is Volumetric Efficiency (VE) (Criss and Winston, 2008), utilized previously by Tolentino et al. (2016) on a subset of the BRS catchments analysed here. VE is defined as: where P is the modelled/predicted values and O is the observed runoff values. A value of 1 indicates a perfect score. Because we are interested in the performance of GRUN over the period of each streamflow record, unlike Tolentino et al. (2016), we calculate VE using all paired monthly observed and simulated values rather than the monthly medians. This results in lower 145 VE scores than those previously reported by Tolentino et al. (2016) compared to hydrologic models.
Further, we use both the linear and logged Nash-Sutcliffe efficiency (NSE) parameter. Proposed by Nash and Sutcliffe (1970) it is defined as: The range of NSE can be between -∞ and 1 (perfect fit). NSE values are useful (compared to VE) in that values less than zero indicate that the mean value of the observed data is a better predictor than the hydrologic model. NSE is also calculated using logarithmic values prior to calculation to reduce the influence of peak flow and increase the influence of low flow values (see further discussion in Krause et al., 2005). 155 Finally, to evaluate a possible strategy for performing a bias correction of the GRUN simulated values at a countrywide scale we use the Root Mean Square Error (RMSE) in units of runoff (i.e., mm/day). The RMSE is applied to the raw GRUN simulated values and the observation based bias corrected GRUN values at the country, climate type (see below) and individual catchment level.

Climate Types
The Philippines has four Climate Types (see also Abon et al., 2016;Tolentino et al., 2017): Type I Climate on the western seaboard of the Philippines is characterized by distinct wet (May to October) and dry (November to April) seasons; Type II Climate on the eastern seaboard has no distinct dry period with maximum rains occurring from November to February; Type III inland climate experiences less annual rainfall with a short dry season (December to May) and a less 165 pronounced wet season (June to November); and, Type IV southeast inland climate experiencing depressed rainfall and characterized by an evenly distributed rainfall pattern throughout the year. Further description is provided in Figure 1.

Results and Discussion
Statistical comparisons described above between individual catchments and the GRUN dataset are shown in the supplemental figures and tabulated in Table 1. Shown also in the supplemental figures ( Figure A1) are time series comparisons between the GRUN runoff values and runoff (area normalized discharge). Statistics performance metrics across all data as well as by climate types I to IV are also listed in Table 1. Given the emphasis on a country scale evaluation of 175 GRUN we primarily focus below on results in aggregate grouped by climate type or among all catchments.
Across all observations GRUN has a r-squared of 0.372 and a VE of 0.363 (Table 2). Using log(runoff) values (following Criss and Winston, 2008) this improves to an r-squared of 0.546 and a volumetric efficiency (VE) of 0.733, suggesting reasonable utility in the GRUN product at a country scale for the Philippines, despite no training data from the Philippines being used in the creation of GRUN (Personal Communications, G. Ghiggi, 2019). The raw RMSE across the 180 dataset is 2.648 mm/day (Table 2). In the following we break down the comparison between the streamflow observations and GRUN by first comparing runoff distributions and extreme values at the individual basin level, then aggregating our results by Climate Type, and finally look at several correlations of VE to watershed characteristics.

Comparison of runoff distributions 185
Average runoff values among all catchments compared to GRUN show reasonably good predicted values. In Figure   2A  Comparison of runoff distributions ranked by catchment size in Figure 2B demonstrates that maximum runoff values appear to be most underpredicted by GRUN in the smallest catchments. However, we do find a significant correlation (at p<0.01, r 2 =0.391) between log values of maximum runoff difference (observed minus predicted) versus catchment area 195 (not shown), with a negative relationship. This suggests that particularly for small catchments, which may have steeper average slope, GRUN underpredicts monthly runoff values associated with the wet season.
In general, the median and interquartile ranges (IQR, 25% to 75%) shown in Figure 2B overlap between GRUN and the observations. For five catchments of large (n=2) and relatively small sizes (n=3) the IQR of the observations does not overlap with the GRUN runoff IQR. The three small catchments are Climate Type III (yellow) and the two large catchments 200 are Climate Type IV. In two catchments of moderate size the GRUN IQR is greater than the observed IQR runoff range.

Comparison by climate type
In all basins regardless of Climate Type, a general underestimation of the model is seen for the highest runoff 210 months, as noted above looking at the distributions by basin. This is especially evident in Climate Types I and II with pronounced wet seasons as also shown by their lower r-squared values ( Figure 3)

Correlation and trends with watershed characteristics
In this section we analyse two potential correlations between watershed characteristics and VE scores. In Figure 4 we show a weak positive correlation (r 2 = 0.041, p = 0.137) between VE and log(catchment size) (listed in Table 1) and a stronger negative correlation with mean runoff (r 2 = 0.182, p < 0.01). However, at low runoff there is significant spread in 225 VE score, driven primarily by Climate Type I catchments (red box and whisker plot in Figure 4C). These catchments experience distinct wet and dry seasons in the northwest Philippines. The positive correlation with catchment size is likely primarily due to the extreme wet months biasing results as described above. This is particularly evident from the Nash-Sutcliff Efficiency (NSE) and log(Nash-Sutcliff Efficiency) (NSE-log10) scores in Table 2 and Table A1. As discussed by Criss and Winston (2008) the NSE-log10 scores are in most cases significantly more skilful among our catchments because 230 extremely wet months are weighted less than using the raw runoff values. It is also notable that the VE scores using log10 values across the entire dataset is significantly improved (0.363 vs. 0.733; Table 2).
Previous studies have investigated the correlation between runoff and catchment size (Mayor et al., 2011), and the different hydrologic and geologic factors that cause non-linear relationships between these two variables (Rodriguez et al., 2014). Recently, Zhang et al. (2019) point out that runoff coefficients increase logarithmically as catchment size decreases. 235 Moreover, the same paper reports that smaller catchments are more sensitive to vegetation cover, slope, and land use compared to larger catchments. This implies that predictability of basin runoff for smaller catchments are increasingly more difficult due to variances in the compounding factors mentioned above but more importantly that runoff-catchment size relationship derived from large basins must be corrected when applied to smaller catchments. We hypothesize that these effects proposed by Zhang et al. (2019) are also influencing the Philippines streamflow dataset utilized in this study. As 240 such, we suggest that GRUN, while useful for studying trends, seasonality and average runoff from tropical catchments (e.g., Merz et al., 2011;Wanders and Wada, 2015) in the Philippines, is not suitable for extreme value analyses associated with major tropical storms during the wet seasons unless suitable bias corrections (see next section) can be effectively carried out.

Bias Correction and Outlook
Overall, the GRUN data underestimates the actual observed runoff from Philippine basins. The GRUN dataset shows a range of 0 to 10mm/day for most basins and up to 20mm/day for larger basins in the group. The observed maximum runoff values are on average higher by and exceed 50mm/day during extreme rain events (Figure 3). Furthermore, the GRUN dataset also appears to underestimate minimum flow in streams from highly seasonal catchments (e.g. Types I and 255 II). https://doi.org/10.5194/hess-2020-26 Preprint. Discussion started: 24 February 2020 c Author(s) 2020. CC BY 4.0 License.
These discrepancies may be a function of the direct interpolation of large basin characteristics to smaller basins as discussed in the previous section. The underestimation of runoff values during extreme rain events may also be a result of the fast saturation of the overlying soil and exceedance of rates of infiltration partly as a result of shallow aquifers filling up 260 and consequently the conversion of excess rainfall into direct runoff (e.g., Tarasova et al., 2018). On the other hand, the underestimation of flow during low flow events may be a result of not accurately accounting for stream baseflow which is fed by shallow aquifers.
Given the biases observed in our analysis and in particular the clear underprediction of GRUN during the wettest months we perform a bias correction of the GRUN dataset at a nationwide level using all the available filtered data used in 265 our analysis. We do so in a two-step process to both correct the mean offset and stretch the wettest months to higher values with all transformations occurring in log-transform space (i.e., as displayed in Figure 3). Thus, we first add the mean log10(runoff) difference between the observations and the predicted values (0.117). Following this, using the lm function in R, we fit a linear regression between the observations and the GRUN predicted values (log10(runoff, observed) = m × log10(runoff, predicted) + b) and correct the predicted values using the slope (m=0.774) and intercept (b=0.099) derived 270 from this regression. By carrying out these calculations in log-transform space the highest GRUN runoff values are the most significantly affected, which are the data points we have observed to be most underpredicted in our above analysis (Figures   2A and 3).
To assess this bias correction, we calculated RMSE values at a catchment, climate type and countrywide level  Table 2). Interestingly, the median RMSE value for Climate Type I and II catchments are not significantly improved, however, the RMSE range for both have been reduced (red and blue boxes in Figure 5, respectively). This analysis and the improvement of RMSE values, as well as some of performance metrics such as NSE (see 280 scores tabulated in Table 2), using a simple log-transform based bias correction demonstrates the importance of either: 1) including smaller catchments in future iterations of products such as GRUN, or 2) performing similar bias corrections on a country, region or even catchment scale as appropriate. This is particularly important given that taken at face value the proportional contribution of relatively small tropical land areas to global discharge accounting (e.g., Dai and Trenberth, 2002) would be underestimated without such corrections. 285

Conclusion
Using monthly runoff observations from catchments in the Philippines with more than 10 years of data between 1946 and 2014, we demonstrated across all observations significant but weak correlation (r 2 = 0.372) and skillful prediction (Volumetric Efficiency = 0.363 and log(Nash-Sutcliff Efficiency) = 0.453) between the GRUN-predicted values and actual 295 observations. Looking at different hydrometeorological regimes, we demonstrated that GRUN performs best among low rainfall catchments located in climate types III and IV and showed a weak negative positive correlation between volumetric efficiency and catchment area. Further, we found that particularly for smaller catchments, maximum wet season values are grossly underpredicted by GRUN. The application of a nationwide bias correction to stretch high runoff values using logtransform runoff values greatly improved the RMSE of the predicted values. We thus propose that the utilization of the 300 GRUN dataset can be extended to other ungauged tropical regions with smaller catchments upon applying a similar correction as described in this study. We acknowledge the help of Mart Geronia, Allen Marvin Yutuc, Iris Joy Gonzales, Jia Domingo and EJ Terciano Jr. for help with dataset compilation. DPWH-BRS, GISM and GRDC for datasets availability.

Data availability 315
Data was compiled from the DPWH-BRS, GISM and GRDC datasets (see links in text) and is made available as a supplemental file.

Author contributions
DEI and CPCD designed the study, DEI and PLMT carried out the dataset compilation and screening, PLMT 320 verified catchment areas, DEI and PLMT carried out the analysis, and DEI and CPCD prepared the manuscript with contributions from PLMT.

Competing interests
The authors declare that they have no conflict of interest. 0.546 0.733 n/a n/a n/a n/a 1.067 n/a n/a 36 Notes * For regressions forced through intercept of 0 ** Two-step bias correction procedure where first mean offset is added to the predicted GRUN values and then a log-transform stretch correction is applied (see text for details) Table 2: Results of statistical agreement between GRUN aggregated by Climate Type and for the entire dataset (see Table A1 for individual catchments) https://doi.org/10.5194/hess-2020-26 Preprint. Discussion started: 24 February 2020 c Author(s) 2020. CC BY 4.0 License.