|It was nice to read this manuscript again. I still think that the data are unique and that the comparison with the GRUN dataset is useful – even if to just show the errors in blindly using these model outputs. However, I still have several main comments. These include 1) the need for some more comparisons of the observations and GRUN output (e.g., flow duration curves or flow percentiles) to strengthen the analysis, 2) the need to discuss the effects of errors in the rainfall data used in the GRUN ‘simulations’, 3) the mismatch between the interpretations regarding hydrological processes and the monthly time scale of the data, 4) the way the bootstrapping for the bias correction factor is implemented, and 5) the need for rewording some of the text to make it more accurate and improve readability.|
L 100: I still do not understand the description of the data quality categories. What is meant by “the actual gauge height vs height computed”? I assume that you are looking at the rating curves here. Do you mean that xx% of the level observations are within the range of the measurements used to create the rating curve? or something else?
Figure 1: The font is very small – particularly when the figure is rescaled to the journal pages.
L189-197: This section is a mixture of explanations on what is discussed and shown in the next sections and some initial results. I would include the results in section 3.1 and significantly shorten the remainder of the section or remove it completely so that you can use the “word space” to show more comparisons or more thoroughly discuss the results.
L190: Is a VE of 0.50 really a reasonable prediction? Doesn’t it suggest an error of about 50%!!
Section 3.1. Mention the range (and average) of the NSE and NSE(log) values. What are they and for how many of the catchments is it better than 0 or 0.5? This would actually tell me if GRUN has some skill in predicting the flow across a region.
Section 3.1 and 3.2: The structure of these sections is a bit confusing and leads to repetition. It is probably better/more logical to first discuss the pooled data as you do in 3.0, then the range of NSE and NSE(log) values for the individual catchments, the prediction of the average and median flows, the prediction of the interquartile range, and then finally the prediction of the peak flows and minimum flows.
Most of the comparison of the data (and section 3.1) focuses on the peak flows. This is interesting but since this is the absolute peak it is also prone to errors in the data or just a mismatch between the GRUN and this one month with the highest flows. What about also adding a comparison of some other metrics that describe the overall peakflow fits, such as the 95th or 99th percentile of the flow or the 5-year return period monthly flow? I would have liked a comparison of the flow duration curves as well. Overall, there could have been more analyses than just the mean, max and min flow that is currently included. I think that adding a few more comparisons would strengthen the manuscript.
L226: A RMSE of 4.55 mm/d seems very large. Please put these values into perspective. How does this compare to the average flow?
L246-249: The explanation given here seems not plausible. It would be fine if we looked at hourly or daily data but here monthly data are used. It seems very unlikely that for the larger catchments (which are still not very big) the flood events last multiple months! Routing simply isn’t that slow. As far as I know there are also not that many very large lakes in the Philippines that could buffer all this water for the larger catchments. Does it rather mean that small catchments are more dominated by fast flow pathways, such as ssf, and larger catchments by slower pathways, such as groundwater flow? Although I think that streambed infiltration is important in some areas, I am not sure if in such a wet country like the Philippines, there is really that much loss of water from the stream into the aquifers to delay the streamflow response by several months. I like the attempt to describe the differences in terms of hydrological processes (here and on L275-280) but think that the monthly time scale of the data aren’t fully considered in these interpretations. Yes, whether runoff is generated as overland flow or subsurface stormflow has a huge effect on the hourly or 5-min peakflows but for the monthly runoff values, this effect should be fairly small as both flow pathways will transport the water to the stream within the monthly timescale.
The larger issue is likely the rainfall. For larger catchments, the average rainfall intensity and variation in the rainfall is less (due to the averaging over larger areas) and perhaps better predicted or represented by the GSWP precipitation data that are used in GRUN. Add some discussion on what is known about the bias in the GSWP precipitation data – and bias in the variability of precipitation. There is currently no information on how any bias in precipitation for the Philippines in GSWP may have caused the huge bias in the GRUN streamflow. Considering the need for significant rescaling of the GRUN streamflow predictions. It seems that there must be a bias in the input (i.e., rainfall data) used for the streamflow predictions. Otherwise, the mass balance can’t work out. I think that more discussion on this is needed.
L287: I thank the authors for taking up the idea of bootstrapping but think that it is not done correctly here. Taking out individual months from a range of catchments is likely not so helpful because of the large amount of ‘redundant data’ in long time series. The question is how sensitive the bias correction factor is to the choice of the catchments or the number of catchments for which data are available. Thus instead of randomly taking out data points (from different times and different catchments), it would be better to exclude all the data from a certain number of catchments and to then determine how this affects the bias correction factor and the uncertainty in the bias correction factor. In fact, I would suggest that the authors do not only take out a fraction of the catchments for the bootstrapping but also test what the uncertainty of this factor would be if they had only data for one (or two or three or five) catchments per climate zone. This would be helpful for readers from other countries who may not have access to data from so many catchments to determine a bias correction factor.
L291-293: This requires some rewriting as the text and the logic are difficult to follow.
L324: This sentence is not clear. Are you really suggesting that even though the GRUN database was not intended to be used for predicting flow for individual catchments, it can be used that way after bias correction? I don’t think that you can conclude this based on your results!!