General comments:
The manuscript introduces a new framework for the evaluation of generated rainfall time series in terms of their ability to reproduce runoff time series characteristics. This is done by two tests, an integrated test and a unit test. This topic is of broad interest for the hydrological scientific community and suitable for a publication in HESS.
I’m involved as reviewer for the second time. The revised manuscript has been significantly improved in response to the review comments and most of the issues raised in the first round of reviews have been addressed in a satisfactory way. The presentation and discussion of the method and the results is now clearer and more robust with the FDCs. However, there is still the need of improving some presentational aspects, especially in terms of method description. I have some follow-up questions on the responses and changes, plus some new suggestions, which are necessary to consider for the readability of the manuscript. Thus I would recommend moderate revisions for the manuscript.
Major and minor comments:
P1l29 „…evaluating…the efficacy of SRM’s…comparisons to observed rainfall or streamflow are limited.“ Maybe the authors want to replace the „or“ with „and“ to emphasize the streamflow comparison. Comparisons of generated rainfall time series with observed ones are state of the art and applied in the majority of rainfall generation manuscripts (as the authors point out later), so I would not consider the body of literature as „limited“.
P2l16 „poor predictive performance“ In general, the rainfall-runoff (r-r) models are calibrated in before with observed time series. This pre-calibration and a subsequent comparison of simulated and observed runoff characteristics enable conclusions, if the hydrological model can be used for the estimation of the runoff characteristics. Hence, for the following „observed-streamflow evaluation“ it is from my point of view not challenging to ascertain the „poor performance“ as the authors point out later (p2l17). Maybe the authors can implement a discussion on that issue in the manuscript. In general, for the calibration of the r-r models several possibilities exist (using observed time series with a shorter time length than analysed later in the comparison or with lower network density or with coarser temporal resolution or from different data sources (satellite or radar data instead of station-data),….) and can be discussed in the manuscript.
P2l10 Data errors do not necessarily occur only for single catchments and not for others, they can also appear in one catchment, but for a limited time period only. In combination with my previous comment, data errors should be identified before or during the calibration process. In general, I would exclude „data errors“ and “r-r model structural errors” from the motivation for the introduced framework.
P4l15-17 Observed runoff is not required for the virtual hydrological evaluation, but it is for the calibration of the model in before.
P5l1 The authors state that the framework “categorises performance at multiple spatial and temporal scales”. The case study includes one catchment, simulated with a lumped r-r model lumped and with daily resolution. What are the multiple spatial and temporal scales?
P5l5 The authors state that one key objective of the manuscript is the “introduction of a formalised framework for the virtual hydrological evaluation of SRMs”, but the authors state before (p4l28) that this framework was developed before by Bennett et al. (2018, reference is missing in the reference list). What is the difference between both frameworks? Is the “introduction of the framework” still a novelty or is it the application of the existing framework introduced by Bennett et al. (2018)? Throughout the manuscript Bennett et al. (2018) is referenced. It is necessary to explain the differences between both studies and to enable a full understanding of the applied method/framework without reading the reference of Bennett et al. (2018).
P5l23-26 For the comparison of different rainfall data sets (observed vs. generated time series) the extraneous variables are kept. This introduces a new bias, since extraneous variables from a rainy day can occur with a dry day from the generated rainfall time series (and vice versa), while for the observed time series always a “perfect match” occurs. Depending on the used equation for the calculation of the potential evapotranspiration (not mentioned so far) for example very sunny days with high radiation can occur simultaneously with rainfall the whole day. This bias has to be quantified by a sensitivity study , because it is not related to the SRM generation itself. A possible solution would be to use a weather generator (to have consistent weather data set as input for the r-r model) instead of only a rainfall generator (but of course, that introduces biases as well from the other climate time series), but this is maybe beyond the scope of the study. A discussion on the new introduced bias and its quantification would solve this issue.
P5l23-26 Also, one aim of the rainfall generation is to provide longer input time series than the existing observed ones. How can this be handled by the framework? For example, would a generated 600 year time series split into “30 realisations” because only 20 years of observations exist? How are completely unobserved catchments (no climate data) validated with this method. Some information on these issues should be provided in the manuscript.
P6l7 Maybe the authors consider to replace “compare” by “combine”? If a comparison is the intention behind Fig. 2, I cannot understand how the comparison is carried out.
P6l14 The authors mention integrated tests here, but introduce them a few sections later. This is a bit confusing for the reader. Either the reader is referred to the subsection where the test is introduced or (what I would prefer) the test is introduced before mentioning it.
P6l17 & Fig. 2 The CASE framework was mentioned only in the introduction (p4l28), the provided reference is not included in the reference list. (How) Does the CASE framework differ from the framework applied in this study? If the CASE framework is important for the reader to follow the investigation, it should be (briefly) explained in the current manuscript (the reader should not be forced to read other manuscripts to be able to follow the current investigation).
P6l22 Is it also possible to take into account more than one primary streamflow characteristic?
P7l16 Again, the CASE framework should be explained before.
Fig. 3 The location of the elements case (i-iii) is hard to identify since it remains unclear, which element (the mean or the range) determines the position on the y-axis. I suggest to put both elements on the same level in direction of y-axis instead of having two different elements. Also, does the range result from different months or from the 10.000 realisations of the SRM or from both? Would it be worth to distinguish between both ranges?
P7l20-22 How have the threshold of 90 % and 99.7 % been determined? Especially the latter one seems to have a certain origin and was not been chosen arbitrarily…The threshold should depend on the criterion used for the validation, right? So the threshold would be different for a derived flood with a 100 year return period in comparison to the mean discharge or some drought indices.
P8l5-17 How is snowfall handled in this context? If snow falls and accumulates over the winter period, it depends on the temperature time series when it starts to melt and to contribute to the total runoff. Again, in this context, how is the bias quantified? I can imagine that the differences are quite high depending on the interplay between rainfall and temperature.
For example, if the SRM generates “precipitation” and due to the temperature time series it falls as snow and accumulates over days/weeks until it contributes to runoff, it causes a high difference to the observed time series, which had no precipitation in that cold period. This is from my understanding not covered by the virtual hydrological framework, and has to be quantified as the bias mentioned before.
P12l19-29 The description of the rainfall generator is not sufficient to understand it. For further details the reader is referred to a reference which is not provided in the reference list. This section has to be clear to the reader, since it is an essential basic for the evaluation of the introduced framework.
P12l30-31 Regarding the parameter estimation: Parameters are estimated from 1914-1986 and have almost no overlay with the calibration period of the hydrological model. Which time series period of the other climate variables is used for the simulations, the period of the calibration period of the r-r model or the period which was also used for the estimation of the SRM parameters?
P13l2-4 Was the calibration carried out for the same period and with the same station density as in this study? If not, does it has an impact on the simulated runoff (e.g. if one station was missing in the calibration period, which has a high influence (weight) on the areal rainfall, but the rainfall generation for that station is different due to its altitude or snow/rainfall occurrences)?
Fig. 5: 7 out of 22 rain gauges have no influence on areal rainfall using Thiessen polygons: station 21, 13, 19, 8, 2, 10, 8. In total, 3 out of them are explored in detail due to “the relatively poorer ability of simulated rainfall to reproduce annual streamflow totals at these sites”. How can these points have a negative impact on the simulated runoff if they do not contribute at all to the areal rainfall (if Thiessen polygons are applied)? Or did I miss a certain point in the description of the areal rainfall determination? Is the areal rainfall for the framework analysis estimated only by one (or less than all 22 stations)?
P13l5-6 To which streamflow data were the model parameters calibrated? Mean flow (on a seasonal basis), annual extreme flows or something different?
P13l11 “The same set” – For all further analysis all 100 parameter sets are kept or only the best one?
P13l Providing a NSC-value is only useful, if the authors state for which variable it was calculated. I assume for simulated daily discharge values?
Table 3 The abbreviations used in the table header include event-based characteristics. How do the authors define an event (when does it start and end)? This information is important for the reader to understand the characteristics.
P15l6 I can’t follow how 10 out of 22 rain gauges are categorized as “poor”, if the areal mean is used as input for r-r modelling. Or are the r-r models driven by single-station input only?
P15l9-10 I would prefer to see the results for all 22 sites to test the framework more in detail.
P15l19-27 I struggle with the description of Fig. 7. The first column includes simulated rainfall statistics: mean of daily rainfall amounts and mean number of wet days, while in the second one the standard deviation for both is shown. How is in the first column for one sites decided, if it is a good, fair or poor quality, when two criteria are taken into account? Do both criteria for one site have to be “good”, to result in a final “good?” Or is the mean of both relative errors chosen for the final decision? A short explanation would be helpful.
P16l1-3 The small deviation from the observed runoff in January result from the very low runoff generated in January (see Fig. 6), that the r-r model is trained to simulate-the rainfall has no effect in that months. It would be useful if the authors provide an example from the wettest periods (as done later).
Fig. 7 caption: Since the rainfall and runoff characteristics are mentioned before (p15l15-18), they can be removed from the caption.
Fig. 8 Why is for the standard deviation (b) and d)) no range? There should be a range from the number of months used for the simulation (15 years) and from the 100 parameter sets mentioned before.
Fig. 9 Again, shouldn’t there be a range for the virtual observations as well?
P22l20 Do the 10 % refer to the 10 bad out of 22 stations or to all 22 stations?
P26l11 I’m not familiar with the reference of Ang & Tang (2007) and the information provided in the reference list are insufficient to find it (is it a book? v1 sounds like a handbook or a model description). However, the r-r transformation is strongly non-linear, hence an error propagation as mentioned in the text should not be carried out (even not as an example).
P28l21-22 This sentence sounds as if it was possible to determine, which rainfall characteristics are most important for the r-r modelling. Indeed, this was not done in the study. A few rainfall characteristics have been analysed, but there was no detailed and quantitative investigation which one has higher impacts on the simulated runoff.
P29l1 It would be useful if the authors can provide references for conventional approaches.
Technical corrections:
P2l9 The reference of Bennett et al. (2018) is missing and could not be proven regarding its content and relevance for the current manuscript.
Fig. 9, 10, 13 “cummulative” -> “cumulative”
P13l6 Fig. 5 instead of Fig. 4? But I can’t see the Houlgrave weit in Fig. 5 anyway.
P24l6 “sits” is an inappropriate verb in this context
P27l8 “identifying” -> “identifies”
P27l13 Remove the comma.
P27l25 “a hydrological model” -> “a single hydrological model” Maybe this is what the authors want to say?
Comments from the first review:
Reply to comment 2: “This is the first time the virtual-observed streamflow evaluation approach has been formalised using a Comprehensive and Systematic Evaluation (CASE) framework (pioneered by Bennett et al., 2018 and used by Evin et al. 2018, Khedhaouiria et al. 2018) to evaluate stochastic rainfall models in terms of the ability to produce key runoff statistics of interest“
-> From the manuscript it remains unclear what the CASE framework is exactly (is it the applied framework illustrated in Fig. 2?), the reference for explanations (Bennett et al., 2018) is missing.
„(iii) systematically categorise aggregate performance over multiple spatial and/or temporal scales“
->Only one spatial and temporal scale are analyzed. |