What controls the tail behaviour of flood series: rainfall or runoff generation?

Macdonald, Elena; Merz, Bruno; Guse, Björn; Nguyen, Viet Dung; Guan, Xiaoxiang; Vorogushyn, Sergiy

doi:https://doi.org/10.5194/hess-28-833-2024

Articles | Volume 28, issue 4

https://doi.org/10.5194/hess-28-833-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-28-833-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 28, issue 4

Research article

|

21 Feb 2024

Research article |

| 21 Feb 2024

What controls the tail behaviour of flood series: rainfall or runoff generation?

Elena Macdonald, Bruno Merz, Björn Guse, Viet Dung Nguyen, Xiaoxiang Guan, and Sergiy Vorogushyn

Download

Final revised paper (published on 21 Feb 2024)
Preprint (discussion started on 31 Jul 2023)

Interactive discussion

Status: closed

RC1:
'Comment on hess-2023-186', Anonymous Referee #1, 28 Aug 2023
The manuscript proposes a derived flood frequency analysis based on the combination of a weather stochastic generator and a relatively complex lumped rainfall-runoff model, namely the HBV model (Parajka, 2007), with 15 parameters. Various stochastic rainfall series are generated according to a GEV random distribution, using three different mean daily precipitation values and varied shape parameter values. For the RR model, based on previous publications and on a sensitivity analysis, focused on the activation of the “very fast runoff” component of the HBV model which controls the magnitude of the generated extreme events, some parameters are fixed and others are varied over reasonable ranges, to cover a large spectrum or possible rainfall-runoff dynamics. The analysis of the relation between these dynamics, the statistical characteristics of the generated rainfall series and the shape of the resulting generated flood frequency distributions is the central focus of the proposed paper.
The manuscript is potentially interesting, but contains several weaknesses and approximations that lead to misinterpretations and erroneous conclusions, that have absolutely to be corrected before it can be published.
The proposed analysis is based on the determination of a return period over which both distributions of (a) the rainfall depth over a duration equal to the time of concentration of the watershed Ptc and of (b) the peak discharges Q, can be considered to be parallel (i.e. the discharge distribution is entirely controlled by the rainfall distribution because the very fast component dominates the generated runoff in the RR model). The authors argue that “for impervious catchments, the curves of Pct and Q are assumed to run in parallel” (Page 8, L. 185). This interpretation key is false! And this is highly problematic for a manuscript aiming at drawing general conclusions on the shapes of flood peak distributions. In fact, for impervious catchments, the curves of Pct and Q are not only parallel, but superimposed (simple consequence of the rational formula, see also Gaume 2006). This error by the authors is due to the fact that they surprisingly neglect the importance of the deep percolation parameter Cperc in the HBV model, yet clearly visible in the sensitivity analysis (Figure 4). If the deep percolation is not set equal to zero, a significant deep percolation will remain in the model, even for large and rare rainfall events and the RR model can therefore not be considered to represent the behavior of an impervious catchment, unlike what is said by the authors (Fc=Luz=1 mm, line 185). I strongly suspect that the distance between the distributions of Pct and Q is controlled by the parameter Cperc. The influence of the parameter Cperc should therefore be considered and analyzed in the manuscript. And, at least, runs where this percolation is set to 0 must be considered in the analysis as reference runs for asymptotically impervious watersheds. It must be clearly stated that “for impervious catchments, the curves of Pct and Q are superimposed” (this is a straightforward statement…). I let the authors and the readers consider if it is realistic to think that a deep percolation flux, feeding a very slow runoff component, can remain significant for extreme events. In many cases, this would appear as an unrealistic assumption.

Second, the authors should absolutely use rigorously the term “tail behavior”. In the same manuscript, they conclude both: 1) that the distributions of Pct and Q are asymptotically parallel and they define a threshold return period based on this property and 2) that the distributions of Q have heavier tails than the distributions of Pct, based on the estimated shape parameters (figure 7 in particular). Both statements are clearly incompatible. The reason is that the generated distributions of Q are obviously not of the GEV type and show a more or less rapid transition phase (see figure 3). It is possible to calibrate a shape parameter of the best suited GEV distribution, based on the generated discharge series, but, since the Q distributions differ clearly from the GEV distribution, the estimated shape parameter value, if it may encapsulate some information about the “transition phase” of the distribution of Q (i.e. its sharpness), it does by no means inform about the tail of this distribution of Q. The authors should correct their misleading interpretations and conclusions on the tails of the distribution of generated peak discharges and should not consider the estimated shape parameters as characterizing the tail of the discharge distributions. This is really an essential point.

The Scale and position parameters of the Rainfall GEV random model are adjusted to fit three values of mean annual precipitations: i.e. MAP serves as a calibration parameter for the random generator. But it is suggested in the analysis and discussion of the manuscript, that some behaviors may be related to the MAP (see fig. 6 and the corresponding discussion for instance). This is clearly an over-interpretation. Are the results depending on the MAP per se or on the statistical characteristics of generated extreme rainfall events fixed according to the parameter MAP? The later statement is true, but the fact that the MAP by itself may be an explanatory factor is never demonstrated: are a more frequent saturations of the soil and sub-soil during wet periods really observed and has it a decisive impact on the results? Other choices for the rainfall random generator could have produced similar distributions for the extreme daily rainfall amounts with varied MAP. Would then the discharge peak distributions depend on the MAP? I Wonder. Unless they provide evidence that the annual total rainfall amount has a real impact on the results, I suggest that the authors avoid ambiguity and unsupported conclusions and replace references to the MAP by reference to the average value of the generated rainfall amounts in the interpretation and conclusions.

The authors conduct their analysis on generated series of different lengths – i.e. from 60 to 6000 years. Doing so, they mix two completely different issues and introduce confusion. The first question, the central question of the manuscript I suspect, is the relation between the distribution of Pct and Q, depending on the RR processes and on the rainfall statistics. To tackle this question, it is important to get the most accurate definition of both distributions and hence to work on long series (i.e 6000 years for instance). The second question, which is an interesting but different one, is related to the estimation of the characteristics of both distributions, once defined, in real-life applications, when the lengths of the available measured series are limited. Estimation uncertainties related to sampling may introduce significant variability and blur the general image. It is not uninteresting to mention this second question, to connect this theoretical work to real-life applications, but only once the answers to the first questions are settled. The two questions must clearly be separated in the manuscript to avoid any confusion.

Except for figure 3, which corresponds to one specific case study, no illustration of the generated distributions is provided in the manuscript and the results are analyzed through aggregated values (estimated shape parameters, threshold return periods) … It is essential to provide such illustrations. I would for instance suggest representing the distribution of Ptc along with the distributions of the various generated discharge series on the same graphic, to show the wide spectrum of generated peak discharge distributions and of their convergence speed towards the distribution of Ptc or a distribution parallel to the it, if the deep percolation is not asymptotically equal to 0 (see comment 1). This graphic could easily be repeated for all the settings of the stochastic rainfall model (combinations of Scale and Shape parameter values). Of course, the figure must be established for the longest available series. They may be repeated for shorter series to illustrate the effect of sampling variability and how it blurs the general sketch. Such a figure would present clearly the obtained results to the readers and illustrate one of the conclusions of Gaume (2006): i.e. concerning the shape of flood peak distributions, the range of possibilities is extremely large.

The authors should remain prudent when extrapolating their results, especially the typical range of return periods of their defined threshold, to real-life. The rainfall stochastic model and the RR model remain simple approximation of real-life situations. Moreover, the RR model is implemented well beyond the range of events against which it could be calibrated and evaluated. Who knows really how watershed behave during extreme rainfall events, what type of thresholds, non-linearities or discontinuities may appear when extreme floods occur? Existing RR models, based on continuous and asymptotically converging reservoir models produce generally, to my experience, to smoothed evolutions of the RR relation. They may therefore help explore only part of the range of possibilities for the shape of flood peak distributions when used in derived flood frequency approaches… This limitation should be acknowledged in the manuscript.

I join an annotated manuscript complementing this review. The discussion part, containing several questionable statements, should also be revised in-depth.
Citation: https://doi.org/10.5194/hess-2023-186-RC1
- AC1:
  'Reply on RC1', Elena Macdonald, 06 Sep 2023
  We thank the reviewer for the thorough review and comments. We agree with some of the points, find them valuable and will make changes accordingly. However, in some cases, we believe that some of the aspects that the reviewer sees as shortcomings are only evaluated that way based on a different set of assumptions than are valid for our work. In the following we will address each of the points raised to clarify where and why we see things differently, and also to state which points we will adopt and what changes we will make to the manuscript.
  Point 1 (parallelism of Pct and Q, effect of Cperc, impervious catchments)
  We thank the reviewer for raising this point and bringing it to our attention that we set the criteria for what can be considered an impervious catchment too wide. Indeed, we should only consider the model runs to represent impervious conditions where – in addition to FC and Luz – also the percolation parameter Cperc has a value very close to 0. In our simulations the minimum value of Cperc is 0.00042 mm/h. We decided against setting any model parameter to 0 in any of the model runs as we wanted to ensure that the same processes are acting in all model runs, just with different intensity or relative importance. To reflect that neither FC nor Luz nor Cperc are exactly 0, we will change the wording from “impervious catchments” to “close to impervious catchments”. This will then refer to all model runs for which FC = Luz = 1 mm and Cperc = 0.00042 mm/h. In doing so, the number of model runs which are considered to have close to impervious conditions decreases from 35 to 7. When repeating the analysis for finding the duration of P which best represents the concentration time of the catchment with this stricter definition of close to impervious catchment conditions, the results stay qualitatively the same – the concentration time is still considered to be 6h. Figure 1B will be updated accordingly.
  
  We find that the distance between the distributions of Pct and Q does indeed change with the parameter Cperc, as suspected by the reviewer. However, this change is marginal. With regards to the influence of catchment characteristics on the return period beyond which the distributions of Pct and Q run in parallel, we analysed the effect of Cperc and found that it does not show a notable influence (l. 235, Fig. B2).
  
  Regarding the statement that “for impervious catchments, the curves of Pct and Q are not only parallel, but superimposed” we must disagree, at least for our model set-up. Even for impervious catchments, evapotranspiration remains active and therefore Q will always be lower than Pct (see also attached figure). In the HBV model, a part of the rainfall leaves the system as evapotranspiration and will never become runoff. This is true also during rainfall events, as the actual evapotranspiration in each time step is linked only to the potential evapotranspiration and the state of the soil moisture storage, but not to the incoming rainfall. In fact, when checking this for one of the model runs with close to impervious conditions, the difference between P and Q corresponds to the actual evapotranspiration. Pct and Q would be superimposed for impervious catchments when evapotranspiration is neglected or when only effective rainfall is considered. We therefore stick to the assumption that the curves of Pct and Q run in parallel (l. 185) instead of stating that they must be superimposed as requested by the reviewer. For the analysis where we consider close to impervious conditions, i.e. finding the duration of P which best represents the concentration time of the catchment, the distance between the curves of Pct and Q is not of relevance but only their slopes.
  
  With regards to the last point, we agree with the reviewer that it seems unrealistic that a deep percolation flux can remain significant for extreme events. However, we do not see that we would make such an assumption. The highest value of Cperc adopted in any model run is 0.25 mm/h which is one to two orders of magnitude lower than the hourly rainfall during extreme events. We argue therefore that the percolation rate is close to negligible compared to the influx of rainfall during extreme events, rather than remaining significant. As already stated in l. 304, the percolation rate seems to act on a longer timescale than is relevant for the generation of extreme runoff.
  
  Point 2 (the term “tail behaviour”)
  We thank the reviewer for this comment. In our study, we are considering both asymptotic and pre-asymptotic properties and are aware that this can lead to seemingly contradictory conclusions. Please, refer to the discussion in Merz et al. (2022) on this issue. While we addressed this in the manuscript (e.g. l. 36, l. 315, l. 356), the reviewer’s comment shows that we did not make it sufficiently clear yet. When comparing the distribution curves of Pct and Q, we argue that the distributions run in parallel beyond a certain threshold return period. This can be considered an asymptotic property of the distributions. When fitting GEV distributions to Pct and Q, and using their shape parameters for characterizing the tail behaviour, this should be considered a pre-asymptotic property given that the time series for fitting the distribution is of limited length. As rightfully mentioned by the reviewer, the shape parameter of a GEV distribution fitted to a limited time series does not reflect the “true” tail behaviour of the underlying distribution. It can only be considered as an approximation of the true tail behaviour, and this approximation becomes increasingly uncertain the shorter the available time series is. We also agree with the reviewer that the GEV distribution is not the best fit for some Q series (see also the discussion in l. 356) – nevertheless, we fit GEV distributions as this is common practice in hydrology when analysing annual maxima of observed series. However, one needs to keep in mind that especially in these cases the GEV shape parameter does not reflect the true tail behaviour of the underlying distribution. Nevertheless, it can help with achieving more accurate estimations of the occurrence probabilities of extreme events in light of limited time series. To make the distinction between the true tail behaviour of the underlying distribution and the tail behaviour of the fitted GEV distribution clearer, we will adopt the following terminology: when fitting a GEV distribution, we will refer to the tail behaviour as characterized by the GEV shape parameter as “apparent tail behaviour” in contrast to the true tail behaviour. In the manuscript, we will add the following sentences in the methods section (around l. 167): “It should be noted that the shape parameter of a GEV distribution fitted to a time series of limited length does not necessarily reflect the true tail behaviour of the underlying distribution but is only an approximation thereof. When fitting GEV distributions to subsets of a time series of different lengths, the shape parameters may vary due to differences in the estimation uncertainties. To reflect this, we will use the terminology “apparent tail behaviour” when drawing conclusions based on the GEV shape parameter of a distribution fitted to a limited time series.”
  
  When considering this distinction between asymptotic and pre-asymptotic properties, the two statements mentioned by the reviewer are compatible. Even when the distributions of Pct and Q are asymptotically parallel, their apparent tail behaviour as characterized by the GEV shape parameter can be quite different from one another.
  
  Point 3 (MAP)
  Thanks for the comment. First, we need to clarify two things: 1) the stochastic weather generator is not a “rainfall GEV random model” – it uses an extended Pareto distribution and not a GEV distribution for generating rainfall time series (l. 117), and 2) the MAP does not serve as a calibration parameter in the way that the extended Pareto distribution is fitted to three different MAP values. Instead, rainfall time series are generated based on one MAP value and afterwards shifted to three different levels of MAP by multiplying the hourly rainfall depths with a factor (l. 135). Through this approach, two rainfall time series with the same tail behaviour only differ in their mean rainfall level, but not in other aspects such as the frequency of wet days. Because of this, MAP and the mean event rainfall depth, suggested as an alternative measure to MAP by the reviewer, are highly correlated. In our study, we adopt MAP as a simple measure for quantifying the general wetness of a catchment. We agree with the reviewer that we could also adopt the mean event rainfall depth as a measure. As mentioned above, the two are highly correlated in our set-up and so the results are qualitatively the same independent of which of the two is chosen. We decided to use MAP as it seems to be a more straightforward and widely-used representation of overall catchment wetness. Instead, the mean volume of precipitation events is not a common indicator and might be ambiguous to define and use especially with increasing catchment scale. This is especially true for defining the three wetness levels to which the rainfall series are shifted, and we prefer to stick to the same measure for representing catchment wetness throughout the manuscript. To avoid over-interpretation as pointed out by the reviewer, we will add a statement clarifying that we use MAP as a measure of overall catchment wetness and that it controls the mean event rainfall depth.
  
  Point 4 (different time series lengths)
  We thank the reviewer for pointing out that we have not separated the two aspects clear enough. While the analysis of a threshold return period beyond which Pct and Q run in parallel is conducted only for 6000-year long series, the comparison of estimated GEV shape parameters between P and Q and the influence of the runoff generation on the apparent tail behaviour are addressed for different time series lengths. We will restructure the respective parts of the results and discussion sections (especially from l. 252 and l. 313) so that the results based on long time series are thoroughly presented and discussed before moving to shorter time series and the meaning of our findings for real-life applications with limited data availability.
  
  Point 5 (illustration of the generated time series)
  Thanks for the great suggestion. Such a figure will indeed enrich the manuscript as it clearly presents the “raw” results before aggregation. A first version of such a figure is attached: distributions of 6000-year long Pct series with 7 different tail behaviours and 3 different MAP levels are shown along with the distributions of all respective simulated Q series. This figure, or a revision of it, will be added to the results section.
  
  Point 6 (extrapolation of results)
  We are aware of the limitations of extrapolating simulation results to the real world and state this in l. 380: “Furthermore, our findings are based on synthetic catchments and simulation runs. While such an approach has major advantages like the generation of long time series, results are not always directly transferable to the real world.” In our case, the results are also limited in the way that we consider homogeneous catchment conditions and do not include spatial variability of rainfall or catchment characteristics (see l. 373). However, we agree with the reviewer that these limitations should be made more explicit. Following the reviewer’s suggestion, we will expand this part as follows: “In the adopted rainfall-runoff model only one nonlinearity in the runoff generation was considered, namely the activation of an additional very fast runoff component. However, in a real catchment multiple nonlinearities and process shifts might be present such as the onset of overland flow, the onset of subsurface stormflow, the activation of macropores or the temporary expansion of the river network. The model does not include all these processes explicitly and is therefore, as all models, a simplified representation of reality. Hence, the simulated flood peak distributions are also only representative for this simplified reality. Nevertheless, they can help us explore results which can be valuable for real-world applications.”
  
  For the final response to the reviewer we will also thoroughly address all comments made in the annotated manuscript.
  
  Citation: https://doi.org/10.5194/hess-2023-186-AC1
  - RC3: 'Reply on AC1 : Parallelism of Pct and Q, effect of Cperc, impervious catchments', Anonymous Referee #1, 07 Sep 2023
    
    If I understand the authors' answer correctly, the parallelism is attributed to the fact that evapotranspiration remains active in the proposed model setting, even during heavy rainfall events. This, of course, must be clearly explained in the revised version of the manuscript. In fact, since the atmosphere is, by definition, saturated or close to saturation during rainfall events, especially during significant rainfall events, evapotranspiration is drastically reduced. Maintaining a high value of evapotranspiration during rainfall events is a shortcoming of the proposed model settings and is quite unrealistic, and this should also be acknowledged. I must insist on this point because it is key information: for impervious catchments and spatially uniform rainfall intensities, the distributions of Pct and Q should, at least asymptotically, not only be parallel but also superimposed if physically realistic conditions are considered for the RR relation !
    
    Citation: https://doi.org/10.5194/hess-2023-186-RC3
    
    AC2: 'Reply on RC3', Elena Macdonald, 26 Sep 2023
    
    Yes, this was understood correctly. In the HBV-like model that we are using, evapotranspiration can take place even during rainfall events. This is indeed a shortcoming of the model and will be acknowledged in the revised manuscript.
    
    Citation: https://doi.org/10.5194/hess-2023-186-AC2
RC2:
'Comment on hess-2023-186', Anonymous Referee #2, 06 Sep 2023

The authors examine the relationship between upper tail behaviour of rainfall and flood peak distributions through analyses that are based on a stochastic weather simulator and a rainfall runoff model. Results are used to conclude that “runoff generation can strongly modulate the behaviour of flood distributions”… “threshold processes in runoff generation lead to heavier tails”… and that “for return periods that are mostly of interest to flood risk management, runoff generation is often a more pronounced control of flood heavy tails than precipitation”. The modeling and analysis chain used in this study includes assumptions, approximations and subjective judgements that are not compatible with the strong conclusions that are drawn. The analyses are interesting and provide the foundation for a useful paper, with more modest scope and expanded treatment of uncertainties in the analyses.
Specific issues/questions with the modeling / analysis chain are enumerated below:
1) Is the stochastic weather generator suitable for representing the upper tail of rainfall? Do observations from Bamberg provide a suitable grounding for a general assessment of rainfall extremes? Why vary the GP shape parameters between 0.2 and 2.0? Do the 0.9/0.1 day/night PET scaling assumptions have an impact on results? The weather generator produces daily rainfall, but the authors note that shorter duration data are needed for modeling studies of a 50 km2 watershed. Is the “Method of Fragments” suitable for reproducing the climatology of sub-daily time scales? Each of these issues requires supporting arguments to justify the strong conclusions. The reliance on observations from a single site in Germany and the method used for producing hourly observations are of particular concern.
2) Is the rainfall-runoff model suitable for drawing strong conclusions about upper tails of flood peaks? The assumption of a homogeneous catchment over a 50 km2 scale removes the capability of assessing important processes that can contribute to flood peak response. Does spatial variability of rainfall contribute to tail behaviour at 50 km2 scale? How does this change for 10 km2 scale and 1000 km2 scale?
3) It is disconcerting that 300 of the 1000 precipitation series are discarded because of a large, estimated shape value. Having thresholds based on estimated shape parameters can contort inferred distributions in simulation studies like those used in this paper. How are results dependent on the 0.37 threshold for precipitation shape parameters? Was 0.37 chosen because it is a little larger than 0.33? The principal concern is that a subjective decision on the extreme nature of rainfall is an important component of the analysis chain that leads to the conclusion that flood peaks do not depend as strongly on rainfall as on runoff production.
4) Model simulations are grounded in subjective decisions concerning model parameters. Sensitivity analyses of model parameters are used to select a sub-set for numerical experiments. These parameters are varied across a “reasonable range”, with the other parameters set to fixed values based on previous studies in Austria. Strong arguments are needed to support the assumptions that the HBV model with the range of selected parameters captures the world of extreme flood response.
5) How well does the upper subsurface storage in the HBV model represent threshold-dependent flood response, especially given the assumption of spatial homogeneity? Is the approach suitable for a broad range of basin scales? Choosing a storage-dependent variable for runoff production is a strong assumption that is likely violated in some settings, especially for the most extreme events. Incorporation of runoff processes that are more sensitive to rainfall rate could lead to markedly different conclusions. Infiltration excess runoff mechanisms, and their variants, play an important role for extreme floods in many settings, especially for arid/semi-arid regions (which are prominent settings for inferred heavy tails of flood frequency distributions).
6) The procedure used to examine threshold return periods beyond which the flood peak distribution is governed by the rainfall distribution (evaluating slopes of rainfall and flood distributions on log-log-plots) is based on ad-hoc procedures with subjectively chosen parameters.
The authors note that “we fit one GEV distribution to the data even when we know that there is a process shift in the runoff generation, which actually violates the assumption of independent and identically distributed values for distribution fitting”. This is a serious issue, which is tied to the broader issue raised in the introduction concerning the dependence of estimated upper tail behaviour of observed time series given the sensitivity to the “largest few events”.
The authors have deployed a broad array of simulation and modeling tools to address an interesting and important problem. The assumptions and subjective decisions needed to implement this array of tools create the most serious obstacles to supporting the expansive conclusions that are reported.

Citation: https://doi.org/10.5194/hess-2023-186-RC2
- AC3:
  'Reply on RC2', Elena Macdonald, 26 Sep 2023
  We thank the reviewer for the very helpful review and comments. We have thoroughly checked our assumptions and conclusions, and it seems like some conclusions came across stronger than they were intended to. Thanks for bringing this to our attention. We have now addressed all the points raised by the reviewer and will weaken some conclusions where deemed appropriate, for example by adding that the findings mainly hold for small, homogeneous catchments and Central European conditions. Below are our detailed responses to the reviewer’s comments, along with the respective changes that we will make in the manuscript.
  1) Thanks for raising these questions. We will address them one by one:
  In the stochastic weather generator an extended Generalized Pareto (extGP) distribution is used to simulate precipitation. The Generalized Pareto distribution can be heavy-tailed and is suitable to capture extreme rainfall values, but might miss the lower bulk (Nguyen et al., 2021). To address this, a previous version of the weather generator used a mixed, i.e. two-component, distribution. However, such a distribution is defined by a higher number of parameters and fitting it to precipitation data is “not a trivial task” (Nguyen et al., 2021). Using a more parsimonious extGP distribution “helps move away from the mixed distribution concept but still allows a smooth transition between bulk of the distribution and the heavy tails” (Nguyen et al., 2021). When evaluating the weather generator based on the extGP distribution for 528 stations in Central Europe, Nguyen et al. (2021) found that both the daily mean and the extreme (99.9^th percentile) precipitation intensities are well captured. We are therefore convinced that the weather generator is suitable for our study. We will add the following sentence in the methods section (l. 112): “The weather generator has been evaluated to capture both the daily mean and the extreme (99.9^th percentile) precipitation intensities well for a large set of weather stations in Central Europe (Nguyen et al., 2021).”
  
  The observations from Bamberg were chosen for setting up the weather generator due to the availability of long daily and hourly records at this station. First, we estimate the extGP distribution parameters using the station's daily data. Then, we manipulate the upper tail shape parameter with various reasonable values to create time series that illustrate different degrees of extreme frequency. For our study, it is of main importance to have a large set of different rainfall extremes, and this is achieved by setting the extGP upper tail shape parameter to a range of different values. “Through the manipulation of the upper tail shape parameter to a range of values, time series with different degrees of extreme frequency are created, despite using observations from just one station as initial input.”
  
  The shape parameters of the extGP distribution were not actually varied between 0.2 and 2.0, but the shape parameter of the distribution fitted to the data from Bamberg was multiplied with factors in this range. This way, the extGP shape parameters covered the range of shape parameters from distributions fitted to observations from 528 weather stations in Central Europe. These are the stations used by Nguyen et al. (2021) for calibrating and evaluating the weather generator. Factor 2.0 leads to shape parameters that are higher than most of the “observed” shape parameters. However, we wanted to create a wide range of tail behaviours and rather narrow it down afterwards based on GEV shape parameters as our analysis is based on annual maxima and not on daily data (see also our response to point 3). We will add in the methods section (l. 119): “This way, the upper shape parameter covers the range of values that was found when fitting extGP distributions to observations from the large set of Central European weather stations analysed by Nguyen et al. (2021).”
  
  The fragments for the disaggregation of PET were assigned as 0.9 for day times and 0.1 for night times to represent that the largest share of evapotranspiration occurs during the day. To check the sensitivity, we set up a test case of 189 model runs over 60 years with different disaggregation of PET. We compared the original fragments of 0.9 for day times and 0.1 for night times to fragments of 0.5 each, meaning that PET was set to be constant for every 24 hours. This is considered to be an extreme and not very realistic disaggregation of PET from daily to hourly values and should only serve for the evaluation of the sensitivity. To analyse the sensitivity, the GEV shape parameters of the simulated discharge time series were estimated for both PET disaggregation schemes. The root mean squared error between the shape parameters was estimated to be 0.00921. Based on this we are confident that the disaggregation of PET has hardly any effect on the results – at least for small, homogeneous catchments – and that any realistic disaggregation scheme would lead to the same results as presented in the manuscript.
  
  The Method-of-Fragments (MOF) is a commonly used method when disaggregating rainfall time series (e.g. Carreau et al., 2019; Li et al., 2018; Lu et al., 2015; Sharma and Srikanthan, 2006; Westra et al., 2012). In a comparison of different rainfall disaggregation models, the nonparametric MOF was found to outperform point process-based and cascade models (Pui et al., 2021). It is able to better match the observed intensity-frequency relationship than the other models, and this was found to be particularly true for extreme rainfall characteristics (Pui et al., 2012). For more details on MOF and its performance, see also Guan et al. (2023). We will add in the methods section (l. 122): “The MOF is a commonly used method for the disaggregation of rainfall (e.g. Carreau et al., 2019; Li et al., 2018; Lu et al., 2015; Sharma and Srikanthan, 2006; Westra et al., 2012), and has been found to outperform other disaggregation models, especially for extreme rainfall characteristics (Pui et al., 2012).”
  
  2) We thank the reviewer for this comment. We are aware that assuming homogeneous conditions throughout the catchment is an assumption that does not allow to include all processes which might affect tail behaviour. However, as stated in l. 378, including more processes and adding spatial variability of rainfall or runoff characteristics makes it difficult to isolate their effects. We strongly agree that spatial variability and the influence of the catchment size would be interesting aspects for future studies and suggest respective analyses (l. 376), but this would be beyond the scope of our current study. In the current set-up it would not be advisable to analyse different catchment sizes for various reasons: 1) for larger catchments, the assumption of homogeneous conditions is questionable, and 2) the current model set-up does not include river routing which would become increasingly important for large catchments. To acknowledge these limitations, we will weaken some conclusions by stating more clearly that they are only valid for small, homogeneous catchments.
  
  3) We thank the reviewer for this comment. We understand that limiting the extremeness in a study on extremes can seem inappropriate or misleading in some cases, but we are confident that limiting the GEV shape parameter of precipitation (P) time series does not affect our conclusions. It would be a completely different story had we excluded individual extreme events instead of entire time series with too high GEV shape parameters. As mentioned in our response to point 1, we generated P series with a rather wide range of extGP shape parameters, to be able to potentially exclude some time series based on their GEV distributions. We wanted to ensure that when going from daily to hourly values and then to annual maxima, we would still cover a wide range of tail behaviours. As it turned out, we covered a range that was wider than what has been estimated for observed precipitation time series and so we excluded the ones well outside the observed range.
  As seen in Fig. 7, we still cover a range of GEV shape parameters, even after excluding the P time series with the largest GEV shape parameters. It can also be seen that all P shape parameters result in a wide range of Q shape parameters, and that the minimum and maximum of the Q shape parameters seem to linearly increase with increasing P shape parameter for the time series of 6000 years. This relation would expand also for larger P shape parameters, so that adding the respective time series would not change the picture. However, we do not feel comfortable using P shape parameters well outside the observed range in this analysis. The cut-off of 0.37 was chosen because it is a little larger than the “observed” maximum of 0.33, but still seems to be a reasonable GEV shape parameter. As described above, using a different value would not qualitatively affect the results presented.
  With regards to the last sentence of this comment, we feel the need to clarify that we did not infer that flood peaks depend more on runoff generation than on rainfall – instead, we are showing that P becomes increasingly important the larger an event is, until eventually a threshold is reached beyond which the runoff generation has no effect.
  We will add in the methods section (l. 141): “Using a slightly different cut-off than 0.37 for excluding P time series with very high GEV shape parameters was not found to affect the findings.”
  
  4) We thank the reviewer for this comment. As we are running the model simulations on a synthetic catchment, we could not calibrate the model against observations and so had to make some decisions regarding the model parameters. However, we aimed at being as little subjective about this as possible by basing our assumptions about the parameters on previous studies. The goal was not to capture the entire world of extreme flood responses, and we are aware that there might be other parameter combinations not covered in our set-up which also result in extreme floods. Instead, the main focus was to capture events with and without the fastest runoff component being active. The sensitivity analysis of the model parameters was set up accordingly. The parameter ranges that we used are based on the study by Parajka et al. (2007). The ranges cover a large span of values and are commonly used in studies using the TUWmodel (e.g. Merz et al., 2011; Ceola et al., 2015). For example, Ceola et al. (2015) adopted the same ranges of model parameters in a study where they calibrated the TUWmodel for catchments in Italy, Switzerland, Austria, Germany and Sweden – i.e. for catchments in different topographic settings and covering a range of meteorological conditions. We therefore believe that the ranges are wide enough to capture many extreme flood responses, even when not necessarily capturing all possible extremes. We will add in the methods section (l. 156): “The same parameter ranges have been used by Ceola et al. (2015) for calibrating the TUWmodel for European catchments with different topographic and meteorological conditions, and are therefore deemed appropriate for capturing many different extreme flood responses.”
  
  5) We thank the reviewer for this important comment. The exceedance of the upper subsurface storage is of course not the only threshold process in the runoff generation that could act in a catchment. We selected it as one representation of threshold behaviour that could be reasonably implemented in the simulation model. In contrast, infiltration excess cannot be well represented in the model, as is the case for other threshold processes such as the onset of preferential flow through macropores or a temporary expansion of the river network. As also stated in response to point 2 raised by the reviewer, the approach and the findings should not be extrapolated to much larger basins, as the assumption of a spatially homogenous catchment storage does not hold for most large basins. For large catchments, it is unlikely that storage exceedance occurs simultaneously in the entire catchment (see also the discussion in l. 373 on this). In addition, for larger catchments river processes might become more important, and processes like network expansion or overland flow are not represented in the model. We agree with the reviewer that assessing the effects of other nonlinearities in the runoff generation would be highly interesting, especially in combination with different basin scales. However, this would require a different model set-up and, as discussed in l. 378, “in such a set-up, tail heaviness could be affected by a combination of catchment size, sub-basin response, spatial organization and river routing characteristics, making it difficult to isolate the effects of precipitation and runoff generation.” In response to both this comment and a comment by reviewer #1, we will expand the discussion as follows: “In the adopted rainfall-runoff model only one nonlinearity in the runoff generation was considered, namely the activation of an additional very fast runoff component. However, in a real catchment multiple nonlinearities and process shifts might be present such as the onset of overland flow, the onset of subsurface stormflow, the activation of macropores or the temporary expansion of the river network. The model does not include all these processes explicitly and is therefore, as all models, a simplified representation of reality. Hence, the simulated flood peak distributions are also only representative for this simplified reality. Nevertheless, they can help us explore results which can be valuable for real-world applications.”
  With regards to the reviewer’s reference to arid/semi-arid regions we would like to emphasize that the model and its parametrization are set up for Central European conditions. Parameters for the rainfall-runoff model are based on Austrian catchments (Merz et al., 2011) and the weather generator was originally set up and evaluated for Central European stations (Nguyen et al., 2021). We will add in the discussion: “The simulation model chain and its parametrization has been set up for Central European conditions. The results are therefore not directly transferable to other regions of the world with very different conditions.”
  
  6) We do not agree that we have used an ad-hoc procedure to examine the threshold return periods. We have based the procedure on the following reasoning: For close to impervious catchments, the curves of Pct and Q are assumed to run in parallel. This would mean that the difference between their slopes is 0. However, due to some noise in the data, we observed that the slopes between the curves of Pct and Q are hardly ever exactly zero, even for the model runs on close to impervious catchments (l. 182). We therefore used the differences between the slopes of Pct and Q estimated for close to impervious catchments to evaluate what level of noise is to be expected, i.e. what amount of differences between the slopes we need to expect even for parallel curves. Based on this we defined a buffer around zero within which slope differences need to lie for the curves to be considered as parallel. We also tested the sensitivity of the threshold return period to the definition of the buffer to ensure that our results are reasonable (l. 187, l. 243). We found that both a narrower and a wider buffer result in some cases in unreasonable threshold return periods, while this was not the case for the buffer that we used (l. 245).
  
  7) We fit one GEV distribution even when we know that there is a process shift in the runoff generation, as fitting GEV distributions to annual maximum series (AMS) is very common in hydrological practice. In practice, we often do not know whether there might be a process shift acting or not, and so we simply fit one distribution to the observed AMS. Doing the same here, allows us to draw conclusions which can be of relevance for hydrological practice. However, it should be noted that when fitting one GEV distribution despite the presence of a process shift we cannot necessarily infer the tail behaviour of the true underlying distribution. Nevertheless, it can still be insightful with regards to the occurrence probability of extreme events. For more details on this aspect and the distinction between true and apparent tail behaviour, please also see our response to point 2 raised by reviewer #1. In the manuscript, we will add the following sentences in the methods section (around l. 167): “It should be noted that the shape parameter of a GEV distribution fitted to a time series of limited length does not necessarily reflect the true tail behaviour of the underlying distribution but is only an approximation thereof. When fitting GEV distributions to subsets of a time series of different lengths, the shape parameters may vary due to differences in the estimation uncertainties. To reflect this, we will use the terminology “apparent tail behaviour” when drawing conclusions based on the GEV shape parameter of a distribution fitted to a limited time series.”
  
  Citation: https://doi.org/10.5194/hess-2023-186-AC3

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (19 Oct 2023) by Efrat Morin

AR by Elena Macdonald on behalf of the Authors (09 Nov 2023) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (15 Nov 2023) by Efrat Morin

RR by Anonymous Referee #1 (30 Nov 2023)

RR by Anonymous Referee #2 (15 Jan 2024)

Suggestions for revision or reasons for rejection

The authors have provided additional detail to the extensive methodological questions raised in reviewer comments, improving the paper significantly. The authors have provided expanded rationale for specific methods and assumptions in response to issues raised in my initial review. Many of these are difficult, open-ended problems for which the authors have provided helpful revisions. Two are noted below for additional consideration. The larger issues that need additional treatment concern not the individual modeling components, but the modeling and analysis chain used in the study. There are many links that are stitched together to yield hydrologic analyses. There are several formulations of model-chain questions that would provide useful ways of organizing additional discussion. The most direct question is “Why should one have confidence in the composite analysis system? A simpler formulation is “What are the model components/assumptions that are most questionable? And Why? The added discussion of nonlinearity in runoff generation is a start at addressing this question. The revisions that are recommended are significant, but can be accomplished in a reasonably short time, hence the recommendation of minor revisions.

The two questions from the original review that would benefit most from additional discussion:
1) Is the rainfall-runoff model suitable for drawing strong conclusions about upper tails of flood peaks? The assumption of a homogeneous catchment over a 50 km2 scale removes the capability of assessing important processes that can contribute to flood peak response. Does spatial variability of rainfall contribute to tail behaviour at 50 km2 scale? How does this change for 10 km2 scale and 1000 km2 scale?

The authors’ responses focus on future studies addressing these issues, the suggestion that 50 km2 catchments are homogeneous in Central Europe and the importance of open channel flow processes (tied to river routing) in larger catchments. The arguments and associated revisions are not persuasive.

2) The authors note that “we fit one GEV distribution to the data even when we know that there is a process shift in the runoff generation, which actually violates the assumption of independent and identically distributed values for distribution fitting”. This is a serious issue, which is tied to the broader issue raised in the introduction concerning the dependence of estimated upper tail behaviour of observed time series given the sensitivity to the “largest few events”.

The authors conclude their response by noting that “when fitting GEV distributions to subsets of a time series of different lengths, the shape parameters may vary due to differences in the estimation uncertainties. To reflect this, we will use the terminology “apparent tail behaviour” when drawing conclusions based on the GEV shape parameter of a distribution fitted to a limited time series.” Tail estimates are always subject to issues associated with bias and variance, even if the underlying assumptions hold. Resorting to “apparent tail behaviour” isn’t helpful in addressing the issue. Here the key points to address are the consequences that follow from the fact that the underlying GEV assumption does not hold.

Hide

ED: Publish subject to revisions (further review by editor and referees) (24 Jan 2024) by Efrat Morin

AR by Elena Macdonald on behalf of the Authors (02 Feb 2024) Author's response Author's tracked changes Manuscript

ED: Publish as is (08 Feb 2024) by Efrat Morin

AR by Elena Macdonald on behalf of the Authors (08 Feb 2024)

Short summary

In some rivers, the occurrence of extreme flood events is more likely than in other rivers – they have heavy-tailed distributions. We find that threshold processes in the runoff generation lead to such a relatively high occurrence probability of extremes. Further, we find that beyond a certain return period, i.e. for rare events, rainfall is often the dominant control compared to runoff generation. Our results can help to improve the estimation of the occurrence probability of extreme floods.