Comment on hess-2021-334 Anonymous Referee # 2 Referee comment on " Flexible and Consistent Quantile Estimation for Intensity-Duration-Frequency Curves

Line # 29: could be written as Methods In lines # 65-70: A few aspects of nonstationarity could be discussed here, discuss briefly the added value of this study to address nonstationarity as compared to the methods discussed in the literature (Cheng and AghaKouchak 2014; Ganguli and Coulibaly 2017). Section 2.1: Lines 90 – 95: This is not very clear why 8 stations were merged into a single station leaving 92 overall stations out of 100 stations? Which physiographic or hydrologic similarity measures were adopted for regionalization? It has been shown in the literature that the Generalized Maximum likelihood method, in general, does not provide a credible estimate of the shape parameter, yielding an abrupt estimate of shape estimate (Martins and Stedinger 2000). Have your values lie within the credible limits of shape parameter range as shown in the literature, boxplots showing the range of shape parameters for different duration could help to identify this issue In skill score index in lines 190-200: what M and R represent, If R denotes empirical distribution, which empirical plotting position formula was used to estimate it? Typically Gringorten's plotting position formula is in use to characterize extremes. Fig. 7: Could you please show the difference in return levels in an inset diagram with vs without flattening? How much is the percentage difference between the two statistics in order to qualify as significant? Between lines 360-363: Any discussion on copula-based IDF estimation that claims to preserve the inherent non-linearity between intensity vs duration? Line # 433: few outlying events correspond to higher quantile (or at the tail of the distribution) leave the confidence intervals.


Introduction
The number of heavy precipitation events has increased significantly in Europe (Kundzewicz et al., 2006;Tank and Können, 2003) and worldwide (Hartmann et al., 2013). Such events are related to flooding and other hazards which can cause severe damage to agriculture and infrastructure (Brémond et al., 2013). The impact of extreme precipitation depends on the temporal scale of the event. Short, intense convective precipitation exhibits different characteristic consequences than long-lasting, mostly stratiform, precipitation. Examples for events on different timescales of minutes to hours, days, and weeks are pluvial or flash floods (Braunsbach, Germany, May 2016), river flooding (Elbe, Germany, 2013), and groundwater flooding (Leicestershire, UK, March 2017), respectively.
The definition of precipitation extremes is based on the occurrence probability and is quantified using quantiles (return levels) and associated occurrence probabilities, often expressed as return periods in a stationary interpretation. Quantitative estimations of quantiles and associated probabilities mostly follow one of two popular methods, namely (1) block maxima and their description with the generalized extreme value (GEV) distribution -a heavy-tailed and asymmetric distribution -or (2) threshold exceedances and a description with the generalized Pareto distribution (GPD; e.g., Coles, F. S. Fauer et al.: Flexible and consistent IDF curves 2001). Typically, annual precipitation maxima of different timescales are used to describe extreme rainfall events. Both GEV and GPD can be used to model the extreme precipitation of a certain timescale. A common problem for short timescales, i.e., the scarce availability of data with high measurement frequency, can be overcome by modeling several timescales at once. Different timescales can be represented as durations over which the precipitation rate is aggregated and averaged. Most frequently, daily precipitation sums are reported but hourly or 5 min aggregation are also common.
A way to describe the characteristics of extremes for various durations (timescales) are intensity-duration-frequency (IDF) curves, which describe the relationship between extreme precipitation intensities, their duration (timescale), and their frequency (occurrence probability or average return period). These relations have been known since the mid-20th century (Chow, 1953) and have become popular among hydrologists and meteorologists. In estimating these curves, one tries to exploit the assumption of a smooth intensityfrequency relationship across different durations. This helps for interpolating between durations, and it can also improve the estimation for short durations, as shorter time series are often available for those.
Historically, a set of GEV distributions is sought individually for a set of durations (e.g., 5 min, 1 h, 12 h, 24 h, and 48 h) leading to quantile (return level) estimates for specified probabilities (return periods) for all durations considered. In a second step, for a given probability a smooth function is estimated by interpolating associated quantiles across durations (García- Bartual and Schneider, 2001). However, by using a duration-dependent extreme value distribution (d-GEV) (see the first examples in Nguyen et al., 1998, andMenabde et al., 1999), IDF estimation can be carried out in one step within a single model. To achieve this, GEV parameters are defined as functions of duration. This approach prevents the crossing of quantiles across durations and is, thus, considered consistent. Moreover, the data are used more efficiently, which simplifies the extension of the model, e.g., the inclusion of large-scale covariates at a later stage. This will enable first insights into physical effects beyond the pure statistical evaluation of precipitation.
It is widely agreed that precipitation intensities for given exceedance probabilities follow a power-law-like function (scaling) across duration (Gupta and Waymire, 1990;Veneziano and Furcolo, 2002;Burlando and Rosso, 1996), with higher intensities for short durations. For a range of very short durations (d ≤ 1 h), the scaling assumption does not hold because maxima from different durations often originate from the same event. This leads to a curvature of IDF curves, where intensity no longer follows a power law with decreasing duration. Bougadis and Adamowski (2006) approached this issue by using two different duration exponents for small durations and for other durations. A more smooth transition can be achieved by including the curvature as a duration offset in the parameters of the GEV distribution without explicitly distinguishing between short and long durations. Koutsoyiannis et al. (1998) used a model with five parameters to describe the complete IDF relation for different probabilities (return periods) and across durations. The underlying idea is based on a reparameterization of the GEV and its three characteristic parameters of location µ, scale σ , and shape ξ . While shape ξ is held constant for all durations d, it is further assumed that the ratio between location and scale µ/σ remains constant across duration d (for details, see Sect. 2.3). Ulrich et al. (2020) built on the approach of Koutsoyiannis et al. (1998) and extended it to a spatial setting with covariates for the d-GEV parameters. Although using both consistent modeling and spatial pooling improves model performance, the need for more flexibility of the IDF curves in longer durations is emphasized and will be addressed in the present study. Therefore, we aim to look for new parameterizations of the IDF curve's duration dependence by combining the existing approaches of multiscaling and duration offset and also extending it by a new parameter, namely the intensity offset.
The commonly used variant of the d-GEV with five parameters (Koutsoyiannis et al., 1998) might not be flexible enough for a wide range of durations from minutes to several days. A first approach for extending the d-GEV addresses the simple scaling relation. This model assumes a scaling that is independent of the exceedance probability (return period). However, relaxing this assumption leads to so-called multiscaling, which allows for different scaling-like behavior for different exceedance probabilities (return periods). This is achieved by introducing another parameter η 2 , as in Eqs. 8 and 10 in Sect. 2.3. Then, the ratio between location and scale is not constant anymore. Multiscaling is found to be effective for durations longer than 1 h (Veneziano and Furcolo, 2002;Burlando and Rosso, 1996). Van de Vyver (2018) employs the multiscaling approach in a Bayesian setting. On a global scale, different scale parameters have been investigated by Courty et al. (2019). None of the named studies combine multiscaling with curvature for short durations but focus on only one of these aspects, while our study is aiming for a combination and analysis of three different features.
In this study, we compare different ways to parameterize IDF curves, including the features of multiscaling and duration offset. In addition, we present a new d-GEV parameter, the intensity offset, which accounts for the deviation from the power law and the flattening of IDF curves for long durations. To our knowledge, this comprehensive analysis of different features has not been conducted before. Section 2 lists the data sources and introduces the different features and their modeling equations, as well as the verification methods, to analyze modeling performance. In Sect. 3, the crossvalidated verification results of all features are shown with respect to modeling performance of different return periods and durations. For verification, we perform a case study using rainfall gauge data from a catchment in Germany. IDF curves that include all analyzed features are presented for selected stations.

Data and methods
We use precipitation measurements in an area in and around the catchment of the river Wupper in western Germany. In order to compare different models for the d-GEV, we use the quantile skill index (QSI) introduced in Ulrich et al.
(2020) within a cross-validation setting. For the resulting IDF curves, we obtain confidence intervals using a bootstrapping method and test their coverage with artificial data with known dependence levels in a simulation study. The data and all necessary methods are explained in the following section.

Station-based precipitation data
Precipitation sums for the minute, hour, and day are provided by Wupperverband and the German Meteorological Service. Rain gauges are located in and around the catchment of the river Wupper in North Rhine-Westphalia, Germany. In total, 115 stations are used. Data from two measuring devices with a distance below 250 m are combined into one station each in order to obtain a longer time series, thus resulting in a total of 92 grouped stations. However, in cases where measurement series are grouped together, it is common to have measurements from both instruments for a certain period of time. Thus, when merging, we decided to use only the observations with the higher measuring frequency for the analysis. For example, when combining two time series with hourly and daily data, respectively, the time series of all aggregation levels (durations) are obtained from the hourly data for the overlapping period. This choice is made because 24 h values from an hourly measurement frequency might show higher intensities since the 24 h window is shifted hour by hour in order to find the annual maximum. On the other hand, 24 h values from the daily measurement frequency are recorded for a fixed day block, i.e., from 08:00 to 07:59 LT (local time). A test of robustness was performed (not shown) to assess how the estimated IDF curves are affected by the choice of measurement frequency which is preserved during the time of overlap when merging the time series. For this purpose, two IDF curves were created, i.e., (1) choosing the maxima from a higher measurement frequency and (2) choosing the maxima from a lower measurement frequency. In most cases, there was no relevant difference between both methods.
Years with more than 10 % of missing values are disregarded. Some years contain measurement artifacts, where identical rainfall values were repeated over several time steps. After consulting the data maintainers, these years are removed before the analysis. The data exhibit heterogeneity in terms of the temporal frequency and the length of the resulting time series. Figure 1 presents the availability of data over time (left) and space (right) for three possible temporal resolutions. More specifically, minute data have been available since 1968, whereas daily records range back to 1893.
Time series for different durations are obtained from an accumulation over a sample of durations as follows: respectively, as done by Ulrich et al. (2020). In the following, numerical values are presented in hours. We use only the annual maxima of these time series to model the distribution of extreme precipitation. The original precipitation time series are accumulated with the R package IDF (Ulrich et al., 2021b). The data set with annual maxima can be found online and supports the findings of this study . The set of durations d is chosen such that small durations are presented with smaller increments than larger durations. In a simulation study, we tested whether using a different set of durations with a stronger focus on short durations affects the results, but no such effect could be found (see Appendix C).
Minute measurements might be less accurate when only a small number of rain drops is recorded and measured intensity is affected by sampling uncertainty. However, for events that are identified as annual maxima, we expect the rain amount to be large enough so that a higher sampling uncertainty compared to larger measurement accumulation sums can be neglected.

Generalized extreme value distribution
One of the most prominent ideas of extreme value statistics is based on the Fisher-Tippett-Gnedenko theorem, which states that, under suitable assumptions, maxima drawn from sufficiently large blocks follow one of three distributions. These distributions differ in their tail behavior. The GEV distribution comprises all three cases in one parametric family and is widely used in extreme precipitation analysis as follows (Coles, 2001): Here, the non-exceedance probability G for precipitation intensity z depends on the parameter location µ, scale σ > 0, and shape ξ = 0, with z restricted to 1 + ξ(z − µ)/σ > 0. The non-exceedance probability can be expressed as a return period, e.g., for an annual block size T (z) = 1/(1 − G(z)) years. Consequently, return values for a given nonexceedance probability 0 < p < 1 can be calculated by solving Eq. (2) for z, as follows: Water management authorities and other institutions rely on return values for different durations. However, the GEV (2) is limited to one selected duration at a time. One way to account for that need is to model each duration separately and then, in an independent second step, interpolate the resulting quantiles (return levels) across duration d, as done in the KOSTRA atlas (DWD, 2017) of the German Meteorological Service (DWD). One huge disadvantage of this method is that quantile crossing can occur, meaning that quantiles (intensities) associated with smaller exceedance probabilities can have higher values than quantiles from larger exceedance probabilities in some duration regimes. To solve this problem, Nguyen et al. (1998), Koutsoyiannis et al. (1998), and Menabde et al. (1999) proposed a distribution with parameters depending on duration d; there is thus only one single model required to obtain consistent (i.e., non-crossing) duration dependent quantiles (return values). Another advantage is the involvement of data from neighboring durations in the estimation of GEV parameters. For the modeling of short-duration rainfall, often very little data are available than for longer durations d ≥ 24 (1 d). Thus, in this setting, information from long durations has the potential to increase modeling performance for short durations as well.

Duration dependence
There are multiple empirical formulations for the relationship between intensity z and duration d. Koutsoyiannis et al. (1998) proposed a general form with five parameters for IDF curves. Therefore, a reparameterization and extension of the GEV is needed with the following: Here,μ is the rescaled location parameter, θ is the duration offset, and η the duration exponent. Scale σ follows a twoparameter power law (scaling relation) of duration d, with scale offset σ 0 being constant for all durations. For d θ , it is justified to disable the duration offset feature by setting duration offset θ = 0. The resulting IDF curves ( Fig. 2a) have two main features. (1) The curves follow a power law for a wide range of durations d > 1 (1 h), and the power law exponent (slope in a double logarithmic plot) is described by a duration exponent η and is equal for all probabilities. (2) The deviation from the power law (or curvature) for d < 1 (1 h) is described by the duration offset θ .
When applying the GEV separately to every duration out of a set of durations and interpolating in a second independent modeling step, the number of parameters equals three GEV parameters times the number of selected durations plus at least three parameters for interpolating every quantile. For the set of durations chosen here, and for evaluating five quantiles, this implies estimating 15 × 3 + 5 × 3 = 60 parameters. For the consistent approach, estimation is reduced to only five d-GEV parameters, i.e.,μ, σ 0 , ξ , θ , and η. A smaller parameter set is less likely to overfit the data and enables us to better investigate underlying physical processes.
Models can be further improved by adding the multiscaling feature (Gupta and Waymire, 1990;Burlando and Rosso, 1996;Van de Vyver, 2018) which introduces different slopes, depending on the quantile (or associated probability). Figure 2b presents how this feature affects the IDF curves. In order to highlight the effect of the multiscaling, we set θ = 0, resulting in no deviation from a power law (curvature) for short durations. The multiscaling feature is added at the cost of one additional parameter η 2 (second duration exponent) in the model, as follows: Using only the curvature feature, Ulrich et al. (2020) reported decreasing performance in consistent modeling for longer durations d ≥ 24 (1 d) when compared to using separate GEV models for each duration. Our attempt aims at more flexibility in the IDF curve in the long-duration regime. Therefore, we combine both features, curvature (duration offset) and multiscaling (second duration exponent), and add a new parameter τ , the intensity offset, which allows for a slower decrease of intensity for very long durations d 24 (Fig. 2c). This effect will be called flattening in the following: Summarizing this section, the following three different features were presented: (1) curvature, described by the duration offset parameter θ , (2) multiscaling, using a second duration exponent η 2 , and (3) flattening with the intensity offset τ , which is introduced in this study. In the following, we use the notation IDF c for a model including only the curvature feature, i.e., η 2 = 0 and τ = 0, IDF m for a model including only the multiscaling feature, i.e., θ = 0 and τ = 0, and IDF f for a model including only the flattening feature, i.e., θ = 0 and η 2 = 0. Feature combinations are denoted as, e.g., IDF cmf for all features. The plain duration-dependent model without curvature, multiscaling, and flattening is denoted as IDF.

Parameter estimation
Parameters of the d-GEV distribution are estimated by maximizing the likelihood (maximum likelihood estimation -MLE) under the assumption of independent annual maxima. Jurado et al. (2020) showed that independence is a reasonable assumption in many cases, especially for long durations. Their study was performed on an earlier version of the same data set that was used in this study. Moreover, Rust (2009) showed that in strongly dependent time series convergence towards the GEV distribution is slower. But, assuming an appropriate choice of model, dependence does not play a large role. We use the negative log-likelihood L, to avoid using products of small numbers, as follows: with the number of data points N, the duration set D, data points for each duration z n,d , and the characteristic parameters of the modified d-GEVμ, σ 0 , ξ, θ, η, η 2 , τ . The sum of L over all data points Z is minimized with the R package optim() function (R Core Team, 2020). The R package IDF (Ulrich et al., 2021b), which was already used for the accumulation, also provides functions for fitting and plotting IDF curves. Its functionality was extended in the context of this study and now provides options for both multiscaling and flattening in IDF curves. Finding reasonable initial values for d-GEV parameters in the optimization process was a major challenge during parameter estimation because optimization stability strongly depends on the choice of initial values. Details about this procedure can be found in Appendix A.

Quantile skill index
After estimating GEV parameters, quantiles q (return levels) can be predicted for a chosen non-exceedance probability p (or return period T , with T = 1/(1 − p)), using Eq. 3. To verify how well a modeled quantile q of a given probability p represents extremes in the data, we use the quantile score (QS) as follows (Koenker and Machado, 1999;Bentzien and Friederichs, 2014): with a small score indicating a good model. Here, ρ p is the tilted absolute value function, also known as the so-called check function.
with u = z n − q. For high non-exceedance probabilities p, it leads to a strong penalty for data points that are still higher than the modeled quantile (z n > q). Using this approach, the QS allows for detailed verification for each probability p and duration d separately by predicting a quantile intensity for a given p and d and comparing it with data points z n,d of duration d.
To compare different IDF models in terms of the QS, we require another verification measure. The quantile skill score QSS compares the quantile score QS M of a new IDF model M with the quantile score QS R of a reference IDF model R as follows: The QSS takes values −∞ < QSS ≤ 1 with QSS = 1 for a perfect model. Positive values QSS M|R > 0 are associated with an improvement of M over R. In case the model M is outperformed by the reference R, the resulting QSS is negative QSS M|R < 0. In this case, its value is not easily interpretable. This issue is acknowledged by the quantile skill index (QSI) suggested by Ulrich et al. (2020). In the case of QSS M|R < 0, reference R and model M are exchanged and − QSS R|M is used for negative values of the QSI, as follows: The QSI has a symmetric range and indicates either (1) a good skill over the reference when leaning clearly towards 1, (2) little or no skill when being close to 0, or (3) worse performance than the reference when leaning clearly towards −1.
In this study, the quantile score was calculated in a crossvalidation setting. For each station, the available years with maxima are divided into n cv non-overlapping blocks of 3 consecutive years. Then, for each cross-validation step i, one block is chosen as testing set, and all the other blocks are used as training data set. For the remaining cross-validation steps, this procedure is repeated with another block chosen as testing set in each step until all blocks have been used as testing sets exactly once. The cross-validated QS is obtained by averaging the score of all cross-validation steps as follows: Then, the QSI is derived from the averaged cross-validated QS of the model, QS cv M , and the averaged cross-validated QS of the reference, QS cv R , according to Eq. (16). If a year was assigned to the training or testing data set, then all available accumulation durations are used for training or testing, respectively, to avoid dependence between the test and validation set.
In order to compare individual model features, we will use the mentioned models without this specific feature as a reference in the following.

Bootstrapping and coverage
To provide an estimate of the uncertainty of the intensity quantile estimates in IDF curves, we obtain 95 % confidence intervals using a bootstrapping method. To account for dependence between annual maxima of different durations, we apply the ordinary non-parametric bootstrap percentile method (Davison and Hinkley, 1997) as follows.
Please note that, in this paragraph, the empirical quantiles used for the confidence intervals should not be confused with the intensity quantiles, which describe the return level and are referred to as quantiles as well. For each station, we draw a sample of years (with replacement) from the set of years with available data. This way, for a chosen year, all maxima from this year are used, and we expect that sampling in this way maintains the dependence structure of the data. We then estimate the parameters of the d-GEV, which is used to calculate the intensity quantile that is connected to a certain non-exceedance probability (see Eq. 3). We obtain a distribution of the estimated intensity quantiles by repeating this process 500 times. From this distribution, we use the empirical 0.025 and 0.975 quantiles to obtain the upper and lower bounds of the 95 % confidence interval of intensity quantiles.
We conduct a simulation study to examine whether the derived confidence intervals provide reasonable coverage despite the dependence between the annual maxima of different durations. Therefore, we simulate 500 samples of data, each with a size of n = 50 years, with a known dependence between durations. In a first case, samples with no dependence between durations are obtained by drawing random values from a d-GEV distribution. Further details on the simulated data can be found in Appendix Sect. B. In a second case, to obtain data with dependence between durations, we use the R package SpatialExtremes to simulate values from a Brown-Resnick simple max-stable process with known dependence parameters. We use the range and smooth parameter (ρ, α) ∈ {(1, 0.2), (120, 1), (60, 1)} for (1) a weak dependence, (2) a strong dependence, and (3) a dependence found for Wupper catchment (Jurado et al., 2020), respectively. We transform the simulated data from having Fréchet margins to d-GEV margins with the chosen parameters, as done by Jurado et al. (2020), and adjust them to the hourly scale used in this study. Then, for the artificial data, confidence intervals are obtained by bootstrapping with 500 repetitions, as described above. For better understanding, this results in a total of 500 × 500 × 50 data points and 500 confidence intervals.
The ratio of samples in which the confidence intervals cover the true intensity z and the total number of samples (500) is called the coverage. It can be calculated for each duration and probability separately.

Results
Results are presented in the following order: (1) modeling performance is verified with the QSI for the three different IDF curve features, i.e., curvature, multiscaling, and flattening.
(2) IDF curves with all three features are shown for two rain gauges. Curves are presented with a 95 % confidence interval, as created by a bootstrapping method. (3) The trustworthiness of this bootstrapping method applied to the new model with all three features is investigated with a coverage analysis, based on simulated data.

Model validation
The QSI is used to compare the quantile score of a model with that of a reference. In order to specifically investigate the influence of a single model feature, we use these features in a model and compare with a reference without this specific feature; e.g., QSI IDF c |IDF gives the performance for a model including only curvature against the plain reference without curvature, or QSI IDF cmf |IDF mf gives the performance for the full model including curvature, multiscaling, and flattening against a reference with multiscaling and flattening and without curvature (see Sect. 2.3). Figure 3 shows the QSI for an IDF model including each of the three features of curvature, multiscaling, and flattening (columns) combined with no other, one other, or both other features (rows) against a reference model which differs only in the one feature under investigation (labels on top of the columns). For each QSI IDF 1 |IDF 2 , one panel is provided in Fig. 3. Models and references are listed in Table E1 in Appendix. E. In this way, the potential performance of each feature, e.g., curvature, is analyzed and is denoted as e.g., curvature skill. QSI values between −0.05 and 0.05 are considered as being an indicator of no relevant difference between model performances.
The curvature (duration offset θ = 0) for short durations can be explained by a stronger connection between the annual maxima of different durations, which tend to originate from the same event. Usually, the most intense phase of a heavy precipitation event lasts for several minutes, and aggregated maxima do not differ much on this scale. Based on this idea, curvature influences the IDF curve's shape only for very short durations below 1 h (Fig. 2a). The consistently positive QSI values for d = 1/60 (1 min) for the curvature skill support this theory. These results show that this duration regime d = 1/60 is much better modeled with the curvature, compared to models without this feature. However, the slope for medium durations, described by the duration exponent (see Fig. 2b), is steeper when using curvature compared to models that do not use curvature. So, for medium and long durations, models perform equally well or worse when curvature is used than reference models without curvature, in most cases, on average (see the blue regimes in Fig. 3a). In the absence of multiscaling (rows 1 and 3), a further performance increase could be found for durations between 8 h and 5 d.
Multiscaling allows for different slopes of different p quantiles on a double logarithmic scale. Figure 3b shows that this feature increases modeling performance mainly for long, but also for some sub-hourly, durations when estimating quantiles for small non-exceedance probabilities. Not using curvature enables a small multiscaling skill gain for subhourly durations (rows 1 and 3 in Fig. 3b). An explanation could be that multiscaling tends to let IDF curves associated with different return periods diverge for short durations and converge for long durations. This behavior might interfere with the duration offset's introduction of curvature in short durations. Furthermore, the presence of curvature leads to a slightly smaller skill increase for durations longer than 16 h (rows 2 and 4 in Fig. 3b). This effect agrees with the results from the curvature skill verification (Fig. 3a), where it was shown that curvature improves modeling performance only for very short durations and has no or a negative effect on medium and long durations.
The intensity offset τ is a new feature, first introduced in this study, which addresses the empirically observed slower decrease in intensity for very long durations, called flattening. In the case where curvature is enabled in both model and reference (rows 2 and 4 in Fig. 3c), the flattening feature improves modeling performance slightly for the shortest duration of 1 min and strongly for medium durations between 2 h and 1 d. Here, the flattening might compensate for the loss in skill that we observe for medium durations for models with curvature. In these cases, there is a slight loss in skill for very long durations. In cases where curvature is not used, flattening is not needed as it provides no clear skill. An explanation for the flattening of the IDF curves in long durations could be seasonal effects, with annual maxima of short or long durations occurring more often in the summer or winter months, respectively. These effects are currently under further investigation.
When modeling only the durations d ≥ 1 (1 h) of all available stations, the models are rather indifferent towards parameterization (Fig. 4). Here, multiscaling and flattening show some skill improvements for long and medium durations, respectively, similar to that in Fig. 3, but to a much smaller extent when compared to a data set which uses the whole range of durations from 1 min to 5 d for both training and testing (Fig. 3).
We conclude that the choice of parameters depends on the study purpose. When focusing on long ranges of durations, we recommend using features like curvature, multiscaling, and flattening. If the focus lies on long durations, or the data   do not provide a sub-hourly resolution, simple scaling models might be sufficient. These recommendations are further elaborated on in the discussion in Sect. 4. Figure 5 shows IDF curves for the stations Bever and Buchenhofen, where a long precipitation series is available (51 years with minute resolution and 76 years with daily resolution in Bever and 19 and 77 years in Buchenhofen, re-spectively). The difference in available years for different durations has an impact on the width of 95 % confidence intervals, with uncertainty being larger when little data are available. Noticeably, confidence intervals for both stations for p = 0.99 and d = 1/60 have a wide range over more than 100 mm/h. Considering that the 100-year return level was not observed in either station, a wide confidence interval range was expected. For p ≤ 0.8 in Bever, the confidence intervals remain narrow, even on a minute scale.

Coverage
Confidence intervals in Fig. 5 are obtained from a bootstrapping procedure. In the formulation of the likelihood, we assume the maxima of different durations to be independent. This assumption might not be justified especially for short durations (see Jurado et al., 2020), and thus, this dependence must be taken into account when estimating uncertainties. Disregarding the dependence would result in an underestimation of the uncertainty. To account for this effect, all annual maxima of each year -for all considered durations -are always included jointly into a bootstrapping sample. We assume that this procedure preserves the dependence structure between durations. To investigate this assumption, we calculate the coverage of simulated data (see Appendix B) from (1) a d-GEV distribution without dependence and a Brown-Resnick max-stable process with (2) a typical dependence for the Wupper station (Jurado et al., 2020) and (3) a rather weak and (4) strong dependence between durations (Fig. 6). In the first case without any dependence, the displayed coverage does completely agree with the 95 % confidence interval, without any respect to duration or frequency (probability). When using dependence on a weak or strong level, the coverage is smaller but still around 90 %. This can be interpreted as an underestimation of uncertainty, to a small extent, by the confidence intervals in the case of a high dependence. The true dependence of durations was not investigated in this study and could be lower. That said, these results suggest that bootstrapping is a suitable tool for estimating confidence intervals in the presented context.

Discussion
In this study, we show that model performance can be increased when the flattening of IDF curves in the longduration regime is taken into account. We assume that this behavior arises from seasonal effects. That means that the an-nual maxima of different durations may not follow the same scaling process. However, this topic is currently under further investigation (Ulrich et al., 2021a).
The analyzed features -curvature, multiscaling, and flattening -were seen in the results to have a different impact on modeling performance, depending on the duration and return period. All features are able to improve the model for certain regimes, but depending on the problem that is approached, features should be chosen accordingly. If the focus is on a small timescales of minutes, then the curvature skill is important for a good modeling result. When curvature is used and medium to long timescales are also of importance, then the flattening feature should be used. This helps to compensate for the deterioration due to curvature over longer durations. Multiscaling is a good choice if a loss in skill for short durations can be accepted in exchange for a simultaneous improvement at long durations, regardless of which other features are requested.
The skills of the features depend on another feature's presence. This dependence is strongest for the flattening, which can only improve the model when curvature is used. The modeling performance of the curvature depends less on the presence of other features. The same applies to the multiscaling feature.
These suggestions hold for models that are supposed to cover a wide range of timescales from minutes to days. For data with hourly or more coarse temporal resolution, the skill gain from using the features is much smaller. Here, flattening can improve the model slightly on a daily timescale and multiscaling only improves modeling long durations a little bit but leads to a slight reduction in skill for the hourly timescale.
Additional parameters give the model more flexibility. Including τ in the model allows one to reduce deviations between model and data points particularly for long durations. This, in turn, opens the possibility to vary the remaining parameters such that deviations between model and data points can be reduced in other (e.g., short) duration regimes. Con-F. S. Fauer et al.: Flexible and consistent IDF curves Figure 6. Bootstrapping coverage. Using a Brown-Resnick max-stable process, the coverage was determined in order to investigate the reliability of 95 % confidence intervals from bootstrapping. A total of three different levels of dependence were used. versely, this also holds when including θ . In this way, a parameter that changes the curve in long durations can increase the modeling performance in other durations and even slightly decrease model performance in long durations in certain cases. In Fig. 7, IDF curves for two models are compared for a chosen station (Bever). The model that includes flattening (IDF cmf ) is able to follow the empirical quantiles in long durations as well as the model without flattening (IDF cm ). However, flattening gives the model the opportunity to better follow the empirical quantiles in medium durations between 4 h and 1 d, which is in accordance with the results in Fig. 3c.
The parametric form of the IDF relation is based on three modifications to a simple power law which are motivated by our understanding of the rainfall process, namely curvature (e.g., Koutsoyiannis et al., 1998) for small durations addressing limits to rainfall intensity, multiscaling (e.g., Van de Vyver, 2018) taking care of a varying scaling behavior for events of different strength, and flattening (suggested here) resulting from a mixing of convective and stratiform generated precipitation extremes from different seasons in the climate regime under study. Our contribution is to combine these modifications into a flexible parametric form capable of describing various effects related to rainfall processes. The resulting model is based on empirical grounds, and it can be shown to improve the description over simple power law models. Other modifications are possible. To our knowledge, there is no theoretical justification for these forms which can be derived from first principles for rainfall processes. Since our results might apply only to the geographical region under investigation, further studies are necessary to find out whether the found model performances of the different features are generally applicable. The character of the shape parameter ξ with respect to duration is still unclear, since no linear or log-linear duration dependence could be found as with the location and scale parameters µ and σ . A possible approach would be to let the shape parameter vary smoothly across durations in a non-parametric manner but to penalize its deviation from the median over all durations (Bücher et al., 2021). However, the scarce data availability hampers a more complex estimation of the shape parameter.

Summary and outlook
The aim of this study is to compare and suggest new parametric forms of consistent IDF curves that are applicable to a large range of durations from minutes to several days and, therefore, cover events from short-lived convective storms to long-lasting synoptic events. The dependence on duration is implemented in the location and scale parameter and allows for three features, i.e., curvature, multiscaling, and flattening. The analysis of these features enables us to understand more about the underlying physical effects beyond the subject of return periods and provides more flexible IDF curves that are suitable for a wide range of durations. The results of our simulation study show that we are able to provide reasonable estimates of uncertainty using bootstrapping and also with regard to dependence between durations.
Our findings agree with Veneziano and Furcolo (2002), who found that simple scaling was adequate for modeling short durations, and multiscaling was adequate for long durations. Moreover, our conclusion that curvature improves the modeling of short durations indirectly agrees with Bougadis and Adamowski (2006), who used different slopes for durations longer or shorter than 1 h, respectively, and concluded that linear scaling does not hold for small durations.
Consistent modeling using the d-GEV enables the use of fewer parameters. In this way, the model can be easily extended, e.g., using physically relevant atmospheric covariates. Thus, improving the parameterization of the d-GEV is crucial to leading the path for further steps. In future studies, we plan to include spatial covariates into the estimation of the newly proposed d-GEV parameters, including intensity offset, in order to use data from different locations more efficiently. Also, the concept of non-stationary precipitation with respect to IDF curves is important to consider (see Cheng and AghaKouchak, 2014;Ganguli and Coulibaly, 2017;Yan et al., 2021) since extremes are expected to vary due to climate change. For example, Benestad et al. (2021) found a model that enables downscaling of 24 h measurement data to shorter durations without assuming stationarity. However, we think that implementing atmospheric large-scale covariates (as in Agilan and Umamahesh, 2017) into the flexible d-GEV model proposed here would allow for a better understanding of the underlying processes. We plan to use this approach to investigate the change in characteristics of extreme precipitation due to climate change in future studies. While the choice of method depends on the study target, different approaches have been taken to create IDF curves, as in Bezak et al. (2016), who used copula-based IDF curves and reported that IDF curves might be sensitive to the choice of method. This is important to consider when deciding on the appropriate way to create IDF curves. Moreover, the origin of flattening in annual maxima for long durations is currently investigated in more detail (Ulrich et al., 2021a).
The analysis of the performance shows that the new parametric form of the duration-dependent GEV suggested here, together with the bootstrap-based confidence intervals, offers a consistent, flexible, and powerful approach to describing the intensity-duration-frequency (IDF) relationships for various applications in hydrology, meteorology, and other fields. All initial-value techniques were based on the same first step. An individual GEV distribution was fitted to each duration d separately, with moment estimators as initial values (Coles, 2001), and the three GEV parameters of location µ, scale σ , and shape ξ were stored for each duration. In the next step, a function was fitted to each of the parameters with respect to the duration. Since we assumed no dependence of the shape parameter on the duration, we chose ξ v1 = median(ξ ) for all suggestions. According to Eqs. (4) and (5), we fitted log(σ (d)) and log(µ(d)) as a function of log(d) in a linear regression with a simple slope and y intercept as follows: with given σ (d), µ(d), and d. From this fit, we extracted σ 0,v1 = exp(log(σ 0 )) (Eq. A1),μ v1 =μ (combine Eqs. A1 and A2), η v1 = η, and η 2,v1 = η 2 . For the most simple suggestion of initial parameters, we chose θ v1 = 0 and τ v1 = 0.
In the next steps, we further elaborated the ways of finding good initial values. Version 2 of the duration exponents η v2 and η 2,v2 were found, using only d ≥ 1 h, because the slope is mainly characterized by this duration regime. This is the second set of suggestions for initial values, together with version 1 of the other parameters. Another initial duration offset θ v2 could be estimated by fitting a nonlinear squares regression (R function nls) as follows: with given σ (d), µ(d), and d. The mean of both estimates for θ was used for θ v2 . These functions were less stable and provided worse initial values for σ 0 andμ than Eqs. (A1) and (A2). That is why we used them only for estimating initial θ v2 in the nls function and not for estimating initialμ, σ 0 , η 1 , or η 2 . The new initial estimate θ v2 was combined with η v1 and η 2,v1 in one set (suggestion 3) and with η v2 and η 2,v2 in another set (suggestion 4). The same nls function was used for an estimation of the initial τ v2 , taking only d ≥ 1 h and no duration offset (curvature), as follows: with given σ (d), µ(d), and d. Again, the mean of both estimates of τ was used as τ v2 . To define suggestions 5-8, this second version of τ was combined with θ v1 , η v1 , η 2,v1 or θ v1 , η v2 , η 2,v2 or θ v2 , η v1 , η 2,v1 or θ v2 , η v2 , and η 2,v2 . The different combinations of initial value versions are listed again in Table A1. the additional data points do not contain substantial new information.
The model with the new artificial training data set (Eq. C1) is now verified against the same model with the previously used artificial data set (Eq. 1) with the following results: all QSIs are below 0.05 for all durations d and all quantiles q ∈ {0.5, 0.8, 0.9, 0.95, 0.98} (not shown). Thus, the results do not indicate that the choice of accumulation duration significantly influences how well the model performs for certain duration regimes.
Author contributions. The conceptualization was done by HWR, FSF, and JU, with FSF and JU also curating the data, developing the methodology, and validating the findings. FSF conducted the formal analysis, validated and visualized the project, and prepared the original draft. HWR acquired the funding and supervised the project. The software was developed by FSF, JU, and OEJ. The review and editing of the paper was done by JU, OEJ, and HWR. Moreover, all of the authors have read and agreed to the published version of the paper.