The need to fit time series characterized by the presence of a trend or change points has generated increased interest in the investigation of nonstationary probability distributions in recent years. Considering that the available hydrological time series can be recognized as the observable part of a stochastic process with a definite probability distribution, two main topics can be tackled in this context: the first is related to the definition of an objective criterion for choosing whether the stationary hypothesis can be adopted, whereas the second regards the effects of nonstationarity on the estimation of distribution parameters and quantiles for an assigned return period and flood risk evaluation. Although the time series trend or change points are usually detected using nonparametric tests available in the literature (e.g., Mann–Kendall or CUSUM test), the correct selection of the stationary or nonstationary probability distribution is still required for design purposes. In this light, the focus is shifted toward model selection criteria; this implies the use of parametric methods, including all of the issues related to parameter estimation. The aim of this study is to compare the performance of parametric and nonparametric methods for trend detection, analyzing their power and focusing on the use of traditional model selection tools (e.g., the Akaike information criterion and the likelihood ratio test) within this context. The power and efficiency of parameter estimation, including the trend coefficient, were investigated via Monte Carlo simulations using the generalized extreme value distribution as the parent with selected parameter sets.

The long- and medium-term prediction of extreme hydrological events under nonstationary conditions is one of the major challenges of our times. Streamflow, as well as temporal rainfall and many other hydrological phenomena, can be considered as stochastic processes (Chow, 1964), i.e., families of random variables with an assigned probability distribution, and time series are the observable part of this process. One of the main goals of extreme event frequency analysis is the estimation of distribution quantiles related to a certain non-exceedance probability. They are usually obtained after fitting a probabilistic model to observed data. As Koutsoyiannis and Montanari (2015) depicted in their historical review of the “concept of stationarity”, Kolmogorov, in 1931, “used the term stationary to describe a probability density function that is unchanged in time”, whereas Khintchine (1934) provided a formal definition of stationarity of a stochastic process.

In this context, detecting the existence of time-dependence in a stochastic process should be considered a necessary task in the statistical analysis of recorded time series. Thus, several considerations should be made with respect to updating some important hydrological concepts while assuming that the non-exceedance probability varies with time or other covariates. For example, the return period may be reformulated in two different ways, the “expected waiting time” (EWT; Olsen et al., 1998) or the “expected number of events” (ENE; Parey et al., 2007, 2010), which lead to a different evaluation of quantiles within a nonstationary approach. As proved by Cooley (2013), the EWT and ENE are affected differently by nonstationarity, possibly producing ambiguity in engineering design practice (Du et al., 2015; Read and Vogel, 2015). Salas and Obeysekera (2014) provided a detailed report regarding relationships between stationary and nonstationary EWT values within a parametric approach for the assessment of nonstationary conditions. In such a framework, a strong relevance is given to statistical tools for detecting changes in non-normally distributed time series (Kundewicz and Robson, 2004).

To date, the vast majority of research regarding climate change and the detection of nonstationary conditions has been developed using nonparametric approaches. One of the most commonly used nonparametric measures of trend is Sen's slope (Gocic and Trajkovic, 2013); however, a wide array of nonparametric tests for detecting nonstationarity is available (e.g., Kundewicz and Robson, 2004). Statistical tests include the Mann–Kendall (MK; Mann, 1945; Kendall, 1975) and Spearman (Lehmann, 1975) tests for detecting trends, and the Pettitt (Pettitt, 1979) and CUSUM (Smadi and Zghoul, 2006) tests for change point detection. All of these tests are based on a specific null hypothesis and have to be performed for an assigned significance level. Nonparametric tests are usually preferred over parametric tests as they are distribution-free and do not require knowledge of the parent distribution. They are traditionally considered more suitable for the frequency analysis of extreme events with respect to parametric tests because they are less sensitive to the presence of outliers (Wang et al., 2005).

In contrast, the use of null hypothesis significance tests for trend detection has raised concerns and severe criticisms in a wide range of scientific fields for many years (e.g., Cohen, 1994), as outlined by Vogel et al. (2013). Serinaldi et al. (2018) provided an extensive critical review focusing on logical flaws and misinterpretations often related to their misuse.

In general, the use of statistical tests involves different errors, such as type I error (rejecting the null hypothesis when it is true) and type II error (accepting the null hypothesis when it is false). The latter is related to the test power, i.e., the probability of rejecting the null hypothesis when it is false; however, as recognized by a few authors (e.g., Milly et al., 2015; Beven, 2016), the importance of the power has been largely overlooked in Earth system science fields. Strong attention has always been paid to the level of significance (i.e., type I error), although, as pointed out by Vogel et al. (2013), “a type II error in the context of an infrastructure decision implies under-preparedness, which is often an error much more costly to society than the type I error (over-preparedness)”.

Moreover, as already proven by Yue et al. (2002a), the power of the Mann–Kendall test, despite its nonparametric structure, actually shows a strong dependence on the type and parametrization of the parent distribution.

Using a parametric approach, the estimation of quantiles of an extreme event distribution requires the search for the underlying distribution and for time-dependant hydrological variables. If variables are time-dependent, they are “i/nid” (independent/non-identically distributed) and the model is considered nonstationary; otherwise, the variables are “iid” (independent, identically distributed) and the model is a stationary one (Montanari and Koutsoyiannis, 2014; Serinaldi and Kilsby, 2015).

From this perspective, the detection of nonstationarity may exploit (besides traditional statistical tests) well-known properties of model selection tools. Even in this case, several measures and criteria are available for selecting a best-fit model, such as the Akaike information criterion (AIC; Akaike, 1974), the Bayesian information criterion (BIC; Schwarz, 1978), and the likelihood ratio test (LR; Coles, 2001); the latter is suitable when dealing with nested models.

The purpose of this paper is to provide further insights into the use of
parametric and nonparametric approaches in the framework of extreme event
frequency analysis under nonstationary conditions. The comparison between
those different approaches is not straightforward. Nonparametric tests do
not require knowledge of the parent distribution, and their properties
strongly rely on the choice of the null hypothesis. Parametric methods for
model selection, in comparison, require the selection of the parent
distribution and the estimation of its parameters, but are not necessarily
associated with a specific null hypothesis. Nevertheless, in both cases, the
evaluation of the rejection threshold is usually based on a statistical
measure of trend that, under the null hypothesis of stationarity, follows a
specific distribution (e.g., the Gaussianity of the Kendall statistic for the MK nonparametric test, and the

Considering the pros and cons of the different approaches, we believe that specific remarks should be made about the use of parametric and nonparametric methods for the analysis of extreme event series. For this purpose, we set up a numerical experiment to compare the performance of (1) the MK as a nonparametric test for trend detection, (2) the LR parametric test for model selection, and (3) the AIC

We aim to provide (i) a comparison of test power between the MK, LR, and
AIC

We conducted the analysis using Monte Carlo techniques; this entailed generating samples from parent populations assuming one of the most popular extreme
event distributions, the generalized extreme value (GEV; Jenkinson, 1955), with a
linear (and without any) trend in the position parameter. From the
samples generated, we numerically evaluated the power and significance level of tests
for trend detection, using the MK, LR, and AIC

Considering that parametric methods involve the estimation of the parent distribution parameters, we also analyzed the efficiency of the maximum likelihood (ML) estimator by comparing the sample variability of the ML estimate of trend with the nonparametric Sen’s slope. Furthermore, we scoped the sample variability of the GEV parameters in the stationary and nonstationary cases.

This section is divided into five parts. Sect. 2.1, 2.2, and 2.3 report the
main characteristics of the MK, LR, and AIC

Hydrological time series are often composed by non-normally independent
realizations of phenomena, and this characteristic makes the use of
nonparametric trend tests very attractive (Kundzewicz and Robson, 2004). The
Mann–Kendall test is a widely used rank-based tool for detecting monotonic,
and not necessarily linear, trends. Given a random variable z, and assigned
a sample of

The null hypothesis of this test is the absence of any statistically
significant trend in the sample, whereas the presence of a trend represents an alternative hypothesis. Yilmaz and Perera (2014) reported that serial dependence can
lead to a more frequent rejection of the null hypothesis. For

Yue et al. (2002b) observed that autocorrelation in time series can influence the ability of the MK test to detect trends. To avoid this problem, a correct approach with respect to trend analysis should contemplate a preliminary check for autocorrelation and, if necessary, the application of pre-whitening procedures.

A nonparametric tool for a reliable estimation of a trend in a time series with

The likelihood ratio statistical test allows for the comparison of two candidate models. As its name suggests, it is based on the evaluation of the likelihood function of different models.

The LR test has been used multiple times (Tramblay et al., 2013; Cheng et
al., 2014; Yilmaz et al., 2014) to select between stationary and nonstationary models in the context of nested models. Given a stationary
model characterized by a parameter set

Besides the analysis of power, we also checked (in Sect. 3.3) the approximation

Information criteria are useful tools for model selection. It is reasonable
to retain that the Akaike information criterion (AIC; Akaike, 1974) is the most
famous among these tools. Based on the Kullback–Leibler discrepancy measure, if

An empirical distribution of AIC

Sugiura (1978) observed that the AIC can lead to misleading results for small samples; thus, he proposed a new measure for the AIC:

In order to select between stationary and nonstationary candidate models, we
use the ratio

Considering that the better fitting model has a lower AIC, if the time series
arises from a nonstationary process, the AIC

In order to provide a rigorous comparison between the use of the MK, LR, and
AIC

More in detail, we adopted the following procedure:

for each of these samples the AIC

exploiting the empirical distribution of AIC

The cumulative distribution function of the generalized extreme value (GEV) distribution (Jenkinson,
1955) can be expressed as follows:

Traditional extreme value distributions can be used in a nonstationary
framework, modeling their parameters as function of time or other
covariates (Coles, 2001), producing

In this study, only a deterministic linear dependence on the time

It is important to note that Eq. (8) is a more general way of defining the GEV
and has the property of degenerating into Eq. (7) for

According to Muraleedharan et al. (2010), the first three moments of the GEV
distribution are as follows:

In this work, we used the maximum likelihood method (ML) to estimate the GEV
parameters from sample data. The ML allows one to treat

The power of a test is related to the type II error and is the
probability of correctly rejecting the null hypothesis when it is false. In
particular, defining

The threshold AIC

From these synthetic series, the power of the test is estimated as

We used a reduced number of generations (

A comparative evaluation of the tests' performance was carried out for different GEV parameter sets

For a clear exposition of the results, this section is divided into four
subsections. In Sect. 3.1, we focus on the opportunity to use the AIC or
AIC

Distributions of the differences between the power of AIC

Considering the nonstationary GEV four-parameter model, in order to satisfy
the relation

The effect of the parent distribution parameters and the sample size on the numerical evaluation of the power and significance level of the MK, LR, and AIC

Dependence of test power on the trend coefficient, sample size, scale, and shape of the parent parameters.

In particular, for

A higher difference is found for a heavy-tailed parent distribution
(

The practical consequences of such patterns are very important and are discussed in Sect. 4.

We evaluated the threshold values (corresponding to a significance level of
0.05) for accepting/rejecting the null hypothesis of stationarity according
to the methodologies recalled in Sect. 2.1 and 2.2 for the MK and LR tests and
introduced in Sect. 2.3 for AIC

Table 1 shows the numerical values of the actual level of significance,
obtained numerically, to be compared with the theoretical value of 0.05 for all of
the sets of parameters and sample sizes considered. Among the three measures
for trend detection, the LR shows the worst performance. The results in Table 1
show that the rejection rate of the (true) null hypothesis is systematically
higher than it should be, and it is also dependent on parent parameter values. This effect is exalted when the parent distribution has an upper boundary
(

The actual level of significance of the tests for different sample sizes, scales, and shapes of the parent parameters.

Conversely, the performance of MK with respect to the designed level
of significance is less biased and is independent of the parameter set.
Similar good performance is trivially obtained for the AIC

Enlargement of the power test curves in the case (

The plot in Fig. 4 is displayed in order to focus on the actual value of the
level of significance and, in particular, on the LR approximation

Actual level of significance of the AIC

AIC

Other considerations can be made regarding the use of AIC

Sample variability of ML-

In our opinion, the results shown above, with respect to the performance of parametric and
nonparametric tests, are quite surprising and important. It
is proved that the preference widely accorded to nonparametric tests, due to the fact that
their statistics are allegedly independent from the parent distribution, is not
well founded. Conversely, the use of parametric procedures raises the
problem of correctly estimating the parent distribution and, for the purpose
of this paper, its parameters. Moreover, as the trend coefficient

We evaluated sample variability

In Fig. 7, we show the empirical distributions of the Sen's slope

Empirical distributions of

Figure 8 shows the sample variability of ML-

Sample variability of ML-

Empirical distributions of ML-

Empirical distributions of ML-

In order to better analyze such patterns, for the scale and shape parent
parameters we also report the distribution of their empirical ML estimates
for different parameter sets vs. the true

The results shown have important practical implications. The dependence of test power on the parent distribution parameters may significantly affect results of both parametric and nonparametric tests, including the widely used Mann–Kendall test.

Considering the feasibility of the numerical evaluation of power, allowed by the parametric approach, we observe that, while awareness of the crucial role of type II error has been growing in recent years in the hydrological literature, a common debate would deserve more development about which power values should be considered acceptable. Such an issue is much more enhanced in other scientific fields where the experimental design is traditionally required to estimate the appropriate sample size to adequately support results and conclusions. In psychological research, Cohen (1992) proposed 0.8 to be a conventional value of power to be used with level of significance of 0.05, thus leading to a 4 : 1 ratio between the risk of type II and type I error. The conventional value proposed by Cohen (1992) has been taken as a reference by thousands of papers in social and behavioral sciences. In pharmacological and medical research, depending on the real implications and the nature of the type II error, conventional values of power may be as high as 0.999. This was the value suggested by Lieber (1990) for testing a treatment for patients' blood pressure. The author stated, while “guarding against cookbook application of statistical methods”, “it should also be noted that, at times, type II error may be more important to an investigator then type I error”.

We believe that, when selecting between stationary and nonstationary models
for extreme hydrological event prediction, a fair comparison between the
null and the alternative hypotheses of

For all of the generation sets and tests conducted, under the null hypothesis
of stationarity, the power has values ranging between the chosen
significance level (0.05) and 1 for large (and larger) ranges of the trend
coefficient. The test power always collapses to very low values for weak
(but climatically important) trend values (e.g., in the case of annual maximum
daily rainfall,

These results also imply that in spatial fields where the alternative
hypothesis of nonstationarity is true but the parent's parameters
(including the trend coefficient) and the sample length are variable in
space, the rate of rejection of the false null hypothesis may be highly
variable from site to site and the power, if left without control, de facto assumes random values in space. In other words, the probability of recognizing the alternative hypothesis of
nonstationarity as true from a single observed sample may unknowingly
change (between 0.05 and 1) from place to place. For small samples (e.g.,

Therefore, considering the high spatial variability of the parent distribution parameters and the relatively short period of reliable and continuous historical observations usually available, a regional assessment of trend nonstationarity may suffer from the different probability of the rejection of the null hypothesis of stationarity (when it is false).

These problems affect both parametric and
nonparametric tests (to slightly different degrees). While these considerations are generally applicable to
all of the tests considered, differences also emerge between them. For heavy-tailed parent distributions and smaller samples, the MK test power decreases
more rapidly than for the other tests considered. Low values of power are
already observable for

Results also suggest that the theoretical distribution of the LR test-statistic based on the null hypothesis of stationarity may lead to a significant increase in the rejection rate compared with the chosen level of significance, i.e., an abnormal rate of rejection of the null hypothesis when it is true. In this case, the use of numerical techniques, based on the implementation of synthetic generations performed by exploiting a known parent distribution, should be preferred.

In light of these results, we conclude that the assessment of the parent distribution and the choice of
the null hypothesis should be considered as fundamental preliminary tasks in trend detection on annual
maximum series. Therefore, it is advisable to make use of parametric tests by
numerically evaluating both the rejection threshold for the assigned
significance level and the power corresponding to alternative hypotheses.
This also requires the development of robust techniques for selecting the parent
distribution and estimating its parameters. To this end, the use of
a parametric measure such as the AIC

The need for robust procedures to assess the parent distribution and its
parameters is also proven by the numerical simulations that we conducted. Sample
variability of parameters (including the trend coefficient) may increase
rapidly for series with

This analysis shed light on important eventual flaws in the at-site analysis of climate change provided by nonparametric approaches. Both test power and trend evaluation are affected by the parent distribution as is also the case for parametric methods. It is not by chance, in our opinion, that many technical studies that have recently been conducted around the world provide inhomogeneous maps of positive/negative trends and large areas of stationarity characterized by weak trends that are not considered statistically significant.

As already stated, an advantage of using parametric tests and numerical
evaluation of the test statistic distribution is given by the possibility of
assuming a null hypothesis based on a preliminary assessment of the parent
distribution, including trend detection via the evaluation of nonstationary
parameters. This could lead to a regionally homogeneous and controlled
assessment of both the significance level and the power in a fair mutual
relationship. With respect to the estimation of the parameters of the parent
distribution, results suggest that at-site analysis may provide highly
biased results. More robust procedures are necessary, such as hierarchic
estimation procedures (Fiorentino et al., 1987), and procedures that provide estimates of

As a final remark, concerning real data analysis, in our numerical experiment we showed that a weak linear trend in the mean suffices to reduce power to unacceptable values in some cases. However, we explored the simplest nonstationary working hypothesis by introducing a deterministic linear dependence of the location parameter of the parent distribution on time. Obviously, when making inference from real observed data, other sources of uncertainty may affect statistical inference (trend, heteroscedasticity, persistence, nonlinearity, and so on); moreover, if considering a nonstationary process with underlying deterministic dynamics, the process becomes non-ergodic, implying that statistical inference from sampled series is not representative of the process's ensemble properties (Koutsoyiannis and Montanari, 2015).

As a consequence, when considering a nonstationary stochastic process as being produced by a combination of a deterministic function and a stationary stochastic process, other sources of information and deductive arguments should be exploited in order to identify the physical mechanism underlying such relationships. Also, in this case observed time series have a crucial role in the calibration and validation of deterministic modeling; in other words, they are important for confirming or disproving the model hypotheses.

In the field of frequency analysis of extreme hydrological events, considering the high spatial variability of the sample length, the trend coefficient, the scale, and the shape parameters, among others, physically based probability distributions could be further developed and exploited for the selection and assessment of the parent distribution in the context of nonstationarity and change detection. The physically based probability distributions we refer to are (i) those arising from stochastic compound processes introduced by Todorovic and Zelenhasic (1970), which also include the GEV (see Madsen et al., 1997) and the TCEV (Rossi et al., 1984), and (ii) the theoretically derived distributions following Eagleson (1972) whose parameters are provided by clear physical meaning and are usually estimated with the support of exogenous information in regional methods (e.g., Gioia et al., 2008; Iacobellis et al., 2011; see Rosbjerg et al., 2013 for a more extensive overview).

Hence, we believe that “learning from data” (Sivapalan, 2003), will remain a key task for hydrologists in future years, as they face the challenge of consistently identifying both deterministic and stochastic components of change (Montanari et al., 2013). This involves crucial and interdisciplinary research to develop suitable methodological frameworks for enhancing physical knowledge and data exploitation, in order to reduce the overall uncertainty of prediction in a changing environment.

No data sets were used in this article.

All authors contributed in equal measure to all stages of the development and production of this paper.

The authors declare that they have no conflict of interest.

The authors thank the three anonymous reviewers and the editor Giuliano Di Baldassarre, who all helped to extend and the improve the paper.

The present investigation was partially carried out with support from the Puglia Region (POR Puglia FESR-FSE 2014–2020) through the “T.E.S.A.” – Tecnologie innovative per l'affinamento Economico e Sostenibile delle Acque reflue depurate rivenienti dagli impianti di depurazione di Taranto Bellavista e Gennarini – project.

This paper was edited by Giuliano Di Baldassarre and reviewed by three anonymous referees.