Power of parametric and non-parametric tests for trend detection in annual maximum series

The need of fitting time series characterized by the presence of trend or change points has generated in latest years an increased interest in the investigation of non-stationary probability distributions. Considering that the available hydrological time series can be recognized as the observable part of a stochastic process with a definite probability distribution, two main topics can be tackled in this context: the first one is related to the definition of an objective criterion for choosing whether the 10 stationary hypothesis can be adopted, while the second one regards the effects of non-stationarity on the estimation of distribution parameters and quantiles for assigned return period and flood risk evaluation. Although the time series trend or change points can be recognized using classical tests available in literature (e.g. Mann-Kendal or CUSUM test), for design purpose it is still required the correct selection of the stationary or non-stationary probability distribution. By this light, the focus is shifted toward model selection criteria which implies the use of parametric methods with all related issues on 15 parameters estimation. The aim of this study is to compare the performance of parametric and non-parametric methods for trend detection analysing their power and focusing on the use of traditional model selection tools (e.g. Akaike Information Criterion and Likelihood Ratio test) within this context. Power and efficiency of parameter estimation, including the trend coefficient, were investigated through Monte Carlo simulations using Generalized Extreme Value distribution as parent with selected parameter sets. 20

the selection of the parent distribution and the estimation of its parameters but are not necessarily associated with a specific null hypothesis. Nevertheless, in both cases the evaluation of the rejection threshold is usually based on a statistic measure of trend that, under the null hypothesis of stationarity, follows a specific distribution (e.g. gaussianity of the Kendall statistic for the MK non-parametric test; 2 distribution of deviance statistic for the LR parametric test).
Considering pros and cons of different approaches, we believe that specific remarks should be made about the use of parametric 70 or non-parametric methods for the analysis of extreme event series. For this purpose, we set up a numerical experiment to compare performances of: 1 the MK as a non-parametric test for trend detection, 2 the LR parametric test for model selection, 3 the parametric test as defined in section 2.4. In particular, the is a measure for model selection, based on the AIC, whose distribution was numerically evaluated, under the null hypothesis of a stationary process, for comparison purposes with other tests.
We aim to provide (i) a comparison of test power between MK, LR and , (ii) a sensitivity analysis of test power to parameters of a known parent distribution used to generate sample data, (iii) an analysis of the influence of sample size on test power and significance level.
We conducted the analysis using Monte Carlo techniques, by generating samples from parent populations assuming one of the most popular extreme event distributions, the Generalized Extreme Value (Jenkinson, 1955), with linear (and without any) 80 trend in the position parameter. From generated samples we numerically evaluated the power and significance level of tests for trend detection, using MK, LR and . For the latter we also checked the option of using the modified version , suggested by Sugiura (1978) for smaller samples.
Considering that parametric methods involve the estimation of the parent distribution parameters, we also analysed the efficiency of the Maximum Likelihood (ML) estimator used in trend assessment by comparing the sample variability of the

The Mann-Kendall test
Hydrological time series are often composed by non-normally independent realizations of phenomena, and this characteristic makes the use of non-parametric trend tests very attractive (Kundzewicz and Robson, 2004). Mann-Kendall test is a widely 95 used rank-based tool for detecting monotonic, and not necessarily linear, trends. Given a random variable z, and assigned a sample of L independent data = ( 1 , … , ), the Kendall S statistic (Kendall, 1975) can be defined as: with sgn sign function.
The null hypothesis of this test is the absence of any statistically significant trend in the sample, while it is contemplated by 100 the alternative hypothesis.  reported that serial dependence can lead to a more frequent rejecting of null hypothesis. For ≥ 8, Mann (1945) reported how Eq. (1) is an approximatively normally distributed variable with zero mean and variance that, in the presence of m-length ties, can be expressed as: In practice, Mann-Kendall test is performed using the Z statistic which follows a standard normal distribution. With this approach, it is simple to evaluate p-value and compare it with an assigned level of significance or, equivalently, to evaluate the threshold value to be compared with Z, where is the (1 − ) quantile of a standard normal distribution. Yue et al. (2002b) observed that autocorrelation in time series can influence the ability of MK test in detecting trends. For 110 avoiding this problem, a correct approach in trend analysis should contemplate a preliminary check for autocorrelation and, if necessary, the application of pre-whitening procedures.
A non-parametric tool for a reliable estimation of a trend in a time series with N pairs of data is the Sen's slope estimator (Sen, 1968): 115 being > . Sorting in ascending order the 's, Sen's slope estimator can be defined as their median .

Likelihood Ratio Test
The Likelihood Ratio statistical test allows to compare two candidate models. Like its name suggests, it is based on the evaluation of the likelihood function of different models.
The LR test has been used different times (Tramblay et al., 2013;Cheng et al., 2014; for selecting between 120 stationary and non-stationary models in the context of nested models. Given a stationary model characterized by a parameter set and a non-stationary model, with parameter set , if ℓ(̂) and ℓ(̂) are their respective maximized loglikelihoods, the Likelihood Ratio test can be defined through the deviance statistic D is approximately, for large L, 2 distributed, with = ( ) − ( ) degrees of freedom. The null hypothesis of Coles, 2001).
Besides the analysis of power, we also checked (in subsection 3.3) the approximation ~2 as a function of the sample size L for the evaluation of the level of significance.

Akaike Information Criterion Ratio test
Information criteria are useful tools for model selection. It is reasonable to retain that Akaike Information Criterion (AIC; 130 Akaike, 1974) is the most famous among them. Based on the Kullbach-Leibler discrepance measure, if is the parameter set of a k-dimensional model ( = ( )), AIC is defined as: The model that best fits data has the lowest value of AIC between candidates. It is useful to observe that the term proportional to the number of model parameters allows to account for the increased estimator variance due to a larger parametrization and 135 embodies the principle of parsimony. Sugiura (1978) observed that AIC can lead to misleading results for small samples; he proposed a new measure for AIC: where a second-order bias correction is introduced. Burnham e Anderson (2004) suggested to use this version only when ⁄ < 40, being the maximum number of parameters between the compared models. However, for larger L, 140 converges to AIC. For a quantitative comparison between AIC and in the extreme value stationary model selection framework see also Laio et al. (2009).
In order to select between stationary and nonstationary candidate models, we use the ratio = .
(6) where the subscripts indicate the AIC value obtained for a stationary (st) and a non-stationary (ns) model, both fitted with maximum likelihood to the same data series.
Considering that the better fitting model has lower AIC, if the time series is non-stationary, the should be less than 1.
Vice versa if the time series is stationary.
In order to provide a rigorous comparison between the use of MK, LR and , we evaluated the , threshold value corresponding to significance level by numerical experiments.

150
More in detail, we adopted the following procedure: 1. N = 10000 samples are generated from a stationary GEV parent distribution, with known parameters; 2. for each of these samples the is evaluated, by fitting the stationary and non-stationary GEV models described in section 2.4, thus providing its empirical distribution (see pdf in fig. 1); 3. exploiting the empirical distribution of the threshold associated with a significance level of = 0.05 is 155 numerically evaluated: this value, , , represents the threshold for rejecting the null hypothesis of stationarity (which in these generations is true) in 5% of the synthetic samples.
This procedure was applied both for AIC and . The experiment was repeated for a few selected sets of the GEV parameters, including different trend values, and different sample lengths, as detailed in section 3.

The GEV parent distribution
The cumulative Generalized Extreme Value (GEV) distribution (Jenkinson, 1955) can be expressed as: where , , are respectively known as the position, scale and shape parameter, In this study, only a linear trend with time t in the position parameter has been introduced, leading Eq. (7) to be expressed and It is important to note that Eq. (8) is a more general way to define the GEV and has the property of degenerating into Eq. (7) for 1 = 0: in other words Eq. (7) represents a nested model of Eq. (8) which confirms the suitability of the Likelihood Ratio test for model selection.
The estimation of GEV parameters is often performed by means of the L-moments (Hosking, 1990), linear combinations of PWM (Hosking et al., 1985). Given a time series of values z, sorting it in ascending order, sample PWM can be expressed 180 using the following relationships: Between PWM and L-moments the following relationships hold: We observe that, imposing trend in the mean has reflections only in 1 , and does not affect 2 . The analytical proof based on sample relationships is provided in Appendix.
In this work we used the maximum likelihood method (ML) to estimate GEV parameters from sample data. ML allows to treat

Numerical evaluation of test power and significance level
The power of a test is related to the second type error and is the probability of correctly rejecting the null hypothesis when it is false. In particular, defining α (level of significance) the probability of a Type I error and β the probability of a Type II error,

195
we have power = 1 -β. The maximum value of power is 1, which correspond to β = 0, i.e. no probability of Type II error. A fair comparison between the null and the alternative hypotheses would see α = β = 0.05, which provides power = 0.95. In most applications conventional values are α = 0.05 and β = 0.2, meaning that a 1-to-4 trade-off between  and  is accepted. Thus, assuming a significance level 0.05, a power level less than 0.8 should be considered too low. For each of the tests described in subsections 2.1, 2.2 and 2.3, the power was numerically evaluated according to the following procedure: 200 1) N = 2000 Monte Carlo synthetic series are generated using the non-stationary GEV in Eqs. (8-9) as parent distribution with fixed parameter set = ( 0 , 1 , , ) and length L, being 1 ≠ 0.
2) The threshold , associated with a significance level = 0.05 is numerically evaluated, as described in section 2.3 using the corresponding parameter set = ( 0 , , ) of GEV parent distribution.
3) From these synthetic series the power of the test is estimated as:

205
= where is the number of series for whom the null hypothesis is rejected, as in Yue et al. (2002a).
The same procedure, with N = 10000, was used in order to check the actual significance level of test, which is the probability of first type error i.e. the probability of rejecting the null hypothesis when it is true. The task was performed by following the 210 above steps from 1 to 3 while replacing with at step 1), in such a case the rejection rate ⁄ represents the actual level of significance .

Sensitivity analysis, results and discussion
A comparative evaluation of the tests' performance was carried out for all the GEV parameter sets and in particular the validity of the 2 approximation for the D statistic is discussed. In the fourth subsection the numerical investigation on the sample variability of parameters is reported.

Evaluation of , with AIC or
Considering the non-stationary GEV four-parameters model, in order to satisfy the relation / < 40 suggested by Burnham e Anderson (2004), a time series with a record length not less than 160 should be available. Following this simple reasoning the AIC should be considered de facto not-applicable to any annual maximum series showing a changing point in the '70-80s (e.g. Kiely, 1999). In our numerical experiment, the second-order bias correction of Sugiura (1978) should be always used because for the maximum sample length, L = 70, we have / = 70/4 = 17.5 for the non-stationary GEV.

230
Nevertheless, we checked if using AIC or is important in such a use of the ratio . To this purpose we evaluated from synthetic series the percentage differences between the power of , evaluated by means of AIC and . In Fig. 2 the empirical probability density functions of such percentage differences, grouped according to sample length, are plotted for generations with ε = 0.4 and different values of σ. It is interesting to note that only for L = 70 the error distribution shows a regular and unbiased bell-shaped distribution. Then we observe for L = 50 a small negative bias (about -0.02%), while for L = 235 30 a bias of -0.08 with a multi-peak and negatively skewed pdf. The latter pdf also has a higher variance than the others.
Similar results were obtained for all values of ε, providing a general amount of differences always very low and allowing to conclude that the use of AIC or does not significantly affect the power of for the cases examined. This follows the combined effect of the sample size (whose minimum value considered here is 30) and the limited difference in the number of parameters in the selected models. In the following we will refer and show only the plots obtained for the in Eq. (6) with 240 AIC evaluated as in Eq. (4).

Dependence of power on parent distribution parameters and sample size
The effect of parent distribution parameters and sample size on the numerical evaluation of power and significance level of MK, LR and for different values of , and 1 is shown in Fig. 3. The curves represent both significance level which is shown for 1 = 0 (true parent is the stationary GEV) and power for all other values 1 ≠ 0 (true parent is the non-stationary Higher difference is found for heavy tailed parent distribution ( = +0.4). While LR keeps having the largest power, the difference with respect to remains small while the MK's power almost collapses to values always smaller than 0.5.
Practical consequences of such patterns are very important and are discussed in the conclusion section.   Table 1 show that the rejection rate of the (true) null hypothesis is systematically higher than it should be, and it is also dependent on parent parameter values. Such effect is exalted when the parent distribution 270 is upper bounded ( = −0.4) and for higher values of the scale parameter. In practice this implies that when using the LR test, as described in subsection 2.2, one actually has a probability of rejecting the true null hypothesis of stationarity quite higher than he knows.
On the other hand, the performances of MK with respect to the designed level of significance are less biased and independent from the parameter set. Similar good performances are trivially obtained for the , whose rejection threshold is numerically 275 evaluated.
The plot in Fig. 4 is displayed in order to focus on the actual value of the level of significance and in particular on the LR approximation ~2 as a function of the sample length n. The difference between theoretical and numerical values of the significance level is represented by the distance between the bottom value of the curve (obtained for 1 = 0, i.e. the stationary GEV model) and the chosen level of significance 0.05 which is represented by the horizontal dotted line. In particular in Fig.   280 overestimated because of the approximation ~2 . These results suggest, for the LR test, the use of a numerical procedure (as the one introduced for in subsection 2.3) for evaluating the D distribution and the rejection threshold.
Other considerations can be made on the use of . As explained in subsection 2.3 we empirically evaluated by numerical generations the , threshold value with significance level 0.05 for each of the parameter sets and sample sizes considered. Similar results were obtained using the which are not shown for brevity. We found a significative dependence of , on the sample size. Fig. 5 shows curves of , obtained for each of the parameter sets vs sample size. It is also worth noting 290 that all curves asymptotically trend to 1 as L increase. This property is due to the structure of AIC and peculiarity of the nested models used in this paper: while using a sample generated with weak non-stationarity (i.e. when 1 → 0 in Eq. (9)) the maximum likelihood of model (7), ℓ(̂), tends to ℓ(̂) of model (8) leaving only the bias correction in AIC to be discriminant for model selection. As a consequence, , will be always lower than 1, but, increasing sample size, also both the likelihood terms −2ℓ(̂) and −2ℓ (̂) in Eq. (4) will increase, pushing toward the limit 1. On the other hand, 295 Fig. 5 shows that the threshold value , is significatively smaller than 1 up to L values well beyond the length usually available in this kind of analysis. Hence the numerical evaluation of the threshold has to be considered a required task in order to provide an assigned significance level to model selection. On the other hand, the simple adoption of the selection criteria < 1 (i.e. , = 1), would correspond to an unknown significance level dependent on the parent distribution and sample size. In order to highlight this point, we evaluated the significance level corresponding to , = 1 following the 300 procedure described in subsection 2.5 by generating N = 10000 synthetic series for any parameter set and sample length.
Results, provided in Tab. 2, show that, in the explored GEV parameter domain, ranges between 0.16 and 0.26 mainly depending on the sample length and the shape parameter of the parent distribution.

Sample variability of parent distribution parameters
Results shown above, with regard to performances of parametric and non-parametric tests, are in our opinion quite surprising 305 and important. It is proven that the preference widely accorded to non-parametric tests being their statistics allegedly independent from the parent distribution is not well founded. On the other hand, the use of parametric procedures raises the problem of correctly estimating the parent distribution and, for the purpose of this paper, its parameters. Moreover, as being the trend coefficient 1 a parameter of the parent distribution in non-stationary condition, the proposed parametric approach provides a maximum likelihood-based estimation of the same trend coefficient which is hereafter called ML-1 . For a 310 comparison with non-parametric approaches we also evaluated the sample variability of the Sen's slope measure () of the imposed linear trend. In order to provide insights into these issues, from the same sets of generations exploited above, we also analysed the sample variability of the maximum likelihood estimates ML-ML-, for different parameter sets and sample length.
We  In order to better analyse such patterns, for the scale and shape parent parameters we report also the distribution of their empirical ML estimates for different parameter sets vs the true 1 value used in generation. The sample distribution of MLis shown in Fig. 10 for = 30 and Fig. 11 for = 70. The sample distribution of ML-is shown in Fig. 12 for = 30 and 330 Fig. 13 for = 70. Subplots show that the presence of a strong trend coefficient may produce significant loss in the estimator efficiency probably due to deviation from normal distribution of the sample estimates also for long samples. This suggests the need of more robust estimation procedures which provides higher efficiency for estimates of and in case of strong observed trend.

335
The results shown have important practical implications. The dependence of power on the parent distribution parameters may significantly affect results of both parametric and non-parametric tests including the widely used Mann-Kendall. For all the generation sets and tests conducted, under the null hypothesis of stationarity, the power has values ranging between the chosen significance level (0.05) and 1 for large (and larger) ranges of the trend coefficient. The test power always collapses to very low values for weak (but climatically important) trend values (in the case of annual maximum daily rainfall, 1 equal 340 to 0.2 or 0.3 mm per year, for example). In presence of trend, the power is also affected by the scale and shape parameters of the GEV parent distribution. This observation can be made with reference to samples of all the lengths considered in this paper (from 30 to 70 years of observation) but the use of smaller samples significantly reduces the test power and dramatically extends the range of 1 values for which the power is below the conventional value 0.8. The use of this sample size is not rare considering that significative trends due to anthropic effects are typically investigated in periods following a changing point 345 often observed in the '80s.
These results also imply that in spatial fields where the alternative hypothesis of non-stationarity is true, but the parent's parameters (including the trend coefficient) and the sample length are variable in space, the rate of rejection of the false null- hypothesis may be highly variable from site to site and practically out of control. In other words, in such a case, the probability of recognizing the alternative hypothesis of non-stationarity as true from a single observed sample may unknowingly change 350 (between 0.05 and 1) from place to place. For small samples (as = 30 in our analysis) and heavy tailed distributions, the power is always very low for all the investigated range of the trend coefficient.
Hence, considering the high spatial variability of the parent distribution parameters and the relatively short period of reliable and continuous historical observations usually available, a regional assessment of trend non-stationarity may suffer from the different probability of rejection of the null hypothesis of stationarity (when it is false).

355
These problems affect, in slightly different measures, both parametric and non-parametric tests. While these considerations are generally applicable to all the tests considered, differences also emerge between them. For heavy tailed parent distributions and smaller samples, the MK test power decreases more rapidly than for the other tests considered. Low values of power are already observable for = 50. The LR test slightly outperforms the for small sample size and higher absolute values of the shape parameter. Nevertheless, the higher value of the LR power seems to be overestimated as a consequence of the 2 360 approximation for the D statistic distribution (see section 3.3).
Results also suggest that theoretical distribution of the LR test-statistic based on the null hypothesis of stationarity may lead to a significative increase of the rejection rate compared to the chosen level of significance i.e. an abnormal rate of rejection of the null hypothesis when it is true. In this case the use of numerical techniques, based on the use of synthetic generations performed by exploiting a known parent distribution, should be preferred.

365
By the light of these results we conclude that in trend detection on annual maximum series the assessment of the parent distribution and the choice of the null hypothesis should be considered fundamental preliminary tasks. According to this remark, it is advisable to make use of parametric tests by numerically evaluating both the rejection threshold for the assigned significance level and the power corresponding to alternative hypotheses. This also requires developing robust techniques for individuation of the parent distribution and estimation of its parameters. To this perspective, the use of a parametric measure 370 such as the , may take account of different choices for the parent distribution and, even more important, allows to set a null hypothesis different from the stationary case, based on a priori information.
The need of robust procedures for assessing the parent distribution and its parameters is also proven by the numerical simulations we conducted. Sample variability of parameters (including the trend coefficient) may increase rapidly for series with L as low as 30 years of annual maxima. Moreover, we observed that, in case of highest trend, numerical instability and 375 non-convergence of algorithms may affect the estimation procedure for upper bounded and heavy tailed distributions.
Nevertheless, the sample variability of the ML trend estimator was found always smaller than the Sen's slope sample variability. Finally, it is worth noting that also the non-parametric Sen's slope method, applied to synthetic series, showed dependence on the parent distribution parameters with sample variability higher for heavy tailed distributions.

385
As already said, an advantage of using parametric tests and numerical evaluation of the test-statistic distribution is given by the possibility of assuming a null hypothesis based on a preliminary assessment of the parent distribution including trend detection by evaluation of non-stationary parameters. This could lead to a regionally homogeneous and controlled assessment of both significance level and power in a fair mutual relationship. With respect to the estimation of parameters of the parent distribution, results suggest that at site analysis may provide highly biased results. More robust procedures are necessary like 390 hierarchic estimation procedures (Fiorentino et al., 1987) providing estimates of and from detrended series (Strupczewski et al., 2016;Kochanek et al., 2013). Considering the high spatial variability of sample length, trend coefficient, scale and shape parameters we believe that the application of well-known and developed regional methods for selection and assessment of the parent distribution could be easily and profitably exploited in the context of non-stationarity and climate change detection in annual maximum series and will be tackled in future research.

Appendix
Let us consider two different but consequent years, t and t + 1, setting ( ) = 0 + , for 1 there is: Subtracting side by side: Which proves that a trend in the mean value produces a related trend in 1 .
By using the same approach for 2 , we observe that: Subtracting side by side: Which proves that a trend in the mean value does not affect 2 .