The Standardized Precipitation Index (SPI) is a widely accepted drought index. Its calculation algorithm normalizes the index via a distribution function. Which distribution function to use is still disputed within the literature. This study illuminates that long-standing dispute and proposes a solution that ensures the normality of the index for all common accumulation periods in observations and simulations.

We compare the normality of SPI time series derived with the gamma, Weibull, generalized gamma, and the exponentiated Weibull distribution. Our normality comparison is based on a complementary evaluation. Actual compared to theoretical occurrence probabilities of SPI categories evaluate the absolute performance of candidate distribution functions. Complementary, the Akaike information criterion evaluates candidate distribution functions relative to each other while analytically punishing complexity. SPI time series, spanning 1983–2013, are calculated from the Global Precipitation Climatology Project's monthly precipitation dataset, and seasonal precipitation hindcasts are from the Max Planck Institute Earth System Model. We evaluate these SPI time series over the global land area and for each continent individually during winter and summer. While focusing on regional performance disparities between observations and simulations that manifest in an accumulation period of 3 months, we additionally test the drawn conclusions for other common accumulation periods (1, 6, 9, and 12 months).

Our results suggest that calculating SPI with the commonly used gamma distribution leads to deficiencies in the evaluation of ensemble simulations. Replacing it with the exponentiated Weibull distribution reduces the area of those regions where the index does not have any skill for precipitation obtained from ensemble simulations by more than one magnitude. The exponentiated Weibull distribution maximizes also the normality of SPI obtained from observational data and a single ensemble simulation. We demonstrate that calculating SPI with the exponentiated Weibull distribution delivers better results for each continent and every investigated accumulation period, irrespective of the heritage of the precipitation data. Therefore, we advocate the employment of the exponentiated Weibull distribution as the basis for SPI.

Drought intensity, onset, and duration are commonly assessed with the Standardized Precipitation Index (SPI). SPI was first introduced by

SPI quantifies the standardized deficit (or surplus) of precipitation over any period of interest – also called the accumulation period. This is achieved by fitting a probability density function (PDF) to the frequency distribution of precipitation totals of the accumulation period – which typically spans either 1, 3, 6, or 12 months. SPI is then generated by applying a

The choice of the PDF fitted to the frequency distribution of precipitation is essential because only a proper fit appropriately standardizes the index. While the standardization simplifies further analysis of SPI, the missing physical understanding of the distribution of precipitation leads to a questionable basis for the fit. Therefore, the choice of the PDF is to some extent arbitrary and depicts the Achilles heel of the index.

Originally,

Several studies have investigated the adequacy of PDFs fitted onto observed precipitation while focusing on different candidate distribution functions

Nevertheless, some common conclusions can be drawn. Most investigations only analyzed two-parameter distribution functions

Two additional studies analyzed the adequacy of different candidate PDFs fitted onto simulated precipitation while focusing on drought occurrence probabilities in climate projections

Testing the performance of three-parameter distributions introduces the risk of overfitting

Most studies test candidate distribution functions with goodness-of-fit tests

The abovementioned goodness-of-fit tests equally evaluate each value of SPI's distribution. Such an evaluation focuses on the center of the distribution because the center of any distribution contains per definition more samples than the tails. In contrast, SPI usually analyzes (and thus depends on a proper depiction of) the distribution's tails. Therefore, a blurred focus manifests in these goodness-of-fit tests. Moreover, the convention to binarily interpret the abovementioned goodness-of-fit tests aggravates this blurred focus. Because of this convention, these goodness-of-fit tests are unable to produce any relative ranking of the performance of distribution functions for a specific location (and accumulation period). This inability prevents any reasonable aggregation of limitations that surface despite the blurred focus. Thus, they are ill suited to discriminate the best-performing PDF out of a set of PDFs

In agreement with this insight, those studies that rigorously analyzed candidate distribution functions or investigate an appropriate test methodology for evaluating SPI candidate PDFs consequently advocate the use of relative assessments: mean absolute errors

SPI calculation procedures were developed for observed precipitation data. Since models do not exactly reproduce the observed precipitation distribution, these procedures need to be tested and eventually adapted before being applied to modeled data. Here, we aspire to identify an SPI calculation algorithm that coherently describes modeled and observed precipitation (i.e., describes both modeled and observed precipitation distributions individually and concurrently). While testing SPI's calculation algorithm on modeled precipitation data is usually neglected, such a test demands nowadays a similarly prominent role as the one for observations because of the increasing importance of drought predictions and their evaluation. Despite this importance, the adequacy of different candidate distribution functions has to the authors' best knowledge never been tested in the output of a seasonal prediction system – although seasonal predictions constitute our most powerful tool to predict individual droughts. To close that gap, this study evaluates the performance of candidate distribution functions in an output of 10 ensemble members of initialized seasonal hindcast simulations.

In this study, we test the adequacy of the gamma, Weibull, generalized gamma, and exponentiated Weibull distribution in SPI's calculation algorithm. The evaluation of their performance depends on the normality of the resulting SPI time series. In this evaluation, we focus on an SPI accumulation period of 3 months (SPI

We employ a seasonal prediction system

We obtain observed precipitation from the Global Precipitation Climatology Project (GPCP), which combines observations and satellite precipitation data into a monthly precipitation dataset on a 2.5

Depending on the accumulation period (1, 3, 6, 9, or 12 months), we calculate the frequency distribution of modeled and observed precipitation totals over two different seasons (August and February – 1, JJA and DJF – 3, MAMJJA and SONDJF – 6, and so on). Because our results do not indicate major season-dependent differences in the performance of candidate PDFs for SPI

Our precipitation hindcasts are neither bias- nor drift-corrected and are also not recalibrated. Such corrections usually adjust the frequency distribution of modeled precipitation in each grid point to agree better with the observed frequency distribution. Here, we investigate the adequacy of different PDFs in describing the frequency distribution of modeled precipitation totals over each accumulation period without any correction. As a consequence, we require that SPI's calculation algorithm deals with such differing frequency distributions on its own. That requirement enables us to identify the worst possible mismatches.

We calculate SPI

Our parameter estimation method first identifies starting values for the

If neither the BFGS quasi-Newton nor the Nelder–Mead method leads to any convergence of the most suitable parameters of our candidate PDFs, then we omit these grid points where convergence is not achieved. For the gamma, Weibull, and exponentiated Weibull distribution, non-converging parameters are rare exceptions and only occur in a few negligible grid points. For the generalized gamma distribution, however, non-convergence appears to be a more common issue and occurs in observations as well as in simulations in roughly every fifth grid point of the global land area. This shortcoming of the generalized gamma distribution needs to be kept in mind when concluding its potential adequacy in SPI's calculation algorithm.

Since PDFs that describe the frequency distribution of precipitation totals are required to be only defined for the positive real axis, the cumulative probability (

In very arid regions or those with a distinct dry season, SPI time series are characterized by a lower bound

To further optimize the fit of the PDF onto modeled precipitation, all hindcast ensemble members are fitted at once. We checked and ascertained the underlying assumption of this procedure – that all ensemble members show in each grid point identical frequency distributions of precipitation. It is, therefore, reasonable to presume that a better fit is achievable for simulated rather than for observed precipitation.

Cumulative precipitation sums are described by skewed distribution functions which are only defined for the positive real axis. We test four different distribution functions and evaluate their performance based on the normality of their resulting SPI frequency distributions. The four candidate PDFs either consist of a single shape (

Candidate distribution functions whose performance is investigated in this study: the two-parameter gamma distribution (GD2), the two-parameter Weibull distribution (WD2), the three-parameter generalized gamma distribution (GGD3), and the three-parameter exponentiated Weibull distribution (EWD3). Displayed are examples of those PDFs for

Abbreviations used for candidate distribution functions.

Instead of investigating the Pearson type III distribution, which is already widely used, we analyze the simple gamma distribution. They differ by an additional location parameter which does not change the here presented results

Gamma distribution:

Weibull distribution:

Generalized gamma distribution:

Exponentiated Weibull distribution:

SPI time series are supposed to be standard normally distributed (

According to WMO's

Standardized Precipitation Index (SPI) classes with their corresponding SPI intervals and theoretical occurrence probabilities (according to WMO's

The three-parameter candidate distribution functions contain the two-parameter candidate distribution functions for special cases. Given those special cases, the three-parameter candidate distribution functions will in theory never be inferior to the two-parameter candidate distribution functions they contain when analyzing deviations from

Our aim is twofold. First, we want to maximize the normality of our SPI time series by choosing an appropriate distribution function. Second, we simultaneously aspire to minimize the parameter count of the distribution function to avoid unnecessary complexity. Avoiding unnecessary complexity decreases the risk of overfitting. The objective is to identify the necessary (minimal) complexity of the PDF which prevents the PDF from being too simple and losing explanatory power. Or in other words: we are interested in the so-called

In our case, AIC's first term evaluates the performance of candidate PDFs in describing the given frequency distributions of precipitation totals. The second term penalizes candidate PDFs based on their parameter count. The best-performing distribution function attains the smallest AIC value because the first term is negative and the second one is positive.

Further, the absolute AIC value is often of little information – especially in contrast to relative differences between AIC values derived from different distribution functions. Thus, we use values of relative AIC difference (AIC-D) in our analysis. We calculate these AIC-D values for each PDF by computing the difference between its AIC value to the lowest AIC value of all four distribution functions. AIC-D values inform us about superiority in the optimal trade-off between bias and variance and are calculated as follows:

For our analysis, AIC-D values are well suited to compare and rank different candidate PDFs based on their trade-off between bias and variance. The best-performing distribution function is characterized by a minimum AIC value (AIC

Flow chart of methods to aggregate deviations from

The analysis of deviations from

For each candidate distribution function, accumulation period, and domain and during both seasons, we compute deviations from

Again for each candidate distribution function, accumulation period, and domain and during both seasons, we aggregate AIC-D over several grid points into a single graph separately for observations and simulations as depicted on the right-hand side of the flow chart in Fig.

We investigate the normality of SPI time series derived from each candidate PDF first for the entire global land area and analyze subsequently region-specific disparities. For this analysis we focus on the land area over six regions scattered over all six inhabited continents: Africa (0–30

Examining frequency distributions of precipitation totals over domains smaller than the entire globe reduces the risk of encountering opposite deviations from

Borders of regions examined in this study.

For a first overview, it is beneficial to cluster as many similar results as possible together to minimize the level of complexity of the regional dimension. The choice of sufficiently large or small domains is still rather subjective. Which size of regions is most appropriate? This subjective nature becomes apparent in studies that identify differing borders for regions that are supposed to exhibit rather uniform climatic conditions

In agreement with prior studies

Deviations from

In theory, since the three-parameter generalized gamma distribution (GGD3) encompasses GD2 as a special case, GGD3 should not be inferior to GD2. In reality, however, the applied optimization methods appear to be too coarse for GGD3 to always lead to an identical or better optimum than the one identified for GD2 with the given length of the time series. When optimizing three parameters, it is more likely to miss a specific constellation of parameters which would further optimize the fit, especially when limited computational resources impede the identification of the actual optimal fitting parameters. Additionally, a limited database (our database spans 31 years) obscures the frequency distribution of precipitation totals which poses another obstacle to the fitting methods. This results in missed optimization opportunities that impact GGD3 more strongly than GD2 because of GGD3's increased complexity, which leads to GGD3 requiring more data than GD2. Therefore, the weighted sum (weighted by the theoretical occurrence probability of the respective SPI class; Table

In agreement with

In simulations, the fit onto 3-month precipitation totals is performed on all 10 ensemble members at once. This increases 10-fold the sample size in simulations relative to observations. Presuming an imperfect fit for the 31 samples in observations, deviations from

We attempt to disentangle both effects (analyzing modeled, instead of observed, precipitation distributions and increasing the sample size) for our two-parameter candidate PDFs next. If the two-parameter PDFs are suited to be applied to modeled precipitation data, they should benefit at least to some extent from this multiplication of sample size. Despite expecting irregularities in the magnitude of these reductions, they ought to be notable for candidate distribution functions that are adequately suited to describe modeled 3-month precipitation totals – assuming an imperfect fit for the 31 events spanning our observational time series. Therefore, we weigh each class' deviation from

For the two-parameter PDFs, the weighted deviations from

In this section, we have analyzed global deviations from

GD2, GGD3, and EWD3 describe similarly well the overall frequency distribution of observed 3-month precipitation totals.

WD2 performs overall poorly and is in every regard inferior to any other candidate distribution function.

GGD3 and EWD3 describe the frequency distribution of modeled 3-month precipitation totals distinctly better than any two-parameter candidate distribution.

GD2 describes the frequency distribution of modeled 3-month precipitation totals sufficiently well on the global average.

Both two-parameter candidate distribution functions are unable to benefit from the increased length of the database in simulations relative to observations, while both three-parameter PDFs strongly benefit from that increase.

AIC-D frequencies: percentages of global land grid points in which each distribution function yields AIC-D values that are smaller than or equal to a given AIC-D

In general, each candidate distribution function performs similarly well in winter and summer in their depiction of the frequency distribution of observed 3-month precipitation totals (compare Fig.

Percent of grid points that are classified into specific AIC-D categories (according to

In ensemble simulations, our results are again rather stable for all investigated distribution functions between summer and winter (compare Fig.

Analyzing AIC-D frequencies for both seasons (DJF and JJA) discloses no distinct season-dependent differences, similar to before in the investigation of deviations from

It seems worth elaborating on the insufficient (only average) confidence in EWD3 to perform ideally in observations (ensemble simulations) around the globe. The complexity penalty of AIC correctly punishes EWD3 more strongly than GD2 because AIC evaluates whether EWD3's increased complexity (relative to GD2) is necessary. However, the results justify the necessity for this increased complexity – GD2 performs erroneously in 26 % (6 %), insufficiently in 18 % (2 %), and without any skill in 12 % (1 %) of the global land area in ensemble simulations (observations). The risk of underfitting by using two-parameter PDFs seems larger than the risk of overfitting by using three-parameter PDFs. Once the need for three-parameter candidate PDFs is established, their remaining punishment relative to two-parameter PDFs biases the analysis, particularly for the ideal AIC-D category. EWD3's increased complexity penalty relative to two-parameter candidate PDFs depends on the sample size and amounts to 2.46 in observations and 2.04 in ensemble simulations (see black vertical lines in Fig.

Mean deviations from

The AIC-D frequencies of Table

Among our candidate PDFs, EWD3 is obviously the best-suited PDF for SPI. Yet, we still need to confirm whether also EWD3's absolute performance is adequate. While the global analysis indicated EWD3's adequateness, the ultimate validation of this claim is incumbent upon the regional analysis.

We investigated thus far deviations from

In observations (Fig.

In simulations (Fig.

These insights about candidate PDF performance in observations and simulations are even more obvious at first glance when displayed in an image plot (Fig.

Mean deviations from

For observations, the regional analysis confirms the insights from the global analysis in each region: EWD3 is (same as GD2 and GGD3) an adequate PDF in SPI's calculation algorithm. For ensemble simulations, the regional analysis additionally corroborates the finding of the AIC-D analysis that EWD3 performs noticeably better than GD2. The corroboration of this finding substantiates support for EWD3.

The analysis of AIC-D frequencies proves that EWD3 is SPI's best distribution function among our candidate PDFs. Additionally, the regional investigation confirms the global analysis: the absolute performance of EWD3 is at minimum adequate in observations and ensemble simulations.

In the following, we investigate deviations from

As in Fig.

The AIC

Percent of grid points that are classified into specific AIC-D categories (according to

In contrast to previous results in this and other studies

Unsurprisingly the same deficit as identified before for both two-parameter candidate PDFs also emerges in the baseline's performance: the sum weighted by each classes' likelihood of occurrence over the absolute values of deviations from

Moreover, identifying the maximum deviation from

So far, we used all ensemble members at once to fit our candidate PDFs onto simulated precipitation. That improves the quality of the fit. In this section, we first analyze a single ensemble member and investigate subsequently the sensitivity of our candidate PDFs' performance on the ensemble size. In doing so, we properly disentangle the difference between observations and simulations from the impact of the sample size.

As before, three-parameter candidate distribution functions also perform for a single ensemble simulation better than two-parameter PDFs (Table

In the next step, we isolate and investigate the improvement of the fit by an increasing sample or ensemble size. As a consequence of limited observed global precipitation data, we neglect observations and their differences to simulations in this remaining section. During this investigation, we reanalyze Table

Percent of grid points that are classified into specific AIC-D categories (according to

Percent of grid points that are classified into specific AIC-D categories (according to

Despite requiring more data, our three-parameter candidate PDFs perform already better for 31 samples. For 31 samples, we identify this better performance of three-parameter candidate PDFs in observations and simulations. Further, since our three-parameter candidate PDFs require more data to estimate optimal parameters, they benefit in simulations more strongly from additional samples than our two-parameter candidate PDFs. That benefit becomes apparent in a distinctly improved relative performance after multiplying the sample size through the use of additional ensemble members.

A similar pattern as identified for SPI

In agreement with prior studies

Most interesting, EWD3 performs well almost everywhere around the entire globe for each accumulation period and in both realizations. EWD3 shows the highest percentages of all candidate PDFs for each analysis (each row of Table

Despite the inclusion of the complexity penalty, EWD3 still performs best in 32 out of all 40 analyses (all rows of Tables

Previous studies have emphasized the importance of using a single PDF to calculate SPI for each accumulation period and location

The outlined problem is additionally aggravated by the fact that it cannot be circumnavigated. Our results demonstrate that any inept description of precipitation by SPI's candidate distribution function manifests most severely in the tails of SPI's distribution. Since SPI is usually employed to analyze the left-hand tail of its distribution (droughts), biased descriptions of this tail are highly undesirable. To establish the robustness of this valuable tool and to fully capitalize on its advantages, SPI's problem of requiring a single, universally applicable candidate PDF needs to be solved. In this study, we show that the three-parameter exponentiated Weibull distribution (EWD3) is very promising in solving this problem virtually everywhere around the globe in both realizations (observations and simulations) for all common accumulation periods (1, 3, 6, 9, and 12 months).

Other studies have dismissed the possibility of such a solution to this problem and proposed instead a multi-PDF approach

Yet, in agreement with those other studies

We also repeated our AIC-D analysis with the Bayesian information criterion

Overall our three-parameter candidate PDFs perform better than investigated two-parameter candidate PDFs. Despite requiring more data, a sample size of 31 years suffices our three-parameter candidate PDFs to outperform our two-parameter candidate PDFs in simulations and observations. Further, our three-parameter candidate PDFs greatly benefit from an increase in the sample size in simulations. In simulations, such a sample size sensitivity analysis is feasible by using different ensemble sizes. Whether three-parameter PDFs would benefit similarly from an increased sample size in observations is likely but ultimately remains speculative because trustworthy global observations of precipitation are temporally too constrained for such a sensitivity analysis.

In contrast to

Other consequences of this finding are apparent major season-dependent differences in the performance of the investigated baseline. This finding contradicts the results of

To aggregate our AIC-D analysis over the globe and visualize this aggregation in tables, we need to evaluate the aggregated performance of candidate PDFs for certain AIC-D categories

There is scope to further test the robustness of our derived conclusions in different models with different time horizons and foci on accumulation periods other than 3 months (e.g., 12 months). Of additional interest would be insights about the distribution of precipitation. Such insights would enable SPI's calculation algorithm to physically base its key decision. A recent study suggests that a four-parameter extended generalized Pareto distribution excels in describing the frequency distribution of precipitation

The results presented here further imply that the evaluated predictive skill of drought predictions assessed with SPI should be treated with caution because it is likely biased by SPI's current calculation algorithms. This common bias in SPI's calculation algorithms obscures the evaluation of predictive skill of ensemble simulations by inducing a blurred representation of their precipitation distributions. That blurred representation emerges in the simulated drought index, which impedes the evaluation process. Drought predictions often try to correctly predict the drought intensity. The evaluation process usually considers this to be successfully achieved if the same SPI category as the observed one is predicted. This evaluation is quite sensitive to the thresholds used when classifying SPI categories. The bias identified here blurs these categories in ensemble simulations more strongly than in observations against which the model's predictability is customarily evaluated. As a consequence of these sensitive thresholds, such a one-sided bias potentially undermines current evaluation processes.

Current SPI calculation algorithms are tailored to describe observed precipitation distributions. Consequently, current SPI calculation algorithms are ineptly suited to describe precipitation distributions obtained from ensemble simulations. Also in observations, erroneous performances are apparent and well-known but less conspicuous than in ensemble simulations. We propose a solution that rectifies these issues and improves the description of modeled and observed precipitation distributions individually as well as concurrently. The performance of two-parameter candidate distribution functions is inadequate for this task. By increasing the parameter count of the candidate distribution function (and thereby also its complexity), a distinctly better description of precipitation distributions can be achieved. In simulations and observations, the best-performing candidate distribution function identified here – the exponentiated Weibull distribution (EWD3) – performs proficiently for every common accumulation period (1, 3, 6, 9, and 12 months) virtually everywhere around the globe. Additionally, EWD3 excels when analyzing ensemble simulations. Its increased complexity (relative to GD2) leads to an outstanding performance of EWD3 when an available ensemble multiplies the sample size.

We investigate different candidate distribution functions (gamma – GD2, Weibull – WD2, generalized gamma – GGD3, and exponentiated Weibull distribution – EWD3) in SPI's calculation algorithm and evaluate their adequacy in meeting SPI's normality requirement. We conduct this investigation for observations and simulations during summer (JJA) and winter (DJF). Our analysis evaluates globally and over each continent individually the resulting SPI

Irrespective of the accumulation period or the dataset, GD2 seems sufficiently suited to be employed in SPI's calculation algorithm in many grid points of the globe. Yet, GD2 also performs erroneously in a non-negligible fraction of grid points. These erroneous performances are apparent in observations and simulations for each accumulation period. More severely, GD2's erroneous performances deteriorate further in ensemble simulations. Here, GD2 performs in a non-negligible fraction of grid points also insufficiently or even without any skill. In contrast, EWD3 performs for all accumulation periods without any defects, irrespective of the dataset. Despite requiring more data than two-parameter PDFs, we identify EWD3's proficient performance for a sample size of 31 years in observations as well as in simulations. Further, ensemble simulations allow us to artificially increase the sample size for the fitting procedure by including additional ensemble members. Exploiting this possibility has a major impact on the performance of candidate PDFs. The margin by which EWD3 outperforms GD2 further increases with additional ensemble members. Furthermore, EWD3 demonstrates proficiency also for every analyzed accumulation period around the globe. The accumulation period of 12 months poses in simulations the only exception. Here, EWD3 and GD2 both perform similarly well around the globe. Still, we find that three-parameter PDFs are generally better suited in SPI's calculation algorithm than two-parameter PDFs.

Given all the dimensions (locations, realizations, and accumulation periods) of the task, our results suggest that the risk of underfitting by using two-parameter PDFs is larger than the risk of overfitting by employing three-parameter PDFs. We strongly advocate adapting the calculation algorithm of SPI and the use therein of two-parameter distribution functions in favor of three-parameter PDFs. Such an adaptation is particularly important for the proper evaluation and interpretation of drought predictions derived from ensemble simulations. For this adaptation, we propose the employment of EWD3 as a new standard PDF for SPI's calculation algorithm, irrespective of the heritage of input data or the length of scrutinized accumulation periods. Despite the issues discussed here, SPI remains a valuable tool for analyzing droughts. This study might contribute to the value of this tool by illuminating and resolving the discussed long-standing issue concerning the proper calculation of the index.

The model simulations are available at the World Data Center for Climate (WDCC) at

PP, AD, and JB designed the study. PP led the analysis and prepared the paper with support from all coauthors. All coauthors contributed to the discussion of the results.

The authors declare that they have no conflict of interest.

The model simulations were performed at the German Climate Computing Centre. The authors also thank Frank Sienz for providing the software to compute AIC and SPI with different candidate distribution functions. The authors would also like to thank Gabriel Blain and another anonymous referee for their effort in reviewing this paper and editor Marie-Claire ten Veldhuis for her engagement in overseeing and actively participating in the review process.

This work was funded by the BMBF-funded joint research projects RACE (Regional Atlantic Circulation and Global Change) and RACE-Synthesis. Patrick Pieper is supported by the Stiftung der deutschen Wirtschaft (SDW, German Economy Foundation). André Düsterhus and Johanna Baehr are supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy EXC 2037 “Climate, Climatic Change, and Society” (project no. 390683824) through contributions to the Center for Earth System Research and Sustainability (CEN) of Universität Hamburg. André Düsterhus is also supported by A4 (Aigéin, Aeráid, agus athrú Atlantaigh), funded by the Marine Institute and the European Regional Development Fund (grant no. PBA/CC/18/01).

This paper was edited by Marie-Claire ten Veldhuis and reviewed by Gabriel Blain and one anonymous referee.