A framework for deriving drought indicators from GRACE

Identifying and quantifying drought in retrospective is a necessity for better understanding drought conditions and the propagation of drought through the hydrological cycle, and eventually for developing forecast systems. Hydrological droughts refer to water deficits in surface and subsurface storage, and since these are difficult to monitor at larger scales, several studies have suggested to exploit total water storage data from the GRACE (Gravity Recovery and Climate Experiment) satellite gravity mission to analyze them. This has led to the development of GRACE-based drought indicators. However, it is unclear 5 how the ubiquitous presence of climate-related or anthropogenic water storage trends found within GRACE analyses masks drought signals. Thus, this study aims to better understand how drought signals propagate through GRACE drought indicators in the presence of linear trends, constant accelerations, and GRACE-specific spatial noise. Synthetic data are constructed and existing indicators are modified to possibly improve drought detection. Our results indicate that while the choice of the indicator should be application-dependent, large differences in robustness can be observed. We found a modified, temporally 10 accumulated version of the Zhao et al. (2017) indicator particularly robust under realistic simulations. We show that linear trends and constant accelerations seen in GRACE data tend to mask drought signals in indicators, and that different spatial averaging methods required to suppress the spatially correlated GRACE noise affect the outcome. Finally, we identify and analyze two droughts in South Africa using real GRACE data and the modified indicators. Copyright statement. TEXT 15

Thus, perhaps not surprisingly, a number of GRACE-based drought indicators have been suggested (e.g. Houborg et al., 2012;Thomas et al., 2014;Zhao et al., 2017), typically either based on normalization or percentile rank methods. However, a 5 comprehensive comparison and assessment of these indicators is still missing, particularly in the presence of (1) trend signals as picked up by GRACE in many regions that may reflect non-stationary 'normal' conditions, (2) correlated spatial noise that is related to the peculiar GRACE orbital pattern, and (3) the inevitable spatial averaging applied to GRACE, which results in smoothing out noise (Wahr et al., 1998). From a water balance perspective, GRACE TWSC variability mainly represents monthly total precipitation anomalies (e.g., Chen et al., 2010;Frappart et al., 2013). It is thus obvious that GRACE drought 10 indicators will contain signatures that are visible in meteorological drought indicators, yet the difference should tell about the magnitude of other contributions (e.g. increased evapotranspiration due to radiation) to hydrological drought. Fig. 1 shows a time series of region-averaged, de-trended and de-seasoned GRACE water storage changes over Eastern Brazil (Ceará state) compared to the region-averaged 6 months Standard Precipitation Indicator SPI (McKee et al., 1993) to illustrate the potential of GRACE TWSC for drought monitoring. As can be expected, TWSC and 6 months SPI appear 15 moderately similar (correlation 0.43), characterized by positive peaks, for example at the beginning of 2004 and at the end of 2009, and negative peaks at the beginning of 2013. We also found correlations between TWSC and 6 months SPI in regions with different hydro-climatic conditions for the Missouri river basin (0.31), Maharashtra in West India (0.46) and South Africa (0.45) among other regions. This motivates us to modify common GRACE indicators to account for accumulation periods of input data, e.g. used with 6 months SPI, but also periods that are based on differences of input data. To our knowledge, this 20 is the first study where (modified) indicators are tested in a synthetic framework based on a realistic signal that includes a hypothetical drought. We hypothesize that in this way we can (i) assess indicator robustness, with respect to identifying a 'true' drought of given duration and magnitude, and (ii) understand how trend signals and spatial noise propagate into indicators and mask drought detection. In addition, we investigate to what extent the spatial averaging that is required for analyzing GRACE data affects indicators. For this, we compare spatially averaged gridded indicators to indicators derived from spatially averaged 25 TWSC.
This contribution is organized as follows: in section 2 we will review three GRACE-based drought indicators and modify them to accommodate either multi-month accumulation or differencing, while in section 3 our framework for testing GRACE indicators in a realistic simulation environment will be explained. Then, section 4 will provide simulation results and finally the results from real GRACE data. A discussion and conclusion will complete the paper. tion, 2) threshold-based, 3) quantile scores, and 4) probability-based (e.g., Zargar et al., 2011;Keyantash and Dracup, 2002;Tsakiris, 2017).
Since total water storage deficit may be viewed as a more comprehensive information source on drought, the advent of GRACE total water storage changes (TWSC) data has led to new indicators being developed. For example, Frappart et al. (2013) developed a drought indicator based on yearly minima of water storage and a method for standardization, and Kusche 5 et al. (2016) computed recurrence times of yearly minima through generalized extreme value theory. Other indicators explore the monthly resolution of GRACE, e.g. the Total Storage Deficit Index (TSDI, Agboma et al., 2009), the GRACE-based Hydrological Drought index (GHDI, Yi and Wen, 2016), the Drought Severity Index (DSI, Zhao et al., 2017), and the Drought Index (DI, Houborg et al., 2012). Further, Thomas et al. (2014) presented a water storage deficit approach to detect drought magnitude, duration, and severity based on GRACE-derived TWSC. To our knowledge, only the Zhao et al. (2017), Houborg 10 et al. (2012), and Thomas et al. (2014) methods are able to detect drought events from monthly GRACE data without any additional information. Therefore, these three indicators will be discussed further.
In order to stress the link between GRACE-based and meteorological indicators, we first describe the relation of TWSC and precipitation. Assuming evapotranspiration (E) and runoff (Q) vary more regularly as compared to precipitation (i.e. ∆E = 0, ∆Q = 0), the monthly GRACE TWSC (∆s) corresponds to precipitation anomalies (∆P ) accumulated since the GRACE 15 storage monitoring began where ∆t is the time from t 0 to t 1 . In contrast to Eq. 1, the difference between GRACE months corresponds to the precipitation anomaly accumulated between these months. Accumulated monthly TWSC thus corresponds to an iterative summation over the precipitation anomalies described by t t0 In the following, we will discuss and extend the definition of Zhao et al. (2017), Houborg et al. (2012), and Thomas et al. (2014) GRACE-based indicators, which are hence referred to as the 5 respectively.

Zhao-method
In the approach of Zhao et al. (2017), one considers GRACE-derived monthly gridded TWSC for n years, Let us define the monthly climatology, i.e. mean monthly TWSC,x j with j = 1, . . . , 12 and the standard deviationσ j of the anomalies in month j with respect to the climatological value as Zhao et al. (2017) define their drought severity index 'GRACE-DSI' as the standardized anomaly of a given month t i,j and provide a scale from -2.0 (exceptional drought) to +2.0 (exceptionally wet), as shown in Tab. 1. There is no particular probability distribution function (PDF) underlying the method, however if we assume the anomalies for a given 20 month follow a Gaussian PDF it is straightforward to compute the likelihood of a given month falling in one of the Zhao et al.
(2017) severity classes: For example, 2.1 % of months would be expected to turn out as exceptional drought and 2.1 % as exceptionally wet. This can be applied to any other PDF.
Drought severity, however, should be related to the duration of a drought. For example McKee et al. (1993) showed how typical time scales of 3, 6, 12, 24, and 48 months of precipitation deficits are related to their impact on usable water sources. To 25 account for the relation between severity and duration in the Zhao et al. (2017) approach, we consider q-months accumulated TWSC, which is approximately related to precipitation in Eq. (3) as with t i,j+1−q = t i−1,j+13−q for j + 1 − q < 1, or equivalently written for q-months averaged TWSC as 5 For example for q = 3, we would look for the 3 months running mean Dec-Jan-Feb, Jan-Feb-Mar, and so on. In the next step, one computes, for example, the climatology and anomalies as with the original method. On the other hand, we can relate hydrological to meteorological indicators using Eq.
(2). To develop a TWSC indicator that can be compared to indicators based on accumulated precipitation, one should rather consider the q months differenced TWSC 10 Thus, as with TWSC-DSI i,j in Eq. (8), we can define two new multi-month indicators (TWSC-DSIA and TWSC-DSID) through standardization by using accumulated (A) and differenced (D) TWSC (Eq. 9 and 11) as and

15
Finally, it is obvious that sampling the full climatological range of dry and wet months is not yet possible with the limited GRACE data period. Therefore, Zhao et al. (2017) suggest applying a bias correction to avoid the under-or overestimation of drought events. This implies using TWSC from multi-decadal model runs, which is feasible but not in the focus of this study. Table 1. Drought severity level of the TWSC-DSI (Zhao et al., 2017). The values of TWSC-DSI are unitless.
2.2 Houborg-method Houborg et al. (2012) define the drought indicator 'GRACE-DI' via the percentile of a given month, t i,j , with respect to the cumulative distribution function (CDF). The GRACE-DI is applied to TWSC by i.e. all years containing month j are counted for which TWSC is equal or lower than TWSC in month j and year i, and 5 normalized by the number of the years that contain month j. The indicator value is assigned to five severity classes as shown in Tab. 2. For example, exceptional droughts occur up to 2 % of the entire time period at any location.
Again, to relate drought severity to duration, we proceed via multi-month accumulation (Eq.9) and differences (Eq.11) resulting in the definition of two new indicators based on TWSC-DI i,j in Eq. (14): Assuming again that the CDF equals the cumulative Gaussian, 0.6 % of months would be detected as exceptionally dry and 9.5 % of months as abnormally dry. Houborg et al. (2012) applied the percentile approach also separately to surface soil moisture, root zone soil moisture and groundwater storage, which were derived by assimilating GRACE-derived TWSC into a 15 hydrological model, and the CDFs were adjusted to a long-term model run. Here, we focus on a simulated TWSC environment for the GRACE period only and, as explained in Sec. 2.1, we therefore disregard the bias correction. Table 2. Drought severity level of the TWSC-DI (Houborg et al., 2012). The values of TWSC-DI are given in %.

Abnormal 20 30
Moderate 10 20 TWSC observations x i,j and a threshold c, we can compute anomalies by While the threshold can be derived from different concepts, Thomas et al. (2014) use the monthly climatology x j (Eq. 6). Here, 5 we also consider using a fitted signal for defining the threshold. The signal is computed by at time t with a constant a 0 , a linear trend term a 1 , a constant acceleration term a 2 , annual signal terms b 1 and b 2 , and similarly semi-annual signal terms c 1 and c 2 . Trends and possible accelerations in GRACE TWSC can result from many different hydrological processes. For example, accelerations can result from trends in the fluxes precipitation, evapotranspiration, and 10 runoff (e.g. Eicker et al., 2016). In the following, the linear trends are denoted as trends and constant accelerations are denoted as accelerations. The Thomas-method then identifies drought events through the computation of their magnitude, duration, and severity: the magnitude or water storage deficit is equal to ∆x i,j (Eq. 17) and the duration d i,j is given by the number of consecutive months where TWSC is below the threshold. Thomas et al. (2014) propose a minimum number of 3 consecutive months required for the computation of drought duration. By using the deficit ∆x i,j and the duration d i,j , the severity s i,j of 15 the drought event can finally be computed by Severity is therefore a measure of the combined impact of duration and magnitude of water storage deficit, see Thomas et al. (2014) and Humphrey et al. (2016).
3 Framework to derive synthetic TWSC for computing drought indicators 20

Methods
In order to analyze the performance of drought indicators, we first construct a synthetic timeseries of 'true' total water storage changes (TWSC) on a grid. We base our drought simulations on the GRACE data model including the introduced (in Sec. 2.3) signal x (which contains seasonality, a constant, linear, and time varying trend, Eq.

25
18), an interannual signal η (which has been de-trended and de-seasoned and which will carry the simulated 'true' drought signature), and a GRACE-specific noise term . To simulate the 'true' signal as realistically as possible using Eq. (20), we first analyze real GRACE-TWSC following the steps summarized in Fig. 2. We derive 1) the signal components constant, trend, acceleration, annual, and semi-annual sine wave, 2) temporal correlations, 3) a representative drought signal quantified by strength and duration, and 4) spatially correlated noise from GRACE error covariance matrices. While the first three steps are generic and can be used for simulating other observables, step 4 is directly related to the measurement noise (in this case the GRACE noise).

5
As an input to the simulation, GRACE-TWSC are derived by mapping monthly ITSG-GRACE2016 gravity field solutions of degree and order 60, provided by TU GRAZ (Mayer-Gürr et al., 2016), to TWSC grids. As per standard practice, we add degree-one spherical harmonic coefficients from Swenson et al. (2008) and degree 2, order 0 coefficients from laser ranging solutions, (Cheng et al., 2011). Then, we remove the temporal mean field, apply a DDK3-filtering (Kusche et al., 2009) to suppress excessive noise, and map coefficients to TWSC via spherical harmonic synthesis. We also remove the effect of Droughts are a multiscale phenomenon, and for a realistic simulation we must first define the largest spatial scale to which we will apply the model of Eq. (20). In other words, we first need to identify coherent regions in the input data for which our approach is then applied at grid-scale prior to step 1. For this, we apply two consecutive steps: we first compute temporal signal correlations by fitting an autoregressive (AR) model (Appendix A; Akaike, 1969) to detrended and deseasoned GRACE 15 data. These TWSC residuals contain interannual and subseasonal signals including real drought information. Next, temporal correlation coefficients are used as input for an Expectation Maximization (EM) clustering (Dempster et al. (1977), Redner and Walker (1984)), because regions with similar residual TWSC correlation within the interannual and subseasonal signal are hypothesized here to be more likely affected by the same hydrological processes. The EM-algorithm by Chen (2018) is modified to identify regional clusters. The EM-algorithm alternates an expectation and a maximization step to maximize the likelihood of the data (e.g. Dempster et al., 1977;Redner and Walker, 1984;Alpaydin, 2009). More details about EM-clustering are provided in App. B.
As a result of this procedure, we identified three clusters located in East Brazil (EB), South Africa (SA), and West India (WI), which were indeed affected by droughts in the past (e.g. Parthasarathy et al., 1987;Rouault and Richard, 2003;Coelho 5 et al., 2016). Location and shape of the three chosen clusters are shown in Fig. 3, and a global map of all clusters is provided in Fig. B1. Cluster delineations from the above procedure should not be confused with political boundaries or watersheds. The following simulation steps are then applied to each of these three clusters.
In step 1 we estimate the signal coefficients according to Eq. (18) through least squares fit for each grid cell within the cluster.
The coefficients are then spatially averaged to create a signal representative of the mean conditions within the region, and then 10 are used to create the constant, trends, and the seasonal parts of the synthetic time series. To simulate realistic temporal correlations at the regional scale (step 2), we use the AR-model identified beforehand (Fig.2) and again average AR-model coefficients within the cluster. Then, we apply an AR model with the estimated optimal order and the averaged correlation coefficient (Eq. A1) to the synthetic time series to add temporal correlations.
Simulating realistic drought events in step 3 is challenging because, to our knowledge, no unique procedure to simulate 15 realistic drought periods for TWSC exists. For this reason, we first perform a literature review to identify representative drought periods and magnitudes for selected regions. Among others, this includes the 2003 European drought and the drought in the Amazon basin in 2011 (e.g., Seitz et al., 2008;Espinoza et al., 2011, respectively). TWSC within the identified drought period are then eliminated from the time series. In the next step, the parameters describing the constant, trend, acceleration and seasonal signal components before and after the drought are used to 'extrapolate' these signals during the drought period. 20 By computing the difference of the original GRACE-TWSC time series and the continued signal in the drought period, we can separate non-seasonal variations from the data, which represent the drought magnitude. Our hypothesis is that the nonseasonal variations that we derive from the procedure possibly show a systematic behavior that can be parameterized. To extract this systematic behavior, all extracted droughts are transformed to a standard duration. To compare the different drought signals, a standard duration and a standard magnitude are arbitrarily set to 10 months and -100 mm, respectively. Finally, 25 a synthetic drought signal η is generated by using the extracted knowledge of drought duration, drought magnitude, and systematic behavior and it is added to the synthetically generated signal (Eq. 20).
In step 4 we add GRACE-specific spatially correlated and temporally varying noise (Eq. 20). First, for each month t we extract a full variance-covariance matrix Σ for the region grid cells from GRACE-TWSC. Then, whenever Σ is positive definite, we apply Cholesky decomposition Σ = R T R, while if Σ is only positive semi-definite we apply eigenvalue decomposition 30 (Appendix C). Second, we generate a Gaussian noise series v of the length n, where n represents the number of grid cells within the cluster. Finally, spatial noise in month t is simulated through The final synthetic signals for each grid cell within a cluster will thus exhibit the same constant, trend, acceleration, seasonal signal, temporal correlations, and drought signal, but spatially different and correlated noise. In the following, we will test the hypothesis that GRACE indicators depend on the presence of trend and random input signals using the generated synthetic time series.
We believe that our synthetic framework based on real GRACE data has multiple benefits: i) we are able to identify the 5 ability of an indicator by comparing the 'true' drought duration and magnitude (step 3) to the indicator results; ii) we are able to detect the influence of other typical GRACE signals on the drought detection; iii) the synthetic framework enables us to identify strengths and weaknesses of each analyzed indicator, and thereby enables us to choose the most suitable indicator for a specific application.

10
Here, we will briefly discuss the TWSC simulation following methods described in the previous section.
When estimating AR models for detrended and deseasoned global GRACE data, we find that for more than 70 % of the global land TWSC grids are best represented by an AR(1) process (App. Fig. A1). Therefore, we apply the AR(1) model for each grid. Fig. 3 shows the estimated AR-model coefficients, which represent the temporal correlations, ranging from very low up to 0.3, e.g. over the Sahara or in South West Australia, up to about 0.8, e.g. in Brazil or in the Southeastern U.S.

15
EM-clustering is then based on these coefficients.
The selected three clusters (Fig. 3) show differences between the signal coefficients of the functional model (step 1, Eq. 18), which are hence discussed for the linear trend. We find a mean linear trend for the East Brazil cluster of 1.0 mm TWSC per year, a higher trend of 5.0 mm per year in South Africa, and for West India a trend of 56.3 mm per year (Tab. 3). The trends for East Brazil and South Africa in GRACE TWCS have been identified before (e.g. Humphrey et al., 2016;Rodell et al., 2018). 20 We did not find confirmations for the strong linear trend in West India found, for example, by Humphrey et al. (2016) who identified about 7 mm per year within this region. We assume that in this study the linear trend for West India is estimated as strong positive because we additionally identify a strong negative acceleration of -8.03 mm per year 2 in West India. However, our simulation will cover weak and strong trends. In fact, all coefficients show strong differences, which suggests that we cover different hydrological conditions when simulating TWSC for the three regions. In step 2 we identify correlations of 0.74 in Texas (e.g. Long et al., 2013), and the 2003 drought in Europe (e.g. Seitz et al., 2008). To extract the drought duration, we compared drought onset and end identified in these and other papers. We found that different studies do not exactly match, with inconsistencies likely due to different methodologies used. Furthermore, some authors only specified the year of drought.
Droughts extracted from the literature had a duration of 3 to 10 months ( Fig. 4a-d). Unless otherwise specified, we decided to 10 base our simulations on a duration of 9 months to represent a clear identifiable drought duration. Extracted drought magnitudes range from about -20 to -350 mm TWSC ( Fig. 4a-d). Therefore, in order to simulate a drought magnitude that has a clear influence on the synthetic time series, we set the magnitude to -100 mm.
As described in Sec. 3.1, we transform these water storage droughts to a standard duration and magnitude to understand whether a typical signature can be seen. However, Fig When we remove those four timeseries (Fig. 4f), a systematic behavior can be identified and parameterized using a linear or quadratic temporal model. However, due to these difficulties, we decided to use the most simple TWSC drought model, i.e. a constant water storage deficit within a given time span.
In step 4, we project the simulation on a 0.5 • grid and add spatially correlated GRACE noise. A few representative time 20 series of the gridded synthetic total water storage change are shown in Fig. 5     identification of exceptional drought if no masking occurs (but in the presence of GRACE noise), so at this point we can determine that exceptional drought represents the 'true' drought severity class. As expected, a trend and/or an acceleration signal that are frequently observed in GRACE analyzes can lead to misinterpretations in the indicators. However, the influence of the trend or acceleration also depends on the timing of the drought period within the analysis window. For example, assuming we simulate the time series with the same trend or acceleration but the drought were to occur in 2014, the drought detection 5 would not have been influenced as much. Therefore, we decided to set up an additional experiment and discuss the influence of different trend strengths for the drought detection (Sec. 4.3).
The analysis reveals that DSI and DSIA indicators are sensitive with respect to trends, while they are less sensitive to the annual and semi-annual signal. The seasonal signal is clearly dampened (e.g. compare Fig. 5 to the DSIA in Fig. 6). This is caused by removing the climatology within the Zhao-method (Eq. 8). Comparing DSIA3, DSIA6, DSIA12, and DSIA24, e.g.

10
for East Brazil, suggests that with a longer accumulation period, indicator time series are increasingly smoothed and less severe droughts are identified (Fig. 6, left). Furthermore, the drought period appears shifted in time and its duration is prolonged. This can lead to missing a drought identification if a trend or an acceleration is contained in the analyzed timeseries, for example for the 24 months DSIA for East Brazil. We find that all DSIA are able to unambiguously detect a drought close to 2005 assuming that neither trend nor acceleration is apparent (Fig. 6 DSIA for South Africa). Particularly, the 3 and 6 months DSIA identify 15 the drought close to 2005 for South Africa, and its computation appears to dampen the temporal noise that is present in the DSI.
In contrast we find that the 3, 6, 12, and 24 months TWSC-differencing DSID exhibit stronger temporal noise as compared to the DSIA and the DSI. This can be seen in the light of Eq.
(2) -these indicators are closer to meteorological indicators and thus do not inherit the integrating property of TWSC. The DSID does not propagate a trend and acceleration, annual 20 signal or semi-annual signal. All DSID time series, for example for East Brazil (Fig 6, right), show a strong negative peak within the drought period, but this peak does not cover the entire drought period for the 3, and 6 months differenced DSID.
The negative peak within the drought period is always followed by a strong positive peak; when we consider Eq. 2 this lends to the interpretation that a pronounced drought period is normally followed by a very wet event to return to 'normal' water storage condition. Despite higher noise and the positive peak and contrary to the DSIA, all DSID (DSID3, DSID6, DSID12, 25 and DSID24) correctly identify the drought within 2005 to be exceptionally dry for East Brazil and South Africa. All different DSID time series for WI identify at least a moderate drought.
Analysis of the Houborg-method shows a broadly similar behavior as compared to the Zhao-method: The sensitivity of drought detection to an included trend or acceleration depends on the indicators type. Using the DIA we can confirm the large influence of the trend or acceleration on the indicator value, which is not the case for DID (e.g. Fig. 7 DIA and DID for East 30 Brazil). Annual and semi-annual water storage signals are all considerably weakened in the Houborg-method because they are effectively removed when computing the empirical distribution for each month of the year. Differences to the Zhao-method appear when comparing more general properties, e.g. we find that DI is more noisy and the range of output values is restricted to about 7 % to 100 % (Fig. 7). This restriction is caused by the length of the time series; e.g. assuming we strive to identify an event with exceptional dry values (≤ 2%), we would need at least 50 years of monthly observations. Yet, with GRACE we only have about 14 years of good monthly observations, so the simulation was also restricted to this period. If we then take the driest value that might occur only once, we can compute the minimum value of DI to be 7.14 %. Hence the detection of exceptional or extreme drought is not possible when referring to the duration of the GRACE TWSC time series. As mentioned in Sec. 2.2, Houborg et al. (2012) applied a bias correction to the empirical CDF to mitigate this restriction. We do not follow Houborg's approach here in order to focus on the synthetic environment instead of the availability of model outputs.

5
The Thomas-method is applied to simulated TWSC data to derive magnitude, duration and severity of drought, which we show in Fig. 8 for the EB region. We find that the linear trend and acceleration propagate into the magnitude (Fig. 8, top) when using TWSC deficits with climatology removed (blue, Eq. 6) compared to using TWSC deficits with removed trends, accelerations and seasonality (red, Eq. 18). When using non-climatological TWSC (blue), we identify a strong deficit in 2015 and 2016 (Fig. 8, top) which suggests a duration of up to 38 months (Fig. 8, center) and a severity of about -4000 mm 10 months (Fig. 8, bottom). Using the detrended and deseasoned TWSC (red), drought is mainly detected in the 'true' drought period (2005) and not at the end of the time series. Thus we conclude that a trend or acceleration indeed modifies the drought detection.
Results so far were derived by imposing a minimum duration of 3 months (blue and red). When moving to a minimum duration of 6 consecutive months (green, Fig. 8, middle and bottom) we find this would lead to a decrease in identified severity by half, and the beginning of the drought period shifts 3 months in time. This is in line with Thomas et al. (2014). The same findings are made for South Africa and West India.

Synthetic TWSC: effect of spatially correlated GRACE errors
Here, we investigate how robust the Zhao-, Houborg-and Thomas-indicators are with respect to the spatially correlated and time-variable GRACE errors. However, any analysis must take into account that GRACE results cannot be evaluated directly 5 at grid resolution.
In our first analysis, indicators based on (synthetic) TWSC grids are thus spatially averaged through two different methods (Sec. 3.1). We find that regional-scale DSI and DI indicators, as well as the outputs derived by the Thomas-method for South Africa computed from averaging TWSC first (darkblue Fig. 9) is indeed different to the averaging indicators computed at grid scale from TWSC (lightblue, Fig. 9). These differences can be explained by the inherent non-linearity of the indicators. Since 10 the synthetic data have been constructed from the same constants, trends, seasonal signal, temporal correlations, and drought signal, we isolate the effect of GRACE noise on regional-scale indicators here. Outside of the drought period we conclude that the sequence in which we spatially average causes larger differences for DI as compared to DSI. For South Africa, the range of averaged DI is about 7 -100 % while the range of the DI of averaged TWSC is about 7 -80 %. Within the drought period the DI exhibits little difference between both averaging methods. The DSI from averaged TWSC does suggest a weaker severity in the drought period compared to averaged DSI. In this case, both indicator averages identify the same (exceptional) drought severity class. Yet we find that for both DSI and DI the identification of drought severity is not sensitive to the choice of the averaging method for this cluster. However, for other cases these differences can be more significant. These may lead to misinterpretation (e.g. May and July 2005 for the DI East Brazil, Fig. 9). For the Thomas-method, we cannot distinguish 5 which result is more significant, since we have no comparable 'true' severity amount for that indicator.
To determine the influence of the GRACE-specific spatial noise on the detected drought severity, a second analysis is applied.
This analysis computes the share of area, for each time step for which a given drought severity class is identified (Fig. 10).
Since different grid cells for one time step only differ in their spatial noise, it is important to understand that identifying more than one severity class is directly related to the noise. Only one class of drought would be detected for one epoch, assuming 10 the grid cells have no or exactly the same noise. For example, we identify all classes of droughts (abnormal to exceptional) in December 2015 by using DSI for the East Brazil cluster (Fig. 10, top left). Thus, the spatial noise has a large influence on the drought detection. To establish which indicator is most affected, the indicators are compared with each other.
We note that large differences are found between the DSI, the 6 months accumulated DSIA, and the 6 months differenced DSID within the given drought period for the East Brazil region (Fig. 10, left). All three indicators manage to identify the hand, the DSIA does not detect exceptional drought in any grid cell. It is apparent that this indicator misses the exceptional dry event because of the included trend and acceleration.
When comparing DSIA of East Brazil to the DSIA of South Africa (Fig. 10, center), we find that DSIA is able to detect the drought strength correctly when there is a small trend or acceleration present. However, DSIA appears more robust against spatial noise, since it identifies severe drought or drier in more than 90 % of grid cells, while the DSI indicator identifies only 5 about 60 %. As described in Sec. 4.1, longer accumulation periods lead to smoother and thus more robust indicators. We find that the DSID is more successful in detecting exceptional drought: more than 80 % of the DSID grid cells show exceptional drought, but the indicator appears more noisy than the DSIA. Finally, with regard to the drought duration, we find that only DSI detects the 'true' period correctly. When identified via DSIA, the duration appears longer and when identified in DSID, the period was found shorter as compared to the 'true' drought period. Overall, we find that the different indicators DSI, DSIA, and DSID all come with advantages and disadvantages regarding the presence of spatial and temporal noise. The same findings were made for the indicators of the Houborg-method (results not shown). This analysis is not applied to the Thomas-method, because the method does not refer to severity classes (Sec. 2.3).

Synthetic TWSC: experiments with variable trend, drought duration and severity
Two experiments were additionally constructed to examine the influence of trends and drought parameters on the indicator 5 capability. First, we consider how strong a linear trend in total water storage must be to mask drought in the indicators. For this, we test different trends from -10 mm/year to 10 mm per year for DSI, DSIA, DI, DIA, and the Thomas-method in the West India region (since these indicators were identified as being affected by trends, Sec. 4.1). No acceleration is included for these tests. We find that trends between -1 and 1 mm per year cause no influence on all indicators, while differences start to appear when simulating a trend higher than 2 mm per year. This propagates into the DSI, DSIA, DI, and DIA indicators but 10 did not affect the drought period.
A question we must ask is what would be the largest trend magnitude that does not affect the correct detection of drought duration and drought severity, and how can we verify this? An obvious influence within the drought period in 2005 is found when simulating a trend of -7 mm or lower per year. It is important at this point to understand that there is a relation between the timing of the drought and the sign of the trend, i.e. whether the trend is positive or negative. Assuming that a positive trend Other factors, e.g. the length of the time series, have an influence on the masking by the trend and, as a result, affect drought detection. The longer the input time series, the more sensitive the drought detection is to the trend. At the same time, the magnitude of the trend needs to be considered relative to the variability or range of the TWSC. For example, a -6 mm per year trend has a larger influence on the drought detection if the range of TWSC is -50 to 50 mm compared to -200 to 200 mm. As a reference, the synthetic time series for West India, without any trend or acceleration signal, ranges from about -323 to 87 mm. 5 So, deriving a general quantity for these dependencies is difficult.
In a second experiment, we assess which input drought duration and magnitude would at least be visually recognized in the indicators. We choose 3, 6, 9, 12, and 24 months for the simulated duration and -40 mm, -60 mm, -80 mm, -100 mm, and -120 mm for the drought magnitude, and apply both the Zhao-and the Houborg-method. We compare the changes for one indicator time series for the East Brazil region. The drought always begins in January 2005 for the first tests. In general, we found that 10 the identification of the severity class is less sensitive to changes in the drought duration, since a drought duration of 3, 6, 9, 12, and 24 months mostly results in equal drought severity classes, for example, a drought magnitude of 120 mm. Thus, we concentrate our analysis on changes in drought magnitude.
Exceptional drought is only classified by the Zhao-method for East Brazil for a simulated drought magnitude of 120 mm; this is related to the trend and acceleration signal contained in the simulated TWSC and was already found in Sec. 4.1. For 15 the Zhao-method, extreme drought is identified when simulating a drought magnitude of at least -100 mm, while only severe and moderate drought is identified when simulating a magnitude of -80 mm and -60mm. The Houborg-method fails to identify extreme and exceptional drought, as described in Sec. 4.1. Thus, simulating a magnitude of -100 and -120 mm is identified as severe drought for all simulated drought periods (3 to 24 months), while simulating a lower magnitude (-80 mm and -60 mm) causes moderate or abnormal dry events to be identified. We find that both methods are not able to clearly detect a drought that 20 has a magnitude of -40 mm or weaker, if the duration is between 3 and 24 months. This experiment supports our findings in Sec. 3.2.

Application to real GRACE data: South Africa droughts
For South Africa, droughts are a recurrent climatic phenomenon. The complex rainfall regime has led to multiple occurrences of drought events in the past, for example to a strong drought in 1983 (e.g. Rouault and Richard, 2003;Vogel et al., 2010;25 Malherbe et al., 2016). These past droughts appeared in varying climate regions, at different times of the year, and with a different severity. Since 1960, many of them were linked to El Niño (e.g. Rouault and Richard, 2003;Malherbe et al., 2016).
Based on the simulation results, we chose the 6 months accumulated DSIA to identify droughts for (the administrative area of) South Africa (GADM, 2018) in the GRACE total water storage data. DSIA has proven to be more robust with respect to the peculiar, GRACE-typical spatial and temporal noise as compared to the other tested indicators (Sec. 4.2 and 4.1). of South Africa (Fig 12b). For comparison, the EM-DAT database similarily identified 2015 as a drought event, but did not classify 2016 as such. We speculate that the differences are due to the drought criteria of the EM-DAT database (disasters are included when, for example, 10 or more people died or 100 or more people were affected). However, the EM-DAT database lists 2016 as a year of extreme temperature, which might be related to our detected drought. Furthermore, we can confirm the 10 2015/2016 drought by a lower maximum precipitation in these years than in other years (about 65 mm) and by meteorological indicators indicating severe to extreme drought (SPI, Standardized Precipitation Evapotranspiration Index (Vincente-Serrano et al., 2010), and Weighted Anomaly Standardized Index (Lyon and Barnston, 2015)).

Discussion
The framework developed in this study enables us to simulate GRACE-TSWC data with realistic signal and noise properties, 15 and thus to assess the ability of GRACE drought indicators to detect drought events in a controlled environment with known 'truth'. This will be extended to GRACE-FO in the near future. GRACE studies have often been based on simplified noise models (e.g. Zaitchik et al., 2008;Girotto et al., 2016) where the GRACE noise model is not derived from the used GRACE data but, for example, from literature and assumed to be spatially uniform and uncorrelated. However, it is important to account for realistic error and signal correlation (e.g. Eicker et al., 2014), in particular for drought studies where one will push the limits of GRACE spatial resolution. This signal correlation includes information about, for example, the geographic latitude, the density of the satellite orbits, the time-dependencies of mission periods or North-South-dependencies. When analyzing the Zhao-, Houborg-and Thomas-methods, we find that trends and accelerations in GRACE water storage maps tend to bias not only the DSI, DI and the Thomas-indicator (which use non-climatological TWSC), but also the DSIA and 10 DIA (which use accumulated TWSC). The indicators DSID and DID, which utilize time-differenced TWSC, were not found biased by trends and accelerations; the same goes for the Thomas-method when based on detrended and deseasoned TWSC.
When we simulated smaller trends or accelerations, all indicators were able to detect drought, but they identified different timing, duration, and strength; for example for the SA cluster (trend of 4.98 mm/year, acceleration of -0.38 mm/year 2 ). This suggests removing the trend in GRACE data first, but this must be done with care, since it can also influence the detection of, After these drought periods, strongly wet periods were detected. Regarding future applications, we suggest a direct comparison of the DSID and meteorological indicators, in particular for confirming or rejecting drought duration and the following wet periods.
On the other hand, computing accumulated indicators implies a temporal smoothing causing the drought period to appear lagged in time; however for accumulation periods of 3 and 6 months the lag was found insignificant. DSIA and DIA are thus 5 more robust against temporal and spatial GRACE noise as compared to DSID and DID, and again we would suggest utilizing 3 or 6 months accumulation periods. In general, we found the Zhao-and Thomas-indicators performed better in detecting the correct drought strength than the Houborg-method, at least for the limited duration of the GRACE time series that we have at the time of writing.
By simulating the effect of spatial noise on drought detection, we found that some indicators appear less robust. Analysis of 10 the percentage of drought affected area showed that the GRACE spatial noise limits correct drought detection. Again, the DSIA was identified to be more robust as compared to DSI and DSID -it was the only indicator that identified exceptional drought in nearly all grid cells. A second experiment was conducted to examine if the influence of the spatial noise can be reduced by using spatial averages. We found that spatially averaging DSI and DI appears less robust against the spatial noise compared to computing the indicator of averaged TWSC. At this point we therefore suggest to compute the indicator from spatially 15 averaged TWSC. Since the DI showed stronger difference between both averaging methods than the DSI, we conclude that the DI is generally less robust against spatial noise than the DSI. In our real-data case study, due to these findings, the DSIA6 was and DID) were derived and tested; these are modifications of the above mentioned approaches based on time-accumulated and -differenced GRACE data. We found that indeed most indicators were mainly sensitive to water storage trends and to the GRACE-typical spatial noise. 25 Among these various indicators, we identified the DSIA6 as particularly well-performing, i.e. it is less sensitive to GRACE noise and with good capability towards identifying the correct severity of drought, at least in absence of trends. However, the choice of the indicator should always be made in the context of the application.
We see ample possibilities to extend our framework. Future work should focus on better defining the onset and end of a drought and developing a signature for TWSC drought. One should also consider other observables in the simulation, such 30 as groundwater for example, which can be derived from GRACE and by removing other storage contributions from direct modelling or through data assimilation.
In the GRACE community, efforts are currently being made to 'bridge' the GRACE timeseries to the beginning of the GRACE-FO data period (e.g. Jäggi et al., 2016;Lück et al., 2018). These gap-filling data will inevitably have much higher noise and spatial correlations that may be very different from GRACE data, and drought detection capability should be investigated through simulation first. On the other hand, GRACE-FO is supposed to provide more precise measurements, and thus less influence of spatial noise on the drought detection may be expected. The combination of GRACE-FO data and a thorough 5 understanding and 'tuning' of GRACE drought identification methods, possibly through this framework, might then enable us to identify water storage droughts more precisely.
Appendix A: AR model coefficients computations To extract temporal correlations from the GRACE total water storage changes (TWSC) we apply an autoregressive(AR) model, which is described by where X represents the observed process at time t, p is the model order, φ are the correlation parameters, and is a white noise process (Akaike, 1969). Here, detrended and deseasoned TWSC are used as the observed process X(t), because the remaining residuals contain interannual and subseasonal signal as the drought information, which we want to extract with this approach. The approach is then applied for different model orders. The optimal order of the AR-model is adjusted by means of 15 the information criteria, for example the Akaike information criterion (AIC), and the Bayes information criterion (BIC). Then, by using the optimal order, the AR-model coefficients φ, which represent the temporal correlations, can be computed using a least squares adjustment.
The results for the optimal order of interannual and subseasonal TWSC is shown in Fig. A1. The most of the global land grids of detrended and deseasoned TWSC shows an optimal order of 1 (about 70%).

Appendix B: EM-Clustering
Expectation maximization (EM) represents a popular iterative algorithm that is widely used for clustering data. EM partitions data into cluster of different sizes and aims at finding the maximum likelihood of parameters of a predefined probability distribution (Dempster et al., 1977). In case of a Gaussian distribution the EM-algorithm maximizes the Gaussian mixture parameters, which are the Gaussian mean µ k , covariance Σ k , and mixing coefficients π k (Szeliski, 2010). The algorithm then 5 iteratively applies two consecutive steps to maximize the parameters: the expectation step (E-step) and the maximization step (M-step). Within the E-step we estimate the likelihood that a data point x t is generated from the k-th Gaussian mixture by E-step: The M-step then re-estimates the parameters for each Gaussian mixture: 10 M-step: by using the number of points assigned to each cluster via Using the maximized parameters EM assigns each data point to a cluster. The final global distributed clusters of the ARparameters (Fig. 3) are shown in Fig. B1. These clusters were derived by modifying and applying an EM-algorithm provided by Chen (2018).
Appendix C: Eigen value decomposition 20 The decomposition of the variance-covariance matrix Σ by using Cholesky decomposition fails, when Σ is positive semi definite. To still be able to decompose the matrix, we can use eigen value decomposition, but this is accompanied by a loss of information due to the rank deficiency. The decomposition is then examined by Σ = U DU T , where U is a matrix with the eigenvectors of Σ in each column and D is a diagonalmatrix of the eigenvalues. In this case, a decomposed matrix can be related to R T introduced in Sec 3.1. R T can be computed by U √ D. In Sec. 3.1, we multiply R T with a normal distributed Figure B1. Clusters based on EM-clustering applied to the global AR-model coefficients.
Author contributions. HG, OE, and JK designed all computations and HG carried them out. HG prepared the manuscript with contributions from OE and JK.
Competing interests. The authors declare that they have no conflict of interest.