Comment on hess-2021-64 Anonymous Referee # 2 Referee comment on " Spatiotemporal development of the 2018 – 2019 groundwater drought in the Netherlands : a data-based approach

The topic of this work is of high interest to the research community, as it discusses important points on data reliability, methods to overcome these hindrances, and presents a timely and interesting overview of recent drought developments in the Netherlands. The manuscript follows a clear structure, and I commend the authors to its good overall structure and readability. Some improvements to the method description are required in regards to choosing thresholds and cut-off values, and adjustment of figures to allow for better comparison between measurement based SGIs and model-based SGIs (see specific comments below), but overall, I recommend this manuscript for publication after revisions.


Introduction
In the summer of 2018, a severe drought hit large parts of northern and western Europe. Extremely low precipitation coincided with high temperatures, both breaking multiple-decade records in many places (Bakke et al., 2020;Philip et al., 2020;Toreti et al., 2019). Drying soils and declining water reserves caused damage to agricultural production and natural ecosystems, problems with drinking water and energy production, and widespread forest fires, among other impacts (Bakke et al., 2020;30 https://doi.org/10.5194/hess-2021-64 Preprint. Discussion started: 29 March 2021 c Author(s) 2021. CC BY 4.0 License. Bastos et al., 2020;Buras et al., 2020;Philip et al., 2020). The kind of 'hot drought' that occurred in 2018 is expected to become more frequent in the future (Philip et al., 2020;Toreti et al., 2019).
The Netherlands was one of the countries most hit by the weather extremes (Bakke et al., 2020). The damage was felt mainly in the southern and eastern parts of the country ( Van de Velde et al., 2019;Van den Eertwegh et al., 2019;Witte et al., 2020b).
In a country traditionally more focused on getting rid of water surpluses, the drought of 2018-2019 was felt by many water 35 managers as a wake-up call, sparking a widespread search for solutions to prepare water systems for increasingly frequent drought extremes (De Lenne and Worm, 2020;IenW, 2019;Witte et al., 2020a;Van de Velde et al., 2019). In the southeastern Netherlands, as in many other parts of the world, groundwater is a crucial water source. Accordingly, much of the damage in 2018 was directly related to deep drawdowns in groundwater levels (LCW, 2020). Among other effects, the groundwater shortages caused severe damage in peatland and brook ecosystems (Witte et al., 2020b) and concerns over the sustainability 40 of increased irrigation and drinking water abstractions (Van de Velde et al., 2019;Van den Eertwegh et al., 2019). Groundwater is often the most persistent water store in the landscape, reacting latest as a meteorological drought propagates into the hydrological system (Van Loon, 2015). This makes proper management of the groundwater a crucial component of drought management strategies.
Previous studies have shown that the response of groundwater to meteorological drought can vary strongly in space. Variations 45 in groundwater response are caused by differences in geology, water management and other catchment characteristics (Bloomfield et al., 2015;Hellwig et al., 2020;Peters et al., 2006;Van Loon and Laaha, 2015). To be able to mitigate and prevent drought damage, it is therefore essential to understand how groundwater drought develops in both time and space. In recent years, water managers in the Netherlands have indeed expressed a need for more up-to-date, locally-specific drought information and predictions to be able to take appropriate measures (IenW, 2019;Pezij et al., 2019;Witte et al., 2020a). Not 50 only the period of meteorological drought itself, but also the recovery of the system after meteorological drought is important to understand.
Multiple recent research efforts have aimed at better understanding the variations in groundwater drought and its impacts at national and European scales (Bakke et al., 2020;Bloomfield et al., 2018;Hellwig et al., 2020;Margariti et al., 2019;Van Loon et al., 2017). The 2018 drought in Europe has so far been mainly studied from a meteorological perspective (Bakke et 55 al., 2020;Philip et al., 2020;Toreti et al., 2019). A large-scale analysis of the hydrological drought has recently been made for Scandinavia (Bakke et al., 2020); while for the Netherlands, some assessments of the drought in the groundwater have been done based on small numbers of measurement sites and physically-based modelling studies .
What is still lacking, is a more detailed image of how the 2018-2019 drought manifested itself in the groundwater and how this varied in space, based on measurement data. This could provide valuable insights into groundwater drought dynamics and 60 mitigation options in the Netherlands and elsewhere.
Groundwater heads are widely monitored in measurement wells. However, analysis of groundwater drought from these data over large areas is often challenged by data quality. Firstly, data usually have to be obtained from multiple organisations and contain errors and other perturbations. Secondly, the length of measured time series is often not sufficient for drought analysis, for which at least 30-year series are recommended (Link et al., 2020). As a result, many data-based groundwater drought 65 studies have focused on relatively few measurement wells with near-natural, long series (Bakke et al., 2020;Van Loon et al., 2017); this may give an incomplete image of the true variability in drought dynamics. In addition, available data usually lags behind the present, hindering the up-to-date drought assessments that water managers need.
To deal with these challenges, several studies have developed methods for automated validation and lengthening of groundwater head time series Peterson et al., 2018;von Asmuth et al., 2012). This is usually 70 done with statistical models, ranging from regression analysis to artificial intelligence methods (Peterson et al., 2018;Van Loon et al., 2017). One type of statistical modelling that has proven very useful for groundwater data is time series modelling with impulse-response functions von Asmuth et al., 2002). These models describe groundwater head variations at a specific location as a function of driving variables, usually weather data, and a fitted impulse-response function. This type of impulse-response time series modelling (TSM) allows accurate simulations to be made without a need 75 for information on site characteristics. The simulations can be used to identify errors and other atypical behaviour in the data, and to lengthen and harmonise time series, as shown by e.g. Zaadnoordijk et al. (2019), Bartholomeus et al. (2008) and Marchant and Bloomfield (2018). Marchant and Bloomfield were the first to develop a full time series model-based method to study groundwater drought over a large region in the UK. Although TSM-based analyses appear a valuable tool for groundwater drought assessment, their wider applicability for various cases has not yet been well explored. To be able to 80 widely use TSM data preparation for drought studies, three main questions need to be answered.
Firstly, it is not yet clear what methods are optimal for data validation. Time series modelling can be used to identify time series influenced by disturbances after analysis (e.g. , but also to remove irregularities from the data beforehand. In addition, TSM-based data cleaning can be combined with other more basic consistency checks, improving its effectiveness (Peterson et al., 2018). This validation method has not yet been evaluated specifically for the case of 85 groundwater drought analysis.
Secondly, the reliability of time series simulations for groundwater drought analysis has not been properly tested. To provide lengthened, gap-filled groundwater head series that are suitable for analysis, researchers have often used TSM simulations directly as replacement of the data. These simulations often have a very good fit to the data Zaadnoordijk et al., 2019); however, this approach inevitably also strips out part of the external influences that are not explicitly 90 included in the model drivers (Peterson et al., 2018;Zaadnoordijk et al., 2019). Drought occurrence and development can be strongly affected by human impacts, as well as by local-scale natural influences such as surface water influence (Margariti et al., 2019;Van den Eertwegh et al., 2019). As such, excluding such external drivers of groundwater levels may provide an incomplete image of drought dynamics (Van Loon et al., 2016). It is therefore important to understand how the use of TSM simulations rather than measurement series affects the assessment of drought behaviour. 95 Finally, the usefulness of TSM for nowcasting and predicting groundwater drought needs to be better understood. Recently, there has been a surging interest in large-scale drought forecasting, which has shown that physically-based groundwater models can predict with some skill up to two months ahead (Honingh et al., 2020;Mackay et al., 2015;Sutanto et al., 2020;Wanders https://doi.org/10.5194/hess-2021-64 Preprint. Discussion started: 29 March 2021 c Author(s) 2021. CC BY 4.0 License. et al., 2019). Time series modelling could provide a very easy alternative to nowcast and predict groundwater drought requiring nothing more than weather data Marchant and Bloomfield, 2018). However, prediction skill may 100 depend on the location and the delay of the groundwater system (Mackay et al., 2015;Marchant and Bloomfield, 2018). Also, models may have difficulties to correctly represent groundwater behaviour during abnormal drought conditions (Hellwig et al., 2020). It is therefore important to better understand how accurate time series models actually are in nowcasting and forecasting groundwater drought.
Given these knowledge gaps, the current study aims to evaluate the usefulness of a time series modelling-based data preparation 105 method for regional analysis of groundwater drought. The method will be applied to the southeastern Netherlands to characterise the development in time and space of the 2018-2019 groundwater drought and its recovery. The method will be evaluated on the usefulness for series validation and simulation, the effect of simulations on drought assessment, and the potential for drought prediction.
The applied method for data validation and drought assessment, as well as the tests used for its evaluation, are set out in section 110 2 and 3. The performance of the method and the resulting drought analysis are presented in section 4. In section 5, the resulting insights into the use of time series modelling-based data methods for groundwater drought studies are discussed, followed by the spatiotemporal drought development of 2018-2019. Final conclusions are given in section 6.

Study area and data
The study area covers six provinces in the southeastern part of the Netherlands (Fig. 1). This area coincides with that part of 115 the Netherlands that lies above sea level and is dominated by Pleistocene-era deposits. Most of the study area has sandy sediments at the surface, but part is also covered by river clays and loess deposits (Fig. 1b). Elevation is mostly between 0 and 30 m AMSL (Fig. 1a). Higher elevations occur in the limestone-loess hill landscape of southern Limburg and on glacierpushed ridges in Utrecht and Gelderland (areas indicated in Fig. 1a). Land use is dominated by agriculture, while the glacial ridges are covered mainly with forest. The area has a temperate climate with a yearly precipitation surplus. In addition, the 120 groundwater system is affected by abstractions for drinking water and irrigation, as well as by drainage systems in the lowestlying parts of the study area.

Input data
Groundwater head data were supplied by several regional water managing bodies. For some areas, additional series were obtained from the Dutch national groundwater database DINO (TNO). The data consist of groundwater head time series with a twice-monthly to sub-daily frequency, mostly running until spring 2019. In addition, metadata of the monitoring wells were available, including location, filter depth and surface level. The different data sets were checked for overlapping wells by well 130 names and coordinates, and overlapping series were combined. Only data from the first filter of boreholes was used, generally representing the phreatic level. Data were cut off at 1 January 1990 and those series were selected that contained > 10 years of data and ended after 31 August 2018. This resulted in 2722 series for further analysis.
Daily precipitation and reference evapotranspiration (ETref) were obtained from the Royal Netherlands Meteorological Institute for January 1990 to May 2020 (KNMI, 2020). Data were used from 15 general weather stations (for ETref) and 114 precipitation 135 stations over the study area. Reference evapotranspiration is determined by KNMI following Makkink (1957). The ETref series did not always cover the full period of interest; gaps were filled with the nearest station that did have full data.

Methods
The method for data preparation and drought analysis consists of three components ( last two steps were also performed for a prediction of the groundwater heads. Finally, each step was evaluated by one or more tests.

Series validation 145
Data validation in general is a step to check initial data for errors and other deviating behaviour. What validation method is suitable depends on the type of deviations in the data and which of them need to be removed for the intended analysis.
Phreatic groundwater levels typically follow a yearly cycle, overlain by faster fluctuations in response to rainfall and evapotranspiration. An actual series of measured groundwater heads will often show deviations from this expected pattern.
Deviations may be of short duration, visible as loose outliers; or they may be structural, shown as level shifts, trends and other 150 abnormal patterns in the data series. Both the short-term and structural deviations can result from errors, such as instrument drift and typing errors. However, deviations can also reflect real groundwater behaviour caused by local natural or human influences, such as rainstorms, flooding or abstraction (Post and von Asmuth, 2013;Zaadnoordijk et al., 2019;Margariti et al., 2019). For one subregion of the study area (Overijssel province, see Fig. 1) log book notes from the data collectors were available; inspection of these showed that the data were influenced by short-term extractions, changes in water management, 155 sensor problems, relocation of wells, well maintenance and other issues. Also Zaadnoordijk et al. (2019) found that probably a majority of series in Dutch groundwater data sets contain such disturbances.
This kind of groundwater head data clearly require a validation step to become suitable for drought analysis. This step needs to remove the errors as much as possible; also real but short-term natural and human disturbances in the groundwater heads are less relevant for a large-scale drought analysis. However, the long-term deviations in the groundwater level, such as caused 160 by long-term abstraction and land use effects, are important for drought studies and should ideally be retained.
Whether atypical behaviour in a series is caused by errors or by real external influences is very difficult to distinguish by automated methods. Our approach is therefore to filter out all short-term deviations and then classify the series according to their long-term behaviour. The short-term deviations will be denoted as outliers: one or a few measurements lying far from the expected level. The use of categories for long-term behaviour allows for retaining some of the potentially influenced series in 165 https://doi.org/10.5194/hess-2021-64 Preprint. Discussion started: 29 March 2021 c Author(s) 2021. CC BY 4.0 License. the analysis, while acknowledging their lower reliability. The validation itself was done by a combination of time series modelling and simple tests, as explained below.
Time series modelling was done with the PASTAS function package for Python. This package has been developed by Collenteur et al. (2019) to allow impulse-response time series models for groundwater heads to be easily applied and combined into larger workflows. Here the most basic settings of PASTAS were used (see Collenteur et al., 2019). Groundwater heads 170 were modelled as a function of measured precipitation and ETref series, and a gamma-shaped recharge response function.
Recharge itself was modelled as a linear combination of precipitation and ETref. PASTAS also enables nonlinear recharge modelling; we tested the nonlinear model for a subset of series in Brabant, but model fit was improved in only a small minority of series, and so linear recharge modelling was applied for the rest of the study.
Data cleaning was done on the original, raw series in several steps: 175 1. Removing measurements below the well filter or > 20 cm above the surface level; 2. Removing far outliers. The upper and lower 20 % of the range of a series were identified as potential outliers; if leaving out these measurements reduced the range by more than half, they were classified as outliers and removed.
3. Outlier removal with TSM. A model was fit for each series using precipitation and ETref from the nearest weather stations and the basic model settings of PASTAS. Fitting was done at a weekly time step to prevent bias towards 180 recent years with more frequent measurements. The simulation interval was calculated as the standard deviation of the residuals, and all measurements outside a range of 4 times this interval around the simulation were removed. This step was repeated a second time to remove smaller outliers.
For the long-term behaviour classification, a new time series model was fitted on the resulting cleaned series. The explained variance percentage (EVP, equal to • 100) and model parameters A, f and d were saved. Parameter A represents the long-185 term response of the groundwater level to a constant recharge input of one unit, in this case 1 mm; f is the influence of ETref relative to precipitation; and d is the groundwater base level. In addition, a linear trend was fitted through the residual series and the p-value and r 2 of the trend were saved. Finally, the series were checked for periods with missing data of > 4 years and, if these were present, only the time period after the last such data gap was used. This check had to be done after data cleaning to ensure all data gaps are recognised. Based on these indicators the series were ordered into four categories: 190 1. Discarded series: EVP < 60 %, less than 10 years of data, data gaps of more than 4 years, or no data over June-August 2018; 2. Deep-groundwater series: mean water table depth (WTD) > 5 m. These series typically showed a very slow, smoothed behaviour and often poor model fit. They were saved as a separate category to reflect their distinct behaviour and potentially less reliable results; 195 3. Atypical series: EVP >= 60 %, but containing a trend or atypical parameters. Locations were marked as atypical if they had a trend with p<0.05 and r 2 >0.15; an f parameter > -0.05 or < -1.95, very close to the parameter bounds; or an A parameter > 1.5, far from its normal range for the used dataset. These boundaries were chosen by trial and error https://doi.org/10.5194/hess-2021-64 Preprint. Discussion started: 29 March 2021 c Author(s) 2021. CC BY 4.0 License.
testing on a subset of the series, where they were found to properly separate those series with visually somewhat atypical, but not extremely disturbed behaviour. 200 4. Normal series: EVP >= 60 %, no other issues.
The cleaned measurement series of the normal, atypical and deep categories were aggregated to a minimum daily time step and saved for further use.
The described method for data cleaning and categorising was tested on a representative subset of the data. Around 120 series were randomly taken spread over the different provinces; from these 56 test series were selected: 44 with visual errors or 205 abnormal behaviour and 12 without. The test series were first visually checked for both short-term outliers and longer-term abnormal patterns. Then the automated validation results were compared to this reference. The method was scored for its ability to 1) correctly remove outliers; 2) correctly categorise abnormal longer-term behaviour; and 3) overall, result in a properly treated (or duly discarded) series. The first two aspects were scored as True or False Positive (deviations correctly recognised) and True or False Negative (absence of deviations correctly recognised). The overall performance was scored as 210 Good if outliers and patterns were properly cleaned and categorised; Reasonable if some errors remained or were overremoved, but without significant influence on the series as a whole; and Bad if remaining errors or over-filtering had a significant influence on the character of the series. Groundwater series with mean groundwater depth > 5 m ('Deep') were retained as a separate category as the tested validation method was found unsuitable for these series.

Series simulation 215
The cleaned series from the validation step were used to simulate groundwater heads for the drought analysis. It is also possible to interpolate the original measurement series with TSM . However, it was chosen to use simulated series to ensure regular series without sudden level shifts and prevent influence of remaining outliers. First, a PASTAS model was fit on the cleaned measurement series and the model parameters were saved. Next, for those models with an EVP > 60 %, the fitted model parameters and weather data were taken to simulate daily-step groundwater heads for the 220 period of interest, in this case 1 January 1990 to 31 May 2020. This setup makes it easy to extend the simulations when new weather data comes available, without re-doing the model fitting procedure. As most measurement series originally ran until spring 2019, roughly one year of 'nowcasting' was added to the data.
The performance of the simulations was assessed by the root mean squared error (RMSE) and the mean error (ME) of the simulated groundwater heads over the whole length of the measurement series. The mean error was calculated as 225 mean(simulated-observed) to quantify bias. It is possible that the model performance under dry conditions differs from the overall performance. Therefore the measurements of each series were subdivided into 'low-head periods' (measured head < 20th percentile), 'medium-head periods' (20th to 80th percentile) and 'high-head periods' (> 80th percentile), and the RMSE and ME were re-calculated for these periods. Finally, RMSE and ME were determined specifically over July-November 2018, the period in which the meteorological drought was most severe. All performance measures were expressed both as the absolute 230 https://doi.org/10.5194/hess-2021-64 Preprint. Discussion started: 29 March 2021 c Author(s) 2021. CC BY 4.0 License.
value in meters, and as fraction of the mean water table depth (WTD) at the location, as the practical consequences of given error in groundwater head are much larger where groundwater levels are normally shallow.

Drought index
To identify drought periods in time series of hydrological variables, and to compare drought severity between locations, standardised drought indices are used. This concept was first developed for precipitation data with the Standardised 235 Precipitation Index (SPI) (McKee et al., 1993). Later, equivalent indices were developed for other hydrological variables. We quantified the development of meteorological drought over 2018-2020 by the three-month-aggregated Standardised Precipitation Evaporation Index (SPEI) (Vicente-Serrano et al., 2010). The study area was divided into four zones (west, north, mid-east, southeast); the SPEI-3 for each zone was calculated from the average precipitation of all weather stations within the zone and the distance-weighted mean reference evapotranspiration of the three stations closest to the midpoint. 240 For groundwater heads the Standardised Groundwater Index (SGI) was developed (Bloomfield and Marchant, 2013). To calculate the index values, the measured groundwater heads at a specific location are transformed to a standard normal distribution. This gives a drought index series varying roughly between -3 and 3, indicating conditions from extremely dry to extremely wet compared to the normal situation at the specific location. When analysing multiple locations, a common reference period must be used, recommended to be at least 30 years. Here, the period January 1990 -December 2019 is used 245 throughout as the reference period.
For precipitation, the transformation step is usually done by fitting a gamma distribution function to the data. For groundwater heads, however, distribution shapes vary widely between locations (Bloomfield and Marchant, 2013;Dawley et al., 2019;Loáiciga, 2015). It is possible to fit individual parametric distribution functions to the groundwater series, but the 30-year monthly series used here are probably insufficient to do this reliably (Link et al., 2020), giving a high risk of biased drought 250 index values (Svensson et al., 2017). Groundwater levels were therefore transformed using a normal scores transform (see e.g. Bloomfield and Marchant, 2013). This is a nonparametric transformation method that has the advantage of being simple and transparent and circumvents the risk of bias due to erroneous distribution fits. For each location, the simulated series was first aggregated to a monthly time step. Transformation was then done separately for each calendar month. For each calendar month with n years of data, in this case 30, cumulative probability values are taken, uniformly spaced over the interval (1/ ) to (1 − 255 1/ ); the corresponding SGI values are found by applying an inverse cumulative distribution function to these values. The resulting SGI values are assigned to the groundwater head measurements of the given calendar month by their rank from low to high. This method of calculating SGIs results in a limited number of 'discrete' SGI values that have a relatively weak relation with occurrence probability, but correspond directly to the rank of the groundwater level compared to the rest of the reference period. Table 1 gives the SGI values and their corresponding rank and drought severity in this study. Note that the 260 SGI, unlike the SPI, is not aggregated over multiple months; the SGI values given here therefore represent 'SGI-1'.  The SGI values for the months outside the reference period (January 2020 -May 2020) were estimated by linearly interpolating 265 the groundwater head-SGI relation for the calendar month. If the heads fell outside the range reached in the reference period, they were assigned the most extreme SGI value.
To test how the use of TSM simulations rather than measurement series affects drought analysis, SGI values were also calculated directly for a selection of locations that had long measurement series. From the cleaned measurement series of the validation step, those were selected that started before 1 January 1993, resulting in 531 series. The SGI values for 2018 were 270 calculated in the same way as for the simulated series, with the SGI of a calendar month calculated only if at least 25 years of data were available. The simulation-based and measurement-based SGI values were compared by regression.

Drought prediction
Finally an estimate was made of the drought prediction performance of the time series models. If weather predictions are available, time series simulations can be easily extended into the future. However, as the predicted period is not included in 275 the calibration, the quality of the predictions is likely to be lower than for the normal 'fully calibrated' simulations. This is especially important for the prediction of extreme drought, which by definition involves system conditions not present in the calibration period. To test the performance of the time series models for predicting extreme drought conditions, models were re-calibrated until the start of the 2018 growing season (1 April 2018); and the groundwater behaviour for the rest of 2018 was 'predicted' with the available weather data. As such, an 'upper limit' of the prediction performance is estimated, representing 280 how well the 2018 drought would have been predicted had perfect weather predictions been available. The prediction error in groundwater levels was quantified by the RMSE and mean error ME over July-November 2018.
Previous studies have found that in slow-reacting groundwater systems and dry situations, initial conditions can control the quality of model predictions many months forward, irrespective of weather forecasts (Honingh et al., 2020;Mackay et al., 2015;Sutanto et al., 2020). The type of time series models used here rely only on transforming of a time series of recharge to 285 deviations in groundwater heads. The predicted heads are therefore insensitive to the initial conditions simulated at the start of the prediction period. For forecasting, it might be advantageous to include the true initial condition into the preditctions. The Where ( ) is the remaining effect of the residual at time after the start of predictions, is the residual at the start of the 290 prediction period, and is the decay parameter of the noise model (eq. 7 in Collenteur et al., 2019). Adding the series of residual effects to the predicted values thus draws these towards the real values, with the effect decreasing over time.
To estimate how the prediction error in groundwater heads would propagate to the drought quantification, SGI values were calculated both for the predicted series and for the same series with the prediction period replaced by the measurement data.
The performance of the SGI predictions was quantified by the mean and maximum absolute error of the prediction-versus the 295 measurement-based SGI values. As such, the difference between the prediction-based and measurement-based SGI values gives an indication of the performance in predicting the occurrence of drought.

Results
Here, we present the outcomes of the drought analysis and the evaluation of the time series modelling-based data preparation method. We first show the performance of the method for validation and simulation of groundwater head series; then we 300 present the analysis of the 2018-2019 groundwater drought. Finally, the performance of the TSM method for drought quantification and prediction is shown.

Performance of TSM data preparation for groundwater head series
The performance of the applied validation method, based on a combination of basic tests and TSM, was tested on a subset of the groundwater series. The method generally identified longer-term irregularities in behaviour well (Table 2, middle columns). 305 More frequently, it incompletely removed outliers or detected outliers that were visually not clearly erroneous (left columns).
Overall, the validation method was successful for a majority of the test series (68 %, right columns). For part of the series (20 %), validation was classified as reasonable, with some errors remaining or slight over-smoothing, but without a substantial effect on the overall series, thus being unlikely to affect drought analysis. For a small fraction of the test series (5 %), validation was unsuccessful and more important irregularities remained. Thus, a small number of potentially erroneous series is likely to 310 remain in the data used for the drought analysis. When applied to the full study dataset, the validation procedure discarded 31% of the groundwater head measurement series, 315 so that 1869 of the original 2722 series remained for analysis. A poor model fit was the most frequent cause for discarding. 10 % of the series were maintained as atypical series, while another 12 % of locations had a deep groundwater table, being less suitable for the used validation method.
In the simulation step, 1632 locations were modelled with sufficient quality (EVP > 60 %). Overall, these series were simulated with an average error of 14 cm, resulting in a 20 % error in the groundwater table depth ( Table 3). The bias is low with -1 mm. 320 Subdivision into dry, normal and wet conditions shows that the errors are larger for more extreme groundwater levels. More precisely, the models tend to underestimate the extremes: there is a positive average bias in periods of low groundwater levels, and a negative bias when groundwater levels are in their high ranges. There was no clear spatial pattern in this bias. Also during the main period of groundwater drought in July-November 2018, the simulation error is above average with 18 cm, but there is only a small negative bias of 1 cm. 325

Development and recovery of the 2018-2019 groundwater drought in the Netherlands
The groundwater drought of 2018-2019 was driven by exceptionally dry weather conditions. The meteorological drought started in spring 2018 and peaked in late summer (Fig. 3). After a relatively normal winter, summer 2019 again showed 330 moderate to severe drought. The winter of 2019-2020 was relatively wet, but exceptionally low rainfall in spring 2020 caused a return to extreme meteorological drought conditions. The meteorological drought was not spatially uniform. Especially in spring 2018 and summer 2019, the western part of the study area experienced less dry conditions than the east (Fig. 3).  2018 growing season started with uniformly normal to high groundwater levels over the study area (Fig. 4). Drought started developing in May and June, with drought onset varying between locations. By July and August, severe to extreme groundwater drought occurred over most of the area. In September, heavy rain in part of the study area slightly alleviated the 340 drought conditions. However, the drought situation worsened again in autumn, reaching its height in October and November when the simulations show almost uniform extreme drought over the study area. By December, a slow recovery is visible. The simulated data show distinct spatial patterns in the development of the groundwater drought (Fig. 5). Especially southern 345 Limburg and the ridges in Utrecht and Gelderland with their distinct geology and topography (see Fig. 1) reacted more slowly than the rest of the study area and did not experience drought conditions yet in 2018. Also the fast recovery of the low-lying western Utrecht area in autumn stands out.

350
The simulations over 2019-2020 show that also drought recovery was strongly variable in space (Fig. 6). In spring 2019, groundwater heads in the west of the study area were again approaching normal levels, but the eastern regions had recovered poorly. By this time extreme groundwater drought had also developed on the high ridges and in southern Limburg. The summer of 2019 again brought severe to extreme groundwater drought, this time clearly concentrated in the east of the study area, corresponding to the differences in meteorological drought. In March 2020, groundwater levels had returned to relatively high 355 levels. However, the exceptionally dry weather in April rapidly resulted in a new severe drought situation by May. To further explore the spatial variations in drought behaviour, the groundwater response time was mapped for all monitoring locations (Fig. 7). The response time, the time after which 50 % of the groundwater head response to a recharge event has 360 occurred, can be obtained directly from the parameters of the fitted time series model for a specific location. The found response times ranged from a few days to up to two years, but were generally relatively short: 94 % of locations had a response time of less than one year. The spatial distribution of the response time corresponds with the topography and the occurrence of glacial sand ridges in the landscape (Fig. 1) and with the propagation speed of meteorological drought to groundwater drought in 2018-2020 (Fig. 5, 6). 365

Usefulness of TSM method for drought quantification and prediction
For a subset of the locations (n = 531), a long groundwater measurement series was available and the SGI values resulting from the simulated series could be compared to those obtained directly from the cleaned measurement series (Fig. 8). The 370 comparison shows that the simulations follow the same general drought behaviour as the measurement series, as the two follow a 1:1 line (Spearman's ρ = 0.8). However, the simulations generally show a smaller spatial variation than the measurements. This is also visible in measurement-based SGI maps (Fig. 9), which show more scatter and local extremes than the simulationbased drought maps. In addition, the simulations for 2018 tend to slightly overestimate drought severity in the low ends of the drought (lower left in Fig. 8). This is contrary to the general tendency of the head simulations towards positive bias during 375 drought periods, shown in Table 3.

Figure 9: SGI for two months in 2018 based directly on cleaned long measurement series. 'No Data' symbols indicate insufficient data to enable drought index quantification.
In addition to the simulations, it was also tested how well the time series models, calibrated until the start of the growing season, would have predicted the 2018 groundwater drought from weather data only (Table 4). The average error in the 385 predicted groundwater heads was 20 cm, giving a relative error in the groundwater depth of 31 %. The (small) negative mean error indicates that the declines in groundwater level over summer, and thus the severity of drought, are slightly overestimated. This pattern is also visible in the simulations themselves. The prediction performance is poorer than that of the normal simulations with all data included in the calibration (Table 3). This means that, as expected, the prediction of groundwater levels under extreme drought conditions has a relatively high unreliability. However, the difference in average error is only 3 390 cm, indicating that the effect is relatively small compared to the error already present in the fully calibrated simulations.
The prediction errors in the groundwater head were estimated to result in an average absolute SGI error of 0.34. With the used methods for SGI calculation and n=30 years, an error of 0.48 is required to cause a shift between "extreme drought" and "severe drought", and an error of 0.26 will cause a shift between severe and moderate drought. This means that most locations that were in extreme drought in summer-autumn 2018 were also predicted as such, despite errors in the heads themselves. 395 However, there is a large variation between locations in the size and direction of the drought index errors. Predictions are good for many locations (error ≈ 0), while some are classified at the complete other side of the drought spectrum than the measurements (error of -2.9; see Table 1).  The initial conditions simulated at the start of the growing season had an average absolute error of 15 cm. Including the true initial groundwater heads on March 31 into the predictions led to small improvements in the prediction on a short term: this reduced the prediction errors after 1 and 7 days on average by 14 cm and 5 cm, respectively. However, after a month or more the effect of the initial residual becomes insignificant, and after 4 months, the inclusion of initial conditions leads to an average increase in error of 0.9 cm. Inclusion of the residual at the start of the growing season therefore did not improve drought 410 predictions in the height of summer.

Usefulness of TSM data preparation for drought analysis
The performed study aimed to evaluate the usefulness of time series modelling-based data processing methods for groundwater drought studies. The application of the TSM-based method to the 2018-2019 drought in the Netherlands provided several new 415 insights into how TSM methods can be used for data validation; how reliable they are for the quantification of extreme drought situations; and their usefulness for drought prediction.

Validation methods for groundwater data
A validation method was applied that combines basic error tests with time series modelling-based identification of irregularities, while separately treating short-term outliers and long-term atypical behaviour. It was found that the data available 420 for this study, and this may be true for many cases, contained large numbers of outliers with various sources. Removing these outliers often led to a good TSM fit that was not possible before this procedure. Therefore, with the kind of data used in this study, pre-analysis outlier cleaning is likely to improve results compared to performing a more limited validation, or identifying impacted series only after simulation. The validation method performed reasonably with regard to outlier removal, but not optimally ( Table 2). The outlier cleaning may be further improved by using more iterations (e.g. Peterson et al., 2018), although 425 care must be taken to avoid over-smoothing.
The TSM-based validation also proved suitable for identifying long-term atypical behaviour in the head measurement series.
However, we could not explicitly separate erroneous and real atypical patterns in this study. Many of the long-term nonweather influences on the groundwater, such as structural abstractions and land use, are likely to be included implicitly in the simulations, as the time series models will model any external influence that correlates with weather or modifies the recharge 430 response. In addition, the use of a classification for long-term behaviour allowed for retaining moderately 'atypical series' rather than filtering out all impacted locations. Still, the validation is likely to retain some series with erroneous patterns and discard some real behaviour.
The separation of errors could be improved by including additional driving factors in the time series modelling, such as surface water fluctuations Van Loon et al., 2016;Zaadnoordijk et al., 2019); however, this requires 435 substantially more data and modelling effort. Another potential approach is to make use of the spatial coherence of groundwater behaviour to separate errors. For example, Lehr and Lischeid (2020) and Marchant and Bloomfield (2018) used the spatial coherence of observed groundwater patterns as an indication of their reliability, and to relate clusters of similar series to some external (human) impact. It would be valuable to explore such extensions of TSM to improve validation methods for groundwater drought analysis. This will allow for a better understanding of the different natural ánd human drivers of drought 440 development.
Although the validation-simulation method generally performed well for the given data, it was less suitable for locations with very deep groundwater tables, dominated by multi-year head fluctuations. Here, model fit was often low, leading to poor outlier https://doi.org/10.5194/hess-2021-64 Preprint. Discussion started: 29 March 2021 c Author(s) 2021. CC BY 4.0 License.
identification and models being discarded for a large proportion of these locations. This issue was also found by Marchant and Bloomfield (2018) and Zaadnoordijk et al. (2019) for locations with thick unsaturated zones. To enable TSM simulation in 445 such cases, measurement series are needed that are substantially longer than the minimum of 10 years used here.

Reliability of TSM groundwater level simulations during extreme drought
It was found that impulse-response function-based time series models on a general level produced reliable groundwater head simulations. They described most of the groundwater head series very closely, with a low overall bias (Table 3). However, the time series models tended to underestimate extreme conditions somewhat, including dry summer conditions, although the 450 simulations for summer 2018 interestingly had a small negative bias in the lowest ranges. The overall underestimation of extreme conditions was also found by Mackay et al. (2015) with a process-based groundwater model. An important shortcoming of the models is that they assume that the response time to recharge, or the lack of it, is constant for a given location. In reality, the response may be different during (extreme) drought conditions. This may contribute to less reliable TSM estimations during extreme groundwater drought. 455 When considering the resulting drought index values, the time series models also appear to give a correct representation of overall drought dynamics. The comparison of simulation-and measurement-based SGI values (Fig. 8) showed a good correlation (Spearman's ρ = 0.8). The drought estimation is similar or slightly better than found by Van Loon et al. (2017), who simulated groundwater drought in the eastern Netherlands with a lagged SPEI-SGI correlation (Spearman's ρ ≈ 0.75).
Also, as discussed further below, the general patterns shown by the drought maps match well with the experience of water 460 managers.
However, the comparison of simulation-and measurement-based SGI values also showed that the simulation-based drought quantification overestimated drought severity in the lowest ends of the 2018 drought, while also reducing the spatial variability.
In summer and autumn 2018, the models simulated uniform extreme drought over the whole study area, while the measurement-based SGI values showed much more scatter towards less extreme drought ( Fig. 8-9). It appears that the weather 465 conditions were severe enough to trigger a 30-year extreme groundwater drought everywhere, but conditions not accounted for by the models, such as external water supply, caused the drought to be less extreme locally .
In addition, the models may have underestimated the severity of earlier droughts; this would inflate the extremeness of the 2018 drought. Similar bias issues can be seen in Hellwig et al. (2020), who analysed groundwater drought in Germany using a physically-based model. Their simulations overestimated drought severity during a drought in 1973, but not in 2003, 470 confirming that bias can differ between individual drought events. In their case, drought severities and bias therein were strongly influenced by the parameters related to the response time of the groundwater system. Similar issues may have played a role for the time series models used here.
Despite these shortcomings, the use of time series model simulations in drought analysis provides large advantages. In this study, it allowed for including a much larger number of series than if only long and disturbance-free series could have been 475 used. This enabled a detailed image of drought development that would not have been possible by using direct measurement https://doi.org/10.5194/hess-2021-64 Preprint. Discussion started: 29 March 2021 c Author(s) 2021. CC BY 4.0 License. series only. In addition, the model parameters gave additional insight in groundwater behaviour, for example through the response time Collenteur et al., 2019). Therefore, impulse-response time series simulations are considered a valuable tool for groundwater drought analysis in the Netherlands and other situations, although care must be taken for bias and (local) effects of non-weather influences on the groundwater system. 480

Potential of TSM for groundwater drought prediction
The findings also provide support for the usefulness of time series modelling for groundwater drought prediction. We found that, had perfect weather forecasts been available at the start of the growing season in 2018, the type of time series models used in this study would have reasonably predicted the course of the groundwater drought in summer and autumn, with an error of around 20 cm and low bias ( Table 4). The potential for nowcasting is also shown by the lengthened series over 2019-485 2020, which matched well with general observations provided by other studies and reports (LCW, 2020; . It was expected that the prediction of groundwater heads in periods of extreme drought would perform relatively poorly, as system conditions are being simulated that are not included in the calibration period. However, the errors of our predictions were close to those of the normal simulations that did include 2018 in their calibration (20 vs 18 cm, Table 3-4).
The results also suggest that, over the time spans relevant for drought forecasting (months), updating forecasts with 490 measurements may not be of much added value. This supports the idea that time series models, when based on sufficient calibration data, could be a relatively robust means for drought prediction over longer time spans.
Still, the prediction RMSE and mean errors amounted locally to high levels ( Table 4). The drought index values in the main drought period of 2018 were predicted with an estimated error of 0.3 units, which can easily shift a location to a different drought category. The magnitude of these errors is similar to that reported by Marchant and Bloomfield (2018), who found 495 that temporal interpolation of head measurement series with TSM led to an uncertainty in SGI values of roughly 0.1 to 0.2 in periods with good data cover, but much larger in longer periods with little data. In this study, we accounted only for the prediction error in drought severity resulting from model inaccuracy in the prediction period. In reality, the prediction uncertainty would obviously be increased by the uncertainty in the long-term weather forecasts (Honingh et al., 2020;Mackay et al., 2015). 500 Taken together, time series models show potential for groundwater drought nowcasting and forecasting on a general regional scale. However, if water managers are to be provided structurally with up-to-date information on regional groundwater drought, the models have to be calibrated with reliable, recent measurement data. Therefore, improving the accessibility and consistency of recent groundwater data among different organisations and countries remains an important task. The regional-scale analysis of the 2018-2019 groundwater drought in the southeastern Netherlands provided a new, detailed image of the drought development in time and space. The analysis showed that extreme groundwater drought was widespread over the study region in 2018 and 2019, breaking 30-year records almost everywhere (Fig. 4). However, the timing and duration of the drought varied strongly in space. In the western parts of the study area drought was terminated before the end of 2018, 510 while the higher-lying areas reached drought conditions only by 2019 (Fig. 5-6). This image corresponds well with how the drought was generally experienced by water managers. The Dutch Commission on Water allocation (LCW), which regularly surveys the drought status in the Netherlands based on input from water managers, reported widespread exceptional drops in groundwater levels, especially in the higher-lying sandy areas of the country, combined with a relatively fast recovery in the west (LCW, 2020). Also the severe damage observed in groundwater-dependent ecosystems in the southeastern Netherlands 515 (Witte et al., 2020a) and widespread drying of groundwater-fed streams (Van den Eertwegh et al., 2019) match the assessment of an extreme groundwater situation.
The time series models made it possible to extend the available groundwater head measurement series beyond 2018 and obtain an estimate of drought recovery dynamics over 2019-2020. This showed that despite near-normal to high rainfall in the winters of 2018-19 and 2019-20 (Fig. 3) groundwater levels were restored only locally, and severe groundwater drought continued 520 into 2019 or even 2020 over large parts of the study area (Fig. 6). These findings are consistent with general reports of the situation by local water managers (LCW, 2020; Van den Eertwegh et al., 2019). The results confirm that the drought should be viewed as a multi-year event rather than a single-year summer drought. The multi-year character increases the risk of lasting negative effects, especially in natural ecosystems. In addition, it stresses the importance of winter season groundwater management and water retention as determining factors in drought development. Both these issues have already become 525 apparent in the Netherlands after 2018 (De Lenne and Worm, 2020; Witte et al., 2020a;Witte et al., 2020b).

Driving factors of spatial drought distribution
The current study did not aim to fully quantify the driving factors of the spatial variations in drought dynamics. However, the results suggest that the variation in drought severity, timing and duration was governed mainly by the spatial distribution in rainfall and the geological-topographic setting. The influence of spatial variations in weather was visible in late 2018 and 2019. 530 In this period, a gradient in meteorological drought towards the east caused groundwater drought to be concentrated in this area. The effect of the geological-topographic setting is visible for the higher elevated parts of the study area, especially the glacial sand ridges and the limestone-loess hills, which clearly had a late, long-lasting drought response compared to lowerlying areas (Fig. 5-6). This pattern is also visible in the groundwater response times (Fig. 7). The slow drought response in the higher elevated areas is likely explained by a thick unsaturated zone and low drainage density, while the fast recovery of the 535 low-lying western parts may have been aided by thinner unsaturated zones and possibly some surface water influence. These https://doi.org/10.5194/hess-2021-64 Preprint. Discussion started: 29 March 2021 c Author(s) 2021. CC BY 4.0 License. factors have been found by other studies to influence drought behaviour (Bloomfield et al., 2015;Hellwig et al., 2020;Peters et al., 2006;Van Loon and Laaha, 2015). In addition, the topographic-geological setting in the study area correlates with variations in land use, soil type and aquifer characteristics; these factors may have played an additional role.
The dominant role of landscape position in shaping drought development calls for locally adapted, but also regionally 540 coordinated mitigation strategies. The response times obtained from the time series analysis form a useful first indicator of the landscape characteristics that control the propagation of meteorological drought to groundwater drought. The response times found here are somewhat shorter than the drought response times reported by Van Loon et al. (2017) for part of the eastern Netherlands. However, they are very similar to those found by Zaadnoordijk et al. (2019), who performed time series analysis on groundwater series from the whole Netherlands. As they used different data and different model quality criteria, the 545 similarity is reassuring and points to the stability of the time series models.

Conclusions
The performed study aimed to evaluate the usefulness of time series modelling-based data processing methods for regionalscale groundwater drought monitoring and prediction. A TSM-based method was set up for data validation and drought quantification and applied to the regional 2018-2019 groundwater drought in the southeastern Netherlands to test its usefulness 550 and reliability. It was found that the kind of groundwater data often available for drought studies requires a validation-cleaning step before analysis to enable reliable results. The validation method based on a combination of basic tests and time series modelling proved a suitable method for the given situation. Still, improvements in the validation method are desired, especially in the separation of real and erroneous head disturbances. The simulated groundwater head series were generally found to be reliable; however, it was shown that the use of time series simulations may bias drought estimations and underestimate spatial 555 variability, producing large errors at a local scale. Still, the use of time series model simulations in drought analysis provides large advantages, as it enables a spatially detailed record of drought development that may be impossible with direct measurement series only. Time series models were also found to be a promising tool for regional-scale drought nowcasting and prediction, although the same caution for bias and local errors is needed.
The drought analysis for the southeastern Netherlands provided a complete, detailed image of the development of the 2018-560 2019 groundwater drought in time and space. The findings confirm that the meteorological drought in 2018 caused extreme groundwater drought throughout the southeastern Netherlands, starting in late spring and peaking in October-November of that year. The timing of drought onset and the duration of drought varied strongly in space. Drought development appeared to be governed dominantly by the spatial distribution of rainfall and the geological-topographic setting. In much of the area, the drought continued as a multi-year drought into 2019 and 2020, especially in the eastern and higher elevated regions. 565 Taken together, we conclude that time series modelling forms a useful tool to obtain a quick, detailed and up-to-date image of drought development; however, for a proper understanding of the different driving factors of drought the availability of recent and consistent monitoring data remains crucial.