Using Long Short-Term Memory networks to connect water table depth anomalies to precipitation anomalies over Europe

. Many European countries mainly rely on groundwater for domestic water use. Due to a scarcity of near real-time water table depth ( wtd ) observations, establishing a spatially consistent groundwater monitoring system at the continental scale is a challenge. Hence, it is necessary to develop alternative methods to estimate wtd anomalies ( wtd a ) using other hydrometeorological observations routinely available near real-time. In this work, we explore the potential of Long Short- 10 Term Memory (LSTM) networks to produce monthly wtd a , using monthly precipitation anomalies ( pr a ) as input. LSTM networks are a special category of artificial neural networks, useful in detecting a long-term dependency within sequences, in our case time series, which is expected in the relationship between pr a and wtd a . To set up the methodology, spatio-temporally continuous data were obtained from daily terrestrial simulations (hereafter termed the TSMP-G2A data set) with a spatial resolution of 0.11°, ranging from the year 1996 to 2016. They were separated into a training set (1996-2012), a 15 validation set (2013-2014), and a test set (2015-2016) to establish local networks at selected pixels across Europe. The modeled wtd a maps from LSTM networks agreed well with TSMP-G2A wtd a maps in 2003 and 2015 constituting drought years over Europe. Moreover, we categorized test performances of the networks based on yearly averaged wtd , evapotranspiration ( ET ), soil moisture ( θ ) , snow water equivalent ( S w ), and soil type ( S t ) and dominant plant functional type ( PFT ). Superior test performance was found at the pixels with wtd < 3 m, ET > 200 mm, θ > 0.15 m 3 m -3 and S w < 10 mm, 20 revealing a significant impact of the local factors on the ability of the networks to process information. Furthermore, results of cross-wavelet transform (XWT) showed a change in the temporal pattern between TSMP-G2A pr a and wtd a at some selected pixels, which can be a reason for undesired network behavior. Our results demonstrate that LSTM networks are useful to produce high-quality wtd a based on other hydrometeorological data measured and predicted at large scales, such as pr . This contribution may facilitate the establishment of an effective groundwater monitoring system over Europe relevant to 25 water management.

nonlinear relationships. ANNs are not easily affected by input noise and able to readjust their parameters when new information is included. More importantly, compared to physically-based models, they necessitate little background 65 knowledge, reducing the requirements for human involvement and expertise, and thus, enabling rapid hypothesis testing (Govindaraju, 2000;Shen, 2018;Sun and Scanlon, 2019).
Recurrent neural networks (RNNs) are mainly designed for sequential data analysis. Through loops in their hidden layers, the information generated in the past flows back to neurons as the input of new computing processes (Karim and Rivera, 1992). Due to the ability to store information traveling through, RNNs can more efficiently solve sequential data problems 70 such as groundwater level estimation than feedforward networks and their variants. The latter are commonly used ANNs for groundwater level modeling in previous studies, e.g., Yang et al. (1997), Nayak et al. (2006), Adamowski and Chan (2011), Yoon et al. (2011), Gong et al. (2015), Mohanty et al. (2015), Sun et al. (2016). With RNNs, it is not necessary to estimate the delay time d (d > 0) in the network response in advance and to assign one input variable to several input neurons (namely, the input data at the time steps t, t -1, …, t -d + 1) during modeling like it is with feedforward networks (J. Zhang 75 et al., 2018;Supreetha et al., 2020).
Long Short-Term Memory (LSTM) networks are a special type of RNNs and famous because of their superior performance in exploiting long-term dependencies between sequences, which is expected in the response of wtd to pr.
Although LSTM networks have been employed extensively in other science fields, particularly natural language processing (D. , their application in hydrology is still in its infancy and has only recently received increasing 80 attention (e.g., Kratzert et al., 2018;Shen, 2018;J. Zhang et al., 2018;Le et al., 2019;Sahoo et al., 2019). Thus, limited studies have been conducted to estimate groundwater fluctuations using LSTM networks.
The consistency of the temporal pattern between input and target variables is a prerequisite for the good performance of ANNs, including LSTM networks. Cross-wavelet transform (XWT) is a useful tool to visualize the pattern changes between input and target variables, aiming to extract similarities of two time series in time and frequency. The technique has been 85 applied for time-frequency analysis in many publications, e.g., Adamowski (2008), Prokoph and El Bilali (2008) and Banerjee and Mitra (2014).
In this study, we utilized spatio-temporally continuous pra and wtda from integrated hydrologic simulation results over Europe (hereafter termed TSMP-G2A data set, introduced in Sect. 2.4) in combination with LSTM networks to capture the time-varying and time-lagged relationship between pra and wtda in order to obtain reliable prediction models at the 90 individual pixel level. The impact of local factors on the network behavior was also investigated, and the local factors studied were yearly averaged wtd, ET, soil moisture (θ), snow water equivalent (Sw), and soil type (St) and dominant plant functional type (PFT). In addition, we implemented XWT on both TSMP-G2A pra and wtda series for time-frequency analysis to gain insight into the internal characteristics of the obtained networks.
This paper is organized as follows: in Sect. 2 (Methodology), we first present a conceptual model of groundwater balance 95 to theoretically derive the relationship between pra and wtda and then briefly introduce the architecture of the proposed LSTM networks, continuous and cross-wavelet transform. This is then followed by detailed information of our study area https://doi.org/10.5194/hess-2020-382 Preprint. Discussion started: 24 August 2020 c Author(s) 2020. CC BY 4.0 License. and data set as well as a generic workflow to construct local LSTM networks at selected pixels over Europe. Section 3 (Results and discussion) shows reproduced wtda maps for groundwater drought analysis, discusses the impact of local factors on the network behaviors and investigates the network performances at the local scale, before completing the paper with 100 Sect. 4 (Summary and conclusions).

Methodology
LSTM networks were applied to estimate monthly wtda over the European continent, using monthly pra, as input. We constructed the networks at the individual pixels and analyzed temporal patterns between TSMP-G2A pra and wtda using XWT. In this section, we briefly recall the conceptual model of groundwater balance, introduce the principle of LSTM 105 networks and the application of XWT, and describe the study area and data set and a universal workflow to establish the proposed LSTM networks locally at selected pixels.

Conceptual model of groundwater balance
The complete subsurface water balance can be described by a control volume that contains the vadose zone, and an unconfined aquifer closed at the bottom (Fig. 1). Areas with surface water are not taken into account. Fluxes in and out of 110 the control volume are pr and ET across the land surface and lateral fluxes in the subsurface. These fluxes are balanced by changes in the water stored in the vadose zone and the unconfined aquifer.

115
The complete groundwater balance equation for the conceptual model is given in Eq. (1): Rearranging Eq. (1), will result in Eq.
(2) as follows: The term on the left-hand side and the first term on the right-hand side in Eq.
(2) indicate an explicit relationship between the fluctuation of Sua and pr, providing the theoretical basis of this study. In case of large continental watersheds (i.e., ( ) = 0), the difference between pr and ET is equal to the total variations in and . Note, we explicitly separated 125 the water storage term of the vadose zone from the unconfined aquifer to highlight the transient impact of unsaturated storage on the relationship between ( )/ and (P-ET).

Long Short-Term Memory networks
In this study, we employed LSTM networks having the same architecture of hidden neurons as Gers et al. (2000). As a category of RNNs, LSTM networks have loops in their hidden layers, facilitating hidden neurons to weigh not only new 130 inputs but also earlier outputs internally for predictions. Hence, they are considered to have memory. Compared with standard RNNs, LSTM networks add a constant error carousel (CEC) and three gates that are the input, forget and output gates in their hidden neurons (see Fig. 2), in order to overcome the gradient exploring and vanishing issue. For a detailed description of the functions of these components, the reader is referred to Hochreiter and Schmidhuber (1997), and Gers et al. (2000). Benefiting from the interaction of these components, LSTM networks show great promise in studying long-term 135 relationships between time series. They have the ability to capture dependencies over 1000 time steps, outperforming standard RNNs whose upper boundary of reliable performances is only 10 time steps (Hochreiter and Schmidhuber, 1997;Kratzert et al., 2018).
The procedure for processing inputs in hidden neurons of LSTM networks are as follows (Olah, 2015;Ma et al., 2019): 1) filter the information used for prediction from new inputs based on the result of the input gate; 2) filter the information to 140 remember from the old CEC state according to the output of the forget gate; 3) update the CEC state using the results from the previous two steps; 4) compute outputs of hidden neurons from the new CEC state and the results given by the output gate. Figure 2 illustrates a one-hidden-layer LSTM network containing only one hidden neuron; the pseudocode is presented in Appendix A. Owing to limited data available at each pixel (i.e., a total of 252 time steps), we built small LSTM networks at 145 the local scale, having one input layer, one hidden layer, and one output layer. The network receives monthly pra from the input layer, processes it on the hidden layer, and finally generates monthly wtda from the output layer. Numbers of input and output neurons are determined by how many input and output variables are used in the derivation of the network. In the constructed LSTM networks, only one neuron is located on either the input or output layer, as the number of input or output variables is one. Thus, the complexity of the network only depends on the number of hidden neurons and, therefore, can vary 150 by changing the number of hidden neurons. The architecture of a network plays an important role in its behavior of https://doi.org/10.5194/hess-2020-382 Preprint. Discussion started: 24 August 2020 c Author(s) 2020. CC BY 4.0 License.
processing new data, and it can be a double-edged sword to apply a network with considerable hidden neurons. On the one hand, the larger we allow a network to grow, the better it can learn from a given data set. On the other hand, a complex network easily captures unwanted patterns when it learns too much from the given data set, eliminating its ability to deal with previously unobserved information (Dawson and Wilby, 2001;Müller and Guido, 2017). This phenomenon is termed 155 overfitting. Hence, it is crucial to identify the optimal number of hidden neurons and specify the appropriate structure of the network, which is the focus of hyperparameter tuning described in Sect. 2.5.

Continuous and cross-wavelet transform
Continuous wavelet transform (CWT) is a type of wavelet transform useful for feature extraction (Grinsted et al., 2004).
Given a mother wavelet 0 ( ), being a dimensionless time parameter, the CWT of a time series 0 is formulated as the 165 convolution of 0 and a scaled and translated form of 0 ( ) (Torrence and Compo, 1998): where, the (*) signifies the complex conjugate; is the time step of 0 ; N is the total number of in 0 ; is the wavelet scale; and n is the localized time index along which 0 ( ) is translated. Here, the wavelet power is defined as | ( , )| 2 .
The mother wavelet must be zero-mean and localized in the time and frequency domains (Torrence and Compo, 1998). In 170 this study, we applied the Morlet wavelet as the mother wavelet, defined as: where, 0 is the dimensionless frequency, set as 6 here to acquire a good balance between time and frequency localization (Grinsted et al., 2004).
XWT is a method to locate common high power in the wavelet transforms of two time series. The XWT of two time series 175 0 and 0 can be computed using (Grinsted et al., 2004): where, ( , ) and ( , ) are the CWT of time series 0 and 0 , respectively. The cross-wavelet power is calculated as | ( , )|. However, directly using the cross-wavelet power gives biased results of the XWT analysis, so here we applied | ( , )|⁄ proposed by Veleda et al. (2012) for correction. For detailed descriptions about CWT and XWT, the reader is 180 referred to Torrence and Compo (1998), Grinsted et al. (2004), Prokoph and El Bilali (2008), and Veleda et al. (2012).
In this study, the application of XWT aims to identify common, localized high-power frequency modes of 0 ( ) in input and expected output series and detect dynamics of the modes over time. Using the XWT analysis, we expect to clarify whether a changing pattern exists in the input-output relationship during the study period and if it affects the network behavior. Moreover, by linking the results of the XWT analysis with the network outputs, we explore the impact of the 185 amount and range of the frequency modes on the LSTM network performance in order to obtain insight into internal operations of LSTM networks.

Study area and data set
We constructed the LSTM networks at individual pixels over eight hydrometeorologically different regions within Europe ( Fig. 3), which are known as the PRUDENCE regions (Christensen and Christensen, 2007). Table 1 lists region names and  190 abbreviations, coordinates, and climatologic information. The climatology is represented by regional averages and standard deviations of yearly averaged data derived from the TSMP-G2A data set (Furusho-Percot et al., 2019) from the years 1996 to 2016, except for Sw of which data are only available from the years 2003 to 2010. The TSMP-G2A data set consists of daily averaged simulation results from the Terrestrial Systems Modeling Platform (TSMP) over Europe, using the grid definition from the COordinated Regional Downscaling Experiment (CORDEX) framework with a spatial resolution of 0.11° (12.5 195 km, EUR-11). TSMP is a fully coupled atmosphere-land-surface-subsurface modeling system, giving a physically consistent representation of the terrestrial water and energy cycle from the groundwater via the land surface to the top of the atmosphere (Keune et al., 2016;Furusho-Percot et al., 2019). TSMP has been successfully applied in many studies to simulate the terrestrial hydrological processes, e.g., Shrestha et al. (2014) As shown by the averages in Table 1, pr is heterogeneously distributed over the PRUDENCE regions, with the highest rainfall in AL (1480 mm) and the lowest in EA (778 mm). Most regional average wtd ranges from 2 m to 4 m, other than IB 210 and MD (having a larger average wtd > 6 m). Within this range, AL has a relatively high average wtd (3.95 m) due to its strong relief. Higher ET is naturally observed in more arid regions, e.g., the highest regional average ET (518 mm) is recorded in MD. No significant difference is observed in regional average θ over PRUDENCE regions, and the minimal regional average θ is observed in IB (0.28 m 3 m -3 ) and MD (0.29 m 3 m -3 ). For Sw, large values (> 60 mm) are simulated in SC and AL, while values below 10 mm are recorded in the other regions. 215 We utilized the TSMP-G2A data set to compute pra and wtda (Eqs. (6)- (7)), which are the input and output data of the 220 proposed LSTM networks. The associated average and standard deviation values are based on the training set (i.e., the data within the years 1996 to 2012, described in Section 2.5) to guarantee that no future information leaks into the networks in the training process.
where, is monthly sum pr calculated from the TSMP-G2A data set; is the climatological average of (i.e., 225 averages of in January, February, …, December); is the climatological standard deviation of .
where, is monthly average wtd derived from the TSMP-G2A data set; is the climatological average of ; is the climatological standard deviation of .
To identify the effect of local factors on the network behaviors, we categorized the network performances based on yearly 230 averaged wtd, ET, θ, Sw, and St and dominant PFT. The data of θ were calculated based on the information at a depth from 0 to 5 cm below the land surface. It is important to note that the data used in this study cover the years 1996 to 2016 (except for Sw data only available from 2003 to 2010), to ensure that spinup effects do not impact the analyses (Furusho-Percot et al., 2019).

Experiment design 235
LSTM networks are employed here to detect connections between pra and wtda from the pan-European simulation results and utilize pra as input to predict wtda. At each time step, one new input enters a network, together with information stored in the network's memory (i.e., useful messages from inputs in the past), to generate outputs. Therefore, LSTM networks have the ability to handle the lagged response of wtd to pr.
Monthly anomaly time series at individual pixels were divided into three parts for network training (01/1996-12/2012), 240 validation (01/2013-12/2014), and testing (01/2015-12/2016) containing about 80%, 10%, and 10% of the total data, respectively. In training, the network is fitted to a given training set by adjusting its weights and biases. The technique of adjusting network parameters is called an optimizer that minimizes a cost function at a certain learning rate (Govindaraju, 2000). This study utilized a supervised training algorithm with a supplementary teacher signal (i.e., TSMP-G2A monthly wtda) to guide the training process, which is widely adopted in Hydroscience in case of e.g., stream stage modeling (Sung et al., 2017), stream discharge modeling (Zhang et al., 2015) and groundwater level modeling (Adamowski and Chan, 2011).
One common challenge in the training process is overfitting. Validation is a process to address overfitting by comparing the network output with the teacher signal to obtain a validation loss (Govindaraju, 2000;Liong et al., 2000). Provided that the network has gained sufficient knowledge from the training set, training ceases when the number of epochs (i.e., an iteration when the whole training set travels through the network forward and backward once) is ≥ 50 and the validation loss starts 250 increasing. The strategy to stop training based on validation losses is termed early stopping.
Moreover, the validation losses were applied to tune hyperparameters of the LSTM networks whose architecture has been introduced in Sect. 2.2. To simplify the procedure of hyperparameter tuning, we only focused on the optimization of the number of hidden neurons in this study and set other hyperparameters constant ( Table 2). The networks with hidden neurons from 1 to 100 were trained at individual pixels, and the best three of them were selected for testing based on the validation 255 losses.

Hyperparameter
Value or method Finally, during testing, the optimally trained networks were provided with a previously unknown data set, originating from 260 the same source as the training set. The difference between generated and target values during testing is called the generalization error, representing the ability of a network to perform on previously unobserved data. The average of the three optimal network results was utilized for evaluation in order to moderately eliminate individual deficiencies of the selected networks, thereby improving the quality of the final results (Goodfellow et al., 2017;Brownlee, 2018). The metrics to assess network performance in this study are the root mean square error (RMSE), the coefficient of determination (R 2 ) and the bias 265 from the Pearson product-moment correlation coefficient R (α) as shown in Eqs. (8)-(10), respectively. α indicates systematic additive and multiplicative biases in the generated values, having a value between 0 and 1, where α = 1 means no bias (Duveiller et al., 2016).
where, , ̅̅̅̅̅, are the expected value, the average of the expected values, and the standard deviation of the expected values, respectively; , ̅̅̅̅̅̅̅, are the generated value, the average of the generated values, and the standard deviation of the generated values, respectively; is the number of time steps in the given time series.
Repeating the above network training, validation, and testing processes (right panel of Fig. 4), we constructed the 275 proposed LSTM networks locally at ≤ 200 pixels randomly selected in each group in order to save computing time. As described in Sect. 2.4, climatologic differences occur not only between different PRUDENCE regions but also at certain pixels in the same region, which potentially explains varying network performances at individual pixels. To analyze the network reaction to local factors, we categorized the pixels into various groups based on yearly averaged wtd, ET, θ, Sw, and St and dominant PFT (Table 3), and the analysis result will be presented in Sect. 3.2. Figure 4 gives a generic workflow of 280 this study to establish the LSTM networks at the local scale and analyze their output.      (Fig.  295   5a). Apparently, a groundwater drought (i.e., wtda ≥ 1.5) covered large parts of Europe, which is in good agreement with previous studies (Andersen et al., 2005;Van Loon et al., 2017). In the simulation, over central Germany, central Britain, southeastern France, the west Iberian Peninsula, and several parts in Eastern Europe, groundwater storage increased, illustrating the strong spatial heterogeneity of the anomalies, which is expected. In contrast, the year 2015 is part of the testing period, leading to a reduced agreement between the LSTM and TSMP-G2A wtda (Fig. 5b). Especially extremes in 300 wet and dry anomalies were underestimated suggesting that the training set contains too little information on extreme events and, thus, is too short. Yet overall, visual inspection of Fig. 5b shows that the LSTM anomalies agree well with the expected values spatially, lending confidence in the trained networks to predict wtda from pra information. Additional European wtda

Impact of local factors on the network performance
In each PRUDENCE region, we computed averages, and standard deviations of the test R 2 scores and RMSEs for the 310 different categories (Table 3) of yearly averaged wtd, ET, θ, Sw, and St and dominant PFT (Fig. 6), to study dependents of the network test performances on different local factors. Note that negative values were set to zero in this calculation.
https://doi.org/10.5194/hess-2020-382 Preprint. Discussion started: 24 August 2020 c Author(s) 2020. CC BY 4.0 License. There was no significant influence of St and dominant PFT on the scores. In general, the performance decreased with increasing yearly averaged wtd, which was manifested by decreasing average R 2 scores and growing average RMSEs (Fig.   6a). This type of network behavior can be attributed to a stronger connection of groundwater to pr in shallow aquifers, which is intuitive. In contrast to the impact of yearly averaged wtd on the test performance, the performance was positively 320 correlated to yearly averaged ET and θ. With increasing yearly averaged ET (Fig. 6b) or θ (Fig. 6c), there was an increase of https://doi.org/10.5194/hess-2020-382 Preprint. Discussion started: 24 August 2020 c Author(s) 2020. CC BY 4.0 License. average R 2 scores and a decrease of average RMSEs. We can explain this phenomenon by the overlap between low-wtd and high-ET (or high-θ) areas over Europe. We also discovered that yearly averaged Sw played an important role in the network test performance. In most PRUDENCE regions, the performance decreased in the case of Sw, leading to smaller average R 2 scores and larger average RMSEs presented in Fig. 6d. Snow accumulation resulted in complex feedback with groundwater 325 processes that cannot be captured well by the networks without including additional input information.
As mentioned in Sect. 2.4, we only used the training set to calculate the climatological average and standard deviation in order to prevent the networks from incorporating future information in the training process. However, some extreme values in the validation and test sets may exceed the range of the training set resulting in decreased validation and test performances, suggesting that a varying pattern may exist between pra and wtda over the study period (see Sect. 3.3). 330 Due to differences in the hydrometeorological characteristics of the PRUDENCE regions (see Table 1), the strength of the relationship between pra and wtda differed regionally (Fig. 6), leading to various regional test performances of the proposed LSTM networks shown in Fig. 7. Table 4 provides percentages of the selected pixels with test R 2 ≥ 50% in each region, where FR exhibits the overall best network performance. Reduced performance in other regions appeared to be mainly related to high Sw in SC and AL and generally large wtd in IB and MD. 335  We extended the scope of the analyses to the entire study period, and found that the performance of individual networks generally followed three combinations with respect to training and test scores that are: • C1: training R 2 score ≥ 50%, test R 2 score ≥50%; • C2: training R 2 score ≥ 50%, test R 2 score ≤ 0%; • C3: training R 2 score ≤ 0%, test R 2 score ≤ 0%. 345 The data distribution in the training and test sets was expected to be analogous, and if the networks did not encounter overfitting during training, their test performance increased by the improvement of the training performance, and vice versa (C1 and C3). C1 is the network behavior with satisfactory training and test scores. In contrast, in the case of C3, training and testing scores were unsatisfactory. An exception is C2, in which the networks that performed well on the training set failed to perform during testing. Significantly reduced test performance in C2 can be attributed to the hypothesis that the pattern 350 between pra and wtda varied over the study period.
The distribution of the performance combinations was influenced by local factors. Figure 8 illustrates percentages of the pixels where the network performance followed C1 to C3 categorized by region, yearly averaged wtd, ET, θ and Sw, respectively. Each combination was required to have ≥ 50 study pixels. C1 was mainly found in areas with shallow wtd, high ET, high θ and little Sw, and was also the most common network performance in FR. In contrast, C2 mostly appeared in areas 355 with large wtd, small ET and heavy Sw. This was a typical network performance at pixels with negative ET in SC, where processes such as freezing and sublimation were more pronounced than others due to low temperature. In addition, C3 appeared in a few pixels with wtd ≥ 10 m.

Cross-wavelet transform (XWT) analysis
In the previous section, we posed the hypothesis that the temporal pattern between pra and wtda during training, validation, and testing was different at a number of pixels over the European continent. XWT was employed here for hypothesis testing at individual, representative pixels (Pixels 1-3, Table 5). XWT extracted the time localized coherence of the variability in the 365 pra and wtda time series derived from the TSMP-G2A data set (i.e., TSMP-G2A pra and wtda) at these pixels. The α values (Eq. (10)) of Pixel 1 were generally suggesting that smaller biases existed in the results of the LSTM networks. In addition, we found very different α values for Pixel 2 with small biases in the training and large biases in the validation and testing.
The α values of Pixel 3 were generally small, indicating that the biases were large in the case of C3.   Figure 9 shows the results of the XWT analyses of the selected pixels in combination with the corresponding TSMP-G2A pr and wtd time series. Inspecting the results of the XWT analyses (bottom panel of Fig. 9), the concentration period of 380 power was inconsistent in the area without edge effects (i.e., the area within the black dashed line) at Pixel 2 from the time period 1996 to 2016, indicating a time-varying pattern between pra and wtda at the pixel, thus supporting our hypothesis. It also explores the high sensitivity of LSTM networks to outliers, which is a major defect of such data-driven models, so physically-based models cannot be completely replaced by data-driven models in this sense.
The power in the XWT results at Pixel 1 and Pixel 3 (Figs. 9a and 9c) was both nearly consistently located in a certain 385 period, indicating an analogous pattern between pra and wtda throughout the whole study period, which is the prerequisite of good network performances. However, the networks behaved differently at the two pixels. By linking the XWT results to the associated network performances, we found that the networks tended to perform well when most of the power in the XWT results was consistently concentrated in the period from 2 to 16 months during the study period (see Fig. 9a). Supplementary plots in Appendix C showed similar phenomena as above. Therefore, we speculate that LSTM networks might be frequency-390 aware and work well on high-frequency components.

Summary and conclusions
In this study, we proposed LSTM networks as an indirect method to model monthly wtda over the European continent, using monthly pra as input. Local LSTM networks were constructed at individual pixels randomly selected over Europe to capture the time-varying, and time-lagged relationship between pra and wtda from integrated hydrologic simulation (TSMP-G2A) 400 results covering 1996 to 2018 episode. The monthly anomaly series derived from the TSMP-G2A data set were divided into three sections at each pixel for network training, validation, and testing. Using the output of the LSTM networks, we successfully reproduced TSMP-G2A wtda maps over Europe for drought months in both the training and testing period (e.g., August 2003 andAugust 2015). The good agreement between the TSMP-G2A and LSTM wtda maps demonstrated the ability of the trained networks to model wtda from pra data. The results highlighted the impact of local factors on the network 405 test performance, manifested by R 2 scores and RMSEs. Most of the networks attained high test R 2 scores at the pixels with wtd < 3 m, ET > 200 mm, θ > 0.15 m 3 m -3 and Sw < 10 mm, where a stronger connection existed between pra and wtda. Also, the various hydrometeorological characteristics in each PRUDENCE region resulted in regional differences in the test https://doi.org/10.5194/hess-2020-382 Preprint. Discussion started: 24 August 2020 c Author(s) 2020. CC BY 4.0 License.
performance of the proposed networks, with FR showing the overall best network performance. In some regions, test performance deteriorated due to changing temporal patterns in the pra-wtda relationship, approved by XWT analyses. 410 According to the results of the XWT analyses, we gave a hypothesis that LSTM networks have frequency awareness and tend to perform well on high-frequency components.
We also recognized that the limited amount of data in the training introduces uncertainties in the network performances.
Any potential extension of training data may lead to a significant improvement in the quality of the derived networks. In addition, hyperparameters of the proposed LSTM networks may be further tuned at the individual pixel level to improve 415 network performance. The results suggest that LSTM networks are useful to estimate wtda time series based on pra, which are routinely measured and, therefore, are more easily available from e.g., atmospheric re-analyses and forecast data sets and observations than groundwater level measurements. The proposed methodology may be transferred into a real-time monitoring and forecasting workflow for wtda at the continental scale.

420
Code and data availability. The code for constructing the proposed LSTM networks and result analyses is available from the authors. Please contract Yueling Ma at y.ma@fz-juelich.de. The TSMP-G2A data set is available online at   Competing interests. The authors declare that they have no conflict of interest.