Quantifying the spatiotemporal dynamics in subsurface hydrological flows over a long time window usually employs a network of monitoring wells. However, such observations are often spatially sparse with potential temporal gaps due to poor quality or instrument failure. In this study, we explore the ability of recurrent neural networks to fill gaps in a spatially distributed time-series dataset. We use a well network that monitors the dynamic and heterogeneous hydrologic exchanges between the Columbia River and its adjacent groundwater aquifer at the U.S. Department of Energy's Hanford site. This 10-year-long dataset contains hourly temperature, specific conductance, and groundwater table elevation measurements from 42 wells with gaps of various lengths. We employ a long short-term memory (LSTM) model to capture the temporal variations in the observed system behaviors needed for gap filling. The performance of the LSTM-based gap-filling method was evaluated against a traditional autoregressive integrated moving average (ARIMA) method in terms of error statistics and accuracy in capturing the temporal patterns of river corridor wells with various dynamics signatures. Our study demonstrates that the ARIMA models yield better average error statistics, although they tend to have larger errors during time windows with abrupt changes or high-frequency (daily and subdaily) variations. The LSTM-based models excel in capturing both high-frequency and low-frequency (monthly and seasonal) dynamics. However, the inclusion of high-frequency fluctuations may also lead to overly dynamic predictions in time windows that lack such fluctuations. The LSTM can take advantage of the spatial information from neighboring wells to improve the gap-filling accuracy, especially for long gaps in system states that vary at subdaily scales. While LSTM models require substantial training data and have limited extrapolation power beyond the conditions represented in the training data, they afford great flexibility to account for the spatial correlations, temporal correlations, and nonlinearity in data without a priori assumptions. Thus, LSTMs provide effective alternatives to fill in data gaps in spatially distributed time-series observations characterized by multiple dominant frequencies of variability, which are essential for advancing our understanding of dynamic complex systems.

Long-term hydrological monitoring using distributed well networks is of critical importance for understanding how ecosystems respond to chronic or
extreme perturbations as well as for informing policies and decisions related to natural resources and environmental issues

Various statistical methods have been developed to fill gaps in spatiotemporal datasets, with the autoregressive integrated moving average (ARIMA)
method being the most commonly used

Deep neural networks (DNNs)

Groundwater monitoring well network in the 300 Area of the Hanford site and the monitoring data at select wells. Each dot represents a well instrumented to measure groundwater elevation, temperature, and specific conductance (SpC). The wells selected for this study are marked with red dots with well names. The three variables monitored are shown in time-series plots using blue (water elevation), black (SpC), and red (temperature) lines. (Base map © Google Maps.)

Our study aims to evaluate the potential of using LSTM models for filling gaps in spatiotemporal time-series data collected from a distributed
network. We tested the LSTM-based gap-filling method using datasets collected to understand the interactions between a regulated river and a
contaminated groundwater aquifer. We treat the gap filling as a forecasting problem, i.e., we use the historical data as input to predict the missing
values in the data gaps. The future information relative to the gap is implicitly used in training the LSTM models. Framing gap filling as a
predictive problem is a common practice when machine learning methods are used for filling gaps in time-series data (see examples in

A 10-year (2008–2018) spatiotemporal dataset was collected from a network of groundwater wells that monitor temperature (CS547A water conductivity and
temperature probe, Campbell Scientific), specific conductance (SpC) (CS547A water conductivity and temperature probe, Campbell Scientific),
and water table elevation (CS451 stainless-steel pressure transducer, Campbell Scientific) at the 300 Area of the U.S. Department of Energy
Hanford site, located in southeastern Washington State. The groundwater well network was originally built to monitor the attenuation of legacy
contaminants. The groundwater aquifer at the study site is composed of two distinct geologic formations: a highly permeable formation (Hanford
formation, consisting of coarse gravelly sand and sandy gravel) underlain by a much less permeable formation (the Ringold Formation, consisting of
silt and fine sand). The dominant hydrogeologic features of the aquifer are defined by the interface between the Hanford and Ringold formations as well as
the heterogeneity within the Hanford formation

The intrusion of river water into the adjacent groundwater aquifer causes two water bodies with distinct geochemistry to mix and stimulates biogeochemical reactions at the interface. The river water has lower SpC (0.1–0.12

Earlier studies have demonstrated that physical heterogeneity contributes to the different response behaviors of different locations while the river stage
dynamics lead to multifrequency dynamics in those responses. Natural climatic forcing drives seasonal and annual variations

Wavelet power spectrum (WPS) analysis of SpC at each well from 2008 to 2018. The first column is the spectrogram (in

To understand the multifrequency variations in the river water and groundwater mixing in each well at the study site, we perform spectral analysis on
multiyear SpC observations at each selected well using a discrete wavelet transform (DWT). The DWT is widely used for time–frequency analysis of
time series and relies on a “mother wavelet”, which is chosen to be the Morlet wavelet

As the system dynamics are driven by the river stage, we perform magnitude-squared wavelet coherence analysis via the Morlet wavelet to reveal the
dynamic correlations between the SpC and river stage time series

Histograms of gap lengths for each monitored variable, aggregated across all wells in the monitoring network from 2008 to 2018.

As can be seen in Fig.

In this section, we describe two methods that we implemented to fill gaps of various lengths in SpC measurements at selected wells: an LSTM model and the
traditional ARIMA model. We focused our analyses on filling gaps in SpC due to its importance in revealing river water and groundwater mixing. The
same set of analyses can be performed on water level and temperature. An input with

Illustration of LSTM models for gap filling. Panel

The input window may contain multiple variables relevant to the prediction from a single well or multiple wells. After experimenting with different
sets of input variables (SpC only; SpC and water level; and SpC, water level, and temperature), we found that including SpC and water level
measurements in the input window yielded the most robust performance. Therefore, we used historic water level (

We designed an LSTM architecture, shown in Fig.

Parameters used in training single-well LSTM models.

Training data for the LSTM models were created by finding data segments of

To evaluate the accuracy of the trained LSTM models in filling SpC data gaps during the validation and testing processes, we assumed that synthetic
gaps of various lengths (e.g., 1, 6, 12, 24, 48, and 72

In addition to MAPE, the models are evaluated using the Nash–Sutcliffe model efficiency coefficient (NSE)

The KGE is another goodness-of-fit metric used to evaluate hydrological models by combining the three components of the NSE model errors
(i.e., correlation, bias, and ratio of variances or coefficients of variation) in a more balanced way. It has the same range of values as the NSE, where 1
indicates a perfect model fit. The KGE is calculated on a model's SpC predictions using the following equations:

In addition to the LSTM models trained for the single-well setup, we also trained multi-well LSTM models that used observations from wells 1-1, 1-10A,
and 1-16A to fill in data gaps for well 1-1. We explored the same set of configuration parameters in the multi-well LSTM models as shown in
Table

We used a grid-search approach to explore different LSTM model hyperparameter configurations to find the best model for a given gap length at each
well. This involved iterating over all combinations of the input time window size (

ARIMA is one of the most general model classes for extrapolating time series to produce forecasts. We used it as a baseline to compare and assess the
LSTM gap-filling method. ARIMA can be applied to nonstationary time-series data using a combination of differencing, autoregressive, and moving
average components. A nonseasonal ARIMA

The main task in ARIMA-based forecasting is to select appropriate model orders, i.e., the values of

Similar to the LSTM-based gap filling, we explored various input window sizes, from 24 to 504

We selected the best combination of LSTM units (

Gap-filling performance for SpC evaluated against the validation datasets, grouped by gap lengths

As shown in Fig.

The performance of single-well LSTM models varied among the wells, as shown in Fig.

Optimal input windows for the LSTM and ARIMA models to fill gaps of various lengths at each well.

The single-well LSTM gap-filling approach was compared to the ARIMA approach using the relative errors calculated for each data point assumed to be
missing in the testing data by setting

Box plots of the relative errors for filling SpC gaps of various lengths (1, 6, 12, 24, 48, and 72

Figure

Comparison of single-well LSTM and ARIMA models for a 24

For each well, we performed a

In addition to the relative errors, we calculated the MAPE, root-mean-squared error (RMSE), NSE, and KGE for the best LSTM and ARIMA model for each
gap length. Table

The LSTM and ARIMA models yielded comparable average metrics at all wells for the 24

In addition to the error statistics, it is also important to examine how well a gap-filling method captures the desired dynamics patterns in the
gap-filled time series. Therefore, the SpC time series reproduced by the gap-filling methods during the testing period (2016 for wells 1-1, 1-10A,
2-2, 2-3; 2017 for well 1-15; 2008 for well 2-5) with 24

Column 1 shows time series of model predictions from the LSTM (in red) and ARIMA (in blue) methods, respectively, assuming a 24

As shown in the first column of Fig.

To further investigate the dependence of the relative performance of the two gap-filling methods on the inherent dynamics in each time series,
spectral analyses for the testing SpC datasets were performed using the same wavelet decomposition method for the multiyear analyses (shown earlier
in Fig.

There is also a significant difference in the computational cost for the LSTM and ARIMA methods. ARIMA requires very few computational resources: the
auto.arima function in R requires approximately 40

We evaluated the predictive ability of multi-well models using both approaches for filling gaps of various lengths in the SpC data at well 1-1 by
comparing their performance against their single-well counterparts. Well 1-1 was chosen because of the data availability for nearby wells (wells 1-10A
and 1-16A). Similar to the single-well ARIMA and LSTM model for well 1-1, the multi-well models also predict the SpC measurement using water level and
SpC from three wells. We adopted the same LSTM architecture from the single-well LSTM model and trained the same set of alternatives considering input
window sizes, LSTM units, and learning rates for various gap lengths as listed in Table

Comparing relative error performance between the best single-well LSTM models (well 1-1 – red), multi-well LSTM models (wells 1-1, 1-10A, and 1-16A – yellow), a single-well ARIMA model (blue), and a multi-well ARIMA model (green) for filling in various SpC gap lengths for well 1-1 during the test period (year 2016).

Comparison of single-well and multi-well LSTM and ARIMA models for all synthetic gap lengths in the SpC data. The models are the same as those used in Fig. 9. Calculations are performed on the test dataset for well 1-1 (year 2016). The calculated statistics are as follows: the mean absolute percentage error (MAPE), the root-mean-squared error (RMSE), the Nash–Sutcliffe model efficiency coefficient (NSE), and the Kling–Gupta efficiency (KGE).

The box plots of relative errors yielded from the single-well and multi-well models using both approaches are provided in Fig.

These comparisons show that, although the information from a single well may be sufficient to fill in gaps smaller than a day, including spatial information from neighboring wells in the LSTM and ARIMA models could potentially increase the chance of successes in filling data gaps under more challenging circumstances, such as capturing more complex dynamic patterns with longer data gaps. While the aggregated metrics provide an overall assessment of model performance, examining the distribution of relative errors could provide complementary information on large error spikes while selecting optimal model configurations.

In this study, we implemented an LSTM-based gap-filling method to account for spatiotemporal correlations in monitoring data. We extensively evaluated the method on its ability to fill data gaps in the groundwater SpC measurements that are often used to indicate groundwater and river water interactions along river corridors. We took advantage of a 10-year, spatially distributed, multivariable time-series dataset collected by a groundwater monitoring well network and optimized an LSTM architecture for filling SpC data gaps. A primary advantage of using LSTM is its ability to incorporate spatiotemporal correlations and nonlinearity in model states without a priori assuming an explicit form of correlations or nonlinear functions in advancing system states. We compared the performance of a single-well LSTM-based gap-filling method with a traditional gap-filling method, ARIMA, to evaluate how well an LSTM model can capture multifrequency dynamics. We also trained LSTM and ARIMA models that take input from multiple wells to predict responses at one well. The multi-well models were compared with single-well models to identify and assess improvements in gap-filling performance from including additional spatial correlation from neighboring wells.

In general, both LSTM and ARIMA methods were highly accurate in filling smaller data gaps (i.e., 1 and 6

Wavelet analysis could provide useful insights into the dynamic signatures of the data and changes in the composition of their important frequencies over time, which can serve as a basis for selecting an appropriate gap-filling method. For example, the ARIMA method would work well if the dynamics are dominated by seasonal cycles, while more sophisticated approaches like LSTM-based methods could work better if there is evidence of weekly, daily, and subdaily fluctuations. Depending on the mixture of high- and low-frequency variability inherent in the time series, different LSTM architecture and configurations can be explored and evaluated through hyperparameter searches with respect to LSTM layers, dense layers, and activation functions to achieve better performance in capturing complex dynamics. The optimal LSTM model configuration and achievable performance would vary case by case.

We also demonstrated that incorporating spatial information from neighboring stations in LSTM models could contribute to performance improvements
under challenging scenarios with dynamic system behaviors and longer data gaps of up to 2

What is the dynamics signature of the data to be filled?

How is gap-filling performance impacted by the length of gaps?

How does the amount of training data impact the model performance?

How does the choice of the input time window impact gap-filling performance?

How much value can measurements at neighboring add to the performance improvement?

The well observations dataset is available from

The supplement related to this article is available online at:

HR and EC developed scripts and performed the analyses; the abovementioned authors contributed equally to the paper. BK contributed to the interpretation of the results. XC conceived and designed the study. All authors contributed to writing the paper.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was supported by the U.S. Department of Energy (DOE), Office of Biological and Environmental Research (BER), as part of the BER Environmental System Science (ESS) program. A portion of the methodology development was supported by the Laboratory Directed Research and Development Program at Pacific Northwest National Laboratory, a multi-program national laboratory operated by Battelle for the DOE under contract no. DE-AC05-76RL01830. This research was performed using PNNL Institutional Computing at Pacific Northwest National Laboratory. This research was also supported in part by the Indiana University Environmental Resilience Institute and the “Prepared for Environmental Change” grand challenge initiative. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

This research has been supported by the U.S. Department of Energy, Biological and Environmental Research (grant no. PNNL SBR SFA).

This paper was edited by Dimitri Solomatine and reviewed by three anonymous referees.