Never Train a Deep Learning Model on a Single Well? Revisiting Training Strategies for Groundwater Level Prediction

Ohmer, Marc; Liesch, Tanja

doi:10.5194/hess-30-2373-2026

Articles | Volume 30, issue 8

https://doi.org/10.5194/hess-30-2373-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-30-2373-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 30, issue 8

Research article

|

27 Apr 2026

Research article |

| 27 Apr 2026

Never Train a Deep Learning Model on a Single Well? Revisiting Training Strategies for Groundwater Level Prediction

Marc Ohmer and Tanja Liesch

Download

Final revised paper (published on 27 Apr 2026)
Preprint (discussion started on 06 Oct 2025)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-4055', Anonymous Referee #1, 26 Dec 2025

Summary
The paper compares single-well models with global models for groundwater level forecasting, focusing on robustness and predictive performance. The comparison is well motivated by earlier work suggesting that global approaches often perform better in surface water modelling. The authors also examine how global models performance depends on training-set size and evaluate the influence of dynamic similarity across sites. The study also investigates how well global models generalize to unseen wells. Overall, the manuscript is clearly structured and the analysis is presented in a careful and transparent way.
Evaluation and Recommendations
Model choice may influence the conclusions, but it is currently unclear to what extent. For single-well models, performance can vary across sites depending on the selected model structure. Global models performance may also be sensitive to model choice, which could affect the resulting predictions and the strength of the conclusions. Expanding the set of tested models may be beyond the scope of this paper, but I recommend explicitly discussing how sensitive the main findings are to the chosen model(s), and under which conditions the conclusions might change.
As an additional diagnostic, a map showing the spatial distribution of performance differences (e.g., ΔNSE = NSE_global − NSE_local) would be informative to assess whether the largest deltas follow any geographic or hydrogeological patterns.
The methodology of filtering out a subset of wells is clear and coherent, and the correlation-based selection is easy to follow. However, I wonder how the results might change if a spatio-dynamic clustering were used instead. In this context, it would help to justify why a correlation-based approach was preferred over other clustering methods. A useful discussion point is whether adding hydrogeological classifications (in addition to the dynamic similarity) could provide meaningful context before applying the global model, and whether longer time series (where available) would be expected to improve model performance.
Specific comments:
Line 16: missing reference.
Line 17: are often slower (not always, as in the case of Karst)
Line 104-107: Please rephrase for clarity. In the context of this sentence, it is not clear what “unseen location” means
Line 122: was HYRAS or ERA5-Land used in this case?
Line 193- 195: “Groundwater drought” is defined and interpreted in different ways across the literature. In this manuscript, it appears to be implicitly defined as periods when groundwater levels fall below the 10th percentile (“the 10th and 90th percentiles of the observed distribution in the test set.”), but this threshold is not stated clearly or justified. Please explicitly define the drought criterion, provide a reference (or brief background) for the use of the 10th-percentile threshold, and clarify your terminology. How these lines relate to line 305: “For each well, low extremes were defined as values in the test period below the 1st percentile of its training distribution, and high extremes as values above the 99th percentile”.?
Section 4.4 is duplicated to 4.5.

Citation: https://doi.org/10.5194/egusphere-2025-4055-RC1
- AC1: 'Reply on RC1', Marc Ohmer, 03 Feb 2026
  
  We thank RC1 for the thorough and constructive review.
  
  A detailed point-by-point response is provided in the attached PDF file.
  
  In summary, we have revised the manuscript accordingly and clarified the main points raised.
  
  Citation: https://doi.org/10.5194/egusphere-2025-4055-AC1
RC2:
'Comment on egusphere-2025-4055', Anonymous Referee #2, 06 Jan 2026

“Never Train a Deep Learning Model on a Single Well? Revisiting Training Strategies for Groundwater Level Prediction” by Ohmer and Liesch presents an interesting study on the design of DL models for groundwater timeseries modelling. Even though there already exists a substantial amount of DL applications on groundwater timeseries modelling, I believe that the study design and the obtained results add novelty to the existing work. I have several points that I wish to see addressed prior to publication.
In the introduction, the authors give a quite broad overview of DL application for both surface water and groundwater timeseries modelling. The introduction would benefit from clearly stating which studies are focusing on groundwater and which on surface water. Since this is a groundwater study I wonder how many surface water references are required – maybe some of them can be remove and replaced by groundwater references. I agree that there are more DL experiences in the surface water domain, especially when it comes to spatial transferability, but this can maybe also be an additional point to be highlighted in the introduction. The intercomparison study by Collenteur et al (https://doi.org/10.5194/hess-28-5193-2024) would be a good addition to the introduction.
The authors carry out a spatial transferability study (4. objective), which I have not seen in the groundwater literature and the presented references (l.50) are all surface water studies. If this is the first spatial transferability study in the groundwater domain, the authors should state this clearly and if other studies exist, they should be mentioned in the introduction.
What is the reasoning behind using a CNN for the single well models and a LSTM for the global models?
Section 2.2 Are any of the climate variables aggregated in time, for example running sum of net precipitation or SPI at different aggregation windows?
Section 2.3 Just to be clear, timeseries statistics such as mean head or standard deviation are not part of the static attributes?
What are the sensitivities of the choice of architecture and hyperparameter values presented in section 3.2?
Section 3.2 Were the head timeseries normalized in any way? If yes, how can the authors argue for testing spatial extrapolation if knowledge on mean and standard deviation is required for the back transformation?
What does P1, P2, …, P5 mean? P1 excludes 500 wells, P2 excluded 1000 wells, and so on? To me this first became clear when reading the result sections. It would be good to state the number of wells in each stage already in 3.1. The testing strategy is not stated. Are all wells for 2013-2022 used for testing or only the ones left after stagewise removing? From Figure 2 I get the impression that the testing dataset varies for stage – can the performances be compared in a meaningful way across the stages? I would suggest to make an additional test using the P5 wells for all stages.
That fact that global models do not outperform single well models for the P0 stage and that an advantage of global model first becomes tangible at P4 and P5 makes me wonder if the chosen LSTM architecture can exploit the static features in a meaningful way? Along these lines, when removing wells based on their correlation, do the static features also become more homogeneous? In other words, is the similarity of the timeseries reflected by the static features?
Another very relevant question in my opinion is the length of the timeseries. The authors make use of an extensive German database, with full coverage for a period of 1991 to 2022, which is a coverage that is not available in many other countries. Therefore, an alternative modelling experiment with stages where e.g., 2 years at a time are removed from the training dataset would be very insightful. P1 starting in 1996, P2 in 1998, etc. would be extremely relevant for similar applications in countries with shorter groundwater records.
Section 4.4 and 4.5 contain the same text.
I am puzzled why the performance for the correlation wise stages increases in figure 6 for the out of sample wells. For P5cor the model is trained on 451 and tested on 2500, and for P1cor the mode is trained on 2451 and tested on 500, is this correct? Again, the varying testing datasets make it difficult to compare performance across the stages in my opinion. Nevertheless, for the P5cor training you are using very homogeneous timeseries and test it across very heterogenous timeseries. Why should this work better than P1cor where you are training using heterogenous timeseries and also use heterogenous timeseries for testing?

Citation: https://doi.org/10.5194/egusphere-2025-4055-RC2
- AC2: 'Reply on RC2', Marc Ohmer, 03 Feb 2026
  
  We sincerely thank RC2 for the constructive and insightful comments that helped to further improve our manuscript.
  
  A detailed point-by-point reply addressing all remarks is provided in the attached PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-4055-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Publish subject to minor revisions (further review by editor) (28 Feb 2026) by Daniel Klotz

AR by Marc Ohmer on behalf of the Authors (09 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Publish subject to technical corrections (05 Apr 2026) by Daniel Klotz

AR by Marc Ohmer on behalf of the Authors (07 Apr 2026) Author's response Manuscript

Short summary

We compared global vs. local deep learning models for groundwater level prediction using ~3,000 wells across Germany. Unlike surface water, groundwater is complex and data-scarce. Results: global models show no systematic accuracy advantage over local ones. Data similarity matters more than quantity for better predictions. Successful groundwater modeling requires strategies tailored to these unique complexities, not just larger datasets.