Articles | Volume 30, issue 8
https://doi.org/10.5194/hess-30-2373-2026
© Author(s) 2026. This work is distributed under the Creative Commons Attribution 4.0 License.
Never Train a Deep Learning Model on a Single Well? Revisiting Training Strategies for Groundwater Level Prediction
Download
- Final revised paper (published on 27 Apr 2026)
- Preprint (discussion started on 06 Oct 2025)
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on egusphere-2025-4055', Anonymous Referee #1, 26 Dec 2025
- AC1: 'Reply on RC1', Marc Ohmer, 03 Feb 2026
-
RC2: 'Comment on egusphere-2025-4055', Anonymous Referee #2, 06 Jan 2026
- AC2: 'Reply on RC2', Marc Ohmer, 03 Feb 2026
Peer review completion
AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload
ED: Publish subject to minor revisions (further review by editor) (28 Feb 2026) by Daniel Klotz
AR by Marc Ohmer on behalf of the Authors (09 Mar 2026)
Author's response
Author's tracked changes
Manuscript
ED: Publish subject to technical corrections (05 Apr 2026) by Daniel Klotz
AR by Marc Ohmer on behalf of the Authors (07 Apr 2026)
Author's response
Manuscript
Summary
The paper compares single-well models with global models for groundwater level forecasting, focusing on robustness and predictive performance. The comparison is well motivated by earlier work suggesting that global approaches often perform better in surface water modelling. The authors also examine how global models performance depends on training-set size and evaluate the influence of dynamic similarity across sites. The study also investigates how well global models generalize to unseen wells. Overall, the manuscript is clearly structured and the analysis is presented in a careful and transparent way.
Evaluation and Recommendations
Model choice may influence the conclusions, but it is currently unclear to what extent. For single-well models, performance can vary across sites depending on the selected model structure. Global models performance may also be sensitive to model choice, which could affect the resulting predictions and the strength of the conclusions. Expanding the set of tested models may be beyond the scope of this paper, but I recommend explicitly discussing how sensitive the main findings are to the chosen model(s), and under which conditions the conclusions might change.
As an additional diagnostic, a map showing the spatial distribution of performance differences (e.g., ΔNSE = NSE_global − NSE_local) would be informative to assess whether the largest deltas follow any geographic or hydrogeological patterns.
The methodology of filtering out a subset of wells is clear and coherent, and the correlation-based selection is easy to follow. However, I wonder how the results might change if a spatio-dynamic clustering were used instead. In this context, it would help to justify why a correlation-based approach was preferred over other clustering methods. A useful discussion point is whether adding hydrogeological classifications (in addition to the dynamic similarity) could provide meaningful context before applying the global model, and whether longer time series (where available) would be expected to improve model performance.
Specific comments:
Line 16: missing reference.
Line 17: are often slower (not always, as in the case of Karst)
Line 104-107: Please rephrase for clarity. In the context of this sentence, it is not clear what “unseen location” means
Line 122: was HYRAS or ERA5-Land used in this case?
Line 193- 195: “Groundwater drought” is defined and interpreted in different ways across the literature. In this manuscript, it appears to be implicitly defined as periods when groundwater levels fall below the 10th percentile (“the 10th and 90th percentiles of the observed distribution in the test set.”), but this threshold is not stated clearly or justified. Please explicitly define the drought criterion, provide a reference (or brief background) for the use of the 10th-percentile threshold, and clarify your terminology. How these lines relate to line 305: “For each well, low extremes were defined as values in the test period below the 1st percentile of its training distribution, and high extremes as values above the 99th percentile”.?
Section 4.4 is duplicated to 4.5.