Assessing groundwater level modelling using a 1-D  convolutional neural network (CNN): linking model  performances to geospatial and time series features

Gomez, Mariana; Nölscher, Maximilian; Hartmann, Andreas; Broda, Stefan

doi:https://doi.org/10.5194/hess-28-4407-2024

Articles | Volume 28, issue 19

https://doi.org/10.5194/hess-28-4407-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-28-4407-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 28, issue 19

Research article

|

07 Oct 2024

Research article |

| 07 Oct 2024

Assessing groundwater level modelling using a 1-D convolutional neural network (CNN): linking model performances to geospatial and time series features

Mariana Gomez, Maximilian Nölscher, Andreas Hartmann, and Stefan Broda

Download

Final revised paper (published on 07 Oct 2024)
Preprint (discussion started on 13 Sep 2023)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2023-1836', Marvin Höge, 06 Oct 2023

Review of “Performance assessment of geospatial and time series features on groundwater level forecasting with deep learning.” by Mariana Gomez et al., 2023
Summary
The presented study addresses the important topic of groundwater level forecasting, demonstrated for several hundreds of wells in Lower Saxony, Germany. In specific, it covers the performance of convolutional neural network (CNN) to accomplish this task as an increasingly used alternative to typically used physics-based models. The study analyses how the CNN performance relates to geospatial features and time series features of the respective sites.
Evaluation and Recommendation
The study covers an interesting topic especially since groundwater in Germany (and in a global scope) poses a fresh water resource of already high and even growing importance. The topic of this manuscript is suitable for the journal.
The manuscript is overall well written and referenced. The codes that were used are freely available and data sources are referenced. The figures and maps are of high quality. With deep learning approaches being increasingly used for groundwater level forecasting, the investigation of such models’ capability is important and meets the community’s interest. Yet, the current manuscript requires some restructuring and additional investigations.
In the introduction, it is mentioned that some consider machine learning methods as “black boxes”. Explainable artificial intelligence (xAI) tools or similarly called methods are supposed to help here. Therefore, a brief coverage of advances in this field – ideally in the field of groundwater research if available - would be worth mentioning.
The main point of concern, however, is as follows: Overall, the performance of the employed CNN model, as presented, e.g., in Figure 6, is not fully convincing. Even the well-performing models - according to NSE and R2 - show mainly a sinusoidal pattern with only slight variations – yet, it is these variations that would be interesting to be modelled. Otherwise, a sine function-based model with a mean trend might often be sufficient and provide the same goodness-of-fit values. Therefore, one can assume that the used model architecture (together with only precipitation and temperature as inputs) is not complex enough to capture more of the dynamics. The subsequent analysis of performance with respect to geospatial and timeseries features therefore appears to be weaker than it could be. It appears to be difficult to deduce relations between features and model performance if the model does not perform convincingly in the first place. All correlations reported are rather weak with the strongest anti-correlation being -0.62.
Along these lines, the used geospatial features are all interesting but the timeseries features are too many and some are hardly tangible. A thorough explanation of their meaning, range of values, etc. would be beneficial. If time series features shall be part of the analysis, I recommend to focus on only a few of the ones provided. Yet, in this case, the question remains: What is exactly gained from relating these time series features to model performance? For instance, a time series can be rated as complex while an adequate model could still be able to predict it.
It could be helpful to have a leaner story, e.g. presenting another (maybe more complex) model that captures more of the dynamics and then to analyze whether strong relations between performance of that model and geospatial features can be elicited - and whether they are more indicative than the ones that correspond to the current model. If there were clear correlations to timeseries features, it might be an option to keep some in the main analysis. Overall, it would be better to place them in the appendix for the reasons discussed above.
I recommend major revisions and, at the same time, due to its interesting core topic and the different aspects of modelling, feature analysis, etc. I think this study bears quite some potential.

Specific comments
Please see the attached manuscript.
Tables and Figures
Please see the attached manuscript.
Language
Please see the attached manuscript.

Citation: https://doi.org/10.5194/egusphere-2023-1836-RC1
- AC1: 'Reply on RC1', Mariana Gomez, 31 Oct 2023
  
  Dear Marvin Höge,
  We appreciate your suggestions on our manuscript, which we indeed consider all valuable in enhancing its quality. In response to your comments, we aim to address your discussion points as following.
  In the introduction, it is mentioned that some consider machine learning methods as “black boxes”. Explainable artificial intelligence (xAI) tools or similarly called methods are supposed to help here. Therefore, a brief coverage of advances in this field – ideally in the field of groundwater research if available - would be worth mentioning.
  We acknowledge the importance of addressing the implementation of Explainable AI techniques in our research, and we intend to integrate a discussion on this topic into our manuscript. This addition will contribute to a more thorough understanding of the interpretability and transparency of our modeling approach.
  The main point of concern, however, is as follows: Overall, the performance of the employed CNN model, as presented, e.g., in Figure 6, is not fully convincing. Even the well-performing models - according to NSE and R2 - show mainly a sinusoidal pattern with only slight variations – yet, it is these variations that would be interesting to be modelled. Otherwise, a sine function-based model with a mean trend might often be sufficient and provide the same goodness-of-fit values. Therefore, one can assume that the used model architecture (together with only precipitation and temperature as inputs) is not complex enough to capture more of the dynamics. The subsequent analysis of performance with respect to geospatial and time series features therefore appears to be weaker than it could be. It appears to be difficult to deduce relations between features and model performance if the model does not perform convincingly in the first place. All correlations reported are rather weak with the strongest anti-correlation being -0.62.
  Regarding your main concern about the model's performances, we hypothesize that the monthly resolution used in this study may contribute to the sinusoidal pattern seen on the well-performing model example. The seasonality evidenced on the monthly temperature and precipitation used as inputs can certainly affect the model behaviour. It is worth noting that the CNN model has previously demonstrated effectiveness in predicting weekly groundwater level time series with high overall accuracy, as shown by Wunsch et al. (2022) "Deep learning shows declining groundwater levels in Germany until 2100 due to climate change". Therefore, we are confident that our approach works well for higher temporal resolutions and we propose a shift to a weekly temporal resolution for modeling the time series. This adjustment is intended to mitigate the potential seasonality introduced by the monthly resolution. By comparing the model's performance at weekly resolution, we seek to address this issue and establish a more robust foundation for examining geospatial and time series correlations also in form of possibly higher correlations.
  It is important to emphasize that while adopting a different model may yield improvements in overall performance, we have consistently applied the same model to all monitoring stations. Consequently, local or more specific variations in performance metrics across stations are primarily attributed to external factors rather than model selection. Certain models may better adapt to specific locations while displaying lower performance when applied to the entire dataset. In this study, our primary objective is to analyze these relative performance differences in the presence of external influences. Therefore, even if an alternative model were to enhance overall performance, our analysis will retain its current form, focusing on the examination of these external influences.
  Along these lines, the used geospatial features are all interesting but the timeseries features are too many and some are hardly tangible. A thorough explanation of their meaning, range of values, etc. would be beneficial. If time series features shall be part of the analysis, I recommend to focus on only a few of the ones provided. Yet, in this case, the question remains: What is exactly gained from relating these time series features to model performance? For instance, a time series can be rated as complex while an adequate model could still be able to predict it.
  The selection of time series features represents the outcome of a pre-evaluation aimed at elucidating their physical significance in the context of groundwater level time series. We agree with your suggestion that a more extensive interpretation of these features should be incorporated into our manuscript. We will augment our discussion with a detailed explanation of each feature, emphasizing their relevance to the physical aspects of groundwater level dynamics.
  Regarding the detailed comments addressed on the attached PDF, we will go through them and incorporate them when in agreement.
  In conclusion, we are committed to addressing your comments as comprehensively as possible.
  
  Citation: https://doi.org/10.5194/egusphere-2023-1836-AC1
RC2:
'Comment on egusphere-2023-1836', Jonathan Frame, 15 Oct 2023

This paper presents a CNN developed for ground water levels across a region of Germany. The CNN was developed for each well time series individually, meaning this is not a model that should be applied to locations without well data. The paper’s main contributions is an attribution of model performance to geospatial characteristics and time series features. This analysis should be relatively interesting to modelers of groundwater systems, particularly those using data driven methods. This paper would greatly benefit from additional description of the training procedure and evaluation when dealing with gap-filled or processed data. Another benefit would be further evaluation of the sensitivity of model performance to the gap filling and data processing measures.
Below are some specific line comments.
Line 44: These claim should be cited: “In terms of accuracy and calculation speed, the CNN models outperform the LSTM. NARX models performed, on average, better than CNN.”
Line 47: “Most studies have successfully applied these techniques for GWL forecasting using only meteorological variables as inputs.” You might be interested in this paper: Gholizadeh et al., “Long short-term memory models to quantify long-term evolution of streamflow discharge and groundwater depth in Alabama” Science of The Total Environment Volume 901, 25 November 2023, 165884, where the did in fact include site geospatial characteristics to make predictions of wells that were held out from the training set (ungauged).
Line 116: Can you please provide the total number (and percentage) of gap filled values referred to here: “ To provide the CNN model with continuous time series, we performed a data imputation process through a Multiple Linear Regression (if enough dynamically similar wells based on the Euclidean Distance”. Can you also explain if these values were removed, or should be removed, from the loss during training, and also removed from the evaluation?
Line 120: Similarly can you provide the total number (and percent) of data points that were modified as outliers described here: “We removed these anomalies by finding the highest slope in the cumulative sum”? Is this a standard approach? I don’t think this description is satisfactory. I see from your code that you identify these based on “initial point where the values increase by 0.5 of the standard deviation“. This is an important point that should be explained in the paper, as well as the decision to use this processing method.

Line 135: Is it really necessary to give the equations for r-squared and NSE, as you don’t provide the equations for MSE or BIAS? There is also more unfamiliar calculations made in Tables 2 and 3 with no equations provided, and also the main CNN model is not described with equations. I guess I would suggest just removing equations 1 and 2, avoiding an asymmetry in descriptions, rather than adding equations for all the rest of the calculations.
Line 176: “Occasionally, in poorly performing models, the pattern of the GWL observations has been generally learned but with a strong Bias.” This is a little concerning, and I think it would be work describing in more detail. Similar to your NSE/r-squared cutoffs above, can you provide a quantification of these problematic BIAS wells, something like in how many wells does the prediction not intersect the observation? What causes this BIAS, is it an unusually high section in the training period? I wonder if there is anything in the data preprocessing that plays into this issue.

Citation: https://doi.org/10.5194/egusphere-2023-1836-RC2
- AC2: 'Reply on RC2', Mariana Gomez, 31 Oct 2023
  
  Dear Jonathan Frame,
  Thank you very much for your comments on the manuscript; we value your incomes and would like to incorporate the suggestions when suitable and in accordance with the current objective of the paper.
  “this paper would greatly benefit from additional description of the training procedure and evaluation when dealing with gap-filled or processed data. Another benefit would be further evaluation of the sensitivity of model performance to the gap filling and data processing measures.”
  We will certainly elaborate more on the data pre-processing, by including a more detailed description of the data exploration and gap-filling methods used before applying the CNN model. Regarding the training procedure and sensitivity of model performance to the gap filling, we believe that it can be helpful as further and future research but by only using time series with good data quality in terms of data gap lengths and frequency we seek to avoid major influence of data imputation approaches. Therefore, we think that a sensitivity analysis is a bit out of the scope of this study and could be focussed on in a follow-up analysis.
  Line 44: These claim should be cited: “In terms of accuracy and calculation speed, the CNN models outperform the LSTM. NARX models performed, on average, better than CNN.”
  We agree that this needs to be cited.
  Line 47: “Most studies have successfully applied these techniques for GWL forecasting using only meteorological variables as inputs.” You might be interested in this paper: Gholizadeh et al., “Long short-term memory models to quantify long-term evolution of streamflow discharge and groundwater depth in Alabama” Science of The Total Environment Volume 901, 25 November 2023, 165884, where the did in fact include site geospatial characteristics to make predictions of wells that were held out from the training set (ungauged).
  Gholizadeh et al. 2023 applied an LSTM model including static inputs that refer to the aquifers' hydrogeology as an attempt to model ungauged locations. As the model does not include groundwater levels as inputs, the authors attribute the satisfactory model performance to input features such as hydraulic conductivity, soil depth, soil porosity, and maximum water content. These findings can contribute to the central discourse of the paper.
  Line 116: Can you please provide the total number (and percentage) of gap filled values referred to here: “To provide the CNN model with continuous time series, we performed a data imputation process through a Multiple Linear Regression (if enough dynamically similar wells based on the Euclidean Distance”. Can you also explain if these values were removed, or should be removed, from the loss during training, and also removed from the evaluation?
  Thank you for raising this point. We will include precise numbers regarding the data gaps and imputation. From the 505 groundwater level time series, 241 (48%) are complete, 254 (50%) have gaps of 2 consecutive values and 10 (2%) have gaps of 3 consecutive values. Overall, the time series have less than 5% gap-filled values. We did not remove them from the training phase since the number of filled values is not considerably high.
  Line 120: Similarly can you provide the total number (and percent) of data points that were modified as outliers described here: “We removed these anomalies by finding the highest slope in the cumulative sum”? Is this a standard approach? I don’t think this description is satisfactory. I see from your code that you identify these based on “initial point where the values increase by 0.5 of the standard deviation“. This is an important point that should be explained in the paper, as well as the decision to use this processing method.
  Only 28 wells were identified to have jumps/steps on the temporal record. The cumulative sum is commonly used to detect changes in the mean or variance along the time series and is not referred to as outliers. Here, we intended to detect jumps/steps on the observed values that can hinder the model training or might introduce confusion to the model due to potential changes in the dynamic of the ground water levels. The optimal fraction of standard deviation was determined through trial and error by visually inspecting the detections and selecting the value that best adjust to most of the jumps. We will definitely include these explanations in the revised manuscript. Thank you very much for pointing this out.
  Line 135: Is it really necessary to give the equations for r-squared and NSE, as you don’t provide the equations for MSE or BIAS? There is also more unfamiliar calculations made in Tables 2 and 3 with no equations provided, and also the main CNN model is not described with equations. I guess I would suggest just removing equations 1 and 2, avoiding an asymmetry in descriptions, rather than adding equations for all the rest of the calculations.
  We agree to remove the equations to make the manuscript more consistent regarding the inclusion of equations.
  Line 176: “Occasionally, in poorly performing models, the pattern of the GWL observations has been generally learned but with a strong Bias.” This is a little concerning, and I think it would be work describing in more detail. Similar to your NSE/r-squared cutoffs above, can you provide a quantification of these problematic BIAS wells, something like in how many wells does the prediction not intersect the observation? What causes this BIAS, is it an unusually high section in the training period? I wonder if there is anything in the data preprocessing that plays into this issue
  This comment relates a lot to the main concern, raised in the first reviewer comment (RC1). We will address this issue more in detail through re-running the model on a weekly resolution, which we expect will improve model performance.
  
  Citation: https://doi.org/10.5194/egusphere-2023-1836-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (14 Nov 2023) by Ralf Loritz

Dear Authors,

First of all, congratulations on composing a manuscript that has been well-received by both reviewers. Your detailed discussions with the reviewers and the proposed modifications will surely enhance the quality of your manuscript. I look forward to reading its revised version and to sending it again to the two reviewers.

Regarding the performance of your models, as discussed with Marvin Höge, you could think about adding a simple baseline model to your study. You could calculate the average groundwater levels from your training data at each site and for each week, and use these values as simple baseline models. These models would capture the seasonal dynamics of your data at each site, allowing for a comparison with your CNNs. This approach will help in understanding what additional insights the CNNs provide beyond just replicating the seasonal patterns.

Moreover, I am curious about the robustness of your networks. Training networks on small datasets often leads to significant performance variations due to random weight initialization. Was this a challenge in your study? Have you experimented with swapping the testing and training data (something in the direction of a k-fold cross-validation) to see how it influences your model's performance?

Finally, I reviewed your code and found it to be well-organized and user-friendly. Thank you very much! However, considering that most data from federal states are now commonly shared under an open license, particularly when freely downloadable, including this data would be beneficial. Uploading your data to your Zenodo repository would allow reviewers and potential readers to run your code much more efficiently, without the need to download data from some local authorities. This process can be time-consuming and introduces uncertainty regarding the specific selection of your data subsets. Additionally, it necessitates at least a basic understanding of the German language and I failed to find the download link at the website (but this might be my fault). Therefore, I recommend uploading the data from local authorities to Zenodo, if possible, to improve the reproducibility of your research. The senior co-authors of this manuscript are well connected and I think this could be resolved with an email or two. Please note that all my comments are just suggestions. Feel free to ignore them.

Best regards,

Ralf Loritz

Hide

AR by Mariana Gomez on behalf of the Authors (06 Feb 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (10 Feb 2024) by Ralf Loritz

RR by Marvin Höge (20 Feb 2024)

Suggestions for revision or reasons for rejection

Overall, the manuscript shows improvements, most comments have been addressed (e.g. refined explanations and the added figure to the appendix) – yet, when it comes to the major points of concern not always fully satisfactorily (see points below). I still think this paper can be a valuable contribution to the field of water research, but at the same time it should also not oversell outcomes – in particular w.r.t. to the analysis of correlation coefficients. That said, it is no flaw of a study to say that a particular method was applied but did not reveal clear relations. However, it has to be clearly stated. Then, the community learns more about the capability of the employed methods. I am open to recommend publication but not in its current form. I suggest another round of iteration to address the following points:

1) Most importantly, the results section needs to be partially rewritten:
l. 198-199:
“Although correlation coefficients are statistically significant.”
-> Were significance tests conducted to support this? If so please show the results (e.g. in an appendix section). Otherwise, please rephrase.
l. 200ff:
“The R2 increases as the distance to the coastline does, whereas the bias reduces.”
-> There is no clear trend visible
“The proportion of the most common land cover type in the study area (non-irrigated arable land) relates positively to model performance. Conversely, wells with a significant surrounding area of forest or high LAI display lower correlations. Sink and low relief…
-> The correlation coefficients to support these statements are in the area of +-0.1x. This is not sufficient to support these claims.
L 205ff:
“Stronger correlations, mainly negative, are found for the time series features. Overall, we found that autocorrelation reduces model performance.”
-> Also the results part of the time series features analysis has to be rephrased. First, the correlations might be stronger, but starting from essentially no correlation and reaching up to 0.52 maximally is no strong correlation. Therefore, formulations should be appropriate and reflect that relations are indicated but generally not clearly supported.

Overall, regarding the correlation coefficients it should be clearly stated that most features showed no clear correlation.

Reporting such results properly is no bad because it shows other researchers what one can and cannot expect from the applied CNN + feature analysis procedure used in this paper. The reduction of considered features in Fig. 7 and the corresponding text is already an improvement. Yet, it has to be made clear that the conducted correlation analysis did mostly show no clear correlation. This does not undermine the content of the paper but is proper scientific conduct.

2) Fig. 6: Thanks for comparing the CNN results to a sinusoidal fit. Yet, first, I was referring to a “a sine function-based model with a mean trend” as a (too) simple reference model. A simple sine function alone clearly cannot follow the observed pattern of GWL. The model I was referring to would look something like this: y(t) = m(t) + a*sin(bt+c) with time t and parameters a, b, c of the sinusoidal part. The “trend” part m(t) could be something like a moving average or regression model to describe a trend that depends on the precipitation of the last 9 months or so. With such an approach, I think this would really have led to a simple reference model that could also be reported in the appendix. The sole sinusoidal fit obviously cannot be a proper alternative.

Second, results for a reference model should have been reported for the same GW stations as shown in Fig. 6. It seems like only the (well-fitting) first panel of Fig. 6a was reported in the author’s reply while the other two panels in the reply do not match the original time series in Fig.6. I leave it to the authors to decide whether my original suggestion should be pursued.

3) L 224-227: “To better interpret the non-linear behaviour between groundwater and its 225 influencing factors, some studies applied explainable AI techniques (Chakraborty et al., 2021; Zhang et al., 2023). However, this implies including all the analyzed features as inputs on the model, most of which correspond to static data (as regionally accessible information) that might not add value to the sequential model.”

-> These are newly added sentences. The second sentence is not clear and sounds like a justification of why xAI methods are not applied after drawing attention towards them in the first sentence. I think it is fine that this topic is beyond the scope of this paper but I suggest not to add it to the Discussion section then. This could go into the Conclusions/Outlook section, saying that such techniques might be able to shed more light on relations that could not be identified via the employed methods here.

Hide

RR by Anonymous Referee #3 (15 Mar 2024)

Suggestions for revision or reasons for rejection

Review of Gomez et al. Performance assessment of geospatial and time series features on groundwater level forecasting with deep learning

Generally, I enjoyed reading this manuscript and learn about the relevant topic of predicting groundwater levels using meteorological input variables and convolutional neural networks. Although, I joined the revision only in the second round and I see this is not ideal as it brings up new views, I still decided to take an unbiased view on the manuscript and provide feedback mainly from that. I think some major improvements could still be made to increase clarity to the readers and to provide a bit more context of the model results. I recommend publication after incorporating the comments.
General comments
- The abstract does not really flow well, some terms and sentences are not clear, please revise to improve clarity. The temporal data resolution should be added as well. The clarity is also an issue in the main text, I indicated some examples, but please generally check the clarity as this is really important for the readers
- This is also the case for the title, which does not really match the manuscript. From my reading, you did not investigate performance of the features but linked performances to the features which is different in my view. Please revise
- I am not sure why the authors train one model per site instead of one model for all sites which could perform much more robustly and avoid overfitting? Also seeing in the context of the current comment by Kratzert et al. on improved machine learning models when training on multiple catchments, could you comment on that, please? The motivation for such an approach should be presented to the reader.
- The authors miss to reflect their model performances in context of overfitting a typical issue that can occur in highly parameterized machine learning applications. The authors did not present performance from training and validation and reflect on potential mismatches in performance compared to test data. Overfitting can also cause low model performance on the test data, however, the authors only discuss poor model performance in relation to physical and time series properties. Also, I am missing information on the covered time periods per stations in addition to the time series lengths to better understand the variable coverage of training, validation and testing periods. The reader is left with incomplete information to fully interpret and use the presented results.
- I agree with a previous comment by Marvin Höge who brought up the point of interpreting model performance and suggested to compare CNN performance with a simple model (.e.g sinusoidal) to see how much a complex CNN actually adds to the predictability. I did not see that the authors really addressed this issue, and the response partly missed this issue.
- In relation also to the above issue: How much does your result depend on the selection of “relevant” features? In the discussion you wrote that it is hard to separate the effects, but actually maybe you would also miss effects. Does it also depend on the data quality, length of training, validation, testing periods? What about trends?
- I assume pearson correlation might not be appropriate, I would not expect a linear relationships between strongly skewed and partly intercorrelated variables. Maybe, it would be also good to provide an overview (and the data, see next comment) of represented feature values.
- Please add more complete information on the data used in this study, e.g. provide links to the download pages, potential download criteria, download dates, versions etc. The code repository is great, however, it is still not fully reproducible without the input files unfortunately. At best, could you also include the data into the code repository. This would strongly increase reproducibility of your results, potential follow-up studies and be highly appreciated. At least, the processed geospatial and time series features data and metadata of the included stations should be possible to include.

Abstract
L5 This is not clear “Likely causalities of this discrepancy”
L6 how do you quantify the “effects of …” This term does not match with what was done
L11 why do you use pearson correlation, are you assuming a linear relationship? I think this is very unlikely actually, have you checked the relationships visually also?
L14 “exhibit better metrics” this is confusing as metrics could be anything and you are probably referring to model performance, so I suggest to also write that.
L16 “external physical factors” not matching time series properties

Introduction
L20ff I think the authors should try to cite peer-reviewed publications instead of citing an UNESCO report (which is also not fully referenced in the literature) for the first 3 sentences. Moreover, I find it a bit irritating that the authors start their statement with water use, although they want to model groundwater levels, I think this paragraph could be largely improved.
L25 This seems incorrect “approaches based on groundwater observation sites”
L49 Is it necessary to cite the preprint? I think it is the same publication as Wunsch et al 2021. Also for Wunsch et al. 2022 the link leads to the preprint not the actual published paper, please revise your references in that sense.
L65 what are “relevant“ features? Please, revise.

L84 “through”? Please provide the exact information to access the data, at the best an open data portal where the analyzed data is easily accessible
L85 Is the data directly measured monthly or are these already aggregated values over smaller time intervals?
L95 what is the “soil moisture index (SMI)”?
L102ff previously you mentioned that meteorological variables were provided by the State authority. Please, check
L111 “resampling” is unclear what is means exactly. Please, be specify
L117 It would be good to provide the versions used, especially for tensorflow
L122 “exclude wells under strong anthropogenic influences such as pumping” How did you do that? This requires an explanation. This is also highly relevant for your later interpretations of anthropogenic impacts.
L125ff this is not clear, please add what you consider as similar and how MLR was done. Would also good to provide references to methods that you used, e.g. PCHIP. Why piecewise? What does it mean, why is that not used for all gaps? There is also a typo “Otherwise”
L133 “3 x 3 pixels“ does that mean you use meteorological input of 15 by 15 km given a resolution of 5km? Isn’t that quite large?

L135ff Please provide more information instead of mainly referring to Wunsch et al. 2022. As this is your main methodology, these steps should be clear. Also I have not seen any introduction of a 1D-CNN

L161 Why 1 km radius? What are relevant categories?
L164 which metrics? How do you evaluate the added value?
L 151 what are sub-sequences?
L187 “where the density of wells is higher.” Does that matter if you are training one independent model per well?
L189 I cannot see or confirm that, I would suggest to add a scatterplot to observed versus simulated across models to supplements
L193 “local variations from the main seasonal behavior are ignored.“ not clear to me what this means
L209 “higher mean variance“? This was not how stability was defined, I think
L229 “are better observed at a weekly or even daily temporal resolution instead of the monthly time step” so what does that mean for your study? It is not clear what you want to say with this information. Actually, the paragraph started with uncertainties, but the arguments presented are not linked back to this issue.
L250 I am not sure what you mean here “understanding the influence of geospatial and temporal features related to the GWL” This seems grammatically incomplete. Please revise. Also, for curiosity, wouldn’t the best model also be best suited for “understanding the influences of … “, i.e. the network structure with optimally tuned hyperparameters?

L260 I do not understand “Every model that could not correctly learn from meteorological inputs might be treated independently.” What is “correctly”, what do you mean with “treat independently”.
L276 I think such an interpretation is critical if most of the area in non-irrigated – how many wells are within irrigated land actually? See my major comment
L290 “variability in climate“ fits better in my view or do you refer to climate change? Then I would not agree though.
L293 “trends“ I would be careful with this term - again variability fits better. I don’t think you could extrapolate into trends as model was not trained for this and it is also not the issue discussed here.
L300 Would a highly seasonal time series be considered as complex? Maybe the term complex is confusing, as I would envision a more erratic and highly irregular pattern, not a regular variability.
L376 “Fernando Nogueira“? Please check
Figures and Tables
Fig. 1 there is a typo in the caption of ”Münsteländer Kreidebecken”. Also I would suggest to add the English names to the caption in addition to the legend so that readers do not have to speculate on the translations.
Fig. 3 I miss a bit of information in the caption. What are the sources in a)-c)? Which year of CORINE? Especially given fractions calculated and described in Tab. 2 I am not sure what is shown here, what does “associated” mean here?
Tab. 1: As source, I would like to also provide the reference/link to the data, not only the provider, potentially also access date, version etc to increase the reproducibility
Tab. 2: “up to … km“ do you mean that all distances larger than that were set to the limit?
Tab. 3: Stability – what is the window size? Fourier power spectral density - Why should it be annual climate variability, while it is actually groundwater variability?

References
Kratzert, F., Gauch, M., Klotz, D., and Nearing, G.: HESS Opinions: Never train an LSTM on a single basin, Hydrol. Earth Syst. Sci. Discuss. [preprint], https://doi.org/10.5194/hess-2023-275, in review, 2024.

Hide

ED: Reconsider after major revisions (further review by editor and referees) (26 Mar 2024) by Ralf Loritz

AR by Mariana Gomez on behalf of the Authors (26 Jun 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (28 Jun 2024) by Ralf Loritz

RR by Anonymous Referee #3 (08 Aug 2024)

Suggestions for revision or reasons for rejection

Dear authors,
thank you for the revised manuscript, in which you carefully addressed the comments. I recommend publication of the study in HESS after some remaining smaller comments in response to the revision.
- Baseline model is presented in Figure 6 but is not reflected in the text. It needs a small part in the methods and the results/discussion to be fully integrated.
- L. 127ff The procedure on selecting wells without anthropogenic impact is still not very clear to me unfortunately. The given reference Wriedt et al. (2020) is not very clear in that point also. If possible please provide more detailed information as this is a really important issue that many studies face and others could learn from your procedure. Did you manually check each time series? What metadata is needed to judge this? E.g. Wriedt et al 2020 wrote that “most of the excluded wells were within deeper aquifers”.
- L. 135ff Thank you for the expansion on the gap filling procedure. One point remains unclear: Euclidean distance of time series … ? Please add between which variables the distance is calculated as this makes a big difference, e.g. “absolute GWL” or somehow transformed GWL? If standardized GWL, how were they standardized?
- L 142: Thanks for the clarification. I think it makes sense to add “(15km*15km)” of “pixels (5km*5km each)” to make the scale of meteorological data clear. As information is scattered from above
- L230 and Table3: Thank you for addressing this comment. I am still not convinced “high stability” should be equal to “high variance”, as for me higher stability means low variability. Thus it is counterintuitive for me. Could you please expand this in Table 3 if appropriate? For example, in the case of the power spectrum the right column provides information on what high versus low values actually mean. That would be helpful here as well, esp. with this counterintuitive definition. Please also check if this is reflected in the text, as e.g. the correlation coefficient might be interpreted in reverse way.

Hide

ED: Publish as is (16 Aug 2024) by Ralf Loritz

AR by Mariana Gomez on behalf of the Authors (20 Aug 2024) Author's response Manuscript

Short summary

To understand the impact of external factors on groundwater level modelling using a 1-D convolutional neural network (CNN) model, we train, validate, and tune individual CNN models for 505 wells distributed across Lower Saxony, Germany. We then evaluate the performance of these models against available geospatial and time series features. This study provides new insights into the relationship between these factors and the accuracy of groundwater modelling.