Review of Gomez et al. Performance assessment of geospatial and time series features on groundwater level forecasting with deep learning
Generally, I enjoyed reading this manuscript and learn about the relevant topic of predicting groundwater levels using meteorological input variables and convolutional neural networks. Although, I joined the revision only in the second round and I see this is not ideal as it brings up new views, I still decided to take an unbiased view on the manuscript and provide feedback mainly from that. I think some major improvements could still be made to increase clarity to the readers and to provide a bit more context of the model results. I recommend publication after incorporating the comments.
General comments
- The abstract does not really flow well, some terms and sentences are not clear, please revise to improve clarity. The temporal data resolution should be added as well. The clarity is also an issue in the main text, I indicated some examples, but please generally check the clarity as this is really important for the readers
- This is also the case for the title, which does not really match the manuscript. From my reading, you did not investigate performance of the features but linked performances to the features which is different in my view. Please revise
- I am not sure why the authors train one model per site instead of one model for all sites which could perform much more robustly and avoid overfitting? Also seeing in the context of the current comment by Kratzert et al. on improved machine learning models when training on multiple catchments, could you comment on that, please? The motivation for such an approach should be presented to the reader.
- The authors miss to reflect their model performances in context of overfitting a typical issue that can occur in highly parameterized machine learning applications. The authors did not present performance from training and validation and reflect on potential mismatches in performance compared to test data. Overfitting can also cause low model performance on the test data, however, the authors only discuss poor model performance in relation to physical and time series properties. Also, I am missing information on the covered time periods per stations in addition to the time series lengths to better understand the variable coverage of training, validation and testing periods. The reader is left with incomplete information to fully interpret and use the presented results.
- I agree with a previous comment by Marvin Höge who brought up the point of interpreting model performance and suggested to compare CNN performance with a simple model (.e.g sinusoidal) to see how much a complex CNN actually adds to the predictability. I did not see that the authors really addressed this issue, and the response partly missed this issue.
- In relation also to the above issue: How much does your result depend on the selection of “relevant” features? In the discussion you wrote that it is hard to separate the effects, but actually maybe you would also miss effects. Does it also depend on the data quality, length of training, validation, testing periods? What about trends?
- I assume pearson correlation might not be appropriate, I would not expect a linear relationships between strongly skewed and partly intercorrelated variables. Maybe, it would be also good to provide an overview (and the data, see next comment) of represented feature values.
- Please add more complete information on the data used in this study, e.g. provide links to the download pages, potential download criteria, download dates, versions etc. The code repository is great, however, it is still not fully reproducible without the input files unfortunately. At best, could you also include the data into the code repository. This would strongly increase reproducibility of your results, potential follow-up studies and be highly appreciated. At least, the processed geospatial and time series features data and metadata of the included stations should be possible to include.
Abstract
L5 This is not clear “Likely causalities of this discrepancy”
L6 how do you quantify the “effects of …” This term does not match with what was done
L11 why do you use pearson correlation, are you assuming a linear relationship? I think this is very unlikely actually, have you checked the relationships visually also?
L14 “exhibit better metrics” this is confusing as metrics could be anything and you are probably referring to model performance, so I suggest to also write that.
L16 “external physical factors” not matching time series properties
Introduction
L20ff I think the authors should try to cite peer-reviewed publications instead of citing an UNESCO report (which is also not fully referenced in the literature) for the first 3 sentences. Moreover, I find it a bit irritating that the authors start their statement with water use, although they want to model groundwater levels, I think this paragraph could be largely improved.
L25 This seems incorrect “approaches based on groundwater observation sites”
L49 Is it necessary to cite the preprint? I think it is the same publication as Wunsch et al 2021. Also for Wunsch et al. 2022 the link leads to the preprint not the actual published paper, please revise your references in that sense.
L65 what are “relevant“ features? Please, revise.
L84 “through”? Please provide the exact information to access the data, at the best an open data portal where the analyzed data is easily accessible
L85 Is the data directly measured monthly or are these already aggregated values over smaller time intervals?
L95 what is the “soil moisture index (SMI)”?
L102ff previously you mentioned that meteorological variables were provided by the State authority. Please, check
L111 “resampling” is unclear what is means exactly. Please, be specify
L117 It would be good to provide the versions used, especially for tensorflow
L122 “exclude wells under strong anthropogenic influences such as pumping” How did you do that? This requires an explanation. This is also highly relevant for your later interpretations of anthropogenic impacts.
L125ff this is not clear, please add what you consider as similar and how MLR was done. Would also good to provide references to methods that you used, e.g. PCHIP. Why piecewise? What does it mean, why is that not used for all gaps? There is also a typo “Otherwise”
L133 “3 x 3 pixels“ does that mean you use meteorological input of 15 by 15 km given a resolution of 5km? Isn’t that quite large?
L135ff Please provide more information instead of mainly referring to Wunsch et al. 2022. As this is your main methodology, these steps should be clear. Also I have not seen any introduction of a 1D-CNN
L161 Why 1 km radius? What are relevant categories?
L164 which metrics? How do you evaluate the added value?
L 151 what are sub-sequences?
L187 “where the density of wells is higher.” Does that matter if you are training one independent model per well?
L189 I cannot see or confirm that, I would suggest to add a scatterplot to observed versus simulated across models to supplements
L193 “local variations from the main seasonal behavior are ignored.“ not clear to me what this means
L209 “higher mean variance“? This was not how stability was defined, I think
L229 “are better observed at a weekly or even daily temporal resolution instead of the monthly time step” so what does that mean for your study? It is not clear what you want to say with this information. Actually, the paragraph started with uncertainties, but the arguments presented are not linked back to this issue.
L250 I am not sure what you mean here “understanding the influence of geospatial and temporal features related to the GWL” This seems grammatically incomplete. Please revise. Also, for curiosity, wouldn’t the best model also be best suited for “understanding the influences of … “, i.e. the network structure with optimally tuned hyperparameters?
L260 I do not understand “Every model that could not correctly learn from meteorological inputs might be treated independently.” What is “correctly”, what do you mean with “treat independently”.
L276 I think such an interpretation is critical if most of the area in non-irrigated – how many wells are within irrigated land actually? See my major comment
L290 “variability in climate“ fits better in my view or do you refer to climate change? Then I would not agree though.
L293 “trends“ I would be careful with this term - again variability fits better. I don’t think you could extrapolate into trends as model was not trained for this and it is also not the issue discussed here.
L300 Would a highly seasonal time series be considered as complex? Maybe the term complex is confusing, as I would envision a more erratic and highly irregular pattern, not a regular variability.
L376 “Fernando Nogueira“? Please check
Figures and Tables
Fig. 1 there is a typo in the caption of ”Münsteländer Kreidebecken”. Also I would suggest to add the English names to the caption in addition to the legend so that readers do not have to speculate on the translations.
Fig. 3 I miss a bit of information in the caption. What are the sources in a)-c)? Which year of CORINE? Especially given fractions calculated and described in Tab. 2 I am not sure what is shown here, what does “associated” mean here?
Tab. 1: As source, I would like to also provide the reference/link to the data, not only the provider, potentially also access date, version etc to increase the reproducibility
Tab. 2: “up to … km“ do you mean that all distances larger than that were set to the limit?
Tab. 3: Stability – what is the window size? Fourier power spectral density - Why should it be annual climate variability, while it is actually groundwater variability?
References
Kratzert, F., Gauch, M., Klotz, D., and Nearing, G.: HESS Opinions: Never train an LSTM on a single basin, Hydrol. Earth Syst. Sci. Discuss. [preprint], https://doi.org/10.5194/hess-2023-275, in review, 2024. |