Data-driven modelling of hydraulic-head time series: results and lessons learned from the 2022 Groundwater Time Series Modelling Challenge

Collenteur, Raoul A.; Haaf, Ezra; Bakker, Mark; Liesch, Tanja; Wunsch, Andreas; Soonthornrangsan, Jenny; White, Jeremy; Martin, Nick; Hugman, Rui; de Sousa, Ed; Vanden Berghe, Didier; Fan, Xinyang; Peterson, Tim J.; Bikše, Jānis; Di Ciacca, Antoine; Wang, Xinyue; Zheng, Yang; Nölscher, Maximilian; Koch, Julian; Schneider, Raphael; Benavides Höglund, Nikolas; Krishna Reddy Chidepudi, Sivarama; Henriot, Abel; Massei, Nicolas; Jardani, Abderrahim; Rudolph, Max Gustav; Rouhani, Amir; Gómez-Hernández, J. Jaime; Jomaa, Seifeddine; Pölz, Anna; Franken, Tim; Behbooei, Morteza; Lin, Jimmy; Meysami, Rojin

doi:10.5194/hess-28-5193-2024

Articles | Volume 28, issue 23

https://doi.org/10.5194/hess-28-5193-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-28-5193-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 28, issue 23

Research article

|

04 Dec 2024

Research article |

| 04 Dec 2024

Data-driven modelling of hydraulic-head time series: results and lessons learned from the 2022 Groundwater Time Series Modelling Challenge

Raoul A. Collenteur, Ezra Haaf, Mark Bakker, Tanja Liesch, Andreas Wunsch, Jenny Soonthornrangsan, Jeremy White, Nick Martin, Rui Hugman, Ed de Sousa, Didier Vanden Berghe, Xinyang Fan, Tim J. Peterson, Jānis Bikše, Antoine Di Ciacca, Xinyue Wang, Yang Zheng, Maximilian Nölscher, Julian Koch, Raphael Schneider, Nikolas Benavides Höglund, Sivarama Krishna Reddy Chidepudi, Abel Henriot, Nicolas Massei, Abderrahim Jardani, Max Gustav Rudolph, Amir Rouhani, J. Jaime Gómez-Hernández, Seifeddine Jomaa, Anna Pölz, Tim Franken, Morteza Behbooei, Jimmy Lin, and Rojin Meysami

Download

Final revised paper (published on 04 Dec 2024)
Supplement to the final revised paper
Preprint (discussion started on 14 May 2024)
Supplement to the preprint

Interactive discussion

Status: closed

RC1:
'Comment on hess-2024-111', Anonymous Referee #1, 25 Jun 2024

The paper presents a modeling challenge performed by 15 teams from different institutions to reproduce temporal evolution of hydraulic heads at four monitoring wells, based provided meteorological data and a calibration time window with previously observed heads. The teams adopt different methods, with large predominance of methods based on artificial intelligence (AI). I find this experiment of much interest for the hydrology community, and particularly timely considering the increasing and widespread use of AI. For this reason, I think that the paper fits the quality standard of HESS and I recommend it for publication. I only have a few minor comments for the authors.
Comments to authors
Method description in section 3.1.1 to 3.1.3 could be slightly expanded to better highlight the difference between the different methods within the same category.
It is not sufficiently clear if any information about geology and setting (e.g., well depth) is provided to the teams.
Minor issues:
L1 and L51: “2022 groundwater time series modeling challenge”. I suggest to put this in italic or between quotes
L70: “not allowed to use the observed head data itself as an explanatory variable.” Could the authors develop more on this point? What kind of modeling would violate this rule?
L86: At this point of the reading, calibration and validation period have not been defined yet (except in the abstract), which might complicate the understanding of the sentence “calibrate the model without head measurements in the validation period”. I suggest to reformulate this sentence.
L96: is the descriptions in lines 97 to 121 the same that was provided to the participants?
Figure 1: Authors should provide references for the head time series in the text or in the figure.
Table 3: Team name is not indicative of the geographical provenience nor of the participants. A connection between participant names and group name should be provided. The acronyms ML and DL should be defined in the caption.
Fig. 2 and 3: Do the box plots use quartiles or 20%-80% quantiles?

Citation: https://doi.org/10.5194/hess-2024-111-RC1
- AC1: 'Reply on RC1', Raoul Collenteur, 10 Jul 2024
  
  The paper presents a modeling challenge performed by 15 teams from different institutions to reproduce temporal evolution of hydraulic heads at four monitoring wells, based provided meteorological data and a calibration time window with previously observed heads. The teams adopt different methods, with large predominance of methods based on artificial intelligence (AI). I find this experiment of much interest for the hydrology community, and particularly timely considering the increasing and widespread use of AI. For this reason, I think that the paper fits the quality standard of HESS and I recommend it for publication. I only have a few minor comments for the authors.
  REPLY: Thank you for your kind words.
  Comments to authors
  Method description in section 3.1.1 to 3.1.3 could be slightly expanded to better highlight the difference between the different methods within the same category.
  REPLY: We understand this comment. It is obviously not possible to describe all 15 models in detail, so we had to summarize. But we can certainly try to provide some additional information in the paper. Please note that all model input files are available on github.
  It is not sufficiently clear if any information about geology and setting (e.g., well depth) is provided to the teams.
  REPLY: The description as presented in Section 2.2 (the four bullets describing the sites) is pretty much all the teams received regarding geology and setting. We will add a sentence explaining this, including a reference to the github site where all the information was provided.
  Minor issues:
  L1 and L51: “2022 groundwater time series modeling challenge”. I suggest to put this in italic or between quotes
  REPLY: That is a good idea. We need to see what the journal style allows.
  L70: “not allowed to use the observed head data itself as an explanatory variable.” Could the authors develop more on this point? What kind of modeling would violate this rule?
  REPLY: Here we are interested in investigating which stresses can explain the head variations, and if the head is used as a stress we don’t learn about that. In machine learning it is rather common to use the measured variable itself to predict what happens in the future. This makes a lot of sense as the head tomorrow is probably similar to the head today. And if there was an upward trend in the past month, then there may still be an upward trend in the next month. In traditional groundwater modeling (e.g., MODFLOW, FEFLOW, etcetera), this is not possible, of course, and stresses (forcings) on the mode together with the physical principles that are in the model are supposed to create the correct behavior. We will add a few words to clarify to the reader what is meant here and why.
  L86: At this point of the reading, calibration and validation period have not been defined yet (except in the abstract), which might complicate the understanding of the sentence “calibrate the model without head measurements in the validation period”. I suggest to reformulate this sentence.
  REPLY: Good point. We will modify the text.
  L96: is the descriptions in lines 97 to 121 the same that was provided to the participants?
  REPLY: In essence, yes. But the wording was slightly different. We included a reference to the original wording (also in response to the earlier comment about geology and setting).
  Figure 1: Authors should provide references for the head time series in the text or in the figure.
  REPLY: We will add the origin of the head time series to the text.
  Table 3: Team name is not indicative of the geographical provenience nor of the participants. A connection between participant names and group name should be provided. The acronyms ML and DL should be defined in the caption.
  REPLY: We will add a column with the number of the authors’ affiliation(s). The geographical location of the participants is already provided in the authors’ list, and it is mentioned in the text that two thirds of the teams come from continental Europe.
  Fig. 2 and 3: Do the box plots use quartiles or 20%-80% quantiles?
  REPLY: The box gives quartiles (25%-75%) and the whiskers are 1.5 times the interquartile range, as per default Matplotlib settings. This will be added to the text: “All box plots in this paper show the interquartile range and whiskers indicate 1.5 times the interquartile range”.
  
  Citation: https://doi.org/10.5194/hess-2024-111-AC1
RC2:
'Comment on hess-2024-111', Anonymous Referee #2, 25 Jun 2024

This manuscript is very interesting by presenting the results from the 2022 groundwater modeling challenge. The provided data, evaluation of model results are well described. However, I suggest to reject the manuscript considering the following reasons:
(1) This manuscript is more like a summary (or technical report) of the groundwater modeling challenge. Scientific results and discussions is limited.
(2) The novelty of this work is limited both from the point of view of groundwater hydrology or the point of view of modeling. The machine learning models and deep learning models conducted in the manuscript are all classic algorithms. Furthermore, numerical models are not included.
(3) The details of the models (lump-parameters, machine learning, deep learning) are not illustrated which may because they all used the classic ones.

Citation: https://doi.org/10.5194/hess-2024-111-RC2
- AC2:
  'Reply on RC2', Raoul Collenteur, 10 Jul 2024
  This manuscript is very interesting by presenting the results from the 2022 groundwater modeling challenge. The provided data, evaluation of model results are well described.
  REPLY: Thank you. Glad to hear you find the paper interesting and the results well described.
  However, I suggest to reject the manuscript considering the following reasons:
  (1) This manuscript is more like a summary (or technical report) of the groundwater modeling challenge. Scientific results and discussions is limited.
  REPLY: Our paper is a scientific analysis and discussion of the challenge we organized. This is a very common format in the sciences; in hydrological research papers it is applied in, e.g., Jeannin et al., 2021, Holländer et al. 2009, or the “Battle”-series as part of the WDSA/CCWI conferences since the 1980’s).
  References:
  Jeannin, P.-Y., Artigue, G., Butscher, C., Chang, Y., Charlier, J.-B., Duran, L., Gill, L., Hartmann, A., Johannet, A., Jourde, H., Kavousi, A., Liesch, T., Liu, Y., Lüthi, M., Malard, A., Mazzilli, N., Pardo-Igúzquiza, E., Thiéry, D., Reimann, T., Schuler, P., Wöhling, T., and Wunsch, A.: Karst modelling challenge 1: Results of hydrological modelling, Journal of Hydrology, 600, 126508, https://doi.org/10.1016/j.jhydrol.2021.126508, 2021.
  
  Holländer, H. M., Blume, T., Bormann, H., Buytaert, W., Chirico, G. B., Exbrayat, J.-F., Gustafsson, D., Hölzel, H., Kraft, P., Stamm, C., Stoll, S., Blöschl, G., and Flühler, H.: Comparative predictions of discharge from an artificial catchment (Chicken Creek) using sparse data, Hydrology and Earth System Sciences, 13, 2069–2094, https://doi.org/10.5194/hess-13-2069-2009, 2009.
  
  Battle of Water Networks, 3rd International Joint Conference on Water Distribution Systems Analysis &Computing and Control for the Water Industry (WDSA/CCWI), Ferrara, Italy, https://wdsa-ccwi2024.it/battle-of-water-networks/ (last accessed July 8, 2024)
  
  (2) The novelty of this work is limited both from the point of view of groundwater hydrology or the point of view of modeling. The machine learning models and deep learning models conducted in the manuscript are all classic algorithms. Furthermore, numerical models are not included.
  REPLY: The reviewer is correct that no new models are developed, but that is the whole purpose of a challenge: Use your own favorite model to try to model the data series provided in the challenge. The performance of these models to different sites, and the way the teams applied these models, gives insights into both the capabilities of the different models and the teams that applied these models. Traditional numerical models were not excluded, but were not submitted, as mentioned in the paper.
  (3) The details of the models (lump-parameters, machine learning, deep learning) are not illustrated which may because they all used the classic ones.
  REPLY: We can obviously not describe the details of all 15 models, so we had to summarize, but we appreciate the desire for more details. All model input files are available on github to reproduce the results. We will add additional descriptions and references, also in response to Reviewer #1.
  
  Citation: https://doi.org/10.5194/hess-2024-111-AC2
RC3:
'Comment on hess-2024-111', Anonymous Referee #3, 26 Jun 2024
The paper presents a collaboration effort between 15 Teams to compare the performance of different types of models to simulate groundwater heads at four boreholes. The paper is clearly written, and I think it is of interest for hydrogeological modelers. I recommend the publication of this paper after addressing the following points:
In the introduction, the authors argue that modelling will increase our understanding of groundwater systems. Also, they mention that AI may results in new knowledge that may be used to improve empirical and process-based groundwater models. Unfortunately, the modelling outcome is not discussed within this context and the hydro-geological characteristics of the aquifer systems hosting these boreholes are not inferred from these models and not discussed in the paper.

Information regarding the structures of the models and how these reflects the hydro-geological settings should be included. I expect that the lumped model structures and parameters to reflect the hydro-geological characteristics. If ML models are black boxes and nothing can be inferred from them, this should be explicitly mentioned in the paper and included in the discussion. It would be good to know the opinion of the Teams regarding the use of these models as it poses a philosophical question regarding their use especially in prediction mode.

Please revise the text describing the data used for calibration and validation. Also, validation and prediction terms are used interchangeably. It is stated that 10 years of data including groundwater levels are provided for calibration, and five years without GWLs are provided for validation. Is this meant to be for prediction? But later it is clear that GWLs for the five years are used for validation. Is the validation done by someone other than the Teams after the submission of model output? Please clarify.

For the USA site, it is mentioned that the nearest surface water is approximately 6.8 km. Later, it has been found that the river has an important role in improving the performance of the models at this site. What is the magnitude of the river stage fluctuations? Does the porous medium hydraulic characteristics justify the river control of GWLs at a borehole that is approximately 7 km away?

If possible, can you please explain why additional engineered input data are needed for ML and DL models and why these models are not able to self-adjust to avoid the need for this additional input?

Can you please rewrite or simplify the statement regarding the AI models and lumped model written in Section 5.1 Lines 302 and 306 as I find it confusing.
Citation: https://doi.org/10.5194/hess-2024-111-RC3
- AC3:
  'Reply on RC3', Raoul Collenteur, 10 Jul 2024
  The paper presents a collaboration effort between 15 Teams to compare the performance of different types of models to simulate groundwater heads at four boreholes. The paper is clearly written, and I think it is of interest for hydrogeological modelers.
  REPLY: Thank you for your kind words.
  I recommend the publication of this paper after addressing the following points:
  In the introduction, the authors argue that modelling will increase our understanding of groundwater systems. Also, they mention that AI may results in new knowledge that may be used to improve empirical and process-based groundwater models. Unfortunately, the modelling outcome is not discussed within this context and the hydro-geological characteristics of the aquifer systems hosting these boreholes are not inferred from these models and not discussed in the paper.
  REPLY: In the Introduction we state that “Modeling makes such information explicit and increases our understanding of groundwater systems” with reference to Shapiro and Day-Lewis (2022) who argue for reframing groundwater hydrology as a data-driven science. And we cite several authors that we can learn from, e.g., machine learning, to improve our groundwater models. Unfortunately, this does not mean, however, that it is possible to infer, e.g., the aquifer characteristics of the studied boreholes from the models. The purpose of the challenge was to predict the head variation for the four years beyond the measured (provided) head data, and that is what the paper focuses on.
  Information regarding the structures of the models and how these reflects the hydro-geological settings should be included. I expect that the lumped model structures and parameters to reflect the hydro-geological characteristics. If ML models are black boxes and nothing can be inferred from them, this should be explicitly mentioned in the paper and included in the discussion. It would be good to know the opinion of the Teams regarding the use of these models as it poses a philosophical question regarding their use especially in prediction mode.
  REPLY: We will expand the section on Model Types to better explain the relationship between the models and the hydro-geological setting. ML models are indeed often referred to as black box models (although there are developments here in the context of explainable artificial intelligence (XAI) modeling, see for example, Wunsch et al. (2024) and Jung et al. (2024)), but that does not mean, as we indicate, that we can not learn anything from it. The results of the challenge clearly indicate that ML models can perform just as well (or better) as the lumped-parameter models to simulate heads in the validation period. So there is an obvious use for ML models. In the section 5.1, we tried to summarize what we learn from the comparison of the different models. Will expand this section to discuss how the modeling choices may affect the results, which may lead to suboptimal models.
  References:
  Wunsch, A., Liesch, T., and Goldscheider, N.: Towards understanding the influence of seasons on low-groundwater periods based on explainable machine learning, Hydrology and Earth System Sciences, 28, 2167–2178, https://doi.org/10.5194/hess-28-2167-2024, 2024.
  
  Jung, H., Saynisch-Wagner, J., and Schulz, S.: Can eXplainable AI Offer a New Perspective for Groundwater Recharge Estimation?—Global-Scale Modeling Using Neural Network, Water Resources Research, 60, e2023WR036360, https://doi.org/10.1029/2023WR036360, 2024.
  
  Please revise the text describing the data used for calibration and validation. Also, validation and prediction terms are used interchangeably. It is stated that 10 years of data including groundwater levels are provided for calibration, and five years without GWLs are provided for validation. Is this meant to be for prediction? But later it is clear that GWLs for the five years are used for validation. Is the validation done by someone other than the Teams after the submission of model output? Please clarify.
  REPLY: We will clarify the text.
  For the USA site, it is mentioned that the nearest surface water is approximately 6.8 km. Later, it has been found that the river has an important role in improving the performance of the models at this site. What is the magnitude of the river stage fluctuations? Does the porous medium hydraulic characteristics justify the river control of GWLs at a borehole that is approximately 7 km away?
  REPLY: We misstated that nearby surface water was 6.83 km away. That is the distance to the gauging station. The main river is only about 1.5 km away while the closest branches are around 500 m away. The paper will be modified accordingly.
  If possible, can you please explain why additional engineered input data are needed for ML and DL models and why these models are not able to self-adjust to avoid the need for this additional input?
  REPLY: We appreciate that a large part of the readership likely consists of ‘traditional’ groundwater modelers that are relatively unfamiliar with data-driven models, so we will spend a short paragraph explaining this in the expanded section on Model Types.
  Can you please rewrite or simplify the statement regarding the AI models and lumped model written in Section 5.1 Lines 302 and 306 as I find it confusing.
  REPLY: We will clarify the text.
  
  Citation: https://doi.org/10.5194/hess-2024-111-AC3
CC1:
'Comment on hess-2024-111 especially on modelisations with Gardenia computer code', Dominique Thiéry, 02 Jul 2024

This a very interesting paper about modeling groundwater hydraulic head time series.
While the results and conclusions are of significant interest, the modeling performed using BRGM’s Gardenia computer code presents a clear concern.
It appears that for the presented project, the users of this computer code have not employed the recommended standard method for modeling hydrological time series. The following issues have been identified. These issues are described in details in the attached file.
It is our opinion that this very interesting and valuable paper should be modified to correct the concern of the clear misuse of the model and of the clear discredit on Gardenia model.
Dominique THIÉRY

Citation: https://doi.org/10.5194/hess-2024-111-CC1
- CC2:
  'Reply on CC1', Didier Vanden Berghe, 02 Jul 2024
  
  Thank you Dominique for your comments. You are mentionning significant improvements thanks to better choices or taking into account additional functionnalities (snow, river level, double reservoir). Would you mind to share your files conducting to those improvements ? Regards, Didier
  
  Citation: https://doi.org/10.5194/hess-2024-111-CC2
  - CC3: 'Reply on CC2', Dominique Thiéry, 04 Jul 2024
    
    Dear Didier,
    In the attached zip file, for each well there is 2 CSV file:
    The calibration period: simulated and Observed levels ("Calib" in the [.csv] name)
    The validation period: simulated and Observed levels ("Valid" in the [.csv] name)
    A missing observed well level is coded: 9999
    
    Citation: https://doi.org/10.5194/hess-2024-111-CC3
    
    CC4: 'Reply on CC3', Didier Vanden Berghe, 04 Jul 2024
    
    Thank you Dominique. I was more thinking about the full Gardenia files, especially the .gar and .rga, as well as the specific input files regarding river level and snow melting/temperature effect. Regards, Didier
    
    Citation: https://doi.org/10.5194/hess-2024-111-CC4
- AC5:
  'Reply on CC1', Raoul Collenteur, 10 Jul 2024
  This a very interesting paper about modeling groundwater hydraulic head time series. While the results and conclusions are of significant interest, the modeling performed using BRGM’s Gardenia computer code presents a clear concern.
  It appears that for the presented project, the users of this computer code have not employed the recommended standard method for modeling hydrological time series. The following issues have been identified. These issues are described in details in the attached file.
  It is our opinion that this very interesting and valuable paper should be modified to correct the concern of the clear misuse of the model and of the clear discredit on Gardenia model.
  REPLY: We thank Dominique Thiéry for his comment and we understand his concern. It is the risk of software developers that people apply their model suboptimally, which may result in an undeserved bad reputation. We want to avoid that, of course. We unfortunately cannot include your results in the challenge after the fact, as you undoubtedly understand; everybody was invited to submit results, the validation data was made public when we submitted the paper, and we analyzed all submitted results. What we can and will do, is that we better emphasize in the paper that modeling results are the combined result of the model and the modeling team and that it is entirely possible that other modeling teams are able to get better (or worse) results with the same method. Hence, poor performance of a method for a certain site does not necessarily reflect a deficiency of the method. We will also include a link to the discussion on the HESS website to indicate that, for example, the developers of the Gardenia model were able to obtain much better results (we discussed this with the Editor Thom Bogaard and he agreed that it is possible to put a reference to Discussion in the paper).
  
  Comments the attached PDF:
  
  The following issues have been identified:
  Manual calibration: The code's standard procedure involves automatic calibration. However, in this case, manual calibration was employed without any justification. It is not surprising that this deviation from the standard approach has resulted in inaccurate calibration.
  
  Omission of snowmelt module: Even for basins in snow-dominated climates like Sweden, the snowmelt module was not utilized. Consequently, the obtained results are of poor quality.
  
  Not using of double-reservoir schemes: Double-reservoir schemes are tailored for shallow water level time series, such as the "Netherlands" series. Their absence in this analysis has led to poor simulation of this time series.
  
  Disregard of river level integration: The standard feature in the Gardenia computer code for integrating river stage series was not utilized. Using this feature would have significantly improved the results for the "USA" series. The results presented, resulting from an inappropriate use, strongly discredit BRGM’s Gardenia calculation code, which is unacceptable.
  
  REPLY: As explained above, model teams were free to use a method in whatever way they seemed fit. We cannot (and don’t want to) enforce application according to standards set by the developers. It is the risk any software developer runs when making software available to the general public. We will emphasize in the manuscript that (much) better results can be obtained by making changes to the model setup and calibration and refer to your HESS comment.
  We independently modeled the four hydraulic head time series using the data provided in the appendix and achieved satisfactory results: In validation phase, the NSE coefficients obtained rank first or second for three out of four wells. The average validation NSE rank is 3.25, which is significantly better than the previously presented value of 10.25 (indicating poor performance). We understand that the paper presents the results from the “2022 groundwater modeling challenge”. However this is our opinion, as having developed Gardenia computer code at BRGM, that this very interesting and valuable paper should be modified to correct the concern of the clear misuse of the model and of the clear discredit on Gardenia model.
  REPLY: We don’t think “misuse” is an appropriate assessment. Looking at your results (after the challenge was over and the heads in the validation period were made available), a better assessment is “suboptimal”. The results of team Gardenia are somewhere in the middle of the pack, e.g.,scoring 8th place for the Germany data in Figure 2. We refer to our previous and following replies on how we plan to handle your concern and highlight in the paper that the suboptimal performance is related to modeling choices and not the model.
  Detailed comments:
  Line 22: “for the well in the USA, where the lumped-parameter models did not use (or use to
  the full benefit) the provided river stage data” Gardenia lumped-parameter model can integrate the provided river stage as an “external influence”. Such an “external influence” is commonly used for the influence of nearby pumping, and also for the variation of river stage or river flow. Taking into account the river stage data for the USA well series significantly improved the NSE criterion during the calibration period: NSE was increased from 0.72 to 0.86. The sentence should be adapted. “most lumped-parameter models, except Gardenia, did not use…”
  REPLY: Gardenia was not the only lumped-parameter model that didn’t include the river stage in the USA (also HydroSight), so we leave the sentence as is. We will note that including the river as a stress is an option for all lumped models and will probably improve the results.
  Line 169: “Gardenia was manually calibrated by minimizing the NSE and visual interpretation.” This not at all the correct way of using Gardenia. Gardenia, since its creation in 1977, is implemented with an automatic calibration method, the Rosenbrock algorithm. Gardenia is distributed with a tutorial of more than 20 examples, each one with automatic calibration. Gardenia has been used to model more the aquifer level (heads) or the river flow in more than 1000 sites. It has never been calibrated manually. No wonder than calibrating manually the model leads to poor results. Our simulations obtained with automatic calibration (computer time between 5 and 10 second for the calibration of each well) will be provided in attached files The corresponding NSE and MAE criteria will be provided in attached files.
  
  REPLY: Teams are free to choose their method of calibration. They were not (and can not be) forced to use a calibration procedure favored by the developers. For your information, and as an example, team “da_collective” also didn’t use the built-in parameter estimation procedure in the Pastas software. We will highlight that manual calibration is somewhat uncommon and automatic calibration would probably improve the results for Gardenia, and refer to your HESS comment.
  Figure 2: Nash-Sutcliffe Efficiency (NSE). The bar plots and ranking of Gardenia do not at all reflect the results obtained with a normal use of the model. Truly, this discredits this BRGM model (even if it mentioned, line 211 that “none of the models consistently outperformed all other models”). Indeed after a normal standard automatic calibration of the 4 wells on the calibration period, and then calculating the criteria on the validation period (where the observed heads were totally ignored during the calibration phase), we obtained very different results. Comparing our validation NSE to the NSE values (digitalized) from Figure 2: Our Gardenia validation phase NSE:
  Netherlands validation NSE = 0.873 => Rank = 1, instead of rank 10;
  Germany validation NSE = 0.80 => Rank = 1 (or 2), instead of rank 8
  Sweden validation NSE = 0.611 => Rank = 2, instead of rank 11
  USA validation NSE = 0.862 => Rank = 9, instead of rank 12
  Average Gardenia rank = 3.25, instead of rank 10.25 which would be fairly bad.
  Gardenia rank = within the two best ranks for 3 wells out 4.
  The true bar plot and ranks numbers should be corrected in Figure 2 (and in Figure 4)
  Figure 3: Mean Absolute Error (MAE)
  Comparing our validation MAE to the MAE values (digitalized) from Figure 2:
  Our values of Validation MAE:
  Netherlands = 0.057 => Approx rank = 3, instead of rank 9,
  Germany = 0.10 => Approx rank = 4, instead of rank 10,
  Sweden_2 = 0.383 => Approx rank = 2, instead of rank 11,
  USA = 0.255 => => Approx rank = 9, instead of rank 12
  Average Gardenia rank = 4.5, instead of rank 10.5 which would be fairly bad.
  The true bar plot and ranks numbers must corrected in this Figure 3.
  Line 209: “Model performances generally decreased from the calibration…”
  Just for information: our Gardenia modeling: average NSE for the 4 basin:
  Calibration 0.807, validation = 0.786 => Very small decrease.
  REPLY:
  As previously stated, and we are sure you understand, we cannot include your results to the challenge after the challenge was over and the validation data and results were published (in HESS Discussions); in retrospect it is really unfortunate you didn’t submit your results during the challenge, as it would have been a worthwhile submission. We thank you for your clarification, which we will refer to in the paper. And we will make sure to emphasize that the results are a combination of the modeling team and the method and that poor performance should not necessarily reflect on deficiencies of the method.
  Line 220-224: “Performance of the lumped-parameter models substantially lower for the well in the USA” In the sentence “The relatively low model performances for HydroSight and Gardenia here can probably be explained by the fact that river stage data was not used in these models, opposite to all other teams.” The 2 words “and Gardenia” should be deleted, as using the river stage for the simulation of the USA well, which is standard in Gardenia, yields a very high NSEs: 0. 862 => Rank = 3 for validation, and a very high calibration NSE = 0.893.
  REPLY: Teams were free to decide which stresses to use in their models (as clearly stated in the paper). We explain here what a likely reason was for the under-performance of both teams HydroSight and Gardenia. We will include that both HydroSight and Gardenia have the option to include river stages to emphasize that the lower performance was likely due to a modeling choice and not a weakness of the method.
  Lines 223-226:
  “Missing data and processes are likely also the reasons for the low model performance of the Gardenia model for the well in Sweden, i.e., it is the only model in the challenge that did not use temperature data. Temperature data for Sweden is important to account for the impact of snow processes on the heads.” The sentence must be deleted. As a matter of fact, since about 1977 Gardenia is operational with a snow melting module. It make no sense to model a basin (or a well) subject every year to very long periods with negative temperature without using the standard snow melting module. (There are examples of this use in the tutorial provided with the code distribution). (To our mind, in a lumped parameter model equipped with a snow melting module, disregarding temperature data in such a snow context is as inappropriate as disregarding potential evapotranspiration (PET) data or even precipitation data.) Using the standard snow melt module, using temperature, for the Sweden_2 well yields satisfying NSEs: 0.611 => Rank = 2 for validation, and 0.777 for calibration.
  REPLY: Again, teams were free to choose which stresses to include. We will explicitly state here that Gardenia has the option to include snow, but team Gardenia chose not to use this option.
  
  Citation: https://doi.org/10.5194/hess-2024-111-AC5
RC4:
'Comment on hess-2024-111', Anonymous Referee #4, 04 Jul 2024
The paper presents the results of a benchmark study where different data driven models are applied to selected groundwater head time series. The study presents a discussion of the various results collected and reviews the whole procedure of the proposed challenge.
I find the paper interesting and I particularly appreciated the effort of sharing difficulties and lessons learned in the organization and management of this comparison study. However, there a few technical points that should be addressed, as listed below. Ultimately, in my view, the key weak point of this work is that it is hard to extract a take home message related to the model performances, that may be useful in the selection of an appropriate approach in another case not considered here. I think addressing the following points would help in resolving this weakness.
The approaches compared likely differ in terms of parameterization, e.g. what is the number of calibration parameters to be optimized in the calibration phase? This information is not discussed properly, but is crucial for a fair comparison especially of calibration (training) performance. Model discrimination criteria could be used for these purposes (e.g., AIC, BIC, possibly KIC) to provide a fair comparison between models and the authors should be able to compute these metrics based on the provided materials.

I understand the idea of assessing the models in the tails, however I find the two MAE (0.2 and 0.8) criteria quite crude and based on completely arbitrary thresholds that could influence the results. The authors should further elaborate on the robustness and significance of these two criteria.

Details on the hydrogeological context were not shared with the participating teams, and I understand this is motivated by the objective of comparing approaches requiring information that can be widely available even for scarcely characterized sites. However, I think that in this discussion it would be beneficial to share the hydrogeological characteristics, even though these were not part of the challenge input data. How and why these particular wells were selected? Is there any results that can be further discussed when we jointly consider hydrogeological data and model performance? This discussion would be beneficial to readers and could be useful for application of approaches similar to those presented here in other cases, where some hydrogeological data is actually available.

The conclusions of the study should be strengthened to include some technical discussion. Now they read a bit shallow and generic.

Minor comment: the plot in Figure 4 for Netherlands has a strange behavior, I think due to an “empty” period between validation and calibration. Probably it would be better to leave this blank rather than linearly connect the two points.
Citation: https://doi.org/10.5194/hess-2024-111-RC4
- AC4:
  'Reply on RC4', Raoul Collenteur, 10 Jul 2024
  The paper presents the results of a benchmark study where different data driven models are applied to selected groundwater head time series. The study presents a discussion of the various results collected and reviews the whole procedure of the proposed challenge.
  I find the paper interesting and I particularly appreciated the effort of sharing difficulties and lessons learned in the organization and management of this comparison study.
  REPLY: Thank you.
  However, there a few technical points that should be addressed, as listed below. Ultimately, in my view, the key weak point of this work is that it is hard to extract a take home message related to the model performances, that may be useful in the selection of an appropriate approach in another case not considered here.
  REPLY: We will add two points to the discussion and conclusion: First, that each contribution to the challenge is the combined result of the method and the modeling team. It is, obviously, not possible to contribute a below-average performance to a weakness of the method or a suboptimal application of the modeling team, and we will state this explicitly. Second, we will include a discussion of the justification to include a stress into the model or not. In data-driven modeling, the justification comes from making one model with the river stage and one without the river stage, and comparing results. If the model performance is better with the stress included, this does not necessarily mean that there is a causal relationship, as the stress can be a proxy of another stress that behaves in a similar manner. We do argue, however, that if a stress is needed to get a good time series model, that it is highly likely that a traditional groundwater model also needs this stress.
  I think addressing the following points would help in resolving this weakness.
  The approaches compared likely differ in terms of parameterization, e.g. what is the number of calibration parameters to be optimized in the calibration phase? This information is not discussed properly, but is crucial for a fair comparison especially of calibration (training) performance. Model discrimination criteria could be used for these purposes (e.g., AIC, BIC, possibly KIC) to provide a fair comparison between models and the authors should be able to compute these metrics based on the provided materials.
  
  REPLY: It is difficult to compare the number of parameters of a lumped-parameter model and a ML model because a lumped-parameter model uses on the order of 10 parameters, while a ML model may use more than 10,000 parameters. In general, using many unnecessary parameters may result in overfitting. ML algorithms, when correctly applied, apply all kinds of safety guards to avoid overfitting. We tested whether overfitting was a problem by evaluating the performance of the models in the validation period. To clarify this in the paper, we will add additional explanations about the large difference in the number of parameters and that no significant overfitting was observed.
  I understand the idea of assessing the models in the tails, however I find the two MAE (0.2 and 0.8) criteria quite crude and based on completely arbitrary thresholds that could influence the results. The authors should further elaborate on the robustness and significance of these two criteria.
  
  REPLY: Every performance metric in itself is an arbitrary choice, but each metric represents a comparison between the observed and modeled series. We outlined our selection of performance metrics in Section 2.3. In this case the 20% and 80% quantile thresholds represent most periods with low and high water levels fairly well. Using other thresholds such as 5%/95% or 10%/90% exclude too many values from the evaluation for the series considered here. The latter thresholds usually originate from the surface water domain, where runoff curves have quite different characteristics and runoff peaks are more distinct. Therefore, we claim that the 20%/80% thresholds are a good measure for this specific evaluation. To support this reasoning, please check the figure below, which shows the different thresholds for the USA head data. Nevertheless, we will elaborate on this aspect in the text to clarify our choice and, furthermore, we will compute the results for other thresholds (0.1, 0.05, 0.9, 0.95) to assess the robustness of these thresholds and share the results in the Supplemental Materials.
  
  Details on the hydrogeological context were not shared with the participating teams, and I understand this is motivated by the objective of comparing approaches requiring information that can be widely available even for scarcely characterized sites. However, I think that in this discussion it would be beneficial to share the hydrogeological characteristics, even though these were not part of the challenge input data. How and why these particular wells were selected? Is there any results that can be further discussed when we jointly consider hydrogeological data and model performance? This discussion would be beneficial to readers and could be useful for application of approaches similar to those presented here in other cases, where some hydrogeological data is actually available.
  
  REPLY: The wells were selected on the one hand to reflect different hydrogeological settings (porous, fractured and karstic aquifers, confined/unconfined), different climates (e.g., influenced by snow or not) and other aspects (possible influence by surface water). On the other hand, we selected the wells by their available data (long and gapless time series of daily (weekly in the case of Sweden) heads). Though we tried to interpret the results of the models for each well and its specifics, four wells and 15 submissions are probably not enough to draw some robust general conclusions regarding hydrogeological data and model performance, such as model type A generally performs better in hydrogeological setting X, while model type B is better in hydrogeological setting Y. We will add our reasoning for selecting these series to the paper.
  
  The conclusions of the study should be strengthened to include some technical discussion. Now they read a bit shallow and generic.
  
  REPLY: We will modify the conclusions to strengthen the most important messages of the paper. We will elaborate on the fact that the results of the teams are the combined result of the method and the modeling team. A large part of the model performance is determined by how the model is set up and calibrated, as also shown by the community comment Of Dominique Thiéry. This highlights the fact that although it is generally relatively easy to set up a data-driven model, getting good results is still an art that is highly dependent on choices of the modeler.
  Minor comment: the plot in Figure 4 for Netherlands has a strange behavior, I think due to an “empty” period between validation and calibration. Probably it would be better to leave this blank rather than linearly connect the two points.
  REPLY: Good point. We had left it blank in Figure 1 and will also modify Figure 4.
  
  Citation: https://doi.org/10.5194/hess-2024-111-AC4

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (13 Jul 2024) by Alberto Guadagnini

AR by Raoul Collenteur on behalf of the Authors (23 Aug 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (01 Sep 2024) by Alberto Guadagnini

RR by Anonymous Referee #1 (02 Sep 2024)

RR by Anonymous Referee #3 (01 Oct 2024)

ED: Publish as is (02 Oct 2024) by Alberto Guadagnini

AR by Raoul Collenteur on behalf of the Authors (11 Oct 2024) Manuscript

Short summary

We show the results of the 2022 Groundwater Time Series Modelling Challenge; 15 teams applied data-driven models to simulate hydraulic heads, and three model groups were identified: lumped, machine learning, and deep learning. For all wells, reasonable performance was obtained by at least one team from each group. There was not one team that performed best for all wells. In conclusion, the challenge was a successful initiative to compare different models and learn from each other.