The Great Lakes Runoff Intercomparison Project  Phase 4: the Great Lakes (GRIP-GL)

Mai, Juliane; Shen, Hongren; Tolson, Bryan A.; Gaborit, Étienne; Arsenault, Richard; Craig, James R.; Fortin, Vincent; Fry, Lauren M.; Gauch, Martin; Klotz, Daniel; Kratzert, Frederik; O'Brien, Nicole; Princz, Daniel G.; Rasiya Koya, Sinan; Roy, Tirthankar; Seglenieks, Frank; Shrestha, Narayan K.; Temgoua, André G. T.; Vionnet, Vincent; Waddell, Jonathan W.

doi:10.5194/hess-26-3537-2022

Articles | Volume 26, issue 13

https://doi.org/10.5194/hess-26-3537-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-26-3537-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 26, issue 13

Research article

| Highlight paper

|

08 Jul 2022

Research article | Highlight paper |

| 08 Jul 2022

The Great Lakes Runoff Intercomparison Project Phase 4: the Great Lakes (GRIP-GL)

Juliane Mai, Hongren Shen, Bryan A. Tolson, Étienne Gaborit, Richard Arsenault, James R. Craig, Vincent Fortin, Lauren M. Fry, Martin Gauch, Daniel Klotz, Frederik Kratzert, Nicole O'Brien, Daniel G. Princz, Sinan Rasiya Koya, Tirthankar Roy, Frank Seglenieks, Narayan K. Shrestha, André G. T. Temgoua, Vincent Vionnet, and Jonathan W. Waddell

Download

Final revised paper (published on 08 Jul 2022)
Supplement to the final revised paper
Preprint (discussion started on 29 Mar 2022)
Supplement to the preprint

Interactive discussion

Status: closed

RC1:
'Comment on hess-2022-113', Anonymous Referee #1, 29 Apr 2022

Mai et al present a thorough model-intercomparison for the Great Lakes region.

The manuscript is very extensive, as is the way that the model intercomparison was managed. The intercomparison is conducted in a very structured manner and clearly was not opportunity-driven; teams had to create a new model-set up to be consistent with underlying data, and perform a new calibration. Also the analysis is very thorough and very honest, fairly comparing the performance of all the models based on different aspects. This is very much appreciated. It also demonstrates how much information can be gained from such a carefully designed experiment: many conclusions on different aspects can be drawn. This does make the manuscript quite long and has the risk that some conclusions might get lost in accompany of all the other conclusions, but the abstract provides a good summary.

I only have a few minor points:

From the methods-section it is not clear whether the LSTM was also trained with geographic data. Later I read it was, perhaps this can already be clarified earlier.

The donor-basin rule is indeed very basic... and as such I am wondering about the value of the space-validation. What does it mean when a model is good at simulating a catchment it hasn't "seen", with parameters based on another catchment? Does that make a model "better"? It could also just be an indication of how sensitive the output of this model is to different forcing / its own parameters, rather than a value-judgement of its performance. But this is just my thought.

It is appreciated that mistakes in the procedures are openly shared, such as about the PET-controlling constant for LBRM-CC calibration. However, there are no consequences related to this point. For instance, it is used as an argument to explain lower performance, but a lack of applying a constraint should actually result in better model performance because during the calibration there was more freedom to fit this parameter (or in equal model performance because the calibration algorithm did end up at the correct spot after all). The implications of this error for comparability are not clear. (same for the other calibration bug with SVS LSS)

It is nice that the majority of the models applied the same calibration algorithm, but all used slightly different settings. Was this determined based on expert judgement?

Some models were calibrated regionally, other locally. It is unclear why which models where used in one way or the other. I guess because this fits the general philosophy of this model / its common use. Maybe this can be clarified in Ch 2

l. 487-490 (p19) unclear what is meant here.

l. 604 (p23) not very clearly explained.. I guess it also depends on the shape of your pareto front (if it exists at all). It would be nice to see it somehow in a 2D-version (e.g. for two variables only), or a 3D version.

The link to the website is now mentioned quite late. It is a very nice feature, would be nice if it would be mentioned earlier in the text.

In the conclusion it is clearly stated that gridded evaluation might be preferred over basin evaluation (both could demonstrate different results). This is not mentioned as such in the abstract, where only the difference between the two is mentioned.

Citation: https://doi.org/10.5194/hess-2022-113-RC1
- AC1: 'Reply on RC1', Juliane Mai, 03 Jun 2022
  
  Dear reviewer,
  thanks a lot for evaulating our manuscript. Please find our detail reply in the attached PDF.
  Best regards,
  Juliane Mai and co-authors
  
  Citation: https://doi.org/10.5194/hess-2022-113-AC1
CC1:
'Comment on hess-2022-113', John Ding, 18 May 2022

Comparing a LSTM model and a one-step-ahead river forecast model
The study team reaches a most profound conclusion that the Machine Learning LSTM-lumped model outperforms 12 other physically based models for Great Lakes - Ottawa River region (Lines 14-16, Abstract, Main Result (1)).
I'm curious how the LSTM model (Sect. 2.4.1, S.2.1) compare with a simple one-step-ahead forecast model, AR(2), a second-order autoregressive process of the streamflow (only). The latter is constructed as follows (e.g., Ding, 2018):
Q[t+1]=0+O[t]+(O[t]-O[t-1])=2O[t]-O[t-1],
in which:
O[t] and Q[t] are the observed and simulated discharge, respectively, at current timestep t. This pre-defined AR(2) has a constant of zero, and lag 1 and 2 coefficients of 2 and -1, thus having a fixed variance for an observed hydrograph.
A comparison between the two on one of their study watersheds will help demonstrate their performances. One candidate could be Gauge ID 02GA047 - Speed River at Cambridge which has been used for validation purposes (Table S15). This is located on the east side of the Grand River, Ontario, opposite to the University of Waterloo campus, and has a drainage area of 782 sq. km.
On the Speed River, does a LSTM model that has been calibrated globally for the Great Lakes-Ottawa River region outperform an AR(2) too?
References
Ding, J. : Interactive comment on “On the choice of calibration metrics for “high flow” estimation using hydrologic models” by Naoki Mizukami et al., Hydrol. Earth Syst. Sci. Discuss., https://doi.org/10.5194/hess-2018-391-SC1, 2018.

Citation: https://doi.org/10.5194/hess-2022-113-CC1
- AC3: 'Reply on CC1', Juliane Mai, 03 Jun 2022
  
  Dear Dr. Ding,
  thanks a lot for your comment regarding our manuscript. Please find our detail reply in the attached PDF.
  Best regards,
  Juliane Mai and co-authors
  
  Citation: https://doi.org/10.5194/hess-2022-113-AC3
RC2:
'Comment on hess-2022-113', Matteo Giuliani, 24 May 2022

The paper contributes a comprehensive model intercomparison across 13 hydrologic models, including Machine Learning based, conceptual, and physically based models. The analysis is run over the Great Lakes region looking at the model's ability to simulate streamflow, actual evapotranspiration, surface soil moisture, and snow water equivalent. The comparison is performed looking at simulated output aggregated to basin-scale as well as at grid-level, considering temporal and spatial validation. The study is extremely well designed and it provides a solid contribution to the existing literature. The manuscript is also well written and definitely interesting for HESS readership. I only have a few comments that I would recommend addressing before accepting the paper for publication.

1) The model intercomparison reads extremely solid in terms of using consistent data, forcing, etc. as well as in terms of the adopted calibration-validation scheme. Yet, the description of the different models' calibrations in Section 2 seems to introduce quite some variability whose potential implications are not discussed. Although most models have been calibrated using the same algorithm, it is not clear whether the different modeling teams had some guidelines/constraints about the calibration effort to somehow harmonize it across models. Since it does not seem there was any limit on the number of model evaluations run (or on the total time spent) during the calibration, I'm wondering whether some results could be explained by better/worse calibration results. This aspect could also be an interesting finding of the analysis, but to be fair it should be derived by coordinating the calibration efforts. For example, the LSTM model involves 300,000 parameters, and being a data-driven model is by definition more flexible than the other models considered. This LSTM model was calibrated in 2.75 hours; how about the other models? Is the effort of running 300 iterations for calibrating the 9 parameters of the LBRM-CC-lumped model comparable?

2) One of the key assumptions of the analysis is considering only streamflow gauges in low-human impacted watersheds. While the authors clearly motivate this choice, I believe the paper would benefit from some further elaboration around this point given the somehow limited number of "pristine" river basins worldwide, see for example Belletti et al. 2020. Which type of bias we could expect in using these models in a human-impacted basin? Are these biases consistent across models, or can some categories better capture human inference even if not explicitly described? I believe this type of reasoning could be a good addition to the model discussion, which could be perhaps potentially supported by looking at the model performance in some sampled stations currently excluded from the analysis.

Belletti et al. (2020), More than one million barriers fragment Europe’s rivers, Nature, 588, 436-444

3) The temporal validation of the models is based on model simulations over the period 2011-2017, with the models calibrated over 2001-2010. This looks certainly good, but I was then expecting the authors to somehow comment/discuss the role of nonstationary forcing as I expect that data (e.g. temperature) could show some trends over these 17 years. If this is the case, how did you handle such trends? Were the data de-trended or did you use the raw observations? Moreover, what are the authors' recommendations for developing hydrologic models under such evolving conditions? Again, are there any class of models more prone/robust to possible extrapolation biases induced by global warming?

4) Lastly, as the authors probably know the paper is quite lengthy and it does require substantial commitment to get to the end. I think the authors did already a good effort in guiding readers using a good structure and providing summaries of each section, but I would suggest - if feasible - to further shortening the paper in order to facilitate a complete read. I don't have clear recommendations on how to do this; perhaps an idea could be to move the model description of section 2.4 into an appendix keeping only a summary in the main text?

Citation: https://doi.org/10.5194/hess-2022-113-RC2
- AC2: 'Reply on RC2', Juliane Mai, 03 Jun 2022
  
  Dear Matteo,
  thanks a lot for evaulating our manuscript. Please find our detail reply in the attached PDF.
  Best regards,
  Juliane Mai and co-authors
  
  Citation: https://doi.org/10.5194/hess-2022-113-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Publish subject to minor revisions (further review by editor) (05 Jun 2022) by Alberto Guadagnini

AR by Juliane Mai on behalf of the Authors (05 Jun 2022) Author's response Author's tracked changes Manuscript

EF by Merve Parla (06 Jun 2022) Supplement

ED: Publish as is (10 Jun 2022) by Alberto Guadagnini

AR by Juliane Mai on behalf of the Authors (22 Jun 2022) Manuscript

Editorial statement

The study provides a unique effort undertaken by several research groups in the context of a remarkably instrumented large scale system. The latter is subject to a variety of measurement approaches and techniques and a wide range of models are considered in a remarkable inter-comparison study. Key methodological elements stemming from this very comprehensive work are highly relevant to the broad hydrological community.