Articles | Volume 29, issue 21
https://doi.org/10.5194/hess-29-5955-2025
© Author(s) 2025. This work is distributed under the Creative Commons Attribution 4.0 License.
Deep learning of flood forecasting by considering interpretability and physical constraints
Download
- Final revised paper (published on 04 Nov 2025)
- Supplement to the final revised paper
- Preprint (discussion started on 10 Mar 2025)
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on hess-2024-393', Anonymous Referee #1, 01 Apr 2025
- AC1: 'Reply on RC1', Ting Zhang, 21 Jun 2025
-
RC2: 'Comment on hess-2024-393', Anonymous Referee #2, 17 May 2025
- AC2: 'Reply on RC2', Ting Zhang, 21 Jun 2025
Peer review completion
AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload
ED: Reconsider after major revisions (further review by editor and referees) (30 Jun 2025) by Roberto Greco
AR by Ting Zhang on behalf of the Authors (02 Jul 2025)
Author's response
Author's tracked changes
Manuscript
ED: Referee Nomination & Report Request started (02 Jul 2025) by Roberto Greco
RR by Anonymous Referee #2 (26 Jul 2025)
RR by Marco Luppichini (04 Aug 2025)
ED: Publish subject to minor revisions (review by editor) (07 Aug 2025) by Roberto Greco
AR by Ting Zhang on behalf of the Authors (11 Aug 2025)
Author's response
Author's tracked changes
Manuscript
ED: Publish as is (27 Aug 2025) by Roberto Greco
AR by Ting Zhang on behalf of the Authors (28 Aug 2025)
Manuscript
The article "Deep Learning of flood forecasting by considering interpretability and physical constraints" addresses the use of deep learning models, specifically LSTM networks, for flood risk mitigation. The objectives of the study are interesting and aligned with the scientific mission of the HESS journal.
However, several aspects are addressed with insufficient scientific rigor, compromising the robustness and significance of the contribution. Below, I provide some major and minor comments on the manuscript.
Abstract
It is recommended to avoid using unexplained acronyms, as this may hinder comprehension for the reader. Including a brief environmental and geological context of the studied basin would help justify the choice of the forecast horizon t + 6. This parameter is highly dependent on the characteristics of the river considered and may be excessive or not meaningful in other fluvial contexts. Consequently, the results cannot be generalized without appropriate caution.
Study area and data
It is unclear why the experiment was conducted in this particular watershed. What are its characteristics? Why is it relevant? How does it differ from others? Will the results obtained be valid only for this site, or are they generalizable? This should be clarified.
Lines 297–306: The data sampling frequency is not specified, even though it is a fundamental parameter.
Figure 2: it is recommended to change the colors, as the triangles and the star are not clearly visible.
The dataset split into training and validation sets appears to be the main critical issue of the study. What rule was followed? Currently, the most accepted strategy is to divide the dataset into three parts (training, validation, and test), using the validation set during batch steps. Why wasn’t this approach followed? Is it due to the limited number of available cases? An explanation is required.
What happens if the events that compose the three subsets are changed? Does the predictive performance of the models vary? Using techniques like cross-validation or bootstrapping would allow for the analysis of error distributions. How stable is a model trained multiple times on the same initial dataset? Answering these questions would strengthen the scientific approach of the paper, moving it beyond a simple application. The results presented seem fragile as they might depend on the initial, arbitrary assignment of events to the training, validation, and test phases.
Are 30 events sufficient to train deep learning models? The size of the original dataset and the derived datasets is not clear. I suggest conducting a distributional analysis of the events. If the analysis focuses on a limited number of cases, they should be hydrologically analyzed and shown to be statistically representative of the hydrology of the basin under study.
Line 323: Indicate the version of the TensorFlow library used.
Line 327: Provide a citation for the activation functions employed.
Line 330: it is unclear how overfitting is being mitigated by early stopping. It must be demonstrated that the models are not affected by overfitting. Furthermore, splitting the dataset into three sets is a fundamental first step to prevent both overfitting and underfitting.
Results
Tables 3 and 4: It is advisable to replace the tables (which can be included as supplementary material to ensure transparency of the raw data) with plots showing the metrics as a function of lead time for each model. This would help reveal potential trends and the presence or absence of overfitting. Additionally, the reported results may lack statistical validity and could be coincidental. It is necessary to repeat the training procedures, as mentioned above, to assess the robustness of the outcomes.
What if the error metrics were computed only for data exceeding a certain threshold (statistical or physical)? Focusing on peak flood events, would the metrics change? Would more patterns emerge?
Figure 4: The axis labels are not legible.
This observation applies to all time horizons, but is particularly evident for t + 5 and t + 6: for observed discharges above approximately 300 m³/s, an anomalous behavior appears in the scatterplot points, forming a curve. In my experience, these points likely correspond to a specific event that the model fails to simulate correctly, tending to underestimate the flows. Suppose this hypothesis is confirmed by the authors. In that case, it should be discussed, as it would reveal an interesting phenomenon: the model is unable to overestimate flow in advance and instead tends to underestimate it as lead time increases.
These models seem to suffer from a common limitation: the inability to anticipate runoff before the onset of precipitation. This limitation may be understandable given the lack of meteorological forecast input to the model. Nonetheless, this observation opens up interesting research avenues that the authors are encouraged to explore in the discussion and conclusions.
Finally, if the hypothesis that those outlier points belong to a single event holds true, the most significant errors in predicting large events should be analyzed in detail. All these aspects could serve as input for a revision of the discussion and conclusions, enhancing the scientific impact of the paper, which in its current form lacks significant novelty.