Deep learning of flood forecasting by considering interpretability and physical constraints

Zhang, Ting; Zhang, Ran; Li, Jianzhu; Feng, Ping

doi:https://doi.org/10.5194/hess-29-5955-2025

Articles | Volume 29, issue 21

https://doi.org/10.5194/hess-29-5955-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-29-5955-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 29, issue 21

Research article

|

04 Nov 2025

Research article |

| 04 Nov 2025

Deep learning of flood forecasting by considering interpretability and physical constraints

Ting Zhang, Ran Zhang, Jianzhu Li, and Ping Feng

Download

Final revised paper (published on 04 Nov 2025)
Supplement to the final revised paper
Preprint (discussion started on 10 Mar 2025)

Interactive discussion

Status: closed

RC1:
'Comment on hess-2024-393', Anonymous Referee #1, 01 Apr 2025

The article "Deep Learning of flood forecasting by considering interpretability and physical constraints" addresses the use of deep learning models, specifically LSTM networks, for flood risk mitigation. The objectives of the study are interesting and aligned with the scientific mission of the HESS journal.
However, several aspects are addressed with insufficient scientific rigor, compromising the robustness and significance of the contribution. Below, I provide some major and minor comments on the manuscript.
Abstract
It is recommended to avoid using unexplained acronyms, as this may hinder comprehension for the reader. Including a brief environmental and geological context of the studied basin would help justify the choice of the forecast horizon t + 6. This parameter is highly dependent on the characteristics of the river considered and may be excessive or not meaningful in other fluvial contexts. Consequently, the results cannot be generalized without appropriate caution.
Study area and data
It is unclear why the experiment was conducted in this particular watershed. What are its characteristics? Why is it relevant? How does it differ from others? Will the results obtained be valid only for this site, or are they generalizable? This should be clarified.
Lines 297–306: The data sampling frequency is not specified, even though it is a fundamental parameter.
Figure 2: it is recommended to change the colors, as the triangles and the star are not clearly visible.
The dataset split into training and validation sets appears to be the main critical issue of the study. What rule was followed? Currently, the most accepted strategy is to divide the dataset into three parts (training, validation, and test), using the validation set during batch steps. Why wasn’t this approach followed? Is it due to the limited number of available cases? An explanation is required.
What happens if the events that compose the three subsets are changed? Does the predictive performance of the models vary? Using techniques like cross-validation or bootstrapping would allow for the analysis of error distributions. How stable is a model trained multiple times on the same initial dataset? Answering these questions would strengthen the scientific approach of the paper, moving it beyond a simple application. The results presented seem fragile as they might depend on the initial, arbitrary assignment of events to the training, validation, and test phases.
Are 30 events sufficient to train deep learning models? The size of the original dataset and the derived datasets is not clear. I suggest conducting a distributional analysis of the events. If the analysis focuses on a limited number of cases, they should be hydrologically analyzed and shown to be statistically representative of the hydrology of the basin under study.
Line 323: Indicate the version of the TensorFlow library used.
Line 327: Provide a citation for the activation functions employed.
Line 330: it is unclear how overfitting is being mitigated by early stopping. It must be demonstrated that the models are not affected by overfitting. Furthermore, splitting the dataset into three sets is a fundamental first step to prevent both overfitting and underfitting.
Results
Tables 3 and 4: It is advisable to replace the tables (which can be included as supplementary material to ensure transparency of the raw data) with plots showing the metrics as a function of lead time for each model. This would help reveal potential trends and the presence or absence of overfitting. Additionally, the reported results may lack statistical validity and could be coincidental. It is necessary to repeat the training procedures, as mentioned above, to assess the robustness of the outcomes.
What if the error metrics were computed only for data exceeding a certain threshold (statistical or physical)? Focusing on peak flood events, would the metrics change? Would more patterns emerge?
Figure 4: The axis labels are not legible.
This observation applies to all time horizons, but is particularly evident for t + 5 and t + 6: for observed discharges above approximately 300 m³/s, an anomalous behavior appears in the scatterplot points, forming a curve. In my experience, these points likely correspond to a specific event that the model fails to simulate correctly, tending to underestimate the flows. Suppose this hypothesis is confirmed by the authors. In that case, it should be discussed, as it would reveal an interesting phenomenon: the model is unable to overestimate flow in advance and instead tends to underestimate it as lead time increases.
These models seem to suffer from a common limitation: the inability to anticipate runoff before the onset of precipitation. This limitation may be understandable given the lack of meteorological forecast input to the model. Nonetheless, this observation opens up interesting research avenues that the authors are encouraged to explore in the discussion and conclusions.
Finally, if the hypothesis that those outlier points belong to a single event holds true, the most significant errors in predicting large events should be analyzed in detail. All these aspects could serve as input for a revision of the discussion and conclusions, enhancing the scientific impact of the paper, which in its current form lacks significant novelty.

Citation: https://doi.org/10.5194/hess-2024-393-RC1
- AC1: 'Reply on RC1', Ting Zhang, 21 Jun 2025
  
  We sincerely appreciate all your constructive comments, which have been immensely helpful in improving our work. Our point-by-point responses are provided in the Supplement.
  
  Citation: https://doi.org/10.5194/hess-2024-393-AC1
RC2:
'Comment on hess-2024-393', Anonymous Referee #2, 17 May 2025

This paper presents a deep learning-based model for flood forecasting, with a particular focus on enhancing interpretability. The proposed method integrates physics-based constraints into the training process and leverages attention mechanisms to quantify the contributions of different features and temporal components to the final prediction. This combination of physical reasoning and interpretable AI makes the study both methodologically interesting and practically relevant, and is likely to attract broad interest from the community.
However, I have a few suggestions and questions for the authors to help clarify certain aspects of the methodology and improve the overall presentation of the manuscript.
1. Clarification on training data size: The manuscript states that only 20 flooding events are used for training, with each event lasting less than 10 days. Could the authors specify the total number of training samples (e.g., input-output pairs or sequences) generated from these events? This information is important for evaluating the robustness and generalizability of the model.
2. Physics-based loss in PHY-FTMA-LSTM (Line 224-251): Further clarification is needed regarding the implementation of the physics-based loss. Specifically, how are the perturbations δe, δs and δt, introduced during training? Are fixed values pre-specified and added to the input variables? If so, what are the chosen values, and how are they justified? Explicit details on this setup would greatly improve the reproducibility and interpretability of the method.
3. Terminology clarification (Line 120): The term "dot product" is typically reserved for operations between vectors, whereas matrix operations such as the one described are more commonly referred to as element-wise multiplication or Hadamard product. Based on the following context, it appears that the authors intended to apply an element-wise product rather than a dot product. I recommend revising the terminology to avoid confusion and ensure mathematical accuracy.
4. Undefined abbreviations (Line 181): The abbreviations FA and TA are introduced without prior definition. For clarity, all abbreviations should be clearly defined at first mention to ensure readability for a broad academic audience.
4. Figure 1 clarity: Figure 1, particularly subplot (b), is difficult to interpret. The label "head_m" appears to encompass multiple attention mechanisms, including feature attention, time attention, and feature-time attention—not solely multi-head attention as the label may imply. I suggest renaming the label in subplot (b) to more accurately reflect its composite structure and enhance reader comprehension.
5. Labeling in Figure 5: In Figure 5, it would be more intuitive to label the x-axis using calendar dates (e.g., MM-DD-HH) rather than elapsed time in hours. Using time in hours may be easily confused with forecast lead times, potentially causing misinterpretation. I recommend updating the x-axis to calendar dates to improve clarity and reader understanding.

Citation: https://doi.org/10.5194/hess-2024-393-RC2
- AC2: 'Reply on RC2', Ting Zhang, 21 Jun 2025
  
  We sincerely appreciate all your constructive comments, which have been immensely helpful in improving our work. Our point-by-point responses are provided in the Supplement.
  
  Citation: https://doi.org/10.5194/hess-2024-393-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (30 Jun 2025) by Roberto Greco

AR by Ting Zhang on behalf of the Authors (02 Jul 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (02 Jul 2025) by Roberto Greco

RR by Anonymous Referee #2 (26 Jul 2025)

RR by Marco Luppichini (04 Aug 2025)

ED: Publish subject to minor revisions (review by editor) (07 Aug 2025) by Roberto Greco

AR by Ting Zhang on behalf of the Authors (11 Aug 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (27 Aug 2025) by Roberto Greco

AR by Ting Zhang on behalf of the Authors (28 Aug 2025) Manuscript

Short summary

This study presents a model integrating attention mechanisms and physical constraints to improve flood prediction. It forecasts floods up to 6 h in advance. The model enhances accuracy by focusing on critical input features and historical patterns. Results demonstrate its superior performance compared to other models, offering improved flood prediction with greater interpretability and alignment with physical laws.

Deep learning of flood forecasting by considering interpretability and physical constraints

Download

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection