the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Deep learning of flood forecasting by considering interpretability and physical constraints
Abstract. Deep learning models have been proven to be effective in flood forecasting by leveraging the rich time-series information in the data. However, their limited interpretability and lack of physical mechanisms remain significant challenges. To address these limitations, this study introduces a novel model called PHY-FTMA-LSTM, which combines the feature-time-based multi-head attention mechanism with physical constraints. The PHY-FTMA-LSTM model takes four essential features of runoff, rainfall, evapotranspiration, and initial soil moisture as inputs to forecast floods in the Luan River Basin with a lead time of 1–6 h. It emphasizes the significance of relevant factors in the input features and historical moments through the feature-time attention module. Furthermore, the model enhances physical consistency by considering the monotonic relationship between the input variables and the output results. The results demonstrate that the PHY-FTMA-LSTM in most cases outperforms the original LSTM, the feature-time-based attention LSTM (FTA-LSTM), and the feature-time-based multi-head attention LSTM (FTMA-LSTM). For a lead time of t+1, the model achieves an NSE of 0.988, with KGE and R2 of 0.984 and 0.988. The NSE, KGE, and R2 also reach 0.908, 0.905, and 0.911 for a lead time of t+6. The proposed PHY-FTMA-LSTM model achieves excellent prediction accuracy, offering valuable insights for enhancing interpretability and physical consistency in deep learning approaches.
- Preprint
(2254 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 16 May 2025)
-
RC1: 'Comment on hess-2024-393', Anonymous Referee #1, 01 Apr 2025
reply
The article "Deep Learning of flood forecasting by considering interpretability and physical constraints" addresses the use of deep learning models, specifically LSTM networks, for flood risk mitigation. The objectives of the study are interesting and aligned with the scientific mission of the HESS journal.
However, several aspects are addressed with insufficient scientific rigor, compromising the robustness and significance of the contribution. Below, I provide some major and minor comments on the manuscript.
Abstract
It is recommended to avoid using unexplained acronyms, as this may hinder comprehension for the reader. Including a brief environmental and geological context of the studied basin would help justify the choice of the forecast horizon t + 6. This parameter is highly dependent on the characteristics of the river considered and may be excessive or not meaningful in other fluvial contexts. Consequently, the results cannot be generalized without appropriate caution.
Study area and data
It is unclear why the experiment was conducted in this particular watershed. What are its characteristics? Why is it relevant? How does it differ from others? Will the results obtained be valid only for this site, or are they generalizable? This should be clarified.
Lines 297–306: The data sampling frequency is not specified, even though it is a fundamental parameter.
Figure 2: it is recommended to change the colors, as the triangles and the star are not clearly visible.
The dataset split into training and validation sets appears to be the main critical issue of the study. What rule was followed? Currently, the most accepted strategy is to divide the dataset into three parts (training, validation, and test), using the validation set during batch steps. Why wasn’t this approach followed? Is it due to the limited number of available cases? An explanation is required.
What happens if the events that compose the three subsets are changed? Does the predictive performance of the models vary? Using techniques like cross-validation or bootstrapping would allow for the analysis of error distributions. How stable is a model trained multiple times on the same initial dataset? Answering these questions would strengthen the scientific approach of the paper, moving it beyond a simple application. The results presented seem fragile as they might depend on the initial, arbitrary assignment of events to the training, validation, and test phases.
Are 30 events sufficient to train deep learning models? The size of the original dataset and the derived datasets is not clear. I suggest conducting a distributional analysis of the events. If the analysis focuses on a limited number of cases, they should be hydrologically analyzed and shown to be statistically representative of the hydrology of the basin under study.
Line 323: Indicate the version of the TensorFlow library used.
Line 327: Provide a citation for the activation functions employed.
Line 330: it is unclear how overfitting is being mitigated by early stopping. It must be demonstrated that the models are not affected by overfitting. Furthermore, splitting the dataset into three sets is a fundamental first step to prevent both overfitting and underfitting.
Results
Tables 3 and 4: It is advisable to replace the tables (which can be included as supplementary material to ensure transparency of the raw data) with plots showing the metrics as a function of lead time for each model. This would help reveal potential trends and the presence or absence of overfitting. Additionally, the reported results may lack statistical validity and could be coincidental. It is necessary to repeat the training procedures, as mentioned above, to assess the robustness of the outcomes.
What if the error metrics were computed only for data exceeding a certain threshold (statistical or physical)? Focusing on peak flood events, would the metrics change? Would more patterns emerge?
Figure 4: The axis labels are not legible.
This observation applies to all time horizons, but is particularly evident for t + 5 and t + 6: for observed discharges above approximately 300 m³/s, an anomalous behavior appears in the scatterplot points, forming a curve. In my experience, these points likely correspond to a specific event that the model fails to simulate correctly, tending to underestimate the flows. Suppose this hypothesis is confirmed by the authors. In that case, it should be discussed, as it would reveal an interesting phenomenon: the model is unable to overestimate flow in advance and instead tends to underestimate it as lead time increases.
These models seem to suffer from a common limitation: the inability to anticipate runoff before the onset of precipitation. This limitation may be understandable given the lack of meteorological forecast input to the model. Nonetheless, this observation opens up interesting research avenues that the authors are encouraged to explore in the discussion and conclusions.
Finally, if the hypothesis that those outlier points belong to a single event holds true, the most significant errors in predicting large events should be analyzed in detail. All these aspects could serve as input for a revision of the discussion and conclusions, enhancing the scientific impact of the paper, which in its current form lacks significant novelty.
Citation: https://doi.org/10.5194/hess-2024-393-RC1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
154 | 22 | 4 | 180 | 6 | 5 |
- HTML: 154
- PDF: 22
- XML: 4
- Total: 180
- BibTeX: 6
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1