A deep learning model for real-time forecasting of 2-D river flood inundation maps

Pianforini, Matteo; Dazzi, Susanna; Pilzer, Andrea; Vacondio, Renato

doi:https://doi.org/10.5194/hess-2024-176

Preprints

https://doi.org/10.5194/hess-2024-176

Preprints

05 Aug 2024

| 05 Aug 2024

Status: this discussion paper is a preprint. It has been under review for the journal Hydrology and Earth System Sciences (HESS). The manuscript was not accepted for further review after discussion.

A deep learning model for real-time forecasting of 2-D river flood inundation maps

Matteo Pianforini, Susanna Dazzi, Andrea Pilzer, and Renato Vacondio

Abstract. Floods are among the most hazardous natural disasters worldwide. Accurate and rapid flood predictions are critical for effective early warning systems and flood management strategies. The high computational cost of hydrodynamic models often limits their application in real-time flood simulations. Conversely, data-driven models are gaining attention due to their high computational efficiency. In this study, we aim at assessing the effectiveness of transformer-based models for forecasting the spatiotemporal evolution of fluvial floods in real-time. To this end, the transformer-based data-driven model FloodSformer (FS) has been adapted to predict river flood inundations with negligible computational time. The FS model leverages an autoencoder framework to analyze and reduce the dimensionality of spatial information in input water depth maps, while a transformer architecture captures spatiotemporal correlations between inundation maps and inflow discharges using a cross-attention mechanism. The trained model can predict long-lasting events using an autoregressive procedure. The model's performance was evaluated in two case studies: an urban flash flood scenario at the laboratory scale and a river flood scenario along a segment of the Po River (Italy). Datasets were numerically generated using a two-dimensional hydrodynamic model. Special attention was given to analyzing how the accuracy of predictions is influenced by the type and severity of flood events used to create the training dataset. The results show that prediction errors generally align with the uncertainty observed in physically based models, and that larger and more diverse training datasets help improving the model's accuracy. Additionally, the computational time of the real-time forecasting procedure is negligible compared to the physical time of the simulated event. The performance of the FS model was also benchmarked against a state-of-the-art convolutional neural network architecture and showed better accuracy. These findings highlight the potential of transformer-based models in enhancing flood prediction accuracy and responsiveness, contributing to improve flood management and resilience.

Received: 13 Jun 2024 – Discussion started: 05 Aug 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 19433 KB)

Supplement (13581 KB)

Download & links

Preprint (19433 KB)
Metadata XML
Supplement (13581 KB)
BibTeX
EndNote

Matteo Pianforini, Susanna Dazzi, Andrea Pilzer, and Renato Vacondio

Status: closed

RC1:
'Comment on hess-2024-176', Anonymous Referee #1, 17 Sep 2024

Summary
This paper presents a surrogate deep-learning approach for forecasting flood inundation maps.

The model is a modified version of a previously published model by the same authors which is mainly based on transformers.

Here, the architecture is modified to also take upstream hydrographs into account by introducing a cross-attention mechanism between inundation maps and those hydrographs.

The architecture is showcased in mainly two different scenarios, one representing an urban flash flood and the other representing a river flood.

Each scenario is evaluated with multiple sets of hyperparameters for a more detailed examination.

Qualitatively, the presented approach produces good results, however, I am not sure about quantitative evaluation (see below).
I want to preface my review with the fact that I am a machine-learning scientist who received only a little training in hydrology.

General comments
The overall quality of the preprint is good.

The problem that is tackled here is of very high significance for human lives, the economy, and science to which the authors seem to have a good solution.

From a machine learning point of view, the presented model is adequate and in line with recent research.

Regarding its evaluation, however, there are a few points that need to be addressed (see below).

I think the structure of the paper could be improved as some sections feel a bit out of place.

With 44 pages and an additional 10 pages of supplementary material, the paper is quite long.

The authors should consider shortening the paper as this would increase the probability of other scientists reading it.

Restructuring and shortening might be achieved by moving more (sub)sections into the appendix, which currently is only half a page long.

Other than that, the paper is well written, very detailed, easy to follow and understand, and only contains minor grammatical errors.

Specific comments
1. l. 68 "relatively high computational time may restrict the application of GNN in real-time flood forecasting": Is this due to the model being a GNN or some other aspect of this work? It is advertised as "rapid" and has a speed-up of up to 100 compared to the numerical solver.

2. l. 155 "Notably, the AR procedure relies on the availability of the entire time series of upstream discharges, which is provided as input data to the surrogate model. This is in line with the approaches commonly employed in EWS based on physically-based hydrodynamic models, in which the inflow time series derives from meteorological/hydrological models.": One of the arguments for your architecture is its speed-up compared to conventional methods. If the upstream hydrograph is predicted by conventional methods, is this speed-up lost again? Maybe mention that also here, surrogate models should be developed.

3. l. 172: "In the FS model, the input matrix XKV represents the sequence of latent features Z = [z1, . . . , zI ] ∈ RI×h×w×dmodel provided by the CNN encoder layer, while the XQ is obtained expanding and reshaping the sequence of discharge values [q2 , . . . , qI+1 ] ∈ RI in order to obtain the same dimensions of XKV.": Intuitively, I would have expected that the inundation map query the discharges and not the other way around. Is there a specific reason you did it like that? What would be the difference?

4. l. 176: "We emphasize that the temporal translation of one frame between maps and discharge values used as input data (e.g., the MHCA correlates the feature at time t = 1 with the discharge at instant t = 2) is essential to predict the map at the subsequent instant (e.g., t = 2).": Shouldn't the shift depend on the position of the upstream gauge, the flow velocity, and the chosen temporal resolution?

5. l. 185: "Furthermore, a 2D relative positional encoding (RPE) and a fixed absolute 1D positional encoding (PE) are used in the spatial and temporal MHSA, respectively." Doesn't lead a fixed absolute positional encoding for temporal MHSA to bad generalization in time? Why not use a relative encoding here, too?

6. l. 203: "The GDL is designed to minimize differences between the gradients of water depths in the original (x) and reconstructed (x̂) maps." Why did you introduce this loss? If absolute predictions are correct, gradients are, too. With this additional loss, absolute errors become less important for the model, which seems like a disadvantage. Is this somehow related to the thresholds which determine when a cell is considered dry and you want to know exactly where the inundation front lies?

7. l. 206: "During the second training phase (“VPTR training”), only the VPTR block is trained, and the parameters of the encoder and decoder blocks remain fixed": Not a comment that needs to be addressed, but I think it would be interesting to see whether optimizing the encoder and decoder with a smaller learning rate might improve results.

8. l. 276: "The value of λGDL was calibrated to obtain the same order of magnitude between LM SE (Eq. 7) and LGDL (Eq. 8) losses.": What is the motivation to have similar magnitudes?

9. l. 625: "This possibly leads to a reduced optimization of the Po3 model, which could have been relieved by adopting a larger training dataset." It seems like Po3 did not lead to any significant insights, except for the fact that a finer spatial resolution is not always better. This could be an opportunity to reduce the length of the paper.

10. It seems to me that the authors only trained a single instance of their model for each scenario. The performance of neural networks can heavily depend on their weight initialization. It is therefore crucial to report results from multiple seeds to evaluate the architecture itself. This includes reporting means and standard deviations, and potentially a significance analysis. The computational times of the proposed architecture seem to allow this at least in some of the scenarios.

11. l. 652: "Benchmark comparison": This section felt out of place when I first read the paper and afterward I saw that a benchmark comparison was requested by the editor. I would suggest incorporating the benchmark comparison into the previous paragraphs, but this might be a personal preference.

12. l. 662: "The values of the learning rate and batch size were determined through a trial-and-error process.": This is always problematic, as most authors most likely spend more time optimizing their own model compared to benchmark models. Sometimes, it is unavoidable to do that. If, however, it would be possible to compare to previously published results or weights directly, this would be preferred.

13. l. 678: "This benchmark analysis confirms the remarkable performance of the FS model compared to a state-of-the-art DL architecture.": While this might (see point directly above) be true, it would be interesting to see the number of parameters and computational times of the benchmark model as well.

14. l. 743: "Initial condition sensitivity analysis": This is an interesting analysis that lead to the nice idea of storing a database for initial conditions, however, this subsection again feels a bit out of place. This might be a personal preference, though.

15. General: I think it's fascinating that we go from real-world to model urban district to numerical model to surrogate model. Do you think it would make sense to fine-tune the surrogate model on data produced by the physical model experiments before using it in real-world applications? I assume learning a surrogate model is easier than a model of the real world.

Technical corrections
1. l. 69 "real-word" -> "real-world"

2. l. 139 if q_t is a scalar, I would remove the superscript from R^1 as I was looking for an l

3. l. 329: "non-dimensionalized dividing them" -> "non-dimensionalized by dividing them", also, just FYI, the term "standardized" is much more common than "non-dimensionalized" in ML literature

Citation: https://doi.org/10.5194/hess-2024-176-RC1
- AC1: 'Reply on RC1', Matteo Pianforini, 03 Oct 2024
  
  The original comments by the Reviewer are provided in italics. Responses are provided in plain text.
  
  "Summary" and "General comments" sections
  We thank the Reviewer for his useful suggestions and for appreciating the work. Point-by-point replies to the specific comments are provided below. The manuscript will be modified based on the recommendations of the Reviewer.
  
  1. l. 68 "relatively high computational time may restrict the application of GNN in real-time flood forecasting": Is this due to the model being a GNN or some other aspect of this work? It is advertised as "rapid" and has a speed-up of up to 100 compared to the numerical solver.
  Thank you for this insightful observation. The actual speed-up potential of GNNs remains an open question, as GNNs have not yet been applied to large-scale domains with high-resolution discretization (e.g., in the order of millions of computational cells). To date, GNNs have only been employed for flood simulations on small domains with tens of thousands of cells. As a result, the scalability of GNNs for larger, more complex domains has yet to be thoroughly investigated.
  We will revise the sentence indicated by the Reviewer and the following one (i.e., lines 67-69) to better clarify this point.
  
  2. l. 155 "Notably, the AR procedure relies on the availability of the entire time series of upstream discharges, which is provided as input data to the surrogate model. This is in line with the approaches commonly employed in EWS based on physically-based hydrodynamic models, in which the inflow time series derives from meteorological/hydrological models.": One of the arguments for your architecture is its speed-up compared to conventional methods. If the upstream hydrograph is predicted by conventional methods, is this speed-up lost again? Maybe mention that also here, surrogate models should be developed.
  We appreciate this insightful comment, as it allows us to clarify the key characteristics of a coupled hydrological-hydraulic model chain. Traditionally, this approach involves a series of physically based or conceptual models to generate inundation maps starting from meteorological forecasts. First, a rainfall-runoff model is used to predict runoff at the outlet of a drainage basin starting from weather models forecast (mainly rainfall). The output of this model is the hydrograph that describes the temporal variation of discharge in a specified river section. Next, a 2D hydrodynamic model is adopted to propagate the flood downstream, producing the inundation maps.
  The computational cost of a 2D hydrodynamic model which integrates the Shallow Water Equations (SWEs) is generally much higher than that of a rainfall-runoff one due to the governing equations involved which needs to be discretized using a grid of millions of cells in large domains. Consequently, the hydrodynamic model often becomes the bottleneck, in terms of computational cost, in the hydrological-hydraulic model chain. For this reason, our work focuses on developing a surrogate model for the hydrodynamic component, aiming to significantly reduce computational time without compromising accuracy.
  It is also worth noting that while rainfall-runoff models are computationally less demanding, many data-driven models in the literature have been developed as surrogates for these models (e.g., Kratzert et al., 2018; Xu et al., 2023; Yin et al., 2023). Therefore, an alternative approach could involve using hydrographs generated by such surrogate models as input for the FloodSformer model, which could further reduce the overall computational time of the model chain.
  
  We will incorporate these considerations into the revised manuscript to better clarify this point.
  
  3. l. 172: "In the FS model, the input matrix XKV represents the sequence of latent features Z = [z1, . . . , zI ] ∈ RI×h×w×dmodel provided by the CNN encoder layer, while the XQ is obtained expanding and reshaping the sequence of discharge values [q2 , . . . , qI+1 ] ∈ RI in order to obtain the same dimensions of XKV.": Intuitively, I would have expected that the inundation map query the discharges and not the other way around. Is there a specific reason you did it like that? What would be the difference?
  Thank you for highlighting this interesting point. When simulating flood events using a surrogate model, it is essential to incorporate information from both the inundation maps and the upstream boundary conditions. Specifically, the prediction of the next inundation map is strongly influenced by the previous conditions in the study area, represented by the prior inundation maps. At the same time, the inflow discharge is also crucial: the flood evolution will be different during the rising or falling limb of the hydrograph. Therefore, predictions of future inundation map need to account for both past inundation maps and the inflow discharge hydrograph.
  To address this, we implemented a cross-attention (CA) mechanism. This allows the model to condition future inundation map predictions on inflow discharge data while leveraging prior inundation states. Our methodology was inspired by existing works in different fields that successfully applied CA mechanisms between 2D matrices (e.g., images) and 1D scalar values for other tasks, such as visual question answering (e.g., Nguyen & Okatani, 2018: https://openaccess.thecvf.com/content_cvpr_2018/html/Nguyen_Improved_Fusion_of_CVPR_2018_paper.html; Jaegle et al., 2022: https://arxiv.org/abs/2107.14795) and medical analysis (e.g., Chen et al., 2021: https://openaccess.thecvf.com/content/ICCV2021/html/Chen_Multimodal_Co-Attention_Transformer_for_Survival_Prediction_in_Gigapixel_Whole_Slide_ICCV_2021_paper.html?ref=https://githubhelp.com).
  
  For example, Chen et al. (2021) applied a CA mechanism to predict patient survival by examining how large 2D whole-slide images attend to gene expression (1D scalar values), which inspired our application of this technique in the FS model.
  In response to the Reviewer’s comment, we also explored an alternative configuration by inverting the input information flow to the MHCA block. In this modified version, the latent features derived from the inundation maps served as the query matrix, while the discharge values were treated as the value and key matrices. We trained the FS model with this new configuration using the Toce River case study under the Toce2 training settings. The following table summarizes the average RMSEs for the four flood events within the testing dataset. For clarity, the implementation described in the manuscript is referred to as “original”, while the modified version is referred to as “modified”.
  Flood event RMSE (mm)
  
  original RMSE (mm)
  
  modified
  Low 1.8 2.2
  Medium 2.6 3.3
  High 3.2 3.2
  Gradual 3.6 4.3
  This analysis indicates that the model can still reproduce flood propagation in the selected study area, even with the input configuration reversed. Notably, the "original" version of the model demonstrated slightly better accuracy compared to the "modified" version (see the table above). However, this analysis is only preliminary, and a more comprehensive investigation into this design choice may be a valuable direction for future research. Given the recommendation of the editor and reviewer 2 to shorten the length of the manuscript we won’t include this analysis in the revised version of the manuscript.
  
  4. l. 176: "We emphasize that the temporal translation of one frame between maps and discharge values used as input data (e.g., the MHCA correlates the feature at time t = 1 with the discharge at instant t = 2) is essential to predict the map at the subsequent instant (e.g., t = 2).": Shouldn't the shift depend on the position of the upstream gauge, the flow velocity, and the chosen temporal resolution?
  We appreciate this insightful question, as it allows us to further clarify the role of the temporal shift between inundation maps and discharge values. First, it is important to highlight that the shift applied in our model is fixed at one time step, which corresponds to the temporal resolution chosen for the output maps from the hydrodynamic model (e.g., 3 hours in the Po River case study; see Table 1). This shift does not require calibration.
  During the dataset creation, this shift associates an inundation map at a given time step with the discharge value at the subsequent time step. Specifically, for an inundation map at time t, the associated discharge value is at time t+1. This temporal shift is essential for forecasting, as predicting the inundation map at time t+1 depends both on the preceding inundation dynamics (i.e., the map at time t) and the inflow discharge at time t+1.
  
  Thus, the temporal shift is not influenced by the location of the upstream gauge, flow velocity, or the temporal resolution of the model. It remains a constant offset ensuring that future inundation maps are predicted based on both past flood dynamics and the inflow condition at the prediction time.
  The parameter that might requires closer attention might be the temporal resolution for the output maps. This should be appropriate for accurately capturing the flood evolution; therefore, it is case dependent. However, once the temporal resolution is selected, the shift remains consistently fixed at one time step.
  We will revise the Section 2.1.1 of the manuscript to better explain this aspect.
  
  5. l. 185: "Furthermore, a 2D relative positional encoding (RPE) and a fixed absolute 1D positional encoding (PE) are used in the spatial and temporal MHSA, respectively." Doesn't lead a fixed absolute positional encoding for temporal MHSA to bad generalization in time? Why not use a relative encoding here, too?
  We are grateful for this insightful suggestion. In the current implementation of the FloodSformer model, we opted to retain the original architectural choices of the VPTR model as presented by Ye and Bilodeau (2023). Given the complexity of the proposed model—comprising several layers and architectural components such as CNNs, attention mechanisms, and MLPs—we decided to minimize modifications to the base structure to maintain consistency and reduce potential sources of error.
  While the results obtained with this configuration are satisfactory and meet the objectives of this study, we acknowledge that there is room for improvement, including the use of relative positional encoding in the temporal MHSA. Further tuning of the model, not only in terms of positional encoding but also with respect to hyperparameter optimization and other architectural components, could enhance both accuracy and computational efficiency.
  We will include these considerations in the discussion (Section 4) of the revised version of the manuscript.
  
  6. l. 203: "The GDL is designed to minimize differences between the gradients of water depths in the original (x) and reconstructed (x̂) maps." Why did you introduce this loss? If absolute predictions are correct, gradients are, too. With this additional loss, absolute errors become less important for the model, which seems like a disadvantage. Is this somehow related to the thresholds which determine when a cell is considered dry and you want to know exactly where the inundation front lies?
  We appreciate this valuable remark. We acknowledge the Reviewer’s point regarding the correlation between the two components of the loss function (L_MSE and L_GDL). The inclusion of the GDL is primarily intended to improve the accuracy near hydraulic discontinuities and around the wet/dry front of the flood. While it is true that correct absolute predictions should result in correct gradients, the GDL provides additional focus on capturing changes in water depth, particularly in areas where sharp transitions occur, such as flood fronts. This helps ensure that the model accurately predicts the evolution of the inundation front.
  From preliminary analyses, we found that incorporating the GDL did not negatively affect the overall accuracy of the model. Furthermore, it introduces a completely negligible computational overhead compared to the total training time. Therefore, we decided to include it in the loss function. However, we agree that a more detailed analysis of GDL’s impact in this specific application could provide valuable insights, and we plan to explore this in a future work.
  Additionally, to address the concern about GDL overshadowing the absolute prediction accuracy (MSE), it is crucial to carefully select the weighting factor λ_GDL. This ensures that both the MSE and GDL term remain balanced and within the same order of magnitude, preserving the model’s focus on absolute predictions, as discussed in point 8.
  
  7. l. 206: "During the second training phase (“VPTR training”), only the VPTR block is trained, and the parameters of the encoder and decoder blocks remain fixed": Not a comment that needs to be addressed, but I think it would be interesting to see whether optimizing the encoder and decoder with a smaller learning rate might improve results.
  Thank you for this insightful suggestion. In preliminary analyses, we experimented with different learning rates and schedulers to determine the most effective configuration. The current values were selected based on their ability to provide the best results during training. Nevertheless, we acknowledge that further, more detailed analyses could be conducted to explore the potential improvements from fine-tuning the learning rate and other hyperparameters.
  
  It is also worth noting that, by the end of the training process, the errors in the encoder and decoder modules are already quite low. As such, any additional training on these components would likely result in only marginal improvements in validation performance.
  
  8. l. 276: "The value of λGDL was calibrated to obtain the same order of magnitude between LM SE (Eq. 7) and LGDL (Eq. 8) losses.": What is the motivation to have similar magnitudes?
  The motivation behind ensuring that both components of the loss function (L_MSE and L_GDL) have similar magnitudes is to ensure that both are appropriately weighted during the optimization process. If one of the components has values significantly lower (by one or more orders of magnitude) than the other, its contribution to the overall loss computation would become negligible. This balance is essential to ensure that both the absolute prediction accuracy (captured by the MSE) and the sharpness of the inundation maps (captured by the GDL) are taken into account during model training. Without this balance, the model might focus disproportionately on one aspect, potentially compromising the quality of the results. Preliminary tests confirmed that the same order of magnitude between L_MSE and L_GDL improves the accuracy of the model.
  
  9. l. 625: "This possibly leads to a reduced optimization of the Po3 model, which could have been relieved by adopting a larger training dataset." It seems like Po3 did not lead to any significant insights, except for the fact that a finer spatial resolution is not always better. This could be an opportunity to reduce the length of the paper.
  Thank you for your suggestion. We agree with the Reviewer that the Po3 configuration did not yield significant additional insights beyond demonstrating that (i) a finer spatial resolution does not necessarily result in better performance and (ii) the proposed model works also for different spatial resolutions. To address this, we have decided to move the discussion of the Po3 configuration to the appendix, thereby reducing the length of the main text body without omitting any relevant information.
  
  10. It seems to me that the authors only trained a single instance of their model for each scenario. The performance of neural networks can heavily depend on their weight initialization. It is therefore crucial to report results from multiple seeds to evaluate the architecture itself. This includes reporting means and standard deviations, and potentially a significance analysis. The computational times of the proposed architecture seem to allow this at least in some of the scenarios.
  Thank you for this insightful observation. We acknowledge the Reviewer’s point regarding the importance of weight initialization in the training of data-driven models. In our study, the parameters of the FloodSformer model were initialized using the weights obtained from Ye and Bilodeau (2023), who trained the VPTR framework specifically for predicting video frames (see lines 258-264). Nonetheless, we agree that conducting an analysis involving multiple seeds to assess performance, including the reporting of means, standard deviations, and potentially a significance analysis, would have been beneficial. We appreciate this suggestion for future work.
  
  11. l. 652: "Benchmark comparison": This section felt out of place when I first read the paper and afterward I saw that a benchmark comparison was requested by the editor. I would suggest incorporating the benchmark comparison into the previous paragraphs, but this might be a personal preference.
  Thank you for your thoughtful suggestion. In Section 3.4, we present the benchmark comparison results for both case studies (Toce and Po River). As a result, integrating this analysis into the preceding paragraphs would complicate the presentation, as we would have to separate it in two distinct parts, one for each case study. However, if the Editor deems it appropriate, we are open to relocating the benchmark comparison section to the appendix to enhance the overall flow of the manuscript.
  
  12. l. 662: "The values of the learning rate and batch size were determined through a trial-and-error process.": This is always problematic, as most authors most likely spend more time optimizing their own model compared to benchmark models. Sometimes, it is unavoidable to do that. If, however, it would be possible to compare to previously published results or weights directly, this would be preferred.
  Thank you for bringing this important aspect to our attention. In this manuscript, we presented two distinct case studies. Including a third case study, such as that utilized by Kabir et al. (2020), would significantly increase the manuscript’s length, a concern that has already been noted by both the Reviewer and the Editor. Nevertheless, we acknowledge that comparing our proposed model with other state-of-the-art deep learning models could be a valuable direction for future research, and we will certainly consider your suggestion in that context.
  Regarding the present study, we want to assure the Reviewer that we have dedicated substantial effort to evaluating the optimal hyperparameters for the 1D CNN model. We conducted tens of training iterations for each case study, exploring various hyperparameter configurations. For each case, we selected the trained model that yielded the best results, both quantitatively (based on established metrics) and qualitatively (through the analysis of predicted inundation maps), to present in the benchmark analysis.
  We attribute the superior accuracy of the FloodSformer model, compared to the 1D CNN model, to its ability to incorporate previous inundation conditions (i.e., water depth maps from prior time steps) into its predictions. The dynamics of a flood event in a river or floodplain at any given moment depend not only on the upstream boundary conditions but also on preceding flood dynamics. This explains why the 1D CNN model struggles to predict the inundation in the defended floodplains of the Po River (as illustrated in Figures 14 and 15) and displays a more oscillatory trend in water depths for the Toce River case (notably evident in the water depths extracted at control points, which are not shown in the current version of the manuscript).
  
  13. l. 678: "This benchmark analysis confirms the remarkable performance of the FS model compared to a state-of-the-art DL architecture.": While this might (see point directly above) be true, it would be interesting to see the number of parameters and computational times of the benchmark model as well.
  Thank you for your thoughtful suggestion. In response to the Reviewer’s comment, we will include the number of parameters and computational times for the 1D CNN model in the revised version of the manuscript. For clarity, we will summarize this information here.
  
  Regarding the number of parameters, the 1D CNN model contains approximately 8.5 million parameters for the Toce River case study and about 670 million parameters for the Po River case study (specifically, the Po2 configuration with a spatial resolution of 20 m). These values are of the same order of magnitude as those of the FloodSformer model, as detailed in Table 1.
  
  Focusing on the computational times, the training time of the 1D CNN model for the Toce River case study is about 2 minutes, while the inference time is approximately 1.5 minutes. Differently, for the Po River case study, the convolutional model takes about 21 minutes for the training and 3 minutes for the prediction of the November 2014 flood event with a spatial resolution of 20 m.
  
  14. l. 743: "Initial condition sensitivity analysis": This is an interesting analysis that lead to the nice idea of storing a database for initial conditions, however, this subsection again feels a bit out of place. This might be a personal preference, though.
  Thank you for your valuable feedback. We appreciate your acknowledgment of the interest in the initial condition sensitivity analysis. In response to your comment, we will move this subsection from the discussion chapter to the conclusion of the results chapter. This adjustment aims to enhance the coherence and flow of the manuscript.
  
  15. General: I think it's fascinating that we go from real-world to model urban district to numerical model to surrogate model. Do you think it would make sense to fine-tune the surrogate model on data produced by the physical model experiments before using it in real-world applications? I assume learning a surrogate model is easier than a model of the real world.
  Thank you for your insightful observation. The approach of fine-tuning the surrogate model using data generated from physical model experiments could indeed be beneficial, as physical models typically replicate real-world conditions while operating in a reduced-dimensional domain. However, it is important to note that the data derived from experimental analyses usually consist of recorded water levels at specific gauging stations within the study area. Consequently, these data do not provide comprehensive 2D inundation maps, which are essential for training the FS model.
  Furthermore, the development of physical models for large-scale case studies, such as the Po River, can’t be done due to the inherent complexity and substantial dimensions of the area involved, which significantly limits the feasibility of constructing such models.
  
  Citation: https://doi.org/10.5194/hess-2024-176-AC1
RC2:
'Comment on hess-2024-176', Anonymous Referee #2, 19 Sep 2024

The paper demonstrates that a transformer model they called FloodSformer (FS) can be a better surrogate model for a physically-based model PARFLOOD on the trained sites than a convolutional model (CNN) because it has better resolution of the effect of inundation states at the last time step.
A large number of exercises have pursued this route --- training NNs or other machine learning models as surrogate models for more computationally expensive physically-based models to save time. While this is a legitimate pursuit, most of these models have one issue that hinders their practical utility for flood inundation modeling --- they cannot adapt to different domains with different topography. This means they must be trained separately on different sites; they will still produce large errors if going out of the range of the inflow conditions met during training. Thus their practical value is to interpolate between different training simulations. Then, for this type of effort, one must justify the cost of producing training/validation datasets plus training and diagnosing issues vs. directly using the physically-based model. In the case of this paper, I suspect adding up the cost of all the preparation steps will be much greater than just running the original model for the cases investigated, let alone training the models. One can even just run the original models on an interval of inflow discharge conditions comparable to their training data density and interpolate the results in space and time.
Another major limitation is that, even only for the domains trained on, the authors did not confront the model with real data, but only simulations, this further reduces the meaningfulness of the model.
If the above suspicions are true, it means the value of this paper is to demonstrate a higher accuracy of FS compared to CNN for the purpose of being a surrogate model of PARFLOOD from a purely methodological perspective, as the FS is not purely a function of inflow but also inundation states of previous time steps. If this were the first paper to show a surrogate model's success in 2D, there should still be methodological value, but given a large number of other studies (citations below as well as several cited in the paper), this work would be considered more incremental. Furthermore, more comparisons to other models like the one from Bentivoglio et al., 2023 would be needed if the claim was on accuracy and computational savings. Overall I suggest, while similar papers have been published elsewhere, this contribution does not meet the rigorous standard of HESS.
https://www.mdpi.com/2073-4441/15/3/566
https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2022WR033214 (this is the closest paper)
https://link.springer.com/article/10.1007/s00477-024-02814-z
https://hess.copernicus.org/articles/27/4227/2023/ (cited by the authors. quite similar claims, but also could claimed to generalize to unseen topography)

Citation: https://doi.org/10.5194/hess-2024-176-RC2
- AC2: 'Reply on RC2', Matteo Pianforini, 03 Oct 2024
  
  The original comments by the Reviewer are provided in italics. Responses are provided in plain text.
  We thank the Reviewer for the comments. Point-by-point replies to the specific comments are provided below. The manuscript will be modified based on the recommendations of the Reviewer.
  
  1. “The paper demonstrates that a transformer model they called FloodSformer (FS) can be a better surrogate model for a physically-based model PARFLOOD on the trained sites than a convolutional model (CNN) because it has better resolution of the effect of inundation states at the last time step.”
  Thank you for your comment. The primary objective of this work is to develop a surrogate model capable of real-time simulation of real flood events on large domains. To achieve this, it is essential to adopt a DL model that is both accurate and efficient, capable of generating time-evolving flood maps in a small fraction of the physical time. These maps can be used by disaster management agencies to take timely countermeasures during emergencies, minimizing the impact on affected populations.
  The speed of simulation is crucial, as it must provide sufficient time for the implementation of countermeasures. Equally important is the accuracy of the predictions, which is vital for identifying high-risk areas and deploying appropriate responses. For this reason, we believe that the FloodSformer model, which integrates both speed and accuracy as key attributes, can be a valuable tool for practical purposes.
  The comparison of the proposed model with the 1D CCN of Kabir et al. (2020) provided an additional analysis to assess our model against a state-of-the-art architecture. We compared predicted flood maps from the FloodSformer model and the 1D CNN across the entire duration of the flood event. As shown in Table 5 and Figures 13-15, the FloodSformer model consistently outperforms the 1D CNN in terms of accuracy throughout the flood event, not just at the final time step.
  Furthermore, our work includes another element of novelty, which is a detailed analysis of the model’s sensitivity to the training dataset, particularly in relation to the type and number of flood events used for generating the ground-truth maps. This analysis is crucial for ensuring that the surrogate model is properly trained to deliver reliable predictions across diverse flood events, a topic that has been underexplored in previous research.
  This will be clarified in both the introduction and conclusion of the revised version of the manuscript.
  
  2. “A large number of exercises have pursued this route --- training NNs or other machine learning models as surrogate models for more computationally expensive physically-based models to save time. While this is a legitimate pursuit, most of these models have one issue that hinders their practical utility for flood inundation modeling --- they cannot adapt to different domains with different topography. This means they must be trained separately on different sites; they will still produce large errors if going out of the range of the inflow conditions met during training. Thus their practical value is to interpolate between different training simulations.”
  Thank you for highlighting this point. We agree with the Reviewer regarding the limitations of DL models in generalizing across different topographies. Indeed, as noted in the literature — apart from a few exceptions (e.g., Bentivoglio et al. (2023) for dike-breach floods and do Lago et al. (2023) for pluvial floods) — most data-driven models for river flood predictions face this challenge (Bentivoglio et al., 2022; Karim et al., 2023). While this challenge is significant and requires further investigation, it is important to clarify that addressing this limitation is not the main objective of the present work. Instead, our main goal is to develop a surrogate model that can deliver high accuracy and computational efficiency for real-time flood predictions within a specific river region.
  Another critical aspect of data-driven flood simulation models is ensuring that the training dataset can guarantee proper training and, consequently, robust generalization during inference. As is well-documented, all data-driven models suffer from reduced accuracy when attempting to extrapolate beyond the range of data used for training (e.g., Fraehr et al., 2024). To mitigate this, it is essential to include extreme flood events for the study area in the training dataset. Such extreme events can be estimated by considering the maximum flow capacity of the river, which is generally constrained by the river’s physical and topographical characteristics.
  In our work, we also investigated the impact of different inflow conditions used to construct the training dataset. Since recorded flood events, especially those with high return periods, are often limited, we introduced synthetic hydrographs in the training dataset to incorporate a broader range of severe events in the dataset. This strategy is key to avoid the need for extrapolation beyond the training data.
  
  3. “Then, for this type of effort, one must justify the cost of producing training/validation datasets plus training and diagnosing issues vs. directly using the physically-based model. In the case of this paper, I suspect adding up the cost of all the preparation steps will be much greater than just running the original model for the cases investigated, let alone training the models. One can even just run the original models on an interval of inflow discharge conditions comparable to their training data density and interpolate the results in space and time.”
  Thank you for your valuable insight. In addressing the challenge of real-time flood forecasting, computational efficiency becomes a critical factor. During emergency situations, such as flood events, reducing computation time is crucial because it increases the time available for emergency services to implement protective measures, thereby mitigating loss of life and economic damage.
  Physically based models, while accurate, are often impractical for real-time simulations due to their high computational demands. Conversely, a fast and efficient surrogate model can produce predictions in a much shorter time frame. As a result, in the context of setting up models for early warning applications, the time and cost required to produce the training dataset and train the surrogate model are not primary concerns. These tasks can be performed “in time of peace”, well before any emergency occurs. It is only the real-time forecasting during critical phases that requires minimal inference time to support timely decision-making. Therefore, for practical applications, the key factor is minimizing inference time to ensure the model can deliver rapid and actionable predictions during emergencies.
  Regarding the suggestion of interpolating results from previous simulations based on inflow discharge conditions, this approach is feasible only when the system behaves in a linear or quasi-linear manner. However, flood propagation over real-world bathymetry is a highly non-linear phenomenon due to the complex interactions between water flow, terrain topography, and other factors. The relationship between upstream discharge and water levels in any given location is not straightforward and changes over time. As a result, simple linear interpolation cannot adequately capture the dynamic interactions across time and space, leading to low-accuracy predictions. This is further supported by the fact that increasingly complex DL models for the simulation of the spatiotemporal dynamics of flood events are developed in literature by different research groups.
  To clarify this, we will modify the introduction in the revised version of the manuscript.
  
  4. “Another major limitation is that, even only for the domains trained on, the authors did not confront the model with real data, but only simulations, this further reduces the meaningfulness of the model.”
  We thank the Reviewer for this insightful comment. Unfortunately, for real flood events, direct observational data is particularly scarce and typically limited to water level recordings at a few gauging stations along the river. Furthermore, spatially distributed data on the dynamics of flood events are generally unavailable. Even in such cases, the data provide only an approximate outline of the inundated area and do not include information on water depths or temporally varying information. This is the primary reason why we relied on synthetic data generated by a physically- based model to create the dataset for training the DL algorithm.
  In our work, we compared the model’s results with the available observed data in both case studies. Specifically, for the Toce River case study, the water depths at several control points (locations shown in Figure 3a) were compared with the water depths recorded during the experimental analysis conducted in the laboratory by Testa et al. (2007). The results of this comparison are presented in Figures 6, S1-S3. Similarly, for the Po River case study, the water depths recorded at the Boretto gauging station (location shown in Figure 3b) are directly compared with the results obtained from both the numerical model and the surrogate model, as shown in Figure 9.
  Moreover, we would like to emphasize that the recorded water depths at the control points for the Toce River, as well as at the Boretto gauge station for the Po River, were used to calibrate the hydrodynamic model in each case study. This calibration process is crucial for ensuring the accuracy of the ground-truth maps, which are then used for training the surrogate model.
  In the revised version of the manuscript these considerations will be emphasised in sections 3 and 4.
  
  5. “If the above suspicions are true, it means the value of this paper is to demonstrate a higher accuracy of FS compared to CNN for the purpose of being a surrogate model of PARFLOOD from a purely methodological perspective, as the FS is not purely a function of inflow but also inundation states of previous time steps.”
  As previously mentioned, the primary objective of this work is to implement a framework capable of real-time flood simulations over large real-world domains. The use of a model that (i) leverages cutting-edge DL algorithms, and (ii) incorporates both inflow discharge and previous inundation maps as input data, leads to significantly improved performance compared to other state-of-the-art models. This approach is essential for achieving higher accuracy during the inference stage, thereby enhancing the model’s reliability and its applicability in real-time flood management scenarios.
  
  6. “If this were the first paper to show a surrogate model's success in 2D, there should still be methodological value, but given a large number of other studies (citations below as well as several cited in the paper), this work would be considered more incremental. Furthermore, more comparisons to other models like the one from Bentivoglio et al., 2023 would be needed if the claim was on accuracy and computational savings.”
  We appreciate the Reviewer’s comment and acknowledge the existence of numerous studies in the field of flood simulations using data-driven models. However, as emphasized earlier, the proposed FloodSformer model offers innovative features compared to other DL models. Furthermore, only a limited number of studies available in literature are dedicated to the real-time forecasting of the spatiotemporal evolution of inundation maps for river flood scenarios. Indeed, most DL models focus primarily on pluvial flood events, where the input data and objectives differ significantly. These studies often predict only maximum water depths or inundation extents for specific rainfall scenarios.
  The primary aim of our work is not to provide an extensive comparison between the FloodSformer model and other state-of-the-art models, but rather to develop a surrogate DL model capable of real-time simulation of actual flood events over large domains. We believe a broader comparison across various surrogate models could be a focus of future research, as such comparison requires significant effort.
  The Reviewer kindly referenced several data-driven models, many of which are already cited and discussed in our manuscript. These models employ diverse neural network architectures and target different objectives. For instance, the second study referenced by the Reviewer (Zhou et al., 2022), combines a 1D CNN with a U-Net-based spatial reduction and reconstruction method to simulate inundation maps. Similar to that of Kabir et al. (2020), this approach primarily focuses on inflow conditions at the upstream boundary. Therefore, we expect Zhou et al.’s model to exhibit the same limitations discussed in Section 3.4 of our manuscript, since it relies solely on upstream boundary conditions for predictions.
  The third study (Zheng et al., 2024), uses a Bayesian CNN to predict maximum water depth maps for specific rainfall conditions. This work does not consider spatiotemporal dynamics and instead focuses on urban flood scenarios, using rainfall information as the primary input. Given these distinctions, Zheng et al.’s work addresses a different problem from ours, which is why it was not included in our manuscript. Additionally, this study was published on September 17, 2024, and was therefore not available at the time our manuscript was written.
  Lastly, in Bentivoglio et al. (2023) a GNN is used to model dike breach floods on randomly generated topographies over small, squared domains. While the model was tested on unseen bathymetries with similar geometric characteristics, it was applied only to small areas with approximately 16,000 cells. No information is provided on the scalability of this approach for large domains or real bathymetries, which typically involve hundreds of thousands to millions of cells. Consequently, the suitability of the GNN model for real-time flood forecasting on large-scale areas is not demonstrated in that study, which has a different focus. For example, in our Po River case study, the domain is discretized with approximately 1.3 million cells (Po2 configuration). Given these constraints, a direct comparison between the FloodSformer and GNN models for real-time forecast on large-scale case domain is not feasible.
  The discussion of the work of Bentivoglio et al. (2023) will be modified and the work of Zheng et al., 2024 will be referenced in the revised version of the manuscript.
  
  7. “Overall I suggest, while similar papers have been published elsewhere, this contribution does not meet the rigorous standard of HESS.”
  We appreciate the Reviewer’s feedback and their assessment of our manuscript. While we acknowledge that similar papers have been published in related fields, we believe that our contribution offers several key innovations that distinguish it from existing work. Specifically, our FloodSformer model incorporates both inflow discharge and previous inundation maps, a combination that is rarely addressed in real-time river flood forecasting models. This novel approach, obtained adopting a transformer DL architecture, significantly improves model performance and applicability in practical flood management scenarios.
  Furthermore, our work directly addresses real-time flood forecasting for large, real-world domains, which is a critical yet underexplored area in existing literature. Many of the referenced models either focus on pluvial flood scenarios or use different input conditions and objectives. We have made efforts to clarify these distinctions throughout the manuscript.
  Finally, we checked the model’s sensitivity to the training dataset (type and number of flood events used for generating the ground-truth maps), a topic that has been underexplored in previous research and may be of interest also for other types of surrogate models.
  We are committed to meeting the high standards of HESS and would be grateful for any specific suggestions the Reviewer may have regarding how we can further strengthen our manuscript. We hope that our response demonstrates the originality and practical relevance of our work.
  
  Citation: https://doi.org/10.5194/hess-2024-176-AC2
RC3:
'Comment on hess-2024-176', Anonymous Referee #3, 13 Oct 2024

The paper presents a potentially interesting contribution by leveraging transformers in the context of real-time flood inundation modeling. However, given its similarity with Pianforini et al. (2024) and other existing works, including a recent contribution in EGUSphere [1] that significantly improves upon Bentivoglio et al. (2023), the novelty of this work does not reach the level expected for HESS. Another venue would be more appropriate for this manuscript. Furthermore, I find the paper to be very long—almost twice the length of similar contributions— and hard to read at times, despite the good quality of the language used. It would benefit from significant shortening to aid readability.

[1] Bentivoglio, Roberto, et al. "Multi-scale hydraulic graph neural networks for flood modelling." EGUsphere 2024 (2024): 1-28.

Citation: https://doi.org/10.5194/hess-2024-176-RC3
- AC3:
  'Reply on RC3', Matteo Pianforini, 15 Oct 2024
  The original comments by the Reviewer are provided in italics, while the responses are provided in plain text.
  
  We thank the Reviewer for the comments. The manuscript will be modified based on the recommendations of the Reviewer.
  a. “The paper presents a potentially interesting contribution by leveraging transformers in the context of real-time flood inundation modeling. However, given its similarity with Pianforini et al. (2024) and other existing works, including a recent contribution in EGUSphere [1] that significantly improves upon Bentivoglio et al. (2023), the novelty of this work does not reach the level expected for HESS. Another venue would be more appropriate for this manuscript. […]
  
  [1] Bentivoglio, Roberto, et al. "Multi-scale hydraulic graph neural networks for flood modelling." EGUsphere 2024 (2024): 1-28.”
  Thank you for your comment. Regarding the similarity with the previous version of the FS model published in the Journal of Hydrology (Pianforini et al., 2024), we believe that the current version marks a significant advancement. Specifically, we have enhanced the attention mechanism within the transformer block by transitioning from a self-attention mechanism to a cross-attention one. This shift allows the model to integrate information from multiple input sequences (boundary conditions, previous flood stages, etc.), making it far more adaptable to a variety of real-world flood scenarios. While the initial version was limited to dam-break due to its inability to incorporate boundary conditions, the updated version can predict flood events where upstream boundary conditions are crucial (e.g., river floods), making the model applicable to any type of watercourse.
  
  As outlined in the Introduction and Discussion sections of the manuscript, our model differs from existing DL models in several key aspects. Firstly, to the best of our knowledge, this is the first time a transformer-based model has been used to simulate the spatiotemporal evolution of river flood events for real-time forecasts. Additionally, we have demonstrated the model’s capability to handle domains with millions of cells, achieving satisfactory results in both accuracy and computational efficiency.
  
  We would also like to address the reviewer’s concern regarding the similarity with the recent work of Bentivoglio et al. (2024). Firstly, it is worth noting that this work is still under review in EGUSphere and was only made available online on September 20, 2024. This timing is important because our manuscript was submitted prior to the availability of this work, meaning that we could not have taken it into consideration at the time of writing. It is also important to note that such situations are not uncommon in the rapidly evolving field of DL model research, where advancements and publications occur very frequently.
  
  While Bentivoglio et al. (2024) presents an interesting contribution to the field, there are several fundamental differences between their approach and ours that we believe warrant clarification.
  Architectural differences: the primary distinction lies in the AI architectures employed. Bentivoglio et al. (2024) utilized Graph Neural Networks (GNNs) to model flood events, whereas our work employs a transformer-based architecture.
  
  Different objectives: the objectives of the two studies also differ significantly. Bentivoglio et al. (2024) focused on developing a surrogate model that can generalize across various topographies. In contrast, our work aims to create an ANN for real-time forecasting of inundation maps over large domains. Our approach is geared toward providing fast, accurate predictions for decision-making during flood events, rather than developing a model that generalizes across multiple domains. This focus on operational forecasting is crucial for practical applications in flood risk management.
  
  Flood event types: another key difference lies in the types of flood events modelled. Our work addresses river flood events, whereas Bentivoglio et al. (2024) concentrated on dike-breach scenarios within relatively flat domains (see the DTM of the “dike ring 15” case study) which generates relatively shallow water depths (less than 2 meters). In contrast, the Po River case study in our work involves much more complex topography (see Fig. 3), with water depths reaching up to 23-24 meters in the main channel (see Fig. 11). The complexity of these scenarios presents a much greater challenge for flood modelling, and our model’s ability to handle such cases confirms its robustness and versatility.
  
  Scale and resolution: the scale and resolution of the case studies also differ significantly. Bentivoglio et al. (2024) used a low-resolution mesh with 22,881 cells, while we applied the FS model to a domain with 5,226,496 cells, demonstrating its scalability for domains two orders of magnitude larger. Additionally, the FS model maintains high computational efficiency, achieving physical to computational time ratios as high as 10,000. The scalability of the GNNs for larger and more complex domains has yet to be fully explored. Scalability is a critical factor in real-world applications, where high-resolution predictions are needed over large areas. The ability of the FS model to operate at this scale demonstrates its practicality for operational use in real-time flood forecasting.
  
  In conclusion, while both our work and that of Bentivoglio et al. (2024) contribute valuable advancements to flood modelling, they are based on fundamentally different approaches, objectives, and application contexts. As such, the two works should not be directly compared, as they address distinct challenges in the development of surrogate models for flood modelling. Given these significant differences, we believe both works offer complementary, rather than competing, contributions to the field.
  
  b. “Furthermore, I find the paper to be very long—almost twice the length of similar contributions— and hard to read at times, despite the good quality of the language used. It would benefit from significant shortening to aid readability.”
  We appreciate the Reviewer’s positive feedback regarding the clarity of the language used in the paper.
  
  We understand the concern regarding the length of the paper, and we acknowledge that it is longer than typical contributions. The extended length is primarily due to the comprehensive validation of the proposed model, which includes two detailed case studies. Additionally, we conducted a sensitivity analysis to explore the effects of different types and numbers of flood events in the training dataset, alongside a benchmark comparison with another deep learning model.
  
  However, we recognize the importance of maintaining readability and conciseness. Therefore, we are considering several options to reduce the length of the manuscript. These may include removing or relocating specific sections and figures to the appendix or supplementary material, such as “Surrogate model implementation details” (Section 2.1.4), the “Po3 training case” (Section 3.2.3) and the Fig. 10. Additionally, we are considering shortening certain sections. This approach would allow us to retain the necessary scientific rigor while improving the overall flow and readability of the main body of the paper.
  
  We are confident that these revisions will address the Reviewer’s concerns while ensuring that the key contributions and findings of the study remain clear and accessible.
  
  Citation: https://doi.org/10.5194/hess-2024-176-AC3

Status: closed

RC1:
'Comment on hess-2024-176', Anonymous Referee #1, 17 Sep 2024

Summary
This paper presents a surrogate deep-learning approach for forecasting flood inundation maps.

The model is a modified version of a previously published model by the same authors which is mainly based on transformers.

Here, the architecture is modified to also take upstream hydrographs into account by introducing a cross-attention mechanism between inundation maps and those hydrographs.

The architecture is showcased in mainly two different scenarios, one representing an urban flash flood and the other representing a river flood.

Each scenario is evaluated with multiple sets of hyperparameters for a more detailed examination.

Qualitatively, the presented approach produces good results, however, I am not sure about quantitative evaluation (see below).
I want to preface my review with the fact that I am a machine-learning scientist who received only a little training in hydrology.

General comments
The overall quality of the preprint is good.

The problem that is tackled here is of very high significance for human lives, the economy, and science to which the authors seem to have a good solution.

From a machine learning point of view, the presented model is adequate and in line with recent research.

Regarding its evaluation, however, there are a few points that need to be addressed (see below).

I think the structure of the paper could be improved as some sections feel a bit out of place.

With 44 pages and an additional 10 pages of supplementary material, the paper is quite long.

The authors should consider shortening the paper as this would increase the probability of other scientists reading it.

Restructuring and shortening might be achieved by moving more (sub)sections into the appendix, which currently is only half a page long.

Other than that, the paper is well written, very detailed, easy to follow and understand, and only contains minor grammatical errors.

Specific comments
1. l. 68 "relatively high computational time may restrict the application of GNN in real-time flood forecasting": Is this due to the model being a GNN or some other aspect of this work? It is advertised as "rapid" and has a speed-up of up to 100 compared to the numerical solver.

2. l. 155 "Notably, the AR procedure relies on the availability of the entire time series of upstream discharges, which is provided as input data to the surrogate model. This is in line with the approaches commonly employed in EWS based on physically-based hydrodynamic models, in which the inflow time series derives from meteorological/hydrological models.": One of the arguments for your architecture is its speed-up compared to conventional methods. If the upstream hydrograph is predicted by conventional methods, is this speed-up lost again? Maybe mention that also here, surrogate models should be developed.

3. l. 172: "In the FS model, the input matrix XKV represents the sequence of latent features Z = [z1, . . . , zI ] ∈ RI×h×w×dmodel provided by the CNN encoder layer, while the XQ is obtained expanding and reshaping the sequence of discharge values [q2 , . . . , qI+1 ] ∈ RI in order to obtain the same dimensions of XKV.": Intuitively, I would have expected that the inundation map query the discharges and not the other way around. Is there a specific reason you did it like that? What would be the difference?

4. l. 176: "We emphasize that the temporal translation of one frame between maps and discharge values used as input data (e.g., the MHCA correlates the feature at time t = 1 with the discharge at instant t = 2) is essential to predict the map at the subsequent instant (e.g., t = 2).": Shouldn't the shift depend on the position of the upstream gauge, the flow velocity, and the chosen temporal resolution?

5. l. 185: "Furthermore, a 2D relative positional encoding (RPE) and a fixed absolute 1D positional encoding (PE) are used in the spatial and temporal MHSA, respectively." Doesn't lead a fixed absolute positional encoding for temporal MHSA to bad generalization in time? Why not use a relative encoding here, too?

6. l. 203: "The GDL is designed to minimize differences between the gradients of water depths in the original (x) and reconstructed (x̂) maps." Why did you introduce this loss? If absolute predictions are correct, gradients are, too. With this additional loss, absolute errors become less important for the model, which seems like a disadvantage. Is this somehow related to the thresholds which determine when a cell is considered dry and you want to know exactly where the inundation front lies?

7. l. 206: "During the second training phase (“VPTR training”), only the VPTR block is trained, and the parameters of the encoder and decoder blocks remain fixed": Not a comment that needs to be addressed, but I think it would be interesting to see whether optimizing the encoder and decoder with a smaller learning rate might improve results.

8. l. 276: "The value of λGDL was calibrated to obtain the same order of magnitude between LM SE (Eq. 7) and LGDL (Eq. 8) losses.": What is the motivation to have similar magnitudes?

9. l. 625: "This possibly leads to a reduced optimization of the Po3 model, which could have been relieved by adopting a larger training dataset." It seems like Po3 did not lead to any significant insights, except for the fact that a finer spatial resolution is not always better. This could be an opportunity to reduce the length of the paper.

10. It seems to me that the authors only trained a single instance of their model for each scenario. The performance of neural networks can heavily depend on their weight initialization. It is therefore crucial to report results from multiple seeds to evaluate the architecture itself. This includes reporting means and standard deviations, and potentially a significance analysis. The computational times of the proposed architecture seem to allow this at least in some of the scenarios.

11. l. 652: "Benchmark comparison": This section felt out of place when I first read the paper and afterward I saw that a benchmark comparison was requested by the editor. I would suggest incorporating the benchmark comparison into the previous paragraphs, but this might be a personal preference.

12. l. 662: "The values of the learning rate and batch size were determined through a trial-and-error process.": This is always problematic, as most authors most likely spend more time optimizing their own model compared to benchmark models. Sometimes, it is unavoidable to do that. If, however, it would be possible to compare to previously published results or weights directly, this would be preferred.

13. l. 678: "This benchmark analysis confirms the remarkable performance of the FS model compared to a state-of-the-art DL architecture.": While this might (see point directly above) be true, it would be interesting to see the number of parameters and computational times of the benchmark model as well.

14. l. 743: "Initial condition sensitivity analysis": This is an interesting analysis that lead to the nice idea of storing a database for initial conditions, however, this subsection again feels a bit out of place. This might be a personal preference, though.

15. General: I think it's fascinating that we go from real-world to model urban district to numerical model to surrogate model. Do you think it would make sense to fine-tune the surrogate model on data produced by the physical model experiments before using it in real-world applications? I assume learning a surrogate model is easier than a model of the real world.

Technical corrections
1. l. 69 "real-word" -> "real-world"

2. l. 139 if q_t is a scalar, I would remove the superscript from R^1 as I was looking for an l

3. l. 329: "non-dimensionalized dividing them" -> "non-dimensionalized by dividing them", also, just FYI, the term "standardized" is much more common than "non-dimensionalized" in ML literature

Citation: https://doi.org/10.5194/hess-2024-176-RC1
- AC1: 'Reply on RC1', Matteo Pianforini, 03 Oct 2024
  
  The original comments by the Reviewer are provided in italics. Responses are provided in plain text.
  
  "Summary" and "General comments" sections
  We thank the Reviewer for his useful suggestions and for appreciating the work. Point-by-point replies to the specific comments are provided below. The manuscript will be modified based on the recommendations of the Reviewer.
  
  1. l. 68 "relatively high computational time may restrict the application of GNN in real-time flood forecasting": Is this due to the model being a GNN or some other aspect of this work? It is advertised as "rapid" and has a speed-up of up to 100 compared to the numerical solver.
  Thank you for this insightful observation. The actual speed-up potential of GNNs remains an open question, as GNNs have not yet been applied to large-scale domains with high-resolution discretization (e.g., in the order of millions of computational cells). To date, GNNs have only been employed for flood simulations on small domains with tens of thousands of cells. As a result, the scalability of GNNs for larger, more complex domains has yet to be thoroughly investigated.
  We will revise the sentence indicated by the Reviewer and the following one (i.e., lines 67-69) to better clarify this point.
  
  2. l. 155 "Notably, the AR procedure relies on the availability of the entire time series of upstream discharges, which is provided as input data to the surrogate model. This is in line with the approaches commonly employed in EWS based on physically-based hydrodynamic models, in which the inflow time series derives from meteorological/hydrological models.": One of the arguments for your architecture is its speed-up compared to conventional methods. If the upstream hydrograph is predicted by conventional methods, is this speed-up lost again? Maybe mention that also here, surrogate models should be developed.
  We appreciate this insightful comment, as it allows us to clarify the key characteristics of a coupled hydrological-hydraulic model chain. Traditionally, this approach involves a series of physically based or conceptual models to generate inundation maps starting from meteorological forecasts. First, a rainfall-runoff model is used to predict runoff at the outlet of a drainage basin starting from weather models forecast (mainly rainfall). The output of this model is the hydrograph that describes the temporal variation of discharge in a specified river section. Next, a 2D hydrodynamic model is adopted to propagate the flood downstream, producing the inundation maps.
  The computational cost of a 2D hydrodynamic model which integrates the Shallow Water Equations (SWEs) is generally much higher than that of a rainfall-runoff one due to the governing equations involved which needs to be discretized using a grid of millions of cells in large domains. Consequently, the hydrodynamic model often becomes the bottleneck, in terms of computational cost, in the hydrological-hydraulic model chain. For this reason, our work focuses on developing a surrogate model for the hydrodynamic component, aiming to significantly reduce computational time without compromising accuracy.
  It is also worth noting that while rainfall-runoff models are computationally less demanding, many data-driven models in the literature have been developed as surrogates for these models (e.g., Kratzert et al., 2018; Xu et al., 2023; Yin et al., 2023). Therefore, an alternative approach could involve using hydrographs generated by such surrogate models as input for the FloodSformer model, which could further reduce the overall computational time of the model chain.
  
  We will incorporate these considerations into the revised manuscript to better clarify this point.
  
  3. l. 172: "In the FS model, the input matrix XKV represents the sequence of latent features Z = [z1, . . . , zI ] ∈ RI×h×w×dmodel provided by the CNN encoder layer, while the XQ is obtained expanding and reshaping the sequence of discharge values [q2 , . . . , qI+1 ] ∈ RI in order to obtain the same dimensions of XKV.": Intuitively, I would have expected that the inundation map query the discharges and not the other way around. Is there a specific reason you did it like that? What would be the difference?
  Thank you for highlighting this interesting point. When simulating flood events using a surrogate model, it is essential to incorporate information from both the inundation maps and the upstream boundary conditions. Specifically, the prediction of the next inundation map is strongly influenced by the previous conditions in the study area, represented by the prior inundation maps. At the same time, the inflow discharge is also crucial: the flood evolution will be different during the rising or falling limb of the hydrograph. Therefore, predictions of future inundation map need to account for both past inundation maps and the inflow discharge hydrograph.
  To address this, we implemented a cross-attention (CA) mechanism. This allows the model to condition future inundation map predictions on inflow discharge data while leveraging prior inundation states. Our methodology was inspired by existing works in different fields that successfully applied CA mechanisms between 2D matrices (e.g., images) and 1D scalar values for other tasks, such as visual question answering (e.g., Nguyen & Okatani, 2018: https://openaccess.thecvf.com/content_cvpr_2018/html/Nguyen_Improved_Fusion_of_CVPR_2018_paper.html; Jaegle et al., 2022: https://arxiv.org/abs/2107.14795) and medical analysis (e.g., Chen et al., 2021: https://openaccess.thecvf.com/content/ICCV2021/html/Chen_Multimodal_Co-Attention_Transformer_for_Survival_Prediction_in_Gigapixel_Whole_Slide_ICCV_2021_paper.html?ref=https://githubhelp.com).
  
  For example, Chen et al. (2021) applied a CA mechanism to predict patient survival by examining how large 2D whole-slide images attend to gene expression (1D scalar values), which inspired our application of this technique in the FS model.
  In response to the Reviewer’s comment, we also explored an alternative configuration by inverting the input information flow to the MHCA block. In this modified version, the latent features derived from the inundation maps served as the query matrix, while the discharge values were treated as the value and key matrices. We trained the FS model with this new configuration using the Toce River case study under the Toce2 training settings. The following table summarizes the average RMSEs for the four flood events within the testing dataset. For clarity, the implementation described in the manuscript is referred to as “original”, while the modified version is referred to as “modified”.
  Flood event RMSE (mm)
  
  original RMSE (mm)
  
  modified
  Low 1.8 2.2
  Medium 2.6 3.3
  High 3.2 3.2
  Gradual 3.6 4.3
  This analysis indicates that the model can still reproduce flood propagation in the selected study area, even with the input configuration reversed. Notably, the "original" version of the model demonstrated slightly better accuracy compared to the "modified" version (see the table above). However, this analysis is only preliminary, and a more comprehensive investigation into this design choice may be a valuable direction for future research. Given the recommendation of the editor and reviewer 2 to shorten the length of the manuscript we won’t include this analysis in the revised version of the manuscript.
  
  4. l. 176: "We emphasize that the temporal translation of one frame between maps and discharge values used as input data (e.g., the MHCA correlates the feature at time t = 1 with the discharge at instant t = 2) is essential to predict the map at the subsequent instant (e.g., t = 2).": Shouldn't the shift depend on the position of the upstream gauge, the flow velocity, and the chosen temporal resolution?
  We appreciate this insightful question, as it allows us to further clarify the role of the temporal shift between inundation maps and discharge values. First, it is important to highlight that the shift applied in our model is fixed at one time step, which corresponds to the temporal resolution chosen for the output maps from the hydrodynamic model (e.g., 3 hours in the Po River case study; see Table 1). This shift does not require calibration.
  During the dataset creation, this shift associates an inundation map at a given time step with the discharge value at the subsequent time step. Specifically, for an inundation map at time t, the associated discharge value is at time t+1. This temporal shift is essential for forecasting, as predicting the inundation map at time t+1 depends both on the preceding inundation dynamics (i.e., the map at time t) and the inflow discharge at time t+1.
  
  Thus, the temporal shift is not influenced by the location of the upstream gauge, flow velocity, or the temporal resolution of the model. It remains a constant offset ensuring that future inundation maps are predicted based on both past flood dynamics and the inflow condition at the prediction time.
  The parameter that might requires closer attention might be the temporal resolution for the output maps. This should be appropriate for accurately capturing the flood evolution; therefore, it is case dependent. However, once the temporal resolution is selected, the shift remains consistently fixed at one time step.
  We will revise the Section 2.1.1 of the manuscript to better explain this aspect.
  
  5. l. 185: "Furthermore, a 2D relative positional encoding (RPE) and a fixed absolute 1D positional encoding (PE) are used in the spatial and temporal MHSA, respectively." Doesn't lead a fixed absolute positional encoding for temporal MHSA to bad generalization in time? Why not use a relative encoding here, too?
  We are grateful for this insightful suggestion. In the current implementation of the FloodSformer model, we opted to retain the original architectural choices of the VPTR model as presented by Ye and Bilodeau (2023). Given the complexity of the proposed model—comprising several layers and architectural components such as CNNs, attention mechanisms, and MLPs—we decided to minimize modifications to the base structure to maintain consistency and reduce potential sources of error.
  While the results obtained with this configuration are satisfactory and meet the objectives of this study, we acknowledge that there is room for improvement, including the use of relative positional encoding in the temporal MHSA. Further tuning of the model, not only in terms of positional encoding but also with respect to hyperparameter optimization and other architectural components, could enhance both accuracy and computational efficiency.
  We will include these considerations in the discussion (Section 4) of the revised version of the manuscript.
  
  6. l. 203: "The GDL is designed to minimize differences between the gradients of water depths in the original (x) and reconstructed (x̂) maps." Why did you introduce this loss? If absolute predictions are correct, gradients are, too. With this additional loss, absolute errors become less important for the model, which seems like a disadvantage. Is this somehow related to the thresholds which determine when a cell is considered dry and you want to know exactly where the inundation front lies?
  We appreciate this valuable remark. We acknowledge the Reviewer’s point regarding the correlation between the two components of the loss function (L_MSE and L_GDL). The inclusion of the GDL is primarily intended to improve the accuracy near hydraulic discontinuities and around the wet/dry front of the flood. While it is true that correct absolute predictions should result in correct gradients, the GDL provides additional focus on capturing changes in water depth, particularly in areas where sharp transitions occur, such as flood fronts. This helps ensure that the model accurately predicts the evolution of the inundation front.
  From preliminary analyses, we found that incorporating the GDL did not negatively affect the overall accuracy of the model. Furthermore, it introduces a completely negligible computational overhead compared to the total training time. Therefore, we decided to include it in the loss function. However, we agree that a more detailed analysis of GDL’s impact in this specific application could provide valuable insights, and we plan to explore this in a future work.
  Additionally, to address the concern about GDL overshadowing the absolute prediction accuracy (MSE), it is crucial to carefully select the weighting factor λ_GDL. This ensures that both the MSE and GDL term remain balanced and within the same order of magnitude, preserving the model’s focus on absolute predictions, as discussed in point 8.
  
  7. l. 206: "During the second training phase (“VPTR training”), only the VPTR block is trained, and the parameters of the encoder and decoder blocks remain fixed": Not a comment that needs to be addressed, but I think it would be interesting to see whether optimizing the encoder and decoder with a smaller learning rate might improve results.
  Thank you for this insightful suggestion. In preliminary analyses, we experimented with different learning rates and schedulers to determine the most effective configuration. The current values were selected based on their ability to provide the best results during training. Nevertheless, we acknowledge that further, more detailed analyses could be conducted to explore the potential improvements from fine-tuning the learning rate and other hyperparameters.
  
  It is also worth noting that, by the end of the training process, the errors in the encoder and decoder modules are already quite low. As such, any additional training on these components would likely result in only marginal improvements in validation performance.
  
  8. l. 276: "The value of λGDL was calibrated to obtain the same order of magnitude between LM SE (Eq. 7) and LGDL (Eq. 8) losses.": What is the motivation to have similar magnitudes?
  The motivation behind ensuring that both components of the loss function (L_MSE and L_GDL) have similar magnitudes is to ensure that both are appropriately weighted during the optimization process. If one of the components has values significantly lower (by one or more orders of magnitude) than the other, its contribution to the overall loss computation would become negligible. This balance is essential to ensure that both the absolute prediction accuracy (captured by the MSE) and the sharpness of the inundation maps (captured by the GDL) are taken into account during model training. Without this balance, the model might focus disproportionately on one aspect, potentially compromising the quality of the results. Preliminary tests confirmed that the same order of magnitude between L_MSE and L_GDL improves the accuracy of the model.
  
  9. l. 625: "This possibly leads to a reduced optimization of the Po3 model, which could have been relieved by adopting a larger training dataset." It seems like Po3 did not lead to any significant insights, except for the fact that a finer spatial resolution is not always better. This could be an opportunity to reduce the length of the paper.
  Thank you for your suggestion. We agree with the Reviewer that the Po3 configuration did not yield significant additional insights beyond demonstrating that (i) a finer spatial resolution does not necessarily result in better performance and (ii) the proposed model works also for different spatial resolutions. To address this, we have decided to move the discussion of the Po3 configuration to the appendix, thereby reducing the length of the main text body without omitting any relevant information.
  
  10. It seems to me that the authors only trained a single instance of their model for each scenario. The performance of neural networks can heavily depend on their weight initialization. It is therefore crucial to report results from multiple seeds to evaluate the architecture itself. This includes reporting means and standard deviations, and potentially a significance analysis. The computational times of the proposed architecture seem to allow this at least in some of the scenarios.
  Thank you for this insightful observation. We acknowledge the Reviewer’s point regarding the importance of weight initialization in the training of data-driven models. In our study, the parameters of the FloodSformer model were initialized using the weights obtained from Ye and Bilodeau (2023), who trained the VPTR framework specifically for predicting video frames (see lines 258-264). Nonetheless, we agree that conducting an analysis involving multiple seeds to assess performance, including the reporting of means, standard deviations, and potentially a significance analysis, would have been beneficial. We appreciate this suggestion for future work.
  
  11. l. 652: "Benchmark comparison": This section felt out of place when I first read the paper and afterward I saw that a benchmark comparison was requested by the editor. I would suggest incorporating the benchmark comparison into the previous paragraphs, but this might be a personal preference.
  Thank you for your thoughtful suggestion. In Section 3.4, we present the benchmark comparison results for both case studies (Toce and Po River). As a result, integrating this analysis into the preceding paragraphs would complicate the presentation, as we would have to separate it in two distinct parts, one for each case study. However, if the Editor deems it appropriate, we are open to relocating the benchmark comparison section to the appendix to enhance the overall flow of the manuscript.
  
  12. l. 662: "The values of the learning rate and batch size were determined through a trial-and-error process.": This is always problematic, as most authors most likely spend more time optimizing their own model compared to benchmark models. Sometimes, it is unavoidable to do that. If, however, it would be possible to compare to previously published results or weights directly, this would be preferred.
  Thank you for bringing this important aspect to our attention. In this manuscript, we presented two distinct case studies. Including a third case study, such as that utilized by Kabir et al. (2020), would significantly increase the manuscript’s length, a concern that has already been noted by both the Reviewer and the Editor. Nevertheless, we acknowledge that comparing our proposed model with other state-of-the-art deep learning models could be a valuable direction for future research, and we will certainly consider your suggestion in that context.
  Regarding the present study, we want to assure the Reviewer that we have dedicated substantial effort to evaluating the optimal hyperparameters for the 1D CNN model. We conducted tens of training iterations for each case study, exploring various hyperparameter configurations. For each case, we selected the trained model that yielded the best results, both quantitatively (based on established metrics) and qualitatively (through the analysis of predicted inundation maps), to present in the benchmark analysis.
  We attribute the superior accuracy of the FloodSformer model, compared to the 1D CNN model, to its ability to incorporate previous inundation conditions (i.e., water depth maps from prior time steps) into its predictions. The dynamics of a flood event in a river or floodplain at any given moment depend not only on the upstream boundary conditions but also on preceding flood dynamics. This explains why the 1D CNN model struggles to predict the inundation in the defended floodplains of the Po River (as illustrated in Figures 14 and 15) and displays a more oscillatory trend in water depths for the Toce River case (notably evident in the water depths extracted at control points, which are not shown in the current version of the manuscript).
  
  13. l. 678: "This benchmark analysis confirms the remarkable performance of the FS model compared to a state-of-the-art DL architecture.": While this might (see point directly above) be true, it would be interesting to see the number of parameters and computational times of the benchmark model as well.
  Thank you for your thoughtful suggestion. In response to the Reviewer’s comment, we will include the number of parameters and computational times for the 1D CNN model in the revised version of the manuscript. For clarity, we will summarize this information here.
  
  Regarding the number of parameters, the 1D CNN model contains approximately 8.5 million parameters for the Toce River case study and about 670 million parameters for the Po River case study (specifically, the Po2 configuration with a spatial resolution of 20 m). These values are of the same order of magnitude as those of the FloodSformer model, as detailed in Table 1.
  
  Focusing on the computational times, the training time of the 1D CNN model for the Toce River case study is about 2 minutes, while the inference time is approximately 1.5 minutes. Differently, for the Po River case study, the convolutional model takes about 21 minutes for the training and 3 minutes for the prediction of the November 2014 flood event with a spatial resolution of 20 m.
  
  14. l. 743: "Initial condition sensitivity analysis": This is an interesting analysis that lead to the nice idea of storing a database for initial conditions, however, this subsection again feels a bit out of place. This might be a personal preference, though.
  Thank you for your valuable feedback. We appreciate your acknowledgment of the interest in the initial condition sensitivity analysis. In response to your comment, we will move this subsection from the discussion chapter to the conclusion of the results chapter. This adjustment aims to enhance the coherence and flow of the manuscript.
  
  15. General: I think it's fascinating that we go from real-world to model urban district to numerical model to surrogate model. Do you think it would make sense to fine-tune the surrogate model on data produced by the physical model experiments before using it in real-world applications? I assume learning a surrogate model is easier than a model of the real world.
  Thank you for your insightful observation. The approach of fine-tuning the surrogate model using data generated from physical model experiments could indeed be beneficial, as physical models typically replicate real-world conditions while operating in a reduced-dimensional domain. However, it is important to note that the data derived from experimental analyses usually consist of recorded water levels at specific gauging stations within the study area. Consequently, these data do not provide comprehensive 2D inundation maps, which are essential for training the FS model.
  Furthermore, the development of physical models for large-scale case studies, such as the Po River, can’t be done due to the inherent complexity and substantial dimensions of the area involved, which significantly limits the feasibility of constructing such models.
  
  Citation: https://doi.org/10.5194/hess-2024-176-AC1
RC2:
'Comment on hess-2024-176', Anonymous Referee #2, 19 Sep 2024

The paper demonstrates that a transformer model they called FloodSformer (FS) can be a better surrogate model for a physically-based model PARFLOOD on the trained sites than a convolutional model (CNN) because it has better resolution of the effect of inundation states at the last time step.
A large number of exercises have pursued this route --- training NNs or other machine learning models as surrogate models for more computationally expensive physically-based models to save time. While this is a legitimate pursuit, most of these models have one issue that hinders their practical utility for flood inundation modeling --- they cannot adapt to different domains with different topography. This means they must be trained separately on different sites; they will still produce large errors if going out of the range of the inflow conditions met during training. Thus their practical value is to interpolate between different training simulations. Then, for this type of effort, one must justify the cost of producing training/validation datasets plus training and diagnosing issues vs. directly using the physically-based model. In the case of this paper, I suspect adding up the cost of all the preparation steps will be much greater than just running the original model for the cases investigated, let alone training the models. One can even just run the original models on an interval of inflow discharge conditions comparable to their training data density and interpolate the results in space and time.
Another major limitation is that, even only for the domains trained on, the authors did not confront the model with real data, but only simulations, this further reduces the meaningfulness of the model.
If the above suspicions are true, it means the value of this paper is to demonstrate a higher accuracy of FS compared to CNN for the purpose of being a surrogate model of PARFLOOD from a purely methodological perspective, as the FS is not purely a function of inflow but also inundation states of previous time steps. If this were the first paper to show a surrogate model's success in 2D, there should still be methodological value, but given a large number of other studies (citations below as well as several cited in the paper), this work would be considered more incremental. Furthermore, more comparisons to other models like the one from Bentivoglio et al., 2023 would be needed if the claim was on accuracy and computational savings. Overall I suggest, while similar papers have been published elsewhere, this contribution does not meet the rigorous standard of HESS.
https://www.mdpi.com/2073-4441/15/3/566
https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2022WR033214 (this is the closest paper)
https://link.springer.com/article/10.1007/s00477-024-02814-z
https://hess.copernicus.org/articles/27/4227/2023/ (cited by the authors. quite similar claims, but also could claimed to generalize to unseen topography)

Citation: https://doi.org/10.5194/hess-2024-176-RC2
- AC2: 'Reply on RC2', Matteo Pianforini, 03 Oct 2024
  
  The original comments by the Reviewer are provided in italics. Responses are provided in plain text.
  We thank the Reviewer for the comments. Point-by-point replies to the specific comments are provided below. The manuscript will be modified based on the recommendations of the Reviewer.
  
  1. “The paper demonstrates that a transformer model they called FloodSformer (FS) can be a better surrogate model for a physically-based model PARFLOOD on the trained sites than a convolutional model (CNN) because it has better resolution of the effect of inundation states at the last time step.”
  Thank you for your comment. The primary objective of this work is to develop a surrogate model capable of real-time simulation of real flood events on large domains. To achieve this, it is essential to adopt a DL model that is both accurate and efficient, capable of generating time-evolving flood maps in a small fraction of the physical time. These maps can be used by disaster management agencies to take timely countermeasures during emergencies, minimizing the impact on affected populations.
  The speed of simulation is crucial, as it must provide sufficient time for the implementation of countermeasures. Equally important is the accuracy of the predictions, which is vital for identifying high-risk areas and deploying appropriate responses. For this reason, we believe that the FloodSformer model, which integrates both speed and accuracy as key attributes, can be a valuable tool for practical purposes.
  The comparison of the proposed model with the 1D CCN of Kabir et al. (2020) provided an additional analysis to assess our model against a state-of-the-art architecture. We compared predicted flood maps from the FloodSformer model and the 1D CNN across the entire duration of the flood event. As shown in Table 5 and Figures 13-15, the FloodSformer model consistently outperforms the 1D CNN in terms of accuracy throughout the flood event, not just at the final time step.
  Furthermore, our work includes another element of novelty, which is a detailed analysis of the model’s sensitivity to the training dataset, particularly in relation to the type and number of flood events used for generating the ground-truth maps. This analysis is crucial for ensuring that the surrogate model is properly trained to deliver reliable predictions across diverse flood events, a topic that has been underexplored in previous research.
  This will be clarified in both the introduction and conclusion of the revised version of the manuscript.
  
  2. “A large number of exercises have pursued this route --- training NNs or other machine learning models as surrogate models for more computationally expensive physically-based models to save time. While this is a legitimate pursuit, most of these models have one issue that hinders their practical utility for flood inundation modeling --- they cannot adapt to different domains with different topography. This means they must be trained separately on different sites; they will still produce large errors if going out of the range of the inflow conditions met during training. Thus their practical value is to interpolate between different training simulations.”
  Thank you for highlighting this point. We agree with the Reviewer regarding the limitations of DL models in generalizing across different topographies. Indeed, as noted in the literature — apart from a few exceptions (e.g., Bentivoglio et al. (2023) for dike-breach floods and do Lago et al. (2023) for pluvial floods) — most data-driven models for river flood predictions face this challenge (Bentivoglio et al., 2022; Karim et al., 2023). While this challenge is significant and requires further investigation, it is important to clarify that addressing this limitation is not the main objective of the present work. Instead, our main goal is to develop a surrogate model that can deliver high accuracy and computational efficiency for real-time flood predictions within a specific river region.
  Another critical aspect of data-driven flood simulation models is ensuring that the training dataset can guarantee proper training and, consequently, robust generalization during inference. As is well-documented, all data-driven models suffer from reduced accuracy when attempting to extrapolate beyond the range of data used for training (e.g., Fraehr et al., 2024). To mitigate this, it is essential to include extreme flood events for the study area in the training dataset. Such extreme events can be estimated by considering the maximum flow capacity of the river, which is generally constrained by the river’s physical and topographical characteristics.
  In our work, we also investigated the impact of different inflow conditions used to construct the training dataset. Since recorded flood events, especially those with high return periods, are often limited, we introduced synthetic hydrographs in the training dataset to incorporate a broader range of severe events in the dataset. This strategy is key to avoid the need for extrapolation beyond the training data.
  
  3. “Then, for this type of effort, one must justify the cost of producing training/validation datasets plus training and diagnosing issues vs. directly using the physically-based model. In the case of this paper, I suspect adding up the cost of all the preparation steps will be much greater than just running the original model for the cases investigated, let alone training the models. One can even just run the original models on an interval of inflow discharge conditions comparable to their training data density and interpolate the results in space and time.”
  Thank you for your valuable insight. In addressing the challenge of real-time flood forecasting, computational efficiency becomes a critical factor. During emergency situations, such as flood events, reducing computation time is crucial because it increases the time available for emergency services to implement protective measures, thereby mitigating loss of life and economic damage.
  Physically based models, while accurate, are often impractical for real-time simulations due to their high computational demands. Conversely, a fast and efficient surrogate model can produce predictions in a much shorter time frame. As a result, in the context of setting up models for early warning applications, the time and cost required to produce the training dataset and train the surrogate model are not primary concerns. These tasks can be performed “in time of peace”, well before any emergency occurs. It is only the real-time forecasting during critical phases that requires minimal inference time to support timely decision-making. Therefore, for practical applications, the key factor is minimizing inference time to ensure the model can deliver rapid and actionable predictions during emergencies.
  Regarding the suggestion of interpolating results from previous simulations based on inflow discharge conditions, this approach is feasible only when the system behaves in a linear or quasi-linear manner. However, flood propagation over real-world bathymetry is a highly non-linear phenomenon due to the complex interactions between water flow, terrain topography, and other factors. The relationship between upstream discharge and water levels in any given location is not straightforward and changes over time. As a result, simple linear interpolation cannot adequately capture the dynamic interactions across time and space, leading to low-accuracy predictions. This is further supported by the fact that increasingly complex DL models for the simulation of the spatiotemporal dynamics of flood events are developed in literature by different research groups.
  To clarify this, we will modify the introduction in the revised version of the manuscript.
  
  4. “Another major limitation is that, even only for the domains trained on, the authors did not confront the model with real data, but only simulations, this further reduces the meaningfulness of the model.”
  We thank the Reviewer for this insightful comment. Unfortunately, for real flood events, direct observational data is particularly scarce and typically limited to water level recordings at a few gauging stations along the river. Furthermore, spatially distributed data on the dynamics of flood events are generally unavailable. Even in such cases, the data provide only an approximate outline of the inundated area and do not include information on water depths or temporally varying information. This is the primary reason why we relied on synthetic data generated by a physically- based model to create the dataset for training the DL algorithm.
  In our work, we compared the model’s results with the available observed data in both case studies. Specifically, for the Toce River case study, the water depths at several control points (locations shown in Figure 3a) were compared with the water depths recorded during the experimental analysis conducted in the laboratory by Testa et al. (2007). The results of this comparison are presented in Figures 6, S1-S3. Similarly, for the Po River case study, the water depths recorded at the Boretto gauging station (location shown in Figure 3b) are directly compared with the results obtained from both the numerical model and the surrogate model, as shown in Figure 9.
  Moreover, we would like to emphasize that the recorded water depths at the control points for the Toce River, as well as at the Boretto gauge station for the Po River, were used to calibrate the hydrodynamic model in each case study. This calibration process is crucial for ensuring the accuracy of the ground-truth maps, which are then used for training the surrogate model.
  In the revised version of the manuscript these considerations will be emphasised in sections 3 and 4.
  
  5. “If the above suspicions are true, it means the value of this paper is to demonstrate a higher accuracy of FS compared to CNN for the purpose of being a surrogate model of PARFLOOD from a purely methodological perspective, as the FS is not purely a function of inflow but also inundation states of previous time steps.”
  As previously mentioned, the primary objective of this work is to implement a framework capable of real-time flood simulations over large real-world domains. The use of a model that (i) leverages cutting-edge DL algorithms, and (ii) incorporates both inflow discharge and previous inundation maps as input data, leads to significantly improved performance compared to other state-of-the-art models. This approach is essential for achieving higher accuracy during the inference stage, thereby enhancing the model’s reliability and its applicability in real-time flood management scenarios.
  
  6. “If this were the first paper to show a surrogate model's success in 2D, there should still be methodological value, but given a large number of other studies (citations below as well as several cited in the paper), this work would be considered more incremental. Furthermore, more comparisons to other models like the one from Bentivoglio et al., 2023 would be needed if the claim was on accuracy and computational savings.”
  We appreciate the Reviewer’s comment and acknowledge the existence of numerous studies in the field of flood simulations using data-driven models. However, as emphasized earlier, the proposed FloodSformer model offers innovative features compared to other DL models. Furthermore, only a limited number of studies available in literature are dedicated to the real-time forecasting of the spatiotemporal evolution of inundation maps for river flood scenarios. Indeed, most DL models focus primarily on pluvial flood events, where the input data and objectives differ significantly. These studies often predict only maximum water depths or inundation extents for specific rainfall scenarios.
  The primary aim of our work is not to provide an extensive comparison between the FloodSformer model and other state-of-the-art models, but rather to develop a surrogate DL model capable of real-time simulation of actual flood events over large domains. We believe a broader comparison across various surrogate models could be a focus of future research, as such comparison requires significant effort.
  The Reviewer kindly referenced several data-driven models, many of which are already cited and discussed in our manuscript. These models employ diverse neural network architectures and target different objectives. For instance, the second study referenced by the Reviewer (Zhou et al., 2022), combines a 1D CNN with a U-Net-based spatial reduction and reconstruction method to simulate inundation maps. Similar to that of Kabir et al. (2020), this approach primarily focuses on inflow conditions at the upstream boundary. Therefore, we expect Zhou et al.’s model to exhibit the same limitations discussed in Section 3.4 of our manuscript, since it relies solely on upstream boundary conditions for predictions.
  The third study (Zheng et al., 2024), uses a Bayesian CNN to predict maximum water depth maps for specific rainfall conditions. This work does not consider spatiotemporal dynamics and instead focuses on urban flood scenarios, using rainfall information as the primary input. Given these distinctions, Zheng et al.’s work addresses a different problem from ours, which is why it was not included in our manuscript. Additionally, this study was published on September 17, 2024, and was therefore not available at the time our manuscript was written.
  Lastly, in Bentivoglio et al. (2023) a GNN is used to model dike breach floods on randomly generated topographies over small, squared domains. While the model was tested on unseen bathymetries with similar geometric characteristics, it was applied only to small areas with approximately 16,000 cells. No information is provided on the scalability of this approach for large domains or real bathymetries, which typically involve hundreds of thousands to millions of cells. Consequently, the suitability of the GNN model for real-time flood forecasting on large-scale areas is not demonstrated in that study, which has a different focus. For example, in our Po River case study, the domain is discretized with approximately 1.3 million cells (Po2 configuration). Given these constraints, a direct comparison between the FloodSformer and GNN models for real-time forecast on large-scale case domain is not feasible.
  The discussion of the work of Bentivoglio et al. (2023) will be modified and the work of Zheng et al., 2024 will be referenced in the revised version of the manuscript.
  
  7. “Overall I suggest, while similar papers have been published elsewhere, this contribution does not meet the rigorous standard of HESS.”
  We appreciate the Reviewer’s feedback and their assessment of our manuscript. While we acknowledge that similar papers have been published in related fields, we believe that our contribution offers several key innovations that distinguish it from existing work. Specifically, our FloodSformer model incorporates both inflow discharge and previous inundation maps, a combination that is rarely addressed in real-time river flood forecasting models. This novel approach, obtained adopting a transformer DL architecture, significantly improves model performance and applicability in practical flood management scenarios.
  Furthermore, our work directly addresses real-time flood forecasting for large, real-world domains, which is a critical yet underexplored area in existing literature. Many of the referenced models either focus on pluvial flood scenarios or use different input conditions and objectives. We have made efforts to clarify these distinctions throughout the manuscript.
  Finally, we checked the model’s sensitivity to the training dataset (type and number of flood events used for generating the ground-truth maps), a topic that has been underexplored in previous research and may be of interest also for other types of surrogate models.
  We are committed to meeting the high standards of HESS and would be grateful for any specific suggestions the Reviewer may have regarding how we can further strengthen our manuscript. We hope that our response demonstrates the originality and practical relevance of our work.
  
  Citation: https://doi.org/10.5194/hess-2024-176-AC2
RC3:
'Comment on hess-2024-176', Anonymous Referee #3, 13 Oct 2024

The paper presents a potentially interesting contribution by leveraging transformers in the context of real-time flood inundation modeling. However, given its similarity with Pianforini et al. (2024) and other existing works, including a recent contribution in EGUSphere [1] that significantly improves upon Bentivoglio et al. (2023), the novelty of this work does not reach the level expected for HESS. Another venue would be more appropriate for this manuscript. Furthermore, I find the paper to be very long—almost twice the length of similar contributions— and hard to read at times, despite the good quality of the language used. It would benefit from significant shortening to aid readability.

[1] Bentivoglio, Roberto, et al. "Multi-scale hydraulic graph neural networks for flood modelling." EGUsphere 2024 (2024): 1-28.

Citation: https://doi.org/10.5194/hess-2024-176-RC3
- AC3:
  'Reply on RC3', Matteo Pianforini, 15 Oct 2024
  The original comments by the Reviewer are provided in italics, while the responses are provided in plain text.
  
  We thank the Reviewer for the comments. The manuscript will be modified based on the recommendations of the Reviewer.
  a. “The paper presents a potentially interesting contribution by leveraging transformers in the context of real-time flood inundation modeling. However, given its similarity with Pianforini et al. (2024) and other existing works, including a recent contribution in EGUSphere [1] that significantly improves upon Bentivoglio et al. (2023), the novelty of this work does not reach the level expected for HESS. Another venue would be more appropriate for this manuscript. […]
  
  [1] Bentivoglio, Roberto, et al. "Multi-scale hydraulic graph neural networks for flood modelling." EGUsphere 2024 (2024): 1-28.”
  Thank you for your comment. Regarding the similarity with the previous version of the FS model published in the Journal of Hydrology (Pianforini et al., 2024), we believe that the current version marks a significant advancement. Specifically, we have enhanced the attention mechanism within the transformer block by transitioning from a self-attention mechanism to a cross-attention one. This shift allows the model to integrate information from multiple input sequences (boundary conditions, previous flood stages, etc.), making it far more adaptable to a variety of real-world flood scenarios. While the initial version was limited to dam-break due to its inability to incorporate boundary conditions, the updated version can predict flood events where upstream boundary conditions are crucial (e.g., river floods), making the model applicable to any type of watercourse.
  
  As outlined in the Introduction and Discussion sections of the manuscript, our model differs from existing DL models in several key aspects. Firstly, to the best of our knowledge, this is the first time a transformer-based model has been used to simulate the spatiotemporal evolution of river flood events for real-time forecasts. Additionally, we have demonstrated the model’s capability to handle domains with millions of cells, achieving satisfactory results in both accuracy and computational efficiency.
  
  We would also like to address the reviewer’s concern regarding the similarity with the recent work of Bentivoglio et al. (2024). Firstly, it is worth noting that this work is still under review in EGUSphere and was only made available online on September 20, 2024. This timing is important because our manuscript was submitted prior to the availability of this work, meaning that we could not have taken it into consideration at the time of writing. It is also important to note that such situations are not uncommon in the rapidly evolving field of DL model research, where advancements and publications occur very frequently.
  
  While Bentivoglio et al. (2024) presents an interesting contribution to the field, there are several fundamental differences between their approach and ours that we believe warrant clarification.
  Architectural differences: the primary distinction lies in the AI architectures employed. Bentivoglio et al. (2024) utilized Graph Neural Networks (GNNs) to model flood events, whereas our work employs a transformer-based architecture.
  
  Different objectives: the objectives of the two studies also differ significantly. Bentivoglio et al. (2024) focused on developing a surrogate model that can generalize across various topographies. In contrast, our work aims to create an ANN for real-time forecasting of inundation maps over large domains. Our approach is geared toward providing fast, accurate predictions for decision-making during flood events, rather than developing a model that generalizes across multiple domains. This focus on operational forecasting is crucial for practical applications in flood risk management.
  
  Flood event types: another key difference lies in the types of flood events modelled. Our work addresses river flood events, whereas Bentivoglio et al. (2024) concentrated on dike-breach scenarios within relatively flat domains (see the DTM of the “dike ring 15” case study) which generates relatively shallow water depths (less than 2 meters). In contrast, the Po River case study in our work involves much more complex topography (see Fig. 3), with water depths reaching up to 23-24 meters in the main channel (see Fig. 11). The complexity of these scenarios presents a much greater challenge for flood modelling, and our model’s ability to handle such cases confirms its robustness and versatility.
  
  Scale and resolution: the scale and resolution of the case studies also differ significantly. Bentivoglio et al. (2024) used a low-resolution mesh with 22,881 cells, while we applied the FS model to a domain with 5,226,496 cells, demonstrating its scalability for domains two orders of magnitude larger. Additionally, the FS model maintains high computational efficiency, achieving physical to computational time ratios as high as 10,000. The scalability of the GNNs for larger and more complex domains has yet to be fully explored. Scalability is a critical factor in real-world applications, where high-resolution predictions are needed over large areas. The ability of the FS model to operate at this scale demonstrates its practicality for operational use in real-time flood forecasting.
  
  In conclusion, while both our work and that of Bentivoglio et al. (2024) contribute valuable advancements to flood modelling, they are based on fundamentally different approaches, objectives, and application contexts. As such, the two works should not be directly compared, as they address distinct challenges in the development of surrogate models for flood modelling. Given these significant differences, we believe both works offer complementary, rather than competing, contributions to the field.
  
  b. “Furthermore, I find the paper to be very long—almost twice the length of similar contributions— and hard to read at times, despite the good quality of the language used. It would benefit from significant shortening to aid readability.”
  We appreciate the Reviewer’s positive feedback regarding the clarity of the language used in the paper.
  
  We understand the concern regarding the length of the paper, and we acknowledge that it is longer than typical contributions. The extended length is primarily due to the comprehensive validation of the proposed model, which includes two detailed case studies. Additionally, we conducted a sensitivity analysis to explore the effects of different types and numbers of flood events in the training dataset, alongside a benchmark comparison with another deep learning model.
  
  However, we recognize the importance of maintaining readability and conciseness. Therefore, we are considering several options to reduce the length of the manuscript. These may include removing or relocating specific sections and figures to the appendix or supplementary material, such as “Surrogate model implementation details” (Section 2.1.4), the “Po3 training case” (Section 3.2.3) and the Fig. 10. Additionally, we are considering shortening certain sections. This approach would allow us to retain the necessary scientific rigor while improving the overall flow and readability of the main body of the paper.
  
  We are confident that these revisions will address the Reviewer’s concerns while ensuring that the key contributions and findings of the study remain clear and accessible.
  
  Citation: https://doi.org/10.5194/hess-2024-176-AC3

Matteo Pianforini, Susanna Dazzi, Andrea Pilzer, and Renato Vacondio

Supplement

https://doi.org/10.5194/hess-2024-176-supplement

Matteo Pianforini, Susanna Dazzi, Andrea Pilzer, and Renato Vacondio

Viewed

Total article views: 1,649 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
756	269	624	1,649	67	28	33

HTML: 756
PDF: 269
XML: 624
Total: 1,649
Supplement: 67
BibTeX: 28
EndNote: 33

Views and downloads (calculated since 05 Aug 2024)

Month	HTML	PDF	XML	Total
Aug 2024	245	136	51	432
Sep 2024	143	18	22	183
Oct 2024	85	28	103	216
Nov 2024	37	9	119	165
Dec 2024	25	9	140	174
Jan 2025	28	7	173	208
Feb 2025	34	4	9	47
Mar 2025	41	14	6	61
Apr 2025	40	14	0	54
May 2025	33	8	1	42
Jun 2025	37	19	0	56
Jul 2025	8	3	0	11

Cumulative views and downloads (calculated since 05 Aug 2024)

Month	HTML	PDF	XML	Total
Aug 2024	245	136	51	432
Sep 2024	143	18	22	183
Oct 2024	85	28	103	216
Nov 2024	37	9	119	165
Dec 2024	25	9	140	174
Jan 2025	28	7	173	208
Feb 2025	34	4	9	47
Mar 2025	41	14	6	61
Apr 2025	40	14	0	54
May 2025	33	8	1	42
Jun 2025	37	19	0	56
Jul 2025	8	3	0	11

Viewed (geographical distribution)

Total article views: 1,560 (including HTML, PDF, and XML) Thereof 1,560 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 07 Jul 2025

Short summary

We proposed an innovative model to improve flood prediction and management, mitigating the impact of extreme inundations. Using advanced deep learning techniques, the proposed model accurately forecasts long-lasting river flood events in negligible computational time. The results show significant improvements in prediction accuracy, providing communities and policymakers with a valuable tool for flood management and early warning systems.

Flood event	RMSE (mm) original	RMSE (mm) modified
Low	1.8	2.2
Medium	2.6	3.3
High	3.2	3.2
Gradual	3.6	4.3


Total:	0
HTML:	0
PDF:	0
XML:	0