Deep Learning Methods for Flood Mapping: A Review of Existing Applications and Future Research Directions

Bentivoglio, Roberto; Isufi, Elvin; Jonkman, Sebastian Nicolaas; Taormina, Riccardo

doi:https://doi.org/10.5194/hess-2021-614

Preprints

https://doi.org/10.5194/hess-2021-614

Preprints

13 Dec 2021

| 13 Dec 2021

Status: this discussion paper is a preprint. It has been under review for the journal Hydrology and Earth System Sciences (HESS). The manuscript was not accepted for further review after discussion.

Deep Learning Methods for Flood Mapping: A Review of Existing Applications and Future Research Directions

Roberto Bentivoglio, Elvin Isufi, Sebastian Nicolaas Jonkman, and Riccardo Taormina

Abstract. Deep Learning techniques have been increasingly used in flood risk management to overcome the limitations of accurate, yet slow, numerical models, and to improve the results of traditional methods for flood mapping. In this paper, we review 45 recent publications to outline the state-of-the-art of the field, identify knowledge gaps, and propose future research directions. The review focuses on the type of deep learning models used for various flood mapping applications, the flood types considered, the spatial scale of the studied events, and the data used for model development. The results show that models based on convolutional layers are usually more accurate as they leverage inductive biases to better process the spatial characteristics of the flooding events. Traditional models based on fully-connected layers, instead, provide accurate results when coupled with other statistical models. Deep learning models showed increased accuracy when compared to traditional approaches and increased speed when compared to numerical methods. While there exist several applications in flood susceptibility, inundation, and hazard mapping, more work is needed to understand how deep learning can assist real-time flood warning during an emergency, and how it can be employed to estimate flood risk. A major challenge lies in developing deep learning models that can generalize to unseen case studies and sites. Furthermore, all reviewed models and their outputs, are deterministic, with limited considerations for uncertainties in outcomes and probabilistic predictions. The authors argue that these identified gaps can be addressed by exploiting recent fundamental advancements in deep learning or by taking inspiration from developments in other applied areas. Models based on graph neural networks and neural operators can work with arbitrarily structured data and thus should be capable of generalizing across different case studies and could account for complex interactions with the natural and built environment. Neural operators can also speed up numerical models while preserving the underlying physical equations and could thus be used for reliable real-time warning. Similarly, probabilistic models can be built by resorting to Deep Gaussian Processes.

Received: 06 Dec 2021 – Discussion started: 13 Dec 2021

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Roberto Bentivoglio, Elvin Isufi, Sebastian Nicolaas Jonkman, and Riccardo Taormina

Status: closed

RC1:
'Comment on hess-2021-614', Anonymous Referee #1, 28 Dec 2021

Review report

Thank you for submitting your manuscript on Deep Learning Methods for Flood Mapping: A Review of Existing Applications and Future Research Directions. There are a lot of text should be improved such as introduction, validation and discussion. Therefore, I give some suggestion and question which I hope useful to the author, I give major revision to the author.

1ï¼It was difficult to see the justification for the need of this research. The literature review is poor. The paper needs to clearly state what are the problems with the existing works (these types of approaches) and what problem(s) this particularly paper was going to address. Without this clearly problem statement readers would have difficulty to see the merit of this paper. The author only lists some references, I did not find the problem with the exist method. The problem of the existing method is not clear. The author should show us deep analysis about the gap between existing method.

2ï¼ A Review of flood mapping or Deep Learning? This is confused for me.

3) The flowchart of the method should be insert.

4) There were very few discussions of previous studies.

Other comment:

Figure: The resolution of figure should be improved.

Referenceï¼

There are a lot of latest article should be updated.

Citation: https://doi.org/10.5194/hess-2021-614-RC1
- AC1:
  'Reply on RC1', Roberto Bentivoglio, 26 Jan 2022
  Reply to anonymous Reviewer #1
  We thank the Reviewer for reading our paper and providing comments for its improvement. Here we provide answers to the issues raised along with details on the amendments to the original manuscript to be featured in the revision. Unless otherwise specified, reported line numbers refer to the updated version.
  1ï¼It was difficult to see the justification for the need of this research. The literature review is poor. The paper needs to clearly state what are the problems with the existing works (these types of approaches) and what problem(s) this particularly paper was going to address. Without this clearly problem statement readers would have difficulty to see the merit of this paper. The author only lists some references, I did not find the problem with the exist method. The problem of the existing method is not clear. The author should show us deep analysis about the gap between existing method.
  We have restated the justification of this research in the introduction (lines 61-65): “The existing reviews mainly focused on the temporal variability of floods, especially concerning rainfall-runoff modeling, covering only a few instances of flood mapping applications. But the spatial evolution of flood events is extremely important to determine affected areas, plan mitigation measures and inform response strategies. Yet, there are no comprehensive overviews and analyses of DL in flood mapping to facilitate flood researchers and practitioners. The aim of this review is thus to advance the emerging field of DL-based flood mapping by surveying the state-of-the-art, identifying outstanding research gaps, and proposing fruitful research directions.”
  We report some problems with the existing methods throughout Section 3, and we dedicated Section 4 “Knowledge gaps” to summarise those we believe are the major ones, common to all reviewed papers. In particular, we highlight:
  The lack of “general” DL models that can work across multiple case scenarios. Our review shows that models are usually deployed for single case studies, which greatly limits their applicability.
  
  Existing DL models are not suitable for modelling complex interactions with the natural and built environments; this hinders their operational use for all types of applications.
  
  The focus so far has been only on developing deterministic models, while flood management requires accounting for uncertainties in outcomes and probabilistic predictions.
  
  Further efforts should be directed in developing DL models for flood risk or real-time flood warning applications or tackling problems related to data availability.
  
  We argue that the community should address these problems by transferring recent fundamental advancements in DL to flood mapping. These advancements mainly include mesh-based neural networks, such as Graph Neural Networks and Fourier Neural Operators, as well as Probabilistic Deep Learning. The future research directions in Section 5 of the paper substantiate these suggestions and provide insights on how to apply these methods to improve flood mapping.
  While we believe that the paper’s justification and contributions are sufficiently clear, we tried our best to improve it during the revision process. As suggested by the Reviewer, we modified the manuscript to further clarify its merits and purpose.
  
  2ï¼ A Review of flood mapping or Deep Learning? This is confused for me.
  As stated in the title of the paper, this review concerns deep learning methods for flood mapping. We evaluate the efforts of the community concerning the design and implementation of deep learning methods for flood mapping. Therefore, the manuscript explores the intersection between these two areas, as stated in lines 66-69 of the original manuscript: “45 papers are analysed considering two main parallel yet intertwined directions. On the one hand, we focused on the flood management application, spatial scale of study, and type of flood. On the other hand, we examined the deep learning model, type of training data, and performance with respect to alternative methods. This strategy provides insights from a flood management perspective and concurrently facilitates reflection on how to successfully apply DL models.”
  
  3) The flowchart of the method should be insert.
  We thank the Reviewer for the suggestion. We have now added a flow chart (Fig. 3) to explain our methodology better.
  
  4) There were very few discussions of previous studies.
  Thank you for raising this point. While some specific discussions on individual studies have been included throughout Section 3, we opted for a concise, yet meaningful, contribution of the reviewed papers to respect also the length limitations. Common to successful review papers of the field (e.g., Maier, 2013), we preferred to report observations that are valid across multiple studies, especially when outlining the knowledge gaps and proposing future research directions. That said, as suggested by the reviewer, we included further insights from individual studies throughout Section 3 of our paper to enrich the overall narrative.
  
  Here follow some examples we included in the revised manuscript:
  Lines 434-437 “Most CNN models show noticeable improvements with respect to traditional threshold methods, such as the Normalized Difference Water Index (NDWI) and automatic threshold model (ATM) (e.g., Wieland and Martinis, 2019; Isikdogan et al., 2017; Nemni et al., 2020), and with respect to machine learning models such as random forest (RF) and support vector machine (SVM).”
  Lines 461-464 “However, Wang et al. (2020b) and Liu et al. (2021) show that 1D-CNNs, which perform convolution on the input features for each domain’s cell, are not suited for this problem, as they do not properly leverage any inductive bias. Some works showed that deep belief networks (DBN), an unsupervised variation of MLPs, could outperform standard MLPs in flood susceptibility mapping (e.g., Shirzadi et al., 2020; Pham et al., 2021).”
  Lines 516-517 “Hu et al. (2019) and Jacquier et al. (2021) use a LSTM and a MLP, respectively, in combination with a reduced order modelling framework.”
  
  Other comment:
  Figure: The resolution of figure should be improved.
  Thanks for the suggestion. We improved the font sizes and resolutions of all figures.
  
  Referenceï¼
  There are a lot of latest article should be updated.
  We thank the Reviewer for pointing out this very important detail. By refining the search procedure described in Section 3.1, we retrieved 13 more very recent papers which are now included in the review. The complete list of added papers is shown in the references below.
  
  References:
  Maier, Holger R. "What constitutes a good literature review and why does its quality matter?." Environ. Model. Softw. 43 (2013): 3-4.
  
  Added references:
  Ahmed, N., Hoque, M. A.-A., Arabameri, A., Pal, S. C., Chakrabortty, R., and Jui, J.: Flood susceptibility mapping in Brahmaputra floodplain of Bangladesh using deep boost, deep learning neural network, and artificial neural network, Geocarto International, pp. 1–22, 2021.
  Chakrabortty, R., Chandra Pal, S., Rezaie, F., Arabameri, A., Lee, S., Roy, P., Saha, A., Chowdhuri, I., and Moayedi, H.: Flash-flood hazard susceptibility mapping in Kangsabati River Basin, India, Geocarto International, pp. 1–23, 2021a.
  Chakrabortty, R., Pal, S. C., Janizadeh, S., Santosh, M., Roy, P., Chowdhuri, I., and Saha, A.: Impact of Climate Change on Future Flood Susceptibility: an Evaluation Based on Deep Learning Algorithms and GCM Model, Water Resources Management, 35, 4251–4274, 2021b.
  Hosseiny, H.: A deep learning model for predicting river flood depth and extent, Environmental Modelling & Software, 145, 105 186, 2021
  Isikdogan, F., Bovik, A. C., and Passalacqua, P.: Surface water mapping by deep learning, IEEE journal of selected topics in applied earth observations and remote sensing, 10, 4909–4918, 2017.
  Jacquier, P., Abdedou, A., Delmas, V., and Soulaïmani, A.: Non-intrusive reduced-order modeling using uncertainty-aware Deep Neural Networks and Proper Orthogonal Decomposition: Application to flood modeling, Journal of Computational Physics, 424, 109 854, 2021
  Lei, X., Chen, W., Panahi, M., Falah, F., Rahmati, O., Uuemaa, E., Kalantari, Z., Ferreira, C. S. S., Rezaie, F., Tiefenbacher, J. P., et al.:Urban flood modeling using deep-learning approaches in Seoul, South Korea, Journal of Hydrology, 601, 126 684, 2021
  Liu, J., Wang, J., Xiong, J., Cheng, W., Sun, H., Yong, Z., and Wang, N.: Hybrid Models Incorporating Bivariate Statistics and Machine Learning Methods for Flash Flood Susceptibility Assessment Based on Remote Sensing Datasets, Remote Sensing, 13, 4945, 2021.
  Saeed, M., Li, H., Ullah, S., Rahman, A.-u., Ali, A., Khan, R., Hassan,W., Munir, I., and Alam, S.: Flood Hazard Zonation Using an Artificial Neural Network Model: A Case Study of Kabul River Basin, Pakistan, Sustainability, 13, 13 953, 2021.
  Syifa, M., Park, S. J., Achmad, A. R., Lee, C.-W., and Eom, J.: Flood mapping using remote sensing imagery and artificial intelligence techniques: a case study in Brumadinho, Brazil, Journal of Coastal Research, 90, 197–204, 2019.
  Wieland, M. and Martinis, S.: A modular processing chain for automated flood monitoring from multi-spectral satellite data, Remote Sensing, 11, 2330, 2019.
  Yokoya, N., Yamanoi, K., He, W., Baier, G., Adriano, B., Miura, H., and Oishi, S.: Breaking limits of remote sensing by deep learning from simulated data for flood and debris-flow mapping, IEEE Transactions on Geoscience and Remote Sensing, 2020.
  Xie, S., Wu, W., Mooser, S., Wang, Q., Nathan, R., and Huang, Y.: Artificial neural network based hybrid modeling approach for floodinundation modeling, Journal of Hydrology, 592, 125 605, 2021.
  
  Citation: https://doi.org/10.5194/hess-2021-614-AC1
RC2:
'Comment on hess-2021-614', Anonymous Referee #2, 28 Dec 2021

This manuscript focus on the application of deep learning methods to flood mapping, including flood extend mapping, flood susceptibility mapping, and flood hazard mapping. There are some concerns in the manuscript. Below are my comments. I hope the authors find them useful.

Major concern:

1. I suggest the authors provide the time extent of these reviewed publications. Because studies related to deep learning in flood mapping are constantly updated.

2. It is better to introduce the three flood maps as follows: first to show flood extent or inundation maps, then to illustrate flood susceptibility map, and finally present flood hazard map. Because flood extent or inundation map can be viewed as preliminary work in mapping research. Then, the results of the flood extent maps can be used as training data to predict flood susceptibility. Finally, the flood susceptibility indicates the potential location of future floods. And flood hazard map can be viewed as an extension of a flood susceptibility map that not only considers the location of the flood but also integrate the depth and water extent.

3. A deep brief network is also an important part of deep learning, which has been used in the flood mapping field. Some studies are shown below:

[1]    Shahabi, Himan, et al. "Flash flood susceptibility mapping using a novel deep learning model based on deep belief network, back propagation and genetic algorithm." Geoscience Frontiers 12.3 (2021): 101100.

[2]    Shirzadi, Ataollah, et al. "A novel ensemble learning based on Bayesian Belief Network coupled with an extreme learning machine for flash flood susceptibility mapping." Engineering Applications of Artificial Intelligence 96 (2020): 103971.

[3]    Pham, Binh Thai, et al. "Can deep learning algorithms outperform benchmark machine learning algorithms in flood susceptibility modeling?." Journal of Hydrology 592 (2021): 125615

4. The application perspectives on different mapping scenarios are different. Therefore, the authors should provide specific limitations and future research directions on different mapping frameworks. For example, the deep learning methods for mapping susceptibility focus on predicting the location of potential flood areas by considering the historical location and environmental variables. Therefore, it is important to design an appropriate network to integrate heterogeneous environmental information. For flood extent mapping, it aims to find the continuous inundated areas based on satellite images or UAV images. Some deep learning methods such as semantic segmentation are more appropriate in flood extent mapping.

Minor concern:

1. Figure 1: the legend is overlapped in the main figure.

2. It is better to entitle Section 2.2 as “Deep learning method”. Section 2.2.1, 2.2.2, 2.2.3 should be entitled “Multi-layer perceptron”, “Convolutional neural network”, and “Recurrent neural network”, respectively.

3. Section 2.2, part 155: lack of related reference in the first sentence.

4. Figure 1: please provide the location information in the caption.

5. 5 (a) should be improved.

6. Section 5.3 belongs to the future direction, but data scarcity is a kind of limitation. Data enhancement may be a suitable title for this section.

Citation: https://doi.org/10.5194/hess-2021-614-RC2
- AC2: 'Reply on RC2', Roberto Bentivoglio, 26 Jan 2022
  
  Reply to anonymous Reviewer #2
  We thank the Reviewer for reading our paper and providing detailed and insightful comments. Here we provide answers to the issues raised along with details on the amendments to the original manuscript to be featured in the revision. Unless otherwise specified, reported line numbers refer to the updated version.
  
  Major concern:
  1) I suggest the authors provide the time extent of these reviewed publications. Because studies related to deep learning in flood mapping are constantly updated.
  The time extent considered for the reviewed publications has now been explicitly mentioned in section 3.1: lines 255-256 “The 3,338 publications obtained were then filtered to include only journal papers from January 2010 until December 2021, in the areas of engineering, environmental science, and earth and planetary sciences”
  
  2) It is better to introduce the three flood maps as follows: first to show flood extent or inundation maps, then to illustrate flood susceptibility map, and finally present flood hazard map. Because flood extent or inundation map can be viewed as preliminary work in mapping research. Then, the results of the flood extent maps can be used as training data to predict flood susceptibility. Finally, the flood susceptibility indicates the potential location of future floods. And flood hazard map can be viewed as an extension of a flood susceptibility map that not only considers the location of the flood but also integrate the depth and water extent.
  Thank you for this insightful recommendation. We agree with it and we changed the order of the presented flood maps in Section 2.1.2 and Section 3. Moreover, we modified lines 107-112 as follows: “Flood inundation maps determine the extent of a flood, during or after it has occurred (see Fig. 1a). Flood inundation maps represent flooded and non-flooded areas. This application is used for post-flood evacuation and protection planning, and for damage assessment. These maps can then be used also as observed and calibration data for other applications. Flood images are obtained through remote-sensing techniques and processed by histogram-based models (e.g., Martinis et al., 2009; Manjusree et al., 2012), threshold models (e.g., Cian et al., 2018), and machine learning models (e.g., Hesset al., 1995; Ireland et al., 2015).”
  
  3) A deep brief network is also an important part of deep learning, which has been used in the flood mapping field. Some studies are shown below:
  [1]    Shahabi, Himan, et al. "Flash flood susceptibility mapping using a novel deep learning model based on deep belief network, back propagation and genetic algorithm." Geoscience Frontiers 12.3 (2021): 101100.
  [2]    Shirzadi, Ataollah, et al. "A novel ensemble learning based on Bayesian Belief Network coupled with an extreme learning machine for flash flood susceptibility mapping." Engineering Applications of Artificial Intelligence 96 (2020): 103971.
  [3]    Pham, Binh Thai, et al. "Can deep learning algorithms outperform benchmark machine learning algorithms in flood susceptibility modeling?." Journal of Hydrology 592 (2021): 125615
  We thank the Reviewer for pointing out the applications of Deep Belief Networks (DBN) for flood mapping. Since DBNs are unsupervised methods, we did not include them among the reviewed papers as our manuscript focuses on supervised methods (lines 164-165 in the original manuscript). Nonetheless, we included some of the suggested papers in lines 473-475: “Some works showed that deep belief networks (DBN), an unsupervised variation of MLPs, could outperform standard MLPs in flood susceptibility mapping (e.g., Shirzadi et al., 2020; Pham et al., 2021).”
  
  4) The application perspectives on different mapping scenarios are different. Therefore, the authors should provide specific limitations and future research directions on different mapping frameworks. For example, the deep learning methods for mapping susceptibility focus on predicting the location of potential flood areas by considering the historical location and environmental variables. Therefore, it is important to design an appropriate network to integrate heterogeneous environmental information. For flood extent mapping, it aims to find the continuous inundated areas based on satellite images or UAV images. Some deep learning methods such as semantic segmentation are more appropriate in flood extent mapping.
  We thank the reviewer for the useful feedback. We carefully considered the competent suggestions made by the Reviewer concerning the presentation of specific limitations and research directions for the different types of flood mapping applications. We included limitations of each presented flood map in lines 558-563: “Nonetheless, each of the presented maps has its own limitations. Susceptibility maps provide only qualitative results and rely on recorded flood events. Therefore, limited recorded data may lead to incorrect predictions. Moreover, it is important to design an appropriate model to integrate heterogeneous environmental information. Inundation maps mostly consider real events, thus they suffer from the acquisition method’s problems. For example, satellites struggle to extract information below clouded areas (e.g., Meraneret al., 2020). Hazard maps, instead, are limited by the accuracy of the underlying numerical simulator.”
  As regards to flood extent or inundation mapping, most of the presented papers indeed consider semantic segmentation, in the paper referred to as “image segmentation”. We added a description of what image segmentation refers to in lines 220-222: “Instead, if the task is to perform image segmentation, i.e., classify specific parts of an image, the final layers are composed of de-convolutional layers which perform an operation opposite of convolutional layers, in an encoder-decoder structure”.
  
  Minor concern:
  1) Figure 1: the legend is overlapped in the main figure.
  We removed the legend overlap in Fig. 1.
  
  2) It is better to entitle Section 2.2 as “Deep learning method”. Section 2.2.1, 2.2.2, 2.2.3 should be entitled “Multi-layer perceptron”, “Convolutional neural network”, and “Recurrent neural network”, respectively.
  We renamed the sections as suggested by the Reviewer.
  
  3) Section 2.2, part 155: lack of related reference in the first sentence.
  We added a reference to line 155 of the original manuscript (LeCun et al., 2015).
  
  4) Figure 1: please provide the location information in the caption.
  We omitted the location since the purpose of Figure 1 is merely exemplificative; it should not be taken as a necessarily correct estimate of flood inundation, susceptibility, and hazard for the selected area.
  
  5) 5 (a) should be improved.
  Figure 5(a) was improved so that the legend does not overlap with the bar plot anymore.
  
  6) Section 5.3 belongs to the future direction, but data scarcity is a kind of limitation. Data enhancement may be a suitable title for this section.
  Thanks for the suggestion. Based on the comment and on the common nomenclature used in machine learning, we changed section 5.3 to “Data augmentation”.
  
  References:
  LeCun, Y., Bengio, Y., and Hinton, G.: Deep learning, nature, 521, 436–444, 2015
  
  Citation: https://doi.org/10.5194/hess-2021-614-AC2
RC3:
'Comment on hess-2021-614', Anonymous Referee #3, 02 Jan 2022

This paper performs a review of deep learning approaches applied for flood mapping. In a field that is evolving rapidly, I think this work can make a valuable contribution in ensuring a common understanding of techniques in the community and outlining future research directions.

My concerns are that

-the review is currently not always very precise in distinguishing the contexts in which different approaches are relevant

-it lacks an assessment of which techniques were used over time (some now popular techniques were not available 2 years ago)

-it could occasionally be better at explaining concepts with a focus on the hydrological target group

-generalization of deep learning predictions to locations / events outside the training data is a key aspect that deserves a more prominent place in the paper. Currently this topic is raised in several subsections. It might be useful to provide an overview on what are actually the needs, which can then be used to discuss whether different approaches are conceptually able to meet this (and if this was/was not implemented in current research)

-comparisons of scores across papers need to be interpreted more carefully than what is currently the case. Scores are not necessarily computed in the same manner. In particular, non-flooded areas are not handled consistently in the literature, which has a major impact on the results.

I think all of these issues can be adressed in a revision. I have provided detailed comments below.

Detailed comments

line 51: the automatic discovery of representations is "to some extent" possible. We are still dealing with an input output model. It is quite a common misunderstanding that deep learning can find "any representations", while many relations in hydrology are highly nonlinear and require careful consideration of the data.

line 126-137: I don't think the detailed overview of modelling approaches is needed in this review.

line 190 to 205: this text is somehow misplaced in this section. It is more an assessment of the properties of different techniques and it would probably make more sense to place it after the different layer types were introduced.

Figure 2: I think it would help many readers if the figure illustrates that the convolutional kernels map many pixels to one. Also in the text (line 210), a simply explanation of the kernels (spatially weighted average where the weights are learned during optimization) may be helpful.

Table 2: I believe the correct citation for the work of Guo et al. is 2021, not 2020

Section 3.2.4: The review is generally missing a section that discusses under which conditions a deep learning network can generalize, i.e. predict flooding in different locations

Section 3.2.5: A key issue when assessing flood predictions (inundation and hazard) is the large number of zeros (often >95% of the dataset) which implies that, for example, accuracy scores almost per definition are in the order of 80% and above. This issue needs to be explained here. In addition, binary scores such as CSI are very vulnerable to double penalty issues.

Section 3.4: In general, for flood inundation, it is not completely clear to me whether the authors focus on models that can predict flood inundation (in binary form) given some rainfall or on "gap filling" in remote sensing data. This needs to be checked in all related sections.

Line 472: Due to the 0 problem mentioned above, "slight" increases in accuracy may actually be linked to substantial changes of the quality of a model. The scores therefore need to be interpreted carefully and it is also not guaranteed that all papers computed scores in the same manner.

Line 496: Pham et al. assessed flood conditioning factors, Löwe et al. performed a forward selection to identify relevant topographic variables, Zahura et al. tested feature importance in their random forest model

Table 5: It is not clear to me why not all the papers performing hazard predictions where included in this table? In addition, the error scores may not be comparable across papers (0 problem or similar) which should be mentioned. Also speed up is a difficult quantity to compare, because it depends on the assumed number of numerical simulations that should be performed (e.g. if we assume that we have to assess flood hazard for 1000 rain events, then the speed up factor obtained by a neural network will be much higher than when only 10 events are considered). Most certainly, these assumptions are also not the same across papers and therefore not comparable.

Section 4.2: The discussion on generalization abilities needs to be differentiated a bit more. Both Guo et al. 2021 and Löwe et al. 2021 consider terrain characteristics as input to their models, and in Löwe et al. 2021 generating predictions outside the training dataset was explicitly the focus of the work. As mentioned by the authors, these approaches are in their infancy and have been tested on limited datasets, but these approaches do consider effects of e.g. the built environment in the form of 2D grids.

Section 5.1: While investigating the possibility to consider mesh-based deep learning setups is an interesting direction, the authors present no argument why this should work better than convolutional approaches (which are also used for simulating fluid movements). Other than stated around line 610, they are simply a different data representation with advantages and disadvantages (mesh generation) and may or may not improve performance.

Line 648: From here on the text no longer focuses on meshes (which is the Section heading) but on physical conditioning.

Line 656: I think a formulation that will be easier to understand for many readers is that the PINN can only be trained for a specific boundary condition (such as a specific rain event) and it is subsequently only able to simulate this specific event.

Line 656: FNOs need to be mentioned as one approach amongst many. DeepONets are a widely known alternative and new approaches are constantly developed. The same is true for DGP in the following section.

Section 5.3: I don't see how GANs fix data scarcity issues (line 680). They are indeed an interesting approach for e.g. gap filling or the generation of rainfall scenarios, but they do not be trained and do not relieve us of the problem that e.g. flood observations are hardly available. The discussion in the first parts of this section goes in a very different direction than the transfer learning approaches (which focus on training models with few data), which creates confusion.

Conclusions

First bullet - this conclusion could be more clear about the methodological preferences being the current status which is developing rapidly.

Line 724 - I would say DL for hazard mapping so far relies on numerical simulations, this may change.

Line 731-736 - Some of the existing architectures do enable generalization but this certainly requires more research and testing. Meshes are one way forward amongst others.

Line 737-741 - Physics-informed learning is not only relevant in a warning context but for virtually all kinds of flood simulations. FNOs and DGPs are potentially interesting approaches, but there are others. Your are overstating the ability of geometric DL which (to my knowledge) has not been tested in the flood context.

Line 742-745 - As mentioned before, there is some logic here that does not make sense, because the GANs need to be trained against observed data. Once we have a GAN, what would be the point of training another deep learning model that only learns to emulate the output of the GAN?

References:

Löwe, R., Böhm, J., Jensen, D. G., Leandro, J., & Rasmussen, S. H. (2021). U-FLOOD – topographic deep learning for predicting urban pluvial flood water depth. Journal of Hydrology, 603, 126898. https://doi.org/10.1016/j.jhydrol.2021.126898

Pham, B. T., Luu, C., Phong, T. Van, Trinh, P. T., Shirzadi, A., Renoud, S., Asadi, S., Le, H. Van, von Meding, J., & Clague, J. J. (2020). Can deep learning algorithms outperform benchmark machine learning algorithms in flood susceptibility modeling? Journal of Hydrology, 592(July 2020), 125615. https://doi.org/10.1016/j.jhydrol.2020.125615

Zahura, F. T., Goodall, J. L., Sadler, J. M., Shen, Y., Morsy, M. M., & Behl, M. (2020). Training machine learning surrogate models from a high-fidelity physics-based model: Application for real-time street-scale flood prediction in an urban coastal community. Water Resources Research, 56(10), e2019WR027038. https://doi.org/10.1029/2019WR027038

Citation: https://doi.org/10.5194/hess-2021-614-RC3
- AC3: 'Reply on RC3', Roberto Bentivoglio, 26 Jan 2022
  
  Reply to anonymous Reviewer #3
  We thank the Reviewer for carefully reading our paper and providing very useful comments for its improvement. Here we provide answers to the issues raised along with details on the amendments to the original manuscript to be featured in the revision. Unless otherwise specified, reported line numbers refer to the updated version.
  
  My concerns are that
  
  1)-the review is currently not always very precise in distinguishing the contexts in which different approaches are relevant
  We thank the Reviewer for raising this point. We have added comments on each approach throughout the paper, e.g., line 109 (referred to inundation maps): “These maps can then be used also as calibration data for other applications such as flood susceptibility or flood hazard mapping.”
  Another example is in lines 114-115 (referred to susceptibility maps): “However, it can provide reliable information when no quantitative data is available.”
  We also illustrated the main limitations of each flood map in lines 558-563: “Nonetheless, each of the presented maps has its own limitations. Susceptibility maps provide only qualitative results and rely on recorded flood events. Therefore, limited recorded data may lead to incorrect predictions. Moreover, it is important to design an appropriate model to integrate heterogeneous environmental information. Inundation maps mostly consider real events, thus they suffer from the acquisition method’s problems. For example, satellites struggle to extract information below clouded areas (e.g., Meraneret al., 2020). Hazard maps, instead, are limited by the accuracy of the underlying numerical simulator.”
  
  2)-it lacks an assessment of which techniques were used over time (some now popular techniques were not available 2 years ago)
  We thank the reviewer for this observation. We included temporal context in section 3.2.1 to specify why certain methods were applied only in the recent years: lines 275-277 “The late use of convolutional and recurrent models is motivated by their recent popularization and development, along with a rise in awareness of the ML advancements, contrary to fully-connected layers, that have a longer application history.”
  
  3)-it could occasionally be better at explaining concepts with a focus on the hydrological target group
  We thank the Reviewer for pointing out this aspect. In this light, we amended relevant text throughout the paper, adding more examples and comments reflecting a hydrologic/hydraulic perspective. For instance, we added in lines 189-193 “We explain the concept of invariance and equivariance with an example. Consider a picture with a flooded area in its top-left corner and one with the same area but shifted to be in the bottom-right corner. An invariant model would predict that there is a flooded area in both images, while an equivariant model would also reflect the change in positions of the flood, i.e., identify that the flood is in the top-left corner in one case and in the bottom-right corner in the other. In this case, invariance and equivariance are associated to a spatial translation, but the same principle applies to other transformations, such as temporal translation.”
  Following this example, we also added a similar consideration in lines 668-672 (section 5.1.1): “Symmetries result in inductive biases, which address the curse of dimensionality by decreasing the required training data (e.g., Wang et al., 2020a) and enabling the processing of different data types, such as meshes. From a flooding perspective, symmetries can be understood and motivated by referring to the example in Section 2.2.1. For instance, analogously to translation, the rotation of a domain should result in an equivalent rotation of the predictions.”
  Another hydrologic explanation is in lines 385-287 “For example, if 90% of an area represents non-flooded areas, a model which assumes that there are no floods will have 90% accuracy.”
  
  4)-generalization of deep learning predictions to locations / events outside the training data is a key aspect that deserves a more prominent place in the paper. Currently this topic is raised in several subsections. It might be useful to provide an overview on what are actually the needs, which can then be used to discuss whether different approaches are conceptually able to meet this (and if this was/was not implemented in current research)
  We agree with the Reviewer that the generalization of deep learning models should deserve a more prominent place as it is a valuable and difficult gap. Thus, we created a corresponding subsection in the knowledge gaps section (section 4.2) in our updated manuscript as follows:
  “Generalization refers to the capacity of a model to extrapolate from a training dataset into unseen testing data. This means that a DL model can correctly predict scenarios unused in its development. This property is particularly relevant because training requires data, model development, and time. In the context of flood modelling, there are two main generalization objectives: (i) boundary conditions, i.e., different rainfall events, and (ii) topographical changes, i.e., different case studies. However, the transference between different areas is challenging for DL models because of the difference in input and output data. In fact, except for flood inundation mapping, most reviewed papers focused on generalizing different boundary conditions (e.g., Guo et al., 2021; Berkhahn et al., 2019). Instead, only a few papers tested the model on areas not considered during training. Löwe et al. (2021) could generate flood hazard maps for unseen areas within the same study region as the training dataset, as there was little variability of inputs and outputs. Zhao et al. (2021b) instead pre-trained a model for flood susceptibility on an urban area and then used it for another similar area. They showed that pre-training improves predictions with respect to a model trained from scratch, both in cases of low and high data availability. These works show that such approaches are in their infancy and have been tested on limited datasets. A DL model which cannot generalize to new areas has to be trained every time for a new study case. Thus, it may have limited advantages over a hydraulic model, since it requires more effort, data, and time. Instead, a general DL model which can generalize to new areas could emphasize the advantages over numerical models. This concept was experimented also for rainfall-runoff modeling where DL models outperformed state-of-the-art alternatives in the prediction of ungauged basins in new study areas (Kratzert et al., 2019b).”
  
  5)-comparisons of scores across papers need to be interpreted more carefully than what is currently the case. Scores are not necessarily computed in the same manner. In particular, non-flooded areas are not handled consistently in the literature, which has a major impact on the results.
  We thank the Reviewer for the valuable comment. We revisited this section and added the following paragraph to identify the issue mentioned. Lines 397-400: “Moreover, since different works generally use different datasets, a comparison across them may not always be meaningful. Instead, our purpose here is to show that, for the same case study, DL tends to outperform traditional models.”
  Additionally, we thought that the issue of incomparable metrics could also be reflected in the absence of a unified dataset for the different flood applications. Thus, we added a new paragraph in the data availability knowledge gap in lines 643-650: “Another issue, which emerges also from Section 3.2.5, is the lack of a unified framework to compare different approaches with each other. This can be achieved by creating flood-based benchmark datasets for each mapping application. For flood inundation, some datasets have been already used across different works (e.g., Bonafilia et al., 2020). However, works on both flood susceptibility and hazard mapping consider different datasets, focusing on different geographic areas or flood types. One possibility could then be to unify different case studies in a single dataset, for each application, allowing to assess the validity of a model more objectively. For flood susceptibility, case studies with the same input availability could be merged in a dataset with many flood types, scales, and geographical areas. A similar reasoning could be made for flood hazard mapping, selecting, for each case study, initial and boundary conditions for specific return periods.”
  
  I think all of these issues can be adressed in a revision. I have provided detailed comments below.
  Detailed comments
  
  6)line 51: the automatic discovery of representations is "to some extent" possible. We are still dealing with an input output model. It is quite a common misunderstanding that deep learning can find "any representations", while many relations in hydrology are highly nonlinear and require careful consideration of the data.
  We thank the Reviewer for this clarification. With appropriate data, deep learning can learn any representation, as a consequence of the universal approximation theory results (Hornik et al. , 1989). However, it is indeed relevant to choose the data with careful selection. Thus, we changed the sentence in line 52 of the original manuscript to avoid confusion and we further clarified that data still require accurate pre-processing and selection. Lines 52-53: “Nonetheless, data must be carefully selected according to the task at hand.”
  
  7)line 126-137: I don't think the detailed overview of modelling approaches is needed in this review.
  We thank the Reviewer for the suggestion. We reduced the length of this paragraph and described numerical models briefly. The section has been modified as follows: “Flood hazard maps are carried out by numerical models, which simulate flood events by discretizing the governing equations and the computational domain. We distinguish between one-dimensional (1D), two-dimensional (2D), and three-dimensional (3D) models with increasing complexity and, generally, accuracy (e.g.,Horritt and Bates, 2002; Teng et al., 2017).”
  
  8)line 190 to 205: this text is somehow misplaced in this section. It is more an assessment of the properties of different techniques and it would probably make more sense to place it after the different layer types were introduced.
  We thank the Reviewer for the observation. We placed here this section to introduce the necessity of models like convolutional or recurrent networks to overcome the shortcomings of fully connected layers. While writing the paper, we tried placing this section elsewhere (e.g., before and after introducing the different layers), but ultimately we agreed that the current configuration was more suitable, even if it may hinder the reading flow. Thus, we would prefer to keep it as it is.
  
  9)Figure 2: I think it would help many readers if the figure illustrates that the convolutional kernels map many pixels to one. Also in the text (line 210), a simply explanation of the kernels (spatially weighted average where the weights are learned during optimization) may be helpful.
  We thank the Reviewer for the precious comment. We added an explanation of what kernels in lines 199-200 : “A kernel represents a spatially weighted average which is applied to the input and where the weights are learned during optimization.”
  
  10)Table 2: I believe the correct citation for the work of Guo et al. is 2021, not 2020
  Thank you for the observation. We modified the citation as correctly pointed out.
  
  11)Section 3.2.4: The review is generally missing a section that discusses under which conditions a deep learning network can generalize, i.e. predict flooding in different locations
  We thank the Reviewer for the comment. With the new section on generalization, we tried to discuss what are the current issues and limitations. However, the problem of generalization is complicated and minimum conditions needed to guarantee it are still an ongoing research even in the machine learning community.
  
  12)Section 3.2.5: A key issue when assessing flood predictions (inundation and hazard) is the large number of zeros (often >95% of the dataset) which implies that, for example, accuracy scores almost per definition are in the order of 80% and above. This issue needs to be explained here. In addition, binary scores such as CSI are very vulnerable to double penalty issues.
  Thanks for the valuable comment. We modified section 3.2.5 and added further clarification of the problems related to classification metrics, addressing the issues of imbalanced datasets and appropriate metrics for flood mapping. We did not address the issue of CSI since very few papers use that metric.
  Section 3.2.5 has been modified as follows: “In supervised learning, we distinguish between regression and classification problems, depending on whether the target values to predict are continuous (e.g., water depth) or discrete (e.g., flooded vs non-flooded area), respectively. Depending on the task, we employ a different set of metrics to evaluate model performances.
  Regression metrics are a function of the differences, or residuals, between target and predicted values. The most common metrics include the root mean squared error (RMSE), the coefficient of determination (R²), and the mean average error (MAE). RMSE and MAE improve as they approach zero, while R² improves as it approaches one. In general, RMSE is preferred to MAE since it minimizes the standard deviation of the errors, thus decreasing the presence of extreme outliers. However, since these metrics are averaged on a domain, their comparison across different works requires careful attention.
  Classification tasks can be either binary (e.g., building a model to predict flooded and non-flooded locations) or multi-categorical (e.g., classifying between permanent water bodies, buildings, and vegetated areas), according to the output number of classes. In the following discussion, we focus on the former, with concepts naturally extending to the second case. When computing binary classification metrics, flooded areas are generally represented as positive class, while non-flooded areas as negative class. The most common metrics for flood modelling are accuracy, recall, and precision, followed by other indices such as the area under the receiver operator characteristic curve. Accuracy represents the number of correct predictions over the total. While popular and easy to implement, this metric is inappropriate for imbalanced datasets, where some categories are more represented than others. For example, if test samples feature an average 90% non-flooded area, a naïve model constantly predicting no flooding will reach 90% accuracy, despite having wrong assumptions. Furthermore, since it may be better to overestimate a flooded area than to underestimate it, one could resort to metrics such as recall that account for false negatives and thus penalize models that cannot recognize a flooded area correctly. However, when used alone, recall can lead to similar issues to those described for accuracy, e.g., yielding a perfect score for a model always predicting the entire domain as flooded. Thus, for an exhaustive understanding of the model's performance, one should also consider metrics accounting for false positives, i.e., where the model misclassifies non-flooded areas as flooded. There are several possible metrics, such as the F1 score, the Kappa score, or the Matthews correlation coefficient, each with their drawbacks and benefits (e.g., Wardhani et al., 2019;Delgado and Tibau, 2019; Chicco and Jurman, 2020). A reasonable choice is the F1 score, which is the geometric mean of recall and precision, and it thus equally considers both false negatives and false positives. Another good example is the ROC (Receiver Operating Characteristic) curve that describes how much a model can differentiate between positive and negative classes for different discrimination thresholds (Bradley, 1997). The Area under the ROC curve (AUC) is often used to synthesise the ROC as a single value. However, the AUC loses information on which parts of the dataset the model performs best. For this reason, one should always interpret these results carefully, especially when comparing different studies. Our purpose here is to show that, for the same case study, DL tends to outperform traditional models.”
  Following this addition to the text, Tables 3 and 4 have been modified accordingly, prioritizing metrics such as F1 and AUC.
  
  13)Section 3.4: In general, for flood inundation, it is not completely clear to me whether the authors focus on models that can predict flood inundation (in binary form) given some rainfall or on "gap filling" in remote sensing data. This needs to be checked in all related sections.
  For most works, flood inundation consists of determining flooded and non-flooded areas from remote-sensing data, i.e., given a picture, the model determines which areas are flooded. Only one paper does not consider remote sensing data (Dong et al.). We clarified this in the review in line 427: “Only Dong et al. (2021) differ from the other papers by considering sensors in place of flood pictures.”
  
  14)Line 472: Due to the 0 problem mentioned above, "slight" increases in accuracy may actually be linked to substantial changes of the quality of a model. The scores therefore need to be interpreted carefully and it is also not guaranteed that all papers computed scores in the same manner.
  Thanks for the important observation. As discussed in comment 12), we specified that it is not guaranteed that all works compute scores in the same manner and that comparison requires careful attention.
  
  15)Line 496: Pham et al. assessed flood conditioning factors, Löwe et al. performed a forward selection to identify relevant topographic variables, Zahura et al. tested feature importance in their random forest model
  We thank the Reviewer for providing valuable suggestions. Since this section concerns flood hazard we now included Löwe et al. . However, we did not include Zahura et al., as they employ other machine learning models, while the review focuses only on DL.
  
  16)Table 5: It is not clear to me why not all the papers performing hazard predictions where included in this table? In addition, the error scores may not be comparable across papers (0 problem or similar) which should be mentioned. Also speed up is a difficult quantity to compare, because it depends on the assumed number of numerical simulations that should be performed (e.g. if we assume that we have to assess flood hazard for 1000 rain events, then the speed up factor obtained by a neural network will be much higher than when only 10 events are considered). Most certainly, these assumptions are also not the same across papers and therefore not comparable.
  We greatly appreciate this comment of the Reviewer and we address its different parts individually.
  For flood hazard, as for the other applications, not every paper was included in the provided tables. Many papers do not provide information on computational times of both numerical and deep learning models and, thus, they do not report speed-up metrics. Nonetheless, to provide a general overview of every paper, we now included all papers in all the tables, even those not reporting speed-up metrics or comparison against other machine learning models.
  As regards the error scores, we agree that they may not be comparable throughout the works as the scales and resolutions may differ. However, we believe that these errors, along with the study case area, provide a measure of the model’s reliability. We mentioned the issues when comparing scores across different studies in lines 543-545: “However, the comparison of speed-up across different papers is often unrealistic since it depends on the number of performed numerical simulations and on the type of numerical model. A similar consideration persists for the error scores, as they depend on the scale of the case study and on its resolution.”
  Regarding the 0 problem, we agree that the scores may differ depending on the number of zeros in a domain, since most regression metrics are averaged. Thus, we added the following in lines 377-378: “However, since these metrics are averaged on a domain, their comparison across different works requires careful attention.”
  
  17)Section 4.2: The discussion on generalization abilities needs to be differentiated a bit more. Both Guo et al. 2021 and Löwe et al. 2021 consider terrain characteristics as input to their models, and in Löwe et al. 2021 generating predictions outside the training dataset was explicitly the focus of the work. As mentioned by the authors, these approaches are in their infancy and have been tested on limited datasets, but these approaches do consider effects of e.g. the built environment in the form of 2D grids.
  We thank the Reviewer for this observation. We expanded the discussion on generalization in section 4.2 by differentiating between generalization on boundary conditions (i.e., different rain events) and initial conditions (i.e., different topography).
  Lines 612-609: “Generalization refers to the capacity of a model to extrapolate from a training dataset into unseen testing data. This means that a DL model can correctly predict scenarios unused in its development. This property is particularly relevant because training requires data, model development, and time. In the context of flood modelling, there are two main generalization objectives: (i) boundary conditions, i.e., different rainfall events, and (ii) topographical changes, i.e., different case studies. However, the transference between different areas is challenging for DL models because of the difference in input and output data. In fact, except for flood inundation mapping, most reviewed papers focused on generalizing different boundary conditions (e.g., Guo et al., 2021; Berkhahn et al., 2019). Instead, only a few papers tested the model on areas not considered during training. Löwe et al. (2021) could generate flood hazard maps for unseen areas within the same study region as the training dataset, as there was little variability of inputs and outputs.”
  
  18)Section 5.1: While investigating the possibility to consider mesh-based deep learning setups is an interesting direction, the authors present no argument why this should work better than convolutional approaches (which are also used for simulating fluid movements). Other than stated around line 610, they are simply a different data representation with advantages and disadvantages (mesh generation) and may or may not improve performance.
  We thank the Reviewer for this comment. We provided arguments based on recent works suggesting that mesh-based models are better than convolutional neural networks for generalization, accuracy, and stability in fluid dynamics. This is expressed in lines 674-675: “There already exist promising works which simulate fluid dynamics with mesh-based GNNs, with increased generalization, accuracy, and stability, with respect to CNNs (e.g., Pfaff et al., 2020; Lino et al., 2021).”
  We introduced the limitations of meshes, in lines 657-658: “Unstructured meshes, nonetheless, inherit similar problems as those typical of numerical models, such as mesh generation and the need of explicitly defining how each node is connected.”
  
  19)Line 648: From here on the text no longer focuses on meshes (which is the Section heading) but on physical conditioning.
  We thank the Reviewer for this observation. We added a section named physics-based deep learning that includes physics-based neural networks and neural operators.
  
  20)Line 656: I think a formulation that will be easier to understand for many readers is that the PINN can only be trained for a specific boundary condition (such as a specific rain event) and it is subsequently only able to simulate this specific event.
  We greatly appreciate this comment. We modified lines 693-694 as follows: “However, PINNs can only be trained for a specific boundary condition (e.g., a specific rain event) and can subsequently only simulate that specific event (Kovachki et al., 2021).”
  
  21)Line 656: FNOs need to be mentioned as one approach amongst many. DeepONets are a widely known alternative and new approaches are constantly developed. The same is true for DGP in the following section.
  We thank the Reviewer for the comment. Indeed, there are many possibilities and alternatives as regards those approaches. We added DeepONets and clarified that there are several approaches which can be used in lines 698-699: “While many approaches have been proposed, such as DeepONets (Lu et al., 2019) or multipole graph neural operator (Li et al., 2020), Fourier neural operators (FNO) have currently achieved the best results (Li et al., 2021).”
  Moreover, we included a section related to Bayesian neural network in the probabilistic deep learning section. Lines 720-725: “Along with those related to the model’s input, uncertainties are also present in the model’s prediction. To account for this kind of uncertainty we can use Bayesian neural networks (BNN). BNNs are models with stochastic components trained using Bayesian inference. They assign prior distributions to the model parameters to provide an estimate of the model’s confidence on the final prediction (Blundell et al., 2015). If, for different parameter sampling, the output is unvaried, then the model has a good confidence on the prediction and vice versa if different parameters give different results. Jacquier et al. (2021) used BNNs to determine the confidence intervals in flood hazard maps, providing a measure of the model’s reliability.”
  
  22)Section 5.3: I don't see how GANs fix data scarcity issues (line 680). They are indeed an interesting approach for e.g. gap filling or the generation of rainfall scenarios, but they do not be trained and do not relieve us of the problem that e.g. flood observations are hardly available. The discussion in the first parts of this section goes in a very different direction than the transfer learning approaches (which focus on training models with few data), which creates confusion.
  We thank the Reviewer for the comment. We understand the doubts issued with GANs and VAEs. We have now mentioned that they do not solve the issue of a complete lack of data but can be useful in many situations where little data is available. For example, we could use the data of the GAN as augmentation to the few real data we have; then these new GAN data can provide more training samples for larger DL models, which would be more challenging with fewer data. GANs can also be used to generate floods in areas never experienced before, without necessarily being fed to a DL model afterward.
  Following those considerations, we clarified the issue and modified section 5.3 as follows.
  Moreover, we added a section on new data sources, which was previously introduced in the knowledge gaps (section 4.4). We also decided to remove the paragraph on transfer learning to avoid confusion.
  “Even though remote sensing and measuring stations provide noticeable amounts of data, several parts of the world still lack enough data to deploy deep learning models. New satellite missions and added sensor networks throughout the world increasingly provide new data sources (e.g., van de Giesen et al., 2014). The flexibility of DL partially overcomes data scarcity by facilitating the use of a wider variety of data sources. For instance, several papers already employ cameras to detect floods and measure the associated water depth (e.g., Vandaele et al., 2021; Jafari et al., 2021; Moy De Vitry et al., 2019). Structural monitoring with cameras can provide reliable data where it was previously hard to obtain, such as in urban environments. Social media information, such as tweets or posted pictures, can also be used to identify flood events and flooded areas (e.g., Rossi et al., 2018; Pereira et al., 2020). In this case, the quality of the retrieved information must be further validated before its use for real applications. Moreover, the heterogeneity of the sources of these data needs to be carefully taken into account when deploying a suitable DL model.
  Another approach can be to generate artificial data to supplement scarce data. This can be done using generative adversarial networks (GAN), which create new data from a given dataset (Goodfellow et al., 2014). GANs are composed of two neural networks, named generator and discriminator, whose purpose is, respectively, to generate new data and to detect if a given data is real or fake. A trained GAN can produce new fake but plausible data, facilitating data augmentation, i.e., providing more training samples. Interesting applications of GANs could overcome some limitations of satellite data (Lütjens et al.,2020, 2021), predict flood maps (Hofmann and Schüttrumpf, 2021) or meteorological forecasts (Ravuri et al., 2021), and create realistic scenarios of flood disasters for projected climate change variations (Schmidt et al., 2019). GANs could also be used to generate a plausible urban drainage system or topography for cities that do not have any sewers construction plan or in areas where only low-resolution data is available (e.g., Fang et al., 2020b).
  However, GANs are difficult to train (Goodfellow, 2016). Variational autoencoders (VAE) are another type of generative model, which can overcome this issue. Differently from standard autoencoders, VAEs model the latent space with probability distributions that aim to ensure good generative properties to the model (Kingma and Welling, 2013). Once the model is trained, new synthetic data can be generated by taking new samples from the latent distributions. Nonetheless, because of the model’s definition, the predictions are less precise than GANs. As such, VAEs and GANs offer a trade-off between the reality of the prediction and the availability of training data.”
  
  Conclusions
  
  23)First bullet - this conclusion could be more clear about the methodological preferences being the current status which is developing rapidly.
  We appreciate the comment and have clarified for each application which are the methodological preferences, based on the methods proposed so far (and thus excluding future direction models).
  Lines 743-751: “Flood inundation, susceptibility, and hazard mapping were investigated using deep learning models. Flood inundation considers as the main data images of floods, mostly taken via satellite. The main and most accurate deep learning models were CNNs. In flood susceptibility, deep learning models consider several inputs, the most important being slope, land use, aspect, terrain curvature, and distance from the rivers. The main deep learning model used were MLPs, often in combination with other statistical techniques, but CNNs generally provided more accurate results. So far, flood hazard maps estimate the water depth in a study area by using deep learning as a surrogate model for numerical simulations. For this application, there are no deep learning model preferences. However, RNNs are preferable for spatio-temporal simulations. Regardless of the application, results show that deep learning solutions outperform traditional approaches as well as other ML techniques.”
  
  24)Line 724 - I would say DL for hazard mapping so far relies on numerical simulations, this may change.
  Thank you for the suggestion. As mentioned, indeed, flood hazard models may also consider real flood events for training. We added “so far” in line 747, as shown in question 22, to indicate this.
  Further comments are presented in section 3.5.2, e.g., in lines 514-515 “Even though observed data were not employed, they could be used in future research to corroborate the transferability of such methods.”
  
  25)Line 731-736 - Some of the existing architectures do enable generalization but this certainly requires more research and testing. Meshes are one way forward amongst others.
  Thank you for the observations. We changed “cannot” with “struggle to”.
  
  26)Line 737-741 - Physics-informed learning is not only relevant in a warning context but for virtually all kinds of flood simulations. FNOs and DGPs are potentially interesting approaches, but there are others. Your are overstating the ability of geometric DL which (to my knowledge) has not been tested in the flood context.
  We thank the Reviewer for the comment. As discussed before, we modified the Future Research Directions section to include more possible models. Indeed, as regards physics-based learning, many models can benefit from it. We rewrote this section clarifying those issues as follows: “Physics-based deep learning provides a reliable framework for flood modelling since it considers the underlying physical equations. Probabilistic hazard mapping can take advantage of deep Gaussian processes or Bayesian neural networks to determine the uncertainties associated with the model and its inputs.”
  While geometric DL has not been used yet in flood context, based on recent findings (e.g., Pfaff et al. 2020, Lino et al. 2021, Wang et al. 2021), we believe that it may be a valuable tool for flood modelling.
  
  27)Line 742-745 - As mentioned before, there is some logic here that does not make sense, because the GANs need to be trained against observed data. Once we have a GAN, what would be the point of training another deep learning model that only learns to emulate the output of the GAN?
  We greatly appreciate this comment. Following the comments previously addressed (comment 22), we modified the paragraph in the conclusions as follows (lines 784-787): “DL necessitates large quantities of data which are difficult to collect in several areas of the world. New data sources such as camera pictures and videos, or social media information can potentially be used thanks to deep learning models. Moreover, generative models, such as GANs and VAEs, can be employed to produce synthetic data for such data-scarce regions, based on training data collected elsewhere”
  
  References:
  
  Löwe, R., Böhm, J., Jensen, D. G., Leandro, J., & Rasmussen, S. H. (2021). U-FLOOD – topographic deep learning for predicting urban pluvial flood water depth. Journal of Hydrology, 603, 126898. https://doi.org/10.1016/j.jhydrol.2021.126898
  
  Pham, B. T., Luu, C., Phong, T. Van, Trinh, P. T., Shirzadi, A., Renoud, S., Asadi, S., Le, H. Van, von Meding, J., & Clague, J. J. (2020). Can deep learning algorithms outperform benchmark machine learning algorithms in flood susceptibility modeling? Journal of Hydrology, 592(July 2020), 125615. https://doi.org/10.1016/j.jhydrol.2020.125615
  
  Zahura, F. T., Goodall, J. L., Sadler, J. M., Shen, Y., Morsy, M. M., & Behl, M. (2020). Training machine learning surrogate models from a high-fidelity physics-based model: Application for real-time street-scale flood prediction in an urban coastal community. Water Resources Research, 56(10), e2019WR027038. https://doi.org/10.1029/2019WR027038
  
  References:
  Hornik, K., Stinchcombe, M. and White, H., 1989. Multilayer feedforward networks are universal approximators. Neural networks, 2(5), pp.359-366.
  Dong, S., Yu, T., Farahmand, H. and Mostafavi, A., 2021. A hybrid deep learning model for predictive flood warning and situation awareness using channel network sensors data. Computer‐Aided Civil and Infrastructure Engineering, 36(4), pp.402-420.
  Pfaff, T., Fortunato, M., Sanchez-Gonzalez, A. and Battaglia, P.W., 2020. Learning mesh-based simulation with graph networks. arXiv preprint arXiv:2010.03409.
  Lino, M., Cantwell, C., Bharath, A.A. and Fotiadis, S., 2021. Simulating Continuum Mechanics with Multi-Scale Graph Neural Networks. arXiv preprint arXiv:2106.04900.
  Wang, R., Walters, R. and Yu, R., 2020. Incorporating symmetry into deep dynamics models for improved generalization. arXiv preprint arXiv:2002.03061.
  
  Citation: https://doi.org/10.5194/hess-2021-614-AC3

Status: closed

RC1:
'Comment on hess-2021-614', Anonymous Referee #1, 28 Dec 2021

Review report

Thank you for submitting your manuscript on Deep Learning Methods for Flood Mapping: A Review of Existing Applications and Future Research Directions. There are a lot of text should be improved such as introduction, validation and discussion. Therefore, I give some suggestion and question which I hope useful to the author, I give major revision to the author.

1ï¼It was difficult to see the justification for the need of this research. The literature review is poor. The paper needs to clearly state what are the problems with the existing works (these types of approaches) and what problem(s) this particularly paper was going to address. Without this clearly problem statement readers would have difficulty to see the merit of this paper. The author only lists some references, I did not find the problem with the exist method. The problem of the existing method is not clear. The author should show us deep analysis about the gap between existing method.

2ï¼ A Review of flood mapping or Deep Learning? This is confused for me.

3) The flowchart of the method should be insert.

4) There were very few discussions of previous studies.

Other comment:

Figure: The resolution of figure should be improved.

Referenceï¼

There are a lot of latest article should be updated.

Citation: https://doi.org/10.5194/hess-2021-614-RC1
- AC1:
  'Reply on RC1', Roberto Bentivoglio, 26 Jan 2022
  Reply to anonymous Reviewer #1
  We thank the Reviewer for reading our paper and providing comments for its improvement. Here we provide answers to the issues raised along with details on the amendments to the original manuscript to be featured in the revision. Unless otherwise specified, reported line numbers refer to the updated version.
  1ï¼It was difficult to see the justification for the need of this research. The literature review is poor. The paper needs to clearly state what are the problems with the existing works (these types of approaches) and what problem(s) this particularly paper was going to address. Without this clearly problem statement readers would have difficulty to see the merit of this paper. The author only lists some references, I did not find the problem with the exist method. The problem of the existing method is not clear. The author should show us deep analysis about the gap between existing method.
  We have restated the justification of this research in the introduction (lines 61-65): “The existing reviews mainly focused on the temporal variability of floods, especially concerning rainfall-runoff modeling, covering only a few instances of flood mapping applications. But the spatial evolution of flood events is extremely important to determine affected areas, plan mitigation measures and inform response strategies. Yet, there are no comprehensive overviews and analyses of DL in flood mapping to facilitate flood researchers and practitioners. The aim of this review is thus to advance the emerging field of DL-based flood mapping by surveying the state-of-the-art, identifying outstanding research gaps, and proposing fruitful research directions.”
  We report some problems with the existing methods throughout Section 3, and we dedicated Section 4 “Knowledge gaps” to summarise those we believe are the major ones, common to all reviewed papers. In particular, we highlight:
  The lack of “general” DL models that can work across multiple case scenarios. Our review shows that models are usually deployed for single case studies, which greatly limits their applicability.
  
  Existing DL models are not suitable for modelling complex interactions with the natural and built environments; this hinders their operational use for all types of applications.
  
  The focus so far has been only on developing deterministic models, while flood management requires accounting for uncertainties in outcomes and probabilistic predictions.
  
  Further efforts should be directed in developing DL models for flood risk or real-time flood warning applications or tackling problems related to data availability.
  
  We argue that the community should address these problems by transferring recent fundamental advancements in DL to flood mapping. These advancements mainly include mesh-based neural networks, such as Graph Neural Networks and Fourier Neural Operators, as well as Probabilistic Deep Learning. The future research directions in Section 5 of the paper substantiate these suggestions and provide insights on how to apply these methods to improve flood mapping.
  While we believe that the paper’s justification and contributions are sufficiently clear, we tried our best to improve it during the revision process. As suggested by the Reviewer, we modified the manuscript to further clarify its merits and purpose.
  
  2ï¼ A Review of flood mapping or Deep Learning? This is confused for me.
  As stated in the title of the paper, this review concerns deep learning methods for flood mapping. We evaluate the efforts of the community concerning the design and implementation of deep learning methods for flood mapping. Therefore, the manuscript explores the intersection between these two areas, as stated in lines 66-69 of the original manuscript: “45 papers are analysed considering two main parallel yet intertwined directions. On the one hand, we focused on the flood management application, spatial scale of study, and type of flood. On the other hand, we examined the deep learning model, type of training data, and performance with respect to alternative methods. This strategy provides insights from a flood management perspective and concurrently facilitates reflection on how to successfully apply DL models.”
  
  3) The flowchart of the method should be insert.
  We thank the Reviewer for the suggestion. We have now added a flow chart (Fig. 3) to explain our methodology better.
  
  4) There were very few discussions of previous studies.
  Thank you for raising this point. While some specific discussions on individual studies have been included throughout Section 3, we opted for a concise, yet meaningful, contribution of the reviewed papers to respect also the length limitations. Common to successful review papers of the field (e.g., Maier, 2013), we preferred to report observations that are valid across multiple studies, especially when outlining the knowledge gaps and proposing future research directions. That said, as suggested by the reviewer, we included further insights from individual studies throughout Section 3 of our paper to enrich the overall narrative.
  
  Here follow some examples we included in the revised manuscript:
  Lines 434-437 “Most CNN models show noticeable improvements with respect to traditional threshold methods, such as the Normalized Difference Water Index (NDWI) and automatic threshold model (ATM) (e.g., Wieland and Martinis, 2019; Isikdogan et al., 2017; Nemni et al., 2020), and with respect to machine learning models such as random forest (RF) and support vector machine (SVM).”
  Lines 461-464 “However, Wang et al. (2020b) and Liu et al. (2021) show that 1D-CNNs, which perform convolution on the input features for each domain’s cell, are not suited for this problem, as they do not properly leverage any inductive bias. Some works showed that deep belief networks (DBN), an unsupervised variation of MLPs, could outperform standard MLPs in flood susceptibility mapping (e.g., Shirzadi et al., 2020; Pham et al., 2021).”
  Lines 516-517 “Hu et al. (2019) and Jacquier et al. (2021) use a LSTM and a MLP, respectively, in combination with a reduced order modelling framework.”
  
  Other comment:
  Figure: The resolution of figure should be improved.
  Thanks for the suggestion. We improved the font sizes and resolutions of all figures.
  
  Referenceï¼
  There are a lot of latest article should be updated.
  We thank the Reviewer for pointing out this very important detail. By refining the search procedure described in Section 3.1, we retrieved 13 more very recent papers which are now included in the review. The complete list of added papers is shown in the references below.
  
  References:
  Maier, Holger R. "What constitutes a good literature review and why does its quality matter?." Environ. Model. Softw. 43 (2013): 3-4.
  
  Added references:
  Ahmed, N., Hoque, M. A.-A., Arabameri, A., Pal, S. C., Chakrabortty, R., and Jui, J.: Flood susceptibility mapping in Brahmaputra floodplain of Bangladesh using deep boost, deep learning neural network, and artificial neural network, Geocarto International, pp. 1–22, 2021.
  Chakrabortty, R., Chandra Pal, S., Rezaie, F., Arabameri, A., Lee, S., Roy, P., Saha, A., Chowdhuri, I., and Moayedi, H.: Flash-flood hazard susceptibility mapping in Kangsabati River Basin, India, Geocarto International, pp. 1–23, 2021a.
  Chakrabortty, R., Pal, S. C., Janizadeh, S., Santosh, M., Roy, P., Chowdhuri, I., and Saha, A.: Impact of Climate Change on Future Flood Susceptibility: an Evaluation Based on Deep Learning Algorithms and GCM Model, Water Resources Management, 35, 4251–4274, 2021b.
  Hosseiny, H.: A deep learning model for predicting river flood depth and extent, Environmental Modelling & Software, 145, 105 186, 2021
  Isikdogan, F., Bovik, A. C., and Passalacqua, P.: Surface water mapping by deep learning, IEEE journal of selected topics in applied earth observations and remote sensing, 10, 4909–4918, 2017.
  Jacquier, P., Abdedou, A., Delmas, V., and Soulaïmani, A.: Non-intrusive reduced-order modeling using uncertainty-aware Deep Neural Networks and Proper Orthogonal Decomposition: Application to flood modeling, Journal of Computational Physics, 424, 109 854, 2021
  Lei, X., Chen, W., Panahi, M., Falah, F., Rahmati, O., Uuemaa, E., Kalantari, Z., Ferreira, C. S. S., Rezaie, F., Tiefenbacher, J. P., et al.:Urban flood modeling using deep-learning approaches in Seoul, South Korea, Journal of Hydrology, 601, 126 684, 2021
  Liu, J., Wang, J., Xiong, J., Cheng, W., Sun, H., Yong, Z., and Wang, N.: Hybrid Models Incorporating Bivariate Statistics and Machine Learning Methods for Flash Flood Susceptibility Assessment Based on Remote Sensing Datasets, Remote Sensing, 13, 4945, 2021.
  Saeed, M., Li, H., Ullah, S., Rahman, A.-u., Ali, A., Khan, R., Hassan,W., Munir, I., and Alam, S.: Flood Hazard Zonation Using an Artificial Neural Network Model: A Case Study of Kabul River Basin, Pakistan, Sustainability, 13, 13 953, 2021.
  Syifa, M., Park, S. J., Achmad, A. R., Lee, C.-W., and Eom, J.: Flood mapping using remote sensing imagery and artificial intelligence techniques: a case study in Brumadinho, Brazil, Journal of Coastal Research, 90, 197–204, 2019.
  Wieland, M. and Martinis, S.: A modular processing chain for automated flood monitoring from multi-spectral satellite data, Remote Sensing, 11, 2330, 2019.
  Yokoya, N., Yamanoi, K., He, W., Baier, G., Adriano, B., Miura, H., and Oishi, S.: Breaking limits of remote sensing by deep learning from simulated data for flood and debris-flow mapping, IEEE Transactions on Geoscience and Remote Sensing, 2020.
  Xie, S., Wu, W., Mooser, S., Wang, Q., Nathan, R., and Huang, Y.: Artificial neural network based hybrid modeling approach for floodinundation modeling, Journal of Hydrology, 592, 125 605, 2021.
  
  Citation: https://doi.org/10.5194/hess-2021-614-AC1
RC2:
'Comment on hess-2021-614', Anonymous Referee #2, 28 Dec 2021

This manuscript focus on the application of deep learning methods to flood mapping, including flood extend mapping, flood susceptibility mapping, and flood hazard mapping. There are some concerns in the manuscript. Below are my comments. I hope the authors find them useful.

Major concern:

1. I suggest the authors provide the time extent of these reviewed publications. Because studies related to deep learning in flood mapping are constantly updated.

2. It is better to introduce the three flood maps as follows: first to show flood extent or inundation maps, then to illustrate flood susceptibility map, and finally present flood hazard map. Because flood extent or inundation map can be viewed as preliminary work in mapping research. Then, the results of the flood extent maps can be used as training data to predict flood susceptibility. Finally, the flood susceptibility indicates the potential location of future floods. And flood hazard map can be viewed as an extension of a flood susceptibility map that not only considers the location of the flood but also integrate the depth and water extent.

3. A deep brief network is also an important part of deep learning, which has been used in the flood mapping field. Some studies are shown below:

[1]    Shahabi, Himan, et al. "Flash flood susceptibility mapping using a novel deep learning model based on deep belief network, back propagation and genetic algorithm." Geoscience Frontiers 12.3 (2021): 101100.

[2]    Shirzadi, Ataollah, et al. "A novel ensemble learning based on Bayesian Belief Network coupled with an extreme learning machine for flash flood susceptibility mapping." Engineering Applications of Artificial Intelligence 96 (2020): 103971.

[3]    Pham, Binh Thai, et al. "Can deep learning algorithms outperform benchmark machine learning algorithms in flood susceptibility modeling?." Journal of Hydrology 592 (2021): 125615

4. The application perspectives on different mapping scenarios are different. Therefore, the authors should provide specific limitations and future research directions on different mapping frameworks. For example, the deep learning methods for mapping susceptibility focus on predicting the location of potential flood areas by considering the historical location and environmental variables. Therefore, it is important to design an appropriate network to integrate heterogeneous environmental information. For flood extent mapping, it aims to find the continuous inundated areas based on satellite images or UAV images. Some deep learning methods such as semantic segmentation are more appropriate in flood extent mapping.

Minor concern:

1. Figure 1: the legend is overlapped in the main figure.

2. It is better to entitle Section 2.2 as “Deep learning method”. Section 2.2.1, 2.2.2, 2.2.3 should be entitled “Multi-layer perceptron”, “Convolutional neural network”, and “Recurrent neural network”, respectively.

3. Section 2.2, part 155: lack of related reference in the first sentence.

4. Figure 1: please provide the location information in the caption.

5. 5 (a) should be improved.

6. Section 5.3 belongs to the future direction, but data scarcity is a kind of limitation. Data enhancement may be a suitable title for this section.

Citation: https://doi.org/10.5194/hess-2021-614-RC2
- AC2: 'Reply on RC2', Roberto Bentivoglio, 26 Jan 2022
  
  Reply to anonymous Reviewer #2
  We thank the Reviewer for reading our paper and providing detailed and insightful comments. Here we provide answers to the issues raised along with details on the amendments to the original manuscript to be featured in the revision. Unless otherwise specified, reported line numbers refer to the updated version.
  
  Major concern:
  1) I suggest the authors provide the time extent of these reviewed publications. Because studies related to deep learning in flood mapping are constantly updated.
  The time extent considered for the reviewed publications has now been explicitly mentioned in section 3.1: lines 255-256 “The 3,338 publications obtained were then filtered to include only journal papers from January 2010 until December 2021, in the areas of engineering, environmental science, and earth and planetary sciences”
  
  2) It is better to introduce the three flood maps as follows: first to show flood extent or inundation maps, then to illustrate flood susceptibility map, and finally present flood hazard map. Because flood extent or inundation map can be viewed as preliminary work in mapping research. Then, the results of the flood extent maps can be used as training data to predict flood susceptibility. Finally, the flood susceptibility indicates the potential location of future floods. And flood hazard map can be viewed as an extension of a flood susceptibility map that not only considers the location of the flood but also integrate the depth and water extent.
  Thank you for this insightful recommendation. We agree with it and we changed the order of the presented flood maps in Section 2.1.2 and Section 3. Moreover, we modified lines 107-112 as follows: “Flood inundation maps determine the extent of a flood, during or after it has occurred (see Fig. 1a). Flood inundation maps represent flooded and non-flooded areas. This application is used for post-flood evacuation and protection planning, and for damage assessment. These maps can then be used also as observed and calibration data for other applications. Flood images are obtained through remote-sensing techniques and processed by histogram-based models (e.g., Martinis et al., 2009; Manjusree et al., 2012), threshold models (e.g., Cian et al., 2018), and machine learning models (e.g., Hesset al., 1995; Ireland et al., 2015).”
  
  3) A deep brief network is also an important part of deep learning, which has been used in the flood mapping field. Some studies are shown below:
  [1]    Shahabi, Himan, et al. "Flash flood susceptibility mapping using a novel deep learning model based on deep belief network, back propagation and genetic algorithm." Geoscience Frontiers 12.3 (2021): 101100.
  [2]    Shirzadi, Ataollah, et al. "A novel ensemble learning based on Bayesian Belief Network coupled with an extreme learning machine for flash flood susceptibility mapping." Engineering Applications of Artificial Intelligence 96 (2020): 103971.
  [3]    Pham, Binh Thai, et al. "Can deep learning algorithms outperform benchmark machine learning algorithms in flood susceptibility modeling?." Journal of Hydrology 592 (2021): 125615
  We thank the Reviewer for pointing out the applications of Deep Belief Networks (DBN) for flood mapping. Since DBNs are unsupervised methods, we did not include them among the reviewed papers as our manuscript focuses on supervised methods (lines 164-165 in the original manuscript). Nonetheless, we included some of the suggested papers in lines 473-475: “Some works showed that deep belief networks (DBN), an unsupervised variation of MLPs, could outperform standard MLPs in flood susceptibility mapping (e.g., Shirzadi et al., 2020; Pham et al., 2021).”
  
  4) The application perspectives on different mapping scenarios are different. Therefore, the authors should provide specific limitations and future research directions on different mapping frameworks. For example, the deep learning methods for mapping susceptibility focus on predicting the location of potential flood areas by considering the historical location and environmental variables. Therefore, it is important to design an appropriate network to integrate heterogeneous environmental information. For flood extent mapping, it aims to find the continuous inundated areas based on satellite images or UAV images. Some deep learning methods such as semantic segmentation are more appropriate in flood extent mapping.
  We thank the reviewer for the useful feedback. We carefully considered the competent suggestions made by the Reviewer concerning the presentation of specific limitations and research directions for the different types of flood mapping applications. We included limitations of each presented flood map in lines 558-563: “Nonetheless, each of the presented maps has its own limitations. Susceptibility maps provide only qualitative results and rely on recorded flood events. Therefore, limited recorded data may lead to incorrect predictions. Moreover, it is important to design an appropriate model to integrate heterogeneous environmental information. Inundation maps mostly consider real events, thus they suffer from the acquisition method’s problems. For example, satellites struggle to extract information below clouded areas (e.g., Meraneret al., 2020). Hazard maps, instead, are limited by the accuracy of the underlying numerical simulator.”
  As regards to flood extent or inundation mapping, most of the presented papers indeed consider semantic segmentation, in the paper referred to as “image segmentation”. We added a description of what image segmentation refers to in lines 220-222: “Instead, if the task is to perform image segmentation, i.e., classify specific parts of an image, the final layers are composed of de-convolutional layers which perform an operation opposite of convolutional layers, in an encoder-decoder structure”.
  
  Minor concern:
  1) Figure 1: the legend is overlapped in the main figure.
  We removed the legend overlap in Fig. 1.
  
  2) It is better to entitle Section 2.2 as “Deep learning method”. Section 2.2.1, 2.2.2, 2.2.3 should be entitled “Multi-layer perceptron”, “Convolutional neural network”, and “Recurrent neural network”, respectively.
  We renamed the sections as suggested by the Reviewer.
  
  3) Section 2.2, part 155: lack of related reference in the first sentence.
  We added a reference to line 155 of the original manuscript (LeCun et al., 2015).
  
  4) Figure 1: please provide the location information in the caption.
  We omitted the location since the purpose of Figure 1 is merely exemplificative; it should not be taken as a necessarily correct estimate of flood inundation, susceptibility, and hazard for the selected area.
  
  5) 5 (a) should be improved.
  Figure 5(a) was improved so that the legend does not overlap with the bar plot anymore.
  
  6) Section 5.3 belongs to the future direction, but data scarcity is a kind of limitation. Data enhancement may be a suitable title for this section.
  Thanks for the suggestion. Based on the comment and on the common nomenclature used in machine learning, we changed section 5.3 to “Data augmentation”.
  
  References:
  LeCun, Y., Bengio, Y., and Hinton, G.: Deep learning, nature, 521, 436–444, 2015
  
  Citation: https://doi.org/10.5194/hess-2021-614-AC2
RC3:
'Comment on hess-2021-614', Anonymous Referee #3, 02 Jan 2022

This paper performs a review of deep learning approaches applied for flood mapping. In a field that is evolving rapidly, I think this work can make a valuable contribution in ensuring a common understanding of techniques in the community and outlining future research directions.

My concerns are that

-the review is currently not always very precise in distinguishing the contexts in which different approaches are relevant

-it lacks an assessment of which techniques were used over time (some now popular techniques were not available 2 years ago)

-it could occasionally be better at explaining concepts with a focus on the hydrological target group

-generalization of deep learning predictions to locations / events outside the training data is a key aspect that deserves a more prominent place in the paper. Currently this topic is raised in several subsections. It might be useful to provide an overview on what are actually the needs, which can then be used to discuss whether different approaches are conceptually able to meet this (and if this was/was not implemented in current research)

-comparisons of scores across papers need to be interpreted more carefully than what is currently the case. Scores are not necessarily computed in the same manner. In particular, non-flooded areas are not handled consistently in the literature, which has a major impact on the results.

I think all of these issues can be adressed in a revision. I have provided detailed comments below.

Detailed comments

line 51: the automatic discovery of representations is "to some extent" possible. We are still dealing with an input output model. It is quite a common misunderstanding that deep learning can find "any representations", while many relations in hydrology are highly nonlinear and require careful consideration of the data.

line 126-137: I don't think the detailed overview of modelling approaches is needed in this review.

line 190 to 205: this text is somehow misplaced in this section. It is more an assessment of the properties of different techniques and it would probably make more sense to place it after the different layer types were introduced.

Figure 2: I think it would help many readers if the figure illustrates that the convolutional kernels map many pixels to one. Also in the text (line 210), a simply explanation of the kernels (spatially weighted average where the weights are learned during optimization) may be helpful.

Table 2: I believe the correct citation for the work of Guo et al. is 2021, not 2020

Section 3.2.4: The review is generally missing a section that discusses under which conditions a deep learning network can generalize, i.e. predict flooding in different locations

Section 3.2.5: A key issue when assessing flood predictions (inundation and hazard) is the large number of zeros (often >95% of the dataset) which implies that, for example, accuracy scores almost per definition are in the order of 80% and above. This issue needs to be explained here. In addition, binary scores such as CSI are very vulnerable to double penalty issues.

Section 3.4: In general, for flood inundation, it is not completely clear to me whether the authors focus on models that can predict flood inundation (in binary form) given some rainfall or on "gap filling" in remote sensing data. This needs to be checked in all related sections.

Line 472: Due to the 0 problem mentioned above, "slight" increases in accuracy may actually be linked to substantial changes of the quality of a model. The scores therefore need to be interpreted carefully and it is also not guaranteed that all papers computed scores in the same manner.

Line 496: Pham et al. assessed flood conditioning factors, Löwe et al. performed a forward selection to identify relevant topographic variables, Zahura et al. tested feature importance in their random forest model

Table 5: It is not clear to me why not all the papers performing hazard predictions where included in this table? In addition, the error scores may not be comparable across papers (0 problem or similar) which should be mentioned. Also speed up is a difficult quantity to compare, because it depends on the assumed number of numerical simulations that should be performed (e.g. if we assume that we have to assess flood hazard for 1000 rain events, then the speed up factor obtained by a neural network will be much higher than when only 10 events are considered). Most certainly, these assumptions are also not the same across papers and therefore not comparable.

Section 4.2: The discussion on generalization abilities needs to be differentiated a bit more. Both Guo et al. 2021 and Löwe et al. 2021 consider terrain characteristics as input to their models, and in Löwe et al. 2021 generating predictions outside the training dataset was explicitly the focus of the work. As mentioned by the authors, these approaches are in their infancy and have been tested on limited datasets, but these approaches do consider effects of e.g. the built environment in the form of 2D grids.

Section 5.1: While investigating the possibility to consider mesh-based deep learning setups is an interesting direction, the authors present no argument why this should work better than convolutional approaches (which are also used for simulating fluid movements). Other than stated around line 610, they are simply a different data representation with advantages and disadvantages (mesh generation) and may or may not improve performance.

Line 648: From here on the text no longer focuses on meshes (which is the Section heading) but on physical conditioning.

Line 656: I think a formulation that will be easier to understand for many readers is that the PINN can only be trained for a specific boundary condition (such as a specific rain event) and it is subsequently only able to simulate this specific event.

Line 656: FNOs need to be mentioned as one approach amongst many. DeepONets are a widely known alternative and new approaches are constantly developed. The same is true for DGP in the following section.

Section 5.3: I don't see how GANs fix data scarcity issues (line 680). They are indeed an interesting approach for e.g. gap filling or the generation of rainfall scenarios, but they do not be trained and do not relieve us of the problem that e.g. flood observations are hardly available. The discussion in the first parts of this section goes in a very different direction than the transfer learning approaches (which focus on training models with few data), which creates confusion.

Conclusions

First bullet - this conclusion could be more clear about the methodological preferences being the current status which is developing rapidly.

Line 724 - I would say DL for hazard mapping so far relies on numerical simulations, this may change.

Line 731-736 - Some of the existing architectures do enable generalization but this certainly requires more research and testing. Meshes are one way forward amongst others.

Line 737-741 - Physics-informed learning is not only relevant in a warning context but for virtually all kinds of flood simulations. FNOs and DGPs are potentially interesting approaches, but there are others. Your are overstating the ability of geometric DL which (to my knowledge) has not been tested in the flood context.

Line 742-745 - As mentioned before, there is some logic here that does not make sense, because the GANs need to be trained against observed data. Once we have a GAN, what would be the point of training another deep learning model that only learns to emulate the output of the GAN?

References:

Löwe, R., Böhm, J., Jensen, D. G., Leandro, J., & Rasmussen, S. H. (2021). U-FLOOD – topographic deep learning for predicting urban pluvial flood water depth. Journal of Hydrology, 603, 126898. https://doi.org/10.1016/j.jhydrol.2021.126898

Pham, B. T., Luu, C., Phong, T. Van, Trinh, P. T., Shirzadi, A., Renoud, S., Asadi, S., Le, H. Van, von Meding, J., & Clague, J. J. (2020). Can deep learning algorithms outperform benchmark machine learning algorithms in flood susceptibility modeling? Journal of Hydrology, 592(July 2020), 125615. https://doi.org/10.1016/j.jhydrol.2020.125615

Zahura, F. T., Goodall, J. L., Sadler, J. M., Shen, Y., Morsy, M. M., & Behl, M. (2020). Training machine learning surrogate models from a high-fidelity physics-based model: Application for real-time street-scale flood prediction in an urban coastal community. Water Resources Research, 56(10), e2019WR027038. https://doi.org/10.1029/2019WR027038

Citation: https://doi.org/10.5194/hess-2021-614-RC3
- AC3: 'Reply on RC3', Roberto Bentivoglio, 26 Jan 2022
  
  Reply to anonymous Reviewer #3
  We thank the Reviewer for carefully reading our paper and providing very useful comments for its improvement. Here we provide answers to the issues raised along with details on the amendments to the original manuscript to be featured in the revision. Unless otherwise specified, reported line numbers refer to the updated version.
  
  My concerns are that
  
  1)-the review is currently not always very precise in distinguishing the contexts in which different approaches are relevant
  We thank the Reviewer for raising this point. We have added comments on each approach throughout the paper, e.g., line 109 (referred to inundation maps): “These maps can then be used also as calibration data for other applications such as flood susceptibility or flood hazard mapping.”
  Another example is in lines 114-115 (referred to susceptibility maps): “However, it can provide reliable information when no quantitative data is available.”
  We also illustrated the main limitations of each flood map in lines 558-563: “Nonetheless, each of the presented maps has its own limitations. Susceptibility maps provide only qualitative results and rely on recorded flood events. Therefore, limited recorded data may lead to incorrect predictions. Moreover, it is important to design an appropriate model to integrate heterogeneous environmental information. Inundation maps mostly consider real events, thus they suffer from the acquisition method’s problems. For example, satellites struggle to extract information below clouded areas (e.g., Meraneret al., 2020). Hazard maps, instead, are limited by the accuracy of the underlying numerical simulator.”
  
  2)-it lacks an assessment of which techniques were used over time (some now popular techniques were not available 2 years ago)
  We thank the reviewer for this observation. We included temporal context in section 3.2.1 to specify why certain methods were applied only in the recent years: lines 275-277 “The late use of convolutional and recurrent models is motivated by their recent popularization and development, along with a rise in awareness of the ML advancements, contrary to fully-connected layers, that have a longer application history.”
  
  3)-it could occasionally be better at explaining concepts with a focus on the hydrological target group
  We thank the Reviewer for pointing out this aspect. In this light, we amended relevant text throughout the paper, adding more examples and comments reflecting a hydrologic/hydraulic perspective. For instance, we added in lines 189-193 “We explain the concept of invariance and equivariance with an example. Consider a picture with a flooded area in its top-left corner and one with the same area but shifted to be in the bottom-right corner. An invariant model would predict that there is a flooded area in both images, while an equivariant model would also reflect the change in positions of the flood, i.e., identify that the flood is in the top-left corner in one case and in the bottom-right corner in the other. In this case, invariance and equivariance are associated to a spatial translation, but the same principle applies to other transformations, such as temporal translation.”
  Following this example, we also added a similar consideration in lines 668-672 (section 5.1.1): “Symmetries result in inductive biases, which address the curse of dimensionality by decreasing the required training data (e.g., Wang et al., 2020a) and enabling the processing of different data types, such as meshes. From a flooding perspective, symmetries can be understood and motivated by referring to the example in Section 2.2.1. For instance, analogously to translation, the rotation of a domain should result in an equivalent rotation of the predictions.”
  Another hydrologic explanation is in lines 385-287 “For example, if 90% of an area represents non-flooded areas, a model which assumes that there are no floods will have 90% accuracy.”
  
  4)-generalization of deep learning predictions to locations / events outside the training data is a key aspect that deserves a more prominent place in the paper. Currently this topic is raised in several subsections. It might be useful to provide an overview on what are actually the needs, which can then be used to discuss whether different approaches are conceptually able to meet this (and if this was/was not implemented in current research)
  We agree with the Reviewer that the generalization of deep learning models should deserve a more prominent place as it is a valuable and difficult gap. Thus, we created a corresponding subsection in the knowledge gaps section (section 4.2) in our updated manuscript as follows:
  “Generalization refers to the capacity of a model to extrapolate from a training dataset into unseen testing data. This means that a DL model can correctly predict scenarios unused in its development. This property is particularly relevant because training requires data, model development, and time. In the context of flood modelling, there are two main generalization objectives: (i) boundary conditions, i.e., different rainfall events, and (ii) topographical changes, i.e., different case studies. However, the transference between different areas is challenging for DL models because of the difference in input and output data. In fact, except for flood inundation mapping, most reviewed papers focused on generalizing different boundary conditions (e.g., Guo et al., 2021; Berkhahn et al., 2019). Instead, only a few papers tested the model on areas not considered during training. Löwe et al. (2021) could generate flood hazard maps for unseen areas within the same study region as the training dataset, as there was little variability of inputs and outputs. Zhao et al. (2021b) instead pre-trained a model for flood susceptibility on an urban area and then used it for another similar area. They showed that pre-training improves predictions with respect to a model trained from scratch, both in cases of low and high data availability. These works show that such approaches are in their infancy and have been tested on limited datasets. A DL model which cannot generalize to new areas has to be trained every time for a new study case. Thus, it may have limited advantages over a hydraulic model, since it requires more effort, data, and time. Instead, a general DL model which can generalize to new areas could emphasize the advantages over numerical models. This concept was experimented also for rainfall-runoff modeling where DL models outperformed state-of-the-art alternatives in the prediction of ungauged basins in new study areas (Kratzert et al., 2019b).”
  
  5)-comparisons of scores across papers need to be interpreted more carefully than what is currently the case. Scores are not necessarily computed in the same manner. In particular, non-flooded areas are not handled consistently in the literature, which has a major impact on the results.
  We thank the Reviewer for the valuable comment. We revisited this section and added the following paragraph to identify the issue mentioned. Lines 397-400: “Moreover, since different works generally use different datasets, a comparison across them may not always be meaningful. Instead, our purpose here is to show that, for the same case study, DL tends to outperform traditional models.”
  Additionally, we thought that the issue of incomparable metrics could also be reflected in the absence of a unified dataset for the different flood applications. Thus, we added a new paragraph in the data availability knowledge gap in lines 643-650: “Another issue, which emerges also from Section 3.2.5, is the lack of a unified framework to compare different approaches with each other. This can be achieved by creating flood-based benchmark datasets for each mapping application. For flood inundation, some datasets have been already used across different works (e.g., Bonafilia et al., 2020). However, works on both flood susceptibility and hazard mapping consider different datasets, focusing on different geographic areas or flood types. One possibility could then be to unify different case studies in a single dataset, for each application, allowing to assess the validity of a model more objectively. For flood susceptibility, case studies with the same input availability could be merged in a dataset with many flood types, scales, and geographical areas. A similar reasoning could be made for flood hazard mapping, selecting, for each case study, initial and boundary conditions for specific return periods.”
  
  I think all of these issues can be adressed in a revision. I have provided detailed comments below.
  Detailed comments
  
  6)line 51: the automatic discovery of representations is "to some extent" possible. We are still dealing with an input output model. It is quite a common misunderstanding that deep learning can find "any representations", while many relations in hydrology are highly nonlinear and require careful consideration of the data.
  We thank the Reviewer for this clarification. With appropriate data, deep learning can learn any representation, as a consequence of the universal approximation theory results (Hornik et al. , 1989). However, it is indeed relevant to choose the data with careful selection. Thus, we changed the sentence in line 52 of the original manuscript to avoid confusion and we further clarified that data still require accurate pre-processing and selection. Lines 52-53: “Nonetheless, data must be carefully selected according to the task at hand.”
  
  7)line 126-137: I don't think the detailed overview of modelling approaches is needed in this review.
  We thank the Reviewer for the suggestion. We reduced the length of this paragraph and described numerical models briefly. The section has been modified as follows: “Flood hazard maps are carried out by numerical models, which simulate flood events by discretizing the governing equations and the computational domain. We distinguish between one-dimensional (1D), two-dimensional (2D), and three-dimensional (3D) models with increasing complexity and, generally, accuracy (e.g.,Horritt and Bates, 2002; Teng et al., 2017).”
  
  8)line 190 to 205: this text is somehow misplaced in this section. It is more an assessment of the properties of different techniques and it would probably make more sense to place it after the different layer types were introduced.
  We thank the Reviewer for the observation. We placed here this section to introduce the necessity of models like convolutional or recurrent networks to overcome the shortcomings of fully connected layers. While writing the paper, we tried placing this section elsewhere (e.g., before and after introducing the different layers), but ultimately we agreed that the current configuration was more suitable, even if it may hinder the reading flow. Thus, we would prefer to keep it as it is.
  
  9)Figure 2: I think it would help many readers if the figure illustrates that the convolutional kernels map many pixels to one. Also in the text (line 210), a simply explanation of the kernels (spatially weighted average where the weights are learned during optimization) may be helpful.
  We thank the Reviewer for the precious comment. We added an explanation of what kernels in lines 199-200 : “A kernel represents a spatially weighted average which is applied to the input and where the weights are learned during optimization.”
  
  10)Table 2: I believe the correct citation for the work of Guo et al. is 2021, not 2020
  Thank you for the observation. We modified the citation as correctly pointed out.
  
  11)Section 3.2.4: The review is generally missing a section that discusses under which conditions a deep learning network can generalize, i.e. predict flooding in different locations
  We thank the Reviewer for the comment. With the new section on generalization, we tried to discuss what are the current issues and limitations. However, the problem of generalization is complicated and minimum conditions needed to guarantee it are still an ongoing research even in the machine learning community.
  
  12)Section 3.2.5: A key issue when assessing flood predictions (inundation and hazard) is the large number of zeros (often >95% of the dataset) which implies that, for example, accuracy scores almost per definition are in the order of 80% and above. This issue needs to be explained here. In addition, binary scores such as CSI are very vulnerable to double penalty issues.
  Thanks for the valuable comment. We modified section 3.2.5 and added further clarification of the problems related to classification metrics, addressing the issues of imbalanced datasets and appropriate metrics for flood mapping. We did not address the issue of CSI since very few papers use that metric.
  Section 3.2.5 has been modified as follows: “In supervised learning, we distinguish between regression and classification problems, depending on whether the target values to predict are continuous (e.g., water depth) or discrete (e.g., flooded vs non-flooded area), respectively. Depending on the task, we employ a different set of metrics to evaluate model performances.
  Regression metrics are a function of the differences, or residuals, between target and predicted values. The most common metrics include the root mean squared error (RMSE), the coefficient of determination (R²), and the mean average error (MAE). RMSE and MAE improve as they approach zero, while R² improves as it approaches one. In general, RMSE is preferred to MAE since it minimizes the standard deviation of the errors, thus decreasing the presence of extreme outliers. However, since these metrics are averaged on a domain, their comparison across different works requires careful attention.
  Classification tasks can be either binary (e.g., building a model to predict flooded and non-flooded locations) or multi-categorical (e.g., classifying between permanent water bodies, buildings, and vegetated areas), according to the output number of classes. In the following discussion, we focus on the former, with concepts naturally extending to the second case. When computing binary classification metrics, flooded areas are generally represented as positive class, while non-flooded areas as negative class. The most common metrics for flood modelling are accuracy, recall, and precision, followed by other indices such as the area under the receiver operator characteristic curve. Accuracy represents the number of correct predictions over the total. While popular and easy to implement, this metric is inappropriate for imbalanced datasets, where some categories are more represented than others. For example, if test samples feature an average 90% non-flooded area, a naïve model constantly predicting no flooding will reach 90% accuracy, despite having wrong assumptions. Furthermore, since it may be better to overestimate a flooded area than to underestimate it, one could resort to metrics such as recall that account for false negatives and thus penalize models that cannot recognize a flooded area correctly. However, when used alone, recall can lead to similar issues to those described for accuracy, e.g., yielding a perfect score for a model always predicting the entire domain as flooded. Thus, for an exhaustive understanding of the model's performance, one should also consider metrics accounting for false positives, i.e., where the model misclassifies non-flooded areas as flooded. There are several possible metrics, such as the F1 score, the Kappa score, or the Matthews correlation coefficient, each with their drawbacks and benefits (e.g., Wardhani et al., 2019;Delgado and Tibau, 2019; Chicco and Jurman, 2020). A reasonable choice is the F1 score, which is the geometric mean of recall and precision, and it thus equally considers both false negatives and false positives. Another good example is the ROC (Receiver Operating Characteristic) curve that describes how much a model can differentiate between positive and negative classes for different discrimination thresholds (Bradley, 1997). The Area under the ROC curve (AUC) is often used to synthesise the ROC as a single value. However, the AUC loses information on which parts of the dataset the model performs best. For this reason, one should always interpret these results carefully, especially when comparing different studies. Our purpose here is to show that, for the same case study, DL tends to outperform traditional models.”
  Following this addition to the text, Tables 3 and 4 have been modified accordingly, prioritizing metrics such as F1 and AUC.
  
  13)Section 3.4: In general, for flood inundation, it is not completely clear to me whether the authors focus on models that can predict flood inundation (in binary form) given some rainfall or on "gap filling" in remote sensing data. This needs to be checked in all related sections.
  For most works, flood inundation consists of determining flooded and non-flooded areas from remote-sensing data, i.e., given a picture, the model determines which areas are flooded. Only one paper does not consider remote sensing data (Dong et al.). We clarified this in the review in line 427: “Only Dong et al. (2021) differ from the other papers by considering sensors in place of flood pictures.”
  
  14)Line 472: Due to the 0 problem mentioned above, "slight" increases in accuracy may actually be linked to substantial changes of the quality of a model. The scores therefore need to be interpreted carefully and it is also not guaranteed that all papers computed scores in the same manner.
  Thanks for the important observation. As discussed in comment 12), we specified that it is not guaranteed that all works compute scores in the same manner and that comparison requires careful attention.
  
  15)Line 496: Pham et al. assessed flood conditioning factors, Löwe et al. performed a forward selection to identify relevant topographic variables, Zahura et al. tested feature importance in their random forest model
  We thank the Reviewer for providing valuable suggestions. Since this section concerns flood hazard we now included Löwe et al. . However, we did not include Zahura et al., as they employ other machine learning models, while the review focuses only on DL.
  
  16)Table 5: It is not clear to me why not all the papers performing hazard predictions where included in this table? In addition, the error scores may not be comparable across papers (0 problem or similar) which should be mentioned. Also speed up is a difficult quantity to compare, because it depends on the assumed number of numerical simulations that should be performed (e.g. if we assume that we have to assess flood hazard for 1000 rain events, then the speed up factor obtained by a neural network will be much higher than when only 10 events are considered). Most certainly, these assumptions are also not the same across papers and therefore not comparable.
  We greatly appreciate this comment of the Reviewer and we address its different parts individually.
  For flood hazard, as for the other applications, not every paper was included in the provided tables. Many papers do not provide information on computational times of both numerical and deep learning models and, thus, they do not report speed-up metrics. Nonetheless, to provide a general overview of every paper, we now included all papers in all the tables, even those not reporting speed-up metrics or comparison against other machine learning models.
  As regards the error scores, we agree that they may not be comparable throughout the works as the scales and resolutions may differ. However, we believe that these errors, along with the study case area, provide a measure of the model’s reliability. We mentioned the issues when comparing scores across different studies in lines 543-545: “However, the comparison of speed-up across different papers is often unrealistic since it depends on the number of performed numerical simulations and on the type of numerical model. A similar consideration persists for the error scores, as they depend on the scale of the case study and on its resolution.”
  Regarding the 0 problem, we agree that the scores may differ depending on the number of zeros in a domain, since most regression metrics are averaged. Thus, we added the following in lines 377-378: “However, since these metrics are averaged on a domain, their comparison across different works requires careful attention.”
  
  17)Section 4.2: The discussion on generalization abilities needs to be differentiated a bit more. Both Guo et al. 2021 and Löwe et al. 2021 consider terrain characteristics as input to their models, and in Löwe et al. 2021 generating predictions outside the training dataset was explicitly the focus of the work. As mentioned by the authors, these approaches are in their infancy and have been tested on limited datasets, but these approaches do consider effects of e.g. the built environment in the form of 2D grids.
  We thank the Reviewer for this observation. We expanded the discussion on generalization in section 4.2 by differentiating between generalization on boundary conditions (i.e., different rain events) and initial conditions (i.e., different topography).
  Lines 612-609: “Generalization refers to the capacity of a model to extrapolate from a training dataset into unseen testing data. This means that a DL model can correctly predict scenarios unused in its development. This property is particularly relevant because training requires data, model development, and time. In the context of flood modelling, there are two main generalization objectives: (i) boundary conditions, i.e., different rainfall events, and (ii) topographical changes, i.e., different case studies. However, the transference between different areas is challenging for DL models because of the difference in input and output data. In fact, except for flood inundation mapping, most reviewed papers focused on generalizing different boundary conditions (e.g., Guo et al., 2021; Berkhahn et al., 2019). Instead, only a few papers tested the model on areas not considered during training. Löwe et al. (2021) could generate flood hazard maps for unseen areas within the same study region as the training dataset, as there was little variability of inputs and outputs.”
  
  18)Section 5.1: While investigating the possibility to consider mesh-based deep learning setups is an interesting direction, the authors present no argument why this should work better than convolutional approaches (which are also used for simulating fluid movements). Other than stated around line 610, they are simply a different data representation with advantages and disadvantages (mesh generation) and may or may not improve performance.
  We thank the Reviewer for this comment. We provided arguments based on recent works suggesting that mesh-based models are better than convolutional neural networks for generalization, accuracy, and stability in fluid dynamics. This is expressed in lines 674-675: “There already exist promising works which simulate fluid dynamics with mesh-based GNNs, with increased generalization, accuracy, and stability, with respect to CNNs (e.g., Pfaff et al., 2020; Lino et al., 2021).”
  We introduced the limitations of meshes, in lines 657-658: “Unstructured meshes, nonetheless, inherit similar problems as those typical of numerical models, such as mesh generation and the need of explicitly defining how each node is connected.”
  
  19)Line 648: From here on the text no longer focuses on meshes (which is the Section heading) but on physical conditioning.
  We thank the Reviewer for this observation. We added a section named physics-based deep learning that includes physics-based neural networks and neural operators.
  
  20)Line 656: I think a formulation that will be easier to understand for many readers is that the PINN can only be trained for a specific boundary condition (such as a specific rain event) and it is subsequently only able to simulate this specific event.
  We greatly appreciate this comment. We modified lines 693-694 as follows: “However, PINNs can only be trained for a specific boundary condition (e.g., a specific rain event) and can subsequently only simulate that specific event (Kovachki et al., 2021).”
  
  21)Line 656: FNOs need to be mentioned as one approach amongst many. DeepONets are a widely known alternative and new approaches are constantly developed. The same is true for DGP in the following section.
  We thank the Reviewer for the comment. Indeed, there are many possibilities and alternatives as regards those approaches. We added DeepONets and clarified that there are several approaches which can be used in lines 698-699: “While many approaches have been proposed, such as DeepONets (Lu et al., 2019) or multipole graph neural operator (Li et al., 2020), Fourier neural operators (FNO) have currently achieved the best results (Li et al., 2021).”
  Moreover, we included a section related to Bayesian neural network in the probabilistic deep learning section. Lines 720-725: “Along with those related to the model’s input, uncertainties are also present in the model’s prediction. To account for this kind of uncertainty we can use Bayesian neural networks (BNN). BNNs are models with stochastic components trained using Bayesian inference. They assign prior distributions to the model parameters to provide an estimate of the model’s confidence on the final prediction (Blundell et al., 2015). If, for different parameter sampling, the output is unvaried, then the model has a good confidence on the prediction and vice versa if different parameters give different results. Jacquier et al. (2021) used BNNs to determine the confidence intervals in flood hazard maps, providing a measure of the model’s reliability.”
  
  22)Section 5.3: I don't see how GANs fix data scarcity issues (line 680). They are indeed an interesting approach for e.g. gap filling or the generation of rainfall scenarios, but they do not be trained and do not relieve us of the problem that e.g. flood observations are hardly available. The discussion in the first parts of this section goes in a very different direction than the transfer learning approaches (which focus on training models with few data), which creates confusion.
  We thank the Reviewer for the comment. We understand the doubts issued with GANs and VAEs. We have now mentioned that they do not solve the issue of a complete lack of data but can be useful in many situations where little data is available. For example, we could use the data of the GAN as augmentation to the few real data we have; then these new GAN data can provide more training samples for larger DL models, which would be more challenging with fewer data. GANs can also be used to generate floods in areas never experienced before, without necessarily being fed to a DL model afterward.
  Following those considerations, we clarified the issue and modified section 5.3 as follows.
  Moreover, we added a section on new data sources, which was previously introduced in the knowledge gaps (section 4.4). We also decided to remove the paragraph on transfer learning to avoid confusion.
  “Even though remote sensing and measuring stations provide noticeable amounts of data, several parts of the world still lack enough data to deploy deep learning models. New satellite missions and added sensor networks throughout the world increasingly provide new data sources (e.g., van de Giesen et al., 2014). The flexibility of DL partially overcomes data scarcity by facilitating the use of a wider variety of data sources. For instance, several papers already employ cameras to detect floods and measure the associated water depth (e.g., Vandaele et al., 2021; Jafari et al., 2021; Moy De Vitry et al., 2019). Structural monitoring with cameras can provide reliable data where it was previously hard to obtain, such as in urban environments. Social media information, such as tweets or posted pictures, can also be used to identify flood events and flooded areas (e.g., Rossi et al., 2018; Pereira et al., 2020). In this case, the quality of the retrieved information must be further validated before its use for real applications. Moreover, the heterogeneity of the sources of these data needs to be carefully taken into account when deploying a suitable DL model.
  Another approach can be to generate artificial data to supplement scarce data. This can be done using generative adversarial networks (GAN), which create new data from a given dataset (Goodfellow et al., 2014). GANs are composed of two neural networks, named generator and discriminator, whose purpose is, respectively, to generate new data and to detect if a given data is real or fake. A trained GAN can produce new fake but plausible data, facilitating data augmentation, i.e., providing more training samples. Interesting applications of GANs could overcome some limitations of satellite data (Lütjens et al.,2020, 2021), predict flood maps (Hofmann and Schüttrumpf, 2021) or meteorological forecasts (Ravuri et al., 2021), and create realistic scenarios of flood disasters for projected climate change variations (Schmidt et al., 2019). GANs could also be used to generate a plausible urban drainage system or topography for cities that do not have any sewers construction plan or in areas where only low-resolution data is available (e.g., Fang et al., 2020b).
  However, GANs are difficult to train (Goodfellow, 2016). Variational autoencoders (VAE) are another type of generative model, which can overcome this issue. Differently from standard autoencoders, VAEs model the latent space with probability distributions that aim to ensure good generative properties to the model (Kingma and Welling, 2013). Once the model is trained, new synthetic data can be generated by taking new samples from the latent distributions. Nonetheless, because of the model’s definition, the predictions are less precise than GANs. As such, VAEs and GANs offer a trade-off between the reality of the prediction and the availability of training data.”
  
  Conclusions
  
  23)First bullet - this conclusion could be more clear about the methodological preferences being the current status which is developing rapidly.
  We appreciate the comment and have clarified for each application which are the methodological preferences, based on the methods proposed so far (and thus excluding future direction models).
  Lines 743-751: “Flood inundation, susceptibility, and hazard mapping were investigated using deep learning models. Flood inundation considers as the main data images of floods, mostly taken via satellite. The main and most accurate deep learning models were CNNs. In flood susceptibility, deep learning models consider several inputs, the most important being slope, land use, aspect, terrain curvature, and distance from the rivers. The main deep learning model used were MLPs, often in combination with other statistical techniques, but CNNs generally provided more accurate results. So far, flood hazard maps estimate the water depth in a study area by using deep learning as a surrogate model for numerical simulations. For this application, there are no deep learning model preferences. However, RNNs are preferable for spatio-temporal simulations. Regardless of the application, results show that deep learning solutions outperform traditional approaches as well as other ML techniques.”
  
  24)Line 724 - I would say DL for hazard mapping so far relies on numerical simulations, this may change.
  Thank you for the suggestion. As mentioned, indeed, flood hazard models may also consider real flood events for training. We added “so far” in line 747, as shown in question 22, to indicate this.
  Further comments are presented in section 3.5.2, e.g., in lines 514-515 “Even though observed data were not employed, they could be used in future research to corroborate the transferability of such methods.”
  
  25)Line 731-736 - Some of the existing architectures do enable generalization but this certainly requires more research and testing. Meshes are one way forward amongst others.
  Thank you for the observations. We changed “cannot” with “struggle to”.
  
  26)Line 737-741 - Physics-informed learning is not only relevant in a warning context but for virtually all kinds of flood simulations. FNOs and DGPs are potentially interesting approaches, but there are others. Your are overstating the ability of geometric DL which (to my knowledge) has not been tested in the flood context.
  We thank the Reviewer for the comment. As discussed before, we modified the Future Research Directions section to include more possible models. Indeed, as regards physics-based learning, many models can benefit from it. We rewrote this section clarifying those issues as follows: “Physics-based deep learning provides a reliable framework for flood modelling since it considers the underlying physical equations. Probabilistic hazard mapping can take advantage of deep Gaussian processes or Bayesian neural networks to determine the uncertainties associated with the model and its inputs.”
  While geometric DL has not been used yet in flood context, based on recent findings (e.g., Pfaff et al. 2020, Lino et al. 2021, Wang et al. 2021), we believe that it may be a valuable tool for flood modelling.
  
  27)Line 742-745 - As mentioned before, there is some logic here that does not make sense, because the GANs need to be trained against observed data. Once we have a GAN, what would be the point of training another deep learning model that only learns to emulate the output of the GAN?
  We greatly appreciate this comment. Following the comments previously addressed (comment 22), we modified the paragraph in the conclusions as follows (lines 784-787): “DL necessitates large quantities of data which are difficult to collect in several areas of the world. New data sources such as camera pictures and videos, or social media information can potentially be used thanks to deep learning models. Moreover, generative models, such as GANs and VAEs, can be employed to produce synthetic data for such data-scarce regions, based on training data collected elsewhere”
  
  References:
  
  Löwe, R., Böhm, J., Jensen, D. G., Leandro, J., & Rasmussen, S. H. (2021). U-FLOOD – topographic deep learning for predicting urban pluvial flood water depth. Journal of Hydrology, 603, 126898. https://doi.org/10.1016/j.jhydrol.2021.126898
  
  Pham, B. T., Luu, C., Phong, T. Van, Trinh, P. T., Shirzadi, A., Renoud, S., Asadi, S., Le, H. Van, von Meding, J., & Clague, J. J. (2020). Can deep learning algorithms outperform benchmark machine learning algorithms in flood susceptibility modeling? Journal of Hydrology, 592(July 2020), 125615. https://doi.org/10.1016/j.jhydrol.2020.125615
  
  Zahura, F. T., Goodall, J. L., Sadler, J. M., Shen, Y., Morsy, M. M., & Behl, M. (2020). Training machine learning surrogate models from a high-fidelity physics-based model: Application for real-time street-scale flood prediction in an urban coastal community. Water Resources Research, 56(10), e2019WR027038. https://doi.org/10.1029/2019WR027038
  
  References:
  Hornik, K., Stinchcombe, M. and White, H., 1989. Multilayer feedforward networks are universal approximators. Neural networks, 2(5), pp.359-366.
  Dong, S., Yu, T., Farahmand, H. and Mostafavi, A., 2021. A hybrid deep learning model for predictive flood warning and situation awareness using channel network sensors data. Computer‐Aided Civil and Infrastructure Engineering, 36(4), pp.402-420.
  Pfaff, T., Fortunato, M., Sanchez-Gonzalez, A. and Battaglia, P.W., 2020. Learning mesh-based simulation with graph networks. arXiv preprint arXiv:2010.03409.
  Lino, M., Cantwell, C., Bharath, A.A. and Fotiadis, S., 2021. Simulating Continuum Mechanics with Multi-Scale Graph Neural Networks. arXiv preprint arXiv:2106.04900.
  Wang, R., Walters, R. and Yu, R., 2020. Incorporating symmetry into deep dynamics models for improved generalization. arXiv preprint arXiv:2002.03061.
  
  Citation: https://doi.org/10.5194/hess-2021-614-AC3

Roberto Bentivoglio, Elvin Isufi, Sebastian Nicolaas Jonkman, and Riccardo Taormina

Viewed

Total article views: 4,234 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
2,783	1,378	73	4,234	92	100

HTML: 2,783
PDF: 1,378
XML: 73
Total: 4,234
BibTeX: 92
EndNote: 100

Views and downloads (calculated since 13 Dec 2021)

Month	HTML	PDF	XML	Total
Dec 2021	443	141	8	592
Jan 2022	147	82	6	235
Feb 2022	96	54	1	151
Mar 2022	67	67	1	135
Apr 2022	65	88	1	154
May 2022	40	53	1	94
Jun 2022	32	39	1	72
Jul 2022	37	21	0	58
Aug 2022	34	45	1	80
Sep 2022	33	40	0	73
Oct 2022	31	37	1	69
Nov 2022	30	56	3	89
Dec 2022	21	35	0	56
Jan 2023	22	36	0	58
Feb 2023	45	26	0	71
Mar 2023	37	23	0	60
Apr 2023	42	23	1	66
May 2023	35	29	1	65
Jun 2023	37	32	1	70
Jul 2023	47	33	1	81
Aug 2023	35	20	1	56
Sep 2023	57	24	3	84
Oct 2023	100	21	3	124
Nov 2023	43	13	1	57
Dec 2023	50	17	3	70
Jan 2024	61	35	2	98
Feb 2024	56	21	4	81
Mar 2024	66	29	1	96
Apr 2024	68	12	3	83
May 2024	71	19	8	98
Jun 2024	69	19	2	90
Jul 2024	39	11	1	51
Aug 2024	28	6	2	36
Sep 2024	22	16	0	38
Oct 2024	44	19	0	63
Nov 2024	69	19	1	89
Dec 2024	36	10	0	46
Jan 2025	30	8	1	39
Feb 2025	40	3	0	43
Mar 2025	64	7	2	73
Apr 2025	24	18	2	44
May 2025	67	17	1	85
Jun 2025	155	27	1	183
Jul 2025	144	24	1	169
Aug 2025	4	3	2	9

Cumulative views and downloads (calculated since 13 Dec 2021)

Month	HTML	PDF	XML	Total
Dec 2021	443	141	8	592
Jan 2022	147	82	6	235
Feb 2022	96	54	1	151
Mar 2022	67	67	1	135
Apr 2022	65	88	1	154
May 2022	40	53	1	94
Jun 2022	32	39	1	72
Jul 2022	37	21	0	58
Aug 2022	34	45	1	80
Sep 2022	33	40	0	73
Oct 2022	31	37	1	69
Nov 2022	30	56	3	89
Dec 2022	21	35	0	56
Jan 2023	22	36	0	58
Feb 2023	45	26	0	71
Mar 2023	37	23	0	60
Apr 2023	42	23	1	66
May 2023	35	29	1	65
Jun 2023	37	32	1	70
Jul 2023	47	33	1	81
Aug 2023	35	20	1	56
Sep 2023	57	24	3	84
Oct 2023	100	21	3	124
Nov 2023	43	13	1	57
Dec 2023	50	17	3	70
Jan 2024	61	35	2	98
Feb 2024	56	21	4	81
Mar 2024	66	29	1	96
Apr 2024	68	12	3	83
May 2024	71	19	8	98
Jun 2024	69	19	2	90
Jul 2024	39	11	1	51
Aug 2024	28	6	2	36
Sep 2024	22	16	0	38
Oct 2024	44	19	0	63
Nov 2024	69	19	1	89
Dec 2024	36	10	0	46
Jan 2025	30	8	1	39
Feb 2025	40	3	0	43
Mar 2025	64	7	2	73
Apr 2025	24	18	2	44
May 2025	67	17	1	85
Jun 2025	155	27	1	183
Jul 2025	144	24	1	169
Aug 2025	4	3	2	9

Viewed (geographical distribution)

Total article views: 4,070 (including HTML, PDF, and XML) Thereof 4,070 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 08 Aug 2025

Short summary

Deep Learning methods have been increasingly used in flood mapping as an alternative to traditional modeling techniques. While promising results have been obtained, our review shows significant challenges in building Deep Learning models that can generalize across multiple scenarios, account for complex interactions, and provide probabilistic predictions. We argue that these shortcomings could be addressed by transferring recent fundamental advancements in Deep Learning.


Total:	0
HTML:	0
PDF:	0
XML:	0

Deep Learning Methods for Flood Mapping: A Review of Existing Applications and Future Research Directions

Viewed

Viewed (geographical distribution)

Cited

7 citations as recorded by crossref.