Deep Learning Methods for Flood Mapping: A Review of Existing Applications and Future Research Directions

. Deep Learning techniques have been increasingly used in ﬂood management to overcome the limitations of accurate, yet slow, numerical models, and to improve the results of traditional methods for ﬂood mapping. In this paper, we review 58 recent publications to outline the state-of-the-art of the ﬁeld, identify knowledge gaps, and propose future research directions. The review focuses on the type of deep learning models used for various ﬂood mapping applications, the ﬂood types considered, the spatial scale of the studied events, and the data used for model development. The results show that models based on 5 convolutional layers are usually more accurate as they leverage inductive biases to better process the spatial characteristics of the ﬂooding events. Models based on fully-connected layers, instead, provide accurate results when coupled with other statistical models. Deep learning models showed increased accuracy when compared to traditional approaches and increased speed when compared to numerical methods. While there exist several applications in ﬂood susceptibility, inundation, and hazard mapping, more work is needed to understand how deep learning can assist real-time ﬂood warning during an emergency, and 10 how it can be employed to estimate ﬂood risk. A major challenge lies in developing deep learning models that can generalize to unseen case studies. Furthermore, all reviewed models and their outputs, are deterministic, with limited considerations for uncertainties in outcomes and probabilistic predictions. The authors argue that these identiﬁed gaps can be addressed by exploiting recent fundamental advancements in deep learning or by taking inspiration from developments in other applied areas. Models based on graph neural networks and neural operators can work with arbitrarily structured data and thus should be capable of generalizing across different case studies and could account for complex interactions with the natural and built environment. Physics-based deep learning can be used to preserve the underlying physical equations resulting in more reliable speed-up alternatives for numerical models. Similarly, probabilistic models can be built by resorting to Deep Gaussian Processes or Bayesian neural networks.

(a) Flood inundation (b) Flood susceptibility (c) Flood hazard -Flood inundation maps determine the extent of a flood, during or after it has occurred (see Fig. 1a). Flood inundation maps represent flooded and non-flooded areas. This application is used for post-flood evacuation, protection planning, and for damage assessment. These maps can then be used also as calibration data for other applications such as flood layer is connected to the following one by weights, represented by directed arrows. The values of the input, hidden, and output layers are represented, respectively, by vectors x0, x1, andŷ; (b) an MLP encoder-decoder. The input data x0 is encoded into a lower dimensional layer x1,and then decoded into the outputŷ. This structure is also applicable to convolutional and recurrent layers; (c) a convolutional neural network (CNN) composed of a convolutional layer, and a fully-connected layer. The green squares represent an input tensor, the orange squares represent hidden layers and the red parallelogram on the right represents the output layer. The small box K1 represents the convolutional kernel described in Eq. 3. The final layer depends on the task; (d) visual explanation of how convolutional kernels work. Each element of the kernel is multiplied by its matching input value. Then, all values are summed to obtain the convolved output. This process is repeated across the whole input, as the kernel shifts along it. (e) a recurrent neural network (RNN) in compact form (left) and in the unfolded form (right). The iterative structure of the RNN (left) can be unfolded in time to show how hidden states influence the solution at each time step (right). The colouring scheme indicates for each architecture the input (green), the state (orange), and the output (red). optimization (LeCun et al., 1995). The propagation rule of layer of a convolutional layer is where K is the kernel function for the th layer and * is the convolution operator. Convolutional layers are mostly applied to images i.e., two dimensional spatial grids. For such inputs the kernel is a 2D matrix. Convolutional layers have an inductive bias of translational equivariance, which reflects the idea that spatially close grid elements influence each other. This results search that considered cited and citing works, ultimately leading to 58 eligible documents (Fig. 3). We find that the described methodology selected a representative subset for producing a thorough review of recent advances and developments in this 265 field.
The selected papers are listed in Table 2 which reports major details including the flood mapping application, the type of flood, the DL model, and the spatial scale. General findings related to these three criteria are first presented in Section 3.2.
Specific findings for each application are then presented in Sections 3.3 (flood inundation), 3.4 (flood susceptibility), and 3.5 (flood hazard). These specific sections provide a more in-depth discussion on the deep learning models employed, with a focus 270 on the architecture, the input and output data, and the performance assessment.     14 https://doi.org/10.5194/hess-2022-83 Preprint. Discussion started: 2 March 2022 c Author(s) 2022. CC BY 4.0 License.

Flood types
Fig. 5 shows the types of flood analyzed with respect to each application. River floods are the most common, with many applications in inundation and hazard mapping. This is probably because, for historical reasons, most cities in the world are 285 built close to rivers (Kummu et al., 2011). The scientific community has dedicated significant efforts to exploring the potential of DL for urban flooding. This is difficult to model because of the complex topography and the presence of a drainage system whose dynamics need to be coupled with the overland flood (Löwe et al., 2021). Almost all papers analyzing flash floods described flood susceptibility mapping applications. This is expected due to the short duration and the contingent nature of these phenomena, which limit remote sensing imaging and numerical simulations used in flood inundation and flood hazard 290 mapping, respectively. Despite the importance of coastal flooding (Neumann et al., 2015), only a few papers report the use of DL for coastal flooding. While other works are available in the literature (Lütjens et al., 2020(Lütjens et al., , 2021Bowes et al., 2021), they were not considered since the employed DL models were not trained via supervised learning. Some of these works will be discussed in Section 5. Dam break floods are the least analyzed type, possibly because of their relatively rare occurrence and complexity.

Spatial scale
As shown in Fig. 6a, most applications consider local and regional scales. Local scale refers to towns (e.g., Darabi et al., 2021;Berkhahn et al., 2019), small catchments (e.g., Lin et al., 2020a;Kabir et al., 2020) or river reaches (e.g., Chu et al., 2020;Gebrehiwot et al., 2019). As such, they are mostly referred to urban and river floods. The cases sizes vary from very small ones, 165m 2 (Hou et al., 2021), to small towns up to 100km 2 (Lin et al., 2020a). Regional scale models consider a catchment (e.g., Popa et al., 2019), a province (e.g., Wang et al., 2020b) or large cities (e.g., Löwe et al., 2021;Kalantar et al., 2021). Most works focus on river floods, while some study flash, urban, and coastal floods. National scale models refer to the assessments of entire countries, with only two papers concerning such scales, respectively for Iran and Greece (Khosravi et al., 2020;Kourgialas and Karatzas, 2017). Nemni et al. (2020) and Sarker et al. (2019) consider several study areas across Africa and Asia, and Australia, respectively, but since the size of each area were smaller than 100000km 2 they were marked as 305 regional scale models. They also do not fit within the national scale classification since they do not encompass whole nations.
Supra-national scale models assessing the entire globe or a continent have not been studied yet with deep learning models. This seems unexpected since ML techniques have already been employed at global scales, outperforming traditional techniques, for example in the estimation of design floods along river networks (e.g., Zhao et al., 2020a). Since DL models have been shown to outperform ML models, as later outlined in this review, more models should be used at those scales in future studies.  MLP networks are widely used due to their flexibility and ease of implementation. However, they are usually coupled with other techniques to reach satisfactory performances. Stochastic optimization techniques, such as genetic algorithm, firefly 315 algorithm, and particle swarm optimization were combined with MLPs to search the optimal model's parameters (e.g., Li et al., 2015;Ngo et al., 2018;Kalantar et al., 2021). Multi-criteria decision analysis models, such as frequency ratio and analytical hierarchy process, were also coupled with MLPs to adjust the weights of each input in flood susceptibility (e.g., Kourgialas and Karatzas, 2017;Costache et al., 2020;Popa et al., 2019). Furthermore, k-means clustering was used to categorize the dataset in classes, to account for different topographical conditions; then, for each class, a MLP was trained (e.g., Chang et al., 2010;320 Huang et al., 2021a). Combining MLPs with such methods partly compensates the lack of inductive biases, however, this lack blocks the model from employing existing structures in the data, ultimately limiting their usability. Since flooding phenomena have spatial and temporal structures, we expect MLPs to become progressively less used in this field, as hinted by the trend in CNNs are best suited for processing raster files and images, thanks to their spatial inductive bias. Since most data for 325 flood analysis (e.g., elevation data, rainfall distribution fields, remote sensing image) come in this format, CNNs have been increasingly employed by the research community in the recent years. While most papers consider standard CNNs, there are a few which employ 1D-CNNs (e.g., Dong et al., 2021;Guo et al., 2021;Liu et al., 2021) and 3D-CNNs (e.g., Wang et al., 2020b;Fang et al., 2020a). 1D-CNNs consider as input a hyetograph or a hydrograph of a certain event, while 3D-CNNs consider raster files stacked upon each other. Regarding the architectures, different papers for flood inundation consider an 330 encoder-decoder structure for image segmentation and classification (e.g., Nemni et al., 2020;Hashemi-Beni and Gebrehiwot, 2021;Liu et al., 2019). For such papers, the input is a satellite image of a flood and the output is its classification in flooded and non-flooded areas. This architecture allows the models to increase its performance since it can retain high frequency details in the segmented images (Badrinarayanan et al., 2015). rainfall hyetograph in the latent space. In this way, they can consider both spatial and temporal data within the same framework.
RNNs have been mostly employed to model temporally-varying floods, where they can exploit best their sequential inductive bias. However, they remain the least common choice of DL architecture for spatial flood analysis. Most papers apply RNNs on a time series, such as a hyetograph or a hydrograph (e.g., Kao et al., 2021;Zhou et al., 2021). Some papers, instead, consider spatial sequentiality by reshaping the original raster data into vectors (e.g., Fang et al., 2020a;Panahi et al., 2021;340 Lei et al., 2021). For example, Fang et al. (2020a) extract, for each pixel, its neighboring pixels in a 3 × 3 window and then convert them into a vector based on spatial contiguity. However, this operation introduces arbitrariety in the sequential order chosen for arranging the input pixels, since it is independent of the underlying topography. In fact, Panahi et al. (2021) and Lei et al. (2021) show that these models underperform when compared with CNNs. Among the different RNN layers, most works consider LSTM units (Kao et al., 2021;Zhou et al., 2021;Fang et al., 2020a) but simple recurrent units (Panahi et al., 2021;345 Huang et al., 2021a) and GRUs (Dong et al., 2021) have also been employed. Some papers analyzed the potential of RNNs in combination with other techniques. Kao et al. (2021) use an encoder-decoder architecture to forecast flood features based on rainfall patterns. The encoder and the decoder steps are composed of fully-connected layers, while a LSTM is present in the latent space to process rainfall data. Zhou et al. (2021) identify representative spatial locations in the study area. Then, an LSTM is trained to simulate the water levels' evolution in time at each location. A water surface is ultimately determined by The model takes as input the channels' properties, such as their cross-sections, and rainfall and water level measures, taken from sensors in the network. This input is then given in parallel to a 1D-CNN and to a GRU whose output is then combined to

Performance assessment
This section discusses different approaches for assessing the performance of the DL models, i.e., how well they match the outcomes of traditional and machine learning models. Flood susceptibility and inundation models are compared with techniques such as frequency ratio (Popa et al., 2019), a type of MCDA model; the soil conservation service runoff model (Jahangir et al., 2019), a hydrologic model; and automatic threshold model (Nemni et al., 2020), a histogram-based model. They are also com-360 pared with machine learning techniques, such as support vector machines (e.g., Sarker et al., 2019;Gebrehiwot et al., 2019;Zhao et al., 2020b), random forest (e.g., Darabi et al., 2021;Zhao et al., 2020b), adaptive neuro-fuzzy inference system (Panahi et al., 2021), deep boost (e.g., Chakrabortty et al., 2021a;Ahmed et al., 2021), and radial basis function (Nogueira et al., 2017).
DL models show to outperform both traditional and ML models in terms of the accuracy of the results. Flood hazard models, instead, are compared against numerical models, since they act as surrogate models. Thus, their main purpose is to increase 365 computational speed while maintaining low prediction errors.
There are also a few papers that compared different DL models. Huang et al. (2021b) compared MLPs with RNNs, while Fang et al. (2020a) showed that MLPs were outperformed by the more inductive-biased approaches such as RNNs, 1D-CNNs, and 3D-CNNs. Wieland and Martinis (2019) showed that CNNs widely outperform MLPs, as expected, because of their inductive biases capabilities. Besides accuracy, the number of parameters and the data requirements are important factors 370 when comparing DL models. A higher number of parameters results in better performances but may also lead to overfitting, a condition where the model decreases its performance on the testing data. Hence, when deployed in similar settings such model would perform drastically worse. Moreover, data is not always available leading to possibly unfair comparisons between models with different data budgets. As such, the same model may give different outcomes depending on the considered case.
In supervised learning, we distinguish between regression and classification problems, depending on whether the target 375 values to predict are continuous (e.g., water depth) or discrete (e.g., flooded vs non-flooded area), respectively. Depending on the task, we employ a different set of metrics to evaluate model performances.
Regression metrics are a function of the differences, or residuals, between target and predicted values. The most common metrics include the root mean squared error (RMSE), the coefficient of determination (R 2 ), and the mean average error (MAE).
RMSE and MAE improve as they approach zero, while R 2 improves as it approaches one. In general, RMSE is preferred to 380 MAE since it minimizes the standard deviation of the errors, thus decreasing the presence of extreme outliers. However, since these metrics are averaged on a domain, their comparison across different works requires careful attention.
Classification tasks can be either binary (e.g., predict flooded and non-flooded locations) or multi-categorical (e.g., classifying between permanent water bodies, buildings, and vegetated areas), according to the output number of classes. In the following discussion, we focus on the former, with concepts extending to the second case. When computing binary classifi-385 cation metrics, flooded areas are generally represented as positive class, while non-flooded areas as negative class. The most common metrics for flood modelling are accuracy, recall, and precision, followed by other indices such as the area under the receiver operator characteristic curve. Accuracy represents the number of correct predictions over the total. While popular and easy to implement, this metric is inappropriate for imbalanced datasets, where some categories are more represented than others. For example, if test samples feature an average 90% non-flooded area, a naïve model constantly predicting no flooding 390 would reach 90% accuracy, despite having wrong assumptions. Furthermore, since it may be better to overestimate a flooded area than to underestimate it, one could resort to metrics such as recall that account for false negatives and thus penalize models that cannot recognize a flooded area correctly. However, when used alone, recall can lead to similar issues to those described for accuracy, e.g., yielding a perfect score for a model always predicting the entire domain as flooded. Thus, for an exhaustive understanding of the model's performance, one should also consider metrics accounting for false positives, i.e., where the 395 model misclassifies non-flooded areas as flooded. There are several possible metrics, such as the F1 score, the Kappa score, or the Matthews correlation coefficient, each with their drawbacks and benefits (e.g., Wardhani et al., 2019;Delgado and Tibau, 2019;Chicco and Jurman, 2020). A reasonable choice is the F1 score, which is the geometric mean of recall and precision, and it thus equally considers both false negatives and false positives. Another good example is the ROC (Receiver Operating Characteristic) curve that describes how much a model can differentiate between positive and negative classes for different 400 discrimination thresholds (Bradley, 1997). The Area under the ROC curve (AUC) is often used to synthesise the ROC as a single value. However, the AUC loses information on which parts of the dataset the model performs best. For this reason, one should always interpret these results carefully, especially when comparing different studies. Our purpose here is to show that, for the same case study, DL tends to outperform traditional models.
For surrogate models, the comparison is also performed in terms of their speed-up, which is determined as the ratio between 405 the simulation time of the numerical model and the simulation time of the DL model. For a correct comparison, the training time of the DL model must be considered as well in this analysis. However, this was done only by a few papers (e.g., Guo et al., 2021;Kabir et al., 2020;Jacquier et al., 2021).

Flood inundation Classification
Flood susceptibility Regression Flood hazard Both

Deep Learning for flood inundation
Flood inundation maps determine the extent of a flood, during or after it has occurred. The objective is to determine flooded and Ichim and Popescu, 2020), buildings (e.g., Hashemi-Beni and Gebrehiwot, 2021), and more (e.g., Muñoz et al., 2021). All the types of floods were well represented for this application but flash floods (Fig. 5). We attribute this to the limited frequency of observation of most remote sensing techniques.

415
Regarding the spatial scale, most papers focused on local and regional scales. Availability of remote sensing at wider scales is increasingly higher (e.g., Observatory); however, this seems to be only partially considered. A plausible reason is the limited frequency of observation of the satellites. High temporal remote sensing imagery has a low spatial resolution. Few papers tackle this issue by increasing the resolution of the predicted flood maps, via a neural network, with a technique known as superresolution (e.g., Li et al., 2015Li et al., , 2016b. Super-resolution enhances the quality of an input low-resolution image (Yang et al., 420

Deep Learning for flood susceptibility
Flood susceptibility determines the tendency to flooding of a study area based on its physical characteristics and given a set of known past flood events. This is done by assigning to each location a level of susceptibility ranked from low to high (see There exist DL-related applications for all types of flood (see Fig. 5). Furthermore, Fig. 6a shows that most of the works are concerned with regional or wider scales (e.g., Tien et al., 2020;Panahi et al., 2021;Khosravi et al., 2020). This is expected since susceptibility mapping gives a qualitative estimate of which locations are prone to flooding. Operating on small scales may thus be limiting, both in terms of data availability and applicability for prevention strategies. The data requirements for an 470 accurate estimate would probably be too high for a small area.

Input and output data
The inputs for the deep learning models are several. We distinguish between five input typologies: Topographical data were the most frequent type of input. Most papers also performed a statistical analysis to determine which factors influenced the most the final results: on average the most important factors were slope, land use, aspect, terrain curvature, and distance from the rivers (e.g., Khosravi et al., 2020;Fang et al., 2020a;Popa et al., 2019;Costache et al., 2020). A complete list of inputs is reported in the Appendix (Fig. B1).
As output data, most papers considered a flood inventory map, given by a set of flooded and non-flooded locations. The flooded locations were derived from measurements and records taken from remote sensing and stations, while non-flooded locations were taken randomly from locations with no previous flood record.

Performance assessment
In flood susceptibility analysis, both classification and regression metrics are adopted (Fig. 7). While classification metrics are show that encoding spatial sequentiality with LSTMs work slightly better than 1D-CNNs and 3D-CNNs, however they avoid the comparison with 2D-CNNs.

Deep Learning for flood hazard
Flood hazard predicts the depth, velocity, and extent of floods. This application produces maps which evaluate for a certain event its maximum inundation (e.g., Guo et al., 2021;Berkhahn et al., 2019;Löwe et al., 2021) or how it evolves in time (e.g., 515 Lin Zhou et al., 2021). While most studies consider the probability of different events, using return periods (e.g., Kabir et al., 2020;Guo et al., 2021), there are few papers which determine the water depth map for a single event (e.g., Hu et al., 2019; Chang et al., 2010). However, no papers were identified to predict the flow velocities. Since the simulation results are taken as ground-truth data for training, deep learning models for flood hazard mapping are used as surrogate models in place of numerical models.

520
The most studied types of floods are river and urban floods. As regards the spatial scale, the models are carried out at local and regional scales. This is probably due to the computational burden of performing several simulations at larger scales to train the deep learning model.  why numerical models are used is to simulate events that have never occurred or have never been observed, such as floods with high return periods. Even though observed data were not employed, they could be used in future research to corroborate the 540 transferability of such methods. When training only on the numerical models' predictions, the deep learning models' results are limited in accuracy by the numerical models' one, i.e., if the numerical model does not represent reality so will the DL model. Thus, when the model is deployed on real data, there may also be some generalization issues caused by the difference between the training and testing data. The inclusion of real measured data may thus also improve the accuracy with respect to numerical models.

Performance assessment
In flood hazard, regression metrics are used to evaluate the water depth, while classification metrics are used to evaluate the flood extent, as done for flood inundation (Fig. 7). While for flood susceptibility and inundation DL models were used to improve the performances, in flood hazard their main focus is to improve the speed, while still maintaining reasonably low errors with respect to the numerical predictions. This is highlighted in Table 5, for all papers which provide information on  and data availability. Some other minor gaps were shown in the previous section. Based on these gaps, future research directions are proposed in Section 5.

Flood applications and usability
Deep learning has proven useful for assessing flood-prone areas from the location of past events, identifying flooded areas from remote sensing images, and working as a surrogate model for numerical simulations. Nonetheless, each of the presented 565 maps has its own limitations. Susceptibility maps provide only qualitative results and rely on recorded flood events. Therefore, limited recorded data may lead to incorrect predictions. Moreover, it is important to design an appropriate model to integrate heterogeneous environmental information. Inundation maps mostly consider real events, thus they suffer from the acquisition method's problems. For example, satellites struggle to extract information below clouded areas (e.g., Meraner et al., 2020).
Hazard maps, instead, are limited by the accuracy of the underlying numerical simulator. This leads to explore also other 570 applications within this field that could benefit from deep learning models. In particular, we address two flood management applications, flood risk and real-time flood warning. We also define two desired types of maps, flood arrival time maps and probabilistic hazard maps. Then, we discuss dam and dike breach flood events.
Flood risk combines the probability that a certain event occurs with the associated consequences, such as economic impacts or loss of life. The expected annual loss is a common measure obtained from flood risk assessment and depends on (i) flood that ML and DL approaches can estimate flood risk at regional scale, but do not compare their results against other methods, such as MCDA. One drawback of their approach is that the resulting maps were qualitative, while quantitative results should be preferable for risk assessment.
Real-time flood warning is another application that has not been widely addressed. This is needed by local authorities to inform when and where a flood may occur. While several papers mention real-time prediction, most can be used only after the 590 event has occurred, since they require as input the complete hyetograph or hydrograph of the event. There are a few examples based on RNNs which could forecast floods in near real-time using sensors (Kao et al., 2021) and rainfall distribution (Dong et al., 2021). However, few situations are covered and, thus, more research should focus on filling this gap. An alternative method is to predict the rainfall in real-time and then retrieve the corresponding water depth map by using a similarity measure on a large dataset of previous simulations . However, such a solution may be challenging because of the 595 large storage requirements. Using DL for surrogate modeling instead showed substantial speed improvements, thus allowing for real-time simulations and forecasts. Similar achievements have already been obtained for rainfall now-casting, where the deep learning models can accurately forecast the near-future rainfalls (e.g., Shi et al., 2015;Ravuri et al., 2021).
Arrival time maps estimate the time employed by a flood to reach a certain water depth threshold. They can encode both spatial and temporal information in the same map. So, for a practitioner, they carry at one place detailed information not only 600 on where to intervene but also when to execute mitigation measures. Despite these promises, they have been seldom used in flood management; consequently, they have also not been exploited with DL methods. Using DL for arrival map estimation may be a promising direction to identify critical infrastructure and set up corresponding evacuation plans in real-time. This is because DL has shown potential for surrogate modeling (see Table 5) and because arrival maps can be obtained from flood hazard maps taken over different time intervals of a flood event. This application may be particularly important for exceptional 605 flood events, such as dike breaches and dam breaks, where little forecast can be made until a failure initiates (Yakti et al., 2018).
Probabilistic hazard mapping captures the model uncertainty related to its inputs and outputs. As pointed out by Di Baldassarre et al. (2010), uncertainties can result in deterministic maps which are only spuriously accurate. But probabilistic maps can account for the uncertainties by assigning a probability of flooding to each domain element. This analysis is generally carried out with probabilistic methods such as Monte Carlo simulations (e.g., Papaioannou et al., 2017). However, since they 610 require a vast amount of simulations, only simpler numerical models are used. DL models could be used as surrogates to speed up computation and improve the accuracy of the simpler models. Nonetheless, brute-force simulations, such as Monte Carlo, may require up to hundreds of thousands of simulations to obtain a satisfactory measure of the uncertainty (Liu, 2017). Thus, we need models that can intrinsically work with probabilistic input distributions of parameters.
Dam break and dike breach floods concern a relevant category of flood events that has been poorly approached with deep

Generalization
Generalization refers to the capacity of a model to extrapolate from a training dataset into unseen testing data. This means that a DL model can correctly predict scenarios unused in its development. This property is particularly relevant because training requires data, model set-up, and time. In the context of flood modelling, there are two main generalization objectives:

625
(i) boundary conditions, i.e., different rainfall events, and (ii) topographical changes, i.e., different case studies. However, the transference between different areas is challenging for DL models because of the difference in input and output data. In fact, previously hard to obtain, such as in urban environments. Social media information can also be used to identify flood events and flooded areas, via tweets or posted pictures (e.g., Rossi et al., 2018;Pereira et al., 2020). In this case, the information's validity and reliability must be considered before its use for real applications. Moreover, the heterogeneity of the sources of these data needs to be carefully taken into account when deploying a DL model.
Another approach can be to generate artificial data to supplement scarce data. This can be done using generative adversarial networks (GAN), which create new data from a given dataset (Goodfellow et al., 2014). GANs are composed of two neural 760 networks, named generator and discriminator, whose purpose is, respectively, to generate new data and to detect if a given data is real or fake. A trained GAN can produce new fake but plausible data, facilitating data augmentation, i.e., providing more training samples. Interesting applications of GANs could overcome some limitations of satellite data (Lütjens et al., 2020(Lütjens et al., , 2021, predict flood maps (Hofmann and Schüttrumpf, 2021) or meteorological forecasts (Ravuri et al., 2021), and create realistic scenarios of flood disasters for projected climate change variations (Schmidt et al., 2019). GANs could also be 765 used to generate a plausible urban drainage system or topography for cities that do not have any sewers construction plan or in areas where only low-resolution data is available (e.g., Fang et al., 2020b).
However, GANs are difficult to train (Goodfellow, 2016). Variational autoencoders ( -Flood inundation, susceptibility, and hazard mapping were investigated using deep learning models. Flood inundation considers as the main data images of floods, mostly taken via satellite. three orders of magnitude) while maintaining sufficient accuracy.
-Most papers dealt with river and urban floods, while only a few works described applications for flash, coastal, and dam break floods. Case studies were mainly addressed at local or regional scales, arguably due to the availability of highresolution data. Conversely, the community should further investigate the suitability of DL models for flood applications at larger scales. Concerning the development data, we found that models producing susceptibility and inundation maps 795 rely on the availability of real flood observations. Instead, DL-based surrogate models for hazard mapping require target data from numerical simulations.
This review outlined several knowledge gaps, which can be addressed via deep learning to improve the state of the art of flood mapping. To solve these gaps we proposed possible solutions based on recent advances in fundamental machine learning research:

800
-Flood risk could be addressed in a similar way as for flood susceptibility by using physical and economical characteristics to obtain a risk map. Flood arrival time maps can provide both spatial and temporal information of a flood event and may be obtained similarly as for flood hazard maps.
-Current deep learning models struggle to generalize across different case studies and regions, meaning that a new model has to be created each time. Moreover, they cannot account for the complex interactions with the natural and built 805 environment. A solution to these problems is to use novel DL architectures that include meshes as learning frameworks. Mesh-based neural networks, such as graph neural networks and neural operators, can consider arbitrarily shaped domains and thus provide the required flexibility to generalize across case studies and model the effects of complex interactions.
-Physics-based deep learning provides a reliable framework for flood modelling since it considers the underlying physical 810 equations. Probabilistic hazard mapping can take advantage of deep Gaussian processes or Bayesian neural networks to determine the uncertainties associated with the model and its inputs.
-DL necessitates large quantities of data which are difficult to collect in several areas of the world. New data sources such as camera pictures and videos, or social media information can potentially be used thanks to deep learning models.
Moreover, generative models, such as GANs and VAEs, can be employed to produce synthetic data for such data-scarce 815 regions, based on training data collected elsewhere.
We expect deep learning to be a promising tool to improve and speed up flood mapping. Nonetheless, deep learning models are black-box models, meaning that the underlying operations are unknown. Thus, their deployment in real emergencies has to be taken cautiously. As deep learning for flood mapping is still novel, we advise its use in critical situations to be always validated by traditional models and expert knowledge, until robust and corroborated models are available. The above concern 820 highlights the main challenge DL models for flood management need to face. However, DL models are still in their infancy and carry the large potential to aid researchers for many applications, especially where traditional models cannot provide sufficient accuracy or speed. In particular, deep learning-based flood mapping approaches could provide an added value for regions with limited data or limited resources to invest in setting up time-consuming hydraulic models.   Amini (2010) High-resolution images are classified into five categories with a MLP. The study area is located in a Iranian city. Chu et al. (2020) A MLP is used to estimate the water depth for a river reach in Australia. The dataset is generated using a 2D hydrodynamic model, TUFLOW, and considers ten flood events.

Lin et al. (2020b)
A MLP is used to estimate the water depth and extent for a river reach in a German city. The dataset is generated using a 2D hydrodynamic model, HEC-RAS, and considers 180 flood events.

Lin et al. (2020a)
A MLP is used to forecast water depth and flood extent for a river reach in a German city. The dataset is the same as in Lin Löwe et al. (2021) A encoder-decoder CNN is used for flood mapping in a urban area in Denmark. The dataset is generated using a 2D hydrodynamic model, MIKE 21, and considers 53 flood events.
Hu et al. (2019) A LSTM model is developed to simulate a tsunami in Japan. The model is trained in a lower dimensional space to reduce the problem's complexity. The dataset is obtained using a 3D hydrodynamic model, Fluidity, and consists of 100 snapshots of a modelled tsunami event.
Author contributions. All authors contributed in conceptualizing the paper and its contents. RB and RT developed the structure of the paper.

830
RB wrote the paper, produced all figures and tables, and formatted the article. RT, EI and SNJ reviewed, revised, and supervised the progress of the paper.

Competing interests. No competing interests are present.
Acknowledgements. This work is supported by the TU Delft AI Labs programme.