Scalable Flood Level Trend Monitoring with Surveillance Cameras using a Deep Convolutional Neural Network

In many countries, urban flooding due to local, intense rainfall is expected to become more frequent because of climate change and urbanization. Cities trying to adapt to this growing risk are challenged by a chronic lack of surface 10 flooding data that is needed for flood risk assessment and planning. In this work, we propose a new approach that exploits existing surveillance camera systems to provide qualitative flood level trend information at scale. The approach uses a deep convolutional neural network (DCNN) to detect floodwater in surveillance footage and a novel qualitative flood index (SOFI) as a proxy for water level fluctuations visible from a surveillance camera’s viewpoint. To demonstrate the approach, we trained the DCNN on 1218 flooding images collected from the Internet and applied it to six surveillance videos 15 representing different flooding and lighting conditions. The SOFI signal obtained from the videos had on average a 75% correlation to the actual water level fluctuation. By retraining the DCNN with a few frames from a given video, correlation is increased to 85% on average. The results confirm that the approach is versatile, with the potential to be applied to a variety of surveillance camera models and flooding situations without the need for on-site camera calibration. Thanks to this flexibility, this approach could be a cheap and highly scalable alternative to conventional sensing methods. 20


The need for urban pluvial flood monitoring data
Urban pluvial floods are floods caused by intense local rainfall in urban catchments, where drainage systems are usually not 20 designed to cope with storm events of more than a 10-year return period. Although the full impact of such flood events is difficult to gauge because of reporting and knowledge gaps (Paprotny et al., 2018;van Riel, 2011), some studies estimate the societal cost of small but frequent urban pluvial floods to be comparable to the cost of large, infrequent fluvial flooding events (Jiang et al., 2018b;ten Veldhuis, 2011). Additionally, it is generally acknowledged that the frequency of urban pluvial floods will increase under the driving forces of climate change and urbanization (Skougaard Kaspersen et al., 2017;Zahnt et al., 25 2018).
To cope with urban pluvial flood risk, city planners must understand long-term flooding trends, design appropriate flood mitigation solutions in the medium term, and provide flood alerts in the short term. Numerical flood modelling is a widely used tool for all these tasks, but a certain amount of data is needed. Data pertaining to drainage infrastructure, land use, and Hydrol. Earth Syst. Sci. Discuss., https://doi.org /10.5194/hess-2018-570 Manuscript under review for journal Hydrol. Earth Syst. Sci. Discussion started: 15 February 2019 c Author(s) 2019. CC BY 4.0 License. elevation is required to construct a model, and rainfall data is required to test the model on past rain events. Additionally, flood monitoring data allows for model calibration, which is essential for improving the accuracy of urban drainage models (Tscheikner-Gratl et al., 2016). However, conventional sensors are ill suited to urban environments, where vehicles can disturb the flow and vandalism is a high risk. Similarly, remote sensing is not able to provide data with sufficient spatiotemporal resolution. The lack of monitoring methods and ensuing data scarcity are frequently decried in the urban pluvial flood 5 modelling community (Gaitan et al., 2016;Hunter et al., 2008;El Kadi Abderrezzak et al., 2009;Leandro et al., 2009). In this context, researchers and practitioners have turned to alternative sources of data such as surveillance footage (Liu et al., 2015;Lv et al., 2018), ultrasonic-infrared sensor combinations (Mousa et al., 2016), field surveys (Kim et al., 2014) and first-hand reports (Kim et al., 2014;Yu et al., 2016). Although quantitative information (e.g. water level) is commonly sought for, studies show that even qualitative information, e.g. binary information (Wani et al., 2017) or class information (van Meerveld et al., 10 2017), are useful for calibrating hydraulic and hydrological models.

Surveillance cameras as a data source
Surveillance footage has several advantages when used for flood monitoring. First, many municipalities have already invested in a network of surveillance cameras. In the cities investigated by Goold et al. (2010), these networks usually have fifty to several hundred cameras. In certain cities, however, camera systems operated by institutions are also integrated in the 15 municipal monitoring network. For example, the Command and Control Centre for the London Police is reported to have access to 60'000 cameras (Goold et al., 2010), and the police in Paris have access to 10'000 cameras operated by partners (Sperber et al., 2013). The second advantage of surveillance cameras is their high reliability, since their utility for traffic surveillance and crime reduction depends on permanent uptime. However, the use of surveillance footage for flood monitoring has complications. First, camera placement is generally 20 controlled by outside parties for security purposes, thus critical flooding locations may be only partially visible in the footage or even completely missed. Second, the personal privacy of individuals visible in the footage must be protected. Finally, the interpretation of surveillance footage into a signal that can be assimilated into a flood model is not trivial.

Automatic water level monitoring with surveillance images
While manual reading of water level from surveillance images is possible (e.g. in the study of Liu et al. (2015)), it is both 25 prohibitively labor-intensive at scale and potentially critical from a privacy perspective. Automatic image analysis helps overcome these hurdles and has already been explored in research. The following publications provide the current state of the art of automatic water level estimation from ground-level images.
In the work of Lo et al. (2015), video frames are segmented into a number of visually distinct areas using a graph-based approach. The area corresponding to water is identified thanks to an operator-provided "seed", and the water level is 30 qualitatively assessed by comparing the water area to virtual markers placed in the image by the operator. With a more cameraspecific solution, Sakaino (2016) estimates water levels with a supervised histogram-based approach which assumes a straight water line on a wall visible in the footage. Similarly, Kim et al. (2011) used a ruler in the camera's field of view as a reference for the water level measurement. Although these methods work well, they rely on prior knowledge about the images to be analyzed and may be challenging to apply to a large number of cameras.
A more modern approach for image-based flood level estimation has been proposed by Jiang et al. (2018a). The authors use a deep convolutional neural network to extract image features and then apply a regression to infer water level. Although the 5 results are positive, the approach requires that the neural network and regression be retrained for each camera. Thus, the method is probably most valuable for providing redundancy to existing water level readings and not as a scalable flood monitoring solution.

Objective of the present work
In this work, we propose a novel and highly scalable approach to automatically extract local flood level fluctuations from 10 surveillance footage. By proposing this approach, we aim to provide a tool that can exploit existing surveillance infrastructure to furnish much-needed flood information to urban flood modelers and decision-makers. By making scalability a priority, we hope to facilitate adoption of the tool by practitioners, especially in cities where extensive surveillance camera systems are already in place.

Materials and methods 15
Our approach consists in a two-step processing pipeline that combines automatic image analysis with data aggregation (Fig.   1). In a first step, floodwater is segmented in individual video frames with a deep convolutional network (DCNN). The segmented frames are then summarized with an index (SOFI) that qualifies the visible extent and thereby the depth of the water over time. We evaluate the performance of this approach using footage from surveillance cameras during various flood events. Additionally, we investigate how the data used to train the DCNN influences both segmentation performance and the 20 information content of the SOFI.

Image segmentation with deep convolutional neural networks 5
Semantic segmentation is the task of annotating each pixel in an image according to a predefined taxonomy. The most recent advances in image segmentation have been made with DCNNs (He et al., 2017), so it is of value to apply this powerful tool to the problem of flood segmentation. DCNNs are a subset of artificial neural networks (ANN), machine learning models with a structure that mimics the structure of neurons in the brain. In the case of DCNNs, images are interpreted through consecutive convolutional (matrix-like) layers that extract and combine information at varying levels of abstraction. 10 Although the concept of DCNNs originated in the 80s (Fukushima, 1980), their success for non-trivial problems requires large training sets and computational resources that have only become available relatively recently. An important breakthrough in DCNN development was the fully convoluted network (FCN) introduced by Long et al. (2015), in which the fully connected layers responsible for generating class labels are also formulated as convolutional layers, thereby providing spatially explicit label maps. However, FCN suffered an issue of resolution loss. To solve this issue, Noh et al. (2015) combined FCN with a 15 "deconvolution network", a network that predates FCN (Zeiler et al., 2011) and consists in upsampling and unpooling layers.

Water segmentation with U-net
The DCNN architecture used for water segmentation in this work is that of U-net (Ronneberger et al., 2015). U-net builds on the FCN architecture, but differs in that the decoding layers have as many features as their respective encoding layers, which allows the network to propagate context and texture information to the final layers. Additionally, U-net implements "skip" 20 connections to preserve details and object boundaries, by carrying information directly from the encoding to the decoding layers. To code the DCNN, we built on an open source implementation of U-net (Pröve, 2017) that uses Keras (Chollet and others, 2015) to interface with the TensorFlow library (Abadi et al., 2016). After exploring a range of hyperparameter values (layer depth, feature size, etc.), we found the following network structure ( Fig. 2) to have the best combination of performance and generalization potential for the flood segmentation problem. As input, the network takes color images with a resolution of 512x512 pixels. The network is composed of five encoding and five decoding blocks, each block consisting of two 3x3 convolutions. A residual connection around the two convolutions was added to improve the learning capacity of the network (He et al., 2016). A batch normalization layer between each convolution 5 accelerates the training and makes the training performance less dependent on the initial weights (Ioffe and Szegedy, 2015).
On the encoding side, the blocks end with a 2x2 max pooling operation while on the decoding side, blocks start with a 2D upsampling (or "up-convolution") operation. The skip connections between the encoding and decoding blocks are implemented by taking the final convoluted map of each encoding block and concatenating it to the first map of the corresponding decoding block. The number of features in the first layer is 16, and the number of features is doubled with 10 increasing layer depth. Additionally, dropout regularization is added between the two deepest convolution layers of the network in order to avoid over-fitting.

Deep convolutional neural network training strategies
The collection and labelling of training data is one of the most costly and time-consuming aspects of training DCNNs. For the specific application of flood detection in CCTV and webcam images, training images are particularly rare. Therefore, in this study we evaluated the effectiveness of two strategies for increasing segmentation performance with few training images.
Given the relative rarity of flooding images from surveillance cameras, we used a collection of 1214 labelled images that were 5 collected from the Internet and manually labeled (Chaudhary, 2018). Since almost all of the images in the dataset are subject to copyright, we provide a sample of images in the public domain that are representative of the dataset in Fig. 3. These images have two differences as compared to typical surveillance camera images. First, the image quality is generally better in terms of resolution and color reproduction. Second, the pictures almost only depict extensive flooding where most of the ground is covered by water. To provide examples of dry ground, 300 images of street scenery without flooding from the Cityscapes 10 dataset (Cordts et al., 2015) were added to the 1214 Internet images, forming a pool of reference images. These reference images were used to train a Basic version of the DCNN (80% for training, 10% for validation, and 10% for testing). Conventional augmentation was applied to the images as they were fed into the network: a random displacement of 15 up to 20% and a random horizontal flip. In this work, we considered two strategies for improving the flood segmentation performance of the "Basic" training strategy (Table 1).
In the Augmented strategy, the same images as for the basic training strategy were used but with additional augmentation steps that degraded image quality to the level of typical surveillance footage: Gaussian blur, color desaturation, contrast modification, brightness alteration, and resolution reduction. 20 In the Fine-tuned strategy, we performed transfer learning to adapt the DCNN to specific surveillance videos. This was done by retraining the "Augmented" network with seven manually labeled frames from each video. 1 The retraining is performed in two steps. First, only the weights of the last deconvolution block were released and retrained. Then, the rest of the weights were also released and retrained. This process resulted in a distinct and specialized network for each footage sequence. While it was comparatively the most laborious strategy, it allowed the networks to learn specific characteristics of a given surveillance 25 camera that may not have been represented in the reference images.
1 Manual labelling takes around 2 minutes per image for someone with a little experience.   (Kingma and Ba, 2015). The dice coefficient served as the loss function, defined after Zou et al. (2004). The DCNN was trained on an Nvidia Titan X (Pascal) 12 GB GPU. The "Basic" and "Augmented" training strategies 5 each took two to three hours, and the fine-tuning process required an additional 5 minutes per video.

Static Observer Flooding Index
The Static Observer Flooding Index (SOFI) is introduced in this work as a dimensionless proxy for water level fluctuations that can be extracted from segmented images of stationary surveillance cameras. The SOFI signal is computed as SOFI = #Pixels Flooded #Pixels Total (1) 10 and corresponds to the visible area of the flooding as seen by a stationary observer. Its value can vary between 0% (no flooding visible) and 100% (only flooding visible). When this index is evaluated at multiple consecutive moments in time, the variation of its value provides information about fluctuations of the actual water level under the assumption that the camera remains static and that the view of the flooding is not overly obstructed by moving objects or people. In principle, objects and people will move in and out of the image at a higher frequency than the water level fluctuations, so the influence of such obstructions 15 should be limited to an additional noise that can be filtered out. Nevertheless, situations may arise in which the assumption does not hold, for example if an obstruction is permanently removed from the scene during a flood event.
In certain cases, it may make sense to restrict the computation of the SOFI to a region of interest (ROI) of the image. For example, if the image contains more than one hydraulic process, such as accumulation in one part of the image and flow in the other, a ROI can be defined so that the SOFI only reflects the evolution of the accumulation process. The ROI can also be defined so as to exclude areas in which water segmentation is problematic due to unfavorable lighting conditions or visual obstructions, for example. Finally, the ROI can also be chosen over a region of the image in which changes of water level are going to be most visible, e.g. over a vertical wall. In this study, the ROI was implemented by means of a rectangular selection made by the authors according to these criteria. To gauge the effectiveness of this measure, performance was assessed both 5 with and without user-defined ROI.

Surveillance footage
Six videos depicting flooding were used to assess the performance of the proposed flood monitoring approach. Table 2 provides the characteristics of these videos, which provide a diverse and realistic range of environmental conditions and image qualities. 10 sufficient for the present study since this study only investigates the ability of SOFI to predict water level trend, and not the actual water level.

Flood segmentation performance
Three images from each video, representing low, medium, and high flooding conditions, were used to assess segmentation performance. Image segmentation is a common classification task that is often evaluated with the mean intersection over union 5 ratio (IoU), also known as the Jaccard index (Levandowsky and Winter, 1971). This metric is applied by running the DCNN on a series of testing images that were not seen during training, and comparing the segmentation result S to a manually annotated ground truth G. The mean IoU is then computed as where N is the number of testing images and or is the water-covered area in a segmented image or corresponding ground 10 truth image, respectively. The index varies between 0% for complete misclassification and 100% for perfect classification.

Performance of SOFI as a proxy for the water level trend
To evaluate whether SOFI can be considered a proxy for real water level trends, one can assess to what extent the relationship between the two signals is monotonic increasing. This quality can be evaluated with the Spearman rank-order correlation coefficient (Spearman, 1904), which is used to measure the degree of association between two synchronous signals. 15 Importantly, it does not assume any other (e.g. linear) relationship between the two signals. To compute the coefficient, the rank of each signal value must be computed relative to its respective signal. For signals in which the same value can appear multiple times (tied ranks), the Spearman rank-order correlation coefficient is given by where and are the ranks of the two signals for time step , and where ̅ and ̅ are the average ranks of the two signals. In 20 the current study, the reference signal for the water level trend was obtained either from an in-situ sensor or by visual inspection of the surveillance footage, as described in Section 2.3.1. The basic DCNN was able to detect water in certain cases, but also committed large segmentation errors in the cases of the FloodXCam1, Garage, and Park videos. In the Parking lot video, segmentation appears quite successful, which could be due to the scene being visually similar to the images with which the "Basic" network was trained.

Automatic flood water segmentation
Compared to the "Basic" DCNN, the "Augmented" DCNN provides a visible improvement in most cases. The case of FloodXCam1 is an exception since the DCNN successfully segmented the shallow water flowing on the ground (which was 5 not classified as flooding in the human labels), but did not detect the water ponding in the upper right of the image. This error is fixed in the "Fine-tuned" DCNN for this video thanks to the use of additional training images from the FloodXCam1 video.  Figure 5 shows the segmentation performance of the DCNNs measured by the IoU, both for the full image and within the defined ROI. The "Augmented" DCNN improves performance for all videos except for the two FloodX videos. In the case of the Park video, the improvement of IoU for the full image is around 30 percentage points. For the two FloodX videos, however, segmentation seems to suffer slightly under the "Augmented" network, possibly because the augmentation transformations 5 increased dissimilarity of the training images instead of vice versa. For these two videos, improvement is only achieved thanks to fine-tuning, which proved beneficial for all videos by providing IoUs higher than 90% on average. Figure 5 also shows that within the "expert-defined" ROIs, segmentation performance is generally worse. We conclude that a human is generally not able to identify "difficult" regions of the image a priori, which was one of the original reasons for defining an ROI. 10

Flood level trend extraction
After the frames of a video are segmented, the SOFI is computed for each frame, providing information about the temporal evolution of the visible flood extent. In Fig. 6, the case of video FloodXCam5 is a clear example of how the SOFI reflects changes in the actual water level. Comparing the SOFI signals to the measured water level, it is evident that a correlation exists, although the relationship does not appear linear. For all DCNN training strategies, the trend of the SOFI signal from 5 the ROI (in red) is easier to visually identify than the trend of the SOFI computed from the whole image (in black), which is advantageous if the signal is to be visually interpreted. However, the SOFI signal from the ROI is also noisier, and does not capture the first flood event occurring at 12:48.
In Fig. 6, we also see that a systematic segmentation error is committed in that water is always falsely detected on the ground between the two event. This example illustrates why our approach focuses on the trend of the SOFI signal and not its absolute 10 values, which are more sensitive to systematic errors. In Fig. 7, the relationship between the SOFI and the water level is further explored for both FloodX videos. This figure shows how the relationship between SOFI and the water level can be non-linear, a consequence of the topography in which the flooding occurred. For example, in FloodXCam5 a large area of the image is rapidly segmented as water starts covering the floor of the basement, causing an almost vertical segment on the left side of the scatter plot. For the SOFI computed from the full images, we also see systematic and time-variant errors, resulting in portions of the data having a larger internal correlation 5 that are visible as strands of points in Fig. 7.
Generally, it appears that the use of an ROI (red) leads to a stronger association between water level and SOFI. However, the SOFI signals from the ROIs also contain more noise than the SOFI signal derived from the full image. Additionally, in FloodXCam5, it seems that the ROI was poorly selected, resulting in lower sensitivity of SOFI to the water level up to a depth of around 100 mm. 10  Figure 8 shows the relationship between SOFI and the visually estimated flooding intensity in the remaining videos, for which no in-situ water level measurement is available. In this figure, the value of the "Augmented" and "Fine-tuned" networks appears in the progressive reduction of noise in the SOFI signal.
Two exceptional cases in Fig. 8 need to be explained in more detail. First, the SOFI signal for the River footage suddenly 5 appears to be arbitrary at the highest water level. This is due to intermittent submersion of the camera by the floodwater that leads to gross segmentation errors. The Garage video is also exceptional; flooding caused objects to float around in the garage causing constant changes of the visible inundated area and thus noise in the SOFI signal.  Figure 9 shows the Spearman correlation coefficients between the SOFI and the flooding intensity for each video and each training strategy. This figure shows that using the "Augmented" training strategy, the Spearman correlation coefficient for the full image reaches 75% on average, while for the "Fine-tuned" training strategy the average correlation coefficient reaches around 85%. 5 We draw two general conclusions from the results shown in Fig. 9. First, defining an ROI does not consistently improve the ability of the SOFI signal to reproduce flood trends. This could be due to the poorer segmentation performance within the ROIs (Fig. 5), thereby introducing noise in the water level-SOFI relationship ( Fig. 7 and Fig. 8). Second, the "Fine-tuned" networks generally improve the correlation of SOFI with the water level trend, with the exception of the River video. As made visible in Fig. 8, the submersion of the camera leads to frequent outliers that become more distinct from the rest of the signal 10 after fine-tuning, leading to a lower correlation with the water level trend.

15
The vertical segments in the All videos category represent the standard deviation of values. The results show that both the "Augmented" and "Fine-tuned" networks improve the correlation of SOFI with the water level, although performance is still varied.

SOFI as a scalable and robust approach for qualitative water level sensing
Compared to alternative methods for water level monitoring, our approach is less ambitious in the type of information it aims 20 to provide (only water level fluctuation and not absolute water level). This weakness is at least partially compensated for by an increased scope of applicability, as it is the only image-based monitoring approach designed to provide water level information without needing to be calibrated to each camera. This quality of scalability was demonstrated by applying the "Augmented" DCNN to six videos not seen during training, from which information was extracted despite complex environments, moving objects, and bad lighting conditions . 25 Hydrol. Earth Syst. Sci. Discuss., https://doi.org/10.5194/hess-2018-570 Manuscript under review for journal Hydrol. Earth Syst. Sci. Discussion started: 15 February 2019 c Author(s) 2019. CC BY 4.0 License.
An additional advantage of SOFI over other methods is that it is intrinsically robust. The use of areal integration to quantify flood intensity makes the index less sensitive to small segmentation errors, which could be problematic if small virtual markers are used (Lo et al., 2015). Also, since we only attempt to obtain trend information, systematic segmentation errors (e.g. misclassification at the water-background boundary) should only slightly disturb the result of the analysis. This aspect of SOFI is well illustrated in this study by the Park video, for which the SOFI signal had a high correlation to the water level trend 5 despite mediocre segmentation performance.
When questioning the practical utility of SOFI in light of its qualitative nature and noisiness, two aspects must be considered.
First, SOFI aims to provide information in the context of urban pluvial flood events, for which monitoring data is admittedly difficult to obtain. The studies cited in the introduction have proved that in situations of data scarcity, even qualitative information can be useful in improving model accuracy. Second, thanks to the scalability of SOFI, one should be able to apply 10 it to large surveillance networks or retroactively to archived footage at a reasonable cost. Nevertheless, future research should assess the actual value of the information provided by SOFI for the validation and calibration of urban flood models.
Additionally, possible methods to de-noise the SOFI signal and quantify its reliability should be investigated.

Challenges of flood segmentation in surveillance camera images
Several factors make floodwater segmentation in surveillance footage a complex task, namely (i) water surface movement can 15 cause a range of different wave structures, and water color itself is variable, (ii) water reflects light that falls on it, even in a mirror-like fashion if the water surface is still, and (iii) surveillance cameras tend to have low color fidelity, dynamic range, resolution, and sharpness.
Due to the complexity of the task, one can expect that more training images would be required than for a typical segmentation problem. In our results, the need for additional training data is suggested by the high variability between segmentation results 20 for the different videos. In particular, the variability suggests that the training images were not fully representative of the testing images. Thus, it is probable that the segmentation performance could be substantially improved if a larger and more diverse training set were available. Despite the potential for improvement, this study demonstrates the feasibility of the proposed approach. Compared to other studies that use handcrafted spatial texture features to segment water, the use of a DCNN is possibly a disadvantage due to its need for data, although further research is needed to compare the two approaches at this 25 level.

Degradation of training image quality improves segmentation of surveillance images
In this study, the training images available were of higher quality than typical images originating from surveillance cameras, a discrepancy that was expected to limit the performance of the DCNN on surveillance images. Therefore, in the "Augmented" training strategy, the augmentation step included transformations that lowered the quality of the training images, making them 30 more similar to surveillance images. The results from the six videos used in this study confirm that the artificial degradation of training image quality not only improves segmentation performance in surveillance footage but also increases the correlation of SOFI with the actual water level trend. While in most videos the improvement was clear, no improvement was observed for the two FloodX videos. These two videos stand out in terms of low image quality, location of water in the upper part of the image, and a different setting surrounded by concrete walls. We therefore see a need to investigate such failure cases in order to improve the training data 5 collection and augmentation steps.

Fine-tuning of the DCNN to specific cameras
In situations where segmentation performance is critical, one can fine-tune a general DCNN to a specific surveillance camera.
Even with very few additional training images, we find that the segmentation performance and correlation of SOFI with water level trends both improve thanks to fine-tuning. Despite this result, it should be kept in mind that the fine-tuned DCNNs also 10 lose some generality and may, for example, have issues when lighting conditions are very different from those present in the images for fine-tuning.
Our recommendation for fine-tuning is that care should be taken in creating a set of training images that is roughly as diverse as situations in which the fine-tuned DCNN will be used. Additionally, although we performed fine-tuning with very small sets of seven images per video, fine-tuning performance could be further increased by using more images. 15

Regions of interest (ROI) do not deliver expected value
The definition of ROIs for SOFI computation, motivated by the possibility of omitting difficult portions of the images and focusing on more information-rich portions, proved unsuccessful. Not only was it difficult for the human "expert" to foresee what area of the image met the above criteria, but a systematic increase of noise in the SOFI signal was observed within the ROI. We therefore do not recommend that ROIs be defined systematically for all cameras but only in cases where multiple 20 hydraulic processes are visible in the image and need to be distinguished.

Conclusions
In this study, we explored the potential of using a deep convolutional neural network (DCNN) and a simple but novel index (SOFI) to obtain flood level trend information from generic surveillance cameras. The results of our study strongly suggest that qualitative flood level information can indeed be extracted automatically and universally from any static camera, although 25 we see the need for many training images to cover the range of appearances that floodwater can take. To compensate for the limited number of training images available in this study, we found that degrading image quality during training improved segmentation performance by approximately 10% (IoU) on low-quality surveillance images, and fine-tuning the DCNN to a specific video further improved performance, even with as few as seven manually labeled images. In our results, the SOFI signal from the camera-independent DCNN correlated with water level trends at a rate of 75% on average (Spearman rank 30 Hydrol. Earth Syst. Sci. Discuss., https://doi.org/10.5194/hess-2018-570 Manuscript under review for journal Hydrol. Earth Syst. Sci. Discussion started: 15 February 2019 c Author(s) 2019. CC BY 4.0 License. correlation coefficient). Nevertheless, the signals obtained were often noisy, and the impact of this noise on flood model calibration should be assessed in future research. In the best case, the approach proposed in this work will allow modelers to obtain part of the urban flood monitoring information they need to improve their models' accuracy.

Code availability
The code used in this work for creating, training, and evaluating the DCNN, as well as extracting the SOFI and plotting results 5 can be found in the following archive: https://doi.org/10.25678/0000BB.

Data availability
The licenses to the images and video data used in this work are held by third parties and cannot be republished by the authors.
We encourage the interested reader to refer to the references provided for the individual data sources. The DCNN weights trained for each training strategy are available in the following archive: https://doi.org/10.25678/0000BB.