Community-scale urban flood monitoring through fusion of time-lapse imagery, terrestrial lidar, and remote sensing data

Dale, Jedidiah E.; Dorosin, Sophie; Constantine, José A.; Masteller, Claire C.

doi:10.5194/hess-30-2183-2026

Articles | Volume 30, issue 7

https://doi.org/10.5194/hess-30-2183-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-30-2183-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 30, issue 7

Research article

|

17 Apr 2026

Research article |

| 17 Apr 2026

Community-scale urban flood monitoring through fusion of time-lapse imagery, terrestrial lidar, and remote sensing data

Jedidiah E. Dale, Sophie Dorosin, José A. Constantine, and Claire C. Masteller

Download

Final revised paper (published on 17 Apr 2026)
Supplement to the final revised paper
Preprint (discussion started on 29 Sep 2025)
Supplement to the preprint

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-3962', Seyed Mohammad Hassan Erfani, 10 Nov 2025

This is a well-written and methodologically solid paper addressing an important and timely topic—urban flooding. The study effectively builds upon previous efforts, particularly Erfani et al. and Eltner et al., and integrates their insights into a novel framework. The authors demonstrate a strong grasp of both the hydrologic and vision-based aspects of flood monitoring, making the work a valuable contribution to the field. Below, I offer a few comments and questions that may help strengthen the manuscript.
“While aerial lidar offers broad spatial coverage, it does not resolve fine-scale topographic features such as street curbs or shallow depressions common in urban environments” (Dale et al., 2025, p. 7) (pdf)
Why did you use aerial lidar in the first place? If it was not used directly in your workflow, you might consider omitting it to avoid confusion.
“This approach relies on annotated point prompts that indicate the presence or absence of 230 flooding at individual pixels within a reference image.” (Dale et al., 2025, p. 9) (pdf)
“For a given flood event, the earliest image in which flooding was visible was annotated with three to five positive point prompts. These prompts were then used to segment the remaining image sequence.” (Dale et al., 2025, p. 9) (pdf)
“The visual confirmation of flooding was used to iteratively refine the segmentation, with additional positive prompts added to correct for false negatives (i.e., flooded areas classified as non-flooded), and negative prompts added to address false positives (i.e., non-flooded areas 235 misclassified as flooded)” (Dale et al., 2025, p. 9) (pdf)
I understand that machine learning is not the main focus of this study—it primarily serves as a tool to extract information from 2D imagery. However, given that previous studies have already addressed similar challenges, it might have been advantageous to employ some of those established methods directly. Although the amount of manual annotation here is reduced, it still represents a bottleneck to achieving full automation.
“The extrinsic camera pose matrix, P, was estimated based on a set of matched reference features with known locations in both image coordinates (u, v), and world coordinates (X, Y, Z). This process, known as the Perspective-n-Point (PnP) problem, yields an estimated camera pose denoted as PPnP. Feature matching was performed manually, with image coordinates of reference features labeled in ImageJ (Schindelin et al., 2012) and their corresponding world coordinates annotated from the terrestrial lidar point cloud using CloudCompare (CloudCompare, 2023). In the absence of permanent ground control points, 270 static scene elements such as rooftops, fence posts, and utility poles were used as reference features. Between 20 and 30 such features were labeled for each camera. Point precision was limited by image resolution, point cloud noise, and the spatial resolution of the lidar scan.” (Dale et al., 2025, p. 10) (pdf)
In this section, the methodology appears somewhat behind the state of the art. As mentioned earlier, even though these technical components might seem peripheral, exploring ways to automate them is crucial for advancing toward operational applications of such frameworks.
Also, how many times did the authors perform this procedure? Assuming the camera locations are fixed, it seems unnecessary to repeat it multiple times—unless the cameras were moved between events.
“A separate camera pose estimate was computed for each camera and flood event. For the moderate May 14 flood, Camera A’s pose was calculated using 18 reference features, yielding a median reprojection error of 6.83 pixels. The recovered camera location was offset 46 cm from the labeled camera center in the point cloud. For the July 4 event, pose estimation at Camera A used 24 features, resulting in a median reprojection error of 23.6 pixels and a reduced camera position offset to 6 cm.” (Dale et al., 2025, p. 11) (pdf)
This part is a bit confusing. Could the authors clarify why the July event—with more reference features—has a higher reprojection error in image space but a smaller offset in 3D space? The 3D error seems quite large and could significantly affect flood mapping accuracy (e.g., introducing nearly a meter of uncertainty in flood extent). Did the authors examine how this uncertainty propagates into flood depth estimates?
“Flood extent estimation is based on the intersection of lidar-derived topography and image-derived water classifications. Using the established projection pipeline in Equation 2, each point in the terrestrial lidar point cloud is mapped to a corresponding image pixel. If a pixel is identified as flooded in the SAM2-derived binary segmentation mask, the associated terrestrial lidar point is classified as inundated.” (Dale et al., 2025, p. 11) (pdf)
How was this implemented? Since multiple 3D points may project onto a single image pixel, how did the authors handle indexing or correspondence between flooded pixels and their associated 3D points?
“To estimate water surface elevation (WSE), the highest elevations along the boundary of the inundated zone are used as a proxy for the maximum water level and the water surface is 305 assumed to be flat. Edge pixels are extracted using a Canny Edge Detection filter, and the 90th and 95th percentiles of the extracted edge elevation distribution are used to represent a range of possible water surface levels (WSE90 and WSE95) to account for potential topographic noise or obstruction of the water edge in the time lapse images.” (Dale et al., 2025, p. 11) (pdf)
This appears to be the core contribution of the paper and would benefit from more detailed elaboration. The rest of the workflow closely follows prior studies. Based on Figure 1, I initially thought the authors were using a hypsometric curve approach (Dale et al., 2025, p. 6) (pdf). It might be helpful to elaborate on how these curves are utilized and how they relate to the conceptual model applied later in the iterative flood-fill procedure at 0.5 m resolution (Wu et al., 2018; Samela et al., 2020).
“The area of interest for the flood-fill implementation focused on the direct area spanning the two camera locations, approximately 500 m by 250 m, to avoid propagation into unobservable areas.” (Dale et al., 2025, p. 11) (pdf)
This aspect could also be an interesting avenue for future research—for example, using a location-allocation optimization approach to minimize the number of cameras while maximizing the coverage area.
“Although image data informed general model development, no direct calibration against the imagery was performed.” (Dale et al., 2025, p. 12) (pdf)
This raises an interesting question: if sparse information extracted from cameras were available, how could such data be assimilated into flood models to refine their outputs? Could this be implemented in real time?
“Our comparison focuses on quantifying the relative agreement in predicted flood extent between the two methods. The primary metric focuses on identifying regions where both the model and camera-based approaches indicate flooding areas of mutual agreement in predicted inundation. This shared extent is expressed as Foverlap, the ratio of the number of pixels classified as flooded by both methods to the total number of pixels classified as flooded by either. The model domain includes areas separated from our camera sites by major roads and drainage canals. To provide a meaningful comparison between model output and our image-based methods, we spatially restricted our comparison to a region with the approximate bounds of the topographic depression containing the study neighborhood. Where flood extents overlap, we also compared modeled and observed water surface elevations and flood depths.” (Dale et al., 2025, p. 12) (pdf)
This section feels somewhat unconventional and could benefit from clarification. If I were the authors, I would consider treating the HEC-RAS output as the reference (or ground truth) and evaluating the vision-based estimates using standard metrics such as a confusion matrix. This would make the comparison more transparent and interpretable. It would also help highlight that the vision framework is not isolated—the overall performance reflects both the errors of the camera-based method (which provides boundary and initial conditions) and those of the conceptual flood model. A more detailed characterization of each component’s contribution would strengthen the paper considerably.

Citation: https://doi.org/10.5194/egusphere-2025-3962-RC1
- AC1: 'Reply on RC1', Jedidiah E. Dale, 08 Feb 2026
  
  We thank the reviewer for his comments. We have provided a detailed response to his comments and suggestions as an attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3962-AC1
RC2:
'Comment on egusphere-2025-3962', Anonymous Referee #2, 11 Dec 2025
In their paper, the authors present an innovative urban flood monitoring approach. Intersecting segmented flood masks derived from imagery recorded by low-cost trail cameras and lidar data, they estimate flood water surface elevations for two flood events. Maximum flood depths and extents were then compared with results from a 2D hydrodynamic model.
I have read the paper with interest and think it can be published after major revision. My detailed comments are included below.
Major concerns
1. HEC-RAS model

While I agree in general that comparing flood extent and/or depth from the authors’ new method with results from a 2D hydrodynamic model might be an interesting analysis, the paper in its current for lacks important details regarding how the HEC-RAS model was implemented:
How was the model grid set up?

What infiltration method was used in the rain-on-grid approach, and how was it parameterized?

What land use classifications and corresponding roughness coefficient were used?

Was storm drain infrastructure modeled, or only surface flow?

Are there storage areas within the model domain?

Without additional detail, a review of this portion of the analysis is nearly impossible. Even if detail is added, I still question the value of comparing the authors results with those from an uncalibrated HEC-RAS model; I also suspect the lack precipitation data form within the study area adds substantial uncertainty to model results (it sounds like data from only one rain gauge was used for each event, and gauges were located at a distance of 6 and 8 km from the study area, respectively).

I think one of the potentially important applications of the proposed method is mentioned in the discussion (lines 588-589): data for calibration of hydrodynamic models is limited, particularly for pluvial flooding. Here, estimates of water surface elevations and flood extents from cameras could fill an important data gap. If the authors could demonstrate that they can calibrate their HEC-RAS model using camera-based observations, that would strengthen the paper considerably.
2. Extension of flood extends beyond the camera field of view
The authors apply a flood fill procedure to estimate flood extents outside the camera field of view. I question this approach, which can’t account for overland flow dynamics, infiltration, etc. I think the authors might be better off using the flood fill approach only within the field of view.
3. Validation of the new method
The study would also be strengthened if estimated water surface elevations could be validated using other data sources. I understand that depth measurements may not be available, but could the authors estimate depth at strategic locations based on visible markers and compare those to estimates from their approach for the corresponding location? Also, some expanded discussion of uncertainty as a function of distance from the camera location would be beneficial.
Other comments
Figure 1b – this image is difficult to interpret, perhaps change the color scheme?
Section 2.3 – I recommend revising this section. It is difficult to follow the detailed accounts of start and end times. It might be better to display this as a figure, or omit some of the detail not necessary for understanding the larger picture.
Line 322: What do the authors mean by “rainfall was uniformly applied to the domain"? Precipitation data from one gauge was applied to the entire model domain? Uniform intensity? Please clarify.
Line 365: Please explain how SOFI values should be interpreted.
Citation: https://doi.org/10.5194/egusphere-2025-3962-RC2
- AC3: 'Reply on RC2', Jedidiah E. Dale, 08 Feb 2026
  
  We thank the reviewer for their comments. We have provided a detailed response to his comments and suggestions as an attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3962-AC3
RC3:
'Comment on egusphere-2025-3962', Anonymous Referee #3, 28 Dec 2025

The paper is very interesting and presents an application of camera-based flood monitoring. I do not fully agree with some of the terminology used to describe existing approaches, particularly the use of the term “traditional,” but this does not significantly affect the main message of the paper. Given the rapid improvement in machine learning methods, camera quality, and the continuously decreasing costs of imaging systems, I see strong potential for camera-based approaches in the near future. The proposed methodology is timely and well aligned with these developments. Overall, I think the paper can be considered for publication after the authors address the comments listed below and other reviewers’ comments.

1- The statement that traditional fluvial monitoring infrastructure (e.g., stream gauges and water level sensors) is not suited to detect spatially disconnected pluvial flood patches appears overstated. The manuscript should acknowledge that recent advances in dense sensor networks, smart drainage monitoring, and urban hydrometric instrumentation partially address these limitations. The authors are encouraged to moderate the claim and more clearly delineate the specific contexts (e.g., highly localized, shallow, short-duration inundation) in which conventional monitoring remains insufficient.
2- Lines ~(75–90): The manuscript currently frames alternative approaches in a way that may be interpreted as dismissive of prior work. It is not necessary to portray existing methods as fundamentally inadequate to justify the proposed approach. A more balanced framing that highlights complementary strengths and weaknesses particularly regarding accuracy, spatial resolution, temporal resolution, and operational constraints would strengthen the motivation and credibility of the study.
3- The challenge of translating two-dimensional image-based flood fractions into real-world water depth should be explained more rigorously. In heterogeneous urban environments, flooded pixel fraction does not scale linearly with water depth because inundation often occurs in shallow, spatially discontinuous depressions controlled by microtopography, curbs, and drainage infrastructure. Small vertical changes in water level can produce large apparent changes in flooded area (or vice versa), leading to ambiguity when inferring depth or volume from image coverage alone. Explicitly linking this limitation to urban surface complexity would clarify why image-only approaches are insufficient for depth estimation.
4-Use of the term “traditional” The repeated use of the term traditional to describe existing monitoring and modeling approaches is potentially misleading. Many of these methods are actively evolving and increasingly integrated with high-resolution data and advanced numerical schemes. The authors may consider replacing this term with more precise language to avoid implying obsolescence.

5-Is this method having a lower cost in compare with existing approaches? this argument is weakened by reliance on aerial LiDAR and high-density terrestrial LiDAR. Why the authors should not explicitly discuss the cost, accessibility, and transferability of these datasets, particularly for low-income or data-scarce regions. A comparative table summarising data requirements, costs, spatial/temporal resolution, and uncertainties across camera-based methods, LiDAR-dependent approaches, and conventional monitoring would provide a transparent and unbiased comparison.

6-The workflow for estimating floodwater elevation is central to the contribution of this paper, yet it is difficult to fully evaluate due to incomplete access to it (the Zenodo link could not be accessed, at least I could not).

7-The Discussion and Conclusions section is lengthy and combines interpretation with summary statements. I recommend separating this into a concise Discussion section focused on interpretation and limitations, followed by a distinct Conclusions section that succinctly highlights the main contributions, findings, and implications.
8-I understand the study is not centred on HEC-RAS, modelling, however, providing a brief description of the HEC-RAS setup, assumptions, and any calibration or validation strategy (ideally in supplementary material) would strengthen the credibility of the comparison without disrupting the narrative flow of the main manuscript.

9-The manuscript would benefit from a brief discussion of how segmentation uncertainty propagates into water surface elevation estimates, particularly under challenging conditions such as specular reflections, shadows, low-light conditions, and partial occlusions by vegetation or vehicles. (At least raise them)

Citation: https://doi.org/10.5194/egusphere-2025-3962-RC3
- AC2: 'Reply on RC3', Jedidiah E. Dale, 08 Feb 2026
  
  We thank the reviewer for their comments. We have provided a detailed response to his comments and suggestions as an attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3962-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Publish subject to revisions (further review by editor and referees) (06 Mar 2026) by Thomas Kjeldsen

AR by Jedidiah E. Dale on behalf of the Authors (10 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (29 Mar 2026) by Thomas Kjeldsen

AR by Jedidiah E. Dale on behalf of the Authors (31 Mar 2026) Author's response Manuscript

Short summary

Frequent, low-intensity urban pluvial flooding is notoriously difficult to detect and monitor. This study introduces a novel, low-cost approach using computer vision to integrate time-lapse photos with lidar data to estimate water levels and flood extents. Applied to two case study flood events and validated against a two-dimensional flood model, this method shows how community-centered, adaptable monitoring systems can capture spatiotemporal flood dynamics often missed by traditional methods.