Articles | Volume 30, issue 7
https://doi.org/10.5194/hess-30-2183-2026
© Author(s) 2026. This work is distributed under the Creative Commons Attribution 4.0 License.
Community-scale urban flood monitoring through fusion of time-lapse imagery, terrestrial lidar, and remote sensing data
Download
- Final revised paper (published on 17 Apr 2026)
- Supplement to the final revised paper
- Preprint (discussion started on 29 Sep 2025)
- Supplement to the preprint
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on egusphere-2025-3962', Seyed Mohammad Hassan Erfani, 10 Nov 2025
- AC1: 'Reply on RC1', Jedidiah E. Dale, 08 Feb 2026
-
RC2: 'Comment on egusphere-2025-3962', Anonymous Referee #2, 11 Dec 2025
- AC3: 'Reply on RC2', Jedidiah E. Dale, 08 Feb 2026
-
RC3: 'Comment on egusphere-2025-3962', Anonymous Referee #3, 28 Dec 2025
- AC2: 'Reply on RC3', Jedidiah E. Dale, 08 Feb 2026
Peer review completion
AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload
ED: Publish subject to revisions (further review by editor and referees) (06 Mar 2026) by Thomas Kjeldsen
AR by Jedidiah E. Dale on behalf of the Authors (10 Mar 2026)
Author's response
Author's tracked changes
Manuscript
ED: Publish as is (29 Mar 2026) by Thomas Kjeldsen
AR by Jedidiah E. Dale on behalf of the Authors (31 Mar 2026)
Author's response
Manuscript
This is a well-written and methodologically solid paper addressing an important and timely topic—urban flooding. The study effectively builds upon previous efforts, particularly Erfani et al. and Eltner et al., and integrates their insights into a novel framework. The authors demonstrate a strong grasp of both the hydrologic and vision-based aspects of flood monitoring, making the work a valuable contribution to the field. Below, I offer a few comments and questions that may help strengthen the manuscript.
“While aerial lidar offers broad spatial coverage, it does not resolve fine-scale topographic features such as street curbs or shallow depressions common in urban environments” (Dale et al., 2025, p. 7) (pdf)
Why did you use aerial lidar in the first place? If it was not used directly in your workflow, you might consider omitting it to avoid confusion.
“This approach relies on annotated point prompts that indicate the presence or absence of 230 flooding at individual pixels within a reference image.” (Dale et al., 2025, p. 9) (pdf)
“For a given flood event, the earliest image in which flooding was visible was annotated with three to five positive point prompts. These prompts were then used to segment the remaining image sequence.” (Dale et al., 2025, p. 9) (pdf)
“The visual confirmation of flooding was used to iteratively refine the segmentation, with additional positive prompts added to correct for false negatives (i.e., flooded areas classified as non-flooded), and negative prompts added to address false positives (i.e., non-flooded areas 235 misclassified as flooded)” (Dale et al., 2025, p. 9) (pdf)
I understand that machine learning is not the main focus of this study—it primarily serves as a tool to extract information from 2D imagery. However, given that previous studies have already addressed similar challenges, it might have been advantageous to employ some of those established methods directly. Although the amount of manual annotation here is reduced, it still represents a bottleneck to achieving full automation.
“The extrinsic camera pose matrix, P, was estimated based on a set of matched reference features with known locations in both image coordinates (u, v), and world coordinates (X, Y, Z). This process, known as the Perspective-n-Point (PnP) problem, yields an estimated camera pose denoted as PPnP. Feature matching was performed manually, with image coordinates of reference features labeled in ImageJ (Schindelin et al., 2012) and their corresponding world coordinates annotated from the terrestrial lidar point cloud using CloudCompare (CloudCompare, 2023). In the absence of permanent ground control points, 270 static scene elements such as rooftops, fence posts, and utility poles were used as reference features. Between 20 and 30 such features were labeled for each camera. Point precision was limited by image resolution, point cloud noise, and the spatial resolution of the lidar scan.” (Dale et al., 2025, p. 10) (pdf)
In this section, the methodology appears somewhat behind the state of the art. As mentioned earlier, even though these technical components might seem peripheral, exploring ways to automate them is crucial for advancing toward operational applications of such frameworks.
Also, how many times did the authors perform this procedure? Assuming the camera locations are fixed, it seems unnecessary to repeat it multiple times—unless the cameras were moved between events.
“A separate camera pose estimate was computed for each camera and flood event. For the moderate May 14 flood, Camera A’s pose was calculated using 18 reference features, yielding a median reprojection error of 6.83 pixels. The recovered camera location was offset 46 cm from the labeled camera center in the point cloud. For the July 4 event, pose estimation at Camera A used 24 features, resulting in a median reprojection error of 23.6 pixels and a reduced camera position offset to 6 cm.” (Dale et al., 2025, p. 11) (pdf)
This part is a bit confusing. Could the authors clarify why the July event—with more reference features—has a higher reprojection error in image space but a smaller offset in 3D space? The 3D error seems quite large and could significantly affect flood mapping accuracy (e.g., introducing nearly a meter of uncertainty in flood extent). Did the authors examine how this uncertainty propagates into flood depth estimates?
“Flood extent estimation is based on the intersection of lidar-derived topography and image-derived water classifications. Using the established projection pipeline in Equation 2, each point in the terrestrial lidar point cloud is mapped to a corresponding image pixel. If a pixel is identified as flooded in the SAM2-derived binary segmentation mask, the associated terrestrial lidar point is classified as inundated.” (Dale et al., 2025, p. 11) (pdf)
How was this implemented? Since multiple 3D points may project onto a single image pixel, how did the authors handle indexing or correspondence between flooded pixels and their associated 3D points?
“To estimate water surface elevation (WSE), the highest elevations along the boundary of the inundated zone are used as a proxy for the maximum water level and the water surface is 305 assumed to be flat. Edge pixels are extracted using a Canny Edge Detection filter, and the 90th and 95th percentiles of the extracted edge elevation distribution are used to represent a range of possible water surface levels (WSE90 and WSE95) to account for potential topographic noise or obstruction of the water edge in the time lapse images.” (Dale et al., 2025, p. 11) (pdf)
This appears to be the core contribution of the paper and would benefit from more detailed elaboration. The rest of the workflow closely follows prior studies. Based on Figure 1, I initially thought the authors were using a hypsometric curve approach (Dale et al., 2025, p. 6) (pdf). It might be helpful to elaborate on how these curves are utilized and how they relate to the conceptual model applied later in the iterative flood-fill procedure at 0.5 m resolution (Wu et al., 2018; Samela et al., 2020).
“The area of interest for the flood-fill implementation focused on the direct area spanning the two camera locations, approximately 500 m by 250 m, to avoid propagation into unobservable areas.” (Dale et al., 2025, p. 11) (pdf)
This aspect could also be an interesting avenue for future research—for example, using a location-allocation optimization approach to minimize the number of cameras while maximizing the coverage area.
“Although image data informed general model development, no direct calibration against the imagery was performed.” (Dale et al., 2025, p. 12) (pdf)
This raises an interesting question: if sparse information extracted from cameras were available, how could such data be assimilated into flood models to refine their outputs? Could this be implemented in real time?
“Our comparison focuses on quantifying the relative agreement in predicted flood extent between the two methods. The primary metric focuses on identifying regions where both the model and camera-based approaches indicate flooding areas of mutual agreement in predicted inundation. This shared extent is expressed as Foverlap, the ratio of the number of pixels classified as flooded by both methods to the total number of pixels classified as flooded by either. The model domain includes areas separated from our camera sites by major roads and drainage canals. To provide a meaningful comparison between model output and our image-based methods, we spatially restricted our comparison to a region with the approximate bounds of the topographic depression containing the study neighborhood. Where flood extents overlap, we also compared modeled and observed water surface elevations and flood depths.” (Dale et al., 2025, p. 12) (pdf)
This section feels somewhat unconventional and could benefit from clarification. If I were the authors, I would consider treating the HEC-RAS output as the reference (or ground truth) and evaluating the vision-based estimates using standard metrics such as a confusion matrix. This would make the comparison more transparent and interpretable. It would also help highlight that the vision framework is not isolated—the overall performance reflects both the errors of the camera-based method (which provides boundary and initial conditions) and those of the conceptual flood model. A more detailed characterization of each component’s contribution would strengthen the paper considerably.