Creating a national urban flood dataset for China from news texts (2000–2022) at the county level

Fu, Shengnan; Schultz, David M.; Lyu, Heng; Zheng, Zhonghua; Zhang, Chi

doi:10.5194/hess-29-767-2025

Articles | Volume 29, issue 3

https://doi.org/10.5194/hess-29-767-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-29-767-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 29, issue 3

Research article

| Highlight paper

|

13 Feb 2025

Research article | Highlight paper |

| 13 Feb 2025

Creating a national urban flood dataset for China from news texts (2000–2022) at the county level

Shengnan Fu, David M. Schultz, Heng Lyu, Zhonghua Zheng, and Chi Zhang

Download

Final revised paper (published on 13 Feb 2025)
Preprint (discussion started on 27 May 2024)

Interactive discussion

Status: closed

RC1:
'Comment on hess-2024-146', Anonymous Referee #1, 24 Jun 2024
Manuscript Review of "Extracting Spatiotemporal Flood Information from News Texts Using Machine Learning for a National Dataset in China"
Overview
The authors developed a machine learning approach based on Natural Language Processing (NLP) applied to news data to identify urban-flood-related news items and extract the timing and location of thousands of flood events in China from 2000 to 2022. The research is original and aligns with the burgeoning trend of NLP-driven research and applications, including in disaster risk research. The proposed approach is efficient, demonstrating high-performance metrics for both event detection and information extraction.
The proposed approach and resulting dataset could serve as a valuable, significant, and complementary basis for future research and improving risk management and modeling practices. Historical catalogs of flood hazards, crucial for understanding flood risks, remain scarce and biased, whether constructed from textual documents or satellite imagery, both regionally and globally. Given the high performance of the event detection and information extraction approach and the high number of retrieved events, I conclude that the research provides significant results.
However, my primary expectation for the introduction and its expansion within the discussion section would have been to gain more insight into how this approach compares to existing NLP-based flood event extraction studies and how the resulting dataset fills gaps in existing catalogs. In the current manuscript version, these aspects are not sufficiently covered. While some NLP approaches are introduced, the manuscript lacks a sufficient overview of the state-of-the-art on this hot topic despite a high number of references, not all of which are directly relevant. Moreover, the performance of the proposed method is not compared with existing methods, leaving the reader without a clear understanding of how this method ranks within its application domain (considering language differences). Similarly, while a few catalogs of events are introduced, the resulting catalog is not sufficiently compared to these existing ones, thus failing to demonstrate a reduction of the knowledge gap the paper intends to lower.
On the other hand, eight figures attempt to depict the spatiotemporal and environmental context of the resulting dataset, which could be done more efficiently to reorient the paper towards a more balanced discussion. Additionally, I raise in the general comments some concerns about the considered flood types, the analysis of the GDP-economic drivers of flood reporting that do not consider population density, the chosen flood susceptibility indicators, and a few concerns about FAIR standards and the format of the shared dataset.
In conclusion, while I consider the paper a valuable piece of research and application, it shows weaknesses, in my opinion, by not being sufficiently contextualized and lacking some key focuses on the result rank and fit in its context.

General Comments
Note: I compare the authors'catalog initiative with the Global Landslide Catalog (GLC) to support some general comments (see Kirschbaum et al., 2010, 2015; Dandridge et al., 2023).
G1. Flood Query Keywords
The flood query was limited to "flood" and "flood disasters" (L142, L154), while many other terms could hint at flood events in news items, e.g., "typhoon," "cyclone," "mud," "heavy rainfall," "inundated areas,"… Query terms are an essential aspect of event detection and this could be seen as a restriction limiting the detection power of the proposed approach. It raises some questions: Should this be documented as a limitation? Is it a decision to limit the size of the corpus? Does the Q&A approach prevent that concern?
G2. Flood Types and Multi-Hazard Concerns
The paper focuses on urban floods, excluding other types of floods, yet flood types are interrelated and very often not mutually exclusive. Hence, referring, for instance, to the Hazard Information Profiles (HIPs, https://www.preventionweb.net/drr-glossary/hips ), an urban flood could also be related to a flash flood (despite the exclusion of the query of "flash flood," L151), a riverine flood, a coastal flood, a groundwater flood. Floods are also secondary hazards associated with other hazards, such as a flood that could result from a Typhoon, heavy rainfall, a storm surge, an intense monsoon etc. Floods are also associated with geo-hazards such as landfall (See GLC studies). I found the Typhoon case study in the paper interesting. It also illustrates the multi-hazard nature of floods well. As in GLC studies, I would be interested in having the authors' view on multi, cascading, and co-occurring type issues, the possibilities of detecting multi-type floods, and the challenges, limitations, and perspectives concerning their proposed approach.
G3. A More Balanced Discussion: Trend Analyses vs. Gap Filling Potential
The manuscript extensively discusses spatiotemporal trend analysis, necessitating more caution and clarity on trends influencing factors. I understand the need to illustrate trends in the resulting dataset, but, in my opinion, this matter could be more efficiently summarized, and the paper could be more descriptive and less assertive in the interpretation. Some analyses are simplistic and do not go deep enough. Rather than make the paper even longer, I invite the authors to distinguish more between the essential and the accessory and, if anticipated, to cover in greater depth the spatiotemporal analysis of events and cross-referencing with third-party data in other papers (see GLC studies).
Some figures may be grouped, e.g., maps in different pannels of one figure, allowing not only to focus on the trends of the output data but also on how the output data compares to other datasets, which is currently limited to Figure 4, despite the numerous datasets being listed in the introduction. The reader has little clue as to what gap is being filled. In particular, the Chinese bulletin appears as a more exhaustive dataset (although coarser). This point may be worth further discussion.
Note regarding temporal trends:

Trends in hazard occurrences are complex, influenced by variations in hazard intensity and alteration of environmental susceptibility, as well as demographic shifts that alter exposure or vulnerability. Moreover, climatic cycles (e.g., ENSO or other climate indices) can distort linear trend estimations over brief periods due to their cyclical nature.
The complexity is further compounded when analyzing trends from news data. Changes in reporting capacity, especially in remote areas, along with new communication technologies like satellite and social media, may introduce significant biases. The proliferation of the internet during the 1990s and 2000s has notably impacted flood event reporting (Gall et al., 2009; Kron et al., 2012; Delforge et al., 2023). Kron et al., 2012 illustrate well the challenges in building a hazard database with flood examples. These works underscore the necessity for standardized flood event definitions to mitigate discrepancies in reporting scales. In the case of news scraping, the framing by journalists can significantly alter the perceived frequency, spatial representation, and the type of events.
In conclusion, the total number of flood events is a highly relative figure. It is essential to acknowledge that while flood hazards are natural phenomena, flood disasters and their reporting are social phenomena with potentially distinct and diverging trend patterns. Given these complexities, attributing trends depicted in the news (i.e., social variables, not physical ones) to climate change or land use changes requires careful consideration.
G4. Analyses of GDP
The manuscript highlights the GDP as the primary driver of media attention. However, the boxes in Figure 5 do not seem to show any significant difference between the occurrence of floods for different GDP groups. So, to highlight a possible effect of GDP on media attention, it is vital to use GDP per capita (see GLC studies).
The population is a critical factor in media attention and hazard exposure. More densely populated cities should receive more media attention in the event of a flood. It is likely the primary factor explaining the spatial patterns in the dataset. It is likely to be correlated with GDP, as well as other factors such as elevation, distance to river or coast, or climate (see G5). Therefore, controlling that factor when investigating some effects is essential.
G5. Analyses of Flood Susceptibility
Figure 7 and the underlying analysis of flood susceptibility present some issues and do not bring much to the paper. The proposed pattern is not very neat (the points also overlap with no transparency), likely because the chosen indicators are quite remote proxies of flood susceptibility and should not be presented as acknowledged indicators in hydrology (the supporting references are weak).
Average daily precipitation depicts a hydrological equilibrium rather than an extreme event. Naturally, arid regions are less susceptible (also less populated, hence, exposed). However, the indicator becomes less relevant to other hydrological systems with higher precipitation averages (a mixture of blue and red dots). Likewise, elevated areas are also likely to be less populated and then less exposed, and the elevation effect tends to disappear at a lower elevation. Flow accumulation or topographical wetness indices could have been more reliable indicators of flood susceptibility.
I would recommend removing this analysis given its low informative value and also because these variables are related to climate variability, which is already pictured in Figure 12. See GLC studies for comparisons.
G6. Flood Events Dataset Resolution
While the final dataset is reported at the county-month level, the reader is left with little insight into the level of detail directly resulting from the information extraction process, which remains unclearly described. Based on Figures 4 and 6, it appears that information at the city-daily level was collected. It seems that a much more precise dataset could have been shared without much additional effort, raising questions about the motivation behind disaggregating the data to such a coarser level.
G7. Data Content, FAIR Principles, and Reusability
Also, given that a central outcome of the paper is a dataset, alignment with FAIR principles (https://www.go-fair.org/) should be particularly encouraged. Regarding the data shared, GitHub is not considered FAIR as it does not allow for persistent identifiers. Also, a few additional data could greatly increase the reusability of the dataset, e.g., precise column descriptions in the readme, the reference for the administrative unit shapefile to link the data with the post-code or administrative units as described in the paper (L275-278), using international time standards, and possibly translate region names to English to maximize reuse in the global context.
Regarding reproducibility, the data and code availability section could be improved. Input news data and their conditions of (re-)use are not described in this section. Tools and libraries being used to develop the approach are not referred to (except references to the Python "Re" module at L187). There is no comment about whether or not the developed models are accessible and under which conditions of use.
There are no links or references to the news articles that have been used to construct the dataset. Sharing the links could drastically increase the paper's outreach and support future research and NLP applications to extract additional information, such as flood impact variables or associated hazard types, without redeveloping an NLP flood event detection model. Annotated corpora are also valuable datasets in the context of NLP for future benchmarking. Consider commenting on that dataset as well.
References
Dandridge, C., Stanley, T. A., Kirschbaum, D. B., and Lakshmi, V.: Spatial and Temporal Analysis of Global Landslide Reporting Using a Decade of the Global Landslide Catalog, Sustainability, 15, 3323, https://doi.org/10.3390/su15043323, 2023.

Delforge, D., Wathelet, V., Below, R., Lanfredi Sofia, C., Tonnelier, M., Loenhout, J. van, and Speybroeck, N.: EM-DAT: the Emergency Events Database, preprint, https://doi.org/10.21203/rs.3.rs-3807553/v1, 2023.

Gall, M., Borden, K. A., and Cutter, S. L.: When Do Losses Count?: Six Fallacies of Natural Hazards Loss Data, Bulletin of the American Meteorological Society, 90, 799–810, https://doi.org/10.1175/2008BAMS2721.1, 2009.

Kirschbaum, D., Stanley, T., and Zhou, Y.: Spatial and temporal analysis of a global landslide catalog, Geomorphology, 249, 4–15, https://doi.org/10.1016/j.geomorph.2015.03.016, 2015.

Kirschbaum, D. B., Adler, R., Hong, Y., Hill, S., and Lerner-Lam, A.: A global landslide catalog for hazard applications: method, results, and limitations, Nat Hazards, 52, 561–575, https://doi.org/10.1007/s11069-009-9401-4, 2010.

Kron, W., Steuer, M., Löw, P., and Wirtz, A.: How to deal properly with a natural catastrophe database – analysis of flood losses, Natural Hazards and Earth System Sciences, 12, 535–550, https://doi.org/10.5194/nhess-12-535-2012, 2012.

Specific Comments
S1. L8: "similar" could be more nuanced.
S2. L9:10: "the connection between…": the connection does not support accuracy and the analysis is oversimplistic (See G5).
S3. L43 (and after): "natural disaster" is a controversial terminology often avoided by Disaster Risk experts, acknowledging that a disaster is not natural (as opposed to natural hazards).
S4. L43-L52: Table 2 could distinguish between catalogs from remote and social sensing, e.g., that DFO is based on remote sensing, EM-DAT on the collection of text documents and manual extraction of the information. Some missing recent initiatives could be worth mentioning, e.g., a global remote sensing catalog is the global flood database and a global catalog obtained from social media:
Tellman, B., Sullivan, J.A., Kuhn, C. et al. Satellite imaging reveals increased proportion of population exposed to floods. Nature 596, 80–86 (2021). https://doi.org/10.1038/s41586-021-03695-w

J.A. de Bruijn, H. de Moel, B. Jongman, M.C. de Ruiter, J. Wagemaker, J.C.J.H. Aerts. A global database of historic and real-time flood events based on social media. Scientific Data, 6 (1) (2019), p. 311, 10.1038/s41597-019-0326-9

G.R. Brakenridge. Global Active Archive of Large Flood Events. Dartmouth Flood Observatory, University of Colorado, USA. http://floodobservatory.colorado.edu/ Archives/ (Accessed xxx)

Delforge, D., Wathelet, V., Below, R., Lanfredi Sofia, C., Tonnelier, M., Loenhout, J. van, and Speybroeck, N.: EM-DAT: the Emergency Events Database, preprint, https://doi.org/10.21203/rs.3.rs-3807553/v1, 2023.

S5. L65: Beyond cloud cover for optical imagery, mapping urban flood is challenging per se.
S6. L75: "Yang et al. (2023)" Such a paper of high relevance should be rediscussed later in the discussion section, among others, to identify (see Overview).
S7. L77: The authors acknowledge the multi-hazard nature of floods here and after, but the issue is not discussed in light of their own work (see G2).
S8. L90: "Conditional Random Fields (CRF) layer" appears to be a central part of the methodology appearing multiple times in the paper; however, it lacks a clear explanation of what it is and why it is used.
S9. L110:116: since the paper follows a conventional structure, it is unnecessary to detail it in the introduction.
S10. Table 2: EM-DAT is continuously updated (see Delforge et al., 2023). I would also refer to the Global Flood Awareness System (https://global-flood.emergency.copernicus.eu/), the flood component of CEMS, instead of CEMS. See also S4.
S11. L134: check url link (404 error).
S12. Figure 1: I appreciate the availability of an example. However, consider selecting a more topic-appropriate example or asking for a where/when the question for more relevance.
S13. L142, L151, and L154: See G1.
S14. L145-148: The description of the data and its processing, including test/train split, may be confusing. It may be more appropriate to move to the method section.
S15. L157: "Validation" unless China Flood and Drought Bulletin is considered a gold standard, I think referring to comparative data and cross-comparison instead of validation is more appropriate.
S16. L168-L174: oversimplistic view of hydrology and weak references. See G5.
S17. L190-199: This section could indicate the total/train/test sample sizes more clearly.
S18. L235: words should be singular in "and does contain the words 'will'…". Also, I wonder if this approach successfully separated actual events from forecasts? Is there any language specificity in Chinese invoved here?
S19. Figure 3: Is [SEP] a requirement given the specificity of the Chinese language?
S20. L243: In the first sentence, correct "flood information extraction" into "(i) flood event detection and (ii) flood information extraction" for clarity.
S21. L259: it is not clear to me how Exact Match behaves in case of multiple locations, zero if any error? What is it clearly meant by the location data? City? County? How is location handled before the flood location recognition is explained in section 3.2? Perhaps 3.2 should be explained before.
S22. L276: consider adding the reference of the used administrative unit shapefile. See also G7.
S23. L285, section 4.1. The performance seems good in an absolute manner, but the reader has no clue how this performs in relation to the context of social sensing of flood or in the context of Chinese NLP. This is quite important to document.
S24. Figure 4: Bulletin seems more exhaustive. This could be discussed more and the authors could highlight better complementarities between data collection approaches, e. g., how would the proposed approach improve Chinese bulletin?
S25. L298-L308: The analysis of media attention due to GDP biases is not significat and do not control for the population bias (see G4).
S26. L313-314: The two case studies were selected as the author assumed a good coverage because of their important hazard magnitude and impact. This is a known bias and an issue worth mentioning, as small-impact disasters tend to be less well-covered and documented. See Kron et al., 2012, Gall et al. 2009, and Delforge et al. 2023 and references therein for more insights about hazard catalog biases.
S27. L328-339 + Figure 7. These selected indicators are bad proxies of flood susceptibility, and I do not see how this analysis validates something about the spatial distribution of floods (see G5). Consider removing.
S28. L340: how the information was structured prior to harmonizing the data into the urban flood dataset is unclear. See also G6.
S29. Figures 8 and 9, it would be great to have an additional column or a time series on the Y axis with the annual total. This could help identify pluriannual cycles as a result of climate indices. Consider adding the total number of occurrences and items in the figure caption.
S30. L354: "seasonality" instead of "climate's tendency" could be more appropriate.
S31. L390: "exposure" or "susceptibility" (the environmental side of vulnerability) is maybe more appropriate than vulnerability because the latter also encompasses social vulnerability.
S32. Maps Figures 10, 11, and 12 could be grouped into a multipanel figure for conciseness. Consider adding population density as well since it drives hazard exposure. DEM and river networks may also be considered as information to include (parsimoniously).
S33. L409: The comparison with other datasets is quite limited, and the Chinese bulletin seems more exhaustive if one can trace the original data. To what extent the proposed dataset fills gaps is thus not very well documented (see G1). Adding more than one catalog from Table 1 and 2 in Figure 4 for comparison can improve this discussion.
S34. L473: The data availability section does not include the input news data accessibility information. In line with HESS recommendations and FAIR standards, I also encourage the authors to share information about code and model availabilities.
S35. L414-L416: this sentence (and the section in general) looks like the authors do their best to fit in the context of climate change and urbanization, even excluding some peak values to retrieve a positive trend. Trends, in particular for disaster news, are much more complex than trends observed on physical variables and include important social drivers and biases. The discussion is oversimplified, and the authors should take more distance and inquire about the biases arising from social sensing of hazards. See G3 and references.
S36. L445: Perspectives are neither exhaustive nor detailed. Consider adding more relevant perspectives, differentiating those related to the method (NLP-detection, extraction) and those related to the valorization of the resulting dataset.
S37. L473: data and code availabilities: see G7.
S38. Table A2: Same as Figure 4. It may be removed, in my opinion.
Citation: https://doi.org/10.5194/hess-2024-146-RC1
- AC1: 'Reply on RC1', Heng Lyu, 06 Sep 2024
  
  Dear Reviewer,
  
  Thank you very much for your time involved in reviewing the manuscript and providing valuable feedback. Those comments are constructive for revising and improving our manuscript. We have taken the time to think through all of your comments and will carefully revise the manuscript according to each comment. The point-by-point response is in the supplement.
  
  Citation: https://doi.org/10.5194/hess-2024-146-AC1
RC2:
'Comment on hess-2024-146', Anonymous Referee #2, 01 Aug 2024
The aim of the paper is quite interesting, making use of new techniques to create a dataset from old media archives. There are some points that could use improvement, especially to ensure its relevance for the readers of HESS and the wider scientific community, which I have listed below. Overall I think the paper is quite interesting and provides a novel dataset that can be very useful, but the quality of the dataset and the choices for training the model need to be explained further.
From the abstract, but also the rest of the paper, the level of detail that this flood information dataset has is unclear. A spatial scale is mentioned as ‘county-level’, but that can vary quite a lot depending on where the reader is from. Connecting this to a typical length scale (1, 10, 100, … kilometres?) will make it more clear to a potential end-user whether this dataset is useful. Similarly: what kind of information is present about the flooding? Is it just spatial extent? Or also indications of amounts of water, timing or duration, damages done, etc etc. This should be immediately clear from the first reading, in both the abstract, as well as the results section. Related to this, table 1 is an overview of current flood disaster reports, which also doesn’t contain any information on the kind of data that’s in there. Giving both your, and the existing datasets that level of detail can make it clear what the advantage of this new methodology is in comparison to the existing ones. Also the validation data described in section 2.3 suffers from this lack of information.

The approach used seems quite specific for the Chinese language, using several specifically trained models and training input. It’s worthy of discussion of your approach also works for a complete different language group to apply this methodology in other data-scarce regions (e.g. the Global South).

Also regarding the approach: the media used are all newspaper databases, and only 2 different ones. Why is social media not included, or other sources of information? This seems to limit the potential of the method, since using one type of media source might be fairly uniform in its wording and phrasing, and perhaps not always covering all instances of floods. Furthermore, the restrictive choices on the keywords to select these articles might make the whole model biased: was there any form of testing with broader search terms, synonyms or other idioms for instance (like in L 152)? The model is strongly influenced by the choices for the training data obviously, but it seems to me like some additional testing of the influence of that training data is necessary.

Reading through the methodology it seems like a lot of manual preprocessing is still required, including manually annotating news texts. Ow much of a bottleneck is that for operational purposes is that, if you really will have a constantly updating database? This requires some discussion since it directly impacts the applicability of this dataset.

L 235: the exclusion of any texts wit the word ‘will’ seems like it can introduce giant margins of error. I get the reasoning to exclude forecasts, but if ‘will’ is used in a different context in a text that is actually related to flooding (e.g. ‘damaged roads will re-open in 4 days’) is then the whole text still excluded?

The choice of GDP as a clustering method is odd to me. Why not use population density instead? That does correlate somewhat with GDP (so you still get reports on economic losses) but the loss of human life also hugely matters in disaster reporting, I’d think.

Figure 6: This figure doesn’t seem too relevant to the paper to warrant inclusion. A typhoon is certainly going to lead to flooding but the spatial scale is so wide that it’s not a great verification in my opinion.

Figures 8 and 9: occurrence is here shown without any distinction of severity of flooding, whereas the latter one might be more relevant for actual use of the dataset.

Some smaller remarks that came up during reading:
L230: I don’t understand what the authors mean with ‘3 epochs’ and a learning rate of 5 x 10^-5. Please elaborate.

L 275: ‘verify and revise’: is this part of the preprocessing? What does this mean, exactly?

Figure 4: Any idea what causes the large biases? This is hardly discussed.
Citation: https://doi.org/10.5194/hess-2024-146-RC2
- AC2: 'Reply on RC2', Heng Lyu, 06 Sep 2024
  
  Dear Reviewer,
  We have carefully considered each of the comments and will make the necessary revisions according to your comments. Please refer to the attached document for a point-by-point response.
  
  Citation: https://doi.org/10.5194/hess-2024-146-AC2
RC3:
'Comment on hess-2024-146', Anonymous Referee #3, 02 Aug 2024
Review comments to "Extracting Spatiotemporal Flood Information from News Texts Using Machine Learning for a National Dataset in China"

This paper presents an innovative approach to constructing a national flood event dataset by utilizing news data and machine learning techniques, extracting the time and location of thousands of flood events in China from 2000 to 2022. The topic is highly relevant and crucial for understanding the spatiotemporal distribution of urban floods in China. However, several major issues need to be addressed before the manuscript can be considered for publication.
Comments:
The authors provide a comprehensive introduction to existing natural disaster datasets that record flood events created by official sources, other governments, or organizations. However, the manuscript would benefit from a more detailed discussion on how this study specifically addresses the gaps in these existing datasets. It is essential to clearly state the novelty and significance of your work in the context of existing datasets. For instance, do the deficiencies in these existing datasets affect the analysis, modeling, and prediction of flood events to some extent? How does the new dataset you have developed alleviate these issues at both theoretical and practical application levels?

Line 143 “After a manual review to remove duplicates and irrelevant entries, including those referring to flash floods which occur suddenly in mountainous areas and are not the focus of this study, the final dataset consisted of 253 relevant news articles”. The data preparation section needs more details. Please explain the criteria used for manually reviewing and removing irrelevant news articles from the CNKI database. Additionally, discuss any potential biases or limitations introduced by this manual selection process.

Similarly, Line145 “These relevant news articles were then segmented into paragraphs and reorganized into 633 distinct samples. Among them, 503 samples were used to fine-tune the BERT model, alongside data from the CMRC2018 dataset, enhancing the model's stability to accurately extract flood disaster information. The remaining 130 samples served as a test set to evaluate the model’s performance.” Please clarify how the 503 samples were selected from the 633 distinct samples, and explain why the remaining 130 samples were used to evaluate the model’s performance. This selection process is currently unclear and confusing.

For the identification of flood locations, I have a general question. From my understanding, news media reports about flooding occurrences typically mention the affected city or, at most, the district. However, actual urban flooding can occur at the street level or even smaller scales. Could you please provide a detailed explanation of how the BiLSTM-CRF model was trained and applied to recognize flood locations?

Regarding the performance of the BERT model (Table 4), it appears that the authors have only examined results based on a binary classification (flood vs. non-flood). If this is the case, the task seems too simple and lacks sufficient novelty. Could the authors also provide an evaluation of the model’s performance in identifying the time and location of flood events?

It seems that the number of identified flooded cities is significantly underestimated by the news media compared to the China Flood and Drought Bulletin (Figure 4). The authors suggest this discrepancy is related to the low attention given to low GDP areas. However, this raises a significant concern about the reliability of the developed dataset. As mentioned in section 4.3, the dataset records urban flood events reported in news articles from 2000 to 2022. If the news media is so inaccurate that it fails to record a large number of flood events, how can the authors ensure the reliability of the data generated from these news sources?

Figure 6 is not directly related to your results, I think you can put it into supplementary materials.
Citation: https://doi.org/10.5194/hess-2024-146-RC3
- AC3: 'Reply on RC3', Heng Lyu, 06 Sep 2024
  
  Dear Reviewer,
  We have carefully considered each of the comments and will make the necessary revisions according to your comments. The specific responses are in the supplement.
  
  Citation: https://doi.org/10.5194/hess-2024-146-AC3
RC4:
'Comment on hess-2024-146', Anonymous Referee #4, 03 Aug 2024

I joined the reviewer team late, so I won't repeat the many valid points raised by other colleagues. Simply put, I see the potential of this work beyond just publication—it may enable a new paradigm for the urban hydrology community to better understand the socioeconomic impacts of urban flooding using emerging language-oriented machine learning techniques. That said, there are several concerns I'd like the authors to address/comment on in a revised manuscript before this work could be published.
1. It's a bit surprising that this work is still based on BERT and doesn't mention anything about the emerging large language model (LLM) techniques (e.g., GPT-4). Please comment on this choice and discuss potential improvements if newer techniques could be used.

2. Given the focus of this dataset on cities, the analysis of the contributed dataset seems somewhat less pertinent. For instance, the large-scale climate zone analysis is rather off-topic. Instead, one would expect to see if such a dataset could be linked with urban-specific features (e.g., built-up area, urban volumetric density, GDP) to reveal more city-scale findings.
## Other Minor Comments

1. Line 375: "Lanzhou Province" - Lanzhou is **not** a province but the capital city of Gansu Province.

2. The dataset should be archived more appropriately following the FAIR principle as suggested by reviewer 1. In addition, the GitHub repo needs more necessary README info, such as a description of the dataset, citation, etc. Also, `xlsx` is not recommended for simple tabular formats—please consider publishing this dataset in `csv` for better accessibility to allow better open research.

Citation: https://doi.org/10.5194/hess-2024-146-RC4
- AC4: 'Reply on RC4', Heng Lyu, 06 Sep 2024
  
  Dear Reviewer,
  We have carefully considered each of the comments and will make the necessary revisions to address the concerns raised. Please see the attached document for a point-by-point response.
  
  Citation: https://doi.org/10.5194/hess-2024-146-AC4

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Publish subject to revisions (further review by editor and referees) (27 Sep 2024) by Marnik Vanclooster

AR by Heng Lyu on behalf of the Authors (07 Nov 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (07 Nov 2024) by Marnik Vanclooster

RR by Anonymous Referee #1 (30 Nov 2024)

ED: Publish subject to technical corrections (03 Dec 2024) by Marnik Vanclooster

AR by Heng Lyu on behalf of the Authors (12 Dec 2024) Manuscript

Post-review adjustments

AA – Author's adjustment | EA – Editor approval

AA by Heng Lyu on behalf of the Authors (10 Feb 2025) Author's adjustment Manuscript

EA: Adjustments approved (10 Feb 2025) by Marnik Vanclooster

Editorial statement

This paper uses information from news sites with natural language processing tools to infer data on a hydrological process at the regional scale (flooding). The paper demonstrates the technique's applicability and opens new avenues to use advanced computing techniques and web resources to improve the understanding of hydrological processes.