the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Extracting Spatiotemporal Flood Information from News Texts Using Machine Learning for a National Dataset in China
Abstract. Urban floods present a threat in China, demanding an understanding of their spatiotemporal distribution. Current flood datasets primarily offer provincial-scale insights and lack temporal continuity, which leads to a challenge in detailed analysis. To create a consistent national dataset of flood events, this study introduces a machine learning framework by applying online news media as a primary data source to construct a county-level dataset of urban flood events from 2000 to 2022. Using the Bidirectional Encoder Representations from Transformers (BERT) model, we achieved robust performance in information extraction, with an F1 score of 0.86 and an exact match score of 0.82. Further, a combined model of Bidirectional Long Short-term Memory (BiLSTM) networks with a Conditional Random Field (CRF) layer effectively identified flood locations. Our analysis reveals that the temporal trend of flooded cities in the news-based dataset is similar to the China Flood and Drought Bulletin. Furthermore, the consistency of flood events in the news with the typhoon trajectory in two cases, and the connection between flood occurrences and flood conditioning factors, confirm the accuracy of spatial distribution. The validated news-based dataset analyzes urban floods in China from both temporal and spatial perspectives. First, this dataset shows the seasonal characteristics of flood events, which are concentrated in the summer. From 2000 to 2022, the peak year for floods was 2010, and excluding the influence of peak year, the overall temporal trend of total flood occurrence shows an increase. Spatially, the distribution of floods decreases from southeast to northwest, with Guangxi Province having the highest number of floods. Additionally, the Yangtze and Pearl River basins are most frequently affected by urban floods. The subtropical climate zone is the most susceptible to flooding. This study provides an automated and effective method for constructing a national flood event dataset and reveals the spatiotemporal characteristics of urban flooding in China.
- Preprint
(6851 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 01 Aug 2024)
-
RC1: 'Comment on hess-2024-146', Anonymous Referee #1, 24 Jun 2024
reply
Manuscript Review of "Extracting Spatiotemporal Flood Information from News Texts Using Machine Learning for a National Dataset in China"
Overview
The authors developed a machine learning approach based on Natural Language Processing (NLP) applied to news data to identify urban-flood-related news items and extract the timing and location of thousands of flood events in China from 2000 to 2022. The research is original and aligns with the burgeoning trend of NLP-driven research and applications, including in disaster risk research. The proposed approach is efficient, demonstrating high-performance metrics for both event detection and information extraction.Â
The proposed approach and resulting dataset could serve as a valuable, significant, and complementary basis for future research and improving risk management and modeling practices. Historical catalogs of flood hazards, crucial for understanding flood risks, remain scarce and biased, whether constructed from textual documents or satellite imagery, both regionally and globally. Given the high performance of the event detection and information extraction approach and the high number of retrieved events, I conclude that the research provides significant results.Â
However, my primary expectation for the introduction and its expansion within the discussion section would have been to gain more insight into how this approach compares to existing NLP-based flood event extraction studies and how the resulting dataset fills gaps in existing catalogs. In the current manuscript version, these aspects are not sufficiently covered. While some NLP approaches are introduced, the manuscript lacks a sufficient overview of the state-of-the-art on this hot topic despite a high number of references, not all of which are directly relevant. Moreover, the performance of the proposed method is not compared with existing methods, leaving the reader without a clear understanding of how this method ranks within its application domain (considering language differences). Similarly, while a few catalogs of events are introduced, the resulting catalog is not sufficiently compared to these existing ones, thus failing to demonstrate a reduction of the knowledge gap the paper intends to lower.Â
On the other hand, eight figures attempt to depict the spatiotemporal and environmental context of the resulting dataset, which could be done more efficiently to reorient the paper towards a more balanced discussion. Additionally, I raise in the general comments some concerns about the considered flood types, the analysis of the GDP-economic drivers of flood reporting that do not consider population density, the chosen flood susceptibility indicators, and a few concerns about FAIR standards and the format of the shared dataset.
In conclusion, while I consider the paper a valuable piece of research and application, it shows weaknesses, in my opinion, by not being sufficiently contextualized and lacking some key focuses on the result rank and fit in its context.
 
General CommentsNote: I compare the authors'catalog initiative with the Global Landslide Catalog (GLC) to support some general comments (see Kirschbaum et al., 2010, 2015; Dandridge et al., 2023).
G1. Flood Query KeywordsÂ
The flood query was limited to "flood" and "flood disasters" (L142, L154), while many other terms could hint at flood events in news items, e.g., "typhoon," "cyclone," "mud," "heavy rainfall," "inundated areas,"… Query terms are an essential aspect of event detection and this could be seen as a restriction limiting the detection power of the proposed approach. It raises some questions: Should this be documented as a limitation? Is it a decision to limit the size of the corpus? Does the Q&A approach prevent that concern?Â
G2. Flood Types and Multi-Hazard Concerns
The paper focuses on urban floods, excluding other types of floods, yet flood types are interrelated and very often not mutually exclusive. Hence, referring, for instance, to the Hazard Information Profiles (HIPs, https://www.preventionweb.net/drr-glossary/hips ), an urban flood could also be related to a flash flood (despite the exclusion of the query of "flash flood," L151), a riverine flood, a coastal flood, a groundwater flood. Floods are also secondary hazards associated with other hazards, such as a flood that could result from a Typhoon, heavy rainfall, a storm surge, an intense monsoon etc. Floods are also associated with geo-hazards such as landfall (See GLC studies). I found the Typhoon case study in the paper interesting. It also illustrates the multi-hazard nature of floods well. As in GLC studies, I would be interested in having the authors' view on multi, cascading, and co-occurring type issues, the possibilities of detecting multi-type floods, and the challenges, limitations, and perspectives concerning their proposed approach.
G3. A More Balanced Discussion: Trend Analyses vs. Gap Filling Potential
The manuscript extensively discusses spatiotemporal trend analysis, necessitating more caution and clarity on trends influencing factors. I understand the need to illustrate trends in the resulting dataset, but, in my opinion, this matter could be more efficiently summarized, and the paper could be more descriptive and less assertive in the interpretation. Some analyses are simplistic and do not go deep enough. Rather than make the paper even longer, I invite the authors to distinguish more between the essential and the accessory and, if anticipated, to cover in greater depth the spatiotemporal analysis of events and cross-referencing with third-party data in other papers (see GLC studies).
Some figures may be grouped, e.g., maps in different pannels of one figure, allowing not only to focus on the trends of the output data but also on how the output data compares to other datasets, which is currently limited to Figure 4, despite the numerous datasets being listed in the introduction. The reader has little clue as to what gap is being filled. In particular, the Chinese bulletin appears as a more exhaustive dataset (although coarser). This point may be worth further discussion.Â
Note regarding temporal trends:
Trends in hazard occurrences are complex, influenced by variations in hazard intensity and alteration of environmental susceptibility, as well as demographic shifts that alter exposure or vulnerability. Moreover, climatic cycles (e.g., ENSO or other climate indices) can distort linear trend estimations over brief periods due to their cyclical nature.The complexity is further compounded when analyzing trends from news data. Changes in reporting capacity, especially in remote areas, along with new communication technologies like satellite and social media, may introduce significant biases. The proliferation of the internet during the 1990s and 2000s has notably impacted flood event reporting (Gall et al., 2009; Kron et al., 2012; Delforge et al., 2023). Kron et al., 2012 illustrate well the challenges in building a hazard database with flood examples. These works underscore the necessity for standardized flood event definitions to mitigate discrepancies in reporting scales. In the case of news scraping, the framing by journalists can significantly alter the perceived frequency, spatial representation, and the type of events.
In conclusion, the total number of flood events is a highly relative figure. It is essential to acknowledge that while flood hazards are natural phenomena, flood disasters and their reporting are social phenomena with potentially distinct and diverging trend patterns. Given these complexities, attributing trends depicted in the news (i.e., social variables, not physical ones) Â to climate change or land use changes requires careful consideration.
G4. Analyses of GDP
The manuscript highlights the GDP as the primary driver of media attention. However, the boxes in Figure 5 do not seem to show any significant difference between the occurrence of floods for different GDP groups. So, to highlight a possible effect of GDP on media attention, it is vital to use GDP per capita (see GLC studies).
The population is a critical factor in media attention and hazard exposure. More densely populated cities should receive more media attention in the event of a flood. It is likely the primary factor explaining the spatial patterns in the dataset. It is likely to be correlated with GDP, as well as other factors such as elevation, distance to river or coast, or climate (see G5). Therefore, controlling that factor when investigating some effects is essential.Â
G5. Analyses of Flood Susceptibility
Figure 7 and the underlying analysis of flood susceptibility present some issues and do not bring much to the paper. The proposed pattern is not very neat (the points also overlap with no transparency), likely because the chosen indicators are quite remote proxies of flood susceptibility and should not be presented as acknowledged indicators in hydrology (the supporting references are weak).Â
Average daily precipitation depicts a hydrological equilibrium rather than an extreme event. Naturally, arid regions are less susceptible (also less populated, hence, exposed). However, the indicator becomes less relevant to other hydrological systems with higher precipitation averages (a mixture of blue and red dots). Likewise, elevated areas are also likely to be less populated and then less exposed, and the elevation effect tends to disappear at a lower elevation. Flow accumulation or topographical wetness indices could have been more reliable indicators of flood susceptibility.Â
I would recommend removing this analysis given its low informative value and also because these variables are related to climate variability, which is already pictured in Figure 12. See GLC studies for comparisons.
G6. Flood Events Dataset Resolution
While the final dataset is reported at the county-month level, the reader is left with little insight into the level of detail directly resulting from the information extraction process, which remains unclearly described. Based on Figures 4 and 6, it appears that information at the city-daily level was collected. It seems that a much more precise dataset could have been shared without much additional effort, raising questions about the motivation behind disaggregating the data to such a coarser level.
G7. Data Content, FAIR Principles, and Reusability
Also, given that a central outcome of the paper is a dataset, alignment with FAIR principles (https://www.go-fair.org/) should be particularly encouraged. Regarding the data shared, GitHub is not considered FAIR as it does not allow for persistent identifiers. Also, a few additional data could greatly increase the reusability of the dataset, e.g., precise column descriptions in the readme, the reference for the administrative unit shapefile to link the data with the post-code or administrative units as described in the paper (L275-278), using international time standards, and possibly translate region names to English to maximize reuse in the global context. Â
Regarding reproducibility, the data and code availability section could be improved. Input news data and their conditions of (re-)use are not described in this section. Tools and libraries being used to develop the approach are not referred to (except references to the Python "Re" module at L187). There is no comment about whether or not the developed models are accessible and under which conditions of use.Â
There are no links or references to the news articles that have been used to construct the dataset. Sharing the links could drastically increase the paper's outreach and support future research and NLP applications to extract additional information, such as flood impact variables or associated hazard types, without redeveloping an NLP flood event detection model. Annotated corpora are also valuable datasets in the context of NLP for future benchmarking. Consider commenting on that dataset as well.Â
References
- Dandridge, C., Stanley, T. A., Kirschbaum, D. B., and Lakshmi, V.: Spatial and Temporal Analysis of Global Landslide Reporting Using a Decade of the Global Landslide Catalog, Sustainability, 15, 3323, https://doi.org/10.3390/su15043323, 2023.
- Delforge, D., Wathelet, V., Below, R., Lanfredi Sofia, C., Tonnelier, M., Loenhout, J. van, and Speybroeck, N.: EM-DAT: the Emergency Events Database, preprint, https://doi.org/10.21203/rs.3.rs-3807553/v1, 2023.
- Gall, M., Borden, K. A., and Cutter, S. L.: When Do Losses Count?: Six Fallacies of Natural Hazards Loss Data, Bulletin of the American Meteorological Society, 90, 799–810, https://doi.org/10.1175/2008BAMS2721.1, 2009.
Kirschbaum, D., Stanley, T., and Zhou, Y.: Spatial and temporal analysis of a global landslide catalog, Geomorphology, 249, 4–15, https://doi.org/10.1016/j.geomorph.2015.03.016, 2015. - Kirschbaum, D. B., Adler, R., Hong, Y., Hill, S., and Lerner-Lam, A.: A global landslide catalog for hazard applications: method, results, and limitations, Nat Hazards, 52, 561–575, https://doi.org/10.1007/s11069-009-9401-4, 2010.
- Kron, W., Steuer, M., Löw, P., and Wirtz, A.: How to deal properly with a natural catastrophe database – analysis of flood losses, Natural Hazards and Earth System Sciences, 12, 535–550, https://doi.org/10.5194/nhess-12-535-2012, 2012.
Specific Comments
S1. Â Â L8: "similar" could be more nuanced.Â
S2.   L9:10: "the connection between…": the connection does not support accuracy and the analysis is oversimplistic (See G5).
S3. Â Â L43 (and after): "natural disaster" is a controversial terminology often avoided by Disaster Risk experts, acknowledging that a disaster is not natural (as opposed to natural hazards).Â
S4. Â Â L43-L52: Table 2 could distinguish between catalogs from remote and social sensing, e.g., that DFO is based on remote sensing, EM-DAT on the collection of text documents and manual extraction of the information. Some missing recent initiatives could be worth mentioning, e.g., Â a global remote sensing catalog is the global flood database and a global catalog obtained from social media:
- Tellman, B., Sullivan, J.A., Kuhn, C. et al. Satellite imaging reveals increased proportion of population exposed to floods. Nature 596, 80–86 (2021). https://doi.org/10.1038/s41586-021-03695-w
- J.A. de Bruijn, H. de Moel, B. Jongman, M.C. de Ruiter, J. Wagemaker, J.C.J.H. Aerts. A global database of historic and real-time flood events based on social media. Scientific Data, 6 (1) (2019), p. 311, 10.1038/s41597-019-0326-9
- G.R. Brakenridge. Global Active Archive of Large Flood Events. Dartmouth Flood Observatory, University of Colorado, USA. http://floodobservatory.colorado.edu/ Archives/ (Accessed xxx)
- Delforge, D., Wathelet, V., Below, R., Lanfredi Sofia, C., Tonnelier, M., Loenhout, J. van, and Speybroeck, N.: EM-DAT: the Emergency Events Database, preprint, https://doi.org/10.21203/rs.3.rs-3807553/v1, 2023.
S5. Â Â L65: Beyond cloud cover for optical imagery, mapping urban flood is challenging per se.
S6. Â Â L75: "Yang et al. (2023)" Such a paper of high relevance should be rediscussed later in the discussion section, among others, to identify (see Overview).Â
S7. Â Â L77: The authors acknowledge the multi-hazard nature of floods here and after, but the issue is not discussed in light of their own work (see G2).Â
S8. Â Â L90: "Conditional Random Fields (CRF) layer" appears to be a central part of the methodology appearing multiple times in the paper; however, it lacks a clear explanation of what it is and why it is used.Â
S9. Â Â L110:116: since the paper follows a conventional structure, it is unnecessary to detail it in the introduction.Â
S10. Â Â Table 2: EM-DAT is continuously updated (see Delforge et al., 2023). I would also refer to the Global Flood Awareness System (https://global-flood.emergency.copernicus.eu/), the flood component of CEMS, instead of CEMS. See also S4.
S11. Â Â L134: check url link (404 error).Â
S12. Â Â Figure 1: I appreciate the availability of an example. However, consider selecting a more topic-appropriate example or asking for a where/when the question for more relevance.
S13. Â Â L142, L151, and L154: See G1.
S14. Â Â L145-148: The description of the data and its processing, including test/train split, may be confusing. It may be more appropriate to move to the method section.Â
S15. Â Â L157: "Validation" unless China Flood and Drought Bulletin is considered a gold standard, I think referring to comparative data and cross-comparison instead of validation is more appropriate.
S16. Â Â L168-L174: oversimplistic view of hydrology and weak references. See G5.
S17. Â Â L190-199: This section could indicate the total/train/test sample sizes more clearly.Â
S18.   L235: words should be singular in "and does contain the words 'will'…". Also, I wonder if this approach successfully separated actual events from forecasts? Is there any language specificity in Chinese invoved here?
S19. Â Â Figure 3: Is [SEP] a requirement given the specificity of the Chinese language?
S20. Â Â L243: In the first sentence, correct "flood information extraction" into "(i) flood event detection and (ii) flood information extraction" for clarity.Â
S21. Â Â L259: it is not clear to me how Exact Match behaves in case of multiple locations, zero if any error? What is it clearly meant by the location data? City? County? How is location handled before the flood location recognition is explained in section 3.2? Perhaps 3.2 should be explained before.Â
S22. Â Â L276: consider adding the reference of the used administrative unit shapefile. See also G7.
S23. Â Â L285, section 4.1. The performance seems good in an absolute manner, but the reader has no clue how this performs in relation to the context of social sensing of flood or in the context of Chinese NLP. This is quite important to document.Â
S24. Â Â Figure 4: Bulletin seems more exhaustive. This could be discussed more and the authors could highlight better complementarities between data collection approaches, e. g., how would the proposed approach improve Chinese bulletin?
S25. Â Â L298-L308: The analysis of media attention due to GDP biases is not significat and do not control for the population bias (see G4).Â
S26. Â Â L313-314: The two case studies were selected as the author assumed a good coverage because of their important hazard magnitude and impact. This is a known bias and an issue worth mentioning, as small-impact disasters tend to be less well-covered and documented. See Kron et al., 2012, Gall et al. 2009, and Delforge et al. 2023 and references therein for more insights about hazard catalog biases.Â
S27. Â Â L328-339 + Â Figure 7. These selected indicators are bad proxies of flood susceptibility, and I do not see how this analysis validates something about the spatial distribution of floods (see G5). Consider removing.Â
S28. Â Â L340: how the information was structured prior to harmonizing the data into the urban flood dataset is unclear. See also G6.Â
S29. Â Â Figures 8 and 9, it would be great to have an additional column or a time series on the Y axis with the annual total. This could help identify pluriannual cycles as a result of climate indices. Consider adding the total number of occurrences and items in the figure caption.Â
S30. Â Â L354: "seasonality" instead of "climate's tendency" could be more appropriate.Â
S31. Â Â L390: "exposure" or "susceptibility" (the environmental side of vulnerability) is maybe more appropriate than vulnerability because the latter also encompasses social vulnerability. Â
S32. Â Â Maps Figures 10, 11, and 12 could be grouped into a multipanel figure for conciseness. Consider adding population density as well since it drives hazard exposure. DEM and river networks may also be considered as information to include (parsimoniously).
S33. Â Â L409: The comparison with other datasets is quite limited, and the Chinese bulletin seems more exhaustive if one can trace the original data. To what extent the proposed dataset fills gaps is thus not very well documented (see G1). Adding more than one catalog from Table 1 and 2 in Figure 4 for comparison can improve this discussion.
S34. Â Â L473: The data availability section does not include the input news data accessibility information. In line with HESS recommendations and FAIR standards, I also encourage the authors to share information about code and model availabilities.
S35. Â Â L414-L416: this sentence (and the section in general) looks like the authors do their best to fit in the context of climate change and urbanization, even excluding some peak values to retrieve a positive trend. Trends, in particular for disaster news, are much more complex than trends observed on physical variables and include important social drivers and biases. The discussion is oversimplified, and the authors should take more distance and inquire about the biases arising from social sensing of hazards. See G3 and references.Â
S36. Â Â L445: Perspectives are neither exhaustive nor detailed. Consider adding more relevant perspectives, differentiating those related to the method (NLP-detection, extraction) and those related to the valorization of the resulting dataset. Â
S37. Â Â L473: data and code availabilities: see G7.
S38. Â Â Table A2: Same as Figure 4. It may be removed, in my opinion.Â
Citation: https://doi.org/10.5194/hess-2024-146-RC1
Data sets
China-urban-flood-dataset Shengnan Fu, David M. Schultz, Heng Lyu, Zhonghua Zheng, and Chi Zhang https://github.com/shengnan0218/China-urban-flood-dataset
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
233 | 46 | 13 | 292 | 7 | 8 |
- HTML: 233
- PDF: 46
- XML: 13
- Total: 292
- BibTeX: 7
- EndNote: 8
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1