Water resources management in Latin America and the Caribbean is particularly threatened by climatic, economic, and political pressures. To assess the region's ability to manage water resources, we conducted an unprecedented literature review of over 20 000 multilingual research articles using machine learning and an understanding of the socio-hydrologic landscape. Results reveal that the region's vulnerability to water-related stresses, and drivers such as climate change, is compounded by research blind spots in niche topics (reservoirs and risk assessment) and subregions (Caribbean nations), as well as by its reliance on an individual country (Brazil). A regional bright spot, Brazil, produces well-rounded water-related research, but its regional dominance suggests that funding cuts there would impede scientifically informed water management in the entire region.
Despite being the world's most water-rich region, Latin America and the Caribbean (LAC) faces extreme weather events and a range of water-related stresses that are expected to worsen with climate change
Freshwater resources face mounting pressures, brought about by human population growth and urbanization
LAC is among the most urbanized regions in the world where population densities in water-rich regions can be many times higher than arid regions, such as in Argentina. These high-density areas face particular vulnerability to water quality and supply reliability
While uncertainty surrounds the reliability of water supplies to address water-related risks and meet future needs in LAC, water resources management is a relatively young field of study
Given these circumstances, it is critical to understand how the breadth of past water resources research across LAC contributes to the scientific knowledge necessary for decision-making processes
To perform the literature review, we assembled a corpus of 20 000 water resources research articles in English, Spanish, and Portuguese by querying online databases and modeled the topics of each article with latent Dirichlet allocation
To contextualize results from the literature review, we used publicly available data to statistically cluster countries into groups with similar social and hydrological systems, socio-economic metrics, and measures of water resources abundance and use. This clustering process allowed for more meaningful interpretation of subsequent results within and across countries. To further ground our results in the current research landscape, we invited 20 000 corresponding authors from our corpus to share their experiences through a survey focused on research discipline, accessibility, and connectivity. A total of 1969 respondents from 35 countries and a variety of disciplinary backgrounds completed this survey.
Bright spots and blind spots of water resources research in LAC were evaluated using three concepts: abundance, spread, and connectivity. Abundance was measured as research volume by country and by topic. Spread was estimated by topic normality across countries and articles, describing how close a topic's probability distribution is to the standard normal distribution. Connectivity was determined with a weighted citation network across countries and topics, describing the probability that a specific node (country or topic) is cited by other nodes.
The rest of the article is organized as follows. The next section details how the data underlying this study were acquired. Section 3 presents the methods we used to associate metadata (e.g. topic, location) with each document in the corpus. In that section, we also introduce the specific metrics and methods used to infer bright and blind spots. Section 4 exposes results associated with metadata generation. Section 5 presents and discusses bright spots and blind spots. The last section summarizes our findings.
In this section, we detail the specific process for corpus collection, retrieving socio-hydrologic and survey data.
The process of corpus collection consisted of four steps:
querying online databases; retrieving documents; iteratively assessing quality of the corpus and correcting bias; and
cleaning the corpus.
First, we defined the query to obtain water resources research about Latin American and Caribbean (LAC) countries. We selected a peer-reviewed literature database based on the following criteria: (1) inclusion of journals from LAC (e.g. SciELO); (2) number of query results in English, Spanish, and Portuguese; and (3) expert assessment of the results with a focus on relevancy. Web of Science and Scopus databases were chosen.
Results from the query were assembled into
Second, in accordance with institutional licensing, EndNote was used to retrieve documents from the assembled reference files. This method of corpus collection showed an unequal return rate across the three languages (Table S1 in the Supplement), which was corrected for in the subsequent steps.
Third, we performed a quality assessment of the corpus to assess possible bias between query results and document retrieval within distinct sources or time periods (Year). We defined the within-corpus bias as the difference in relative frequencies
Lastly, we performed a cleaning process to prepare the article texts for the topic model. Texts were converted into a standardized version in which all cases were lowered, words not found in a dictionary were removed, and special patterns for words such as emails and URLs were assigned tags. Additional cleaning removed punctuation (except for apostrophes and hyphens inside words) and single-letter words. Finally, lemmatization (the reduction of a word to a common base form) was performed using TreeTagger software by changing all nouns into singular form and all verbs into the infinite tense
To build a database of socio-hydrologic country descriptors, we collected 43 relevant indicators from five indexes to compare and contrast LAC countries (Table S2, Data S1). We selected the indicators from the following databases, chosen for their international recognition and global or LAC-wide breadth of data:
AQUASTAT, ( Environmental Performance Index ( Global State of Democracy ( Social Progress Index ( LAC INFORM (
AQUASTAT is a global information system produced by the Food and Agriculture Organization of the United Nations and presents a perspective on agriculture and water resources availability, infrastructure to support large-scale regional planning, and analysis. The Environmental Performance Index considers two fundamental dimensions of sustainable development: environmental health, which rises with economic growth and prosperity, and ecosystem vitality, based on 24 indicators
The Global State of Democracy indices depict democratic trends at the country, regional, and global levels across a broad range of different attributes of democracy in the period 1975–2015. Democracy is conceptualized as popular control over public decision-making and decision makers and equality of respect and voice between citizens in the exercise of that control. The index translates these principles into five main democracy attributes: representative government, fundamental rights, checks on government, impartial administration, and participatory engagement
The Social Progress Index is a comprehensive measure of quality of life, independent of economic indicators, and is designed to complement economic measures such as gross domestic product. Social progress is defined as the capacity of a society to meet the basic human needs of its citizens, establish the building blocks that allow citizens and communities to enhance and sustain the quality of their lives, and create the conditions for all individuals to reach their full potential. The index aggregates three broad dimensions of social progress: basic human needs, foundations of wellbeing, and opportunity
The LAC INFORM Risk Index simplifies risk-based information for LAC countries. A risk score is calculated for each country by combining 82 indicators that measure three dimensions: hazard and exposure, which captures potential hazardous events and the number of people that could be exposed; vulnerability, which measures the fragility of socio-economic systems and the strength of communities, households, and individuals to confront a crisis situation; and lack of coping capacity, which takes into account a country's institutional and infrastructural strength to cope with and recover from crisis
Although these indicators do not capture the full spectrum and complexity of factors related to water, they allow for an analysis of the topic modeling results based on country clusters with similar characteristics.
Results from the literature review were ground-truthed with an electronic survey sent to corresponding authors of the articles from the corpus. The survey aimed to shed light on researchers' characteristics by including questions about their research experience, institutional affiliation, publication history, their perceptions on funding and interdisciplinarity in the field of water resources, and open-ended questions. The survey received exemption from the University of California, Davis, IRB Administration (ID 1335782-1).
The survey was designed to include 16 questions, in three languages (English, Spanish, and Portuguese), and take approximately 5 min to complete (Table S3). Respondents were asked questions about their position; institutional affiliation; years of experience; main research discipline; countries of birth, residence, and research focus; number of peer-reviewed publications; motivations for picking journals for publications; source of funding; and opinions regarding interdisciplinary research.
The survey was sent in November 2018 to corresponding authors from 22 324 papers, a subset of the final corpus, using Qualtrics survey distribution software. This included articles written in each language: 20 332 English, 1293 Spanish, and 699 Portuguese. The survey was resent to non-respondents weekly until February 2019, leading to 1969 responses. The survey response data was cleaned and prepared for analysis. The survey responses in Spanish and Portuguese were translated to English and compiled into one document.
In this section, we detail the following:
the clustering of countries from their socio-hydrologic data; the generation and mining of metadata; and
the metrics to identify bright and blind spots of water research.
We clustered countries based on socio-hydrologic variables using two different methods: The average proportion (APN) measures the proportion of observations not placed in the same cluster under both cases and evaluates how robust the clusters are under cross-validation The average distance between means (ADM) measures the variation of the cluster center and evaluates the stability of the localization of the cluster in the multidimensional clustering variable space The average distance (AD) measures the distance between observations placed in the same cluster and evaluates within-cluster stability The figure of merit (FOM) estimates the predictive power of the clustering algorithm by measuring the within-cluster variance of the removed variable
Each document in the corpus was augmented by generating and mining metadata. The mined metadata correspond to author keywords as well as the citing and cited literature resulting in a citation network. The generated metadata correspond to modeled topics and study location.
The content of the corpus documents were modeled using latent Dirichlet allocation (LDA), a Bayesian, generative, probabilistic model conceptualizing each document in a corpus of documents as a random mixture of topics
The LDA was programmed to identify 105 topics in English and 65 topics in the Spanish and Portuguese corpora based on commonly used metrics for LDA tuning. We then conducted a quality assessment of the topic models through cross-validation. For this we developed human-derived topics for the English corpus by reading a subset of 1428 papers from the corpus and manually identifying single-word tags based on keywords and main research topics. A similar percentage of documents were read for the Spanish and Portuguese corpora: 188 and 111, respectively.
As topics are statistical objects, they must be assigned a human label to make them tractable. An interdisciplinary review panel of eight water experts therefore assigned labels for each topic while simultaneously evaluating their significance. Topics were removed if multiple members of the review panel determined that the most frequently occurring words were irrelevant based on their expert knowledge of water resources science. The remaining relevant topics were tagged with five labels independently by multiple reviewers, and the collection of proposed topic labels were then harmonized to produce final topic labels. Topics were assigned labels for each of several levels: (i) specific topic name; (ii) theme, i.e., categories of scientific research as defined by the US National Science Foundation (NSF) that were either (iii) specific or (iv) broad; and (v) description, i.e., spatial scale, water budget, or methods. These labels were consolidated into four topic categories: general, specific, methods, and water budget.
Article metadata containing the citing literature and author-defined keywords for articles from the English corpus were extracted using Elsevier API and each article's DOI. We used this citing literature to build a citation network. Keywords were used to supplement the country location labels assigned during human-reading by looking for regular expressions of the name of the target countries. Of the 31 countries in LAC, only 23 countries had a sufficient occurrence (i.e., at least 30 articles) within the human-read subset of the English corpus and were included in the citation network.
The citation network was extracted from article metadata, which stores identifiers for the citing article (i.e. the article to which the metadata are attached to), and the cited articles. A total of 29 900 citations were found between 4603 unique articles of the English corpus. The resulting bibliographic network is defined by its
We used machine learning to predict the location of the country of study of each paper in the English corpus. The training labels were provided by human-reading randomly chosen articles from the corpus (1428 human-derived labels) and from text mining the article metadata (2663 text-mined labels). Interestingly, the human-reading provided 563 observations of irrelevant country locations (i.e. outside LAC) or irrelevant subjects of study (i.e. not water resources related). This occurred in some cases when our queries returned articles containing accurate keywords but different meanings than intended; for example, the search work “Mexico” returned irrelevant locations including regions of the United States around the Gulf of Mexico and irrelevant topics such as signal processing analyses employing the “Mexican hat” wavelet.
The human-derived labels were first used for constructing a relevance filter based on simple binary classification between “Relevant” and “Irrelevant” documents (
The prediction of the location of study for each document was performed using both human-derived and text-mined labels (
This section details assessments of the following:
ground truth from analyzing survey data;
research volume; research spread using topic normality; and research connectivity from network analysis.
Closed-ended responses of the electronic survey were analyzed by tallying aggregate data by country and discipline of study. This allowed for a number of inferences such as the most commonly represented research disciplines and the countries of study and of origin of respondents. Research collaborations were analyzed based on the three main countries of study and the three main countries of research collaborations for every respondent.
Open-ended responses were coded for content and analyzed in ATLAS.ti, resulting in dozens of codes used to group responses of similar content (Table S7). Comments irrelevant to the study were removed. We also identified relationships between codes based on connections in the data. If respondents mentioned political issues that hamper funding availability, we coded the two elements (e.g. “political issues” and “funding difficulties”, and we then linked them with the qualifier “is the cause of”). In addition, through the “word cruncher” tool, we generated a word frequency from the survey comments resulting in a word cloud visualization where words' sizes are proportional to their use frequency.
A timeline was created with the number of new research articles published per year representing countries in the three socio-hydrologic clusters, to visualize growth in research output over time. The articles were sourced from the English corpus of water resources research, because these articles were labeled with the country of research and could therefore be associated with a socio-hydrologic cluster. To better understand trends observed in each socio-hydrologic cluster, a residual analysis was performed. The data were transformed with a logarithmic transformation to obtain a roughly linear relationship between time and research output, and then a linear regression was calculated. The residuals for each year were plotted and displayed starting from 2000 to 2017. Year 2000 was chosen as the starting point, because it marks the time by which research output had increased enough to reach at least 30 new articles in each socio-hydrologic cluster per year. The residuals were then plotted along with brackets of the standard deviation (for both positive and negative values) to provide a reference of significance.
The normality of research topics was estimated for general and specific topics, as well as for the method and water budget topics. Documents in each subset were sourced from the English corpus. Each subset was filtered for documents that were labeled with a country of research and then for countries where the sum of documents per country was greater than 30. A statistical distance from standard, normal distribution was calculated to describe the normality of topic probabilities from two perspectives: across documents and across countries. As a statistical distance, we chose the Jensen–Shannon distance because of its link with entropy
The citation network was analyzed using Gephi 0.9.2
A force-directed graph algorithm, Fruchterman–Reignold (FR), was
selected to produce the networks' visualizations. It simulates the graph as a system of mass particles. The nodes are the mass particles and the edges are springs between the particles. The algorithm tries to minimize the energy of this physical system. FR is most suitable for small networks and a better performance
For each network, the following geometric descriptions were calculated: number of nodes (countries or research topics), number of edges (citation between countries or research topics), and thickness of the edges (connectivity proportion). We also calculated network density and degree.
Network density is a measure of the connectedness of a graph, defined as the number of connections divided by the number of possible connections, with all possible edges and density
LAC countries were clustered based on socio-hydrological characteristics using hierarchical and
We determined if the LDA successfully identified a relevant topic based on the top 10 occurring words which showed a 86 % agreement between expert-identified topics and LDA-derived topics. We judged the performance of the LDA by comparing the topic model output from the English corpus to the output from the Spanish and Portuguese corpora. The specific topic label for each topic was used for comparison. The number of topics with each specific label were grouped by language and tallied (Fig.
Our findings are based on the output of a topic model of articles written in English and are predicated on the assumption that the English language corpus accurately represents the breadth of regional research published in English, Spanish, and Portuguese given that non-English corpora were small fractions of the English corpus (4 % and 2 % for Spanish and Portuguese, respectively). Comparing the topic model performance of the three corpora (Fig.
Topic model performance assessed by the total number of specific topics present in each language. Bolded topics represent the top 12.5 % of research (named topics in Fig.
For the relevance filter, the random forest, multinomial, and support vector machine models were the best-performing models and showed no statistical difference in the distribution of their performance measured by area under curve (AUC, Fig.
Machine learning performance for relevance prediction based on simple binary classification between “Relevant” and “Irrelevant” documents (
For predicting location of the country of study of each paper, random forest outperformed every other model with a mean multiclass AUC of 0.99 and a mean accuracy of 96 % (Fig.
Machine learning performance for prediction of the location of study for each document using both human-derived and text-mined labels (
The most common categories of answers in the comment section were summarized (Table S7). Difficulties related to funding stood out as the main challenge for respondents (Fig.
Word cloud based on word frequency from survey comments.
First, we look at research
Growth of water resources research.
A residual analysis identifies three distinct periods since 2000, the first year that each socio-hydrologic cluster was represented by over 30 research papers in each language. Annual output was lower than each cluster's general trend for the first few years, followed by a period of relatively higher output from 2007–2012, ending with a trend of decreasing growth from 2013–2017, although some residuals are below a single standard deviation within these periods. It is possible that uniform anomalies below or above general growth trends correspond to region-wide events, although a several-year lag could be possible between causal events and effects in research output. For example, a connection may exist between Brazil's economic crisis starting in 2012 and the subsequent drop in research output from 2013–2017.
Combining research topics with predicted study location describes the composition of water research in LAC with a chord diagram (Fig.
Composition of water research in LAC according to study location and top 25 % of studied research topics. On the left, countries are identified individually and by their associated socio-hydrologic cluster. On the right, research topics are grouped by their general category. The top 50 % of specific topics are listed within each general category.
While research about Brazil, Mexico, Argentina, and Chile are bright spots that dominate the research landscape, the absence of countries in the Caribbean and most of Central America indicates a shortage of research in these regions. A country's socio-hydrologic cluster correlates to its representation in overall research, with cluster 1 (Brazil and Mexico) receiving the most research, followed by cluster 2, then cluster 3, which includes most of Central America and the Caribbean. Although population size likely affects each country's representation in the overall research output, it does not precisely correlate with research volume. We therefore expect that other factors used to define the country clusters (e.g. a country's water and economic resources, geography, and history) influence the likelihood that researchers study that country.
Mexico and Argentina alternate for second highest representation, depending on the topic, after Brazil. These findings are confirmed by survey respondents, of which 35 % study Brazil, followed by Mexico with 15 %, and Argentina with 9 % (across all research topics). Only 12 % of survey respondents focused most of their research on countries in the Caribbean or Central America.
Water research is not distributed equally among disciplines and is primarily conducted in the physical and life sciences, representing together 80 % of topic probabilities. Survey responses also confirm these findings, as 80 % of respondents identified as life and physical scientists. The lack of water research in social sciences may reflect a combination of low publication rates (compared to physical sciences), disciplinary preference for publishing in books rather than peer-reviewed articles, a historical framing of water management as a purely technical discipline
Next, we look at the
Normality of research topics for
Normality of research topics for
The least normality is seen in two topics of great importance for water management: reservoirs and risk assessment. Both topics have normality values far below 1 across both countries and documents, suggesting poor representation of these topics on a broad scale (Fig.
We complete our description of LAC's water research portfolio by estimating degrees of
The high volume of research about Brazil (45 % of the labeled English corpus) motivated further investigation to see if this large scientific output is proportionally more influential than the research about other countries. Brazil is a central bright spot of the citation network (Fig.
Connectivity between LAC countries, measured by directional citations between articles' country of study. The direction of each edge is represented by drawing it clockwise from an earlier node to a later node.
Results from our survey complement findings from the citation network.
Over half of participating researchers collaborated with researchers outside of the country(ies) they study. Despite being the only Portuguese-speaking country, Brazil was the country most often listed in collaborations within LAC (17 %). Brazil's prominence may be partially explained by the legacy of relationships formed during graduate-level training when Brazilian researchers study abroad and when graduate students from other South American countries study in Brazil. Despite its greater connectivity, the review of 250 water science papers presented at the 2019 Brazilian Water Resources Symposium still found a lack of a common scientific agenda within the country and a need for more interdisciplinary research and collaboration with international communities, “especially with other Latin American countries with shared water issues”
A quarter of research collaborations involve non-LAC researchers, mostly in the United States (14 %). This may reflect differences in access to funds and highlights how more affluent countries can influence the scope of research conducted in LAC. Respondents indicated that insufficient and precarious funding arrangements are their main challenge. 89 % of respondents said that the government is their main source of funding, which may explain that a country's political and economic context were mentioned as further aggravating funding availability. Funding difficulties were also associated with a lack of value given to water research and to the long timeframes associated with research that are misaligned with decision-makers' timelines.
Furthermore, being physically close or part of the same socio-hydrologic clusters did not increase the likelihood of cross-country collaboration between countries. For instance, 22 % of researchers in the neighboring countries of Argentina or Chile either study both countries or collaborate with one another, while 24 % of those researchers report collaboration with Brazil. Researchers from Mexico and Brazil, who share a socio-hydrologic cluster, collaborate even less, with only 14 % reporting to work or collaborate in both countries, despite a high level of connectivity from Mexico to Brazil in the citation network. Conversely, more than 80 % of researchers in the Caribbean reported collaborating with researchers from other Caribbean nations and few collaborated with Brazil. This is in opposition to the findings from the citation network, showing few citations within the Caribbean and more frequent citations of publications on Brazil, but this could partially be due to the limited number of articles studying the Caribbean region included in our corpus.
We assessed the connectivity of water research throughout the region by aggregating research from all countries by topic (Fig.
Connectivity between topics of research, measured by directional citations between articles' research topic for
Interestingly, behind these few silos and the central node, a vast network of connectivity exists. While this level of connectivity is low (less than 10 %), it supports the characterization of water resources as a scientific discipline where research topics are already integrated
The wide scope of this study, intended to capture the breadth of the state of water resources research across LAC, required inevitable compromises in the depth of information and the subsequent ability to thoroughly interpret our results. Notably, much scientific literature in Spanish and Portuguese was not readily available or accessible online, and this resulted in the need to rely on English publications as a proxy of research across LAC. A targeted method to collect gray literature would increase the size of the Spanish and Portuguese corpora. Of the literature we found, very little focused on Caribbean countries, and this lack of information limited subsequent analysis. A targeted method of corpus augmentation and human-read validation towards less-represented countries and topics will likely increase the model's predictive capabilities and may improve the representation of Caribbean countries.
In addition, while the presented citation network included all LAC countries, the exclusion of countries outside of LAC prevented a more comprehensive analysis of LAC countries' reliance on non-regional research. Survey responses suggested that reliance on non-LAC research was high, as researchers stated they were more likely to collaborate with scientists outside of LAC than within LAC. Inclusion of non-LAC countries in the analysis of scientific interactions could present many opportunities for expanding our findings in future research. Finally, our study indicated where bright and blind spots appear across research in LAC, but did not aim to examine causal relations for these patterns, a common shortcoming in Science of Science
However, a more comprehensive answer would require exploring historical, political, economic, and social dynamics influencing the allocation of research resources. Overall, this work displays the value of our novel method to interpret results from machine learning, points to the need for a deeper and wider understanding of existing water resources research in water vulnerable regions, and warrants expanding our methods to include gray literature and coverage across the Global South.
This unprecedented multilingual literature review provides insights into bright and blind spots of water research throughout LAC. Our results reveal that the region's vulnerability to water-related stresses, and drivers such as climate change, is compounded by research blind spots in certain topics (e.g. reservoirs and risk assessment) and in entire subregions (e.g. Caribbean nations). Although certain topics and countries are under-studied in relation to the rest of the corpus, research on most components of the water budget (e.g. precipitation) represents a bright spot and suggests that most countries can make science-informed decisions regarding their water management. Research on water resources in Brazil dominates the research landscape, representing another bright spot. However, Brazil's dominance also highlights a regional vulnerability: while research on Brazil is vast, well-rounded, and highly influential across LAC, funding cuts and policy shifts that affect the country's scientific output can halt progress and impede scientifically informed water management throughout the region. Supporting societal and ecological needs while addressing challenges linked with future water-related risks will depend on countries' abilities to improve the accessibility of existing research (in English, Spanish, and Portuguese), expand research in under-studied topics (particularly in the social sciences), and harness existing opportunities for knowledge sharing.
The Kullback–Leibler divergence measures the expected information for discriminating between discrete probability distributions
Finally, the Jensen–Shannon distance,
Code and data are available as the
The supplement related to this article is available online at:
AJD and HG contributed to this paper's conceptualization, methodology, investigation, data management, analysis, visualization, project administration, funding acquisition, and writing, including the original draft, revisions, and editing. RDG, NPK, and FvdB contributed to this paper's conceptualization, methodology, investigation, data management, analysis, visualization, and writing, including the original draft, revisions, and editing. AK contributed to this paper's methodology, investigation, data management, and writing, including revisions and editing. JPOP contributed to this paper's conceptualization, investigation, analysis, visualization, and writing, including revisions and editing. LEGD contributed to this paper's conceptualization, investigation, and writing, including the original draft, revisions, and editing. JGR contributed to this paper's investigation. EG contribution to this paper's investigation and writing, including revisions and editing. SSS contributed to this paper's conceptualization, methodology, investigation, project administration, and writing, including revisions and editing.
The authors declare that they have no conflict of interest.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We are indebted to Shanti Sandosham, Miranda Romero, Lilly Mccaffrey Pecher (interns at the University of California, Davis (UCD)), and Carly M. Lawyer (intern at the University of South Carolina). Pamela Reynolds and Carl Stahmer at the DataLab (UCD) provided invaluable insights. This study was supported by funds from Jastro Research Fellowship (UCD).
This research has been supported by the Jastro Research Fellowship of UC Davis.
This paper was edited by Graham Jewitt and reviewed by José Luis Arumí and one anonymous referee.