the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Preprocessing approaches in machine-learning-based groundwater potential mapping: an application to the Koulikoro and Bamako regions, Mali
Pedro Martínez-Santos
Miguel Martín-Loeches
Download
- Final revised paper (published on 18 Jan 2022)
- Preprint (discussion started on 30 Jun 2021)
Interactive discussion
Status: closed
-
RC1: 'Comment on hess-2021-261', Anonymous Referee #1, 05 Aug 2021
Referee comment to:
HESS-2021-261 Revision report
“Preprocessing approaches in machine learning-based groundwater potential mapping: an application to the Koulikoro and Bamako regions, Mali” by Gómez-Escalonilla et al. (2021).General Comments
Gómez-Escalonilla et al. (2021) provide an interesting machine learning learning-based groundwater potential mapping in the Koulikoro and Bamako regions of Mali.The authors have used machine learning models for groundwater potential mapping (GPM) in two regions in Mali and evaluated their models. Also, their models explore the potential factors affecting groundwater potential.
The paper is interesting and within the scope of the HESS journal. In general, machine learning is well-placed in HESS. The authors have done very diligent work by summarizing many publications applying machine learning. The manuscript can be interesting to the scientific community working on machine learning applied in hydrology. The work is of importance; but at the present state, I would not recommend it for publication because certain comments need to be addressed with major revisions.
The introduction is very general. It should be worked out why this study with machine learning is necessary, knowing that machine learning is a “Blackbox model” and what its benefit is with other methods such as fuzzy logic, the frequency ratio, weight of evidence, or multi-criteria decision analysis (MCDA). In addition, the objectives are not clear and are included at various locations in the manuscript (see Page 13, Line 255- 260, with the sentence “A major goal of this study…). You need to improve the manuscript correctly.In the Introduction section only studies for the other continent are presented. It would be interesting to see how studies in other regions in Africa deal the groundwater potential mapping with machine learning or others methods? Also, the authors should be linking the issues of groundwater resources in the context of The Sustainable Development Goals (SDGs) to motive the reader at the end of the introduction.
You could add this reference in the introduction section: A new method to map groundwater potential at a village scale, based on a comprehensive borehole database. An application to Sikasso, Republic of Mali by Ana Carolina Gonçalves Delgado, 2018.
Add a new section (or put some sentences) about the limitations of machine learning techniques to study groundwater potential zones.
The overfitting problem is one of the drawbacks that affect the accuracy of models in machine learning. Take into account this issue in the introduction.
The topic of validation of Groundwater Potential Map by is not mentioned. In my opinion, it is one of the major limitations of the study. If so, this topic should be discussed in more detail.
Did you limit the validation of your model with cross-validation? Or do you have the intention to integrate the external validation?
The authors need to describe in the methodology section a sub-section of “Multicollinearity analysis” before presenting the results in section 3.1.Did you use the variance inflation factor (VIF) and tolerance (TOL) indices as are customarily used to estimate multicollinearity of predictive factors in machine learning modelling? If so, explain in the manuscript.
What is the effect of sample size on the different machine learning models for groundwater potential mapping in your research?
Did you try to make a sensitivity analysis of the effect of each factor (explanatory variables) on the groundwater potential map, i.e., when you decide to eliminate one or more factors??
What is the resolution chosen to develop the thematic layers? Because the various GIS layers come with different spatial resolutions. You need to clarify this aspect.
The sub-sections in Section 2 on “Material and methods” need to be reorganised for better reading. For example, you could define the title "Materials and methods": Study area; Data used (Borehole database, Explanatory variables/Thematic layers, etc); and Methods (Random Forest, AdaBoost, Gradient Boosting, Decision Tree, and Extra Trees classifiers); Tools used to process the data, etc.
A table of data sources must be put to increase the clarity and to ease readers’ understanding.
Do you have performed quality control of datasets before the modelling?
Did you use all explanatory variables to map groundwater potential in Figure 10? If so, could you specify the variables used to develop the final products? Because, on Page 20 of MS, you mentioned that the outcomes show that elevation, rainfall, geology, and drainage density, among others, are the most important factors conditioning the groundwater potential. Did you use these four (04) explanatory variables in reality? Specify exactly the variables used in the final models.
The conclusion is very general. To be check according to the revision of MS.
Limitations of the research should be addressed.
I suggested the authors separate the results and discussion.
I suggest developing in the new discussion section "the model validation/performance and comparison"; "assessment of variable importance"; limitations of the research", etc. Furthermore, try to compare the outcomes of your research with other studies available in the literature on the mapping of groundwater potential such as the GIS-based Dempster–Shafer model, etc.
The discussion is incomplete, authors must address the uncertainty in groundwater potential mapping (deficiencies of data quality; biased and absent data, sample sizes, missing covariates, and also errors in the structural and nature of the model, etc). Add some references.
In the Conclusion section, specify the utility of this research and the potential users. For example, who could use the prepared maps?
Is it possible to improve the performance of the best machine learning models in your study? Which additional predicting variable (s) (even if such information is scarce) could be added to improve the results?
I think that Table 2 on Page 19 could be placed in a new section in Supporting Information.
Abstract section:
Page 1. The abstract is very long. It should be shortened and focused.
Keywords: I propose to delete “big data; climate change and water access”; and add “Groundwater potentiality, and GIS”.
The abstract should be thoroughly revised according to the revision of the manuscript.Specific comments:
Page 2. Paragraph 1. Line 2, rewrite the sentence “Today, 2.5 billion people….” by “Today, 2.5 billion people around the World…”.
Page 2. Paragraph 3. Line 1. Introduce a sentence before this: “There are two main approaches to GPM: expert-based decision systems and machine learning methods”.
Page 3. For section 2.4, Material and methods, I suggest separating them into two sections.
2.4.1 Definition of target
2.4.2.2 Explanatory variables/Thematic layers
In this new section, I propose to describe the explanatory variables by order according to Figure 6.
Also, I propose to prepare explanatory variables used in groundwater potential mapping in a Table. In your Table, you could, for example, put in 4 columns (Type of data layers/ Explanatory variables; scale/resolution, time, available format, and source of data).Page 6. Line 3. Specify correctly if the numbers 530 and 452 are the number of villages?
Page 9: Put in order Figure 5a; before Figure 6.
Page 9: Rewrite the last sentence by also, it was used to.
Page 9. Figure 4 must be centered.
Page 12: Fig.12: order the number of figures following the description found on page 9, i.e.: curvature, slope, topographic wetness index (TWI),Page 13: Number 2.5 was repeated on page 14. Check it. I have the impression that the authors did not take the time to proofread the document.
Page 13: 2nd paragraph. The objectives of the study mentioned at various locations in the manuscript should be summarized at the end of the introduction (see comments above). Why did you put the main goal of the study here? I think that the objective must be found in the introduction section.
Page 14. 3rd paragraph. Did you fix the number of iterations at 500 in this study? It is the default value of model? Justify how this number was established.
Page 17: you mentioned in the first line that “The AUC exceeds 0.90 in all cases”. I'm not sure about this affirmation because, if you analyse Table 1, you observe that in the MaxAbs scaling method AdaBoost shows an AUC value of 0.898. Could you rewrite your sentence to take into account this case? Or maybe use AUC mean because this value exceeds in all cases.
Page 20. First-line (Line 415). You mentioned Naghibi and Pourghasemi (2015), the citation is incorrect because you have three authors: Seyed Amir Naghibi & Hamid Reza Pourghasemi & Barnali Dixon.
At the end of this paragraph again, you mentioned (Naghibi and Pourghasemi, 2015; Nguyen et al. 2020b). Due to this error for the citation in two places, I propose to check all references.Page 21: Move Figure 8 on page 21 under the section of “3.3 Importance of explanatory variables” on
Page 23: I propose to add the well locations/boreholes on the two maps in Figure 8.
If possible, put on these two maps: well training and well validation with different colours of points.
Also, make clear your legend of Fig.8 with the classes well-defined.
For example:
(0- 0.2) Very low;
(0.2- 45) Low;
Etc.
Change the term “Intermediate” in the legend by “Moderate”. It is most appropriate.
Change “Groundwater potential” to "Groundwater potential classes”.Page 21. I repeat the need to clarify my request mentioned above (see general comments). When you analyse feature importance calculated in Figure 8, you observe that some explanatory variables are not important in the models. Could you explain more how many variables did you select to produce the outcomes of Figure 10?
Page 23. Why did you choose to classify villages in three classes based on groundwater potential, and you show the outcomes of Groundwater potential in five classes?
Page 24. Could explain more why Groundwater potential classes are three in Table 3 compared to Figure 10, where we found five classes?Page 15. Section 3 on Results and discussion. Please add a new sub-section on “Validation on machine learning models.
Reference section
Page 27. Rewrite this reference: Direction Nationale de l’Hydraulique (Ed.): Données Hydrogeologiques et des Forages. Direction Nationale de l’Hydraulique du Mali, 2010.
to the precise country name.
Page 29. Precise the link and access date of this reference: Poggio, L. and de Sousa, L.: SoilGrids250m 2.0 - Clay content, 2020.Page 30.
Add the access date of the reference Traore, A, Z., et al.
Add the link and access date of this reference: United Nations: Resolution A/RES/64/292. United Nations General Assembly, United Nations, 2010.Technical corrections
Page 5. Line 115-120. Add the unity of mean water depth to be coherent in the sentence, because you have put the unity of mean electric conductivity.
Page 6: in Figure 2 B, write correctly m3/h
Page 8. Line 160-165, add the comma in (BGS, 2021). Also, on Page 9 and the title of Figure 4, add a comma to the same reference.
Page 11, line 5. “semiarid” “semi-arid”Page 13: Equation 6; define x Ì
Page 25. Delete “s” in the word of Conclusion.- AC1: 'Reply on RC1', Victor Gómez-Escalonilla, 03 Sep 2021
-
RC2: 'Comment on hess-2021-261', Anonymous Referee #2, 09 Aug 2021
The manuscript “Preprocessing approaches in machine learning-based groundwater potential mapping: an application to the Koulikoro and Bamako regions, Mali “ represents an important contribution aligned with the objective of the HESS journal and can interest the scientific community working on machine learning applied in water management.
Concerning the scientific quality, I think that the used scientific approach and applied methods are interesting but the sections of the manuscript have unbalanced structure and some sections are inappropriate and need in-depth analysis with improving the used English language. For that, I think this paper needs major modification and resubmission
- General Comments
The introduction :
-the section dedicated to the Reviews of literature concerning Groundwater potential mapping studies should be more developed with the presentation of the brief results of the pertinent studies.
- the introduction missed the presentation of the water resources problems in the study area and the need to elaborate the Groundwater potential map
Then the results discussed must be more in-depth, especially by explaining the results of the GPM obtained in connection with the hydrogeological context of the study area and the used explanatory parameters.
The methodology
The hydrogeological context of the studies area is unfairly presented; then the explanatory parameters used are unclearly presented. It is important to explain in-depth these used data to enrich the explanation of the results of GPM.
Revision suggestions:
ABSTRACT:
line9: “Groundwater is crucial for domestic supplies in the Sahel”
it is necessary to precise the location. which Sahel?
Line11 & 12: “This paper presents a machine learning method to map groundwater potential and illustrates it through an application to two regions of Mali”.
It is poorly structured sentences!
Line 13: “A set of explanatory variables for the presence of groundwater is developed first”
I suggest to replacing the presence of groundwater by groundwater occurrence
Line17: “This process identifies noisy, collinear and counterproductive variables and excludes them from the input dataset”:
It is a result details, I suggest deleting this sentence.
Line 18, 19 & 20: “Tree-based algorithms, including the AdaBoost, Gradient Boosting, Random Forest, Decision Tree and Extra Trees classifiers were found to outperform other algorithms on a consistent basis (accuracy >0.85), whereas maximum absolute value and standardization proved the most efficient methods to scale explanatory variables”.
I suggest replacing by:
The results shows that the Tree-based algorithms, including the AdaBoost, Gradient Boosting, Random Forest, Decision Tree and Extra Trees classifiers were found to outperform other algorithms on a consistent basis (accuracy >0.85), whereas maximum absolute value and standardization proved the most efficient methods to scale explanatory variables.
Line 22 & 23: “From a methodological standpoint, the outcomes lead to three major conclusions”:
I suggest replacing by: The outcomes of this study lead to three major conclusions
Introduction
Line 38 & 39: “Groundwater potential mapping (GPM) is recognized as a valuable tool to underpin planning and development of groundwater resources (Elbeih, 2015)”.
I suggest replacing by
Groundwater potential mapping (GPM) is recognized as a valuable tool to underpin planning and exploration of groundwater resources (Elbeih, 2015).
Line 41 & 42: “In practice, however, it consists in computing spatially-distributed estimates for a target variable (groundwater potential) based a set of explanatory variables”
What are the explanatory variables, you should explain them, I suggest to replace these sentences by:
However, it consists of computing spatially distributed estimates for a target variable (groundwater potential) based a set of dependent variables such as soil, lineaments, slope, geology, landforms, lithology, and drainage density (Díaz-Alcaide and Martínez-Santos 2019a)
Line 42, 43 & 44: “GPM typically relies on existing cartography, digital elevation models obtained from satellite, aerial photographs, satellite imagery and geophysical information (Schetselaar et al., 2007)”.
The GPM based on the assembling of data from different sources. I suggest replacing by:
GPM typically relies on the compilation of data derived from existing maps, aerial photographs, satellite imagery, and airborne geophysical information (Schetselaar et al. 2008).
Line 46: “There are two main approaches to GPM: expert-based decision systems and machine learning methods”.
I suggest replacing by:
Recently, expert-based decision systems and machine learning methods have been implemented in many groundwater studies.
Line 46 &47: “Expert-based systems have existed for a long time (DEP, 1993)”
I suggest replacing by: Expert-based system methods have been used for a long time (DEP, 1993)
Line 52 & 53: Algorithms used in the GPM literature include Mixture Discriminant Analysis (Al-Fugara et al., 2020), Random Forest (Kalantar et al., 2019; Moghaddam et al., 2020),
I suggest replacing by: In literature, The Machine Learning Algorithms used in the GPM studies include Mixture Discriminant Analysis (Al-Fugara et al., 2020), Random Forest (Kalantar et al., 2019; Moghaddam et al., 2020),
Line 58: “GPM works under the assumption that the presence of groundwater can be partially inferred from surface features”
I suggest replacing by:
The GPM is based on a common assumption is that the groundwater occurrence can be partially inferred from surface features.
Line 60 & 61: Supervised classification algorithms are trained to find the associations between these variables and known groundwater data.
The data are trained using the algorithm not the algorithms are trained: I suggest replacing by:
These explanatory variables are trained using Supervised classification algorithms to find the associations between them and known groundwater data.
Line 64 & 65: add reference.
Line 68: add reference
Line 71 & 72: The outcomes of machine learning GPM studies are almost invariably assessed by means of standard big data metrics such as
precision, recall, and area under the receiver operating characteristic curve.
I suggest replacing by:
The outcomes of GPM studies using machine learning algorithms are almost invariably assessed by means of standard big data metrics such as…. And add reference to this observation
Line 76 to Line 80: Within this context, this research presents two main additions to the literature. In the first place, it explores
different scaling methods. The goal is to avoid the pitfalls associated with the reclassification of explanatory variables. Scaling is thus advocated as an essential part of algorithm training since each subsequent procedure depends on the choice of unit for each feature (Huang et al., 2015). Furthermore, scaling is expected to transform feature values based on a defined rule, so that all scaled features have the same degree of influence (Angelis and Stamelos, 2000).
I suggest replacing by:
Within this context, this research presents two main additions to the literature. In the first place, it explores different scaling methods to avoid the pitfalls associated with the reclassification of explanatory variables. Scaling
is thus advocated as an essential part of algorithm training, since each subsequent procedure depends on the choice of unit for
each feature (Huang et al., 2015). Furthermore, scaling is expected to transform feature values based on a defined rule, so that
all scaled features have the same degree of influence (Angelis and Stamelos, 2000). (This is d detail of methodology I propose to add to the methodology section)
2 Material and methods
2.1 Study area
Line 93 to 111: I suggest to add a hydrogeological section or a geologic map to highlight the aquifers units of the study area
Line 89 to 101: “Water in these aquifers is preferentially located in the weathered mantle, and, within this, the lower part is generally more transmissive due to lower clay content. The upper part is less permeable to flow, but can still be important as a groundwater reservoir. Fractures can produce significant quantities of water, although their storage capacity is typically low (Martín-Loeches et al.,2018)”
I suggest replacing by:
In these aquifers, groundwater flows preferentially in the weathered mantle, and, within this, the lower part is generally more transmissive due to lower clay content. The upper part is less permeable to flow but can still be important as a groundwater reservoir where the fractures can raise the reservoir permeability although their storage capacity is typically low (Martín-Loeches et al.,2018).
Line 107: “Some boreholes however exceed 100 m3/hour”
I suggest replacing by:
However, some boreholes yield exceeds 100 m3/hour
Line 107 & 108: “The Paleozoic rocks located in the north are determined by fractures that allow water to flow through the sandstone and limestone layers”.
I suggest replacing by:
In the north, the fractured Paleozoic rocks allow water to flow through the sandstone and limestone layers.
2.2 Borehole database
Line 115: Borehole data were provided by Direction Nationale de l’Hydraulique (2010)
I suggest replacing by:
Borehole data were provided by the National Water Directorate (DNH, 2010)
Line 115 to 116: “The database contains 115 information on 5,387 boreholes (3,772 successful and 1,615 unsuccessful), distributed across 1,605 human settlements”.
I suggest replacing by:
The database contains information of 5,387 boreholes (3,772 successful and 1,615 unsuccessful), distributed across 1,605 fields.
Line 121 to 123: “This can be assumed to be the thickness of the (Courtois et al., 2010). Water table depth
I suggest replacing by:
There is a considerable number of boreholes with a 100% success rate (530), many villages are supplied by a single borehole
Line 126 to 127: For algorithm training purposes, this raises the question as to whether villages with a small number of boreholes are statistically representative, particularly in cases where the mean yield is low
I suggest replacing by:
This raises the question in the application of algorithm in the choice of training datasets, especially to whether villages with a small number of boreholes are statistically representative, particularly in cases where the mean yield is low
Line 145: Figure 3: correct the word classification metrics
Line 156 to 157: Sixteen explanatory variables were selected based on an extensive review of the GPM literature (Díaz-Alcaide and Martínez-Santos 2019).
I think to explain in detail this extensive review in the Introduction part
Line 161: you should add a description of the main factors that can influence the groundwater recharge before explaining the description of each used variables or factors in the groundwater potential mapping
Line 162: Geology constrains the presence of groundwater to an important extent
I think to delete this sentence
Line 173: Soils are important in GPM because soil characteristics such as permeability…
I suggest replacing by:
Soil is important factor to determine the groundwater occurrence ……….
Line 174: Soil descriptions of the study area were obtained from the European Soil Data Centre
You should describe the main soils of the study area types and their characteristics
Line 175 and 176: Integration of land use and land cover is often used in groundwater potential mapping studies because human activities alter hydrological dynamics (Díaz-Alcaide and Martínez-Santos, 2019).
I suggest replacing by:
Integration of land use and land cover is often used in groundwater potential mapping studies because Land use changes, which are mostly induced by human activities, affect hydrological dynamics (Díaz-Alcaide and Martínez-Santos, 2019).
Line 175 to 180: you should describe the land use of your study area and the data used for the elaboration of this map
Line 182: You should add the reference of used rainfall data
Line 184: Figure 4: you should add the lineaments and faults in the geological map
Line 191 & 192: DEMs are relevant because shallow groundwater flow and infiltration are partially
conditioned by surface features and parameterized by properties that can be extracted from the surface data (Elbeih, 2015)
I suggest replacing by:
The topography is a relevant factor in groundwater distribution, storage, and flow, as well as surface runoff and infiltration are partially conditioned by surface features and parameterized by properties that can be extracted from the surface data (Elbeih, 2015)
Line 197: The topographic wetness index
I suggest replacing by:
The Topographic Wetness Index (TWI)
Line 243: Figure 6. Explanatory variables used to predict the GPM: a) water table depth (meters) b) slope (degree) c) curvature d) borehole yield
(m3/h) e) normalized difference vegetation index (NDVI) f) normalized difference water index (NDWI) g) alteration band ratio (B6/B7) h)
Drainage density i) Stream power index (SPI) j) topographic wetness index (TWI) k) Clay content 245 (g/kg) l) rainfall (mm/year)
What is the difference of the figure 6g (alteration band ratio (B6/B7) ) and the figure 6k (Clay content); in the text it means the same information line 230 to 233: This layer provides information on clay content on the surface and the relationship with infiltration. Clay content on the surface is calculated as per Eq. 5, where B6 is the short-wave infrared 1 and B7 the short-wave infrared 2.
ð´ðð¡ðððð¡ððð (ðððð¦ ðððððððð ððððððð¡ðððð) = ðµ6 / ðµ7 (5)
Line 267: reference of equation 6
Line 273: reference of equation 7
Line 380 to 400: I find this paragraph should be added to the introduction section to explain the use of used algorithms in literature
Line 437: Classifier outcomes were extrapolated to produce groundwater potential maps
What you want to say it is not clear!
Line 437 to 438: Figure 9 shows the groundwater potential predictions rendered by each of the five best-performing algorithms under the two most effective scaling methods
I suggest adding the abbreviations of used algorithms and scaling methods between parentheses
Line 447: The agreement map (Fig. 10) allows for an analysis of discrepancies among the best performing algorithms.
What you want to say about the agreement map!
Line 455: Figure 9. Mapping outcomes of the top five supervised classification algorithms for the two best performing scaling methods. At the top the MaxAbs scaling method, below it the standardized scaling method. From left to right: AdaBoost classifier, Gradient Boosting classifier, Random Forest classifier, Decision Tree classifier and Extra Trees classifier.
I suggest to add number or letter for each map like:
- AdaBoost classifier, (b) Gradient Boosting classifier, (c) Random Forest classifier, (d) Decision Tree classifier and (e) Extra Trees classifier.
Line 492 to 494: “On a final note, the literature features few examples of groundwater potential studies in the study area. Perhaps the only systematic precedent is the one carried out by Díaz-Alcaide et al. (2017). These authors performed a national-scale assessment of groundwater potential for the Republic of Mali based on the same borehole database that has been used in this research”.
This is a literature review about similar studies in pilot area, I suggest to add in the Introduction section
Citation: https://doi.org/10.5194/hess-2021-261-RC2 - AC2: 'Reply on RC2', Victor Gómez-Escalonilla, 03 Sep 2021