Importance of spatial and depth-dependent drivers in groundwater level modeling through machine learning

The water and food security of South Asia is embedded in the groundwater resources of the transboundary aquifer system of Indus20 Ganges-Brahmaputra-Meghna (IGBM) rivers, which has been subjected to diverse natural and anthropogenic triggers. Thus, understanding the relative importance of such triggers in groundwater level change and developing a prediction framework is essential to sustain future stress. Although a number of studies on groundwater level prediction and simulation exist in the literature, characterization of predictive performances of groundwater level modeling using a large network of ground-based observations (n = 2303) is not yet reported. To identify the spatial and depth-wise predictors influence, here, we used linear regression based 25 dominance analysis and machine learning methods (Support Vector Machine and Artificial Neural network) on long term (19852015) GWLs and/or climatic variables in the parts of IGBM basin aquifers. The results from the dominance analysis show that groundwater level change is primarily influenced by abstraction and population in most of the IGBM, whereas in the Brahmaputra basin, precipitation exhibits greater influence. Our results show a large proportion of the observation wells (n >50% for ANN and n >65% for SVM) demonstrate good correlation (r> 0.6, p<0.05), Nash-Sutcliff efficiency (NSE >0.65), and normalized root mean 30 square error (RMSEn<0.6) between the observed and simulated values. However, the results in the highly abstracted parts of the basin are poor, due to insufficient knowledge of groundwater abstraction. Furthermore, a significant decrease in performance from shallow (intake depth < 35m) to deep observation wells (intake depth > 35m) could be linked to the change in groundwater abstraction pattern from shallow to deep groundwater in recent times. We also find that, in areas where natural factors dominate over anthropogenic factors, climatic variables may be used as suitable predictors for the groundwater level. 35

the performance decreases from shallow to deep observation wells. The conclusions can be useful for groundwater management in the IGBM areas. However, the quality of this manuscript is not good enough for publication in HESS. The detailed comments are shown as the following.
Reply: We thank the reviewer for his/her review and support for the general intent of the paper.
We appreciate that the appended comments are helpful and intended to improve the manuscript.
We have addressed the reviewer's comments and done a complete major revision of the manuscript. Doing so, we have added details on the method, results, and discussion, rewritten several sections, and added a few new figures, which we believe have greatly improved the manuscript.

Highlights of the revision:
In this revision, we have a) Included a detailed discussion of the prevailing methods, differences between our study and previous studies. b) Highlighted the originality of the present study c) Shown the representative groundwater level time series for different basins d) Discussed the interdependence of the variables and its relation to dominance analysis e) Included the rationale for using ANN and SVM in the study f) Discussed possible limitation of using machine learning methods in groundwater level modeling g) Explained the possible reasons for relatively large deviations in the Indus basin.
Rev 2. Comment 1: Machine learning methods are popular over the years. The authors gave an introduction to machine learning methods used in GWL. I expect that more prevailing methods should be mentioned in the introduction instead of ANN and SVR. I also expect a comparison of these prevailing methods.
Reply: We thank the reviewer for the comment. In recent times, machine learning-based methods have been widely used to simulate and predict groundwater levels across the world. Most of these studies used methods like autoregressive integrated moving average (ARIMA), Artificial Neural Network (ANN), hybrid-ANN, Adaptive neuro-fuzzy inference system (ANFIS), genetic programming (GP), Support Vector Machine (SVM) and nonlinear auto-regressive exogenous model-based (NARX), long-short term memory (LSTM) model few others using a wide range of frequency and temporal data on past GWLs, satellite observations derived groundwater storage (GWS), Normalized difference vegetation index (NDVI)), meteorological variables, river discharge, variables of groundwater use, few dummy variables to simulate and/or predict GWLs.
However, most studies (including studies on India and Bangladesh) are mainly small-scale studies, and due to the small number of observation wells, they are unable to characterize the spatial variability in model performances extensively. Moreover, the temporal extent of the studies on India and Bangladesh is often short. Hence the predictions are based on the short-term trends of dependent variables and do not consider the long-term variability. Furthermore, to our knowledge, none of the studies have considered the spatial and depth-wise performance variability of machine learning models in predicting GWL. The originality of this study lies in addressing some critical aspects which were not included in the previous studies. Firstly, to understand the spatial variability in machine learning-based model performances, we have considered a large network of monitoring wells (n = 2303) from 1985 to 2015 to simulate GWLs in the IGBM. Secondly, considering the variable depth-wise patterns of groundwater abstraction, we showed the significance of well depth (intake depth of the observation wells) information in GWL modeling using machine learning. Thirdly, we used meteorological variables exclusively to simulate in-situ GWLs. Fourthly, based on dominance analysis and outputs from the machine learning models, we investigated the most influential basin specific predictor(s) (both natural and human-induced) in GWL modeling. Following the reviewer's suggestion, we have modified the text, including the findings of the previous studies, and added new text to show the differences and originality of the present study.
Following the suggestion of the reviewer, we have added other prevailing methods in the introduction section. We further modified the text regarding the comparison of the prevailing methods and added new text to show the difference and originality of the study.
"Over the years, the simplistic approach and acceptable results of the machine learning (ML) methods are preferred when the underlying physical system is not well understood. Sometimes, decoding the physical system becomes much more complex due to interactions and feedbacks between multiple processes. GWL modeling based on ML has the unique ability to find the likely relationships between GWL and controlling hydro-climatic-anthropogenic variables without constructing knowledge-driven conceptual or physically-based models. Therefore, researchers have studied the performance of ML methods for GWL modeling in India and Bangladesh (Nayak et al., 2006;Nury et al., 2017;Malakar et al., 2018;Mukherjee and Ramachandran, 2018;Bhanja et al., 2019b;Sun et al., 2019;Yadav et al., 2020;and the references therein) and other parts of the world (Coulibaly et al., 2001;Feng et al., 2008;Sun, 2013;Nourani and Mousavi, 2016;Sun et al., 2016;Yoon et al., 2016;Barzegar et al., 2017;Ebrahimi and Rajaee, 2017;Wunsch et al., 2018;Zhang et al., 2018;Chen et al., 2019;Lee et al., 2019;Jeong et al., 2019  variables, river discharge, variables of groundwater use, few dummy variables to simulate and/or predict GWLs. In a study by Yoon et al. (2011), ANN and SVM models were used using tide level, precipitation, and past GWLs as inputs to predict GWL fluctuations at a South Korean coastal aquifer. They reported that precipitation and tide levels are the most important input variables, and SVM performs better than ANN. Furthermore, the ability of GP, ANFIS, ANN, SVM, and ARIMA methods was evaluated by Shiri et al. (2013) in predicting GWL in Korea. The results suggest that the performance of GP is better than others. Using hybrid ANN with preprocessing approach Sahoo et al. (2017) predicted GWL change in some of the agriculture alluvial aquifers of the USA. Another recent study (Jeong et al., 2019) reported that NARX and LSTM methods provide good accuracy in predicting the water level of two observation wells in the Korean peninsula using preprocessed climatic variables (temperature, precipitation, humidity, sunshine hours, and atmospheric pressure) that potentially affect GWL through changing the evapotranspiration and recharge. Zhang et al. (2018) identified stressed aquifer conditions by comparing the observed and estimated GWL with LSTM using precipitation, temperature, water diversion, and evaporation as input. A recent study by Mukherjee and Ramachandran (2018) simulated GWLs for a small number ( Yadav et al. (2020)

used ANN and SVM on preprocessed data on GWL, precipitation, Southern Oscillation Index, Northern Oscillation
Index, Niño3, and population as input to predict GWL in the urban areas of Bengaluru, India.
They also discussed the significant impact of population growth in GWL estimation and prediction in urban areas in India (Yadav et al., 2020)." We further added on the originality of our study, Feng, S., Kang, S., Huo, Z., Chen, S. and Mao, X.: Neural networks to simulate regional ground water levels affected by human activities, Ground Water, 46 (1)

Rev 2. Comment 2:
From the manuscript, it is difficult to see the originality of this study. For me, the only originality might be the use of a large network of monitoring wells to identify the spatial and depth-wise drivers.
Reply: We thank the reviewer for the comment. Over the years, a host of machine learning methods is applied in groundwater level prediction and simulation worldwide. However, to our knowledge, the previous studies have not considered the spatial and depth-wise performance variability of machine learning models in simulating GWL. The originality of this article lies in addressing these critical aspects. Following the reviewer's comment, we have highlighted the originality of our study. We added, Reply: We thank the reviewer for the comment. We would like to mention that this is the best possible dataset in terms of spatial resolution available in the region to date. We agree with the reviewer that the temporal resolution of the data is a little coarse; however, we could not find any other data with better temporal resolution covering the entire study area.
Based on the reviewer's suggestion, we have inserted the time series plots of groundwater levels from multiple locations. Since the time series of all the 2303 monitoring wells was not possible to show, following the reviewer's comment, we have shown the 100 randomly selected observation wells in each of the basins with most records in the supplementary information.    Reply: We would like to thank the reviewer for this concern. We agree with the reviewer that population and groundwater withdrawal are interlinked parameters in some aspects. For example, assuming per capita groundwater withdrawal for domestic purposes is nearly equal, a net rise in population is directly proportional to the rise in groundwater withdrawal for domestic purposes.
However, domestic withdrawal is limited to only ~4-8% of the total groundwater withdrawal in the basin, while irrigation-linked groundwater withdrawal contributes more than 90%. Irrigation strategies are shifting from flood irrigation to drip and sprinkler based irrigation systems, and this would continue in the near future. Thus, the water withdrawal for irrigation purposes (being the highest consumer of groundwater) is not directly linked to the population increase of the study area. This is the reason we have considered two separate parameters for designing this study.
"We considered both the population and groundwater withdrawal as input parameters in the dominance analysis. Although these two parameters seem to be interlinked, however, in reality, they are not directly related in the IGBM basin. For example, assuming per capita groundwater withdrawal for domestic purposes is not changing over the years, a net rise in population is directly proportional to the rise in groundwater withdrawal for domestic purposes. However, domestic withdrawal is limited to only ~4-8% of the total groundwater withdrawal in the basin (Sharma et al., 2008;CGWB, 2019). Irrigation-linked groundwater withdrawal contributes more than 90% throughout the basin (Sharma et al., 2008;CGWB, 2019). Irrigation strategies are shifting from flood irrigation to drip and sprinkler based irrigation systems, and this would continue in the near future. Thus, the water withdrawal for irrigation purposes (being the highest consumer of groundwater) is not directly linked to population increase; rather, it is dependent upon the irrigation strategies used (Bhanja et al., 2017a)." The temperature could be considered as a proxy of water uses for irrigation. We also agree with the reviewer that potential evapotranspiration is controlled by ambient temperature. Furthermore, in addition to temperature, PET (used in the study) is calculated from variables such as net radiation at crop surface, wind speed, soil heat flux, and vapour pressure using the Penman-Monteith formula. Thus, these variables influencing the potential evapotranspiration are significant components of the regional hydrological cycle. Hence, we have included PET in our analyses.
We included in the text, "Temperature could be considered as a proxy of water uses for irrigation (Sun, 2013).
Furthermore, potential evapotranspiration (PET) dependents on temperature, net radiation at crop surface, wind speed, soil heat flux, and vapour pressure (Ekström et al., 2007;Harris et al., 2020). PET has been included in the analysis since these variables are significant components of the regional hydrological cycle." Furthermore, the dominance analysis (DA) computes the coefficient of determination (R 2 ) in multiple regression of all possible predictor combinations. The relative importance of the predictor variables is determined by the DA through a pair-wise comparisons of all predictors in the multiple regression model as they relate to an outcome variable. The DA has been developed specifically to address the issue of multicollinearity between predictors and provide the relative importance not affected by the multicollinearity between the predictors.
Following the reviewer's comment, we have added a brief description.

Dominance analysis
"Yearly precipitation, temperature, groundwater withdrawals, population, and potential evapotranspiration for the IGBM and each of sub-basins were taken as the predictor variable to understand their relationship with the outcome variable GWLs. However, the predictor variables used in the study could be interrelated to some degree. Thus it is hard to assess the importance of each predictor if there exists a high degree of multicollinearity within the predictors. The DA has been developed specifically to address the issues and provide the relative not affected by the multicollinearity between the predictors. .
The dominance analysis (DA) computes the coefficient of determination (R 2 ) in multiple regression of all possible predictor combinations . The relative importance of the predictor variables is determined by the DA through a pair-wise comparisons of all predictors in the multiple regression model as they relate to an outcome variable . The DA method partition coefficient of determination (R 2 ) of the overall multiple regression into "shares". These shares are attributable to each of the predictors (Braun et al., 2019) and identify the predictors that are more or less important or dominant than others. Here, the conditional dominance of the variables for p-1 sub-models is performed (where p is the numeric value of total sub-models) . A comprehensive narrative on the dominance analysis can be found in  and ." Reply: We would like to thank the reviewer for the comment. We agree that ANN and SVM are used for quite a long time. These methods are seemed to be robust, widely accepted, and extensively used in the earth science discipline. We have mentioned the rationale for using ANN and SVM in our study. We added, "Despite many computationally intensive and suitable machine learning methods available in the literature, ANN and SVM are selected here since these are the two most popular and widely used methods in predicting GWL, and proven to provide very good results in predicting GWL worldwide and in the similar study areas. The previous studies, as well as the studies on Bangladesh and India, are mostly based on a small spatial and a short temporal extent. Furthermore, to our knowledge, none of the studies have considered the spatial and depth-wise performance variability of machine learning models in predicting GWL." Rev 2. Comment 6: Line 251: replace 'has' with 'have' Reply: We thank the reviewer for noticing the typo. We have corrected the typo in the text.
We corrected, "Moreover, in general, the ML methods may have some weaknesses in modeling complex relationships regarding the low generalizability of the methods, risk of overtraining ." Rev 2. Comment 7: If the ML methods used in the study have some weakness regarding the low generalizability of the methods, risk of overtraining, why did the authors choose other machine learning methods?
Reply: We thank the reviewer for the comment. The statement of ML methods having low generalizability, risk of overtraining may be true to most of the other ML methods when modeling complex relationships. We mentioned this in Section 2.6 (Limitations, assumptions, and uncertainty) to highlight the possible drawback of using ML methods in general. Other methods, such as LSTM, are also subjected to these issues. In one of the recent studies,  reported that a poor training procedure of the LSTM for these situations leading to a bad generalization ability. In fact, they further added the possibility to be trapped in local minima with bad generalization in such complex relationships as rainfall-runoff is possible, even when using the advanced Adam algorithm with measures taken to avoid the overtraining problem (holdout method and dropout layer). We have modified the sentence as "Moreover, in general, most of the ML methods may have some weaknesses in modeling complex relationships regarding the low generalizability of the methods, risk of overtraining ." Furthermore, we performed a sensitivity analysis using different model configurations and selected the model that provides better performances.