Integrating process-related information into an artiﬁcial neural network for root-zone soil moisture prediction

. Quantiﬁcation of root-zone soil moisture (RZSM) is crucial for agricultural applications and the soil sciences. RZSM impacts processes such as vegetation transpiration and water percolation. Surface soil moisture (SSM) can be assessed through active and passive microwave remote-sensing methods, but no current sensor enables direct RZSM retrieval. Spatial maps of RZSM can be retrieved via proxy observations (vegetation stress, water storage change and surface soil moisture) or via land surface model predictions. In this study, we investigated the combination of surface soil moisture information with process-related inferred features involving artiﬁcial neural networks (ANNs). We considered the inﬁltration process through the soil water index (SWI) computed with a recursive exponential ﬁlter and the evaporation process through the evaporation efﬁciency computed based on a Moderate Resolution Imaging Spectroradiometer (MODIS) remote-sensing dataset and a simpliﬁed analytical model, while vegetation growth was not modeled and was only inferred through normalized difference vegetation index (NDVI) time series. Several ANN models with different sets of features were developed. Training was conducted considering in situ stations distributed in several areas worldwide characterized by different soil and climate patterns of the International Soil Moisture Network (ISMN), and testing was applied to stations of the same data-hosting facility. The results indicate that the integration of process-related features into ANN models increased the overall performance over the reference model level in which only SSM features were considered. In arid and semiarid areas, for instance, performance enhancement was observed when the evaporation efﬁciency was integrated into the ANN models. To assess the robustness of the approach, the trained models were applied to observation sites in Tunisia, Italy and southern India that are not part of the ISMN. The results reveal that joint use of surface soil moisture, evaporation efﬁciency, NDVI and recursive exponential ﬁlter represented the best alternative for more accurate predictions in the case of Tunisia, where the mean correlation of the predicted RZSM based on SSM only sharply increased from 0.443 to 0.801 when process-related features were integrated into the ANN models in addition to SSM. However, process-related features have no to little added value in temperate to tropical conditions.

Abstract. Quantification of root-zone soil moisture (RZSM) is crucial for agricultural applications and the soil sciences. RZSM impacts processes such as vegetation transpiration and water percolation. Surface soil moisture (SSM) can be assessed through active and passive microwave remotesensing methods, but no current sensor enables direct RZSM retrieval. Spatial maps of RZSM can be retrieved via proxy observations (vegetation stress, water storage change and surface soil moisture) or via land surface model predictions. In this study, we investigated the combination of surface soil moisture information with process-related inferred features involving artificial neural networks (ANNs). We considered the infiltration process through the soil water index (SWI) computed with a recursive exponential filter and the evaporation process through the evaporation efficiency computed based on a Moderate Resolution Imaging Spectroradiometer (MODIS) remote-sensing dataset and a simplified analytical model, while vegetation growth was not modeled and was only inferred through normalized difference vegetation index (NDVI) time series. Several ANN models with different sets of features were developed. Training was conducted considering in situ stations distributed in several areas worldwide characterized by different soil and climate patterns of the International Soil Moisture Network (ISMN), and testing was applied to stations of the same data-hosting facility. The results indicate that the integration of process-related features into ANN models increased the overall performance over the reference model level in which only SSM features were con-sidered. In arid and semiarid areas, for instance, performance enhancement was observed when the evaporation efficiency was integrated into the ANN models. To assess the robustness of the approach, the trained models were applied to observation sites in Tunisia, Italy and southern India that are not part of the ISMN. The results reveal that joint use of surface soil moisture, evaporation efficiency, NDVI and recursive exponential filter represented the best alternative for more accurate predictions in the case of Tunisia, where the mean correlation of the predicted RZSM based on SSM only sharply increased from 0.443 to 0.801 when process-related features were integrated into the ANN models in addition to SSM. However, process-related features have no to little added value in temperate to tropical conditions.

Introduction
Soil moisture is a major land parameter integrated into several agricultural, hydrological and meteorological applications (Koster et al., 2004;Paris Anguela et al., 2008). This essential climate variable (ECV) consists of two components, namely, surface soil moisture (SSM) (0-5 cm) and root-zone soil moisture (RZSM). RZSM corresponds to the soil moisture in the region in which the main vegetation rooting network is developing. Its definition varies depending on vegetation type and pedoclimatic conditions. The importance of RZSM is mainly highlighted in agricultural applications through vegetation stress and water needs and in carbon and nitrogen cycles, as RZSM influences biogeochemical activities in soil (Martínez-Espinosa et al., 2021). RZSM is nonlinearly related to SSM through different hydrological processes, such as diffusion processes. RZSM may be extracted by evaporation at the surface through root extraction or by capillary rises (Calvet et Noilhan, 2000). SSM quantification is achieved through three main sources: in situ measurements, model estimates and remote-sensing-based products. Microwave remote-sensing technologies involving sensors such as Soil Moisture and Ocean Salinity (SMOS) (Kerr et al., 2010), Soil Moisture Active Passive (SMAP) (Entekhabi et al., 2010), Advanced Microwave Scanning Radiometer (AMSR) (Owe et al., 2008) and Advanced Scatterometer (ASCAT)  have been employed to retrieve SSM at coarse resolutions. Current satellite sensors can only provide surface soil moisture information because of the shallow penetration depth of spaceborne data (on the order of a few centimeters) (Wagner et al., 2007). Fine-spatial-resolution synthetic aperture radar (SAR) data can also be applied in synergy with optical data to retrieve soil moisture (Zribi et al., 2011;Hajj et al., 2014;Dorigo et al., 2011), but again for surface soil moisture. The International Soil Moisture Network (ISMN) is an exhaustive datahosting facility focused on soil moisture data and associated ancillary information. The ISMN provides in situ soil moisture measurements collected from operational soil moisture networks worldwide (Dorigo et al., 2011). Various models can be adopted to estimate RZSM, such as land surface models (Surfex) (Masson et al., 2013), Interaction Sol-Biosphère-Atmosphère (ISBA) (Noilhan and Mahfouf, 1996), the Community Land Model (CLM; Oleson et al., 2010) or the Joint UK Land Environment Simulator (JULES) (Best et al., 2011) or dedicated crop models such as Aquacrop (Raes et al., 2009) or Simple Algorithm For Yield Estimate (SAFYE) (Battude et al., 2017). While these models provide the advantage of physical process-based estimates, these estimates depend on the availability and accuracy of ancillary information. Model predictions are often enhanced by the implementation of data assimilation techniques, such as the land data assimilation system (LDAS) (Sabater et al., 2007;Entekhabi et al., 2020).
Data-driven methods such as artificial neural networks (ANNs) have also been commonly applied in hydrology as detailed, for instance, by the ASCE Task Committee on Application of Artificial Neural Networks in Hydrology (2020) and in Tanty et al. (2015). One of their advantages is that these models do not require an explicit model structure to accurately represent the involved hydrological processes but instead construct a relationship between the given inputs and the process of interest. Therefore, ANNs are regarded as dynamic input-output mapping models heavily relying on the provided training data relevant to target values (Pan et al., 2017). Moreover, ANNs only require a one-time calibration to provide soil moisture estimations once instrument data are loaded and thus generate relatively low computational costs (Kolassa et al., 2018). These advantages explain the approach to estimate RZSM based on surface information with ANNs in various methodologies (Pan et al., 2017;Grillakis et al., 2021;Souissi et al., 2020). In this paper, we do not address ANN applications as a model twin where the ANN model is trained on the target for mimicking purposes and subsequently generates predictions while requiring a short computation time or fewer input simplifications. Here, we are instead interested in the adoption of ANNs as independent models trained on in situ observations. Within this context, Pan et al. (2017) successfully applied an ANN as a model for shallow 20 cm root-zone soil moisture prediction with a global correlation coefficient of 0.7. Grillakis et al. (2021) proposed employing an ANN as a means of calibrating and regionalizing the time constant of a recursive exponential filter, which was thereafter applied at the regional scale. A combined implementation of a Bayesian probabilistic approach and an ANN to infer RZSM at different depths from optical unmanned aerial vehicle (UAV) acquisitions via local training was also applied (Hassan-Esfahani et al., 2017). Multitemporal averaged features to predict RZSM based on only SSM and to investigate the transferability of a trained ANN across different climatic conditions globally were proposed in Souissi et al. (2020). Temporal information can be considered in ANNs through recurrent neural networks (RNNs), long short-term memory (LSTM) architectures , 1D convolutional neural networks (CNNs), or multitemporal averaging. In Souissi et al. (2020), median, maximum and minimum correlation values of 0.77, 0.96 and 0.65 were, respectively, reported across training, validation and test datasets. The use of climatic variables such as precipitation and surface temperature and intrinsic surface properties such as soil texture and land cover has also been considered in ANNs . The choice of variables depends not only on the data availability, but also on the objectives. Finally, ANN-based approaches pertain to the more general term of machine learning approaches, and within this framework, the random forest approach has been applied to root-zone soil moisture prediction (Carranza et al., 2021). The aforementioned studies have investigated the application of multiple information sources to predict root-zone soil moisture. The input features are commonly curated for quality, and correlation analysis is conducted to determine the useful inputs, while physical processes are not considered. In this paper, we introduce process-related features based on simplified analytical models representing the major processes contributing to root-zone soil moisture dynamics. In this work, RZSM refers to a point observation of water content at a depth ranging between 30 and 55 cm. We investigate the impact of the application of different process-related variables on the precision of RZSM predictions as well as the robustness of our approach. (1) We start from a previously developed ANN model (Souissi et al., 2020), and we extend the feature list to include NDVI time series, surface soil tem-  Table 1). The scaling and descaling are applied to each dataset separately. perature and process-related variables, namely, the soil water index given by a recursive exponential filter and remotesensing-based evaporation efficiency. (2) The robustness of the approach is assessed through additional tests involving stations not included in the ISMN database in Tunisia, Italy and southern India. (3) Climatic analysis is conducted to infer the most indicative process-related features for each climate pattern.

Materials and methods
The proposed methodology entails the construction of several ANN models with both direct (SSM, surface temperature and NDVI) and intermediate sets of features (soil water index and evaporation efficiency) computed based on simplified analytical models. An overview of the processing configuration is shown in Fig. 1. Standard scaling is applied to each dataset separately so that the different inputs fall into the same range of values. Then the ANN outputs are descaled to make the comparison with actual values of RZSM possible.
This approach results in a combination of ANN models (Table 1). Each model has one or more process-related features in addition to three SSM features which correspond to backward rolling averages of in situ SSM computed over 10, 30 and 90 d. All the ANN model hyperparameters remain the same except the number of input features.
The model with the simplest starting point is ANN_SSM based on Souissi et al. (2020). The most complex model includes the full set of inputs. Intercomparison of the model performance provides information on the added value of each input. All input features are scaled, and training is performed on each of these features based on scaled in situ RZSM data retrieved from the ISMN. The RZSM model predictions are validated against an independent set of observations.

ISMN soil moisture data
The first training and test operations were conducted on eight ISMN networks previously considered in Souissi et al. (2020). Figure 2 shows the distribution of the considered soil moisture networks with different soil textures and climatic parameters (see Appendix B). For each station, the RZSM observation point is located between 30 and 55 cm ( Table 2). For each soil moisture hourly acquisition, the ISMN provides quality flags. Quality flags can be marked as "C" (exceeding the plausible geophysical range), "D" (questionable/dubious), "M" (missing), or "G" (good) (Dorigo et   al., 2011). Category "D" has subset flags, namely, "D01" for which in situ soil temperature < 0 • C, "D02" that flags points at which in situ air temperature < 0 • C, as well as "D03" that also flags areas where the Global Land Data Assimilation System (GLDAS) soil temperature < 0 • C. In our study, only soil moisture data whose quality flag is marked "G" were retained.

External soil moisture data
The external networks are considered to assess the transferability and robustness of the approach. The trained models are run for predictions only over these sites. They have been selected to cover semiarid, moderate and tropical semiarid climates.
-Tunisian site: the Merguellil site is located in central Tunisia (9 • 54 E, 35 • 35 N). This site is characterized by a semiarid climate with highly variable rainfall patterns (average equal to 300 mm yr −1 ), very dry summer seasons, and wet winters. The Merguellil site represents an agricultural region where croplands, namely, olive groves and cereal fields, prevail (Zribi et al., 2021). At this study site, a network of continuous ThetaProbe sta- -Italian site: the Landriano site is located in northern Italy (Pavia Province, Lombardy Region). This station is located in a maize field, which was monitored in 2006 and from 2010 to 2011 (Masseroni et al., 2014). The average rainfall in Pavia Province is 650-700 mm, the climate is classified as Cfa (see Appendix A) and the field is irrigated by the border method with an average irrigation amount of approximately 100 to 200 mm per application with one to two applications per season due to the presence of a shallow groundwater table. Soil moisture measurements were performed with time domain reflectometer (TDR) soil moisture sensors. Five TDR soil moisture sensors were installed along a profile at depths of 5, 20, 35, 50 and 70 cm.
-Indian site: the Berambadi watershed is located in Gundalpet Taluk, Chamarajanagara district, in the southern part of Karnataka state in India and covers an area of approximately 84 km 2 . The average rainfall is equal to 800 mm yr −1 , and the climate is classified as Aw (see Appendix A). Hydrological variables have been intensively monitored since 2009 in the Berambadi watershed by the Environmental Research Observatory ORE BVET and AMBHAS Observatory. The soil moisture levels at the surface (5 cm) and root zone (50 cm) are monitored with a HydraProbe sensor at different agricultural sites across the watershed, and in the current study, four stations were chosen.

Surface soil temperature
In addition to in situ soil moisture, the ISMN optionally includes meteorological and soil variables that are available over specific time periods. Values of the in situ surface soil temperature among these variables can be employed as a useful indicator of the soil moisture data quality. The soil temperature was provided in degrees Celsius, and the plausible values range from −60 to 60 • C. Regarding soil moisture data, surface soil temperature data were also provided with quality flags (Dorigo et al., 2011). However, the drawback is that this variable is not available in all networks, which is the case with the AMMA-CATCH network.

Normalized difference vegetation index
We considered the remote-sensing-based normalized difference vegetation index (NDVI) to infer vegetation dynamics. We extracted this index from the Moderate Resolution Imaging Spectroradiometer (MODIS) Vegetation Indices product (MOD13Q1 version 6). MODIS Vegetation Indices data are generated at 16 d intervals and a 250 m spatial resolution as a level-3 product. This product provides two primary vegetation layers. The first vegetation layer is the NDVI, which is referred to as the continuity index of the existing National Oceanic and Atmospheric Administration-Advanced Very High Resolution Radiometer (NOAA-AVHRR)-derived NDVI. The algorithm chooses the best available pixel value from all the acquisitions over the 16 d period. The criteria considered are low cloud coverage, low viewing angle, and the highest NDVI value (Huete et al., 1999). To obtain daily NDVI values, we conducted linear interpolation of the 16 d product.

Potential evapotranspiration
Similarly, we assessed the impact of considering a remotesensing-based evaporation efficiency, which is initially defined as the ratio of actual to potential soil evaporation, on RZSM prediction. The computation details of this variable will be given later (see Sect. 2.2.2). We employed the remotesensing-based potential evapotranspiration (PET) to compute the evaporation efficiency. We extracted the PET from the MOD16A2 Evapotranspiration/Latent Heat Flux version 6 product, which is an 8 d composite dataset produced at a 500 m pixel resolution. The algorithm used for this product collection is based on the logic of the Penman-Monteith equation, which employs inputs of daily meteorological reanalysis data along with MODIS remote-sensing data products such as vegetation property dynamics, albedo and land cover. The MOD16A2 product provides layers for the composite evapotranspiration (ET), latent heat flux (LE), potential ET (PET) and potential LE (PLE). The pixel values for the PET layer include the sum of all 8 d within the composite period (Running et al., 2017). To obtain daily PET values, we performed a linear interpolation over the 8 d product, and then we divided the interpolated value by 8.

Recursive exponential filter
Two ANN models presented in Table 1 contained extra knowledge on infiltration process information based on the outputs of the recursive exponential filter (Stroud, 1999) as a feature. The recursive exponential filter was first introduced by Wagner et al. (1999) to estimate the soil water index (SWI) from surface soil moisture. SWI is computed as follows: where SWI t n is the soil water index at time t n , ms(t n ) is the scaled surface soil moisture at time t n (scaled between maximum and minimum values), K n is the gain at time t n , which occurs in [0, 1] and is given by and T is a time constant and is the only required tuning parameter to compute the recursive exponential filter. For the initialization of the filter, gain K 1 = 1 and SWI (t1) = ms(t 1 ).

Evaporation efficiency
An ANN model with evaporation efficiency input was also developed. This variable, which is defined as the ratio of the actual to potential soil evaporation, was first introduced in Noilhan and Planton (1989), Jacquemin and Noilhan (1990) and Lee and Pielke (1992) and thereafter readapted in Merlin et al. (2010) to include the soil thickness. In our work, we use a modified evaporation efficiency formulation based on the third model developed in Merlin et al. (2010), which can be expressed as follows (see Appendix C): where β is evaporation efficiency and θ is the water content in the soil layer of thickness L. θ max is the maximum soil moisture at each station. P * is a parameter computed as follows: (4) P * , a proxy of parameter P (see Appendix C), represents an equilibrium state controlled by retention forces in the soil, which increase with the thickness L of considered soil and by evaporative demands at the soil surface. PET is the potential evapotranspiration extracted from the MODIS 500 m 8 d product (MOD16A2). The soil evaporation efficiency computed by model 3, developed in Merlin et al. (2010), decreases when PET increases. Retention force and evaporative demand make the term P increase (replaced by P * ), as if an increase in potential evaporation LE p (here replaced by PET) at the soil surface would make the retention force in the soil greater. Merlin et al. (2010) tested this approach at two sites in southwestern France using in situ measurements of actual evaporation, potential evaporation, and soil moisture at five different depths collected in summer. Model 3 was able to represent the soil evaporation process with a similar accuracy to the classical resistance-based approach for various soil thicknesses up to 100 cm. Merlin et al. (2010) affirm that the parameterization of P as a function of LE p (here PET) indicates that β cannot be considered a function of soil moisture alone since it also depends on potential evaporation. Moreover, the effect of potential evaporation on β appears to be equivalent to that of soil thickness on β. This equivalence is physically interpreted as an increase in retention forces in the soil in reaction to an increase in potential evaporation.

Artificial neural network implementation
The multilayer perceptron (MLP), which is a multilayer feedforward ANN, is one of the most widely applied ANNs, mainly in the field of water resources (Abrahart and See, 2007). The multilayer perceptron contains one or more hidden layers between its input and output layers. Neurons are organized in layers such that the neurons of the same layer are not interconnected and that any connections are directed from the lower to upper layers (Ramchoun et al., 2016). Each neuron returns an output based on the weighted sum of all inputs and according to a nonlinear function referred to as the transfer or activation function (Oyebode and Stretch, Table 3. Proportion of the stations whose performance enhances using the ANN models enriched with process-related features compared to model ANN_SSM ( a : % of stations at which the correlation improves over the model ANN_SSM level; b : % of stations at which RMSE improves over the model ANN_SSM level).  We started with the ANN model developed in Souissi et al. (2020), whose architecture consists of one hidden layer of 20 hidden neurons, a tangent sigmoid function as the activation function of the hidden layer, a quadratic cost function as the loss function and the stochastic gradient descent (SGD) technique as the optimization algorithm. This model was developed to estimate RZSM based on only in situ SSM information. SSM was not applied as a feature of hourly values but was employed in the form of three features, namely, SSM rolling averages over 10, 30 and 90 d. Additional ANN models were developed to study, through each model, the impact of the application of the NDVI, SWI, evaporation efficiency and surface soil temperature as features. A model combining surface soil moisture, NDVI, evaporation efficiency and the recursive exponential filter was further considered. These ANN models were trained and validated on the 122 ISMN stations considered to be of good quality after a data-filtering step as detailed in Souissi et al. (2020). Training of the above ANN models was conducted considering 70 % of these 122 stations. Thirty percent was reserved for validation, and testing was conducted at the rest of the stations. So, in summary, 122 stations were considered for the training/validation of the ANN models and 224 stations if all input data available were used for testing. In a second step, tests were conducted on data external to the ISMN database, namely, on sites of Tunisia, Italy and India. The trained models over the ISMN are used only in prediction mode over these sites. The data for SSM in addition to the other features are used as inputs, and RZSM is predicted in outputs.

Exponential filter characteristic time length
A large proportion of the stations attained an optimal time constant (T best ) value equal to 60 d, which suggests an abnormally long infiltration time. These stations belong to the SCAN network and exhibit an RZSM acquisition depth of 50 cm, in contrast to other networks such as SMOSMA-NIA, for instance, where RZSM is retrieved at 30 cm. The high values correspond to correlation with seasonal dynamics rather than infiltration processes. This depth could explain the anomalously long infiltration time. This is consistent with Paulik et al. (2014), in which the average T value with the highest correlation (T best ) increased with increasing depth of the in situ observations. For comparison purposes, Paulik et al. (2014) found that 23.98 % of the stations achieved T best = 5 d, while 21.58 % of the stations achieved T best ≥ 60 d (60 or 100 d). Albergel et al. (2008) considered an average T best value of 6 d for the SMOSMANIA network. This value represented the average T best value for all stations belonging to the SMOSMANIA network. In our case, the average T best value for all stations of the SMOSMANIA network reached 9 d. In this study, an average T best value could be established for each station or each network. However, this is not relevant to our work because we aim to evaluate maps of remote-sensing data in the next steps, and thus we did not compute T best at each location. We fixed the value of T to 5 d as a median infiltration time.

Intercomparison of the ANN models
The distribution histograms for training, validation and test stations (Fig. 3) show that the integration of the considered process-related features improved the prediction accuracy in certain cases compared to the reference. Time series of good and less good quality of fit were provided in Appendix E for training, validation and test stations using the reference model ANN_SSM and the most complex ANN model.
In terms of the NDVI, 65.82 %, 45.71 % and 55.22 % of the stations attained better correlation values with  Finally, we investigated the impact of the joint application of the NDVI, recursive exponential filter (T = 5 d) and evaporation efficiency (B = 60 W m −2 ) in the ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5 model. The surface soil temperature was not included, as its effect is included in the evaporation process. At 84.06 %, 61.29 % and 62.07 % of the training, validation and test stations, the correlation value obtained with this model was higher than that obtained with the ANN_SSM model, respectively. In addition, RMSE was minimized for 62.32 %, 54.84 % and 54.02 % of the training, validation and test stations with ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5 compared to model ANN_SSM, respectively.
Considering model ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5, only one training station had a decrease in correlation by more than 0.1, namely, station "Lind#1" (network "SCAN") compared to the reference model ANN_SSM. All inputs were not available at the same dates, which implied a significant reduction in data points (see Appendix F). The decrease in correlation and increase in RMSE did not exceed 0.1 and 0.01 m 3 m −3 , respectively, for the rest of the stations of lower performance metrics with the most complex ANN (Table 4).
Similarly for validation stations, only one station had a decrease in correlation above 0.1, namely, station "PineNut" (network SCAN), with model ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5. This decrease can be also explained because of data shortage (see Appendix F). The decrease in correlation and increase in RMSE did not exceed 0.1 and 0.01 m 3 m −3 , respectively, for the rest of the stations of lower performance metrics with the most complex ANN (Table 4).
Regarding the test stations, correlation decreased by more than 0.1 and RMSE increased by more than 0.01 m 3 m −3 with model ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5 compared to model ANN_SSM, detected for only two stations. Both stations, namely, stations "S-Coleambally" and "Widgiewa", which belong to network "OZNET", significantly lose in data volume when processrelated variables are integrated into ANN and more precisely because of NDVI data availability (see Appendix F). For the rest of the test stations, correlation decreased and RMSE increased simultaneously by less than 0.1 and 0.01 m 3 m −3 , respectively, with model ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5 (Table 4).
Always in terms of the general performance of model ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5, about 75 % of the stations have an RMSE of less than 0.05 m 3 m −3 , and around half of the stations have an RMSE of less than 0.04 m 3 m −3 . This accuracy is consistent, for instance, with the target value in the SMAP (Entekhabi et al., 2010) and SMOS (Kerr et al., 2010) missions, which is equal to 0.04 m 3 m −3 , and also with the average sensor accuracy adopted by Dorigo et al. (2013), which is equal to 0.05 m 3 m −3 . Overall, the most complex model ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5 can successfully characterize the soil moisture dynamics in the root zone since half of the stations have a correlation value of greater than 0.7. Pan et al. (2017) developed different ANN models to estimate RZSM at depths of 20 and 50 cm over the continental USA using surface information. They found that half of the stations have an RMSE of less than 0.06 m 3 m −3 and that more than 70 % of stations have a correlation above 0.7 when predicting RZSM at 20 cm. However, the developed ANN was less effective in RZSM prediction at 50 cm, which is also in accordance with Kornelsen and Coulibaly (2014). In our study, the densest soil moisture network is SCAN, located in the USA. Soil moisture was predicted at a depth of 50 cm over this network. Around half of the stations have a correlation value of above 0.6 and an RMSE of less than 0.04 m 3 m −3 after the integration of process-related inputs. Pan et al. (2017) suggest that the use of only time-dependent variables may not be sufficient for the ANN models to accurately predict RZSM and suggest adding soil texture data. Table 4. Proportion of the stations whose correlation decreases using the ANN models enriched with process-related features compared to model ANN_SSM ( corr = corr ANN_SSM − corr ANN_SSM_X ; X denotes one or a combination of process-related variables) Model

Robustness of the approach
To further assess the robustness of our approach, which involves RZSM prediction using the different ANN models with different features, we predicted RZSM at sites not previously considered in previous parts of the study. The selected stations are located in the Kairouan Plain, a semiarid region in central Tunisia, the Landriano site located in the north of Italy, and the Berambadi watershed located in Gundalpet Taluk, southern India. In the case of Tunisia, model ANN_SSM yielded moderate to low-precision predictions, as highlighted by the performance metrics listed in Table 5. The time series (see Appendix G) show that the RZSM predictions followed the SSM seasonality, which was reflected by the false peaks generated in the RZSM predictions whenever a sharp increase or decrease occurred in the SSM values. This observation was also found in Souissi et al. (2020). Actually, the Kairouan Plain is characterized by a semiarid environment where rainfall events infrequently occur and the level of evaporation is high. The reference model ANN_SSM shows its limitations in accurately predicting RZSM in areas with no alternate wet and dry cycles.
However, the consideration of additional features, namely, the NDVI, evaporation efficiency and SWI in the ANN models, resulted in good agreement between the in situ and predicted RZSM values (Fig. 4). The correlation values were improved by 60.04 %, 169.5 %, 112.02 %, 80.23 % and 53.7 % at stations Barrouta-160, Hmi-date_163, Barrage_162, Bouhajla_164 and P12, respectively, with the ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5 model over ANN_SSM model values. Similarly, RMSE values were reduced (Table 5). As shown in Fig. 4, the most complex ANN model is able to capture the variations of RZSM. This finding highlights the added value of our hybrid approach based on an association of a machine learning method with process-related variables. Instead of injecting uncertain information into physical models, such as soil properties, we used a nonparametric method related to physical processes without using forcing data that may be subject to errors and potentially lead to inaccurate tracking of the long-term evolution of soil moisture.
A second comparison can be conducted between the quality of fit of these independent datasets and training datasets. Actually, the climate class of the Tunisian stations is Bsh (see Appendix A). At the training stage, no station falls into climate class Bsh (see Appendix A). However, some training stations fall under a similar climate class, which is Bsk (see Appendix B). Table 5 presents correlation and RMSE values for these training stations and Tunisian sites with both models ANN_SSM and ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5. For all training stations, performance metrics are slightly enhanced, with the most complex ANN model compared to the reference model ANN_SSM, except for stations GrouseCreek, Harmsway and Lind#1, whose performance decreases. Overall, the range of correla-  tion values is similar for training and external validation stations with model ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5, and RMSE is greatly reduced for the Tunisian stations compared to the training stations. Given the results on unseen datasets, namely, on Tunisia, the performance of the most complex ANN model is good as it is able to generalize the patterns present in the training dataset. At the southern Indian stations, the ANN_SSM model yielded good agreement even without the integration of process-related features (Table 6). These features added little to nonsignificant improvement. The same observation was made at the Italian site. The application of multiple features performed the best under arid conditions, e.g., in Tunisia. In the tropical and temperate climate regions, this was not the case. The presence of clouds in the MODIS NDVI and potential evapotranspiration products could explain this observation at sites of southern India and northern Italy. In southern India, for instance, the maximum variability in soil moisture occurred during the monsoon season, which is characterized by a large amount of clouds. Moreover, the coarse resolution of the MODIS NDVI product makes it sometimes not adapted to the considered site. Chen et al. (2016) investigated the impact of sample impurity and landscape heterogeneity on crop classification using coarse-spatial-resolution MODIS imagery. They showed that the sample impurity such as mixed crop types in a specific sample, compositional landscape heterogeneity, which is the richness and evenness of land cover types in a landscape, and configurational heterogeneity, which is the complexity of the spatial structure of land cover types in a specific landscape, are sources of uncertainty affecting crop area mapping when using coarsespatial-resolution imagery. High-resolution NDVI from sensors like Sentinel-2 could have been used in this exercise to mitigate the spatial resolution issue; however, MODIS data were privileged in order to provide NDVI and PET from the same sensor.

Discussion
Climate analysis of the results yielded by the different models indicated that, among all the models, the climate class with the highest mean correlation change rate (Fig. 5) was class BWk (see Appendix A), which regroups desert areas where the link between SSM and RZSM is weak due to high evaporative rates. Class Dfa (see Appendix A), which includes areas experiencing harsh and cold winters, also yielded a high mean correlation change rate (> 100 %). Similarly, at stations of this climate type, the link between the surface and root zone is poor. In regard to class Cfa (see Appendix A), in which more than 80 % of the total stations belong to the SCAN network, the high mean correlation change rate could be explained by the surface-subsurface decoupling phenomena detected within this network, as previously reported in Souissi et al. (2020). The model with the largest number of stations with improved predictions over the ANN_SSM model predictions was ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5. Actually, the coupled use of process-related features in the ANN models exerted a greater impact on the prediction accuracy than that exerted by the one-at-a-time application of these features. In model ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T5, the three process-based features jointly employed seemed to counterbalance the weight of the three SSM features. In this model, the process-related features were equally represented versus the SSM information depicted by three features. The redundancy of the considered SSM information could explain the limited impact of the one-at-a-time addition of process-related features.
In addition, Karthikeyan and Mishra (2021) demonstrated that, at root depths beyond 20 cm, the importance of SSM was notably lower than that at the 20 cm depth, signifying decorrelation between surface and deeper SM values, which is in accordance with the findings in Souissi et al. (2020), and it was further revealed that vegetation exhibits a higher importance than that of the meteorological predictors and precipitation. Kornelsen and Coulibaly (2014) indicated that evapotranspiration is the most important meteorological input for the prediction of soil moisture in the root zone with the MLP, which reflects the importance of the water vapor flux in soil moisture state determination. mean_corr_change_rate = mean corr ANN_SSM_X − corr ANN_SSM corr ANN_SSM · 100 (5) where X denotes a process-related variable (X ∈ ["NDVI", "EXP-FILT-T5", "EVAP-EFF-B60", "TEMP"]). The world map illustrated in Fig. 6 shows the bestperforming ANN models based on the mean correlation change rate (Eq. 5). We assumed that the results in a given area of a specific climate class could be extended to other areas of the same climate class even if we did not consider the data for these areas. The climate classes without at least one station were marked in black and labeled with "NO DATA".
In arid areas such as the eastern and western sides of the USA with high evaporation rates, ANN_SSM_EVAP-EFF-B60 was the best-performing model. Similarly, in bare areas of Africa, the Middle East and Australia where the Bwh climate class prevailed (arid desert hot climate; see Appendix A), the evaporation efficiency was the best informative variable.
In the internal part of continental Europe and near the Mediterranean basin, the NDVI was the most relevant indicator for RZSM estimation, where agricultural fields dominated. Similarly, the Great Plains region in the USA was deeply affected by the NDVI, as this region is a cultivated area. In Nordic areas characterized by the ET climate class (Polar Tundra climate, see Appendix A) and mainly covered by grassland and shrubland areas according to ESA CCI land cover maps.  In Nordic areas characterized by the ET climate class, the soil temperature was the most important root-zone soil moisture indicator, mainly because of the freeze-thaw events encountered in these regions. In tropical savannah wet areas (class Aw; see Appendix A), the ANN_SSM_TEMP model was the best-performing model. This classification definitely suffered from limitations mainly provoked by the generalization of the climatic analysis results to areas not considered in this study. For instance, Figure 6. World map of the best-performing ANN models per climate class based on the mean correlation change rate; colors correspond to climate classes (see Appendix A), and hatches correspond to the most contributive input to the predictions, namely, EVAP (evaporation efficiency), EXP (exponential filter SWI), NDVI, and TEMP (surface soil temperature).
in regions of climate class Dfc (cold dry without a dry season, cold summer climate; see Appendix A), we expect the temperature to serve as the most relevant indicator instead of the evaporation efficiency.

Conclusion
In this study, we developed several ANN models to estimate RZSM based either solely on in situ SSM information or on a group of process-related features in addition to SSM, namely, the soil water index computed with a recursive exponential filter, evaporation efficiency, NDVI and surface soil temperature. Different regions across the globe with distinct land cover and climate patterns were considered. The main conclusion of this study was that the consideration of more features in addition to SSM information could enhance the accuracy of RZSM predictions mainly in regions where the link between SSM and RZSM is weak.
In arid areas with high evaporation rates, the most informative feature was the evaporation efficiency. In regions with agricultural fields, the NDVI was, for example, the most relevant indicator to predict RZSM. Overall, the best-performing model included the surface soil moisture, NDVI, SWI and evaporation efficiency as features. The robustness of the approach was further assessed through additional tests considering external sites in central Tunisia, India and Italy. Similarly, the process-related features exerted a positive im-pact on the prediction accuracy when combined with surface soil moisture in the case of Tunisia. The mean correlation across the five Tunisian stations sharply increased from 0.44 when only SSM was considered to 0.8 when all process-related features were combined with SSM. In India and Italy, the correlations were already high with the reference model ANN_SSM. The change in correlation after the addition of process-related features, namely, NDVI, is about −0.04, which is nonsignificant and is potentially because of the cloudy conditions in India and the noisy MODIS products. Also, the crop heterogeneity and sample impurity make MODIS NDVI products not adapted to all sites.
As a research perspective, datasets can be separated into clusters corresponding to major climate classes and/or soil types. More analysis can be conducted in this direction to eventually make connections between the different inputs and climate/soil configurations.
Future work will examine the ability of the developed model to estimate RZSM across larger areas based on remote-sensing global soil moisture products. The use of remote-sensing-derived soil moisture products may yield lower correlations with the reference model ANN_SSM, which potentially implies further improvement when process-related features are added.
where θ L is the water content in the soil layer of thickness L. P is a parameter computed as follows: θ max is the soil moisture at saturation. LE p is the potential evaporation. L 1 is the thinnest represented soil layer, and A 3 (unitless) and B 3 (W m −2 ) are the two best-fit parameters a priori depending on the soil texture and structure, respectively.

3294
R. Souissi et al.: Integrating process-related information into an ANN for root-zone soil moisture prediction Figure G9. In situ SSM, in situ RZSM, and predicted RZSM series at station P12 (Tunisia) with model ANN_SSM. Figure G10. In situ SSM, in situ RZSM, and predicted RZSM series at station P12 (Tunisia) with model ANN_SSM_NDVI_EVAP-EFF-B60_EXP-FILT-T.
Code availability. The code is not publicly accessible because this is in the requirements of some projects.
Author contributions. The methodology was established by AAB, RS and MZ. The original draft was written by RS. The manuscript was reviewed and edited by all the authors. Soil moisture data over the Italian site and the site description were provided by CC and MM. Soil moisture data over the Indian site and the site description were provided by SM, SKM and DBU. This PhD thesis is under the supervision of MZ and AAB. All the authors have read and agreed to the published version of the manuscript.