Articles | Volume 27, issue 22
Research article
16 Nov 2023
Research article |  | 16 Nov 2023

Understanding hydrologic controls of sloping soil response to precipitation through machine learning analysis applied to synthetic data

Daniel Camilo Roman Quintero, Pasquale Marino, Giovanni Francesco Santonastaso, and Roberto Greco

Soil and underground conditions prior to the initiation of rainfall events control the hydrological processes that occur in slopes, affecting the water exchange through their boundaries. The present study aims at identifying suitable variables to be monitored to predict the response of sloping soil to precipitation. The case of a pyroclastic coarse-grained soil mantle overlaying a karstic bedrock in the southern Apennines (Italy) is described. Field monitoring of stream level recordings, meteorological variables, and soil water content and suction has been carried out for a few years. To enrich the field dataset, a synthetic series of 1000 years has been generated with a physically based model coupled to a stochastic rainfall model. Machine learning techniques have been used to unwrap the non-linear cause–effect relationships linking the variables. The k-means clustering technique has been used for the identification of seasonally recurrent slope conditions in terms of soil moisture and groundwater level, and the random forest technique has been used to assess how the conditions at the onset of rainfall controlled the attitude of the soil mantle to retain much of the infiltrating rainwater. The results show that the response in terms of the fraction of rainwater remaining stored in the soil mantle at the end of rainfall events is controlled by soil moisture and groundwater level prior to the rainfall initiation, giving evidence of the activation of effective drainage processes.

1 Introduction

Slope response to precipitation is highly non-linear in terms of runoff generation, rainwater infiltration, and subsurface drainage processes, which mostly depend on the initial soil moisture state at the onset of each rainfall event (Tromp-Van Meerveld and McDonnell, 2006b; Nieber and Sidle, 2010; Damiano et al., 2017). The initial (or antecedent) conditions are related to hydrological processes that occur in the slopes, which control how they exchange water with the surrounding systems (i.e. atmosphere, surface water, deep groundwater). These processes occur through the boundaries of the slope and often evolve over timescales of weeks or even months, much longer than the duration of rainfall events, typically ranging between some hours and few days.

While the importance of soil moisture conditions on slope runoff and drainage has been recognised long since (Ponce and Hawkins, 1996; Tromp-Van Meerveld and McDonnell, 2006a, b), the scientific community has only recently started providing new perspectives to better understand hydrologic conditions predisposing slopes to landslides (Bogaard and Greco, 2018; Greco et al., 2023) to explain why most of large rain events do not destabilise slopes and only some do (Bogaard and Greco, 2016), and physically based models capable of integrating hydrological knowledge for predicting landslide occurrence have been proposed (e.g. Bordoni et al., 2015; Greco et al., 2018; Marino et al., 2021).

The triggering of some rainfall-induced geohazards, such as shallow landslides and debris flows, is favoured by pore pressure increase, caused by rainwater infiltration and consequent soil moisture accumulation. The storage of rainwater within the soil requires drainage mechanisms that develop in the slopes in response to precipitation to be ineffective at draining much of the infiltrating water (Greco et al., 2021, 2023). Consequently, especially for nowcasting and early warning purposes, the identification of hydrological variables suitable for identifying slope predisposing conditions is extremely useful. Thus, to better understand how hydrological predisposing conditions may control the processes involving the sloping soil response in terms of water storage, field monitoring for the assessment of the slope water balance is highly recommended (Bogaard and Greco, 2018; Marino et al., 2020a).

The identification of suitable variables to be monitored in the field is indeed useful to achieve an insight into the behaviour of the interconnected hydrological systems (i.e. groundwater, surface water, soil water). Besides the study of rainfall-induced landslides, the evaluation of the hydrological scenarios in a region of interest could impact several other applications from flood hazard assessment (Reichenbach et al., 1998; Forestieri et al., 2016; Chitu et al., 2017), to the prediction of possible crop water stress conditions in relation to defoliation (Capretti and Battisti, 2007), pathogen expansions in chestnut groves (Gao and Shain, 1995), and plant mortality in a climate change context (McDowell et al., 2008).

This research focuses on a case study of a slope located in Campania (southern Italy), representative of a wide area frequently hit by destructive rainfall-triggered shallow landslides (e.g. Fiorillo et al., 2001; Revellino et al., 2013). In fact, such geohazards are recurrent along the carbonate slopes covered with unsaturated air-fall pyroclastic deposits, diffuse over an area of a few thousand square kilometres around the two major volcanic complexes of the region, the Somma–Vesuvius and the Phlegraean Fields (Di Crescenzo and Santo, 2005; Cascini et al., 2008). The underlying limestone bedrock, densely fractured, is characterised by the presence of deep karst aquifers (Allocca et al., 2014). The triggering mechanism of landslides in the area is the increase in water storage within the soil mantle after intense and persistent precipitation, leading to pore pressure build-up (Bogaard and Greco, 2016). Slope equilibrium is in fact guaranteed by the additional shear strength promoted by soil suction (Lu and Likos, 2006; Greco and Gargano, 2015), the reduction of which often leads to slope failure due to shear strength loss by soil wetting during rainwater infiltration (Olivares and Picarelli, 2003; Damiano and Olivares, 2010; Pagano et al., 2010; Pirone et al., 2015).

Recent studies show that the response of the soil mantle to precipitation in the study area is affected not only by rainfall characteristics and antecedent soil moisture but also by the wetness of the interface with the underlying bedrock, which controls the leakage of water into the underlying fractured limestone (Marino et al., 2020a, 2021). At the contact between soil and bedrock, intense weathering modifies the physical properties of the soil as well as of the fractured bedrock, which form a hydraulically interconnected system: the epikarst (e.g. Perrin et al., 2003; Hartmann et al., 2014; Dal Soglio et al., 2020). The changing hydraulic behaviour of the soil–bedrock interface can be related to the storage of water in the epikarst, where a perched aquifer forms during the rainy season (Greco et al., 2014, 2018).

The aim of this study is to identify the major hydrological processes controlling the response to precipitation of the pyroclastic soil mantles typical of the area and the seasonally recurrent conditions that affect their ability to retain much of the infiltrating rainwater, through suitable measurable variables. To this aim, a rich dataset of measured rainfall events and corresponding hydrological effects would be required, which was not available for the case study, where monitoring activities had been carried out for a few years. Therefore, a synthetic 1000-year hourly dataset was generated, by means of a stochastic rainfall model and a simplified physically based model of the slope, coupling the unsaturated pyroclastic soil mantle and the underlying perched aquifer (Greco et al., 2018). Both models had been previously calibrated and validated on field experimental data (Damiano et al., 2012; Greco et al., 2013; Comegna et al., 2016; Marino et al., 2021). The synthetic data of soil suction, water content, and aquifer water level, all measurable in the field and assumed as representative of real conditions, were analysed as if they were measured data. After sorting the rainfall events within the 1000-year time series, a dataset was built with the antecedent conditions 1 h before the beginning of each rainfall event. It included the previously listed variables plus the total-event rainfall depth and the change in the water stored in the soil mantle at the end of each rainfall event. To disentangle the non-linear processes controlling the hydraulic behaviour of the slope and their role in the soil response to precipitation, the dataset was analysed with machine learning (ML) techniques, i.e. clustering, and random forest. Indeed, ML allows managing a large number of data, such as those provided by assimilation of extensive monitoring networks, remote sensing, satellite products, and other sources, without introducing any mathematical model structure to highlight the cause–effect relationships linking the variables.

2 Materials and methods

The studied slope, described in Sect. 2.1, belongs to the Partenio Massif, and it has the typical characteristics of many pyroclastic slopes of Campania (southern Italy) (Greco et al., 2018). Indeed, three major zones characterised by unsaturated pyroclastic deposits can be identified in Campania (Cascini et al., 2008): the Campanian Apennine chain, composed of carbonate rock covered by a variable layer of pyroclastic soil (from 0.1 to 5 m); the Phlegraean district, formed by underlying densely fractured volcanic tuff bedrock, placed under several metres of pyroclastic soils; and the Sarno and Picentini Mountains, where a thin layer of pyroclastic material is found over a terrigenous bedrock. In these three areas, the thickness of the soil mantle is quite variable, according to the slope inclination and to the distance from the eruptive centre (De Vita et al., 2006; Tufano et al., 2021).

To identify the seasonally recurrent conditions that affect the attitude of the soil mantle to retain much of the infiltrating water, a large set of measurements of rainfall events and their effects on the slope would be required. Hence, to enrich the data available from the monitoring activities carried out for some years at the slope (Marino et al., 2020a), a synthetic dataset of the hydrologic response of the slope to precipitation has been generated with a stochastic Neyman–Scott Rectangular Pulse Model (NSRP) of rainfall (Rodriguez-Iturbe et al., 1987) and a simplified 1-D model of the interaction of the unsaturated pyroclastic soil mantle with the underlying perched aquifer forming in the epikarst. Both the models, described in the following sections, were previously developed based on experimental data (Greco et al., 2013, 2018; Marino et al., 2021). The obtained synthetic dataset has been compared to the limited dataset from field monitoring, showing a reasonable agreement. Therefore, it has been considered suitable to reproduce slope response to climate forcing, in terms of soil volumetric water content and perched aquifer water level, in the studied area (see Sect. 2.2).

The synthetic dataset has been analysed with machine learning techniques (Sect. 2.3), as they prove to be able to identify non-linear cause–effect relationships between variables, without introducing any model structure, as if the data were provided by field measurements. Figure 1 shows the flowchart of the entire methodology.

Figure 1Flowchart summarising the methodology followed in the analysis of sloping soil response to precipitation.


2.1 Case study

The study area refers to the north-east slope of Monte Cornito, part of the Partenio Massif (Campania, southern Italy), 2 km from the town of Cervinara, about 40 km north-east of the city of Naples. The slope was involved in a series of rapid shallow landslides after a rainfall event of 325 mm in 48 h during the night between 15–16 December 1999, causing casualties and heavy damage (Fiorillo et al., 2001). A field monitoring station was installed nearby the big landslide scarp in 2001. Further details of the investigated zone, with indications of the area affected by the largest of the landslides triggered in 1999, are shown in Fig. 2.

Figure 2Location of the study area and indication of the zone affected by a large landslide in 1999. Adapted from Marino et al. (2020a).

The Partenio Massif is part of the southern Apennines area. The bedrock mainly consists of Mesozoic–Cenozoic fractured limestones, mantled by loose pyroclastic deposits, resulting from the explosive volcanic activity of Somma–Vesuvius and the Phlegraean Fields, which occurred over the last 40 000 years (Rolandi et al., 2003).

The fractured limestone formations of the southern Apennines often host large karst aquifers, through which a basal groundwater circulation occurs, for which regional groundwater recharge between 100 and 500 mm yr−1 has been estimated, with 200 mm yr−1 regarding the area of Cervinara (Allocca et al., 2014). Moreover, recent studies showed that, in the upper part of the karst system, denoted as epikarst (Hartmann et al., 2014), which is more permeable and porous than the underlying rock, a perched aquifer often develops (Williams, 2008; Celico et al., 2010). It temporarily stores water and favours the recharge of the deep aquifer through the larger fracture system. The water, which is accumulated temporarily in the epikarst, also reappears at the surface in small ephemeral streams.

Specifically, the slope of Cervinara has an inclination between 35 and 50, at an elevation between 500 and 1200 m above sea level. The soil mantle, usually in unsaturated conditions, is the result of the air-fall deposition of the materials from several eruptions, so it is generally layered. It mainly consists of layers of volcanic ash (with particle size in the range of sands to loamy sands) alternating with pumices (sandy gravels), lying upon the densely fractured limestone bedrock. Near the soil–bedrock interface, a layer of weathered ash, characterised by finer texture (silty sand), with lower hydraulic conductivity, moderate plasticity, and low cohesion, is often observed (Damiano et al., 2012).

The soil mantle thickness varies spatially from a minimum of 1.0 m, in the steepest part of the slope, to larger values at its foot (up to 4–5 m). The thin soil mantle, compared to the slope width and length of hundreds of metres (Fig. 2) makes the flow processes nearly one-dimensional, except for the close proximity to geometric singularities.

The pyroclastic soils of the profile are characterised by high porosity (from about 50 % for the pumices to 75 % for the ash) and quite high values of saturated hydraulic conductivity (ranging up to the order of 10−5 m s−1). Thus, this kind of soil lets rainwater infiltrate even during the most intense rainfall events, with little runoff generation, and it can store a large amount of water without approaching saturation. The values of soil capillary potential, measured during the rainy season, rarely exceed 0.5 m, as observed also in other slopes of the area (Cascini et al., 2014; Comegna et al., 2016; Napolitano et al., 2016).

The climate is Mediterranean, which is characterised by dry and warm summer and rainy autumn and winter, with mean annual precipitation of about 1600 mm, mostly occurring between October and April. The total potential evapotranspiration ET0, estimated with the Thornthwaite formula (Shuttleworth, 1993), is between 700 and 800 mm in the altitude range from 750 to 400 m (Greco et al., 2018). The vegetation mainly consists of widespread deciduous chestnuts, with a dense understorey of brushes and ferns, growing during the flourishing period (between May and September). In fact, visual inspections of the soil profile showed a large amount of organic matter and roots. In most cases, roots are denser in the uppermost part of the soil mantle and become sparse between the depths of 1.50 and 2.00 m below the ground surface, reaching the basal limestones and penetrating the fractures.

Moreover, in the surrounding area, several ephemeral and perennial springs are present, mostly located at the foot of the slopes, which supply a network of small creeks and streams, allowing us to show the activity of the aquifer discharge to the surface water. An indication regarding the Castello stream (the main stream for this side of the basin), with springs, is shown in Fig. 3, where, during a field survey in 11 November 2021, the surface water flow appeared (springs) and disappeared (sinks) in some points along the stream course. Normally the stream exhibits its lowest water depth values up to the beginning of late autumn (Marino et al., 2020a), but it is interesting to note that the surface water in the stream emerging from the epikarstic springs is an indicator of the active slope drainage.

Figure 3Identification of surface water flow in the Castello stream at the beginning of the rainy season in November 2021 by visual recognition of springs and sinks in the watercourse.

2.1.1 Field monitoring data

Several hydrological monitoring activities have been carried out at the slope of Cervinara since 2001, initially consisting of measurements of precipitation and manual readings (every 2 weeks) of soil suction by “Jet-Fill” tensiometers, equipped with a Bourdon manometer (Damiano et al., 2012). Since November 2009, an automatic monitoring station has been set at an elevation of 585 m a.s.l., near a narrow track close to the landslide scarp of December 1999. The installed instrumentation consisted of tensiometers, time domain reflectometry (TDR) probes for water content measurements, and a rain gauge (Greco et al., 2013; Comegna et al., 2016).

Since 2017, the hydrometeorological monitoring was enriched (Marino et al., 2020a), aimed at understanding the seasonal behaviour of the slope and the interactions between the hydrological systems, i.e. the unsaturated soil mantle, the epikarst, and the underlying fractured bedrock.

Specifically, the data collected by tensiometers and TDR probes were supplemented with those from a meteorological station (composed of a thermo-hygrometer, a pyranometer, an anemometer, a thermocouple for soil temperature measurement, and a rain gauge) and with the water level in two streams at the slope foot, so as to gain useful information for the assessment of the water balance of the studied slope.

The data from field monitoring, carried out between 2017 and 2020 with an hourly resolution, consist of rainfall, evapotranspiration, soil moisture and suction at various depths, and the water depth of the Castello stream. The data have been useful to highlight seasonally recurrent soil moisture distributions. More details about the measured data and the observed recurrent seasonal behaviour of the area of Cervinara can be found in Marino et al. (2020a).

2.2 Synthetic dataset

In order to identify suitable variables to be monitored in the field for the identification of the conditions controlling different slope responses to precipitation, a rich dataset of rainfall and underground monitored variables, such as soil moisture and groundwater level, is needed. However, it is not always possible to analyse a complete field-monitored dataset, and, when it exists, it is commonly available for short periods, granting a relatively small number of measurements. Hence, a synthetic dataset, aimed at improving the information obtained from field monitoring, has been generated. This dataset has been obtained by means of the physically based mathematical model described hereinafter (Sect. 2.2.2). The model has been run with a 1000-year synthetic hourly rainfall series, obtained with a stochastic rainfall generator, for which further details are given in Sect. 2.2.1. The choice of such a long synthetic series has been made to obtain a large number of data, representative also of conditions rarely occurring at the slope and large enough to ensure the significance of the analyses carried out with ML techniques. In this respect, it is worth noting that the adopted clustering and random forest techniques allow easy handling of a large number of data without an unaffordable computational burden.

2.2.1 Definition of synthetic rainfall events

The Neyman–Scott Rectangular Pulse Model (NSRP) has been used to obtain a 1000-year-long synthetic hourly series of precipitation. The NSRP model reproduces the precipitation process as a set of rain clusters, composed of possibly overlapping rain cells embodied by rectangular pulses, each one with random origin. The storm duration is represented by the cell width, and its height represents the associated rainfall intensity, so that when multiple cells overlap, the total intensity is the sum of the intensities of the overlapping cells (Rodriguez-Iturbe et al., 1987; Cowpertwait et al., 1996).

NSRP model calibration requires the identification of five parameters, using the method of moments (Peres and Cancelliere, 2014), based on available rainfall data for the investigated site. Specifically, the data from the rain gauge station of Cervinara, situated near the Loffredo village, belonging to the Civil Protection Agency of Campania Region available from January 2001 to December 2017 with a time resolution of 10 min, were used.

The aim of this study is the identification of variables expressing the slope conditions responsible for of different responses to precipitation. In that sense, it is important to define the events within the rainfall time series to clearly distinguish antecedent conditions from the effects of the current rainfall event.

In other words, within the 1000-year-long time series, a criterion should be identified to separate rainfall events, so that a new event begins only when the effects of the previous one disappeared. For this study, the events were defined as periods with at least 2 mm of rainfall, preceded and followed by at least 24 h with less than 2 mm (i.e. smaller than the mean daily potential evapotranspiration estimated for the case study). Indeed, the separation period of 24 h is commonly used for the definition of the empirical thresholds for early warning systems against rainfall-induced landslides (e.g. Peres et al., 2018; Segoni et al., 2018; Marino et al., 2020b).

In fact, the mean volumetric water content (θ) at 10 cm depth drops below soil field capacity (θ≅0.35) 24 h after the end of each event (Fig. 4) in all the cases in which such a value was exceeded before the end of the event. This shows that a dry interval of 24 h after a rainfall event is long enough for drainage processes to remove from the topsoil most of the water infiltrated from the previous event. As topsoil moisture controls the infiltration capacity at the ground surface, after such an interval the infiltration of new rainfall is only little affected by the remnants of the previous rainfall event.

Figure 4Scatter plot of event rainfall depth and mean volumetric water content of the top 10 cm soil depth 1 h (grey dots) and 24 h (black dots) after the end of each rainfall event.


With the assumed separation criterion, a total of 53 061 rainfall events within 1000 years are obtained, with durations ranging between 1 and 570 h and a total rainfall depth between 2 and 710 mm.

2.2.2 Slope hydrological model

As already pointed out in Sect. 2.1, the regular geometry of the slope and the hydraulic characteristics of the soils make the flow processes in the soil mantle mostly one-dimensional. Indeed, a simplified 1-D model has previously been developed and successfully validated according to the data collected during the hydrological monitoring activities (Greco et al., 2013, 2018), and it was applied to investigate the hydrological response of the slope to synthetic hourly precipitation data. The unsaturated flow through the soil mantle is modelled with a 1-D head-based Richards' equation (Richards, 1931), assuming for simplicity a single homogeneous soil layer, and it is coupled with a model of the saturated water accumulated in the perched aquifer. The adoption of a 1-D model is allowed thanks to the geometry of the considered mantle and to the prevailing water potential gradients orthogonal to the ground surface when the soil is in unsaturated conditions.

The root water uptake has been accounted for in the source term of the model, according to the expressions by Feddes et al. (1976), based on estimated potential evapotranspiration, with a maximum root penetration depth equal to the soil mantle thickness and triangular root density shape.

Two boundary conditions are considered for the unsaturated soil mantle. At the ground surface (i.e. the upper boundary condition), if the rainfall intensity is greater than the current infiltration capacity, the excess rainfall forms overland runoff. Otherwise, all rainfall intensity is set as infiltration. The bottom boundary condition links the soil mantle to a perched aquifer developing in the fractures and hydraulically connected to the unsaturated cover through the weathered soil layer (less conductive and capable of retaining much water), located at the contact between the cover and the bedrock. This soil layer penetrates the vertical conduits and fractures (Greco et al., 2013). In this context, the perched aquifer is modelled as a linear reservoir model that receives water from the gravitational leakage of the overlying unsaturated soil mantle and releases it as deep groundwater recharge and spring discharge (Greco et al., 2018). This conceptualisation of the perched aquifer behaviour implies that the streamflow, supplied by the springs, is linearly related to the aquifer water level temporarily developing in the epikarst. Indeed, with this assumption, the model closely reproduces the trend of the stream water level observed in the field (Greco et al., 2018; Marino et al., 2020a). The pressure head at the soil–bedrock interface is assumed to follow the fluctuations in the water table of the underlying aquifer.

The hydraulic parameters of the homogeneous soil mantle have been obtained considering the information from previous laboratory tests (Damiano and Olivares, 2010) and field monitoring data analysis (Greco et al., 2013), considering the van Genuchten–Mualem model for the hydraulic characteristic curves (van Genuchten, 1980). Specifically, the parameters of the hydraulic characteristic curves were searched with a genetic algorithm, constrained within intervals ensuring that the obtained curves resemble available measurements of water retention and unsaturated hydraulic conductivity, obtained both in the field and in the laboratory (Greco et al., 2013). The parameters describing the hydraulic behaviour of the perched aquifer hosted in the upper part of the limestone bedrock were derived from previous studies, which showed that the model satisfactorily reproduced the fluctuations in water potential and moisture, observed at various depths in the unsaturated soil cover, both during rainy and dry seasons (Greco et al., 2013, 2018). Model parameters are summarised in Table 1. The groundwater level of the perched aquifer is given in reference to the base of the epikarst, which is assumed to be 14 m below the soil–bedrock interface.

Table 1Hydraulic parameters of the coupled model of the unsaturated soil mantle and of the aquifer hosted in the epikarst (Greco et al., 2021).

Download Print Version | Download XLSX

The equations have been numerically integrated with the finite difference technique, with a time step of 1 h over a spatial grid with a vertical spacing of 0.02 m.

The model assumes a homogeneous soil profile and a simplified slope geometry, and indeed it is not aimed at reproducing the details of flow processes through the unsaturated soil mantle. Consequently, the hydraulic properties of the homogeneous soil layer should be regarded as effective properties, useful to reproduce the major features of the infiltration and drainage phenomena. The model is used to assess how large-scale (in time and space) hydrological processes, such as long-term cumulated rainfall and evapotranspiration and perched aquifer recharge, control the conditions that affect the response of the soil mantle to precipitation events. In this sense, the obtained results can be considered representative of large areas that share the major geomorphological features of the slopes of the Partenio Massif.

2.2.3 Synthetic hydrometeorological data

As it has been stated from previous sections, the dataset comes from the simulation of the hydrologic response of a slope to 1000-year-long hourly rainfall time series, carried out with a physically based model, calibrated for the case study. The output contains the time series of soil water content and suction at all depths throughout the soil mantle, of the water exchanged between the soil and the atmosphere, of the leakage through the soil–bedrock interface, and of the predicted water level of the underlying aquifer.

One hour before the onset of each rainfall event, the following variables have been extracted, as they would be measurable in the field and are representative of antecedent conditions: the aquifer water level (ha), the mean volumetric water content in the uppermost 6 cm of the soil mantle (θ6), and the mean volumetric water content in the uppermost 100 cm of the soil mantle (θ100). To quantify the effects of rainfall on the slope response, the change in the water stored in the soil mantle at the end of each rainfall event (ΔS) has been computed and compared with the total rainfall depth of the event (H).

Specifically, the inclusion of soil water content information has been chosen, as it can be obtained from available satellite-derived remote-sensing products (Paulik et al., 2014; Pan et al., 2020) or from field sensor networks (Wicki et al., 2020). Regarding satellite products, in many cases not giving precise water content values, they satisfactorily reproduce temporal trends, which represent valuable information for hazard assessment.

Besides, as the model introduces a linear relationship to estimate the outflow from the groundwater system, the monitored stream water level has been regarded as interchangeable with the simulated groundwater level, as the two variables are assumed to be directly linked in the model.

2.3 Data analysis techniques

The resulting dataset has been analysed with machine learning techniques, aimed at capturing the complex interactions between the hydrological subsystems (i.e. soil mantle, fractured bedrock, surface water). Indeed, the analysis of the data is not only constrained to classical statistical analyses, such as data frequency distributions, but also to data classification based on their geometrical distribution and on quantifying the importance of the considered antecedent variables on the simulated response as well.

2.3.1 Variable importance assessment by random forest

The aim of this study is to find a set of measurable variables which, based only on field measurements, provide valuable information for predicting the response of the soil mantle to precipitation. In this respect, a suitable tool is represented by random forest (RF), a machine learning method that is based on the theory of regression/classification trees, bagging data and capturing even the complex or non-linear interactions between the data of a set with relatively low bias (Breiman, 2001). This method is often used to forecast a desired variable based on predictor variables in terms of regression or a classification set of randomly constructed trees. RF analysis of importance allows quantifying how informative the input variables are to make good predictions of the output, which should not be confused with the information provided by a variance-based sensitivity analysis (SA). In fact, this latter fact, always based on a mathematical model linking input variables to output, explains how the variability in the output is related to the variability in the inputs, regardless of how the output of a model resembles available observations. As in this case the analysed dataset is synthetic (i.e. it has been obtained through a mathematical model), the results of a variance-based SA will also be presented, allowing us to compare the different kinds of information provided by the two analyses.

In this case, a regression-based random forest technique is applied to predict the soil storage response (ΔS) at the end of each rainfall event of total depth H, using as predictors all possible triplets of variables described in Sect. 2.2.3 (H, ha, θ6, and θ100). Specifically, four random forest models have been developed: RF1 with input features H,θ6,ha, RF2 with input features H,θ100,ha, RF3 with as input features H,θ6,θ100, and RF4 with input features H,θ6,θ100. Overall, 80 % of the dataset was used to train the models and tune the major hyperparameters of the random forest algorithm: the number of trees, the maximum depth, the minimum sample leaf, and the maximum number of features (more details about the evaluation and optimisation of the hyperparameters are provided in Appendix B).

Then, the best predictor triplet of variables is selected according to the lowest value of the root mean square error (RMSE) calculated using the test dataset consisting of 20 % of the remaining data.

Furthermore, to understand how a single predictor variable affects the regression model, the importance of input variables (features) in the random forest regression model has been assessed through the mean decrease in impurity (Breiman, 2001), which is a measure of the ability of the tree to split the dataset into classes. Impurity is here computed as the mean decrease in RMSE, when a particular variable is used for splitting nodes across all the trees in the RF. Specifically, RMSE is employed to assess the quality of splits and to determine the importance of features in predicting output values.

2.3.2 Data classification by clustering analysis

The exploratory analysis of spatial large datasets is often performed by means of clustering techniques, aimed at identifying different classes in the data and accounting for on the distribution of the variables under study. There are two types of clustering algorithms used for class identification purposes: algorithms based on the density of points and algorithms based on the distance between points. The algorithm used here is named k means, and it is a distance-based procedure to cluster data, based on the number of desired clusters and their centroids. The algorithm assigns every element in the dataset to a cluster, iteratively minimising the variance of the Euclidean distance between the elements of each cluster and their centroids. Consequently, the data labelling is done based on their geometrical disposition in the dot cloud, depending on the target number of clusters to be identified (Lloyd, 1982; Arthur and Vassilvitskii, 2007). When variables with very different magnitudes are being related for clustering purposes, it is convenient to normalise the data, keeping the relative distances between observations. Therefore, the clustering here is applied to the standardised data to exploit the variance of each variable and keeping the geometrical disposition between observations stable.

As the k-means algorithm does not automatically estimate the optimal number of clusters to be identified within the dataset, the silhouette metric has been used here to evaluate the preferred number of clusters (Rousseeuw, 1987; de Amorim and Hennig, 2015). In fact, this metric quantifies the quality of cluster identification as the difference between the overall average intra-cluster distances and the average inter-cluster distances (i.e. the larger the difference, the better the cluster quality). In that way the metric would always be a value ranging from 1 and 1, where typically 1 means that clearly distinguished clusters have been identified, 0 means that the identified clusters are indifferent, and 1 means that data are mixed in the identified clusters.

3 Results and discussion

The analysis is carried out on both field-monitored and synthetic datasets to quantify the information provided by the defined antecedent variables useful to predict the seasonal changes in the slope response to precipitation. The analysis of the physical behaviour of the studied slopes is based on the results of model simulations, as if they satisfactorily resemble what could be measured in the field. Indeed, the uncertainty in model parameters may affect the identified cause–effect relationships. However, during the calibration of model, field measurements of the hydraulic behaviour of the involved soil were considered (Greco et al., 2013); thus the major features of the hydrological processes occurring in the slope are considered reliably reproduced in the synthetic dataset.

3.1 Role of measurable variables in the response of the soil mantle

To select the most informative triplets of variables for predicting the change in water storage (ΔS) in the soil mantle, associated with rainfall events of total depth H, four random forest models are trained to predict the ratio ΔS/H, based on the dataset consisting of all possible combinations of the synthetic variables: H,θ6,ha, H,θ100,ha, H,θ6,θ100, and θ6,θ100,ha. In fact, the change in storage ΔS is obviously strongly dependent on the event rainfall depth H (i.e. the more it rains the more soil storage increases), thus concealing important hydrological processes going on in the slope. Differently, the choice of the ratio ΔS/H, a measure of the amount of rain that remains stored in the soil mantle, allows detaching the water drainage processes from the water accumulation processes. For each random forest model, the values of the root mean square error (RMSE) are calculated, and the importance of each predictor variable is evaluated according to the procedure described in Sect. 2.3.1. The computational effort implied in doing the calculations by a conventional workstation with a Core(TM) i7-10870H processor and 16 GB of RAM memory is less than 2 min for each model run. The obtained results are reported in Table 2.

Table 2RMSE and variable importance for H, θ6, θ100, and ha in the prediction of soil response described as ΔS/H.

Download Print Version | Download XLSX

All the choices of triplets indicate that all the tested variables are informative to predict the normalised soil mantle response ΔS/H (Table 2), with the perched groundwater level, ha, resulting in the most influential variable. The importance of ha on the response of the soil mantle suggests that, in some conditions, the change in soil storage is affected by the effectiveness of water exchange between the soil mantle and the underlying aquifer, as it will be discussed in the following sections. Moreover, in Table 2 the triplet showing the lowest RMSE values is formed by the total rainfall depth, the aquifer water level, and the mean volumetric water content in the uppermost 100 cm. According to the random forest model, they are the most informative for predicting the soil mantle response. Therefore, the triplet H,θ100,ha is used for further analysis.

Considering the triplet of input variables H,θ100,ha, a variance-based sensitivity analysis has been also carried out, based on the methodology outlined by Sobol (2001), which is implemented in the Sensitivity Analysis Library in Python – SALib toolbox (Herman and Usher, 2017; Iwanaga et al., 2022). The sampling scheme proposed by Saltelli (2002) has been used to generate 65 536 triplets, so as to have a similar number of data as for the RF importance analysis. Table 3 reports the obtained sensitivity indices.

Table 3Sensitivity indices of the variance-based SA of the variability in ΔS/H resulting from variations in H, θ100, and ha.

Download Print Version | Download XLSX

Interestingly, the indices show how the aquifer water level, ha, which is the most informative variable for output predictions according to the RF analysis, is responsible only for a small part of the output variability, which instead is mostly related to the variations in the other two input variables. As will be discussed in Sect. 3.2 and 3.3, ha, not affecting the variability in ΔS/H, is an extremely informative variable anyway, as it allows us to separate the initial conditions into two families: low levels and high levels, corresponding to quite different responses of the soil mantle to precipitation. The situation also arises that output variability mostly depends on the variations in single inputs (i.e. the indices S1 explain most of the total sensitivity, and the indices S2, measuring the contribution to the total output variance deriving from mutual interactions between couples of inputs, are all small).

3.2 Soil and underground antecedent conditions

The field monitoring activities allow us to get a complete dataset that traces the rainfall values coupled with the soil mean volumetric water content in the uppermost metre of the soil profile (θ100) and the water depth of the Castello stream (hs), both measured hourly for 3 years. The field-monitored data, composed of 57 rainfall events, include the water level of the Castello stream rather than the direct measurement of the aquifer water level (ha). Nevertheless, a direct relationship links the water level in the aquifer and the water level in the stream, as assumed for mathematical modelling. This dataset has been enriched synthetically, as has been described in Sect. 2.2.

Therefore, to analyse the effects of the underground conditions on the slope response, Fig. 5 shows the simulated data (circular dots in the background) and the field-monitored data (triangular coloured dots). Logarithmic axes are used to distinguish the very low aquifer water level from the high values.

Figure 5Field-monitored mean volumetric water content in the upper metre of the soil profile (θ100) and water depth in the Castello stream (hs), compared with synthetic data of θ100 and aquifer water level (ha) (on the vertical axis, plotted on a logarithmic scale to help to visualise of low water levels and thus not allowing us to represent zeroes; the values of hs smaller than the sensitivity of the water level sensor have been plotted as 1 mm; also, the smallest simulated values of ha should be considered equivalent to zero, owing to the limits of any measurement device which could be used for operational field monitoring).


Four major seasonally recurrent conditions could be identified for the water in the subsurface system from field-monitored data: first, a condition usually occurring between December and May is characterised by the highest water content in the soil and the highest measured water level in the stream. Second, the period from June to July is characterised by intermediate water content values, with a still high level in the stream. Third, the period from August to September is characterised by the lowest values of water content in the soil but also the lowest water depth hs measured in the stream (a few centimetres, in some cases nearly zero). Finally, the period from October to November is characterised by a wide range of values in soil water content and a relatively low range of stream water depth.

The underground antecedent conditions are naturally linked to a seasonal behaviour dominated by the hydrological conditions which can be traced in time, as can be seen from the synthetic data (Fig. 6). The months from December to April follow a winter and spring behaviour, characterised by wet soil conditions and aquifer water levels ranging from low to high. From June to July, a late spring behaviour is visible, characterised by relatively dry soil (i.e. most of the data falling below soil field capacity), in combination with relatively high groundwater levels (indicating a still active slope drainage). In August and September, a summer-like behaviour is shown, with the driest soil water content and a generally low aquifer water level. Finally, in October and November, the end of the dry season is shown: a wide range of soil wetness coupled with a still low aquifer water level.

Figure 6Seasonal behaviour of the aquifer water level (ha) and the mean volumetric water content of the upper metre of the soil profile (θ100) for the synthetic dataset (the vertical axis is plotted on a logarithmic scale to help to visualise low water levels).


For both the field-monitored and synthetically obtained datasets, the observed conditions are the result of the time lag between the beginning of the rainy season and the slope response. The recurrent seasonal behaviour observed for the synthetic dataset, although delayed or anticipated owing to the year-by-year variability in rainfall, is close to that observed in the field.

The overall situation for the synthetic dataset of antecedent conditions (i.e. doublets θ100,ha) can be described by the distribution of each individual variable, which can be seen in the histograms shown in Fig. 7. It is interesting to note that, for both θ and ha, a bimodal behaviour is observed, corresponding to dry and wet field conditions.

Figure 7Histograms for data distributions of (a) θ100 and (b) ha for the synthetic dataset.


The k-means clustering technique has been used to investigate the geometrical distribution of the doublets θ100,ha, with the number of clusters ranging from 2 to 7. According to the silhouette metric, the optimal number of clusters is 3, with a metric value of 0.7, allocating 28 %, 30 %, and 42 % of the data in clusters 1, 2, and 3, respectively. Figure 8 shows the three clusters obtained within the synthetic dataset. Centroid positions are also displayed, showing the zones of the clouds where most of the dots are gathered. This representation of the data uses both vertical and horizontal axes on a linear scale to let us visualise distance magnitudes between the different clusters, but it corresponds to the same dataset shown in Fig. 6.

Figure 8Identified clusters for the doublets θ100,ha representing underground antecedent conditions of the synthetic dataset. For each cluster, the centroids are shown.


The distribution of the data after clustering is also analysed for each cluster, and the histograms are shown in Fig. 9. It seems clear that the clusters capture different couplings of dry and wet underground antecedent conditions.

Figure 9Histograms for data distributions of (a) θ100 and (b) ha, according to each identified cluster in the doublets θ100,ha.


In fact, cluster 1 captures dry conditions, with a volumetric water content below the field capacity θfc (it was estimated as 0.35, with the empirical relationship proposed by Twarakavi et al. (2009) according to the van Genuchten model parameters) and low values of ha. Differently, clusters 2 and 3 capture scenarios related to relatively wet soil mantle conditions (i.e. θ100>θfc), coupled to low ha in cluster 3 and gathering scenarios normally observed in late autumn, and to the highest ha conditions for cluster 2, comprising conditions normally occurring in late winter and spring.

The two chosen variables, θ100 and ha, allow us to identify three different antecedent slope conditions 1 h before the onset of any rainfall event. Hence, it is worth investigating how these different antecedent conditions may be related to different slope responses to precipitation.

3.3 Effects of soil and underground antecedent conditions on the slope response to rainfall

The analysis of the data has been focused on identifying clusters within the triplets θ100,ha,ΔS/H, aiming to evaluate the slope response as the amount of rainwater being stored/drained in the soil mantle. The results are plotted in the space composed of the variables that can be monitored in the field: θ100,ha,H.

As an increase in soil storage during rainfall events is not always expected, the identification of draining slope conditions is an important aspect.

Figure 10Clustering results of the synthetic data triplets θ100,ha,ΔS/H represented in the space θ100,ha,H.


Figure 11Clustering results of the triplets θ100,ha,ΔS/H in (a) the (θ100,ha) plane, (b) the (θ100,H) plane, and (c) the (H,ha) plane.


Figures 10 and 11 show the data clusters for the triplets θ100,ha,ΔS/H, for any identified rainfall event, represented in the θ100,ha,H space in a logarithmic axis representation. The silhouette metric in this case suggests 4 as an optimal number of clusters with a metric value of 0.61. It is remarkable that three of the clusters are close to those already identified from the antecedent (seasonally recurrent) underground conditions (Sect. 3.2).

Specifically, clusters 1, 2, and 3 correspond to different slope processes according to ΔS/H (Fig. 12). Even if cluster 1 and cluster 2 show similar responses, with slightly smaller ΔS/H for cluster 1, the controlling processes are indeed different; the conditions of cluster 1 typically occur in dry seasons with long dry periods between short rainfall events, leading to dry antecedent conditions, so that an accumulation of water in the soil mantle (increase in water storage) is expected at each event. The data in cluster 2 are typically related to wet seasons, especially in late winter and spring, where rainfall events are more frequent, leading to antecedent wet soil (θ100θfc) and antecedent high groundwater level. However, these conditions do not seem to correspond to effective slope drainage, so that the slope response in cluster 2 is comparable to that observed in cluster 1 in terms of ΔS/H. Instead, the conditions gathered in cluster 3 differ from those in cluster 2 for the lower aquifer water level ha, and the highest ΔS/H indicates the lowest slope drainage.

Figure 12Distribution of the slope response ΔS/H for the data in each cluster.


The additional cluster 4 identified here highlights a particular slope response, as it catches all the conditions where nearly zero and negative ΔS take place, meaning an effective slope drainage during rainfall events. It is interesting to note that, even for relatively high rainfall events (above 100 mm), this slope response occurs when soil moisture is above the field capacity and when this condition is coupled with very high groundwater level, probably due to the high permeability all along the soil mantle and due to the hydraulic connection with the underlying aquifer.

4 Conclusions

This study aims to identify and analyse the major hydrological controls of the slope response to precipitation and, in that way, defining suitable variables to be monitored in the field to predict such a response. The studied case refers to the hydrological processes in a slope system consisting of a pyroclastic soil mantle overlaying a fractured karstic bedrock, where a perched aquifer develops during the rainy season. A synthetic time series of slope response to precipitation has been built, thanks to a physically based model, previously calibrated with field monitoring data, coupled with a stochastic rainfall generator. Synthetic and experimental data show substantial agreement. In fact, the soil water content values measured in the field are close to those of the synthetic dataset. Furthermore, the simulated epikarst water level shows similar seasonal behaviour to the stream level records, which are indeed directly related to the discharge from the epikarst aquifer. The synthetic dataset has been explored with random forest and k-means clustering to evaluate the slope response characterised as the change in water stored in the soil mantle (ΔS) during precipitation events with rainfall depth H, starting from different underground antecedent conditions. These were quantified through the mean volumetric water content in the uppermost metre of the soil mantle (θ100) and the aquifer water level (ha) 1 h before the onset of rainfall.

The ratio ΔS/H, which allows us to identify soil mantle response regardless of the amount of event precipitation, is sensitive to both ha and θ100, with the groundwater level being the most influential antecedent variable. The underground antecedent conditions, characterised by θ100 and ha and linked to the seasonal meteorological forcing, allow us to identify different responses, related to the seasonally active hydrological processes.

High perched groundwater level, typical of winter and spring, indicates active drainage from the soil mantle, which compensates for rainwater infiltration, so that the soil storage remains stable, or even reduces, even after large rainfall events.

Differently, low perched groundwater level corresponds to impeded drainage. When it occurs with initially dry soil mantle (typically in summer and early autumn), it tends to retain all the infiltrated rainwater as increased soil storage. When the soil mantle is already wet (i.e. above the field capacity) at the onset of rainfall events, as usually happens in late autumn and early winter, the increase in soil storage is smaller, as the soil approaches saturation.

The presented results suggest that monitoring antecedent conditions, by measuring suitable variables to identify the major hydrological processes occurring in the slope in response to precipitation, can be useful to understand such processes and to develop effective predictive models of slope response. Therefore, the proposed methodology can be replicated also in other contexts and be useful for several hydrologic applications: from the water supply to natural streams due to infiltrated water to the hydric stress estimation in crops (e.g. the centenary chestnut forests of the case study) especially in very dry seasons but also for the design of effective monitoring networks exploiting geohydrological information for geohazard prevention (and early warning).

Appendix A: Calibration of the stochastic rainfall generator

Figure A1Monthly plot of hourly rainfall characteristics calculated based on the experimental data of the rain gauge of Cervinara.


The Neyman–Scott Rectangular Pulse (NSRP) model (Neyman and Scott, 1958; Rodriguez-Iturbe et al., 1987; Cowpertwait et al., 1996) is here used as a stochastic rainfall generator. The NSRP describes the process of point rainfall as a superposition of randomly arriving rain clusters, each containing several rain cells with constant intensity. The hyetograph within a cluster is obtained by summing the intensity of the various cells belonging to the cluster. It has been calibrated based on 17 years of experimental data (2000–2016) of rainfall depth at a 10 min time resolution, recorded by the rain gauge managed by the Civil Protection in Cervinara (southern Italy). The calibration has been carried out by minimising, for rainfall aggregated at various durations, the difference between the following quantities, estimated by the model and calculated from the experimental data: mean, variance, lag 1 autocorrelation, the probability of a dry interval, the probability of a transition from a dry-to-dry interval, and the probability of a transition from a wet-to-wet interval. The calibration procedure, based on the one proposed by Cowpertwait et al. (1996), is described in detail in Peres and Cancelliere (2014). To account for the seasonality of rainfall, these quantities have been calculated month by month in the experimental record (Fig. A1), suggesting that the calibration of the NSRP model should be carried out separately for seven homogeneous periods (September, October, November, December–March, April, May–June, July–August).

Table A1 gives the obtained parameters of the NSRP stochastic model, where λ represents the parameter of a Poisson process describing the arrival of clusters; ν is the mean number of cells in a cluster, also described by a Poisson process; β is the parameter of an exponential probability distribution describing the arrival times of each cell in a cluster, expressed as the number of time intervals of 10 min starting from the beginning of a cluster; η is the parameter of an exponential probability distribution describing the duration of rain cells; and ξ is the parameter of a Weibull probability distribution describing the rain intensity of cells, with cumulative probability function Fx,ξ,b=1-exp-ξxb, in which x is cell rain intensity and the parameter b=0.8 has been set a priori (Cowpertwait et al., 1996).

Table A1Parameters of the NSRP model.

Download Print Version | Download XLSX

The adherence of the rainfall generated with the stochastic model to the experimental rainfall data has been tested by evaluating rainfall characteristics different from those used for the calibration. For instance, Fig. A2 shows the comparison of the rainfall depth, cumulated over 1 year, for the experimental data (17 years) and for 1000 years of synthetic data generated with the calibrated NSRP model.

Figure A2Comparison of observed (black) and simulated (red) cumulated rainfall plots in a year.


In Fig. A3, the boxplot of the maximum hourly rainfall in 1 year, observed in the experimental dataset of 17 years, is compared with the same boxplot referring to 20 series of 17 years randomly extracted from the generated 1000-year synthetic rainfall series. Several of the synthetic 17-year intervals show a distribution of the maximum hourly rainfall close to the observed one.

Figure A3Comparison of observed and simulated distributions (boxplots) of the maximum hourly precipitation in a year, for series of the same length. Each panel shows the distribution for the 17 observed years (boxplot is always the same) and 17 randomly picked simulated years.


Regarding the required comparison between synthetic and observed wet and dry intervals, Fig. A4 shows the scatter plot of duration and total rain depth of the events, sorted by a separation “dry” interval of 24 h with less than 2 mm rainfall from the observed dataset (blue dots) and the synthetic dataset (grey dots). The plots show how the synthetic data contain the observed ones and that the shape of the dot clouds looks quite similar.

Figure A4Scatter plot of total rainfall event depth (H) vs. rainfall event duration (D). The events have been sorted within the rainfall datasets by considering a separation “dry” interval of 24 h with less than 2 mm rainfall. The blue dots represent events extracted from the 17-year experimental rainfall dataset, while the grey dots represent events extracted from the 1000-year synthetic rainfall dataset.


Figure A5 shows the frequency distributions of the durations of dry intervals belonging to the 17-year rainfall dataset and the same distribution for the dry intervals extracted from the 1000-year synthetic dataset: the two distributions look nearly identical.

Figure A5Frequency distributions of dry interval durations for events extracted from the 17-year experimental rainfall dataset (a) and events extracted from the 1000-year synthetic rainfall dataset (b). The events have been sorted within the rainfall datasets by considering a separation “dry” interval of 24 h with less than 2 mm rainfall.


Appendix B: Tuning random forest hyperparameters

The random forest (RF) algorithm (Breiman, 2001) has been very successful as a general-purpose classification and regression method. Starting from bagging or bootstrap aggregation (Efron and Tibshirani, 1993), RF builds several random de-correlated decision trees and then averages their predictions.

The regression RF algorithm can be summarised as follows. (1) By means of bootstrap, a sample is extracted from the training data. (2) Based on the bootstrapped data, a tree T of the random forest is grown by repeating the following operations until a leaf node (a node without split) is reached: (a) for each node, m variables are randomly selected from the p input variables or features (with 1mp); (b) among the m variables, the best variable and splitting point are selected according to a minimum criterion; (c) the node is split into two daughter nodes. To build the RF with B trees, steps 1 and 2 are repeated B times. Then, the prediction, Ypred, for a new observation, X, is the average of the final values, Tb(X), i.e. the values of the predicted variable corresponding to the leaves of each tree:

(B1) Y pred = 1 B b = 1 B T b X .

The main advantage of RF is the simplicity with which a forest can be trained and the parameters of the algorithms optimised. In this paper, the scikit-learn framework (Pedregosa et al., 2011) is used to run the RF algorithm.

The main hyperparameters of a RF are (1) n_estimators, the number of trees of the forest; (2) max_depth, the maximum depth of each decision tree in the forest; (3) min_samples_leaf, the minimum number of samples required to be at a leaf node; and (4) max_features, the number of features, or input variables, to consider when looking for the best split.

The procedure applied in this study to estimate and optimise the hyperparameters of the RF algorithm consists of the following steps:

  • Step 1. The dataset is divided into a training set and a test set, respectively, containing 80 % and 20 % of the data, randomly chosen.

  • Step 2. The K-fold cross-validation technique (Stone, 1974), with K= 10, is applied to empirically determine a set of values for the hyperparameters, using only the training dataset.

  • Step 3. For each fold, a RF is trained on the other K−1 folds of the data and tested on the first fold. This process is repeated K=10 times, so to use each of the K folds exactly once as the validation set. A performance metric is then calculated for each fold, to estimate how well the RF will perform on new data. In this work the root mean square error (RMSE) is used as the performance metric.

  • Step 4. The RF is trained by changing one hyperparameters at once and using the default values for the other three (default values of hyperparameters as reported in Pedregosa et al. (2011) are n_estimators = 100; max_depth =none, i.e. the tree is expanded until all leaves contain less samples than min_samples_split; min_samples_leaf = 1; max_features = 1).

  • Step 5. From the results of the previous step, the ranges of hyperparameters, given in Table B1, are defined. These values represent the grid in which the optimal hyperparameters are searched. In other words, using the K-fold technique (Step 2), the RF model is fitted K times, and then the optimal set of values is the one minimising the RMSE.

  • Step 6 (validation of the model). Once the optimal values of the hyperparameters are determined, the performance of the RF model is evaluated for the test dataset as defined in Step 1, using the RMSE.

In this study, the described methodology is used to evaluate the hyperparameters for the following RF models: RF1, trained using the input features H,θ6,ha; RF2, trained using H,θ100,ha; RF3, trained using H,θ6,θ100; RF4, trained using H,θ6,θ100. All models are trained to predict the normalised change in water storage in the soil mantle, ΔS/H. Figures B1, B2, B3, and B4 show the results of Step 4. Specifically, they depict the trends of the RMSE versus the hyperparameters for RF1, RF2, RF3, and RF4, respectively.

Figure B1Performance of random forest model RF1 on the test and cross- validation (CV) sets according to the test metric by changing the hyperparameters: (a) N_estimators, (b) Max_depth, (c) Max_ features, and (d) Min_samples_leaf.


Figure B2Performance of random forest model RF2 on the test and cross-validation (CV) sets according to the test metric by changing the hyperparameters: (a) N_estimators, (b) Max_depth, (c) Max_ features, and (d) Min_samples_leaf.


Figure B3Performance of random forest model RF3 on the test and cross-validation (CV) sets according to the test metric by changing the hyperparameters: (a) N_estimators, (b) Max_depth, (c) Max_ features, and (d) Min_samples_leaf.


Figure B4Performance of random forest model RF4 on the test and cross-validation (CV) sets according to the test metric by changing the hyperparameters: (a) N_estimators, (b) Max_depth, (c) Max_features, and (d) Min_samples_leaf.


The analysis of Figs. B1, B2, B3 and B4 provides the search grid of hyperparameters given in Table B1. After fitting each model K times (Step 5), the optimal sets of hyperparameters are reported in Table B2 for each RF model. Then, the performance of models RF1, RF2, RF3, and RF4 is evaluated on the test dataset using the RMSE metric. The obtained results are summarised in Table B3.

The above-described analysis has been used to identify the most informative triplet of variables, which has been chosen as the one corresponding to the best-performing of the optimal RF models, namely RF2.

Table B1Hyperparameters range of variation.

Download Print Version | Download XLSX

Table B2Optimal values of hyperparameters.

Download Print Version | Download XLSX

Table B3RMSE of studied models computed for the test dataset.

Download Print Version | Download XLSX

Code and data availability

The data used to derive to the conclusions of the present study and the codes for the data analysis are freely accessible in the public repository “Hydrological controls of slope response to precipitation – Code and Data” that will be accessible through (Roman Quintero et al., 2023).

Author contributions

RG and DCRQ formulated the research aim; PM provided the field measurements; PM and GFS supplied the model simulations; DCRQ and GFS curated and analysed the data; RG oversaw the research activities; DCRQ worked on the preparation and the data visualisation; DCRQ, PM, and GFS wrote the draft manuscript; RG wrote the final version of the manuscript.

Competing interests

At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.


Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.


This research is part of the PhD project of D.C. Roman Quintero entitled “Hydrological controls and geotechnical features affecting the triggering of shallow landslides in pyroclastic soil deposits” within the Doctoral Course “A.D.I” of Università degli Studi della Campania “Luigi Vanvitelli”. The authors would also like to thank David J. Peres for his help in the calibration of the NSRP model for the synthetic rainfall generation.

Financial support

This research has been supported by the Università degli Studi della Campania “Luigi Vanvitelli” (program VALERE: VAnvitelli pEr la RicErca; grant SEND: Smart Earl warning system for risk mitigation from Natural Disasters).

Review statement

This paper was edited by Marnik Vanclooster and reviewed by three anonymous referees.


Allocca, V., Manna, F., and De Vita, P.: Estimating annual groundwater recharge coefficient for karst aquifers of the southern Apennines (Italy), Hydrol. Earth Syst. Sci., 18, 803–817,, 2014. 

Arthur, D. and Vassilvitskii, S.: k-means++: The Advantages of Careful Seeding, in: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 7–9, 2007, in New Orleans, Louisiana, 1027–1035,, 2007. 

Bogaard, T. A. and Greco, R.: Landslide hydrology: from hydrology to pore pressure, WIRES Water, 3, 439–459,, 2016. 

Bogaard, T. and Greco, R.: Invited perspectives: Hydrological perspectives on precipitation intensity-duration thresholds for landslide initiation: proposing hydro-meteorological thresholds, Nat. Hazards Earth Syst. Sci., 18, 31–39,, 2018. 

Bordoni, M., Meisina, C., Valentino, R., Lu, N., Bittelli, M., and Chersich, S.: Hydrological factors affecting rainfall-induced shallow landslides: From the field monitoring to a simplified slope stability analysis, Eng. Geol., 19–37,, 2015. 

Breiman, L.: Random Forests, Mach. Learn., 45, 5–32,, 2001. 

Capretti, P. and Battisti, A.: Water stress and insect defoliation promote the colonization of Quercus cerris by the fungus Biscogniauxia mediterranea, Forest Pathol., 37, 129–135,, 2007. 

Cascini, L., Cuomo, S., and Guida, D.: Typical source areas of May 1998 flow-like mass movements in the Campania region, Southern Italy, Eng. Geol., 96, 107–125,, 2008. 

Cascini, L., Sorbino, G., Cuomo, S., and Ferlisi, S.: Seasonal effects of rainfall on the shallow pyroclastic deposits of the Campania region (southern Italy), Landslides, 11, 779–792,, 2014. 

Celico, F., Naclerio, G., Bucci, A., Nerone, V., Capuano, P., Carcione, M., Allocca, V., and Celico, P.: Influence of pyroclastic soil on epikarst formation: A test study in southern Italy, Terra Nova, 22, 110–115,, 2010. 

Chitu, Z., Bogaard, T. A., Busuioc, A., Burcea, S., Sandric, I., and Adler, M.-J.: Identifying hydrological pre-conditions and rainfall triggers of slope failures at catchment scale for 2014 storm events in the Ialomita Subcarpathians, Romania, Landslides, 14, 419–434,, 2017. 

Comegna, L., Damiano, E., Greco, R., Guida, A., Olivares, L., and Picarelli, L.: Field hydrological monitoring of a sloping shallow pyroclastic deposit, Can. Geotech. J., 53, 1125–1137,, 2016. 

Cowpertwait, P. S. P., O'Connell, P. E., Metcalfe, A. V., and Mawdsley, J. A.: Stochastic point process modelling of rainfall. I. Single-site fitting and validation, J. Hydrol., 17–46,, 1996. 

Dal Soglio, L., Danquigny, C., Mazzilli, N., Emblanch, C., and Massonnat, G.: Taking into Account both Explicit Conduits and the Unsaturated Zone in Karst Reservoir Hybrid Models: Impact on the Outlet Hydrograph, Water, 12, 3221,, 2020. 

Damiano, E. and Olivares, L.: The role of infiltration processes in steep slope stability of pyroclastic granular soils: laboratory and numerical investigation, Nat. Hazards, 52, 329–350,, 2010. 

Damiano, E., Olivares, L., and Picarelli, L.: Steep-slope monitoring in unsaturated pyroclastic soils, Eng. Geol., 137–138, 1–12,, 2012. 

Damiano, E., Greco, R., Guida, A., Olivares, L., and Picarelli, L.: Investigation on rainwater infiltration into layered shallow covers in pyroclastic soils and its effect on slope stability, Eng. Geol., 220, 208–218,, 2017. 

de Amorim, R. C. and Hennig, C.: Recovering the number of clusters in data sets with noise features using feature rescaling factors, Inf. Sci., 324, 126–145,, 2015. 

De Vita, P., Agrello, D., and Ambrosino, F.: Landslide susceptibility assessment in ash-fall pyroclastic deposits surrounding Mount Somma-Vesuvius: Application of geophysical surveys for soil thickness mapping, J. Appl. Geophys., 59, 126–139,, 2006. 

Di Crescenzo, G. and Santo, A.: Debris slides–rapid earth flows in the carbonate massifs of the Campania region (Southern Italy): morphological and morphometric data for evaluating triggering susceptibility, Geomorphology, 66, 255–276,, 2005. 

Efron, B. and Tibshirani, R. J.: An Introduction to the Bootstrap, Chapman and Hall, New York,, 1993. 

Feddes, R. A., Kowalik, P., Kolinska-Malinka, K., and Zaradny, H.: Simulation of field water uptake by plants using a soil water dependent root extraction function, J. Hydrol., 31, 13–26,, 1976. 

Fiorillo, F., Guadagno, F., Aquino, S., and De Blasio, A.: The December 1999 Cervinara landslides: Further debris flows in the pyroclastic deposits of Campania (Southern Italy), B. Eng. Geol. Environ., 171–184,, 2001. 

Forestieri, A., Caracciolo, D., Arnone, E., and Noto, L. V.: Derivation of Rainfall Thresholds for Flash Flood Warning in a Sicilian Basin Using a Hydrological Model, Procedia Engineer., 154, 818–825,, 2016. 

Gao, S. and Shain, L.: Effects of water stress on chestnut blight, Can. J. Forest Res., 25, 1030–1035, 1995. 

Greco, R. and Gargano, R.: A novel equation for determining the suction stress of unsaturated soils from the water retention curve based on wetted surface area in pores, Water Resour. Res., 51, 6143–6155,, 2015. 

Greco, R., Comegna, L., Damiano, E., Guida, A., Olivares, L., and Picarelli, L.: Hydrological modelling of a slope covered with shallow pyroclastic deposits from field monitoring data, Hydrol. Earth Syst. Sci., 17, 4001–4013,, 2013. 

Greco, R., Comegna, L., Damiano, E., Guida, A., Olivares, L., and Picarelli, L.: Conceptual Hydrological Modeling of the Soil-bedrock Interface at the Bottom of the Pyroclastic Cover of Cervinara (Italy), Proced. Earth Plan. Sc., 122–131,, 2014. 

Greco, R., Marino, P., Santonastaso, G. F., and Damiano, E.: Interaction between Perched Epikarst Aquifer and Unsaturated Soil Cover in the Initiation of Shallow Landslides in Pyroclastic Soils, Water, 10, 948,, 2018. 

Greco, R., Comegna, L., Damiano, E., Marino, P., Olivares, L., and Santonastaso, G. F.: Recurrent rainfall-induced landslides on the slopes with pyroclastic cover of Partenio Mountains (Campania, Italy): Comparison of 1999 and 2019 events, Eng. Geol., 288, 106160,, 2021. 

Greco, R., Marino, P., and Bogaard, T. A.: Recent Advancements of Landslide Hydrology, WIRES Water, 10, e1675,, 2023. 

Hartmann, A., Goldscheider, N., Wagener, T., Lange, J., and Weiler, M.: Karst water resources in a changing world: Review of hydrological modeling approaches, Rev. Geophys., 52, 218–242,, 2014. 

Herman, J., and Usher, W.: SALib: an open-source Python library for Sensitivity Analysis, The Journal of Open Source Software, 2, 97,, 2017. 

Iwanaga, T., Usher, W., and Herman, J.: Toward SALib 2.0: Advancing the accessibility and interpretability of global sensitivity analyses, Socio-Environmental Systems Modelling, 4, 18155–18155,, 2022. 

Lloyd, S. P.: Least Squares Quantization in PCM, IEEE T. Inform. Theory, 28, 129–137,, 1982. 

Lu, N. and Likos, W. J.: Suction Stress Characteristic Curve for Unsaturated Soil, J. Geotech. Geoenviron., 131–142,, 2006. 

Marino, P., Comegna, L., Damiano, E., Olivares, L., and Greco, R.: Monitoring the Hydrological Balance of a Landslide-Prone Slope Covered by Pyroclastic Deposits over Limestone Fractured Bedrock, Water, 12, 3309,, 2020a. 

Marino, P., Peres, D. J., Cancelliere, A., Greco, R., and Bogaard, T. A.: Soil moisture information can improve shallow landslide forecasting using the hydrometeorological threshold approach, Landslides, 17, 2041–2054,, 2020b. 

Marino, P., Santonastaso, G. F., Fan, X., and Greco, R.: Prediction of shallow landslides in pyroclastic-covered slopes by coupled modeling of unsaturated and saturated groundwater flow, Landslides, 31–41,, 2021. 

McDowell, N., Pockman, W. T., Allen, C. D., Breshears, D. D., Cobb, N., Kolb, T., Plaut, J., Sperry, J., West, A., Williams, D. G., and Yepez, E. A.: Mechanisms of plant survival and mortality during drought: Why do some plants survive while others succumb to drought?, New Phytologist, 178, 719–739,, 2008. 

Napolitano, E., Fusco, F., Baum, R .L., Godt, J. W., and de Vita, P.: Effect of antecedent-hydrological conditions on rainfall triggering of debris flows in ash-fall pyroclastic mantled slopes of Campania (southern Italy), Landslides, 13, 967–983,, 2016. 

Neyman, J. and Scott, E. L.: Statistical Approach to Problems of Cosmology, J. Roy. Stat. Soc. Ser. B, 20, 1–29,, 1958. 

Nieber, J. L. and Sidle, R. C.: How do disconnected macropores in sloping soils facilitate preferential flow?, Hydrol. Process., 24, 1582–1594,, 2010. 

Olivares, L. and Picarelli, L.: Shallow flowslides triggered by intense rainfalls on natural slopes covered by loose unsaturated pyroclastic soils, Geotechnique, 283–287,, 2003. 

Pagano, L., Picarelli, L., Rianna, G., and Urciuoli, G.: A simple numerical procedure for timely prediction of precipitation-induced landslides in unsaturated pyroclastic soils, Landslides, 7, 273–289,, 2010. 

Pan, S., Pan, N., Tian, H., Friedlingstein, P., Sitch, S., Shi, H., Arora, V. K., Haverd, V., Jain, A. K., Kato, E., Lienert, S., Lombardozzi, D., Nabel, J. E. M. S., Ottlé, C., Poulter, B., Zaehle, S., and Running, S. W.: Evaluation of global terrestrial evapotranspiration using state-of-the-art approaches in remote sensing, machine learning and land surface modeling, Hydrol. Earth Syst. Sci., 24, 1485–1509,, 2020. 

Paulik, C., Dorigo, W., Wagner, W., and Kidd, R.: Validation of the ASCAT Soil Water Index using in situ data from the International Soil Moisture Network, Int. J. Appl. Earth Obs., 30, 1–8,, 2014. 

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M. Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E.: Scikit-learn: Machine Learning in Python, J. Mach. Learn. Res., 12, 2825–2830, 2011. 

Peres, D. J. and Cancelliere, A.: Derivation and evaluation of landslide-triggering thresholds by a Monte Carlo approach, Hydrol. Earth Syst. Sci., 18, 4913–4931,, 2014. 

Peres, D. J., Cancelliere, A., Greco, R., and Bogaard, T. A.: Influence of uncertain identification of triggering rainfall on the assessment of landslide early warning thresholds, Nat. Hazards Earth Syst. Sci., 18, 633–646,, 2018. 

Perrin, J., Jeannin, P. Y., and Zwahlen, F.: Epikarst storage in a karst aquifer: A conceptual model based on isotopic data, Milandre test site, Switzerland. J. Hydrol., 279, 106–124,, 2003. 

Pirone, M., Papa, R., Nicotera, M. V., and Urciuoli, G.: Soil water balance in an unsaturated pyroclastic slope for evaluation of soil hydraulic behaviour and boundary conditions, J. Hydrol., 63–83,, 2015. 

Ponce, V. M. and Hawkins, R. H.: Runoff Curve Number: Has It Reached Maturity?, J. Hydrol. Eng., 1, 11–19,, 1996. 

Revellino, P., Guerriero, L., Gerardo, G., Hungr, O., Fiorillo, F., Esposito, L., and Guadagno, F. M.: Initiation and propagation of the 2005 debris avalanche at Nocera Inferiore (Southern Italy), Ital. J. Geosci., 366–379,, 2013. 

Reichenbach, P., Cardinali, M., De Vita, P., and Guzzetti, F.: Regional hydrological thresholds for landslides and floods in the Tiber River Basin (central Italy), Environ. Geol., 35, 146–159,, 1998. 

Richards, L. A.: Capillary conduction of liquids through porous mediums, J. Appl. Phys., 318–333,, 1931. 

Rodriguez-Iturbe, I., Febres De Power, B., and Valdes, J. B.: Rectangular pulses point process models for rainfall: analysis of empirical data, J. Geophys. Res., 92, 9645–9656,, 1987. 

Rolandi, G., Bellucci, F., Heizler, M. T., Belkin, H. E., and De Vivo, B.: Tectonic controls on the genesis of ignimbrites from the Campanian Volcanic Zone, southern Italy, Miner. Petrol., 79, 3–31,, 2003.  

Roman Quintero, D. C., Marino, P., Santonastaso, G. F., and Greco, R.: Hydrological controls of slope response to precipitation – Code and Data (Hydrology), Zenodo [code and data set],, 2023. 

Rousseeuw, P. J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math., 20, 53–65,, 1987. 

Saltelli, A.: Making best use of model evaluations to compute sensitivity indices, Comput. Phys. Commun., 145, 280–297,, 2002. 

Segoni, S., Piciullo, L., and Gariano, S. L.: A review of the recent literature on rainfall thresholds for landslide occurrence, Landslides, 15, 1483–1501,, 2018. 

Shuttleworth, W. J.: Evaporation, in: Handbook of Hydrology, edited by: Maidment, D. R., McGraw-Hill, New York, NY, USA, ISBN: 9780070397323, 1993. 

Sobol, I. M.: Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates, Math. Comput. Simulat., 55, 271–280,, 2001. 

Stone, M.: Cross-validatory choice and assessment of statistical predictions, J. Royal Stat. Soc. Ser. B, 36, 111–147, 1974. 

Tromp-Van Meerveld, H. J. and McDonnell, J. J.: Threshold relations in subsurface stormflow: 1. A 147-storm analysis of the Panola hillslope, Water Resour. Res., 42, 2410,, 2006a. 

Tromp-Van Meerveld, H. J. and McDonnell, J. J.: Threshold relations in subsurface stormflow: 2. The fill and spill hypothesis, Water Resour. Res., W02411,, 2006b. 

Tufano, R., Formetta, G., Calcaterra, D., and De Vita, P.: Hydrological control of soil thickness spatial variability on the initiation of rainfall-induced shallow landslides using a three-dimensional model, Landslides, 18, 3367–3380,, 2021. 

Twarakavi, N. K. C., Sakai, M., and Šimůnek, J.: An objective analysis of the dynamic nature of field capacity, Water Resour. Res., 45, W10410,, 2009. 

van Genuchten, M. Th.: A Closed-form Equation for Predicting the Hydraulic Conductivity of Unsaturated Soils1, Soil Sci. Soc. Am. J., 44, 892,, 1980. 

Wicki, A., Lehmann, P., Hauck, C., Seneviratne, S. I., Waldner, P., and Stähli, M.: Assessing the potential of soil moisture measurements for regional landslide early warning, Landslides, 17, 1881–1896,, 2020. 

Williams, P. W.: The role of the epikarst in karst and cave hydrogeology: a review, Int. J. Speleol., 37, 1–10,, 2008. 

Short summary
This study shows a methodological approach using machine learning techniques to disentangle the relationships among the variables in a synthetic dataset to identify suitable variables that control the hydrologic response of the slopes. It has been found that not only is the rainfall responsible for the water accumulation in the slope; the ground conditions (soil water content and aquifer water level) also indicate the activation of natural slope drainage mechanisms.