Rainfall-induced shallow landslides can lead to loss of life and significant damage to private and public properties, transportation systems, etc. Predicting locations that might be susceptible to shallow landslides is a complex task and involves many disciplines: hydrology, geotechnical science, geology, hydrogeology, geomorphology, and statistics. Two main approaches are commonly used: statistical or physically based models. Reliable model applications involve automatic parameter calibration, objective quantification of the quality of susceptibility maps, and model sensitivity analyses. This paper presents a methodology to systemically and objectively calibrate, verify, and compare different models and model performance indicators in order to identify and select the models whose behavior is the most reliable for particular case studies.

The procedure was implemented in a package of models for landslide susceptibility analysis and integrated in the NewAge-JGrass hydrological model. The package includes three simplified physically based models for landslide susceptibility analysis (M1, M2, and M3) and a component for model verification. It computes eight goodness-of-fit indices by comparing pixel-by-pixel model results and measurement data. The integration of the package in NewAge-JGrass uses other components, such as geographic information system tools, to manage input–output processes, and automatic calibration algorithms to estimate model parameters.

The system was applied for a case study in Calabria (Italy) along the Salerno–Reggio Calabria highway, between Cosenza and Altilia. The area is extensively subject to rainfall-induced shallow landslides mainly because of its complex geology and climatology. The analysis was carried out considering all the combinations of the eight optimized indices and the three models. Parameter calibration, verification, and model performance assessment were performed by a comparison with a detailed landslide inventory map for the area. The results showed that the index distance to perfect classification in the receiver operating characteristic plane (D2PC) coupled with the model M3 is the best modeling solution for our test case.

Landslides are one of the most dangerous geohazards worldwide and constitute a serious menace for public safety leading to human and economic losses (Park, 2011). Geoenvironmental factors such as geology, land use, vegetation, climate, and increasing population sizes may increase the occurrence of landslides (Sidle and Ochiai, 2006). Landslide susceptibility assessments, i.e., the likelihood of a landslide occurring in an area on the basis of local terrain conditions (Brabb, 1984), are not only crucial for an accurate landslide hazard quantification but also a fundamental tool for the environmental preservation and responsible urban planning (Cascini et al., 2005).

Many methods for landslide susceptibility mapping have been developed and can be grouped in two main branches: qualitative and quantitative methods (Glade and Crozier, 2005; Corominas et al., 2014 and references therein).

Qualitative methods, based on field campaigns and expert knowledge and experience, are subjective but necessary to validate quantitative method results. Quantitative methods include statistical and physically based methods. Statistical methods (e.g., Naranjo et al., 1994; Chung et al., 1995; Guzzetti et al., 1999; Catani et al., 2005) use different approaches such as bivariate statistics, multivariate analysis, discriminant analysis, and the random forests approach to link instability factors (such as geology, soil, slope, curvature, and aspect) with past and present landslides. Bivariate statistical methods ignore the interdependence of instability factors, whereas multivariate analysis is able to statistically consider their interactions. Other data-driven methods for landslide susceptibility analysis include the use of neural networks (Pradhan, 2011; Conforti et al., 2014), support vector machines (Pradhan, 2013 and citations therein), and Bayesian networks (Lee et al., 2002). Deterministic models (e.g., Montgomery and Dietrich, 1994; Lu and Godt, 2008, 2013; Borga et al., 2002; Simoni et al., 2008; Capparelli and Versace, 2011) synthesize the interaction between hydrology, geomorphology, and soil mechanics in order to physically understand and predict the location and timing that trigger landslides. These models generally include a hydrological and a slope stability component. The hydrological component simulates infiltration and groundwater flow processes with different degrees of simplification, from steady state (e.g., Montgomery and Dietrich, 1994) to transient analyses (Simoni et al., 2008). The soil stability component simulates the slope safety factor (FS) defined as the ratio of stabilizing to destabilizing forces. One of the main advantages of data-driven methods for landslide susceptibility is that they can be easily applied in wide areas while deterministic models are in general applied in local analyses. The latter are more computationally expensive and require detailed input data and parameters, which often involve high uncertainty. On the other hand, data-driven methods assume that landslides are caused by the same combination of instability factors over the entire study area, whereas deterministic models enable different triggering mechanisms to be understood and investigated.

The results of a landslide susceptibility analysis strongly depend on the model hypothesis, parameter values, and parameter estimation method. Questions regarding the performance evaluation of the landslide susceptibility model, the choice of the most accurate model, and the selection of the best-performing method for parameter estimation are still open. Thus, a procedure is needed that facilitates reproducible comparisons between different models and evaluation criteria aimed at the selection of the most accurate models.

Much effort has been devoted to the crucial problem of evaluating landslide susceptibility model performance (e.g., Dietrich et al., 2001; Frattini et al., 2010; Guzzetti et al., 2006). Accurate discussions about the most common quantitative measures of goodness of fit (GOF) between measured and modeled data are discussed in Bennet et al. (2013), Jolliffe and Stephenson (2012), Beguería (2006), and Brenning (2005 and references therein). We have summarized them in Appendix A. Usually one of these indices is selected and used as an objective function (OF) in combination with a calibration algorithm in order to obtain the optimal set of model parameters. However, in most cases, the selection of the OF is not justified or compared with other options.

The wrong classifications in landslide susceptibility analysis not only risk a loss of life but also have economic consequences. For example, locations classified as stable increase their economical value because no construction restrictions will be applied, while the reverse is true for locations classified as unstable.

In this work, we propose an objective methodology for environmental model analysis which selects the best-performing model based on a quantitative comparison and assessment of model prediction skills. In this paper, the methodology is applied to assess the performance of simplified landslide susceptibility models. As the procedure is model independent, it can be used to assess the ability of any type of environmental model to simulate natural phenomena.

Unlike previous applications, our methodology aims to objectively

select a set of the most appropriate OFs in order to determine the best model parameters;

compare the performance of a model using the parameter sets selected in the previous step in order to identify the OFs that provide particular and not redundant information; and

perform a model parameter sensitivity analysis in order to understand the relative importance of each parameter and its influence on the model performance.

identify the most appropriate OFs for estimating the model parameters and

compare different models in order to select the best one that estimates the landslide susceptibility of the study area.

Integration of the landslide susceptibility analysis system in NewAge-JGrass hydrological model.

The procedure is implemented in the open-source and GIS-based hydrological model, denoted as NewAge-JGrass (Formetta et al., 2014a) which uses the Object Modeling System (OMS, David et al., 2013) modeling framework. OMS is a Java-based modeling framework which promotes the idea of programming by components. It provides the model developers with many features such as multithreading, implicit parallelism, model interconnection, and a GIS-based system.

The NewAge-JGrass system, Fig. 1, contains models, automatic calibration
algorithms for model parameter estimation, and methods for estimating the
goodness of the model prediction. The open-source GIS uDig
(

The methodology presented in this paper for landslide susceptibility analysis (LSA) represents one model configuration within the more general NewAge-JGrass system. It includes two new models specifically developed for this paper: mathematical components for landslide susceptibility mapping and procedures for landslide susceptibility model verification and selection. The LSA configuration also uses two models that have already been implemented in NewAge-JGrass: the geomorphological model setup and the automatic calibration algorithms for model parameter estimation. All the models used in the LSA configuration are presented in Fig. 1, encircled with a dashed red line.

The methodology is presented in Sect. 2. It was set up considering three different landslide susceptibility models, eight GOF metrics, and one automatic calibration algorithm. The flexibility of the system enables more models and GOF metrics to be added, and different calibration algorithms can be used. Thus, different LSA configurations can be created depending on the landslide susceptibility model, the calibration algorithm, and the GOFs selected by the user. Finally, Sect. 3 presents a case study of landslide susceptibility mapping along the A3 Salerno–Reggio Calabria highway in Calabria, which illustrates the capability of the system.

The LSA is implemented in the context of
NewAge-JGrass (Formetta et al., 2014a), an open-source, large-scale
hydrological modeling system. It models the whole hydrological cycle: water
balance, energy balance, snow melting, etc. (Fig. 1). The system implements
hydrological models, automatic calibration algorithms for model parameter
optimization and evaluation, and a GIS for input–output visualization
(Formetta et al., 2011, 2014a). NewAge-JGrass is a component-based model: each
hydrological process is described by a model (energy balance,
evapotranspiration, runoff production in Fig. 1). Each model implements one
or more components (considering, for example, the model evapotranspiration in
Fig. 1, the user can select between three different components:
Penman–Monteith, Priestly–Taylor, and Fao). In addition, each component can be
linked to the others and executed at runtime, building a model
configuration. Figure 1 offers a complete picture of the system and the
integration of the new LSA configuration encircled with dashed red lines.
More precisely, the LSA in the current configuration includes two new models:
a landslide susceptibility model and a verification and selection model. The
first includes three components proposed in Montgomery and Dietrich (1994),
Park et al. (2013), and Rosso et al. (2006); the latter includes the three-step
verification procedure (3SVP), presented in Sect. 2. The LSA
configuration also includes another two models previously implemented in the
NewAge-JGrass system:

the Horton Machine for geomorphological model setup, which computes input maps such as slope and total contributing area and displays the model's results, and

the particle swarm for automatic calibration.

The landslide susceptibility models implemented in NewAge-JGrass and
presented in a preliminary application in Formetta et al. (2014b) consist of
the Montgomery and Dietrich (1994) model (M1), the Park et al. (2013) model
(M2), and the Rosso et al. (2006) model (M3). The three models are derived from
simplifications of the infinite slope equation (Grahm, 1984; Rosso et al.,
2006; Formetta et al., 2014) for the factor of safety:

Indices of goodness of fit for comparison between actual and predicted landslide.

The model M1 assumes a hydrological steady state, with flow occurring in the
direction parallel to the slope, and neglects cohesion, degree of soil
saturation, and void ratio. It computes

Unlike M1, the model M2 considers (i) the effect of the degree of soil
saturation (

The component M3 considers both the effects of rainfall intensity and
duration on the landslide triggering process. The term

These models are suitable for shallow translational landslides controlled by
groundwater flow convergence. Shallow landslides usually have a very low
ratio between the maximum depth (

Each component has a user interface which specifies the input and output. Model inputs are computed in the GIS uDig integrated in the NewAge-JGrass system by using the Horton Machine package for terrain analysis (Abera et al., 2014). Model output maps are directly imported in the GIS and are available for the user's visualization.

The models that we implemented present an increasing degree of complexity in terms of the theoretical assumptions for modeling landslide susceptibility. Moving from M1 to M2, the soil cohesion and soil properties were considered, and moving from M2 to M3, rainfall of finite duration was used.

In order to assess the models' performance, we developed a model that computes the most common indices for assessing the quality of a landslide susceptibility map.

These indices are based on a pixel-by-pixel comparison between the observed
landslide (OL) and predicted landslide (PL) maps. They are binary maps with
positive pixels corresponding to “unstable” ones, and negative pixels that
correspond to “stable” ones. Therefore, four types of outcomes are
possible for each cell. A pixel is a true positive (tp) if it is mapped as
“unstable” both in OLs and in PLs, which is a correct alarm with well-predicted
landslides. A pixel is a true negative (tn) if it is mapped as
“stable” both in OLs and in PLs, which corresponds to a well-predicted, stable
area. A pixel is a false positive (fp) if it is mapped as “unstable” in
PLs, but is “stable” in OLs; that is a false alarm. A pixel is a
false negative (fn) if it is mapped as “stable” in PLs, but is “unstable”
in OLs; that is a missed alarm. The concept of the receiver operator
characteristic (ROC; Goodenough et al., 1974) graph is based on the values
assumed by tp, fp, and tn. ROCs are used to assess the performance of models
which provides results assigned to one of two classes. The ROC graph is
widely used in many scientific fields, such as medicine (Goodenough et al.,
1974), biometrics (Pepe, 2003), and machine learning (Provost and Fawcett,
2001). The ROC graph is a Cartesian plane with the FPR on the

Eight GOF indices for the quantification of model performance were implemented in the system. Table 1 shows their definition, range, and optimal values. A more comprehensive description of the indices is provided in Appendix A.

Automatic calibration algorithms implemented in NewAge-JGrass as OMS components can be used in order to tune the model parameters to reproduce the actual landslides. This is possible because each model is an OMS component and can be linked to the calibration algorithms as it is, without rewriting or modifying its code. Three calibration algorithms are embedded in the system core: Luca (Hay et al., 2006), a step-wise algorithm based on shuffled complex evolution (Duan et al., 1992); particle swarm optimization (PSO), a genetic model presented in Kennedy and Eberhart (1995); and DREAM (Vrugt et al., 2008), an acronym for differential evolution adaptive metropolis. In the actual configuration, we used a PSO algorithm to estimate optimal values of the model parameters.

During the calibration procedure, the selected algorithm compares the model output in terms of a binary map (stable or unstable pixel) with the actual landslide, thus optimizing a selected objective function (OF). The model parameter set for which the OF assumes its best value is the optimization procedure output. The eight GOF indices presented in Table 1 were used in turn as OFs and, consequently, eight optimal parameters sets were provided as the calibration output (one for each optimized OF). This means that a GOF index selected in Table 1 becomes an OF when it is used as an objective function of the automatic calibration algorithm.

In order to quantitatively analyze the model performance, we implemented a three-step verification procedure (3SVP). Firstly, we evaluated the performance of each OF index for each model. We presented the results in the ROC plane in order to assess what the OF index(es) was (were), and whose optimization provided the best model performance. Secondly, we verified whether each OF metric had its own information content or whether it provided information analogous to other metrics (and thus not essential).

Lastly, for each model, the sensitivity of each optimal parameter set was tested by perturbing optimal parameters and by evaluating their effects on the GOF.

Test site.

The test site was located in Calabria, Italy, along the Salerno–Reggio
Calabria highway between Cosenza and Altilia municipalities, in the southern
part of the Crati basin (Fig. 2). The mean annual precipitation is about
1200 mm, distributed over approximately 100 rainy days, with a mean annual
temperature of 16

In the study area, the topographic elevation has an average value of around
450 m a.s.l., with a maximum value of 730 m a.s.l. Slopes, computed from
the 10 m resolution digital elevation model, range from 0 to
55

The Crati Basin is a Pleistocene–Holocene extensional basin filled with clastic marine and fluvial deposits (Vezzani, 1968; Colella et al., 1987; Fabbricatore et al., 2014). The stratigraphic succession of the Crati Basin can be simply divided into two sedimentary units as suggested by Lanzafame and Tortorici (1984). The first unit is a Lower Pliocene succession of conglomerates and sandstones passing upward into a silty clay (Lanzafame and Tortorici, 1986) second unit. This is a series of clayey deposits grading upward into sandstones and conglomerates which refer to Emilian and Sicilian, respectively (Lanzafame and Tortorici, 1986), as also suggested by data provided by Young and Colella (1988).

In the study area, the second unit outcrops. A topsoil of about 1.5–2.0 m lies on sandy–gravelly and sandy deposits, which are generally well-stratified. Soils range from Alfisols (i.e., highly mature soils) to Inceptisols and Entisols (i.e., poorly developed soils). Due to the combination of such climatic, geostructural, and geomorphological features, the test site is one of the most landslide-prone areas in Calabria (Conforti et al., 2014; Carrara and Merenda, 1976; Iovine et al., 2006).

Mass movements were analyzed from 2006 to 2013 by integrating aerial photography interpretation acquired in 2006, 1 : 5000 scale topographic map analysis, and an extensive field survey.

All the data were digitized and stored in a GIS database (Conforti et al., 2014) and the result was the map of occurred landslides, presented in Fig. 2d. Digital elevation model, slope, and total contributing area (TCA) maps are presented in Fig. 2a, b, and c, respectively. In order to perform model calibration and verification, the data set of occurred landslides was divided in two parts one used for calibration (located at bottom of Fig. 2d) and one for validation (located in the upper part of Fig. 2d). The landslide inventory map refers only to the initiation area of the landslides. This leads to a fair comparison with the landslide models that provide only the triggering point and does not include a run-out model for landslide propagation.

The LSA presented in the paper was applied to the Salerno–Reggio Calabria highway, between Cosenza and Altilia (southern Italy). Section 3.1 describes the model parameter calibration and the model verification procedure, Sect. 3.2 presents the model performance correlation assessment, Sect. 3.3 presents the robustness analysis of the GOF indices used, and lastly, Sect. 3.4 presents the computation of the susceptibility map.

The three models presented in Sect. 2 were used to predict the landslide susceptibility for the study area. Model parameters were optimized using each GOF index presented in Table 1 in order to fit landslides of the calibration group. Table 2 presents the list of parameters that will be optimized, specifying their initial range of variation, and the parameters kept constant during the simulation and their value.

Optimized models' parameter values.

Models' performance results in the ROC plane for M1, M2, and M3. Only GOF indices whose optimization provides FPR < 0.4 and TPR > 0.7 were reported.

Optimal parameter sets output of the optimization procedure of each GOF index in turn. Results are presented for each model (M1, M2, and M3).

Results in term of true-positive rate (TPR) and false-positive rate (FPR), for each model (M1, M2, and M3), for each optimized GOF index and for both calibration (CAL) and verification (VAL) data set.

The rows for which the condition FPR < 0.4 and TPR > 0.7 is verified are shown in bold.

Correlation plot between models' performance (MP) vectors computed by optimizing all GOF indices in turn. Results are reported for each model: M1, M2, and M3.

The component PSO provides eight best parameter sets, one for each optimized
GOF indices. Values for each model (M1, M2, and M3) are presented in Table 3.
Optimal parameter sets differ slightly among the models and among the
optimized GOF indices for a given model. In addition, a compensation effect
between the parameter values is evident. High values of friction angle are
related to low cohesion values; high values of critical rainfall are related
to high values of soil resistance parameters. For the model M1, the
transmissivity value (74 m

Executing the models using the set of eight optimal parameters, true positive rates and false positive rates are computed by comparing the model output and actual landslides for both the calibration and verification data sets. The results are presented in Table 4 for all three models (M1, M2, and M3). These points were reported in the ROC plane to visualize the effects of the optimized objective function on model performance in a unique graph. This procedure was repeated for the three models. ROC planes, considering all the GOF indices and all three models, are included in Appendix B both for the calibration and verification period. For models M2 and M3, it is clear that ACC, HSS, and CSI performed the worst. This is also true for model M1, although, unlike M2 and M3, there is no clear separation between the performance provided by ACC, HSS, and CSI and the remaining indices.

Among the results provided in Table 4, we focused on the GOF indices whose optimization satisfies the following condition: FPR < 0.4 and TPR > 0.7. This choice was made in order to focus comments on the results exclusively for the GOF indices which provide acceptable model results and in order to heighten the readability of graphs.

Figure 3 presents three ROC planes, one for each model, with the optimized
GOF indices that provide FPR < 0.4 and TPR > 0.7. The
results presented in Fig. 3 and Table 4 show that

the optimization of AI, D2PC, SI, and TSS achieves the best model performance in the ROC plane, which is verified for all three models;

performance increase as model
complexity increases: moving from M1 to M3, points in the ROC plane approach
the perfect point (TPR

by increasing the model complexity, good model results are achieved, not only in the calibration but also in the validation data set. In fact, moving from M1 to M2 soil cohesion and soil properties were considered, and moving from M2 to M3 rainfall of a finite duration was used.

Finally, it is important to consider the limitations of the models used for the current applications. Models M1 and M2 are not able to mimic the transient nature of precipitation and infiltration processes, and only M3 is able to account for the combined effect of storm duration and intensity in the triggering mechanism. In addition, in this study we neglected effects such as spatial rainfall variability, roads, and other engineering works.

The second step in the procedure is to verify the information content of each optimized OF, checking whether it is the same as other metrics or it is a particular feature of the optimized OF.

Executing a model using one of the eight parameter sets (assuming, for
example, the one obtained by optimizing CSI) enables all the remaining GOF
indices to be computed, which we indicate as CSI

All indices present a positive correlation with each other, irrespectively of
the model used. In addition, strong correlations between the

In this step, we focused on models M2 and M3 and performed a parameter
sensitivity analysis. Let us consider model M2 and the optimal parameter set
computed by optimizing the critical success index (CSI). Also, considering
the cohesion model parameter, the procedure evolves according to the
following steps:

The starting parameter values are the optimal values derived from the optimization of the CSI index.

All the parameters except the analyzed parameter (cohesion) were kept constant and equal to the optimal parameter set.

A total of 1000 random values of the analyzed parameter (cohesion) were selected from a uniform distribution with the lower and upper bound defined in Table 1. With this procedure, 1000 model parameter sets were defined and used to execute the model.

A total of 1000 values of the selected GOF index (CSI), computed by comparing model
outputs with the measured data, were used to compute a boxplot of the
parameter

Model M2 parameters sensitivity analysis.

Model M3 parameters sensitivity analysis.

Each column in the figures represents one optimized index and has a number of boxplots equal to the number of model parameters (five for M2 and six for M3). Each boxplot represents the range of variation of the optimized index due to a particular change in the model parameters. The narrower the boxplot for a given optimized index, the less sensitive the model is to that parameter. For both M2 and M3, the parameter set obtained by optimizing AI and SI shows the least sensitive behavior for almost all the parameters. In this case, a model parameter perturbation has little impact on the model's performance. However, the models with parameters obtained by optimizing ACC, TSS, and D2PC are the most sensitive to the parameter variations and this is reflected in much more evident changes in model performance. Finally, it is important to consider that the methodology used for evaluating the parameter sensitivity is based on changing the parameters one at a time. Although this procedure facilitates an intercomparison of the results (because the parameter sensitivity is computed with reference to the optimal parameter set), it is does not take into account simultaneous variations or interactions between parameters.

The selection of the most appropriate model for computing landslide susceptibility maps is based on what we learn from the previous steps. In the first step, we learn that (i) the optimization of AI, D2PC, SI, and TSS outperforms the remaining indices and (ii) models M2 and M3 provide more accurate results than M1. The second step suggests that overall the model results obtained by optimizing AI, D2PC, SI, and TSS are similar each other. Lastly, the third step shows that the model performance derived from the optimization of AI and SI is less sensitive to input variations than D2PC and TSS. This could be due to the formulation of AI and SI which gives much more weight to the true negative compared to D2PC and TSS.

Landslide susceptibility maps using model M3 and the parameter set obtained by optimizing D2PC.

For our application, the model M3 with parameters obtained by optimizing D2PC was the most sensitive to the parameter variation, avoiding an “insensitive” or flat response by changing the parameter values. A more sensitive coupled model optimal parameter set will in fact accommodate any parameters, input data, or measured data variations responding to these changes with a variation in model performance.

We thus used the combination of model M3 with parameters obtained by
optimizing D2PC in order to compute the final susceptibility maps in Fig. 7.
Categories of landslide susceptibility from Class 1 to 5 are assigned from
low to high according to FS values (e.g., Huang et al., 2007): Class 1
(FS

We have presented a procedure to quantitatively calibrate, evaluate, and
compare the performance of environmental models. The procedure was applied
for the analysis of three landslide susceptibility models. It is made up of
three steps:

model parameter calibration, optimizing different GOF indices, and model evaluation in the ROC plane;

computation of the degree of similarities between different model performance obtained by optimizing all the considered GOF indices;

evaluation of model sensitivity to parameter variations.

The procedure was conceived as a model configuration of the hydrological
system NewAge-JGrass. It integrates

three simplified, physically based landslide susceptibility models;

a package for model evaluations based on pixel-by-pixel comparison of modeled and actual landslide maps;

model parameter calibration algorithms, and

the integration with the uDig open-source geographic information system for model input–output map management.

The procedure was applied in a test case on the Salerno–Reggio Calabria
highway and led to the following conclusions:

the OFs AI, D2PC, SI, and TSS coupled with the models M2 and M3 provided the best performance among the eights metrics used in the calibration;

the four selected OFs
provided quite similar model performance in terms of

M3 showed the best performance by optimizing the D2PC index. In fact, M3 responded to parameter variations with changes in model performance.

In our application, effective precipitation was calibrated because we were performing a landslide susceptibility analysis and it was useful for demonstrating the method. However, we are aware that for operational landslide early warning systems, rainfall constitutes a fundamental input of the predictive process. In addition, the analysis would profit from data on the rainfall that triggered the landslides; however, such data are currently not available for the study area.

We believe that our system would be useful for decision makers who deal with risk management assessments. It could be improved by adding new landslide susceptibility models or different types of model selection procedures.

Data for this paper are available from the corresponding author upon request.

CSI, Eq. (A1), is the number of correct detected landslide pixels (tp),
divided by the sum of tp, fn, and fp. CSI is also named threat score. It
ranges between 0 and 1 and its best value is 1. It penalizes both fn and fp.

ESI, Eq. (A2), contrarily to CSI, is able to take into account the true
positives associated with random chance (

SI, Eq. (A4), equally weighs the true positive rate (TPR) (Eq. A5) and specificity,
defined as 1 minus the false positive rate (FPR), Eq. (A6). SI varies between 0
and 1 and its best value is 1. SI is also named modified success rate.

D2PC is defined in Eq. (A7). It measures the distance in the plane FPR–TPR
between an ideal perfect point of coordinates (0, 1) and the point of the
tested model (FPR, TPR). D2PC ranges 0–1 and its best value is 0.

AI, Eq. (A8), is the average value between four different indices:

TPR,

precision,

the ratio between successfully predicted stable pixels (tn) and the total number of actual stable pixels (fp plus tn), and

the ratio between successfully predicted stable pixels (tn) and the number of
simulated stable cells (fn

The fundamental idea of a generic skill score measure is to quantify the
model performance with respect to a set of controls or reference models. Given a
measure of model accuracy

SS assumes positive and negative values: if the tested model is perfect,

The marginal probability of a predicted unstable pixel is (tp

The probability of a correct “yes” forecast by chance is P1

In the HSS, Eq. (A10), the control model is a model that forecasts by chance:

TSS, Eq. (A12), is the difference between the hit rate and the false alarm
rate. It is also named Hanssen–Kuiper's skill score and Pierce's skill
score. It ranges between

TSS measures the ability of the model to distinguish between landslide and
non-landslide pixels. If the number of tn is large, the false alarm value is
relatively overwhelmed. If tn is large, as in landslide maps, FPR
tends to zero and TSS tends to TPR. A problem of TSS is that it treats the
hit rate and the false alarm rate equally, irrespective of their likely
differing consequences.

This research was funded by the PON project no. 01_01503 “Integrated Systems for Hydrogeological Risk Monitoring, Early Warning and Mitigation Along the Main Lifelines”, CUP B31H11000370005, within the framework of the National Operational Program for “Research and Competitiveness” 2007–2013. The authors would like to acknowledge the editor and the three reviewers (M. Mergili and three unknown reviewers) for providing insightful comments and improving the quality of the paper.Edited by: M. Mikos Reviewed by: M. Mergili and three anonymous referees