Soil texture and soil particle size fractions (PSFs) play an increasing role in physical, chemical, and hydrological processes. Many previous studies have used machine-learning and log-ratio transformation methods for soil texture classification and soil PSF interpolation to improve the prediction accuracy. However, few reports have systematically compared their performance with respect to both classification and interpolation. Here, five machine-learning models – K-nearest neighbour (KNN), multilayer perceptron neural network (MLP), random forest (RF), support vector machines (SVM), and extreme gradient boosting (XGB) – combined with the original data and three log-ratio transformation methods – additive log ratio (ALR), centred log ratio (CLR), and isometric log ratio (ILR) – were applied to evaluate soil texture and PSFs using both raw and log-ratio-transformed data from 640 soil samples in the Heihe River basin (HRB) in China. The results demonstrated that the log-ratio transformations decreased the skewness of soil PSF data. For soil texture classification, RF and XGB showed better performance with a higher overall accuracy and kappa coefficient. They were also recommended to evaluate the classification capacity of imbalanced data according to the area under the precision–recall curve (AUPRC). For soil PSF interpolation, RF delivered the best performance among five machine-learning models with the lowest root-mean-square error (RMSE; sand had a RMSE of 15.09 %, silt was 13.86 %, and clay was 6.31 %), mean absolute error (MAE; sand had a MAD of 10.65 %, silt was 9.99 %, and clay was 5.00 %), Aitchison distance (AD; 0.84), and standardized residual sum of squares (STRESS; 0.61), and the highest Spearman rank correlation coefficient (RCC; sand was 0.69, silt was 0.67, and clay was 0.69). STRESS was improved by using log-ratio methods, especially for CLR and ILR. Prediction maps from both direct and indirect classification were similar in the middle and upper reaches of the HRB. However, indirect classification maps using log-ratio-transformed data provided more detailed information in the lower reaches of the HRB. There was a pronounced improvement of 21.3 % in the kappa coefficient when using indirect methods for soil texture classification compared with direct methods. RF was recommended as the best strategy among the five machine-learning models, based on the accuracy evaluation of the soil PSF interpolation and soil texture classification, and ILR was recommended for component-wise machine-learning models without multivariate treatment, considering the constrained nature of compositional data. In addition, XGB was preferred over other models when the trade-off between the accuracy and runtime was considered. Our findings provide a reference for future works with respect to the spatial prediction of soil PSFs and texture using machine-learning models with skewed distributions of soil PSF data over a large area.
Soil texture, classified by ranges of soil particle size fractions (PSFs), is one of the most important attributes affecting the soil properties and the physical, chemical, and hydrological processes covering soil porosity, soil fertility, water retention, infiltration, drainage, aeration, and so on. Soil texture distribution can be used for soil fertility management (Pahlavan-Rad and Akbarimoghaddam, 2018; Bationo et al., 2007), water management (Thompson et al., 2012), and ecosystem service provision (Adhikari and Hartemink, 2016). The soil PSFs – sand, silt, and clay – are vital in most hydrological, ecological, and environmental risk assessment models (Liess et al., 2012). The spatial distributions of soil texture and soil PSFs affect runoff generation, slope stability, soluble salt content, and the estimation of the evaporative fraction (McNamara et al., 2005; Follain et al., 2006; Yoo et al., 2006; Gochis et al., 2010; Crouvi et al., 2013; Xu et al., 2019).
The ancillary data should be considered in the prediction, especially over a large study area, to enhance the interpolation performance (Wang and Shi, 2017). Machine-learning models, such as boosting regression trees (Jafari et al., 2014; Yang et al., 2016), random forests (RF; Hengl et al., 2015; Zeraatpisheh et al., 2017), and artificial neural networks (Bagheri Bodaghabadi et al., 2015; Taalab et al., 2015), have been commonly employed in both interpolation and classification combined with environmental covariates for soil properties. Machine-learning models such as RF and gradient boosting have shown better performance than statistical linear models (e.g. multiple linear regression) in the prediction of soil properties, because they are robust to noise and have a low bias when dealing with large data sets (Hengl et al., 2015, 2017). Among machine-learning models, artificial neural networks and “tree learners” (e.g. decision trees) have been preferred due to their relatively high overall accuracy and kappa coefficients, the interpretability of the results, and the speed of the parameterization in the prediction of soil classes (Taghizadeh-Mehrjardi et al., 2015; Heung et al., 2016). Most previous studies have used machine-learning algorithms to simulate soil category or continuous properties for classification or regression problems. However, few studies have systematically analysed both soil texture classification and soil PSF interpolation using different machine-learning models.
The soil PSFs, which can be classified as soil texture, are not only continuous variables but are also compositional data – thus, the constant sum (1 % or 100 %) should be guaranteed. Soil PSF data are typical compositional data with three components that are not independent of each other but are rather expressed as a percentage (Filzmoser et al., 2009). Because of the spurious correlations between components, different results occur on different measurement scales (Abdi et al., 2015; Reimann and Filzmoser, 2000). Indicators and statistical methods based on Euclidean distances can reveal misleading or biased results (Butler, 1979). Numerous different interpretations of compositional data have been suggested in soil science (Gobin et al., 2001; Salazar et al., 2015; Tolosana-Delgado et al., 2019; Hengl et al., 2018), and the most extensively used method has been a combination of log-ratio transformation methods, including the additive log ratio (ALR; Aitchison, 1982), the centred log ratio (CLR; Aitchison, 1982), and the isometric log ratio (ILR; Egozcue et al., 2003). Soil PSFs have been predicted using multiple linear regression (Huang et al., 2014) and kriging (Wang and Shi, 2018; Zhang et al., 2013) combined with log-ratio transformation methods. Moreover, multivariate treatment of soil PSFs can be realized using the probability density functions of soil particle size curves (PSCs), as non-negative values integrating to 1 % (or 100 %) can be considered as compositional data with infinitesimal parts (so-called functional compositions) (Menafoglio et al., 2014). Functional compositions are beneficial for acquiring complete and continuous information rather than discrete information, and soil texture and soil PSFs can be extracted from the stochastic simulation of soil PSCs (Menafoglio et al., 2016a), which can be jointly applied to the fractions to fully exploit the richness of information. Menafoglio et al. (2016b) applied such functional–compositional data for the stochastic simulation of PSCs based on a geostatistical Monte Carlo and Bayes space approach combined with a CLR transformation method in heterogeneous aquifer systems in hydrogeology, demonstrating a remarkable improvement of the characterization of the spatial variability and uncertainty compared with traditional methods. However, most soil PSF data used in studies are discrete (i.e. sand, silt, and clay), and few studies have conducted a systematic comparison of the accuracy, strengths, and weaknesses of different machine-learning models using original data and different log-ratio-transformed data.
Soil texture classification can be predicted by machine-learning models directly, and it can also be derived indirectly from soil PSFs. For the direct soil texture classification, tree-based models such as RF and classification tree (CT) performed better than multinomial logistic regression, support vector machines (SVM), and artificial neural network (ANNs; Camera et al., 2017; Wu et al., 2018). For the indirect classification of soil texture, Poggio and Gimona (2017) combined hybrid geostatistical generalized additive models with ALR and modelled PSFs at a 250 m resolution in Scotland. Considering the particularity of compositional data, the results of soil PSF classification and interpolation could be compared using the direct and indirect methods. Nevertheless, few studies have systematically compared the different machine-learning models for both direct and indirect soil texture classification.
In our study, five machine-learning models – K-nearest neighbour (KNN), multilayer perceptron neural network (MLP), RF, SVM, and extreme gradient boosting (XGB) – were applied for soil texture classification and soil PSF interpolation. Furthermore, the log-ratio-transformed data were also combined with these five machine-learning models for soil PSF interpolation. The objectives of this study are (i) to compare the performance of five machine-learning models for soil texture classification and soil PSF interpolation, (ii) to evaluate the performance of machine-learning models using original and different log-ratio-transformed data for soil PSF interpolation, and (iii) to estimate the performance of direct and indirect soil texture classification using these methods.
The Heihe River basin (HRB; 97
The vegetation in the upper reaches of the HRB (Fig. 1c) is influenced by hydrothermal conditions from the southeast to the northwest. The main vegetation types are alpine vegetation (4000–5000 m), the alpine meadow vegetation belt (3000–4000 m), alpine shrub meadow (3200–3800 m), the mountain forest meadow belt (2400–3200 m), the mountain grassland belt (1800–2400 m), and the desert base belt (less than 1800 m). The main vegetation types in the middle and lower reaches of the HRB are relatively fewer, and the shrub and steppe are mainly located in the area near the lower reaches of the Heihe River.
The main soil types (Fig. 1d) are frigid desert soils (higher than 4000 m), alpine meadow soil and alpine steppe soil (3600–4000 m), grey cinnamon soil and Chernozem (3200–3600 m), Sierozem and grey cinnamon soil (2600–3200 m), grey cinnamon soil (2300–2600 m), and Sierozem (1900–2300 m) in the upper reaches of the HRB. The main soil types in the middle reaches of the HRB are aeolian sandy soil, frigid frozen soil, and grey brown desert soil. The main soil types in the lower reaches of the HRB are aeolian sandy soil, grey–brown desert soil (northwest), and Lithosol (northeast).
The main geomorphology types in the upper reaches of the HRB are modern glaciers, alpine, hilly, and intermountain basin (Fig. 1e). Narrow plains are distributed in the middle reaches of the HRB. In the lower reaches, the main types of geomorphology are hilly (northwest), plain, sandy land, and platform (east), as well as a flood plain located in the area near the Heihe River.
The main land use type in the upper reaches, middle reaches, and lower reaches of the Heihe River were forest land and grassland, cultivated land, and unused land respectively (Fig. 1f). The water area and construction area were mainly distributed near the river in the middle reaches of the HRB.
A total of 640 soil sampling points was collected in the HRB from the
National Tibetan Plateau Data Center (NTPDC) in China (
The environmental covariates, such as topographic variables, remote sensing
variables, climate and position variables, soil physicochemical variables,
and categorical maps, are related to the distributions of the soil PSFs. System
for Automated Geoscientific Analysis (SAGA) GIS (Conrad et al., 2015) was
used to compute the topographic variables from the DEM, including the slope, aspect,
convergence index, general curvature, plane curvature, profile curvature, and
valley depth. Remote sensing variables, including the normalized difference
vegetation index (NDVI; Huete et al., 2002), the brightness index (BI; Metternicht and Zinck, 2003), and the soil adjusted vegetation index (SAVI; Huete, 1988), were derived from the Landsat 7 based on band operation. We
also collected climate variables such as the mean annual precipitation and
the mean annual temperature from the National Meteorological Information
Center (
K-nearest neighbour (KNN) is a simple and nonparametric classifier that is based on
using the known instance to label the unknown instance (Cover and Hart, 1967). For the
test set, K-nearest training set vectors (
Multilayer perceptron neural network (MLP), which is one of the most common multilayer feed-forward back-propagation networks (Zhang et al., 2018), was selected to train the artificial neural network (ANN) models due to the rapid operation, the small set of training requirements, and the ease of implementation (Subasi, 2007). MLP neurons can perform classification or regression depending on whether the response variable is categorical or continuous. The MLP has three sequential layers: the input layer, the hidden layer, and the output layer. The resilient back-propagation algorithm was chosen because the learning rate of this algorithm was adaptive, avoiding oscillations and accelerating the learning process (Behrens and Scholten, 2006). The range of the data set should be standardized because MLPs operate in terms of a scale from zero to one. MLP can be run using the “RSNNS” R package (Bergmeir and Benitez, 2012).
Random forest (RF) was developed by Breiman (2001), combining the bagging method (Breiman, 1996) with random variable selection, and the principle was to merge a group of “weak learners” together to form a “strong learner”. Bootstrap sampling is used for each tree of RF, and the rules to binary split data are different for regression and classification problems. For classification, the Gini index is used to split the data; for regression, minimizing the sum of the squares of the mean deviations can be selected to train each tree model. The benefits of using RFs are that the ensembles of trees are used without pruning. In addition, RF is relatively robust to overfitting. Standardization or normalization is not necessary because it is insensitive to the range of input values. Two parameters should be adjusted for the RF model: the number of trees (ntree) and the number of features randomly sampled at each split (mtry). The RF model is available in the “randomForest” R package (Liaw and Wiener, 2002).
Support vector machine (SVM), proposed by Cortes and Vapnik (1995), is a type of generalized linear classifier that is widely applied to classification and regression problems in soil science (Burges, 1998). The main principle of SVM is to classify different classes by constructing an optimal separating hyperplane in the feature space (so-called “structural risk minimization”). Regression problems can also be solved by minimization of the structural risk using loss functions (Vapnik, 1998) in SVM, which is known as support vector regression. The SVM is more effective in high dimensional spaces. A linear function was selected for SVM as the kernel function in our study. Additionally, cost and gamma are two other parameters that needed to be tuned, as these parameters control the trade-off between the classification accuracy and complexity and the ranges of the radial effect respectively. The SVM model is available in the “e1071” R package (Meyer et al., 2017).
Extreme gradient boosting, put forward by Chen and Guestrin (2016), is an efficient method of implementation for gradient boosting frames, tree learning algorithms, and efficient linear model solvers to solve both classification and regression problems (Chen et al., 2018). Like the boosted regression trees (Elith et al., 2008), it follows the principle of gradient enhancement; however, more regularized model formalization is applied to XGB to control over-fitting, making it perform better in terms of accuracy assessment. The residuals of the first tree can be fitted by the second tree to enhance the model accuracy, and the sum of the prediction of each tree generates the ultimate prediction. There are seven parameters in XGB – the learning rate (eta), the maximum depth of a tree (max_depth), the max number of boosting iterations (nrounds), the subsample ratio of columns (colsample_bytree), the subsample ratio of the training instance (subsample), the minimum loss reduction (gamma), and the minimum sum of instance weight (min_child_weight). The XGB model is available in the “xgboost” R package (Chen et al., 2018).
The equation description of five machine-learning models can be found in the Supplement (Sect. S1). The “caret” R package (Kuhn, 2018) for MLP, SVM, and XGB; the “randomForest” R package for RF; and the “kknn” R package for KNN were used to adjust the above parameters. A set of parameters with the lowest RMSE for regression and the highest kappa coefficient for classification by cross-validation are selected as the best parameters. There are 11 dependent variables (i.e. “sand”, “silt”, “clay”, “ILR1”, “ILR2”, “ALR1”, “ALR2”, “CLR1”, “CLR2”, and “CLR3” for regression and “class” for classification) trained with environmental covariates (independent variables). All methods were applied on these 11 components independently, and all of the adjusted parameters for the different models are listed in Table S1. More details about parameter optimization and independent modelling are given in Sect. S2.
For the composition of
We used five machine-learning models combined with original data (ORI) and three log-ratio methods (ALR, CLR, and ILR) in this study, including five machine-learning models for direct soil texture classification (five models); we also use the above-mentioned methods with original data and log-ratio-transformed data for indirect soil texture classification (20 models) and soil PSF interpolation (20 models) (Table 1). The data were randomly divided into two sets: 448 soil samples (70 %) for training and 192 soil samples (30 %) for validation. This process was repeated 30 times.
The method system of soil texture classification and soil PSF interpolation.
We used the overall accuracy, kappa coefficients, area under the precision–recall curve (AUPRC), and abundance index to validate the performance of different models. The first two indicators were selected to evaluate the overall prediction performance of soil texture types, and the last two were applied to evaluate the performance of each soil texture type.
The overall accuracy represents all samples of all soil texture types correctly
classified by machine-learning models, divided by the total number of
samples of soil texture types used in the validation. The overall accuracy
is defined as follows (Brus et al., 2011):
Similarly, the confusion index (COI) based on prediction probability was
calculated to evaluate the uncertainties of machine-learning models of
classification (Burrough et al., 1997). The equation was as follows:
The abundance index was applied to describe the proportion of all soil texture
types and well-classified soil texture types in prediction maps and was
defined as follows:
Five statistical indicators, including the Spearman rank correlation coefficient
(RCC), root-mean-square error (RMSE), mean absolute error (MAE), Aitchison
distance (AD; Aitchison, 1992), and standardized residual sum of squares
(STRESS; Martin-Fernandez et al., 2001), were used to validate the methods
of soil PSF interpolation. The equations for the validation indicators RCC,
RMSE, MAE, AD, and STRESS are as follows:
The standard deviation (SD), coefficient of variation (CV), mean value, minimum value
(Min), maximum value (Max), median absolute deviation (MAD), skewness (Skew),
kurtosis, and the Kolmogorov–Smirnov (k–s) test (
For the original data of sand content, the mean (30.64 %) was much higher
than that of the median centre (26.06 %). In contrast, silt and clay contents
were the opposite, with lower means (silt, 55.79 %; clay, 13.57 %)
than median centres (silt, 59.51 %; clay, 14.43 %). For the log-ratio-transformed data, different log-ratio methods delivered the same means for
sand, silt, and clay. Additionally, the means of sand (28.69 %) and silt
(60.54 %) were closer to the median centres of the original data, except
for clay (10.78 %). With respect to the SD and CV, soil PSF data in the log-ratio
geometry had more stability and less variability than the original
data. ILR and CLR had the lowest MAD for the first component (0.66) and the
second component (0.43) respectively (Fig. 2). Although the
Descriptive statistical analysis for the original and log-ratio-transformed data for
The overall accuracy of all models ranged from 0.613 to 0.636. (Fig. 3a). RF had the highest overall accuracy (0.636) among the five models, followed closely by KNN (0.630) and MLP (0.627). In addition, SVM (0.618) and XGB (0.613) had relatively lower accuracy than the other models. The highest kappa coefficient was generated from MLP (0.242), followed by RF (0.238), XGB (0.229), KNN (0.213), and SVM (0.213) (Fig. 3b). With respect to the confusion indices (COIs), XGB (0.278) delivered the best performance, and RF (0.501) demonstrated the highest confusion of models (Fig. 3c).
We combined the PRCs of the five machine-learning models to evaluate the performance of predicting each soil texture type using imbalanced data with different samples of each type (Fig. 4). The AUPRCs of the types with fewer positive examples were typically small, especially for SaClLo (only four samples), and delivered unsatisfying results. This was because the lack of soil sampling points made models learn poorly during the training process. In contrast, the soil texture types (Lo, SaLo, SiLo, and SiClLo) with more positive examples delivered superior results to those with fewer positive examples. Moreover, these soil texture types had significant differences in AUPRCs. For example, SiLo, which had the largest number of samples, was the most effective among the nine types. For soil texture types with more samples, RF and XGB performed better. For soil texture classes with less samples, RF and SVM showed better performance according to the AUPRCs.
The AUPRCs for different machine-learning models in the
prediction of each soil texture type:
Prediction maps of soil texture types delivered quite different spatial distributions in the overall performance of different models (Fig. 5). The abundance indices pointed out that SVM could predict all nine types, KNN and XGB predicted eight of nine types, followed closely by RF (seven of nine types) and MLP (six of nine types). The maps predicted by RF, SVM, and XGB illustrated that the main soil texture types in the northwest of the lower reaches of the HRB were mostly LoSa, while other prediction models produced SaLo. In the upper reaches of the HRB, soil texture types generated from RF were more abundant and more in accordance with the real environment (Fig. 1).
Soil texture classification prediction maps of different
soil texture types for
Comparisons of the accuracy of different machine-learning models combined with original and transformed data. Bold values denote the best model performance for different indicators.
We compared the performance of each machine-learning model using the original and log-ratio-transformed data. The results indicated that the STRESS of the methods using log-ratio-transformed data were superior to the methods using original data (Table 2). The RMSE, MAE, RCC, and AD generated from KNN, MLP, RF, and XGB using original data outperformed the results using log-ratio-transformed data. By comparison, among different log-ratio-transformed data of the same machine-learning model, ILR and CLR outperformed ALR. KNN_CLR demonstrated the most remarkable performance with the highest RCC and the lowest RMSE and MAE for KNN using the three log-ratio transformation methods. Furthermore, RF and SVM generated relatively similar results using CLR- and ILR-transformed data . XGB_ILR showed the best performance with most of the indicators except for RMSE (6.75 %) and MAE (5.36 %) of clay, and STRESS (0.63). RF had the lowest RMSE and MAE, the highest RCC, and the lowest AD and STRESS for ALR, CLR, and ILR. For original data, RF also outperformed other models.
Prediction maps of the sand fraction. All of the ranges of the prediction maps of sand (approximately 9.0 %–90.0 %) were within the range of the original data (0.98 %–99.66 %). RF_ILR (7.9 %–94.7 %) and XGB_ORI (1.8 %–92.4 %) generated wider output distributions and were relatively closer to the range of the distribution of the original data than other prediction maps, such as KNN_ILR (7.3 %–88.6 %), KNN_ORI (7.8 %–80.8 %), MLP_ILR (8.8 %–90.8 %), MLP_ORI (9.0 %–90.3 %), RF_ORI (9.0 %–81.0 %), SVM_ILR (6.5 %–85.6 %), SVM_ORI (7.3 %–90.0 %), and XGB_ILR (5.0 %–88.5 %).
Interpolation prediction maps of soil PSFs using log-ratio-transformed data (ILR) and original data are represented in Figs. 6, S1, and S2. The maps generated from ILR-transformed data showed closer ranges to the original soil sampling data in terms of the ranges of sand (0.98 %–99.66 %), silt (0.17 %–95.87 %), and clay (0.03 %–39.77 %), and the texture features were more consistent with the distributions of the real environment (Figs. 6, S1, S2). With respect to different machine-learning models, RF and XGB delivered prediction maps that were closer to the range of the distribution of the original data than KNN, SVM, or MLP.
The overall accuracy and kappa coefficients of the indirect classification were improved by using log-ratio-transformed data, especially for RF and XGB (Fig. 7). ILR showed the highest overall accuracy among the three log-ratio transformations and also demonstrated the best performance in terms of the kappa coefficients, except for MLP. We compared direct classification with indirect classification and found that the differences in the overall accuracy of direct and indirect classification methods were negligible. However, the kappa coefficients were greatly modified using indirect classification compared with direct classification, except for MLP; peculiarly, RF_ILR increased the kappa coefficient to 0.291 (a 21.3 % improvement) and the accuracy remained stable.
Overall accuracy and kappa coefficients calculated from soil texture classification by soil PSF interpolation using five machine-learning models combined with original data and log-ratio-transformed data.
The distributions of soil texture types using original and ILR-transformed data are illustrated in Fig. 8 using the United States Department of Agriculture (USDA) soil texture triangle. The triangle of the original data of soil PSFs (Fig. 8a) demonstrates wider ranges of spatial dispersion than the interpolated data using machine-learning methods. These predictions reveal the properties of aggregating from the sides to the centre of triangles. With respect to the machine-learning models, RF shows the most dispersed feature in accordance with the original soil PSF data. The predictions from models combined with ILR-transformed data are more discrete and more associated with the original soil PSF data than those resulting from ORI methods. The prediction results represent significant differences in the error ratio (yellow symbols, Fig. 8) of the soil sampling points with respect to soil types between the left part (LoSa, SaLo, and Lo) and right part of the triangles (SiLo and Si) for most of the models, especially for KNN and MLP. The log-ratio methods over-calculate the mean value of silt in the process of transformation (Fig. 2), so these points are biased to the right of the USDA soil texture triangle based on overall contraction (regression smoothing effects), crossing the classification boundary and turning to other soil texture types. RF_ILR (Fig. 8f) delivers the highest right ratio (RR) among these models, and the classification accuracy is enhanced using the ILR method (83.9 %) compared with ORI (81.7 %). In the case of other models, the differences between ORI and ILR are negligible. We also compared the RRs of indirect classification models with those of direct classification, demonstrating all RRs of direct classification were higher (KNN, 67.97 %; MLP, 75.16 %; RF, 100 %; SVM, 66.09 %; XGB, 81.09 %), especially for RF and XGB. However, we removed this evaluation indicator because the same data sets were employed in the processes of training and predicting.
Soil texture types of 640 soil samples shown using the USDA texture
triangle. The results of soil PSFs were generated from
The prediction maps of soil texture classification by indirect methods using KNN, MLP, RF, SVM, and XGB with either ILR-transformed data (ILR) or original data (ORI).
The soil texture maps predicted using original data were different from the map generated using log-ratio-transformed data, and classification maps of the machine-learning models combined with the log-ratio-transformed data had more detailed information (Figs. 9, S3). The results of machine-learning models using three log-ratio-transformed data sets were similar to the number of predicted types; however, there were significant differences between the results using original data and log-ratio-transformed data. All machine-learning models combined with original data predicted more Lo and SaLo soil texture types and fewer LoSa and Si types (Fig. 9). We also compared the prediction of soil texture types by direct classification (Fig. 5) with those generated from indirect classification using the same machine-learning models, which revealed that different distributions of LoSa existed among them in the lower reaches of Heihe River basin. For the upper reaches, prediction maps of the ILR methods generated more Si and less Lo than the ORI method. Si soil texture types were mainly distributed in the middle and southeast of the upper reaches of the HRB in the predictions combined with ILR methods. For the middle reaches, ILR prediction maps were recommended and were more in line with the real environment than the ORI methods, because more SaLo and less Lo soil texture types were predicted in the middle reaches of the HRB. Furthermore, the predicted soil texture using indirect methods was more abundant than the directly predicted soil texture in the middle reaches (Fig. 5).
Average time spent running the KNN, MLP, RF, SVM, and XGB models 30 times for soil texture classification and soil PSF interpolation.
The run times of the models were computed and compared for different machine-learning models in soil texture classification and soil PSF interpolation (Fig. 10). Because the run times of the ORI and log-ratio methods were similar, the ILR was selected for soil PSF interpolation. With respect to the different models, RFs required the longest time for both classification (453.73 s) and interpolation (188.87 s), which may cause it to lose its advantage over the other models when processing large data sets. KNN (classification, 4.2 s; interpolation, 23.6 s) and SVM (classification, 4.15 s; interpolation, 12.4 s) had shorter run times with respect to both classification and interpolation. XGB (classification, 21.6 s; interpolation, 17.13 s) was much more stable and required less time; the data processes were also simpler compared with MLP (classification: 47.28 s, interpolation: 152.31 s). Moreover, XGB delivered better performance than KNN and SVM in prediction maps, demonstrating that it is an effective way of dealing with large data sets.
The range of applicability of the study is limited to independent modelling, i.e. the component-wise approaches. However, joint fractions modelling could lead to different results. We found that tree-based machine-learning models – RF and XGB – delivered better performance than KNN, MLP, and SVM, which was also concluded by Heung et al. (2016). With respect to the total computing time, RF revealed the longest run time with respect to both the classification and interpolation mode. In addition, regarding trade-offs between the total computing time of the model and the accuracy, XGB was superior to the other four models, reducing the computing time significantly while maintaining acceptable accuracy. In fact, parallel calculations can be automatically executed during the training phase of the XGB model: this is a great advantage when working with large data sets, as the XGB can be more than 10 times faster than the existing gradient boosting models (Chen and Guestrin, 2016). Therefore, XGB is recommended due to its speed (although this is at the expense of suboptimal accuracy) when researchers are dealing with large data sets in study areas. Moreover, some joint fractions approaches – compositional kriging (Wang and Shi, 2017), high accuracy surface modelling (HASM; Yue et al., 2015, 2016) and the Dirichlet regression (Hijazi and Jernigan, 2009) – can consider the multivariate treatment for soil PSFs using a joint model, but machine-learning models are more convenient for combining environmental covariables. For the machine-learning models in our study, KNN, MLP, RF, and SVM also can be applied to multivariate vectors combined with log-ratio methods. For example, the multivariate random forest (MRF) method, which is the extended version of RF, calculates predictions of all output features using a single model (Segal and Xiao, 2011).
Log-ratio transformation methods can open the data and remove the “closure effect”, which induces spurious correlation. The opened data can be interpolated into the mapping area, and the results can then be back-transformed using inverse equations. However, in the process of parameter optimization, the optimal parameters of different machine-learning models are obtained using log-ratio-transformed data, which cannot guarantee the most accurate back-transformed results. This is because the values of assessment indicators (e.g. MAEs and RMSEs) will remain stable with limited differences due to the small value range of log-ratio-transformed data. Therefore, when the prediction values of log-ratio methods are back-transformed to the real space, these indicator values will be enlarged.
Due to the contraction of the predicted values (Fig. 8), there were small numbers of predictions beyond the range of the original data values, including the negative predictions using ORI data. Although these few negative predictions can be eliminated by parameter adjustment in our study, there is still a drawback to using ORI data. Among the three log-ratio methods, ILR and CLR were superior to ALR, which can be explained by the fact that ILR and CLR are isometric transformations and they could preserve distances (Filzmoser and Hron, 2009). Moreover, ALR has been criticized because the results were affected by the subjective choice of the denominator. In addition, ILR showed slightly better performance than CLR, because the geometric mean composed of all compositions of soil PSFs is the denominator in CLR, and one-to-one mapping of equations and soil PSFs could be implemented. Nevertheless, the sum of the dimensions of CLR is zero, and the problem of collinearity is still present. ILR transformed all of the information into D-1 orthogonal log contrasts (so-called balances) (Egozcue et al., 2003) and overcame the data collinearity and sub-compositional incoherence in CLR by using an appropriate choice of the basis (Egozcue and Pawlowsky-Glahn, 2005). Moreover, in the ILR method, multiple sets of ILR-transformed data can be generated by permutations of components (different sequential binary partitions, SBPs) in compositional data, and different choices of ILR balances influenced the model accuracy. The choice of a specific SBP for compositions is crucial for the intended interpretation of coordinates (Fiserova and Hron, 2011). The choice of SBPs can be applied blindly (Fiserova and Hron, 2011), can be based on a priori expert knowledge, or can be based on using a compositional biplot (Lloyd et al., 2012), and the best ILR balance also can be chosen using variograms and cross-variograms (Molayemat et al., 2018). All three SBPs are demonstrated in Sect. S6 (Table S3). The ILR balance chosen in our study was SBP1, because the ILR-transformed data using SBP1 were more symmetric than other two SBPs. However, there will be different results and prediction maps when different SBPs are used, which requires further research. Furthermore, each component of log-ratio or original soil PSF data was independently modelled using component-wise approaches (machine-learning methods), which may be suboptimal compared with the joint fractions approach under the circumstances (when dealing with the multivariate treatment). For example, CLR-transformed data are still characterized by collinearity, but there is no guarantee that the sum of the three components of CLR is zero due to the use of independent modelling. Although the final predictions were not influenced (still sum to 100 %) due to the inverse equations for CLR, collinear constraints reduced the prediction accuracy. By contrast, the ILR method is more meaningful and appropriate than the other log-ratio methods because it indeed removes the data constraints. Therefore, ILR is recommended as a combination method with machine-learning models for component-wise modelling unless multivariate extensions of the methods (e.g. functional compositions) are considered.
Compared with the real soil texture distribution and environment of the HRB, SiLo overlaid the upper reaches of the HRB, and SaLo and Lo were present in the south of the upper reaches of the HRB (showing a strip distribution). Moreover, an uncovered area was detected in the northwest of the lower reaches of the HRB, where it cannot be predicted accurately due to a lack of input information in the model training process. The main soil texture types in the lower reaches of the HRB were SiLo, LoSa, and small areas of SaLo and Lo, which were distributed in the uncovered area. The main soil texture types predicted from direct classification using machine-learning models were SaLo and SiLo; RF and XGB delivered much more LoSa than other direct classification models. However, all of these models predicted that the main soil type in the lower reaches of the HRB was SaLo, which did not fit with the real environment (LoSa). In fact, LoSa and SaLo were obviously the most confusing. However, they are fairly similar to each other (Fig. 8). In addition, due to the limitation of the training subsets, direct classification can only predict types that are contained in training subsets. In contrast, indirect classification broke such limitations, and new prediction types arose due to the transformation from soil PSFs to soil texture types. Moreover, more suitable matching performance with respect to the real environment should be considered such as the log-ratio methods of the MLP and RF models, KNN_ALR, KNN_ILR, and XGB_CLR.
We systematically compared five machine-learning models using original data and three log-ratio-transformed data in the HRB for direct and indirect soil texture classification and soil PSF interpolation. As flexible and stable models, tree learners – RF models – delivered powerful performance in both classification and interpolation and were superior to the other machine-learning models mentioned above. As a new and suboptimal machine-learning method in soil science, XGB appeared to be more computationally efficient in processing large data sets. RF and XGB were recommended to evaluate the classification capacity of imbalanced data. In addition, the log-ratio methods, especially ILR, had the advantage of modifying STRESS in soil PSF interpolation. Moreover, the indirect methods for soil texture classification outperformed the direct methods, especially when combined with log-ratio transformations. The indirect methods for soil texture classification generated preferable results with respect to both the accuracy indicators and the prediction maps. The keys to improving the interpolator accuracy are using more appropriate interpolation techniques with environmental covariates, transforming soil PSF data using more efficient transformation methods, utilizing compositional data analysis in the multivariate studies, and using systematic parameter adjustment algorithms for compositional data.
The 640 soil sampling data for the HRB,
The supplement related to this article is available online at:
WS contributed to soil data sampling and oversaw the design of the entire project. MZ performed the analysis and wrote the paper. ZX collected and analysed data. All authors contributed to writing the paper and interpreting data.
The authors declare that they have no conflict of interest.
We acknowledge the comments from the editor, Alberto Guadagnini; the reviewers, Tom Hengl and Alfred Stein; and the anonymous referees that helped us improve the quality of the paper. Thanks are also due to the National Meteorological Information Center for providing the meteorological data and the National Tibetan Plateau Data Center for the soil particle size fractions data.
This study was supported by the National Key Research and Development Program of China (grant no. 2017YFA0604703), the National Natural Science Foundation of China (grant nos. 41771364 and 41771111), the Fund for Excellent Young Talents in Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences (CAS; grant no. 2016RC201), the Youth Innovation Promotion Association, CAS (grant no. 2018071), the Investigation and Monitoring project of Ministry of Natural Resources (grant no. JCQQ191504-06) and a grant from the State Key Laboratory of Resources and Environmental Information System.
This paper was edited by Alberto Guadagnini and reviewed by two anonymous referees.