Supplement of Systematic comparison of five machine-learning models in classification and interpolation of soil particle size fractions using different transformed data

Abstract. Soil texture and soil particle size fractions (PSFs) play
an increasing role in physical, chemical, and hydrological processes. Many
previous studies have used machine-learning and log-ratio transformation
methods for soil texture classification and soil PSF interpolation to
improve the prediction accuracy. However, few reports have systematically
compared their performance with respect to both classification and interpolation. Here,
five machine-learning models – K-nearest neighbour (KNN), multilayer
perceptron neural network (MLP), random forest (RF), support vector machines
(SVM), and extreme gradient boosting (XGB) – combined with the original data and three log-ratio transformation methods – additive log ratio (ALR), centred log ratio (CLR), and
isometric log ratio (ILR) – were applied to evaluate soil texture and
PSFs using both raw and log-ratio-transformed data from 640 soil samples in the Heihe River basin
(HRB) in China. The results demonstrated that the log-ratio transformations
decreased the skewness of soil PSF data. For soil texture
classification, RF and XGB showed better performance with a higher overall
accuracy and kappa coefficient. They were also recommended to evaluate the
classification capacity of imbalanced data according to the area under the
precision–recall curve (AUPRC). For soil PSF interpolation, RF
delivered the best performance among five machine-learning models with the
lowest root-mean-square error (RMSE; sand had a RMSE of 15.09 %, silt was 13.86 %, and
clay was 6.31 %), mean absolute error (MAE; sand had a MAD of 10.65 %, silt was 9.99 %, and clay was 5.00 %), Aitchison distance (AD; 0.84), and standardized
residual sum of squares (STRESS; 0.61), and the highest Spearman rank
correlation coefficient (RCC; sand was 0.69, silt was 0.67, and clay was 0.69). STRESS
was improved by using log-ratio methods, especially for CLR and ILR. Prediction
maps from both direct and indirect classification were similar in the middle and
upper reaches of the HRB. However, indirect classification maps using log-ratio-transformed data provided more detailed information in the lower reaches of
the HRB. There was a pronounced improvement of 21.3 % in the kappa
coefficient when using indirect methods for soil texture classification compared
with direct methods. RF was recommended as the best strategy among the five
machine-learning models, based on the accuracy evaluation of the soil PSF
interpolation and soil texture classification, and ILR was recommended for
component-wise machine-learning models without multivariate treatment,
considering the constrained nature of compositional data. In addition, XGB
was preferred over other models when the trade-off between the accuracy and runtime was
considered. Our findings provide a reference for future works with respect to the
spatial prediction of soil PSFs and texture using machine-learning models
with skewed distributions of soil PSF data over a large area.


1 This supplementary material consists of equations of methods, tables and prediction maps in the paper and in the following 6 sections. Section S1 is the equation descriptions of machine-learning models.
Section S2 shows parameter adjustment and modeling of machine-learning methods. Section S3 includes the uncertainty assessment of soil PSF interpolation. Section S4 includes the prediction maps of silt and clay fractions. Section S5 demonstrates the indirect classification maps using ALR and CLR transformation methods. Section S6 demonstrates the SBP balances of ILR methods and the choice of construction of coordinates (so-called balances) in the SBPs.

Supplementary Material
Section S1 The equation descriptions of machine-learning models  (1) , (1) ) is based on the distance function which is as follows: ( , (1) ) = ( ( , )), and ̂= (1) refers to the nearest neighbor, which is the prediction for . Value ( ) and ( ) are the th nearest neighbor of and class of training set, respectively.
For multilayer perceptron neural network (MLP), each neuron sums input environmental covariate in our study after multiplying them by the connection weights respectively, and calculates its output (soil PSF components or texture types) as a function of the sum: where is the activation function, which can be a linear or logistic function. The sum of squared differences between the predicted values and observed values of the output results of neurons is defined as follows: where and is the predicted and observed value of output neuron , respectively. Each is adjusted to reduce and the adjustment of depends on the training algorithm.
For random forest (RF), the equations for Gini index and minimizing the sum of the squares of the mean deviations (M) are as follows: where refers to the proportion of th class in the data set on the current node, for feature = , data set is divided into two parts ( 1 and 2 ), 1 describes the data set which meets the condition = and 2 is the opposite of 1 ; ( , ) represents the uncertainty of set after binary split; is the predicted value of input value ; 1 and 2 is the mean of data set 1 and 2 , respectively.
In support vector machine (SVM), for a data set { , }, = 1, . . . , , ∈ and refers to an ndimensional vector, ∈ {−1, +1} is the class corresponding to , the equation for calculating a hyperplane of SVM is defined as follows: , , where ( ) refers to the mapping from the input space to the feature space, > 0 is penalty factor (cost), , , and are the parameters need to be optimized during the process of model training, which can be determined by the Lagrange multipliers: where refers to the support vector, ( , ) refers to the kernel function, and * is the bias.
For extreme gradient boosting (XGB), the general prediction function at step t is defined as follows: where ( ) refers to the tree (learner) at step t, ( ) and ( −1) refer to the predicted values at steps t and t − 1, and is the input value.
where ( ) is the regularized objective, and refer to the prediction value and observed value, refers to the loss function, is the number of data set, and refers to the regularization term, which equation is defined as follows: where refers to the weight vector, denotes the total number of features, is the regularization term, and is the minimum loss.

Section S2 Parameter adjustment and modeling of machine-learning methods
For the parameter adjustment in Table S1, all variables (i.e., "sand, silt, clay, ilr1, ilr2, alr1, alr2, clr1, clr2, clr3" for regression and "class" for classification) were trained independently to define the bestperformance parameter combination of each machine-learning method using R packages mentioned in Section 2.4.6 'Parameters optimization'. Accuracy indicators (e.g., RMSEs) were based on Aitchison space and Euclidean space for the original data and log ratio transformed data, respectively. For KNN, the kmax was 15; the distance was 1; the kernel was rectangular. For MLP, the size ranged between 5 and 10. For RF, the ntree was 1000; the mtry fluctuated from 9 to 11. For SVM, the gamma was 0.01; the cost was 1. For XGB, the max_depth was 3 -4; the eta was 0.05 -0.1; the colsample_bytree was 0.6 -0.8, the nrounds was 30; the subsample was 0.8 -1; the gamma was 0 -0.8; the min_child_weight was 0.6 -0.8.

Table S1
Adjusted parameters for different machine-learning methods. "rectan" is short for rectangular, "opt" is short for optimal and "ep" is short for epanechnikov. For the independent modeling of soil PSF interpolation, each component in Table S1 was trained separately using five machine-learning methods except for 'class'. For original method, three components, 'sand', 'silt' and 'clay' were applied separately to machine-learning methods with their own parameters. For log ratio transformation methods, 7 components were also applied separately, then the results of three log ratio transformation methods were back-transformed (alr1 and alr2 for ALR method, clr1, clr2, clr3 for CLR method, and ilr1, ilr2 for ILR method).

Section S3 Uncertainty assessment of soil PSF interpolation
For the uncertainty assessments of models, Table S2 showed that ORI delivered lower SDs than those of log ratio methods among five machine-learning models for sand, silt and clay. Moreover, the ranges of 95 % confidence interval (CI) of indicators were also computed, which indicated relatively low values compared with assessment indicators (Table S2)  Section S4 Prediction maps of silt and clay fractions Figure S1. The prediction maps of silt fraction using five machine-learning models with ORI and ILR data. Figure S2. The prediction maps of clay fraction using five machine-learning models with ORI and ILR data.

Supplementary Material
Section S5 Indirect classification maps using ALR and CLR transformation methods