Compositional balance should be considered in the mapping of soil 1 particle-size fractions using hybrid interpolators

Digital soil mapping of soil particle-size fractions (PSFs) using log-ratio methods is a widely used technique. As a 10 hybrid interpolator, regression kriging (RK) is an alternative way to improve prediction accuracy. However, there is still a lack 11 of comparisons and recommendations when RK is applied for compositional data, and it is not known if the performance based 12 on different balances of isometric log-ratio (ILR) transformation is robust. Here, we compared the generalized linear model 13 (GLM), random forest (RF), and their hybrid patterns (RK) using different transformed data based on three ILR balances, with 14 29 environmental covariables (ECs) for the prediction of soil PSFs in the upper reaches of the Heihe River Basin, China. The 15 results showed that RF performed best, with more accurate predictions, but GLM produced a more unbiased prediction. For 16 the hybrid interpolators, RK was recommended because it widened the data ranges of the prediction values, and modified the 17 bias and accuracy for most models, especially for RF. Moreover, prediction maps generated from RK revealed more details of 18 the soil sampling points. For three ILR balances, different data distributions were produced. Using the most abundant 19 component of the compositional data as the first component of the permutations was not considered the right choice because 20 it produced the worst performance. Compared to the relative abundance of components, we recommend that the focus should 21 be on data distribution. This study provides a reference for the mapping of soil PSFs combined with transformed data at the 22 regional scale. 23


Introduction 24 Groups
Step

GLM 136
The GLM is an extended version of the linear model, which contains response variables, with non-normal distributions (Nelder 137 and Wedderburn, 1972). The link function is embedded into the GLM to ensure the classical linear model assumptions. The 138 scaled dependent variables and the independent variables can be connected using a link function for the additive combination 139 of model effects, the choice of link function depends on the distribution of response variables (Venables and Dichmont, 2004). 140 A Gaussian distribution with an identity link function was applied in our study, which produced consequences equivalent to 141 that of MLR (Nickel et al., 2014). However, categorical variables can be directly trained in the GLM without setting dummy 142 variables. The Akaike's information criterion (AIC) was applied to choose the best predictors and remove model 143 multicollinearity using a backward stepwise algorithm, and the combinations of ECs for different ILR data were then obtained 144 (Table S2.1). 145

RF 146
The RF is a non-parametric technique, which combines the bagging method with a selection of random variables as an extended 147 version of a regression tree (RT) (Breiman, 1996(Breiman, , 2001. It can improve model prediction accuracy by producing and 148 aggregating multiple tree models. The principle of the RF is to merge a group of "weak trees" together to generate a "powerful 149 forest." The bootstrap sampling method was applied for each tree, and each predictor was selected randomly from all model 150 predictors. The "out of bag" (OOB) data were applied to produce reliable estimates in an internal validation using a random 151 subset independent of the training tree data. Three parameters needed to be tuned: number of trees ( ); minimum size of 152 terminal nodes ( ), and number of variables randomly sampled as predictors for each tree ( ) (Liaw and Wiener, 153 2001). The standard value of the parameter was one-third of the total number of predictors, while and 154 were 500 and 5, respectively. For regression, the mean square errors (MSEs) of predictions were estimated to train 155 the trees. The variable importance of the RF was produced from the OOB data using the "importance" function. One of the 156 benefits of the RF is that the ensembles of trees are used without pruning to ensure that the most significant amount of variance 157 can be expressed. Moreover, the RF can reduce model overfitting and normalization is unnecessary due to the effects on the 158 value range being insensitive. The GLM and RF algorithms and the parameter adjustment of the RF were conducted in the R 159 package "caret" (Max Kuhn, 2018). 160 161

RK 162
Regression kriging is a hybrid interpolation technique that combines regression models (e.g., GLM and RF) with the residuals 163 of OK (Odeh et al., 1995). Mathematically, the RK method corresponds to two interpolators, the regression part and the kriging 164 part, which are operated separately (Goovaerts, 1999). One limitation of using only the regression part is that it is usually only 165 useful within the range of values of the training sets (Hengl et al., 2015). The principle of the RK method is that the regression 166 model explains a deterministic component of spatial variability, and the interpolation of regression residuals generated from 167 OK is used to describe the spatial variability (Bishop and McBratney, 2001;Hengl et al., 2004). The residuals create a 168 variogram (e.g., Gaussian, spherical, or exponential) for models based on the MSE from the results of a cross-validation. First, 169 we used the regression part (GLM or RF) to predict soil PSFs, the residual from the fitted model was then calculated by 170 subtracting the regression part from the observations. Subsequently, OK was applied for the whole study area to interpolate 171 the residuals. Finally, the regression prediction and the predicted residuals at the same location were summed. The variograms 172 of the RK method were generated automatically using the "autofitVariogram" function in the R package "automap" (Hiemstra 173 et al., 2009). 174

Prediction method system and validation 175
The method system of spatial interpolation models for soil PSFs is presented in Table 2. We systematically compared 12 176 models: four interpolators, including GLM and RF with or without RK, and three SBPs of the ILR transformation method. For 177 the validation of model performance, the independent data set validation was used to evaluate the prediction bias and accuracy 178 of the models. The sub-training sets (70%) and the sub-testing sets (30%) were randomly selected from data independently, 179 and this process was repeated 30 times. 180 The mean error (ME), the root mean square error (RMSE), and Aitchison distance (AD) were used to evaluate and compare 183 the prediction performance. The ME and RMSE measure prediction bias and accuracy, respectively (Odeh et al., 1995). The 184 AD is an overall indicator of compositional analysis, which describes the distance between two compositions. Generally, in an 185 accurate, unbiased model all three values will be close to 0. The ME, RMSE, and AD were calculated as follows: 186 The interpretation of the ILR balances is based on a decomposition of the covariance (COV) structure (Fiserova and Hron, 195 2011). We calculated the variance (VAR), COV, and the corresponding correlation coefficient (CC) of ILR transformed data 196 based on different SBP. The equations for calculating VAR, COV, and CC were derived from Eq. (1) as follows: 197 For soil PSF data, Eqs. (10), (11), and (12) can be simplified to three dimensions. The relationship between the ratios of soil 203 PSF components and the dominant roles of ILR transformed data were indicated from the covariance structure. All the 204 statistical analyses, such as the descriptive statistics of soil PSF data, calculation and evaluation of indicators, and the spatial 205 prediction mapping, were performed using the R statistical program (R Development Core Team, 2019). From the descriptive statistics of the original (raw) and ILR transformed data, the silt fraction dominated the soil PSFs, 211 accounting for a more substantial amount than the sand and clay fractions. The distributions of the sand and clay fractions 212 were similar (Fig. 2a). The ILR transformed data based on the three SBPs revealed different distributions (Figs. 2b,2c,and 213 2d). For example, two ILR components (ILR1 and ILR2) for SBP1 had a symmetric distribution around zero at the x-axis (Fig.  214 2b). In comparison, the distribution of data generated from SBP2 or SBP3 had a mirrored symmetry, with a left-skewed ILR1 215 of SBP2 and right-skewed ILR2 of SBP3 (Figs. 2c and 2d). The comparison of means and medians demonstrated that the back-216 transformed means of three sets of ILR transformed data were the same, and the mean ILR of sand was closer to the median 217 compared with the original soil PSF data. In contrast, the opposite patterns were apparent for the silt and clay components ( The covariance analysis of the transformed data of soil PSFs based on the different SBPs showed that the variance VarILR_1 225 of SBP3 was the largest, followed by the VarILR_1 of SBP1 and SBP2 (Table 3). The variance of the second component of 226 ILR (VarILR_2) followed the opposite pattern to that of VarILR_1. The COV and corresponding CC followed the same pattern 227 of SBP1 > SBP3 > SBP2. The first ILR equation ( 1 in Table 1) contained all information of soil PSFs, while the second one ( 2 in Table 1) included only two components. The information of VarILR_1 was therefore more abundant. All of VarILR_1 229 and VarILR_2 values were not 0 (or not nearly 0), indicating that there was no constant (or almost constant) value in any two 230 ratios of soil PSF components. The COV of SBP3 was close to 0, indicating that the proportions of clay/sand and clay/silt 231 were approximately the same. The same results were generated from the corresponding CC. For the distribution of soil PSFs 232 in a ternary diagram (the United States Department of Agriculture texture triangle, USDA), the main texture class was silt 233 loam (Fig. 3a). The biplot of soil samples demonstrated that the rays of the three components, i.e., sand, silt, and clay, were 234 reasonably well clustered at about 120° in the three groups (Fig. 3b). 235 236  Figure 3. The distribution of soil PSFs in the USDA triangle (a) and biplot graph (b). The red curve was fitted by loess function. 242

Accuracy comparison of different models using ILR data 243
The first three rows of the boxplots in Figs. 4a, 4b, and 4c indicate the bias of the different models according to their ME 244 values. The ME of sand was closest to 0, followed by the MEs of clay and silt. GLM was more unbiased than RF, with lower 245 ME values. After combining with RK, there was an improvement in the ME for most GLM and RF models (Figs. 4a, 4b, and  246 4c). For the accuracy assessment, the RMSE of silt was higher than for the other two components. The GLMRK did not 247 perform as well as expected in terms of the RMSE, with only the sand component having an improved RMSE (Fig. 4d).
However, the RFRK performed better than the GLMRK and improved the accuracy of most parts compared with the RF, 249 except for the RFRK_SBP1 of sand. As an overall indicator, AD showed that the RF (or RFRK) performed better than the 250 GLM (or GLMRK) in terms of both average RMSE values and uncertainties (Fig. 4g). Moreover, the RFRK improved the AD 251 values for the SBP2 and SBP3 methods. For the uncertainty assessment, the RF generated lower uncertainties than the GLM, 252 and the models combined with RK further reduced the uncertainties for most GLM and RF models. 253 The model performances were different for the three SBPs. To better evaluate model performance using the different SBP 258 balances, we graded each box from 1 to 3, and the final results are shown in Fig. 5. The results demonstrated that SBP1 performed best, with the lowest ME value of all models. For the accuracy comparison there was no apparent pattern, but 260 accuracy could be considered hierarchically: (1) for the GLM, SBP1 performed better than the other two SBP methods, which 261 also performed well when RK was combined (GLMRK); (2) for RF, SBP1 produced the best result. However, the introduction 262 of RK resulted in the Score2 of SBP3 performing best among the three SBPs. However, RFRK of SBP1 performed worst 263 according to the values of Score2 and Score5. Finally, for the comprehensive assessment, SBP1 performed best among three 264 SBPs according to Score6. More details and calculation processes can be found in the Supplementary Material (Table S4.1). 265 266 Figure 5. Ranking score of model performance based on three SBPs. Score1 and Score2 are the sum scores of ME and RMSE 267 for each model, respectively; Score3 is the sum scores of ME, RMSE and AD for each model, Score4 and Score5 are the sum 268 scores of ME or RMSE for GLM all (GLM and GLMRK) and RF all (RF and RFRK), Score6 is the sum scores of all indicators. 269 The lower the value, the better the model performance. 270 271 3.3 Spatial prediction maps of soil PSFs generated from the different models narrow range of low values for these components, revealing a smoother distribution than that generated by the GLM and 276 GLMRK. Unlike the regression methods, the RF and RFRK methods produced hot and cold spots on the prediction maps and 277 more details of the soil sampling points were apparent (Fig. S5.1). 278 279 Figure 6. Spatial prediction maps of the sand component of the upper reaches of the Heihe River Basin. 280

Spatial distribution of soil texture classes in the USDA triangles 281
The predicted soil textures in the USDA texture triangles (Fig. 7) showed that most predictions fell within the range of observed 282 soil textures (Fig. 3a), and silt loam was the dominant soil texture in all cases. The GLM produced a more discrete distribution 283 than the RF, and the RK method expanded the dispersion. In the trends of the predicted samples, the silt components predicted 284 from all models were overestimated. The pattern fitting curves indicated that the prediction results were closer to the bottom 285 right of the USDA triangle than the soil PSF observations. The GLMRK and RFRK curves were longer than the GLM and RF 286 curves, with a more extensive range of values in triangles. Compared with the GLMRK, the RFRK produced a more upward 287 extension (Figs. 7j, k, l). It was clear that the clay fraction was overestimated and the sand fraction was underestimated. 288  We found RF reveal more accurate results, but with more bias than the GLM, and RK method improved the performance in 295 terms of bias for most models and the accuracy of the RF. Odeh et al. (1995) indicated that RK was superior to the linear 296 models, such as MLR, which was reflected in the prediction results for sand in our study. Scarpone et al. (2016) reported that 297 as a hybrid interpolator, the RFRK outperformed the RF when making soil thickness predictions. We proved that RFRK was 298 also suitable for compositional data and improved model performance when combining with the ILR transformation. In 299 summary, the GLM and RF had both advantages and disadvantages when considering the trade-off between bias and accuracy. 300 The results of GLM and GLMRK should not depend on the ILR basis being chosen, which has been proved by previous 301 studies on the use of linear models and kriging for compositional data (Pawlowsky-Glahn et al, 2015). However, the GLM 302 model used the "glmStepAIC" algorithm (i.e., a stepwise regression) to select the best combination of environmental 303 covariables for each ILR component (Table S2.1). Therefore, the variable inputs are different for these ILR data, and further 304 impact the accuracy assessment and prediction maps. In addition, the difficulty with the use of the GLM is the need for a back-305 transformation. There is a need to present results on the original untransformed scale after conducting the analysis on a 306 transformed level, which may produce spurious results (Lane, 2002). In our study, we compared the means of ILR transformed 307 data and the original data. We proved the feasibility of the ILR transformation method, especially for meeting the requirements 308 of compositional data. However, the accuracy of the GLM still needs to be improved, which may be because the transformed 309 data did not follow a normal distribution (Fig. 2). 310 Although the RF had the advantage of prediction accuracy, the limited interpretability of the consequences made it difficult 311 to modify the prediction bias -each tree from the model cannot be examined individually (Grimm et al., 2008). Moreover, the 312 ILR transformation before modeling increased the difficulty of interpretation for not only the predicted values on the ILR scale 313 but also the residuals. The back-transformation of the optimal estimate of log-ratio variables does not generate the optimal 314 estimation of compositional data (Lark and Bishop, 2007), which should also be considered. 315

Comparison of three SBPs of ILR transformation 316
For the comparison of the three SBPs, the ME and RMSE performed better when using SBP1 for ILR transformed data, 317 which may be interpreted as the distributions of the ILR1 and ILR2 of SBP1 being more symmetric (Fig. 2b). In contrast, the 318 performance of SBP2 was worse than that of SBP1 and SBP3 because the ILR_1 component, including all the soil PSF 319 information, was left-skewed (Fig. 2c). This result was especially apparent for the GLM and GLMRK, because the data in a 320 linear model needs to be normally distributed (Lane, 2002). 321 The negligible difference among the three SBP balances revealed a triangular shape with a cluster at about 120° (Fig. 3b). 322 This could be interpreted as the three soil PSFs having a mixed pattern, with each component dominated by the components 323 in one cluster (Tolosana-Delgado et al., 2005). Although the silt component dominated the soil PSFs (Fig. 2a), sand and clay also played important roles in soil compositions. Taking either the most abundant component of the compositional data as the 325 denominator (Martins et al., 2016) or the first component of the permutations did not provide convincing evidence. Because 326 using the most abundant component of the compositional data as the primary component of the alterations, i.e., SBP2, resulted 327 in a relatively poor performance compared to the other SBPs. Thus, we recommend that the focus should be on data distribution. 328 Furthermore, the choice of balance and combination of RK are also the key to improving model accuracy, as shown by the 329 result of the RFRK-SBP3 model (Fig. 4). 330

Limitations 331
Firstly, the scope of this study is limited to independent modeling. Each ILR component was modeled separately, which may 332 suboptimal because they cannot further consider the cross correlations among ILR coordinates. However, the study 333 demonstrated the relation of the raw data (sand, silt, and clay), and has confirmed that the currently used prediction models are 334 suitable. In our pervious study, we have used compositional kriging (CK) for the spatial prediction of soil PSFs (Wang and Shi, compared with only the regression model. However, the non-normal distribution of ILR data and its residuals, and more data 358 transformation and inverse transformation processes make models further difficult to interpreted and improve. 359 For the different SBPs, the three SBP-based data generated different distributions, but no pattern was apparent. This could 360 be explained by the angle of the biplot diagram, with three rays of soil PSF components clustered into three modes, and each 361 part dominating its cluster. Using the most abundant component of the compositional data as the first component of the 362 permutations was not considered the right choice because SBP2 produced the worst performance. Thus, we recommend that 363 the focus should be on data distribution. This study can provide a reference for the spatial simulation of soil PSFs combined 364 with ECs at the regional scale, and how to choose the balances of ILR transformed data. 365