Compositional balance should be considered in soil particle-size fractions 1 mapping using hybrid interpolators

Digital soil mapping of soil particle-size fractions (PSFs) using log-ratio methods has been widely used. As a hybrid 10 interpolator, regression kriging (RK) is an alternative way to improve prediction accuracy. However, there is still a lack of 11 systematic comparison and recommendation when RK was applied for compositional data. Whether performance based on 12 different balances of isometric log-ratio (ILR) transformation is robust. Here, we systematically compared the generalized 13 linear model (GLM), random forest (RF), and their hybrid pattern (RK) using different balances of ILR transformed data of 14 soil PSFs with 29 environmental covariables for prediction of soil PSFs on the upper reaches of the Heihe River Basin. The 15 results showed that RF had better performance with more accurate predictions, but GLM had a more unbiased prediction. For 16 the hybrid interpolators, RK was recommended because it widened data ranges of the prediction results, and modified bias and 17 accuracy for most models, especially for RF. The drawback, however, existed due to the data distributions and model 18 algorithms. Moreover, prediction maps generated from RK demonstrated more details of soil sampling points. Three ILR 19 transformed data based on sequential binary partitions (SBP) made different distributions, and it is not recommended to use 20 the most abundant component of compositions as the first component of permutations. This study can reference spatial 21 simulation of soil PSFs combined with environmental covariables and transformed data at a regional scale. 22


Introduction 25
Recently, spatial interpolation of soil particle-size fractions (PSFs) has become a focus in soil science. More accurate predicted 26 soil PSFs could contribute to a better understanding of hydrological, physical, and environmental processes (Delbari et al., 27 2011;Ließ et al., 2012;McBratney et al., 2002). 28 The characteristic of compositional data makes soil PSFs were more impressive than other soil properties. Soil PSFs usually 29 express three components of discrete data -sand, silt, and clay, and carry only relevant percentage information. Soil texture is 30 classified as soil PSFs, which can demonstrate on the ternary diagram. This closure system of the ternary diagram is not 31 https://doi.org/10.5194/hess-2020-384 Preprint. Discussion started: 31 August 2020 c Author(s) 2020. CC BY 4.0 License.
Euclidean space. Instead, it is Aitchison space (so-called the simplex) (Aitchison, 1986). Due to the "spurious correlations" 1 (Pawlowsky-Glahn, 1984), traditional statistical methods based on the Euclidean geometry may make mistakes when dealing 2 with soil PSFs data directly (Filzmoser et al., 2009). The constant sum, nonnegative, and unbiased are the key to its spatial 3 interpolation (Walvoort and de Gruijter, 2001). Data transformation is crucial importance for compositional data to transform 4 it from the simplex to the real space. Log ratio transformations play a significant role in compositional data analysis, including 5 additive log-ratio (ALR), centered log-ratio (CLR) (Aitchison, 1986), and isometric log-ratio (ILR) (Egozcue et al., 2003). 6 Currently, though these three log-ratio methods have been widely applied to transform soil PSFs data, different study area 7 scales and what model use should consider when modeling. For local-scale study areas, geostatistical models combined with 8 log-ratio transformed data can meet the requirements to map spatial patterns effectively in our previous study (Wang and Shi, 9 2017). As another perspective, functional compositions combined with the kriging method also applied for soil particle size 10 curves (PSC) (Menafoglio et al., 2014), which can fully develop the richness of information. It used complete and continuous 11 information rather than discrete information, and soil PSFs can be extracted from the predicted soil PSCs (Menafoglio et al., 12 2016a). Log-ratio transformations can also be combined with functional-compositional data for PSCs' stochastic simulation 13 (Menafoglio et al., 2016b, Talska et al., 2018. For middle-scale study areas, outliers may lead to the variogram's 14 overestimation and make prediction errors (Lark, 2000). Therefore, the spatial interpolation should take robust variogram 15 estimators into account to improve model performance (Lark, 2003). The previous study has already proved that applying 16 robust variogram estimators in log-ratio co-kriging had significant improvement in mapping performance (Wang and Shi, 17 2018). For the large-scale study area, geostatistical models limited by soil sampling points and increased spatial variability. 18 More and more studies have concentrated on mapping soil PSFs using different machine learning models, statistical models 19 and geostatistical models combined with ancillary data (so-called environmental covariates, EC) on a broad basin scale (Zhang 20 et al., 2020), national scale (Akpa et al., 2014) and global level (Hengl et al., 2017) using log-ratio transformed data. 21 Among these EC-combined models, linear, machine-learning, geostatistical models, and high accuracy surface modeling 22 (Yue et al., 2020) have been commonly used in middle-scale or large-scale studies. Linear models such as generalized linear 23 model (GLM) and multiple linear regression (MLR) have been used in soil PSFs prediction because of flexibility and 24 interpretability (Lane, 2002;Buchanan et al., 2012). Many of machine-learning models were applied for soil PSFs interpolation 25 and soil texture classification. For example, tree learners -such as random forest (RF), showed more advantages with abilities 26 to handle noisy datasets and generated more realistic maps (Zhang et al., 2020). Further, regression kriging (RK) can not only 27 combine environment covariables by its regression part but also improve model accuracy as a hybrid interpolator for some soil 28 properties, such as topsoil thickness and pH (Hengl et al., 2004). However, further performance comparison needs to be 29 conducted in mapping soil compositional data by linear models, machine-learning models, and these models combining with 30 RK (hybrid patterns). 31 The ILR method performed better in log-ratio methods than ALR and CLR both in theory and in practice (Filzmoser and 32 Hron, 2009;Wang and Shi, 2018;Zhang et al., 2020). ILR eliminates model collinearity and preserves advantageous properties 33 such as isometry, scale invariance, and sub-compositional coherence, which is based on orthonormal coordinate systems (so-spatial position ( ). The continuous variables included morphometry and hydrologic characteristics of topographic properties, 1 climatic and vegetation indices, and soil physical and chemical properties. The categorical variables include geomorphology 2 types, land use types, and vegetation types, transformed from vector to raster (1000 m). Due to the intricate patterns of 3 topography in the upper reaches of the HRB, variables of topographic properties dominated the environmental covariates. 4 SAGA GIS (Conrad et al., 2015) was applied for terrain analysis to derive topographic variables using 30 m DEM (ASTER 5 GDEM, http://www.gscloud.cn). The collinearity test was used to remove redundant variables, and these topographic 6 properties were then resampled to 1000 m. More details about environmental covariables can be found in the Data Availability 7 section. 8

Isometric log-ratio transformation and sequential binary partition 2
An orthonormal basis of ILR was chosen to project the compositions from (the simplex for the Aitchison geometry) to 3 where refers to the balance between two groups, 1 , 2 , . . . , is the parts of one group, and 1 , 2 , . . . , is the 8 parts of the other group. Therefore, the balances contain stepwise all the relevant information of compositions in two groups. 9 It also can be explained in a tabular form -for soil PSFs data (D = 3), all three choices of the balance of SBPs are shown in 10 for ILR were defined as Eq. (4), (5), (6). ILR transformation and its inverse are available in the R package "compositions" (K. 20

Groups
Step Sand Silt Clay r s Balance 2.4 Linear model, machine-learning model, and hybrid patterns 5

Generalized linear model 6
The generalized linear model (GLM) is an extended version of the linear model, which contains response variables with non-7 normal distributions (Nelder and Wedderburn, 1972). The link function is embedded into the GLM to ensure the classical linear 8 model assumptions. The scaled dependent variables and the independent variables can be connected using the link function 9 for the additive combination of model effects. The choice of link function depends on the distribution of response variables 10 (Venables and Dichmont, 2004). Gaussian distribution with an identity link function was applied in our study, which gives 11 consequences equivalent to multiple linear regression (Nickel et al., 2014). However, categorical variables can be directly 12 trained in the GLM without setting dummy variables. The Akaike's information criterion (AIC) was applied to choose the best 13 predictors and remove model multicollinearity using backward stepwise algorithm. 14

Random forest 15
Random forest (RF) is a non-parametric technique, which combines the bagging method with a selection of random variables 16 as an extended version of regression trees (RT) (Breiman, 1996(Breiman, , 2001. It can improve model prediction accuracy by producing 17 and aggregating multiple tree models. The principle of RF is to merge a group of "weak trees" together to generate a "powerful 18 forest." The bootstrap sampling method is applied for each tree, and each predictor was selected randomly from all model 19 predictors. The "out of bag" (OOB) data were applied to produce reliable estimates in an internal validation using a random 20 https://doi.org/10.5194/hess-2020-384 Preprint. Discussion started: 31 August 2020 c Author(s) 2020. CC BY 4.0 License. subset independent of the training tree data. Three parameters need to be tuned: the number of trees ( ) and minimum size 1 of terminal nodes ( ), and the number of variables randomly sampled as predictors for each tree ( ) (Liaw and  2 Wiener, 2001). The standard value of the parameter for is one-third of the total number of predictors, and 3 is 500 and 5, respectively. For regression, the mean square errors (MSEs) of predictions are estimated to train trees. 4 The variable importance of RF is produced from the OOB data using the "importance" function. The benefits of RFs are that 5 the ensembles of trees are used without pruning to ensure that the most significant variance can be expressed. Moreover, RF 6 can reduce model overfitting, and normalization is unnecessary due to the insensitive effects on the value range. The algorithms 7 of GLM and RF and the parameters adjustment of RF were available in the R package "caret" (Max Kuhn, 2018). 8

Regression kriging 9
Regression kriging (RK) is a hybrid interpolation technique that combines regression models (e.g., GLM and RF) with ordinary 10 kriging (OK) of residuals of regression models (Odeh et al., 1995). Mathematically, the RK method corresponds to two 11 interpolatorsthe regression part and the kriging part are operated separately (Goovaerts, 1999) . We used residuals to create a variogram (e.g., Gaussian, Spherical, or Exponential) for model based on the MSE 16 from cross-validation results. Firstly, the regression part in our study (GLM or RF) was used to predict soil PSFs; the residual 17 from the fitted model was then calculated by subtracting the regression part from the observations. Subsequently, the OK was 18 applied for the whole study area to interpolate the residuals. Finally, the regression prediction and the predicted residuals at 19 the same location were summed. The variograms of the RK method were generated automatically by using the 20 "autofitVariogram" function in the R package "automap" (Hiemstra et al., 2009). 21

Prediction method system and validation 22
The method system of spatial interpolation models for soil PSFs in our study was shown in Table 3. We systematically 23 compared 12 models -four interpolators, including GLM and RF combined with or without RK and three SBPs of ILR 24 transformation method. For the validation of model performance, the independent data set validation was used to evaluate 25 models' bias and accuracy. The sub-training sets (70 %) and the sub-testing sets (30 %) were randomly divided from data 26 independently, and this process was repeated 30 times. 27 The mean error (ME), the root mean square error (RMSE), and Aitchison distance (AD) were used to evaluate and compare 2 the prediction performance of models. ME and RMSE measure prediction bias and accuracy, respectively (Odeh et al., 1995). 3 AD is an overall indicator of compositional analysis, which describes the distance between two compositions. Generally, an 4 accurate, unbiased model will have all three symbols close to 0. The equations for ME, RMSE, and AD are defined as: 5 For soil PSFs data, Eq. (10), (11) and (12) can be simplified to three dimensions; the relationship between ratios and the 22 dominant roles of ILR transformed data are demonstrated from the covariance structure. All the statistical analyses, such as 23 the descriptive statistics of soil PSFs data, calculation and evaluation of indicators, and the spatial operation of prediction maps, 24 were performed on the R statistical program (R Development Core Team, 2019). The descriptive statistics of original (raw) and ILR transformed data, silt fraction dominant soil PSFs with more substantial 4 components than those of sand and clay. The distributions of sand and clay fractions were similar (Fig. 2a). ILR transformed 5 data based on three balances of SBP were revealed different distributions (Figs. 2b, 2c, and 2d). For example, two ILR (ILR1, 6 ILR2) components for SBP1 had symmetric distribution around zero value at the x-axis (Fig. 2b). In comparison, data 7 generated from SBP2 or SBP3 had to mirror-symmetric deliveries with left-skewed ILR1 of SBP2 and right-skewed ILR2 of 8 SBP3 (Figs. 2c and 2d). The comparison of means and medians demonstrated that back-transformed means of three ILR 9 transformed data were the same, and the mean of sand of ILR was closer to the median compared with soil PSF original data. 10 In contrast, the cases of component silt and clay were the opposite (Fig. 2e). 11 12 Figure. 2. Descriptive statistics of original soil PSF data and ILR transformed data using different balances of SBP. Not that 13 means of Sand_ILR, Silt_ILR, and Clay_ILR from different SBPs of ILR were back-transformed to the real space. 14

Covariance structure of ILR transformed data with different balances 15
The covariance analysis of the transformed data of soil PSFs data based on different SBPs showed that the variance VarILR_1 16 of SBP3 was maximum, followed by the values of VarILR_1 of SBP1 and SBP2 ( Table 4). The variance of the second 17 component of ILR (VarILR_2) was opposite to the rule of VarILR_1. The covariance (COV) and the corresponding correlation 18 coefficient (CC) followed the same pattern -SBP1 > SBP3 > SBP2. These values revealed the relationship between soil PSFs 19 components or ratios. The first equation of ILR ( 1 in Table 2) contained all the information on soil PSFs. The second one ( 2 20 in Table 2) included only two components. Therefore, the information on VarILR_1 was more abundant. Six values of VarILR_1 and VarILR_2 were not 0 (or not nearly 0), indicating that there was no constant (or almost the constant) in any two 1 ratios of soil PSF components. The value of COV of SBP3 was close to 0, showing the proportions of clay/sand and clay/silt 2 were approximately the same. The same results were generated from the corresponding correlation coefficient (CC). 3 The distribution of soil PSFs sampling data in the ternary diagram (the USDA texture triangle) showed that the main texture 8 class was silt loam (Fig. 3a). The biplot of soil samples demonstrated that the rays of three components, i.e., sand, silt, and 9 clay, were reasonably clustered at about 120 ° in three groups (Fig. 3b). 10 11 Figure. 3. The distribution in the USDA triangle (a) and biplot graph (b) of soil PSFs sampling. The red, smooth curve of these 12 soil samples in the USDA triangle was fitted by loess function. 13

Accuracy comparison of different models using ILR data 14
The first three rows of boxplots (Figs. 4 a, 4b, and 4c) demonstrated the bias of different models according to ME values. The 15 MEs of sand were closest to 0, followed by MEs of clay and silt. GLM was more unbiased than RF with lower ME values. 16 After combing with RK, there was an improvement for ME in most GLM and RF models (Figs. 4a, 4b, and 4c). For the 17 accuracy assessment, RMSEs of silt was higher than the other two components. GLMRK did not perform as well as expect for 18 showed that RF (or RFRK) performed better than GLM (or GLMRK) in both average RMSE values and uncertainties (Fig.  1   4g). Moreover, RFRK improved AD values for SBP2 and SBP3 methods. For the uncertainty assessment, RF generated lower 2 difficulties than GLM, and models combined with RK further reduced the uncertainties for most GLM and RF models. For 3 three balances of SBP methods, model performances were different. To better evaluate model performance using different SBP 4 balances, we graded each box from 1 to 3, and the final results were shown in the Supplementary Material table S1.1. It 5 demonstrated that SBP1 performed best with the lowest ME value among all models. For the accuracy comparison, the pattern 6 is not apparent, but it can be considered hierarchically. For GLM, SBP1 had better performance than the other two SBPs 7 methods, which also performed well when RK was added (GLMRK). For RF, SBP1 produced the best result. However, the 8 introduction of RK made SBP3 performed best among the three methods. Further, the RMSEs generated from RFRK using 9 SBP3 data had the best accuracy among all models in our study.

Spatial distribution of soil texture classes in the USDA triangles 3
The predicted soil textures plotted in Fig. 8 in the USDA texture triangles showed that most predicted soil textures fell within 4 the ranges of observed soil textures (Fig. 3a), and silt loam was dominant in the soil texture types for all the cases. GLM 5 produced more discrete distribution than RF, and the RK method expanded the effect of dispersion. For the trends of predicted 6 samples, silt components predicted from all models were over-estimated. The pattern fitting curves indicated that the prediction 7 results were closer to the values in the bottom right of the USDA soil texture trianglethan the soil PSFs observations. Curves 1 of GLMRK and RFRK were longer than GLM and RF, showing more extensive ranges of value in ternary. Compared with 2 GLMRK, RFRK produced more upward extension (Fig. 8j, k, l). It was clear that the clay fraction was over-estimated, and the 3 sand fraction was under-estimated.

Comparison of GLM, RF and their hybrid interpolators using ILR data 2
For the assessment of independent validation, RF revealed more accurate but more bias than GLM. RK improved the 3 performance of the bias for most models and the accuracy of RF. Odeh et al. (1995) have indicated that RK was superior to 4 the linear models like MLR in sand prediction. Scarpone et al. (2016) reported that RFRK outperformed RF when dealing with 5 soil thickness prediction as a hybrid interpolator. We proved that RK was also available for compositional data to improve 6 performance using ILR transformation in RF. In summary, GLM and RF had their advantages and disadvantages when 7 considering the trade-off between bias and accuracy. The difficulty of GLM is back-transformation; it needs to present results 8 on the original untransformed scale after analyzing on a transformed level, which may produce the unfortunate result between 9 them (Lane, 2002). In our study, we compared the means of ILR transformed data and the original data. We proved the 10 feasibility of the ILR transformation method, especially for meeting the requirements of compositional data. Still, the accuracy 11 of GLM needs to be improved; this may be because the transformed data did not follow a normal distribution. In addition, 12 although RF had an advantage on prediction accuracy, the limited interpretability of the consequences -a "black box" effect 13 -made it challenging to modify prediction bias because each tree from the model cannot be examined individually (Grimm et 14 al., 2008). ILR transformation before modeling increased the difficulty of interpretation for the predicted values on ILR-scale 15 and the residuals. Moreover, the back-transformation of the optimal estimate of log-ratio variables does not generate the 16 optimal estimation of compositions (Lark and Bishop, 2007), which also should be considered. 17

Comparison of three balances of ILR transformation method 18
The comparison of three balances of SBP showed that most indicators of ME and RMSE using SBP1 of ILR transformed data 19 performed better, which may be interpreted as the distributions of the ILR1 and ILR2 of SBP1 were more symmetric (Fig. 2b). 20 In contrast, the performance of SBP2 was worse than the other two because the ILR_1 component, including all the information 21 of soil PSFs, was left-skewed (Fig. 2c). This result was apparent, especially for GLM and GLMRK, because the normal 22 distribution of data is needed in the linear model (Lane, 2002). 23 The interpretation of the negligible difference among three balances of SBP was the biplot of soil PSFs sampling data (Fig.  24 3b), which revealed a triangular shape. In other words, three soil PSFs had a mixed pattern, and each component was dominated 25 demonstrated relatively poor performance among three SBPs data. Thus, we recommended using other parts that were not the 30 most abundant as the first component of permutations when the biplot diagram was uniform distribution with 120 ° (Fig. 3b).
fitted the biplots using a random sampling test (70 % soil sampling data randomly), and the distribution of these graphs (angle) 1 were almost the same (Fig. S3.1). Multiple data sets should be considered in further researches to verify if it was a general 2 feature of soil PSFs samples or produced from our data set. 3 Also, the weighting problem was not considered in this study, because the ILR method can be qualified as an unweighted 4 log-ratio transformation, giving all parts the same weight for both the definition of the total variance and the reduction of 5 dimension. It may enlarge the ratios generated from the rare parts and dominate the analysis (Greenacre and Lewi, 2009). The 6 pairwise log-ratio can set weights by their proportions when there is no additional knowledge about the component 7 measurement errors (Greenacre, 2019). Nevertheless, all three parts of soil PSF data dominate the biplot diagram without the 8 rare element and redundancy; thus, there are no shortcomings mentioned above. The accuracy assessments using pairwise log-9 ratio transformation need more researches in the future. 10

Conclusions 11
We evaluated and compared the performance of GLM, RF, and their hybrid pattern (i.e., GLMRK and RFRK) using different 12 ILR balances transformed data. The bias of GLM was lower than those of RF; however, the accuracy of GLM was relatively 13 lower. GLMs produced more discrete distributions and broader ranges of prediction value distributions in the USDA soil 14 texture triangles. In other words, GLM and RF generated different data sets -unbiased and inaccurate predictions for GLM 15 and biased and more accurate predictions for RF. 16 The hybrid patterns of GLM and RF, i.e., RK, were recommended, which produced relative higher prediction accuracy and 17 environmental correlation, showing more details about soil sampling points (hot spots and cold spots) than the regression part. 18 However, the non-normal distribution of ILR transformed data, and the "black box" effect of the RF algorithm were drawbacks 19 of GLMRK and RFRK. 20 Concerning different balances of SBP, three SBP-based data generated different distributions. Their prediction results had a 21 slight difference. The pattern was not obvious, which was because the angle of the biplot diagram -three rays of soil PSFs 22 components clustered into three modes, and each part dominated in its cluster. Using the most abundant component of 23 compositions as the first component of permutations was not the right choice because of the worst performance of SBP2. On 24 the contrary, we recommended using other parts that were not the most abundant as the first component of permutations when 25 the biplot diagram was uniform distribution with 120 °, like our study. For a general feature of soil PSFs compositional data, 26 multiple soil PSFs data sets should be considered and compared in the future. This study reference spatial simulation of soil 27 PSFs combined with environmental covariables at a regional scale and how to choose the ILR balances. 28 29 Data Availability. We did not use any new data and the data we used come from previously published sources. Soil particle-30 size fractions data is available through our previous studies Shi, 2017, 2018). Moreover, it also can be visited on 31 this website: http://data.tpdc.ac.cn/zh-hans/data/7f91d36d-8bbd-40d5-8eaf-7c035e742f40/ (Digital soil mapping dataset of 32 meteorological data can be accessed through http://data.cma.cn/ (last access: 4 July 2020). Environmental covariates data of 1 soil physical and chemical properties and categorical maps can be obtained through http://data.tpdc.ac.cn/zh-hans/ (last access: 2 4 July 2020), including saturated water content, field water holding capacity, wilt water content, saturated hydraulic 3 conductivity data (http://data.tpdc.ac.cn/zh-hans/data/e977f5e8-972b-42a5-bffe-cd0195f3b42b/, Digital soil mapping dataset 4 of hydrological parameters in the Heihe River Basin (2012); last access: 4 July 2020), and soil thickness data 5 (http://data.tpdc.ac.cn/zh-hans/data/fc84083e-8c66-4a42-b729-4f19334d0d67/, Digital soil mapping dataset of soil depth in 6 the Heihe River Basin (2012-2014); last access: 4 July 2020). DEM data set is provided by the Geospatial Data Cloud site, 7 Computer Network Information Center, Chinese Academy of Sciences. (http://www.gscloud.cn, last access: 4 July 2020). 8 9 Author contribution. Wenjiao Shi contributed to soil data sampling, oversaw the design of the entire project. Mo Zhang 10 performed the model analysis and wrote the manuscript. Both authors contributed to writing this paper and interpreting data.