A two-step merging strategy for incorporating multi-source precipitation products and gauge observations using machine learning classification and regression over China

. Although many multi-source precipitation products (MSPs) with high spatio-temporal resolution have been 10 extensively used in water cycle research, they are still subject to considerable uncertainties due to the spatial variability of terrain. Effective detection of precipitation occurrence is the key to enhancing precipitation accuracy. This study presents a two-step merging strategy to incorporate MSPs (GSMaP, IMERG, TMPA, PERSIANN-CDR, CMORPH, CHIRPS, and ERA-Interim) and rain gauges to improve the precipitation capture capacity and precipitation intensity simultaneously during 2000-2017 over China. Multiple environment variables and the spatial autocorrelation between precipitation observations are 15 selected as auxiliary variables in the merging process. Three machine learning (ML) classification and regression models, including gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and random forest (RF), are adopted and compared. The strategy first employs classification models to identify wet and dry days in warm and cold seasons, then combines regression models to predict precipitation amounts based on wet days. The results are also compared with those of traditional methods, including multiple linear regression (MLR), ML regression models, and gauge-based Kriging 20 interpolation. A total of 1680 (70%) rain gauges are randomly chosen for model training and 692 (30%) for performance evaluation. The results show that: (1) The multi-sources merged precipitation products (MSMPs) perform better than original MSPs in detecting precipitation occurrence under different intensities, followed by Kriging. The average Heidke Skill Score (HSS) of MSPs, Kriging, and MSMPs is 0.30-0.69, 0.71, 0.79-0.8, respectively. (2) The proposed method significantly alleviates the bias and deviation of original MSPs in temporal and spatial. The MSMPs strongly correlate with gauge 25 observations with the CC of 0.85. Moreover, the modified Kling-Gupta efficiency (KGE) improves by 17%-62% (MSMPs: 0.74-0.76) compared with MSPs (0.34-0.65). (3) The spatial autocorrelation factor (KP) is the most important variable in models, which contributes considerably to improving the model accuracy. (4) The proposed method outperforms MLR and ML regression models, and XGBoost algorithm is more recommended for large-scale data merging owing to its high computational efficiency. This study provides a robust and reliable method to improve the performance of precipitation data 30 with full consideration of multi-source information. This method could be applied globally and produce large-scale precipitation products if rain gauges are available. with various data flexibly, including outliers and irrelevant features. Moreover, the prediction The Accuracy shows the proportion of total days that are correctly classified as wet and dry days. One point that needs to 325 be emphasized is that this study takes Accuracy as the evaluation metric to describe the accuracy of ML classification models (RF, GBDT, and XGBoost) in training processes, thereby determining the optimal parameters of the model. regression 550 and MLR). Several statistical and categorical metrics are employed to quantitatively describe the precipitation detection capability and precipitation uncertainties. The main findings of this study can be concluded as follows:


7
to combine them to maximize their advantages. Although several products already combine gauge observation data to reduce bias, only a few gauges within China are used. Given the relatively high gauge density used in this study, this has little impact 170 on the independence of gauges and the reliability of results (Shen et al., 2013).

Environment variables
The environment variables used in this study include DEM, longitude, latitude, wind speed, relative humidity, soil moisture, cloud cover, and air temperature.

175
DEM is downloaded from the Shuttle Radar Topographic Mission (SRTM) with a resolution of 90 m. Wind speed, relative humidity, soil moisture, and air temperature are obtained from NASA Global Land Data Assimilation System Noah Land Surface Model (GLDAS_NOAH), with 3 h and 0.25° resolutions (Redell., 2004). Cloud cover is collected from ERA-interim developed by the ECMWF, with the resolution of 3 hourly and 0.25°. Although Normalized Differential Vegetation Index (NDVI) is often used as a critical auxiliary variable to predict precipitation, it is susceptible to soil type and human activities.

180
NDVI is more suitable for applying on the monthly or annual scale due to its temporal resolution (Ghorbanpour et al., 2021;Shen et al., 2021;Tan et al., 2021). Inversely, the response of air temperature and soil moisture to daily precipitation is better than NDVI, especially in the desert and bare land (Bhuiyan et al., 2018). In addition, the interactions between cloud properties and precipitation are equally important (Sharifi et al., 2019).

Data preprocessing
To maintain the temporal and spatial consistency of data used in this study, all MSPs and environment variables at a subdaily scale are aggregated to daily data. The spatial resolution of DEM (90m) and CHIRPS (0.05°) are upscaled to 0.1°, the TMPA, PERCDR, CMORPH, ERAI, cloud cover, and GLDAS_NOAH are downscaled to 0.1° using the bilinear interpolation method. In this study, the gauges are divided into two groups, 70% of rain gauges are spatially and randomly selected as 190 training and calibrating samples and the remaining 30% as validation samples. Due to the irregular distribution of rain gauges over China, random sampling is carried out for each river basin to ensure the spatial representativeness of the validation gauges.
Inspired by previous researches (Baez-Villanueva et al., 2020;Zhang et al., 2020), we consider a covariate describing spatial autocorrelation between rain gauges in this study. The semivariogram based on Ordinary Kriging is adopted to calculate spatial autocorrelation factor, i.e., Kriging_based prediction (KP). Compared with other predict models, such as Inverse 195 distance interpolation (IDW), Kriging_based semivariogram considers not only the spatial relationship between predicted and neighboring known points but considers the statistical autocorrelation between known points. The Ordinary Kriging assumes the model: Where z  (x0) is the predicted value of the unknown x0 point. z(xi) and i are the known value of neighboring rain gauge xi and 200 its weight. Unbiasedness and minimum estimation variance are the conditions for choosing weights. The weight depends on the distance between the known points, the predicted position, and the overall spatial arrangement based on the known points.
Spatial autocorrelation must be quantified before spatial arrangement can be applied in weights. The calculation processes of KP are as follows: (1) Calculate the distance and semivariogram between known points; Where (h) is the semivariogram of xi and xj, h is the distance, z is the value of known of points.
(2) A theoretical model is used to fit semivariogram and distances. The nugget, sill, and range can be obtained according to the fitted semivariogram. The commonly used semivariogram models are spherical, exponential, Gaussian, and Linear models. The spherical model is selected in this study: Where (h) is semivariogram, h is the distance, C0, C, and a is the nugget, sill, and range, respectively.
(3) Calculate the semivariogram between the unknown point and known points, and form a matrix to solve the weights: Where  is Lagrange parameter. 215 (4) Predict the value of the unknown point using eq. (1) according to the weights obtained from eq. (4).

A two-step merging strategy
The two-step merging strategy for multiple MSPs is illustrated in Fig.2 (GSMaP, IMERG, TMPA, PERCDR, CMORPH, CHIRPS, and ERAI) and rain gauges. Although the RF method has been 220 extensively employed in most previous studies, few studies compared it with GBDT and XGBoost models in precipitation merging. The environment variables, including soil moisture, cloud cover, relative humidity, air temperature, DEM, longitude, latitude, and spatial autocorrelation (KP) are selected as auxiliary variables (i.e., covariate) of the merging strategy. The values of multiple covariables and MSPs extracted according to gauge locations are taken as independent variables, while gauge observations are taken as the dependent variable. Meanwhile, according to the annual distribution characteristics of 225 precipitation, we group all input datasets into two seasons: warm season (May and October) and cold season (November to April), and models are trained and evaluated in each season.
The two-step merging strategy explored in this study can be generally described in two stages (Fig. 2). Firstly, the gauge observations are classified into wet and dry days to be used as the benchmark for classification models. Considering the measurement accuracy of China's standard rain gauges is 0.1mm, the daily precipitation less than 0.1 is set as a dry day, greater 230 than or equal to 0.1 is a wet day. The RF, GBDT, and XGBoost classification models are adopted to improve the precipitation identification ability of MSPs based on the classified gauge observations. Then, the regression models are employed to predict the precipitation amounts of wet days. The final multi-source merged precipitation products (MSMPs) combine dry days

RF
The RF model was proposed by Breiman (2001) and is widely applied to deal with regression, classification, and other 245 tasks (Rodriguez-Galiano et al., 2012;Nguyen et al., 2021). The general structure of RF is shown in Fig. 3. RF is an ensemble learning algorithm composed of multiple decision trees and generally outperforms a single tree. For regression problems, the model returns predictions by averaging all individual decision trees. For classification problems, each tree in the forest is judged and classified separately, and the output of RF is the class of a majority vote on classification trees (Ho, 1998).
The Bootstrap Aggregation (i.e., Bagging) technique is applied by the RF training algorithm for tree learners, which is 250 designed to improve the accuracy and stability of ML algorithms in classification and regression processes. The Bagging algorithm utilizes the out-of-bag (OOB) error to measure the prediction error of RF. It creates two independent datasets. One https://doi.org/10.5194/hess-2021-642 Preprint. Discussion started: 17 January 2022 c Author(s) 2022. CC BY 4.0 License. dataset, the Bootstrap sample (approximately two-thirds of all samples), is selected as "in-the-bag" data through sampling and replacement, while the remaining out-of-bag dataset (one-third) that is not selected during the sampling process is used to calculate the model's OOB error (Breiman, 2001). The advantages of RF can be mainly summarized in four points: (1) 255 processing high-dimensional data (a mass of features) without dimensionality reduction and feature selection; (2) measuring the importance of features and how they interact with each other; (3) avoiding overfitting and easy to implement; (4) balancing errors for asymmetric datasets, which is critical in the cold season when wet and dry days are unevenly distributed. In addition, several important parameters in RF are the number of decision trees (n_estimators), the maximum depth of each decision tree (max_depth), and the minimum number of samples required to split an internal node (min_samples_split). The optimal 260 parameters vary according to the characteristics of training samples, which are therefore different in warm and cold seasons.

GBDT
The GBDT is an iterative decision tree model created by Breiman (1997) and subsequently developed by Friedman (2002), 265 which is also called the multiple additive regression tree (MART) (shown in Fig. 4). The additive algorithm is utilized for classification or regression to reduce residuals generated in the training process continuously. GBDT uses the forward distribution algorithm and selects the classification and regression tree (CART) learner as a weak base learner. GBDT generates numerous weak learners through multiple iterations, and each learner is trained based on the residual of the previous learner.
It finally integrates the multiple weak learners into a single strong learner by weighting the summation of each tree.

270
The main difference between RF and GBDT is that RF can be trained in parallel to reduce variances, while GBDT reduces the biases by fitting the residual of former trees. Due to the strong connection between weak learners, GBDT is difficult to be paralleled. Generally speaking, GBDT has superior generalization ability and robustness, which is less affected by training samples size and can deal with various data flexibly, including outliers and irrelevant features. Moreover, the prediction https://doi.org/10.5194/hess-2021-642 Preprint. Discussion started: 17 January 2022 c Author(s) 2022. CC BY 4.0 License. accuracy of GBDT is high in the case of relatively little parameter adjustment time. The main parameters of GBDT include 275 the number of boosting stages to perform (n_estimators), the learning rate shrinks the contribution of each tree by learning_rate (learning rate,) and the maximum depth of trees (max_depth). The n_estimators and learning rate are highly correlated with the performance of the model.

XGBoost
The XGBoost model was proposed by Chen and Guestrin (2016) based on the structure of GBDT. XGBoost also combines multiple weak learners into a strong one, and the base learner in XGBoost can be either CART or linear classifier. XGBoost possesses the strength of GBDT and has several additional improvements: First, GBDT only uses the first-order derivative information in optimization, while XGBoost performs second-order Taylor expansion on the cost function to obtain the first-285 order and second-order derivatives, thus acquiring more accurate loss functions. Second, XGBoost introduces a regularization term into the cost function to effectively control the complexity of the model. From the perspective of bias-variance tradeoff, it reduces the variance of the model, making the learned model more straightforward and preventing over-fitting. Third, XGBoost allows users to define custom optimization goals and evaluation criteria, increasing its flexibility. Moreover, XGBoost implements parallel processing when selecting the best split node for enumeration, significantly improving the

MLR
The MLR is the first type of regression algorithm used extensively in many fields, assuming a stable linear relationship between a dependent variable and multiple independent variables. Compared with nonlinear relationships, the MLR is easier to fit and each explanatory variable's statistical property is more intuitive. MLR is usually fitted using the ordinary least square 300 method to minimize the sum of squares of residuals predicted by the model and observed by the sample. The overall model for MLR is: Where n is the number of explanatory variables, Y is the dependent variable predicted by X1, X2…, Xn. 0 is the intercept, and 1, 2…,i are regression coefficients. 305

Performance evaluation and comparison
In this study, the performance and accuracy of all products are evaluated using randomly selected independent daily rain gauges from 2000 to 2017. The evaluation metrics mainly involve categorical and statistical metrics. The categorical metrics focus on analyzing the ability of products to capture precipitation events, including the probability of detection (POD), false alarm ratio (FAR), critical success index (CSI), Precision (precision), frequency bias (FB), Heidke Skill Score (HSS), and 310 classification accuracy (Accuracy). The POD, also called hit bias, represents the probability of precipitation events correctly detected. FAR and precision describes the ratio of falsely and correctly detected events among total detected precipitation events, respectively. The sum of FAR and precision is 1. The CSI incorporates POD and FAR, which demonstrates the overall ability of precipitation detection. The FB is the ratio of POD and FAR. It shows the balanced ability of products in detecting precipitation events. FB < 1 indicates that precipitation events are underestimated, FB > 1 indicates overestimated. The FB 315 equals 1 meaning that the number of missed events is equal to false alarmed events. HSS compares the predicted performance with random chance. The negative HSS shows random chance is better than the model predicted. The range of HSS is - to 1, the perfect value is 1.
The Accuracy shows the proportion of total days that are correctly classified as wet and dry days. One point that needs to 325 be emphasized is that this study takes Accuracy as the evaluation metric to describe the accuracy of ML classification models (RF, GBDT, and XGBoost) in training processes, thereby determining the optimal parameters of the model.
Where H is the total number of precipitation events simultaneously observed and predicted, M is the total number of precipitation events observed but not predicted, F is the total number of precipitation events predicted but not detected, N is 330 the total number of no-precipitation events. The optimal value of POD, precision, CSI, Accuracy, and FB is 1, while FAR is 0.
The statistical metrics include Pearson correlation coefficient (CC), root mean square error (RMSE), the modified Kling-Gupta efficiency (KGE) and its components ( and ). The CC measures the magnitude of the correlation between the model predicted and observed values. The RMSE accesses the error between predicted and observed values. The KGE combining the CC, bias (), and variability ratio () reflects the overall goodness of fit between model predicted and observed.  > 1 indicates 335 precipitation amount is overestimated and vice versa. The formulas for these metrics are expressed as follows: = 1 − √( − 1) 2 + ( − 1) 2 + ( − 1) 2 , Where Po and Pm are the value of gauge observed and predicted precipitation, respectively. N is the total number of samples.
m and o are the mean value of gauge observed and predicted precipitation. SDo and SDm are the standard deviation of gauge observed and predicted precipitation, respectively. The optimal value for CC, KGE, ,  is 1, while for MAE, and RMSE is 0.

Performance assessment for classification results
The classification accuracy (Accuracy) of different ML models for wet/dry days is shown in performances are considerable. The Accuracy for three models is higher than 91% in the whole period, which is 91.9%, 91.8%, and 91.8% for RF, GBDT, and XGBoost, respectively. The Accuracy in the cold season is better than that in the warm season.
There is no significant difference among the three classification algorithms, the main reason is that the input variables used in 350 this study are sufficient in variety and quantity, but their efficiency and ease of use should be different. To better understand the performance of different technologies, the multi-source merged precipitation products ( with the value of 0.76, much higher than original MSPs (0.3-0.67) and the best performing Kriging (0.69) (Fig. 5c). In terms of precision (Fig. 5d), MSMPs achieve a significant improvement, the precision increases from 0.47-0.71 (MSPs) to 0.87-0.88 (MSMPs). For FB (Fig. 5e), MSPs and Kriging deviate from 1, among which PERCDR has the worst value (1.83) and TMPA has the best value (0.92). Although ERAI achieves a high POD, its FB is 1.74, indicating ERAI has seriously overestimated wet days and misclassified many precipitation events. Fortunately, MSMPs strike a good balance between hit and false alarmed 365 rates. The FB of MSMPs is closer to 1, which is 0.96 for GBDT, 1 for XGBoost, and 0.98 for RF. In terms of HSS (Fig. 5f), except for Kriging (0.71) and GSMaP (0.69), the HSS of MSPs is lower than 0.5 (0.3-0.49). In contrast, the MSMPs (0.79-0.8) improve by 14% to 163%.

370
The general performance of all products (including MSPs, Kriging, and MSMPs) in the warm season is better than that in the cold season (Fig. 5). However, the MSMPs' performance difference between warm and cold seasons is smaller than that of MSPs, demonstrating that the ability of MSMPs is more balanced throughout the year. Moreover, the metrics' variation of original MSPs is considerable in the cold season, particularly FAR and precision. The boxplot of FAR (Fig. 5b) and precision

Performance assessment for regression results
The final MSMPs are predicted by regression models (GBDT, XGBoost, and RF) based on classification results. The regression results include GBDT-GBDT, XGBoost-XGBoost, and RF-RF, respectively named GBDT2, XGB2, and RF2 for 400 simplicity. We also employ five statistical metrics (CC, RMSE, KGE and its components (β and γ)) to compare original MSPs and Kriging with GBDT2, XGB2, and RF2 based on daily observations. According to comparison results (Fig. 7, Table 4), the 405 MSMPs outperform original MSPs to varying degrees, followed by Kriging. The MSMPs have a strong correlation with gauge observations in the warm season (CC: 0.83-0.84), cold season (CC: 0.9), and the whole period (CC: 0.85) (Fig. 7a), which is significantly better than MSPs (warm:0.45-0.76; cold: 0.45-0.72; whole: 0.47-0.76). For RMSE (Fig. 7b), the values in the warm season are higher than that in the cold season. This is because precipitation is mainly concentrated in the warm season, and higher precipitation amounts often lead to larger RMSE. The RMSE for MSMPs decreases by 25% -53% compared with    spatial comparison among them is more representative and brevity. The RMSE gradually increases from north to south, which is consistent with the precipitation change pattern (Fig. 8a)

Variable importance of ML models
The variable importance can quantitatively explain their contribution to improving model accuracy and recognize crucial 440 input variables. The permutation feature importance is utilized to calculate variable importance values of models. The basic idea of this method is to randomly shuffle the order of specific variable while keeping other variables unchanged and compute the accuracy difference (the evaluation metric is Accuracy for the classification model, mean squared error for the regression model) with the original model. As Fig. 9 shown, the importance of variables for GBDT, XGBoost, and RF and their ranks are different, which is related to the inherent structure of each model. This phenomenon also exists between classification and 445 regression models. Nonetheless, KP is always the most important variable in every model, proving that the Kriging_based predictor considering the spatial autocorrelation between rain gauges is pretty helpful to improve model efficiency. For all models, the top three variables in importance are KP, GSMaP, and IMERG. The CMORPH, PERCDR, ERAI, and temperature is considered next significant. The importance of ERAI and temperature in regression models is more obvious than that in classification models. Additionally, TMPA, longitude, latitude, DEM, cloud cover, and relative humidity exhibit relatively low 450 influence on precipitation merging. The impacts of CHIRPS, soil moisture, and wind speed on prediction results are negligible.
However, this does not mean that these predictors are not important for precipitation in whole regions. The slight importance of the latter variables may be affected by data quality and the correlation degree with precipitation. For example, CHIRPS is the worst performance product among original MSPs. Overall, it is necessary to employ multiple covariables in classification and regression models since complex precipitation processes cannot be thoroughly described by a single variable.

Comparison of prediction accuracy of various merging approaches
From the aspect of merging processes, different models and training samples could affect the accuracy of the integrated dataset. Therefore, three additional merging scenarios are considered for quantitative comparison with the proposed strategy to highlight the impact of samples' seasonal division and algorithm selection on fusion results. Fig. 10 gives a brief overview of four scenarios and their corresponding merged precipitation products. Scenario1 is the method adopted in this study; gauge observations. The performance of scenario1 is apparently better than other scenarios. For scenarios2, although the 475 statistical metrics (CC and KGE) are only slightly worse than scenario1, the categorical metrics (CSI, FB, and HSS) are considerably weakened. In the whole period (Fig. 11a), the HSS is between 0.66-0.69 for scenario2, much lower than 0.79-0.8 for scenario1. Moreover, the FB of scenario2 is larger than 1.38 (Fig. 11a), indicating that a number of precipitation events have been overestimated. A similar phenomenon also occurs in warm and cold seasons (Fig. 11b, c). Meanwhile, the MLR performs worse than the three ML models. The results of scenario2 demonstrates that only relying on regression models to 480 merge precipitation can describe precipitation intensity but not capture precipitation occurrence well. In term of scenario3, the overall performance is superior to scenario2 but inferior to scenario1. Compared with scenario1, the accuracy deterioration of scenario3 is more obvious in the cold season. The CSI (Fig. 11c) for scenario1 and scenario3 range from 0.73-0.74 and 0.68-0.70, respectively. Scenario3 suggests that seasonal precipitation fusion based on precipitation characteristics could effectively balance the performance differences within a year. Scenario4 shows the worst performance regardless of season, with poor 485 CSI, FB, and HSS. Especially for the MLR-ER dataset, its accuracy is even worse than GSMaP and Kriging. This is because MLR is difficult to describe the complex relationship between precipitation and other variables. In general, the four scenarios can be ranked by prediction accuracy from best to worst: Scenario1 > Scenario3 > Scenario2 > Scenario4. The approach (i.e., Scenario1) employed in this study is proved to be more reliable and robust than other traditional strategies.