Evaluation of 18 satellite- and model-based soil moisture products using in situ measurements from 826 sensors

Information about the spatiotemporal variability of soil moisture is critical for many purposes, including monitoring of hydrologic extremes, irrigation scheduling, and prediction of agricultural yields. We evaluated the temporal dynamics of 18 state-of-the-art (quasi-)global near-surface soil moisture products, including six based on satellite retrievals, six based on models without satellite data assimilation (referred to hereafter as “open-loop” models), and six based on models that assimilate satellite soil moisture or brightness temperature data. Seven of the products are introduced for the first time in this study: one 5 multi-sensor merged satellite product called MeMo and six estimates from the HBV model with three precipitation inputs (ERA5, IMERG, and MSWEP) and with and without assimilation of SMAPL3E satellite retrievals, respectively. As reference, we used in situ soil moisture measurements between 2015 and 2019 at 5-cm depth from 826 sensors, located primarily in the USA and Europe. The 3-hourly Pearson correlation (R) was chosen as the primary performance metric. Application of the Soil Wetness Index (SWI) smoothing filter resulted in improved performance for all satellite products. The best-to-worst 10 performance ranking of the four single-sensor satellite products was SMAPL3ESWI, SMOSSWI, AMSR2SWI, and ASCATSWI, with the L-band-based SMAPL3ESWI (median R of 0.72) outperforming the others at 50 % of the sites. Among the two multi-sensor satellite products (MeMo and ESA-CCISWI), MeMo performed better on average (median R of 0.72 versus 0.67), mainly due to the inclusion of SMAPL3ESWI. The best-to-worst performance ranking of the six open-loop models was HBV-MSWEP, HBV-ERA5, ERA5-Land, HBV-IMERG, VIC-PGF, and GLDAS-Noah. This ranking largely reflects the quality 15 of the precipitation forcing. HBV-MSWEP (median R of 0.78) performed best not just among the open-loop models but among all products. The calibration of HBV improved the median R by +0.12 on average compared to random parameters, highlighting the importance of model calibration. The best-to-worst performance ranking of the six models with satellite data assimilation

model products, the satellite products (with the exception of ASCAT) often do not provide retrievals when the soil is frozen or snow-covered (Supplement Fig. S1). To keep the evaluation consistent (Gruber et al., 2020), we used ERA5 (Hersbach et al., 2020) to discard the estimates of all 18 products when the near-surface soil temperature of layer 1 (0-7 cm) was < 4 • C and/or the snow depth was > 1 mm.
To deepen the vertical support of the superficial satellite observations and suppress noise, we also evaluated 3-hourly versions 5 of the satellite products processed using the SWI exponential smoothing filter (Wagner et al., 1999;Albergel et al., 2008). MeMo was not processed as it was derived from SWI-filtered products. The SWI filter is defined according to: where SM sat (units depend on the product) is the soil moisture retrieval at time t i , T (days) represents the time lag constant, and t represents the 3-hourly time step. T was set to 5 days for all products, as the performance did not change markedly using 10 different values, as also reported in previous studies (Albergel et al., 2008;Beck et al., 2009;Ford et al., 2014;Pablos et al., 2018). Following Pellarin et al. (2006), the SWI at time t was only calculated if ≥ 1 retrievals were available in the interval The vertical support is physically consistent with in situ soil moisture measurements at 5-cm depth for most models. The average depth of the soil layer (i.e., half the depth of the lower boundary) is 2.5 cm for SMAPL4, 3.5 cm for ERA5 and 15 ERA5-Land, 5 cm for GLEAM, 8.5 cm for HBV-ERA5, 6.6 cm for HBV-IMERG, 7.3 cm for HBV-MSWEP, and 15 cm for VIC-PGF (Table 1; Supplement Table S1). The soil layers of HBV may seem too deep, especially since they represent conceptual "buckets" that can be fully filled with water, in contrast to the soil layers of the other models which additionally consist of mineral and organic matter. However, the soil layer depths of HBV were calibrated (see Section 2.3) and are thus empirically consistent with in situ measurements at 5-cm depth. 20

Merged soil Moisture (MeMo) product
Merged soil Moisture (MeMo) is a new 3-hourly soil moisture product derived by merging the soil moisture anomalies of three single-sensor passive-microwave satellite products with SWI filter (AMSR2 SWI , SMAPL3E SWI , and SMOS SWI ; Table 1). MeMo was produced for 2015-2019 (the period with data for all three products) as follows: 1. Three-hourly soil moisture time series of AMSR2 SWI , SMAPL3E SWI , SMOS SWI , the active-microwave satellite product 25 ASCAT SWI , and the open-loop model HBV-MSWEP were normalized by subtracting the long-term means and dividing by the long-term standard deviations of the respective products (calculated for the period of overlap).
2. Three-hourly anomalies were calculated for the five products by subtracting their respective seasonal averages. The seasonal climatology was calculated by taking the multi-year mean for each day of the year, after which we applied a 30-day central moving mean to eliminate noise. The moving mean was only calculated if > 21 days with values were 30 present in the 30-day window. Due to the large number of missing values in winter (Supplement Fig. S1), we were not able to compute the seasonality and, in turn, the anomalies in winter for some satellite products.
3. Time-invariant merging weights for AMSR2 SWI , SMAPL3E SWI , and SMOS SWI were calculated using extended triple collocation (McColl et al., 2014), a technique to estimate Pearson correlation coefficients (R) for independent products with respect to an unknown truth. The R values for the respective products were determined using the triplet consisting of 5 the product in question in combination with ASCAT and HBV-MSWEP, which are independent from each other and from the passive products. The R values were only calculated if > 200 coincident anomalies were available. The weights were calculated by squaring the R values.
4. For each 3-hourly time step, we calculated the weighted mean of the available anomalies of AMSR2 SWI , SMAPL3E SWI , and SMOS SWI . If only one anomaly was available, this value was used and no averaging was performed. The climatology 10 of SMAPL3E -the best-performing product in our evaluation -was added to the result, to yield the MeMo soil moisture estimates.

HBV hydrological model
Six new 3-hourly soil moisture products were produced using the Hydrologiska Byråns Vattenbalansavdelning (HBV) conceptual hydrological model (Bergström, 1976(Bergström, , 1992 forced with three different precipitation datasets and with and without assimilation 15 of SMAPL3E soil moisture estimates, respectively (Table 1). HBV was selected because of its low complexity, high agility, computational efficiency, and succesful application used in numerous studies spanning a wide range of climate and physiographic conditions (e.g., Steele-Dunne et al., 2008;Driessen et al., 2010;Beck et al., 2013;Vetter et al., 2015;Jódar et al., 2018). The model has one soil moisture store, two groundwater stores, and 12 free parameters. Among the 12 free parameters, 7 are relevant for simulating soil moisture as they pertain to the snow or soil routines, while 5 are irrelevant for this study as they pertain to 20 runoff generation or deep percolation. The soil moisture store has two inputs (precipitation and snowmelt) and two outputs (evaporation and recharge). The model was run twice for 2010-2019; the first time to initialize the soil moisture store, and the second time to obtain the final outputs.
HBV requires time series of precipitation, potential evaporation, and air temperature as input. For precipitation, we used three and ameliorate potential biases, the ERA5 air temperature data were matched on a monthly climatological basis using an additive (as opposed to multiplicative) approach to the comprehensive station-based WorldClim climatology (V2; 1-km resolution; Fick and Hijmans, 2017).
We calibrated the 7 relevant parameters of HBV using in situ soil moisture measurements from 177 independent sensors from the International Soil Moisture Network (ISMN) archive (Section 2.5; Supplement Fig. S2). These sensors did not have enough measurements during the evaluation period (March 31, 2015, to September 16, 2019 and thus were available for an independent calibration exercise. The parameter space was explored by generating N = 500 candidate parameter sets using Latin hypercube sampling (McKay et al., 1979), which splits the parameter space up into N equal intervals and generates 5 parameter sets by sampling each interval once in a random manner. The model was subsequently run for all candidate parameter sets, after which we selected the parameter set with the best overall performance across the 177 sites (Supplement Table S1). As objective function, we used the median Pearson correlation coefficient (R) calculated between 3-hourly in situ and simulated soil moisture time series. To avoid giving one of the precipitation datasets an unfair advantage, we recalibrated the model for each of the three precipitation datasets (ERA5, IMERG, and MSWEP towards the satellite observations. Nudging techniques are computationally efficient and easy to implement, and have therefore been used in several studies (e.g., Brocca et al., 2010b;Dharssi et al., 2011;Capecchi and Brocca, 2014;Laiolo et al., 2016;Cenci et al., 2016;Martens et al., 2016). For each grid-cell, the soil moisture state of the model was updated when a satellite observation was available according to: 20 where SM + mod and SM − mod (mm) are the updated and a priori soil moisture states of the model, respectively, SM sc sat (mm) are the rescaled satellite observations, and t is the 3-hourly time step. The satellite observations were rescaled to the open-loop model space using cumulative distribution function (CDF) matching (Reichle and Koster, 2004).
The nudging factor k (−) was set to 0.1 as this gave satisfactory results. The gain parameter G (−) determines the magnitude of the updates and ranges from 0 to 1. G is generally calculated based on relative quality of the satellite retrievals and the 25 open-loop model. Most previous studies used a spatially and temporally uniform G (e.g., Brocca et al., 2010b;Dharssi et al., 2011;Capecchi and Brocca, 2014;Laiolo et al., 2016;Cenci et al., 2016). Conversely, Martens et al. (2016) used the triple collocation technique (Scipal et al., 2008) to obtain spatially variable G values. Here we calculated G in a similar fashion according to: where R sat and R mod (−) are Pearson correlation coefficients with respect to an unknown truth for SMAPL3E and HBV, respectively, calculated using extended triple collocation (Section 2.2). R sat was determined using 3-hourly anomalies of the triplet SMAPL3E, ASCAT SWI , and HBV-MSWEP (  (17), were available for evaluation (Fig. 1). The median record length was 3.0 years.

Evaluation approach
We evaluated the 18 near-surface soil moisture products (Table 1) for the 4.5-year long period from March 31, 2015 (the date on which SMAP data became available), to September 16, 2019 (the date on which we started processing the products). As 20 performance metric, we used the Pearson correlation coefficient (R) calculated between 3-hourly soil moisture time series from the in situ sensors and the products, similar to numerous previous studies (e.g., Karthikeyan et al., 2017a;Al-Yaari et al., 2017;Kim et al., 2018). R measures how well the in situ and product time series correspond in terms of temporal variability, and thus evaluates the most important aspect of soil moisture time series for the majority of applications (Entekhabi et al., 2010;Gruber et al., 2020). It is insensitive to systematic differences in mean and variance, which can be substantial due to: (i) the use 25 of different soil property maps as input to the retrieval algorithms and hydrological models (Teuling et al., 2009;Koster et al., 2009); and (ii) the inherent scale discrepancy between in situ point measurements and satellite footprints or model grid-cells (Miralles et al., 2010;Crow et al., 2012;Gruber et al., 2020).
Additionally, to quantify the performance of the products at different time scales, we calculated Pearson correlation coefficients for the low-frequency fluctuations (i.e., the slow variability at monthly and longer time scales; R lo ) and the high-frequency 30 fluctuations (i.e., the fast variability at 3-hourly to monthly time scales; R hi ). The low-frequency fluctuations were isolated using a 30-day central moving mean, similar to previous studies (e.g., Albergel et al., 2009;Al-Yaari et al., 2014;Su et al., 2016). To ensure a fair evaluation, we discarded the estimates of all products when the near-surface soil temperature was < 4 • C and/or the snow depth was > 1 mm (both determined using ERA5; Hersbach et al., 2020). For the satellite products without SWI filter, we matched the instantaneous soil moisture retrievals with coincident 3-hourly in situ measurements to compute the 5 R values. Since the evaluation was performed at a 3-hourly resolution, we downscaled the two products with a daily temporal resolution (VIC-PGF and GLEAM; To derive insights into the reasons for the differences in performance, median R values were calculated separately for different To determine LAI, we used the 1-km Copernicus LAI dataset derived from SPOT-VGT and PROBA-V data (V2; Baret et al., 2016;mean over 1999. To determine the topographic slope, we used the 90-m MERIT DEM (Yamazaki et al., 2017). To reduce the scale mismatch between point locations and satellite sensor footprints or model grid-cells, we upscaled the Köppen-Geiger, LAI, and topographic slope maps to 0.25 • using majority, average, and average resampling, respectively.

Results and discussion
3.1 How do the ascending and descending retrievals perform? 5 Microwave soil moisture retrievals from ascending and descending overpasses may exhibit performance differences due to diurnal variations in land surface conditions (Lei et al., 2015) and radio-frequency interference (RFI; Aksoy and Johnson, 2013). Table 2 presents R values for the instantaneous ascending and descending retrievals of the four single-sensor products (AMSR2, ASCAT, SMAPL3E, and SMOS;  (2015). The ascending and descending retrievals performed similarly for the passive microwave-based SMAPL3E and SMOS (Table 2). For the remainder of this analysis, we will use only descending retrievals of AMSR2. We did not discard the ascending retrievals of ASCAT as they helped to improve the 15 performance of ASCAT SWI .

What is the impact of the Soil Wetness Index (SWI) smoothing filter?
The application of the SWI filter resulted in higher median R, R hi , and R lo values for all satellite products (Figs. 2a and 3; Table 1). The median R improvement was +0.12 for AMSR2, +0.10 for ASCAT, +0.07 for SMAPL3E, +0.17 for SMOS, and +0.11 for ESA-CCI (Fig. 2a). The improvements are probably mainly because the SWI filter reduces the impact of random 20 errors and potential differences between ascending and descending overpasses (Su et al., 2015;Bogoslovskiy et al., 2015).
Additionally, since the SWI filter simulates the slower variability of soil moisture at deeper layers (Wagner et al., 1999;Albergel et al., 2008;Brocca et al., 2010a), it improves the consistency between the in situ measurements at 5-cm depth and the microwave signals, which often have a penetration depth of just 1-2 cm depending on the observation frequency and the land surface conditions (Long and Ulaby, 2015;Shellito et al., 2016a;Rondinelli et al., 2015;Lv et al., 2018). Our results suggests that   Outliers are not shown.
3.3 What is the relative performance of the single-sensor satellite products?
Among the four single-sensor products with SWI filter (AMSR2 SWI , ASCAT SWI , SMAPL3E SWI , and SMOS SWI ; Table 1) is likely attributable to the deeper ground penetration of L-band signals (Lv et al., 2018), the sensor's higher radiometric accuracy (Entekhabi et al., 2010), and the application of an RFI mitigation algorithm (Piepmeier et al., 2014). SMOS SWI is also an L-band product, while the AMSR2 SWI product used here was derived from X-band observations, which have a shallower penetration depth (Long and Ulaby, 2015). Both AMSR2 SWI and SMOS SWI are more vulnerable to RFI, which may have 10 reduced their overall performance (Njoku et al., 2005;Oliva et al., 2012). The active microwave-based ASCAT SWI performed significantly better in terms of high-frequency than low-frequency fluctuations (Fig. 3), likely due to the presence of seasonal vegetation-related biases (Wagner et al., 2013). ASCAT SWI showed a relatively small spread in R hi values (Fig. 3b), although it showed the largest spread in R and R lo values not just among the single-sensor products but among all products (Figs. 2a and 3a). However, since the models also tend to exhibit lower R values in cold regions (Fig. 2b), it could also be that the in situ 20 measurements are of lower quality or less representative of satellite footprints or model grid-cells, or that our procedure to screen for frozen or snow-covered soils is imperfect. AMSR2 and particularly AMSR2 SWI performed noticeably better in terms of R in arid climates (Figs. 1 and 2b), as reported in previous studies (Wu et al., 2016;Cho et al., 2017), and likely due to the availability of coincident Ka-band brightness temperature observations which are used as input to the LPRM retrieval algorithm (Parinussa et al., 2011). AMSR2 and SMOS (with and without SWI filter) showed markedly lower R values for sites with 25 mean leaf area index > 2 m 2 m −2 (Fig. 2c), confirming that their retrievals are affected by dense vegetation cover (Al-Yaari et al., 2014;Wu et al., 2016;Cui et al., 2018). Most satellite products performed worse in terms of R in areas of steep terrain ( Fig. 2d), consistent with previous evaluations (Paulik et al., 2014;Karthikeyan et al., 2017a;Ma et al., 2019), and attributed to the confounding effects of relief on the upwelling microwave brightness temperature observed by the radiometer (Mialon et al., 2008;Pulvirenti et al., 2011;Guo et al., 2011). The multi-sensor merged product MeMo (based on AMSR2 SWI , SMAPL3E SWI , and SMOS SWI ) performed better than the four single-sensor products for all three metrics (R, R lo , and R hi ; Figs. 2a and 3; Table 1). These results highlight the value of multi-sensor merging techniques, in line with prior studies that merged satellite retrievals (Gruber et al., 2017;Kim et al., 2018), model outputs (Guo et al., 2007;Liu and Xie, 2013;Cammalleri et al., 2015), and satellite retrievals with model outputs (Yilmaz et al., 2012;Anderson et al., 2012;Tobin et al., 2019;Vergopolan et al., 2020). However, MeMo performed only marginally better in terms of median R than the best-performing single-sensor product SMAPL3E SWI (which was incorporated in MeMo; Fig. 2a). The most likely reason for this is that triple collocation-based merging techniques rely on several assumptions (linearity, 5 stationarity, error orthogonality, and zero cross-correlation) which are generally difficult to fully satisfy in practice, affecting the optimality of the merging procedure (Yilmaz and Crow, 2014;Gruber et al., 2016).
Additionally, MeMo performed better than the multi-sensor merged product ESA-CCI SWI (based on AMSR2, ASCAT, and SMOS) for all three metrics (Figs. 2a and 3). MeMo performed better in terms of R at 68 % of the sites, and performed particularly well across the central Rocky Mountains, although ESA-CCI SWI performed better in eastern Europe (Fig. 4). The (v) the gauge-and reanalysis-based PGF (Sheffield et al., 2006), and (vi) the gauge-and satellite-based GPCP V1.3 Daily Analysis (Huffman et al., 2001). This order matches the overall performance ranking of precipitation datasets in a comprehensive 25 evaluation over the conterminous USA carried out by Beck et al. (2019a). Furthermore, the performance of HBV-ERA5 did not depend on the terrain slope, while HBV-IMERG performed worse in steep terrain (Fig. 2d), which is also consistent with the evaluation of Beck et al. (2019a). HBV-IMERG performed worse for low-frequency than for high-frequency fluctuations (Fig. 3), which likely reflects the presence of seasonal biases in IMERG (Beck et al., 2017c;Wang and Yong, 2020). Overall, these results confirm that precipitation is by far the most important determinant of soil moisture simulation performance (Gottschalck et al.,  Among the three soil moisture products derived from ERA5 precipitation (ERA5, ERA5-Land, and HBV-ERA5), and among the three products forced with daily gauge-corrected precipitation (GLEAM, HBV-MSWEP+SMAPL3E, and SMAPL4; Table 1), the ones based on HBV performed better overall in terms of all three metrics (R, R lo , and R hi ; Figs. 2a and 3). This demonstrates that soil moisture estimates from complex, data-intensive models (H-TESSEL underlying ERA5 and ERA5-Land, GLEAM, and the Catchment model underlying SMAPL4) are not necessarily more accurate than those from relatively simple, calibrated models (HBV). This is in line with several previous multi-model evaluations focusing on soil moisture (e.g., Guswa et al., 2002;Cammalleri et al., 2015;Orth et al., 2015), the surface energy balance (e.g., Best et al., 2015), evaporation (e.g., McCabe et al., , runoff (e.g., Beck et al., 2017a), and river discharge (e.g., Gharari et al., 2020).

How do the models with satellite data assimilation perform?
The performance ranking of the models with satellite data assimilation in terms of median R (from best to worst) was HBV-MSWEP+SMAPL3E, HBV-ERA5+SMAPL3E, GLEAM, SMAPL4, HBV-IMERG+SMAPL3E, and ERA5 ( Fig. 2a; Table 1).
The assimilation of SMAPL3E retrievals resulted in a substantial improvement in median R of +0.06 for HBV-IMERG, a minor 10 improvement of +0.01 for HBV-ERA5, and no change for HBV-MSWEP (Fig. 2a). Improvements in R were obtained for 90 %, 65 %, and 56 % of the sites for HBV-IMERG, HBV-ERA5, and HBV-MSWEP, respectively. For HBV-IMERG, the greatest The ERA5 reanalysis, which assimilates ASCAT soil moisture (Hersbach et al., 2020), obtained a lower overall performance (median R = 0.68) than the open-loop models ERA5-Land (median R = 0.72) and HBV-ERA5 (median R = 0.74), which were both forced with ERA5 precipitation (Fig. 2a). This suggests that assimilating satellite soil moisture estimates (ERA5) was and SMOS brightness temperatures into an experimental version of the Integrated Forecast System (IFS) model underlying ERA5 did not improve the soil moisture simulations. They attributed this to the adverse impact of simultaneously assimilated screen-level temperature and relative humidity observations on the soil moisture estimates.
In line with our results for HBV-MSWEP+SMAPL3E, Kumar et al. (2014) did not obtain improved soil moisture estimates after the assimilation of ESA-CCI and AMSR-E retrievals into Noah forced with highly accurate NLDAS2 meteorological 5 data for the conterminous USA. Conversely, several other studies obtained substantial performance improvements after data assimilation despite the use of high-quality precipitation forcings (Liu et al., 2011;Koster et al., 2018;Tian et al., 2019). We suspect that this discrepancy might reflect the lower performance of their open-loop models compared to ours. Using different Overall, it appears that the benefits of data assimilation are greater for models that exhibit structural or parameterization deficiencies.

What is the impact of model calibration?
Among the models evaluated in this study, only HBV and the Catchment model underlying SMAPL4 have been calibrated, although only a single parameter out of more than 100 was calibrated for the Catchment model (Reichle et al., 2019b). 15 HBV-ERA5, HBV-IMERG, and HBV-MSWEP with calibrated parameters obtained median R values of 0.74, 0.65, and 0.78, respectively (Fig. 2a), whereas the same three models with randomly generated (uncalibrated) parameters obtained mean median R values of 0.59, 0.53, and 0.62, respectively (standard deviations 0.17, 0.16, and 0.16, respectively; data not shown).
The mean improvements in median R obtained for HBV-ERA5, HBV-IMERG, and HBV-MSWEP after calibration (+0.15, +0.12, and +0.16, respectively) were significantly greater than the improvements obtained for the same three models after satellite data assimilation (+0.01, +0.06, and −0.00, respectively; Fig. 2a; Section 3.6), which suggests that model calibration 25 results in more benefit overall than data assimilation. Additionally, model calibration benefits regions with both sparse and dense rain gauge networks, whereas data assimilation mainly benefits regions with sparse rain gauge networks (Section 3.6).
Conversely, only data assimilation is capable of ameliorating potential deficiencies in the meteorological forcing data (e.g., undetected precipitation).
Our calibration approach was relatively simple and yielded only a single spatially uniform parameter set (Section 2.3). Previous 30 studies focusing on runoff have demonstrated the value of more sophisticated calibration approaches yielding ensembles of parameters that vary according to climate and landscape characteristics (Samaniego et al., 2010;Beck et al., 2016, accepted).
Whether these approaches have value for soil moisture estimation as well warrants further investigation. It should be noted, however, that many current models have rigid structures, insufficient free parameters, and/or a high computational cost which makes them less amenable to calibration (Mendoza et al., 2015). Moreover, the validity of calibrated parameters may be compromised when the model is subjected to climate conditions it has never experienced before (Knutti, 2008). Care should also be taken that calibration of one aspect of the model does not degrade another aspect and that we get "the right answers for the right reasons" (Kirchner, 2006).
3.8 How do the major product categories compare? 5 The median R ± interquartile range across all sites and products in each category was 0.53 ± 0.32 for the satellite soil moisture products without SWI filter, 0.66 ± 0.30 for the satellite soil moisture products with SWI filter including MeMo, 0.69 ± 0.25 for the open-loop models, and 0.72 ± 0.22 for the models with satellite data assimilation ( Fig. 2a; Table 1). The satellite products thus provided the least reliable soil moisture estimates and exhibited the largest regional performance differences on average, whereas the models with satellite data assimilation provided the most reliable soil moisture estimates and exhibited the smallest 10 regional performance differences on average. Our performance ranking of the major product categories is consistent with previous studies for the conterminous USA (Liu et al., 2011;Kumar et al., 2014;Fang et al., 2016;Dong et al., 2020), Europe (Naz et al., 2019), and the globe (Albergel et al., 2012;Tian et al., 2019;Dong et al., 2019). It should be kept in mind, however, that these studies, including the present one, used in situ soil moisture measurements from regions with dense rain gauge networks, and hence likely overestimate model performance (Dong et al., 2019). 15 The large spread in performance across the satellite products reflects the large number of factors that affect soil moisture retrieval, including, among others, vegetation cover, surface roughness, soil composition, diurnal variations in land surface conditions, and RFI (Zhang and Zhou, 2016;Karthikeyan et al., 2017b). The spread in performance across the open-loop models is lower as it depends primarily on the precipitation data quality, which, in turn, depends mostly on a combination of gauge network density and prevailing precipitation type (convective versus stratiform; Gottschalck et al., 2005;Liu et al., 2011;Beck 20 et al., 2017c;Dong et al., 2019). The smaller spread in performance across the models with satellite data assimilation is due to the fact that individual errors in satellite retrievals and model estimates are cancelled out, to a certain degree, when they are combined, confirming the effectiveness of the data assimilation procedures (Moradkhani, 2008;Liu et al., 2012;Reichle et al., 2017).

3.9
To what extent are our results generalizable to other regions? 25 The large majority (98 %) of the in situ soil moisture measurements used as reference in the current study were from dense monitoring networks in the USA and Europe (Fig. 1) and therefore our results will be most applicable to these regions. We speculate that our results for the models (with and without data assimilation; Figs. 2, 3, and 5) apply to other regions with dense rain gauge networks and broadly similar climates (e.g., parts of China and Australia, and other parts of Europe; Kidd et al., 2017). The calibrated models (HBV and the Catchment model underlying SMAPL4) may, however, perform slightly worse in 30 regions with climatic and physiographic conditions dissimilar to the in situ sensors used for calibration (but likely still better than the uncalibrated models). In sparsely gauged areas the four model products based on precipitation forcings that incorporate daily gauge observations (GLEAM, HBV-MSWEP, HBV-MSWEP+SMAPL3E, and SMAPL4; Table 1) will inevitably exhibit lower performance (but not necessarily lower than the other model products). In convection-dominated regions models driven by precipitation from satellite datasets such as IMERG may well outperform those driven by precipitation from reanalyses such as ERA5 (Massari et al., 2017;Beck et al., 2017cBeck et al., , 2019b. Conversely, in mountainous and snow-dominated regions models driven by precipitation from reanalyses are likely to outperform those driven by precipitation from satellites (Ebert et al., 2007;Beck et al., 2019b, a). 5 Our results for the satellite soil moisture products may be less generalizable, given the large spread in performance across different regions and products revealed in the current study (Figs. 2 and 3) and in previous quasi-global studies using triple collocation (Al-Yaari et al., 2014;Chen et al., 2018;Miyaoka et al., 2017). Outside developed regions we expect the lower prevalence of RFI to lead to more reliable retrievals for those satellite products susceptible to it (Njoku et al., 2005;Oliva et al., 2012;Aksoy and Johnson, 2013;Ticconi et al., 2017). At low latitudes the lower satellite revisit frequency will inevitably 10 increase the sampling uncertainty and reduce the overall value of satellite products relative to models. In tropical forest regions passive products often do not provide soil moisture retrievals, and when they do, the retrievals are typically less reliable than those from active products due to the dense vegetation cover (Al-Yaari et al., 2014;Chen et al., 2018;Miyaoka et al., 2017;Kim et al., 2018). Shedding more light on the strengths and weaknesses of soil moisture products in regions without dense measurement networks -for example using independent soil moisture products (Chen et al., 2018;Dong et al., 2019) or by 15 expanding measurement networks (Kang et al., 2016;Singh et al., 2019) -should be a key priority for future research (Ochsner et al., 2013;Myeni et al., 2019).

Conclusions
To shed light on the advantages and disadvantages of different soil moisture products and on the merit of various technological and methodological innovations, we evaluated 18 state-of-the-art (sub-)daily (quasi-)global near-surface soil moisture products 20 using in situ measurements from 826 sensors located primarily in the USA and Europe. Our main findings related to the nine questions posed in the introduction can be summarized as follows: 1. Local night retrievals from descending overpasses were more reliable overall for AMSR2, whereas local morning retrievals from descending overpasses were more reliable overall for ASCAT. The ascending and descending retrievals of SMAPL3E and SMOS performed similarly. 25 2. Application of the SWI smoothing filter resulted in improved performance for all satellite products. Previous near-surface soil moisture product assessments generally did not apply smoothing filters and therefore may have underestimated the true skill of the products.
3. SMAPL3E SWI performed best overall among the four single-sensor satellite products with SWI filter. ASCAT SWI performed markedly better in terms of high-frequency than low-frequency fluctuations. All satellite products tended to 30 perform worse in cold climates. 4. The multi-sensor merged satellite product MeMo performed best among the satellite products, highlighting the value of multi-sensor merging techniques. MeMo also outperformed the multi-sensor merged satellite product ESA-CCI SWI , likely due to the inclusion of SMAPL3E SWI .
5. The performance of the open-loop models depended primarily on the precipitation data quality. The superior performance of HBV-MSWEP is due to the calibration of HBV and the daily gauge corrections of MSWEP. Soil moisture simulation 5 performance did not improve with model complexity.
6. In the absence of model structural or parameterization deficiencies, satellite data assimilation yields substantial performance improvements mainly when the precipitation forcing is of relatively low quality. This suggests that data assimilation provides significant benefits at the global scale.
7. The calibration of HBV against in situ soil moisture measurements resulted in substantial performance improvements.

10
The improvement due to model calibration tends to exceed the improvement due to satellite data assimilation and is not limited to regions of low quality precipitation.
8. The satellite products provided the least reliable soil moisture estimates and exhibited the largest regional performance differences on average, whereas the models with satellite data assimilation provided the most reliable soil moisture estimates and exhibited the smallest regional performance differences on average. 15 9. We speculate that our results for the models (with and without data assimilation) apply to other regions with dense rain gauge networks and broadly similar climates. Our results for the satellite products may be less generalizable due to the large number of factors that affect retrievals.
b At a latency of hours, MSWEP does not include daily gauge corrections and is therefore of lower quality. The data evaluated here have an effective latency of several days.