The National Aeronautics and Space Administration (NASA) Soil Moisture Active-Passive (SMAP) mission characterizes global spatiotemporal patterns in surface soil moisture using dual L-band microwave retrievals of horizontal (
Accurate information on soil moisture is of great importance for understanding various biophysical processes in hydrology, agronomy, and ecosystem sciences (Bassiouni et al., 2020; Uber et al., 2018). The poor spatial representativeness of in situ soil moisture sensors, combined with their labor-intensive installation and maintenance, impedes the application of these sensors to understand large-scale ecosystem phenomena (Babaeian et al., 2019; Petropoulos et al., 2015). Spaceborne passive microwave remote sensing has been developed as a reliable method to estimate surface soil moisture at large scales (Wigneron et al., 2017). It leverages the large discrepancies in dielectric properties between liquid water and dry soil that result in a high dependency of soil dielectric constants on soil moisture (Njoku and Entekhabi, 1996). Various microwave frequencies have been available to date, amongst which the L-band microwave frequencies were found to be desirable for soil moisture estimations because they can sense soil moisture at a relatively deeper layer (
Passive L-band remote sensing soil moisture estimation uses a radiometer to measure surface emission intensity, which is proportional to the brightness temperature (Wang and Qu, 2009). The brightness temperature is linked to soil moisture and vegetation opacity through the “tau-omeg” emission model and parameterized by soil and vegetation functions (Jackson et al., 1982; Mo et al., 1982). The “tau-omega” model rationale has been adopted by the National Aeronautics and Space Administration (NASA) Soil Moisture Active-Passive (SMAP) mission, which is one of the Earth observation missions dedicated to estimating soil moisture at L-band microwave frequency (Entekhabi et al., 2010). The SMAP mission implemented two primary algorithms: (1) the single-channel algorithm (SCA) that uses one polarized brightness temperature as the primary input to retrieve soil moisture and (2) the dual-channel algorithm (DCA) that retrieves soil moisture and vegetation opacity simultaneously by taking the polarized brightness temperature information in the both horizontal and vertical directions (O'Neill et.al., 2020a). There is strong interest in the DCA approach because of its independent estimation of vegetation opacity in lieu of the specified vegetation climatology employed by the SCA (O'Neill et.al., 2020a). Other L-band-focused satellite mission such as Soil Moisture and Ocean Salinity (SMOS) retrieves both soil moisture and vegetation optical depth by using numerous brightness measurements for different incidence angles (Kerr et al., 2012). Additionally, it has been suggested that using a time-integrated vegetation opacity, as is employed in the multi-temporal dual-channel algorithm (MT-DCA) for instance (Konings et al., 2016), improves the estimates of soil and vegetation state. These contrasting approaches, as well as other studies on SMAP's temporal polarized ratio algorithm (TPRA) (Gao et al., 2020) and regularized dual-channel algorithm (RDCA) (Chaubell et al., 2020), suggested there is still uncertainty about how SMAP observations of horizontal and vertical brightness temperature can best be translated into estimates of surface properties. Although SMAP can provide spatially explicit soil moisture estimates that have been shown to be useful for understanding a set of ecohydrological problems (Dadap et al., 2019; Feldman et al., 2018), the soil moisture retrievals are still subject to a significant amount of uncertainty due to the imperfection of the model and the forcing datasets. It is also important to consider how the amount of duplicate information carried within a set of observations limits the number of independent parameters to be inferred (Konings et al., 2015). Therefore, it is critical to diagnose and quantify the causality of the uncertainty caused by the SMAP algorithm to improve the soil moisture and vegetation opacity retrieval quality.
SMAP soil moisture products have been extensively validated against well-calibrated in situ soil moisture using unbiased root mean square error (ubRMSE), bias, RMSE Pearson correlation coefficients, and the triple collocation method at “core” and “sparse” validation sites (Chan et al., 2016; Chen et al., 2017; Colliander et al., 2017; Zhang et al., 2019). These validation investigations found that SMAP met the required accuracy target (ubRMSE, 0.04
The challenges faced by previous SMAP evaluation investigations can be resolved by leveraging two information quantities: (1) Shannon's entropy (Shannon, 1948), which is the amount of information required to fully describe a random variable, and (2) mutual information (Cover and Thomas, 2005), which represents the amount of information of knowing one variable given the knowledge of another or a set of random variables (Gong et al., 2013) first leveraged against these information quantities to partition overall uncertainty in the hydrological modeling process into two categories: (1) random uncertainty that arises by incompleteness of an exploratory variable and/or inherent stochasticity of forcing datasets and (2) model uncertainty that is contributed by poor model parameterization or formulation. The random uncertainty is not resolvable for the given system, as it is only related to the probability distributions of the forcing data itself, while the model uncertainty is reducible by a better model parameterization.
Given that both horizontal and vertical polarized brightness temperatures are measured by SMAP, it is unclear how each polarization contributes information to the overall performance of the DCA. Recent research on partial information decomposition has provided tremendous opportunities for understanding the nuanced interactions among different variables and model structure. Initially proposed by Williams and Beer (2010) and further advanced by Goodwell and Kumar (2017), this approach has been used to understand environmental processes that link two source variables with a target variable by partitioning multivariate mutual information into unique, redundant, and synergistic components. The unique information represents the amount of information shared with the target variable from each individual source variable separately (Finn and Lizier, 2018). Synergistic information is the information provided to the target, while both source variables act jointly (Kunert-Graf et al., 2020). Redundant information is the overlapping information that both source variables redundantly provide to a target (Wibral et al., 2017). Information partitioning brings new insight by unambiguously characterizing the interdependencies between source variables and a target variable without any underlying assumption (Goodwell et al., 2018).
The overall objective of this study is to demonstrate that by assessing how information flows through satellite algorithms from raw retrievals to end-user products, we can illuminate areas where improvements can be made and diagnose instances where algorithm estimates are expected to be uncertain. In this study, we focus on (1) quantifying the random uncertainty and model uncertainty in SMAP's DCA and understand how these uncertainties are related to DCA retrieval quality and (2) exploring how the partial information components between SMAP DCA soil moisture and horizontally polarized and vertically polarized brightness temperature can be used to indicate overall DCA soil moisture retrieval performance.
The US Climate Reference Network (USCRN) is a systematic and sustained network that is operated and maintained by the National Oceanic and Atmospheric Administration (NOAA) to support climate-impact research with continuous high-quality field-observed soil moisture, soil temperature, and wind speed at different temporal scales (Diamond et al., 2013). The USCRN provides soil moisture observations at five different standard depths (5, 10, 20, 50, and 100 cm) in 114 locations of the contiguous US (CONUS) (Bell et al., 2013). These in situ datasets have been used for a wide variety of research, such as drought evaluation and satellite soil moisture validation (Bell et al., 2015; Leeper et al., 2017). The hourly soil moisture (beta version product) datasets at a depth of 5 cm were collected from 58 (15 croplands, 32 grasslands, 5 shrublands, 2 savannas, 4 mixed) selected USCRN stations (Fig. 1 and Table S1 in the Supplement) based on the availability of in situ soil moisture datasets and the data quality of SMAP pixels in the study period of 31 March 2015 to 10 December 2020.
Spatial distribution of selected USCRN stations classified by land covers.
In this study, we acquired the water-body-corrected horizontally polarized brightness temperature (
The SMAP fraction of the land-cover data field provides the fraction of the top three dominant land covers that were classified by the International Geosphere–Biosphere Programme (IGBP) ecosystem surface classification scheme at each pixel (Chan, 2020). The IGBP classified land surface into water, evergreen needleleaf forest, evergreen broadleaf forest, deciduous needleleaf forest, deciduous broadleaf forest, mixed forest, closed shrublands, open shrublands, woody savannas, savannas, grasslands, permanent wetlands, croplands, urban and built-up, croplands/natural vegetation mosaics, snow and ice, and barren (Seitzinger et al., 2015). In this study, the land cover of the study site was classified as the most dominant land cover if the fraction of the most dominant land cover was greater than 50 %. Otherwise, the land cover of the study site is classified as the “mixed” land cover. Furthermore, the study sites that are dominated by woody savanna were classified as savannas, by closed/open shrublands that were classified as shrublands, and by cropland/natural vegetation mosaics that were classified as croplands. Sites meeting specified data requirements and their associated land-cover classification are shown in Fig. 1. Additionally, the 500 m leaf area index (LAI) of each site was obtained from NASA's Moderate Resolution Imaging Spectrometer (MODIS) mission (Myneni et al., 2015; ORNL DAAC, 2018) and averaged in time. Within each site the mean and standard deviation of LAI of all pixels within each SMAP pixel were calculated as a measure of vegetation biomass and variability.
The fundamental quantity of information theory is Shannon's entropy (Shannon, 1948), which represents the amount of information required to fully describe a random variable (Cover and Thomas, 2005). Shannon's entropy is the basic building block of computing mutual information and the informational uncertainties. The entropy of a single random variable is defined as
A previous study has indicated that this method (Eq. 1) may underestimate the true entropy (Paninski, 2003). Therefore, we leveraged the simple Miller–Madow-corrected entropy estimator (Zhang and Grabchak, 2013), and we also normalized the entropy to remove the bias that may cause the heterogeneity in length of available datasets across all stations. We acknowledge that there exist several entropy estimation methods. However, we select the Miller–Madow correction based on its simplicity and effectiveness. The corrected and normalized entropy is then expressed as
The joint entropy (Cover and Thomas, 2005) is a critical intermediate information quantity to calculate these informational uncertainties. It represents the amount of information required to describe a set of random variables. The joint entropy for two random variables is defined as
Generally, modeling efforts are focused on capturing the information of a random variable of interest via other explanatory variables through some physically or empirically based models. However, most of the models, being constructed of natural processes, are not perfect, and the model outputs are often not capable of capturing the exact relationship between the available input variables and the variable of interest (Gong et al., 2013). There exists a maximum achievable performance of a model that describes the variable of interest the best for a particular system given the available datasets (Gong et al., 2013), yet the detailed structure of this model is often unknown. Mutual information (Cover and Thomas, 2005), for instance
The mutual information can be defined based on entropy and joint entropy (Cover and Thomas, 2005). The mutual information between
We adopted the information uncertainty analysis by Gong et al. (2013) and applied it to SMAP DCA. For a given system in which the input and output are linked via mathematical functions, the mutual information between model output and in situ observation can never exceed the entropy of the in situ observations. Conceptually, the entropies of model output and in situ observations can be considered two circles (of equal or unequal sizes), and the mutual information between model output and in situ observation can be viewed as the overlapping area of these two circles (Uda, 2020). Therefore, the maximum mutual information shared between model output and in situ is the minimum of the entropy of model output and in situ observations, i.e.,
In this study, the explanatory variables of DCA are
The distinct informational contributions of
The mutual information between
It is important to note that we used the point-based in situ soil moisture as the ground truth in this analysis. Due to coarse spatial resolution of SMAP products, we acknowledge that in situ soil moisture may not be able to represent the spatially averaged soil moisture well. Although the nominal sensing depth of L-band SMAP soil moisture is 5 cm, the penetration depth was found to be even shallower in wetter regions (Shellito et al., 2016). In fact, the L-band sensing depth was found to be as little as
The estimated entropies across all the study sites are shown in Fig. 2, while the mutual information quantities are shown in Fig. 3. The brightness temperature entropies,
Entropies of horizontally polarized brightness temperature (
Mutual information between horizontally polarized brightness temperature (
Entropy of in situ soil moisture against the entropies of DCA soil moisture, horizontally polarized brightness temperature (
As shown in Fig. 4a, the entropies of the retrieved brightness temperatures and DCA model output,
The number of informational uncertainties in percentage. The values in the table are the average of each land cover. The values in “Overall” are the averages of all the sites. The “Lumped” field is computed using all available datasets.
It is noticeable that there exists a large information gap between
SMAP informational total uncertainty
The relationship between different informational uncertainties and the Pearson correlation coefficients between in situ soil moisture and SMAP DCA soil moisture, a commonly adopted relative model evaluation metric in SMAP studies (Chan et al., 2016; Colliander et al., 2017), was evaluated. The
The partial information decompositions were assessed on a site basis and are shown in Fig. 6. The fractional contribution of each component to that site's mutual information between brightness temperatures and DCA estimates,
Partial information decomposition components between horizontally (
The partial information decomposition components. The values in the table are the average of each land cover. The values in “Overall” are the average of all the sites. The “Lumped” field is computed using all available datasets.
Partial information decomposition components between horizontally (
Through this analysis, it is shown in Fig. 7 that there are strong relationships between SMAP DCA retrieval quality and decomposed information components. In general, the correlation strength between DCA and in situ soil moisture is higher when
The first objective of this study is to leverage information theory to quantitatively decompose the informational total uncertainty into informational random uncertainty and informational model uncertainty in the DCA as an approach to understand where retrieval uncertainties arise. This information theory approach can provide new insight into SMAP modeling diagnosis. It offers an opportunity to partition the total informational uncertainty in the DCA into the uncertainty due to the input datasets and the uncertainty due to model structure and model parameterizations. This partition process cannot be achieved by leveraging the common DCA assessment metrics (Chan et al., 2016) (e.g., Pearson correlation, ubRMSE) that only involve the DCA soil moisture and in situ soil moisture.
The DCA model structure is inherently a hypothesis that relates the input datasets to soil moisture based on prior physical knowledge. The DCA is thus a procedure of processing the input dataset into estimates of soil moisture. Thus, models, even those that perform the best, can only reduce the available information in its inputs and are not capable of adding new information about the “true” soil moisture. Hence, there is no possibility of building a model that is better than the one with the best-achievable performance of the input data themselves (yet even achieving this theoretical limit is nearly impossible) (Gong et al., 2013). If, however, more freedom of available datasets to incorporate is given, it is possible to build models that outperform the best-achievable model performance by adding new explanatory variables, which may lead to a family of models that have completely different model structures. Based on Table 1, we find that the DCA has more informational uncertainty in shrublands than grasslands and croplands. This might be due to stronger variability in vegetation for shrublands, while grasslands and croplands tend to be more uniform and homogeneous. It is worth noting that these findings are based on averaging our studied sites within different land-cover categories, and results may be different while comparing two specific sites from different land covers. In addition, we find that the proportion of informational uncertainty increases as the data are lumped together relative to averaging these statistics calculated on a site-by-site basis (Table 1). Treating all the surfaces together as a whole does not reduce the informational total uncertainty because the lumping process contains both “high-quality” and “low-quality” (as assessed by the Pearson correlation between in situ and DCA soil moisture) datasets. The uncertainties in these datasets may accumulate while lumping them together and result in an increase in total informational uncertainty.
The fraction that informational random uncertainty contributes to the informational total uncertainty is quite significant (65 % on average) in this study. The informational random uncertainty in the system may arise from the inherent error due to calibration of
Informational model uncertainty contributes a non-negligible portion to the informational total uncertainty (35 % on average). This model uncertainty may arise from poor model parameterizations, which may vary with site soil moisture dynamics (
To summarize, this is the first attempt at leveraging a mutual information approach to analyze the uncertainty components in microwave remote sensing models. The results of this study can be further used as guidance in assessing an SMAP algorithm and can quantitatively identify where information is lost in the process of SMAP soil moisture modeling. More broadly, this study, though focused on SMAP, can be transferred and extended to analyze other remote sensing algorithms. Over many decades, a lot of effort, resources, and time have been devoted to the launch of numerous satellite missions to retrieve the key environmental variables such as evapotranspiration and vegetation biomass (Dubayah et al., 2020; Hulley et al., 2017). Performing such an analysis on these retrieval algorithms is expected to be beneficial in understanding the informational flow in these algorithms and may provide insights to further improve the data retrieval accuracy as well as make maximum use of data collected at greater expense.
The second objective of this study was to demonstrate that the partitioned information components contain useful information about DCA model performance that does not depend on in situ soil moisture and other ancillary datasets. We find a strong linear relationship between redundant (
Compared with other ancillary and in situ independent metrics such as correlation strength between a Pearson correlation of
While we expect that this approach can be generalized to analyze other remote sensing models, it may be difficult to compute the joint probability density functions for models with high-dimensional inputs. Difficulty in determining the joint probability density functions hinders the estimation of high-dimensional joint entropy and mutual information components, and these are still open questions in the field of information theory. Although there exist serval data dimension reduction techniques, these dimension reduction techniques are mostly based on some assumptions (Xu et al., 2019). In practice, most of the systems with high-dimension inputs tend to be complex. Therefore, there is a strong risk of introducing additional uncertainty if one chooses an inappropriate technique.
It is important to understand that the SMAP DCA system retrieves soil moisture with the help of vegetation water content climatology derived from the MODIS NDVI data stream. This is specified as a set value for each location and day of year combination and is used to estimate the unknown vegetation optical depth (O'Neill et al., 2020a). The reader should keep in mind that this study considers such data as a dynamic time-varying parameter, and it is not treated as a data input in this study. Adding NDVI as a data input would result in
This study was conducted only at locations where in situ soil moisture is readily available. It could be an interesting topic to explore whether, and how, information-based uncertainty analysis can be applied in the locations without in situ soil moisture measurements. We would expect the informational uncertainty analysis to provide the estimates of random and model uncertainties. The best performance we can expect from this current uncertainty analysis is to use all the available datasets we have, yet we believe that uncertainty estimations of this approach should be stabilized given adequate representative locations and data records.
This study differentiates and quantifies the uncertainty sources in the SMAP DCA using information theory. We found that on average DCA soil moisture explains 20 % of the information in the in situ soil moisture, leaving 80 % unexplained. Among the unexplained information, 65 % is informational random uncertainty that is caused by the inherent stochasticity of the explanatory variables of SMAP DCA and a lack of additional explanatory variables in the system, while the rest of the informational uncertainty is caused by inappropriateness of the assumption of DCA model structure and parameterizations. We show that informational random uncertainty contributes a larger proportion of the informational total uncertainty across different land covers. However, the informational model uncertainty contributes more to total uncertainty when lumping all the datasets together. The performance of SMAP DCA is negatively correlated with all the information uncertainties, with the informational model uncertainty being more reflective of overall SMAP DCA retrieval quality than the informational random uncertainty.
The decomposition of the mutual information has shown that all decomposed components are significantly related to the Pearson correlation between in situ and DCA soil moisture, with the redundant and synergistic information being the strongest. Good DCA model performance (as measured by the Pearson correlation between in situ and DCA soil moisture) is more likely to be found in locations where the redundant information of brightness temperatures shared with DCA soil moisture is high and is more dominant relative to other components. The informational uncertainty decomposition analysis opens a new window for SMAP algorithm uncertainty diagnosis. SMAP DCA users may examine the
The code regarding the SMAP dataset time series, mutual information, and partial information decomposition calculation can obtained from
SMAP L2 Radiometer Half-Orbit 36 km EASE-Grid Soil Moisture, version 7, is acquired from US National Snow and Ice Data Center (
The supplement related to this article is available online at:
BL and SPG designed this study; BL wrote the manuscript; SPG co-wrote and revised the manuscript.
The authors declare that they have no conflict of interest.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This project was supported by the National Aeronautics and Space Administration (NASA) under grant NNX16AN13G.
This research has been supported by the National Aeronautics and Space Administration (grant no. NNX16AN13G).
This paper was edited by Roger Moussa and reviewed by Nemesio Rodriguez-Fernandez and two anonymous referees.