Uncertainty assessment of satellite remote-sensing-based evapotranspiration estimates: a systematic review of methods and gaps

. Satellite remote sensing (RS) data are increasingly being used to estimate total evaporation, often referred to as evapotranspiration (ET), over large regions. Since RS-based ET (RS-ET) estimation inherits uncertainties from several sources, many available studies have assessed these uncertainties using different methods. However, the suitability of methods and reference data subsequently affects the validity of these evaluations. This study summarizes the status of the various methods applied for uncertainty assessment of RS-ET estimates, discusses the advances and caveats of these methods, identiﬁes assessment gaps, and provides recommendations for future studies. We systematically reviewed 676 research papers published from 2011 to 2021 that assessed the uncertainty or accuracy of RS-ET estimates. We categorized and classiﬁed them based on (i) the methods used to assess uncertainties, (ii) the context where uncertainties were evaluated, and (iii) the metrics used to report uncertainties. Our quantitative synthesis shows that the uncertainty assessments of RS-ET estimates are not consistent and comparable in terms of methodology, reference data, geographical distribution, and uncertainty presentation. Most studies used validation methods using eddy-covariance (EC)-based ET estimates as a reference. However, in many regions such as Africa and the Middle East, other references are often used due to the lack of EC stations. The accuracy and uncertainty of RS-ET estimates are most often described by root-mean-squared errors (RMSEs). When validating against EC-based estimates, the RMSE of daily RS-ET varies greatly among different locations and levels of temporal support, ranging from 0.01 to 6.65 mm d − 1 , with a mean of 1.18 mm d − 1 . We conclude that future studies need to report the context of validation, the uncertainty of the reference datasets, the mismatch in the temporal and spatial scales of reference datasets to those of the RS-ET estimates, and multiple performance metrics with their variation in different conditions and their statistical signiﬁcance to provide a comprehensive interpretation to assist potential users. We provide speciﬁc recommendations in this regard. Furthermore, extending the application of RS-ET to regions that lack validation will require obtaining additional ground-based data and combining different methods for uncertainty assessment.


Introduction
Evapotranspiration (ET) is the key variable linking the water, energy, and carbon cycles of the Earth (Fisher et al., 2017).In the terrestrial water cycle, it is the second-largest flux after precipitation (Korzoun et al., 1978), which predominates the demand side of water resources.It is associated with latent heat flux in the surface energy balance.ET combines the evaporation of water from soil, free water surfaces, and plants and, thus, depends on many factors, such as the atmospheric and vegetation conditions, the availability of water in the soil, waterbodies, canopy, and surface roughness (Monteith, 1965;Shuttleworth and Wallace, 1985).The complex-Published by Copernicus Publications on behalf of the European Geosciences Union.B. N. Tran et al.: Uncertainty assessment of satellite remote-sensing-based evapotranspiration estimates ity of measuring ET directly makes it difficult and expensive to routinely measure and capture its spatial variation as this requires a dense network of in situ gauging stations.Therefore, satellite remote sensing (RS) observations have been increasingly used for estimating ET spatially.
As ET cannot be directly measured by sensors from space, retrieval algorithms or models are needed to estimate ET from other variables observed by RS (Fisher et al., 2017).These models estimate ET from visible and/or thermal infrared RS data and include now-well-known models such as SEBAL (Bastiaanssen et al., 1998), TSEB (Kustas and Norman, 1999), SEBS (Su, 2002), METRIC (Allen et al., 2007), ALEXI (Anderson et al., 2011), PT-JPL (Fisher et al., 2008), and GLEAM (Miralles et al., 2011a).The diversity of models, input RS data sources, and processing techniques results in a wide range of RS-based ET estimates (Jiménez et al., 2011;Long et al., 2014;Chen et al., 2014).
While many studies have evaluated the performance of RS-based ET (RS-ET) models, none of them have concluded that a single model performs best in all situations (e.g., Ferguson et al., 2010;Vinukollu et al., 2011a).Furthermore, retrieving ET estimates requires access to the data, software or source code, and expertise in these models.The limited accessibility of RS-ET models leads to significant challenges to operational applications of RS-ET estimates (e.g., irrigation scheduling and drought monitoring).Driven by community needs, several projects have provided platforms to increase public access to various data products which are generated by these RS-ET models.These projects and outputs include MODIS16 (Mu et al., 2011), SSEBop (Senay et al., 2013), GLEAM (Miralles et al., 2011a), WaPOR (FAO, 2018), ECOSTRESS (Fisher et al., 2020), and OpenET (Melton et al., 2021).
Given that more RS-ET data products are becoming available, information about the uncertainties in RS-ET estimates is important for data users (i.e., water managers and policymakers) to apply them properly.Uncertainty assessment helps data users know what level of confidence they can have in ET estimates and the inferred information about water resources (e.g., crop water consumption, water depletion).Inferences based on RS-ET data products are limited by their spatio-temporal resolution, latency, and specifications.
Previous reviews have discussed RS-ET estimates and uncertainty, which are relevant to this review (Fig. 1).Table S2 in the Supplement summarizes the main topics of these reviews.Many of these reviews focused on outlining the methods to estimate ET using RS-based models (e.g., Kustas and Norman, 1996;Courault et al., 2005;Wang and Dickinson, 2012;Zhang et al., 2016) and sometimes discussed the uncertainties in the estimation (Kalma et al., 2008;Glenn et al., 2011;Karimi and Bastiaanssen, 2015).However, none of these explored how uncertainties in RS-ET estimates are currently being assessed, which is an important issue in remote sensing and the production of spatial data (Bielecka and Burek, 2019;Wu et al., 2019a;Mayr et al., 2019).In an overview of global RS-based essential climate variables, Bayat et al. (2021) concluded that RS-ET data products lack a good practice protocol for operational validation compared to other variables.Meanwhile, in situ measurements of ET also suffer from errors and uncertainty (Allen et al., 2011a) and, thus, require complete documentation that provides sufficient information to ascertain the expected accuracy and representativeness of the reported ET estimates (Allen et al., 2011b).
These reviews highlight the need to better advance the uncertainty assessment of RS-ET, leading to the following research questions: -What are the common and emerging methods used to assess uncertainty in RS-ET estimates?
-In which contexts are the uncertainties of RS-ET assessed with these methods?
-What is the typical range of uncertainty in RS-ET estimates globally based on previous studies?
To answer these questions and build on existing literature, we surveyed previous studies that assessed the uncertainty or accuracy of RS-ET models or the output data products of these models.Given that many literature reviews on the uncertainty or accuracy of ET estimation have been published until 2011 (Fig. 1), we focus on the period from 2011 to evaluate whether the studies in this period adopted the valuable contributions and recommendations from these previous reviews.Given the growing volume of literature published in the field, we followed a systematic quantitative review approach to avoid subjectivity or bias towards particular products, authors, or approaches.We identified research articles with a set of predetermined criteria and categorized these articles based on (i) the methods used to assess uncertainties, (ii) the context where uncertainties were evaluated, and (iii) the metrics used to report uncertainties.We then quantified the number of articles per category to identify any trends or gaps in literature.Furthermore, we appraised the advances and caveats of the existing methods and provided recommendations for future studies.The rest of this paper is organized as follows: Sect. 2 provides the theoretical basis for the research and clarifies the key terms that we use to analyze literature using the methods described in Sect.3. The results of the literature analysis, concerning assessment methods and the context when these methods are used, are discussed in Sects.4 and 5. Based on the categorized literature, Sect.6 discusses the use of uncertainty metrics and shows the typical range of uncertainty in RS-ET estimates.Finally, Sect.7 summarizes the key points and recommendations for future research.2 Theoretical frameworks

Uncertainty definition and representation
Uncertainty is generally defined as the state of being not completely confident or sure of something.The terms error, accuracy, bias, and precision are sometimes used to characterize the uncertainty.All these terms are quantifiable information about what is certain or uncertain.However, they are different from uncertainty by definition (Foody and Atkinson, 2003;Heuvelink, 1998;Loew et al., 2017).Error represents the difference between what is measured and its true value (JCGM, 2012).The true value is the exact value according to the theoretical definition of the variable being measured or estimated.If we perfectly know the true value, we have no measurement error, which eliminates uncertainty.Therefore, uncertainty stems from unknown true values and errors.
When a measurement can be repeated, its uncertainty can be described using probability distributions of the measured values or measurement errors compared to a reference (Montanari, 2007;Foody and Atkinson, 2003;Povey and Grainger, 2015).Figure 2 illustrates the relationship between uncertainty and other related terms when uncertainty is described by the probability distribution of the measured value or error.When accuracy is defined as the expectation (i.e., expected value) of overall error (e.g., Foody and Atkinson, 2003), bias (i.e., the difference between the expected value of an estimator and the true value of a parameter) is considered to be a measure of inaccuracy.Likewise, precision can be described using the standard deviation and variance of the probability distribution of measured values since they both denote error spread around the mean.Some uncertainties cannot be described using a probability distribution function in modeling or measurement.These are called the "known, unquantifiable unknowns" (i.e., what we know to exist but are not able to quantify) and the "unknown unknowns" (i.e., what we do not know to "exist" because we cannot observe) (Povey and Grainger, 2015).The suitability of probability theory for quantifying uncertainty is widely debated in hydrological science (e.g., Beven, 2016;Nearing et al., 2016).Nearing et al. (2016)   quantifiable.Uncertainty assessment of satellite remote sensing data typically reports quantifiable errors but not unknown and unquantifiable errors (Povey and Grainger, 2015).

Sources of uncertainties in RS-ET data production
Raw satellite imagery undergoes a chain of processing and analysis (retrieval model) to generate useful data and information for applications (Fig. 3).ET is not directly measured by sensors but is derived from models; thus, it is considered to be high-level processing by data providers (ESA, 2021;NASA, 2021).The retrieval models of low-level data (e.g., radiance, vegetation indices) share common formulas and usually require only raw satellite images.High-level data like RS-ET rely on various models with different concepts, assumptions, and data sources.Therefore, the uncertainty of RS-ET data products is strongly linked to model uncertainty (from model conceptualization and parameters) and input data uncertainty.RS-ET data are typically acquired during satellite passes over specific areas of interest, resulting in essentially instantaneous estimates of remote-sensing-based evapotranspiration (RS-ET).Because many operational applications necessitate ET estimates over longer time intervals, such as daily, 10 d, or monthly totals, various methods have been developed to upscale these instantaneous RS-ET estimates to daily values (Jiang et al., 2021).Moreover, the designated resolution and the return interval of satellites might not be suitable for operational applications; thus, gap-filling and spatial-downscaling steps are common in many RS-ET retrieval models.
Changes in the spatial and temporal scales1 and gap filling during data pre-and post-processing steps also intro-duce more uncertainty.Modeled estimates are typically validated against a more accurate reference.The errors compared to the reference are referred to as compound uncertainty in this research since this aggregates all sources of uncertainty.Meanwhile, comparison with other equivalent estimates results in relative uncertainty.Although validation essentially yields relative uncertainty due to imperfections in the reference data, we trust it to represent compound uncertainty more than other estimates.

Uncertainty assessment
Uncertainty assessment refers to the estimation of quantifiable uncertainties.The uncertainty from input factors (e.g., parameters and input data) can be quantified with uncertainty analysis, also called uncertainty propagation or error propagation (Crosetto et al., 2001;Heuvelink, 1998;Wadoux et al., 2020).There are many techniques for uncertainty propagation, and their suitability depends on several factors, including the number of uncertain inputs, uncertainty distribution and correlation of input variables, and model linearity (Mohammadi and Cremaschi, 2022).For RS-based models, analytical techniques which are based on the propagation of moment formulas (Taylor, 1997) are often not suitable because these models have complex relationships, and input uncertainty is not always normally distributed.Numerical techniques are generally applicable to RS-based models (Heuvelink, 1998;Crosetto et al., 2001).
The contribution of each input factor to the total uncertainty in the model output is determined by sensitivity analysis (Crosetto et al., 2001;Saltelli et al., 2021).Such analysis is primarily used to identify the factors that contribute most to the model uncertainty (Saltelli et al., 2019).There are two main approaches to sensitivity analysis: local and global sensitivity analysis.Local sensitivity analysis defines to the resolution of the RS image, which is the (average) size of its constituent pixels.the model's sensitivity to an input factor (e.g., parameter or variable) as the first-order partial derivative of the model with respect to this input factor (Saltelli et al., 2019).In contrast, global sensitivity analysis explores the whole variation range of the input factors (Razavi and Gupta, 2015).
Validation is often applied to confirm a data product's fitness for purpose instead of sensitivity analysis and uncertainty analysis of its retrieval model (Crosetto et al., 2001).The definition of validation in modeling is context dependent and has become more well defined over time (Bellocchi et al., 2011).Model validation does not prove that the model is true but rather proves that it is empirically adequate (Oreskes et al., 1994).A valid model is one that does not contain known or detectable flaws and is internally consistent rather than being an assertion of the actual reality.Meanwhile, validation of model results involves quantifying the accuracy compared to a reference (often in situ datasets), which proves the validity of the data for its intended application.In RS, validation often only refers to the data itself and not the model (Bayat et al., 2021;Loew et al., 2017;Wu et al., 2019a).Because RS-derived data products are model results, their validation depends on the quality and quantity of input parameters and the accuracy of auxiliary hypotheses that were used to derive them (Oreskes et al., 1994).Therefore, validating an RS-ET model does not imply that the model can be applied with any forcing data or settings to produce accurate outputs.

Systematic quantitative literature review method
In this literature review, we specifically focus on how the quantifiable uncertainty in the RS-ET estimate has been assessed in recent years (2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020)(2021).In this paper, we employed Pickering and Byrne's (2014) systematic quantitative literature review method, which includes systematic search, categorization, and quantification of literature.We chose this approach to objectively highlight trends and gaps in current RS-ET uncertainty assessment methods through quantitative results.The literature search is systematic but undeniably not exhaustive; thus, certain papers may be omitted if they do not meet the specified inclusion criteria.

Identification and database search
The academic electronic databases Web of Science and Scopus were searched (https://www.webofscience.com/, last access: 24 July 2023 and https://www.scopus.com/,last access: 24 July 2023) using the combination of the three search terms evapotranspiration, remote sensing, and uncertainty or their variants (Table 1).The terms transpiration and interception were not used since they only represent components of ET.Since different terms for satellite remote sensing, evaporation, and uncertainty can be used in the title and abstract, the variants of search terms were identified based on a set of 34 prior articles (Sect.S1 in the Supplement).
The search results were limited to publication dates from 2011 to 2021, and duplicates were removed.Only English articles (> 99 % of results) that reported original research and were published in scientific peer-reviewed journals were considered.Review papers, conference proceedings, and gray literature were not included because they have different formats and provide limited details of the methods used for uncertainty assessment.

Relevance and eligibility screening
From the search results, we identified papers that attempted to assess the accuracy or uncertainty of one or more satellite remote-sensing-based estimations of terrestrial ET, either from model simulations or analysis-ready data products.The models of interest were diagnostic RS-ET models,2 such as the models that were reviewed by Courault et al. (2005), Zhang et al. (2016), andChen andLiu (2020).To identify relevant papers, we screened the title and abstract using the ASReview software, a semi-automated screening system that incorporates an active-learning classifier to rank the order of papers based on their relevance to the articles that were included previously (van de Schoot et al., 2021; website: https://asreview.nl/,last access: 7 August 2023).ASReview can help find 95 % of the eligible studies after screening between only 8 % to 33 % of the studies (van de Schoot et al., 2021).Based on the number of articles and the efficiency of ASReview, we established criteria to stop screening when 100 irrelevant records had been found consecutively (3 % of the total records) and when at least 10 % of the total records had been screened.After screening titles and abstracts, we assessed the eligibility of each paper by reading the fulltext articles and finally included 676 articles in our review (Fig. 4).A brief bibliometric analysis of these articles is provided in Sect.S2.

Article organization and analysis
Each included article was classified into categories based on methods, objectives of the study, and results (Table 2).The total number and percentage of research papers per category were then synthesized from the literature database, and the patterns and trends in assessing the uncertainty of RS-ET were discerned.In addition, the most common method for uncertainty assessment was identified, and articles that used this method were included in a meta-analysis to derive the typical range of uncertainty in RS-ET estimates.

Review of the methods for RS-ET uncertainty assessment
The selected articles assess uncertainty in RS-ET using mainly eight approaches: (1) validation, (2) intercomparison, (3) sensitivity analysis, (4) evaluation of input data, (5) uncertainty propagation, (6) three-cornered hat and triple collocation (TCH and TC), ( 7) physical consistency, and (8) ensemble of estimates.Figure 5 shows the upset plot (Lex et al., 2014) of all reviewed articles categorized by the approach of uncertainty assessment and the intersections of more than one approach.The majority of articles (532 out of 601) used a validation approach.There are a few other approaches that were less frequently used, often in combination with validation, as shown by the number of intersections with "validation" (Fig. 5).
Except for the validation and intercomparison approach, other approaches showed no increasing or a decreasing proportion in selected literature from 2011 to 2021 (Fig. 6).Approaches other than validation and intercomparison have only been used by a small group of researchers and have not been applied widely or increasingly.The following subsections will discuss the application of the most common uncertainty assessment approaches.

Validation
In validation, RS-ET model results are compared to a reference method that is considered by the researcher to be the best or most valid measure.The choice of the reference method introduces subjectivity into the model evaluation (Melsen et al., 2019).In the case of RS-ET, three types of reference are typically used: (1) in situ measurements (N =  572), (2) catchment water balance (N = 83), and (3) output from models run with ground-based input data (N = 9).Almost all articles that used the validation approach considered an in situ measurement as their reference (Fig. 7), while other types of reference data were much less considered.

Using in situ measurements as the validation reference
Several in situ methods have been developed to estimate ET on the ground, including eddy covariance (EC), lysimeters, the Bowen ratio energy balance system (BREBS), etc. (Table S3).These measurements are often considered to be the   observation or reference to validate RS-ET.Among these, EC is the predominant method for validation and was considered in 424 out of 600 articles (Fig. 7).Four factors explain the popularity of the EC method: (1) its relatively large network of stations, (2) long-term temporal coverage of flux towers, (3) open-access nature of data (e.g., FLUXNET, Eu-roFlux, AmeriFlux, and OzFlux), and (4) direct measurement of water vapor concentration and vertical wind speed of the air parcels to calculate latent heat flux.
Using in situ methods for validation faces three main challenges: (1) the cost to set up and maintain measuring stations, (2) the mismatch between the source area of measurements and the spatial resolution of an RS-based estimate, and (3) errors in measurements and assumptions.For example, the cost of a complete EC system is about 10 times the cost of a weather station with basic meteorological instruments.Although the EC method can be used to monitor other fluxes (e.g., carbon dioxide and nitrogen oxide), the high cost of the EC system still limits the number of sampling points and regions (Oliphant, 2012;FLUXNET, 2017).The low sampling density can be compensated for with low-cost systems (Markwitz and Siebicke, 2019) but at the expense of lower accuracy.In order to obtain validation data at a global scale, EC networks need to be expanded in many regions (e.g., Africa, southern Asia, the Middle East, and South America).
Spatial support of in situ measurements often does not overlap with the pixel footprint of the RS images (i.e., the area the pixel value represents).The spatial support of in situ measurements varies among methods, from 1 m 2 (microlysimetry) to a few square kilometers depending on the wind speed and wind direction (eddy covariance and scintillometry).Certain methods for measuring components of ET have more limited spatial support (e.g., sap flow measurement for transpiration).For homogeneous pixels (with the same geophysical and ecological characteristics), in situ measurements can be representative of an entire pixel.However, when the pixel covers a large area, RS-ET validation frequently involves heterogeneous pixels.Therefore, multiple sites and upscaling methods are required to best aggregate site-specific to pixel-scale information (e.g., Liu et al., 2016;Li et al., 2018).
Every in situ measurement technique is subject to uncertainty and error.Even the most widely used technique, the EC flux tower, has limitations in terms of measurements (10 %-20 % error) and spatial support (Glenn et al., 2011;Wang et al., 2015).All methods have common sources of error and uncertainty, such as sensor response (detection limit), calibration error (sensor drift over time), noise (spurious random spikes in the signal from the sensor), and poor installation and maintenance (Allen et al., 2011a).Additionally, each method has specific sources of error and uncertainty due to its theoretical assumptions.For example, the EC method requires fully developed turbulent fluxes to ensure that the net vertical transfer of water vapor is caused by eddies, and the area must be horizontal and uniform.Moreover, the lack of energy balance closure in EC measurements needs particular attention since the gap can be up to 30 % of available energy (Wilson et al., 2002;Vendrame et al., 2020;Bambach et al., 2022;Allen et al., 2011b).The problem is due to the scale mismatch of energy balance components and unaccounted exchange fluxes on heterogenous landscapes (Foken, 2008).
Dealing with scale mismatch and uncertainty of reference in situ measurements is challenging, and there is no consistent method in the reviewed literature.Some studies only mentioned these issues when discussing the validation result.The information about the spatial support and uncertainty of in situ measurements is not always available to researchers if they acquire reference data from other sources.However, without reporting the spatial support and uncertainty of measurements, we might easily draw biased conclusions: when the validation results are good, we conclude that the model is good without questioning the quality of our reference, but when the results are not so good, we conclude that it is because of the imperfect reference measurements and that the model still is good.Hence, it is important to accompany validation results with the best knowledge about the uncertainty and scale mismatch of reference datasets.

Using the residual of the water balance as the validation reference
ET of an area can be estimated as the residual of the water balance (WB) when the inflow (e.g., precipitation, irrigation supply), change in storage, and outflows of water (e.g., runoff, water conveyance) of that area are known.This approach is mainly used for assessment at a river basin scale.
It assumes that the residual from the basin WB should be the total ET of the basin: ET = P − Q − dS/dt, where P is precipitation, Q is river discharge, and dS/dt is the total change in basin water storage over the time period.This water balance approach assumes that there are no other water inflows or outflows across the catchment boundary.In some studies, dS/dt is assumed to be negligible over a long period of time (a year or longer), which results in a more simplified water balance ET = P − Q.
For long-term periods (e.g., years, decades), total water storage change (TWSC) over time (dS/dt) is assumed to be zero, such that ET estimates are then validated with only P − Q (e.g., Vinukollu et al., 2011a).However, this assumption does not hold true in many regions of the world where groundwater is being overexploited at an accelerated rate.For short-term periods (i.e., months), TWSC is often estimated from GRACE RS-based total water storage anomaly (TWSA) products.However, the TWSA products only cover the period from 2002, with a gap of 11 months from 2017 to 2018 between the GRACE and GRACE-FO missions.Some techniques have been developed to reconstruct this gap in the GRACE time series (e.g., Yang et al., 2021).However, the uncertainties in gap-filled dS/dt estimates are still less known than uncertainties in the initial estimates from GRACE and GRACE-FO (Boergens et al., 2022).
The uncertainty in ET estimated by this approach depends on the choice and data quality of other variables (e.g., precipitation and river discharge) in the WB (Senay et al., 2011).Lehmann et al. (2022) have compared the residual calculated from 1694 combinations of P , Q, and ET datasets with dS/dt derived from GRACE and found that none of these combinations can close the WB in all tested basins.They also suggested that using some combinations of P , Q, and ET datasets cancels out their errors in the GRACE-based WB.Because of the errors in the P , Q, and dS/dt components, studies that use WB-derived ET as a reference to validate ET without accounting for uncertainties in the P , Q, and dS/dt components risk biased conclusions.
In order to account for errors in P , Q, and dS/dt, some researchers have tried to use multiple datasets (e.g., Weerasinghe et al., 2020)

Intercomparison
Intercomparison is the second most widely used method (212 out of 676 studies).In intercomparison, the RS-ET estimates from multiple models are compared without assuming a superior one.This approach is mainly used to evaluate the relative uncertainty of a model compared to others (170 out of 212 studies).Intercomparison has also been used to evaluate other sources of uncertainty.For example, uncertainty from a change in spatial support can be evaluated by comparing model outputs using different input-upscaling methods (e.g., Ershadi et al., 2013;Sharma et al., 2016).Intercomparison has also been used to evaluate uncertainty due to the choice of input datasets (e.g., Long et al., 2011;Wang et al., 2016;Badgley et al., 2015).
Since the RS-ET datasets have both temporal and spatial dimensions, comparing RS-ET models or products is usually done by aggregating over one or two dimensions (i.e., resampling to a lower resolution).The simplest method of intercomparison involves aggregating ET estimates both temporally and spatially into one value (e.g., global annually averaged ET) and then comparing this value from different models or products (e.g., Mueller et al., 2013;Pan et al., 2020).Other methods of intercomparison involve comparing time series of spatially aggregated ET (e.g., monthly basin-scale ET).Aggregating over one of the two spatial dimensions is sometimes applied (e.g., Pan et al., 2020;Chen et al., 2019).The time series can also be aggregated by land cover classes (e.g., Weerasinghe et al., 2020) or climate zones (e.g., Trambauer et al., 2014), describing how RS-ET uncertainty varies under different conditions.For spatial intercomparison, temporally aggregated RS-ET maps can be compared visually (e.g., Weerasinghe et al., 2020) or by using simple map algebra (e.g., M. Jung et al., 2019).Only a few studies have applied metrics to evaluate the spatial similarity between two datasets, such as the spatial efficiency metric (SPAEF) (Stisen et al., 2021;H. C. Jung et al., 2019) and the degree correlation measure of spherical harmonic coefficients (López et al., 2017).None of these methods can characterize uncertainty in RS-ET fully; thus, combining them would provide a more comprehensive intercomparison.

Sensitivity analysis
Sensitivity analysis is the third most used approach in the reviewed literature but is only applied a in a small proportion of the reviewed studies (61 out of 676 articles).Out of these, only seven studies applied global sensitivity analysis (GSA).Sobol's (2001) method was applied to the parameters of the MODIS16 algorithm (Zhang et al., 2019), the TSEB model (Burchard-Levine et al., 2020), and three RS-ET models (PT-DTsR, MODIS16 algorithm, and PML) (Cao et al., 2021).This method was also applied to input variables of RS-ET models alone (e.g., Gomis-Cebolla et al., 2019).Elhag (2016) applied a similar variance-based sensitivity measure for the SEBS model but did not refer to Sobol's method.The extended Fourier amplitude sensitivity test has also been applied for GSA (García et al., 2013).This limited number of studies shows that the application of GSA to RS-ET models has been under-researched during the last decade despite the importance of GSA in environmental modeling (Saltelli et al., 2021).
The majority of articles that applied sensitivity analysis (54 out of 61) did not mention or apply a GSA method and thus were considered to be local sensitivity analyses (LSA).In most of these studies, LSA was done by changing one parameter at a time (one-at-a-time method) and calculating the ratio of the change in ET over the change in parameter (e.g., Long et al., 2011).In the reviewed articles, the one-ata-time method has been implemented differently in terms of three factors: (1) the selection of parameters for LSA according to their importance, as judged by the researchers; (2) the range of values over which parameters are allowed to vary; and (3) the calculation of sensitivity for specific land covers.This suggests that LSA is influenced by the subjectivity of the researchers.

Evaluation of input data
The uncertainties of key input datasets are sometimes evaluated by researchers in studies that assess uncertainty in RS-ET without explicitly being propagated to model outputs.This approach ranked fourth in the number of articles, with 20 out of 676.The key input datasets considered by researchers include air temperature, incoming shortwave radiation, incoming longwave radiation, wind speed, and land surface temperature (e.g., Vinukollu et al., 2011b;Pardo et al., 2014;Peng et al., 2016;Li et al., 2017).Input datasets were evaluated through validation with their in situ counterpart.Although other input datasets like vegetation indices are also important in RS-ET models, the in situ measurements of these are often not available for evaluation (Vinukollu et al., 2011b).Some of the forcing datasets of RS-ET models are not remotely sensed data but are products from atmospheric data assimilation systems (e.g., Global Land Data Assimilation System (GLDAS) and ECMWF atmospheric reanalysis (ERA)), which are sometimes provided with uncertainty estimates from data providers.Evaluating the input data provides crucial a priori information for propagating uncertainty to ET estimates.Furthermore, even if uncertainty propagation is not conducted, these assessments can help to identify sources of uncertainty in RS-ET; as the saying goes, "garbage in, garbage out".

Uncertainty propagation
Only 8 out of 676 articles applied the uncertainty propagation approach, mainly the Monte Carlo methods (MCMs), to evaluate uncertainty in RS-ET.In MCMs, the model inputs are randomly sampled from their distributions and fed into the model to generate outputs repeatedly.The variance of the output distribution will then be considered to be the uncertainty in the model output (i.e., ET estimate) associated with the input variables.The limited application of uncertainty propagation can be attributed to its complexity and computational demand.Sensitivity analysis and uncertainty propagation are ideally carried out in tandem (Crosetto et al., 2001;Saltelli et al., 2019), but only five out of eight articles combined these approaches.The uncertainty propagation approach was also used for investigations beyond uncertainty quantification.For example, Talsma et al. (2018) used MCM to determine the uncertainties in ET partitioning (i.e., soil evaporation, interception, and transpiration) in three RS-ET models (MOD16, PT-JPL, and GLEAM) due to the relative uncertainty in the key variables.
In the reviewed studies, uncertainty propagation was done only at one or a few fixed locations by assuming the probability distribution of the input variables then simulating a range of ET values at these locations.This approach is computationally inexpensive but does not fully characterize uncertainties in a spatial field of ET.To fully quantify uncertainty in a scene, Cawse-Nicholson et al. ( 2020) introduced a method based on MCM and spatial-statistical models (Cressie, 1993).With this method, the probability distribution of ET per pixel in a satellite scene can be quantified and presented as percentile maps.This distribution was almost always non-Gaussian for all pixels in ET scenes, which means simple linear error propagation is not possible (Cawse-Nicholson et al., 2020).Future studies of RS-ET would benefit from the development of new methods to quantify uncertainty spatially.

Triple collocation and three-cornered hat method
The three-cornered hat method (TCH) (Premoli and Tavella, 1993) and triple collocation (TC) (Stoffelen, 1998;McColl et al., 2014) are related to the intercomparison approach in the sense that these techniques assess the relative uncertainty of three datasets without assuming one is the best.Therefore, these techniques are useful when there is a lack of a high-quality reference dataset.Both TC and TCH methods require a set of three datasets with the assumption that their errors are independent (Sjoberg et al., 2021).The difference between TCH and TC is that TC can only be used to assess uncertainties of uncorrelated datasets, while TCH can be used when there are correlations with proper constraints (Xu et al., 2019;Sjoberg et al., 2021).However, to date, few studies have evaluated uncertainties in RS-ET using TC (Miralles et al., 2011b;Barraza Bernadas et al., 2018;Khan et al., 2018;and Kibria et al., 2021) and TCH (Long et al., 2014;Xu et al., 2019;and He et al., 2020).The proportion of studies that used these methods is less than 2 % of the total reviewed articles and is not increasing (Fig. 6).This low adoption might be attributed to the limitations of these methods: (1) the lack of information about biases and the estimation of only random errors (e.g., RMSE, standard deviation, or variances), (2) the required conditions to achieve reliable error estimates (large samples, similar scales and magnitudes of errors between datasets) (Sjoberg et al., 2021), and (3) the reliability of TCH as an alternative to direct validation (Wu et al., 2019b).

Physical consistency
Physical consistency can be understood as the plausibility that an ET estimate is consistent with the physical conditions or characteristics of the area it represents.A consistency check or physical validation was proposed by Zeng et al. (2015) as the final step in a general validation process for big remote sensing datasets.When there are limited reference data and ground-based measurements, physical validation is critical to assess the quality of data products (Blatchford et al., 2020).Although physical validation does not quantify uncertainty using metrics, it provides an evaluation of the data quality.This is useful for identifying the regions and conditions in which RS estimates are more uncertain and where more effort in direct validation approaches is required.
Only six studies in the selected literature have attempted to quantify this plausibility (Fig. 6), but they defined physical consistency differently.For example, Rwasoka et al. (2011) used FAO Penman-Monteith potential ET estimates as a threshold to decide whether ET estimates from the SEBS model were physically inconsistent.Blatchford et al. (2020) used the ET/P ratio and water availability (P − Q) to evaluate the physical consistency of the WaPOR ET product.López et al. (2017) developed a technique to assess the hydrological consistency of ET by transforming both ET and P data into spherical harmonics and then using spherical harmonic coefficients to calculate the degree correlation.These studies are not the same as validating RS-ET with P − Q or P − Q − dS/dt as discussed previously since these residuals were not considered to be the best reference of ET.
Another method to assess physical plausibility without an explicit water balance is through the Budyko curve.The Budyko curve describes the semi-empirical relationship between long-term ET and its limiting factors, i.e., precipitation and potential ET (PET), for river basins (Budyko, 1974).Koppa and Gebremichael (2017) validated the physical consistency of ET by calculating the RMSE of the Euclidean distance between the data points and the Budyko curve in ET/P and PET/P space.Weerasinghe et al. (2020)  ply calculated the mean difference (bias) between RS-ET and Budyko-derived ET to evaluate which RS-ET product exceeds the energy and water limit defined by the Budyko curve.They also noticed that, if a data point does not align with the Budyko curve, it might also mean that the ET of the basin exceeds the water or energy limit, for example, due to human activities.Therefore, the interpretation of physical plausibility needs to consider the actual knowledge about water resources in the basin instead of focusing only on modelgenerated numbers.

Using ensemble of RS-ET estimates
Intercomparison studies sometimes lead to ensemble-mean products of all available products on the basis of the assumption that no model performs best so an ensemble of them would be preferable (Bhattarai et al., 2019;Elnashar et al., 2021).This approach has been used the least in the reviewed articles (Fig. 6).Some researchers have evaluated the uncertainty in an ensemble (a set) of RS-ET estimates from different models by calculating the average and range of all members in the ensemble (Vinukollu et al., 2011b;Elnashar et al., 2021;Guo et al., 2020).This approach is the same as the multi-model ensembles in climate modeling.The model structural uncertainty can only be quantified if independent models are sampled from the entire possible model space and avoid the over-representation of one model structure (Abramowitz and Gupta, 2008).For example, Vinukollu et al. (2011b) selected three RS-ET models, namely SEBS (Su, 2002), PM-Mu or MODIS16 (Mu et al., 2007), and PT-Fi or PT-JPL (Fisher et al., 2008), which are based on distinct equations used to estimate ET.
Using an ensemble of RS-ET estimates provides uncertainties of the ensemble but not of each individual member of the ensemble.Thus, some studies went further by merging the datasets of the ensemble and calculating the difference between this merged dataset with each ensemble member (Baik et al., 2018;Elnashar et al., 2021).If simply averaging all the ET products, the bias of different models can be canceled in regions where they perform differently but accumulated in regions where they perform in the same manner.Hence, the ensemble products may arguably produce better estimation in some areas (Yao et al., 2017) but not a better understanding of the physical processes and drivers needed to improve RS-ET (Yao et al., 2014;Zhang et al., 2016).Therefore, it is considered to be more useful to use the range of the ensemble to identify the outlier data products or the uncertainty of all data products.

Context of RS-ET uncertainty assessment
The context in which the uncertainty of RS-ET is assessed determines which method is selected and how it is applied.This context includes the objective of the RS-ET estimates, the spatial and temporal support at which ET is assessed, the geographic location, and the availability of reference datasets.This section describes the context in which 676 reviewed articles assessed uncertainties in RS-ET.

Objectives of the reviewed articles
The review shows that uncertainties in RS-ET estimates were assessed at all stages, from developing a new model to evaluating its data product.Uncertainty in RS-ET was assessed in the context of model implementation (34 % of reviewed articles), model development (13 % of all reviewed articles), model improvement (17 %), model evaluation (19 %), and product evaluation (16 %) (Fig. 8).Here, model implementation means that a pre-existing model was applied to new case studies or to achieve some specific research objective without considerable modification or further development of the model.The prominence of model implementation as the main objective in the reviewed articles could be due to a perceived need to assess the uncertainty of RS-ET estimates for each application despite previous validation.This is an important attitude in the research community since it helps to provide feedback on appropriate application and improvement of RS-ET models.Therefore, studies in the context of model implementation should not be overlooked.

Sources of uncertainty evaluated
The reviewed articles evaluated all sources of uncertainty as categorized in the theoretical framework (Fig. 3), with a strong focus on compound uncertainty.Figure 9 shows that the majority (406 out of 676) of reviewed articles assessed only compound uncertainty without disaggregating into other sources.The second largest set of articles assessed both compound uncertainty and the relative uncertainty of RS-ET estimates.Other sources of uncertainty are remarkably less evaluated in the selected literature.According to the number of articles in each set (Fig. 9), the level of interest in different sources of uncertainty can be ranked as follows: compound, relative, input data, model parameterization, change of spatial support, change of temporal support, and finally gap fill- ing.This does not necessarily show the ranking of importance of the uncertainty sources but rather the availability of methods and the data needed to assess them.
The uncertainties due to temporal upscaling are affected by several factors related to location (Jiang et al., 2021).These factors includes vegetation cover, soil moisture (Gentine et al., 2007;Hoedjes et al., 2008), cloud coverage (as discussed in research by Van Niel et al., 2012), cloud frequency (as explored in studies by Xu et al., 2015), air pollution effects (as indicated in research by Zhang et al., 2013), the return interval of the satellite (Alfieri et al., 2017), the timing of the overpass (Jiang et al., 2021), and the number of instantaneous values used for upscaling (Liu, 2021).Consequently, applying a single temporal-upscaling method for the entire globe results in spatially varying uncertainties in RS-ET estimates.

Spatial and temporal support of uncertainty assessment
Uncertainties in RS-ET estimates are specific for different spatial and temporal supports.The reviewed studies evaluated RS-ET uncertainties at spatial supports ranging from less than 100 m up to global and temporal supports ranging from sub-daily to annual (Fig. 10).Most studies evaluated RS-ET uncertainties at spatial supports of 500 m to 5 km (268 out of 676) and less than 100 m (191 out of 676).This can be attributed to the availability of RS datasets that are widely used to estimate ET, such as MODIS (250 m to 1 km) and Landsat (30 to 100 m).In the case of validation, the spatial support of uncertainty assessment was determined by the spatial support of the ground truth reference.For tem-poral support, uncertainty was mostly evaluated by daily ET (365 out of 676), although RS datasets provide observations at the time of the satellite overpass with a temporal resolution of 5-16 d.This shows that the temporal support of uncertainty assessment is driven more by practical needs and less by the availability of datasets.

Geographical distribution
Assessment of RS-ET uncertainties is not evenly distributed over the globe.The number of articles per country where uncertainties in RS-ET were assessed is shown in Fig. 11.Each article was tagged by the country where the sites of study are located.The highest number of articles assessed ET in China.Because the most common approach is validation and the most common reference used is EC measurements, ET was mainly assessed where there are EC stations (i.e., Amer-iFlux, AsiaFlux, ChinaFlux, OzFlux, EuroFlux, FLUXNET).
Even when the studies aimed to validate RS-ET globally, the estimated uncertainty is not universal since these networks do not cover many regions.These studies were also included in Fig. 11.Based on its popularity, EC can be considered to be the de facto standard ET estimation approach for validation of RS-ET.However, this popularity is mainly driven by the number of publications in countries where EC towers are more densely distributed (e.g.,China and the United States of America).In countries where there are very few or no EC towers available, the most common reference used for validation of RS-ET is the water balance method (Fig. 12).In a few countries in northern Africa and the Middle East, the most common method is to use the FAO-56 method (Allen et https://doi.org/10.5194/hess-27-4505-2023 Hydrol.Earth Syst.Sci., 27, 4505-4528, 2023  6 Results of RS-ET uncertainty assessment

Uses of uncertainty metrics
The reviewed articles that assess uncertainty in RS-ET mainly report accuracy (RMSE), bias (mean error), and the goodness of fit with a reference dataset (R 2 ) (Fig. 13).Although quantifiable uncertainty in measurement is theoretically represented as a probability distribution, this has rarely been done in the literature.The reviewed studies used a wide range of metrics to report their uncertainty assessment (33 metrics).Most studies used 3 metrics, while some used up to 12. Larger numbers of metrics provide more descriptions of uncertainty, but some metrics might be challenging to interpret.Root-mean-square Error (RMSE) is the most widely used metric in the reviewed articles (531 out of 676 articles).Metrics related to RMSE include normalized RMSE (normalized by standard deviation) and relative RMSE (as a percentage of mean ET).Very few studies (17 articles) used modified RMSE to report more robust results, and few consider random errors and systematic errors, such as robust RMSE (Bisquert et al., 2016), systematic and unsystematic RMSE (Yebra et al., 2013), and biased and unbiased RMSE (Martens et al., 2017).RMSE has the unit of the estimates so  it can be expressed in millimeters (mm) for ET or watts per square meter (W m −2 ) for latent heat flux.Therefore, to compare reported RMSEs between different studies, unit conversion is needed.
Inconsistent use of metrics such as R 2 might cause misinterpretation of results, especially when comparing studies.For example, the second most used evaluation metric was referred to using many names including mean error, mean difference, bias error, or bias.Meanwhile, the coefficient of determination (R 2 ) has the opposite issue in which the same term was used with different formulas.R 2 is a measure of the goodness of fit for regression models.There are at least eight formulas for R 2 in the literature (Kvålseth, 1985), but only one formula can be used for any type of model fitting (i.e., R 2 1 in Kvålseth, 1985).Since many studies did not report which formula they used, we did not distinguish between different R 2 formulas in Fig. 13.Nevertheless, we observed that at least four different formulas of R 2 were used in the reviewed articles, including the squared coefficient of correlation (Table S4).
No matter which metrics are used, the validation metrics that compare estimates with references only represent the actual error if the reference is the absolute truth.This is never the case because in situ measurements and upscaling methods are never perfect.Wu et al. (2019a) suggested that validation should be performed in conjunction with uncertainty associated with in situ measurements and the statistical significance of performance metrics.

Synthesizing reported RS-ET uncertainty from reviewed studies
Although there are a large number of papers assessing the uncertainty of RS-ET data, only a few attempted to synthesize their results.For example, a review by Karimi and Bastiaanssen (2015) used meta-analysis (i.e., using statistical methods to synthesize the results of independent studies) to estimate the probability density function of mean absolute percentage error (MAPE) in 46 studies that validate RS-ET estimates in seasonal cycles.Kalma et al. (2008)   30 studies.These syntheses are limited in terms of the number of studies (< 50) and the fact that the selection of studies was not systematic.Another limitation of synthesizing these results is that the selected studies used different validation data and field instruments, which do not have equivalent spatial support and accuracy.Synthesizing results of the reviewed articles in this study will provide a useful reference for future studies to evaluate the results of RS-ET uncertainty assessment.For a metaanalysis, selected studies should use the same validation data and report the same metric; thus, we selected the most used validation data and metric.Since the majority of studies used EC flux towers and RMSE to report uncertainty (372 out of 676), we selected these studies for meta-analysis of reported RS-ET uncertainty.From 372 articles, 348 articles that reported the RMSE of RS-ET from validation with EC flux towers were included.The remainder were excluded because the RMSE was not reported in figures with extractable values (Fig. 4).RMSE values in units other than millimeters per day (mm d −1 ) were converted to millimeters per day (mm d −1 ) assuming a constant rate of ET over the temporal support.For example, 365 mm yr −1 was converted to 1 mm d −1 , and 0.1 mm h −1 was converted to 2.4 mm d −1 .
The reported RMSE values for daily ET (N = 3167) range from 0.01 to 6.65 mm d −1 , with a mean value of 1.18 mm d −1 (Table 3), which is comparable with RMSEs previously reported by Kalma et al. (2008).When converting RMSE values from the reported unit to a common unit of millimeters per day, the mean RMSE is the highest for the validation of instantaneous RS-ET (2.81 mm d −1 ) and the lowest for monthly (0.78 mm d −1 ).In general, studies with larger temporal support of validation have lower mean RMSEs in millimeters per day.For the validation at temporal supports of 3 h, 10 d, and weekly, less can be concluded due to the small number of studies and records.Overall, the decrease in RMSE with increasing temporal support is due to the av-eraging and corrective effect of temporal upscaling.Therefore, improving temporal-upscaling and gap-filling methods are crucial for reducing uncertainty in RS-ET estimates.
Figure 14 shows that very high RMSE values were mainly from validation approaches that used a single EC site.Validation using data from a greater number of EC sites tends to yield lower RMSE values.This might be attributed to the fact that, when papers only report average RMSE values across multiple EC sites, the average RMSE is lower than the highest individual RMSE.Moreover, the random errors at each site are likely to be uncorrelated or partially cancel each other out when averaging them, which further reduces the overall RMSE.As RMSE is inherently dependent on the scale of ET, sites with lower ET values or the practice of averaging ET across multiple sites are more likely to exhibit lower average RMSE values.Unfortunately, only a limited number of studies provided information on the relative RMSE or the average ET corresponding to RMSE values, which hindered the derivation of scale-independent RMSE values across all studies.
The control factors of ET and the uncertainty in their estimates are not the same globally (Zhang et al., 2016).As the distribution of validation sites is concentrated in regions where EC flux towers are available (Fig. 12), the results of the validation are thus not necessarily transferable to other areas.Therefore, when interpreting the uncertainty of RS-ET based on validation, we should consider the validation metrics at each site individually and the variation of these metrics among all locations.
The large range of RMSE values obtained from the metaanalysis can be explained by the diversity of reviewed studies in terms of models, resampling methods, and validation context (e.g., temporal scales, land cover, climate, amount of data).For example, some studies validate RS-ET estimates from global products, while others validate RS-ET estimates from models that were calibrated to reduce RMSE.More- over, many studies reported the RMSE of latent heat flux (in W m −2 or MJ m −2 d −1 ) averaged from estimates at the time of satellite overpass.The accuracy of RS-ET varies at different times of the day due to weather conditions and is thus not representative of the entire day.We converted these values to millimeters per day (mm d −1 ) (Table 3) only for comparison between different temporal supports.The range of RMSEs presented in Fig. 14 and Table 3 should only be considered as a baseline for typical errors in RS-ET.Using only RMSE to compare RS-ET model performance across different studies or validation sites is not recommended.

Summary
This paper identifies and appraises methods for uncertainty assessment of RS-ET estimates by applying a systematic quantitative literature review approach.The majority of re-viewed articles assess uncertainty in RS-ET estimates by validation against EC measurements.In regions where in situ measurements are limited, most studies used the residual of the water balance as a reference for validation.Making use of existing EC networks is important for global validation of RS-ET estimates.However, there is still a gap in the availability of in situ data for global validation as most data are concentrated in North America, eastern Asia, and Europe.Moreover, the challenges in terms of energy balance closure and scale mismatch persist through the reviewed studies.The future of RS-ET is geared toward enhancing spatiotemporal resolutions (Fisher et al., 2017)  Since the uncertainty in RS-ET in the literature is most often reported in terms of the RMSE of RS-ET estimates compared to EC observations, we provide the typical range of uncertainty in RS-ET based on a meta-analysis of 317 articles that reported this metric.RMSE varies a lot among studies due to different models, resampling methods, and site conditions.Moreover, validation with multiple sites reported a lower average and smaller variation in RMSE values than validation at a single site.While RMSE stands as the most commonly employed metric in the literature, it is unsuitable for comparing uncertainties in RS-ET across different studies due to its inherent scale dependency.Therefore, validation metrics only reflect the uncertainty of RS-ET at specific locations.The RMSE range reported in our study should be used only as a baseline for future studies that validate RS-ET estimates using EC.
Comparing the performance of RS-ET models and investigating the sources and geographical distribution of uncertainty in their ET estimates remains an important research endeavor for many applications.Global assessments provide a broad perspective on RS-ET uncertainties by considering factors that affect data quality on a large scale, such as satellite sensor characteristics, model characteristics, and geographical and climatic factors.Local assessments, on the other hand, focus on specific study areas, which may have unique conditions and sources of uncertainty that are overlooked in global assessments.Therefore, future research should combine local and global evaluation efforts.
For validation of RS-ET estimates with in situ methods, we provide specific recommendations: -The uncertainty of the reference datasets, including corrections for surface energy balance closure, should be evaluated and reported.
-RS-ET estimates should be converted to values at the temporal and spatial scales of reference datasets.
-The four common metrics (RMSE, bias or mean error, correlation coefficient, coefficient of determination) and mean ET should be reported in validation studies.
-The statistical significance of validation metrics should be tested, and the number of data points used should be reported.
-In addition, uncertainties in RS-ET estimates should be characterized using multiple metrics that are scale independent to facilitate comparison of RS-ET uncertainty across regions with different ET ranges.
-Validation of RS-ET models and data products should be reported at different levels of spatial and temporal scales, covering multiple locations.
We recommend combining multiple approaches for uncertainty assessment of spatiotemporal RS-ET data when validation datasets are limited.These approaches include intercomparisons, sensitivity analyses, uncertainty propagations, physical consistency checks, evaluations of input, triple collocations, and the ensemble of estimates.Both sensitivity analysis and uncertainty propagation approaches were shown to be useful for the advancement of RS-ET techniques by identifying and quantifying the sources of uncertainty.However, our review shows that there are very few studies that applied sensitivity analysis and uncertainty propagation techniques for RS-ET estimates and that most studies employed less computationally demanding options.This impedes the ability to assess a detailed spatiotemporal distribution of RS-ET uncertainty.Therefore, future research on uncertainty in RS-ET estimates needs to develop or apply more advanced sensitivity analysis and uncertainty propagation methods.Since uncertainty in RS-ET is an attribute of any spatiotemporal dataset, the remaining challenge is to characterize uncertainty spatially and temporally.This means not only quantifying the overall expected errors of the dataset but also identifying where and when high uncertainty is most likely to occur.Several studies have aimed to offer spatially explicit uncertainty in thematic classifications, such as land cover and soil type.These studies, like the ones mentioned by Woodcock (2002), have primarily focused on qualitative mapping techniques.However, for quantitative remote sensing, which involves mapping continuous variables like ET, there is a need for methods that can effectively characterize spatially explicit uncertainty.Therefore, we strongly recommend the development and application of methods to evaluate spatiotemporal uncertainty in RS-ET datasets.
Author contributions.BT and JvdK conceptualized the review approach; BT, JvdK, SSe, MM, and GJ designed the methodology; BT collected and categorized the literature; BT and MM conducted the data collection for meta-analysis; BT analyzed the data; BT and SSe visualized the results; BT wrote the paper draft; BT, JvdK, SSe, MM, GJ, and RU reviewed and edited the paper; GJ, MM, and RU supervised the research activities; MM acquired funding and managed the project.
Competing interests.One of the authors is a member of the editorial board of Hydrology and Earth System Sciences.The peerreview process was guided by an independent editor, and the authors also have no other competing interests to declare.
Disclaimer.Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper.While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

Figure 1 .
Figure 1.Previous literature reviews on RS-ET estimation, uncertainty, and validation of RS-derived data.

Figure 2 .
Figure 2. Uncertainty as described by the probability distribution of measured values.Adapted from Povey and Grainger (2015) and JCGM (2012).

Figure 3 .
Figure 3.The sources of uncertainty in ET estimates from the typical workflow in remote-sensing-based models.Compound uncertainty is the aggregation of all uncertainties from input data, the change in the temporal and spatial scale, gap filling, model parameterization, and model conceptualization.

Figure 4 .
Figure 4. Results of article selection from database search (identification), title and abstract screening (screening), and full-text assessment (eligibility).

Figure 5 .
Figure5.Uncertainty assessment approaches used in the reviewed articles (N = 676).The horizontal bar chart displays the number of articles using specific approaches (categories), while the vertical bar chart represents article counts within the intersections of multiple categories.Each vertical bar corresponds to an intersection in the column beneath it.Black circles denote the categories in the respective rows present in the intersection, while gray circles signify categories absent from the intersection.Intersections with less than two articles were excluded from the graph for improved presentation.TCH and TC stand for three-cornered hat and triple collocation."Others" refers to approaches that are used only once, which are recorded in the Data availability section.

Figure 6 .
Figure6.The proportion of reviewed articles per year for each approach to assessing RS-ET uncertainties.

Figure 7 .
Figure 7. Different reference data and in situ methods used for RS-ET validation in reviewed articles (N = 600).* Other methods for in situ ET estimation include the volumetric soil water content difference (N = 1), canopy temperature and meteorology monitoring systems (N = 1), portable chambers (N = 1), atmometer (N = 1), open-top chambers (N = 1), and crop coefficient methods using reference ET equations other than FAO-56 (N = 1).

Figure 9 .
Figure 9.The source of uncertainty assessed in reviewed articles (N = 676).The horizontal bar chart displays the number of articles assessing specific sources of uncertainty (categories), while the vertical bar chart represents article counts within the intersections of multiple categories.Each vertical bar corresponds to an intersection in the column beneath it.Black circles denote the categories in the respective rows present in the intersection, while gray circles signify categories absent from the intersection.Intersection with less than two articles were excluded from the graph for improved presentation.

Figure 10 .
Figure 10.Number of articles per range of spatial (a) and temporal (b) support at which uncertainty in RS-ET was assessed (total number of articles N = 676).

Figure 11 .
Figure 11.Number of articles per country where uncertainties in RS-ET were assessed.

Figure 12 .
Figure 12.The most common reference used for validation of RS-ET per country.

Figure 13 .
Figure 13.Number of studies per choice of metric to report uncertainty and the number of metrics used.

Figure 14 .
Figure 14.RMSE (mm d −1 ) of RS-ET based on validation with eddy covariance (EC) observations in reviewed articles (N = 348).The scattered dots represent RMSE values reported in articles.The dot color shows the number of EC sites used in validation.The green area under the curve represents the kernel density estimation of the underlying probability distribution.The box-and-whisker plot represents the 5th, 25th, 50th (median), 75th, and 95th percentiles of the distribution.The orange circle inside the box-and-whisker plot represents mean value.
thanks to progresses in thermal infrared missions (e.g., ECOSTRESS, LSTM, SBG, TRISHNA, and HydroSat), along with the development of small satellite constellations (e.g., Landsat Next and Copernicus Contributing Missions).Consequently, there is a need for methods to resample in situ measurements to the spahttps://doi.org/10.5194/hess-27-4505-2023Hydrol.Earth Syst.Sci., 27, 4505-4528, 2023 B. N. Tran et al.: Uncertainty assessment of satellite remote-sensing-based evapotranspiration estimates tiotemporal resolution of these satellite systems to assess the uncertainties of RS-ET data derived from these sources.

Table 1 .
Search terms and variants.Search terms were combined using the AND operator, and variants were combined using the OR operator.The asterisk * was used to include similar terms.

Table 2 .
Categories and subcategories used to organize the included papers.
. Recently,Schoups and Nasseri (2021)proposed treating uncertainties in datasets as unknown ran-  Tran et al.:Uncertainty assessment of satellite remote-sensing-based evapotranspiration estimates dom variables.Instead of using the WB to determine these uncertainties, they estimated ET (and other water fluxes) by combining WB constraints and uncertainty estimation into a comprehensive probabilistic model.Although only applicable for river basins where the GRACE resolution is suitable, this could be a good direction for future research on these water fluxes.

Table 3 .
Descriptive statistics of reported RMSE values (in mm d −1 ) in reviewed articles (N = 348) with validation of RS-ET with EC flux towers.