Robust historical evapotranspiration trends across climate regimes

Evapotranspiration (ET) links the hydrological, energy and carbon cycles on the land surface. Quantifying ET and its spatio-temporal changes is also key to understanding climate extremes such as droughts, heatwaves and flooding. Regional ET estimates require reliable observationbased gridded ET datasets, and while many have been developed using physically based, empirically based and hybrid techniques, their efficacy, and particularly the efficacy of their uncertainty estimates, is difficult to verify. In this work, we extend the methodology used in Hobeichi et al. (2018) to derive two new versions of the Derived Optimal Linear Combination Evapotranspiration (DOLCE) product, with observationally constrained spatio-temporally varying uncertainty estimates, higher spatial resolution, more constituent products and extended temporal coverage (1980–2018). After demonstrating the efficacy of these uncertainty estimates with out-of-sample testing, we derive novel ET climatology clusters for the land surface, based on the magnitude and variability of ET at each location on land. The new clusters include three wet and three dry regimes and provide an approximation of Köppen–Geiger climate classes. The verified uncertainty estimates and extended time period then allow us to examine the robustness of historical trends spatially and in each of these six ET climatology clusters. We find that despite robust decreasing ET trends in some regions these do not correlate with behavioural ET clusters. Each cluster, and the majority of the Earth’s surface, shows clear robust increases in ET over the recent historical period. The new datasets DOLCE V2.1 and DOLCE V3 can be used for benchmarking global ET estimates and for examining ET trends respectively.

based gridded ET datasets, and while many have been developed using physically based, empirically based and hybrid techniques, their efficacy, and particularly the efficacy of their uncertainty estimates, is difficult to verify. In this work, we extend the methodology used in Hobeichi et al. (2018) 10 to derive two new versions of the Derived Optimal Linear Combination Evapotranspiration (DOLCE) product, with observationally constrained spatio-temporally varying uncertainty estimates, higher spatial resolution, more constituent products and extended temporal coverage .  ter demonstrating the efficacy of these uncertainty estimates with out-of-sample testing, we derive novel ET climatology clusters for the land surface, based on the magnitude and variability of ET at each location on land. The new clusters include three wet and three dry regimes and provide an 20 approximation of Köppen-Geiger climate classes. The verified uncertainty estimates and extended time period then allow us to examine the robustness of historical trends spatially and in each of these six ET climatology clusters. We find that despite robust decreasing ET trends in some re-25 gions these do not correlate with behavioural ET clusters. Each cluster, and the majority of the Earth's surface, shows clear robust increases in ET over the recent historical period. The new datasets DOLCE V2.1 and DOLCE V3 can be used for benchmarking global ET estimates and for examining ET 30 trends respectively.

Introduction
Understanding the spatio-temporal variability of evapotranspiration (ET) is a critical part of understanding the processes that lead to high impact weather phenomena, such as 35 droughts (Han et al., 2018;Montano et al., 2015;Sheffield et al., 2012;Teuling et al., 2013), heatwaves (Teuling, 2018; and flooding (Dawdy et al., 1972;Sharma et al., 2018). Several global gridded ET datasets have been developed, using physical schemes with different 40 scopes (e.g. addressing key questions in ecology, hydrology, or other disciplines) and complexity (see Fisher and Koven, 2020) and empirical techniques including machine learning algorithms, typically incorporating a range of remote sensing inputs ( TS3 Alemohammad et al., 2017;Jung et al., 2010Jung et al., , 45 2019. Recently, ET datasets derived with a hybrid approach have been recognised for their potential to outperform singlesource datasets in reducing bias against tower-based eddy covariance ET measurements (Ershadi et al., 2014;Feng et al., 2016;Hobeichi et al., 2018;Jiménez et al., 2018;McCabe et 50 al., 2016).
While most observational products are global (or near global) in their spatial extent, and typically available with a monthly time step, different products are constrained by very different types of observations and vary significantly 55 in their treatment of uncertainty. As detailed below when describing the datasets we use here, "physically based" approaches use equations that represent different physical, chemical, and biological processes and incorporate satellitebased atmospheric forcing and parameterisation of land sur-60 face characteristics, while "empirical" approaches integrate ground-based measurements of ET together with satellite data and ground-based measurements of vegetation characteristics and land surface parameters. These differences result 2 S. Hobeichi et al.: Robust historical evapotranspiration trends across climate regimesTS6 in a diverse group of products and estimates, but it is their approach to deriving uncertainty estimates that is arguably more important.
Very few datasets provide uncertainty estimates associated with the ET flux; these include datasets described in 5 Bodesheim et al. (2018) and Jung et al. (2019). In Bodesheim et al. (2018), monthly uncertainty estimates are computed from the standard deviation of the half-hourly ET values that were used to derive monthly ET averages. Jung et al. (2019) provide an ensemble of global ET estimates; deviations from 10 the ensemble median are used to derive ET uncertainties. In both cases, uncertainties do not reflect the actual deviation from the measured ET at site locations. Without well-calibrated uncertainty estimates, we are unable to tell whether an identified property of any given dataset, such as a trend or a proportion of the surface energy or water budget, is robust rather than a result of bias or stochastic uncertainty.
ET trends computed from different approaches (i.e. physical and empirical) show general agreement at the global scale, and they indicate that ET has increased since the early 20 1980s (Miralles et al., 2014;Pan et al., 2020;Zhang et al., 2016). However, different ET products exhibit considerable disparities in regional and continental ET trends. For instance, Miralles et al. (2014) detected upward ET trends in GLEAM (Global Land Evaporation Amsterdam Model; Mi-25 ralles et al., 2011 TS4 ) in the northern latitudes caused by vegetation greening. In water-limited regions, they found that ET is characterised by a multidecadal variability that follows ENSO (El Niño-Southern Oscillation) dynamics, mainly in eastern and central Australia, southern Africa, and eastern 30 South America. In comparison, ET trends estimated from the observation-driven Penman-Monteith-Leuning (PML; Zhang et al., 2016) model show increasing ET since 1980 in the northern latitudes, arid regions in northern Africa, and northern and eastern Amazon. On the other hand, PML ex- 35 hibits negative trends in southern South America and western United States. More recently, Pan et al. (2020) found that ET trends exhibited during 1982-2011 by a range of empirical and physically based estimates disagree in the direction of trend in the Amazon basin and many arid and semi-arid re- 40 gions. Without incorporating uncertainties in ET estimates in the analysis of trends, it becomes difficult to assess the reliability of the established trends.
The gridded ET product derivation technique implemented by Hobeichi et al. (2018) offers the potential for robust out- 45 of-sample testing of its uncertainty estimates, as well as several other advantages over other techniques. Like other merging approaches, it offers the potential to minimise the eccentricities or biases of any one product, by averaging them (in this case using weights). However, unlike several other merg- 50 ing techniques (Mueller et al., 2013;Paca et al., 2019;Rodell et al., 2015;Stephens et al., 2012), it accounts for performance differences between parent estimates using in situ data as the observational constraint rather than assigning weights based on the ability to match another gridded dataset that is 55 deemed more reliable or the ensemble mean of a selection of datasets (Munier et al., 2014;Sahoo et al., 2011;Wan et al., 2015;Zhang et al., 2018). The efficacy of using in situ measurements for constraining much larger-scale gridded estimates has also been shown explicitly (Hobeichi et al., 2018, 60 2020b). Next, most available merging techniques do not account for dependence between parent estimates, where redundant information in different parent products is likely to bias the hybrid estimate Herger et al., 2018). Finally, and perhaps most important for this work, 65 the technique calculates global spatially and temporally varying uncertainty estimates that are based on observations, in that they are based on the discrepancy between the hybrid ET estimate and in situ data. Aside from being more defensible than simply taking the spread of the parent products around 70 their mean (e.g. Pan et al., 2012 TS5 ;Zhang et al., 2018), this approach also allows for out-of-sample testing, by leaving some sites out of the derivation of the hybrid product and its uncertainty and then using them to test its accuracy.
Despite these advantages, out-of-sample testing of uncer-75 tainty estimates was not explored by Hobeichi et al. (2018), and the short temporal availability of the DOLCE product (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009) limited its application, particularly in examining historical trends. While different subsets of parent products were used over different regions to expand the spa-80 tial coverage of DOLCE, the possibility of different product subsets in different time periods to extend its temporal reach was not explored. Additionally, since the development of DOLCE, four of its six parent datasets (Jung et al., 2010;Martens et al., 2016;Miralles et al., 2011;Mu et al., 85 2011;Zhang et al., 2016) have been improved, and several new global ET datasets have been developed (Balsamo et al., 2015;Bodesheim et al., 2018;Jung et al., 2019). Most of these are available at a higher spatial resolution than the original 0.5 • in DOLCE and cover different subsets of the period 90 1980-2018, with at least two available every year during this period (Table 1).
In this paper we amend these shortcomings and explore some of the insights that the new versions of DOLCE offer, in particular focusing on the temporal trends in ET in differ-95 ent regions and the assessment of robustness of trends that well-calibrated uncertainty estimates afford. Roughly in order, we detail below (1) how we update the DOLCE product with new parent datasets and extend its temporal coverage; (2) how the improved products compare to their previous 100 version and other existing ET estimates from the literature; (3) the efficacy of uncertainty estimates, in particular whether or not they are overconfident; (4) an exploration of historical trends in ET using the extended temporal coverage and how the uncertainty estimates allow us to examine the robustness 105 of these trends; and (5) behavioural ET clusters that describe ET-based climate regimes, as a means to understand the spatial distribution of trends we find.

Data and methods
To derive two new versions of DOLCE, one suitable for benchmarking the ET dataset and another for trends analysis, we combine the 11 and 4 available global gridded ET datasets respectively using the same merging technique as in 5 DOLCE V1. This technique derives a linear combination of the participating ET datasets based on their ability to match in situ observations while also accounting for their error dependency. While we acknowledge the obvious spatial mismatch between gridded and in situ data, we refer readers to 10 Hobeichi et al. (2018) where it was shown that in situ observations do contain useful information about grid-scale fluxes, using out-of-sample testing in a similar framework to the one we present here.
Our aim is to increase the time coverage and spatial res-15 olution of DOLCE V1, as well as examine strategies to improve the effectiveness of the weighting strategy. Below we detail newly available global datasets that allow us to derive DOLCE V2 and DOLCE V3 at 0.25 • spatial resolution and an improved collection of in situ constraining data. We 20 then briefly revisit the weighting and uncertainty estimation approach before describing our tiering approach to extending the temporal reach of DOLCE V2 and DOLCE V3. Finally, we examine alternative clustering and bias-correction approaches to improve the out-of-sample performance of the weighting technique.
Throughout the paper, we use the two terms evapotranspiration (ET) and latent heat (LE) interchangeably, and the unit watts per square metre (W m −2 ) for heat fluxes and millimetres per year (mm yr −1 ) for the water flux equivalent. For reference, 1 W m −2 = 12.86 mm yr −1 . As above, we refer to the product from Hobeichi et al. (2018) (Jung et al., 2010), GLEAM v2a, GLEAM v2b (Miralles et al., 2011), GLEAM v3a (Martens et al., 2016, 15 2017), MOD16 (Mu et al., 2011) and PML (Zhang et al., 2016). In DOLCE V2, we keep both MOD16 and PML datasets, substitute the GLEAM products with their improved versions GLEAM3.3A and GLEAM3.3B (Martens et al., , 2017, and replace MPIBGC with the newly 20 developed empirical ET datasets from the Max Planck Institute for Biogeochemistry: BACI (Bodesheim et al., 2018) and two ET estimates from the FLUXCOM project (Jung et al., 2019). Additionally, we incorporate a recently published dataset (ERA5-Land; Muñoz Sabater, 2019TS7) and three 25 newly available ET datasets: PLSH (Zhang et al., 2015), SEBS (Chen et al., 2019;Su, 2002) and SRB-GEWEX . In comparison, DOLCE V3 was derived from four global ET datasets. These are ERA5-Land, an ET dataset from the FLUXCOM project, and the two lat-30 est versions of the GLEAM products, GLEAM V3.5A and GLEAM V3.5B. We provide a brief description of these datasets below, with URLs and download dates shown in Ta analysis with a series of improvements (mainly higher temporal frequency and spatial resolution) that makes it more reliable for land applications. ERA5-Land is produced under a single simulation that uses adjusted atmospheric inputs from ERA5 atmospheric variables without being coupled to 50 the atmospheric module of ERA5.
FLUXCOM (Jung et al., 2019) is an empirical upscaling of observations from 224 flux tower sites using machine learning methods. The full FLUXCOM product includes 63 global ET datasets that have been produced using two differ-55 ent setups: a remote sensing (RS) setup and a remote sensing plus meteorological (MET) setup. The development of the global datasets incorporates nine machine learning techniques, four global meteorological datasets (used only with the MET setup), three correction methods for energy imbal-60 ance at the flux tower sites and MODIS remote sensing input. In DOLCE V2, we include one dataset from each setup, which we refer to as FLUXCOM-RS (from the RS setup) and FLUXCOM-MET (from the MET setup). To choose the two datasets, we analysed the pair-wise error correlations of all 65 the products against in situ flux tower data and selected the two that had the lowest pair-wise error correlation (and so were deemed least dependent). In DOLCE V3, we include a dataset from the MET setup only.
Process-based Land Surface Evapotranspiration/Heat 70 Fluxes algorithm (PLSH; Zhang et al., 2015) terrestrial ET is derived using an improved Normalized Difference Vegetation Index (NDVI)-based Penman-Monteith algorithm originally developed in Zhang et al. (2010). ET is regulated by a set of geophysical data from GIMMS and Vegetation In-75 dex and Phenology along with radiative data from World Climate Research Programme/Global Energy and Water-Cycle Experiment (WCRP/GEWEX) Surface Radiation Budget (SRB) and CERES along with other meteorological observation data from the NCEP/DOE AMIP-II Reanalysis (NCEP2; 80 Kanamitsu et al., 2002).
In the Surface Energy Balance System (SEBS; Chen et al., 2019;Su, 2002), the ET estimates are produced with the revised Surface Energy Balance System (SEBS) algorithm in Chen et al. (2013Chen et al. ( , 2019. It uses meteorological observa-85 tions, ground heat flux, net radiation and canopy measurements collected from flux tower sites, as well as NDVI and emissivity data from MODIS. In Surface Radiation Budget (SRB)-GEWEX , ET is estimated based on the Penman- 90 Monteith equation. Input datasets include remote sensing data from AVHRRCE2 and MODIS, meteorological data derived from the Variable Infiltration Capacity (VIC; Liang et al., 1994) land surface model forced by PGFCE3, and radiative data from the NASA Global Energy and Water Ex-95 changes (GEWEX CE4 ) Surface Radiation Budget Project (Stackhouse et al., 2011).
It is clear that different parent datasets share forcing, parameterisations, and physical and empirical assumptions. Therefore, they do not constitute entirely independent esti-100 mates. Furthermore, their error correlation (when compared with data from 254 sites -details on these below), which can be used as a measure of their dependence (Bishop and Abramowitz, 2013), is high (Fig. S2, correlation >0.5), reinforcing the potential for benefit using a weighting approach 105 that can account for this redundancy.
Part of the high correlation is of course due to spatial heterogeneity and the scale mismatch between in situ and gridded datasets since individual site locations within a grid cell are likely biased with respect to the (unknown) true gridcell-averaged flux. While it might appear that a weighting approach that accounts for error correlations between parent datasets might be in danger of overfitting to error correlation resulting from spatial heterogeneity, we have two mechanisms that ensure this is not a concern for our final product. First, weights for each product are constructed over very 10 large spatio-temporal domains, i.e. more than 13 000 spacetime records as described below, so the (assumed stochastic) biases of individual sites relative to grid cell values are unlikely to influence weights over a large sample. In fact, representativeness of point-scale measurement for the grid scale 15 does exist across all the flux tower sites as a whole; this has been verified by Hobeichi et al. (2018). Second, and more categorical, all results here are presented with out-of-sample testing, so any overfitting will degrade rather than improve the results we present. More detail on this is presented be-20 low.
Given that most of the parent datasets provide ET information at a 0.25 • or finer spatial resolution (Table 1), it is possible to enhance the resolution of DOLCE from 0.5 to 0.25 • . All the parent datasets are resampled from their original spa-25 tial resolution to a common 0.25 • grid using the nearestneighbour resampling method and aggregated to monthly temporal scale before implementing the weighting technique.

Flux tower data
We use flux tower observations from a range of networks 30 including Ameriflux (https://ameriflux.lbl.gov/TS9), the Atmospheric Radiation Measurement (ARM; https://www.arm. gov/TS10), AsiaFlux (https://www.asiaflux.net/TS11), European Fluxes Database (http://www.europe-fluxdata.eu/TS12), Fluxnet 2015 CE5 , LaThuile Free Fair Use (https://fluxnet. 35 fluxdata.orgTS13), Oak Ridge data repository (https://daac. ornl.gov/ TS14 ), OzFlux (http://www.ozflux.org.au/ TS15 ) and data acquired through communication with individual site principal investigators (PI). Particular efforts were made to establish connections with PIs in regions where ET obser- 40 vations are scarce, including all areas outside North America, Europe and Australia, particularly the MENA (Middle East and North Africa ) region, Siberia, central Africa and the Amazon basin. Our efforts and communications with many PIs unfortunately failed to incorporate flux data from some 45 of these regions (excepting those that are already available from the cited networks). Before the quality control process detailed below, we had obtained data from 366 flux tower sites.
The raw data consist of a composite of half-hourly, daily 50 and monthly records. We compute daily averages from halfhourly records for days when at least 80 % of half-hourly LE records are available. Subsequently, we compute monthly averages from daily records for months when at least 80 % of daily LE records are available. In DOLCE V1 we applied a 55 less strict quality control on the observational data in which up to 50 % of gap filling was allowed. The reason was that DOLCE V1 incorporated much fewer observational datasourced from Fluxnet 2015 and LaThuile Free Fair use only.
In order to retain enough observational data to constrain the 60 weighting, it was necessary to make a trade-off between the quality and the quantity of the data. We also apply energy balance corrections to the monthly LE at all sites where monthly averages of the other variables of the surface energy budget -net radiation (R n ), ground 65 heat flux (G) and sensible heat flux (H ) -are available with the same high quality (quality flag >80 %). Corrections are carried out independently for every monthly record. Where any of the other components of the energy budgets are absent, latent heat measurements are used without any correc-70 tions. The energy balance correction is applied as a Bowen ratio (BR)-based correction that distributes the energy budget residuals among H and LE in such a way that their ratio is conserved. This is done under pre-defined constraints that disallow large changes to be applied to LE. As a result of 75 this, we accept the BR correction and use the corrected LE (LE cor ) values if the original monthly LE and LE cor satisfy In DOLCE V1, we did not set a threshold for LE adjustments, which resulted in LE being changed drastically in a few sites 80 to offset errors in the other energy balance components. If the BR correction does not meet the above criterion, we reject the correction and try using a residual correction, which simply calculates LE as the residual term in the energy balance equation; i.e. LE cor = R n − H − G. Similarly, we reject the resid-85 ual correction if the relation between LE and LE cor above is not satisfied. In this case, we use the original monthly LE values without correction. A simplified flow chart of these steps is displayed in Fig. S3 in the Supplement. A study by Paca et al. (2019) examined the changes to flux tower LE by three 90 means of correction and found that these on average differ by around 20 W m −2 from one another. On this basis, we expect that, typically, the correction of flux tower LE should not exceed 20 W m −2 , unless errors in other components of the budgets are propagating in the corrected ET. The rule for 95 correcting small fluxes and the condition in which each rule is applied (i.e. LE = 30 W m −2 ) are in part subjective and in another part based on a case-by-case assessment of changes induced on ET by the correction techniques, and they achieve a reasonable trade-off between data quality and availability. 100 In a further pre-processing step, if a site is located in close proximity to other sites such that they all sit on the same 0.25 • grid cell, we use observational data from the site that is more representative of the underlying grid cell. Selecting the most representative site among these sites involves (1) iden-105 tifying the biome cover at each site and (2) computing the fraction of the grid area covered by each biome; the most representative site is the one whose biome is more abundant in the underlying grid cell (i.e. scores the highest fraction of the total area). If all sites are equally representative of the underlying grid cell, we consider them as one site and we combine monthly LE from the sites by taking the average. We use the high-resolution 300 m land cover maps from the European Space Agency (ESA; http://www.esa.int/TS16) downloaded from https://cds.climate.copernicus.eu/ TS17 to determine the 10 biome types of neighbouring sites and the corresponding grid cells. This step has ensured that we are not matching a grid cell with inappropriate observational data. All the excluded sites are in Europe and North America. This filtering along with the quality control measures described earlier reduced 15 the number of employed sites in this study from 366 sites to 260 sites (Fig. S1). Furthermore, we exclude six sites from the weighting, which were located on flooded land area, wetlands or intensively irrigated land. As a result of this, the constraining observational dataset used to derive DOLCE V2 in-20 cludes 254 sites with a total of 13 641 monthly records.

Weighting approach
The weighting technique is the same as that used in DOLCE V1 and was originally presented by Bishop and 25 Abramowitz (2013) and implemented for merging observational estimates by Hobeichi et al. (2018Hobeichi et al. ( , 2019Hobeichi et al. ( , 2020a. It consists of building a linear combination, µ, of the parent datasets that minimises J j =1 (µ j − y j ) 2 , where j ∈ [1, J ] represents the monthly time-site records, y j is the observed 30 ET at the j th time-site record. The linear combination µ j = K k=1 w k x j k is subject to the constraint that K k=1 w k = 1, where k ∈ [1, K] represents the parent datasets, and x j k is the value of the kth bias-corrected parent dataset (i.e. after subtracting its mean bias relative to the all-site obser- 35 vational dataset) corresponding to the j th time-site record. The analytical solution to this problem accounts for both the performance differences between the parent datasets and their error covariance (Fig. S2), which is a proxy for dependence. Further details on the merging technique can be 40 found in Abramowitz and Bishop (2015) and Bishop and Abramowitz (2013). The weighting approach is used to combine the global parent datasets separately on different spatiotemporal subsets of the entire period and globe, using a tiered approach detailed in Sect. 2.2.3.

Computing uncertainty in ET
The ensemble dependence transformation process developed by Bishop and Abramowitz (2013) is used to calculate the spatio-temporal uncertainty of DOLCE V2 and DOLCE V3. The process transforms the global parent datasets to a new 50 ensemble so that the variance of the transformed ensemble about the derived hybrid ET estimate, µ, is constrained to be equal to the error variance of µ with respect to the flux tower data, averaged over time and space (i.e. across all J records). We use the spread √ σ 2 of the transformed ensemble as the 55 spatially and temporally varying estimate of uncertainty standard deviation, which we will refer to as "uncertainty". We refer the reader to Bishop and Abramowitz (2013) for the derivation of this approach and Hobeichi et al. (2018) for its implementation in this context. The spread √ σ 2 of the trans-60 formed ensemble accurately reflects the uncertainty of µ in those grid cells where flux tower observations are available. This process ensures that the computed uncertainty provides a better uncertainty estimate of the hybrid ET than simply using the spread of the parent datasets.

65
One additional advantage of defining uncertainty in this way is that it should give an accurate upper bound estimate of the likely discrepancy between the product and unseen ET measurements at a range of spatial scales. That is, since it is based on the discrepancy of the final hybrid product and 70 point-based flux tower estimates, which are essentially at the extremes of spatial discrepancy, the discrepancy between DOLCE and actual ET at any spatial scale greater than that of a tower footprint and smaller than that of DOLCE should be less than this uncertainty estimate (noting, however, that 75 this is the estimated standard deviation of uncertainty rather than a hard upper limit). In Sect. 2.2.5 below, we detail the out-of-sample testing of this uncertainty estimate at the point scale. To derive DOLCE V1 over the global land, we applied spatial tiering (using different subsets of parent products in different regions to maximise spatial coverage). We now expand this approach to include temporal tiering to improve the tempo-85 ral reach of DOLCE. Collectively, the incorporated parent datasets have a temporal cover over 1980-2018 but only a short common overlap during 2003-2007in DOLCE V2 and during 2003 in DOLCE V3, and their spatial intersection does not cover the global land. Therefore, to achieve a 90 global land coverage from 1980 through 2018 without excluding any of their parent products, it was necessary to build DOLCE V2 and DOLCE V3 from different subsets of parent datasets in time periods and land regions depending on the availability of the parent datasets as shown in Table 1. 95 To this end, we consider 14 and 4 distinct temporal tiers in DOLCE V2 and DOLCE V3 respectively. For example, in DOLCE V2, tier 9 covers 2008-2012 and incorporates all datasets except SRB-GEWEX. Tier 1 incorporates the least parent datasets, for the year 1980 (i.e. FLUXCOM-MET and 100 GLEAM3.3A), while tier 8 uses all the parent datasets and covers [2003][2004][2005][2006][2007]. Furthermore, within each temporal tier, we consider three spatial sub-tiers, with each spatial sub-tier covering a part of the land. These consist of (a) all land except Antarctica, Greenland and northern Africa; (b) only Antarctica and Greenland; and (c) only northern Africa. A similar spatial tiering approach was also applied in DOLCE V1. Other spatial tiers, each consisting of a small number of grid cells, were also considered where necessary to ensure that no grid cell in DOLCE V2 or DOLCE V3 is missing ET data if a single parent is missing ET data for that grid cell. As a result of the tiering approach, weighting is computed separately using a different subset of parent datasets and site 10 data in each tier, resulting in distinct spatio-temporal subsets of the entire period. Collectively, the hybrid estimates developed throughout the temporal tiers and their spatial sub-tiers form DOLCE V2 and DOLCE V3 over the global land throughout 1980-2018. The reduced number of tempo-15 ral tiers in DOLCE V3 is to ensure that no temporal discontinuities occur throughout the covered period, which otherwise would have reduced the suitability of DOLCE V3 for trend analysis. In comparison, the incorporation of a larger ensemble of parent products in DOLCE V2 is to derive an optimal 20 ET product that minimises discrepancy with in situ observations.

Weighting groups
Previous studies have found that the performance of a global product can vary with different climatic circumstances, sug-25 gesting that separating the weighting into separate regions or other groupings might well improve the results of the weighting overall (Ershadi et al., 2014;Hobeichi et al., 2018;Michel et al., 2016). Grouped weighting simply involves dividing the time and/or space covered by a particular tier 30 into different subsets or groups (e.g. with different climatic conditions) and then applying the weighting technique separately for each group (within a single tier). We expect that grouped weighting has the potential to improve weighting by accounting for the variation in performance of the par-35 ent datasets over different climate or land conditions and can hopefully improve biases detected in DOLCE V1. Hobeichi et al. (2018) tried to group flux tower sites based on their land cover type and computed weights for each land cover type. However, this approach did not improve the results, whether 40 grouping by climate zone or aridity index, with the main reason being attributed to the small number of sites in many groups. Despite the availability of 100 additional sites to constrain the weighting here compared to Hobeichi et al. (2018), the ratio of the observational data to the number of parents 45 has not improved across several climate or land cover types for DOLCE V2. We therefore investigate new approaches to grouped weighting that allow for sufficiently low group numbers to keep a reasonable sample size in each of them, including the following: 50 -Grouping by latitudinal zone. This is a simplification of grouping by climate type in which climates are aggregated into three latitudinal zones: (i) high latitudes (±60 • poleward), (ii) mid-latitudes (±60 • towards the subtropics ±40 • ), and (iii) tropics and subtropics (be-55 tween −40 and 40 • ). In each zone we apply a separate weighting using the corresponding group of sites.
-Grouping by continents. Sites are naturally separated by continental boundaries, and we might suspect that a particular ET product performs differently across con-60 tinents. For instance, precipitation is involved in the derivation of many of the parent datasets and has been found to have different fidelity over different continents (Hobeichi et al., 2020b).
-Grouping by hemisphere. Pan et al. (2020) found that 65 ET estimates agree more in the Northern Hemisphere than in the Southern Hemisphere. Therefore, performing separate weighting in each hemisphere could be better than weighting across all global land.
-Grouping by seasons. Several studies have shown that 70 the skill of ET datasets varies by seasons (Jiménez et al., 2018;Long et al., 2014;Mueller et al., 2011). To capture these differences, we implement grouping by seasons and grouping by months (detailed below). We consider two combined seasons i.e. summer-autumn 75 and winter-spring. In the summer-autumn season, we constrain the weighting with (1) monthly observations from sites located in the Northern Hemisphere during the period June-November and (2) monthly observations from sites located in the Southern Hemisphere dur-80 ing the period December-May. The remaining observational data are used to constrain the weighting during the winter-spring combined season.
-Grouping by months. This is similar to grouping by seasons, the only difference is that the two groups are June-85 November and December-May, without accounting for the different seasonal phase between hemispheres.
-Grouping by ET regime and months. Land was classified into three distinct, broad ET regimes (Fig. S4) according to two aspects of ET: mean annual total ET and within- 90 year relative variability throughout 1980-2018, derived from GLEAM V3.5A, and using K-means unsupervised classification (MacQueen, 1967). We explain the classification method further in Sect. 3.5.2. Different sets of weights were computed at each ET regime dur-95 ing June-November and December-May. Implementing weighting this way ensured that we account for performance differences across different physical aspects of the land and seasons. Despite the fact that observational data were divided into six distinct groups, the ob-100 servational data available in each group were still appropriate to merge the four parent datasets of DOLCE V3. However, we found this grouped weighting strategy to be not appropriate for merging 11 parent datasets of DOLCE V2. As an alternative to the grouping strategies, we also investigate if deriving a spatially varying bias correction within each tier could further improve the weighting. We describe the examined bias-correction approaches and their effectiveness in the Supplement.

Out-of-sample testing approach
To test the effectiveness of different weighting groups or bias-correction approaches, as well as assess which strategy offers the best performance, we use out-of-sample tests. To do this, we first divide the flux tower sites between the 10 in-sample and out-of-sample groups by randomly selecting 25 % of the sites as out of sample. The remaining sites form the in-sample training set are used to compute bias-correction terms and weights for the parent datasets in each tier using the weighting technique without weighting groups (as 15 adopted in DOLCE V1) and with each of the groups and bias-correction strategies detailed in Sects. 2.2.5 and S4 in the Supplement. In each case, these bias-correction terms and weights are then applied to the parent datasets and compared to the out-of-sample sites to test efficacy of the clustering or 20 bias-correction approach employed. The process is repeated for each grouping or bias-correction strategy to derive several hybrid ET datasets for each sample group of sites.
For each strategy, the test was repeated 1000 times with a different random selection of sites being out of sample. 25 The performance of each hybrid ET estimate was evaluated across five statistical metrics. These were root mean squared error (RMSE), absolute standard deviation difference (|σ dataset − σ observation |), correlation, mean absolute deviation (i.e. mean(|dataset − observation|)) and median abso-30 lute deviation (i.e. median(|dataset − observation|)). DOLCE V1 has not been included in this test as its coarser spatial resolution (i.e. 0.5 • ) excludes many coastal sites and so significantly reduces the observational data we could use in this analysis. The out-of-sample test is carried out over the 35 common period of availability of all the parent datasets, i.e. 2003-2007 and 2003-2016, to enable comparison of the outof-sample performance of each approach with all of the 11 and 4 parent datasets of DOLCE V2 and DOLCE V3 respectively. 40 We perform another out-of-sample experiment to test if the uncertainty estimate derived by the successful grouping and/or bias-correction strategy performs well out of sample. In this test, we first select a site S, but instead of constraining the weighting using observed ET from this site, we compute the weights and bias-correction terms of the parent datasets by using all the sites except S (i.e. just one site is out of sam-5 ple). We then calculate the MSECE6 of the derived hybrid ET against observations from all the sites except S. We denote this value by uncertainty in−sample TS18 , since it represents the uncertainty estimate computed using the same observational dataset that we used to train the weighting. We also calculate 10 the MSE of the hybrid ET against the out-of-sample observations from S, and we denote this as uncertainty out−sample , since we perform the comparison against ET observations that have not been used to train the weighting. We repeat this test for all the sites, and each time we calculate the ratio 15 uncertainty in−sample uncertainty out−sample . In an ideal case, this ratio should equal to unity.

Results and discussion
3.1 Out-of-sample performance of DOLCE V2 and DOLCE V3

20
We derive DOLCE V2.1 (Hobeichi, 2020) from 11 parent datasets by applying a grouped weighting by months. As detailed in the Supplement (Sect. S5), this approach achieves slightly better out-of-sample performance than the other grouped weighting approaches in estimating ET (Fig. S6) 25 and in deriving more robust uncertainty estimates (Fig. S7). We recall that in grouped weighting by months, the observational and gridded ET data are split into two groups: one covering the period June-November and the other covering December-May. Weighting and bias correction is then im-30 plemented in each group separately for each tier to create the subsets from which the hybrid ET product is derived. We derive DOLCE V3 (Hobeichi, 2021TS19) from four parent datasets by applying a grouped weighting by ET regimes and months. Both DOLCE V3 and DOLCE V2.1 35 outperform their parent datasets in the out-of-sample tests across all performance metrics (Figs. S6 and S8). DOLCE V2.1 performs better than DOLCE V3 across all performance metrics except standard deviation difference as illustrated in Fig. S8. The overall better performance of DOLCE V2.1 is expected given that more ET estimates contribute to the weighting. On the other hand, DOLCE V2.1 has proven worse performance than DOLCE V3 in capturing variation in ET observations since variability in ET should have decreased when the variations in individual products are not temporally coincident.  respectively. The grey ribbon represents the uncertainty of DOLCE V2 and DOLCE V3 in Fig. 1a and b respectively, defined by the ± uncertainty standard deviation interval. The uncertainty standard deviation of the two DOLCE products mostly contain the latitudinal variations of their parent 20 datasets with the exception of FLUXCOM-RS, which exhibits larger ET over the tropics and subtropics of the Southern Hemisphere relative to DOLCE V2 (Fig. 1b). This containment should not be surprising since uncertainty estimates should be robust for point-scale estimates. Figure 1a shows 25 that DOLCE V1 exhibits a slightly lower ET than DOLCE V2 in the tropics and subtropics. DOLCE V2 appears in the lower end of the range of the other datasets from 60 • poleward. All the datasets exhibit considerable disparities over the mid-latitudes south of −50 • , where the contribution of 30 the terrestrial ET comes mostly from the lower Andes. The difference between DOLCE V2 and DOLCE V3 is smallest over the mid-latitudes of the Northern Hemisphere where most of the flux tower sites are located and is largest over the tropics where very few observations are available. Also, 35 both the number and the spread of parent datasets are larger in DOLCE V2, which explains its larger uncertainty compared to DOLCE V3. The parent datasets of DOLCE V3 are in general in the upper range of ET across all the different participating products, which also explains why DOLCE V3 40 exhibits larger ET than DOLCE V2 throughout the land and mostly over the tropics. Figure 2 shows the spatial distribution of differences in the ET mean between DOLCE V2 and each of its parent datasets. We apply different spatio-temporal masks for each 45 comparison based on parent dataset coverage (Table 1). We also compute the climatological difference of DOLCE V2 with its predecessor DOLCE V1 over 2000-2009. A similar plot showing the spatial distribution of differences in the ET mean between DOLCE V3 and each of its parent datasets is 50 provided in Fig. S9. Figure S9 shows that DOLCE V3 exhibits higher ET than DOLCE V2.1 and DOLCE V1 over most of the land, particularly over the tropics and the high latitudes. On the other hand, the climatological difference between DOLCE V3 and its parent datasets show different 55 spatial patterns, and the least climatological difference is between DOLCE V3 and GLEAM V3.5B.

Comparison of DOLCE V2 and DOLCE V3 with
Over the temperate regions of the Northern Hemisphere, DOLCE V2 exhibits lower mean ET than all its parents except SEBS. We have computed the mean bias of all these 60 datasets relative to the observational data available from sites located in these temperate latitudes. DOLCE V2 has a negligible bias of 0.2 W m −2 relative to the observational data. This bias results from a positive bias of 0.4 W m −2 during June-November and a negative bias of −0.2 W m −2 during 65 December-May. All the parent datasets except SEBS exhibit a positive bias that ranges between 2.7 and 11.4 W m −2 , and SEBS has a negative mean bias of −3.4 W m −2 that varies between −0.2 W m −2 during December-May and −6.3 W m −2 during June-November. We note that the bias 70 relative to the in situ observational datasets is only indicative of the performance of the gridded datasets at the sites and does not necessarily represent the actual mean bias over these regions. The discrepancy between DOLCE V2 and DOLCE V1 is relatively small across all land. 75 Large differences between DOLCE V2 and FLUXCOM-RS are seen over the Congo and the Amazon basins, southern Africa, and the Brazilian highlands. The mean climatological bias of FLUXCOM-RS relative to observational data from these regions is 30 W m −2 . This large bias likely results 80 from the lack of sufficient data available to train the machine learning algorithm over climatically distinct biomes, which made ET prediction less constrained. This bias did not appear in FLUXCOM-MET, possibly because ET prediction is based on a larger set of predictor variables. DOLCE V2 85 exhibits a relatively small bias ranging between 2.6 W m −2 during June-November and 6.4 W m −2 during December-May. In comparison, DOLCE V3 exhibits no significant bias during June-November and a bias of 12.2 W m −2 during December-May, which is similar to the bias in GLEAM 90 V3.5B over these latitudes and seasons and is less than the bias in the remaining parent datasets (i.e. GLEAM V3.5A, FLUXCOM-MET and ERA5. In general, there are apparent disparities in the patterns of climatological differences in the tropics across all the maps. 95 This results from the fact that global ET datasets exhibit large differences over the tropics, which has been highlighted previously (Paca et al., 2019;Pan et al., 2020), particularly over the Amazon basin. We now compare DOLCE V2 and DOLCE V3 with annual mean ET aggregates over a range of river basins documented in a recent study (Table 4 of Zhang et al., 2018). ET in that study -which we'll refer to as CDR-ET -is derived by merg-105 ing 10 available ET datasets into a hybrid ET which then re-  (Lorenz et al., 2015), most likely due to the absence of a proper representation of snow and per-15 mafrost dynamics (Candogan Yossef et al., 2012). Interestingly, over the North American Arctic basins Mackenzie and Yukon, DOLCE V2 and CDR-ET exhibit much smaller relative differences than at their Siberian counterparts. DOLCE V3 exhibits higher ET than DOLCE V2 and CDR-ET across 20 the majority of the river basins, particularly over the Arctic basins. DOLCE V3 is within the range of its recently developed parent datasets which exhibit higher ET than the oldergeneration products such as SRB-GEWEX and SEBS incorporated in DOLCE V2. 25 We also compare DOLCE V2 and DOLCE V3 with continental annual means of ET shown by L' Ecuyer et al. (2015). In their study, they derive a hybrid ET by merging three global datasets. Then, they adjust the hybrid ET and its associated uncertainty by enforcing the physical constraints of 30 the surface and atmospheric water and energy budgets using a data assimilation technique (DAT). Our results show that DOLCE V2 has smaller ET with larger associated uncertainties compared to those derived in L' Ecuyer et al. (2015) ( Table 3). The range of their ET estimate overlaps with the 35 range of DOLCE V2 and DOLCE V3 throughout all continents. In L' Ecuyer et al. (2015), the uncertainty estimates are originally taken from the literature and are deemed constant across time and space and then these are reduced by the DAT. The uncertainty estimate of DOLCE, however, is firmly 40 grounded in the discrepancy between the gridded DOLCE product and in situ tower data. The variance of this discrepancy is used to recalibrate the variance of the parent datasets, which are then used to estimate uncertainty, allowing for a spatio-temporally varying uncertainty estimate that is both 45 consistent with the discrepancy between DOLCE and surface observations while at the same time being spatially and temporally complete. This process is detailed by Hobeichi et al. (2018).

Comparison of basin and continental ET with
Finally, we compare DOLCE V2 with the ET component 50 of Conserving Land Atmosphere Synthesis Suite (CLASS; Hobeichi, 2019;Hobeichi et al., 2020a), which we denote as CLASS-ET. The CLASS dataset comprises coherent estimates of the surface water and energy budgets at the gridded monthly scale. CLASS-ET has been derived by adjust-55 Table 3. Annual continental averages of ET (W m −2 ) and its standard deviation uncertainty calculated for DOLCE V2, DOLCE V3 and that developed in L'Ecuyer et al. (2015) over a common period 2000-2009. In L'Ecuyer et al. (2015, ET is derived by merging three global datasets and then adjusted by enforcing the physical constraints of the surface and atmospheric water and energy budgets.

Performance of DOLCE V2 at flux sites
We now compare DOLCE V2 with ET measured at the 260 sites used in this study (Table S1). We display two per-formance metrics -correlation and standard deviation -on 15 a Taylor diagram (Fig. 3). All data have been normalised before computing the statistical metrics so that the observational data at each site have a mean of 0 and a standard deviation of 1. Each coloured point summarises the performance statistics of DOLCE V2 at a single site. The observational 20 data are represented by a single "reference" point, i.e. the hollow point at 1 on the horizontal axis. The plot in Fig. 3 shows that most of the coloured points lie close to the reference point, indicating that DOLCE V2 is highly correlated with most of the observational data. Overall, Fig. 3 shows 25 good agreement with the observational datasets. Poor performance is seen over a small number of sites. These are represented by points located outside the Taylor diagram area.
Most of these sites have less than 1 year of monthly records  with several gaps, perhaps raising questions about observational quality.
In a further analysis, we investigate whether the performance of DOLCE V2 is reduced over a particular land cover type. For this purpose, we repeat Fig. 3, but this time we 5 colour-code the statistics points by the land cover type of the sites they represent as shown in Fig. S10. The new plot does not reveal clear links between the performance of DOLCE V2 and the biome types of the sites. Similarly, we could not find performance links with the degree of represen-10 tativeness of the site to the underlying grid cell. This is shown in Fig. S11 where colours represent the degree of agreement between the land cover type at the footprint of the tower site and the dominant land cover of the grid cell containing the site. As shown in Fig. S11, we carry out this analysis on 15 the basis of three levels of agreement. These include blue points, representing sites whose land types match the dominant land types of the underlying grid cells; green points, representing sites whose land types cover more than 25 % of the underlying grid cells without being the dominant land 20 cover at these grid cells; and pink points, representing sites whose land types cover less than 25 % of the underlying grid cells.

Changes in ET since 1980
3.5.1 Annual ET trends over the global land 25 We use DOLCE V3 to produce a long-term  map of trends in annual ET totals (Fig. 4) as proposed by Mann-Kendall (Kendall, 1948;Mann, 1945) using Sen's slope method (Sen, 1968). We use the uncertainty estimates associated with the ET fields and the confidence interval of the slope as two confidence measures to filter out spurious trends. These confidence measures consider trend behaviours as reliable only if (i) the confidence interval of the slope does 5 not encompasses a mix of negative and positive values and (ii) trend slopes computed for multiple different random samples of ET within the interval ET ± uncertainty standard deviation agree in sign at least 90 % of the time.
Unreliable trends occur in regions where ET uncertainty 10 is relatively high, such as in northern Africa and the Sahel, and in the high latitudes where ET observations are sparse or do not exist. Inconsistent trend behaviour (confidence interval (CI) includes positive and negative values) is found in regions that experienced long phases of droughts and nondroughts during 1980-2018, mainly in Australia, or a succession of drought and wet events, mainly in southern United States and the Amazon basin (Marengo et al., 2018). As a result of this, a general long trend in ET is not identified in these regions. Miralles et al. (2014) report that these changes 20 in ET over these regions reflect an El-Niño-La-Niña cycle. Similarly, we have not detected clear long trends in southern South America and eastern and southern Africa. This partially agrees with the study of Pan et al. (2020) where their Fig. 8 shows no ET trend in eastern Africa, and no agreement 25 on the sign of trend between the participating datasets has been found in southern South America. Figure 4 indicates that ET has increased over most of the northern latitudes which has been highlighted in many studies (e.g. Miralles et al., 2014;Pan et al., 2020;Zhang et al., 2016), and declined 30 in western United States, central Africa and South America. Unfortunately, given the absence of adequate in situ observations that cover a long enough period to establish trends analysis, it is difficult to validate the identified trends directly.
In further analysis, we verify that the spatio-temporal tier-35 ing adopted in DOLCE V3 has not resulted in temporal discontinuities. Figure

ET regimes
To understand changes in ET across wet and dry regions, we 50 classify land into six distinct dry and wet ET regimes according to two aspects of ET: annual averages and within-year relative variability derived from DOLCE V3. We apply K-means clustering (MacQueen, 1967) -an unsupervised machine learning algorithm known for its outstanding efficiency 55 in clustering data -by implementing the K-Means function and the least squares quantisation method (Lloyd, 1982) using R software. K-Means identifies K centroids (i.e. imaginary values representing the centre of the clusters) and assigns each data point to the cluster of the nearest centroid 60 using -in this paper -the least squares quantisation method. For each grid cell, we compute (1) the average of the annual total ET across 39 years  and (2) within-year relative variability climatology by temporally averaging the relative standard deviation of monthly ET calculated over a 65 year and across all years. These have been used as input features for the unsupervised classification. After trial and error, we found that the global land can be adequately classified into six distinct regimes that include three dry and three wet regimes. According to centroids values (Table S4), we label 70 the six regimes from driest to wettest, and we list the proportion of the land covered by each regime: (i) very low ET with high variability (16 %), (ii) low ET with high variability (34 %), (iii) mild low ET with medium variability (22 %), (iv) mild high ET with medium variability (13 %), (v) high 75 ET with low variability (8 %) and (vi) very high ET with low variability (7 %). Figure 6 displays the spatial distribution of the six ET regimes. We compare the derived ET regimes map with the modified Köppen climate (KC) classification map by Chen and 80 Chen (2013). We find that each KC class overlaps with only one ET regime with only two exceptions (Table 4): (i) land characterised by a "Dry Steppe Hot arid"CE7 (coded BSh in KC) climate belongs to the "Mild low ET with medium variability regime", but in two regions, the Indian Deccan plateau 85 and Argentinean Gran Chaco low forests, where the climate is BSh, the ET regime is "Mild high ET with medium variability"; (ii) Regions with a "Mild temperate Fully humid Hot summer" climate (coded Cfa in KC) overlap with the "Mild high ET with medium variability" regime in coastal re-90 gions and to the "Very high ET with low variability" regime in inland regions. These two KC classes (i.e. BSh and Cfa) are shown in bold in Table 4. Overall, ET regimes defined in this paper provide an efficient way to aggregate the KC classes into less varied classes. This is not surprising know-95 ing that KC classes are developed based on the empirical relationship between climate and vegetation and that ET links the water, energy (climate) and carbon (vegetation) budgets.

Global annual trends across the ET regimes
We now explore annual trends in mean ET exhibited in each 100 ET regime during 1980-2018. First, we calculate the annual ET total climatology and ET relative variability climatology spatially averaged across each regime separately; then we compute the trends in yearly ET as above (i.e. using Mann-Kendall and Sen's slope methods).   Chen and Chen (2013). Text in bold font indicates that the Köppen climate is associated with more than one ET regime.

ET regimes
Köppen climate classes (Chen and Chen, 2013)    confidence interval contains mixed positive and negative values. This suggest that the tendency for increasing ET in the wettest ET regime is not robust. Our results indicate that decreasing ET trends observed in some regions oppose the consistent positive trends across the majority of ET clusters. 15 We repeat the same analysis for all the participating parent datasets that span at least 30 years. Sen's slope of the trends over the period 1982-2012 and their confidence interval (computed at the 95 % confidence level) are presented in Table 5. As noted earlier, trend behaviours are deemed 20 inconclusive when the CI encompasses negative and positive values. These are presented with regular (as opposed to bold) typeface and are exhibited by FLUXCOM-MET in all regimes except the driest. In contrast, PLSH shows reliable upward trends in all regimes. ERA5-Land shows 25 downward trends in the "M.H.ET, M.variability" and "H.ET, L.variability" regimes. Both GLEAM 3.5A and DOLCE V3 show reliable upward ET trends in the two middle regimes. Differences exist in the magnitude of trends across the majority of the products and the regimes. In DOLCE V3, the 30 strongest trend occurs in the "M.H.ET, M.variability" regime at a rate of 0.56 mm yr −1 . Finally, the slopes of DOLCE V3 trends are within the range of slopes of trends in available ET products.
There are of course some notable limitations to the ap- 35 proach we have taken here, some of which were previously discussed in Hobeichi et al. (2018). First, the weighting ap-proach adopted here relies heavily on flux tower observations, which can suffer from a range of technical issues (Burba and Anderson, 2010;Fratini et al., 2019), as well 40 as temporal gaps during particular weather conditions such as extremes (Van Der Horst et al., 2019), which can affect our results. Next, unresolved land surface processes in the parent datasets due, for example, to the absence of a proper representation of snow and permafrost dynamics or the het-45 erogeneity of the land surface are likely to lead to uncertain ET estimation in DOLCE V2 and DOLCE V3, since each of these is only a combination of its parent datasets. This applies particularly in regions where observations are scarce or do not exist.

Conclusions
This work presents two new hybrid ET datasets DOLCE V2.1 and DOLCE V3. The new datasets are the result of several key improvements over their predecessor, incorporating more parent products in DOLCE V2.1, more in situ data, 55 testing a range of alternative implementations of its weighting and bias-correction approach, increased spatial resolution, and covering a longer time period. The incorporation of a large ensemble of parent datasets in DOLCE V2.1 allowed us to derive a more optimal ET product that can be 60 used to benchmark global ET estimates. In comparison, the reduced number of parent datasets in DOLCE V3 minimised temporal tiering and ensured that no temporal discontinuities occur throughout the covered period. This allowed us to examine historical trends in ET and their robustness to obser-65 vational uncertainty. Despite the observationally constrained approach to defining uncertainty, we found robust ET trends across most areas of the land surface -enough to present a clear signal in most of the ET climate regimes we examined.
These trends indicate a global increase in land-derived ET 70 between 1980 and 2018. This contrasts with other gridded ET products that did not incorporate the same degree of ob-servational constraint in either their mean field or uncertainty estimates, and demonstrates the usefulness of this long-term hybrid ET dataset. Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Acknowledgements. The authors acknowledge the support of the Australian Research Council Centre of Excellence for Climate Ex-20 tremes (CE170100023). This research was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. We thank Franklin (Pete) Robertson (NASA Marshall Space Flight Center) for his valuable contribution to DOLCE V3.

30
FLUXNET eddy covariance data processing and harmonisation were carried out by the ICOS Ecosystem Thematic Center, Ameri-Flux Management Project and Fluxdata project of FLUXNET, with the support of CDIAC CE8 and the OzFlux, ChinaFlux and Asi-aFlux offices. Data were also obtained from the Atmospheric Ra- 35 diation Measurement (ARM) Program sponsored by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Climate and Environmental Sciences Division. This work used data sourced from the Terrestrial Ecosystem Research Network (TERN) infrastructure, which is an Australian 40 Government NCRIS-enabled CE9 project; the Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC); and the Land Cover project of the ESA Climate Change Initiative. We would like to thank all the principal investigators that authorised us to download site data from the European Fluxes Database and all 45 the research institutes that made publicly available and/or hosted the gridded ET datasets used in this study.
Financial support. This research has been supported by Climate Extremes (grant no. CE170100023) TS23 .
Review statement. This paper was edited by Ryan Teuling and re-50 viewed by Jasper Denissen and one anonymous referee. Remarks from the language copy-editor CE1 Please confirm the changes to this section, as edits are not displayed in the track-changes PDF.

CE4
GEWEX was defined as "Global Energy and Water-Cycle Experiment" above. Please check these and make them consistent if necessary.

CE5
Fluxnet 2015 does not have a following website link. Can you please add a URL (including last access date) for consistency?

CE7
Please confirm that the capitalisation of terms is correct here and in Table 4 for the quoted classes. Otherwise, we can change these to lower-case terms.
Remarks from the typesetter TS1 If available, please provide more information for affiliations, such as department.

TS2
The composition of Figs. 1-2 and 4 has been adjusted to our standards.

TS3
Changed to Alemohammad et al., 2017 as in the Biogeosciences reference. TS4 a or b? Please distinguish between 2011a or b throughout the paper.

TS5
This reference is not mentioned in the reference list.

TS7
This reference is not mentioned in the reference list.

TS8
This reference is not mentioned in the reference list.

TS9
Please provide a date when you last visited the website (dd/mm/yyyy).

TS10
Please provide a date when you last visited the website (dd/mm/yyyy).

TS11
Please provide a date when you last visited the website (dd/mm/yyyy).

TS12
Please provide a date when you last visited the website (dd/mm/yyyy).

TS13
Please provide a date when you last visited the website (dd/mm/yyyy).

TS14
Please provide a date when you last visited the website (dd/mm/yyyy).

TS15
Please provide a date when you last visited the website (dd/mm/yyyy).

TS16
Please provide a date when you last visited the website (dd/mm/yyyy).

TS17
Please provide a date when you last visited the website (dd/mm/yyyy).

TS18
Should it be a hyphen between in and sample as well as out and sample?

TS19
This reference is not mentioned in the reference list.

TS22
Please note that the section "Author contributions" is mandatory. Please provide the text for this section in complete sentences. Please see https://publications.copernicus.org/for_authors/obligations_for_authors.html for more information

TS23
Please note that the funding information has been added to this paper. Please check if it is correct. Please also doublecheck your acknowledgements to see whether repeated information can be removed or changed accordingly. Thanks.

TS24
Please correct authors.

TS25
Please provide publisher location.

TS26
Please provide publisher location.

TS27
Please provide page range or article number.

TS28
This reference is not mentioned in the paper.

TS29
Please provide volume.

TS30
Please note added reference.

TS31
Please provide journal title.

TS32
Please provide publisher and publisher location in case of a book. In case of a journal, please provide journal title and page range or journal title, article number and DOI.

TS33
Please provide page range or article number.