Interactive comment on “ Local tower-based merging of two land evaporation products

General comments and questions: The study is well written and provides interesting insights in the performance of the two used ET models. These seem to perform very similar and the merge of them does not provide a significant added value. I’m wondering thus if the use of other WACMOSET models with more diverse performance at the tower sites could be a better test case for the proposed merging procedure (instead of having two already similarly wellperforming models with not much of room for improvement). Can the authors comment, why they did not include a more diverse palette of models?


Introduction
The land heat flux governs the interactions between the Earth and its atmosphere (Betts, 2009), is an essential component of the water, energy, and carbon cycles (Sorooshian et al., 2005), and thus plays a key role in the climate system (Wang and Dickinson, 2012).Terrestrial evaporation (ET) -the associated flux of water from land into the atmosphere -is also an important variable in the management of agricultural systems, forests, and hydrological resources.Hence, estimates of ET at different spatial scales, ranging from individual plants for managing irrigation, to basin scales to evaluate water resources, are required in many applications(e.g.Dunn and Mackay, 1995;Le Maitre and Versfeld, 1997;Gowda et al., 2008;Fisher et al., 2017).
In situ measurements of land heat fluxes are typically conducted during field experiments (Pauwels et al., 2008) and by more permanent flux tower networks (Baldocchi et al., 2001).Being point measurements and requiring special equipment, they cannot be applied for routine measurements covering large areas.Therefore, more readily available observations are combined with well known flux formulations (e.g., Monteith, 1965;Priestley and Taylor, 1972) to obtain local estimates.To derive global estimates, remote sensing from space can be used, but the challenge is that fluxes do not have a direct signature that can be remotely detected.Therefore, satellite remote sensing observations related to surface temperature, soil moisture, or vegetation are again combined with flux formulations to derive global estimates at different time and spatial scales (for overview see Wang and Dickinson, 2012;Zhang et al., 2016) The combination of using different flux formulations driven by different global satellite data sets typically results in relatively large ET discrepancies, which are put in evidence when the ET products are inter-compared or evaluated with the flux networks (Jimenez et al., 2011;Mueller et al., 2011;McCabe et al., 2016).These differences are motivating efforts to derive in principle more accurate ET products by combining individual ET estimates at different time and spatial scales.These efforts range from simply averaging a number of ET products (Mueller et al., 2013) to more complex approaches, such as fusion algorithms

ET models
The GLEAM and PT-JPL models, and the inputs required to run them at the global scale are described in Michel et al. (2016) and Miralles et al. (2016).Only the main differences with respect to the original WACMOS-ET runs are fully described here.

GLEAM
GLEAM is a simple land surface model fully dedicated to deriving evaporation.It distinguishes between direct soil evaporation, transpiration from short and tall vegetation, snow sublimation, open-water evaporation, and interception loss from tall vegetation.Interception loss is independently calculated based on the Gash (1979) analytical model forced by observations of precipitation.The remaining components of evaporation are based upon the formulation by Priestley and Taylor (1972) for potential evaporation, constrained by multiplicative stress factors.For transpiration and soil evaporation, the stress factor is calculated based on the content of water in vegetation (microwave vegetation optical depth) and the root zone (multilayer soil model driven by observations of precipitation and updated through assimilation of microwave surface soil moisture).For regions covered by ice and snow, sublimation is calculated using a Priestley and Taylor equation with specific parameters for ice and supercooled waters.For the fraction of open water at each grid cell, the model assumes potential evaporation.
The recent GLEAM v3 model of Martens et al. (2016) is adopted for this study and replaces the model of Miralles et al. (2011b) previously applied for the WACMOS-ET runs.
Major differences related to the previous model are a revised formulation of the evaporative stress, an optimized drainage algorithm, and a new soil moisture data assimilation system.Notice that the WACMOS-ET runs were done at different time resolutions, while only daily estimates are calculated for this study.
The PT-JPL model by Fisher et al. (2008) is a relatively simple algorithm to derive ET.It uses the Priestley and Taylor (1972) approach to estimate potential evaporation, and then applies a series of stress factors to reduce from potential to actual evaporation.The land evaporation is partitioned first into soil evaporation, transpiration, and interception loss by distributing the net radiation to the soil and vegetation components.The potential evaporation for soil, transpiration, and interception is then calculated separately, followed by a reduction to actual evaporation by applying a series of ecophysiological stress factors.Unlike GLEAM, the stress factors are based on atmospheric moisture (vapour pressure deficit and relative humidity) and vegetation indices (normalized difference vegetation index, and soil adjusted vegetation index) to constrain the atmospheric demand for water.The partitioning between transpiration and interception loss is done using a threshold based on relative humidity, and therefore conceptually quite different from the precipitation based calculation of GLEAM.There is no independent estimation of snow sublimation, and the same algorithms are applied for snow-covered areas.
For this study, optimized vegetation products are used as inputs to the model.In WACMOS-ET, the Leaf Area Index (LAI) and Fraction of Absorbed Photosynthetic Active Radiation (FAPAR) products, derived from the Joint Research Centre Two-Stream Inversion (JRC-TIP) package (Pinty et al., 2007(Pinty et al., , 2011a, b), b), were converted by a simple biome-dependent calibration to a LAI/FAPAR product consistent with the Moderate Resolution Imaging Spectroradiometer (MODIS) LAI/FAPAR before being used as inputs to the model (Michel et al., 2016).Under the assumptions that the JRC-TIP FAPAR is related to the radiation absorption by the green fraction of the canopy, while the MODIS FAPAR is more related to green and non-green leaf area, a new use of the WACMOS-ET vegetation products is proposed.First, the WACMOS-ET JRC-TIP FAPAR is assumed to be close to an Enhanced Vegetation Index (EVI), and it is scaled by the factor 1. pected by the model, which in turn is used by the model as a proxy for the fractional total vegetation cover.Using the original relationships in the model, the fractional total vegetation cover is related to a total (green and non-green) LAI, which is then used to partition the net radiation in to their soil and canopy components.

Model inputs
The GLEAM and PT-JPL required global inputs remain unchanged with respect to Miralles et al. (2016), apart from the precipitation product, and are applied at the same resolution of 25 km.Common inputs to GLEAM and PT-JPL are the surface net radiation, coming from the NASA and GEWEX Surface Radiation Budget (SRB, Release 3.1 Stackhouse et al., 2004), and the near-surface air temperature, sourced from the ERA-Interim atmospheric reanalysis (Dee et al., 2011).PT-JPL also requires near-surface air humidity, also from ERA-Interim, and the vegetation products discussed in Section 2.1.2.On the other hand, GLEAM requires precipitation, coming from the Multi-Source Weighted-Ensemble Precipitation (MSWEP) version 1 product (Beck et al., 2017), soil moisture and vegetation optical depth from the ESA Climate Change Initiative (CCI) Soil Moisture v2.3 product (Liu et al., 2011b, a), and information on snow water equivalents, from the ESA GlobSnow product for the Northern Hemisphere (Takala et al., 2011)(Luojus andPulliainen, 2010), and from the National Snow and Ice Data Center (NSIDC) in snow-covered regions of the Southern Hemisphere (Kelly et al., 2003).

Tower data
The FLUXNET 2015 synthesis data set (http://fluxnet.fluxdata.org/) is used to obtain in situ measurements of evaporation (referred to as tower ET), and it is processed as in Martens et al. (2016) to retain only high-quality data appropriate to evaluate the evaporation models.Starting from the original time resolution (generally 30 minutes or 1 hour), the processing involves: (1) masking measurements using the provided quality flags; (2) masking measure-7 Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | ments for rainy intervals, using the global precipitation product and the local measurements if available (eddy-covariance measurements are generally less reliable during precipitation events), and; (3) aggregating to daily values if more than 75 percent of remaining sub-hourly data exists for a given day.This processing selects 84 station for the 2002-2007 study period.Their geographical locations, and their location in an air temperature and precipitation space, are plotted in Fig. 1, with the station names, land covers and reference or Principal Investigator listed in Table 1.
Eddy-covariance measurements are subject to errors, both random and systematic.Systematic errors can arise from instrumental calibration and unmet assumptions about the meteorological conditions, while random errors are typically related to turbulence sampling errors, the assumptions of a constant footprint area, and instrumental limitations (Moncrieff et al., 1996).Estimating these errors is far from simple, and typically requires dedicated experiments (Nordbo et al., 2012;Post et al., 2015;Wang et al., 2015).Therefore reporting them is not a widespread practice and figures for the individual sites are not commonly available.
The propagation of systematic errors typically results in the lack of energy balance closure observed at many eddy-covariance sites.Techniques to correct this exist (Foken et al., 2006), and corrected fluxes are typically preferred over the original uncorrected observations.However, correcting for the energy balance closure normally implies the need for surface radiation measurements, which are not routinely measured at all stations; e.g., for the 84 stations selected here, 26 do not have corrected latent heat fluxes.For the 58 stations with corrected fluxes, the relative mean difference between raw and corrected fluxes averaged over all stations is 6.1%, with a maximum value of 16.5%.The uncorrected and corrected fluxes correlate very well in time, with a station-averaged square temporal correlation of 0.96.As these numbers do not suggest strong differences between the corrected and uncorrected measurements at our selected stations, we use the original uncorrected latent heat fluxes to avoid mixing uncorrected and corrected fluxes in the analyses, and to keep a sufficiently large database of in situ measurements.

Ancillary data
To help characterizing the spatial homogeneity of the grid cells where the stations are located, two data sets are considered: the MODIS Land Cover Type product MCD12Q1 at an original resolution of 500 meters, and the Terra MODIS Vegetation Continuous Fields product MOD44B at an original resolution of 250 meters.A homogeneity index (I h ) is constructed as: where F gt IGBP is the fraction of MCD12Q1 500 meter cells included in the 25 km model grid cell containing the tower and having the same IGBP land cover than the model cell, F t bare , F t herb and F t f orest are, respectively, the bare, herbaceous, and forest fractions of the MOD44B 250 meter cell containing the tower, and F g bare , F g herb and F g f orest are the same fractions but calculated for the entire 25 km model grid cell where the tower is situated.Ih takes values in the range [0,1], the larger the value the more likely the grid cell represents the landscape of the tower pixel, according to these two MODIS products.

Merging technique
A weighted combination of two products requires the definition of a set of weights, typically based on an estimation of the individual uncertainty in each of the products.The simplest strategy is to assume that both products are equally uncertain: the merged product is a Simple Average (SA-merge) of the individual products.A more elaborate strategy would be first estimating the product uncertainties, followed by a weighting of the products that takes into account this uncertainty.The inverse-variance weighting is the usual combination equation to take into account individual product uncertainties and obtain a merged estimate bounded by the initial estimates.In the context of our analysis it can be expressed as: 11 (2) (3) where and σ 2 GL (σ 2 P T ) is the variance of the GLEAM (PT-JPL) error distribution.For the SA-merge, w GL =w P T =0.5.For the WAmerge, the error is defined as the difference between the model and tower-based ET, the σ GL and σ P T are then estimated over the time series of available errors, and the weights derived following Equations 3 and 4. Note that w GL + w P T = 1.
If the errors follow a Gaussian distribution, are unbiased, and are independent from each other, Equation 2 is the optimal linear estimator (Rodgers, 2000).However, in practice, those conditions are difficult to meet.While GLEAM and PT-JPL can be considered independent from the tower observations, they do share common inputs and formulations, likely making the errors in their estimates dependent.Bias can also be present between the satellite surface meteorology used by the evaporation models and the real meteorology at the tower, due to for instance differences in footprint.The different stress formulations can also introduce systematic errors, and consequently biases.Therefore, for this exercise the inverse-variance weighting can be seen as a simple error-based method to weigh the products, but optimality in the sense of minimizing the error variance cannot be assured.At a given tower location, the error variance is estimated as the variance of the errors over a certain period.If estimated over the entire record, there will be one weight per station, with no seasonal variation.However, it is expected that the errors are non-stationary, hence, in order to have weights evolve in time, they will be estimated over time series of a certain number of days centered at each day of the year.The weights are then estimated daily, by running a time window centered at that day.The choice of window length is subjec- A number of 30 days before and after each calendar day was found to provide a good compromise between the smoothness of weights and the number of samples required, so a 61-day running window is used to derive the daily weights.
GLEAM and PT-JPL estimate separately transpiration, soil evaporation, and the evaporation from the rain intercepted by canopies.As the tower measurements are masked for rainy intervals, only the sum of soil evaporation and transpiration is compared with the tower data and weighted.If the total ET merged product is required, the choices are either to assume that GLEAM and PT-JPL are equally uncertain in their estimation of interception loss, adding their average to the weighted soil evaporation and transpiration, or adding the GLEAM or PT-JPL individual interception if there are reasons to believe that one is overall more certain than the other.

Metrics
The main analyses are done by computing the Pearson correlation coefficient (R), the Mean Square Difference (MSD), and the Root Mean Square Difference (RMSD) according to the the expressions: where P and O are the model-derived and observed (or a second model-derived) variate, and N is the number of cases.The MSD can be decomposed into a random (MSD r ) and systematic (MSD s ) component following (Willmott, 1982) by using the expressions: where Pi = a + bO i is the linear least squares regression of P onto O, being a and b the regression intercept and slope, respectively.Notice that MSD = MSD r + MSD s .When improving a product and comparing with a reference, the common targets are correlation unity and zero RMSD.But as the merge product here is bounded by the original GLEAM and PT-JPL ET estimates, and the tower ET is not always within that ET range, this target cannot always be attained.A possibility to set correlation and RMSD targets is to construct the product that minimizes the RMSD with the tower ET.In the context of our merging strategy, we call this the optimum product .As normalized weights are used, the weighted estimate can be one of the original estimates (if it is weighted 1), or an intermediate value between the products (for weights between 0 and 1).In our case, if the tower ET is between the GLEAM and PT-JPL estimates, the tower ET is then the optimum estimate.But if both the GLEAM and PT-JPL estimates are below or above the tower ET, the closest model ET is the optimum estimate, and the merging cannot reproduce the tower ET observation.Note that the optimum product does not necessarily result in the product combination having the largest correlation with the tower E, although in most cases it will improve these correlations.
Statistical significance of the correlations is tested by calculating 95% confidence intervals.For the correlation differences, a Fisher Z-transformation is applied to the correlations, 14 Statistics are calculated for the whole period, or separately for the boreal winter (DJF), spring (MAM), summer (JJA), and autumn (SON).Given the strong seasonality at most towers, correlations tend to be high without necessarily indicating that product and tower ET day-to-day anomalies are in close agreement.Calculating correlations after removing the mean seasonal cycle allows the study of short-term ET anomalies, but here the limited data record at most towers precludes the calculation of a robust seasonal cycle.

Inter-product comparison
The annual GLEAM and PT-JPL ET components, together with their differences, are shown in Fig. 2. Note that for GLEAM sublimation is included in the soil evaporation.Approximately 60% of the relative differences (differences normalized by the absolute value) are within ±25%.As both models are run with a common surface radiation product, and with a similar calculation of the potential ET, this suggests that there are large differences in their respective stress factors at some places.The disagreement in the partitioning of ET into its different components can also be noticed.PT-JPL estimates transpiration to be about 1/2 of the total ET, while GLEAM estimates that around 3/4 of the total ET is transpiration.
The remaining soil evaporation and interception are equally distributed for GLEAM (13% and 14%, respectively), while for PT-JPL the soil evaporation contribution is larger than the interception (28% and 20 %, respectively).Compared to Miralles et al. (2016), the GLEAM and PT-JPL contributions of each component to the total ET are comparable, with the small differences attributable to the above-described changes in the models and/or inputs (e.g., new precipitation product for GLEAM, or the revised use of the LAI/FAPAR for PT-JPL).Re-15     The GLEAM, PT-JPL, and tower ET are compared now at the available tower sites.The towers spatial distribution of Fig. 1 shows that approximately 75% of the towers are located in grid cells where the relative ET difference is within ± 5%, i.e., at places where the GLEAM and PT-JPL different ET stress formulations do not result in large annual ET differences.Most of those towers are located in temperate regions.The tropical rain forest and savannas, where the relative ET differences seem larger, are poorly represented in the selected tower data.Therefore, regions that would have been relevant to characterize GLEAM and PT-JPL differences are missing in the evaluation with tower data.
Seasonal distributions of ET for three vegetation classes are presented in Fig. 3.The first one groups the International Geosphere-Biosphere Programme (IGBP) forest cover stations, the second one includes the shrublands and savannas, and the third one the croplands and grasslands.They are referred to as "forest", "shrubs/savanna", and "crop/grass", respectively.Note that the stations are not evenly distributed within the three groups, with the forest group (50 stations) being more represented than the shrubs/savanna and crops/grass (10 and 24, respectively), indicating that group statistics could be more significant for forests.The surface Available Energy (Ae) is also plotted.For GLEAM and PT-JPL, Ae is the difference between the surface net radiation and the modeled ground flux.For the towers, as the surface net radiation and/or ground flux are not measured at all towers, Ae is given by the sum of the sensible and latent fluxes.Differences between GLEAM and PT-JPL distributions are visible, but overall GLEAM and PT-JPL agree better with each other than when compared with the tower ET.This is expected given the dependencies between both gridded data sets and their common footprint.
An example of good agreement is the forest group for the SON months, with the distributions of ET and Ae being quite similar for the in situ and modeled variables.The crops/grass group for the JJA months shows also reasonable agreement between GLEAM and PT-JPL ET distributions, but large differences with the tower ET.The tower ET has a clear bimodal distribution, which cannot be replicated by GLEAM and PT-JPL.This may be due to agricultural management practices being poorly captured by the models (e.g., irrigation), but may also reflect the large heterogeneity of croplands and their (a priori) low representa-18 For GLEAM, the mode of the Ae distribution is lower than for PT-JPL; as the surface net radiation is the same for GLEAM and PT-JPL, this implies that in this case GLEAM estimates larger ground fluxes than PT-JPL.The smaller GLEAM Ae can be partly responsible for some of the ET differences; however, the relatively large ET differences indicate differences in evaporative stress.The very large mismatch between the model grid cells and the footprint of the tower measurements contributes to the observed differences.In order to see whether there is a clear relation between spatial homogeneity and model and tower differences, the optimum product defined in Section 2.4 and the the homogeneity index (I h ) described in Section 2.2.3 are computed for each station.The RMSD of the optimum product and the towers ET, normalized by the mean annual tower ET, is displayed in Figure 4 for all the available stations, together with the station I h .The towers are sorted from maximum to minimum I h , i.e., starting by the towers better representing the grid cells where the tower is located.Small and large normalized RMSD can occur at stations with comparable I h , suggesting that spatial heterogeneity is only one of the contributing factors to the ET differences.If a linear square fit of the normalized RMSD of the sorted stations is calculated, the slope of the fit is ∼0.09 mm/day, but the resulting trend is not statistically significant.Significant trends were neither found for the RMSD of the tower E and the original GLEAM and PT-JPL E. This indicates that for the constructed I h and the stations and ET products sampled, the error related to the tower surrounding spatial homogeneity does not dominate the error budget.

Product weighting
A summary of daily weight statistics over all the sites belonging to a given land cover group is given in Fig. 5.The weights are given as the difference to 0.5, and will be referred to as positive weight deviations when the values are larger than 0.5.For instance, a 0.2 GLEAM weight deviation in the plot corresponds to a 0.7 and 0.3 absolute weights for GLEAM and PT-JPL, respectively.In this case GLEAM is positively weighted and PT-JPL negatively weighted.Likewise, a 0.2 PT-JPL weight deviation corresponds to 0.3 and 0.7 absolute weights for GLEAM and PT-JPL.
On average, the weights do not deviate much from 0.5, suggesting that, for a given land cover group, there are no systematic patterns indicating that one model agrees much better than the other when comparing with the tower data.The 25% and 75% percentiles are confined to the ±0.1 range on most days, so the variation of the weights can be considered small.Noticeable is some tendency for PT-JPL to be weighted more in winter, with GLEAM 20 Given the mostly high latitude location of the towers, the winter months correspond to situations with low or even negative surface radiation, the surface can be covered by snow, and ET values are in general low.Therefore, the weighting for these months will not change much the integrated annual values.On the contrary, the months where GLEAM seems more weighted typically correspond to larger ET values, and 5 the corrections in the merged product for those periods will have a more visible impact on the resulting ET.   and 0.24 for GLEAM and PT-JPL, respectively).For the first 3 years, GLEAM captures better the spring ET decrease seen by the tower, but both GLEAM and PT-JPL miss the large summer tower ET values.In this case, the larger weighting for GLEAM results in a weighted product that seems closer to the observed values, although the 0.37 correlation of the SA-merged product is not significantly higher than the 0.31 correlation of the WA-merged product.The US-Ne1 is a maize irrigated site, with expected more regular seasonal cycle and larger ET values than US-SRM.In this case PT-JPL agrees better with the observed ET than GLEAM (correlations of 0.93 and 0.84, respectively), capturing better the ET rise associated with start of the growing season, and the amplitude of the cycle for two out of the four years.PT-JPL is slightly more weighted 10 than GLEAM in the WA-merged product, but the 0.93 correlation of the WA-merged product is similar to the PT-JPL correlation.More statistics of the weights are given in Fig. 8.The left-hand side panel shows the empirical Cumulative Distribution Function (CDF) of the seasonal averaged weight deviations over all stations.The differences are small, but PT-JPL has a larger number of positive weight deviations than GLEAM in winter and the opposite during summer and autumn, similar to the pattern shown in Fig. 5.The right-hand side panel displays box plots of the seasonal averaged weight deviations for the three land cover groups.The median values per season and group are well within the 0.1 range, so on average the weight deviations from the SA-merge value of 0.5 are relatively small, highlighting again the similarities between 24 For the forest and shrubs/savanna groups the seasonal weight deviation medians can be positive for GLEAM or PT-JPL, depending on the season, while for the crops/grass group all seasonal weight deviation medians are positive for GLEAM.This may be an indication of GLEAM performing better than PT-JPL for croplands and grasslands.However, it is difficult to generalize any conclusions to the global scale given the small number of stations and their limited geographical distribution.

Merged products
The performance of the individual and merged products across the different stations is summarized by plotting seasonal group averaged correlations with the tower ET and RMSDs in Fig. 9. Given the typical small weight variations presented in Sec.3.2, the differences in performance between the individual products are expected to be small.In order to better show these differences, Fig. 9 shows the difference of the season and land-cover averaged correlations of the optimum product ET with the correlations of GLEAM, PT-JPL, SA and WA-merge ET, expressed as a percentage of the optimum correlation.The same differences are also plotted for the RMSD, but taking the differences between the individual product RMSD and the optimum product RMSD, so the differences are also positive as for the correlations.The smaller the differences, the closer to the optimum product correlation and RMSD.Note that correlations are not significant for some stations and periods, although all correlations are averaged in Fig. 9 to have a common number of stations for the inter-product and inter-season comparisons.No significant correlations are observed at a larger number of stations in winter.For the other seasons, only a few stations do not show significant correlations, and all of them correspond to low correlation values.For the merged products, they include the two tropical stations MY-Pso and BE-Bra, with reduced seasonality and short records after removal of the rainy events, US-Var and CN-Din, at the foothills of some mountainous ranges, and US-Wi4, a red pine site with some wetlands in the surroundings of the station.
In terms of correlations the worst season is winter, reflecting the low intra-seasonal variability in this period, while the largest correlations are observed in summer.The smallest 25 relative differences in correlation with the optimum product are for spring and autumn, with values of around 10%.This indicates that the margin of improvement with respect to the optimum product is not very large for these seasons.The relative differences are larger for winter and summer, with values around 20%.Regarding the three land cover groups, the average correlations for the optimum product are larger for the forest and crops/grass, and a bit lower for the shrubs/savanna.In most cases the smaller correlation and RMSD differences are for the WA-merge product, with the merged products equaling the performance of the best individual product.However, running Student-t tests shows that the correlation differences between any pair of products are not statistically significant at the selected 95% confidence level.So, although GLEAM and PT-JPL ET disagree, their ET populations do not seem too distinct when compared with the tower ET.
Close GLEAM and PT-JPL weighting means close variance of their ET differences with the tower.This can be seen in Fig. 10, where box plots of the variance of the tower ET, the variance of the GLEAM and PT-JPL ET differences, the variance of the GLEAM and tower ET difference, and the PT-JPL and tower ET are shown.GLEAM and PT-JPL are much closer to each other than when individually compared with the tower ET.Also, when summarized per season and land cover, GLEAM and PT-JPL differences with the tower are quite comparable.In principle, given the different way of modeling the ET stress by GLEAM and PT-JPL, it could have been expected the tower ET to be more informative respect to the GLEAM and PT-JPL differences.But the relative closeness of GLEAM and PT-JPL, when compared with the tower ET, seems to indicate that the fact that GLEAM and PT-JPL share some model inputs, notably the surface radiation, and some common modeling assumptions, such as close estimation of the potential evaporation, produces more dominant common error patterns at the tower locations and reduce the possibility to characterize other model errors.Note also the good correspondence between average correlations per season and land cover displayed in Fig. 9 and the tower ET variability shown here.The shrubs/savanna optimum product average correlations were the lowest for all seasons, reflecting the lowest ET variability observed in these regions.

Figure 9.
Difference of the season and land-cover averaged correlations of the optimum product ET with the correlation of GLEAM (red), PT-JPL (blue), SA-merge (green) and WA-merge (grey) ET, expressed as percentage of the averaged optimum product correlation (top panels).The same differences for the RMSD are also plotted, but taking the differences between the individual products and the optimum product in order to have positive differences (bottom panels).The average optimum product correlations and RMSD are given in the plots.
A factor potentially affecting GLEAM and PT-JPL merge is the statistical nature of their ET differences with the tower data.Assuming that systematic differences between models and observations come mostly from the model, a decrease of the systematic difference component when comparing with observations is typically a sign of improved model per-27 formance.In the context of merging products, if the difference with the observations were mostly of random nature, we should not expect the observations to provide much guidance to combine the products.Figure 11 shows box plots of the ratio of the MSD r over the sum of MSD r and MSD s (i.e, the total MSD) for GLEAM, PT-JPL, and the SA and WA-merge (see Equations 7 and 8).The ratios take values over the whole zero to one range.Analyses of the ratios for the individual towers did not show any obvious patterns in terms of biome or climate conditions.As the medians are larger than 0.5, this indicates that there is a larger number of stations where the random component is larger than the systematic for the three land cover groups.The merged products do not largely change the ratio distributions, so if we assume that a decrease of the systematic component can be indicative of a better fit to the observations, it cannot be claimed that the merge products are reducing biases with the observations.
The validity of Equation 2for the merge product can also be discussed.Ideally Equation 2 should be applied with unbiased and independent estimates.As discussed in Section 2.4, this is not the case for GLEAM and PT-JPL ET estimates.Concerning the bias, at most stations the data record to derive a meaningful climatology is too short, so bias-correcting the model ET estimates before deriving the weights is not feasible for a large number of stations.Regarding the model ET dependences, the correlation can be taken into account by modifying Equation 2 as: where ρ is the correlation coefficient between GLEAM and PT-JPL (Jones et al., 2008).However, negative weights are now possible.For instance, if σ GL > σ P T and ρ > σ P T /σ GL , GLEAM will be weighted with a negative value, and even if the sum of GLEAM and PT-JPL weights is still one, the merged ET estimates will be outside the ET range defined by GLEAM and PT-JPL original ET values.This is statistically correct, but allows the values to be extrapolated outside this range.Application of Equation 9 results indeed at negative 28  2, and does not improve the significance of the results, so it seems that ignoring the dependence of GLEAM and PT-JPL estimates is not the limiting factor of the merging exercise in this particular case.An inverse-variance weighting of two ET daily products at 25 km resolution from the WACMOS-ET project over a selection of tower sites is tested.The two ET models, PT-JPL and GLEAM, are forced with some common input data sets, and share some common features in their modeling framework such as the Priestley-Taylor formulation to estimate the potential evaporation rate.The latter is converted into actual volumes of evaporation using model-specific formulations of evaporative stress.The weights are based on the variance of the errordistribution of the individual products, with the error defined as the difference between tower ET and modeled ET for dry conditions.To produce dynamic weights changing seasonally, they are estimated by a running window of 61 days centered at each day of the year.The 10 assumption of both ET products being equally uncertain, i.e., assigning a constant weight of 0.5, is also tested.The weights over some stations showed seasonal patterns, but the differences from 0.5 were at most cases confined to ±0.2, meaning that GLEAM and PT-JPL were weighted closely to each other.Inspecting the variance of the GLEAM and PT-JPL ET differences showed smaller values than the variance of the difference between the tower ET and the individual GLEAM or PT-JPL ET.This implies that even if GLEAM and PT-JPL ET differs, they are still relatively close, compared with the tower ET of the selected sites.For a few stations where the GLEAM and PT-JPL were more distinct, their weighting resulted in a merged product better capturing some of the measured ET changes at the tower site.However, the correlations of the merged products with the tower ET were not significantly higher than the original GLEAM and PT-JPL correlations.
If a merged optimum ET product minimizing the RMSD is constructed to set a performance target, the correlation and RMSD performance improvement of the merged products with respect to this new product can be assessed.Compared with the original GLEAM and PT-JPL ET, at a large number of stations the RMSD and correlation of the merged products with the tower is closer to the optimum product, demonstrating some improvement in performance by the merged products.However, the margin of improvement with respect to the optimum product is not very large, and the simple average and weighted product statistics are very close with no statistically significant differences.According to this, for these specific ET models, spatial resolution, observation period, and available tower measurements, the tower and modeled ET differences are not informative enough to derive a merged product much better fitting the tower ET than the original GLEAM and PT-JPL products.
Limiting factors for this exercise are the restricted geographical coverage of the towers and the representativeness of the towers at the model spatial resolution.Most of the available towers for the study period are situated in temperate regions, where GLEAM and PT-JPL do not show the largest differences.In the less covered regions many towers only have data records for a limited number of years, so using a larger number of years would have given access to some of them, permitting to test the weighting at places where the tower ET may have provided more insight.The mismatch between modeled data resolution and tower footprint is always an issue.Although an apparent link between tower surrounding 31 In this study the GLEAM and PT-JPL products are derived with common data sets for their shared inputs.These are the surface net radiation and meteorology.This is key to study inter-product differences related to the modeling components, but it eliminates the more common situation of also having inter-product differences associated to applying the models with different input data sets.For this specific observation period and available towers, it seems that the unique input data sets introduced common error patterns with respect to the tower observations.The relatively close modeling framework could also be partially responsible for the common errors.The result is that, even if ET differences between GLEAM and PT-JPL are observed at the selected sites, they are not too distinct when compared with the tower observations to clearly weight GLEAM and PT-JPL very differently.This study suggests that merging tower observations and ET products at the time and spatial scales of this study (daily and 25 km) is not straightforward, and that care has to be taken regarding the dependence of the products to be merged, the tower coverage, errors, and spatial representativeness of their measurements at the products resolution, and the nature of the ET product errors.For these two specific ET modelling frameworks, GLEAM and PT-JPL, it is likely that more distinct error patterns would have been observed if the model inputs were more independent, which could have lead to a more useful weighting of the products.This could be the case for the presumably more different new versions of GLEAM (Martens et al., 2016) and PT-JPL (Fischer et al, manuscript in preparation), which are derived with more different input products, and which are available for a more extended period, facilitating the use of a larger pool of tower data.Work on combining these new products with a similar technique, together with some other daily remote sensing ET products based on more different evaporation frameworks, is envisaged, including research on extrapolation techniques to propagate the local weights outside the tower sites with the Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | 2 Methods and data 2 to become closer to the FAPAR expected by the model, as in the original PT-JPL equations.Second, the WACMOS-ET MODIS-like FAPAR is used as the Fraction of Intercepted Photosynthetic Active Radiation (FIPAR) ex-Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

Figure 1 .
Figure 1.Distribution of tower sites used in the study.Top: geographical location (green crosses) on a map of GLEAM and PT-JPL averaged multi-annual ET.Bottom: distribution of the GLEAM and PT-JPL averaged multi-annual ET (left), the relative GLEAM and PT-JPL ET differences normalized by the previous averaged ET (middle), and the number of global grid cells (right), as function of the annual air temperature and precipitation, together with the location of the tower sites in this space (black dots).10 Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2017-573Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 6 November 2017 c Author(s) 2017.CC BY 4.0 License.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | E W A = w GL E GL + w P T E P T

Discussion
Paper | Discussion Paper | Discussion Paper | Discussion Paper | tive: shorter time windows produce more dynamic weights, but their values are likely to be noisier given the smaller number of samples available to estimate the time series variability.
Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2017-573Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 6 November 2017 c Author(s) 2017.CC BY 4.0 License.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | and a Student t-test at a 5% significance level used to test the significance of the difference.The autocorrelation of the daily time series is taken into account by reducing the degrees of freedom using an effective sampling size (De Lannoy and Reichle, 2016; Lievens et al., 2017).
Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2017-573Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 6 November 2017 c Author(s) 2017.CC BY 4.0 License.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | call that, as discussed in Section 2.4, only the sum of the soil evaporation and transpiration is validated against tower fluxes.

Figure 2 .
Figure 2. Summary of GLEAM and PT-JPL annual ET differences.The GLEAM (left) and PT-JPL (middle) total annual ET in mm/year, and their difference (right) is plotted in the first row.Same plots for the interception, transpiration, and soil evaporation are given in the next three rows.The last row shows the annual percentage of the ET components for GLEAM (left) and PT-JPL (middle), and the total GLEAM and PT relative annual ET differences, normalized with the GLEAM and PT-JPL averaged ET, and expressed as percentage (left).

Figure 3 .
Figure 3. Normalized histograms of ET and available energy (Ae) from GLEAM, PT-JPL, and the tower observations.The histograms are calculated with the ET values at the tower locations separated first by season and land cover. 17 Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2017-573Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 6 November 2017 c Author(s) 2017.CC BY 4.0 License.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2017-573Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 6 November 2017 c Author(s) 2017.CC BY 4.0 License.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | tiveness of the larger pixel scale.For the shrubs/savanna group and the JJA months, the three ET distributions are quite different, with GLEAM showing lower values than PT-JPL.

Figure 4 .
Figure 4. Homogeneity index (I h , blue for forest stations, green for shrubs/savanna, and red for crops/grass) and RMSD of the optimum product and the towers ET, normalized by the mean annual tower ET (grey dots, with a linear fit plotted by a grey line).The towers are sorted from maximum to minimum I h .

Figure 5 .
Figure 5. Daily statistics of weights over all forest (left), shrub/savanna (middle), and crop/grass (right) sites, plotted as the difference of the absolute weights and 0.5 (referred to as weight deviations, ∆weight).A GLEAM ∆weight of a means that GLEAM ET is weighted 0.5 + a, and PT-JPL ET 0.5 − a, while a PT-JPL ∆weight of b means that GLEAM ET is weighted 0.5 − b, and PT-JPL ET 0.5 + b.Displayed are the mean, the 25% and 75% percentile, and the minimum and maximum value.

Figure 6 .
Figure6.Example of ET ∆weights at six selected stations (∆weights defined as in Figure5).

Figure 7 .Figure 8 .
Figure 7. ET time series at the sites US-SRM (top) and US-Nei (bottom).The daily values are time smoothed with a 10-day moving averaged window to better display the more persistent temporal features.23 Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2017-573Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 6 November 2017 c Author(s) 2017.CC BY 4.0 License.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | both products.
Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2017-573Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 6 November 2017 c Author(s) 2017.CC BY 4.0 License.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2017-573Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 6 November 2017 c Author(s) 2017.CC BY 4.0 License.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | weights over a large number of stations, but does not produce very different merged estimates, compared with the results previously discussed.The weighted average statistics (results not displayed) are very close to the merged estimates using the original formulation of Equation

Figure 10 .
Figure 10.Box plots summarizing for the three land cover goups and four seasons the variance of the towers ET (black), the variance of difference between GLEAM and PT-JPL ET (green), and the variance of the difference between the tower ET and GLEAM (red) and PT-JPL (blue) ET.

Figure 11 .
Figure 11.Box plots showing for the three land cover groups the ratio of the random MSD between the tower ET and GLEAM (red), PT-JPL (blue), SA-merge (green) and WA-merge (grey) ET over the sum of the total MSD for the same pair of ET products.30 Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2017-573Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 6 November 2017 c Author(s) 2017.CC BY 4.0 License.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | spatial homogeneity and magnitude of the errors was not found, it is clear that the impact of spatial resolution mismatch is minimized if the modeled ET is available at finer spatial resolutions.Therefore, modeling ET at a finer spatial resolution should help for a posteriori tower-based merging of the estimates.