Potential evaporation at eddy-covariance sites across the globe

Potential evaporation (Ep) is a crucial variable for hydrological forecast and in drought monitoring systems. However, multiple interpretations of Ep exist, and these reflect a diverse range of methods to calculate Ep. As such, a comparison of the performance of these methods against field observations in different global ecosystems is badly needed. In this study, we used eddy-covariance measurements from 107 sites of the FLUXNET2015 database, covering 11 different biomes, to parameterize and compare the main Ep methods and 10 uncover their relative performance. For each site, we extracted the days for which ecosystems are unstressed based on both an energy balance approach and on a soil water content approach. The evaporation measurements during these days were used as reference to validate the different methods to estimate Ep. Our results indicate that a simple radiation-driven method calibrated per biome consistently performed best, with a mean correlation of 0.93, an unbiased RMSE of 0.56 mm day, and a bias of -0.02 mm day against in situ measurements of unstressed 15 evaporation. A Priestley and Taylor method, calibrated per biome, performed just slightly worse, yet substantially and consistently better than more complex Penman, Penman-Monteith-based or temperature-based approaches. We show that the poor performance of Penman-Monteith based approaches relates largely to the fact that the unstressed stomatal conductance was assumed constant. Further analysis showed that the biome-specific parameters required for the simple radiation-driven methods are relatively constant per biome. This makes this 20 simple radiation-driven method calibrated per biome a robust method that can be incorporated into models for improving our understanding of the impact of global warming on future global water use and demand, drought severity and ecosystem productivity.

the FLUXNET2015 database, covering eleven different biomes, to parameterize and inter-compare the most widely used Ep methods and to uncover their relative performance. For each site, we isolate days for which ecosystems can be considered as 'unstressed', based on both an energy balance and a soil water content approach. Evaporation measurements during these days are used as reference to calibrate and validate the different methods to estimate Ep. Our results indicate that a simple radiation-driven method, calibrated per biome, consistently performs best against in situ measurements 15 (mean correlation of 0.93, unbiased RMSE of 0.56 mm day -1 , and bias of -0.02 mm day -1 ). A Priestley and Taylor method, calibrated per biome, performed just slightly worse, yet substantially and consistently better than more complex Penman-, Penman-Monteith-based or temperature-driven approaches. We show that the poor performance of Penman-Monteithbased approaches largely relates to the fact that the unstressed stomatal conductance cannot be assumed to be constant in time at the ecosystem scale. On the contrary, the biome-specific parameters required by simpler radiation-driven methods 20 are relatively constant in time and per biome type. This makes these methods a robust way to estimate Ep and a suitable tool to investigate the impact of water use and demand, drought severity and biome productivity.

Introduction
Since its introduction 70 years ago by C. W. Thornthwaite (1948), the concept of potential evaporation (Ep), defined as the amount of water which would evaporate from a surface unconstrained by water availability, has been widely used in multiple fields. It has been incorporated in hydrological models dedicated to estimate runoff (e.g. Schellekens et al., 2017) or actual evaporation (Wang and Dickinson, 2012), as well as in drought severity indices (Sheffield et al., 2012;Vicente-Serrano et al., 2013). Long-term changes in Ep have been regarded as a driver of ecosystem distribution and aridity (Scheff and Frierson, 2013) and used to diagnose the influence of climate change on ecosystems based on climate model projections (e.g. Milly and Dunne, 2016). However, many different definitions of Ep exist, and consequently many different methods are available to calculate it. In recent years, there has been an increasing awareness of the impact of the underlying assumptions and caveats in traditional Ep formulations (Weiß and Menzel, 2008;Kingston et al., 2009;Sheffield et al., 2012;Seiller and Anctil, 2016;Bai et al., 2016;Milly and Dunne, 2016;Guo et al., 2017). As such, a global appraisal of the most appropriate method for assessing Ep is urgently needed. Yet, current formulations reflect a disagreement on the mere meaning of this variable, which requires the definition of some form of reference system (Lhomme, 1997). Ep has been typically defined as the evaporation which would occur in given meteorological conditions if water was not limited, either (i) over open water (Shuttleworth, 1993); (ii) over a reference crop, usually a wet (Penman, 1963) or irrigated (Allen et al., 1998) short green grass completely shading the ground; or (iii) over the actual ecosystem transpiring under unstressed conditions (Brutsaert, 1982;Granger, 1989).
A second source of disagreement on the definition of Ep relates to the spatial extent of the reference system and 10 the consideration (or not) of feedbacks from the reference system on the atmospheric conditions. Several authors found it convenient to define Ep taking an extensive area as reference system, because this reduces the influence of advection and entrainment flows (Penman, 1963;Priestley and Taylor, 1972;Brutsaert, 1982;Shuttleworth, 1993). Such an idealized extensive and well-watered ecosystem evaporating at 'maximal rate' (for the given atmospheric conditions) can be expected to raise air humidity until the vapour pressure deficit tends to zero. In this case, evaporation is only driven by 15 radiative and no longer by aerodynamic forcing. Meanwhile, others have defended the use of reference systems that are infinitesimally small (Morton, 1983;Pettijohn and Salvucci, 2009;Gentine et al., 2011b), in order to avoid the feedback of the reference system on aerodynamic forcing. The effect of the choice of reference system is best exemplified by the complementary relationship framework (Bouchet, 1964), which uses both approaches to link potential and actual evaporation, through (1 + ) p0 = pa + a , with b an empirical constant (Kahler and Brutsaert, 2006;Aminzadeh et 20 al., 2016), Ep0 the evaporation from an extensive well-watered surface (i.e. in which the feedback from the ecosystem on the VPD and aerodynamic forcing is considered and where evaporation is only driven by a radiative forcing), Epa the evaporation from a well-watered but infinitesimally small surface (i.e. where evaporation is driven by both radiative and aerodynamic forcing) and Ea the actual evaporation (Morton, 1983).
Upon all this controversy, the net radiation of the reference system remains another point of discussion: some 25 scientists argue that the (well-watered) reference system should have the same net radiation as the actual (water-limited) system (e.g. Granger, 1989;Rind et al., 1990;Crago and Crowley, 2005). Yet, this is inherently inconsistent as the surface temperature reflects the surface energy partitioning, thus a well-watered system transpiring at potential rate is expected to have a lower surface temperature (Maes and Steppe, 2012), and correspondingly a higher net radiation (e.g. Lhomme, 1997;Lhomme and Guilioni, 2006). Meanwhile, to some extent, the albedo also depends on soil moisture (Eltahir, 30 1998;Roerink et al., 2000;Teuling and Seneviratne, 2008) and it can be argued that it should be adjusted to reflect wellwatered conditions (Shuttleworth, 1993). Finally, extensive reference surfaces can be expected to exert a feedback, not only on the aerodynamic forcing, but also on the incoming radiation (via impacts on air temperature, humidity and cloud formation). Yet, these larger-scale feedbacks are not acknowledged when computing Ep, even when considering extensive reference systems.

35
As it can be concluded from the above discussion, a unique and universally accepted definition of Ep does not exist, and the most appropriate definition remains tied to the specific interest and application. Nonetheless, as different applications make use of different Ep formulations, a good knowledge of the implications of the choice for a specific method is required (Fisher et al., 2011). For terrestrial ecosystems, the use of an open water reference system is uninformative of the actual available energy and the aerodynamic properties of the actual ecosystem (Shuttleworth, global drought (Dai, 2011). When the actual ecosystem transpiring at unstressed rates is considered as reference system, both climate forcing conditions and ecosystem properties are taken into account. This has been the preferred approach when calculating Ep as an intermediate step to estimate actual evaporation, often by applying a multiplicative stress factor (S) varying between 0 and 1, such that Ea=S Ep (e.g. Barton 1979;Mu et al., 2007;Fisher et al., 2008;Miralles et al., 2011;Martens et al., 2017). This S factor can be considered analogous to the  factor used in some land surface models to 5 incorporate the effect of soil moisture in the estimation of gross primary production and surface turbulent fluxes (Powell et al., 2013).
Several studies have attempted to compare and evaluate different Ep methods. Some of these studies have compared the performance of different Ep formulations in hydrological (Xu and Singh, 2002;Oudin et al., 2005a;Kay and Davies, 2008;Seiller and Anctil, 2016) or climate models (Weiß and Menzel, 2008;Lofgren et al., 2011;Milly and Dunne, 10 2016). Others considered the Penman-Monteith method as the benchmark to test less input-demanding formulations (e.g. Chen et al., 2005;Sentelhas et al., 2010). All these studies have their own merits, yet an evaluation of Ep methods based on empirical data of actual evaporation measurements is to be preferred (Lhomme, 1997). To date, such approaches have been hampered by limited data availability (Weiß and Menzel, 2008). Lysimeters provide arguably the most precise evaporation measurements available (e.g. Abtew, 1996;Pereira and Pruitt, 2004;Yoder et al., 2005;Katerji and Rana, 15 2011), but are sparsely distributed and not always representative of larger ecosystems. Pan evaporation measurements are more easy to perform and are broadly available (Zhou et al., 2006;Donohue et al., 2010;McVicar et al., 2012) but provide a proxy of open-water evaporation, rather than actual ecosystem potential evaporation; they also exhibit biases related to the location, shape and composition of the instrument (Pettijohn and Salvucci, 2009). Eddy-covariance measurements are an attractive alternative, but, apart from an unpublished study by Palmer et al. (2012), have so far only been used in Ep 20 studies focusing on local to regional scales (Jacobs et al., 2004;Sumner and Jacobs, 2005;Douglas et al., 2009;Li et al., 2016).
The overall purpose of the present work is to identify the most suitable method to estimate Ep at the ecosystemscale across the globe. Because we are using an empirical dataset of actual evaporation at FLUXNET sites, the reference system considered in this study is the actual ecosystem, so Ep is defined as the evaporation of the actual ecosystem when 25 it is completely unstressed. As mentioned above, this definition is the most suitable for hydrological studies, studies of ecosystem drought, and derivations of actual evaporation through constraining Ep calculations. Following this definition, Ep is similar to Ep0 in the complementary relationship. We used the most recent and complete eddy-covariance database available, i.e. the FLUXNET2015 archive (http://fluxnet.fluxdata.org/). The most frequently-adopted Ep methods are applied based on standard parameterizations as well as calibrated parameters by biome, and inter-compared in order to 30 gain insights into the most adequate means to estimate Ep from ecosystem to global scales.

Selection of Ep methods
Methods to calculate Ep can be categorized based on the amount and type of input data required. In this overview, we will only discuss the ones evaluated in our study, which are arguably the most frequently used.

35
Methods based on radiation, temperature, wind speed and vapour pressure The well-known Penman-Monteith equation (Monteith, 1965) expresses latent heat flux Ea (W m -2 ) as: λ a = (R n −G)+ ρ a c p VPD r aH +γ+ γ r c r aH (1) with  being the latent heat of vaporisation (J kg -1 ), Ea the actual evaporation (kg m -2 s -1 ), s the slope of the Clausius-Clapeyron curve relating air temperature with the saturation vapour pressure (Pa K -1 ), Rn the net radiation (W m -2 ), G the ground heat flux (W m -2 ), a the air density (kg m -3 ),  the psychrometric constant (Pa K -1 ), cp the specific heat capacity of the air (J kg -1 K -1 ), VPD the vapour pressure deficit (Pa), raH the resistance of heat transfer to air (s m -1 ), rc the surface resistance of water transfer (s m -1 ). While , cp, s and  are air temperature-dependent, raH is a complex function of wind 5 speed, vegetation characteristics and the stability of the lower atmosphere (see Section 2.3). In most methods to estimate Ea or Ep, raH is estimated from a simple function of wind speed. Eq.
(2) is often referred to as the Penman (1948) formulation, and can be conveniently rearranged as λ p = ( − ) + + ( + ) to illustrate that Ep is driven by a radiative (left term) and an aerodynamic (right term) forcing (Brutsaert and Stricker, 1979).

Methods based on radiation and temperature
When the reference system is considered an idealized extensive area, or when radiative forcing is very dominant, the aerodynamic component of Eq.
(2) may become negligible, thus the whole equation collapses to λ p = (R n −G) +γ , which is commonly referred to as 'equilibrium evaporation' (Slatyer and McIlroy, 1961). Priestley and Taylor (1972) analysed time series of open water and water-saturated crops and grasslands and found that the evaporation over these surfaces 20 closely matched the equilibrium evaporation corrected by a multiplicative factor, commonly denoted as PT: This formulation is known as the Priestley and Taylor equation. Because usually a constant value of PT is adopted, it assumes that the aerodynamic term in the Penman equation (eq. 2) is a constant fraction of the radiative term. Typically, PT =1.26 is considered, as estimated by Priestley and Taylor (1972) in their original experiments. In this study, we also include a biome-specific value to extend its applicability to all biomes (see Sect 2.5). Since this method does not require 25 wind speed or VPD as input, it is widely applied in hydrological models (Norman et al., 1995;Castellvi et al., 2001;Agam et al., 2010), remote sensing evaporation models (Norman et al., 1995;Fisher et al., 2008;Agam et al., 2010;Miralles et al., 2011) and drought monitoring (Anderson et al., 1997;Vicente-Serrano et al., 2018).

Methods based on radiation
Other studies such as Lofgren et al. (2011), or the more recent Milly and Dunne (2016), further simplified Eq.
(3) to make it a linear function of the available energy by defining a constant multiplier here referred to as MD: In the case of Milly and Dunne (2016) this equation was applied to climate model outputs based on a constant and universal value of MD=0.8. On a daily scale, (R n − G) expresses the total amount of energy available for evaporation, 5 and the fraction of this energy that is actually used for evaporation is typically referred to 'evaporative fraction', or EF = λ a (H+λ a ) = λ a (R n −G) . From Eq. (4), it follows that the parameter MD can be interpreted as the EF of the unstressed ecosystem. In this study, we test both the general value of MD=0.8 and a biome-specific constant (see Sect. 2.5).

Methods based on temperature
Of the many empirical methods to estimate Ep, temperature-based methods have arguably been the most commonly used 10 because of the availability of reliable air temperature data. For an overview of these methods, we refer to Oudin et al. (2005a). In this study, three methods are included. First, Pereira and Pruitt (2004) formulated a daily version of the wellknown Thornthwaite (1948) equation: Teff<0 with Teff the effective temperature, based on maximum and minimal temperatures (see further, Section 2.5), Th an empirical parameter (see below), I the yearly sum of (Ta_month/5) 1.514 , with Ta_month the mean air temperature for each 15 month, N the number of daylight hours, b a parameter depending on I and c, d and e empirical constants (see Section 2.5).
The general value of Th=16 is often adopted; in this study, we will also calculate and apply a biome-specific value.
The second temperature-based formulation is that proposed by Oudin et al. (2005a), selected and developed after comparing 27 physically-based and empirical methods with runoff data from 308 catchments: with T a the air temperature (C) and Re top-of-atmosphere radiation (MJ m -2 day -1 ), depending on latitude and Julian day.
20 Oudin et al. (2005a) suggested to use Ou=100. This value will be used, in addition to a biome-specific value.
Finally, the third temperature-based method is the Hargreaves and Samani (1985) formulation, which includes minimum (Tmin) and maximum (Tmax) daily temperature, next to Ta and Re: λ p = α HS R e (T a + 17.8)√T max − T min with HS a constant, normally assumed to equal 0.0023. As for the other methods, we additionally apply a biomespecific value. A detailed description of the calibration of all Ep methods is given in Section 2.5.

FLUXNET2015 database
The Tier2 FLUXNET2015 database based on half-hourly or hourly measurements from eddy-covariance sensors is used to evaluate the different Ep formulations (http://fluxnet.fluxdata.org/data/fluxnet2015-dataset/). Sites lacking at least one of the basic measurements required for our analysis (i.e. Rn, G, Ea, H, precipitation, wind speed (u), friction velocity (u*), Ta and relative humidity (RH) or VPD) were not further considered. For latent and sensible heat fluxes, we used the data corrected by the Bowen ratio method. In this approach, the Bowen ratio is assumed to be correct, and the measured Ea and H are multiplied by a correction factor derived from a moving window method; see http://fluxnet.fluxdata.org/data/fluxnet2015-dataset/data-processing/ for a detailed description of this standard procedure.
Nonetheless, taking the uncorrected Ea instead did not impact the main findings (not shown). For the main heat fluxes (G, H, Ea), medium and poor gap-filled data were masked out according to the flags provided by FLUXNET. As no quality flag was available for Rn measurements, the flag of the shortwave incoming radiation was used instead. All negative values for H or Ea were masked out, as these relate to periods of interception loss and condensation when accurate measurements are not guaranteed (Mizutani et al., 1997). Similarly, all negative values of Rn were masked out.
Finally, sub-daily measurements were aggregated to daytime composites by applying a minimum threshold of 5 W m -2 of top-of-atmosphere incoming shortwave radiation, and after excluding the first and last (half-) hours from these 15 aggregates. Based on these daytime aggregates, the daytime means of s,  and a were calculated using the parameterisation procedure described by Allen et al. (1998). We used Ta to calculate s. Only days in which more than 70% of the data were measured directly were retained, and days with rainfall (between midnight and sunset) were removed from the analyses to avoid the effects of rainfall interception. Furthermore, only sites with at least 80 retained days were used for the further analysis. The global distribution of the final selection of sites is shown in Fig. 1 and detailed 20 information about these sites is provided in Table S1 of the Supporting Information. The IGBP classification was used to assign a biome to each site.

25
Estimates of raH are required for the Penman and Penman-Monteith equations. The resistance of heat transfer to air, raH, was calculated as: with qa being the specific humidity (kg kg -1 ) and g=9.81 m s -2 the gravitational acceleration.
The displacement height d and the roughness length for momentum flux z0m were estimated as a function of the canopy vegetation height (VH), as d=0.66 VH and z0m=0.1 VH (Brutsaert, 1982). VH was estimated from the flux tower measurements using the approach of Pennypacker and Baldocchi (2016): This equation was applied to the full (half-)hourly database, and only when conditions were near-neutral (| /L| < 0.01) 5 and friction velocities lower than one standard deviation below the mean u* at each site. The daily VH was then aggregated by averaging out the half-hourly estimates to daily values, excluding the 20% outliers of the data, and then calculating a 30-day window moving average on the dataset. When not enough (half-)hourly vegetation height observations (<160) were available, the site was excluded from the analysis. This gave robust results for all remaining sites; an example of VH temporal development for a specific location is given in Fig. 2a ) is equal to the radiative surface temperature Ts derived from the longwave fluxes. Then, through an iterative approach, an optimal value of z0h was obtained, using the following equations for T0 15 (Garratt, 1992) and Ts (Maes and Steppe, 2012): with  the Stefan-Boltzmann constant and  the emissivity (see further). The (half-)hourly data were used for this calculation. Following the approach of Li et al. (2017), only summertime data were used, and only when H was larger than 20 W m -2 and u* larger than 0.01 m s -1 . Summertime was defined as those months in which the maximal daily value of  is at least 85% of the 98 th percentile of H based on the entire tower time series. In addition, (half-)hourly observations 20 with counter-gradient heat fluxes were excluded from the analysis. For each observation, z0h was optimized by minimizing the difference between T0 and Ts. Then, kB -1 was calculated at each site based on its relation with the observed Reynolds number (Re) by fitting the following function, based on the work by Li et al. (2017): Note that Eq. (12) requires a value for , which is often assumed to be equal to 0.98 for all sites (e.g. Li et al., 2017;Rigden and Salvucci, 2015). Under the assumption that T0=Ts,  can also be calculated separately per site. If H=0, it follows that T0=Ta and from Eq. (12), Here,  was calculated for each site using (half-)hourly data, selecting those measurements where H was close to 0 (-2<H<2 W m -2 ) and excluding rainy days as well as measurements in which the albedo (calculated as =SWout/SWin with SWin the incoming and SWout the outgoing shortwave radiation) was above 0.4, to avoid influences of snow or ice.
Negative estimates of  were masked out, and the  of the site was calculated as the mean, after excluding the outlying 20% of the data. Equation 3 was applied both with a fixed  of 0.98 and with the observed , and the value yielding the 5 lowest RMSE in Eq. (12) was retained for each site. An example of such a function between kB -1 and Re is shown in Fig. 2b.
(Insert Figure 2) 10 Finally, the surface resistance rc (s m -1 ) was calculated by inverting the Penman-Monteith equation as: We converted the rc estimates to surface conductance gc (mm s -1 ) using gc=1000 rc -1 ; we will continue using gc (rather than rc) for the remainder of the study. Note that the approach of calculating kB -1 directly requires a separate measurement of LWin and LWout, which was only available in 95 of the 107 selected sites. For the remaining sites, an alternative approach was used to calculate kB -1 (see Supporting Information).

Selection of unstressed days
We include two different approaches to identify a subset of measurements per eddy-covariance site in which the ecosystem was unstressed and provide the results for both methods. A first approach is based on soil moisture levels. For those sites where soil moisture measurements were available, the maximal soil moisture level for each site was determined as the 98 th percentile of all soil moisture measurements. We then split up the dataset of each site in 5 equal-size classes 20 based on evaporation percentiles. For each class, days having soil moisture levels belonging to the highest 5% of soil moisture levels within each class were selected as unstressed days, but only if the soil moisture level of these selected days was above 75% of the maximum soil moisture. This guaranties the sampling of unstressed evaporation during all seasons.
As soil moisture data were not available for a large number of sites, and because using soil moisture data does not 25 exclude days in which evaporation may be constrained by other kinds of biotic or abiotic stress, a second approach for defining unstressed days was applied based on an energy balance criterion. We calculated the EF from the daytime λEa and H values, and considered it as a direct proxy for evaporative stress, i.e. we assumed that under unstressed conditions, a larger fraction of the available energy is used to evaporate water (Gentine et al., 2007;2011a;Maes et al., 2011). This approach is similar to those of other Ep studies using eddy-covariance or lysimeter data, in which the Bowen ratio (e.g. 30 Douglas et al., 2009) or the ratio of λEa/(SWin+LWin) (Pereira and Pruitt, 2004) are used to define unstressed days. The unstressed record was comprised of all days with EF exceeding the 95 th percentile EF threshold for each particular site; or, if fewer than 15 days fulfilled this criterion, the 15 days with the highest EF. Consequently, we assume that at each site during at least 5% of the days the conditions are such that evaporation is unstressed and Ea reflects Ep. The measured Ea from the identified unstressed days is further referred to as Eunstr (mm day -1 ) and used as reference data to evaluate the 35 different Ep methods.
To assess whether the atmospheric conditions of the unstressed datasets are representative for the FLUXNET sites as such, a random bootstrap sample having the same number of records as the unstressed dataset was taken from the entire dataset of daily records. The mean, standard deviation, 2 nd and 98 th percentile of SWin, Ta, VPD and u were calculated.
This procedure was repeated 1000 times per site. A t-test comparing the values of the unstressed subsample with those of the 1000 random samples was used to analyse whether the atmospheric conditions of unstressed subsample were representative for the overall site conditions. This analysis was carried out for both methods to select unstressed days: the soil moisture threshold and the energy balance criterion. 5

Calculation and calibration of the different Ep methods
An overview of the different methods to calculate Ep is given in Table 1 (5), (6) and (7)), only the standard and biome-specific versions were calculated. The standard version of Thornthwaite's formulation used Th=16 . In the biomespecific version, this parameter was again calculated per site as the mean value of the unstressed records (e.g. Xu and Singh, 2001;Bautista et al., 2009), and then averaged per biome type. The effective temperature Teff was calculated as et al., 1999). The parameter b was calculated as = (6.75 10 −7 3 )-(7.71 10 −7 2 ) + 0.0179 + 0.492 and the parameters c, d and e in Eq. (5c) were 415.85, 32.24 and 0.43, respectively (Pereira and Pruitt, 2004). For Oudin's temperature-based formulation,Ou=100 was taken for the standard version (Eq. (6)). In the biome-specific version, this value was recalculated by calculating Ou for the unstressed 5 records through Eq. (6), calculating the mean Ou per site and finally the biome-dependent Ou. Similarly, for the Hargreaves-Samani method, HS=0.0023 is used for the standard version (Eq. (7)), whereas in the biome-specific version, this value was calculated using the unstressed records. Altogether, this exercise yielded a total of seventeen different methods to estimate Ep whose specificities are documented in Table 1.

10
(Insert Table 1) The influence of climatic forcing data on Eunstr and on gc_ref,PT and MD was investigated. This was done by calculating for each individual site the correlations between the daily estimates of the atmospheric conditions and the daily values of the unstressed datasets. Analyses were then performed on these correlations of all sites. instance, only at one site, the unstressed subset of the 98 th percentile was significantly different from the random samplingbased simulations and in only 2 sites, the 2 nd percentile was significantly different from the simulations.

Key parameters by biome
We focus here on the parameter estimates of the unstressed record based on the energy balance criterion (Section 2.4). Of  (Table 2). In contrast, 27 sites, including 9 croplands, have a mean value of MD above 0.80, and 42 sites have mean gc_ref above 14.49 mm s -1 . Finally, wetlands (WET) show a large standard deviation of PT and MD (Table 2) due to their location in tropical, temperate as well as in arctic regions. The parameters of the three temperature-based differed significantly across biomes, but trends were different for each key parameter and did not always match those for gc_ref, PT and MD (Table 2).
Next, the effect of the atmospheric conditions on Eunstr and on the key parameters gc_ref, PT and MD of the unstressed dataset is investigated. Fig. 3 gives the distribution of the correlations between the climatic variables (Rn, Ta, VPD, u and [CO2]) and Eunstr, gc_ref, PT and MD of the unstressed records at each site. We did not include Th, HS or 5 Ou because the temperature-based methods did not perform well (see Section 3.3). Eunstr was strongly positively correlated with Rn, Ta and VPD in most sites, but less with u (Fig. 3a, Table 3). Considering all sites, the correlation between gc_ref and the climatic variables is not significantly different from zero for any climate variable, yet gc_ref is significantly negatively correlated with Tair and with VPD in 40% and 45% of the flux tower sites, respectively (Table 3, Fig. 3b). The two other parameters, PT and MD, appear less correlated to any of the climatic variables across all sites 10 (Table 3b). In the case of MD, in particular, the distributions of the correlations against all climate forcing variables peak around zero (Fig. 3d): MD is hardly influenced by Rn, and is overall not dependent on u, Ta, [CO2], or VPD in most sites (Table 3).

Evaluation of different Ep methods
We first list the results of the analysis using the energy balance criterion for selecting the unstressed records (Section 2.4).

20
The scatterplots of measured Eunstr versus estimated Ep based on the 17 different methods are shown in Fig. 4 (Table 4)this is to be expected, as the only difference between the standard and biome-specific version of these methods is the value of

35
their key parameters (PT, MD, Ou, HS) which are multiplicative (see Eqs. (3), (4), (6) and (7)). Differences are however reflected in the unbiased Root Mean Square Error (unRMSE) and mean biassee Tables 5 and 6. The biome-specific versions of the radiation-based method (MDb) and of the Priestley and Taylor method (PTb) have consistently the lowest unRMSE for all biomes. Though the difference between these two methods is small, MDb is performing slightly better.
The standard Penman method (Pes) has the highest unRMSE. All reference crop methods (PMr, Per, PTr, MDr) have mean 12 unRMSE above one mm day -1 , and the temperature-based methods (Ths, Ous, HSs, Thb, Oub and HSb) also have relatively high unRMSE. Finally, bias estimates are given in Table 6. Again, MDb is overall the best performing method (mean bias closest to zero mm day -1 ), closely followed by the PTb method. Both methods have consistently the bias closest to zero among all biomes, except for wetlands. Most reference crop methods (PMr, Per, PTr, MDr) as well as Pes overestimate Ep in all biomes. Table 4) (Insert Table 5) (Insert Table 6) 10

(Insert
The use of soil water content as criterion to select unstressed days (see Sect. 2.4) is explored. In total, 62 sites have soil water content data and meet the other selection criteria documented in Section 2.2. The results of this analysis are given in Tables S3-S5 of the supporting section. To allow for a fair comparison, the same statistics have also been computed for just the same 62 tower sites with the energy balance-criterion (Tables S6-S8). Using the soil moisture criterion, the correlations are overall lower and the results of the mean correlation, unRMSE and biases are less consistent.

15
However, the overall performance ranking of the different models remains similar: PTb is the best performing method with overall the highest mean correlation (R=0.84) and the lowest unRMSE (0.78 mm/day) and a bias close to zero (-0.04), closely followed by the MDb method, with R=0.81, unRMSE=0.89 and a mean bias of -0.12. More complex Penman-based models, and especially the empirical temperature-based formulations, show again a lower performance.
So far, all flux sites were used to calibrate the key parameters (Table 2) and those same sites were also used for 20 the evaluation of the different methods. This was done to maximise the sample size. However, to avoid possible overfitting, we also repeated the analysis after separating calibration and validation samples. For each biome, two-thirds of the sites were randomly selected as calibration sites, and one third as validation sites. The key parameters were then calculated from the calibration subset, and applied to estimate Ep of the biome-specific methods of the validation subset.
This procedure was repeated 100 times and the mean correlation, unbiased RMSE and bias per biome were calculated.

25
Results show no substantial differences in overall correlation, unRMSE and bias of each method, and are provided in Tables S9-S11 for completeness.

Comparison of criteria to define unstressed days
We prioritised the energy balance over the soil moisture criterion to select unstressed days (see Sect. 2.4), because it can 30 be applied to sites without soil moisture measurements and because it implicitly allows the exclusion of days in which the ecosystem is stressed for reasons other than soil moisture availability (e.g. insect plagues, phenological leaf-out, fires, heat and atmospheric dryness stress, nutrient limitations). In addition, soil moisture at specific depths can be a poor indicator of water stress, as rooting depth can vary and is not accurately measured (Powell et al., 2006;Douglas et al., 2009;Martínez-Vilalta et al., 2014). This is confirmed by our results: sampling unstressed days based on the energy 35 balance-based criterion resulted in higher correlations between Ep and Eunstr for all methods (Table S6 vs Table S3) and in lower unRMSE, with the exception of the temperature-based methods (Table S7 vs Table S4).
Nonetheless, it could be argued that, because the MD method assumes a constant evaporative fraction, the use of the evaporative fraction as a criterion for selecting unstressed days may favour the MD and even the closely related PT formulation. However, the soil moisture criterion adopted here provides an independent check of the results, and confirms 13 the robust and superior performance of the energy-driven PTb and MDb methods, independently of the framework chosen to select unstressed days. In the following discussion, the primary focus is on the results of selecting unstressed days based on the energy balance criterion.

Estimation of key ecosystem parameters
The resulting biome-specific values of the key parameters in Table 2 are within the range of values used in reference crop 5 and standard applications of the models (Table 1), with the exception of PT, which is typically lower than the frequently adopted value of 1.26. Other studies also found PT values far below 1.26 but within the range of our study, mainly for forests (e.g. Shuttleworth and Calder, 1979;Viswanadham et al., 1991;Eaton et al., 2001;Komatsu, 2005), but also for tundra (Eaton et al., 2001) and grassland sites ( Our results and these studies demonstrate that the standard level of PT=1.26 is close to the upper bound and will lead to 10 an overestimation of Ep at most flux sites (Table 5).

Lower performance of the Penman-Monteith method
The poor performance of the PMr, PMs and PMb methods was relatively unexpected.
. However, in studies dedicated to estimate Ea at eddy-covariance sites, in which gc is adjusted so it reflects the actual stress conditions in the ecosystem, the Penman-Monteith method has already been shown to perform worse than other, simpler methods, such as the Priestley and Taylor equation (e.g. Ershadi et al., 2014;Michel et al., 2016). It is well known that its performance depends on the reliability of a wide range of input data, and on the methods used to derive raH and gc (Singh and Xu, 1997;Dolman et al., 2014;Seiller and Anctil, 2016). In our case, the strict procedure followed to 20 select the samples of 107 FLUXNET sites (see Sect. 2.2) ensured that all relevant variables were available, and the meteorological measurements were quality-checked. Hence, in our analysis, input quality is unlikely to be the cause of low performance.
We believe that the underlying assumption of a constant gc, typically adopted by PM methods (PMr, PMs, PMb) when estimating Ep, is a more likely explanation of the poor performance. PM was the only method of which the biome-25 specific calibration did not improve the performance. This is partially because of the large variation in gc_ref among the different flux sites of the same biome type (Table 2). In addition, of all the key parameters, gc_ref showed the largest mean relative standard deviation of the unstressed datasets at individual sites (results not shown). Surface conductance of the unstressed dataset was significantly negatively correlated with VPD in 45% of the sites (Fig. 3b, Table 3). The relationship between gc and VPD for two of these sites is illustrated in Fig. 5. It is clear that gc of unstressed days (red dots) is always high for a given VPD, confirming the validity of the energy balance method. However, for these sites, it is also shown that gc of the unstressed days becomes very high when VPD becomes very low. As a consequence, the mean value of gc of the unstressed records, used to ultimately calculate gc_ref per biome type, is highly influenced by local VPD and is not necessarily a representative ecosystem property.
The dependence of gc on VPD, even when soil moisture is not limiting, has been well studied (e.g. Jones, 14 increases in gc, and vice versa, lowering the impact of VPD on Ea (Eq. 1). As such, assuming a constant gc_ref value overestimates the influence of VPD (and wind speed) on Ep.
Moreover, assuming a constant gc_ref-value in the Penman-Monteith method also ignores the influence of CO2 levels on gc. As a result, Milly and Dunne (2016) found that the Penman-Monteith methods with constant gc_ref overpredicted Ep in models estimating future water use. Calibrating the sensitivity of gc_ref to VPD and [CO2] in the Penman-Monteith equation is outside the scope of this study, but could certainly improve Ep calculationsyet, it would further increase the complexity of the model. Finally, we note that the above discussion also applies to Penman's method: taking a wet canopy as reference in the Penman method (gc=∞ or rc=0) may not only overestimate Ep (Table 6) but also the influence of VPD and u on Ep.

Considerations regarding simple energy-based methods
The simpler Priestley and Taylor and radiation-based methods came forward as the best methods for assessing Ep with both criteria to define unstressed days. These observations are in agreement with studies highlighting radiation as the 15 dominant driver of evaporation of saturated or unstressed ecosystems (e.g. Priestley and Taylor, 1972;Abtew, 1996;Wang et al., 2007;Song et al., 2017;Chan et al., 2018). It also agrees with Douglas et al. (2009)  With PT (or MD) methods, this trend cannot be captured. A similar criticism can be drawn with regards to the effect of

25
[CO2] on stomatal conductance, water use efficiency, and thus potential evaporation (Field et al., 1995). A separate question is whether more complex Ep methods that incorporate the effects of u, [CO2] or VPD explicitly do this correctly; the above-mentioned issues about the fixed parameterisation of the Penman-Monteith methods for estimating Ep indicate that this may typically not be the case.
Regarding the non-explicit consideration of u by simpler methods, our records show a limited effect of u on Ea and

30
Ep, even when considering larger temporal scales. Of the 16 flux towers with at least 10 years of evaporation data, we calculated the yearly average Ea as well as the annual mean climatic forcing variables. Yearly averages were calculated from monthly averages, which in turn were calculated if at least three daytime measurements were available. Despite a relatively large mean standard deviation in yearly average u of 7.0%, yearly average u was not significantly correlated with Ea in any of these sites. In contrast, yearly average Rn was positively correlated with yearly average Ea in seven of 35 the 16 sites, with comparable mean standard deviation in annual Rn (8.5%). Moreover, looking at all individual towers and using the daily estimates, neither MD nor PT were correlated with u (Fig. 3c, d, Table 3). In fact, since MD appears hardly affected by any climatic variable, and given the relatively small range in MD values within each biome (Table 2), it appears that MD is a robust biome property that can be used in the seamless application of these methods at global scale. The robustness of MD as biome property is furthermore confirmed by the analysis with independent calibration and validation sites, which hardly affected the unRMSE and bias (Tables S10-11).

Use of available energy under stress conditions
The best performing methods rely heavily on measurements of available energy (Rn-G) (Eqs. 2 and 3). In Section 3.2, all

5
Ep calculations used available energy obtained during unstressed conditions. The question is whether Ep can also be calculated correctly using the actual Rn and G when the ecosystems are not unstressed. As mentioned in Section 1, there is discussion whether SWout and LWout should be considered as forcing variables or as ecosystem responses (e.g. Lhomme, 1997;Lhomme and Guilioni, 2006;Shuttleworth, 1993). Among other considerations, it is clear that Ts will be lower if vegetation is healthy and soils are well watered (Maes and Steppe, 2012), which results in lower LWout and higher Rn.
Therefore, while using the observations of SWin and LWin as forcing variables in the computation of Ep is defendable (despite the potential atmospheric feedbacks that may derive from the consideration of unstressed conditions), we agree that SWout, LWout and G should ideally reflect unstressed rather than actual conditions to estimate Ep. Note that in that case, Ep deviates from Ep0 in the complementarity relationship, in which atmospheric feedbacks affecting incoming radiation or VPD are implicitly considered (Kahler and Brutsaert, 2006).

15
A method to derive the 'unstressed' estimates of SWin (through the albedo, , LWin (through Ts) and G under stressed conditions is presented in the Section S2 of the Supplement and is based the MDb method and on flux data of the unstressed datasets. We further refer to this method as the unstressed Ep . It requires a large amount of input data and is not practically applicable at global scale. Comparing the mean unstressed Ep with the 'actual' Ep (i.e., Ep calculated from the actual Rn and G) for all sites reveals that the actual Ep is 8.2±10.1% lower than the unstressed Ep. There are no 20 significant differences between biomes, but the distribution of the underestimation is left-skewed and the actual Ep is more than 10% lower than the unstressed Ep in 22% of the sites (Fig. 6). The main reason, explaining about 65% of the difference between the actual and the unstressed Ep, is the difference in Ts. Assuming that the unstressed Ts can be estimated as the mean of Ta and the actual Ts results in a straightforward alternative to approximate the unstressed Ep with only data of Ta and radiation: 25 p = α MD ((1 − α) SW in + ε LW in − 0.5 ε LW out − 0.5 ε σ T a 4 − G) This approach was also tested and resulted in a mean underestimation of Ep of 2.6±5.8% compared to the 'unstressed' Ep, with a mean median value at -0.1% (Fig. 6). Given the low error and the straightforward calculation, we recommend this method to calculate Ep at global scales.
(Insert Figure 6) 30 Fig. 7 gives an example of the seasonal evolution of Ea and Ep and the S factor (S=Ea Ep -1 ) in a grassland (Fig 7a) and a deciduous forest site (Fig. 7b). The short growing season in the grassland site, when Ea is close to Ep and values of S are close to 1, stands in clear contrast with the winter period, when grasses have died off and Ea and consequently also S are very low. In the relatively wet broadleaf forest, Ea and Ep follow a similar seasonal cycle. In winter, when total evaporation is limited to soil evaporation, S is very low; in spring, when leaves are still developing, Ea lags Ep. In summer, S remains high and close to one.

Conclusion
Based on a large sample of eddy-covariance sites from the FLUXNET2015 database, we demonstrated a higher potential of radiation-driven methods calibrated by biome type to estimate Ep than of more complex Penman-Monteith approaches or empirical temperature-based formulations. This was consistent across all 11 biomes represented in the database, and for two different criteria to identify unstressed days, one based on soil moisture and the other on evaporative fractions.

5
Our analyses also showed that the key parameters required to apply the higher-performance radiation-driven methods are relatively insensitive to climate forcing. This makes these methods robust for incorporation into global offline models, e.g. for hydrological applications. Finally, we conclude that, at the ecosystem scale, Penman-Monteith methods for estimating Ep should only be prioritised if the unstressed stomatal conductance is calculated dynamically and high accuracy observations from the wide palate of required forcing variables are available.

Data availability
The FLUXNET2015 dataset can be downloaded from http://fluxnet.fluxdata.org/data/fluxnet2015-dataset/. The main script for calculating potential evaporation with the different method as well as the daily flux data of one site (AU-How), for which permission of distribution was granted, is available as supplement. For further questions, we ask readers to contact the corresponding author at wh.maes@ugent.be.

Supplement
The supplement contains a description of the calculation method of kB -1 for sites where radiance fluxes are not separately measured (S1), a description of the method to estimate 'unstressed' Ep (See Section 4.5) from flux data (S2) and several tables (S3), including an overview of the FLUXNET sites used in the study (Table S1), a table on the representativeness of the climate forcing conditions of the unstressed datasets for the full dataset (Table S2) as well as correlation, unbiased 20 RMSE and bias tables for different selection criteria (Tables S3-S11).  Brutsaert, W., and Stricker, H.: An advection-aridity approach to estimate actual regional evapotranspiration, Water

z0m
Roughness length for momentum m