A continental-scale evaluation of the calibration-free complementary relationship with physical, machine-learning, and land-surface models

The widespread negative correlation between the atmospheric vapor pressure deficit and soil moisture lends strong support to the complementary relationship (CR) of evapotranspiration. While it has showed outstanding performance in predicting actual evapotranspiration (ETa) over land surfaces, the calibration-free CR formulation has not been tested in the Australian continent dominantly under (semi-)arid climates. In this work, we comparatively evaluated its predictive performance with seven physical, machine-learning, and land surface models for the continent at a 0.5°×0.5° grid resolution. 15 Results showed that the calibration-free CR that forces a single parameter to everywhere produced considerable biases when comparing to water-balance ETa (ETwb). The CR method was unlikely to outperform the other physical, machine-learning, and land surface models, overrating ETa in (semi-)humid coastal areas for 2002-2012 while underestimating in arid inland locations. By calibrating the parameter against water-balance ETa independent of the simulation period, the CR method became able to outperform the other models in reproducing the spatial variation of the mean annual ETwb and the interannual variation 20 of the continental means of ETwb. However, interannual the grid-scale variability and trends were captured unacceptably even after the calibration. The calibrated parameters for the CR method were significantly correlated with the mean net radiation, temperature, and wind speed, implying that (multi-)decadal climatic variability could diversify the optimal parameters for the CR method. The other physical, machine-learning, and land surface models provided a consistent indication with the prior global-scale assessments. We also argued that at least some surface information is necessary for the CR method to describe 25 long-term hydrologic cycles at the grid scale.

deficiency would be far higher than the hypothetical ETw, because high atmospheric vapor pressure deficits (VPD) co-exist with low soil moisture across the globe (Zhou et al., 2019).
In the CR formulation by Szilagyi et al. (2017), the two dimensionless variables, x ≡ ETw/ETp and y ≡ ETa/ETp, are defined, and they are linked with four boundary conditions. If water is ample on the surface, ETa reaches ETw that should be 100 equal to ETp owing to no water deficiency; thus, the first boundary condition is (i) y = 1 for x = 1. When the surface is entirely desiccated, ETa must be nil (i.e., y = 0), and by energy balance, the surface radiation should be fully transformed to the sensible heat flux that fully amplifies VPD. In other words, under the given radiation and wind speed, ETp is maximized when ETa = 0, providing another zero-order boundary condition: (ii) y = 0 for x = xmin ≡ ETw/Epmax, where Epmax is the maximized ETp. When x = 1 (i.e., with ample water), ETa would change as much as changes in ETw, yielding a first-order boundary condition: 105 (iii) dy/dx = 1 for x = 1. On the other boundary (i.e., x = 0), ETa should be constant irrespective of any changes in ETw; thus, another zero-order boundary condition is (iv) dy/dx = 0 for x = 0. The simplest polynomial equation satisfying the four boundary conditions is: with X defined as: 110 Since ETp, ETw, and Epmax could be all obtainable from a set of net radiation, air temperature and humidity, and wind speed data, Eqs. (1a) and (1b) allow users to estimate ETa with no direct soil moisture information (e.g., remote-sensing soil moisture products). Szilagyi et al. (2017) used the Penman (1948)  where, Δavg is the slope of the saturation vapor pressure curve (kPa °C -1 ) at the mean air temperature Tavg (°C), γ is the psychometric constant (kPa °C -1 ), Rn is the surface net radiation less the soil heat flux (MJ m -2 d -1 ), λv is the latent heat of vaporization (MJ kg -1 ) (here we quantified it by λv=2.501-0.00236Tavg), fu = 2.6(1+0.54u2) is the Rome wind function, where u2 is the 2-m wind speed (m s -1 ), and VPD is es(Tavg) minus ea, where es(Tavg) is the saturation vapor pressure at Tavg and ea is the actual vapor pressure, respectively. 120 As it is parameterized under wet surface conditions, ETw could be quantified by the Priestly and Taylor (1972) equation: where, αe is a coefficient typically ranging between 1.10 and 1.32 (Szilagyi et al., 2017), and Δws is the slope of the vapor pressure curve (kPa °C -1 ) at the wet surface temperature Tws for which Szilagyi (2014) used the two methods based on the 125 Bowen ratio and the derivation of Monteith (1980). We used the latter: T ws = T wb + γR n VPD (∆ wb +γ)(c 1 R n +c 2 f u VPD) , where, Twb is the wet-bulb temperature that can be estimated by the Bowen ratio for an adiabatic change: 130 γ T wb −T avg e s (T wb )−e a = −1.
Epmax is calculated with the same Penman equation but with different temperature and humidity conditions. The air overpassing a desiccated surface is likely devoid of humidity; thus, ea would become negligible: where, Δdry is the slope of the vapor pressure curve (kPa °C -1 ) at the dry air temperature Tdry (°C). Tdry is the hypothetical air 135 temperature that would adiabatically reach when the latent heat flux is nil: The coefficient αe in Eq. (3) is the only parameter for the CR method. Szilagyi et al. (2017) proposed to use the mean value of αe values analytically obtained at humid locations for a region of interest, and we achieved α=1.09 by the same approach (the details are given in the Appendix). Although it is smaller than the typical Priestley-Taylor coefficient (1.26), the 140 obtained αe was still higher than the physical lower limit (i.e., unity). It should be noted that the αe incorporated in the CR method is a model parameter analogous to the Priestley-Taylor coefficient rather than having the same physical meaning (Ma et al., 2019;Brutsaert et al., 2017). In prior continental-scale studies, the optimal αe values for continental-scale applications were often lower than the typical Priestley-Taylor coefficient (1.26) (e.g., Ma et al, 2020;Kim et al., 2019b;Ma et al., 2019;Ma and Szilagyi, 2019). 145

Atmospheric forcing and evaluation datasets
The study area was the Australian continent lying within [10°-45° S, 113°-155° E], and the atmospheric forcing data for the CR method were collected from the ERA Interim reanalysis archive (Dee et al., 2011) of the European Centre for Medium-Range Weather Forecasts (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim; last access 150 on mmm-dd/2020). The monthly averages of surface net solar radiation and net thermal radiation, 2-m air temperature, 2-m dew-point temperature, and 10-m U and V wind speed components were downloaded at the 0.5°×0.5° grid resolution for 1979-2018. Rn was estimated simply by summing the two net radiation data (i.e., the soil heat flux was assumed to be negligible).
Tavg and VPD were directly quantified by the air temperature and the dew-point temperature datasets, while the 10-m U and V components were converted to u2 values using the logarithmic wind speed profile.
The CR ETa estimates from Eq. (1a) were evaluated with the water-balance ETa (ETwb) estimates at the same 0.5° grid resolution. To achieve the grid-scale ETwb, some syntheses (e.g., spatial and temporal gap filling and/or conceptual modelling) are inevitable due to the non-uniformity and unavailability of in-situ precipitation, streamflow, and terrestrial water storage (TWS) observations. We collected the global precipitation (P) product v.2018 of the Global Precipitation Climate Centre (GPCC) together with the grid runoff (Q) products by Ghiggi et al. (2019) and Hobeichi et al. (2019) and the TWS 160 anomalies reconstructed by Humphrey and Gudmundsson (2019). The GPCC monthly P data are readily available for 1891present from https://psl.noaa.gov/data/gridded/data.gpcc.html (last access on Jun-01/2020). The monthly GRUN was produced by Ghiggi et al. (2019) at the 0.5° resolution for 1902-2014 using in-situ streamflow observations and a machine learning algorithm, and cross-validated by independent discharge data in major river basins across the world (https://doi.org/10.6084/m9.figshare.9228176; lass access on May-20/2020). The Linear Optimal Runoff Aggregate (LORA) 165 of Hobeichi et al. (2016) merged the syntheses of eleven land surface models using an optimal weighting approach for 1980-2012 (https://geonetwork.nci.org.au/geonetwork/srv/eng/catalog.search#/metadata/f9617_9854_8096_5291m; last access on Dec-23/2020). Hobeichi et al. (2016) validated the LORA Q data with published discharge observations at many river basins.
Since we believed that Q values from multiple sources are more reliable than a single model synthesis, the Q value at each grid was determined by simple averages of the GRUN and the LORA data. The TWS data, namely the GRACE-REC, extend 170 the Gravity Recovery and Climate Experiments (GRACE) land water-equivalent-thickness data given for 2003-2015 to the year of 1901 using a statistical methods (https://doi.org/10.6084/m9.figshare.7670849; last access on Jan-4/2021). We calculated annual TWS changes (δS) at each grid by differencing the December TWS estimates of two consecutive years.
Then, the annual ETwb in each year was calculated by the water balance equation, i.e., ET wb ̅̅̅̅̅̅̅ = P ̅ -Q ̅ -δS, where ET wb ̅̅̅̅̅̅̅ , P ̅ , and Q ̅ are the annual averages of ETwb, P, and Q, respectively. 175 Figure 1 displays the distribution of the wetness index (the long-term-average ratio of P to ETp) for 2002-2012 calculated with the GPCC P and the ETp from the ERA-Interim forcing. Typically, the wetness index categorizes hyper-arid, arid, semi-arid, semi-humid, and humid climates with the thresholds of 0.05, 0.2, 0.5, and 0.65, respectively (Barrow, 1992).
The range of the wetness index was 0.07-5.65 in Australia, indicating that 81% of the land surfaces was under (semi-)arid climates though there are no hyper-arid areas. Humid climates are found in northern, southwestern, and southeastern coastal 180 regions, where major cities and agricultural areas have developed.

ETa products for comparative evaluation
The performance of the CR method was compared with products from two remote-sensing-based physical models, a 185 machine-learning model, and four land surface schemes. The physical models include the Global Land Evaporation applied for routed streamflow estimates. Lastly, the ERA-Interim uses the improved land surface scheme formulated by Viterbo and Beljaars (1995) to simulate the evolution of heat and water storages in soil and snow layers. It classifies a land surface using satellite data and ancillary information. The downward water fluxes in a land pixel are generated by the governing 220 equations, and the latent heat flux to the lowest atmospheric level is computed with the Obukhov length.
For comparation between the nine ETa products, the different spatial resolutions were bilinearly unified to the standard 0.5° grid of the ERA-Interim forcing data. We compared all the ETa products to the grid-scale ETwb, and evaluated the reproducibility in spatial variation of long-term mean ETa and the continental means together with the grid-scale interannual variability and linear trends. 225 The region with AI < 0.10 in Figure 1 corresponds approximately to where the mean ETa is very low in Zhang et al. (2010) and Figure 2a. In the arid Australian continent, the precipitation pattern mostly determines the spatial variation of mean ETa, because about 90% of precipitation returns to the atmosphere (Glenn et al., 2011). Figure 2a, too, depicts that the means of ETwb were small in the arid east-central part, increasing towards northern and eastern coasts where precipitation is abundant 235 due to monsoonal and easterly winds. The continental mean ETwb for 2002-2012 was 431 mm a -1 approximately close to the value (439 mm a -1 ) given by the global assessment of Zhang et al. (2016). The consistency to prior studies led us believe that the annual ETwb product from the reanalysis precipitation and the reconstructed runoff and TWS data could become an acceptable evaluation reference.
Though the pattern correlation between the mean ETwb and CR ETa was fairly high (Pearson r was 0.87), the CR method 245 overestimated ETa in coastal areas, while underrating it in the central-western part under arid climates.
The two physical models, on the other hand, were biased negatively. The GLEAM produced smaller ETa in the wet northern coastal, and the (semi-)arid central and southwestern parts than ETwb, while the PT-JPL seemed to generally underestimated it across the continent (Figure 3c and d). On the contrary, the FLUXCOM product was positively biased in (semi-)arid inland areas with suppressed spatial variation ( Figure 3e). The land surface scheme of AWRA-L generated the 250 unexpected dry hotspots in the mid-western part, which was not found from the water balance ( Figure 3f). The two LIS land surface models, the Noah3.3 and the CLSMF2.5, relatively well captured the spatial variation of mean ETwb, although there were some underestimations by the Noah3.3 in the northern part (Figure 3g and h). The ERA-Interim ETa was of largely biased in coastal areas (Figure 3i).
The continental means ETa for 2002-2012 from the eight models provide a consistent indication (Figure 3a). The CR 255 method generated a positive bias of +58.2 mm a -1 (+14%) relative to the water-balance ETwb. Among the eight models, the CLSMF2.5 produced the minimum bias (+0.5%), whereas the largest bias (+24%) was from the ERA-Interim ETa. In the Taylor (2001) diagram that measures the standard deviation, and the root mean square error (RMSE) and the pattern correlation to ETwb together, the CR method appeared to be in the 6 th rank. The CLSMF2.5 ETa was the closest to ETwb, and followed in order by the PT-JPL, the AWRA-L, the FluxCom, the Noah3.3, the CR method, the GLEAM, and the ERA-Interim. 260 The continental annual means of CR ETa were considerably higher than those from ETwb owing to the positive biases in coastal areas, and their interannual variability was smaller than that of ETwb (      While the vast majority of models in Pan et al. (2020) suggested the largest interannual ETa in regions with annual precipitation between 700 and 1,000mm a -1 , the calibration-free CR method was unable to pronounce such variations in (semi-)humid

locations. 295
In the global-scale evaluation of Pan et al. (2020), land surface models pronounced higher ETa variability relative to physical and machine-learning models. Figure 5 provide a consistent indication that the annual ETa averages from AWRA-L, Noah3.3, CLSMF2.5, and ERA-Interim had interannual variability much higher than those from the PT-JPL and the machine-   We here argue that the inferior performance of the CR method could be led by the parameter αe=1.09 fixed across the large continent where climatic gradients are steep between dry inland and wet coastal areas. In Brutsaert et al. (2020), the Priestley-Taylor coefficient within a CR formulation is tightly related with climatological aridity. In other words, applying the 330 constant αe in every location could nullify the influence of climatological variation on the lower bound of ETp. We hence checked the spatial variation of αe optimal for the CR method by calibrating it against the long-term mean ETwb for 1991-2001 at each grid. Figure 7 displays the optimal αe that minimizes the biases between the mean ETwb and CR ETa, indicating that it tends to increase from coastal areas to western inland locations. In the western part, the calibrated αe was often greater than the typical range of the Priestley-Taylor coefficient (1.1-1.32). The distribution of the optimal αe was likely to reduce the 335 positive biases in coastal locations and the underestimation in the western part produced by application of the constant αe=1.09.
When regenerating ETa for 2002-2012 with the distributed αe, the CR method became the best among the eight models in reproducing the temporal and spatial variations of mean ETwb and in the Taylor diagram ( Figure 8). Note that the optimal αe was found with the ETwb set separated from the new simulation.
The αe distribution in Figure 7 provides a counteractive indication to Brutsaert et al. (2020) in which the parameter 340 αe within the CR of Brutsaert (2015) was calibrated against the eddy-covariance observations across the globe. The relationship between climatological aridity and αe developed by Brutsaert et al. (2020) implicates the tendency of αe exponentially decreasing with aridity. In contrast, Figure 7 indicates that wet regions have αe values lower than in (semi-)arid inland locations.
The central-eastern part receiving little precipitation has larger αe values than the western part. The contradictory result from our work could be explained by the CR formulation from Brutsaert (2015). In the original CR derivation by Brutsaert (2015), 345 ETw was forced to become nil for ETa = 0, posing an ill-defined assumption that Rn is always zero even when a desiccated surface is warm (Crago and Qualls, 2018;Szilagyi et al., 2017, Crago et al., 2016, and making the calibrated αe oversensitively decline with aridity. Since Szilagyi et al. (2017) mended this problem by introducing the upper bound of ETp (i.e., Epmax), the calibrated αe in Figure 7a would not hold the same sensitivity to aridity changes.  Instead, one could analytically relate the parameter αe with climatic variables by equating the Penman and the Priestley-Taylor equations, because ETp and ETw must be equal under ample water conditions, yielding: where, VPDws is the VPD of the atmosphere overpassing the hypothetical wet surface. Although VPDws would be small owing 355 to the interacting wet surface (Brutsaert and Stricker, 1979), Eq. (8) implies that the climatic variables, Rn, Δws, and u2, could amplify the effects of VPDws on αe. We conducted simple correlation analyses between the calibrated αe values and the corresponding averages of Rn, Tavg (i.e., the control of Δws), and u2. The Pearson r values of the αe to the three variables were -0.56, 0.44, and 0.29, respectively (significant at 1% levels). The simple regression analyses, in addition, showed that the Rn, Tavg, and u2 explain 32%, 19%, and 9% of the variation of the calibrated αe values, respectively (significant at 1% levels). In 360 other words, variation of Rn plays a substantial role in determining the optimal αe for the CR method, and it was found that locations with high αe tend to have low Rn (Figure 7b).

Nevertheless, the calibrated CR provided little improvements in reproducing the interannual variability and trends of
ETwb. Despite the slightly increased interannual variability in central-northern areas, the new maps of interannual variability and trends (Figure 9) were still similar to the outcomes from the fixed αe=1.09. This might be attributable in part to vegetation 365 dynamics neglected in the CR method. The Penman and Priestley-Taylor equations assume stationary land-surface parameters, and are unable to capture the plants' behaviors to atmospheric conditions that play a considerable role in the precipitation partitioning (Jasechko, 2018). Modelling studies showed that the absence of surface responses to CO2 fertilization has led offline hydrologic models to runoff projections contradictory to simulations of earth system models (e.g., Milly and Dunne, 2016;Swann et al., 2016;Roderick et al., 2015). equation has worked well in Australia for the past decades (e.g., Crago and Quall, 2018), the prior studies suggest that surface responses to atmospheric changes could considerably affect temporal changes in ETp, ETw and thus ETa. This necessitates further refinements for the CR method to synthesize the surface behaviors explicitly under non-water-limiting conditions. It is worth noting that Australian carbon sink has enhanced during the 21 st century even at increasing wildfire risks owing to the 375 plants' water-use efficiency and productivity increased by CO2 fertilization (Kelly and Harrison, 2014).
In short, to capture the spatial variation of mean ETa in Australia, the CR method needs to consider the influence of climatic variables on the parameter αe. To regenerate the interannual variability and trends, the equations in the CR formulation are seemingly required incorporating dynamic surface parameters. In this case, the operational advantage of the CR method (i.e. no need of surface data) could disappear in return. 380

Intercomparison between ETa products
Since precipitation is the primary control of ETa in the dry continent, biases and errors in the GRUN and LORA runoff dataset are unlikely to induce large biases in the grid-scale water balance. As mentioned earlier, the ratio of ETa to precipitation is approximately 90% in Australia, suggesting that caveats in the GPCC precipitation are major error sources to 390 ETwb. Typically, errors in a grid precipitation product are introduced by: (i) the systematic measuring errors from evaporation out of rain gauges and aerodynamic effects, and (ii) the sampling errors from low gauging density. The GPCC precipitation takes an advanced correction and anomaly interpolation methods for reducing the systematic and the sampling errors via a very rigorous quality control framework (Schneider et al., 2014). The precipitation product has well closed the global water budget, becoming a reliable evaluation reference for other grid precipitation products (e.g., Sun et al., 2017). The quality of 395 the GPCC data, hence, was unlikely a major concern.
Compared to ETwb, however, the other ETa models are subject to diverse limitations. The remote-sensing physical models do not account for soil moisture dynamics playing a pivotal role in canopy conductance and bare soil evaporation in (semi-)arid regions (Pan et al., 2020). While the GLEAM takes soil moisture into account in the ETa synthesis, whereas the PT-JPL does not consider the soil moisture dynamics and underrates ETa in shrubs and deserts in the southern hemisphere 400 (Miralles et al., 2016). Given that bare soil evaporation introduces the largest error to ETa estimates (Talsma et al., 2018), the PT-JPL needs any corrections for operational ETa monitoring in Australia.
On the other hand, the machine-learning FluxCom, which showed the worst performance in this work, has important caveats. Even though it acceptably simulates long-term averages of the surface energy fluxes, the FluxCom carbon fluxes are ETa product needs to be further trained by any datasets describing the interannual behaviors. Pan et al. (2020) showed that interannual variability of ETa products by 14 land surface models was dominantly controlled by precipitation in most of regions in the Southern Hemisphere. However, they also highlighted the dynamic root 410 parameterization of the ORCHIDEE-MICT model (Guimberteau et al., 2018), which is distinguishable from the other models, suggesting that ETa changes in Australia could be less sensitive to precipitation changes than indicated by commonly-adopted land surface models. Hence, larger interannual variability than in ETwb could be an indication that land surface parameterization might be oversensitive to precipitation changes (e.g., the simulations in the eastern part by the AWRA-L and the CLSMF2.5).
Though it outperformed other models in streamflow generation (Frost et al., 2015), the AWRA-L needs corrections for the 415 unexpected dry hotspots in inland areas. The two LIS land surface models, in addition, provided a consistent indication with the prior application for the Upper Blue Nile River (Jung et al., 2017). The CLSMF2.5 tends to provide higher ETa and lower streamflow than the Noah3.3, and it better represented the water budget in the sub-humid river basin. The overestimation in the ERA-Interim product was also found by Miralles et al. (2016) and Mueller et al. (2013). Sun et al. (2017) pointed out that the ERA-Interim often prescribed annual precipitation exceeding the GPCC P data. 420

Conclusions
In this work, we evaluated applicability of the calibration-free CR formulation in Australian land surfaces mostly under (semi-)arid climates. The terrestrial evapotranspiration (ETa) produced by the CR method was compared with a bunch of ETa products from physical, machine-learning, and land surface models, and their spatial and temporal variations and decadal trends were evaluated against the estimates from water balance. While it could generate the ETa product strongly 425 correlated with the water-balance estimates, the CR method seemed to introduce considerable biases when comparing to the other models. In Australia mostly under (semi-)arid climates, the approach proposed by Szilagyi et al. (2017) was unlikely to outperform typically-adopted physical, machine-learning, and land surface models, and thus necessitates better parameterization for improvement. We draw the following conclusions worth emphasizing: (1) The optimal coefficient (αe) for the wet-environment evapotranspiration is unlikely constant. The αe=1.09 430 obtained from the calibration-free approach introduced positive biases in (semi-)humid coastal areas while underestimating in arid locations. When calibrating αe each grid with the independent set of water-balance estimates, αe seems to respond to climatic variations.
(2) Even with the calibrated αe, the CR method insufficiently captured the interannual variability and the decadal trends of the water-balance estimates at the grid scale. Since the latent heat flux is not only controlled by water 435 stress but atmospheric conditions (e.g. CO2 concentration), any formulation that captures land surface behaviors under non-water-liming conditions would be necessary in quantification of the wet-environment and potential evapotranspiration.
(3) The evaluations of the physical, the machine-learning, and the land surface models provided a consistent implication with the prior global-scale studies. A remote-sensing physical model can better represent the surface 440 energy balance by explicit consideration of soil moisture dynamics. The machine-learning depending largely on training datasets can suppress interannual variability and lead to overestimation in arid locations. ETa products from land surface models could be more sensitive to precipitation variability than physical and machine-learning models.

Author contributions 445
DK, MC, and JAC designed the study all together. DK simulated terrestrial evapotranspiration with the CR method and drafted the manuscript. JAC run the LIS land surface models and collected the other datasets for comparative evaluation. MC participated in discussion and review of the results and the manuscript.

Competing interests
The authors declare no competing interests. 450

Code availability
The R scripts for the CR method are available upon request from the leading author (daeha.kim@jbnu.ac.kr).
In this work, the wet cells within the Australian continent were identified as locations satisfying the two conditions of Tws > 680 Tavg+3°C and RH > 90%. While Szilagyi et al. (2017) used Tws > Tavg+2°C, we considered the fact that Tws estimates from Monteith (1981) is approximately 1°C higher than those estimated by the implicit Bowen ratio (Szilagyi, 2014). The results showed that very few cells (less than 1% of the Australian continent) satisfied the given criteria and their αe values from Eq.
(A2) were within a very narrow range of 1.09 ± 0.01 (mean ± standard deviation). This calibration-free approach assumes that the mean of the αe values is applicable for the entire region, thus assuming that a suitable αe is spatially and temporally constant.