Evapotranspiration (ET) is a major component of the land surface process involved in energy fluxes and energy balance, especially in the hydrological cycle of agricultural ecosystems. While many models have been developed as powerful tools to simulate ET, there is no agreement on which model best describes the loss of water to the atmosphere. This study focuses on two aspects, evaluating the performance of four widely used ET models and identifying parameters, and the physical mechanisms that have significant impacts on the model performance. The four tested models are the Shuttleworth–Wallace (SW) model, Penman–Monteith (PM) model, Priestley–Taylor and Flint–Childs (PT–FC) model, and advection–aridity (AA) model. By incorporating the mathematically rigorous thermodynamic integration algorithm, the Bayesian model evidence (BME) approach is adopted to select the optimal model with half-hourly ET observations obtained at a spring maize field in an arid region. Our results reveal that the SW model has the best performance, and the extinction coefficient is not merely partitioning the total available energy into the canopy and surface but also including the energy imbalance correction. The extinction coefficient is well constrained in the SW model and poorly constrained in the PM model but not considered in PT–FC and AA models. This is one of the main reasons that the SW model outperforms the other models. Meanwhile, the good fitting of SW model to observations can counterbalance its higher complexity. In addition, the detailed analysis of the discrepancies between observations and model simulations during the crop growth season indicate that explicit treatment of energy imbalance and energy interaction will be the primary way of further improving ET model performance.

Surface energy fluxes are an important component of Earth's global energy budget and a primary determinant of surface climate. Evapotranspiration (ET), as a major energy flux process for energy balance, accounts for about 60 %–65 % of the average precipitation over the surface of the Earth (Brutsaert, 2005). In agricultural ecosystems, more than 90 % of the total water losses are due to ET (Morison et al., 2008). Therefore, robust ET estimation is crucial to a wide range of problems in hydrology, ecology, and global climate change (Xu and Singh, 1998). In practice, mtuch of our understanding of how land surface processes and vegetation affect weather and climate is based on numerical modeling of surface energy fluxes and the atmospherically coupled hydrological cycle (Bonan, 2008). Several models are commonly used in agricultural systems to evaluate ET. The Penman–Monteith (PM) and Shuttleworth–Wallace (SW) models are physically sound and rigorous (Zhu et al., 2013) and thus widely used to simulate ET for seasonally varied vegetation. The models consider the relationships between net radiation, all kinds of heat flux (such as latent heat, sensible heat, and heat from soil and canopy), and surface temperature. The Priestley–Taylor and Flint–Childs (PT–FC) model (based on radiation) and the advection–aridity (AA) model (based on meteorological variables) have also been widely used because they only require a small number of ground-based measurements to set up the models (Ershadi et al., 2014).

Comparing the performance of the competing ET models and evaluating and understanding the discrepancies between simulations of the models and corresponding observed surface–atmosphere water flux remain challenging problems (Legates and McCabe, 1999). Both non-Bayesian analysis (Szilagyi and Jozsa, 2008; Vinukollu et al., 2011; Li et al., 2013; Ershadi et al., 2015) and Bayesian analysis have been used to evaluate the performance of ET models (Zhu et al., 2014; Chen et al., 2015; Liu et al., 2016; Zhang et al., 2017; Elshall et al., 2018; Samani et al., 2018;). Li et al. (2013) compared the ET simulations of the PM, SW and adjusted SW models under film-mulching conditions of maize growth in an arid region of China. They found that the half-hourly ET was overestimated by 17 % in the SW model. In contrast, the PM and adjusted SW models underestimated the daily ET by 6 % and 2 %, respectively. Therefore, the performances of PM and adjusted SW models are better than those of the SW model in their case study. Ershadi et al. (2014) evaluated the surface energy balance system (SEBS), PM, PT–JPL (JPL is the Priestley–Taylor Propulsion Laboratory Model; a modified Priestley–Taylor model), and AA models. Based on the average value of model efficiency (EF) and RMSE, the model ranking from worst to best was AA, PM, SEBS, and PT–JPL. Ershadi et al. (2015) also compared the response of the models to different formulations of aerodynamic and surface resistances with global FLUXNET data. Their results showed considerable variability in model performance among and within biome types. Currently, ET model selection and comparison have been still conducted using traditional error metrics. It is known that error metrics are not adequate for providing a reasonable result of model ranking for disregarding model complexity (Marshall et al., 2005; Samani et al., 2018). The focus of this study is to use a Bayesian approach to evaluate the performance of the PM, SW, PT–FC, and AA models, which is a novelty contribution of this study. In ET models, the land surface energy system is governed by presumably infinitely dimensional physics. However, considering the ET models to be finitely dimensional can be more precise by covering all relevant relations. Therefore, employing consistent criteria for model selection might be justified when the aim is to better understand the processes involved (Höge et al., 2018). When using consistent model selection, Bayesian model evidence (BME), also known as marginal likelihood, measures the average fit of model simulations to their corresponding observations over a model's prior parameter space. This feature enables BME to consider model complexity (in terms of the number of model parameters) for model performance evaluation. When comparing several alternative conceptual models, the model with the largest marginal likelihood is selected as the best model (Lartillot and Philippe, 2006). BME can thus be used for evaluating the model fit (over the parameter space) and for comparing alternative models. In previous studies, the Bayesian information criterion (BIC; Schwarz, 1978) and the Kashyap information criterion (KIC; Kashyap, 1982) have been used to approximate BME by using maximum likelihood theories to reduce the computational cost of evaluating BME (Ye et al., 2004). However, these approximations have theoretical and computational limitations (Ye et al., 2008; Xie et al., 2011; Schöniger et al., 2014), and a numerical evaluation (not a likelihood approximation) of BME is necessary, especially for complex models (Lartillot and Philippe, 2006). Lartillot and Philippe (2006) advocated the use of thermodynamic integration (TI) for estimating BME, also known as path sampling (Gelman and Meng, 1998; Neal, 2000), in order to avoid sampling solely in the prior or posterior parameter space. TI uses samples that are systematically generated from the prior to the posterior parameter space by conducting path sampling with several discrete power coefficient values (Liu et al., 2016). It is more numerically accurate than the generally used harmonic mean method (Xie et al., 2011).

Most applications of Bayesian methods have focused on the calibration of individual models, while the comparison of alternative models continues to be performed using traditional error metrics. More generally, Bayesian approaches to model calibration, comparison, and analysis have been used far less in the evaluation of ET models than in other areas of environmental science. In this study, the Bayesian approach is used to calibrate and evaluate the four ET models (PM, SW, PT–FC, and AA) based on an experiment over a spring maize field in an arid area of northwestern China from 3 June to 27 September 2014. The objectives of the study are as follows: (1) to calibrate ET model parameters using the DiffeRential Evolution Adaptive Metropolis (DREAM) algorithm (Vrugt et al., 2008, 2009), (2) to identify which parameters had a greater impact on the model performance and to explain why the selected optimal model performed best, (3) to evaluate the performance of the models using traditional error metrics and BME, and (4) to analyze discrepancies between model simulations and observation data in order to better understand model performance and identify ways to improve these models. We expect that the study will not only boost the development of model parameterization and model selection but also contribute to the improvement of the ET models.

The experiment of maize growth was conducted at the Daman Superstation, located
in Zhangye, Gansu province, northwestern China. Daman oasis is located in
the middle Heihe River basin, which is the second largest inland river basin
in the arid region of northwestern China. The midstream area of the Heihe River
basin is characterized by oases with irrigated agriculture and is a region
that consumes a large amount of water for both domestic and agricultural
uses. The annual average precipitation and temperature are 125 mm and 7.2

Our data were collected from the field observation systems of the Heihe Watershed Allied Telemetry Experimental Research (HiWATER) project as described in Li et al. (2013). The observation period was from DOY (day of the year) 154 to DOY 270 in 2014. An open-path eddy covariance (EC) system was installed in a maize field, with the sensors at a height of 4.5 m. Maize is the main crop in the study region and thus covers sufficient planting area to set the EC measurements. The EC data were logged at a frequency of 10 Hz and then processed with an average time interval of 30 min. Sensible and latent heat fluxes were computed by the EC approach of Baldocchi (2003). Flux data measured by EC were controlled by traditional methods, including three-dimensional rotation (Aubinet et al., 2000); Webb–Penman–Leuning (WPL) density fluctuation correction (Webb et al., 1980); frequency response correction (Xu et al., 2014); and spurious data removal caused by rainfall, water condensation, and system failure. About 85 % of the energy balance closure was observed in the EC data (Liu et al., 2011).

Standard hydro-meteorological variables, including rainfall, air temperature, wind speed, and wind direction, were continuously measured at the heights of 3, 5, 10, 15, 20, 30, and 40 m above the ground. Soil temperature and moisture were measured at heights of 2, 4, 10, 20, 40, 80, 120, and 160 cm. Photosynthetically active radiation was measured at a height of 12 m. Net radiation, including downward, upward, and longwave radiation, was measured by a four-component net radiometer. An infrared thermometer was installed at a height of 12 m. The leaf area index (LAI) was measured approximately every 10 d during the growing season.

In this section, we summarize the mathematical definitions forming the basis of each of the four models. Appendix A contains a summary of the names and physical meanings of the model parameters.

The PM model can be formulated in the following way (Monteith, 1965):

In the present study,

The SW model comprises a one-dimensional model of plant transpiration and a
one-dimensional model of soil evaporation. The two terms are calculated by
the following equations:

The Priestley–Taylor model (Priestley and Taylor, 1972) was introduced to
estimate evaporation from an extensive wet surface under conditions of
minimum advection (Stannard, 1993; Sumner and Jacobs, 2005). The ET is
expressed as

The AA model was first proposed by Brutsaert and Stricker (1979) and further
improved by Parlange and Katul (1992). The model relies on the feedback
between actual ET (

The Bayesian model evidence (BME) of a model,

The random samples,

The traditional error metrics for evaluating model performance include

Trace plots of the G–R statistic using DREAM for the PM model

Prior distributions and parameter limits for the PM, SW, PT–FC and AA models. The values are derived from the literature.

The PM model has five parameters,

Maximum likelihood estimates (MLEs), mean estimates, 95 % high-probability intervals (lower limit, upper limit).

Histograms of the DREAM-derived marginal distributions of the parameters are
presented in Fig. 2 and summarized in Table 2 by maximum likelihood
estimates (MLEs), posterior medians, and 95 % probability intervals.
Figure 2a–e, f–l, m–n, and o–p show histograms of the PM, SW, PT–FC, and AA models, respectively. Parameter

The performance of each of the four ET models was evaluated over the course
of the whole season in 2014. The calibrated parameters of the four models
were used, and individual ET models were run to estimate the half-hourly

Regressions between measured and modeled half-hourly ET values produced by different models from DOY 154 to DOY 270:

Mean bias error (MBE) of predicted and observed ET values for

Slope and coefficient of determination (

In general, the four models produced slightly better fits to the measured

Figure 4 shows that large seasonal variations arise in the MBE for the four ET
models. From the variations in the MBE, the estimated

Since there is currently no theoretical method for selecting power posterior

Variation of the mean posterior expectation of the potential

With regard to the efficiency of the DREAM algorithm, the acceptance rates of the PM (15.3 %) and SW (18.9 %) models were much higher than those obtained by some MCMC algorithms that have been used in previous studies (Sadegh and Vrugt, 2014). The posterior parameter bounds exhibit a larger reduction using the DREAM algorithm compared with other studies using the Metropolis–Hastings algorithm. This demonstrates that DREAM could efficiently handle problems involving high dimensionality, multimodality, and nonlinearity.

The results showed that the assumed prior uncertainty ranges from most
parameters in the four models were significantly reduced. This indicates
that the observed ET data contained sufficient information for estimating these
parameters. Surface conductance

The ecophysiological parameter

Parameter

In general, parameters related to soil surface resistance in the SW model were well evaluated, while parameters related to canopy surface resistance in PM and SW models were poorly estimated. Therefore, using a reliable canopy surface resistance equation in the ET model was crucial for improving its performance. In addition, in our study, the traditional approach was used to quantify the uncertainty, which assumed that the uncertainty mainly arose because of the parameter uncertainty. However, this method cannot explicitly consider errors in the input data and model structural inadequacies. This is unrealistic for real applications, and it is desirable to develop a more reliable inference method to treat all sources of uncertainty separately and appropriately (Vrugt et al., 2008). Moreover, simultaneous direct measurement by the micro-lysimeter of sap flow and daily soil evaporation will further help to constrain the model parameters.

In this study, the traditional statistical measures and BME were chosen to
evaluate and compare the performance of four ET models. From the respective
composition of these measures, the statistical measures can be divided into
residual-based metrics (such as regression slope and MBE) and
squared-residual-based measures (such as

Previous studies showed that BME evaluated by TI provided estimates similar to the true values and selected the true model if the true model was included within the candidate models (Marshall et al., 2005; Lartillot and Philippe, 2006). Meanwhile, some have argued that Bayesian analysis would choose the simplest model (Jefferys and Berger, 1992; Xie et al., 2011) because of the best trade-off between good fit with the data and model complexity (Schöniger et al., 2014). In this case, the most complex SW model had the highest BME and was chosen as the model with the best performance. This probably resulted from the fact that the complex SW model is indeed the most reliable model among the alternative ET models and can provide a good fit to justify its higher complexity. The SW model is a two-layer model and simulates soil evaporation and plant transpiration separately, whereas the PM model is a single-layer model in which the plant transpiration and soil evaporation cannot be separated (Monteith, 1965). The PT–FC model is a simplified version of the PM model and only requires meteorological and radiation information (Priestley and Taylor, 1972), whereas the AA model only relies on the feedback between actual ET and potential ET (Brutsaert and Stricker, 1979).

The results indicate that the squared-residual-based measures yielded the
same rank order as the BME consistently, which makes the
squared-residual-based metrics seem to identify a reasonable rank order.
However, this has not been the general case, since the error metrics and BME
belong to different types of model selection and because there are differences in
the behavior and optimality of the two types of model selection. BME is a
consistent model selection that tries to identify which of the models
produced the observed data. Conversely, nonconsistent model selection uses
the available data to estimate which of the models might be best in
predicting future data. In fact, the error metrics are essentially
nonparsimonious model selection, which is a special case of nonconsistent
model selection. The simple traditional statistical measures were known to
usually provide a biased view of the efficacy of a model (Kessler and Neas,
1994; Legates and McCabe, 1999), where only the goodness of fit is used for
rating models without penalizing the model complexity, thus lacking
consistency for the selected model (Höge et al., 2018). In addition,
sensitivity to outliers is associated with these metrics and leads to
relatively high values due to the squaring of the residual terms (Willmott,
1981). Furthermore, these traditional statistical metrics ignore the priors,
which are in fact used in Bayesian analysis. The PT–FC and AA models provide identical
estimates of

Conceptual and structural inadequacies of the hydrological model together with measurement errors of the model input (forcing) and output (calibration) data introduce errors in the estimated parameters and model simulations (Laloy et al., 2015). Hydrological systems are indeed heavily input-driven, and errors in forcing data can dramatically impair the quality of calibration results and model output (Bardossy and Das, 2008; Giudice et al., 2016). Measurement errors occur for a variety of reasons, including unreasonable gap filling in rainy days; dew and fog; inadequate areal coverage of point-scale soil water measurement; mechanical limitations of the EC system; and inaccurate measurements of wind speed, soil water, radiation, and vapor pressure deficit. The ET process is described using equations that can only capture parts of the complex natural processes, and any ET model is an inherent simplification of the real system. These inadequacies can thus lead to biased parameters and implausible predictions.

In our study, the results indicated that the PM and SW models overestimated
the half-hourly ET compared to the measured ET. Several studies also
indicated that ET was overestimated by the PM model (Fisher et al., 2005;
Ortega-Farias et al., 2006; Li et al., 2015) and the SW model (Li et al.,
2013, 2015; Zhang et al., 2008). Possible reasons for the
inaccurate estimates included the following: (1) anisotropic turbulence with
weak vertical and strong horizontal fluctuation leads to energy imbalance.
The total turbulent heat flux was lower by

The estimates for ET produced by the PT–FC and AA models were generally
lower than the measured values during the entire season. In addition, the
four models also underestimated ET during periods of partial cover
(LAI

This study illustrated the application of the Bayesian approach on the
statistical analysis and model selection of four widely used ET models. The
results showed that the DREAM algorithm successfully reduced the assumed
prior uncertainties for most of the parameters in the four models. In the
model calibration, the key parameters which had a significant influence on
ET simulations were well constrained. The main reasons for the outperforming
of SW model were its physically rigorous structure and the extinction
coefficient parameter, which is sensitive and has a significant impact on
the performance of the model, being well constrained. BME is a consistent
model selection for identifying the best fitting to the observed data. Although
the squared-residual-based metrics, including

The model–data discrepancies were analyzed to facilitate model improvement after Bayesian model calibration and comparison. The results indicate that the discrepancies arose mainly as a result of energy imbalance caused by anisotropic turbulence, additional energy induced by advection processes, the absence of a mechanistic representation of the physiological response to plant hydrodynamics and the energy interaction between the canopy and surface. Among these causes, energy imbalance and additional energy are related to forcing data errors rather than to an unreasonable model structure. Thus, understanding the process of the physiological response to plant hydrodynamics, and the interaction between the canopy and surface is essential for improving the performance of evapotranspiration models. Overall, the applications of Bayesian calibration, Bayesian model evaluation, and analysis of model–data discrepancies in our study provide a promising framework for reducing uncertainty and improving the performance of ET models. It would be desirable to confirm whether the SW model is the optimal model using data of other crops or other climate regions.

The eddy covariance flux, meteorological, and other data used in this study are from Heihe Watershed Allied Telemetry
Experimental Research (HiWATER) (

The posterior probability distribution of the parameter is calculated by
Bayes' theorem:

The likelihood function,

In this study, we used the DREAM algorithm (Vrugt et al., 2008, 2009) to explore the ET models' parameter space and to estimate BME. The DREAM sampling scheme is an adaptation of the global optimization algorithm of a shuffled complex evolution Metropolis (SCEM-UA). This algorithm was described in more detail in Vrugt et al. (2008, 2009).

GW and XZ designed the experiments. NY and FK carried them out. MY developed the model selection scheme. GW performed the simulations. GW and XZ prepared the paper, with contributions from all co-authors.

The authors declare that they have no conflict of interest.

We thank Ying Guo, Huihui Dang, Jun Dong for the data collection and analysis. All observed data used in this study are from Heihe Watershed Allied Telemetry Experimental Research (HiWATER). We thank all the staff who participated in HiWATER field campaigns. Considerate and helpful comments by anonymous reviewers have considerably improved the paper.

This research has been supported by the National Natural Science Foundation of China (grant nos. 41471023 and 41702244), the Department of Energy (grant no. DE-SC0019438), and the National Science Foundation – Division of Earth Science grant no. 1552329.

This paper was edited by Bill X. Hu and reviewed by Dan Lu and two anonymous referees.