Intercomparison of freshwater fluxes over ocean and investigations into water budget closure

The development of algorithms for the retrieval of water cycle components from satellite data, such as total column water vapor content (TCWV), precipitation (P), latent heat flux, and evaporation (E) has seen much progress in the past three decades. In the present study, we compare six recent satellite-based retrieval algorithms and ERA5 (the European Centre for Medium-Range Weather Forecasts’ fifth reanalysis) freshwater flux (E-P) data regarding global and regional, seasonal and inter-annual variation to assess the degree of correspondence among them. The compared data sets are recent, freely available 5 and documented climate data records (CDRs), developed with a focus on stability and homogeneity of the time series, as opposed to instantaneous accuracy. One main finding of our study is the agreement of global ocean means of all E-P data sets within the uncertainty ranges of satellite-based data. Regionally, however, significant differences are found among the satellite data and with ERA5. Regression analyses of regional monthly means of E, P, and E-P against the statistical median of the satellite data ensemble (SEM) 10 show that, despite substantial differences in global E patterns, deviations among E-P data are dominated by differences in P throughout the globe. E-P differences among data sets are spatially inhomogeneous. We observe that for ERA5 long-term global E-P is very close to 0 mm/day and that there is good agreement between land and ocean mean E-P, vertically integrated moisture ::: flux : divergence (VIMD), and global TCWV tendency. The fact that E and P are balanced globally provides an opportunity to investigate the consistency between E and P data sets. Over ocean, 15 P (nearly) balances with E if the net transport of water vapor from ocean to land ( ::::::::::: approximated :: by : over-ocean VIMD, i.e., ∇Qocean::::::::::: ∇ · (vq)ocean) is taken into account. Correlation of Eocean−∇Qocean :: On:a:::::::: monthly :::: time::::: scale,::::: linear::::::::: regression :: of ::::::::::::::::::: Eocean−∇ · (vq)ocean with Pocean yields R = 0.86 for ERA5, but smaller R are found for satellite data sets. Climatological global yearly totals of water cycle components (E, P, E-P, and net transport from ocean to land and vice versa) calculated from the data sets used in this study are in agreement with previous studies, with ERA5 E and P are occupying the 20 upper part of the range. Over ocean, both the spread among satellite-based E and the difference between two satellite-based P data sets are greater than E-P and these remain the largest sources of uncertainty within the observed global water budget. We conclude that for a better understanding of the global water budget, the quality of E and P data sets themselves and their associated uncertainties need to be further investigated :::: needs :: to ::: be :::::::: improved ::: and ::: the ::::::::::: uncertainties :::: more ::::::::: rigorously ::::::::: quantified.


25
The water and energy cycles are key components of Earth's climate system. Energy exchange from water phase changes plays a direct role in atmospheric heating; therefore, precipitation (P) and evaporation (E) are two critical processes connecting the land/ocean surface and overlying atmosphere (Trenberth et al., 2009). The difference between E and P rates, E −P , is the freshwater flux from the surface to the atmosphere, which is positive where E dominates and negative where P dominates. Over the global oceans, total E−P is positive, as a considerable amount of water evaporates from the oceans and is transported to land by 30 advection, mainly in the form of water vapor, where it precipitates. Averaged over a year, changes in atmospheric storage vanish and net negative E − P over land is balanced by continental runoff of water into the ocean. Although numerous studies have addressed the question of how variations in the ocean state affect the water cycle and freshwater fluxes with a particular view on global warming (Wentz et al., 2007;Trenberth et al., 2007;Schlosser and Houser , 2007;Robertson et al., 2014) :::::::::::::::::::::: (Wentz et al., 2007; Trenb a clear and consistent picture has yet to emerge -one of the significant challenges in climate science (Bony et al., 2015;Hegerl et al., 2014) of six data sets and putting the results into perspective regarding uncertainty estimates. Moreover, we investigate to what extent water budget closure is achieved by satellite-based over-ocean estimates by comparing with ERA5 data , and previously 60 published estimates of water cycle components.
Here, we consider the atmospheric water vapor budget with a focus on the oceans, where satellite observations of E are available. The net change in atmospheric water vapor content can be written as: :::: (1) With W the total column water vapor and ∇Q the moisture :::::: ∇ · (vq) ::: the :::::::: moisture :::: flux divergence, i.e., the amount of moisture removed by advection ::::::::: dynamical ::::::: transport : from the considered volume. See Table 2 for all symbols and abbreviations. Compared to water vapor, the contributions of liquid water and ice are very small (e.g., Berrisford et al. , 2011) and can be safely ignored in the context of this study.
On the global scale ∇Q :::::: ∇ · (vq) : vanishes (as the Earth is a closed system) and Eq. 1 reduces to: Where, for brevity, we write the W tendency during large (monthly) time steps as ∆W .
Assuming that ∆W is small compared to E and P , Eq. 2 dictates that global total E must equal global total P . Hence, an observed imbalance in global totals of E and P indicates either an inconsistency in E and P data sets or a change in the global water cycle, e.g. an increase in the amount of atmospheric water vapor (possibly caused by global warming), invalidating the assumption that ∆W is negligible. Moreover, globally, E and P co-vary, meaning that their inter-annual, seasonal, and even 75 monthly variability are correlated.
At regional scales and for monthly averages, ∆W is small compared to E − P and ∇Q :::::: ∇ · (vq), so that Eq. 1 can be approximated by: :::: (3) This is also ::::::::: particularly : valid for the large ocean and land regions, and since globally, ∇Q = 0 :::::::::: ∇ · (vq) = 0, from Eq. 3 it with subscripts denoting summation over ocean or land. This separation into land and ocean contributions allows us to assess the consistency of different E and P data sets, as satellite E data are not available over land.
In addition to the spatio-temporal distributions of individual budget terms, e.g., E − P , information on the accuracy and 85 precision of that value is of importance. Uncertainty estimates indicate whether observed differences -between data sets (e.g., observations and models), over time (trends, variability), or in space -are statistically relevant. Moreover, they play a major role in data assimilation. Quantification of retrieval uncertainty, however, is a difficult task, particularly for non-linear retrieval algorithms such as those used to retrieve E and P from satellite observations. Of the E CDRs investigated here, only HOAPS-4.0and : , OAFlux3, :::: and :::::::::: SEAFLUX3 : provide monthly mean uncertainty ranges. In HOAPS, random and systematic 90 uncertainty components are provided separately (Kinzel et al., 2016), allowing error propagation along with the calculation of temporal and/or spatial averages, as random errors (no covariance) disappear for large numbers of data points, whereas systematic errors (100% covariance) do not. For lack of information of error covariances, OAFlux3 ::: and ::::::::::: SEAFLUX3 monthly mean uncertainty is ::::::: estimates ::: are : similarly treated as having 100% covariance. An estimate of uncertainty is provided with ERA5 data in the form of results from ten separate model runs : a :::::::::: ten-member ::::::::: ensemble (Hersbach et al., 2020). 95 In the following section, we provide some background on E and P retrievals and introduce the E, P, and other data sets used for our study. Section 3 details the methods applied to the various data sets to enable a fair comparison. Results of our analyses are presented and discussed in Sections 4-5, and we close our study with a set of conclusions and recommendations.

Data sets
In this inter-comparison study, we assess the degree of agreement between five satellite-based E retrievals, two observation-100 based P retrievals, and a reanalysis data set. In this section, the retrieval algorithms will be briefly introduced: for more details, please refer to the literature listed in Table 1.
The retrieval of E from satellite observations is challenging. It is determined from the bulk flux parameters near-surface wind speed and humidity gradient near the surface. Wind speed can be retrieved from satellite passive microwave brightness temperature (BT) measurements and BTs have also some sensitivity to near-surface specific humidity. Specific humidity at 105 the ocean surface is derived from sea surface temperature (SST). All satellite-based E algorithms use reanalysis data to some extent and, vice versa, ERA5 also assimilates satellite data. Hence, these products cannot be considered completely independent and the distinction between "satellite data" and "reanalysis" is somewhat artificial and not always appropriate. However, for historical reasons -and for lack of a suitable alternative -we will retain these terms throughout this paper.
The main characteristics of the evaporation retrieval from passive microwave data are common to all satellite algorithms, 110 but there is quite some variation regarding the input of Level-1 (calibrated observations) and Level-2 (retrieval results) data, as will be discussed below. First, we will give a brief description of the retrieval basics, followed by details of the various satellite algorithms.

Evaporation Data Records
The liquid-water equivalent evaporation rate, E, is calculated from the latent heat flux Q l as follows : Where L E is specific :::: latent : heat of evaporation of water. The latent heat flux, in turn, is parameterized according to the bulk flux algorithm (based on the Monin-Obhukhov similarity theory representation of fluxes in terms of mean quantities): with ρ the density of air, C E the coefficient of turbulent exchange, u the wind speed at 10 m height relative to the ocean surface 120 current speed, and q s and q a the specific humidity at the sea surface and at 10 m height, respectively. Whereas q a and u are derived from satellite observations of BT, ρ, q s and L E are derived from their dependences on SST and/or air temperature. The turbulent exchange coefficient C E is obtained from the Coupled Ocean-Atmosphere Response Experiment (COARE) version 3.0 algorithm (Fairall et al., 1996(Fairall et al., , 2003. The algorithm iteratively estimates stability-dependent scaling parameters and wind gustiness to account for sub-scale variability.

125
Most of the data sets used here do not explicitly contain E, therefore, we calculated those from monthly means of Q l and SST using Eq. 5 and L E (in J/kg) given by (Henderson-Sellers, 1984): where T s is SST in K. The slight difference with the definition of L E used in the COARE-3.0 algorithm causes negligible differences of 0.03-0.04% for T s between 278-298 K.

130
The BT observations common to satellite-based retrievals of ocean turbulent fluxes come from the Special Microwave Imager (SSM/I, Hollinger et al., 1990) and Special Microwave Imager/Sounder (SSMIS, Kunkee et al., 2008) instruments on the Defense Meteorological Satellite Program (DMSP) platforms F08-F18. These data were corrected and inter-calibrated using various approaches to create FCDRs, stable fundamental climate data records (see, e.g., Wentz et al., 2013;Sapiano et al., 2013;Berg et al., 2018;Fennig et al., 2020), which then serve as input to various satellite retrievals. Slight differences 135 in calibration approaches lead to differences in FCDRs that propagate into the retrieved data. Issues with sensor stability, especially with SSM/I and SSMIS sensors, usually express themselves as slow drifts or sudden jumps of the global mean.
2.1.1 HOAPS-4.0 HOAPS (Andersson et al., 2010) relies almost completely on satellite data, as it only uses an ERA-interim profile climatology as a priori starting point for the 1D-Var retrieval of u and the humidity profile (Graw et al., 2017). The only other auxiliary 140 data set is the daily Optimum Interpolated Sea Surface Temperature (OISST, Reynolds et al. (2007)), version 2, derived from AVHRR satellite data. OISST provides a bulk SST at ::: SST :: at :: a ::::: depth :: of : 0.5 m which is transformed to a skin SST using the approach by Donlon et al. (2002), which is then used for the determination of q s . The parameterization described in Bentamy et al. (2003) is used to determine q a . For calculation of the flux parameters Q l and E, HOAPS-4.0 uses COARE version 2.6a (Bradley et al., 2000), which is nearly identical to COARE-3.0 (Fairall et al., 2003). HOAPS-4.0 is a CDR derived from CM 145 SAF (Climate Monitoring Satellite Application Facility) BT FCDR (Fennig et al., 2017(Fennig et al., , 2020 and is available at 0.5 • and 6-hourly (except E-P) and monthly resolution from July 1987 -December 2014 (Andersson et al., 2017). HOAPS data can be obtained from https://wui.cmsaf.eu.

J-OFURO3
The latest update to J-OFURO involved improvements in the methods of flux retrieval and expansion of the data set in terms of 150 time range and parameters (Tomita et al., 2019). The algorithm is similar to that described above. In addition to BT from SSM/I and SSMIS (from Remote Sensing Systems (RSS), Wentz et al., 2013), J-OFURO3 uses BT data from AMSR-E and AMSR2 (JAXA Version 3 and 2.1, respectively), and TMI (1B11 Version 7 from NASA-GESDISC) for the retrieval of flux parameters.
To determine q a a parameterization based on BTs, total column water vapor, and water vapor scale height was developed using match-ups of in situ buoy-and ship-based q a and DMSP-F13 BTs from eight channels (Tomita et al., 2018). From the 155 instantaneous q a values, gridded daily averages are determined and inter-calibrated to DMSP-F13 q a to remove systematic differences caused by the use of different FCDRs. The T s required for the calculation of q s and other flux parameters is the median value of an ensemble of twelve in situ, satellite-based, and reanalysis data sets. Other auxiliary data sets include water vapor surface mixing ratios from ERA-interim (Dee et al., 2011), OSTIA sea ice concentration (Donlon et al., 2012), and air temperature from NCEP/DOE reanalysis (Kanamitsu et al., 2002). Near-surface wind speed is determined as the simple mean 160 of values derived from microwave radiometers and scatterometers (Tomita et al., 2019). J-OFURO3 is available at 0.25 • and daily resolution from 1988-2013. It was acquired from https://j-ofuro.scc.u-tokai.ac.jp/.

OAFlux3
Satellite data used for the production of OAFlux3 data include wind speed from active (scatterometer) and passive (radiometer) microwave instruments, SST from OISST (Reynolds et al., 2007), and q a from Goddard Satellite-Based Surface Turbulent

165
Fluxes Dataset -Version 2 : ; :: 2c : (GSSTF2.0, ::::::::::::::: Chou et al. (2003), : Shie et al. (2009)). These are merged with NCEP and ERA40 reanalysis data using weighting factors that put more emphasis on satellite data (for u), on reanalyses (q a ), or weights both equally (T s ), whenever satellite data are available (Yu et al., 2008) Similar to J-OFURO and OAFlux, IFREMER's ocean flux retrieval algorithm is based on a synergy of remote sensing and reanalysis data (Bentamy et al., 2013). The current version 4.1 contains, among others, latent heat flux (LHF) and SST at daily and monthly, 0.25 • resolution from 1992-2018. The BTs used for retrievals are inter-calibrated by Colorado State University (CSU, Sapiano et al., 2013), except for data beyond June 2017, where CSU data ends and a switch to BTs from RSS (Wentz et al., 2013) is made. Inter-calibrated scatterometer wind data (Bentamy et al., 2017a) are supplemented by wind speeds 175 determined by RSS from the SSM/I, SSMIS, and WindSat instruments. SST are from OISST (Reynolds et al., 2007). The model relating BTs to q a using satellite -in situ data match-ups was updated from Bentamy et al. (2003) and now includes two additional terms: T s and T a − T s (with T a the air temperature at 10 m height from interpolated ERA-interim data (Bentamy et al., 2013)). IFREMER4.1 data were obtained via https://wwz.ifremer.fr/oceanheatflux/Data .

ERA5
ERA5 is the current operational reanalysis running at ECMWF, the European Centre for Medium-range Weather Forecasts.

Precipitation Data Records
Microwave-based retrievals of precipitation are based on the interaction of liquid or solid hydrometeors with the upwelling radiation field. In HOAPS-4.0, P is determined by an NN retrieval trained on profiles from an ERA-interim climatology (Andersson et al., 2010). The training data set consists of one month (August 2004) of assimilated SSM/I BTs and the corresponding ERA-interim P (Bauer et al., 2006).

230
There is a multitude of global precipitation products in existence (see, e.g., Kidd and Huffman, 2011;Tapiador et al., 2017), but for this study we selected GPCP as the P data set with which to calculate E-P (except for the HOAPS product, which makes us ::: use of its own P data) because it is generally regarded as the data set that performs best globally. Moreover, J-OFURO also makes use of GPCP P to determine E-P (Tomita et al., 2019).
The Global Precipitation Climatology Project -1 Degree Daily (GPCP-1DD; denoted GPCP hereafter), contains P esti-235 mated from a combination of data from ground-based rain gauges and satellites -the latter including near-infrared, passive and active microwave observations (Huffman et al., 2001). Daily global precipitation rates are provided by GPCP-1DD at 1 • resolution for the time range 1996-2017. We calculate monthly mean P from version 1.3 GPCP-1DD (GPCP, 2018), because the spatial resolution of the monthly product is not sufficient for our purposes. These data were obtained from https://rda.ucar.edu/datasets/ds728.5/.

Errors, biases, uncertainty
Only three :::: Four out of seven data sets analyzed here contain explicit information on uncertainty. HOAPS contains estimates of random and systematic bias errors (Kinzel et al., 2016;Liman et al., 2018). The errors in E were obtained by separating biases of HOAPS Level-2 E with respect to collocated in situ ship-based data into equally populated E, u, T s , and W bins. The mean and standard deviation of the biases are assumed to represent the systematic and random components of the 2σ uncertainty 245 range, respectively, which is probably a conservative estimate. By taking the approach of determining uncertainty ranges as a function of turbulent flux parameters these can also be assigned to times and regions not covered by the ship-based reference data set (Liman et al., 2018). For the current study, we calculated the mean uncertainty by averaging the systematic uncertainty component. The random component is negligible when averaging long time series. The HOAPS P data set does not contain uncertainty information; instead, a constant relative 1σ uncertainty range of 13% was assumed, based on a comparison with 250 ship-based in situ :: in ::: situ data (Burdanowitz, 2017). The total E − P uncertainty was determined by error propagation.
In contrast to uncertainty ranges estimated by comparing with other (e.g., in situ) data sets, the uncertainty of ERA5 data is described by the standard deviation and the range of the ensemble, consisting of 10 seperate model :::::::: reanalysis : runs (Hersbach et al., 2020). We determined these statistics after averaging of the data: first, the mean (e.g., global monthly mean) of each 270 individual ensemble member was calculated, then standard deviation and range were determined. Note that ERA5 ensemble statistics should be interpreted in a relative sense (i.e., ensemble spread is larger where uncertainty is higher), as the numerical values are over-confident (ECMWF, 2020).
All comparisons presented here are performed with collocated data, i.e., only grid boxes (at x, y, and t) present in all data sets were used to create climatological or global averages. A more accurate collocation procedure would be performed at shorter, e.g., daily, time scale, because differences in filtering of high-precipitation scenes (where E retrieval is impaired) and selection of included satellite instruments lead to differences in sub-monthly sampling. This was, however, not feasible in this study, as 290 HOAPS and J-OFURO E-P data are only provided on monthly resolution.
The satellite reference data set used in regional comparisons is determined by the statistical median of the satellite-based data ensemble and therefore does not include ERA5. The median is chosen over the mean to exclude outliers. In the following, this reference data set is abbreviated SEM (satellite ensemble median).
Regions where mean P > E are dominated by atmospheric freshwater outflux (into the ocean), shown in blue, and are concentrated at the inter-tropical convergence zone (ITCZ) and the Pacific warm pool. In the subtropics, E generally outweighs P . At higher latitudes P and E are approximately equal, but with a tendency to E − P < 0. Comparison of panels C-F with 310 A and B :: c-f :::: with : a :::: and :: b shows that the E-P pattern is mainly determined by P , as there is less spatial variation in E :: in ::: the :::::: tropical ::: and ::::::::::: high-latitude ::::::: regions, ::: but ::::::::: determined ::: by : E :: in ::: the :::::::::: subtropical :::::: regions. The agreement between SEM and ERA5 E-P climatologies is good, yet, some systematic differences can be observed. Due to higher P at : in : the ITCZ, ERA5 shows more negative E − P there. Conversely, the overall higher E level in ERA5 causes E − P values larger than those found for SEM over most of the global oceans. Excessive E was also found to produce high E-P in ERA-interim (Brown and Kummerow,315 2014).
The negative deviations to the east and west of Australia are similarly ::: also : due to differences in P , whereas the deviations at latitudes < 40 • ::::: > 40 • S are due in equal parts to E and P . The differences between OAFlux-G and ERA5 are mainly due to P , apart from the regions in the subtropical Pacific and Atlantic Oceans, where OAFlux E is smaller than ERA5 E. ::::::::::: SEAFLUX-G

335
To investigate where the differences are significant, the right column of Fig. 2 presents the 1σ uncertainty range from HOAPS (upper panel)and , : OAFlux-G ( :::::: middle :::::: panel), ::: and :::::::::::: SEAFLUX-G : (lower panel). Moreover, regions where the difference between satellite E-P ::::: E − P : and ERA5 E-P :::::: E − P are greater than the 2σ uncertainty range are enclosed by white contour lines in the left panels. The ERA5 E-P uncertainty shows a pattern similar to that of OAFlux-G, but is a factor of 10 smaller than the uncertainties estimated for satellite data and therefore adds a negligible component to the total uncertainty estimate. The
Panel B : b : shows that the seasonal cycle of global ocean mean P is shallow and the two satellite-based data sets agree within the GPCP uncertainty for ten months in :: of : the year. The trough in March and April in the HOAPS climatology 365 is caused by the strong sensitivity to the ENSO (El Niño -Southern Oscillation) phase and is a known characteristic of HOAPS data (see, e.g., Andersson et al., 2011;Masunaga et al., 2019). It is most apparent in panel E, where the Niño 3.4 SST index (Trenberth and Stepaniak, 2001) is plotted in gray bars along with P anomalies: HOAPS P correlates with Niño 3.4 if a lag of 4 months is taken into account. Like for E, we find substantial biases ::::::::: differences among the three P data sets: compared to GPCP, HOAPS has a bias :::: there :: is :: a :::::::: deviation : of about −0.1 mm d −1 , ::::::: between ::::::: HOAPS :::: and :::::: GPCP, same amount, E − P is close to the satellite data(except SEAFLUX-G). SEAFLUX-G shows the smallest E − P throughout the year and the global means come close to 0 mm d −1 in March/April. In contrast, : . HOAPS yields the highest E − P due to its low mean P . All E-P data are contained within the HOAPS and OAFlux uncertainty ranges.

4.3
Inter-comparison of freshwater flux over ocean: time series on regional scales In this section, we investigate the temporal correlation of water cycle components on regional scales. This approach will help 400 to understand differences between the various data sets by uncovering in which regions the differences are particularly large (or small). As a reference for the E and E-P comparisons, we use SEM, a data set determined by the statistical median of all satellite data sets. Since we use only two satellite P data sets, GPCP is selected as a reference for the P comparison. We determine correlation coefficient, slope and intercept of the linear regression (y = ax + b) between 1 • x 1 • monthly means (not anomalies) of each data set, y, and the reference, x, to examine where estimates are most consistent.

405
The results are shown for all six E data sets in Fig. 4, where the left column displays the correlation coefficient. On the top row, HOAPS yields R 2 > 0.75 over most of the globe, with some notable exceptions at :: in the ITCZ and the Peruvian coast. The other satellite data yield higher correlation coefficientsin general, except for SEAFLUX at latitudes > 40 • , where R 2 decreases to 0.5 in the North and below 0.25 in the Southern Ocean. The correlation pattern of ERA5 with SEM is similar to that found for HOAPS, although the tropical areas with R 2 < 0.75 are not at the same locations. The highest overall correlation with SEM 410 is found for J-OFURO ::: and :::::::::: SEAFLUX, with R 2 exceeding 0.75 essentially everywhere.
The patterns in the middle panels are nearly all mirrored in the right panels, i.e., wherever large values are overestimated (a > 1), small values are underestimated (b < 0), and vice versa. All data sets thus appear to agree on intermediate values.
For E-P, the results of the regression analysis are shown in Fig. 6. The highest correlation coefficients (and slopes and 440 intercepts closest to 1 and 0 mm d −1 ) are found among the data sets calculated with GPCP P. This shows that most of the variability in E-P is due to differences in P. Since GPCP P is used in four out of five data sets included in the SEM, those data sets show high correlations, whereas HOAPS and ERA5 yield patterns very similar to those found for the P comparison in Fig. 5. Nevertheless, both for ERA5 and HOAPS the correlation in most of the tropics is higher for E-P than for P. In summary, the correlation patterns for HOAPS and ERA5 indicate agreement on the seasonal cycle in the tropics, a result found previously 445 by Brown and Kummerow (2014), although we find that its amplitude is reduced in the GPCP-based E-P data (Fig. 3). Less agreement is found in the Southern Oceans, where GPCP-based E−P are underestimated relative to SEM. In the mid-latitudes, the regression with SEM yields slopes near 1 and intercepts close to 0 mm d −1 for ERA5 and HOAPS, but the correlation is less than in the tropics, probably due to the smaller dynamic range of E-P.
from E and P. It can, however, be argued that VIMD from reanalysis is a more reliable quantity than reanalysis E-P, since VIMD is calculated from the state variables wind and water vapor, whereas E and P are model-physics derived (e.g., Trenberth et al., 2011)). We verified that in ERA5, the agreement between E − P and ∇Q ::::::: ∇ · (vq) is generally good, as shown in Appendix B.

Examination of the water budget in ERA5
One way of investigating the consistency of different water cycle components is determining if the global water budget (Eq. 1) is closed. However, satellite E-P data sets are available over ocean only, so we revert to a comparison with gap-free reanalysis data. There is no internal constraint for budget closure in ERA reanalyses (Berrisford et al. , 2011;Hersbach et al., 2020), and as the budget was not closed in ERA-interim, it is worthwhile to investigate ERA5's behavior in this regard. Monthly mean 460 total ERA5 E − P over the globe, the ocean, and land are shown in Fig. 7 in black, blue, and green, respectively. The mean values over the globe and land were scaled by their surface area relative to the ocean surface area (i.e., they were multiplied by 510/350 and 160/350, respectively) to obtain consistency with the over-ocean means shown in Fig. 3. The error bars on ERA5 data depict the standard deviation of the 10-member ensemble. Nearly all of the uncertainty in global mean E-P is due to the uncertainty over ocean; the error bars on the over-land E-P are smaller than the graph's line width. This is due in equal parts 465 to E and P, which have similar ensemble standard deviations (not shown). For the time range shown in Fig. 7 global E − P is seen to oscillate around 0 mm d −1 , meaning that the ERA5 water budget is closed on a yearly time scale (in agreement with the findings by Hersbach et al., 2020). The seasonal cycle is mainly driven by increased evapo-transpiration of vegetation on land and peaks in northern hemispheric summer due to the larger fraction of land in the Northern Hemisphere. Precipitation shows a similar seasonal cycle over land, but does not completely cancel out in E-P due to a slight phase shift with respect to 470 the E seasonal cycle (not shown). Figure 7 shows that monthly means of global E−P and ∆W (light blue line) display a high degree of coherence, as expected from Eq. 2. This is an indication that the (atmospheric) water cycle is well represented in ERA5.

Examination of the water budget in satellite data sets
Globally, E − P is equal to ∆W (Eq. 2 and Fig. 7), and because ∆W is two orders of magnitude smaller than E and P , global mean E should necessarily be almost equal to global mean P . This seemingly trivial finding provides us with a tool 485 to investigate the consistency of E and P data sets: by determining how well they correlate. For ERA5, global mean E and P yield correlation coefficients R 2 = 0.82 and R 2 = 0.84 for monthly and yearly means, respectively. This procedure cannot be applied to the satellite E-P data considered here, as they contain values over ocean only. Since there is a substantial seasonality in water vapor transport (Fig. 7), the correlation between ocean-only E and P is expected to be much lower. A regression of ERA5 E ocean and P ocean monthly means (where the subscript ocean indicates averaging over ocean only) indeed yields 490 only R 2 = 0.42, increasing to : or : R 2 = 0.57 for yearly means. To account for the net transport of water from ocean to land, we include ∇Q ocean ::::::::::: ∇ · (vq) ocean into the analysis, and, applying Eq. 4, correlate E ocean − ∇Q ocean ::::::::::::::::::: E ocean − ∇ · (vq) ocean with P ocean . For ERA5, the resulting R 2 are 0.86 for yearly and monthly means: very similar to the coefficients found for the correlation of global E with P .
We calculated correlation coefficients of the various E and P data sets used in this study, combined with ERA5 ∇Q ocean ::::::::::: ∇ · (vq) ocean , 495 and listed them in Table 3. The analyses were performed separately on (i) monthly means, primarily indicating agreement on the seasonal cycle; (ii) yearly means, a measure of consistency of inter-annual variability, including trends; and (iii) monthly anomalies, focused on short-scale variability.
The remaining satellite data sets are not significantly correlated on a yearly time scale (p-value > 0.05), and for SEAFLUX-G 510 E ocean − ∇Q ocean is smaller than GPCP P ocean . Clearly, time series longer than the 17 years investigated here would benefit the analysis of yearly mean data.
Overall, this analysis shows that satellite-based estimates of E are less consistent with satellite-based P data than ERA5 E and P. To a certain degree, this is expected, as the three variables used in the analysis come from different sources (e.g., E from IFREMER, P from GPCP, and ∇Q ocean ::::::::::: ∇ · (vq) ocean from ERA5), each with its own sampling and uncertainty characteristics.

Estimates of global total water cycle components
From the data shown in the previous sections we calculated climatological (1997-2013) global ocean total E and P , total E − P (separated into land and ocean contributions), runoff and net transport. The latter is equal to the total over-ocean (or over-land) ∇Q :::::: ∇ · (vq). Our results are given in the three upper rows of Table 4. For brevity, we again denote total E, P , and 520 E − P over ocean as E ocean , P ocean , and (E − P ) ocean respectively, and similarly, over-land E − P as (E − P ) land . For better comparison with earlier estimates, values are given in units of 10 3 km 3 yr −1 .
The same reasoning applies to the other satellite data sets with similar effects on E ocean .
The spread in P ocean is of the same magnitude as that found for E ocean : HOAPS yields 335 ± 44 · 10 3 km 3 yr −1 , GPCP in agreement with ERA5 R and ∇Q land :::::::::: ∇ · (vq) land . The consistency between runoff and net transport seen in the last four rows of Table 4 is mainly by construction, as both are usually required (or defined) to be equal.

Discussion
We present an inter-comparison of five recent satellite-based E-P data sets. All five E data sets are the latest official versions of CDRs generated from (different) BT FCDRs and are combined with GPCP (or HOAPS) P CDR to form E-P data.
Of the E data sets used in this study, HOAPS depends on the least amount of model data, using these on a climatological (as 600 opposed to collocated, instantaneous) basis. ERA5, being a model reanalysis, represents the other extreme, and the remaining retrieval algorithms are somewhere in between. All algorithms, including ERA5 physics, rely on the same parameterization of bulk fluxes (Eq. 6) and on the COARE algorithm for the determination of the turbulent exchange coefficient (see Sect. 2).
The origin of E differences between various data sets must therefore lie with the bulk flux parameters u, q a , and q s , and with differences in sampling characteristics. A recent study by Roberts et al. (2019) showed that HOAPS, SEAFLUX, and 605 J-OFURO retrieve global mean q a that are systematically too small compared to in situ ship-based NOCSv2 data (Berry and Kent, 2011); IFREMER slightly overestimates q a . This difference could be largely improved by applying a correction based on the subtraction of a regime-dependent bias (in which regimes are defined by their water vapor vertical stratification, cloud liquid water content, and SST; and the bias determined with respect to NOCSv2). The HOAPS algorithm determines Q l (and E) systematic error estimates in a similar fashion: biases with respect to ship-based data were binned by Q l , u, T s , and W , 610 then collected into a 4-dimensional look-up-table (Kinzel et al., 2016;Liman et al., 2018). Subtracting the systematic error from HOAPS Q l (or E) would raise the global mean and improve the agreement with ship-borne data sets such as NOCSv2.
Initial tests show that this is, indeed, the case for Q l . This , :::: and is the topic of a forthcoming study. We stress, however, that reducing biases with respect to reference (e.g., in situ) data by improving the retrieval algorithm through better understanding of physical processes should be the preferred way forward.

635
The inter-comparison of global P data sets is the topic of a range of papers, but since the validation of P is difficult due to its inherent variability and the lack of sufficient in situ data -particularly over ocean -judging which algorithm performs best under which circumstances is a complicated task (Kidd and Huffman, 2011;Gehne et al., 2016;Tapiador et al., 2017).
Data availability. All presented data sets are freely available from the cited websites.
Appendix A: Difference climatologies 690 All difference maps of satellite-based E-P, E, and P climatologies with collocated ERA5 data are shown in Fig. A1. The maps in the left column are very similar (apart from HOAPS) because the E-P deviations are dominated by P and the same GPCP data were used to generate all E-P data sets (except HOAPS).