Daily evaluation of 26 precipitation datasets using Stage-IV gauge-radar data for the CONUS

New precipitation (P ) datasets are released regularly, following innovations in weather forecasting models, satellite retrieval methods, and multi-source merging techniques. Using the conterminous US as a case study, we evaluated the performance of 26 gridded (sub-)daily P datasets to obtain insight in the merit of these innovations. The evaluation was performed at a daily timescale for the period 2008–2017 using the Kling-Gupta Efficiency (KGE), a performance metric combining corre5 lation, bias, and variability. As reference, we used the high-resolution (4 km) Stage-IV gauge-radar P dataset. Among the three KGE components, the P datasets performed worst overall in terms of correlation (related to event identification). In terms of improving KGE scores for these datasets, improved P totals (affecting the bias score) and improved distribution of P intensity (affecting the variability score) are of secondary importance. Among the 11 gauge-corrected P datasets, the best overall performance was obtained by MSWEP V2.2, underscoring the importance of applying daily gauge corrections 10 and accounting for gauge reporting times. Several uncorrected P datasets outperformed gauge-corrected ones. Among the 15 uncorrected P datasets, the best performance was obtained by the fourth-generation reanalysis ERA5-HRES, reflecting the significant advances in earth system modeling during the last decade. ::: The ::::::::::: (re)analyses :::::::: generally ::::::::: performed ::::: better :: in :::::: winter ::: than :: in :::::::: summer, ::::: while :: the :::::::: opposite ::: was ::: the :::: case ::: for ::: the :::::::::::: satellite-based ::::::: datasets. IMERGHH V05 performed substantially better than TMPA-3B42RT V7, attributable to the many improvements implemented in the IMERG satellite P retrieval algorithm. 15 IMERGHH V05 outperformed ERA5-HRES in regions dominated by convective storms, while the opposite was observed in regions of complex terrain. The ERA5-EDA ensemble average exhibited higher correlations than the ERA5-HRES deterministic run, highlighting the value of ensemble modeling. The regional convection-permitting climate model WRF showed considerably more accurate P totals over the mountainous west and performed best among the uncorrected datasets in terms of variability, suggesting there is merit in using high-resolution models to obtain climatological P statistics. Our findings can 20 be used as a guide :::::: provide ::::: some :::::::: guidance to choose the most suitable P dataset for a particular application.


Introduction
Knowledge about the spatio-temporal distribution of precipitation (P ) is important for a multitude of scientific and operational applications, including flood forecasting, agricultural monitoring, and disease tracking (Tapiador et al., 2012;Kucera et al., 2013;Kirschbaum et al., 2017).However, P is highly variable in space and time and therefore extremely challenging to estimate, especially in topographically complex, convection-dominated, and snowfall-Published by Copernicus Publications on behalf of the European Geosciences Union.

208
H. E. Beck et al.: Daily evaluation of 26 precipitation datasets for the CONUS dominated regions (Stephens et al., 2010;Tian and Peters-Lidard, 2010;Herold et al., 2016;Prein and Gobiet, 2017).In the past decades, numerous gridded P datasets have been developed, differing in terms of design objective, spatiotemporal resolution and coverage, data sources, algorithm, and latency (see Tables 1 and 2 for an overview of quasi and fully global datasets).
A large number of regional-scale studies have evaluated gridded P datasets to obtain insight into the merit of different methods and innovations (see reviews by Gebremichael, 2010, Maggioni et al., 2016, and Sun et al., 2018).However, many of these studies (i) used only a subset of the available P datasets, and omitted (re)analyses, which have higher skill in cold periods and regions (Huffman et al., 1995;Ebert et al., 2007;Beck et al., 2017c); (ii) focused on a small (subcontinental) region, limiting the generalizability of the findings; (iii) considered a small number (< 50) of rain gauges or streamflow gauging stations for the evaluation, limiting the validity of the findings; (iv) used gauge observations already incorporated into the datasets as a reference without explicitly mentioning this, potentially leading to a biased evaluation; and (v) failed to account for gauge reporting times, possibly resulting in spurious temporal mismatches between the datasets and the gauge observations.
In an effort to obtain more generally valid conclusions, we recently evaluated 22 (sub-)daily gridded P datasets using gauge observations (∼ 75 000 stations) and hydrological modeling (∼ 9000 catchments) globally (Beck et al., 2017c).Other noteworthy large-scale assessments include Tian and Peters-Lidard (2010), who quantified the uncertainty in P estimates by comparing six satellite-based datasets, Massari et al. (2017), who evaluated five P datasets using triple collocation at the daily timescale without the use of ground observations, and Sun et al. (2018), who compared 19 P datasets at daily to annual timescales.These comprehensive studies highlighted (among other things) (i) substantial differences among P datasets and thus the importance of dataset choice; (ii) the complementary strengths of satellite and (re)analysis P datasets; (iii) the value of merging P estimates from disparate sources; (iv) the effectiveness of daily (as opposed to monthly) gauge corrections; and (v) the widespread underestimation of P in mountainous regions.
Here, we evaluate an even larger selection of (sub-)daily (quasi-)global P datasets for the conterminous US (CONUS), including some promising recently released datasets: ERA5 (the successor to ERA-Interim; Hersbach et al., 2018), IMERG (the successor to TMPA; Huffman et al., 2014Huffman et al., , 2018)), and MERRA-2 (one of the few reanalysis P datasets incorporating daily gauge observations; Gelaro et al., 2017;Reichle et al., 2017).In addition, we evaluate the performance of a regional convection-permitting climate model (WRF; Liu et al., 2017).As a reference, we use the high-resolution, radar-based, gauge-adjusted Stage-IV P dataset (Lin and Mitchell, 2005) produced by the National Centers for Environmental Prediction (NCEP).As a per-formance metric, we adopt the widely used Kling-Gupta efficiency (KGE;Gupta et al., 2009;Kling et al., 2012).We shed light on the strengths and weaknesses of different P datasets and on the merit of different technological and methodological innovations by addressing 10 pertinent questions.
1. What is the most important factor determining a high KGE score?
2. How do the uncorrected P datasets perform?
3. How do the gauge-based P datasets perform?
4. How do the P datasets perform in summer versus winter?
5. What is the impact of gauge corrections?
6. What is the improvement of IMERG over TMPA?
7. What is the improvement of ERA5 over ERA-Interim?
8. How does the ERA5-EDA ensemble average compare to the ERA5-HRES deterministic run?
10. How well does a regional convection-permitting climate model perform?
2 Data and methods

P datasets
We evaluated the performance of 26 gridded (sub-)daily P datasets (Tables 1 and 2).All datasets are either fully or near global, with the exception of WRF, which is limited to the CONUS.The datasets are classified as either uncorrected, which implies that temporal variations depend entirely on satellite and/or (re)analysis data, or corrected, which implies that temporal variations depend to some degree on gauge observations.We included seven datasets exclusively based on satellite data (CMORPH V1.0, GSMaP-Std V6, IMERGHHE V05, PERSIANN, PERSIANN-CCS, SM2RAIN-CCI V2, and TMPA-3B42RT V7), six fully based on (re)analyses (ERA-Interim, ERA5-HRES, ERA5-EDA, GDAS-Anl, JRA-55, and NCEP-CFSR, although ERA5 assimilates radar and gauge data over the CONUS), one incorporating both satellite and (re)analysis data (CHIRP V2.0), and one based on a regional convection-permitting climate model (WRF).Among the gauge-based P datasets, six combined gauge and satellite data (CMORPH-CRT V1.0, GPCP-1DD V1.2, GSMaP-Std Gauge V7, IMERGDF V05, PERSIANN-CDR V1R1, and TMPA-3B42 V7), one combined gauge and reanalysis data (WFDEI-GPCC), three combined gauge, Table 1.Overview of the 15 uncorrected (quasi-)global (sub-)daily gridded P datasets evaluated in this study.The 11 gauge-corrected datasets are listed in Table 2. Abbreviations in the data source(s) column defined as S, satellite; R, reanalysis; A, analysis; and M, regional climate model.The abbreviation NRT in the temporal coverage column stands for near real time.In the spatial coverage column, "Global" means fully global coverage including oceans, while "Land" means that the coverage is limited to the terrestrial land surface.satellite, and (re)analysis data (CHIRPS V2.0, MERRA-2, and MSWEP V2.2), while one was fully based on gauge observations (CPC Unified V1.0/RT).For transparency and reproducibility, we report dataset version numbers throughout the study for the datasets for which this information was provided.For the P datasets with a sub-daily temporal resolution, we calculated daily accumulations for 00:00-23:59 UTC.P datasets with spatial resolutions < 0.1 • were resampled to 0.1 • using bilinear averaging, whereas those with spatial resolutions > 0.1 • were resampled to 0.1 • using bilinear interpolation.

Stage-IV gauge-radar data
As a reference, we used the NCEP Stage-IV dataset, which has a 4 km spatial and hourly temporal resolution and covers the period 2002 until the present, and merges data from 140 radars and ∼ 5500 gauges over the CONUS (Lin and Mitchell, 2005).Stage-IV provides highly accurate P estimates and has therefore been widely used as a reference for  the evaluation of P datasets (e.g., Hong et al., 2006;Habib et al., 2009;AghaKouchak et al., 2011AghaKouchak et al., , 2012;;Nelson et al., 2016;Zhang et al., 2018b).Daily Stage-IV data are available, but they represent an accumulation period that is incompatible with the datasets we are evaluating (12:00-11:59 UTC instead of 00:00-23:59 UTC).We therefore calculated daily accumulations for 00:00-23:59 UTC from 6-hourly Stage-IV accumulations.The Stage-IV dataset was reprojected from its native 4 km polar stereographic projection to a regular geographic 0.1 • grid using bilinear averaging.The Stage-IV dataset is a mosaic of regional analyses produced by 12 CONUS River Forecast Centers (RFCs) and is thus subject to the gauge correction and quality control performed at each individual RFC (Westrick et al., 1999;Smalley et al., 2014;Eldardiry et al., 2017).To reduce systematic biases, the Stage-IV dataset was rescaled such that its longterm mean matches that of the PRISM dataset (Daly et al., 2008) for the evaluation period (2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017).To this end, the PRISM dataset was upscaled from ∼ 800 m to 0.1 • us-ing bilinear averaging.The PRISM dataset has been derived from gauge observations using a sophisticated interpolation approach that accounts for topography.It is generally considered the most accurate monthly P dataset available for the US and has been used as a reference in numerous studies (e.g., Mizukami and Smith, 2012;Prat and Nelson, 2015;Liu et al., 2017).However, the dataset has not been corrected for wind-induced gauge undercatch and thus may underestimate P to some degree (Groisman and Legates, 1994;Rasmussen et al., 2012).

Evaluation approach
The evaluation was performed at a daily temporal and 0.1 • spatial resolution by calculating, for each grid cell, KGE scores from daily time series for the 10-year period from 2008 to 2017.KGE is an objective performance metric combining correlation, bias, and variability.It was introduced in Gupta et al. (2009) and modified in Kling et al. (2012) and is defined as follows: where the correlation component r is represented by (Pearson's) correlation coefficient, the bias component β by the ratio of estimated and observed means, and the variability component γ by the ratio of the estimated and observed coefficients of variation: where µ and σ are the distribution mean and standard devia- .The datasets thus performed considerably worse in terms of correlation, which makes sense given that long-term climatological P statistics are easier to estimate than day-to-day P dynamics.Due to the squaring of the three components in the KGE equation (see Eq. 1), the correlation values exert the dominant influence on the final KGE scores.Indeed, the performance ranking in terms of KGE corresponds well to the performance ranking in terms of correlation (Fig. 2).These results suggest that in order to get an improved KGE score, the most important component score to improve is the correlation.This in turn suggests that, for existing daily P datasets, improvements to the timing of P events at the daily scale (dominating the correlation scores) are more valuable than improvements to P totals (dominating bias scores) or the intensity distribution (dominating variability scores).
3.2 How do the uncorrected P datasets perform?
The (uncorrected) satellite soil moisture-based SM2RAIN-CCI V2 dataset performed comparatively poorly (median KGE of 0.28; Figs. 1 and 2).The dataset strongly underestimated the variability (Fig. S3), due to the noisiness of satellite soil moisture retrievals and the inability of satellite soil moisture-based algorithms to detect rainfall exceeding the soil water storage capacity (Zhan et al., 2015;Wanders et al., 2015;Tarpanelli et al., 2017;Ciabatta et al., 2018).At high latitudes and elevations, the presence of snow and frozen soils may have hampered performance (Brocca et al., 2014), while in arid regions, irrigation may have been misinterpreted as rainfall (Brocca et al., 2018).In addition, approximately 25 % (in the eastern CONUS) to 50 % (over the mountainous west) of the daily rainfall values were based on temporal interpolation, to fill gaps in the satellite soil moisture data (Dorigo et al., 2017).Despite these limitations,  1 and 2. Maps for the correlation, bias, and variability components of the KGE are presented in the Supplement.
the SM2RAIN datasets may provide new possibilities for evaluation (Massari et al., 2017) and correction (Massari et al., 2018) of other P datasets, since they constitute a fully independent, alternative source of rainfall data.
All uncorrected P datasets exhibited lower overall performance in the western CONUS (Figs. 1, 2, and S1-S3), in line with previous studies (e.g., Gottschalck et al., 2005;Ebert et al., 2007;Tian et al., 2007;AghaKouchak et al., 2012;Chen et al., 2013;Beck et al., 2017c;Gebregiorgis et al.,  1 and 2. 2018).This is attributable to the more complex topography and greater spatio-temporal heterogeneity of P in the west (Daly et al., 2008), which affects the quality of both the evaluated datasets and the reference (Westrick et al., 1999;Smalley et al., 2014;Eldardiry et al., 2017).With the exception of CHIRP V2.0 (which has been corrected for systematic biases using gauge observations; Funk et al., 2015b) and WRF (the high-resolution climate simulation; Liu et al., 2017), the (uncorrected) datasets exhibited large P biases over the mountainous west (Fig. S2), which is in agreement with earlier studies using other reference datasets (Adam et al., 2006;Kauffeldt et al., 2013;Beck et al., 2017a;Beck et al., 2017c) and reflects the difficulty of retrieving and simulating orographic P (Roe, 2005).We initially expected bias values to be higher than unity since PRISM, the dataset used to correct systematic biases in Stage-IV (see Sect. 2.2), lacks explicit gauge undercatch corrections (Daly et al., 2008), but this did not appear to be the case (Figs. 2 and S2).

How do the gauge-based P datasets perform?
Among the gauge-based P datasets, the best overall performance was obtained by MSWEP V2.2 (median KGE of 0.81), followed at some distance by IMERGDF V05 (median KGE of 0.67) and MERRA-2 (median KGE of 0.66; Figs. 1 and 2).IMERGDF V05 exhibited a small negative bias, while MERRA-2 slightly underestimated the variability.The good performance obtained by MSWEP V2.2 underscores the importance of incorporating daily gauge data and accounting for reporting times (Beck et al., 2019).While CMORPH-CRT V1.0, CPC Unified V1.0/RT, GSMaP-Std Gauge V7, and MERRA-2 also incorporate daily gauge data, they did not account for reporting times, resulting in temporal mismatches and hence lower KGE scores (Fig. 2).Reporting times in the CONUS range from midnight −12 to +9 h UTC for the stations in the comprehensive GHCN-D gauge database (Menne et al., 2012;Fig. 2c in Beck et al., 2019), suggesting that up to half of the daily P accumulations may be assigned to the wrong day.In addition, CMORPH-CRT V1.0, GSMaP-Std Gauge V7, and MERRA-2 applied daily gauge corrections using CPC Unified (Xie et al., 2007;Chen et al., 2008), which has a relatively coarse 0.5 • resolution, whereas MSWEP V2.2 applied corrections at 0.1 • resolution based on the five nearest gauges for each grid cell (Beck et al., 2019).The good performance of IMERGDF V05 is somewhat surprising, given the use of monthly rather than daily gauge data, and attests to the quality of the IMERG P retrieval algorithm (Huffman et al., 2014(Huffman et al., , 2018)).
Similar to the uncorrected datasets, the corrected estimates consistently performed worse in the west (Figs. 1, 2, and S1-S3), due not only to the greater spatio-temporal heterogeneity in P (Daly et al., 2008), but also the lower gauge network density (Kidd et al., 2017).It should be kept in mind that the performance ranking may differ across the globe depending on the amount of gauge data ingested and the quality control applied for each dataset.Thus, the results found here for the CONUS do not necessarily directly generalize to other regions.

How do the P datasets perform in summer versus
winter?
Figure 3 presents KGE values for summer and winter for the 26 P datasets.The following observations can be made.
-The spread in median KGE values among the datasets is much greater in winter than in summer.In addition, almost all datasets exhibit a greater spatial variability in KGE values in winter, as indicated by the wider boxes and whiskers.This is probably at least partly attributable to the lower quality of the Stage-IV dataset in winter (Westrick et al., 1999;Smalley et al., 2014;Eldardiry et al., 2017).
-All (re)analyses (with the exception of NCEP-CFSR) including the WRF regional climate model consistently performed better in winter than in summer.This is because predictable large-scale stratiform systems dominate in winter (Adler et al., 2001;Ebert et al., 2007;Coiffier, 2011), whereas unpredictable small-scale convective cells dominate in summer (Arakawa, 2004;Prein et al., 2015).
-All satellite P datasets (with the exception of PER-SIANN) consistently performed better in summer than in winter.Satellites are ideally suited to detect the intense, localized convective storms which dominate in summer (Wardah et al., 2008;AghaKouchak et al., 2011).Conversely, there are major challenges associated with the retrieval of snowfall (Kongoli et al., 2003;Liu and Seo, 2013;Skofronick-Jackson et al., 2015;You et al., 2017) and light rainfall (Habib et al., 2009;Kubota et al., 2009;Tian et al., 2009;Lu and Yong, 2018), affecting the performance in winter.
-The datasets incorporating both satellite and reanalysis estimates (CHIRP V2.0, CHIRPS V2.0, and MSWEP V2.2) performed similarly in both seasons, taking advantage of the accuracy of satellite retrievals in summer and reanalysis outputs in winter (Ebert et al., 2007;Beck et al., 2017b).The fully gauge-based CPC Unified V1.0/RT also performed similarly in both seasons.
3.5 What is the impact of gauge corrections?
Differences in median KGE values between uncorrected and gauge-corrected versions of P datasets ranged from −0.07 (GSMaP-Std Gauge V7) to +0.20 (CMORPH-CRT V1.0; Table 3).GSMaP-Std Gauge V7 shows a large positive bias in the west (Fig. S2), suggesting that its gauge-correction methodology requires re-evaluation.The substantial improvements in median KGE for CHIRPS V2.0 (+0.13) and CMORPH-CRT V1.0 (+0.20) reflect the use of sub-monthly gauge data (5-day and daily, respectively).Conversely, the datasets incorporating monthly gauge data (IMERGDF V05 and WFDEI-GPCC) exhibited little to no improvement in median KGE (+0.05 and −0.01, respectively), suggesting that monthly corrections provide little to no benefit at the daily timescale of the present evaluation (Tan and Santo, 2018).These results, combined with the fact that several uncorrected P datasets outperformed gauge-corrected ones (Fig. 2), suggest that a P dataset labeled as "gaugecorrected" is not necessarily always the better choice.
3.6 What is the improvement of IMERG over TMPA?
As a result of these changes ERA5-HRES performed markedly better than ERA-Interim in terms of P across most of the CONUS, especially in the west (Figs. 1 and 4b).ERA5-HRES obtained a median KGE of 0.63, whereas ERA-Interim obtained a median KGE of 0.55 (Fig. 2).Im-provements were evident for all three KGE components (correlation, bias, and variability).It is difficult to say how much of the performance improvement of ERA5 is due to the assimilation of gauge and radar P data.We suspect that the performance improvement is largely attributable to other factors, given that (i) the impact of the P data assimilation is limited overall due to the large amount of other observations already assimilated (Lopez, 2013); (ii) radar data were discarded west of 105 • W for quality reasons (Lopez, 2011); and (iii) performance improvements were also found in regions without assimilated gauge observations (e.g., Nevada; Fig. 4b; Lopez, 2013, their Fig. 3).Nevertheless, we expect the performance difference between ERA5 and ERA-Interim to be less in regions with fewer or no assimilated gauge observations (i.e., outside the US, Canada, Argentina, Europe, Iran, and China; Lopez, 2013, their Fig. 3).
So far, only three other studies have compared the performance of ERA5 and ERA-Interim.The first study compared the two reanalyses for the CONUS by using them to drive a land surface model (Albergel et al., 2018).The simulations using ERA5 provided substantially better evaporation, soil moisture, river discharge, and snow depth estimates.The authors attributed this to the improved P estimates, which is supported by our results.The second and third studies evaluated incoming shortwave radiation and precipitable water vapor estimates from the two reanalyses, respectively, with both studies reporting that ERA5 provides superior performance (Urraca et al., 2018;Zhang et al., 2018a).
3.8 How does the ERA5-EDA ensemble average compare to the ERA5-HRES deterministic run?
Ensemble modeling involves using outputs from multiple models or from different realizations of the same model; it is widely used in climate, atmospheric, hydrological, and ecological sciences to improve accuracy and quantify uncertainty (Gneiting and Raftery, 2005;Nikulin et al., 2012;Strauch et al., 2012;Cheng et al., 2012;Beck et al., 2013Beck et al., , 2017a)).Here, we compare the P estimation performance of a high-resolution (∼ 0.28 • ) deterministic reanalysis (ERA5-HRES) to that of a reduced-resolution (∼ 0.56 • ) ensemble average (ERA5-EDA; Table 1).The ensemble consists of 10 members generated by perturbing the assimilated observations (Zuo et al., 2017) as well as the model physics (Ollinaho et al., 2016;Leutbecher et al., 2017).The ensemble average was derived by equal weighting of the members.
Compared to ERA5-HRES, we found ERA5-EDA to perform similarly in terms of median KGE (0.62 versus 0.63), better in terms of median correlation (0.72 versus 0.69) and bias (0.96 versus 0.93), but worse in terms of median variability (0.80 versus 0.90; Figs. 1, 2, and 4c).The deterioration of the variability is probably at least partly due to the averaging, which shifts the distribution toward medium-sized events.The improvement in correlation is evident over the entire CONUS (Fig. 4c), and corresponds to a 9 % overall increase in the explained temporal variance, demonstrating the value of ensemble modeling.We expect the improvement to increase with increasing diversity among ensemble members (Brown et al., 2005;DelSole et al., 2014).
3.9 How do IMERG and ERA5 compare?IMERGHHE V05 (Huffman et al., 2014(Huffman et al., , 2018) ) and ERA5-HRES (Hersbach et al., 2018) represent the state-of-the-art in terms of satellite P retrieval and reanalysis, respectively (Table 1).Although the datasets exhibited similar performance overall (median KGE of 0.62 and 0.63, respectively; Figs. 1 and 2), regionally there were considerable differences (Fig. 4d).Compared to ERA5-HRES, IMERGHHE V05 performed substantially worse over regions of complex terrain (including the Rockies and the Appalachians), in line with previous evaluations focusing on India (Prakash et al., 2018) and western Washington state (Cao et al., 2018).In contrast, ERA5-HRES performed worse across the southerncentral US, where P predominantly originates from smallscale, short-lived convective storms which tend to be poorly simulated by reanalyses (Adler et al., 2001;Arakawa, 2004;Ebert et al., 2007).The patterns in relative performance between IMERGHHE V05 and ERA5-HRES (Fig. 4d) correspond well to those found between TMPA 3B42RT and ERA-Interim (Beck et al., 2017b, their Fig. 4) and be-tween CMORPH and ERA-Interim (Beck et al., 2019, their Fig. 3d), suggesting that our conclusions can be generalized to other satellite-and reanalysis-based P datasets.Our findings suggest that topography and climate should be taken into account when choosing between satellite and reanalysis datasets.Furthermore, our results demonstrate the potential to improve continental-and global-scale P datasets by merging satellite-and reanalysis-based P estimates (Huffman et al., 1995;Xie and Arkin, 1996;Sapiano et al., 2008;Beck et al., 2017bBeck et al., , 2019;;Zhang et al., 2018b).
3.10 How well does a regional convection-permitting climate model perform?
In addition to the (quasi-)global P datasets, we evaluated the performance of a state-of-the-art climate simulation for the CONUS (WRF; Liu et al., 2017; Table 1).The WRF simulation has the potential to produce highly accurate P estimates since it has a high 4 km resolution, which allows it to account for the influence of mesoscale orography (Doyle, 1997), and is "convection-permitting", which means it does not rely on highly uncertain convection parameterizations (Kendon et al., 2012;Prein et al., 2015).In terms of variability, WRF performed third best, being outperformed only (and very modestly) by the gauge-based CPC Unified V1.0/RT and MSWEP V2.2 datasets (Figs. 1 and 2).In terms of bias, the simulation produced mixed results.WRF is the only uncorrected dataset that does not exhibit large biases over the mountainous west (Fig. S2).However, large positive biases were obtained over the Great Plains region, as also found by Liu et al. (2017) using the same reference data.
In terms of correlation, WRF performed worse than thirdgeneration reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR; Figs. 2 and S1).This is probably because WRF is forced entirely by lateral and initial boundary conditions from ERA-Interim (Liu et al., 2017), whereas the reanalyses assimilate vast amounts of in situ and satellite observations (Saha et al., 2010;Dee et al., 2011;Kobayashi et al., 2015).Overall, there appears to be some merit in using highresolution, convection-permitting models to obtain climatological P statistics.

Conclusions
To shed some light on the strengths and weaknesses of different precipitation (P ) datasets and on the merit of different technological and methodological innovations, we comprehensively evaluated the performance of 26 gridded (sub-)daily P datasets for the CONUS using Stage-IV gaugeradar data as a reference.The evaluation was carried out at a daily temporal and 0.1 • spatial scale for the period 2008-2017 using the KGE, an objective performance metric combining correlation, bias, and variability.Our findings can be summarized as follows.
1. Across the range of KGE scores for the datasets examined the most important component is correlation (reflecting the identification of P events).Of secondary importance are the P totals (determining the bias score) and the distribution of P intensity (affecting the variability score).
2. Among the uncorrected P datasets, the (re)analyses performed better on average than the satellitebased datasets.The best performance was obtained by ECMWF's fourth-generation reanalysis ERA5-HRES, with NASA's most recent satellite-derived IMERGHHE V05 and the ensemble average ERA5-EDA coming a close equal second.
3. Among the gauge-based P datasets, the best overall performance was obtained by MSWEP V2.2, followed by IMERGDF V05 and MERRA-2.The good performance of MSWEP V2.2 highlights the importance of incorporating daily gauge observations and accounting for gauge reporting times.
4. The spread in performance among the P datasets was greater in winter than in summer.The spatial variability in performance was also greater in winter for most datasets.The (re)analyses generally performed better in winter than in summer, while the opposite was the case for the satellite-based datasets.
5. The performance improvement gained after applying gauge corrections differed strongly among P datasets.The largest improvements were obtained by the datasets incorporating sub-monthly gauge data (CHIRPS V2.0 and CMORPH-CRT V1.0).Several uncorrected P datasets outperformed gauge-corrected ones.
6. IMERGHH V05 performed better than TMPA-3B42RT V7 for all metrics, consistent with previous studies and attributable to the many improvements implemented in the new IMERG algorithm.
7. ERA5-HRES outperformed ERA-Interim for all metrics across most of the CONUS, demonstrating the significant advances in climate and earth system modeling and data assimilation during the last decade.
8. The reduced-resolution ERA5-EDA ensemble average showed higher correlations than the high-resolution ERA5-HRES deterministic run, supporting the value of ensemble modeling.However, a side effect of the averaging is that the P distribution shifted toward mediumsized events.9. IMERGHHE V05 and ERA5-HRES showed complementary performance patterns.The former performed substantially better in regions dominated by convective storms, while the latter performed substantially better in regions of complex terrain.10.Regional convection-permitting climate model WRF performed best among the uncorrected P datasets in terms of variability.This suggests there is some merit in employing high-resolution, convection-permitting models to obtain climatological P statistics.
Our findings provide some guidance to decide which P dataset should be used for a particular application.We found evidence that the relative performance of different datasets is to some degree a function of topographic complexity, climate regime, season, and rain gauge network density.Therefore, care should be taken when extrapolating our results to other regions.Additionally, results may differ when using another performance metric or when evaluating other timescales or aspects of the datasets.Similar evaluations should be carried out with other performance metrics and in other regions with ground radar networks (e.g., Australia and Europe) to verify and supplement the present findings.Of particular importance in the context of climate change is the further evaluation of P extremes.

Figure 1 .
Figure 1.KGE scores for the 26 gridded P datasets using the Stage-IV gauge-radar dataset as a reference.White indicates missing data.Higher KGE values correspond to better performance.Uncorrected datasets are listed in blue, whereas gauge-corrected datasets are listed in red.Details on the datasets are provided in Tables1 and 2. Maps for the correlation, bias, and variability components of the KGE are presented in the Supplement.

Figure 2 .
Figure 2. Box-and-whisker plots of KGE scores for the 26 gridded P datasets using the Stage-IV gauge-radar dataset as a reference.The circles represent the median value, the left and right edges of the box represent the 25th and 75th percentile values, respectively, while the "whiskers" represent the extreme values.The statistics were calculated for each dataset from the distribution of grid-cell KGE values (no area weighting was performed).The datasets are sorted in ascending order of the median KGE.Uncorrected datasets are indicated in blue, whereas gauge-corrected datasets are indicated in red.Details on the datasets are provided in Tables1 and 2.

Figure 3 .
Figure3.Box-and-whisker plots of KGE scores for summer (June-August) and winter (December-February) using the Stage-IV gaugeradar dataset as a reference.The circles represent the median value, the left and right edges of the box represent the 25th and 75th percentile values, respectively, while the "whiskers" represent the extreme values.The statistics were calculated for each dataset from the distribution of grid-cell KGE values (no area weighting was performed).The datasets are sorted in ascending order of the overall median KGE (see Fig.2).Uncorrected datasets are indicated in blue, whereas gauge-corrected datasets are indicated in red.Details on the datasets are provided in Tables1 and 2.

Figure 4 .
Figure 4. (a) KGE scores obtained by IMERGHHE V05 minus those obtained by TMPA-3B42RT V7.(b) KGE scores obtained by ERA5-HRES minus those obtained by ERA-Interim.(c) Correlations (r) obtained by ERA5-EDA minus those obtained by ERA5-HRES.(d) KGE scores obtained by IMERGHHE V05 minus those obtained by ERA5-HRES.Note the different color scales.The Stage-IV gauge-radar dataset was used as a reference.The KGE and correlation values were calculated from daily time series.

Table 2 .
Overview of the 11 gauge-corrected (quasi-)global (sub-)daily gridded P datasets evaluated in this study.The 15 uncorrected datasets are listed in Table1.Abbreviations in the data source(s) column defined as G, gauge; S, satellite; R, reanalysis; and A, analysis.The abbreviation NRT in the temporal coverage column stands for near real time.In the spatial coverage column, "global" indicates fully global coverage including ocean areas, while "land" indicates that the coverage is limited to the terrestrial surface.
1 Available until the present with a delay of several hours. 2 Available until the present with a delay of several days.3Availableuntil the present with a delay of several months. 42000-NRT for the next version.
tion, respectively, and the subscripts s and o indicate estimate and reference, respectively.KGE, r, β, and γ values all have their optimum at unity.