Using GRACE to derive corrections to precipitation data sets and improve modelled snow mass at high latitudes

. The amount of lying snow calculated by a land surface model depends in part on the amount of snowfall in the meteorological data that are used to drive the model. We show that commonly-used data sets differ in the amount of snowfall, and more generally precipitation, over four large Arctic basins. An independent estimate of the cold season precipitation is obtained by combining water balance information from the Gravity Recovery and Climate Experiment (GRACE) with estimates of evaporation and river discharge, and is generally higher than that estimated by four commonly-used meteorological data sets. 5 We use the Joint UK Land Environment Simulator (JULES) land surface model to calculate the snow water equivalent (SWE) over the four basins. The modelled seasonal maximum SWE is 38% less than observation-based estimates on average and the modelled basin discharge is signiﬁcantly underestimated, consistent with the lack of snowfall. We use the GRACE-derived estimate of precipitation to deﬁne per-basin scale factors that are applied to the driving data and increase the amount of cold season precipitation by 28% on average. In turn this increases the modelled seasonal maximum SWE by 30%, although this 10 is still underestimated compared to observations by 19% on average. A correction for undercatch of precipitation by gauges is compared with the the GRACE-derived correction. Undercatch correction increases the amount of cold season precipitation by 23% on average, which indicates that some, but not all, of the underestimation can be removed by implementing existing undercatch correction algorithms. However, even undercatch-corrected data sets contain less precipitation than the GRACE-derived estimate in some regions, and it is likely that there are other biases that that are not currently accounted for in gridded 15 meteorological data sets. This study shows that revised estimates of precipitation can lead to improved modelling of SWE, but much more modest improvements are found in modelled river discharge. By providing methods to better deﬁne the precipitation inputs to the system, the current study paves the way for subsequent work on key hydrological processes in high-latitude corrections precipitation reanalysis interpolated, correction for differences in grid box elevation. used in for undercatch Shefﬁeld, description of original data set in Shefﬁeld et al. (2006). in grid box separately in the reanalysis, with the fraction adjusted appropriately where differences in grid box elevation resulted in inappropriate precipitation phase. two of the data sets — WFDEI-CRU and WFDEI-GPCC — already implement undercatch correction. The two data sets that do not apply a correction required the largest scaling factors to match the GRACE-derived estimates. To investigate the importance of the undercatch correction in comparison to the GRACE parts of the hydrological system, as TWS, estimates of SWE based on remote sensing, and river discharge as in this study. Although each product comes with its own uncertainty, and a range of alternative and potentially conﬂicting data sets is often available, the combination of estimates across different parts of the hydrological system can provide extra insights, particularly if a model shows consistent biases across several components. This study shows that at a basin scale the cold season precipitation in four data sets that are commonly used to drive land surface models is low compared with estimates derived from GRACE TWS. This leads to consistent and large errors in the SWE and basin discharge calculated by JULES, which are also low compared to observations. By increasing the precipitation in JULES to match the GRACE estimates the modelled SWE is substantially improved, although river discharge is still low likely because of a combination of poor modelling of runoff processes during the spring melt and possible underestimation of summertime precipitation. By providing methods to better deﬁne the precipitation inputs to the system, the current study paves the way for subsequent work on key hydrological processes.

of the surface of the Earth, which can be interpreted as changes in water storage, revealing the dynamics of TWS at the surface on a monthly time scale. While the monthly changes in TWS can be observed, GRACE does not indicate what fluxes or components are the cause of the changes. However, by using the GRACE TWS anomalies, combined with observed or modelled estimates of river flow and gridded evaporation fluxes, it is possible to estimate the areal precipitation (Swenson, 2010). 5 Previous studies have used GRACE TWS to estimate precipitation inputs in boreal regions (Swenson, 2010;Seo et al., 2010;Behrangi et al., 2016) and cold mountainous basins (Behrangi et al., 2017). These have identified deficiencies or uncertainties in traditional estimates of precipitation, but with magnitude that varies between products, locations and times. Swenson (2010) compared GRACE-derived precipitation with two existing precipitation data sets and found a varied picture, with more variability in North America, while Behrangi et al. (2017) many precipitation products are significantly lower than the GRACE- 10 derived estimate in cold mountainous endorrheic basins. Behrangi et al. (2016) compared several raw reanalysis outputs, as well as some satellite and gauge based products, and only found deficiencies in gauge-based data sets in some regions. These studies have all used LSM or reanalysis output to provide at least some of the ancillary data required to calculate precipitation from GRACE TWS. Since, in this study, the intention is to investigate how changes to input precipitation can result in changes to the LSM output, independent estimates of evapotranspiration and river flow have been used in preference to LSM output. 15 The remainder of this study investigates the hypothesis that many precipitation data sets that are used to drive land surface models contain insufficient cold season precipitation in mid-to high-latitude land areas, and that this leads to underestimation of SWE and river flow. Cold-season precipitation data are assessed against an estimate based on GRACE TWS data combined with independent estimates of evaporation and basin discharge. The results confirm that the precipitation data sets used to drive LSMs underestimate cold season precipitation in boreal regions. The GRACE-derived precipitation estimates are used to 20 rescale the cold season precipitation in the data sets used to drive the JULES model, and this improves the representation of SWE, with a limited improvement in modelled river discharge.
In Sect. 2 the data sets used to carry out this study are described and in Sect. 3 the set up of the JULES model and the experiments carried out are defined. The method by which precipitation is calculated from GRACE TWS is described in Sect. 4. The JULES model runs are described and evaluated in Sect. 5 and the implications are discussed in Sect. 6. 25 2 Data sets All analyses in this paper are carried out using data on a regular latitude-longitude grid at 1 • resolution. Data with other resolutions are first regridded to this common grid with a modified version of SCRIP (Jones, 1998), using the conservative remapping method normalised by destination area fraction (Jones, 1999). This study focuses on the four major Arctic basins -the Ob, Yenisei, Lena and Mackenzie -which together account for approximately 68% of the total discharge into the 30 Arctic Ocean (Grabs et al., 2000) and are marked by O, Y, L and M respectively in Fig. 1. The basins are defined using the 1 • TRIP river network data of Oki and Sud (1998). For consistency with the observations of river flow (see Sect. 2.3), only the part of each basin that drains to the flow gauging station was considered when calculating spatial averages of observed or 4 Hydrol. Earth Syst. Sci. Discuss., https://doi.org/10.5194/hess-2019-109 Manuscript under review for journal Hydrol. Earth Syst. Sci. Discussion started: 27 March 2019 c Author(s) 2019. CC BY 4.0 License. modelled grids. The data sets used all vary in their temporal coverage; for consistency in analysis a "common overlap period" of 2002 -2010 inclusive was used, during which time all of the driving data sets and the data sets used for the water balance precipitation estimates were available. In order to capture full annual cycles of snow accumulation and melt, annual averages were calculated from September 1st of one year to August 31st of the next, so the common overlap period contains eight full years for analysis. All comparisons were carried out using data from this common overlap period, and averaged over the 5 defined basins, unless explicitly noted otherwise.

Meteorological data
A key requirement of land surface models is a time series of meteorological data. To this end, several "driving" data sets have been created that represent realistic global climate for the recent past. These are generally based on global meteorological reanalyses which have been bias-corrected to observational data sets. They provide a full set of the meteorological variables 10 required to run an LSM, including air temperature, pressure and humidity, wind speed, incoming long-and shortwave radiation, as well as the precipitation with which this study is concerned.
Four meteorological data sets were used to drive the JULES model in this study. These are summarised in Table 1 and described below, with a focus on the precipitation data. The data were available at 0.5 • resolution unless stated otherwise.
Basin mean monthly climatologies of all of the driving variables can be seen in Figs. S1 and S2. 15 The CRUNCEP v4 data set (CRUNCEP; Viovy, 2014) uses the CRU TS3.21 monthly data (Harris et al., 2013), including gauge-based precipitation, to bias correct the NCEP/NCAR Reanalysis 1 (Kalnay et al., 1996). The reanalysis is spatially interpolated from its original 2.5 • resolution to the required 0.5 • resolution with no adjustment for grid box elevation. No undercatch correction is applied to the gauge data and the total precipitation is provided (rather than separate estimates of rainfall and snowfall). 20 The Princeton Global Forcing data v2 (PGF; Sheffield et al., 2006) is based on the NCEP/NCAR Reanalysis 1 with precipitation totals scaled to CRU TS monthly values and further statistical corrections applied to correct the number of rain days and to to scale to the final 1 • and 3-hourly resolutions. Again total precipitation is provided. The reanalysis is spatially interpolated, including a correction for differences in grid box elevation. The version used in this study (v2) was not corrected for undercatch (Justin Sheffield, personal communication, 2018) contrary to the description of the original data set in Sheffield et al. (2006).

25
Two variants of the WATCH Forcing Data methodology applied to ERA-Interim Data (WFDEI; Weedon et al., 2014) were used, differing only in the precipitation data used for bias correction: WFDEI-CRU uses CRU TS 3.1.01 until 2009 and CRU TS 3.21 from 2010 onwards (Harris et al., 2013;Trenberth et al., 2013), while WFDEI-GPCC uses GPCC v5 until 2009 and v6 from 2010 onwards (Schneider et al., 2014) and includes many more stations than CRU. Both variants apply undercatch corrections (Adam and Lettenmaier, 2003) and use the ERA-Interim reanalyses (Dee et al., 2011). Variables other than pre-30 cipitation are bias-corrected to CRU TS 3.1 until 2009 and CRU TS 3.21 from 2010 onwards in both versions of WFDEI. The reanalysis was spatially interpolated from its native N128 resolution (around 0.7 • at the equator) to 0.5 • resolution, including adjustment for differences in grid box elevation. Rainfall and snowfall were diagnosed separately in the reanalysis, with the fraction adjusted appropriately where differences in grid box elevation resulted in inappropriate precipitation phase. All of these data sets have been widely used to drive land surface models in previous studies. All of them scale precipitation using data from CRU or GPCC, while an important difference is that only two (WFDEI-CRU and WFDEI-GPCC) include an undercatch correction. The undercatch correction implemented in Adam and Lettenmaier (2003) resulted in an increase in global precipitation of 12%, with an increase of 95% in some boreal regions. Thus this can be an important source of underestimation of precipitation. Although the driving data sets are primarily used by land surface modellers, they include 5 information from more widely-used precipitation data sets, and their basin averaged annual precipitation values are similar to those from a range of other data sets (Fig. S3).

GRACE
The twin GRACE satellites made detailed measurements of changes in Earth's gravity field from April 2002 -January 2017.
GRACE-Tellus used these to provide estimates of the TWS anomaly relative to the baseline mean over the years 200410 (Swenson and Wahr, 2006Swenson, 2012;Landerer and Swenson, 2012). GRACE may be used to study the dynamics of the water cycle on a monthly time step. Although it may not be used to estimate absolute stores, it gives an estimate of the integrated water fluxes into and out of a region.
Three methods have been developed to solve the gravity fields using spherical harmonics (CSR, GFZ, JPL). These are then used to calculate the gridded TWS, so that three products are available. As is recommended by Sakumura et al. (2014), this 15 study uses the ensemble average of the three solutions, which effectively reduces the uncertainty in the GRACE product.
The gravimetry measured by the GRACE satellites is converted to a TWS anomaly, making use of model output. Corrections for changes in atmospheric pressure are based on ECMWF analysis data, and corrections for post-glacial rebound are also modelled (Wahr et al., 1998). Destriping, Gaussian and degree 60 filters are applied to the gridded data. In order to correct for signal attenuation introduced by the filters applied when calculating the grids from the spherical harmonics, the three TWS 20 land grids must be multiplied by a gridded gain factor, which is derived by applying the same filters to the output from the Community Land Model 4 (CLM4; Landerer and Swenson, 2012).
Although the filters remove a significant component of spatially correlated and random errors, the gridded GRACE data are not independent of their neighbours. Uncertainties in the GRACE TWS data are generally highest at low latitudes, and lower towards the poles. In the boreal regions in this study, estimated measurement error is around 15 mm (Landerer and Swenson,25 2012). The application of the gridded gain factor means that the gridded GRACE land data cannot be used to study long-term trends (Landerer and Swenson, 2012).
The data used in this study were the GRACE RL05 gridded land TWS (Swenson, 2012;NASA, 2017  discharge. In this study, basin discharge is quoted in units of mm, for consistency with the other variables, calculated by dividing the total discharge by the area of the basin draining to the station. There are missing observations at each station, and some stations ceased contributing data before the end of the study period. Months with missing data were filled using a mean-monthly climatology, calculated using all available data at the station. Data availability varies between basins, with observations ending between 2003 and 2012. A summary of the available and filled data is shown in Table S1. surface radiation from reanalysis and a gauge-based gridded precipitation product, along with satellite products for SWE, vegetation properties and lightning and soil moisture. There is a version, v3.1b, which uses satellite products for air temperature and surface radiation, but this is only available for latitudes between 50 • S and 50 • N and was not suitable for this study. For this study, the Northern Hemisphere data were regridded onto the analysis grid. Since the original data are masked where mountains are present, in the regridded data pixels were masked where at least 50% of the contributing pixels were mountain.

25
This mostly affects western North America and some regions in Central Asia, masking 16% of the area of the Yenisei, 13% of the Ob, 9% of the Lena and 14% of the Mackenzie. In the Eurasian basins the masked regions are generally in the south where the SWE is low, although in the Mackenzie the mountains extend to the north of the basin. The GlobSnow mountain mask is used for all basin averages of SWE in this study, both modelled and observed, so that comparisons are made over consistent regions.

CMC Daily Snow Depth Analysis
The Canadian Meteorological Centre Daily Snow Depth Analysis Data (CMC; Brown and Brasnett, 2010) are an assimilation of a simple snow accumulation and melt model, using temperatures and precipitation from the CMC Global Environmental Multiscale (GEM) forecast model, and daily snow depth data from the World Meteorological Organization (WMO) information system. From snow depth the SWE is estimated using snow density which is dependent on month and climate class. The data 5 are provided as a Northern Hemisphere polar stereographic grid, with a nominal resolution of 24 km and are available for the years 1998-2012. accumulation throughout the winter. However, the seasonal maxima are consistent between the two data sets, although the GlobSnow maximum is slightly earlier in the Lena.

Comparison of SWE products
This comparison suggests that the GlobSnow and CMC basin average estimates are remarkably similar for the areas and time periods considered, particularly for estimates of seasonal maxima. A more comprehensive comparison of hemispherewide estimates by Mudryk et al. (2015) showed that GlobSnow tended to give larger and earlier seasonal maxima than CMC 5 over the whole Northern Hemisphere, but this does include regions that are not being considered in the current study. We provide brief details here and refer the reader to the above references for a fuller description. 10 JULES calculates the exchanges of radiation, heat, water and carbon between the land surface and the atmosphere. As employed here each grid box contains a mixture of nine surface types (five vegetation types and four non-vegetated surfaces), for each of which the surface fluxes are calculated independently. Subsurface fluxes of moisture and heat are calculated using a layered soil model, with a single soil column in each grid box. Supersaturation of soil moisture is avoided by moving the extra water to lower soil layers. The snow pack is modelled using a multi-layer snow scheme that employs a variable number 15 of layers depending on the depth of snow. Each snow layer has a prognostic temperature, density, grain size, and solid and liquid water contents. Snowfall in vegetated areas with needle-leaf tree cover can be partitioned between the vegetation canopy and the underlying ground Essery and Clark, 2003). The intercepted snow leaves the canopy though sublimation and wind-speed-dependent unloading of melting snow. The surface albedo is affected by the evolving grain size of the snow surface and the extent to which the vegetation is buried by snow. In this study we employ a TOPMODEL-based 20 parameterisation of surface runoff (Beven and Kirkby, 1979). River routing is carried out as a post-processing step, using Total Runoff Integrating Pathways on a 1 • river routing grid (TRIP; Oki and Sud, 1998).  2015) and Ménard et al. (2015). In particular in SnowMIP2 JULES was found to be one of the best models for simulations of open sites, albeit it was thus not as 25 skilful at forest sites (Rutter et al., 2009).
For this study, JULES is run at the native resolution of each driving data set (0.5 • , except when using the PGF data which is 1.0 • resolution). Meteorological data are provided every 6 hours (CRUNCEP) or 3 hours (all other data) and JULES interpolates these onto a 30 minute time step. The precipitation rate is kept constant over the data time step (3 or 6 hours). The model output is regridded to 1.0 • resolution for analysis.

30
The WFDEI-CRU and WFDEI-GPCC data sets contain separate rainfall and snowfall fields that can be used directly by JULES, but the CRUNCEP and PGF data sets only provide total precipitation. For these, JULES assumes that the precipitation is snowfall when the near-surface air temperature is less than or equal to 274 K, while at higher temperatures the precipitation is assumed to be entirely rainfall. Hancock et al. (2014) suggest that, as long as the start and end dates of accumulation are correct, the modelled SWE is relatively insensitive to the choice of threshold.
The distribution of soil properties and land types was based on data from the IGBP Global Soil Task Force (Global Soil Data Task, 2000), gridded to the resolution of each driving data set. Vertical fluxes of soil water follow Darcy's law, using hydraulic characteristics calculated after Brooks and Corey (1964). Runoff was generated using the TOPMODEL formulation, 5 with topographic index regridded from the high-resolution HydroSHEDS data (Marthews et al., 2015). The model was spun up using the first five years of each data set four times, for a total of 20 years, then run for several decades starting in 1970 (CRUNCEP and PGF) or 1979 (both WFDEI variants) and running to the end of the available data.

Experiments
Three experiments using JULES were performed: 10 CTL Control runs, driven with the original meteorological data sets.
GRC Runs in which the total cold season precipitation is scaled to match that derived from GRACE, as described in Sect. 4.1 below. The scale factors, which vary by basin and by run, can be seen in Table 2. These factors are applied constantly to both rain and snow for the months October to February inclusive. In the rest of the year the precipitation is unchanged.
UCC Runs for which wind-based undercatch correction (Adam and Lettenmaier, 2003) is applied to the precipitation (CRUN- 15 CEP and PGF only). This uses the same catch ratios as were used in the creation of WFDEI (Weedon et al., 2011), one for each month for rainfall and snowfall separately. The correction is applied throughout the whole year. The ratios are available at 0.5 • resolution, so were re-gridded to 1 • resolution for PGF.
The full set up for each run is available as a version-controlled Rose suite, through the Met Office repository: https://code. metoffice.gov.uk/trac/roses-u. Full details of the suite number and revision for each run is given in Table S2. 20

Mean bias error
To evaluate the runs, we calculate the mean bias error (MBE) of the annual maximum SWE in each basin (M b ) to be where S b,i is the modelled SWE in basin b at timestep i, s b,i is the corresponding observed SWE and n y is the number of years of overlap between model and observations. To average over basins, it is weighted by the basin area, The variance of the bias error in each basin is 4 Precipitation corrections

GRACE-derived precipitation estimates
A water balance approach (Swenson, 2010) was used to estimate the precipitation (P ) at a basin scale: where dS tot /dt is the change in TWS (S tot ), E is the total evaporation flux, including sublimation, and Q net is net runoff. At the annual or monthly scale, this can be calculated to be t2 t1 where t 1 and t 2 are the start and end of each accumulation period respectively. TWS encompasses all water storage, including 10 soil moisture, ground water, water in wetlands, lakes and rivers, and water stored as snow in the snow pack. Evaporation includes transpiration, evaporation from soil surfaces, evaporation from intercepted water and other open water (rivers and lakes), as well as sublimation from frozen surfaces. As the water balance is calculated over whole basins there is no incoming runoff and the net runoff is equal to the basin discharge.
The change in TWS is calculated by differencing the GRACE anomalies between months, and averaging over the basin. 15 Monthly evaporation is provided by GLEAM and averaged over the basin, while the basin discharge is obtained from the GRDC measurement station closest to the basin outflow.
The water balance is calculated for each month in the cold season (defined as October to February inclusive for all basins).
During this season snowfall is the dominant precipitation type, the accumulating snow-pack is the dominant change to TWS, and the runoff and evaporation fluxes are relatively small (Fig. S4). This minimises the effect of uncertainty in evaporation and 20 basin discharge products, which could be more significant outside of the cold season (Swenson, 2010;Seo et al., 2010). Figure 2 shows the long-term mean monthly accumulated total precipitation during the cold season (October-February) for each of the four driving data sets and for the GRACE-derived precipitation estimates (all averaged over the common overlap period). The differences between data sets vary between the catchments, but in general the driving data sets accumulate less 25 precipitation than the GRACE estimate. The CRUNCEP and PGF data sets have lower precipitation than the WFDEI data through the whole cold season in all basins, but are very similar to each other as they were both bias corrected to CRU data without undercatch correction. The WFDEI data sets generally have higher precipitation, and are much closer to the GRACEderived estimate, even indicating a larger accumulation by December in the Lena and the Ob. The final accumulation is similar to the GRACE-derived estimate in three basins for WFDEI-GPCC and two basins for WFDEI-CRU . Overall the implication is that the driving data sets often do not contain enough precipitation to close the water budget in these basins in the cold season, particularly for the CRUNCEP and PGF data.

Calculation of precipitation scale factors
In order to appropriately increase the amount of precipitation in the driving data sets, a scale factor was calculated for each 5 data set in each study basin. The scale factor between GRACE and each driving data set was found by calculating the ratio of cold season accumulated precipitation derived from GRACE to that in the driving data. As the driving data cover a longer time period than GRACE, to ensure that the difference is not due to inter-annual variability or long-term climate trends each ratio was calculated using only data from the overlapping period of the relevant data set with GRACE. This period starts in 2002 for all of the data sets, and runs to the end of each data set. The scale factors for each basin can be seen in Table 2. 10 The relative scaling between basins is consistent for each data set, with the Yenisei requiring the largest correction (between 24% and 55% increase), and the Ob requiring the least (between a 2% decrease and 37% increase). There is a striking difference between the WFDEI data sets, which require increases up to 27% (and one decrease), and the PGF and CRUNCEP data sets which require much larger increases of between 35% and 55%.
The GRC runs of JULES were carried out by scaling both snowfall and rainfall by these factors during the cold season only 15 for all years of the run (not just the GRACE period). Snow and rain that falls outside of this season were unchanged. This design ensured that each run received the same amount of cold season precipitation as indicated by GRACE, when averaged over the basin and over the period of overlap between the driving data set and GRACE; the temporal and spatial variability of precipitation still varied between runs.

Control runs (CTL)
The simulated monthly maximum SWE for each basin can be seen in Fig. 3. In many cases JULES accumulates snow more slowly than the observational estimates, particularly later in the accumulation season, and the SWE is less sharply peaked than the observations. 5 Figure 4 shows the MBE for seasonal maximum SWE averaged over all four basins with respect to both GlobSnow and CMC. In general the magnitude of the MBE is largest for PGF, and smallest for WFDEI-GPCC (Fig. S5 shows the MBE for each basin). The largest deficit is for the PGF data in the Lena basin, for which it is -57% for GlobSnow and -56% for CMC. The smallest deficit is found in the Mackenzie for WFDEI-GPCC driving data, which has an MBE of -17% against both observations. Maps of the modelled SWE show broadly similar distributions of snow to the observations (Fig. S6). The 10 difference between the modelled and observed SWE (eg. GlobSnow in Fig. S7) show that, although there are a few regions of higher SWE in JULES outside of the studied basins, there is a deficit of modelled lying snow across most of the Northern Hemisphere. The general tendency for JULES to underestimate SWE is consistent with the indication of insufficient cold season precipitation (Fig. 2), and both analyses also agree that the WFDEI runs are generally closer to the observations. However it is clear that, even in the cases where the precipitation is close to the GRACE-derived estimate, the modelled maximum SWE is 15 still lower than observed.
The annual discharge in JULES is also severely underestimated compared with GRDC observations, as seen in Fig. 5 which shows the basin discharge averaged over all four basins. All of the runs demonstrate a substantial dry bias in comparison with GRDC values. CRUNCEP is particularly poor, with an 80-90% deficit in annual mean basin discharge, while the other runs underestimate basin discharge by between 52-57% on average. This difference between CRUNCEP and the other data sets   is consistent across all the basins. The deficit is particularly pronounced during the spring peak (Fig. S9) which in nature is strongly driven by snow melt. In JULES, the peak in the modelled river flow is broader and occurs later than the observations, which may indicate deficiencies in the modelled hydrology as well as a lack of melting snow. However, at the annual scale the bias may also be affected by warm-season rainfall.

5
Using the GRACE-scaled precipitation to drive JULES gives improved SWE representation overall, with significant increases in monthly and seasonal maximum SWE (Fig. 6). The modelled seasonal cycle is much more similar to observations, particularly in the Yenisei and Mackenzie, and early in the season for the Lena and Ob. In the Ob, the CRUNCEP GRC run is much more similar to the observations than any of the other models, while in the Mackenzie it now overestimates SWE. In the Lena, the seasonal peak is still not attained with any data set. Overall the largest increases in SWE are found for the CRUNCEP and 10 PGF runs, consistent with the larger scale factors used (

Discussion
This study has shown that gauge-or reanalysis-based estimates of cold season precipitation in boreal basins can be significantly lower than is suggested by a water-balance approach using GRACE TWS data. By scaling the driving meteorological data sets, precipitation data sets in endorrheic cold mountainous basins, and found that some precipitation data sets only captured between 10 and 60 % of the GRACE-based precipitation estimates. In this study we used time-invariant corrections, but this could be extended to allow time-varying corrections, say on an annual basis.
The GRACE-based estimates provide a means to account for measurement errors in the gauge data that are used to bias-10 correct reanalyses. This study suggests that undercatch is an important source of error, but in many cases the application of an undercatch correction does not remove the bias entirely. Further sources of error in the gauge-based estimates include spatial variability that is missed by the gauge network. While the GRACE TWS method used here circumvents many of these limitations it introduces uncertainties from other terms in the water budget (e.g. evaporation) and is inherently large scale -it may not result in improved SWE when looking at the local scale. 15 Undercatch correction, on the other hand, can be calculated at a local or grid box scale and can, in theory at least, more easily take account of changing meteorological conditions down to the time scale of individual storms. Furthermore it is easy to implement and is attractive as it addresses known deficiencies in the observations. It can also be used to correct historical data sets that pre-date GRACE. However the correction it is uncertain, including dependency on the type of gauge used in each area, and again requires further inputs, each with uncertainties. A further approach, not studied here, is to use estimates of 20 SWE, either from ground measurements or remote sensing, to estimate snowfall.
This study of cold season processes shows that the undercatch correction is equivalent to a substantial fraction of the GRACE-derived correction, suggesting that for gauge-based data sets the undercatch correction can be considered a minimum requirement that should be applied whenever possible. However, this varies by region, implying that there are different reasons for the deficit in precipitation in different regions. The CRUNCEP and PGF data sets do not include an undercatch 25 correction but have been widely used with land surface models; our results suggest that any aspects of those studies that are potentially sensitive to snowfall, such as high-latitude hydrological analyses, should be regarded as particularly uncertain.
Undercatch errors potentially also affect summertime precipitation, but have not been studied here. Figure 8 shows that the increase in seasonal maximum SWE is proportional to the increase in seasonal snowfall between experiments, with a very strong correlation (r 2 = 0.98) between the two. The gradient of 0.76 implies that not all of the increase 30 in snowfall manifests as an increase in SWE, which is mainly due to increased sublimation and a small increase in snow melt, but it does confirm that correcting snowfall is a direct and reliable approach with which to target errors in simulated SWE. This correlation also implies that the approach of Finney et al. (2012) to increase driving snowfall based on the required increase in SWE is reasonable. Figure 8. Increase in annual maximum SWE (mm) from CTL to GRC and UCC runs compared to increase in annual snowfall (mm y −1 ), averaged over all basins (using the GlobSnow mountain mask for snowfall as well as SWE) for each driving data set. This is averaged over the whole of each JULES run (see Table 1).
The use of corrected precipitation improved results in most catchments and with most driving data sets, often substantially.
In contrast the larger dry bias in modelled basin discharge was found in all simulations, with only modest improvements from corrected precipitation. This difference is a result of the much closer links between point snowfall and SWE, whereas modelled discharge can be viewed as the net effect of many hydrological processes acting across the catchment. A striking illustration of this is the large difference between the CRUNCEP and PGF river discharge, despite very similar precipitation inputs (both 5 having been bias-corrected to CRU precipitation). The larger river flow in PGF is balanced by a smaller evapotranspiration flux ( Fig. S10), while evapotranspiration in CRUNCEP tends to be closer to the WFDEI-based estimates. In turn this can be related to differences in other aspects of the meteorological forcing (Figs. S1 and S2); in particular, PGF has much higher specific humidity, particularly in the summer, than the other data sets. In spring and summer PGF humidity is 18% -87% higher than in CRUNCEP, which suppresses evapotranspiration, which is 10% -17% lower in PGF for the CTL runs. (This can be 10 contrasted with the air temperature, which is nearly identical in PGF and CRUNCEP, implying that the data sets employed different methods to reconstruct humidity.) This is consistent with the fact that evapotranspiration in the PGF runs is generally less than estimated by GLEAM, while the other runs tend to be closer to GLEAM, although there is considerable uncertainty in the GLEAM product. Although the focus of this study is on correction of precipitation products, which is shown to improve modelled SWE, it is clear that other meteorological variables are also important, particularly for other aspects of modelled 15 hydrology.
The modelled runoff ratio (Q net /P ) is much lower on average for CRUNCEP than for runs with the other data sets. When combined with the already relatively low precipitation in the CTL run, this means the CRUNCEP data set generates particularly small values of discharge in all basins -overall the CRUNCEP basin discharge is 16% of observed in the CTL run and 25% of observed in the GRC and UCC runs. The extra precipitation input in the GRC runs largely appears as increased evaporation in 20 Hydrol is possibly related to the longer time step of the data set (6 hours rather than 3 hours), within which the precipitation rate is constant in the JULES runs; the lower intensity of precipitation will tend to result in larger interception loss.
The seasonal maximum SWE tends to be biased low even when precipitation inputs are corrected. Assuming that both the precipitation and SWE observations are correct, a possible implication is that the modelled sublimation rate is too high. Sublimation in the CTL runs, averaged over all basins, ranged from 37% of snowfall (PGF) to 44% (CRUNCEP). It is very difficult 5 to measure sublimation, particularly over large areas, and most estimates tend to be based on water balance methods or models (Liston and Sturm, 2004). Estimates at a range of scales indicate that 10 -50% of snowfall can be lost through sublimation, with substantial variation depending on land cover and meteorological conditions (e.g. Liston and Sturm, 2004;Brun et al., 2013;Casson et al., 2018). For the Mackenzie basin sublimation was estimated as 29 mm (7% of annual precipitation; Déry and Yau, 2002), while numerical modelling by Yang et al. (2010) suggested that approximately 24% of annual snowfall in 10 the area 50 -70 • N was lost as sublimation. Thus the evidence suggests that sublimation might be rather high in JULES, albeit the evidence base is itself rather uncertain. Further investigation, including point-scale runs in data-rich environments to examine the simulated water budget across a range of land covers such as in SnowMIP2 (Rutter et al., 2009), will increase our confidence in the ability of JULES and other models to correctly simulate sublimation and other key processes.
However, errors in modelled sublimation and indeed all cold season processes are insufficient to fully account for the dry 15 bias in annual discharge. It is also possible that rainfall outside of the cold season is underestimated in the driving data. In all of these basins there is more precipitation in summer (when it is more likely to be rain) than in winter, suggesting that correction of warm season rainfall could potentially have a larger impact on the annual water balance. However, there are clear signs that meteorological inputs are not the only source of error, and that there are fundamental deficiencies in the model's representation of runoff generation processes: even a good estimate of peak SWE does not result in a good representation of 20 the spring discharge peak (Fig. S9). It is likely that the parameterisation of infiltration into partly-frozen ground and related runoff generation processes are not well represented in JULES. Previous work has shown that alternative descriptions of frozen soil can improve the modelled runoff peak (Finney et al., 2012).

Conclusions
There is a substantial body of literature on the intercomparison of global precipitation data sets, with a lesser focus on the par-25 ticular issues found at high latitudes where much of the precipitation falls as snow. There is an ongoing need to compare these precipitation products and to ensure that the best meteorological data are made available as inputs to land surface modelling.
This study has focussed on precipitation but the model results clearly indicate that other variables, such as humidity, are also important.