Hybrid climate datasets from a climate data evaluation system and their impacts on hydrologic simulations for the Athabasca River basin in Canada

A reliable climate dataset is a backbone for modeling the essential processes of the water cycle and predicting future conditions. Although a number of gridded climate datasets are available for the North American content, which provide reasonable estimates of climatic conditions in the region, there are 15 inherent inconsistencies in these available climate datasets (e.g., spatial and temporal varying data accuracies, meteorological parameters, length of records, spatial coverage, temporal resolution, etc). These inconsistencies raise a valid question as to which datasets are the most suitable for the study area and how to systematically combine these datasets to produce a reliable climate dataset for climate studies and hydrological modeling. This study suggested a framework, called reference reliability evaluation system 20 (REFRES), that systematically determines a ranking of multiple climate datasets to generate a hybrid climate dataset for a region. To demonstrate the usefulness of the proposed framework, REFRES was applied to produce a historical hybrid climate dataset for the Athabasca River basin in Alberta, Canada. A proxy validation was also conducted to prove the applicability of the generated hybrid climate datasets to hydrologic simulations. This study evaluated five climate datasets, including station-based gridded climate 25 datasets (ANUSPLIN, Alberta Township, and PNWNAmet), a multi-source gridded dataset (Canadian Precipiation Analysis CaPA), and a reanalysis-based dataset (NARR). The results showed that the gridded climate interpolated from station data performed better than multi-source and reanalysis based climate datasets. For the Athabasca River basin, Township and ANUSPLIN were mostly ranked first for precipitation and temperature, respectively. The proxy validation also confirmed the superior performance 30 of hybrid climate datasets compared with the other five individual climate datasets investigated in this study. These results indicate that the hybrid climate dataset provides a better representation of historical climatic conditions and thus, enhancing the reliability of hydrologic simulations. Hydrol. Earth Syst. Sci. Discuss., https://doi.org/10.5194/hess-2019-189 Manuscript under review for journal Hydrol. Earth Syst. Sci. Discussion started: 23 May 2019 c © Author(s) 2019. CC BY 4.0 License.


Introduction
1 A reliable historical climate dataset is essential in understanding the climatic and hydrological 2 characteristics of a watershed, as it is a crucial forcing input data for simulating key processes of the water 3 and energy cycles in impact models (Deacu et al., 2012;Essou et al., 2016;Wong et al., 2017). Although 4 climate monitoring networks have advanced over the last decades, poor network density still exists, 5 especially in western mountainous and northern parts of Canada. Moreover, climate observations are often 6 spatially interpolated to cover ungauged regions, which may cause unexpected erroneous model predictions 7 as a consequence of the sparse measurements network, especially for mountainous areas affected by 8 orographic effects (Rinke et al., 2004;Wang and Lin, 2015). 9 As advances in numerical hydrologic and hydrodynamic modeling have increased the capability and 10 reliability in simulating complex natural processes to detect anthropogenic and natural climate changes, a 11 need for temporally-and spatially-reliable climate data has also been grown to accommodate the 12 requirements of input data for numerical models (Shen et al., 2010;Shrestha et al., 2012;Islam and Dery, 13 2017). For instance, process-based distributed hydrologic models have a grid-based structure that requires 14 input data for each grid cell. However, a simple spatial interpolation of observational station data to all 15 model grid cells may not produce a reliable input forcing dataset for hydrologic models, particularly in a 16 region with a sparse gauging network. A reliable historical climate dataset is also crucial in climate change 17 studies when used for statistical downscaling techniques that employ the relationships between observations 18 and outputs of global (or regional) climate models to produce climate forcing at regional or local scales. 19 Since the resolution of products from a statistical downscaling technique usually corresponds to that of the 20 historical climate dataset (Werner and Cannon, 2016;Eum and Cannon, 2017), the availability of 21 temporally-and spatially-reliable historical climate data is essential for climate-related impact studies 22 (Christensen and Lettenmaier, 2007;Kay et al., 2009;Gutmann et al., 2014;Eum et al., 2016). 23 A number of high-resolution gridded climate datasets have been developed for various applications 24 such as inter-comparison studies (Eum et al., 2014a;Wong et al., 2017) and hydrologic modeling (Choi et 25 al., 2009;Eum et al., 2016). There are various types of gridded climate datasets available for the North 1 American region; 1) station-based interpolated, 2) station-based multiple-source, and 3) reanalysis-based 2 multiple-source (Wong et al., 2017). By interpolation of observational station data, long-term gridded 3 climate datasets have been produced over various domains defined by stations incorporated such as  wide Australia National University's spline (ANUSPLIN, Hutchison et al., 2009), the Alberta Township 5 data (Shen et al., 2001), and the PCIC NorthWest North America meteorological (PNWNAmet) dataset 6 (Werner et al., 2019). The Canadian Precipitation Analysis (CaPA) system, a multiple source-based climate 7 dataset, has been developed to produce near real-time precipitation analyses (6-hr accumulated precipitation) 8 over North America at 15 km resolution which has been further improved to 10km resolution (Lespinas et 9 al., 2015). North American Regional Reanalysis (NARR), one of the reanalysis-based datasets derived from 10 a regional climate model (~32km), has been tested as an alternative climate dataset (Choi et   In general, the available historical gridded climate dataset can be divided into three categories; 1) 5 station-based, 2) multiple source-based, and 3) reanalysis-based. In this study, five high-resolution gridded 6 climate datasets available for Alberta were selected (Table 1)  Analysis (CaPA) in 2003 to produce superior gridded precipitation data over North America at 10 km 4 resolution (Lespinas et al., 2015), especially for regions with poor observational networks (Mahfouf et al., 5 2007). CaPA employs an optimum interpolation technique that requires properties of error statistics among 6 observations and a first guess, i.e., background field (Garand and Grassotti, 1995). A short-term forecast of 7 6-hr accumulated precipitation from the Canadian Meteorological Centre (CMC) regional Global 8 Environmental Multiscale (GEM) model (Côté et al., 1998a;1998b) is used in CaPA as the background 9 field. The assimilated precipitation from the Canadian weather radar network and 33 US radars near the

Reanalysis-based dataset 21
Reanalysis products are another common type of gridded dataset used in climate and hydrologic 22 studies. The North American Regional Reanalysis (NARR) was developed to create a long-term set of 23 dynamically consistent 3-hourly climate data from 1979 to 2003 at a regional scale (0.3°= ~ 32km) for the 24 North America domain (Mesinger et al., 2006). By utilizing advanced land-surface modeling and data 25 assimilation through the Eta Data Assimilation System (EDAS), NARR improved the National Centers for 1 Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) global reanalysis data. Using the 3-hr NARR climate data, daily precipitation and minimum and maximum temperature were 7 calculated by adjusting the time zone to MT from the original NARR dataset (UTC zone). 8 9

Reference Reliability Evaluation System (REFRES) 11
This study suggests a REFference Reliability Evaluation System (REFRES) that consists of three 12 main modules (refer to Figure 2): 1) a performance measure module (PMM) to evaluate various 13 performance measures for each climate dataset, 2) a ranking module (RM) to identify the most reliable 14 climate data for a target grid cell using a multi-criteria decision-making technique based on the performance 15 measures provided by PMM, and 3) a data generation module (DGM) to produce a hybrid climate dataset 16 by selecting the most reliable climate dataset based on the ranking provided by the RM (ranking model). 17 These three modules are seamlessly integrated and exchange the required data and information to generate 18 a hybrid climate dataset. The next section provides further details on each module. 19 AHCCD is a point (station) dataset while the other climate datasets used in this study (refer to Table  24 1) are regularly spaced gridded datasets with varying time period, spatial resolution, and coverage (i.e., 25 domain). Therefore, the inverse distance squared weighting method was applied to obtain the values at the 1 AHCCD stations from all the gridded climate datasets. Then, performance measures were calculated by 2 comparing the interpolated values with the data collected at AHCCD stations. The choice of the 3 performance measures is vital in REFRES, as the ranking of climate datasets entirely depends on included 4 performance measures. In this study, performance measures were selected based on three criteria: 1) 5 distribution, 2) sequencing, and 3) spatial pattern. Distribution-related performance is assessed by the 6 Kolmogorov-Smirnov D statistic (DKS) and standard deviation ratio ( ratio ). Sequence-related performance 7 is assessed by the percentage of bias (Pbias), root mean square error (RMSE), and temporal correlation 8 coefficient (TCC). Spatial pattern-related performance is evaluated by the pattern correlation coefficient 9 (PCC) as shown in Eq. (1) to Eq. (5). The equations of TCC and PCC are identical but TCC is calculated 10 with the daily time series of climate variables and PCC is obtained by the mean annual precipitation and 11 temperature of the AHCCD stations over a target domain. Therefore, PCC varies with the user specified 12 target domain. 13 where and are the standard deviation of gridded and observed climate datasets, Gi and Oi represent 19 gridded and observed climate datasets at ith time step, respectively; F is the empirical distribution function 20 of a climate dataset; is standard deviation; ̅ and ̅ represent the mean of gridded and observed 21 climate datasets, respectively and N is a total number of data points. These six performance measures were 22 calculated for all the selected climate datasets and variables at each AHCCD station. Figure 2 (blue box in 1 PMM) shows an example of 6 PMs calculated for the precipitation variable using the ANUSPLIN gridded 2 data. Thus, 15 tables (5 climate datasets × 3 variables) were generated by PMM and transferred to the RM. 3 4

Ranking Module (RM) 5
The function of the ranking module is to select the appropriate AHCCD stations for a given target grid 6 cell and to rank all the gridded data sets based on the six performance measures calculated in the previous 7 module. For a given target cell, AHCCD stations are selected based on two criteria: distance and elevation. (climate dataset in this study) and j th criterion (i.e., a performance measure). A weighted normalized decision 23 matrix, (tij)m×n is given by 24 where, m and n are the total number of alternatives and criteria, respectively, nij is normalized matrix by Eq. 3 (7), and wj represents weighting on the j th criterion. Under the assumption that all performance measures 4 are important, this study used an equal weighting. Then, Euclidean distances (dib and diw) of climate datasets 5 from the best (Ab) and worst (Aw) conditions were calculated respectively by Eq. (8) to Eq. (11) 6 Where, tbj and twj are the best and worst decision matrices determined by Eq. (8) and (9), respectively, and 11 J+ and Jrepresent criteria that have a positive and a negative impact on performance. For example, TCC 12 and PCC are in J+ while DKS , ratio , Pbias, and RMSE are in J-. Using the Euclidean distances, the order of 13 preference for all climate datasets was determined by the similarity (Siw) to the worst condition in Eq. (15). 14 15 siw = 1 when the alternative is equal to the best condition (Ab) and siw = 0 if the alternative is equal to the 16 worst condition (Aw). In other words, a higher siw represents higher preference among alternatives. As we 17 evaluate the performance measures (criteria) for individual climate variables, TOPSIS can be applied to 18 decide the preference of climate datasets considering the performance measures for either individual or 19 multiple variables. In this study, TOPSIS provides two types of ranking information by using performance 20 measures from i) individual climate variable and ii) all climate variables. That is, one is the ranking for 21 precipitation and temperature separately (Rind) and the other is the ranking for multiple variables (Rmul). For 1 example, in this study, Rind was determined by a 5×6 decision matrix (5 climate datasets and 6 performance 2 measures) for precipitation and temperature individually, while Rmul was determined by a 4×18 decision 3 matrix (4 climate datasets excluding CaPA that provides only precipitation by 18 performance measures 4 from three variables). To alleviate the erroneous output that minimum temperature is higher than maximum 5 temperature on a certain day when producing the hybrid climate dataset by the ranking of temperature 6 values individually, the performance measures of both minimum and maximum temperature are employed 7 together to rank the climate datasets for temperature. 8 9

Data Generation Module (DGM) 10
DGM extracts the most reliable climate data for a user-specified target region based on the ranking 11 information obtained from the RM. The tool is flexible enough to provide output in various common 12 formats, i.e., NetCDF, ASCII (text) or in the specific format of a numerical model. As all of the historical 13 gridded climate datasets have been tested and employed in numerous climatic and hydrologic studies, an 14 assumption was made in generating the hybrid climate dataset that all of the climate datasets are equally 15 qualified for inclusion but the final selection can be determined by the proven superiority evaluated through 16 the performance measures. Under this assumption, the available datasets can be combined systematically 17 based on the rank (performance) of each dataset at target grid cells. As each climate dataset has different 18 data periods shown in Table 1, the first ranked dataset cannot fully cover a whole target period to be 19 extracted from a set of climate data candidates. DGM provides a systematic procedure to identify the most 20 reliable dataset for a target region and extracts the data from the inventory of climate datasets considering 21 the ranking and availability of each dataset for a desired period. For instance, if CaPA and ANUSPLIN grid cells (study domain), the hybrid climate dataset is produced in a user-defined fomat. This study 3 generated the hybrid climate datasets in the form of the VIC forcing input format to be directly employed 4 into the hydrologic model. 5 6

Proxy validation 7
Although the AHCCD dataset has been adjusted to provide better estimates of actual precipitation and 8 temperature, it contains statistical artifacts that include inevitable errors from sequential data processes that 9 can be propagated in the derived hybrid climate dataset. Given that the AHCCD stations, the reference 10 dataset for the performance measures, are not regularly distributed and have especially poor density in the 11 northern parts of the study area (refer to Figure 1 Table 2 were selected for the proxy 19 validation based on three criteria: i) hydrometric record length, ii) location defined by upper, middle and 20 lower reaches (Northern River Basin Study, 2002), and iii) the number of gridded climate datasets used to 21 generate a hybrid climate dataset for the catchment area of the selected hydrometric station. In other words, 22 a higher number of gridded climate datasets contributing to the hybrid climate dataset within a catchment 23 was selected to evaluate the utility of the hybrid climate data relative to the existing gridded climate datasets. 24 Hinton is located near the headwaters of ARB, which are characterized by mountainous topography and 25 snow-and glacier-ice melt dominated hydrologic regimes. Pembina is one of the major rivers in the middle 1 reach. The other three stations (Christina, Clearwater above Christina and Firebag) are located in the lower 2 reach, which is a water-limited (dry) region due to a higher amount of evapotranspiration (Eum et  simulations, all of the calibrated parameter sets can be considered as mostly plausible parameter sets for 12 the selected sub-basins. However, as mentioned above, intrinsic biases exist temporally and spatially in all 13 of the gridded climate datasets, e.g., discrepancies in the amount and spatial distribution of precipitation 14 between the gridded climate datasets and observations. Therefore, the similarity of the gridded climate 15 datasets in terms of magnitude, sequence, and spatial distribution of climate events relative to observations 16 is crucial to reproduce historically observed streamflows. In addition to climate forcings, streamflows are 17 mainly affected by geographic characteristics and physical land surface processes (e.g., infiltration and 18 evapotranspiration), which are represented by model parametrization related to infiltration and soil 19 properties (Demaria et al., 2007). In a hydrologic simulation, the biases in climate datasets can be 20 compromised by model parameters that adjust hydrologic processes to observations (Harpold et al., 2017;21 Kirchner, 2006). That is, a calibrated parameter set may imply biases in a climate dataset. Under the 22 assumption that the calibrated parameter sets are suitable for hydrologic simulations in each sub-basin, this 23 study applied a multiset-parameter hydrologic simulation approach that employs all parameter sets 24 calibrated by the seven climate datasets and the same climate dataset as a forcing input data to assess the 25 sensitivity of the climate dataset to all feasible parameter sets. From the multiset-parameter hydrologic 1 simulations, the bias in a climate dataset can be estimated indirectly by quantifying the variability in 2 hydrologic simulations derived from the feasible calibrated parameter sets under a climate forcing dataset. 3 In other words, lower variability in the hydrologic simulations indicates higher reliability in the climate 4 forcing dataset. The suitability of the hybrid climate dataset for improving historical hydrologic simulations 5 was also tested by directly comparing the performances of calibration and validation for each climate 6 dataset. Proxy validations were carried out by conducting 49 hydrologic simulations (7 climate forcing × 7 7 parameter sets) for the Pembina and Christina catchment areas, whereas only 36 simulation runs were 8 possible for Hinton, Firebag, and Clearwater sub-basins, as one of the gridded data sets (i.e., Township) did 9 not cover the entire catchment areas of these three hydrometric stations. 10 11 4. Results 12

Precipitation performance measures in Alberta 13
Although the performance measures were calculated for 190 AHCCD stations in western Canada, the 14 target area of this study is in Alberta, where only 45 stations are located. Therefore, the results for the 45 15 AHCCD stations are given in this study. Table 3  as they employed the raw station data instead of the adjusted precipitation data which is higher than the raw 23 station data by 5%-20%. In contrast, other climate datasets (especially multiple sources and reanalysis data) CaPA and NARR overestimate extreme precipitation events by overly reflecting the orographic effects on 1 precipitation in western Alberta. 2 Figure 4 shows the temporal correlation coefficient (TCC) data averaged over the AHCCD stations in 3 Alberta to investigate the similarity between historical precipitation datasets employed in this study. As 4 expected, station-based climate datasets (i.e., ANUSPLIN, PNWNAmet, and Township) showed better 5 TCCs than CaPA and NARR. The TCC between ANUSPLIN and Township was the highest among climate 6 datasets except for the observations (i.e., OBS), even though they incorporated different interpolation 7 techniques. PNWNAmet showed the highest TCC with ANUSPLIN because they both are based on thin 8 plate spline interpolation. TCCs between CaPA and other climate datasets are similar, as CaPA is produced 9 from multiple sources such as GEM's outputs and weather radar networks of Canada and US. NARR, the 10 reanalysis-based climate dataset, showed higher TCC with CaPA than with other datasets, as it is assimilated 11 with multiple sources of observations. 12 Maps of each performance measure are shown in Figure 5. It is evident from the spatial variability that 13 the ANUSPLIN and Township datasets outperformed the other datasets in DKS throughout Alberta. In the 14 mountainous region of southwest Alberta, most of the climate datasets performed poorly in Pbias, σratio, 15 RMSE, and PCC, resulting mainly from the sparse observation network and inconsistent observations near 16 the Canada-US border. PNWNAmet highly overestimates the mean annual precipitation in the mountainous 17 area (e.g., 300 mm/year higher than that observed at station ID 3050519), which may considerably affect 18 simulated streamflows originating in mountainous headwaters and further downstream. 19 20

Air temperature performance measures in Alberta 21
The performance measures for air temperature averaged over 37 AHCCD stations in Alberta are 22 presented in Table 4. As CaPA provides only precipitation, it was excluded in the assessment for temperature. 23 All of performance measures for temperature are better than those for precipitation except Pbias. NARR is 24 highly biased as it underestimates minimum and maximum temperatures, which might be an attribute of 25 discontinuation of observation assimilation since 2003 (Eum et al., 2014a). ANUSPLIN and Township 1 showed an almost perfect linear relationship (TCC) with the observations (i.e., > 0.97 for all of the climate 2 datasets). The performance measures for maximum temperature are better than those for minimum 3 temperature as maximum temperature is dominated by mainly large-scale heat waves while minimum 4 temperature is affected by local physical processes, e.g., topography and surface conditions (Eum et al., 5 2012). NARR showed less skill in capturing these local effects due to the coarse spatial resolution (~32km) 6 compared to other station-based climate datasets. As with precipitation, the maps of performance measures 7 for minimum and maximum temperature presented in Figure 6 and Figure 7 showed that data from the 8 mountainous areas performed poorly in most of the performance measures. NARR showed positive and 9 negative Pbias for minimum and maximum temperature, respectively, in the mountainous region, indicating 10 that NARR has a warm bias in extreme cold temperatures and a cold bias in extreme warm temperatures. 11 12

Ranking of climate datasets in the ARB 13
The geospatial information (i.e., latitude, longitude, and elevation) of 22,372 grid cells within the ARB 14 was extracted from the Canadian digital elevation data provided by Natural Resources Canada (refer to 15 https://open.canada.ca/data/dataset/7f245e4d-76c2-4caa-951a-45d1d2051333). Using this information, the 16 RM in REFRES ranked the five climate datasets by TOPSIS for each grid cell. Table 5  For precipitation, the Alberta township dataset was ranked first in most of the grid cells within the 21 basin (78%) for the whole ARB, followed by ANUSPLIN (13%), PNWNAmet (3%), CaPA (3%), and 22 NARR (2%). However, the Township data domain covers only 83% of the ARB within Alberta; the 23 remaining 17% of the watershed area that lies on the outside the province is not covered (Figure 8). The 24 Township dataset was ranked first for almost 95% of grid cells within its domain, indicating that the 25 Township dataset overwhelmingly outperformed other climate datasets for precipitation. Township was 1 dominantly ranked first for the subbasins (Pembina and Christina) within the Township domain. 2 For temperature, ANUSPLIN was ranked first (in 62% grid cells) for the whole ARB, followed by 3 Township (31%) and PNWNAmet (7%). In the upper and middle reaches, i.e., Hinton and Pembina, 4 PNWNAmet and Township were mostly ranked first, respectively, while ANUPLIN outperformed other 5 climate datasets for the subbains in the lower reach. When considering the performance measures for 6 multiple variables simultaneously, the Township dataset was ranked first, followed by ANUSPLIN for 64% 7 and 36% of the grid cells for the whole ARB. Figure 9 shows maps of the first-ranked climate datasets for 8 each case in Table 5, i.e., individual variable (Case A and B) and multi-variables (Case C). Due to the 9 limited spatial coverage of the Township dataset, other climate datasets were ranked first in the headwaters 10 of the ARB and the area of the river basin in Saskatchewan. For instance, ANUSPLIN and PNWNAmet 11 were ranked first in the headwaters, while no specific climate dataset dominated in Saskatchewan for 12 precipitation (refer to Figure 9A). For temperature, ANUSPLIN outperformed in the northern part (middle 13 and lower reaches of the ARB) due to outstanding performance of the Pbias performance measure for 14 minimum temperature as shown in Table 4 and Figure 6(b). For multi-variables, Township was mostly 15 ranked first within its domain and ANUSPLIN was ranked first outside the Township dataset domain and 16 also for a small part of lower reach area in the ARB. 17 Figure 10 shows the percentage of each climate dataset at each rank for the three cases (e.g. A, B, and 18 C in Table 5 As two different hybrid climate datasets were generated using the ranking information from single-1 and multi-variable approaches, i.e., Hybrid (Rind) and Hybrid (Rmul), further investigation is required to 2 identify which hybrid climate dataset may provide better performance and consequently will be 3 recommended for future climate-related studies. A proxy validation approach was applied using both 4 generated hybrid climate datasets to validate the utility of one dataset over the other. 5 6

Proxy validation of generated hybrid climate datasets 7
In addition to the five gridded climate datasets, the two hybrid climate datasets were implemented for 8 proxy validation using the VIC model. In contrast to the station-based climate datasets, both CaPA and 9 NARR were produced from climate models and multiple sources of observations, consequently showing 10 higher correlation with each other as shown in Figure 4. Since CaPA also provides only precipitation, this 11 study combined precipitation of CaPA with the NARR temperature to prepare the CaPA climate forcing 12 dataset for the proxy validation. Firebag, located in the middle and lower reaches. Figure 11 presents  CaPA produced the highest NSE (accuracy) among the climate datasets used in this study. Therefore, the 8 results of CaPA need to be considered carefully otherwise they might be misleading. In this context, the 9 CaPA dataset was excluded from further assessment of the precision and accuracy even though all of the 10 results of CaPA were included in Figure 11 for reference only. Hybrid(Rmul) and ANUSPLIN showed the 11 highest accuracy as forcing data, followed by Hybrid(Rind), PNWNAmet, and NARR. In the Pembina and 12 Christina catchments, the Hybrid(Rind), Hybrid(Rmul), and Township datasets had the highest precision and 13 accuracy. NARR produced negative NSEs at Pembina, indicating it is not reliable or suitable as a forcing 14 dataset. For Clearwater, Hybrid(Rind) is the top performer, followed by Hybrid(Rmul), ANUSPLIN, 15 PNWNAmet, and NARR. Clearwater had the highest number of climate datasets combined in the hybrid 16 climate dataset within the basin for precipitation as shown in Figure 9. Interestingly, the precision of NARR 17 is similar to that of CaPA because they shared the temperature data from NARR. For Firebag, Hybrid(Rind) 18 also showed top performance in both precision and accuracy, followed by Hybrid(Rmul), ANUSPLIN, 19 PNWNAmet, and NARR. Overall, Hybrid(Rind) showed the best accuracy and precision at all hydrometric 20 stations, indicating that it has the potential not only to improve historical hydrologic simulations but also 21 to be used as reference data for statistical downscaling of climate change projections in the province. included in the interpolation, many stations might be left out in the data generation processes. While 4 ANUSPLIN used the Canada-wide archive (raw) station data collected by only ECCC, the Alberta 5 Township data has been produced on the basis of the archive (raw) station data collected by ECCC, AEP, 6 and AF over Alberta. Therefore, one of the possible reason for outperformance of Township dataset might 7 be the difference in the numbers of stations (i.e. station density) employed to produce the gridded climate 8 datasets. In addition, PNWNAmet showed a positive Pbias for precipitation, especially in the mountainous 9 areas, while ANUSPLIN, which employs similar thin plate spline interpolation, generated negative Pbias. 10 PNWNAmet overestimated precipitation over the mountainous area, which considerably affects simulated 11 low flows at Hinton in the ARB. Figure 12 shows the observed and simulated hydrographs from gridded 12 climate datasets at (a) Hinton and (b) Pembina. It clearly shows that PNWNAmet highly overestimated the 13 low and high, which is caused by overestimated precipitation in the drainage area of the sub-basins. As with 14 PNWNAmet, NARR also overestimated the low and high flows, which is induced by the combined effects 15 of overestimating precipitation and warm biases in cold temperature. The temperature bias of NARR is thus 16 further confirmed and is consistent with the earlier finding of Eum et al., (2014) and Islam and Dery (2016). 17 In Figure 12, the hybrid climate datasets underestimated the peak flows (in 2009, 2010, 2014, and 18 2015) at Hinton, and hydrograph is similar to the hydrograph produced by ANUSPLIN data set that 19 dominantly ranked first in this watershed. On contrary, the hydrograph of the hybrid climate datasets at 20 Pembina is similar to that of Township that is dominantly ranked first in Pembina (refer to Table 5). These 21 results indicate that the hybrid climate dataset has the intrinsic limitation that the performance of the hybrid 22 dataset for a basin may closely resemble that of the climate dataset that is dominantly ranked first for the 23 basin. However, the utility of the hybrid climate dataset can be clearly found at a whole-basin scale for a 24 large watershed, as the added values of the hybrid climate dataset in sub-basins can be cumulated to the 1 main stem at the downstream in the watershed. 2 Among the station-based gridded climate datasets, ANUSPLIN and Township employed a different 3 number of stations depending on their periods of record. Therefore, there is an inconsistency in these climate 4 datasets over time. For example, the Township dataset employed only 300~400 stations in the 1960s, but 5 has increased to 400~500 since 1970. A change-point analysis of these datasets may provide some useful 6 information to end-users with respect to when and where changes occurred, which will help in establishing 7 spatial and temporal accuracies of these datasets (Eum et al. 2014a). Further, PNWNAmet employed the 8 same number of stations over time to avoid the above mentioned inconsistency, but this study found that it 9 induced overestimation of precipitation in data-poor regions such as mountainous regions in Alberta. As 10 the hybrid climate datasets are generated from the multiple historical gridded datasets, they may also have 11 the same inconsistencies identified in other datasets. The proxy validation, however, demonstrated that the 12 generated hybrid climate datasets can improve the performance of hydrologic simulations. 13 This study identified the preference order of all gridded climate datasets based on the performance 14 measures evaluated at the AHCCD stations, therefore the ranking somewhat relies on the spatial distribution 15 of the AHCCD stations. As shown in Figure 1, the density of AHCCD stations varies across western Canada, 16 and is low in the cold climates of mountainous and northern areas. Therefore, the ranking could further be 17 improved with a more uniform density of AHCCD stations over western Canada. 18 Literature has demonstrated that NARR, a reanalysis-based climate dataset, can be an alternative as a 19 climate forcing dataset for hydrologic simulations in data sparse regions (Choi et al., 2009;Praskievicz and 20 Bartlein, 2014;Islam and Dery, 2016). In this study, the NARR dataset performed quite well in high-21 elevation regions (Hinton in this study) while it did not perform so well in the middle and lower reaches, 22 i.e., lower-elevation watersheds. NARR performed especially poorly in the Pembina sub-basin, a region 23 where hydrologic simulations are highly sensitive to model parameters (Eum et al., 2014b). In Figure 11  24 (b), however, the NARR parameter set produced fair NSE values in hydrologic simulations forced by the 25 other climate datasets except for CaPA and PNWNAmet. Such result indicates that 1) all of parameter sets 1 used in this study were calibrated reasonably and 2) climate forcing input data plays a more crucial role in 2 hydrolog simulations as any parameter sets did not produce a fair NSE value from NARR in Pembina. 3 CaPA was more suitable than NARR for the selected sub-basins in this study, which indicates that CaPA 4 might be a better alternative in low station-density regions such as the ARB. However, since the validation 5 period in this study is only 7 years from 2010 to 2016, a longer data period is necessary to validate the 6 suitability of CaPA as indicated in Eum et al. (2014a) and Wong et al. (2017). 7 In the proxy validation, Hybrid(Rind) performed well in the Clearwater sub-basin where the highest 8 number of climate datasets were combined in the generated hybrid climate datasets. The Township dataset, 9 which mostly ranked first within its spatial domain, partially covers the drainage area of Clearwater, so that 10 the generated hybrid climate dataset, Hybrid(Rind), is composed of many climate datasets in this sub-basin. 11 In a traditional approach to hydrological modelling for Clearwater, either the Township dataset might be 12 completely excluded (as it does not cover the entire Clearwater watershed), or potentially combined with 13 other gridded climate datasets to cover the entire watershed. However, combining different climate datasets 14 to construct the climate forcing for a larger region requires an evaluation of the datasets to identify the order 15 of preference for such aggregation when multiple choices are available. Therefore, this study suggested the 16 the REFRES methodology to systematically compare all-available climate datasets for a region to produce 17 a hybrid climate dataset that covers a desired period of record and spatial domain by considering the order 18 of preference for combining various climate datasets at each grid cell. The proxy validation approach also 19 confirmed the utility of a generated hybrid climate dataset over other data sets, especially in hydrologic 20 simulations. 21 22

Summary and concluding remarks 23
This study suggested a framework called reference reliability evaluation system (REFRES) to 24 systematically generate a performance-based hybrid climate dataset from multiple climate datasets for a 25 region. The hybrid dataset was found to more reliable for hydrological modelling. The REFRES is 1 composed of three modules; 1) performance measures, 2) ranking, and 3) data generation. The suggested 2 framework was applied to the ARB as a test-bed and generated two hybrid climate datasets from single-3 (Rind) and multi-variable (Rmul) approaches by evaluating the performance of five available gridded climate 4 datasets: station-based gridded climate datasets (i.e. ANUSPLIN, Alberta Township, and PNWNAmet), a 5 multi-source dataset (CaPA), and a reanalysis-based dataset (NARR). A hydrologic modelling-based proxy 6 validation approach was applied to demonstrate the applicability of the hybrid climate dataset generated for 7 the five sub-basins in the ARB. The results showed that 8 -Among the five climate datasets, the station-based climate datasets performed better than multi-9 source-and reanalysis-based datasets. The Township dataset, in particular, outperformed other 10 climate datasets in the selected performance measures over northern Alberta. 11 -Most of the climate datasets performed poorly in the mountainous areas of southwest Alberta, due 12 to a sparse observation network, orographic effects, topographic complexity, and inconsistencies in 13 observation between Canada and the US. 14 -As a result of REFRES' application for the ARB, the Township and ANUSPLIN datasets are mostly 15 ranked the highest among the five climate datasets for precipitation and temperature, respectively. 16 -In the proxy validation, two hybrid climate datasets, Hybrid(Rind) and Hybrid(Rmul), performed 17 better in terms of precision and accuracy as forcing data for hydrologic simulations. This study provided the preference order of climate datasets available in Alberta, which may be useful 23 for modelers and decision-makers as to which climate dataset is the most suitable for their studies and 24 projects. Furthermore, this study demonstrated that the hybrid climate dataset produced by REFRES is more 25 representative of historical climatic conditions. Therefore, the hybrid climate dataset is recommended to be 1 used as a reference dataset for statistical downscaling and hydrologic model forcing, resulting in more 2 reliable high-resolution climatic and hydrologic projections.    Table 5. First ranked number of grid cells in the five sub-basins and the whole Athabasca River Basin