On the importance of observational data properties when assessing regional climate model performance of extreme precipitation

. In recent years, there has been an increase in the number of climate studies addressing changes in extreme precipitation. A common step in these studies involves the assessment of the climate model performance. This is often measured by comparing climate model output with observational data. In the majority of such studies the characteristics and uncertainties of the observational data are neglected. This study addresses the inﬂuence of using different observational data sets to assess the climate model performance. Four different data sets covering Denmark using different gauge systems and comprising both networks of point measurements and gridded data sets are considered. Addition-ally, the inﬂuence of using different performance indices and metrics is addressed. A set of indices ranging from mean to extreme precipitation properties is calculated for all the data sets. For each of the observational data sets, the regional climate models (RCMs) are ranked according to their performance using two different metrics. These are based on the error in representing the indices and the spatial pattern. In comparison to the mean, extreme precipitation indices are highly dependent on the spatial resolution of the observations. The spatial pattern also shows differences between the observational data sets. These differences have a clear impact on the ranking of the climate models, which is highly dependent on the observational data set, the index and the metric used. The results highlight the need to be aware of the properties of observational data chosen in order to avoid overconﬁdent and misleading conclusions with respect to climate model performance.

Abstract. In recent years, there has been an increase in the number of climate studies addressing changes in extreme precipitation. A common step in these studies involves the assessment of the climate model performance. This is often measured by comparing climate model output with observational data. In the majority of such studies the characteristics and uncertainties of the observational data are neglected.
This study addresses the influence of using different observational data sets to assess the climate model performance. Four different data sets covering Denmark using different gauge systems and comprising both networks of point measurements and gridded data sets are considered. Additionally, the influence of using different performance indices and metrics is addressed. A set of indices ranging from mean to extreme precipitation properties is calculated for all the data sets. For each of the observational data sets, the regional climate models (RCMs) are ranked according to their performance using two different metrics. These are based on the error in representing the indices and the spatial pattern.
In comparison to the mean, extreme precipitation indices are highly dependent on the spatial resolution of the observations. The spatial pattern also shows differences between the observational data sets. These differences have a clear impact on the ranking of the climate models, which is highly dependent on the observational data set, the index and the metric used. The results highlight the need to be aware of the properties of observational data chosen in order to avoid overconfident and misleading conclusions with respect to climate model performance.

Introduction
In recent years, a large number of studies have focused on estimating the changes in extreme precipitation under climate change conditions. However, information on changes in precipitation and, especially, in extreme precipitation is subject to large uncertainties. The main sources of uncertainty arise from the choice of emission scenario, climate model, and downscaling method. Several studies have concluded that the uncertainty in climate model projections is in most cases larger than the natural variability and the emission scenario uncertainty (Wilby and Harris, 2006;Déqué et al., 2007;Dessai and Hulme, 2007;Hawkins and Sutton, 2011). In an effort to account for this source of uncertainty, multimodel ensembles are widely used in climate change impact studies.
Several uncertainty quantification techniques based on multi-model ensembles have been suggested in the literature. These range from simple methods considering the ensemble average (Dai et al., 2001;Christensen and Lettenmaier, 2007;Pierce et al., 2009) to more complex probabilistic methods like the Bayesian approaches suggested by Tebaldi et al. (2005) and Leith and Chandler (2010). In general, there are two main approaches for combining projections in multi-model ensembles: (i) assign the same weight to all the models (e.g. Goodess et al., 2007), and (ii) assign differential weights to the climate models based on their performance (e.g. van der Linden and Mitchell, 2009;Lenderink, 2010;Taye et al., 2011;Boberg and Christensen, 2012). The latter approach is often preferred as it is believed that not all climate models perform equally well.

Published by Copernicus Publications on behalf of the European Geosciences Union.
However, there are many challenges in the assessment of climate model performance (Knutti et al., 2010;Maraun et al., 2010;Gómez-Navarro et al., 2012). Due to the lack of information about the future, climate model performance is often assessed by comparing climate model output for present conditions to observations. The choice of, respectively, indices used to characterise the properties of data and metrics used to compare model output with observations poses an important challenge (Gómez-Navarro et al., 2012). There is lack of agreement on what is a good model, as different indices and metrics may lead to different results Lenderink, 2010). As suggested by Tebaldi and Knutti (2007), the best approach is probably to use multiple indices and metrics.
In addition to these challenges, most climate change impact studies consider observational data sets as the true value and the associated uncertainty is not addressed. However, there are large uncertainties in precipitation measurements. Gómez-Navarro et al. (2012) concluded that, for mean precipitation, these uncertainties are notable and important when observations are used for ranking of climate models. In the following, the main aspects regarding precipitation observations, indices and metrics used in the evaluation of the climate models' skill to reproduce extreme precipitation are reviewed.

Precipitation observations
Most often precipitation is measured as point observations using rain gauges. These point measurements provide us with useful data for hydrological modelling. Depending on the purpose, point measurements can be good data sets for calculating precipitation indices for a given area. Mean properties such as the mean annual precipitation can be estimated fairly accurately from long time series of point measurements, since this property of precipitation is expected to change slowly in space unless topographical obstacles like mountains interfere. Other indices are less well estimated from point measurements. Extreme precipitation properties from a single time series are less representative of a given area than the mean annual precipitation. These properties are often calculated from a small number of measurements, normally one or a few per year, which means that they are affected by significant sampling error. Additionally, the frequency, true mean intensity and spatial distribution of the extreme events that are recorded are not accurately known. Nonetheless, information on extreme events for a given area is needed in hydrological modelling. Techniques such as the areal reduction factor (ARF) (Wilson, 1990;Sivapalan and Blöschl, 1998) have been introduced to extrapolate point precipitation properties to catchment scale. The ARF can be calculated as a simple linear function of the area covered (Wilson, 1990), or by using more advanced models based on extensive analysis of observations (Sivapalan and Blöschl, 1998). In both cases the areal average precipitation index will decrease, the larger the area considered. The concept of ARF is especially useful in situations where point measurements and gridded values are compared.
In climate change impact studies, the most commonly used observational precipitation data are point measurement data (Goodess et al., 2007;Beldring et al. 2008;Wetterhall et al., 2009;Burton et al., 2010;Taye et al., 2011;Fatichi et al., 2011;) and gridded data (Frei et al., 2003(Frei et al., , 2006Lenderink, 2010;Bárdossy and Pegram, 2011). In most studies in hydrology, precipitation is not interesting at a single point but over the model area. A normal practise to overcome this is to use the point measurement as the mean intensity over an area and combine the areal representation of the available point measurements over the catchments using Thiessen polygons. While this might provide a good representation of precipitation over small areas (Verhoest et al., 2010;Willems et al., 2012) it is not a good representation over large areas (Wilby et al., 1998;Wilks and Wilby, 1999;Frei et al., 2003;Cooley and Sain, 2010). Therefore, a key issue to consider in any given study is if the spatial resolution of data is suitable for the temporal scale of the precipitation properties studied. If long temporal scales are analysed (e.g. mean annual precipitation), a suitable distance for the spatial resolution is probably in the order of several hundred kilometres (Maraun et al., 2010). If sub-daily indices are studied, this distance is considerably shorter (Larsen et al., 2009;Kang and Ramirez, 2010;Gregersen et al., 2013). Regional climate models (RCMs) represent precipitation on grids of rather coarse scale; the spatial resolution of these models is usually around 10-50 km. Even the models with the finest resolution have a grid size that is coarse with respect to precipitation measurements (Maraun et al., 2010). Hence, there is a pronounced scale problem when comparing climate model outputs to precipitation measurements at point scale. This issue was addressed by Chen and Knutson (2008). Even so, approaches comparing climate model outputs with point observations have been followed in a number of climate change impact studies, e.g. Taye et al. (2011), Gómez-Navarro et al. (2012 and Gregersen et al. (2013).
Point measurements are known to be uncertain, and this uncertainty tends to be higher for extreme events (Fankhauser, 1998). Furthermore, when interpolating data into gridded data other sources of uncertainty arise, e.g. the interpolation method used, the homogeneity of the station network, etc. Several studies have dealt with this aspect (e.g. Hewitson and Crane, 2005;Haylock et al., 2008;Hofstra et al., 2009). Chen and Knutson (2008) focused on the differences arising from interpreting climate model precipitation as either point or mean areal values. From this, it is clear that the interpretation has great influence on the conclusions to be drawn from a given study. It is beyond this study to thoroughly assess these uncertainties and the reader is referred to Hewitson and Crane (2005), Haylock et al. (2008), Hofstra et al. (2009) and Chen and Knutson (2008) for further studies.

Indices
Even though climate models are primarily constructed to model climate at large scales (Maraun et al., 2010), extreme precipitation at local scales is of great interest in climate change impact studies. A large number of studies have focused on modelling precipitation extremes in relation to climate model output (e.g. Benestad, 2010;Burton et al., 2010;Cooley and Sain, 2010;Nguyen et al., 2010;Schliep et al., 2010;De Michele et al., 2011;Olsson et al., 2012;Gregersen et al., 2013). These studies used different indices to characterize the tail of the distribution of precipitation data. The choice of indices is highly dependent on the application, e.g. urban hydrology or agricultural hydrology. Several attempts have been made to compile a list of indices suitable to characterize extreme events. For example, in the STARDEX project (Haylock and Goodess, 2004) a set of six core precipitationrelated indices was defined, and the "Expert Team on Climate Change Detection Indices" (ETCCDI) (Peterson, 2005) defined a set of eleven precipitation indices, including those from STARDEX. In the literature, some of the more commonly used indices are: percentiles, often the 95th or 99th (Beldring et al., 2008;Hundecha and Bárdossy, 2008;Benestad, 2010;Cooley and Sain, 2010;Iizumi et al., 2011); the maximum precipitation in one day or a specific number of consecutive days (Segond et al., 2006;Beniston et al., 2007;Sang and Gelfand, 2009a, b;Burton et al., 2010;Schliep et al., 2010); precipitation amounts for T -year return periods (Frei et al., 2006;Fowler and Ekström, 2009;Kysley and Beranova, 2009); and the Intensity-Duration(-Area)-Frequency (ID(A)F) relationship (De Michele et al., 2001Nguyen et al., 2010;Olsson et al., 2012).

Metrics
As in the case of extreme precipitation indices, a range of different metrics have been used for quantifying climate model performance. These can be categorized in two main groups: (i) metrics focusing on the performance of climate models at model grid level or averaged over a region (Giorgi and Mearns, 2002;Boberg et al., 2010;Hanel and Buishand, 2010;Lenderink, 2010), and (ii) metrics focusing on the ability of models to represent the spatial distribution of the variable of interest (Fowler and Ekström, 2009;Lenderink, 2010;Bárdossy and Pegram, 2011). In the first group, the biases in one or more indices are often analysed. Additionally, properties of empirical distributions  and confidence intervals of return levels (Frei et al., 2006) have also been used. In the second group, semivariograms and principal components analysis have been applied. Some studies have compared and combined different metrics. For example, Fowler and Ekström (2009) defined a metric that accounts for both the spatial characteristics and the bias in the extreme events intensity. Lenderink (2010) compared two different metrics for extreme precipitation; one is a simple measure of bias between RCM output and observations, and the other metric measures the differences between the spatial patterns simulated by the RCMs and the observations. The influence of scaling a given data set into coarser scale is well described in the literature (e.g Chen and Knutson, 2008;Tozer et al., 2012) but a more systematic assessment of the influence of the quality of the underlying data is lacking. Studies in this area have been performed for mean precipitation indices (Gómez-Navarro et al., 2012) but not for extreme ones. This study attempts to add new knowledge within this area. Several indices are considered from mean precipitation to high percentiles in order to assess whether the choice of the observational data used affects all the precipitation characteristics, or if it is only relevant for extremes. Additionally, two different metrics are considered that can be used to weight the climate models in the ensemble. The influence of the choice of observational data, indices and metrics on the assessment of climate model performance is investigated. The purpose is not to weight the climate models or finding the best or worst models, although a ranking of model performance is part of the study.
The next section describes the four observational data sets used as well as the climate models considered. The methodology applied to these data is then described in Sect. 3 followed by the results and discussions in Sect. 4. Section 5 summarizes the main conclusions drawn from this study.

Data
Two kinds of data are used for this study: observational data, and climate model output data. First the different observational data sets are presented and afterwards the different climate models.

Observational data
Four different observational data sets have been considered. These comprise two national data sets (SVK and Climate Grid Denmark (CGD)) and two freely available international data sets (European Climate Assessment and Dataset (ECA&D) and E-OBS). The SVK and ECA&D data are point measurements, while the CGD and E-OBS are gridded data sets. For this study all the data sets consist of daily precipitation covering Denmark, and they are used as provided. Figure 1 shows the locations of the grid points and gauge locations of the different data sets. This figure highlights the differences in the spatial distribution of the data available.
The SVK data set is owned by the Danish utility companies. It consists of one-minute temporal resolution precipitation records for approximately 100 stations in Denmark. This station network was designed to provide information on extreme precipitation for design of urban infrastructure Madsen et al., 2002). The length of the individual records ranges from 5 to 33 yr in the period 1979 to 2012, and the spatial coverage is centred on the most urbanized areas of Denmark. Due to its purpose, the SVK data set is operated with a rather high threshold for dry weather, i.e. hours with less than approximately 0.2-0.4 mm of rain are considered dry . For this study the daily precipitation values are calculated from the base data set. The SVK gauge locations are shown in Fig. 1a.
CGD is a gridded precipitation product created by the Danish Meteorological Institute (DMI). It presents daily precipitation based on approximately 300 stations covering Denmark in an irregular but relatively homogeneous, dense network (Scharling, 1999). The station data has been interpolated in grids of 10 × 10 km using an inverse distance weighting method (Scharling, 1999(Scharling, , 2012. The data set has only recently been released for research purposes, and the quality of both the station data and the resulting gridded data has been extensively studied by DMI and found to be very good (Scharling, 2000;Scharling and Kern-Hansen, 2002). The data set is available for 1989 to 2010. The CGD grid locations are shown in Fig. 1b.
ECA&D is a large pan-European station data set that contains more than 2000 stations measuring daily precipitation (Klein Tank et al., 2002;Klok and Klein Tank, 2009). In Denmark, there are a total of 26 stations of which 17 are available for downloading from the project website (http: //www.ecad.eu). The period covered by the time series varies depending on the station. The stations available in Denmark cover a period of more than 30 yr and all of them are currently operational.
The ECA&D data is used as a basis to obtain the gridded data set E-OBS (Haylock et al., 2008). This data set was created as part of the ENSEMBLES project (van der Linden and Mitchell, 2009) and covers the time period 1951-2012. The means used to obtain the gridded data based on point measurements is a kriging method presented by Haylock et al. (2008). The E-OBS data set is available at a resolution of 0.22 and 0.44 • (approximately 25 and 50 km, respectively) both in a regular latitude-longitude grid and a rotated pole grid. In this study we use version 5.0 of the rotated pole grid data set at a resolution of 0.22 • . At this resolution there are 66 land grids over Denmark. Both ECA&D and E-OBS have been widely used in climate change impact studies (Boberg et al., 2009Christensen et al., 2010;Lenderink, 2010). E-OBS is regularly updated and the number of stations included is increasing. However, the number of stations in some regions is currently low compared to the number of grid points. The low density of stations in some regions leads to an over-smoothing of precipitation intensities, and especially of extreme events (Hofstra et al., 2009(Hofstra et al., , 2010. The ECA&D gauge locations are shown in Fig. 1c and the E-OBS grid locations in Fig. 1d.

Climate model data
The four observational data sets are compared with a multimodel ensemble of RCMs from the European ENSEMBLES project. The project aimed at developing an ensemble prediction system to assess the uncertainty in climate projections from seasonal to decadal and longer timescales (van der Linden and Mitchell, 2009). A large data set of RCMs based on several GCMs was set up as part of the ENSEMBLES project. In this study we consider 15 RCMs driven by 6 different GCMs. Table 1 shows the RCMs considered, where the number assigned to each of the RCMs will be used in the results sections.
The models have a spatial resolution of 0.22 • (approximately 25 km) and thirteen of them use the same rotated pole grid as E-OBS. Two models use a Lambert conformal grid system, RM5.1 and RegCM. The indices of these two models have been re-interpolated to the grid in E-OBS by using the natural neighbour interpolation method suggested by Sibson (1980Sibson ( , 1981. Daily precipitation time series are available for the time period 1951-2100 for all the models. The RCM outputs used in this study cover the time period from 1989 to 2010. This is the time period common to all the observational data sets.

Methodology
This study is divided into two main parts. The first part consists of an inter-comparison of indices from the different observational data sets. The comparison is based on the absolute value of the indices and their spatial pattern. The second part compares the climate model performance estimated using each of the different observational data sets. The climate model performance is assessed using two different metrics, which are applied to all the indices. This section describes the indices considered in the study and the metrics used to assess the climate model performance.

Point and grid point
A set of indices is used to compare the different observational data sets and RCM outputs. The indices are chosen to represent information often evaluated in climate studies. They represent a range of temporal scales as well as mean and extreme precipitation properties. The indices evaluated are: -The mean annual precipitation (Mean).
-The proportion of dry days (PDD).
-The simple daily intensity index (SDII) which is the same as the mean precipitation amount per wet day.
Both the SDII and Prec90p are in the list of core indices defined by ETCCDI. Wet days are defined as days with precipitation higher or equal to 1 mm (Peterson, 2005;Seneviratne et al., 2012). These indices are estimated separately for each of the stations in the observational point measurement data sets and for each grid point in the observational gridded data sets and the RCMs.

Spatial pattern
The set of indices defined above are also used to investigate the differences in the spatial pattern of the different data sets. Empirical semivariograms are used for this purpose. Empirical semivariograms use the value of the index at each point to estimate the semivariance, i.e. how the similarity between points changes with distance. This allows us to investigate the spatial pattern of each of the indices described above. Semivariograms show the value of the semivariance depending on the distance (lag) between points. The semivariance, γ (d), is a measure of dissimilarity between two points separated in space by distance d. The semivariance increases with distance until it levels off. The distance at which the semivariogram levels off is known as the range. Two points are considered to be uncorrelated if they are at a distance equal to or higher than the range, also known as the decorrelation length. The semivariance at a distance d is estimated by Wackernagel (2003) as: where Z(x) is the value of the index at the point x, and Z(x + d) is the value at a point located a distance d from x. The semivariance is estimated by grouping all pairs of points into a fixed number of bins. For each bin the average distance and average semivariance of all pairs in the bin are calculated.
In this study the number of bins selected is 15, i.e. all the points are grouped in 15 different bins. In order to be able to compare the different semivariograms the value of the index in each point is normalized by the average of all the points. These values are then used to estimate the semivariance as shown in Eq. (1). Empirical semivariograms are constructed for each of the observational data sets and for the RCMs. Empirical semivariograms have been previously used in climate studies to rank RCMs according to their performance in reproducing spatial patterns (e.g. Fowler and Ekström, 2009). For this reason and due to their ability to represent the spatial pattern of a specific index, they have been selected in this study to assess the performance of the RCMs. Nonetheless, a more concise summary of the similarity between spatial patterns can be graphically shown using Taylor diagrams (see Taylor (2001) for details). The Taylor diagrams show three metrics. They show the centred root mean square difference (RMSD) and the spatial correlation between model data and observations. Additionally, they show the spatial standard deviation of the model data and observations. Taylor diagrams were specifically developed to summarize statistical information of how well patterns match. Hence, Taylor diagrams have been used here to further compare the spatial pattern of the RCMs with the observational data sets.

Metric
In the second part of the analysis, the performance of the RCMs is assessed by comparing the indices estimated for the observational data sets to the indices estimated from the RCM outputs.

Point and grid point
The first metric used is based on the bias in reproducing the precipitation indices. The bias is calculated individually for each grid point in the RCMs for which observational data is available. It is estimated by subtraction of observations from the RCM output, i.e. a positive bias indicates that the RCM output yields higher indices than the observations. The absolute value of the median of the bias is then used to rank the RCMs, i.e. the climate model with the smallest median of the bias is ranked in first position.

Spatial pattern
The second metric used to assess the performance of the climate models is based on the representation of the spatial pattern. The empirical semivariograms are used for this purpose. The performance of the RCMs is assessed using the root mean square error (RMSE). For each climate model, the error at a specific lag is calculated as the difference between the semivariance estimated from the climate model and the observations. The RMSE for the model m is then calculated as where γ Obs i and γ m i are the semivariance for the observations and climate model m at lag i, respectively. N is the number of bins in the semivariogram. The model with the smallest RMSE is ranked in first position. It must be highlighted that the comparison of the climate models is carried out using the empirical semivariance. We do not attempt to parameterise the semivariogram, as often done in interpolation methods. This would include additional uncertainties arising from both the model selection and the parameter estimation.
In addition to using the spatial pattern for assessing the performance of the RCMs, it is also used to assess the similarities of the RCMs in the ensemble. This is of relevance when using the ensemble of RCMs to quantify the uncertainty in climate change projections. Most uncertainty quantification techniques assume that the models are independent. However, this assumption may not be valid as some models may share part of code, parameterizations and/or are driven by the same GCMs. The validity of this assumption is addressed in detail by Tebaldi and Knutti (2007), Knutti et al. (2010), and Pennell and Reichler (2011). In a recent study by Sunyer et al. (2013) the interdependency of the ENSEM-BLES RCMs over Denmark is investigated using E-OBS as the observational data set. The impact of the observational data set chosen is investigated in this study.
The methodology followed here is the same as in Sunyer et al. (2013). The first step is the estimation of the metric to investigate the interdependency of RCMs. The metric used is a measure of the model error. It is estimated by removing the ensemble average error from the individual model error. The ensemble average error represents the common biases. It is calculated separately for each grid point as the average of the model error of all the RCMs. For each index, the metric is estimated separately for all the grid points for each RCM in the ensemble.
The similarity of the RCMs can then be assessed using a hierarchical cluster analysis (Wilks, 2006). This analysis groups the RCMs into clusters depending on their similarity. The similarity of the RCMs is expressed by means of the correlation matrix, R, the elements of which are the correlations between the metric estimated for all the RCMs. Dendrograms are used to illustrate the results of the hierarchical cluster analysis. The dendrograms show the dissimilarity of the RCMs, estimated as the Pearson's distance, i.e. 1 − R. Figure 2 shows maps of mean precipitation for the two gridded data sets. The overall pattern seems to be the same but it is clear that there are some distinct differences. The finer scale CGD data set shows a greater variation with higher precipitation in the western part of Denmark and with a clearer marking of the coastal grid points. E-OBS is consistently Hydrol. Earth Syst. Sci., 17, 4323-4337, 2013 www.hydrol-earth-syst-sci.net/17/4323/2013/ drier than CGD in the eastern part and for most of the western part, except for the most southern grid points. E-OBS is also drier than CGD in the middle-eastern grid points of Jutland and the most northern grid points. The box plots in Fig. 3 summarize the indices estimated for each point for all the data sets. The boxes represent the 25th, 50th and 75th percentiles, the whiskers represent the 5th and 95th percentile, and the circles show the outliers. The box plot for the mean, Fig. 3a, shows a good agreement among the data sets. The median ranges between 600 and 700 mm yr −1 (approximately 1.6 and 2 mm day −1 as presented by the mean), the total span is of a few hundred mm yr −1 (approximately 1 to 1.2 mm day −1 ). This is the expected range of mean precipitation for Denmark as determined by historical investigations (Frich et al., 1997;Madsen et al., 2009). The SVK data set has the lowest median of all the data sets. This is expected to be an artefact mainly caused by the relatively high threshold used in the processing . The box plot of the other long temporal scale index, PDD, shows larger differences between the data sets (see Fig. 3b). In this case the SVK data set also stands out with a considerably larger PDD than the other data sets. The differences are likely to be due to the same phenomena as in the mean. The high threshold for the SVK data should result in absolutely no drizzling and an increased PDD.

Point and grid point
As in the case of the mean and PDD, for the SDII and Prec75p only the SVK data set stands notably out. It has a considerably higher median value but comparable variation. Again, this is most likely linked to the high threshold that leads to fewer wet days. In the case of the higher percentiles, there is a tendency to larger differences between the data sets. Point measurement data sets show higher values than the gridded data sets. Additionally, the gridded data set with a higher spatial resolution (CGD) shows higher values than the gridded data set with a lower spatial resolution (E-OBS). This is in agreement with the general understanding that the gridding of point measurements tends to smooth out extreme precipitation (Chen and Knutson, 2008;Hofstra et al., 2010). The difference between the ECA&D data set and the CGD data set seems to be in the expected range of a 15-20 % reduction in intensity from point scale to a 100 km 2 grid that could be explained by the simple ARF (Wilson, 1990). The E-OBS data set on the other hand is lower than expected by the ARF method (approximately 33 % reduction in intensity from point scale to 625 km 2 grid size) and the difference increases for higher percentiles. The difference between CGD and E-OBS increases as a function of the percentile. This difference is believed to be partly due to the different spatial resolution and partly due to the amount of stations used in the gridding. CGD is created from roughly one observational station per grid cell, whereas E-OBS only has approximately one station available per three grid cells. The same difference is observed between the two point measurement observational data sets (SVK and ECA&D). Again it is believed to be a product of the difference in the number and location of stations in the different data sets. The differences can hence be explained mainly by the quality of the underlying observational data, implying that having more gauges increase the chance of monitoring extremes.
This initial analysis of the observational data sets shows, as expected, differences between point observations and gridded data. Additionally, it also shows that the quantity of data used to create a data set seems to have an important influence on the extreme properties and thereby the quality of the data set in representing the region of interest.

Spatial pattern
The spatial pattern of precipitation, which is of high importance in hydrological applications, is assessed by calculating empirical semivariograms for all the data sets. Figure 4 shows the semivariograms for the mean, SDII and the 95th and 99th percentiles. The maximum distance considered in the semivariograms is 250 km. This is due to the fact that the number of grid points available for higher distances are too few to obtain a reliable estimate of the semivariance. The semivariograms show that for the SVK data there is basically no spatial structure for all considered indices. Further, Fig. 4 shows that E-OBS has a marked increase in the semivariance with distance and no apparent range when compared with CGD. The difference in the spatial pattern of E-OBS and CGD could be explained by the difference in the number of stations used in these data sets. In E-OBS, precipitation measured at stations in the neighbouring countries is probably assimilated into the grids for Denmark. Consequently, a higher semivariance would be obtained for E-OBS at large distances. The semivariograms of E-OBS and CGD do not level off at the same distance. This phenomenon is not explicitly investigated further in the present study. It must also be noted that the two gridded data sets use different interpolation methods; if the data basis is sufficient, this should only have minor influence on the result. Furthermore, due to Denmark's flat topography daily precipitation values are expected to vary slowly in space, and the effect of the interpolation method is expected to be small compared with the effect of the number of stations. The large number of stations can also explain the smoother semivariogram obtained for CGD. The high variation in ECA&D is probably due to the limited number of stations in this data set and along with the other point data set, SVK, a nugget effect due to the pooling of data is probably also influencing the semivariograms.

Climate model performance and ranking
The previous section has focused on comparing the absolute value and the spatial pattern of the indices of different observational data sets. These data sets could all potentially be used for defining the baseline climate in climate change impact studies in Denmark, and in fact SVK, ECA&D and E-OBS have been used for this purpose (e.g. Boberg et al., 2009Boberg et al., , 2010Lenderink, 2010;Sunyer et al., 2012;Gregersen et al., 2013). This section assesses the performance of the climate models using the four different observational data sets analysed in the previous section. The bias in the point indices and the RMSE in the empirical semivariograms are the metrics used to rank the climate models. The indices estimated using CGD have been re-interpolated into the same grid system as E-OBS and the RCMs. This is done to be able to compare the results obtained using CGD and E-OBS without the effect of the spatial resolution. The re-interpolation method used is the same as the one used for the RM5.1 and RegCM3 models. The re-interpolated CGD data is referred to as CGD-25. Figure 5 shows the value of the median of the bias of each of the 15 RCMs in the ensemble calculated using each of the observational data sets. For all the indices the bias estimated is highly dependent on the observational data used. In the case of the mean precipitation, the bias estimated using CGD-25 is lower than the bias estimated using the other observational data sets. On the other hand, the highest biases are obtained when using the SVK data as the observational data set. This is in agreement with the lower values of the mean precipitation found for this observational data set in Fig. 3. The biases estimated using ECA&D and E-OBS are rather similar to the bias estimated using CGD-25, the difference is smaller than 0.5 mm day −1 . However, E-OBS leads to slightly higher bias for most RCMs. Nonetheless, for most of the climate models the observational data sets agree on the positive sign of the bias, i.e. the RCMs overestimate the mean precipitation. For the other indices (SDII, Prec95p and Prec99p) the observational data sets disagree on both the sign and the magnitude of the bias. The largest difference is found between the negative bias shown by SVK and the positive bias estimated using E-OBS. The biases estimated using both CGD-25 and ECA&D lie in between the other two observational data sets.

Point and grid point
Hydrol. Earth Syst. Sci., 17, 4323-4337, 2013 www.hydrol-earth-syst-sci.net/17/4323/2013/ In general, the SVK, CGD-25, and ECA&D point to an underestimation of SDII, Prec95p, and Prec99p by the RCMs, while E-OBS points to an overestimation. For these three indices, the bias estimated using the gridded observational data sets is, in most cases, higher than the bias estimate using the point observational data sets. This is due to the lower value of these indices found for the gridded observational data sets (see Fig. 3). As expected, and in agreement with the results from the previous section, the difference between the biases is higher for higher percentiles. Table 2 shows the ranking of the 15 RCMs according to the metric based on the bias and for four of the indices (mean, SDII, Prec95p and Prec99p). In this table the number assigned to each RCM corresponds to the enumeration used in Table 1. The differences observed in Fig. 5 stand out in the ranking of the models. In the case of the mean, the same models are ranked in the highest positions for all the observational data sets. The five models with the highest ranking for the SVK data set (models highlighted in roman in Table 2) are among the seven best models for CGD-25, ECA&D and E-OBS. A similar pattern is observed for the models with the lowest ranking (models highlighted in bold). However, for the other three indices (SDII, Prec95p and Prec99p) the rankings are more dissimilar. For example, for Prec95p, model 2 has rank 1 in the SVK data but rank 5, 7 and 15 for the CGD-25, ECA&D and E-OBS, respectively. In general, the SVK, CGD-25 and ECA&D data sets lead to more similar model rankings, whereas E-OBS tends to have a reverse ranking. This can be explained by the difference in the sign of the bias when using E-OBS and when using SVK, ECA&D, and CGD-25. In general, the values of SDII, Prec95p and Prec99p of the RCMs lay between the values estimated using E-OBS and SVK, ECA&D, and CGD. This implies that when the absolute value of the bias of an RCM is small according to E-OBS it is found large according to SVK, ECA&D, and CGD.

Spatial pattern
The previous results compare the RCMs with the observational data sets based on the value of the indices at point measurements and grid points. This section focuses on the ability of the RCMs to reproduce the spatial pattern in the observational data sets. Figure 6 shows the Taylor diagrams for the mean, SDII, Prec95p and Prec99p indices. For all the indices and in most cases, the standard deviation of the RCMs is lower than the standard deviation of the observational data sets. The larger differences between the spatial variability of the RCMs in the ensemble are found when using ECA-D as the observational data set. Similarly, the larger difference between RCMs and the observational data set are found for the SVK data. This observational data set leads to higher spatial standard deviation than the other observational data sets. As previously mentioned, this is probably due to the heterogeneity of this data set.
The correlation of the RCMs with the different observational data sets is slightly higher when comparing the RCMs with CGD-25. The RMSD estimated using CGD-25 is also slightly lower than the RMSD estimated using the other observational data sets. Nonetheless, both the correlation and RMSD of the RCMs are similar when using E-OBS and CGD-25. The RCMs show the smallest correlation and largest RMSD when compared with the SVK data set.
The Taylor diagrams show that performance of a specific model depends on the observational data set used. For example, the model represented with the filled square symbol has one of the highest correlations and lowest RMSD for all the indices for CGD-25 but not for the other observational data sets. Figure 7 shows the semivariograms comparing CGD-25, E-OBS and the RCMs for the Mean, SDII, Prec95p and Prec99p indices. The semivariograms of the point measurements data sets are not included in this comparison. This is due to the fact that the number of gauges at a specific distance differs considerably from the number of grid points in the RCMs and in the observational gridded data sets. Figure 7 also shows the RCM with the smallest RMSE for each of the observational data sets. Lags up to 250 km have been considered to estimate the semivariograms and calculate the RMSE.  In the case of the mean, the model with the smallest RMSE is the same for the two observational data sets (model RACMO2 driven by ECHAM5, model 5 in Table 1). However, for SDII, Prec95p and Prec99p the model with the smallest RMSE depends on the observational data set used. In agreement with spatial standard deviation shown in the Taylor diagrams, in general, the RCMs show a smaller semivariance than the observational data sets for all the indices.
The difference in the spatial pattern of the gridded observational data sets also has an effect on the interpretation of the information available in the ensemble of RCMs. Figure 8 shows the ensemble average error and the dendrograms for Prec95p estimated using E-OBS and CGD-25. The y axis in the dendrograms is the Pearson's distance, which is a mea-sure of dissimilarity of the RCMs. All the RCMs included in the ensemble are shown in the x axis. The ensemble average error represents the common biases in the ensemble, while the dendrograms show the clustering of the RCMs.
In agreement with results shown in Fig. 5, the comparison of the ensemble average error shows a higher error of the RCMs when the observational data set used is E-OBS. Additionally, the spatial pattern of the error also shows some differences. The error estimated using E-OBS shows the largest error in north and south-west of Jutland (Danish peninsula), while the error estimated using CGD-25 shows the largest error in the west part of Jutland. The differences in the ensemble average error lead to differences in the spatial pattern of the metric used to estimate the similarities of the RCMs, which in turn lead to differences in the correlation matrix, R. This is reflected in the dendrograms. For example, in the dendrogram using E-OBS, the RACMO2 model (model 5) forms a cluster with the three HIRHAM models (models 1, 2, and 3), while in the dendrogram using CGD-25 this model forms a cluster with the models from the Hadley Centre (models 10, 11, and 12). Nonetheless, there are also some common results in the dendrograms. The most relevant one being that the same RCM driven by different GCMs (i.e. HIRHAM, RCA and HadRM models) are more similar than different RCMs driven by the same GCM. Table 3 shows the ranking of the climate models according to the RMSE of the semivariograms. As seen in Table 3, the model with the highest ranking in mean precipitation is the same for both observational data sets (model 5). For this index the RCMs have virtually similar ranking for the two observational data sets. The difference between the rankings increases from SDII to Prec99p. It must be noted that the ranking of the RCMs based on the semivariograms obtained for CGD-25 and E-OBS is more similar than the ranking obtained using the bias at the grid points for these two observational data sets.
It must also be noted that the ranking of the models using the same observational data set varies depending on the index. This is observed both in Table 2 and Table 3. Similarly, for the same observational data set the ranking of the model also varies depending on the metric used. For example in the case of E-OBS, the best model at representing the spatial pattern (model 5) for Prec95p is ranked in eleventh position regarding the bias. These results show that the performance of the models depends on the index and metric of interest. Therefore, it is not possible to generally classify the models as good or bad models. This is in agreement with the  error (a, b) and dendrograms (c, d) for Prec95p estimated using CGD-25 (a, c) and E -OBS (b, d). The y axis in the dendrograms shows the dissimilarity of the climate models. Model numbers are shown in Table 1. results from previous studies, e.g. Lenderink (2010), Kjelllström et al. (2010). The dependency of the ranking on the index and the metric highlights the importance of using an ensemble of RCMs to obtain robust climate projections for future climate conditions.

Conclusions
This study investigates the influence of the choice of observational data set in the assessment of climate model performance. Four different observational data sets have been analysed. These represent the common type of observations used in climate change impact studies (point measurement and gridded data). A set of indices (ranging from the mean to high percentiles) and two different metrics (based on bias and root mean square error of spatial patterns) are used to analyse and compare daily precipitation data from observational data sets and from an ensemble of RCMs.
Indices calculated for each of the four observational data sets show similar results for the mean precipitation but differ substantially when considering more extreme properties such as high percentiles of precipitation. As expected, the two data sets of point measurements (SVK and ECA&D) show higher values for extreme precipitation. The difference between the point measurement data sets is related to a different number of stations, the spatial distribution of the stations and the precipitation threshold used in the data products. The gridded data set with a higher spatial resolution, CGD, also shows higher extremes than the other gridded data set, E-OBS. The difference between the two gridded data sets is higher than what can be explained due to the change in spatial resolution as explained by the areal reduction factor approach. The results from this study confirm the findings from previous studies regarding the E-OBS data set, i.e. that it over-smoothes precipitation intensities at the grid cell level, Table 3. Ranking of the RCMs depending on the RMSE of the semivariograms for the observational data sets. Model numbers are shown in Table 1. The models highlighted in roman, italic and bold correspond to the models with rank 1 to 5, 6 to 10 and 11 to 15 for SVK in Table  2, respectively.

Mean SDII Prec95p Prec99p
Ranking CGD-25 E-OBS CGD-25 E-OBS CGD-25 E-OBS CGD-25 E-OBS especially extreme precipitation, resulting in less intense extremes in comparison to the other observational data sets.
The different data sets also show different spatial patterns. Even though it over-smoothes the precipitation intensities, E-OBS shows a lower correlation of the grid points at large distances than the CGD data. The differences identified between the observational data sets are important when assessing climate model performance. This is clearly shown in the analysis of the bias, where the sign of the bias for high percentiles is different when comparing the RCMs to E-OBS or to SVK, CGD-25 and ECA-D. Furthermore, the ranking of the climate models is almost opposite when considering E-OBS vs. SVK, CGD-25 and ECA-D data sets. In the case of the mean precipitation, the ranking is less dependent on the observational data set considered, probably because it is an index that is robust to spatial and temporal averaging.
Similar conclusions can be drawn from the analysis of the spatial pattern. The ranking of the climate models depends both on the observational data set used and on the index. Higher differences between the rankings are observed for extreme precipitation. The differences in the spatial pattern of the gridded observational data sets also affect the conclusions regarding the similarity of the RCM biases. Additionally, as other studies have also stressed, when considering only one of the observational data sets, the ranking of the climate models depends on the index and metric used to rank the models.
The results of this study illustrate and highlight the need to be aware of the different characteristics of observational data sets, as this has a high influence on the performance estimated for each of the RCMs. RCMs should be compared to quality-checked observational data that represents the same precipitation characteristics. In this study the data set that better fits these requirements is the CGD data re-interpolated to the same grid resolution as the RCMs, i.e. CGD-25. Further work should focus on addressing the possible errors and uncertainty (e.g. measurement and interpolation uncertainty) in the observations, especially if the interest of the study is mainly in extreme precipitation.