Assessment of precipitation and temperature data from CMIP3 global climate models for hydrologic simulation

The objective of this paper is to identify better performing Coupled Model Intercomparison Project phase 3 (CMIP3) global climate models (GCMs) that reproduce gridscale climatological statistics of observed precipitation and temperature for input to hydrologic simulation over global land regions. Current assessments are aimed mainly at examining the performance of GCMs from a climatology perspective and not from a hydrology standpoint. The performance of each GCM in reproducing the precipitation and temperature statistics was ranked and better performing GCMs identified for later analyses. Observed global land surface precipitation and temperature data were drawn from the Climatic Research Unit (CRU) 3.10 gridded data set and re-sampled to the resolution of each GCM for comparison. Observed and GCM-based estimates of mean and standard deviation of annual precipitation, mean annual temperature, mean monthly precipitation and temperature and Köppen–Geiger climate type were compared. The main metrics for assessing GCM performance were the Nash–Sutcliffe efficiency (NSE) index and root mean square error (RMSE) between modelled and observed long-term statistics. This information combined with a literature review of the performance of the CMIP3 models identified the following better performing GCMs from a hydrologic perspective: HadCM3 (Hadley Centre for Climate Prediction and Research), MIROCm (Model for Interdisciplinary Research on Climate) (Center for Climate System Research (The University of Tokyo), National Institute for Environmental Studies, and Frontier Research Center for Global Change), MIUB (Meteorological Institute of the University of Bonn, Meteorological Research Institute of KMA, and Model and Data group), MPI (Max Planck Institute for Meteorology) and MRI (Japan Meteorological Research Institute). The future response of these GCMs was found to be representative of the 44 GCM ensemble members which confirms that the selected GCMs are reasonably representative of the range of future GCM projections.


Introduction
Our primary objective in this paper is to identify better performing GCMs from a hydrologic perspective.To do this we assess how well 22 global climate models (GCMs) from the World Climate Research Programme's (WCRP) Coupled Model Intercomparison Project phase 3 (CMIP3) multimodel data set (Meehl et al., 2007) are able to reproduce GCM grid-scale climatological statistics of observed precipitation and temperature over global land regions.We recognise that GCMs model different variables with a range of success and that no single model is best for all variables and/or for all regions (Lambert and Boer, 2001;Gleckler et al., 2008).The approach adopted here is not inconsistent with Dessai et al. (2005) who regarded the first step in evaluating GCM projection skill is to assess how well observed climatology is simulated.We also recognise there have been assessments published in peer-reviewed journals, but all appear to be assessed from a climate science perspective.This review concentrates on GCM variables and statistical techniques that are relevant to engineering hydrologic practice.
GCM runs for the observed period do not seek to replicate the observed monthly record at any point in time and space.Rather a better performing GCM is expected to pro-

T. A. McMahon et al.: CMIP3 global climate models for hydrologic simulation
duce long-term mean annual statistics that are broadly similar to observed conditions across a wide range of locations.
Here, the assessment of CMIP3 GCMs is made by comparing their long-term mean annual precipitation (MAP), standard deviation of annual precipitation (SDP), mean annual temperature (MAT), mean monthly patterns of precipitation and temperature and Köppen-Geiger climate type (Peel et al., 2007) with concurrent observed data for 616 to 11 886 terrestrial grid cells worldwide (the number of grid cells depends on the resolution of the GCM under consideration).These variables were chosen to assess GCM performance because they provide insight into the mean annual, interannual variability and seasonality of precipitation and temperature, which are sufficient to estimate the mean and variability of annual runoff from a traditional monthly rainfallrunoff model (Chiew and McMahon, 2002) or from a topdown annual rainfall-runoff model (McMahon et al., 2011) for hydrologic simulation purposes.
The GCMs included in this assessment are detailed in Table 1 (model acronyms adopted are listed in the table).Although no quantitative assessment of the BCCR (Bjerknes Centre for Climate Research) model is made, this model is included in Table 1 as details of its performance are available in the literature which is discussed in Sect. 2. Other details in the table include the originating group for model development, country of origin, model name given in the CMIP3 documentation (Meehl et al., 2007), the number of 20C3M runs available for analysis, the model resolution and the number of terrestrial grid cells used in the precipitation and temperature comparisons.
Readers should note that when this project began as a component of a larger study in 2010, runs from the CMIP5 were not available.We are of the view that the approach adopted here is equally applicable to evaluating CMIP5 runs for hydrologic simulations.Conclusions about better performing models drawn from this analysis may prove similar to a comparable analysis of CMIP5 runs since most models in CMIP5 are, according to Knutti et al. (2013), "strongly tied to their predecessors".Analysis of the CMIP5 models indicates that the CMIP3 simulations are of comparable quality to the CMIP5 simulations for temperature and precipitation at regional scales (Flato et al., 2013).
This study is part of a larger research project that seeks to enhance our understanding of the uncertainty of future annual river flows worldwide through catchment-scale hydrologic simulation, leading to more informed decision-making for the sustainable management of scarce water resources, nationally and internationally.To achieve this, it is necessary to determine, as a minimum, how the mean and variability of annual streamflows will be affected by climate change.Other factors of less importance are changes in the autocorrelation of annual streamflow, changes in net evaporation from reservoir water surfaces and changes in monthly flow patterns, with the latter being more important for relatively small reservoirs.In this paper we deal with the key drivers of streamflow production, namely the mean and the standard deviation of annual precipitation and mean annual temperature, the latter is adopted here as a surrogate for potential evapotranspiration (PET), along with secondary factors, the mean monthly patterns of precipitation and temperature.Adopting temperature as a surrogate for PET is contentious.We provide a detailed discussion of this issue in the Supplementary Material associated with this paper.Suffice to say that a more complex PET formulation requires additional GCM variables other than temperature which are less reliable.This simplicity comes at the expense of potentially inadequate representation of future changes in PET, which may have important negative consequences when modelling streamflow in energy limited catchments.Nevertheless, in the following discussion we concentrate on mean annual temperature as the GCM variable representing PET.
Computer models of most water resource systems that rely on surface reservoirs to offset streamflow variability adopt a monthly time step to ensure that seasonal patterns in demand and reservoir inflows are adequately accounted for.However, in a climate change scenario it is more likely that an absolute change in streamflow will have a greater impact on system yield than shifts in the monthly inflow or demand patterns.This will certainly be the case for reservoirs that operate as carryover systems rather than as within-year systems (for an explanation see McMahon and Adeloye, 2005).Therefore, in this paper we assess the GCMs in terms of annual precipitation and annual temperature, and patterns of mean monthly precipitation and temperature.
Following this introduction we describe, and summarise in the next section, several previous assessments of CMIP3 GCM performance.We also include some general comments on GCM assessment procedures.In Sect.3,data (observed and GCM based) used in the analysis are described.Details and results of the subsequent analyses comparing GCM estimates of present climate mean and standard deviation of annual precipitation, mean annual temperature, mean monthly precipitation and temperature patterns and Köppen-Geiger climate type against observed data are set out in Sect. 4. In Sect.5, we review the results and compare the literature information with our assessments of the GCMs.The final section of the paper presents several conclusions.

Literature
As noted above, to assess the impact of climate change on surface water resources of a region through hydrologic simulation, it is necessary to assess, as a minimum, the performance of the mean and the standard deviation of annual precipitation and mean annual temperature, and the mean monthly patterns of precipitation and temperature.Noting this background we describe in the next section procedures that have been adopted in the literature to assess GCM performance.

Procedures to assess GCM performance
Ever since the first GCM was developed by Phillips (1956) (see Xu, 1999), attempts have been made to assess the adequacy of GCM modelling.Initially, these evaluations were simple side-by-side comparisons of individual monthly or seasonal means or multi-year averages (Chervin, 1981).To assess model performance, Chervin (1981) extended the evaluation procedure by examining statistically the agreement or otherwise of the ensemble average and standard deviation between the GCM modelled climate and the observed data using the vertical transient heat flux in an example application.Legates and Willmott (1992) compared observed with simulated average precipitation rates by 10 • latitude bands.On a two-dimensional plot, Taylor (2001) developed a diagram in which each point consisted of the spatial correlation coefficient and the spatial root mean square (RMS) along with the ratio of the variances of the modelled and the observed variables.Recently, some authors have used the Taylor diagram (Covey et al., 2003;Bonsal and Prowse, 2006) or a similar approach (Lambert and Boer, 2001;Boer and Lambert, 2001).Murphy et al. (2004) Murphy et al. (2004).Whetton et al. (2005) introduced a demerit point system in which GCMs were rejected when a specified threshold was exceeded.Min and Hense (2006) introduced a Bayesian approach to evaluate GCMs and argued that a skill-weighted average with Bayes factors is more informative than moments estimated by conventional statistics.Shukla et al. (2006) suggested that differences in observed and GCM simulated variables should be examined in terms of their probability distributions rather than individual moments.They proposed the differences could be examined using relative entropy.Perkins et al. (2007) also claimed that assessing the performance of a GCM through a probability density function (PDF) rather than using the first or a second moment would provide more confidence in model assessment.To compare the reliability of variables (in time and space) rather than individual models, Johnson and Sharma (2009a, b) developed the variable convergence score which is used to rank a variable based on the ensemble coefficient of variation.They observed the variables with the highest scores were pressure, temperature and humidity.Reichler and Kim (2008) introduced a model performance index by first estimating a normalised error variance based on the square of the grid-point differences between simulated (interpolated to the observational grid) and the observed annual climate weighted and standardised with respect to the variance of the annual observations.The error variance was scaled by the average error found in the reference models and, finally, averaged over all climates.
It is clear from this brief review that no one procedure has been universally accepted to assess GCM performance, which is consistent with the observations of Räisänen (2007).We also note the comments of Smith and Chandler (2010, p. 379) who said "It is fair to say that any measure of performance can be subjective, simply because it will tend to reflect the priorities of the person conducting the assessment.When different studies yield different measures of performance, this can be a problem when deciding on how to interpret a range of results in a different context.On the other hand, there is evidence that some models consistently perform poorly, irrespective of the type of assessment.This would tend to indicate that these model results suffer from fundamental errors which render them inappropriate." In 1992, Legates and Willmott (1992) assessed the adequacy of GCMs based mainly on January and July precipitation fields.Although a number of GCM assessments were carried out during the following one and a half decades, it was not until 2008 that mean precipitation, either absolute or bias, was included in GCM published assessments.In that year, Reichler and Kim (2008, p. 303) argued that the mean bias is an important component of model error.
In Table 2a and b we summarize the application of the numerical metrics and the ranking metrics of precipitation and temperature respectively applied to CMIP3 data sets at the global or country scales.These references cover the period from 2006 to 2014.Across these 15 papers, we observe that for precipitation and temperature the spatial root mean square error, either using raw data (root mean square error -RMSE) or normalised data as a percentage of the mean value (RRMSE), is adopted in 7 of the 15 studies.(The data are normalised by the corresponding standard deviation of the reference or observed data.)This spatial root mean square metric, as well as the bias in the mean of the data, is relevant to hydrologists as it provides an indication of the uncertainty in the climate variables of interest to them.Of more relevance to hydrologists is the uncertainty in temporal mean and variance of climatic variables, which for precipitation are only reported in 4 of the 15 studies.Although spatial correlation is not used directly in general hydrologic investigations, in GCM assessments it is often combined with the variance and spatial RMSE through the Taylor diagram (Taylor, 2001) which is an excellent summary of the performance of a GCM projected variable.As noted in Table 2, three papers utilise this approach.Lambert and Boer (2001, p. 89) extended the Taylor diagram to display the relative mean square differences, the pattern correlations and the ratio of variances for modelled and observed data.This approach to displaying the second-order statistics appears not to have been widely adopted.It is noted in Table 2a that only four papers include the mean or bias of the raw precipitation data in the GCM assessments which is important from a hydrologic perspective.The second set of metrics listed in Table 2b is used essentially for ranking GCMs by performance.Several other assessment tools not included in Table 2b are the climate prediction index (Murphy et al., 2004) and Bayesian approaches (Min and Hense, 2006).
Specific climate features like the preservation of the ENSO (El Niño-Southern Oscillation) signal (van Oldenborgh et al., 2005) would also be considered to be a non-numerical measure of GCM performance, but in some regions to be no less important to hydrologists than the numerical measures.Most of these ranking metrics have been developed for specific purposes with respect to GCMs and several have little utility for the practicing hydrologist who is primarily interested in bias, variance and uncertainty in projected estimates of precipitation and temperature (plus net radiation, wind speed and humidity to derive potential ET) as input to drive stand-alone global and catchment hydrologic models.

Results of CMIP3 GCMs assessments
Table 2a indicates that only two papers (Räisänen, 2007;Gleckler et al., 2008) detail numerical measures for both mean annual precipitation and temperature for 21 and 22 CMIP3 GCMs, respectively, at a global scale.Reifen and Toumi (2009)     Räisänen ( 2007) results illustrate the wide range of model performances that exist: for precipitation, RMSE = 1.35 mm day −1 with a range of 0.97-1.86 and for temperature, RMSE = 2.32 • C with a range of 1.58-4.56.Reichler and Kim (2008) considered 14 variables covering mainly the period 1979-1999 to assess the performance of CMIP3 models using their model performance index.They concluded that there was a continuous improvement in model performance from the CMIP1 models compared to those available in CMIP3 but there are still large differences in the CMIP3 models' ability to match observed climates.Gleckler et al. (2008) normalised the data in Taylor diagrams for a range of climate variables and concluded that some models performed substantially better than others.However, they also concluded that it is not yet possible to answer the question: what is the best model?
Reifen and Toumi (2009) (Table 2b) using temperature anomalies observed that ". . .there is no evidence that any subset of models delivers significant improvement in prediction accuracy compared to the total ensemble".On the other hand, Macadam et al. (2010) (Table 2a) assessed the performance of 17 CMIP3 GCMs comparing the observed and modelled temperatures over five 20-year periods and concluded that GCM rankings based on anomalies can be inconsistent over time, whereas rankings based on actual temperatures can be consistent over time.
In summary, Gleckler et al. (2008) stated that the best GCM will depend on the intended application.In the overarching project of which this study is a component, we are interested in the uncertainty in annual streamflow estimated through hydrologic simulation using GCM precipitation and temperature and how that uncertainty will affect estimates of future yield from surface water reservoir systems.Consequently, we are interested in which GCMs reproduce precipitation and temperature satisfactory.Based on the references of Reichler and Kim (2008), Gleckler et al. (2008) and Macadam et al. (2010), the performance of 23 CMIP3 GCMs assessed at a global scale are ranked in Table 3.In Ta-ble 3 eight models that meet the Reichler and Kim (2008) criterion are also ranked in the upper 50 % based on the Macadam et al. (2010) and Gleckler et al. (2008)

Data
Two data sets are used in the GCM assessment that follows in Sect. 4. One is based on observed data and the other on GCM simulations of present climate (20C3M).It should be noted that of the 22 GCMs examined herein, multiple runs or projections were available for nine models.The resulting 46 runs are identified in the tables summarising the results.
The first data set is based on monthly observed precipitation and temperature gridded at 0.5 • × 0.5 • resolution over the global land surface from Climatic Research Unit (CRU) 3.10 (New et al., 2002) for the period January 1950 to December 1999.For grid cells where monthly observations are not available, the CRU 3.10 data set is based on interpolation of observed values within a correlation decay distance of 450 km for precipitation and 1200 km for temperature.The CRU 3.10 data set provides information about the number of observations within the correlation decay distance of each grid cell for each month.In this analysis we defined a grid cell as observed if ≥ 90 % of months at that grid cell has at least one observation within the correlation decay distance for the period January 1950 to December 1999.Only observed grid cells are used to compute summary statistics in the following analysis.
The second data set is monthly precipitation and temperature data for the present climate (20C3M) from 22 of the 23 GCMs listed in Table 1 and consists of 46 GCM runs.The 20C3M monthly data for precipitation and temperature were extracted from the CMIP3 data set.As shown in Table 1 the GCMs have a wide range of spatial resolutions, all of which are coarser than the observed CRU data.In order to make comparisons between observed and GCM data either the CRU and/or GCM data must be re-sampled to the same resolution.To avoid re-sampling coarse resolution data to a finer resolution we only re-sampled the CRU data here.Thus, in the following analysis the performance of each GCM is assessed at the resolution of the GCM and the CRU data are re-sampled to match the GCM resolution.Therefore, the number of grid cells in each comparison varies with the GCM resolution and ranged from 616 to 11 886 for the temperature comparisons and 425 to 8291 for the precipitation comparisons.The difference in number of grid cells between temperature and precipitation is due to more terrestrial grid cells having observed temperature data than precipitation data over the period 1950-1999.
In the following analysis comparisons are made between observed and GCM values of mean and standard deviation of annual precipitation and mean annual temperature.The GCM values are based on concurrent raw (that is, not downscaled nor bias corrected) data from the 20C3M simulation.For example, if a grid cell has observed calendar-year data from 1953 to 1994, then the comparison will be made with GCM values from the 20C3M run for the concurrent calendar years 1953-1994.Although the aim of a 20C3M run from a given GCM is not to strictly replicate the observed monthly record, we expect better performing GCMs to reproduce mean annual statistics that are broadly similar to observed conditions.Average monthly precipitation and temperature patterns are also compared to assess how well GCM runs reproduce observed seasonality.Finally, we assess how well the Köppen-Geiger climate classification (Peel et al., 2007) estimated from the CMIP3 data compares with present-day gridded observed climate classification.

Comparison of present climate GCM data with observed data
In the analyses that follow, GCM estimates of mean annual precipitation and temperature and the standard deviation of annual precipitation are compared against observed estimates for terrestrial grid cells with ≥ 90 % observed data during the period 1950-1999.
Eight standard statistics -Nash-Sutcliffe efficiency (NSE) (Nash and Sutcliffe, 1970), product moment coefficient of determination (R 2 ) (MacLean, 2005), standard error of regression (Maidment, 1992), bias (MacLean, 2005), percentage bias (Maidment, 1992), absolute percentage bias (MacLean, 2005), root mean square error (RMSE) (MacLean, 2005) and mean absolute error (MacLean, 2005) -were computed as the basis of comparison, but we report only the NSE, R 2 and RMSE in the following discussion.For our analysis, the NSE is the most useful statistic as it shows the proportion of explained variance relative to the 1 : 1 line in a comparison of two estimates of the same variable.R 2 is included because many analysts are familiar with its interpretation.Both NSE and R 2 were computed in arithmetic (untransformed) and natural log space.We have also included RMSE values (computed from the untransformed values) as many GCM analyses include this measure.
In the following sub-sections comparisons between the concurrent raw GCM data and observed values for MAP, SDP, MAT, long-term average monthly precipitation and temperature patterns and Köppen-Geiger climate classification at the grid cell scale are presented and discussed.Although we rank the models by each selection criteria and combine the ranks by addition, we note the warning of Stainforth et al. ( 2007) who argue that model response should not be weighted but ruled in or out.We follow this approach in this paper by identifying better performing GCMs to be used for hydrologic simulations reported in a companion paper (Peel et al., 2015).This approach is consistent with the concept recognised by Randall et al. (2007, p. 608) that ". . .for models to predict future climatic conditions reliably, they must simulate the current climatic state with some as yet unknown degree of fidelity.Poor model skill in simulating present climate could indicate that certain physical or dynamical processes have been misrepresented".It is noted that our comparisons are conducted over the global terrestrial land surface rather than focussing on a single catchment, region or continent.This allows us to assess whether a GCM performs consistently well across a large area and reduces the chance of a GCM being selected due to a random high performance over a small area.

Mean annual precipitation
Comparisons of mean annual precipitation and the standard deviation of annual precipitation between GCM estimates and observed data for the grid cells across the 46 runs are presented in Table 4.For MAP, the NSE varied from a maximum of 0.68 (R 2 = 0.69) with a RMSE value of 335 mm year −1 for model MIUB(3) (Meteorological Institute of the University of Bonn) to −0.54 for GISS-EH(3) (NASA Goddard Institute of Space Studies).(GCM run number is enclosed by parenthesis, for example MIUB(3) is run 3 for the GCM Here, based on untransformed data, the NSE is −0.54 (R 2 = 0.37) with a RMSE value of 697 mm year −1 .The range of NSE values for the MAP comparisons across the 46 GCM runs is plotted in Fig. 3.The results may be classified into four groups: 5 runs exhibiting NSE > 0.6, 27 runs 0.4 < NSE ≤ 0.6, 6 runs 0 < NSE ≤ 0.4 and 8 runs ≤ 0, where the predictive power of the GCM is less than using  the average observed MAP across all grid cells (Gupta et al., 2009).

Standard deviation of annual precipitation
For the standard deviation of annual precipitation, HadCM3 was the best performing model with a NSE of 0.57, R 2 of 0.62 and a RMSE of 51 mm year −1 .MIROCh also yielded a NSE of 0.57 and an R 2 of 0.58 but with a RMSE of 63 mm year −1 .These results along with other standard deviation values are listed in Table 4. Figure 4 is a plot for MIUB(3), which is representative (rank 4, that is the fourth best performance of the 46 runs) of the relationship between GCM and observed SDP, and shows the model underestimates the standard deviation of annual precipitation for high values and overestimates at low values of standard deviation compared with observed values.

Mean annual temperature
The comparison of the GCM mean annual temperatures with concurrent observed data for the grid cells are listed for each model run in Table 4.In contrast to the precipitation modelling, the mean annual temperatures are simulated satisfactorily by most of the GCMs.Except for the IAP (Institute of Atmospheric Physics, Chinese Acad.Sciences) and the GFDL2.0models (NSE = ∼ 0.90 and 0.93, respectively), all model runs exhibit NSE values ≥ 0.94 with 17 of the 46 GCM runs having a NSE value ≥ 0.97.A comparison between MIUB(3) estimates of mean annual temperature (NSE = 0.96, rank 33) and observed values from the CRU data set is presented in Fig. 5. Also shown in Fig. 5 is a linear fit between GCM and observed MAT.The average fit for the 46 GCM runs (not shown) exhibited a small negative bias of −1.03 • C and a slope of 1.01.

Average monthly precipitation and temperature patterns
Because a monthly rainfall-runoff model is applied in the next phase of our analysis (reported in a companion paper) it is considered appropriate to assess how well the GCMs simulate the observed mean monthly patterns of precipitation and temperature (see also the argument of Charles et al., 2007).The NSE was used for the assessment by comparing the 12 long-term average monthly values.For each GCM run the average precipitation and temperature values for each month were calculated for each grid cell.NSEs were computed between the equivalent 12 GCM-based and 12 CRU-based monthly averages.The median NSE values across terrestrial grid cells where observed CRU 3.10 data were available for the period January 1950 to December 1999 for each GCM run are summarised in Table 4.As shown in Table 4 average monthly patterns of precipitation are poorly modelled.In fact, 57 % of the 46 model runs have a median NSE value of < 0. For these GCMs their predictive power for the monthly precipitation pattern is less than using the average of the 12 monthly values at each of the terrestrial grid cells.Only two GCMs have NSE values > 0.25.In contrast, the median NSEs of all monthly temperature patterns are > 0.75, with 41 % > 0.90.The NSE metric reflects how well the GCM replicates both the monthly pattern and the overall average monthly value (bias).Thus, the monthly pattern of temperature is generally well reproduced by the GCMs, whereas the monthly pattern of precipitation is not, which is mainly due to the bias in the GCM average monthly precipitation.

Köppen-Geiger classification
The Köppen-Geiger climate classification (Peel et al., 2007) (see Table 5) provides an alternate way to assess the adequacy of how well a GCM represents climate because the classification is based on a combination of annual and monthly precipitation and temperature data.Two comparisons between the MPI(3) model and CRU observed data are presented in Table 6.The MPI(3) was chosen as an example here as over the three levels of climate classes it estimated the observed climate correctly more often than the other model runs.In Table 6a a comparison at the first letter level of the Köppen-Geiger climate classification is shown.This comparison reveals how well the GCM reproduces the distribution of broad climate types: tropical, arid, temperate, cold and polar over the terrestrial surface.In Table 6b the comparison shown is for the second letter level of the Köppen-Geiger climate classification, which assesses how well the GCM reproduces finer detail within the broad climate types; for example, the seasonal distribution of precipitation or whether a region is semi-arid or arid.The bold diagonal values shown in Table 6a and b represent the number of grid cells correctly classified by the GCM, whereas the off-diagonal values are the number of grid cells incorrectly classified by the GCM for the one-and two-letter level.At the first letter level MPI(3) reproduces the correct climate type at 81 % of the terrestrial grid cells.Within this good performance the MPI(3) produces more polar climate and fewer tropical and cold grids cells than observed.At the second letter level, MPI(3) reproduces the correct climate type at 67 % of the terrestrial grid cells.The model produces fewer grid cells of tropical rainforest, cold with a dry winter and cold without a dry season than expected and more cold with a dry summer and polar tundra than expected.Table 7 summarises the overall proportion of GCM grid cells that were classified correctly for each GCM run across the three levels of classification.As we wish to have a ranking of the comparisons we adopted this simple measure as it is regarded as ". . .one of the most basic and widely used measures of accuracy. . ." for comparing thematic maps (Foody, 2004, p. 632).From Table 7 we observe that GCM accuracy in reproducing the climate classification decreases as one moves from coarse to fine detail climate classification.The average accuracy (and range) for the three classes are 0.48 (0.36-0.60) for the three-letter classification, 0.57 (0.47-0.68) for the two-letter classification and for one-letter 0.77 (0.66-0.82).In other words, at the threeletter scale nearly 50 % of GCM Köppen-Geiger estimates are correct, increasing to nearly 60 % at the two-letter level and, finally, at the one-letter aggregation more than 75 % are correct across the 46 GCM runs.Using these average values across the three classes, the following seven models performed satisfactorily in identifying Köppen-Geiger climate class correctly: CNRM (Météo-France/Centre National de Recherches Météorologiques), CSIRO (Commonwealth Scientific and Industrial Research Organisation), HadCM3, HadGEM, MIUB, MPI and MRI.Of these models the least successful run was for CSIRO with the percentage correct for each class as follows: three-letter 51 %, two-letter 60 % and one-letter 78 %.

Relating GCM resolution to performance
In the analysis presented in the previous section each GCM's performance in reproducing observed climatological statistics was assessed at the resolution of the individual GCM.
The question of whether GCMs with a finer resolution outperform GCMs with a coarser resolution is addressed in Fig. 6, where GCM performance in reproducing observed terrestrial MAP and MAT, based on the NSE, is related to GCM resolution, defined as the number of grid cells used in the comparison.The plot suggests there is no significant relationship between GCM resolution and GCM performance beyond 1500 grid cells for either MAP or MAT.Interestingly, some lower resolution GCMs, < 1500 grid cells, perform as well as higher resolution GCMs for MAP and MAT, yet for others, they perform poorly.While it is sometimes assumed that higher resolution should normally lead to improved performance, there are many other factors that affect performance.These include the sophistication of the parameterisation schemes for different sub-grid-scale processes, the time spent in developing and testing the individual schemes and their interactions.

Joint comparison of precipitation and temperature
In using GCM climate scenarios in a water resources study, it is appropriate to ensure consistency between precipitation and temperature by adopting projections of these variables  4 and 8) ( from the same GCM run.Grid cell based NSEs for mean annual temperature and mean annual precipitation from each GCM are compared in Fig. 7, which illustrates the performance of each GCM for both variables.Models that have relatively high NSEs for precipitation do not necessarily have relatively high values for temperature.It is interesting to note that the rank of the models based on NSE of the MAP is unrelated to the ranking of the models based on MAT.Fortunately, however, most of the NSEs for MAT are relatively high and the acceptance or rejection of a GCM as a better performing model is largely dependent on its precipitation characteristics.

Identifying better performing GCMs
To identify the better performing GCMs across the different variables assessed, the results in this measure was not considered in the selection of models listed in column 1, Table 9.)The following GCMs were selected (Table 9): HadCM3, INGV (National Institute of Geophysics and Vulcanology, Italy), MIROCm, MIUB, MPI and MRI.INGV was included although it failed the monthly precipitation pattern criterion.The above criteria were selected to identify a small number of GCMs that would require less bias correction to produce annual precipitation and temperature consistent with observations.In Table 9, we summarise our observations from the literature review in Sect. 2 and the results from our analyses in Tables 4 and 8, where we identified six GCMs that satisfied our selection criteria (Table 9, column 1).From the literature review (Table 3), eight GCMs were identified as being satisfactory.We have added MIUB because in the literature review it ranked first overall, although no guidance was available from Reichler and Kim (2008).We also added MIROCh to this list as it performed better according to Gleckler et al. (2008) than several models in the above list and met the performance index of Reichler and Kim (2008).Columns 1 and 2 of Table 9 suggest there is some consistency between our analyses from a hydrologic perspective and that reported in the literature from a climatological perspective.From the table, we identify that, in terms of our objective to assess how well the CMIP3 GCMs are able to reproduce observed annual precipitation and temperature statistics and the mean monthly patterns of precipitation and temperature, the following models are deemed acceptable for the next phase of our project: HadCM3, MIROCm, MIUB, MPI and MRI.Although not used in the selection criteria we observe our selected GCMs performed well in the Köppen-Geiger climate assessment.We note here that INGV also performed satisfactorily but it was not included in our adopted GCMs as it was not reviewed in the papers of Gleckler et al. (2008), Reichler and Kim (2008) and Macadam et al. (2010).(2015-2034 to 1965-1994) for the selected five CMIP3 GCMs runs compared with the 23 CMIP3 GCMs including all ensemble members for the global land surface.

Comparing future responses of selected GCMs
In order to confirm that the selected GCM runs are representative of the range of future responses to climate change in the CMIP3 ensemble, we plot in Fig. 8 the ratio of mean annual precipitation for the period 2015-2034 (from the A1B scenario) to 1965-1994 against the mean annual temperature difference between 2015-2034 and 1965-1994 for the global land surface.The five selected GCM runs are well distributed amongst the 44 GCM ensemble members, which indicates that the selected GCMs are reasonably representative of the range of future GCM projections if all the runs were considered.We observe that most GCM runs are clustered around the median response, except for the seven CCSM runs in the top right quadrant with a precipitation ratio > ∼ 1.04.

Conclusions
Our primary objective in this paper is to identify better performing GCMs from a hydrologic perspective over global land regions.The better performing GCMs were identified by their ability to reproduce observed climatological statistics (mean and the standard deviation of annual precipitation and mean annual temperature, and the mean monthly patterns of precipitation and temperature) for hydrologic simulation.The GCM selection process was informed by our results presented here and by a literature review of CMIP3 GCM performance.In terms of the NSE there was a large spread in values for mean annual precipitation and the standard deviation of annual precipitation over concurrent periods.The highest NSE for mean annual precipitation was 0.68 and 0.57 for the standard deviation of annual precipitation.On the other hand, for mean annual temperatures, the NSEs between modelled and observed data were very high, with median NSE being 0.97.Overall, all GCMs reproduced the Köppen-Geiger climate satisfactorily at the broad first letter level.From the literature, the following GCMs were identified as being suitable to simulate annual precipitation and temperature statistics: CCCMA-T47, CCSM, GFDL2.0,GFDL2.1,HadCM3, MIROCh, MIROCm, MIUB, MPI and MRI.After combining our results with the literature the following GCMs were considered the better performing models from a hydrologic perspective: HadCM3, MIROCm, MIUB, MPI and MRI.The future response of the better performing GCMs was found to be representative of the 44 GCM ensemble members which confirms that the selected GCMs are reasonably representative of the range of future GCM projections.Our approach for evaluating GCM performance for hydrologic simulation could be applied to CMIP5 runs.
The Supplement related to this article is available online at doi:10.5194/hess-12-361-2015-supplement.
(17 GCMs)  andKnutti et al. (2010) (23 GCMs) address, inter alia only mean annual temperature.Hagemann et al. (2011) used three GCMs to estimate precipitation and temperature characteristics, but the paper includes only precipitation results.

Figure 8 .
Figure8.Ratio of 2015Ratio of  -2034Ratio of   to 1965Ratio of  -1994   mean annual precipitation compared with the change in mean annual temperature(2015- 2034 to 1965-1994)  for the selected five CMIP3 GCMs runs compared with the 23 CMIP3 GCMs including all ensemble members for the global land surface.

Table 1 .
Details of 23 GCMs considered in this paper.Based on mean annual temperature comparison between GCM and CRU.c Based on mean annual precipitation comparison between GCM and CRU. b

Table 2a .
Numerical measures of performance assessment of CMIP3 GCMs.

Table 2b .
Ranking measures of performance assessment of CMIP3 GCMs.

Table 3 .
Summary of performance of 23 CMIP3 GCMs in simulating present climate based on literature review.
Smith and Chandler (2010)nd Chandler (2010)(The performance index is based on the error variance between modelled and observed climate for 14 climate and ocean variables."Yes" indicates the variance error is less than the median across the GCMs.) a na: not available or not applicable.b Rank 1 is best rank.# more than one GCM with this rank.

Table 4 .
Performance statistics comparing CMIP3 GCM mean and standard deviation of annual precipitation, mean annual temperature, and mean monthly patterns of precipitation and temperature with concurrent observed data.(Analysis based on untransformed data.)

Table 6 .
Köppen-Geiger climate estimated by MPI(3) compared with the observed Köppen-Geiger climate for (a) the one-letter and (b) the two-letter climate classification.Bold values are correctly classified grid cells.

values for MAP and MAT Number of terrestrial GCM grid cells for precipitation and temperature
Our purpose here is to report this observation rather than speculate what it might mean for GCM Hydrol.Earth Syst.Sci., 19, 361-377, 2015 www.hydrol-earth-syst-sci.net/19/361/2015/

Table 8 .
CMIP3 GCM run rank (rank 1 = best) based on Nash-Sutcliffe efficiency (NSE) values from comparison of 20C3M and concurrent observed grid cell data.

Table 9 .
Better performing CMIP3 GCMs identified from the literature and our analyses.

Table 3
* Added to list -see Section 5.3 for explanation.

Table 4
GCMs that achieved MAP NSE > 0.50, SDP NSE > 0.45, MAT NSE > 0.95 and mean monthly pattern of precipitation NSE > 0.0 (Table4) were identified as better performing.(Because nearly all the GCM runs modelled mean monthly patterns of temperature satisfactorily,