Selection of multi-model ensemble of GCMs for the simulation of precipitation based on spatial assessment metrics

The climate modelling community has trialled a large number metrics to evaluate the temporal performance of the Global Circulation Models (GCMs) for the selection of GCMs, while very little attention has been given to spatial performance of GCMs which is equally important. This study evaluated the performance of 20 Coupled Model 15 Intercomparison Project 5 (CMIP5) GCMs pertaining to their skills in simulating mean annual, monsoon and winter precipitation over Pakistan using state-of-the-art spatial metrics; SPAtial EFficiency, Goodman–Kruskal's lambda, Fractions Skill Score, Cramer’s V, Mapcurves, and Kling-Gupta efficiency for the period 1961-2005. The multi-model ensemble (MME) precipitation was generated through intelligent merging of simulated precipitation of selected GCMs employing Random Forest (RF) regression and Simple Mean (SM). The results indicated some differences in the ranks of GCMs for 20 different metrics. The overall ranks indicated NorESM1-M, CESM1-CAM5, GFDL-CM3 and GFDL-ESM2G as the best GCMs in simulating the spatial patterns of mean annual, monsoon and winter precipitation over Pakistan. MME precipitation generated based on the best performing GCMs showed more similarities with observed precipitation compared to precipitation simulated by individual GCMs. The MME developed using RF displayed better performance than the MMEbased on SM. Multiple spatial metrics have been used for the first time for selecting GCMs based on their capability to 25 mimic the spatial patterns of annual and seasonal precipitation. The approach suggested in the present study can be extended to any number of GCMs and climate variables and applicable to any region for the suitable selection of an ensemble of GCMs to reduce uncertainties in climate projections.


Introduction
Climate change is a complex, multidimensional phenomenon that is being critically studied over the last few decades (Byg and Salick, 2009;Cameron, 2011).The changes in climate are mostly observed by studying the variations in precipitation and temperature regimes (Sheffield and Wood, 2008).Several studies reported increase in severity and frequency of droughts, floods, heatwaves and cold snaps in the recent years which are indicative of abrupt variations in the precipitation and temperature regimes (Duffy et al., 2015).According to the Intergovernmental Panel on Climate Change (IPCC) 5 th Assessment Report (AR5), the average global land and ocean temperature has risen by around 0.72°C (0.49-0.89°C) during 1951-2012.It is projected that it will further increase by 1.8 °C to 4 °C by the end of the 21 st century (IPCC, 2014).The climate modelling community has widely agreed that the sharp temperature rise in the post-industrial revolution era is significantly affecting the global hydrologic cycle (Hegerl et al., 2018).The spatiotemporal variations in the global hydrologic cycle are influential on the humans and the environment.Therefore, it is important to study the variations in spatiotemporal patterns of climate variables such as precipitation and temperature (Akhter et al., 2016).
Global Circulation Models (GCMs) are principally utilised to simulate and project climate on global scale (Wright et al., 2015;Sachindra et al., 2014).Over the years, a large number of GCMs have been developed and used for the simulation and projection of global climate.The Coupled Model Intercomparison Project Phase 5 (CMIP5) is a set of GCMs available from the IPCC AR5.The CMIP5 GCMs showed significant improvements in climate simulations compared to its previous generation of CMIP3 models (Wang et al., 2016).Currently, over 40 GCMs are available in the CMIP5 suite with different spatial resolutions (Demirel and Moradkhani, 2015).Human and computational resources pose a restriction on the size of the sub-set of GCMs used in a climate change impact assessment (Ekström et al., 2016).Salman et al. (2018b) and Pour et al. (2018b) reported that a multi-model ensemble (a sub-set) of GCMs selected considering their skills in reproducing past observed characteristics of climate can reduce the GCM associated uncertainties in climate change impact assessment.The multi-model ensembles (MME) also enhance the reliability of prediction using information from several sources or GCMs (Pavan and Doblas-Reyes, 2000;Knutti et al., 2010).
The methods used for the generation of MME are broadly divided into two groups; (1) simple composite method (SCM) and (2) weighted ensemble method (WEM) (Wang et al., 2017b).In SCM all ensemble members are equally weighted while in the WEM, ensemble members are weighted according to their performance in simulating the past climate (Wang et al., 2017a;Oh and Suh, 2017;Giorgi and Mearns, 2002).The SCM is relatively simple to use and found to perform better than individual GCMs (Weigel et al., 2010;Fu et al., 2018;Dong et al., 2018).However, WEM is preferred as it has the capability to remove the systematic biases and improve the prediction capability since better GCMs are assigned higher weightages (Krishnamurti et al., 2000;Krishnamurti et al., 1999).Salman et al. (2018a) reported that prediction capability of a MME improves if it is based on WEM method.Thober and Samaniego (2014) also showed that sub-ensembles generated using WEM has the better capability to capture the historical characteristics of precipitation and temperature extremes.The performances of MMEs depend on the performance of ensemble members in simulating historical climate (Pour et al., 2018a).Therefore, selection of a sub-ensemble is a major challenge is climate change modelling.
Numerous endeavours have been made to examine the adequacy of climate models in simulating various climate variables (e.g.precipitation) (McMahon et al., 2015;Gu et al., 2015).Smith et al. (1998)  Validity, where performance of GCMs are considered, and (4) Representativeness, an ensemble of GCMs covering a wide range of projections of a climate variable (e.g.precipitation) is considered.In the above criteria, assessment and selection of GCMs based on their validity is the most widely adopted criterion where GCMs are ranked and selected according to their skill in simulating observed past climate (Mendlik and Gobiet, 2016).
A wide variety of methods has been used to assess climate models based on their ability to simulate the observed historical climate (past performance) such as such as reliability ensemble averaging approach (Giorgi and Mearns, 2002) relative entropy (Shukla et al., 2006), Bayesian approach (Min and Hense, 2006;Tebaldi et al., 2005;Chandler, 2013), probability density function (Perkins et al., 2007), hierarchical ANOVA models (Sansom et al., 2013), clustering (Knutti et al., 2013), correlation (Xuan et al., 2017;Jiang et al., 2015), and symmetrical uncertainty (Salman et al., 2018b).Johnson and Sharma (2009) assessed the performance of GCMs in replicating inter-annual variability.Thober and Samaniego (2014) evaluated the performance of GCMs in reproducing extreme indices of precipitation and temperature.Apart from that, some studies combined several performance measures such as root means square error, mean absolute error, correlation coefficient, and skill scores into one performance index to assess the accuracy of GCMs in reproducing past climate (Gu et al., 2015;Barfus and Bernhofer, 2015;Gleckler et al., 2008b;Wu et al., 2016;Ahmadalipour et al., 2015;Raju et al., 2016).Moreover, the past performance assessment of GCM is performed at different temporal scales; daily (Perkins et al., 2007), monthly (Raju et al., 2016), seasonal (Ahmadalipour et al., 2015) and annual (Murphy et al., 2004).Besides temporal scales, a number of studies ranked GCMs based on spatial areal average (Ahmadalipour et al., 2015;Abbasian et al., 2018) while some studies considered GCM performances at all the grid points covering the study area (Raju et al., 2016;Salman et al., 2018b).
It is also observed in the literature that there is no consensus on the choice of the GCM selection approach and temporal scale at which the performance assessment is done.Raäisaänen (2007), Smith and Chandler (2010) and McMahon et al. (2015) also argued that there is no universally accepted criterion for the assessment of GCMs.However, McMahon et al. (2015) reported that GCM simulations at annual time scale can better reproduce long-term annual mean statistics compared to that at daily time scale.Gleckler et al. (2008a) stated that assessment of GCMs with respect to a climate variable like precipitation over multiple time scale or seasons may provide vital information to water resources managers especially in the regions where climate variability is high.Moreover, Raju et al. (2016) and Salman et al. (2018b) demonstrated that GCM assessment provides more useful information when the evaluation is conducted at individual grid points covering the study area of interest.Selection of GCMs based on their performance in individual grid points over a region does not guarantee its capability to simulate spatial pattern of regional climate.It is expected that GCMs should able to capture the spatial pattern of major features of climate of a region such as monsoon and western disturbances.Koch et al. (2018) and Demirel et al. (2018) argued that climate modelling community is mostly focused on the temporal performance of GCMs and ignores explicit assessment of their spatial performance which is also equally important.They also emphasized on the importance of the use of multiple spatial metrics for GCM performance assessment.Furthermore, the metrics should be insensitive to the units of the variables compared.Overall, review of literature revealed that several studies assessed the performance of GCMs considering several grid points over the whole study area; however they ignored the capability of GCMs to replicate the spatial patterns.Spatial patterns of GCMs provide better understanding on the occurrences of hydro-climatic phenomena such as precipitation distributions, floods and droughts.Therefore, it is imperative to assess the skills of GCMs to replicate the historical spatial patterns of climate variables.Within this framework, the current study hypothesized that the sub-ensemble members identified based on their ability to mimic the spatial pattern of observed precipitation of a region can be used for generation of a reliable MME for precipitation for that region.This study for the first time, employed five state of the art spatial performance metrics; SPAtial EFficiency metric (SPAEF) (Demirel et al., 2018), Goodman-Kruskal's lambda (Goodman and Kruskal, 1954), Fractions Skill Score (FSS) (Roberts and Lean, 2008), Cramer's V (Cramér, 1999), Mapcurves (Hargrove et al., 2006), Kling-Gupta efficiency (KGE) (Gupta et al., 2009) for the assessment of performance of 20 CMIP5 GCM in simulating observed annual, monsoon and winter precipitation over Pakistan.Then based on the above spatial performance metrics the most skilful GCMs were identified and hence multi-model ensemble (MME) means of precipitation using Simple Mean (SM) and Random Forest (RF) were generated.

Study area
As shown in Figure 1, Pakistan located in south Asia shares its border with India in the east, China in the north, Afghanistan and Iran in the west and Arabian Sea in the south.Pakistan has a rugged topography ranging from 0 m in the south to 8572 m in the north.Pakistan is overwhelmed by arid and semi-arid climate, and displays significant climatic variations.Pakistan receives summer monsoon precipitation during the period June-September and winter precipitation during the period December-March.Besides that, there are two intermediate rainy seasons called the pre-monsoon and the post-monsoon during the periods April-May and October-November, respectively (Sheikh 2001).
The bulk of the summer precipitation is caused by the monsoon winds that arise from the Bay of Bengal while westerly disturbances in the Mediterranean Sea are responsible for the winter precipitation.The average precipitation in Pakistan widely varies from southwest to northern parts in the range of < 100 to > 1000 mm/year.Since the country is mostly characterized by arid and semi-arid climate; the bulk of the country receives less than 500 mm/year of precipitation while only a very limited area in the north receives more than 1,000 mm/year of precipitation (Ahmed et al., 2017).

Gridded Precipitation Data
The lack of long records of climate observations with an extensive spatial coverage is a major issue in hydro-climatological investigations in many regions.As a solution to this problem, gridded data sets based on observations and various interpolation and data assimilation techniques have been created (Kishore et al., 2015).In this investigation, gridded monthly precipitation data of the Global Precipitation Climatology Center (GPCC) (Schneider et al., 2013) were used as the surrogates of observed precipitation for the period 1961-2005.GPCC precipitation data are available at a spatial resolution of 0.5°.
As stated in the existing literature GPCC data are of high quality (Shiru et al., 2018;Salman et al., 2018c) and have an excellent seamless spatial and temporal coverage (Spinoni et al., 2014).Most importantly, GPCC precipitation data have shown high correlation with observed precipitation over Pakistan (Kazmi et al., 2016).

GCM precipitation data
Monthly precipitation data simulated by the 20 CMIP5 GCMs were extracted from the IPCC data distribution center for period 1961-2005.The monthly precipitation projections for all Representative Concentration Pathways (RCP) are only available for 20 of the CMIP5 GCMs.Hence, only those 20 GCMs were used for the current investigation.The modelling centres, names of GCMs and spatial resolution of each of the selected GCMs are provided in Table 1.For the sake of fair comparison, monthly precipitation simulations of all selected CMIP5 were remapped to a common grid with a resolution of 2 o ×2 o (approximately the average resolution of the GCMs considered in the present study), using bilinear interpolation technique.
Table 1.CMIP5 GCMs considered in this study.

Methodology
In this study, GCMs for annual, monsoon and winter season were first ranked separately (individual ranking) using five spatial performance measures; SPAEF, Lambda, FSS, Cramer-V, Mapcurves, and KGE.Then a comprehensive rating metric (RM) (Jiang et al., 2015) was used to rank the GCMs considering the individual ranks determined corresponding to all above spatial performance measures.The RM values of GCMs obtained for annual, monsoon and winter precipitations were finally averaged for the overall ranking of GCMs.Finally, a sub-set of GCMs (MME) based on the overall ranks was selected and a precipitation data set for the MME was derived.The procedure used for the ranking, identification of the ensemble of GCMs and derivation of precipitation data from the multi-model ensemble of GCMs are outlined as follows.

SPAtial EFficiency metric
SPAtial EFficiency metric (SPAEF), proposed by Koch et al. (2018) is a robust spatial performance metric which considers three statistical measures (1) Pearson correlation, (2) coefficient of variation and (3) histogram overlap in the assessment of GOF of a model.The major advantage of SPAEF is that, it combine the information derived from the above three independent statistical measures into one metric.The SPAEF between past observed precipitation (i.e.GPCC) and GCM simulated precipitation was calculated using Eq.(1).In Eq. (1), α is the Pearson correlation coefficient between observed and GCM simulated precipitation, β is the spatial variability and γ is the overlap between the histograms of observed precipitation and GCM simulated precipitation.(1) Equation ( 2) and (3) show the procedure for β and γ calculations respectively (for Pearson correlation (α) refer to (Pearson, 1948).In Eq. ( 2)   and   refer to standard deviation of GCM simulated and observed precipitation respectively and   and   refer to mean of GCM simulated and observed precipitation respectively.
In Eq. ( 3), K, L and n refer to histograms value of observed precipitation, histograms value of GCM simulated precipitation and the number of bins in a histogram. (3) The SPAEF can have a value between −∞ and 1, where value closer to 1 indicates higher spatial similarity between the observations and model simulations (Koch et al., 2018).

Goodman-Kruskal's lamba
Goodman-Kruskal's lamba also known as Lambda coefficient () is used to measure the nominal/categorical association between categorical maps (Goodman and Kruskal, 1954).Lambda coefficient () varies between 0 and 1, where a value closer to 1 refers to a higher similarity between the map of model simulations and that of observations of precipitation.The Lambda () coefficient was calculated using Eq. ( 4), where   is a contingency matrix (describes the relationships between the data classes), i and j are the class or categories in observed and simulated maps, m and n represent the number of classes in observed and simulated maps respectively.

Fractions Skill Score
The Fractions Skill Score (FSS) proposed by (Roberts and Lean, 2008) is another measure used for the assessment of spatial agreement between model simulations and observations.In this study, FSS between observed and GCM simulated precipitation was computed using Eq. ( 5).FSS varies between 0 and 1 where a value closer to 1 refers to higher agreement between observed and simulated precipitation.
In Eq. ( 5) Ps and Po are simulated and observed precipitation respectively whereas N refers to the total number of grid points.

Cramer's V
Cramer's V (Cramér, 1999) statistic is a Chi-square-test-based measure which is used in assessing spatial agreement between observations and model simulations (Zawadzka et al., 2015).Its value ranges between 0 and 1 and can be calculated using Eq. ( 6).
where,  2 is Chi-Square, N is the grand total of observations, m is the number of rows and n is the number of columns.In this exercise m = 42 (number of rows of data) and n = 2 (observed and modelled precipitation).

Mapcurves
Mapcurves is another statistical measure, developed by Hargrove et al. (2006) for the measurement of similarity between categorical maps.Mapcurves provides the degree of spatial association between two maps.The value of Mapcurves can vary from 0 to 1 (perfect agreement).In the present study, the degree of spatial association between the historical observed precipitation map (i.e.GPCC precipitation) and each of the GCM simulated precipitation maps was determined using Eq. ( 7) where, Y refers the Mapcurves value, C is the degree of intersection between the two maps, A and B are the total area of historical and GCM simulated maps. (7)

Kling-Gupta efficiency
Kling-Gupta efficiency (KGE) is a GOF test developed by Gupta et al. (2009), for the model performance assessment.KGE considers three statistical measures (1) Pearson correlation, (2) variability ratio and (3) bias ratio in the assessment of model performance.In the present study, KGE was calculated between historical observed precipitation and GCM simulated precipitation using Eq. ( 8).In Eq. ( 8),   is the Pearson correlation (Pearson, 1948) between observed and GCM simulated precipitation,   is the is the bias ratio, and   is the variability ratio.Equation ( 9) and ( 10), show the calculation of   and   respectively.In Eq. ( 9),   and   refer to mean of GCM simulated and observed precipitation respectively, whereas in Eq. ( 10),   and   refer to coefficient of variation of GCM simulated and observed precipitation respectively.

Comprehensive Rating Metrics
The ranking of GCMs with respect to a given climate variable using one single GOF measure is a relatively simple task.
However, the ranking of GCMs becomes more challenging when multiple GOF measures are used with multiple climate variables, as different GCMs may display different degrees of accuracies for different GOF measures and climate variables.
In such case, an information aggregation approach that combines information from several GOF measures can be used.In this study, a comprehensive rating metric (Chen et al., 2011) was used to obtain the overall ranks of GCMs.The overall ranks of GCMs based on different GOFs were obtained for each season separately using Eq. ( 11).
In Eq. ( 11), m refers to the number of GCMs, n refers to the number of metrics or seasons and i refers to the rank of a GCM based on i th GOF.A value of RM near to 1 refers to a better GCM in terms of its ability to mimic the spatial or temporal characteristics of observations.

Identification of Ensemble Members
The uncertainties in climate projections arise from GCM structure, assumptions and approximations, initial conditions, and parameterization can be reduced by identifying an ensemble of better performing GCMs (Kim et al., 2015).Lutz et al (2016) reported that one or a small ensemble of GCMs is suitable for climate change impact assessment.A number of studies (Weigel et al., 2010;Miao et al., 2012) have suggested that one GCM is not enough to assess the uncertainties associated with the future climate.Therefore, identification of an ensemble of GCMs is a necessity in climate change impact assessment.
In the present study, the most appropriate ensemble of GCMs was identified by considering the four top ranked GCMs.The

Development of Multi-model Ensemble (MME) Mean
The uncertainties in projections of a climate variable can be reduce by using its mean time series calculated from a MME of better performing GCMs (You et al., 2018).Numerous approaches are documented in the literature for the calculation of mean time series from an ensemble of better performing GCMs starting from simple arithmetic mean to machine learning algorithms (Kim et al., 2015).In the present study, two approaches 1).Simple Mean (SM) and 2).Random Forest (RF) (Breiman, 2001) were used in the calculation of mean time series of precipitation corresponding to an ensemble of four top ranked GCMs.

Accuracy Assessment of Gridded Precipitation Data
As a preliminary analysis, the GPCC precipitation data were validated with the observed precipitation.The validation was carried out for the period 1961-2005.In the present study, two statistical metrics namely; Normalized Root Mean Square Error (NRMSE), and modified index of agreement (md) were used to assess the accuracy of GPCC precipitation in replicating the mean and the variability of observed precipitation.
NRMSE is a non-dimensional form of Root Mean Square Error (RMSE) which is derived by normalizing RMSE by variance.
NRMSE is more reliable than RMSE in comparing model performance when the model outputs are in different units or the same unit but with different orders of magnitude (Willmott, 1982).NRMSE can have any positive value, however values near to zero are preferred (Chen and Liu, 2012).The 'md' is widely used to estimate the errors between observed and simulated values and it varies between 0 (no agreement) and 1 (perfect agreement) (Willmott, 1981).
The NRMSE and md values between observed precipitation and GPCC precipitation (pertaining to the grid point closest to the observation station) obtained for 17 locations in Pakistan are given in Table 2. Overall, all the stations showed low and high NRMSE and md values respectively, indicating that the accuracy of the GPCC precipitation in replicating observed precipitation over Pakistan is high.

Evaluation and Ranking of GCMs
The SPAEF, Lambda, FSS, Cramer-V, Map-curves, and KGE between GPCC and GCMs simulated mean annual, monsoon and winter precipitation of Pakistan were estimated for the period 1961 to 2005.As an example, Table 3 shows the GOF values that define the performance of each GCM in simulating GPCC mean annual precipitation (winter and monsoon not shown).The GOF values near to 1 refer to the better performance of the GCM of interest.For example, GFDL-ESM2G has a GOF value of 0.724 for SPAEF, and hence regarded as the best GCM in term of SPAEF, whereas CSIRO-Mk3.6.0 can be regarded as the poorest which has a GOF value of -0.412 in term of SPAEF.The GOF values for other metrics (i.e.Lambda, FSS, Cramer-V, Map-curves, and KGE) can be interpreted in the same manner.
Table 3. GOF values of GCMs obtained using different spatial metrics for mean annual precipitation.
The GCMs were then ranked based on GOF value of each metric shown in Table 3 and presented in Figure 2. Figure 2 shows the ranks attained by GCMs corresponding to different metrics.For example, BCC-CSM1.1 (m) attained ranks 12, 11, 12, 13, 14 and 19 for SPAEF, Lambda, FSS, Cramer-V, Mapcurves and KGE respectively.It was observed that none of the GCMs was able to secure the same rank for all metrics.However, NorESM1-M, CESM1-CAM5, GFDL-ESM2G secured rank 1, 2, and 3 respectively for four metrics (i.e.Lambda, FSS, Cramer-V and Mapcurves).Some of the GCMs attained the same rank for three metrics (e.g.CSIRO-Mk3.6.0,CESM1-CAM5, GFDL-CM3 and GFDL-ESM2G).Cramer-V and Mapcurve showed more or less similar ranks for GCMs.Similar results were also seen for monsoon and winter precipitation (not presented in the manuscript).

Overall Ranks of GCMs based on Annual and Seasonal Precipitation
The application of various evaluation metrics have shown different ranks for the same GCM (Ahmadalipour et al., 2015;Raju et al., 2016).Thus, the procedure detailed in Section 3.2 was employed to combine the ranks of each GCM produced by each spatial performance metric into an overall rank.The ranks attained by GCMs (as an example see Figure 2 for annual precipitation) corresponding to different metrics were used to calculate the RM values for each GCM.The overall ranks of GCMs for mean annual, monsoon and winter precipitation are presented in Table 4 along with the RM values.As Table 4. Overall ranks of GCMs for mean annual, monsoon and winter precipitation based on rating metric values.
The better performance of NorESM1-M, CESM1-CAM5 and HadGEM2-AO in simulating precipitation over Indo-Pak subcontinent has also been reported in several past studies.Babar et al. (2014) assessed 13 CMIP5 GCMs for simulating Indian summer monsoon precipitation and found that NorESM1-M can capture the seasonal cycle of precipitation.Anand et al. (2018) and Jena et al. (2015) concluded that CESM1-CAM5 is one of the GCMs capable of simulating Indian summer monsoon precipitation.Latif et al. (2018) reported the relatively better performance of HadGEM2-AO out of 36 CMIP5 GCMs in simulating precipitation over Indo-Pakistan sub-continent based on spatial correlation.
Table 4 also shows that ranks of many GCMs in simulating annual and monsoon precipitation are more or less the same, however; the ranks of GCMs corresponding to winter season are somewhat different.The difference in ranks of GCMs in winter compared to that of annual and monsoon seasons was probably due to differences in synoptic climatology.The winter precipitation occurs during the period December to March due to the Westerly winds that blow from Mediterranean Sea and enters Pakistan from the western side (Sheikh, 2001).The monsoon precipitation occurs during June to September caused by the monsoon winds that blow from the Bay of Bengal and enters Pakistan from the north eastern side (Sheikh, 2001;Sheikh et al., 2009).It can be inferred that the selection of an appropriate ensemble of GCMs also depends on the season and the mechanism which causes precipitation.The findings of the present study also support the results of Ahmadalipour et al. (2015) where they reported that the performance of GCMs differ from seasons to season.
The spatial patterns of precipitation simulated by the GCMs ranked 1 and ranked 20 were compared with the spatial patterns of GPCC precipitation, and presented in Figure 3.In Figure 3 it was seen that the GCMs that attained rank 1 showed spatial patterns more or less similar to that of GPCC precipitation.On the other hand, GCMs ranked 20 (i.e.rank 20) showed large differences compared to the spatial patterns of GPCC precipitation.The Figure 3 clearly shows that GCMs which attained rank 20 under-estimated the annual, monsoon and winter precipitation over a large region in the study area.

Identification of Ensemble Members
Based on the criteria mentioned in Section 3.3, average RM values for each GCM was estimated and then the GCMs were ranked based on the average RM values.Table 5 shows the average RM values of the 20 GCMs considered in this study.The four top ranked GCMs; NorESM1-M, CESM1-CAM5, GFDL-CM3 and GFDL-ESM2G indicated in bold in Table 5 were designated as the members of the ensemble for projecting precipitation over Pakistan.
Table 5. Averaged RM values of GCMs for the identification of ensemble members The performances of the four top ranked GCMs (i.e.GCMs ranked 1, 2, 3 and 4) and four lowest ranked GCMs (i.e.GCMs ranked 17, 18, 19, and 20) were visually evaluated using scatter plots shown in Figures 4 and 5, pertaining to annual, monsoon and winter precipitation.In order to plot the scatter, the precipitation simulated by each GCM and GPCC precipitation pertaining to all grid point was averaged (spatially averaged precipitation).As expected, GCMs that attained ranks 1 to 4 showed closer agreements with the observed precipitation compared to that of GCMs which attained ranks 17, 18, 19, and 20.The scatter plots in Figure 5 indicated that the precipitation simulated by the least skilful GCMs heavily underestimated annual precipitation.Over and underestimation of precipitation can also be seen in the scatter plots of GCMs ranked 1, 2, 3 and 4.However, their scatter was found much aligned with the 45 degree line compared to that of GCMs ranked 17, 18, 19, and 20.Therefore, it is argued that the GCMs ranked 1, 2, 3 and 4 can be used as an ensemble for the simulation/projection of precipitation.

Conclusion
This study quantitatively and qualitatively assessed the accuracy of 20 CMIP5 GCMs in simulating mean annual, monsoon and winter precipitation over Pakistan for the period 1961-2005.The quantitative evaluation was done based on five state-ofart spatial metrics; SPAtial EFficiency, Goodman-Kruskal's lambda, Fractions Skill Score, Cramer's V, Mapcurves, and Kling-Gupta efficiency and qualitative evaluation was done using scatter plots.A comprehensive rating metric was used to derive the overall ranks of GCMs based on their ranks pertaining to annual, monsoon, and winter seasons.
Following conclusions were drawn from this study: 1) The low Normalized Root Mean Square Error (NRMSE), and high modified index of agreement (md) confirmed the close agreement of monthly GPCC precipitation with the observed precipitation extracted from 17 stations located in different climate zones of Pakistan.The low NRMSE and high md values of GPCC precipitation can be associated with extensive data quality control measures and the use of a large number of stations for the development of GPCC precipitation dataset (Schneider et al., 2014).

Figure 1 .
Figure 1.The location of Pakistan in central-south Asia and the GCM grid points over the country.
Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-585Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 27 February 2019 c Author(s) 2019.CC BY 4.0 License. 1.All GCM simulated past precipitation for the period 1961-2005 were remapped to a common grid with a 2 o ×2 o resolution.2. SPAEF, Lambda, FSS, Cramer-V, Mapcurves, and KGE were individually applied to annual, monsoon and winter precipitation for the period 1961-2005.3. The goodness of fit (GOF) estimated by SPAEF, Lambda, FSS, Cramer-V, Mapcurves, and KGE for annual, monsoon and winter precipitation were used to rank the GCMs separately.4. A comprehensive rating metrics (RM) was used to combine the ranks of GCMs determined by above spatial performance measures separately for annual, monsoon and winter precipitation. 5.The RM values calculated in step 4 for annual, monsoon and winter precipitation were averaged to obtain the overall ranks of the GCMs in simulating precipitation over Pakistan.6.The four top ranked GCMs based on their overall performance in replicating annual, monsoon and winter precipitation were identified.7. Simple Average (SM) and Random Forest (RF) were used to generate MME precipitation mean with the precipitation simulated by the four top ranked GCMs identified in step 6. 8. Finally, the spatial patterns of MME precipitation generated from SM and RF were validated by visually comparing with the spatial patterns of observed precipitation.Details of the methods and the determination of the best performing ensemble of GCMs are provided in the following sections.
ensemble of GCMs was identified in two steps: (1) RM values of GCMs for annual, monsoon and winter precipitation were averaged to derive an overall rank for each GCM, and (2) four top ranked GCMs based on RM values for all seasons were considered for the ensemble.The selection of an appropriate set of GCMs considering their skills in different seasons enables the selection of an ensemble which can better simulate the observations in different seasons.Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-585Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 27 February 2019 c Author(s) 2019.CC BY 4.0 License.

Figure 2 .
Figure 2. Ranks of GCMs according to their performance in replicating spatial patterns of mean annual precipitation.
Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-585Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 27 February 2019 c Author(s) 2019.CC BY 4.0 License.seen in Table 4, NorESM1-M, CESM1-CAM5 and HadGEM2-AO were the most skilful GCMs in reproducing the spatial characteristics of mean annual, monsoon and winter precipitation respectively.On the other hand, MRI-CGCMs3 displayed the least skill in reproducing the spatial characteristics of annual precipitation, and CSIRO-Mk3.6.0 showed the least skill in reproducing the spatial characteristics of monsoon and winter precipitation.

Figure 3 .
Figure 3. Spatial patterns of GPCC (a -c), GCM at rank 20 (d -f) and GCM at rank 1 (g -i) for mean annual, monsoon, and winter precipitation respectively.

Figure 4 .
Figure 4. Scatter of precipitation of four top ranked GCMs against GPCC annual, monsoon and winter precipitation.

Figure 5 .
Figure 5. Scatter of precipitation of four lowest ranked GCMs against GPCC annual, monsoon and winter precipitation.

Figure 6 .
Figure 6.Spatial patterns of observed (a -c), MME computed using Simple Mean (SM) (d -f) and MME computed using Random Forest (RF) (g -i) for mean annual, monsoon, and winter precipitation respectively during 1961 to 2005.

Figure 7 .
Figure 7. Scatter plots of GPCC and MME mean precipitation for annual, monsoon and winter seasons obtained using Simple Mean (SM) and Random Forest (RF) for the period 1961-2005.

Figure 1 .
Figure 1.The location of Pakistan in central-south Asia and the GCM grid points over the country.

Figure 2 .
Figure 2. Ranks of GCMs according to their performance in replicating spatial patterns of mean annual precipitation.

Figure 3 .
Figure 3. Spatial patterns of GPCC (a -c), GCM at rank 1 (d -f) and GCM at rank 20 (g -i) for mean annual, monsoon, and 5 winter precipitation respectively.

Figure 4 .
Figure 4. Scatter of precipitation of four top ranked GCMs against GPCC annual, monsoon and winter precipitation.

Figure 5 .
Figure 5. Scatter of precipitation of four lowest ranked GCMs against GPCC annual, monsoon and winter precipitation.

Figure 6 .
Figure 6.Spatial patterns of observed (a -c), MME computed using Simple Mean (SM) (d -f) and MME computed using Random Forest (RF) (g -i) for mean annual, monsoon, and winter precipitation respectively during 1961 to 2005. 5

Figure 7 .
Figure 7. Scatter plots of GPCC and MME mean precipitation for annual, monsoon and winter seasons obtained using Simple Mean (SM) and Random Forest (RF) for the period 1961-2005.

Table 2 .
Validation of accuracy of GPCC precipitation using NRMSE and md.