Interactive comment on “ Comparative assessment of predictions in ungauged basins – Part 2 : Flood and low flow studies ” by J . L . Salinas

P422, L9-11: Indeed, there is a mistake in the first sentence of the text, it should read: “The results for the flood regionalisation (Fig. 2, right panel) show that the predictions in humid regions exhibit the smallest errors and arid regions have the largest errors” Figure 2 is right, humid catchments present the smallest errors, as the errors are plotted upside down (reversed axis), in order to be consistent throughout the paper – increasing performance (decreasing error) is represented as the upward direction.


Introduction
Estimating flood and low flow discharges in ungauged basins are among the most fundamental challenges in catchment hydrology.There is a long track record in statistical hydrology of developing methods to estimate, in an optimal way, these discharges from runoff observations in neighbouring catchments and from catchment characteristics.Common to these statistical methods is the idea of catchment grouping, i.e. the notion that extreme events that have not been observed in a particular location could already have been observed somewhere else.Therefore runoff data (on floods or low flows) from many sites are pooled in order to obtain a representative sample of what could happen in a particular location.One of the key aspects of the methods consists of exactly how this pooling is performed.
There are a number of options.The classical approach consists of subdividing the study domain into a number of fixed, contiguous regions which are used to regionalise floods or low flows for all catchments in the area (e.g. as used in the index flood method, Dalrymple, 1960).The assumption of this method is that areas close to each other are characterised by similar climate, topography, geology, soils and land use, which gives rise to similar catchment hydrological response and therefore to similar floods or low flows.The grouping is usually found by geographical boundaries, by combining maps of the catchment characteristics in some way (Beable and McKerchar, 1982) or by a diverse set of statistical methods.These include cluster analysis using catchment characteristics (Nathan and McMahon, 1990), residuals from a regression model (Wandle, 1977;Hayes, 1992), regression trees (Laaha and Blöschl, 2006a), and pattern identification on the basis of the seasonality of runoff as an indicator of flood and low flow processes in the catchment (Laaha and Blöschl, 2006b;Piock-Ellena et al., 1999).An alternative is the region of influence (ROI) approach (Burn, 1990) which assigns a different pooling group to each catchment of interest.Similarity between catchments is usually measured by the root mean square difference of all the catchment and climate characteristics in a pair of catchments.A typical application of the ROI approach is given in the UK Flood Estimation Handbook (IH, 1999).The catchments characteristics for the grouping usually include mean annual rainfall, catchment area and soil characteristics.
Once the pooling group has been identified there are again a number of options of how to estimate the flood or low flow discharges.Again a classical one is the index flood method (Dalrymple, 1960) where the flood distribution function scaled by the index flood (e.g. the mean annual flood) is assumed to be homogenous within the region.The procedure consists of first estimating the index flood in the ungauged catchment (e.g. by a regression against catchment characteristics) and then multiplying that index flood with the regional scaled flood distribution function (IH, 1999) or by multiplying that index low flow with the regional scaled low flow distribution function (Clausen and Pearson, 1995;Madsen and Rosbjerg, 1998).With the advent of geographic information systems, alternative methods of using the flood quantiles or low flow quantiles directly in regressions against catchment characteristics have become popular (see, e.g.Cunnane, 1988, andGriffis andStedinger, 2007, for the case of floods, and Gustard et al., 1992, andEngeland andHisdal, 2009, for the case of low flows).More recently, geostatistical methods that exploit the spatial correlation of floods (or low flows) either in space (Merz and Blöschl, 2005) or along the stream network (see Skøien et al. (2006) for the case of floods and Laaha et al. (2012) for the case of low flows) have become popular.One of the strengths of the geostatistical approach is that it directly exploits the spatial correlations of the discharges and there is no need for defining pooling groups explicitly, but a relatively dense stream gauge network is needed.There are also methods that estimate flood statistics in ungauged catchments from rainfall (e.g.Moretti and Montanari, 2008).
When reviewing the rich literature on estimating extreme discharges in ungauged basins it is interesting that many of the statistical methods for floods and low flows are similar if not identical.Given this similarity, it is quite surprising that there are very few studies that directly compared the estimation methods for floods and low flows.Another interesting finding is that the predictive performance for ungauged basins strongly depends on the hydrological or climatological setting of the region (Meigh et al., 1997;Farquharson et al., 1992).There is no consensus in the literature on whether one method always outperforms another.This is because there have been few attempts in generalising the findings on the predictive performance of estimation methods beyond individual case studies.Yet, it would be very interesting to understand whether there are general patterns of performance, i.e. whether particular methods generally perform better than others in a given environment.These are the issues, this paper is concerned with.Specifically, in this paper we perform a meta-analysis of the literature on predictive performance of flood and low flow estimation methods in ungauged basins.In a second step we analyse a number of more detailed datasets, again focusing on the performance of the methods.The aim is to learn from the similarities and differences between catchments in different places, and to interpret the differences in predictive performance in terms of the underlying climate-landscape controls.The following research questions are addressed: i.How good are the predictions of hydrological extremes in different climates?
ii.Which regionalisation method performs best?
iii.How does data availability impact performance?iv.To what extent does runoff prediction performance depend on climate and catchment characteristics?
This paper is part of a set of three papers that are all concerned with assessing the performance of estimating runoff characteristics in ungauged basins.The two companion papers (Parajka et al., 2013;Viglione et al., 2013) deal with estimating runoff hydrographs in ungauged basins and estimating a set of different runoff characteristics in Austria, respectively.

Method of comparative assessment
For the comparative assessment of both flood and low flow predictions in ungauged basins, the same two step process as in Parajka et al. (2013) has been adopted in this paper and is presented below.
Level 1 assessment: in a first step, a literature survey was performed.Publications in the international refereed literature were scrutinised for results of the predictive performance of both floods and low flows.The Level 1 assessment is a meta-analysis of prior studies performed by the hydrological community.The advantage of this type of metaanalysis is that a wide range of environments, climates and hydrological processes can be covered that go beyond what can be reasonably achieved by a single study.It is a comparative assessment that synthesises the results from the available international literature.However, the level of detail of the information provided is often limited.The results in the literature were almost always reported in an aggregated way, i.e. as average or median performance over the study region or part of the study region.
Level 2 assessment: to complement the Level 1 assessment, a second assessment step was performed, termed Level 2 assessment.In this step, some of the authors of the publications from Level 1 were approached to provide data on their floods and low flow predictions for individual basins.The data they provided included information on the catchment and climate characteristics, on the method used, the data availability, and predictive performance.The overall number of catchments involved was smaller than in the Level 1 assessment, so the spectrum of hydrological processes covered in the assessment could be potentially narrower.However, the amount and detail of information available in particular catchments was much higher.As in Level 1, the cross-validation performance for ungauged basins was analysed; however, information on individual catchments was now available.The cross-validation performance was estimated by a leave-one-out strategy, where each gauged catchment was in turn considered as ungauged and the estimated low flow or flood index was compared with the observed one.
The comparative assessment conducted in this paper stratifies the analyses into three main groups: 1. Analysis of process controls on the predictive performance.A number of climate and catchment characteristics have been identified.A large number of catchments and modelling studies around the world have then been organised according to these climate and catchment characteristics, with the objective of learning from their differences and similarities in performance in a general way.

Analysis of predictive performance for different types of methods.
The methods for estimating flood and low flow indexes in ungauged basins have been grouped into the classes discussed in Sect.3. Rather than evaluating specific methods the focus has been on types of method, so to be able to generalise beyond individual studies.
3. Analysis of data availability.The quality of predictions of extremes in ungauged basins not only depends on the hydrological setting and the regionalisation method but also, importantly, on the data that are available for the information transfer.The comparison therefore also examines the number of stream gauges available in a particular study as an index to characterise data availability.
3 Studies and datasets used

Low flow studies
Table 1 lists the 14 low flow prediction studies used in this paper.It includes summary information about the study region, regionalisation method applied and the predictive performance in terms of the coefficient of determination (R 2 ), defined as follows: where Q i,pred : predicted specific discharge in cross-validation at gauge i, Q i,obs : observed specific discharge at gauge i, Q obs : spatial mean of the observed specific discharge.
In the great majority of the papers considered, the performance is given in terms of the described coefficient of determination in cross validation, which reports the amount of explained variance by the model, and is also affected by both bias and dispersion of the estimators.The target low flow index, on which this performance is reported, is mainly the q 95 specific discharge quantile, i.e. the discharge value exceeded 95 % of the time divided by the catchment area, but there were studies presenting performances on other low flow indicators including q 7,10 (7 days 10 yr specific runoff), q mon,5 (monthly 5 day minimum), q 96 , q 97 (96-97 % specific runoff quantiles), q 95 /q A (q 95 specific runoff quantile normalised by the mean annual specific runoff q A ) and baseflow index (BFI).Both the performance measure and the low flow index used in the analysis represent a trade-off between the amount of studies potentially to be included in the analysis and their need to be comparable; the same applies to the flood studies.Several studies compare different regionalisation approaches and/or subsets of data which results in a total of 28 assessments of predictive performance.These results are the base for the Level 1 assessment which represents at total of 3112 catchments (Table 2).Geographically, most of the cross-validation assessments were performed in Europe and North America and only a few studies cover Australia and Asia (Fig. 1, top and Table 1).Six study authors out of the Level 1 assessment provided detailed information about climate and catchment characteristics in a consistent way and reported the regionalisation performance for each catchment (Level 2 assessment).In this sense, the potential of learning from the catchment-by-catchment errors in contrast to the aggregated, regional measures of Level 1 represents a motivation for the Level 2 assessment.Predictive performance on a catchment basis was given as the absolute normalised error (ANE), defined as The dataset for Level 2 assessment combines data from 1895 catchments.Three catchment characteristics are analysed: aridity index, mean elevation and catchment area.
Table 1.Summary assessment of studies for low flow estimation in ungauged catchments used in Level 1 assessment.Performance indicates the leave-one-out assessment of model efficiency in terms of the coefficient of determination R 2 .Low flow regionalisation methods include: process based (PB), global regression (GR), regional regression (RR), geostatistics (G) and short records (SR).Predicted variable indicates the low flow index estimated in the study and includes: 7 days 10 yr specific runoff (q 7,10 ), monthly 5 day minimum specific runoff (q mon,5 ), 95-97 % specific runoff quantiles (q 95 , q 96 , q 97 ), normalised q 95 specific runoff quantile (q 95 /q A ) and baseflow index (BFI).Ranges or various values for R 2 represent variations of the methods or the same method applied on different subsamples from the same region.These characteristics represent a trade-off between the data availability of the studies, and the literature reports on the main controls of flood and low flow regimes.Aridity (the ratio of potential evaporation E PA and precipitation P A on a long-term basis, averaged across the catchment) is an indicator of the competition between energy and water affecting the water balance.Elevation (average topographic elevation within the catchment) is a composite indicator including a range of processes, such as long-term precipitation and hence soil moisture availability, and air temperature.In some environments there is a relationship between elevation and aridity and elevation and snow processes.Catchment area is an indicator of the degree of aggregation of catchment processes related to scale effects (Skøien et al., 2003); an indicator of storage within the catchment.Catchment size also acts as an indicator of the quality of rainfall data that is available for runoff estimation in ungauged basins, as for a constant raingauge density, the mean areal rainfall estimation variance de-creases with increasing the catchment area.This areal rainfall might also be biased by increasing the number of stations located in lower parts of the catchment (Lebel et al., 1987) The low flow regionalisation methods have been classified into the following groups.
-Process-based methods (PB): there is only a single cross-validation study we encountered in the literature (Engeland and Hisdal, 2009) of this type.The procedure consisted of regionalising the parameters of a conceptual rainfall-runoff model from gauged to ungauged catchments in the region.The low flow characteristics were then derived from the simulated daily hydrographs at the ungauged location of interest.
-Global Regression (GR): in the global regression approach a single relationship between the low runoff statistic of interest, such as q 95 , and catchment/climate characteristics is established.Both additive and multiplicative regression models were used.A critical issue in the regression is the choice of the catchment/climate characteristics which include mean annual precipitation and geologic characteristics in the literature.It has been noted that it is important to interpret the catchment/climate characteristics that are found to be significant during a regression analysis from a hydrological perspective, i.e. to link the statistical analysis to the hydrological processes operating at the catchment scale.
-Regional regression (RR): here the procedure is similar, however the entire domain is subdivided into regions and a regression model is applied to each region separately.The main rationale of regional regression is that different processes may operate in the regressions so the catchment/climate characteristics will control low flows in different ways.A number of methods exist for identifying the regions or pooling groups, including cluster analysis of catchment/climate characteristics, residuals from a regression model and pattern identification on the basis of the seasonality of runoff.
-Geostatistical methods (G): geostatistical methods exploit the spatial correlations of low flows based on the rationale that catchments that are geographically close to each other may exhibit similar processes.While some approaches use Euclidean distance as a similarity measure, other approaches use the correlations along the river network.To account for spatially heterogeneous regions, the geostatistical method has been extended to combine it with multiple regressions by using the residuals of the regression for the spatial geostatistical estimation.
-Short records (SR): in some instances there may be short runoff records available for a catchment that is otherwise ungauged.These runoff records may not be representative of the longer time period that is normally used for the estimation of low flows.Methods are therefore used that relate the low flow estimates from the short runoff records to the longer hydrological history of the basin on the basis of regional information, usually involving some element of correlation analysis (Laaha and Blöschl, 2005).

Flood studies
Table 3 lists the 20 flood prediction studies used in this paper.It includes summary information about the study region, regionalisation method applied and the predictive performance in terms of the root mean square normalised error (RMSNE), defined as follows: The cross-validation performance is given, in the great majority of the papers considered, as the defined root mean squared normalised error, a very common error measure for estimators, combining both the bias and the dispersion component of the error.The target flood index, on which this performance was mainly reported, was the 100 yr specific flood quantile q 100 , i.e. the peak discharge value that occurs on average every 100 yr divided by the catchment area.There are three exceptions, namely Srinivas et al. (2008), Cunderlik and Burn (2002), Jingyi and Hall (2004), where the predictive performance is calculated on volumes and not on specific discharges (Table 3).These studies are plotted as crosses in Figs.2-4.It is worth mentioning, that the quantities defined as observed discharges Q i,obs , are actually the flood quantiles estimated from local data and are subject to a certain  Each symbol refers to a result from the studies in Tables 1 and  3. Circles represent performances calculated on specific discharges (m 3 s −1 km −2 ), crosses represent performances calculated on discharges (m 3 s −1 ).Boxes show 25-75 % quantiles.
degree if uncertainty and the same applies to the observed 95 % low flow quantiles.Several studies compare different regionalisation approaches and/or subsets of data which results in a total of 57 assessments of predictive performance.These results are the base for the Level 1 assessment which represents at total of 3023 catchments (Table 2).Figure 1 (bottom) and Table 3 show that the studies are rather evenly spread around the world.Five study authors out of the Level 1 assessment provided detailed information about climate and catchment characteristics in a consistent way and reported the regionalisation performance for each catchment in terms of the absolute normalised error ANE (Level 2 assessment).This dataset combines data from 1422 catchments.As in the case of low flows, three catchment characteristics are analysed: aridity index, mean elevation and catchment area (see Sect. 3.1).The flood regionalisation methods have been classified into the following groups: -Regression methods: the regression methods for flood discharges are similar to those of low flows where the flood runoff is related to catchment/climate characteristics such as catchment area and mean annual precipitation.As is the case of low flows, it is important to interpret the regression coefficients obtained from a hydrological perspective (Merz and Blöschl, 2008a, b).
-Index methods: the index methods consist of a group of approaches where the flood distribution function is scaled by the index flood (e.g. the mean annual flood or the median annual flood) and assumed to be homogenous within the region.One first estimates the index flood in the ungauged catchment (e.g. by a regression against catchment characteristics) and then multiplies that index flood with the regional-scaled flood distribu-tion function.The methods usually differ in terms of how the homogeneous groups are obtained.
-Geostatistical methods: geostatistical methods are analogous to those in use for regionalising low flows (see Sect. 3.1).

How good are the predictions of hydrological extremes in different climates?
Figure 2 (left) shows the Level 1 results of estimating low flows in ungauged basins.The distribution of the studies by climatic region is as follows: 2 are considered as arid, 12 as cold and 14 as humid.The highest performance is obtained for humid catchments, but there are also studies in humid climates that report a significantly lower performance.In arid climates, the performance is never very high, but more studies are needed to clearly show this behaviour.The most likely reason for this finding is that arid regions tend to be very heterogeneous with a high variability of low flow producing processes, and low flows generally tend to be lower and more variable, and therefore harder to predict.Cold environments exhibit the largest performance range.This could be because this class contains sub-polar and mountainous environments which may be hydrologically very complex with many different storage types that complicate low flow behaviours (ice/groundwater).
The results for the flood regionalisation (Fig. 2, right) present 10 studies from arid regions, 12 from tropical, 26 from cold and 9 from humid regions.They show that the predictions in humid regions exhibit the smallest errors and arid regions have the largest errors.This means that the predictive Each symbol refers to a result from the studies in Tables 1 and 3. Circles represent performances calculated on specific discharges (m 3 s −1 km −2 ), crosses represent performances calculated on discharges (m 3 s −1 ).Lines indicate studies that compared different methods for the same set of catchments.Boxes show 25-75 % quantiles.
performance clearly decreases with increasing aridity.There are a number of factors that may contribute to this dependence.The interannual variability (e.g. in terms of coefficient of variation of the annual peak runoff time series) of floods in arid regions is usually bigger than in other climates, due to the associated stronger non-linearities and threshold effects in drier regions and the the larger interannual variability and skewness of rainfall intensities more typical for arid climates.This means that floods are more difficult to estimate from short records.The stronger non-linearity also implies that the spatial hydrological variability in the flood producing processes will impact more strongly the flood frequency curve, so catchments that are close to each other may exhibit quite different flood frequency curves, which reflects poorly on the regionalised predictions.A possible explanation for this nonlinearity in arid catchments is given in Goodrich et al. (1997), where the increasingly non-linear response is attributed to the increasing importance of ephemeral channel losses and partial storm area coverage.In contrast, humid catchments tend to be more linear, so the predictability is larger.The biggest range of performances is found in cold climates.This may be partly related to the larger number of studies available for these regions.Also, in cold regions a wide variety of flood producing processes may exist, including snow and rain-onsnow which may lead to different performance, depending on the prevailing processes.For example, snow melt floods tend to be more predictable than rain-on-snow floods (e.g.Sui and Koehler, 2001).

Which regionalisation method performs best?
The low flow regionalisation methods represented in the assessment included 1 result from the process-based meth-ods group (continuous runoff models); 4 results from the geostatistical group of methods where runoff at the target site was estimated as a weighted mean of runoff at the surrounding gauges; 11 global regression and 7 regional regression results from the regression methods group; and 5 results from the short records group that used various methods.The assessments in each group are not based on exactly the same regionalisation approach, but the methodology is similar.There are also differences in the low flow indices used.They include q 95 (95 % exceedance probability specific runoff), q 7,10 (7 days 10 yr specific runoff), and q mon,5 (monthly 5 day minimum), all standardised by catchment area or mean flow, and the dimensionless BFI.In particular q 95 low flows are usually closely correlated to q 7,10 so that a comparison across the various indices should provide consistent results at the level of detail used for the comparisons.Figure 3 (left) shows a large performance range across the regionalisation methods.Overall, it is clear that low flow predictions from short records (R 2 = 0.62 to 0.99) perform best.The method performs significantly better than all other methods, provided continuous runoff measurements from at least 3-5 yr of observations at the site of interest are used.A lower performance (0.62) is obtained when using a single flow measurement during the low flow period.The performance of global regression ranges from 0.43 to 0.86.Studies from high-mountain environments have a lower performance (Austria: 0.57, Switzerland: 0.51, Nepal: 0.53, India: 0.45) perhaps because the heterogeneity of the low flow process in the landscape (including snow) pose difficulties for applying one single regionalisation model for the entire domain, so division into subregions may be necessary.Global regression is better suited to smaller regions (e.g. the German region of  1 and 3. Circles represent performances calculated on specific discharges (m 3 s −1 km −2 ), crosses represent performances calculated on discharges (m 3 s −1 ).Boxes show 25-75 % quantiles.
Baden-Württemberg) and studies in climates less controlled by snow seasonality (e.g.New South Wales and Victoria in Australia).The four results from geostatistical models give on average the highest performances between 0.61 and 0.89.A continuous runoff model, tested in only one study used in the meta-analysis, gave lower performance than the statistical methods.The studies examined differ in terms of the hydrological characteristics and data availability, so a comparison of methods for different regions will involve some uncertainty.It is therefore useful to apply each different method to the same catchment.A number of studies are available in the literature that have performed such a comparison and the results are indicated as grey lines in Fig. 3 (left).Most of the studies compare global and regional regressions.The comparisons clearly show that the regional regressions always perform better than the global regressions.The studies that conduct this comparison show that the average performance of global regressions is around 0.5 and increases to 0.7 for regional regression.It should be noted that the performance reported is cross-validation performance for ungauged basins, so better performance is related to better predictions rather than to improved goodness of fit of the regressions.There are also a few studies that compared geostatistical methods with regional regression methods.In one study from France (Plasse and Sauquet, 2010) the geostatistical method was based on distance between the catchment centres of gravity.
The performance was larger than for global regression and lower than that of regional regression.If the stream network structure is taken into account, the performance of geostatistical methods can in fact be higher than that of regional regression as illustrated in the Austrian case studies (Laaha et al., 2007(Laaha et al., , 2012)).Finally, one study (Engeland and Hisdal, 2009) compared process-based methods with regional regressions and found that the regressions gave better re-sults.Clearly, application of process-based methods does not per se include the performance of low flow estimation but their value depends on the amount of information available for careful parameterisation of the model.However, processbased methods have more potential to explore the impact of environmental change than statistical methods.
The flood regionalisation methods represented in the assessment included (i) regression methods, 18 results from different regression models where the flood quantiles or the distribution parameters had been transferred to ungauged basins; (ii) index methods, 34 results where a regional growth curve had been defined for homogeneous regions; (iii) geostatistical methods, 5 results where runoff at the target site was estimated as a weighted mean of runoff at the surrounding gauges.While the assessments made by each group are not based on exactly the same regionalisation approach, the methodology is similar.Figure 3 (right) shows that the geostatistical methods perform best (RMSNE of 0.30-0.52)across the studies analysed, although the number of studies is small compared to the other groups.For example, Merz and Blöschl (2005) in Austria and Walther et al. (2011) in Saxony (Germany), provide the combination of the necessary stream network density and non-arid climate that causes their lower RMSNE values (0.30 and 0.46 respectively).The regression methods have the lowest performance, i.e. the largest predictive errors (median RMSNE of 0.62), and the index methods fall in between.As an illustrative example, we find the low performances (average RMSNE of 0.57) of the index flood method in the arid and semi-arid regions of Meigh et al. (1997) and even lower (RMSNE between 0.81-0.88)for the regression approaches in a cold climate in Guse et al. (2010).The result of the overall ranking of methods is confirmed by studies that compared different approaches in the same region (grey lines in Fig. 3  may be difficult to find catchment characteristics that are representative of the flood generating processes.For example, subsurface characteristics are an important control for flood generation and these are difficult to capture unless detailed field surveys are available.Index methods and geostatistical methods are less dependent on the catchment characteristics as they usually take advantage of both spatial proximity (either through spatial correlations or homogeneous regions) and correlations to catchment characteristics.It is also the case that the geostatistical studies in Table 3 have been performed in data-rich environments, which may partly explain their better performance.It is interesting to note that the number of studies applying regression and index methods is much larger than those applying geostatistical methods, which is because they have a longer tradition in hydrology.The first two columns in Table 4 present a summary of the methods with the highest and lowest predictive performances in the Level 1 assessments of low flow and flood studies.

How does data availability impact performance?
While the information on the data used was never very detailed in the studies examined, some inferences on data availability can be drawn from the number of catchments used in the studies.These are usually those catchments used both for the cross validation and for regionalising runoff to neighbouring catchments.Figure 4 (left) shows the predictive performance (R 2 ) for the case of low flows as a function of the number of catchments analysed in each study.It is clear that the studies with less than 100 catchments have, on average, the lowest performance and performance increases with the number of catchments used in analysis.Possibly, this is due to the lower stream gauge density in studies with a smaller number of stream gauges, but more detailed analyses on the precise geographic extent of the studies would be needed to ascertain the data controls on performance.The performance decreases for very large datasets (> 250 catchments).This decrease is related to the higher heterogeneity of larger study areas and to the fact that a number of the studies used global regression methods that did not perform very well in these regions.
Figure 4 (right) shows the RMSNE for the case of floods as a function of the number of catchments analysed in each study.The errors clearly decrease and the performance increases with the number of catchments included in the analysis.This is possibly because of the higher stream gauge density in the larger studies with a bigger number of stations involved, which makes the transfer of floods across the landscape more accurate, in particular if there is a stream gauge upstream or downstream of the target site.Also, the regionalisation methods may be more robust if the total number of stations is larger.

To what extent does runoff prediction performance depend on climate and catchment characteristics?
The assessment of the predictive performance of the low flow regionalisation methods with respect to three climate and catchment characteristics (Level 2 assessment) is presented in Fig. 5.The lines indicate the median runoff prediction performance of catchments belonging to the same study.
Overall, the absolute normalised error (ANE, see Sect.3.1), clearly increase with increasing aridity.This means that the performance is consistently lower in drier, and more arid environments.These are regions that tend to be particularly heterogeneous, with a high presence of intermittent rivers (Jacobson and Jacobson, 2013) and where low flows may be small, which makes them particularly hard to predict. Figure 5 also indicates that there is a tendency for performance to increase with catchment elevation.The average of all methods shows that errors decrease from 0.37 for lowland catchments (mean elevation < 200 m a.s.l.) to 0.16 for high mountain catchments.This may be partially due to the higher specific discharges of mountainous catchments compared to lowland catchments which may increase predictability.Also, in the high mountains, low flows may be of a winter low flow type, so low flows may depend on frost strength which is closely related to catchment elevation.The bottom panels in the figure show the performance as a function of catchment scale.For all methods the performance increases with catchment scale.This may be related to both data availability and space-time aggregation of runoff processes in the catchments, which will increase the predictability.The exceptions are methods that use short runoff records at the site of interest.In these cases, the performance dependence on catchment size is less pronounced than for the other methods.These types of methods may be more dependent on the representativeness of the short runoff record to the temporal variability of low flows, so the dependence on the spatial variability and therefore catchment size may be lower.The left panels in Fig. 6 summarise the performance for different regionalisation approaches, stratified by the aridity index.The left-top, left-middle and left-bottom panels show the performance for all catchments, catchments with an aridity index below and catchments with an aridity index above 1, respectively.Overall, for all catchments the performance of the global regression is much lower than that of any other method.This is consistent with the Level 1 assessment.In the arid catchments the performance of the global regression is particularly low and the absolute normalised errors are on average around 1.1.In the humid regions the short records perform better than any other method.This is, again, consistent with the Level 1 assessment.However, this is no longer the case for the arid catchments.For the arid catchments, the performance of the short records is in fact lower than those of the geostatistical methods and regional regression.It appears that, in arid regions, the variability of the low flows between years may be larger than in other climates, what makes the method more dependent on an appropriate donor site.The appropriateness of a donor depends on gauging density which is often lower in the more arid countries.Methods may be needed in arid regions that specifically account for the runoff generation processes in the region, and preferably are based on proxy data that account for these processes.
The Level 2 assessment for flood prediction studies, i.e. the assessment of the ANE measure with respect to the three climate and catchment characteristics is presented in Fig. 7.The lines indicate again the median runoff prediction performance of catchments belonging to the same study.The top panel shows that the errors clearly increase with Low flows Floods Fig. 6.Absolute normalised error of predicting q 95 low flows (m 3 s −1 km −2 ), left panels, and q 100 floods (m 3 s −1 km −2 ), right panels, in ungauged basins for different regionalisation methods, stratified by aridity (Level 2 assessment).Top: all catchments.Centre: humid catchments (aridity index < 1).Bottom: arid catchments (aridity index ≥ 1).Lines connect median efficiencies for the same study.Boxes are 40-60 % quantiles, whiskers are 20-80 % quantiles.
increasing aridity, i.e. there is a decrease in performance with aridity for all three methods.This is also supported by the lines representing comparative studies.This clear trend is in line with the Level 1 assessment for floods, but also with both assessment levels for low flows.Arid regions tend to be more heterogeneous than humid regions and runoff processes are more non-linear, which makes the predictions for both floods and low flows more difficult.There is a slight increase in performance with elevation but, in contrast to aridity, the errors do not change much with elevation.In the studies examined here, the highest elevation catchments are influenced by snowmelt, so there is a tendency for the flood predictions to improve if snow melt is involved in the flood generation processes.
The results stratified by catchment area (Fig. 7, bottom panels) indicate a clear increase in performance (decrease of ANE) with increasing catchment area for all methods.The increasing performance with catchment size is likely related to two factors.The first is related to the data availability.As the catchment size increases likelihood that gauged subcatchments are available as donor stations increases.This will lead to more reliable transfer of the flood characteristics.Additionally, for larger catchments, there are aggregation effects on the flood generating processes, so floods tend to be less flashy and therefore easier to predict.
The right panels in Fig. 6 summarise the runoff prediction performance of different regionalisation approaches, stratified by the aridity index.Again, the right-top, right-middle

Regression
Index method Geostatistics Fig. 7. Absolute normalised error of predicting q 100 floods (m 3 s −1 km −2 ) in ungauged basins as a function of aridity (E PA /P A ), mean elevation and catchment area for different regionalisation methods (Level 2 assessment).Lines connect median errors for the same study.Boxes are 40-60 % quantiles, whiskers are 20-80 % quantiles.
and right-bottom panels show the performance for all catchments; catchments with an aridity index below and catchments with an aridity index above 1, respectively.Analysis of the overall performance of the three methods shows that performance is similar for geostatistical and index methods, which have a slightly better performance than the regression methods.For humid catchments, again, the performance of geostatistical methods is slightly better than index methods, and the performance of the regression methods is slightly lower.For dry catchments, however, the index methods performs significantly worse than the other two methods.
The low performance of the index flood method in arid regions may be related to the underlying assumption of using the same non-dimensional flood frequency curve (i.e.growth curve) in the entire regions.Arid regions may be spatially more heterogeneous, leading to lower performance.More importantly, most arid catchments have the larger errors for the index methods, as the result of the prediction overestimate on the 100 yr floods (Fig. 7, top centre).The median absolute normalised error is 1.0, and the errors were in the vast majority positive (presented in Blöschl et al., 2013), indicating that typically the methods predict around twice the floods actually observed.If a homogeneous region contains both arid catchments with relatively lower floods and wetter catchments with higher floods, the homogeneity assumption will tend to lead to an overestimation in those catchments with the lower floods.The last two columns in Table 4 present a summary of the methods with the highest and lowest predictive performances in the Level 2 assessments of low flows and floods.

Conclusions
This paper has compared the performance of predicting low flow and flood discharges in ungauged basins using different regionalisation methods.Two kinds of assessments were performed; a Level 1 assessment which constitutes a metaanalysis from the literature; and a Level 2 assessment which analyses individual catchments in more detail.The results indicate that the Level 1 and Level 2 assessments are consistent while shedding light on different aspects of the prediction problem.The assessment of flood and low flow estimation methods in this paper represents the largest existing metaanalysis of regionalisation studies of hydrological extremes.However, it is clear that the analysis cannot cover all facets of hydrological variability worldwide.Arid and tropical climates are missing in the case of low flows.Arid climates are especially prone to droughts, so it would be of worth to pursue more detailed research on assessing predictions of extreme low flows in these areas.Also, some of the methods, e.g.process-based methods, are under-represented in the literature and a more detailed analysis of these would be of interest.For the flood regionalisation studies, the coverage of climates is more uniform, but there is a clear dominance of the regression-type and index-flood methods over geostatistical approaches.The increasing trend in the application of the latter group of methods is likely to lead to a sizeable sample of studies in the literature which will allow more comprehensive tests of their performance in the near future.
The Level 1 analysis suggests that in humid regions the performance of predicting both low flows and floods in ungauged basins tends to be better than in other climates.For the case of floods the performance tends to be lowest in arid regions.For the case of low flows, geostatistical methods can perform better than regional regressions in regions with medium to high stream gauge density if the stream network structure is taken into account.Regional regressions that divide a domain into subregions and apply regression models separately always perform much better than global regressions.For the case of floods, geostatistical methods tend to perform better than the other methods, regressions tend to have the lowest performance, and index methods lie between geostatistic and regression methods.This suggests that it may be difficult to find catchment characteristics that are suitable for regression methods, both for low flows and floods.Again, for both low flows and floods the performance tends to increase with number of stations in a region highlighting the value of stream gauge data in the region of interest, even for the case of ungauged basins.
The results of the more detailed analysis (Level 2) are mostly consistent with those of the meta-analysis from the literature (Level 1).For the case of low flows the predictive performance tends to decrease with increasing aridity (both Level 1 and Level 2 assessments).The performance improves with increasing catchment area (Level 2 assessment), apparently because of the presence of longer water flow pathways that accompany increasing catchment size.The availability of short records is particularly useful to improve performance of low flow predictions (both Levels 1 and 2), especially in humid regions, and are perhaps not as useful in arid regions because of the strong interannual variability together with the usually low stream gauge density in arid regions (Level 2).Of the various methods, regional regressions have been shown to be better than global regressions (from Level 1 and Level 2 assessments).For the case of floods, the predictive performance also tends to decrease with increasing aridity (both Level 1 and Level 2 assessments).As expected, predictive performance increases with increasing catchment area (Level 2 assessment).Both Level 1 and Level 2 assessments indicated that the geostatistical methods have the best performance (especially when data availability is high), index methods work next best, and regression methods have the relatively lowest performance.In arid conditions the index methods are significantly biased and significantly overestimate the 100 yr floods in the catchments analysed.The Level 2 assessment also indicated that index methods do not work well in arid regions.Arid regions would therefore need more gauges to capture the temporal and spatial variability, but achieving this is unrealistic in many arid parts of the world where (due to economic reasons) data density is typically lower than in humid regions.Methods that are able to exploit the specifics of the region would be needed here.Use of readily available landscape information, such as erosional patterns, based on the idea of reading the landscape, may assist in improving the predictions of runoff extremes.More research on arid hydrology is urgently needed.Scale, uncertainty, and choice of proxy data are likely important considerations in this body of research (e.g.Blöschl, 2006;Koutsoyiannis et al., 2009).
The meta-analysis of the literature highlighted that the results on predictive performance of low flows and floods are presented in widely diverse ways, using different performance measures, different ways of aggregating the information of the regions of interest, and different levels of details on the hydrological characteristics of the regions.It appears that, to make the results more useful to the hydrological community, it would be essential to adjust the reporting of results and make them more comparable.This would assist in generalising the findings from individual case studies.We need techniques to exploit information from individual catchment studies, as well as the compilation of all studies from around the world.As a community collectively we need to go beyond that, and find systematic ways to generate knowledge, in terms of the patterns that connect across the multitude of studies and thereby provide a higher level of predictability as to what will happen next and understanding that will enable extrapolation to new situations.This points to the importance of hydrological synthesis as a vehicle for creating these connections.

Fig. 1 .
Fig. 1.Map indicating the countries included in the meta-analysis of low flow studies (top) and flood studies (down) reported in the literature (Level 1 assessment).

Fig. 2 .
Fig. 2. Coefficient of determination of predicting low flows in ungauged basins (left) and root mean squared normalised error of predicting floods in ungauged basins (right), stratified by climate (Level 1 assessment).Each symbol refers to a result from the studies in Tables1 and 3. Circles represent performances calculated on specific discharges (m 3 s −1 km −2 ), crosses represent performances calculated on discharges (m 3 s −1 ).Boxes show 25-75 % quantiles.

Fig. 3 .
Fig.3.Coefficient of determination of predicting low flows in ungauged basins (left) and root mean squared normalised error of predicting floods in ungauged basins (right), stratified by regionalisation method (Level 1 assessment).Each symbol refers to a result from the studies in Tables1 and 3. Circles represent performances calculated on specific discharges (m 3 s −1 km −2 ), crosses represent performances calculated on discharges (m 3 s −1 ).Lines indicate studies that compared different methods for the same set of catchments.Boxes show 25-75 % quantiles.

Fig. 4 .
Fig. 4. Coefficient of determination of predicting low flows in ungauged basins (left) and root mean squared normalised error of predicting floods in ungauged basins (right), stratified by the number of catchments within each study (Level 1 assessment).Each symbol refers to a result from the studies in Tables1 and 3. Circles represent performances calculated on specific discharges (m 3 s −1 km −2 ), crosses represent performances calculated on discharges (m 3 s −1 ).Boxes show 25-75 % quantiles.

Fig. 5 .
Fig.5.Absolute normalised error of predicting q 95 low flows (m 3 s −1 km −2 ) in ungauged basins as a function of aridity (E PA /P A ), mean elevation and catchment area for different regionalisation methods (Level 2 assessment).Lines connect median errors for the same study.Boxes are 40-60 % quantiles, whiskers are 20-80 % quantiles.

Table 2 .
Number of studies (in brackets number of results) and number of catchments used.Level 1 refers to an assessment of the average performance of studies, Level 2 to an assessment of the performance for individual catchments.

Table 3 .
Summary assessment of studies for flood estimation in ungauged catchments used in Level 1 assessment.Error measure indicates the leave-one-out assessment of model efficiency in terms of the root mean square normalised error RMSNE.Flood regionalisation methods include: regression methods (R), index methods (IM) and geostatistics (G).Predicted variable indicates the flood discharge estimated in the study and includes: 100 yr specific flood runoff (q 100 ), 100 yr flood runoff (Q 100 ) and 100 yr flood runoff standardised by the mean annual flood (Q 100 /Q m ).Ranges or various values for RMSNE represent variations of the methods or the same method applied on different subsamples from the same region.

Table 4 .
Methods with the highest and lowest cross-validation performance of runoff predictions in ungauged basins.Arid relates to catchments with an aridity index > 1. Level 1 refers to an assessment of the average performance of studies, Level 2 to an assessment of the performance for individual catchments.For the number of studies and catchments see Table2.