Analysis of high streamflow extremes in climate change studies: How do we calibrate hydrological models?

. Climate change impact studies on hydrological extremes often rely on the use of hydrological models with parameters inferred by using observational data of daily streamflow. In this work we show that this is an error prone procedure when the interest is to develop reliable Empirical Cumulative Distribution Function curves of annual streamflow maximum. As an alternative approach we introduce a methodology, coined Hydrological Calibration of eXtremes (HyCoX), 10 in which the calibration of the hydrological model is carried out by directly targeting the probability distribution of high flow extremes. In particular, hydrological simulations conducted during a reference period, as driven by climate models’ outputs, are constrained to maximize the probability that the modeled and observed high flow extremes belong to the same population. The application to the Adige river catchment (southeastern Alps, Italy) by means of HYPERstreamHS, a distributed hydrological model, showed that this procedure preserves statistical coherence and produce reliable quantiles of 15 the annual maximum streamflow to be used in assessment studies.

literature. A number of studies investigated the likely impact of climate change on hydrology by combining ensemble of projections from multiple climate models under different greenhouse gas emissions scenarios and hydrological modeling [e.g., Kundzewicz et al., 2007;Todd et al., 2010 andHarris, 2006 for a comprehensive review]. A wealth of studies focus on long-term annual and/or seasonal changes in hydrological variables such as runoff, streamflow, snow melt 30 and soil moisture [e.g., Chiew et al. 2009;Majone et al., 2012;Buytaert and De Bièvre, 2012]. Much less address projected changes in hydrological extremes, i.e. floods and droughts, though they are expected to exert profound and dramatic impacts on agriculture, economy, human health, energy and many other water-related sectors [e.g., Arnell 2011;Taye et al. 2011;Bouwer, 2013;Thornton et al., 2014].
The peculiarity of hydrological calibration in climate change impact studies has been highly debated in the hydrological 35 modelling community [e.g. Peel and Blöschl, 2011;Muñoz et al., 2013;Montanari et al., 2013;Thirel et al., 2014]. The standard approach is to calibrate the selected hydrological model using a chronological time series of streamflow observations and then feed the model with future climate projections to evaluate changes of hydrological indicators [see e.g. Hattermann et al. 2018]. However, several studies showed that model parameters are dependent on the climatic characteristics of the input forcing used during the calibration of the hydrological model [e.g., Vaze et al., 2010;Laiti et al., 40 2018]. Although recognized, this additional source of uncertainty is mostly ignored in climate change impact studies.
Indeed, a number of studies suggest that observed streamflow extremes provide valuable information about the hydrological behavior of investigated catchments [Grubbs, 1969;Laio et al., 2010]. Similarly, Perrin et al. [2007] and Seibert and Beven [2009] concluded that a limited number of streamflow extremes encapsulate a significant amount of information that may be useful in the context of hydrological model calibration. Beven and Westerberg [2011] suggested also that, when dealing with 45 extremes, including the entire time series might not be informative. This occurs, for instance, when streamflow extremes belong to a different population than ordinary flows [e.g., Calenda et al, 2009], such that the latter do not provide useful information for inferring the former. Hence, quantifying the influence of such extreme events on model calibration is still a challenge in hydrological studies [Brigode et al., 2015], such as quantifying the uncertainty associated to these estimates [Honti et al., 2014]. 50 Despite the above evidences, the typical approach adopted in impact assessments of hydrological extremes is to estimate the projected changes of selected indicators (e.g. flow quantiles) using a hydrological model calibrated against chronological time series of streamflow observations during a reference period [e.g., Wilby and Harris, 2006;Hattermann et al. 2018]. The drawbacks of such approach are however threefold: i) a model correctly reproducing the time series of observed streamflow does not guarantee the correct reproduction of the desired statistics for extremes, as discussed above; ii) due to epistemic 55 uncertainty, a model calibrated with a given set of observed ground data may respond in a different way when fed with projections obtained from climate change scenarios; and iii) due to the impossibility of obtaining totally unbiased climate simulations there is no a-priori guarantee that simulations fed by climate models produce samples (e.g. time series of simulated annual streamflow maximum) that are statistically coherent with observations. To overcome the aforementioned limitations, we propose an innovative methodology in which the calibration of a 60 physically-based hydrological model is conducted by directly targeting the probability distribution of high streamflow extremes (i.e., the Empirical Cumulative Distribution Function, ECDF, of annual maxima). While the approach is exemplified in this work for high streamflows (also because of the broad interest in the topic), it can be applied to low flows as well (e.g., for droughts assessment). The methodology, coined here as Hydrological Calibration of eXtremes (HyCoX), targets specifically climate change impact assessment studies. In particular, hydrological simulations conducted during a 65 reference period, as driven by climate models, are constrained to maximize the chances that the simulated and observed high flow extremes belong to the same population. Statistical coherence is obtained by using the two-sample Kolmogorov-Smirnov statistic [Smirnov, 1939] as the efficiency metric during the calibration procedure. The parameterization of the hydrological model obtained following this approach is then used in future climate change scenario runs to project changes in the distribution of high flows. The strengths of the methodology are highlighted by performing a comparison with 70 experiments in which model parameterizations are obtained by calibrating the hydrological model, fed by observed ground data, with a suite of efficiency metrics customarily used in hydrological applications, i.e., Nash-Sutcliffe efficiency, [Nash and Sutcliffe,1970] and flow duration curve-related metric .
We emphasize that the suggested approach is by definition "goal-oriented", according to the definition discussed in , Savoy et al. [2017], Guthke [2017], Laiti et al., [2018] and Li et al. [2018]. In other words, the selection of the 75 hydrological model, its level of complexity and the metric to be employed for the evaluation of the statistical coherence of simulated and observed extremes depend on modeler's choice and the particular goal at hand. In the present work the main focus is on high streamflow extremes, and along the goal-oriented approach the proposed calibration procedure is tailored around it.
The paper is organized as follows: Section 2 presents the hydrological calibration framework, the adopted statistical 80 coherence test and the calibration metrics; a description of the study area, the climate change projections available and the observational hydro-meteorological datasets used are summarized in Section 3. The main findings are presented and discussed in Section 4, whereas conclusions are finally drawn in Section 5.

Hydrological modelling 85
Hydrological simulations were performed at the daily time scale with the HYPERstreamHS model [Avesani et al., 2021;Laiti et al., 2018;Larsen et al., 2021] which couples the HYPERstream routing scheme, recently proposed by Piccolroaz et al., [2016], with a continuous SCS-CN module for surface and subsurface flow generation [Michel et al., 2005].
HYPERstream routing scheme is specifically designed for being easily coupled with climate models and, in general, with gridded climate datasets. HYPERstream can share the same computational grid as that of any overlaying product providing 90 the meteorological forcing, still preserving geomorphological dispersion of the river network [Rinaldo et al., 1991]  irrespective of the grid resolution. This "perfect upscaling" [cf. Piccolroaz et al., 2016] can be achieved by application of suitable transfer functions derived from a high-resolution Digital Elevation Model of the study area. The continuous soilmoisture accounting model for surface and subsurface flow generation, based on the SCS-CN methodology, is coupled with a non-linear bucket model for soil moisture dynamics [Majone et al., 2010], a degree-day model for snow melting and 95 accumulation [Rango and Martinec, 1995] and the Hargreaves and Samani [1982] model for evapotranspiration. Furthermore, deep infiltration enters a linear bucket used to represent return flow. Notice that the surface and subsurface flow generation module was already successfully applied in two previous studies conducted in Alpine catchments [Piccolroaz et al., 2015;Bellin et al., 2016]. The model requires a total of 12 parameters, which are assumed as spatially uniform but uncertain and all subject to calibration. A detailed description of the hydrological modeling can be found in Laiti 100 et al. [2018] and Avesani et al. [2021].

Evaluation of statistical coherence
Statistical coherence between the observed and simulated populations of extremes was evaluated by means of the twosample Kolmogorov-Smirnov test [Smirnov, 1939], applied under the null hypothesis that the two samples are drawn from the same underlying distribution. In the two-tail application of interest here the test's statistic is defined as the maximum 105 absolute distance, ! , between the simulated ( " ) and observed ( # ) ECDFs of annual daily streamflow maximum (Q M ): where i is the position of ",(&) ( and #,(&) ( in the ranked samples of the simulated (s) and observed (o) annual streamflow maxima, respectively, and n is the number of years considered in the simulation (29 values in this work, one for each year of the investigated period excluding the first two, see Sect. 2.4). As customary in statistics ",(&) ( , = 1, … , indicates the 110 ranked time series of the annual maxima ",& ( of simulated streamflow. A similar definition has been introduced for observed streamflow. The closer ! is to 0 the more likely it is that the two samples are drawn from the same population. In addition, the two-sample Kolmogorov-Smirnov test returns a p-value (p) corresponding to the computed ! statistic. The p-value is the probability of rejecting the null hypothesis when it is true. It can also be defined as the smallest significance level " at which the null hypothesis would be rejected [Conover, 1999]. In simpler terms, the larger the p-value the stronger is the 115 evidence in favor of the null hypothesis, i.e., in our work that the samples are drawn from the same distribution.
In our framework the evaluation of statistical coherence of observed and simulated populations of extremes is performed aposteriori for each simulation experiment described in Section 2.4. In particular, the p-value will be used as a measure of the statistical coherence between simulated and observed high streamflow extremes.

ECDF computation 120
The daily average annual streamflow maxima are extracted from the chronological daily time series of observed (o) and simulated (s) streamflow and their empirical probability, F, is computed separately according to the classic Weibull formulation [Weibull, 1939]: Empirical probabilities provided by Eq.

Hydrological model calibration
The HYPERstreamHS hydrological model was calibrated against streamflow observations using as meteorological forcing both an observational dataset (i.e. ADIGE, see Sect. 3.2) and three climate models each under two emission scenarios. A short description of these datasets is provided in Sect. 3.3. Calibrations were performed during the period 1980-2010, assumed as reference, with the first two years used as spin-up and therefore excluded from the computation of model 130 performances. Parameters were inferred by optimizing three efficiency metrics by means of the Particle Swarming Optimization algorithm [Kennedy and Eberhart, 1995].
The first efficiency metric is the classic Nash-Sutcliffe index [Nash and Sutcliffe, 1970], which is widely used in where 4 ",(&) 67 ( ) and 4 #,(&) 67 are the simulated and observed streamflow values at the 67 evaluation points (EPs) in which the 145 flow duration curves are partitioned (ranked from the larger to the smaller value) and ¯# is the mean of the observed time series. According to this metric, RFDC = 1 when the two flow duration curves coincide (i.e., they are the same at all the EPs).
Given that the flow duration curve is insensitive to chronologic sequence, RFDC has been used as objective function for streamflow maxima obtained with both climate models and the observational dataset ADIGE. Furthermore, following Westerberg et al. [2011], we employed the so-called volume method in which EPs are identified as the upper boundary of the 150 elements obtained by partitioning the area below the curve in 67 elements such that each of them is characterized by the same water volume. Given the same number of EPs, we remark that the procedure is performed independently for observed and simulated FDCs and it is indeed possible that the total volume under the curves and the water volume of each interval differ between observations and simulations. The water volume pertaining to each interval as well the total water volume of the flow duration curve are computed by using the right Riemann sum procedure [Protter and Morrey, 1977]. In the 155 computations we used 67 = 50, which has been shown sufficient to obtain convergence of the statistic (4) irrespective of the integration scheme [Vogel and Fennessey, 1994].
The third efficiency metric is the minimum of the two-sample Kolmogorov-Smirnov ( ! ) statistic defined in Eq. (1): where " and # are the simulated and observed ECDF as introduced in Eq. year, respectively. This metric, which is at the core of the proposed approach, aims to maximize the chances that the modeled and observed samples of extreme high flows belong to the same population. In other words, among all possible sets of model parameters we consider the one leading to the sample of simulated annual maxima with the smallest Dn (i.e., KS metric). Since KS is not sensitive to the temporal sequence of observed and simulated streamflows, similar to the RFDC case, 165 it has been applied to climate projections in addition to the simulations with the observational dataset ADIGE.

Probability distribution computation and confidence intervals
The probability distributions of simulated and observed annual streamflow maxima were obtained by fitting the Extreme Value Type I (Gumbel) [Gumbel, 1941] [Hosking, 1985] to the respective samples. The Pearson's chi-squared test [Pearson, 1990] with a confidence 170 level " = 0.05 was applied to validate the parameters and provided by the MLE. The adaptation of the Gumbel distribution to the annual maxima of the observed and simulated daily streamflow was performed a-posteriori for comparison purposes in order to extrapolate high flow quantiles (i.e., high return periods) for all the simulation experiments presented in Sect. 4.
Confidence intervals of observed streamflow ECDF were computed by means of parametric bootstrap [Efron, 1982] under 175 the assumption that the quantity of interest was distributed according to the above parametric Gumbel probability distribution. In particular, 90% confidence band was estimated by using 10000 uniform random samplings from the underlying inferred distribution.

Study area 180
To exemplify the application of the methodology we selected as a case study the upper portion of the Adige river basin (Italy), located in the south-eastern Alpine region (see Figure 1), at the gauging station of Trento (11° 06' 54.8" E, 46° 04' 13" N, drainage area of about 9850 km 2 ). The Adige river originates at the Resia Pass (close to the Alpine divide) and ends its course after 410 km in the northern Adriatic Sea. It is a typical Alpine watershed, with terrain elevations ranging from 185 m a.s.l. at Trento to 3500 m a.s.l. at the Italian-Austrian border. The morphology is characterized by deep valleys and 185 high mountain crests.
The climate of the watershed is characterized by relatively dry and cold winters followed by humid summers and autumns.
Streamflow is minimum in winter, when precipitation falls as snow over most of the river basin, and shows two maxima: one occurring early in summer, due to snowmelt, and the other in autumn, triggered by intense cyclonic storms. The annual average precipitation ranges from 500 mm in the North-West to 1600 mm in the southern part of the basin [Lutz et al., 2016;190 Diamantini et al., 2018;Laiti et al., 2018]. Projected decrease of snowfall in winter and anticipation of earlier snow-melting, essentially due to rising temperatures associated with global warming [Gobiet et al., 2014;Gampe et al., 2016], will likely affect Adige streamflow regime by the second half of 21 st century [Bard et al., 2015;Majone et al., 2016]. This may have relevant consequences on water resources and hydropower production, which is particularly relevant in this region of the Alps [Zolezzi et al., 2009;Bellin et al., 2016;Majone et al., 2016]. See also Chiogna et al. [2016] for a comprehensive 195 review of the hydrological stressors acting in the Adige basin, as well as of its ecological status.

Observational datasets
The regional dataset ADIGE developed by Mallucci et al., [2019] by using the meteorological stations within the catchment and in the nearby Austrian territory bounding the catchment from the north, was used as an observational precipitation and temperature dataset. ADIGE was selected since it is the most accurate gridded meteorological dataset of the investigated 205 river basin (as shown in the recent paper by Laiti et al., 2018). Meteorological data at the selected stations were provided by the Austrian Zentralanstalt für Meteorologie und Geodynamik (www.zamg.ac.at) and the meteorological offices of the Autonomous Provinces of Trento (www.meteotrentino.it) and Bolzano (www.provincia.bz.it/meteo). The time series were interpolated over a 1-km grid at a daily time step by means of the kriging with external drift algorithm [Goovaerts, 1997;Journel and Rossi, 1989], with an exponential semivariogram and by using the 16 closest neighbouring stations to generate 210 the estimates. The spatial distribution model was selected by Mallucci et al. [2019] according to the leave-one-out crossvalidation procedure, applied to ordinary kriging and kriging with external drift algorithms, in association with multiple semi-variogram models (i.e., Gaussian, spherical and exponential models) and different numbers of neighbouring stations (namely 8, 16 and 32 stations). An average absolute error of the daily estimates of about 1.32 mm for precipitation and 0.02°C for temperature is reported in Mallucci et al. [2019], comparable with the error estimates provided by other ground 215 data-based datasets available in the Alpine region such as APGD [Isotta et al., 2014]. To comply with the computational grid adopted by HYPERstreamHS model, ADIGE was aggregated by areal averaging to the 5 km grid depicted in Figure 1.
Daily streamflow data collected at the Trento Ponte san Lorenzo gauging station (see Figure 1) during the period 1980-2010 were provided by the Hydrological Office of the Autonomous Province of Trento (www.floods.it).

Climate change projections 220
Climate projections applied in this study were derived from the combination of General Circulation Models (GCMs) and Regional Climate Models (RCMs) available from EURO-CORDEX initiative under 4.5 and 8.5 Representative Concentration Pathways (RCP4.5 and RCP8.5), at a spatial resolution of about 12 km [EUR-11, http://www.eurocordex.net/, Jacob et al., 2014]. To reduce the computational burden in the hydrological modeling experiments, we adopted the model sub-selection proposed by Vrzel et al. [2019] whom applied a hierarchical clustering approach [Wilcke and Bärring, 2016] in 225 selected European river basins (including the Adige) in order to reduce the number of available Climate Model (CM) simulations (i.e., GCM-RCM combinations) while preserving the variability of the overall ensemble of climate change signals. In particular, model reduction involved 5 steps: 1) identification of meteorological variables; 2) transformation of variables into orthogonal and therewith uncorrelated variables by means of singular vector decomposition; 3) identification of the optimum number of clusters; 4) hierarchical clustering to group the simulations; and finally 5) selection of the 230 simulation closest to the group's mean as representative. This procedure led to the selection of the three GCM-RCM combinations (out of the 12 available), here referred to as CLMcom, KNMI and SMHI (see Table 1).
These three Climate Models (CMs) provide an assessment of likely future climate changes for the mid-term horizon 2040-2070, with the time window 1980-2010 selected as a period of reference. The projected climate change meteorological signals in the Adige are discussed in Gampe et al. [2016]. Both RCP4.5 and RCP8.5 emission scenarios are available for all 235 the three adopted CMs, thereby leading to a total of six CMs which are investigated in the present study (see Table 1). Since GCMs/RCMs combinations are prone to model biases especially in complex terrain [Kotlarski et al., 2014], bias-correction is needed to accurately reproduce historical meteorological forcing during the reference period. As customarily done in most of the climate change impacts studies, we rely on bias-corrected products retrieved from EURO-CORDEX initiative. For the reference period 1989-2010 EURO-CORDEX products are available bias corrected by the distribution-based scaling 240 approach [DBS, Yang et al., 2010] using as a reference the MESAN gridded reanalysis datasets of daily precipitation and temperature [Landelius et al., 2016]. CMs forcing in the reference period 1980-2010 slightly differ between the two RCPs as a consequence of: i) the bias correction method adopted, which matches observed and simulated frequency distributions rather than the observed values; and ii) the correction used in the reference period 1989-2010 is extended to the period 1980-by means of the nearest neighbour method to the computational 5 km spacing grid used in the discretization of the Adige river basin (see the grid shown in Figure 1). 4 Results and discussion

Direct calibrations forced by CMs 250
Here we consider the results of hydrological simulations performed by using precipitations and temperature from the three GCM-RCM combinations selected as described in Section 3.3 each one with both RCP4.5 and RCP8.5 pathways, for a total of six combinations, as shown in Table 1. The parameters of the hydrological model are obtained by optimizing KS and RFDC in the period 1982-2010. The ECDFs of the daily average annual streamflow maximum are extracted a-posteriori from the six optimized models and then compared with that of the historical observations in the same period. The comparison is 255 performed by applying the Kolmogorov-Smirnov two-sample test between simulated and observed ECDFs (see Sect. 2.2), with the former obtained by optimizing the two metrics (Eq. (5)) and 9:; (Eq. (4)), respectively.
Results presented in Table 2 show that for both RCP4.5 and RCP8.5 emission scenarios simulations performed by optimizing KS (Eq., (5)) provided samples that with high probability (i.e. p = 1.000) belong to the same population of the observed values for all the 3 CMs used. On the other hand, the optimization of RFDC leads to samples belonging to the same 260 population with probability larger than p=0.05 (i.e. the level of significance customarily adopted in the statistical literature to reject the null hypothesis), though significantly lower than for KS (fourth column in Table 2). The lowest p value is obtained when the calibration is performed by using the climate model CLMcom with RCP4.5 (p = 0.222, see Figure 2a and Table 2).
Consistently, the absolute maximum distance between the ECDF of observed and simulated sample distributions (i.e., Dn) obtained by using RFDC as calibration metric are always larger than those obtained by using KS (see third column in Table 2). 265 The consequence in terms of correspondence between observed and simulated ECDFs is indeed appreciable. The ECDFs obtained from simulations performed by adopting KS as optimization criteria are in a better agreement with the observed ECDF than the ECDFs obtained from simulations that use RFDC as optimization criterion (see all subplots in Figure 2). These results highlight that adopting the KS metric is preferable than using RFDC when dealing with high flow extremes, thus strengthening the idea of targeting directly the desired statistics for extremes in calibration instead of calibrating the 270 hydrological model on the entire streamflow record. The above findings can be cast and discussed in the context of the hydrological literature of hydrological extremes, and in the following we further expand the discussion presented in the Introduction in light of the results obtained so far. Climate change effects on extremes are typically assessed by using a hydrological-climatological modeling chain [Wilby and Harris, 280 2006] in which the adopted hydrological model is calibrated against chronological time series of observed streamflows during a reference period, and then used with GCM-RCM future projections as meteorological input to assess projected changes of the extremes [see e.g. Ngongondo et al., 2013;Aich et al., 2016;Pechlivanidis et al., 2017;Vetter et al., 2017;Hattermann et al. 2018]. Calibration of the hydrological model by targeting directly the statistics of extremes (e.g. high flow quantiles or KS metric as in our case) are indeed much less common in hydrological applications [see e.g. Honti et al., 2014]. 285 This distinction may appear unnecessary since it seems reasonable to assume that a model correctly reproducing the chronological time series of streamflow will also reproduce the statistics of interest (or distribution). However, unavoidable epistemic and parametric errors, impairs this implicit assumption. A possible alternative, we explore here, is to calibrate on the distribution of the considered extremes, i.e. the annual streamflow maxima in our case.
Examples of hydrological models calibrated by using tailored information instead of the entire observed streamflow series 290 are present in the hydrological literature. Montanari and Toth [2007] used the spectral properties of the streamflow time series as a goodness of fit metric. Blazkova and Beven [2009] used certain flow quantiles as acceptability criteria within a GLUE framework. Westerberg et al. [2011] used an informal triangular likelihood function for calibrating a hydrological model on the basis of observed flow duration curves. Furthermore, Lindenschmidt (2017) used water stage frequency curves as objective functions for reproducing ice jam formation in northern rivers. However, these approaches are typically adopted 295 for reproducing watershed response to meteorological forcing under observed conditions, and have not been applied (to our best knowledge) using directly the GCM-RCMs simulations as input forcing in the calibration procedure. The only example somewhat similar to our approach we found in literature is that of Honti et al. [2014], which however used a stochastic weather generator trained by observed weather time series coupled with observed discharge data to sample the posterior distribution of model parameters. The adoption of a time-independent calibration (i.e., timing errors do not influence the 300 model performance) has the intrinsic advantage of allowing the use of GCM-RCM runs conducted without the assimilation of observational data, as in our case. In fact, these runs provide time-slice experiments representing a stationary climate for both reference and future periods [see e.g., Majone et al., 2012] and by definition cannot be used in the context of a classical day-by-day hydrological comparison experiment with observed historical data [see e.g., Eden et al., 2014].
Studies adopting the two-sample Kolmogorov-Smirnov test to evaluate if simulated hydrological variables are distributed 305 according to a given probability distribution [e.g., Kleinen and Petschel-Held, 2007], to detect changes in hydrological variables [e.g., Wang et al., 2008], or to understand if calibrated parameters of hydrological models belong to a given probability distribution [e.g., Wu et al., 2017;Wang and Solomatine, 2019] are relatively common in the literature. This notwithstanding, we are not aware of any study adopting directly the KS metric in the context of hydrological model calibration on extremes.
The proposed method, coined here as Hydrological Calibration on Extremes (HyCoX), hence consists in the calibration of a physically-based hydrological model on the probability distribution of extremes (in our case the ECDF of the annual maximum of daily streamflow), i.e. the possibility to constrain simulated (driven by GCM-RCM runs during the reference period) and observed data samples to belong to the same population. In this respect, HyCoX also provides a framework for excluding CMs forcing datasets not coherent with hydrological observations of high flow extremes (i.e., cases with < 0.05 315 in the Kolmogorov-Smirnov two-sample test). Finally, the parameterization obtained for the hydrological model guaranties statistical coherence between scenarios and observations during the reference period, and can then be used in future climate change scenario runs to project changes in extremes.

Forward simulations using parameterizations derived from calibrations with observed ground data 325
The typical way to assess the impact of climate change on hydrology is to run a model, calibrated with observed meteorological and hydrological data, with the meteorological forcing provided by climate models. This approach is pursued here by using HYPERstreamHS calibrated against daily streamflow in the period 1982-2010, with precipitation and temperature extracted from the ADIGE dataset [Mallucci et al., 2019], and meteorological input provided by the six aforementioned CM simulations. The main objective is to assess if a model well calibrated with observational data responds 330 coherently when applied with precipitation and temperature obtained from climate models. Again, the comparison is performed by applying the Kolmogorov-Smirnov two-sample test between simulated and observed ECDFs.
HYPERstreamHS was calibrated at the Trento gauging station by using NSE, KS and RFDC metrics. In order to ease the ensuing discussions, these three parameterizations are hereafter termed as NSE-ADIGE, KS-ADIGE and RFDC-ADIGE, respectively (see also Table 2). 335 Figure 3a shows the simulated ECDFs and the associated p-values of the Kolmogorov-Smirnov test for calibrations conducted with the ADIGE dataset as input and the three aforementioned metrics (NSE, KS and RFDC). In a strict statistical sense all the three metrics provide simulated samples of annual streamflow maxima belonging to the same population of the observed ones, given that in all cases p>0.05, but with an evidence in favor of this conclusion that is maximum for KSE (p = 1.000) and minimum for RFDC (p = 0.372). At the same time, calibration conducted by using KS as efficiency metric leads to 340 NSE and RFDC values (0.4 and 0.564, respectively, see Table 2) which are lower than those obtained optimizing the two metrics in the calibration (NSE = 0.822 and RFDC = 0.975, see Table 2). This is in accordance with several studies showing that the adoption of a given metric in calibration may lead to suboptimal results for other metrics since each one of them is more sensitive to specific aspects of the time series with its own limitations and trade-offs [see e.g., Schaefli and Gupta, 2007;Gupta et al., 2009;Mcmillan et al., 2017;Fenicia et al., 2018]. This latter limitation is, in our opinion, outweighed by 345 the improvements in representing the ECDF of observed high flow extremes during the calibration of the hydrological model. Accordingly, in our analyses the use of different efficiency metrics leads to different simulated ECDFs and hence to different p-values in the application of the statistical coherence test (see Table 2).
When the 3 aforementioned parameterizations (i.e., NSE-ADIGE, RFDC-ADIGE, KS-ADIGE) are used in the forward simulations of the reference period 1982-2010 using as input the meteorological variables produced by the climate models, 350 the conclusions are significantly different (see Figures 3b, 3c, 3d). In particular, the adopted parameterizations lead to simulations that show low p-values (always lower than p = 0.372) for all the considered CMs and pathways. In particular, NSE-ADIGE and RFDC-ADIGE show on average the lowest p-values, with KS-ADIGE showing a slightly better performance: p = 0.372 for KNMI and SMHI under the RCP8.5 scenario (see Figures 3c and 3d).
The above results highlight how the classical approaches based on feeding the hydrological model calibrated to observed 355 streamflow data (i.e., by using the NSE metric) or to flow duration curve (i.e., by using the RFDC metric), with meteorological forcing provided by Climate Models produce results characterized by low statistical coherence with the observational data.
Furthermore, our results indicate that the same drawback arises when employing parameterizations obtained with a calibration approach optimizing the desired statistic of extremes, but still using observational data as input, i.e., KS-ADIGE in Figures 3b, 3c and 3d. These results are in agreement with previous studies evidencing that the use of parameterizations of 360 hydrological models (obtained with observational input forcing) providing a good reproduction during a baseline period is questionable for simulating streamflow under future climate conditions [Brigode et al., 2013;Lespinas et al., 2014]. Indeed, it is well recognized that the use of different datasets can lead to different optimized parameter sets that will partially account for their specific climate characteristics [Yapo et al. 1996;Vaze et al., 2010;Laiti et al., 2018]. Furthermore, it is acknowledged that climate change impact simulations are certainly affected by uncertainties in climate modeling, but also 365 the calibration strategy adopted during the reference period plays a, often dominant, role [Lespinas et al., 2014;Mizukami et al., 2019]. In this respect, we believe that the preservation of statistical coherence between climate scenarios and observations (i.e., high flow extremes in our case) should be taken into account directly during hydrological calibration, at least in the reference period.  Figures 3b), 3c) and 3d) evidence that for high quantiles the simulated ECDFs are often outside the 90% confidence interval of the observed ECDF for all the considered forward simulations. This is further highlighted in Figure 4, which shows the quantiles of annual maximum daily streamflow as a function of return period at the Trento gauging station. Following the procedure described in Section 2.5 extrapolations are performed under the assumption that the simulated ECDFs are 380 distributed according to the parametric Gumbel probability distribution. All the adaptations to the simulated ECDFs passed the Pearson's chi-squared test. Confidence intervals of observed streamflow are derived from the parametric bootstrap procedure outlined in Sect. 2.5. Visual inspection of Figure 4 shows that for all investigated return periods parameterizations obtained by fitting the model to observed streamflow with meteorological data provided by the ADIGE dataset (i.e., NSE-ADIGE, RFDC-ADIGE, KS-ADIGE) significantly underestimate the ECDF of annual maximum and fall outside the 385 confidence interval of observations (i.e., outside the grey area). The only exceptions are curves derived from simulations conducted with KNMI (KS-ADIGE, dotted line in Figure 4c) and CLMcom (all the 3 metrics, Figure 4a) climate models under RCP4.5 pathway. We note however how these curves are obtained with forward simulations providing low p-values of the Kolmogorov-Smirnov test with respect to the other cases (always lower than p = 0.222).
Quantiles obtained by calibrating the hydrological model with the meteorological input provided by the Climate Models and 390 KS as metric are in a very good agreement with the experimental data, while those obtained by using RFDC are outside or at the lower bound of the interval of confidence, though they generally are in a better agreement with the quantiles of the experimental data than those obtained with the aforementioned NSE-ADIGE, RFDC-ADIGE, and KS-ADIGE parametrizations. Exceptions are represented by CLMcom and KNMI under RCP4.5 and RFDC as metric that present the largest deviations from observations (see Figures 4a and 4c, respectively). We attribute this occurrence to the additional 395 source of uncertainty arising from the extrapolation procedure (i.e., the selection of the probability distribution and of the statistical inference method for the parameters, MLE in our case). The interval of confidence (gray area) widens as the return period increases and this is line with the recent findings of Meresa and Romanowicz [2017], which showed that errors in fitting theoretical distribution models to annual extreme streamflow series might contribute significantly to the overall uncertainty associated to projections of future hydrological extremes. 400

405
Forward simulations are labelled as NSE-ADIGE, RFDC-ADIGE and KS-ADIGE depending on the metric adopted during calibrations conducted using the observational dataset ADIGE as input forcing. Extrapolation from observed streamflow data is also shown (continuous black line) together with the associated 90% confidence interval (grey shaded area).

Model parameters
The results presented in the previous Sections highlight how the largest level of statistical coherence between observations 410 and simulations (performed with CMs simulations as input) can be achieved only by optimizing the desired statistics of extremes (i.e., see the curves labeled KS in Figures 3 and 4) in the calibration of the hydrological model. Starting from this evidence, we investigated what is the effect on model parameters of performing the calibration by using different input data (observational data as well CMs simulations). The list of the 12 parameters of the model with their units together with a short description and range of variation is presented in Table 3. Notice that all the 12 parameters of HYPERStreamHS are 415 calibrated. To simplify the comparison, we considered for the 3 CMs and for both emission scenarios only the calibrations performed by using the KS metric. In particular, for each calibration run we considered the 200 simulations (out of 40000 performed in total) presenting the highest efficiency. In addition, we considered the 100% confidence bands resulting from the retained solutions, and used the distance between the upper and lower limits of the confidence interval, ̅ , as a metric of uncertainty for the calibrated parameter [see Piccolroaz et al., 2015]. We remark that the procedure adopted here aims at 420 quantifying only the differences in the calibrated parameters and not to perform a full uncertainty analysis of predictions.  Figure 5 shows the range of variability of the parameters associated with the retained solutions together with the 425 corresponding optimal parameter sets. The values of the parameters are normalized with respect to their range (see Table 3).
The boundary of the parameters space has been set by means of preliminary simulations such as to minimize the probability of excluding from the search domain combinations of parameters leading to behavioral solutions [Beven and Binley, 1992].
Furthermore, we verified a-posteriori that the optimal parameters are inside the range of variation. In all cases, confidence bands are generally well distributed between 0 and 1, indicating a proper choice of their range of variation, although for a 430 few parameters optimality was located close to the boundary of the search domain. As shown in Figure 5 the majority of the parameters have a confidence band that is similar (or slightly larger) to that obtained in the case of KS-ADIGE, thus supporting the conclusion that calibration using CMs simulations does not lead, for both RCP, to anomalous identification of model parameters.  Figures 5a and 5b). Notwithstanding, this analysis evidences the possibility to identify to what degree each parameter is sensitive to the adoption of a different input in the calibration procedure.
The differences observed in the optimal value of model parameters are due to structural errors in the GCMs and RCMs (i.e.,440 their different capabilities to simulate the present climate), which are a substantial source of uncertainty in the impact assessment modeling chain [Honti et al., 2014;Tian et al, 2016]. Along the concepts brought forward here, this source of uncertainty can be addressed effectively by calibration of the hydrological model to the quantities of interest (i.e. the observed streamflow statistics of extremes) using as input the forcing provided by a specific GCM-RCM combination. This approach can be seen as a "hydrologic-based bias-correction" and is rooted in the adoption of a "goal-oriented" calibration 445 framework [see e.g., Laiti et al., 2018] Table 3 for the description of parameters and their ranges.

Projected changes of streamflow quantiles
Here we evaluate projected changes of high flow extremes in the future period 2040-2070. Notice that, similar to the 455 reference period, the first two years of simulations (i.e., 2040 and 2041) have been used as a spin-up period. Figure 6 presents the annual maximum streamflow as a function of return period at Trento gauging station for the 3 considered CMs under both RCP4.5 and RCP8.5 emission scenarios. As customary, extrapolations are performed under the assumption that the simulated ECDFs are distributed according to a parametric Gumbel probability distribution and the comparison is presented only for the optimal solution obtained during calibration. For each CM we considered the following parameterizations referred to the reference period: direct calibration with KS and RFDC metrics, and NSE-ADIGE as representative of a standard calibration procedure.
Visual inspection of Figure 6 confirms that using the standard calibration (i.e., NSE-ADIGE) of the hydrological model leads to a significant underestimation of all quantiles with respect to using KS and RFDC for all the 3 considered CMs under both RCPs. This is in agreement with the results obtained for the reference period (see Figure 4), where simulations using NSE-465 ADIGE parameterization provided streamflow quantiles systematically lower than with the CMs. In addition, KS-based calibrations always provide larger streamflow quantiles with respect to the cases in which the RFDC metric is considered (considering the same RCP emission scenario). We remark how the adoption of the KS metric is preferable since it provided an almost perfect match with observed streamflow quantiles in the calibration period (see Figure 4).

Conclusions
We investigated different strategies for the calibration of hydrological models to be used in assessing projections of 490 hydrological high flows in light of climate change. In particular, we proposed a methodological framework in which the calibration of the hydrological model is carried out by optimizing the reproduction of the probability distribution of high flows extremes. The methodology, coined here as Hydrological Calibration on eXtremes (HyCoX), is such that the hydrological simulations conducted during a reference period, as driven by climate model outputs, are constrained to maximize the probability that the modeled and observed extreme streamflows belong to the same population. The proposed 495 framework is "goal-oriented" and aims at reducing the uncertainty associated to the estimation of extremes by directly calibrating the selected hydrological model to the quantities of interest (i.e. flow statistics instead of time series) using as input directly the meteorological data (precipitations and temperature in the case at hand) provided by the Climate Model (CM). This approach ensures statistical coherence between scenarios and observations during the reference period, and, likely, preserves it in the future climate change scenario runs performed with the aim of projecting changes in streamflow 500 extremes.
The proposed procedure is exemplified through application of a few climate models and observational data to the analysis of annual maximum streamflow of the Adige river at the Ponte San Lorenzo gauging station in Trento (Italy). Though the nature of the present work is inherently methodological, it is worth mentioning that the application of the framework over a statistically-sized number of watersheds is currently ongoing in order to demonstrate method's generality. The hydrological 505 model employed is HYPERstreamHS, a continuous simulation distributed model. Three performance metrics were adopted, including the proposed one, for which the Kolmogorov-Smirnov two-sample statistical test was employed (KS for brevity).
We remark that the HyCoX methodology is not metric dependent, and any type of metric assessing the statistical coherence between observed and simulated streamflow extremes could have been employed without any loss of generality.
The results highlight that adopting KS is preferable to other popular metrics (e.g. Nash-Sutcliffe or fit to Flow Duration 510 Curve) when dealing with high flow extremes. This result validates our hypothesis that targeting directly the statistics of extreme values under consideration during the calibration exercise leads to coherent and consistent hydrological models for addressing the impact of climate change. We remark that such approach may lead to a suboptimal performance if the target is different from the one employed in this study, limitation that is outweighed by the improvements in representing high flow extremes in line with the goal-oriented framework pursued in this work. 515 In addition, our analysis revealed that using CMs simulations as input of the hydrological model, and adopting the parameterizations derived from calibration against historical time series, is an error prone procedure. Nowadays, it is generally acknowledged that the uncertainties arising at the different steps of the hydrological-climatological modeling chain can cause a significant divergence in the statistics of extremes. However, it appears crucial that the simulated effects on projected extremes in a climate change impact assessment can be safely attributed to the change in climate alone, and not to 520 uncertainties arising from the selection of the efficiency metric in the calibration process. Finally, investigation of optimal values highlighted that direct calibration using CMs outputs lead to unbiased identification of model parameters.
In climate change impact assessments on streamflow extremes the way the hydrological model is calibrated against observations assumes paramount importance. In the present work we showed that calibrating on daily streamflow observations by using observed meteorological data is an error prone procedure when the objective is to project the 525 probability distribution of streamflow extremes by using climate models. Streamflow quantiles are dramatically underestimated and fall outside the confidence interval of the quantiles of observed annual maxima when applied to the observation period. Extrapolations performed by using the proposed calibration procedure, with input provided by CMs, are instead more consistent and provide a good match with observed quantiles.
The goal-oriented approach envisaged in this work can be applied to a variety of hydrological scenario and modeling 530 approaches. While the approach is exemplified here for high flows, it can be applied to low flows as well (e.g. for drought assessment). In any case, it is advisable to calculate the uncertainty band for both the simulation model and the ECDF from observations.

Competing interests 540
The authors declare that they have no conflict of interest. model output within the EURO-CORDEX initiative (https://www.euro-cordex.net/index.php.en). Streamflow data were kindly provided by the Service for Hydraulic Works of the Autonomous Province of Trento (www.floods.it). 550