Why does a conceptual hydrological model fail to correctly predict discharge changes in response to climate change?

Several studies have shown that hydrological models do not perform well when applied to periods with climate conditions that differ from those during model calibration. This has important implications for the application of these models in 10 climate change impact studies. The causes of the low transferability to changed climate conditions have, however, only been investigated in a few studies. Here we revisit a study in Austria that demonstrated the inability of a conceptual semi-distributed HBV-type model to simulate the observed discharge response to increases in precipitation and air temperature. The aim of the paper is to shed light on the reasons of these model problems. We set up hypotheses for the possible causes of the mismatch between the observed and simulated changes in discharge and evaluate these using simulations with modifications of the model. 15 In the baseline model, trends of simulated and observed discharge over 1978−2013 differ, on average over all 156 catchments, by 9295 ± 50 mm yr per 35 yrs. Accounting for variations in vegetation dynamics, as derived from a satellite-based vegetation index, in the calculation of reference evaporation explains 3536 ± 9 mm yr per 35 yrs of the differences between the trends in simulated and observed discharge. Inhomogeneities in the precipitation data, caused by a variable number of stations, explain 3739 ± 26 mm yr per 35 yrs of this difference. Extending the calibration period from 5 to 25 yrs, including annually aggregated 20 discharge data or snow cover data in the objective function, or estimating evaporation with the Penman-Monteith instead of the Blaney-Criddle approach, has little influence on the simulated discharge trends (less than 5 mm yr per 35 yrs or less). The precipitation data problem highlights the importance of using precipitation data based on a stationary input station network when studying hydrologic changes. The model structure problem with respect to vegetation dynamics is likely relevant for a wide spectrum of regions in a transient climate and has important implications for climate change impact studies. 25

Abstract. Several studies have shown that hydrological models do not perform well when applied to periods with climate conditions that differ from those during model calibration. This has important implications for the application of these models in climate change impact studies. The causes of the low transferability to changed climate conditions have, however, only been investigated in a few studies. Here we revisit a study in Austria that demonstrated the inability of a conceptual semi-distributed HBV-type model to simulate the observed discharge response to increases in precipitation and air temperature. The aim of the paper is to shed light on the reasons for these model problems. We set up hypotheses for the possible causes of the mismatch between the observed and simulated changes in discharge and evaluate these using simulations with modifications of the model. In the baseline model, trends of simulated and observed discharge over 1978-2013 differ, on average over all 156 catchments, by 95 ± 50 mm yr −1 per 35 years. Accounting for variations in vegetation dynamics, as derived from a satellite-based vegetation index, in the calculation of reference evaporation explains 36 ± 9 mm yr −1 per 35 years of the differences between the trends in simulated and observed discharge. Inhomogeneities in the precipitation data, caused by a variable number of stations, explain 39 ± 26 mm yr −1 per 35 years of this difference. Extending the calibration period from 5 to 25 years, including annually aggregated discharge data or snow cover data in the objective function, or estimating evaporation with the Penman-Monteith instead of the Blaney-Criddle approach has little influence on the simulated discharge trends (5 mm yr −1 per 35 years or less). The precipitation data problem highlights the importance of using precip-itation data based on a stationary input station network when studying hydrologic changes. The model structure problem with respect to vegetation dynamics is likely relevant for a wide spectrum of regions in a transient climate and has important implications for climate change impact studies.

Introduction
A vast number of studies employ hydrological models to estimate climate change impacts on hydrology. In these studies, hydrological models are typically calibrated in the present climate and then run with climate input derived from climate models. However, hydrological predictions under changed climatic conditions are challenging as it is not clear whether the current generation of hydrologic models performs well under change (Blöschl and Montanari, 2010). By definition, testing models under future climate conditions is not possible as future observations are not available. However, climatic changes have already been observed in the last few decades. Hindcast simulations during periods with climatic variations in the past allow the suitability of hydrological models under changing climatic conditions to be tested. In the differential split sample test (DSST), suggested by Klemeš (1986), a hydrological model is evaluated in a period with climate conditions that differ from those during calibration. Though climatic contrasts between current and future conditions are likely larger than those in the observed record, and future conditions will involve higher air temperatures and higher atmospheric CO 2 concentrations, further increasing uncertainties (Stephens et al., 2020), passing the DSST can be seen as D. Duethmann et al.: Why does a conceptual hydrological model fail to correctly predict discharge changes? a minimum requirement for models applied in climate impact assessments.
Studies that investigated the performance of hydrological models in this way, by evaluating them in periods with climatic conditions that differed from those of the model calibration, largely found a decrease in model performance (Seibert, 2003;Vaze et al., 2010;Merz et al., 2011;Coron et al., 2012;Seiller et al., 2012). In a study on four catchments in Sweden, large flood peaks in the evaluation period were strongly underestimated by the HBV model if the calibration period only contained small flood peaks (Seibert, 2003). Vaze et al. (2010) analysed the model performance of four lumped hydrological models in 61 catchments in southeastern Australia when the model was calibrated to selected wet or dry periods of variable length. The reductions in model performance were greater with increasing differences in rainfall between calibration and evaluation periods. While most studies report reduced model performance in contrasting climates, Vormoor et al. (2018) did not find reduced model performance under contrasting conditions in terms of flood seasonality and flood-generating processes when applying a conceptual hydrological model in five catchments with changes in flood seasonality and flood-generating processes in Norway.
Low model performance in contrasting climates is often characterised by biased discharge values (Coron et al., 2014;Kling et al., 2015). This is a serious concern since changes in discharge volume are of high interest in climate change impact studies. Merz et al. (2011) calibrated and evaluated the HBV model in 5-year periods in 273 catchments in Austria. They found that median flows were overestimated by 15 % and high flows by 35 % when parameters calibrated during 1976-1981 were applied to [2001][2002][2003][2004][2005][2006]. Several studies found increased differences in the discharge bias between the calibration and evaluation period with increasing differences in precipitation (Coron et al., 2012;Sleziak et al., 2018).
The problem of poor model performance in contrasting climates has been observed for various model structures. While most studies that investigate the transferability of hydrological models focus on lumped conceptual models, low transferability in contrasting climate has also been observed for semi-distributed conceptual models Coron et al., 2014) and process-based models (Magand et al., 2015). The application of a DSST to three different lumped conceptual models in five catchments in Tunisia showed similar problems of model transferability under contrasting climate conditions for the three models (Dakhlaoui et al., 2017). Seiller et al. (2012) tested the transposability of 20 lumped conceptual hydrological models between periods with contrasting precipitation and air temperature for two catchments in Canada and Germany, and they were not able to identify a specific model structure that performed well in contrasting climate for all their test conditions.
Understanding the causes of poor performance in a transient climate is a key question since this determines the way forward for hydrological modelling in a transient climate. Possible causes include data problems, poor parameterisation of the model or structural inadequacy (Coron et al., 2014;Westra et al., 2014;Fowler et al., 2018). In the case of data problems, the model should be calibrated with corrected data; however, apart from this, simulations with projections of future climate should not be affected by this problem. In the case of parameterisation problems, efforts should be invested in choosing calibration methods that result in reliable parameterisations in a transient climate. If the problem is related to the model structure, it will be important to understand which parts of the model structure result in reduced performance in order to avoid these structural components in climate change impact analyses. An example of a data problem that may cause poor model performance under contrasting climate conditions is inhomogeneities in the precipitation data, which lead to biased estimates of the precipitation changes. Such inhomogeneities may be caused by inhomogeneities in the station data, a variable number of stations included in a gridded data set (Fawcett et al., 2010), or climate variations that lead to changes in the undercatch error (Forland and Hanssen-Bauer, 2000). A poor parameterisation may be caused by a too short a calibration period. However, in several studies that observed poor performance in contrasting climate the problem could not be solved by using a longer calibration period (Luo et al., 2012;Brigode et al., 2013;Coron et al., 2014). Too low a sensitivity of the objective function to the long-term dynamics of discharge may be another cause for a poor parameterisation that results in poor performance in a transient climate. Hartmann and Bárdossy (2005) observed increased transferability of a distributed conceptual hydrological model under contrasting climate conditions when including annually aggregated discharge data in the objective function in addition to daily discharge data. A thorough approach to test whether the problem may be solved by improving the parameterisation is by applying a multi-objective calibration to the different periods with contrasting climate (Fowler et al., 2018). Model structural inadequacy in the context of a transient climate includes changes in catchment characteristics or dominant hydrological processes that are not reflected by the model. For example, changes in the glacier volume or a longer vegetation period may alter the hydrologic response of the catchment and result in deviations between simulated and observed discharge if not accounted for in the model. Despite their relevance for hydrological modelling in a transient climate, the causes of poor performance under contrasting climate conditions have only been investigated in a few studies (Westra et al., 2014;Fowler et al., 2016Fowler et al., , 2018. This study aims at contributing to closing this gap by analysing the causes of the poor performance of a hydrological model in a transient climate for a case study on a large number of catchments in Austria. Due to a strong climate signal over the last few decades (Schöner et al., 2011), Austria is well suited to studying climate-induced hydrologic changes. We applied a semi-distributed hydrological model based on the HBV concept, which is widely used for operational and scientific purposes, including climate impact assessments. However, in the study by Merz et al. (2011;hereafter Merz2011), the model was not able to correctly estimate changes in the mean discharge in response to the observed increases in precipitation and air temperature. Applying the model calibrated during 1976-1981 with climate data from 2001 to 2006 resulted in an increase in simulated discharge of, on average, 15 %, whereas observations show relatively stable annual discharge volumes. Here, we revisit the study by Merz2011 and investigate what causes the differences between simulated and observed changes in discharge. For that purpose, we set up hypotheses that are tested using modifications of the model. In particular, we analyse the effect of varying the input data for precipitation and air temperature, increasing the length of the calibration period, including annually aggregated discharge data or snow cover data in the objective function, and varying the calculation of reference evaporation (E ref ) to consider changes in global radiation and vapour pressure as well as changes in vegetation dynamics.
2 Data and methods

Study area
This study was carried out using data from 156 catchments in Austria (Fig. 1). The catchments were selected based on the availability of daily discharge data for 1977-2014 (hydrological years from November to October; a maximum of two years missing). We generally excluded catchments with substantial anthropogenic influences from dams or water withdrawals (Viglione et al., 2013), glaciers, and catchments where discharge exceeded the precipitation estimate. The more rigorous selection resulted in a smaller set of catchments compared to Merz2011, who used a set of 273 catchments. The median (interquartile range) of the catchment sizes is 192 (95/366) km 2 . The data set includes lowland and mountain catchments and the median elevation range is 520 (372/665)-1593 (984/2126) m (the numbers in brackets refer to the interquartile range). The most frequent land cover is forest, which covers on average 52(40/67) % of the catchment area (based on Corine 2000 data; European Environment Agency, 2016), followed by grassland, which covers 23(14/33) % of the catchment area. In most catchments the fraction of arable land and heterogeneous agricultural areas is small, with a median of 5(0/29) % of the catchment area. The study region shows strong climatic changes over recent decades. On average over the study catchments, annual precipitation increased by 32 ± 23 mm yr −1 or 2.4 ± 1.7 % per decade, air temperature increased by 0.45 ± 0.09 • C per decade, and global radiation increased by 5.1 ± 0.9 W m −2 per decade over the period 1977-2014. In contrast, discharge did not show strong trends, and the average trend over the study period was 0.2 ± 3.1 % per decade (Duethmann and Blöschl, 2018).

Hydrometeorological data
Discharge data were provided by the Central Hydrographical Bureau (HZB) in Vienna. Climate data required by the hydrological model are air temperature, precipitation and, depending on the model variant, relative humidity, global radiation and wind speeds. Furthermore, interpolated snow depth data were used for model calibration in one model variant. The baseline precipitation data set (P 0) was derived by spatially interpolating the daily precipitation values of the available stations from the HZB and the Austrian Central Institute for Meteorology and Geodynamics (ZAMG), using external drift kriging (EDK) with elevation as the auxiliary variable to a 1 km 2 grid as in Merz2011. Due to variations in the station network, the number of stations included in the interpolation varies over time. In addition, two alternative precipitation data sets were used. For the first alternative (P 1), we used the gridded SPARTACUS data set (Hiebl and Frei, 2018). It has a temporal and spatial resolution of 24 h and 1 km and is based on a two-step interpolation scheme. In the first step, a monthly background climatology for 1977-2006 was obtained based on 1249 stations (including 119 totaliser precipitation gauges); in the second step, a constant number of 523 stations was used for interpolating ratios between the daily precipitation and the background climatology. For the second alternative precipitation data set (P 2), we added a correction for the systematic underestimation from the gauge undercatch to the SPARTACUS data set using the following equation (Richter, 1995): where P corr is undercatch-corrected precipitation, P orig is uncorrected precipitation, and b and e are coefficients that depend on season, precipitation type, and wind exposure. We estimated the precipitation type to be snow for mean air temperatures below −1 • C, mixed precipitation between −1 and 3 • C, and rain for mean air temperatures above 3 • C (ATV-DVWK, 2002). The coefficients of Richter (1995) for very sheltered locations were applied to all grid points. On average over all catchments, the undercatch correction increased precipitation by 7.2 % compared to the original data without undercatch correction. The baseline data set for mean daily air temperature (T 0) was derived by spatially interpolating the mean daily air temperatures of the available stations from the ZAMG, using local ordinary least-squares regression with elevation as in Merz2011. In addition, we used the gridded SPARTACUS data set (Hiebl and Frei, 2016), which is based on a constant station network of 150 stations, as alternative input (T 1). Air temperature and precipitation were aggregated to averages by elevation zone for each catchment, as used by the hydrological model. For model variants that applied the Penman-Monteith approach for estimating E ref , relative humidity, global radiation and wind speeds were needed as further input data. Measured global radiation was used rather than global radiation derived from sunshine duration since, for this study, our interest is in the changes over time, and due to, e.g., changes in the atmospheric aerosol concentrations over time (Norris and Wild, 2007), trends in sunshine duration may differ from those in global radiation. Measurements of relative humidity at 07:00 and 14:00 LT and global radiation were obtained from the ZAMG. Stations with more than 5 % (15 % for global radiation) of missing data during 1976-2014 were excluded, which resulted in 125 and 6 stations for relative humidity and global radiation, respectively. Data gaps were filled using linear regression to the station with the highest correlation. The data were interpolated onto a 1 km 2 grid using local ordinary least-squares regression with elevation. The local neighbourhood was set to a default radius of 100 km for relative humidity and 200 km for global radiation, and adjusted to include at least 10 (global radiation 4) and at most 40 stations. Due to a strong influence of inhomogeneities, long-term changes in wind speed from measured wind speed data are highly uncertain (Böhm, 2008). This is also reflected in the fact that annual anomalies of wind speed data from 85 stations in Austria are hardly related to each other (Duethmann and Blöschl, 2018; see Sect. S1 in the Supplement). Uniform monthly wind speeds averaged over all years from all stations in Austria were therefore applied in this study.
For an additional calibration to snow data, snow depth data from the HZB were interpolated by external drift kriging with elevation and aggregated to averages by elevation zone for each catchment (Parajka et al., 2007).

Model description
In this study, we applied the same hydrological model as Merz2011, which is a semi-distributed conceptual model that follows the structure of the Hydrologiska Byråns Vattenbalansavdelning (HBV; Bergström, 1995). The model equations can be found in Parajka et al. (2007). The model parameters are listed in Table 1. The model operates on a daily time step, and the spatial discretisation is based on 200 m elevation bands. Precipitation is partitioned into snow, rain or mixed precipitation based on air temperature, using a lower and an upper threshold temperature T s and T r . A snow-correction factor (SCF) corrects the stronger undercatch of the precipitation gauges during snowfall. Snowmelt is calculated using a temperature-index approach based on the degree-day factor (DDF) and the melt temperature (T m ). Actual evaporation (E sim ) is estimated as a function of E ref and soil moisture. It equals E ref if soil moisture is above a calibrated threshold (LP). Below this threshold, it linearly decreases to zero at a soil moisture level of zero. The fraction of the sum of rain and snowmelt that results in discharge is calculated as a nonlinear function of soil moisture. This involves the parameters (FC), the maximum soil moisture storage and the nonlinearity parameter B, where a larger B is associated with a smaller fraction of direct run-off and vice versa. The run-off module consists of a hillslope component and a river-routing component. The hillslope component is represented by two linear stores that are connected through a constant percolation rate C p . Fast run-off is generated if the state of the upper zone store is above a threshold (LSUZ), using a fast storage coefficient K 0 . Medium and slow run-off components are calculated as outflow from the upper and lower zone store, using the storage coefficients K 1 and K 2 . In the river-routing component, run-off routing in streams is simulated using a triangular transfer function involving the parameters C R and B max .

Estimation of reference evaporation
Despite being technically external to the applied HBV model, the estimation of E ref is considered part of the hydrological model rather than part of the input data since it is calculated and not available as measured data. E ref is computed on a 1 km 2 grid and aggregated to elevation zones for each catchment, as used in the hydrological model. For the baseline model, E ref was derived based on a modified Blaney-Criddle method (DVWK, 1996), following Merz2011, and denoted as E0 as follows: where T is the mean daily air temperature at 2 m height ( • C), S 0 is the potential daily sunshine duration (h), and S year is the mean yearly sum of the potential sunshine duration (h). In order to consider interannual variations in global radiation and vapour pressure deficit in addition to air temperature, we calculated E ref using the Penman-Monteith equation for well-watered short grass vegetation (Allen et al., 1998), and denoted as E1 as follows: where R n is the net radiation at the crop surface (MJ m −2 d −1 ), G is the soil heat flux density (MJ m −2 d −1 ), r a is the aerodynamic resistance (s m −1 ), r s is the surface resistance (s m −1 ), e s is the saturation vapour pressure (kPa), e a is the actual vapour pressure (kPa), is the slope of the vapour pressure curve (kPa • C −1 ), and γ is the psychrometric constant (kPa • C −1 ). According to the reference conditions of a vegetated surface with a height of 0.12 m, r s = 70 s m −1 and r a = 208/u 2 , where u 2 is the wind speed at 2 m height (m s −1 ), which was derived from the wind speed at 10 m height based on a logarithmic wind speed profile (Allen et al., 1998). The ground heat flux was neglected. The vapour pressure deficit e s − e a was calculated as the average of the vapour pressure deficit at the minimum air temperature (using relative humidity at 07:00 LT) and at the maximum air temperature (using relative humidity at 14:00 LT). R n was estimated from global radiation (R s ; MJ m −2 d −1 ), albedo (α; set to 0.23), and net long-wave radiation (R nl ; MJ m −2 d −1 ) as follows: where R nl was estimated according to Allen et al. (1998) based on minimum and maximum air temperature, clear-sky solar radiation, measured R s , and the mean daily vapour pressure.
In order to consider, additionally, changes in the vegetation dynamics, we calculated E ref using a variable surface resistance based on changes in a satellite-based vegetation index (E2). We used observed 15 d maximum value composite data of the Normalized Difference Vegetation Index (NDVI) at a resolution of 8 km from the Advanced Very High Resolution Radiometer (AVHRR) from Tucker et al. (2005). For each point in time of this biweekly series, we aggregated the NDVI data to 200 m elevation zones based on the NDVI data for a rectangle around Austria. As the NDVI data are only available starting from July 1981, we applied the data of July 1981-June 1982for 1976-1981 where the NDVI data were not available. We used the parameterisation from Sellers et al. (1996) to estimate a variable r s from the NDVI data. This involved estimating the fraction of photosynthetically active radiation (FPAR) from transformed NDVI data (Eq. 5; Sellers et al., 1996), estimating the leaf area index (LAI) from the FPAR data (Eq. 6;Sellers et al., 1996), and estimating r s from the LAI data (Eq. 7; Allen et al., 1998).
where S is a transformed NDVI value (1 + NDVI)/(1 − NDVI) and S min and S max are the 5 % and 98 % quantiles of S for a given land cover class.
where r l is the leaf surface resistance. We applied a value of r l = 100 s m −1 for well-watered grass (Allen et al., 1998).
Since the satellite-based LAI values derived in this way are often lower than the value of 2.88, which is assumed in the Penman-Monteith equation for well-watered short grass by Allen et al. (1998), E2 generally resulted in lower annual E ref than E0 or E1. In order to avoid water balance problems in the hydrological model, E2 was multiplied with the annual average ratio of E2 to E0 that was averaged over all catchments with a value of 1.2. Such an adjustment of E ref may be justified based on the fact that our study catchments are dominated by forests, and the maximum possible evaporation under well-watered conditions (E max ) of forests is typically higher than E ref that assumes short grass. For example, analyses from non-weighable lysimeters suggest E max to be 20 %-30 % higher for sites with pine forests at typical stand ages of 80-100 years compared to sites with grass (ATV-DVWK, 2002). Table 1. A priori distribution of parameter values, where p l and p u are the lower and upper bounds, α and β are the parameters of the a priori distribution, and p max is the parameter value at which the a priori distribution is at its maximum. Note that the parameters T R , T S , C r and B max were set as constant and are therefore not listed here.

Model calibration
The objective function applied for the model calibration consisted of three parts. An average of the Nash-Sutcliffe efficiency of linear and logarithmic discharge values (f Q ) was applied in order to achieve a balanced model performance for high and low flows. In order to keep the volume bias low, the absolute value of the relative volume bias (f bias ) was added as a penalty. Furthermore, a penalty for model parameters that deviate from an a priori distribution (f beta ) was added.
The penalty function f beta is based on a beta distribution for each parameter, as described in Merz2011. The a priori distributions for the model parameters were applied since, on the basis of the literature and previous applications of the model, we believe to have more information on the likely parameter values than just the parameter range. Including this criterion in the objective function has very little influence on the difference between simulated and observed discharge trends (Sect. S1). The objectives were combined in the following way: by setting the weights to w 1 = 0.8, w 2 = 1, and w 3 = 0.2. In order to test whether including annually aggregated discharge data in the objective function improves the model performance under transient climate conditions, we additionally applied a modified objective function as follows: where f annual is the Nash-Sutcliffe efficiency calculated for discharge data aggregated to hydrological years. The weights were set to w 1 = 0.4, w 2 = 1, w 3 = 0.1, and w 4 = 0.5. In a further model variant, we tested whether snow data improve the model performance under transient climate conditions. The snow-related part of the objective function aimed at minimising the number of days with poor snow cover simulations and was defined following Parajka et al. (2007). Observed snow cover was derived from maps of interpolated snow depth. An elevation zone was considered as snow covered if the average interpolated snow depth was greater than 0.5 mm and snow free otherwise. In the model, an elevation zone was considered snow covered if the simulated snow water equivalent was greater than 0.1 mm and snow free otherwise. If the difference between simulated and observed snow cover on a particular day was greater than 50 % of the catchment area, it was considered to be a day with poor snow cover simulations. The snow-related part of the objective function f snow was defined as the ratio of the number of days with poor snow cover simulation and the number of days with observed snow cover. The overall objective function was then defined as follows: The weights were set to w 1 = 0.7, w 2 = 1, w 3 = 0.1, and w 4 = 0.2, following Parajka et al. (2007).
The objective function was minimised automatically with the shuffled complex evolution algorithm (SCE-UA; Duan et al., 1992), a global optimisation method based on the simplex downhill search scheme (Nelder and Mead, 1965). The calibration included 11 parameters. The upper and lower bounds and two further parameters of the beta distribution for each parameter were selected following Merz2011 (Table 1). Four parameters that showed little sensitivity were preset to the following values: T R = 2 • C, T S = 0 • C, C r = 25 d 2 mm −1 , and B max = 10. As the focus of this study was on calibrating the model many times for different calibration periods, catchments and model variants, characterising parameter uncertainties was beyond the scope of this study. For the baseline model, we used seven consecutive 5-year calibration periods without temporal overlap (based on hydrological years), during 1978-2012. Each simulation was started with an additional 22-month warm-up period. As a modification, we also tested using a 25-year period as calibration period  Model performance was evaluated using the relative bias in discharge volume and the Nash-Sutcliffe efficiency (NSE). The relative bias in discharge volume was calculated as follows: where Q sim,t and Q obs,t are, respectively, the simulated and observed discharge on day t, and n is the number of time steps.
In order to focus on the change in discharge under transient climate conditions, we used the difference between simulated and observed discharge trends as an additional criterion. Good performance in the calibration period but inability to estimate the changes in the observed discharge resulting from the climatic changes indicates problems under transient climate conditions. Trends were evaluated over the entire study period . Trend significance was assessed by the nonparametric Mann-Kendall test (Mann, 1945;Kendall, 1975), and lag-1 serial correlation was removed by applying the trend-free prewhitening technique (Yue et al., 2002). Trend slopes were estimated by the Sen's slope estimator (Sen, 1968). Uncertainties of the trend slope were estimated using a bootstrapping approach. For this purpose, 1000 samples of size N were drawn, with replacement, from the record of length N years, and the Sen's slope was calculated for each of the 1000 samples. Then, the standard deviation was determined. The trends and the standard deviations were first derived for each catchment and then averaged over the catchments to determine average trends and their uncertainties over a number of catchments.

Hypotheses for the causes of the expected mismatch between observed and simulated discharge changes
We compiled possible explanations for the expected divergence between the observed and simulated changes in discharge based on the frameworks suggested by Westra et al. (2014) and Fowler et al. (2018) and the discussion in Coron et al. (2014). The working hypotheses are grouped into (1) data problems, (2) problems related to the model calibration, and (3) problems of the model structure (see Table 2). In a first analysis, the hypotheses were evaluated based on process understanding and literature. During this process, a number of the working hypotheses were rejected or assessed as being unlikely to cause the differences between the observed and simulated discharge changes. Other hypotheses were evaluated using simulations with modifications of the model (Table 3).

Data problems
Discharge data can be misleading if they are influenced by abstractions or streamflow diversions. For example, a general increase in water abstractions would reduce a positive streamflow trend. However, our study includes only catchments that were classified as devoid of substantial anthropogenic influences (Viglione et al., 2013) and any existing streamflow diversions were introduced before the beginning of our study period (BMLFUW, 2015). Changes in water abstractions due to irrigation are not believed to be a major cause of the deviations between simulated and observed discharges as only about 3 % of the arable land in Austria is irrigated (FAO, 2016), the fraction of arable land is small in most of the study catchments (median 5 %, see Sect. 2.1), and the study catchments have only a little overlap with those regions where irrigation is most relevant. These are small areas east, southeast and northwest of Vienna, where estimated average irrigation amounts of the agricultural areas exceed 10 mm yr −1 (BMLFUW, 2011). Erroneous trends in the discharge data could be caused by systematic trending errors of the rating curve. However, it seems unlikely that the discharge data of a large number of catchments are afflicted by systematic trends in the same direction. Problems in the discharge data were thus assumed unlikely to be a relevant cause of the differences between simulated and observed discharge trends.
Inhomogeneities of the precipitation data would result in biased estimates of the precipitation trends. A problem that would affect a large number of catchments is a varying number of precipitation stations included for generating the gridded precipitation data set. The precipitation data set used by Merz2011 was based on all available stations and included ∼ 800 stations at the end of the 1970s and ∼ 1050 stations around the year 2000 ( Fig. S2 in the Supplement). The effect of the changes in the number of stations on the trends in the water balance components was analysed by simulations with a precipitation data set based on all available stations (P 0) and simulations with a precipitation data set based on a constant number of stations (P 1). Changes in the gauge undercatch error due to changes in climate would also affect a large number of catchments. An increase in precipitation intensity and a decrease in the snowto-rain ratio are expected to result in a higher catch ratio, meaning that the precipitation increase is lower than perceived by the observed data. The effect of neglecting the systematic precipitation error was estimated by simulations with a precipitation data set that is based on a constant number of stations that was corrected for the (1.2) Problems in the precipitation data Inhomogeneities in the precipitation data due to instrument changes Introduction of heated precipitation gauges would result in larger precipitation increases and thus increase the gap between changes in E wb and changes in E sim . Since, at most locations with a heated gauge, there is a manually operated gauge in addition and values of the latter are used to report daily precipitation sums, this effect is likely not relevant.
Inhomogeneities in the gridded precipitation data due to changes in the number of stations Simulations with a precipitation data set that uses a constant number of stations (model variant V 1).
Biased estimates of the precipitation trend due to changes in the catch ratio caused by changes in the snow-to-rain ratio and changes in precipitation intensities (in addition to inhomogeneities due to a variable number of stations) Simulations with a precipitation data set with a constant number of stations and correction for the systematic precipitation undercatch (considering the precipitation type and precipitation intensity; based on daily precipitation amount; model variant V 2).
(1.3) Problems in the air temperature data Inhomogeneities in the gridded air temperature data due to changes in the number of stations Simulations with a data set that uses a constant number of stations (model variant V 3).
(2) Problems related to the model calibration Section 3.3 Too short a calibration period Simulations with a 25-year calibration period (model variant V 4).
Objective function insensitive to long-term discharge variations Simulations with a modified objective function that includes annually aggregated discharge data (model variant V 5).
Internal inconsistencies due to calibration only to discharge Simulations with a modified objective function that includes a comparison against snow data (model variant V 6).
(3) Problems of the model structure Section 3.4 Effects of changes in radiation and saturation deficit not reflected by the model systematic gauge undercatch considering the influence of the precipitation type and daily precipitation intensity on the catch ratio (precipitation data set P 2).
Similar to the precipitation data set, the air temperature data set in the baseline model was based on a variable station network, though the number of air temperature stations varies much less than the number of precipita-tion stations (Fig. S2). We investigated the effect of the changes in the number of air temperature stations by simulations with air temperature data sets based on all available stations (T 0) or a constant number of stations (T 1).

Problems related to the model calibration
Vary air temperature data P 0 T 1 5 years f 1 E0 V 4 Increase length of calibration period P 0 T 0 25 years f 1 E0 V 5 Include annually aggregated Q into obj. function P 0 T 0 5 years f 2 E0 V 6 Include snow into obj. function Combine V 2 and V 8 P 2 T 0 5 years f 1 E2 Problems in the model calibration relate to the problem that, in principle, parameter sets exist that enable good performance in the calibration and evaluation period, but these parameter sets are not the ones identified during model calibration. Possible causes are, for example, a too short a calibration period that results in overfitting or processes that are relevant in the evaluation period but not activated in the calibration period. We therefore tested whether increasing the model calibration period from 5 years to 25 years reduces the bias between simulated and observed discharge trends. We furthermore investigated whether including annually aggregated discharge data into the objective function improves the model performance under contrasting climate conditions, as found in a study by Hartmann and Bárdossy (2005). Since snow-related processes are important in the mountainous part of the study area, we investigated further whether including data on interpolated snow depth into the objective function has an effect on the model performance under transient climate conditions. A recent study has shown that including snow data into the objective function can improve the temporal stability of snow-related parameters (Sleziak et al., 2020).

Problems of the model structure
In case the problem cannot be solved by rectifying problems in the data and model calibration, problems in the model structure are likely. These include inadequate process representations and changes in the catchment that are not represented by the model. Further changes may result from changes in the vegetation dynamics and the land cover, such as a lengthening of the growing season or increases in forest at the expense of cropland and extensive grassland, as observed in many parts of Austria (Krausmann et al., 2003;Gingrich et al., 2015). To test the possible effect of changes in vegetation dynamics on changes in the simulated trends of streamflow and evaporation, we performed additional simulations where we calculated a modified E ref considering changes in surface resistance based on a satellite-based vegetation index (E2). Land cover changes from agricultural land to forest may also contribute to changes in the satellite-based vegetation index. It is therefore assumed that the simulations with E ref considering changes in vegetation dynamics include also, to some extent, the effect of changes in land cover.

Deviations between simulated and observed changes in discharge and evaporation of the baseline model
There is a clear gap between simulated and observed trends in discharge when the model calibrated in the first subperiod is applied to the entire period. On average over all catchments, the difference is 95 ± 50 mm yr −1 per 35 years over 1978-2013 or 12.8 ± 6.7 % in relation to observed flow ( Table 4). This is illustrated in Fig. 2a  and significant (p ≤ 0.05) increases and decreases in 10 % and 7 % of the catchments. In contrast, simulated discharge increased on average by 118 ± 82 mm yr −1 per 35 years, with significant increases and decreases in 38 % and 1 % of the catchments. Discharge trends were overestimated by the model in many catchments all over Austria (Fig. 2c). Large differences between simulated and observed trends particularly occur in central Austria, southern Carinthia and western Tyrol. The deviations in simulated and observed changes in discharge correspond to deviations in simulated and observed changes in evaporation. The dark blue line in Fig. 2b shows the difference between precipitation and run-off, which may be interpreted as water-balance-based evaporation plus storage changes (E wb ). The fact that E wb includes storage changes, and E sim does not, is relevant for short timescales but less so for long-term trends as the fluctuations tend to average out over time. For example, the large interannual variations of E wb compared to E sim may be explained by storage changes. Large interannual variations are also observed for the difference between precipitation and simulated runoff, which is conceptually equivalent to E wb (Fig. S3). When comparing the long-term variations in E wb and E sim , both E wb and E sim show increases, but E sim increased at a much lower rate than E wb . Furthermore, the trend of E wb is reversed for the last two subperiods, whereas E sim increased over the entire simulation period. While the average trend of E wb over 1978-2013 is 139 ± 59 mm yr −1 per 35 years, with significant increases in 76 % of the catchments, the average trend of E sim is 52 ± 13 mm yr −1 per 35 years, with significant increases in 94 % of the catchments.
In order to investigate whether the overestimation of the simulated discharge trend is related to a decrease in simulated storage that is not represented by observed storage we examined simulated changes in storage. For this, we analysed the sum of all simulated storages, i.e. soil moisture storage, upper and lower zone subsurface storage and snow water equivalent, and calculated trends of annual averages (based on hydrological years). Trends in simulated storage were, on average over all catchments, 9 ± 20 mm over 1978-2013. This shows that the overestimation of the discharge trend is not generated by an opposite trend in simulate storage. Small changes in simulated storage are in agreement with no consistent large-scale groundwater changes in the observations (Blaschke et al., 2011;Neunteufel et al., 2017).
While discharge volume biases during calibration were small, with average values over all catchments of 0.005-0.03 for the different subperiods, discharge biases during evaluation were much higher, with average values of −0.13-0.18 over the study catchments (Fig. 3a). The curves of the average bias during evaluation over the different subperiods for models calibrated in different subperiods show an interesting pattern. Average bias values during evaluation increase from subperiod S1 to S6 by 0.15-0.18 and decrease again for the last period. The curves run almost parallel and differ by a vertical offset that ensures low bias during the calibration period. The changes in the average bias were not caused by few catchments with very large changes, as shown by changes in the distribution of bias across all catchments (Fig. S4). NSE values during model calibration varied in the range of 0.70-0.75 on average over the catchments, showing that the model performed well in each subperiod when calibrated to it. As expected, model performance during evaluation was lower, with average values over the study catchments of 0.56-0.71 (Fig. 3b). In many cases, model performance decreases with increasing distance between the calibration and the evaluation period, particularly for model evaluations in subperiods S1 and S2.
The performance of the baseline model agrees well with the study by Merz2011, who found average NSE during model calibration of 0.74-0.77 and average NSE during model evaluation of 0.64-0.69, when evaluating over all subperiods except the one used for calibration (compared to 0.70-0.75 during calibration and 0.63-0.66 during evaluation in our study). Discharge biases during calibration were slightly smaller in the present study, due to including a penalty for discharge bias in the objective function. The longer study period used in our study revealed that the trend of an increasing difference between simulated and observed discharge, when applying the model calibrated in subperiod S1 to the entire study period, was not continued during the last subperiod.

Precipitation
Driving the hydrological model with a precipitation data set based on a variable number of precipitation stations may influence the estimated trend of precipitation and thus the trend of simulated discharge. In order to quantify this effect, we performed model simulations with a precipitation data set based on a constant number of stations (P 1) in comparison with the baseline precipitation data set P 0 that uses a variable number of stations. This reduced the gap between simulated and observed discharge from 95±50 to 55±47 mm yr −1 per 35 years (Table 4); i.e., a reduction of 39 ± 26 mm yr −1 per 35 years ( Table 5). The reduced gap between simulated and observed discharge is consistent with the difference in the trends in the precipitation data sets. The baseline precipitation data set P 0 suggests a precipitation increase of, on average, 159 ± 89 mm yr −1 per 35 years, whereas the precipitation data set P 1 results in an increase of 121±89 mm yr −1 per 35 years (Fig. 4a). Better model performance with respect to changes in streamflow volume is also reflected by smaller increases in bias during evaluation in the different subperiods (Fig. 5a).
Changes in the snow-to-rain ratio and in the precipitation intensity may affect the undercatch error and thus the precipitation trend. Figure 4c-e shows that, over the study period,  (1978)(1979)(1980)(1981)(1982). Uncertainties relate to standard deviations of the trend slope averaged over all catchments. For trends in Q sim -Q obs , we first derived series of the differences Q sim -Q obs for each catchment and then estimated trends.   the snow-to-rain ratio decreased and the daily precipitation intensity increased, whereas the number of precipitation days remained relatively stable. In the precipitation data sets P 0 and P 1, the precipitation undercatch error is neglected. In order to estimate the magnitude of the effect of changes in air temperature and precipitation intensity on changes of the undercatch error, we performed simulations with a precipitation data set that was corrected for undercatch accounting for daily precipitation intensity and precipitation type, which was estimated based on air temperature (precipitation data set P 2). Precipitation data set P 2 generally exhibits higher precipitation and, with an average trend of 120±93 mm yr −1 per 35 years, a similar absolute and a lower relative precipitation increase over time when compared to the precipitation data set P 1 (Fig. 4a). Simulations with precipitation data set P 2 resulted in a gap between simulated and observed discharge trends of 48 ± 47 mm yr −1 per 35 years (Table 4); i.e., a reduction of 47 ± 28 mm yr −1 per 35 years compared to the baseline model V 0 that uses precipitation data set P 0 (Table 5). Comparing model variants V 2 to V 0, strong reductions of the differences between simulated and observed dis-   Figure 3a shows this for the baseline model V 0. Each line refers to models calibrated in one subperiod, where the filled circle marks the calibration period, and shows bias during the calibration period and during evaluation in the other six subperiods. For a description of the model variants, see Table 3 and Sect. 2.4.2. Table 5. Working hypotheses for potential causes of the divergence between observed and simulated discharge changes that were further analysed and the estimated magnitude of the effect on the gap between trends in Q obs and Q sim (mm yr −1 per 35 years) over 1978-2013 as compared to the baseline model. This was calculated by deriving a series of the differences in annual discharge of the respective model variant compared to the baseline model (e.g. Q sim,V 1 -Q sim,V 0 ) for each catchment and then estimating trends. Uncertainties relate to standard deviations of the trend slope averaged over all catchments.

Working hypothesis Model Result
Magnitude of the effect variant (mm yr −1 per 35 years) (1) Data problems Section 3.2 (1.2) Problems in the precipitation data Inhomogeneities in the gridded precipitation data due to changes in the number of stations V 1 Reduces the gap between changes in Q obs and Q sim ↓ −39 ± 26 Biased estimates of the precipitation trend due to changes in the catch ratio caused by changes in the snow-to-rain ratio and changes in precipitation intensities (in addition to inhomogeneities due to a variable number of stations) V 2 Reduces the gap between changes in Q obs and Q sim ↓ −47 ± 28 (1.3) Problems in the air temperature data Inhomogeneities in the gridded air temperature data due to changes in the number of stations V 3 Little effect on simulated discharge trends −1 ± 5 (2) Problems related to the model calibration Section 3.3 Too short a calibration period V 4 Little effect on simulated discharge trends −5 ± 9 Objective function insensitive to long-term discharge variations V 5 Little effect on simulated discharge trends −3 ± 13 Internal inconsistencies due to calibration only to discharge V 6 Little effect on simulated discharge trends 0 ± 4 (3) Problems of the model structure Section 3.4 Effects of changes in radiation and saturation deficit not reflected by the model V 7 Little effect on simulated discharge trends −2 ± 7 Effects of changes in the vegetation dynamics and land cover not reflected by the model V 8 Reduces the gap between changes in Q obs and Q sim.
↓ −36 ± 9 charge trends particularly occurred in catchments where the differences between simulated and observed discharge trends were large (Figs. S6d and 2c). The tendency to further reduce the gap compared to simulations with the precipitation data set P 1 of 8 ± 9 mm yr −1 per 35 years was not significant.

Air temperature
In order to investigate the possible effect of changes in the station network for air temperature data, we performed simulations with gridded air temperature data based on stations with a complete record over the study period (T 1), compared to simulations with a gridded data set based on all available air temperature series (T 0). This showed virtually no differences in discharge trends between the two variants ( Table 4). The small effect of varying the air temperature data set can be explained by the fact that changes in the station network were only small (Fig. S2), and the two data sets result in very similar changes over time (Fig. 4b).
3.3 Problems related to the model calibration

Varying the length of the calibration period
In order to evaluate whether the calibration period was too short, we increased the calibration period from 5 years (1978)(1979)(1980)(1981)(1982) to 25 years (1978-2002; model variant V 4). This resulted in an average discharge trend of 113 ± 82 mm yr −1 per 35 years over 1978-2013 (Table 4), and thus there was virtually no effect compared to the baseline model.

Varying the objective function
Changing the objective function by including annually aggregated discharge data (model variant V 5) led to an aver-  (Table 4), and thus there was no improvement in the simulation of the long-term discharge trends either. Including a snow-related criterion into the objective function (model variant V 6) improved the model performance with respect to snow without deteriorating the model performance for discharge (Table S1). The performance of the model compared to observed snow cover derived from interpolated snow depth was comparable to Parajka et al. (2007), when considering the same set of catchments. Model performance with respect to long-term trends was not improved, with an average gap between simulated and observed discharge trends of 95 ± 50 mm yr −1 per 35 years over 1978-2013 ( Fig. 6).

Calculation of E ref considering changes in vegetation dynamics
In order to consider changes in the vegetation dynamics, we estimated changes in surface resistance based on changes in a satellite-based vegetation index for the calculation of E ref . Accounting for vegetation dynamics in the calculation of E ref increased trends in E sim to 88 ± 16 mm yr −1 per 35 years (model variant V 8), compared to 52 ± 13 mm yr −1 per 35 years in the baseline model V 0 (Table 4). This reduced the gap between simulated and observed discharge trends from 95 ± 50 to 58 ± 49 mm yr −1 per 35 years (Table 4); i.e., a reduction of 36 ± 9 mm yr −1 . Increased trends in E sim are consistent with E ref trends that increased from 70 ± 13 mm yr −1 per 35 years in the baseline model V 0 to 110 ± 17 mm yr −1 per 35 years in model variant V 8 (Fig. 6). Accounting for vegetation dynamics had a rather consistent effect on the discharge trends throughout the catchments ( Fig. S6b and e). In order to evaluate the effect of combining the model modifications that had a considerable effect on the gap between trends in observed and simulated discharge, we combined the use of the precipitation data set P 2 (model variant V 2) and the consideration of vegetation dynamics in the calculation of E ref (model variant V 8) as model variant V 9. Compared to the baseline model, the differences in the trends between simulated and observed discharge were reduced by 90 ± 31 mm yr −1 per 35 years in this model variant so that the differences largely disappeared (Table 4). Bias values in the evaluation period for variant V 9 show only little variation between subperiods S2 and S6, but some variation remains when transferring models from subperiods S1 or S7 to subperiods S2 or S6 or vice versa (Fig. 5h). Bias values in the evaluation period were reduced from −0.13-0.18 in the baseline model to −0.03-0.10 in model variant V 9. When comparing model variant V 9 and the baseline V 0, the differences in the trends of simulated and observed discharge were reduced in most catchments, with stronger reductions in catchments that showed higher differences in the trends of simulated and observed discharge in the baseline model (Fig. S6f).

Discussion
Our analyses suggest that problems in the precipitation data and neglecting changes in vegetation activity were the most important causes of the poor performance of the HBV model in Austrian catchments in a transient climate. Inhomogeneities in the precipitation data set due to a variable number of stations explained 39±26 mm yr −1 per 35 years of the difference between simulated and observed discharge trends (or 47 ± 28 mm yr −1 per 35 years when using a precipitation data set that was additionally undercatch corrected). While the original model neglected changes in the vegetation activity and length of the growing season, considering these changes by calculating E ref accounting for changes in surface resistance based on changes in a satellite-based vegetation index reduced the gap between simulated and observed discharge trends by 36±9 mm yr −1 per 35 years. Combining both modifications, using a precipitation data set based on a constant number of stations and considering vegetation dynamics for the calculation of E ref , reduced the gap between simulated and observed discharge trends by 95 %. The model structure deficiencies with respect to vegetation dynamics are likely relevant for a large number of studies in a transient climate, including simulations in the context of climate change impact assessments. In a changing climate, changes in vegetation dynamics (such as increased growing season length) can have substantial effects on changes in the water balance. The effect of considering changes in vegetation dynamics observed in this study is in agreement with other studies that demonstrate the impacts of climateinduced changes in growing season length and vegetation growth on the water balance (Caldwell et al., 2016;Hwang et al., 2018;Kim et al., 2018;Gaertner et al., 2019). For example, long-term hydrologic changes in two forested catchments in the southern Appalachians could only be simulated if full vegetation dynamics were incorporated in the ecohydrologic model . Lengthening of the growing season intensified climatically driven increases in evaporation and reductions in streamflow in a mixed forest catchment in New England . Decreased catchment streamflow over the last 15 years was linked to increased growing season length in six northern headwater catchments (Wang et al., 2019). Increases in evaporation in the central Appalachian Mountains region were attributed to longer growing seasons, with an increase in growing season length by 1 d resulting in a moderate increase in evaporation by 0.5 mm yr −1 (Gaertner et al., 2019). Here, we considered changes in vegetation dynamics by using a variable surface resistance based on changes in a satellite-based vegetation index. Based on a rather simple approach, this should be seen as a first estimate to demonstrate the significance of changes in vegetation dynamics on the water balance. While in this study we assume that the simulations accounting for vegetation dynamics also partly reflect the effects of changes in land cover, an approach that enables the disentangling of these effects would be preferable in future work. The changes in vegetation dynamics were derived from satellite-based data, which are often not available in the context of climate change impact assessments. Future work should therefore aim at approaches that simulate the changes in vegetation dynamics in response to climatic changes that may be implemented into conceptual hydrologic models. The effect of increased atmospheric CO 2 concentrations on surface resistance was neglected in the present study. At the global scale, it is estimated that this effect may have reduced evaporation by the order of 1.6 to 2.0 mm yr −1 per decade since the 1960s (Gedney et al., 2006;Piao et al., 2007).
In this study, we found problems in the model structure with respect to the calculation of evaporation contributing to poor model performance in a transient climate. Model structural problems, albeit in different model components, were also found to cause poor performance in a transient climate in other studies. For a case study in South Australia, model performance was improved by allowing the parameter for the maximum capacity of the soil store to vary in time as a function of a linear trend, which was interpreted as increased catchment storage through an increase in farm dams in the catchment (Westra et al., 2014). For a case study in southwestern Australia, introducing a non-linearity parameter and a threshold value for the rainfall-run-off relationship enabled the simulation of dry and non-dry years with the same parameter set, which was not possible with the original model (Fowler et al., 2018). Changes in glacier volume may cause deviations between simulated and observed discharge trends if not accounted for by the model. Therefore, glaciercovered catchments were excluded in our study. Model structural deficits with respect to glacier dynamics may be responsible for further deviations between simulated and observed discharge trends in the study by Merz2011, which did not exclude glacier-covered catchments, although the total glacier cover of Austria is small (0.5 %; Fischer et al., 2015).
The mismatch between simulated and observed discharge trends was partly caused by inhomogeneities in the precipitation data. Thus, the problem of the limited suitability of the hydrological model under transient conditions is less severe than previously assumed. The comparison of the precipitation data sets based on a constant and variable station network (Fig. 4a) shows very well that trend analyses of gridded data based on a variable number of stations can be misleading. Particularly large effects of changes in the gauge network on estimated trends may occur if the gauged precipitation values are interpolated directly (as for the baseline precipitation data P 0), in contrast to interpolation methods that make use of a two-step procedure by interpolating against a climatology (Fawcett et al., 2010). While the SPARTACUS data are currently seen as the best suited gridded data set for trend analyses in Austria, they may, however, contain further inhomogeneities. Network inhomogeneities were avoided by using a constant station network and interpolating against a monthly climatology. However, inhomogeneities may be present in the series of individual stations. Homogenised series were available only for 4 % of the station data used for the SPARTACUS data set, and it is estimated that 25 % of the stations used may still be affected by inhomogeneities (Hiebl and Frei, 2018). However, while we expect changes in the precipitation trends for individual (smaller) catchments, it seems unlikely that inhomogeneities in the station data cause changes in the precipitation trends in the same direction for a large number of catchments.
Taking account of the precipitation undercatch error, including effects of climate variability on the undercatch error, had a small and not significant effect when compared to the simulation using the same precipitation data without undercatch correction. Since high-quality wind speed data were not available, wind speeds were not considered in the calculation of the undercatch error. Analyses of the available data in Austria over 1977-2014 show a slight decrease in wind speeds (on average −3.0 ± 2.5 % per decade; see Sect. S2 in Duethmann and Blöschl, 2018). Decreasing wind speeds would result in increasing catch ratios and mean that our estimate of the effect of changes in the catch ratio due to climatic variability on the difference between simulated and observed discharge trends is at the lower end.
Increasing the length of the calibration period did not reduce the gap between trends in simulated and observed discharge (Table 4). This is in agreement with several other studies that found little improvement of the observed poor performance in contrasting climate by using a longer calibration period (Luo et al., 2012;Brigode et al., 2013;Coron et al., 2014). Similarly, changes to the objective function to improve the internal consistency of the model did not lead to a better performance in a changing climate. In this study, we included snow data because of the influence of snow on the hydrology in the study region. Seibert (2003) tested whether including groundwater-level observations in the calibration reduced the problem of low model performance for large floods when there were no large floods in the calibration period, but this did not lead to improvements. The results are more variable with respect to changes in the objective function that put a stronger focus on interannual variability. While including annually aggregated discharge data into the objective function did not reduce the gap between trends in simulated and observed discharge in this study, Hartmann and Bárdossy (2005) found that including annually aggregated discharge data in the objective function in addition to daily discharge data improved the transferability of a distributed conceptual hydrological model under contrasting climate conditions in their study. A way to find out whether parameter problems might be the cause when a model shows poor performance in contrasting climates is to apply multiobjective calibration to the contrasting periods, as suggested by Fowler et al. (2018). If this is the case, efforts of finding a parameterisation method that identifies parameter sets suitable for contrasting climates only from the calibration period may then be undertaken in a second step. Multi-objective calibration to the contrasting periods was applied in a study that used five different model structures and 86 catchments in Australia (Fowler et al., 2016). The results showed that, depending on the acceptance threshold for good model performance, parameterisation problems caused a decline in model performance in contrasting climate periods in 35 % or 55 % of the cases of DSST failure.
The present study included a large number of catchments, so we assume that our results are robust. However, it is limited to a particular hydrologic model and a particular region. It should therefore be complemented by further studies on the causes of poor (and good) performance of hydrological models in transient climate conditions. The aim is a more complete picture of in which cases which model structure components and which parameterisation methods result in poor model performance in a transient climate so that these model structure components and parameterisation methods can be avoided for applications where good model performance in a transient climate is relevant as, for example, in climate change impact assessments. Ultimately, this will increase the robustness of hydrologic simulations in a changing climate.

Conclusion
In this study, we investigated why the HBV model failed to predict changes in discharge in response to observed increases in precipitation and air temperature for 156 catchments in Austria. The baseline model overestimated the observed discharge trends over 1978-2013 and on average over all catchments by 95 ± 50 mm yr −1 per 35 years or 12.8±6.7 % per 35 years relative to observed discharge. Simulations with variants of the model indicate that the poor performance of the HBV model in Austrian catchments in a transient climate could largely be ascribed to two problems, namely a model structure that neglects changes in the vegetation dynamics and inhomogeneities in the precipitation input. Considering changes in the vegetation dynamics by calculating E ref accounting for changes in surface resistance based on changes in a satellite-based vegetation index reduced the gap between simulated and observed discharge trends by 36 ± 9 mm yr −1 per 35 years. Inhomogeneities in the precipitation data set due to a variable number of stations on average explained 39 ± 26 mm yr −1 per 35 years of the difference between simulated and observed discharge trends. Extending the calibration period from 5 to 25 years, including annually aggregated discharge data or snow cover in the objective function, or estimating evaporation with the Penman-Monteith instead of the Blaney-Criddle approach had little influence on the simulated discharge trends. The model structure deficiencies with respect to vegetation dynamics are likely relevant for a large number of studies in a transient climate, including climate change impact studies. The precipitation data problem highlights the importance of using precipitation data based on a constant number of stations for studies on long-term dynamics. Our study emphasises the importance of considering interrelations between changes in climate, vegetation and hydrology for hydrological modelling in a transient climate.
Data availability. The discharge data and precipitation data from the HZB can be accessed through https://ehyd.gv.at/ (BMLRT, 2020). The meteorological data from the ZAMG are currently not freely available; requests should be directed to klima@zamg.ac.at. The Corine land cover map can be downloaded from https:// www.eea.europa.eu/data-and-maps/data/clc-2000-vector-6 (Euro-pean Environment Agency, 2016). The NDVI data can be downloaded from https://ecocast.arc.nasa.gov/data/pub/gimms/ (Tucker et al., 2005). The hydrological model simulations are available upon request from the first author.