Benchmarking an operational hydrological model for providing seasonal forecasts in Sweden

Probabilistic seasonal forecasts are important for many water-intensive activities requiring long-term planning. Among the different techniques used for seasonal forecasting, the Ensemble Streamflow Prediction (ESP) approach has long been employed due to the singular dependence on past meteorological records. The Swedish Meteorological and Hydrological Institute is currently extending the use of long-range forecasts within its operational warning service, which requires a thorough analysis of the suitability and applicability of different methods with the national S-HYPE hydrological 10 model. To this end, we aim to evaluate the skill of ESP forecasts over 39,493 catchments in Sweden, understand their spatiotemporal patterns, and explore the main hydrological processes driving forecast skill. We found that ESP forecasts are generally skilful for most of the country up to 3 months into the future but that large spatiotemporal variations exist. Forecasts are most skilful during the winter months in northern Sweden, except for the highly-regulated hydropowerproducing rivers. The relationships between forecast skill and 15 different hydrological signatures show that forecasts are 15 most skilful for slowly-reacting, baseflow-dominated catchments and least skilful for flashy catchments. Finally, we show that forecast skill patterns can be spatially clustered in 7 unique regions with similar hydrological behaviour. Overall, these results contribute to identify in which areas, seasons, and how long into the future ESP hydrological forecasts provide an added value, not only for the national forecasting and warning service but, most importantly, to guide decision-making in critical services such as hydropower management and risk reduction. 20


Introduction
Regardless of the geographical setting, human society depends on water resources to satisfy basic needs and allow for social growth and development. At the same time, however, the variability of the hydrological systems, leading to extreme events such as floods or droughts, puts pressure on the viability and sustainability of many water-intensive activities. In this setting, being able to predict the future evolution of the hydrologic system may improve societal resilience by anticipating 25 potentially hazardous events and enabling the adoption of protective and/or adaptive measures (Girons Lopez et al., 2017;Pappenberger et al., 2015b). Even if most day-to-day decisions on water-related issues are based on short-and mediumrange forecasts, some activities, such as water reservoir operation and optimisation or strategic planning, benefit from longterm forecasts. Despite their inherent uncertainties, long-term forecasts such as seasonal forecasts are a valuable tool for such https://doi.org/10.5194/hess-2020-542 Preprint. Discussion started: 28 October 2020 c Author(s) 2020. CC BY 4.0 License.
applications, as they provide insights into the general trends of the hydrological system up to several months into the future, 30 leading also to economic benefits (Bruno Soares et al., 2018;Giuliani et al., 2020).
Different techniques are available for generating seasonal forecasts, each with different strengths and weaknesses. These techniques may be based on dynamic or statistical methods, or on a weighted combination of both. Among these, the Ensemble Streamflow Prediction (ESP) methodology has long been widely adopted for seasonal forecasting (Wang et al., 2011;Wood and Lettenmaier, 2006). Following this methodology, ensemble streamflow forecasts are generated using 35 historical meteorological data as forcing to a hydrological model. An advantage of this method compared to methods based directly on historical streamflow, is that ESP forecasts are initialised based on hydrological conditions updated for the forecast date . Forecasts thus benefit from the most recent hydrological knowledge when they are initialised, which is of particular interest for unprecedented hydrological conditions. This advantage can however also lead to forecast overconfidence as this method does not consider the impact of potential uncertainties in the initial hydrologic 40 conditions, as noted by Wood and Schaake (2008). Additionally, its reliance on historical meteorological forcing makes it impossible for it to capture hydrological responses to unprecedented meteorological events.
ESP forecasts have been used by the scientific community to assess forecast skill sensitivity and uncertainties and to benchmark seasonal forecast improvements Harrigan et al., 2018), as well as for operational flood forecasting in many different settings and scales (Candogan Yossef et al., 2017). Over the years, different techniques have 45 been developed to improve the performance of forecasting systems, such as data assimilation for improving the initial conditions of forecasts (DeChant and Moradkhani, 2011), multi-model approaches (Muhammad et al., 2018), or pre-and post-processing techniques such as using artificial neural networks for reducing the effects of model errors (Jeong and Kim, 2005;Macian-Sorribes et al., 2020), historical scenario selection and weighting (Crochemore et al., 2017;Trambauer et al., 2015), and calibration techniques (Wood and Schaake, 2008). 50 Evaluation efforts are typically carried out based on forecasts issued retrospectively (re-forecasts) over time periods long enough to ensure that the evaluation is statistically robust. For many operational applications it is important to understand the spatiotemporal patterns of seasonal streamflow predictability as well as the driving processes behind these patterns (Sutanto et al., 2020). Indeed, previous studies have identified different sources of forecast skill depending on hydrological characteristics; for instance, Greuell et al., (2019), Shukla et al., (2013), andWanders et al., (2019) identified initial 55 conditions of soil moisture and snow (during spring) as the most important sources of skill over Europe, while Singla et al. (2012) found similar results for France. In a study over the United Kingdom, Harrigan et al. (2018) ascertained streamflow predictability was higher for slow-responding catchments, as described by the baseflow index (BFI). Some studies have even gone one step further by investigating spatiotemporal patterns in streamflow predictability in an attempt to regionalize the forecast skill. For example,  showed that streamflow predictability is strongly dependent on the 60 overall hydrological regime, with limited predictability in flashy basins (low river memory) and hence, it can be regionalised based on a priori knowledge of local hydro-climatic conditions. https://doi.org/10.5194/hess-2020-542 Preprint. Discussion started: 28 October 2020 c Author(s) 2020. CC BY 4.0 License.
The Swedish Meteorological and Hydrological Institute (SMHI) has long been operationally providing streamflow forecasts and hydrological warnings to relevant actors in hydrological risk management (municipalities, county boards, Swedish Civil Contingencies Agency), as well as to the general public. Forecasts were initially produced with the HBV model (Bergström, 65 1976), but in recent years operational forecasting has shifted to the Swedish implementation of the HYPE model (S-HYPE, Lindström et al., 2010), which allows for an integrated, high-resolution description of the hydrological system across the country. Where available, in-situ observations of streamflow are assimilated, which has a beneficial impact on the hydrological predictions downstream. ESP seasonal forecasts are produced but not generally spread to other actors due to uncertainties in their skill and interpretation by external parties. Nevertheless, SMHI is now looking to extend the usage of 70 long-term forecasts within its warning service, which requires a deeper understanding of forecast performance, its patterns, and controlling factors. In terms of regionalisation, four main hydro-climatic regions based on hydro-climatic patterns (Lindström and Alexandersson, 2004;Pechlivanidis et al., 2018) have typically been used for water management in Sweden.
However, these regions were not put forward with consideration to seasonal streamflow predictability over Sweden and might therefore be of limited use for this purpose. 75 The aim of this study is to benchmark and attribute the ESP forecast skill over Sweden with the operational S-HYPE model.
To address these questions, we: (a) evaluate the skill of ESP seasonal forecasts generated with the operational S-HYPE model over Sweden and understand the spatiotemporal pattern of skill, (b) detect potential links between streamflow forecast skill and hydrological characteristics, and (c) attribute streamflow predictability patterns across the country to hydrological behaviour of the river systems. The paper is structured as follows: section 2 presents the data used, hydrological model setup, 80 and methodology for the forecast evaluation; section 3 presents the results, followed by the discussion in section 4; finally, section 5 states the conclusions.

Data
Daily precipitation and temperature data from the PTHBV database (Johansson, 2002) were used as forcing data to the S-85 HYPE model. This database contains gridded data based on a weighted interpolation of measured values from all available stations for any given day with a resolution of 4x4 km, and it is available from 1961 onwards. The interpolation method used for generating PTHBV considers factors such as elevation and wind frequency and direction to make interpolated values for precipitation and temperature more reliable. Additionally, daily stream discharge and water level data from 539 stations of SMHI's gauge network were used to correct the model outputs for improved forecast initialisation (see Figure 1). 90

Hydrological modelling
The ESP re-forecasts were produced using the S-HYPE model (Strömqvist et al., 2012), which is the operational implementation of the HYPE model for Sweden (Lindström et al., 2010). This allowed an analysis of model outputs for https://doi.org/10.5194/hess-2020-542 Preprint. Discussion started: 28 October 2020 c Author(s) 2020. CC BY 4.0 License. 39,493 catchments (with an average spatial resolution of 10 km 2 ) in the model domain. The HYPE model is a process-based hydrological model for water quantity and quality which operates on a daily time step and includes both hydrological 95 (snowpack, groundwater, surface runoff, streamflow) and anthropogenic (reservoir operation, irrigation) factors. The S-HYPE model has a median Kling-Gupta efficiency (Gupta et al., 2009)   A large percentage of water courses in Sweden are regulated, mainly for energy production purposes; see the degree of regulation (%) in Figure 1c; see also the definition in Pechlivanidis et al., (2018). This makes the simulation and prediction of water variables in the main water courses more challenging, as regulation patterns, which can largely deviate from the 105 natural flow, need to be considered. In the operational S-HYPE model, general regulation regimes in the form of constant flow or seasonally varying sine wave shaped flow (or a combination of both) between predefined levels and, in some cases, specific dates are provided for a number of reservoirs. Nevertheless, since dam operation is continuously adapted (within certain bounds) to the present and most probable future meteorological and hydrological conditions, these general regulation regimes are expected to be of little benefit for seasonal forecasting purposes. 110 https://doi.org/10.5194/hess-2020-542 Preprint. Discussion started: 28 October 2020 c Author(s) 2020. CC BY 4.0 License.
We produced a series of hydrological re-forecasts at the daily time step for all 39,493 locations across Sweden and transboundary basins using meteorological forcing data from 25 random years for the period 1961 -2016 so as to mimic SMHI's operational forecast setup. When selecting the forcing data, a window of 3 years was left out around the analysis year (1 year before and two years after) to limit the impact of interannual streamflow memory and thus avoid conditioning the forecasts. We initialised the re-forecasts on the 1 st , 8 th , 15 th , and 22 nd of each month (approximately once a week). 115 We used stream discharge and water level data, where available, to correct the model outputs prior to producing the reforecasts, and thus get the best possible initialisation conditions. To this purpose, we used an autoregressive (AR) correction method (Lindström and Carlsson, 2000;Pechlivanidis et al., 2014). Following this method, model outputs are replaced by the available observations and the model errors with respect to these observations are saved for every time step. If observations are no longer available, the output corrections converge exponentially towards the simulated values. This 120 correction only affects catchments with or downstream from streamflow observations and is especially relevant for regulated water courses with low model performance where simulated streamflow can significantly deviate from actual values ( Figure   1b).

Forecast evaluation
We evaluated the skill of the ESP re-forecasts produced with the S-HYPE model over the period 1981 -2016 using the 125 Continuous Ranked Probability Skill Score (CRPSS, Appendix A) and a cross-validation strategy. Although studies involving large-scale models often use model simulations as reference as this minimizes the impact of model performance on forecast skill , here the reference was a combination of observations (for catchments with or downstream from observation points) and model simulations (elsewhere) as a station-corrected simulation approach was used to achieve the best possible initial conditions. We assessed the skill of the ESP re-forecasts so 130 as to highlight the added value of the ESP forecasts with respect to an ensemble forecast based on historical streamflow, which users would have access to in the absence of SMHI's forecast service (Pappenberger et al., 2015a). To this purpose, we used an ensemble whose 25 members were resampled from the historicalstation-correctedmodel simulations from the period 1981 -2010 (excluding the forecast year) as a benchmark against which to derive the skill of the ESP re-forecasts.
Even if hydrological models are typically run at a daily time scale, forecast results from hydroclimate prediction systems are 135 usually post-processed and aggregated over longer periods to provide information tailored to the user needs (Bohn et al., 2010). More specifically, a temporal aggregation of one month is typically used in seasonal forecasting services (Apel et al., 2018;Bennett et al., 2017). Nevertheless, different time periods may be of interest depending on the sectorial use (e.g. water resources management, civil protection mechanisms, warning services). Therefore, in addition to using a basic temporal aggregation of one week (i.e. daily streamflow forecasts were aggregated to weekly averages) to estimate the predictive skill 140 of the national operational service, we were also interested in understanding how aggregating streamflow forecasts over different time periods (i.e. 2 weeks, 4 weeks, 8 weeks, 12 weeks, and 24 weeks) impacts forecast skill.

Forecast skill attribution
Thereafter, we investigated which hydrological characteristics are associated with skilful forecasts. More specifically, we selected a set of 15 hydrologic signatures (statistics describing the hydrological behaviour; see Table 1) to provide 145 diagnostics of the hydrological regime (Kuentz et al., 2017;Pechlivanidis and Arheimer, 2015). We used the non-parametric Spearman rank test to assess the correlation between forecast skill and each of the hydrologic signatures.

Signature
Abbreviation Then, we applied a k-means clustering approach within the 15-dimension space (hydrological signatures) to group the catchments into clusters based on similarities of basin functioning and further identify the dominant streamflow generating processes for specific regions. Finally, we analysed the hydrologic predictability for each of the clusters.

Temporal and spatial distribution of forecast skill 155
The skill of the ESP re-forecasts varies with the lead time and the forecast initialisation date ( Figure 2). As expected, the skill of the ESP with respect to historical streamflow is overall very high for medium-range horizons (i.e. 1 -2 weeks ahead), with a median skill over Sweden starting at 0.7 ( Figure 2a) and thereafter decreasing with time. After approximately three months, the ESP provides, on average, no added value with respect to historical streamflow. Similar trends have been observed in other evaluations of forecasting systems over Sweden (Foster et al., 2018;Olsson et al., 2016). In particular, we 160 note a rapid decrease in skill in the first forecast month Harrigan et al., 2018). Consequently, under the common monthly initialisation frequency of climate prediction systems (Batté and Déqué, 2016;Johnson et al., 2019), streamflow predictability is expected to remain low for periods beyond a 2-week forecast horizon. By increasing the frequency of forecast initialisation (e.g. from once a month to once a week), and hence frequently updating the initial hydrological states, it is possible to maintain a high streamflow forecast skill for extended forecast horizons (Figure 2b). 165

170
Even if the forecast skill follows a similar decreasing pattern for all initialisation dates, both the maximum skill value as well as the deterioration rate differ. The highest skill (greater than 0.8) is observed for forecasts initialised in winter (December-February). In the other seasons, the forecast skill starts at around 0.7, with the lowest skill value observed for initialisations in April (just under 0.6). Even if the forecast skill deteriorates quickly and reaches a predictability value close to the one of https://doi.org/10.5194/hess-2020-542 Preprint. Discussion started: 28 October 2020 c Author(s) 2020. CC BY 4.0 License.
historical streamflow (CRPSS close to 0) in long forecast horizons, forecasts initialised in March and April (and to some 175 extent also in February) show a small secondary peak in the skill in May. This is probably related to the predictable spring flood season in the northern parts of the country (see also . The spatial distribution of forecast skill differs significantly across initialisation dates and forecast horizons (Figure 3). For instance, forecasts initialised in winter (e.g. December 1 st ) maintain skill for inland forested areas of northern Sweden up to 3 months in the future. Forecasts initialised in spring (e.g. March 1 st ) show skill up to the same forecast horizon, but most 180 notably in the southern and eastern parts of the country. Finally, forecasts issued in summer and autumn are skilful up to 2 months except for some areas in the central-western parts of the country.
For the first forecast month, forecasts tend to have a comparatively poorer skill in the mountainous areas of north-western Sweden than in other parts of the country, except when they are initialised in the spring. Agricultural areas located around some of Sweden's largest lakes, such as Lake Mälaren and Lake Vänern, also have comparatively poor forecast skill. 185 Interestingly, high predictability would have been expected in such lakes with slow hydrological response (long memory) (see . However, these great lakes are heavily regulated (see Figure 1), and the model correction seems to have impacted forecast skill. Streamflow forecasts in the large, highly-regulated, rivers of northern Sweden, such as River Umeälven and River Luleälven, also lack skill (Figure 1c). Again, the regulation patterns that significantly differ from the natural regime of watercourses, are not adequately captured by the ESP. In these cases, the broader ensemble of historical 190 streamflow is a better estimator of the future trends in streamflow. Conversely, streamflow forecasts show high skill in nonregulated rivers located in the same area and of similar size and hydrological regime, i.e. River Kalixälven and River Torneälven.

Forecast skill as a function of temporal aggregation
We next investigate the impact of the forecast aggregation period on the forecast skill. Here, the focus is not on the spatial pattern and therefore forecast skill is averaged over the entire domain. Results show that, even if the average skill for the first forecast period decreases when aggregating over longer time periods, the forecasts remain skilful (CRPSS greater than 0) for 200 aggregation periods up to 12 weeks (Figure 4). When aggregating over 24 weeks, the ESP method generally provides no added value with respect to historical streamflow; the predictability from ESP is very similar to the one from historical streamflow. Even if, as expected, forecast skill decreases when forecasts are aggregated over long periods, a comparatively higher skill is maintained over longer time horizons than when forecasts are aggregated over short periods. In addition, forecasts initialised in February and March are skilful up to 16 weeks ahead when aggregating over long periods (e.g. 8 205 weeks), and the forecasts initialised in April and May show high skill values, even when aggregating over a 12-week period.
This seems to be due to the high predictability of the spring flood season in May, also shown by the secondary peak in skill observed for these initialisations in Figure 2b. Finally, for forecasts initialised in July and October, long aggregation periods (e.g. 12 weeks) tend to dilute the high forecast skill observed over the first weeks.

Relating streamflow signatures and forecast skill
We next investigate potential correlations between forecast skill and the 15 streamflow signatures using the non-parametric Spearman rank test. In all cases the null hypothesis (i.e. no correlation exists between forecast skill and the streamflow 215 signature) is rejected with a level of significance of 0.01. Nevertheless, different patterns emerge when comparing forecast skill (first forecast week) for each catchment with each of the 15 streamflow signatures ( Figure 5). More specifically, forecast skill is strongly inversely correlated (defined here as the Spearman's rank correlation coefficient (ρ) being less than -0.50) with high pulse count (HPC), flashiness (Flash), rising limb density (RLD), declining limb density (DLD), and coefficient of variation (CV). Additionally, a strong direct correlation (ρ > 0.50) is found between skill and baseflow index 220 (BFI), normalised low streamflow (q95), and normalised relatively low streamflow (q70) indicating that slowly reacting catchments with a significant baseflow component generally experience high predictability (Harrigan et al., 2018; https://doi.org/10.5194/hess-2020-542 Preprint. Discussion started: 28 October 2020 c Author(s) 2020. CC BY 4.0 License. however, since spatial patterns in forecast skill weaken and blend in with the forecast horizon, the identified correlations are not strong. Overall, the identified correlations highlight the existence of a generally high forecast skill in slowly reacting, 225 baseflow-dominated catchments, while low forecast skill is predominant in flashy catchments. Although this analysis indicates the existence of dependencies between streamflow signatures and forecast skill, it can still be considered limited given that a hydrological system is generally characterized by a wider set of streamflow signatures than that considered here (Pechlivanidis and Arheimer, 2015;Sawicz et al., 2011).

Attributing streamflow forecast skill to hydrologic behaviour
Here, we investigate the potential attribution of streamflow predictability in the Swedish river systems to hydrological 235 behaviour, given that such dependency has been highlighted in the previous analysis. Using the k-means clustering method, an optimal number of seven distinct clusters (based on a silhouette analysis using a different number of clusters; de Amorim and Hennig, 2015) have been obtained representing different hydrological regimes ( Figure 6). Table 2 provides additional information on the topographic, climatological and hydrological characteristics of each cluster while the spatial variability of each of the 15 streamflow signatures, as well as of the catchment elevation, is presented in Appendix B. 240 Catchments clustered in regions 1 and 5 are characterised by a high baseflow contribution, a slow response to precipitation and, therefore, a generally small intra-annual variability. In terms of topography, these regions consist mainly of forested areas located in southern Sweden. Catchments in Cluster 2 are found in highland areas and boreal forest environments in northern Sweden, and are characterised by high seasonality due to the alternance between snow melting and accumulation.
These catchments are also characterised by high runoff volumes given that they are subject to high precipitation amounts and 245 low evapotranspiration rates. Agricultural and coastal areas located mainly in southern and central parts of the country are found in Cluster 3. These catchments are characterised by a highly variable streamflow regime and a quick response to precipitation, yet exhibit a long hydrograph recession. Similarly, catchments grouped in Cluster 6, which are located in lowland coastal and lake areas, experience flashy responses, as well as high streamflows and seasonal variations. Boreal forest catchments in the northern part of the country are grouped in Cluster 4, and are characterised by a generally high runoff 250 coefficient and a low response to precipitation events. Finally, catchments in Cluster 7 are found along several large and highly-regulated rivers in northern Sweden. These catchments are characterised by a small variability but high streamflow volumes and runoff coefficients explained by anthropogenic regulations.
https://doi.org/10.5194/hess-2020-542 Preprint. Discussion started: 28 October 2020 c Author(s) 2020. CC BY 4.0 License.  The last step is to analyse the streamflow forecast skill in each hydrological cluster (Figure 7). Note that here we have 260 aggregated the skill for all initialisations and hence we have not accessed the seasonal distribution of the forecast skill; however, we have focused on the detection of dependencies between skill and hydrologic regimes. Nevertheless, we note that the clusters with high (or poor) forecast skill in relation to the others are the same independently of the target month/week. According to , this is due to the intraannual variability of the streamflow response, which consistently varies between the catchments from the different clusters. Clusters 1, 4, and 5, which are all considered to 265 have high river memory due to baseflow domination, small intraannual variability, and generally low response to precipitation (see Table 2), have a higher median skill than the country-average. Among them, cluster 5 has the highest overall median skill for all time horizons but also, interestingly, the highest spread in forecast skill as a function of lead time.
All of these clusters correspond mainly to forested catchments across the country. Cluster 3 and, most notably cluster 6, have a lower median skill than the country-average. These catchments are characterized by short river memory with flashy 270 responses and are strongly driven by precipitation and strong seasonal variations. Similar results are observed for cluster 2.
In this case, however, the median skill is closer to the country-average skill than for clusters 3 and 6. The response from catchments in cluster 2 is highly seasonal due to snow accumulation and melting processes, and hence not as rainfall-driven as for clusters 3 and 6. Finally, cluster 7, which contains the catchments along the large regulated rivers in northern Sweden, is the only set of catchments in which the median forecast skill reaches negative values, including also a large spread in the 275 skill values (5 th and 95 th percentiles). In these catchments, the ESP was expected to be outperformed by historical streamflow since, as previously mentioned, historical streamflows benefit from the AR model correction throughout the forecast period and thus can better reproduce regulation patterns with low intraannual variability.

Challenges and opportunities in an operational forecasting service
The results obtained in this study indicate that ESP seasonal forecasts produced with the operational S-HYPE hydrological 285 model are skilful with respect to historical streamflow on average up to 3 months ahead, despite the large temporal and spatial variabilities. This positive skill would make operational seasonal forecasts, in general, suitable to guide decisionmaking for applications requiring long-term planning, (e.g. water resources management, agriculture). Nevertheless, issues related to the modelling setup, the forecast methodology, the hydro-climatic characteristics of the Swedish river systems (e.g. high degree of regulation in many water courses), among others, can impact the reliability of such a forecasting service. 290 The ESP forecasting approach is limited by its use of historical meteorological forcing data to generate the streamflow forecasts, making it unable to capture unprecedented meteorological events. Consequently, extreme events that lay outside the observed range will inevitably be misrepresented, limiting the service's predictability of extreme conditions which can be important to some decision makers. This issue may be addressed by using numerical weather prediction (NWP) models to predict the future climate (Monhart et al., 2019). However, although NWPs are not constrained by the observational period, 295 they are limited by the chaotic nature of the weather system (aleatory uncertainty), which makes small errors in the initial conditions grow significant with time. In addition, NWP-based forecasts require post-processing (i.e. downscaling and bias-https://doi.org/10.5194/hess-2020-542 Preprint. Discussion started: 28 October 2020 c Author(s) 2020. CC BY 4.0 License. adjustment) to be suitable to use in impact studies. Finally, their added value for streamflow forecasting in comparison to ESP is shown to be limited in Sweden, with the possible exception of southern Sweden .
As expected, ESP forecast skill decreases rapidly with time particularly in fast responding river systems. Results have shown 300 that monthly initialization, which is the most common initialization frequency of climate prediction models, is critical to set high skill values; however, such frequency in the initialization cannot account for skill deterioration within the month. In this setting, increasing the initialisation frequency to, for instance, once a week would allow to maintain a high skill up to monthly time horizons. Nevertheless, considering that climate prediction models are not developed to represent the exact daily dynamics of the natural systems and that forecasts are therefore aggregated into long time periods, more frequent (e.g. 305 daily) forecast initialisations are not expected to provide an added value to the forecast service. Moreover, forecast information from such frequent initialisations can easily be misinterpreted by decision-makers (Schepen et al., 2016).
Regarding the aggregation of forecast outputs, most studies have focused on a 1-month aggregation period as it is reported to provide an "appropriate forecast at the seasonal scale and a proxy of the underlying distribution" (Emerton et al., 2018;Meißner et al., 2017;Yossef et al., 2013). Nevertheless, there may be value in considering aggregation periods different from 310 the standard monthly aggregation (or even adaptive aggregation periods) for providing guidance on the usability of the forecasts for decision-making. This choice should be driven by user needs and our finding is that aggregations over periods longer than the month do not necessarily mean a loss in skill. Here, we have observed that, in Sweden, long aggregated forecasts covering the spring flood season tend to gain in skill. Overall, however, from time horizons of, on average, 4 months into the future, forecasts have very low or no skill regardless of the aggregation period of choice. 315 Another important factor driving hydrological predictability at the seasonal scale is the adequate knowledge of the initial hydrological conditions (Shukla et al., 2013). In many cases, ESP forecasts are initialized based on the latest available model state (modelled reality), which may significantly deviate from the actual hydrological state (observed reality). Incorporating the latest available observations into forecast initialisation can thus be especially important to bridge the gap between modelled and observed reality. Here, an AR-updating method is used to correct the model outputs whenever observations are 320 available, with the objective to generate forecasts which are as close as possible to observed reality (see Section 2.2). This method is straightforward and easy to implement, and takes advantage of streamflow memory to not only correct the initial forecast state but also the following forecast horizons when observations are no longer available. More advanced data assimilation methods could be considered in further developments of the presented operational forecast system, such as Kalman Filters (Sun et al., 2016), allowing not only for a correction of model outputs, but also an adjustment of model states 325 and thus of process representation (Musuuza et al., 2020).

Impact of regulation on forecasting skill
One of the main applications of long-term forecasting in Sweden is for planning reservoir operation during the spring flood season (May-July) (Foster et al., 2018). However, forecast skill is low for the main hydropower-producing heavily-regulated rivers in the northern parts of the country, where the highest spring flood peak volumes occur. In these locations, ESP forecasts may be used to adequately predict the water inflows to the reservoirs from the headwaters but they would have no value for predicting reservoir outflows with respect to using the ensemble of historical streamflow. In order to further understand the impact of streamflow regulation on the results, we evaluated the ESP forecasts using model simulations (without AR correction) as reference. Forecast skill was in this case very high for the highly regulated rivers where low forecast skill was obtained in the main analysis. This exercise shows that the regulation routines in some river stations in the 335 S-HYPE model still need improvement in order to correctly represent the management rules dominating regulated streamflow patterns. This issue is not as obvious in less heavily regulated rivers elsewhere in the country, where ESP forecasts are generally skilful.
With the exception of River Luleälven and other comparatively smaller rivers in the Swedish mountains, the S-HYPE model performance is generally high for most locations, including the large rivers in the northern parts of the country. Similarly, 340 ESP seasonal forecasts are skilful for non-regulated rivers in that area that also benefit from long-term planning. More specifically, River Torneälven and, to a lesser extent, River Kalixälven are susceptible to severe ice break-up events in connection to the spring melt season and subsequent spring flooding (Zachrisson, 1989). An important factor in predicting the timing of the ice break-up is the onset of spring flood due to snowmelt. Skilful ESP seasonal forecasts for these rivers should allow for early planning and allocation of resources that could greatly contribute to mitigate potentially severe ice 345 break-ups.

Regionalisation of skill in other domains
Besides streamflow regulation patterns, certain characteristics of the hydrological regime have a high impact on hydrological predictability. Here we have shown that forecast skill is high in baseflow-dominated catchments where past hydrologic conditions drive the catchment response, while it is low in flashy catchments where rainfall drives the streamflow dynamics 350 and hence accurate rainfall forecasts are crucial. This corresponds well with findings from similar studies over different geographical domains (Harrigan et al., 2018;. However, contrary to the findings by (Harrigan et al., 2018), who identified a specific streamflow signature (i.e. the baseflow index) as the main driver of hydrological predictability, we have found that, for Sweden, it is instead the result of the overall hydrological behaviour, even if some specific streamflow signatures may have a greater impact than others. Additionally, the seven clusters not only differ in 355 terms of hydrological response, but also in terms of climatological patterns and physiographic characteristics.
The results obtained here may contribute to guiding in which areas, seasons, and how long into the future ESP hydrological forecasts provide an added value not only for SMHI's forecasting and warning service, but most importantly for guiding decision-making in critical services such as hydropower management and risk reduction. Here, we note that, even if the hydro-climatic gradient of Sweden does not fully represent the equivalent gradients over the continent or the globe, our 360 results are however transferable to other locations with similar climatological and hydrological conditions as it has also been highlighted in .
Herein, we analysed the skill of ESP re-forecasts using the operational S-HYPE hydrological model over Sweden in an effort to evaluate the suitability of this methodology for producing reliable forecasts at the seasonal scale within SMHI's 365 hydrological forecasting and warning service as well as for other activities requiring long-term planning. In addition, we aimed at understanding the underlying patterns and drivers behind skilful forecasts and attributed the seasonal predictability to hydrological characteristics. About 39,400 catchments, which lie along Sweden's strong hydroclimatic gradient, were investigated. The main conclusions of this study are: • The skill of the ESP forecasts varies both geographically and seasonally, and depends on the initialization month 370 and aggregation period. Moreover, the skill decreases rapidly with time particularly in fast responding river systems; however, the ESP forecasts are generally skilful up to 3 months into the future. Forecasts are most skilful during the winter months for the northern parts of the country, except for the highly-regulated hydropowerproducing rivers.
• Initialization frequency is a key driver affecting streamflow forecasting skill. Monthly initialisations are critical to 375 preserve high forecast skill values without, however, addressing the skill deterioration over the first forecast month.
Increasing the initialisation frequency to once a week allows maintaining the high skill up to monthly time horizons.
• The river systems in Sweden can be categorised into 7 clusters based on similarities in streamflow signatures. This results in an improved understanding of the dominating hydrological processes, which are shown to vary spatially and seasonally. Particularly, dominant streamflow generation processes over the mountainous regions, including 380 baseflow and snow accumulation/melting, dampening from lakes, and reservoir alterations could explain the hydrological clustering across the country.
• A link between forecast skill and streamflow signatures has been detected. Over the 15 streamflow signatures investigated here, baseflow index, flashiness, rising limb density, coefficient of variation and high pulse count show strong correlations with forecast skill. Streamflow forecasts are most skilful for slowly-reacting catchments due to 385 snow-related processes and/or dampening from lakes and baseflow-dominated catchments (river systems with long memory). Conversely, forecasts are least skilful for catchments with a flashy response to rainfall (river systems with short memory).

Appendix A: Continuous Ranked Probability Skill Score
The Continuous Ranked Probability Score (CRPS; Hersbach, 2000) is a common measure of ensemble forecast performance. 390 It is formulated as the integral squared distance between the forecast ensemble and the observation step function. The CRPS is then averaged over all forecasts of the evaluation period. Its dimension is that of the forecast variable being assessed, here m 3 s -1 , and its value is equivalent to the mean absolute error when applied to deterministic forecasts. https://doi.org/10.5194/hess-2020-542 Preprint. Discussion started: 28 October 2020 c Author(s) 2020. CC BY 4.0 License.
The Continuous Ranked Probability Skill Score (CRPSS) is then assessed by comparing the CRPS value of the investigated forecast system (here, ESP) to that of a selected benchmark (here, an ensemble of historical streamflow selected from the 395 period 1981 -2010). Given CRPS sys , the CRPS of the forecasting system, CRPS bench the CRPS of the benchmark and CRPS pft the optimal CRPS value (0), the CRPSS is formulated according to Eq. A1.
This metric is non-dimensional and takes values between 1 (optimum) and low negative values. Positive (negative) skill scores indicate that the forecast system performs better (worse) than the benchmark in terms of CRPS. Skill scores close to 0 400 indicate that the evaluated forecast system has equivalent performance to that of the benchmark.
https://doi.org/10.5194/hess-2020-542 Preprint. Discussion started: 28 October 2020 c Author(s) 2020. CC BY 4.0 License. Figure B1 Spatial variability of the 15 modelled hydrological signatures including the catchment mean elevation. The colour intervals are based on the quantiles (15% intervals) of each signature (and elevation) distribution. A clarification of the 405 abbreviations used here can be found in Table 1 in the main text.

Data availability
The HYPE model code is available from the HYPEweb portal (https://hypeweb.smhi.se/model-water/). The meteorological data used for driving the ESP re-forecasts (PTHBV) is available from the luftweb portal (https://luftweb.smhi.se), and the hydrological data used for model correction is available from the vattenweb portal (https://vattenwebb.smhi.se/). 410

Author contribution
M.G.L. contributed with the study design, model runs, result analysis and figures, interpretation of the results, and writing the manuscript; L.C. contributed with the study design, code development for post-processing of results, the interpretation of results, and writing the manuscript; I.G.P. was responsible for the project management and funding acquisition, and contributed with the basic idea, the study design, clustering analysis and figures, interpretation of the results, and writing the 415 manuscript.

Competing interests
The authors declare that they have no conflict of interest.

Financial support
This work was funded by the project "Long-term forecasts of wind and hydropower supply in a fluctuating climate -420 Importance for production planning and investments in energy storage and power transmission" granted by the Swedish Energy Agency under grant agreement No. 46412-1. Funding was also received from the EU Horizon 2020 project S2S4E (Subseasonal to seasonal forecasting for the energy sector) under Grant Agreement 776787. This study was also partially funded by the EU Horizon 2020 project PrimeWater (Delivering advanced predictive tools from medium to seasonal range for water dependent industries exploiting the cross-cutting potential of EO and hydroecological modeling) under the Grant 425 Agreement 870497.