Conditioning ensemble streamﬂow prediction with the North Atlantic Oscillation improves skill at longer lead times

. Skilful hydrological forecasts can beneﬁt decision-making in water resources management and other water-related sectors that require long-term planning. In Ireland, no such service exists to deliver forecasts at the catchment scale. In order to understand the potential for hydrological forecasting in Ireland, we benchmark the skill of ensemble streamﬂow prediction (ESP) for a diverse sample of 46 catchments using the GR4J (Génie Rural à 4 paramètres Journalier) hydrological model. Skill is evaluated within a 52-year hindcast study design over lead times of 1 d to 12 months for each of the 12 initialisation months, January to December. Our results show

Abstract. Skilful hydrological forecasts can benefit decision-making in water resources management and other water-related sectors that require long-term planning. In Ireland, no such service exists to deliver forecasts at the catchment scale. In order to understand the potential for hydrological forecasting in Ireland, we benchmark the skill of ensemble streamflow prediction (ESP) for a diverse sample of 46 catchments using the GR4J (Génie Rural à 4 paramètres Journalier) hydrological model. Skill is evaluated within a 52-year hindcast study design over lead times of 1 d to 12 months for each of the 12 initialisation months, January to December. Our results show that ESP is skilful against a probabilistic climatology benchmark in the majority of catchments up to several months ahead. However, the level of skill was strongly dependent on lead time, initialisation month, and individual catchment location and storage properties. Mean ESP skill was found to decay rapidly as a function of lead time, with a continuous ranked probability skill score (CRPSS) of 0.8 (1 d), 0.32 (2-week), 0.18 (1-month), 0.05 (3-month), and 0.01 (12-month). Forecasts were generally more skilful when initialised in summer than other seasons. A strong correlation (ρ = 0.94) was observed between forecast skill and catchment storage capacity (baseflow index), with the most skilful regions, the Midlands and the East, being those where slowly responding, high-storage catchments are located. Forecast reliability and discrimination were also assessed with respect to low-and high-flow events. In addition to our benchmarking experiment, we conditioned ESP with the winter North Atlantic Oscillation (NAO) using adjusted hindcasts from the Met Office's Global Seasonal Forecasting System version 5. We found gains in winter forecast skill (CRPSS) of 7 %-18 % were possible over lead times of 1 to 3 months and that improved reliability and discrimination make NAO-conditioned ESP particularly effective at forecasting dry winters, a critical season for water resources management. We conclude that ESP is skilful in a number of different contexts and thus should be operationalised in Ireland given its potential benefits for water managers and other stakeholders. penberger et al., 2015a;Zhao and Zhao, 2014). For example, hydrological forecasts have been used to modify reservoir operations for hydropower production (Fan et al., 2016), storage and supply (Turner et al., 2017), and the management of flood and drought conditions (Amnatsan et al., 2018;Ficchì et al., 2016;Watts et al., 2012). They have also been shown to benefit sectors such as agriculture (Mushtaq et al., 2012), tourism (Fundel et al., 2013), and navigation (Meißner et al., 2017). Such applications can yield significant economic returns. For instance, Hamlet et al. (2002) reported a potential rise in annual revenue of USD 153 million when forecast information was incorporated into the operation of major hydropower dams in the Columbia River basin. Similarly, Pappenberger et al. (2015a) claim that the European Flood Awareness System (EFAS; Thielen et al., 2009) saves around EUR 400 for every EUR 1 invested.
The value of hydrological forecasting has led several countries to establish operational seasonal hydrological forecasting (SHF) systems. These include the U.S. National Weather Service's (NWS) Hydrologic Ensemble Forecast Service (HEFS; Demargne et al., 2014), the Hydrological Outlook UK (HOUK; Prudhomme et al., 2017), and the Australian Bureau of Meteorology's statistical and dynamical forecasts (Schepen and Wang, 2015). Although Ireland benefits from regional hydrological outlooks provided by EFAS, no service currently exists for delivering forecasts at the catchment scale; yet water managers and other stakeholders require confident, locally tailored forecast information. A national operational SHF system could bridge this gap. However, despite interest from water managers, it is difficult to justify the implementation of such a system as little preparatory work has been done to evaluate the potential for hydrological forecasting in an Irish context.
Recent international assessments of progress in SHF (Tang et al., 2016;Yuan et al., 2015) indicate that (i) advances in empirical and dynamical SHF are feasible in climate contexts that resemble Ireland; and (ii) SHF spans a wide range of methods with varying complexity and data requirements, but no universally accepted "best" approach has emerged. As the performance of different methods will likely depend on time of year, lead time, and, critically, local hydrological context (Girons Lopez et al., 2021;Harrigan et al., 2018;Meißner et al., 2017;Pechlivanidis et al., 2020), understanding how best to apply the range of available tools to develop skilful forecasts for Ireland requires rigorous testing at the catchment scale. To the authors' knowledge, only Foran Quinn et al. (2021) have previously evaluated seasonal streamflow forecasts for Ireland. They found that whilst skill was mainly restricted to summer months, statistical persistence forecasts could have practical value in the management of water resources and hydrological extremes. We build on this work and further assess the scientific basis for SHF in Ireland by evaluating and benchmarking the skill of ensemble streamflow prediction (ESP).
ESP is a well-established forecasting technique in which historical sequences of climate data at the time of forecast are used to drive a hydrological model, producing an ensemble of equiprobable future streamflow traces (Day, 1985;Twedt et al., 1977). It is comparable to persistence in that it requires no information about future meteorological conditions; outlooks are instead based on knowledge of hydrological state variables (i.e. antecedent soil moisture, groundwater, snowpack, and streamflow itself) which can provide predictability up to 5 months ahead (Wood and Lettenmaier, 2008). In this regard, ESP can be used to efficiently specify not only the catchments where knowledge of initial conditions or meteorological forcing may be the greatest source of skill, but also the time of year and lead times over which different skill sources may be dominant (Wood and Lettenmaier, 2006).
The ESP method was originally developed in the snowdominated catchments of the western United States (e.g. Franz et al., 2003) but has shown skill in other regions, including the UK (Harrigan et al., 2018), European Alps (Förster et al., 2018), Sweden (Girons Lopez et al., 2021), New Zealand (Singh, 2016), Australia (Pagano et al., 2010;Wang et al., 2011), and China . Simplicity and efficiency make ESP a popular choice for operational forecasting. It is one of three methods used in the HOUK (Prudhomme et al., 2017) and forms the basis of the NWS HEFS (Demargne et al., 2014). Moreover, ESP is recognised as a low-cost, "tough-to-beat" forecast (Pappenberger et al., 2015b) against which value added by more sophisticated hydrometeorological ensemble systems can be assessed (e.g. Arnal et al., 2018;Bazile et al., 2017;Wanders et al., 2019). Hence, the potential application of ESP in Ireland merits exploration.
However, lack of sensitivity to concurrent meteorological conditions limits the application of ESP in areas that are less dependent on the initial hydrological state. Given that local meteorological conditions are known to be teleconnected to regional variations in atmospheric-oceanic modes, ESP techniques may be improved by conditioning on these circulation patterns. Several studies have already demonstrated the added value of incorporating climate information into ESP forecasts in this way. For example, Hamlet and Lettenmaier (1999) found that conditioning ESP traces according to El Niño-Southern Oscillation (ENSO) and Pacific Decadal Oscillation (PDO) indicators significantly improved forecast specificity and extended lead time by about 6 months in the Columbia River basin. Similarly, both Werner et al. (2004) and Bradley et al. (2015) reported improvements of 28 % and 27 % in forecast skill, respectively, when conditioning ESP with ENSO. More modest improvements of 5 %-10 % were observed by Beckers et al. (2016) for two test stations when applying an ENSO-conditioned ESP. More recently, Yuan and Zhu (2018) showed that decadal predictions of terrestrial water storage made using ESP could be improved by conditioning with PDO and Atlantic Multidecadal Oscillation indices.
In Europe, the dominant mode of climate variability is the North Atlantic Oscillation (NAO). The NAO affects streamflow predictability, particularly during winter (Bierkens and van Beek, 2009;Steirou et al., 2017;Wedgbrow et al., 2002;Wilby, 2001), and it is highly correlated with winter streamflow over Ireland (Murphy et al., 2013). As winter is the most important season for groundwater recharge in Europe, the ability to accurately forecast winter streamflow would be extremely beneficial for water managers. Advances in predicting the NAO Smith et al., 2020) enable long-range forecasts of UK winter hydrology (Svensson et al., 2015) as well as improved seasonal meteorological forecasts for driving hydrological models (Stringer et al., 2020). Hence, it may be possible to leverage this predictability to improve ESP performance by sub-sampling ensemble members for Ireland using the winter NAO.
In this paper, we benchmark ESP skill against streamflow climatology within a 52-year hindcast study design. Skill is evaluated for a combination of different lead times and initialisation months and for diverse hydroclimate regions and catchment types. The relationship between catchment characteristics and ESP skill is explored. Reliability and discrimination are assessed with respect to low-and high-flow events. We also examine the effect of conditionally sampling ensemble members on ESP skill during winter. The following research questions are addressed: Section 2 describes our data and methods. Our results are presented in Sect. 3. We offer discussion and suggestions for future research in Sect. 4. Conclusions are presented in Sect. 5.

Catchment selection and observed data
A total of 46 catchments were selected for our analysis following the same criteria used to establish the Irish Reference Network (Murphy et al., 2013). Catchments were selected provided they met the following conditions: (i) they had quality-assured, long-term observational data, with a minimum record length of 25 years; (ii) they had a flow regime which had not been significantly altered by human activity; (iii) they had little evidence of land-use change; and (iv) together they build a representative sample of Ireland's diverse hydrological and climatological conditions, with good spatial coverage. This selection process ensured sufficient data for hydrological model calibration whilst limiting the potential for confounding factors that could adversely affect the interpretation of results. Catchments were grouped according to the European Union's NUTS (Nomenclature of Territorial Units for Statistics) III regions ( Fig. 1) to explore spatial variations in skill. As the Dublin region contained only one catchment in our sample, this was merged with the Mid-East into a single region: the East. The distribution of catchments within the seven regions ranges from four in the West to 10 in the Mid-West. Although the NUTS III regions do not inherently lend themselves to hydrological analysis, grouping the catchments in this way did yield regions that were diverse in terms of their hydrology and climate. They are therefore suitable to examine how skill may differ between areas with contrasting hydroclimate properties.
Observed daily mean streamflow data (m 3 s −1 ) were obtained from gauging stations administered by the Office of Public Works (OPW) and the Environmental Protection Agency. Despite the strict selection criteria, some catchments still contain multiple or extended periods of missing data. Hence, streamflow records were retrieved only for calendar years 1992-2017 -the longest usable period common to all 46 catchments. Catchment average daily precipitation (mm d −1 ) and temperature ( • C) spanning 1961-2017 were derived from gridded (1 km × 1 km) datasets developed by Met Éireann (Walsh, 2012). Potential evaporation (mm d −1 ) was calculated from temperature and radiation according to Oudin et al. (2005).  (Mills et al., 2014). These PCDs describe facets of catchment hydrology, morphology, soil, and climate and are used here to examine relationships between catchment characteristics and ESP skill. The primary PCDs of interest are the baseflow index (BFI), the Richards-Baker flashiness index (RBI; Baker et al., 2004), and the runoff ratio (RR), as these describe aspects of catchment storage and response and have been linked to ESP skill (e.g. Girons Lopez et al., 2021;Harrigan et al., 2018;Pechlivanidis et al., 2020). The BFI is calculated according to the Institute of Hydrology method (Gustard et al., 1992) and quantifies the contribution of stored sources to runoff. Hence, the BFI can be considered an integrated measure of catchment storage capacity. The RBI measures the frequency and rapidity of short-term changes in streamflow, and the RR gives the amount of runoff relative to the amount of precipitation received. Across the sample of catchments, the median (5th and 95th percentile) BFI is 0.59 (0.34, 0.75), the median RBI is 0.19 (0.07, 0.5), and the median RR is 0.62 (0.5, 0.82). Higher values of RBI and RR are observed for catchments with lower storage capacity (BFI) and smaller area, indicative of more responsive hydrological regimes. In addition to the BFI, we also represent catchment storage using the calibrated GR4J (Génie Rural à 4 paramètres Journalier) x 1 and x 3 parameters, the sum of which give an overall indicator of storage capacity. Catchment area ranges from 5.46 to 2460 km 2 . Although snow has been shown to be a major source of hydrological predictability (e.g. Greuell et al., 2019;Shukla et al., 2013;Wood and Lettenmaier, 2008), it is not known to make a substantial contribution to precipitation in Ireland. No catchments have a significant amount of snowfall, defined following Berghuijs et al. (2014) as a long-term mean fraction of precipitation falling as snow (F s ) < 0.15. Hence, we do not consider the role of snow in our analysis. A complete list of PCDs referred to in this study is given in Table 1. Catchment characteristics are summarised for Ireland and each of the NUTS III regions in Table 2 and for individual catchments in Table S1 in the Supplement.

Hydrological modelling
The GR4J (Génie Rural à 4 paramètres Journalier; Perrin et al., 2003) daily lumped conceptual rainfall-runoff model was applied. This model has a parsimonious structure consisting of four free parameters (x 1 -x 4 ) that require calibration of observed streamflow data against precipitation and potential evaporation. The model structure can be described in terms of its water balance and routing operators (Santos et al., 2018). Water is partitioned between a production (soil moisture accounting) store and a routing store. The production store (capacity x 1 mm) gains water from rainfall and loses water from evaporation and percolation. A total of 90 % of the total quantity of water reaching the routing component (i.e. the sum of the percolation leak and the water bypassing the production store) is routed by a single unit hydrograph (time base x 4 d) and a non-linear routing store (capacity x 3 mm). The remaining 10 % is routed by a single unit hydrograph (time base 2(x 4 ) d). A groundwater exchange function (rate x 2 mm d −1 ) operates on both routing channels and can be positive, negative, or zero.
We chose GR4J on the basis of its reliability. The model has undergone extensive testing in several countries and has been shown to accurately simulate the hydrology of diverse catchment types, with comparatively good results (e.g. Coron et al., 2012;Perrin et al., 2003;Vaze et al., 2011). It has also been successfully applied to Irish conditions (Broderick et al., 2016(Broderick et al., , 2019, where it was found to perform well for a similar set of catchments to those used here, with respect to both temporal transition between contrasting climate periods and the reproduction of various hydrological signatures. Moreover, GR4J has been used previously for ESP (Harrigan et al., 2018;Pagano et al., 2010). We find the model uniquely suited to this application, as large ensembles of runs are required in long hindcast experiments. These simulations can be computationally intensive and time-consuming with Table 2. Summary statistics of eight catchment characteristics for Ireland and each NUTS III region. The median across n catchments is given with the 5th and 95th percentile ranges in parentheses. Mean annual runoff (Q), precipitation (P ), and potential evaporation (PE) were calculated over calendar years 1992-2017. F s * is the long-term (calendar years 1992-2017) mean fraction of precipitation falling as snow.  (2014), where precipitation on days with an average temperature greater than or equal to 1 • C was considered entirely rainfall, and precipitation on days with an average temperature below 1 • C was considered entirely snowfall. more complex model structures, which do not necessarily lead to large improvements in skill (e.g. . GR4J is implemented in R via the open-source airGR package (v1.4.3.65; Coron et al., 2017Coron et al., , 2020. Model parameters were estimated using memetic algorithms with local search chains (MA-LS-Chains; Bergmeir et al., 2016;Molina et al., 2010). As ESP forecasts are made throughout the year under varying conditions, the nonparametric Kling-Gupta efficiency (KGE NP ; Appendix A) was chosen as the objective function to optimise, as it has been shown to capture multiple parts of the hydrograph well (Pool et al., 2018). Parameter estimation was carried out in R using the Rmalschains package (v0.2-6; Bergmeir et al., 2016Bergmeir et al., , 2019 with the covariance matrix adaptation evolution strategy (Hansen and Ostermeier, 2001) as the local search method.
Model calibration was performed following the procedures recommended by Arsenault et al. (2018). A splitsample test (Klemeš, 1986) was first used to assess model robustness. The available record was divided into two periods of equal length, denoted here as period 1 (P1; 1 January 1993-2 July 2005) and period 2 (P2; 2 July 2005-31 December 2017). Separate parameter sets were created using data from P1 and P2 in turn for calibration and validation (i.e. parameters were calibrated on P1 and validated on P2 and vice versa). A third round of calibration was then performed using data from the complete period (CP; 1 January 1993-31 December 2017). This parameter set was carried forward for all subsequent modelling tasks. An approach of this nature is beneficial as it allows for evaluation of the model's ability to accurately simulate catchment processes over two independent periods whilst maximising the information content of the parameter set that is used to generate the ESP hindcast time series. In all cases, 1992 was used as a warm-up period to initialise model states, and the full series (1993-2017) was simulated before calibration and testing to preserve the internal dynamics and temporal stability of catchment stores. Model performance was evaluated using KGE NP , the Nash-Sutcliffe efficiency (NSE; Nash and Sutcliffe, 1970), and the percent bias (PBIAS; Gupta et al., 1999).

Historical ESP
Forecasts were initialised on the first day of each month following a 4-year model warm-up period to estimate initial hydrological conditions. The first usable forecast date after model warm-up is, therefore, 1 January 1965. For each forecast initialisation date, a 55-member ensemble m of streamflow hindcasts was generated by forcing GR4J with corresponding historic climate sequences (pairs of precipitation and potential evaporation) extracted from 1961-2016 out to a 12-month lead time. Following Harrigan et al. (2018), streamflow at a given lead time is expressed as the mean daily streamflow from the forecast initialisation date to n days or months ahead in time. For example, a January forecast with a lead time of 1 month is the mean daily streamflow from 1 to 31 January, and a January forecast with a lead time of 2 months is the mean daily streamflow from 1 January to 28 February. Average flow values are used, particularly at monthly timescales because these are preferred by decision makers in many water sectors . Hindcast time series were therefore temporally aggregated to provide predictions of mean streamflow over lead times of 1 d to 12 months, resulting in 365 lead times per forecast (excluding leap days). In order to mimic operational conditions and prevent artificial skill inflation (see Robertson et al., 2016), we also employed leave-one-out cross-validation (L1OCV), whereby data from the forecast year were not used as input to the model, as these would not be available in a realtime forecasting setting. For example, a forecast initialised on 1 January 1965 will use historic climate sequences of 365 d in length (1 January to 31 December) extracted from 1961-2016 but not 1965. ESP skill is evaluated over 52 initialisation years N (1965-2016) with 12 initialisation months i (January to December). In total, 624 hindcasts were generated (N × i) with 34 320 individual ensemble members (N × i × m), each at 365 lead times across 46 catchments, resulting in a hindcast archive of more than 5.7×10 8 streamflow values.

Conditioned ESP
To investigate the potential for improving winter streamflow predictability, we conditioned the ESP method using adjusted NAO hindcasts from the Met Office's Global Seasonal Forecasting System version 5 (GloSea5; MacLachlan et al., 2015). GloSea5 is built around the high-resolution Hadley Centre Global Environmental Model version 3 (HadGEM3), which integrates atmosphere, ocean, land, and sea-ice components. HadGEM3 has an atmospheric resolution of 0.83 • longitude by 0.55 • latitude, with 85 vertical levels and an ocean resolution of 0.25 • in both latitude and longitude with 75 vertical levels. Although GloSea5 has been shown to skilfully predict the NAO , several studies have documented a signal-to-noise problem that limits the usefulness of forecasts to drive hydrological models, as ensemble mean signals in NAO forecasts are anomalously weak Scaife et al., 2014;Scaife and Smith, 2018). Focusing on the dynamical signals can correct this by amplifying the ensemble mean , so adjusted hindcasts are used here following the method of Stringer et al. (2020). For each DJF period over 1993-2016, we combined GloSea5 hindcasts initialised on 1, 9, and 17 November, each with 17 ensemble members, to create a 51-member lagged ensemble of raw NAO predictions. After adjustment to remove the signal-to-noise discrepancy in the raw ensemble, predicted monthly NAO values were used to select 10 non-sequential DJF analogues (e.g. December 2007, January 1980, February 2011, where the mean ob-served seasonal NAO approximated the mean adjusted seasonal NAO hindcast. This resulted in a 510-member ensemble of analogue date sequences, which were then used to extract corresponding precipitation and potential evaporation for input to the ESP method. The decision to construct analogue seasons with months from different years was made (a) to ensure that the range of possible values suggested by GloSea5 could be reproduced and (b) to avoid underestimating extreme seasonal NAO values, which would sample exclusively from DJF 2009-2010 if below −10 hPa (Stringer et al., 2020). Per hindcast member, 10 analogues were sampled to minimise non-NAO-related variability whilst keeping a consistent NAO signal across the sample. Conditioned ESP forecasts were only initialised on 1 December. A more detailed description of the adjustment procedure and the selection of the analogue date sequences is available in Stringer et al. (2020).

Hindcast overall performance
We quantify the overall skill of the ESP method using the continuous ranked probability score (CRPS;Hersbach, 2000) and corresponding skill score (CRPSS; Appendix B). The CRPS is a recommended and widely used evaluation metric for ensemble hydrological forecasting (Pappenberger et al., 2015b) that penalises biased and unsharp forecasts (Wilks, 2019). To minimise the impact of hydrological model uncertainty on hindcast quality, we use modelled observations derived from GR4J in place of direct streamflow data when evaluating skill. This is common practice (e.g. Arnal et al., 2018;Harrigan et al., 2018;Wood and Lettenmaier, 2008;Wood et al., 2016) as it isolates loss of skill to errors in initial conditions. Our reference forecast is constructed as the fullsample climatological distribution of modelled observations over 1965-2016 for the forecast period. This forecast was also created using L1OCV to account for streamflow persistence. In the case of the conditioned ESP, skill is calculated relative to both the probabilistic climatology benchmark and the full historical ESP ensemble. In all cases, the Ferro et al. (2008) ensemble size correction for CRPS is applied after cross-validation to account for differences in the number of ensemble members.

Hindcast reliability
Hindcast reliability was also assessed for low and high flows. Reliability refers to the overall agreement between the forecast probabilities and the observed frequencies. For each catchment, initialisation month, and lead time, the probability integral transform (PIT; Gneiting et al., 2007;Laio and Tamea, 2007) score was calculated for subsets of forecastobservation pairs falling within the lower and upper terciles of the corresponding modelled observations. The PIT score was derived from the PIT diagram following Renard et al. (2010). A forecast with a PIT score of 1 has perfect reliability, whereas a forecast with a PIT score of 0 has the worst reliability.

Hindcast discrimination
Hindcasts were further assessed in terms of their ability to discriminate between events and non-events using the receiver operator characteristic (ROC; Mason and Graham, 1999) score. The ROC score is defined as the area under the ROC curve, which plots the probability of detection against the probability of false detection for a given event and a range of probability levels (Demargne et al., 2010). A ROC score of 1 indicates that all ensemble members correctly predicted the event in all years, whereas a ROC score of 0.5 indicates a forecast with no discrimination. For each catchment, initialisation month, and lead time, the ROC score was calculated using the lower and upper terciles of the corresponding modelled observations as thresholds. Hence, the ROC score should be interpreted as a measure of how well ESP can forecast the occurrence of low-and high-flow events and can thus be regarded as an indicator of potential usefulness. We use a slightly stricter skill threshold of 0.6, so that forecasts are only considered skilful if they are better than guesswork. Both the CRPSS and ROC score were calculated in R using the easyVerification package (v0.4.4; MeteoSwiss, 2017).

Lead time
Mean ESP skill declines rapidly as a function of lead time, across all catchments and initialisation months (Fig. 3). Mean CRPSS values for short (1 d) to extended (2-week) lead times range from 0.8 to 0.32 and for monthly (1-and 2-month), seasonal (3-month), and annual lead times from 0.18, 0.09, and 0.05 to 0.01, respectively. However, the rate at which skill decays across catchments varies, with considerable differences around the mean shown by the 5th and 95th percentile bands. For example, for a 2-week lead time, CRPSS values within this band range between 0.1 and 0.58 and for a 1-month lead time between 0.03 and 0.4.

Initialisation month
ESP skill varies with forecast initialisation month and time of year, with the highest and lowest skill scores dependent on lead time (Fig. 4). For short to monthly lead times, skill scores are highest when forecasts are initialised in summer (JJA), with July the most skilful initialisation month on average, whereas skill tends to be lower during winter (DJF), with January and December exhibiting the lowest skill. At seasonal lead times, skill during autumn (SON) is comparable to that of summer, whilst the least skilful forecasts are produced in the spring months (MAM). As in Fig. 3, skill tends toward zero as lead time increases, regardless of initialisation month. Although this decline in performance is less severe for summer than for other seasons, by a 12-month lead time, nearly all forecasts are less skilful than climatology. Despite this, several catchments have above (below) average skill scores, with some performing notably better (worse) across different lead times and initialisation months. For example, ESP forecasts initialised in July with a 1-month lead have moderate skill on average (CRPSS = 0.34), but seven catchments have high skill (CRPSS ≥ 0.5), with a maximum CRPSS of 0.68 for the Erkina (ID 15005). Conversely, 14 catchments have low skill (CRPSS ≤ 0.25), with a minimum of −0.03 for the Newport (ID 32012).

NUTS III regions
Mean ESP skill across all initialisation months is shown in Fig. 5 for Ireland and each of the seven NUTS III regions. The Midlands, Mid-West, and East are the most skilful regions, followed by the South-East, West, and Border regions. The South-West is the least skilful region on average, with the lowest CRPSS values for all sampled lead times. Regional variations in skill are less pronounced at shorter lead times but become more apparent as lead time increases. For example, at a 1-month lead time, the Midlands (CRPSS = 0.26) is twice as skilful as the Border (CRPSS = 0.13) and South-West (CRPSS = 0.12). All regions are, on average, skilful out to a 1-month lead time, but the Midlands is the only region that is moderately skilful (CRPSS ≥ 0.25). The Midlands remains the most skilful region beyond 1-month, though the level of skill is generally quite low for all regions by this point. The regional variations observed in Fig. 5 are partly explained by the relationship between catchment characteristics and ESP skill (Sect. 3.4) as the pattern is broadly consistent with differences in catchment storage capacity and wetness. For instance, the Midlands has a high median BFI of 0.71, a low median RBI of 0.13, and a low median SAAR of 939 mm, whereas the  South-West has a low median BFI of 0.44, a high median RBI of 0.4, and a high median SAAR of 1407 mm. Differences in regional hydroclimate properties therefore contribute to differences in regional skill as forecasts perform better in the baseflow-dominated catchments of the Midlands than the flashy, wetter catchments of the South-West.

Catchment scale
Notable subregional heterogeneity emerges when examining skill scores for individual forecasts at the catchment scale (Fig. 6). This heterogeneity is more noticeable at monthly to seasonal lead times, where skilful forecasts are possible for several catchments at different times of the year, even if average skill for the region as a whole tends to be low. For example, whilst the South-West is the least skilful region at a 1-month lead time, with an average CRPSS of 0.12, forecasts with above-average skill are possible in several catchments in the region in June, such as the Blackwater (ID 18003; CRPSS = 0.25) and the Laune (ID 22035; CRPSS = 0.23). parametric Spearman rank correlation coefficient (ρ). ESP skill is closely linked with catchment storage properties and responsiveness. There are strong positive correlations between modelled storage capacity (x 1 +x 3 ) and BFI (ρ = 0.79) and between ESP skill and BFI (ρ = 0.94). There is also a strong positive correlation between ESP skill and modelled storage capacity (ρ = 0.75). Conversely, there is a strong negative correlation between ESP skill and the RBI (ρ = −0.82) and a moderate negative correlation between ESP skill and the RR (ρ = −0.63). All of these correlations are statistically significant (p ≤ 0.05). In general, ESP skill tends to be higher for slower responding catchments with greater storage capacity and lower for faster responding, flashy catchments with poor infiltration. ESP skill is also positively correlated with catchment area (ρ = 0.5) and main-stream length (ρ = 0.46), indicating a tendency for the method to perform better in larger catchments with longer streams. Negative correlations exist between ESP skill and PCDs related to catchment wetness (SAAR, FLATWET, and PEAT), though these PCDs also exhibit negative correlations with BFI and positive correlations with RBI and RR, highlighting that wetter catchments are more likely to be those with lower storage and flashier regimes in which ESP has already been shown to perform poorly. Poor skill in these catchments is likely a combination of high precipitation and low permeability, which leads to more variable hydrological conditions as rainfall events propagate to streamflow quickly. Finally, there are moderate negative correlations between ESP skill and S1085 (ρ = −0.67) and TAYSLO (ρ = −0.59), indicating that forecasts are less skilful in catchments with steeper gradients. Although these results are based on the 1-month CRPSS averaged across all initialisation months, similar results are observed for a variety of different months and lead times (not shown).

Reliability of low-and high-flow forecasts
ESP is capable of producing reliable forecasts of both low (lower tercile) and high (upper tercile) flows (Fig. 8). However, the level of reliability is dependent on both lead time and initialisation month. Reliability decreases as lead time increases, though the rate at which this occurs is not uniform across all initialisation months. Furthermore, there is con- siderable inter-catchment variability for both low-and highflow forecasts. This latter point is perhaps most pronounced at short to extended lead times but is also evident at longer leads (e.g. 1-and 2-month forecasts initialised in June and July), where some catchments return much higher than average PIT scores. Reliability tends to be highest when forecasts are initialised in summer and lowest when initialised in winter, with the smallest and largest reductions in PIT scores also evident for these seasons as lead time increases. Across all lead times and initialisation months, reliability is, on average, higher for low-flow forecasts than high-flow forecasts. Although the PIT score decays with lead time, unlike the CRPSS it does not tend toward zero and instead has a lower bound of around 0.3. Hence, somewhat reliable forecasts of both low and high flows are still possible at annual lead times even when overall skill (CRPSS) is poor.

Discrimination between events and non-events
In general, ESP is skilful at forecasting the occurrence of both low-flow (lower tercile) and high-flow (upper tercile) events up to 1 month ahead in the majority of catchments and for all initialisation months (Fig. 9). Discrimination for both event types is also possible at lead times of 2 and 3 months, though to a lesser extent. These results highlight that ESP still has utility at longer lead times, even when overall performance as measured by the CRPSS is poor. However, this utility seldom extends beyond 3 months, except for specific catchments and initialisation dates, with little or no skill at lead times of 6 and 12 months across the majority of the catchment sample. Some seasonality in ROC skill is apparent, particularly at monthly lead times, where ESP can more skilfully discriminate between events and non-events in summer than other seasons. Discrimination is more skilful for low-flow events than high-flow events.

Improvements in winter skill
The overall skill (CRPSS) of NAO-conditioned ESP is compared with that of historical ESP in Fig. 10. Whilst historical ESP is skilful in the majority of catchments at a 1-month lead time, there is a dramatic reduction in both the magnitude of skill and the number of catchments for which skilful forecasts can be made at 2-and 3-month lead times. NAOconditioned ESP outperforms historical ESP relative to the climatology benchmark in all but one catchment at a 1-month lead time, though these improvements are generally modest, with a median (5th and 95th percentile) difference in CRPSS of 0.04 (0.01, 0.07). At a lead time of 2 months, NAOconditioned ESP remains skilful against climatology in 98 % of catchments, compared to historical ESP which is only skilful in 37 % of catchments. The value of the NAO-conditioned ESP is more evident at a 3-month lead time, where skilful forecasts are still possible for several catchments in the Border and western regions, when historical ESP exhibits little or no skill across the majority of the sample. Over the three lead times examined here, the greatest improvements are found for wet, fast-responding catchments with low baseflow contribution. For example, two of the bestperforming catchments for NAO-conditioned ESP are the Owenea (ID 38001) and the Fern (ID 39009). The Owenea has a BFI of 0.27, the lowest in the sample, with high SAAR (1753 mm), RR (0.82), and RBI (0.58) values. The Fern has a below-average BFI of 0.47, with similarly high SAAR and RR values of 1570 mm and 0.79, respectively, although it is not as flashy (RBI = 0.18). NAO-conditioned forecasts generally perform the worst in slowly responding catchments with high storage capacity. At a lead time of 3 months, negative skill is observed in several catchments in the East and South-East, though these values can still be defined within the bounds of what Bennett et al. (2017) refer to as "neutral skill" (±0.05 CRPSS) and hence do not represent a sig- nificant departure from the performance of historical ESP. These differences in performance can be explained by the relative contribution of initial conditions and meteorological forcing to ESP skill. In the flashy catchments where NAOconditioned ESP performs well, meteorological conditions are the dominant control on skill as rainfall events propagate to streamflow at a faster rate, and memory of initial conditions is lost quickly. It is also worth noting that in these catchments skill generally increases with lead time. This is likely due to the fact that the underlying NAO signal is not as strong over shorter averaging periods due to the noise of the individual weather systems. Moreover, only the seasonal mean NAO is rescaled to account for the signal-to-noise problem when adjusting hindcasts, so skill is only present at the longer 3month lead time. For example, at a 3-month lead time, NAOconditioned ESP improves forecast skill by ∼ 18 % over historical ESP in both the Owenea and Fern, whereas gains of 7 % and 12 % are observed for 1-and 2-month lead times, respectively. Conversely, catchments where negative skill is observed have high baseflow contribution and long recession times. Hence, hydrological response is controlled predominately by the slow release of water from reservoirs, and initial conditions act as the primary source of skill. The combination of initial conditions and subsampled climate information grants modest improvements in skill in these catchments up to a 1-month lead time. However, at longer lead times, improved atmospheric representation alone cannot compensate for divergences from the initial state. Skill deteriorates as a result, eventually becoming negative.
In addition to the CRPSS, both the PIT score and the ROC score were calculated for NAO-conditioned ESP. Figure 11 shows the difference between PIT scores calculated for historical ESP and NAO-conditioned ESP at lead times of 1, 2, and 3 months. Conditioning ESP with the NAO increases the reliability of low-flow forecasts in all catchments at a 1month lead time. Some catchments experience a reduction in low-flow reliability at a 2-month lead time, whereas at a 3month lead time, low-flow reliability is observed to increase in almost all catchments. High-flow reliability increases in some catchments at a 1-month lead time but then decreases in almost all catchments at lead times of 2 and 3 months. At these longer lead times, increases in high-flow reliability tend to be restricted to flashy catchments (e.g. Owenea), where NAO-conditioned ESP has already been shown to perform well in terms of CRPSS.
ROC scores for individual catchments and the full range of lead times are presented in Fig. 12. On average, NAOconditioned ESP extends the lead time over which discrimination between events and non-events is possible by 141 % for low flows (37 to 89 d) and 170 % for high flows (33 to 89 d). These are considerable improvements over historical ESP, which failed to meet the skill threshold in most catchments at longer lead times. For example, skilful discrimination of low-flow events is possible in 78 % of catchments at a 3-month lead time when using NAO-conditioned ESP compared to only 11 % of catchments when using historical ESP. This makes NAO-conditioned ESP particularly effective at forecasting dry winters, which can be critical for water resources management. It is worth noting that in many catchments NAO-conditioned ESP can "lose" skill before later regaining it, with the ROC score falling only marginally below the skill threshold. Although this is also observed for historical ESP, it is less frequent.
Changes in reliability are generally consistent with improvements in skill (CRPSS) and discrimination (ROC). Improved low-flow reliability allows NAO-conditioned ESP to better distinguish between low-flow events and non-events. The reductions in low-flow reliability in some catchments at a 2-month lead time are also consistent with NAOconditioned ESP "losing" ROC skill before later regaining it (Fig. 12). Increases in high-flow reliability at a 3-month lead time in flashy catchments correspond with the greatest increases in CRPSS from NAO-conditioned ESP. In these catchments, where streamflow variability is greater and the NAO is most influential, improved reliability and sharpness lead to better overall skill at longer lead times.

When is ESP skilful?
For short lead times (1-3 d), ESP forecasts are on average highly skilful (CRPSS ≥ 0.5) and for extended lead times (1-2 weeks) moderately skilful (CRPSS ≥ 0.25). Mean ESP skill decays rapidly with lead time. Hence, forecast skill for Figure 9. As in Fig. 8 but for the ROC score. The red line denotes the stricter skill threshold of 0.6. monthly, seasonal, and annual lead times is on average much lower. This is because ESP relies on the long-term "memory" of the hydrological system. The cumulative effect of distinct meteorological forcing causes a divergence from the initial state that grows with time. Thus, ESP suffers at longer lead times as there is little or no persistence of initial hydrological conditions. Over longer periods, we find that ESP is most skilful out to a month ahead (CRPSS = 0.18) but that some predictability (CRPSS > 0.05) is possible up to 3 months in advance. This rapid decline in forecast skill is consistent with findings from several other benchmarking experiments, including Harrigan et al. (2018) and Girons Lopez et al. (2021), who noted a similar deterioration in ESP skill in the UK and Sweden, respectively. Pechlivanidis et al. (2020) also reported a decline in seasonal streamflow forecasting skill with increasing lead time across Europe. Persistence forecasts, which also rely on hydrological memory as their main source of skill, have shown comparable results. For example, both Svensson (2016) and Foran Quinn et al. (2021) noted a reduction in the number of usable persistence forecasts in the UK and Ireland, respectively, when moving from a 1-month forecast horizon to a 3-month forecast horizon.
ESP skill is also highly dependent on initialisation month. On average, at short to extended lead times (1 d to 2 weeks), ESP is most skilful when initialised in summer and least skil-ful when initialised in winter. This is again consistent with previous research, with higher predictability during dry seasons for forecasting methods that rely on hydrological memory reported for the UK (Harrigan et al., 2018), Switzerland (Staudinger and Seibert, 2014), China (Yang et al., 2014), and parts of the Amazon Basin (Paiva et al., 2012). This likely stems from a reduction in the direct contribution of precipitation to streamflow (Li et al., 2009;Mo and Lettenmaier, 2014;Wood and Lettenmaier, 2008), which reduces variability and allows initial conditions to persist for longer. In winter, lower evaporation rates lead to more effective rainfall, which "disrupts" the initial state and limits the skill of ESP forecasts. This is particularly noticeable in flashy catchments with a low baseflow contribution, where the hydrological response is driven predominately by rainfall. Under such conditions, rainfall events propagate to streamflow at a much faster rate, and memory of initial conditions is lost quickly. At longer lead times, ESP is least skilful when initialised in spring. Both Harrigan et al. (2018) and Svensson (2016) also found lower longer range skill for forecasts initialised in spring in the UK. The former attributed this to the transition from wet conditions with small soil moisture deficits to dry conditions with large soil moisture deficits. Given that Ireland shares a similar precipitation regime to the UK and that ESP skill is negatively impacted by high rainfall variability Figure 10. CRPSS values for historical ESP (a, d, g), NAO-conditioned ESP (b, e, h), and the improvement made by NAO-conditioned ESP over historical ESP (c, f, i), at lead times of 1, 2, and 3 months (rows). Catchments with negative skill (CRPSS < 0) are greyed out. across the forecast period (Harrigan et al., 2018), this is also a plausible explanation for the results observed here.

Where is ESP skilful?
ESP is most skilful in the Midlands and least skilful in the Border and South-West. The Midlands is a lowland karst region, which is underlain by permeable Carboniferous limestone, characterised by several locally and regionally important aquifers. Given that soils in this region are also well drained, catchments located here have higher storage capac-ity and hence greater skill due to their long memory. Both the Border and the West are poorly drained regions, with the former characterised by unproductive bedrock aquifers. This partly explains the low storage capacity of catchments in these regions, which have quick hydrological response times and poor persistence of initial conditions, resulting in lower ESP skill. Similar patterns were noted for persistence forecasts (Foran Quinn et al., 2021).

Why is ESP skilful?
ESP skill displays a strong relationship with modelled catchment storage capacity and catchment BFI values, with higher skill scores returned for catchments with greater storage. We conclude that storage capacity is primarily responsible for modulating ESP skill. High BFI catchments have flow regimes dominated by slowly released groundwater (Chiverton et al., 2015) and are characterised by longer response times and lower streamflow variability (Sear et al., 1999;Broderick et al., 2016). This is conducive to greater persistence of initial conditions, with water storage in the soil creating a memory effect whereby anomalous conditions can take weeks or months to wane (Ghannam et al., 2014;Harrigan et al., 2018;Li et al., 2009). The role played by storage capacity is perhaps best illustrated by the fact that ESP skill decays at a much slower rate in catchments with high BFI, especially during summer when streamflow is derived primarily from stored sources. For example, ESP is moderately skilful (CRPSS ≥ 0.25) out to a 2-month lead time for the Inny (ID 26021; BFI = 0.82) when initialised in July but shows adequate (non-neutral) performance relative to climatology (CRPSS > 0.05) up to 4 months ahead. Moreover, whilst ESP tends to perform worse outside of summer months, catchments with relatively high SAAR but also high BFI yield above-average skill scores in winter, spring, and autumn. In the Slaney (ID 12001; BFI = 0.67; SAAR = 1167 mm), skilful forecasts are possible up to almost a year ahead in January and February and up to 3-6 months ahead in spring and autumn. This likely stems from the delayed release of precipitation from groundwater stores (van Dijk et al., 2013), which can lead to temporal streamflow dependence for up to a season ahead (Chiverton et al., 2015).

Potential for operationalising ESP in Ireland
Our benchmarking results establish that ESP, in its traditional formulation, is skilful in a number of different scenarios, sometimes up to several months in advance. We recommend that ESP be used operationally in Ireland, similar to the HOUK (Prudhomme et al., 2017). Skilful streamflow forecasts at short to extended lead times could prove beneficial for water resources management, particularly in areas such as Dublin where water supply systems have been operating close to capacity and face challenges of supply during dry periods. Given that the predictability of summer rainfall is notoriously difficult over northern Europe (Weisheimer and Palmer, 2014), the true utility of ESP may lie in its ability to leverage initial hydrological conditions, particularly in highstorage catchments, to skilfully predict streamflow up to a season ahead during dry months. Operationally, skill could be extended further by initialising forecasts more than once a month (e.g. Girons Lopez et al., 2021). As ESP has also been shown to accurately forecast the occurrence of low-and high-flow events in many catchments up to at least a month in advance, it may also have practical relevance for decision makers where it can act as an aid in the management of hydrologic extremes.
In the absence of skilful atmospheric forecasts or improved hydrological process representation, historical ESP provides a lower limit of streamflow forecasting skill (Harrigan et al., 2018). However, we show that it is possible to improve ESP skill during winter by conditioning the method on the NAO. Improvements in forecast skill (CRPSS) of 7 %-18 % over lead times of 1 to 3 months are possible in catchments where meteorological conditions are the dominant control on skill. Notwithstanding differences in study design, these improvements are comparable to those of Beckers et al. (2016) using an ENSO-conditioned ESP. We do acknowledge, however, that these improvements are thus limited to specific catchments and are on top of a low initial skill base. In addition to improvements in overall forecast performance, NAO-conditioned ESP increases low-flow reliability and extends the lead time over which skilful discrimination of both low-and high-flow events is possible. As winter is the most important season for groundwater recharge, during which reservoirs fill up to be used over the summer, the ability to more accurately forecast dry winters in this way is extremely valuable for water managers, allowing them to antic-ipate the water situation beyond what is provided by the forecast alone. Hence, the greatest benefit of NAO-conditioned ESP may be found in its improved low-flow reliability and discrimination, rather than its overall performance.

Potential for future work
ESP skill is to a large extent dependent on the ability of hydrological models to accurately simulate catchment processes . It follows that further advances in ESP will likely require better representation of initial hydrological conditions and their evolution over time. Model structural and parameter uncertainty are therefore important considerations. Multi-parameter ensembles, data assimilation (e.g. Franz et al., 2014), state updating (e.g. Gibbs et al., 2018), and the use of satellite data and remote sensing are potential ways through which estimates of initial conditions could be improved. It may also be possible to improve predictability by choosing model structures that are more capable of representing key flow pathways (i.e. groundwater, quick flow, etc.) and hence generate more accurate initial states. In this paper, GR4J is used as a parsimonious conceptual model to determine when and where skill is possible. Ongoing work will explore whether additional model complexity adds forecast skill at different initialisation and lead times through the use of models with different structures and parameter dimensionality. In an operational setting, this could be extended to include more spatially discrete physically based hydrological models that may better account for initial conditions. The additional benefit derived from using ensembles of models for maximising skill persistence could also be assessed for different lead times and initialisation months. This is a promising avenue, as model diversity has been shown to enhance forecast skill in ensemble experiments (Sharma et al., 2019).
We conducted a basic analysis of the relationship between forecast skill and catchment characteristics, using a small selection of descriptors. A more comprehensive investigation of this relationship could be carried out, employing clustering techniques (e.g. Girons Lopez et al., 2021;Pechlivanidis et al., 2020) and a wider range of hydrological signatures. As PCDs are available for a larger sample of 215 catchments, skill could be inferred in areas where modelling is not feasible (e.g. due to sparse or poor-quality observational data) based on a priori knowledge of local hydrological conditions. This could also be achieved by regionalising model parameters.
Finally, our use of NAO-conditioned ESP as described in this paper is only one way in which seasonal climate information can be incorporated into ESP forecasts. Whilst we use precipitation analogues derived from GloSea5 hindcasts to generate a new ensemble, an alternative approach is to postprocess the historical ESP ensemble, similar to Beckers et al. (2016) or Yuan and Zhu (2018). This would involve subselecting ensemble members by comparing the NAO index at the time of forecast with the NAO index on the same day of a year in the historical record (e.g. using correlation analysis or a k-nearest-neighbours approach). A different approach could be to condition model parameter sets rather than model inputs. It may also be possible to improve skill outside of winter, as the winter NAO has shown lagged correlations with summer rainfall over Ireland (Murphy et al., 2013) and river flows in the UK (Wilby, 2001). Seasonal forecasts of precipitation and temperature could also be incorporated directly into the process, in so-called climate-model based SHF (Yuan et al., 2015).

Conclusions
Ensemble streamflow prediction is a popular approach to seasonal hydrological forecasting that is still used some 40 years after its initial development. Here, we benchmarked ESP skill for a diverse sample of Irish catchments and conclude that it is skilful against streamflow climatology but that the level of skill is strongly dependent on lead time, initialisation month, and individual catchment location and storage properties. In summary, we find the following: -ESP skill (CRPSS) decays rapidly as a function of lead time, but the rate of decay is much slower in catchments with high storage capacity, where initial conditions alone can provide skill up to several months in advance.
-For short (1-3 d), extended (1-2 weeks), and monthly lead times, ESP is most skilful when initialised during summer and least skilful when initialised during winter. At seasonal and annual lead times, ESP is least skilful when initialised during spring and about as skilful in autumn as it is in summer.
-ESP is most skilful in the Midlands, Mid-West, and East regions of Ireland, where slower responding catchments and the underlying lithology favour high storage capacity and longer hydrological memory.
-ESP is capable of accurately discriminating between events and non-events for both low and high flows up to a month ahead in the majority of catchments. At lead times longer than 1 month, the number of catchments for which discrimination is possible depends on initialisation month.
-NAO-conditioned ESP improves winter skill (CRPSS) in fast-responding, low-storage catchments in the Border and West regions, where the influence of meteorological forcing outweighs that from initial conditions. These improvements are more substantial over longer lead times of 2 and 3 months when the underlying NAO signal is less obscured by noise.
S. Donegan et al.: Conditioning ensemble streamflow prediction with the North Atlantic Oscillation 4177 -NAO-conditioned ESP improves reliability of low-flow forecasts in nearly all catchments and reduces reliability of high-flow forecasts, except for specific runoffdominated catchments.
-NAO-conditioned ESP extends the lead times over which skilful discrimination of low-and high-flow events is possible. This is particularly beneficial for forecasting dry winters, which can provide forewarning to water managers about potentially problematic conditions.
We have demonstrated the skill of historical ESP for Ireland and highlighted its utility during the dry season, when demand for outlooks may be greatest. We have also shown how to improve ESP during winter, the season most critical for water managers. In light of the potential benefits for decision makers, we recommend that ESP and conditioned ESP are operationalised, as they are serious contenders for producing skilful seasonal streamflow forecasts in Ireland.

4178
S. Donegan et al.: Conditioning ensemble streamflow prediction with the North Atlantic Oscillation Appendix A: Non-parametric Kling-Gupta efficiency The non-parametric Kling-Gupta efficiency (KGE NP ; Pool et al., 2018) is a modification of the traditional KGE (Gupta et al., 2009) that uses the non-parametric Spearman rank correlation coefficient and normalised flow-duration curves to represent discharge dynamics and discharge variability, respectively. It is defined as where ρ is the non-parametric Spearman rank correlation coefficient between the simulated and observed time series, µ s and µ o are the mean of the simulated and observed time series, respectively, and I (k) and J (k) are the time steps when the kth largest flow occurs within the simulated and observed time series, respectively. β represents discharge volume. α NP is calculated from the absolute difference between the normalised flow-duration curves.
Appendix B: Continuous ranked probability skill score The continuous ranked probability score (CRPS;Hersbach, 2000) measures the integrated squared difference between the forecast cumulative distribution function (CDF) and the empirical CDF of the observation. For a continuous random variable X (e.g. streamflow) with probability density function f X , the CRPS between the forecast CDF, denoted F X , and the empirical CDF of the observation y, denoted F y , is defined as where H is the Heaviside step function: H (x) = 1 for x ≥ 0 and H (x) = 0 for x < 0. The continuous ranked probability skill score (CRPSS) is then given by where CRPS Sys is the average CRPS of the forecasting system for a set of forecast-observation pairs, and CRPS Ref is the equivalent for the reference forecast. The CRPSS ranges from −∞ to 1, with positive (negative) values indicating better (worse) performance than the reference forecast.
Data availability. Streamflow data are available from the Office of Public Works (https://waterlevel.ie/hydro-data/, Office of Public Works, 2021) and the Environmental Protection Agency (https: //epawebapp.epa.ie/hydronet/, Environmental Protection Agency, 2021). Climate data and the ESP hindcast archive are available upon request from the authors. Table S1 includes metadata for all 46 catchments as well as model parameter values and data used to generate Table 2 and Figs. 2 and 7.
Author contributions. SD designed the study with input from SH and CM. JK, AAS, and NS contributed the GloSea5 data used to condition the ESP method. CB, DFQ, and SG helped collate catchment data. SD carried out the modelling, analysed the results, and produced the figures. SD interpreted the results with input from CM, SH, RLW, CP, and TM. SD prepared the manuscript with contributions and reviews from all co-authors.
Competing interests. The authors declare that they have no conflict of interest.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.