Streamflow forecast sensitivity to air temperature forecast calibration for 139 Norwegian catchments

The Norwegian flood forecasting system is based on a flood forecasting model running on catchments located all across Norway. The system relies on deterministic meteorological forecasts and uses an auto-regressive post-processing 10 algorithm to achieve probabilistic streamflow forecasts and thus a measure of uncertainty. An alternative approach is to use meteorological and hydrological ensemble forecasts to quantify the uncertainty in forecasted streamflow. In catchments with seasonal snow cover, snowmelt is an important flood generating process. Hence, high quality air temperature data are important for accurate forecasting of streamflow. In this study, the sensitivity of hydrological ensemble forecasts to the calibration of temperature ensemble forecasts was investigated. Ensemble forecasts of temperature from ECMWF covering a period of nearly 15 three years, from 01.03.2013 to 31.12.2015, were used. To improve skill and reduce bias of the temperature ensembles, the Norwegian Meteorological Institute provided parameters for ensemble calibration. The calibration parameters are derived using a standard quantile mapping method. Estimated observed daily temperature and precipitation were obtained from the SeNorge-dataset, which is station data interpolated to a 1×1 km grid covering all of Norway. The operational floodforecasting model, a lumped HBV model distributed on 10 elevation zones, was used to calculate streamflow. 20 The results show that temperature ensemble calibration influenced both temperature and streamflow forecast skill, but differently depending on season and region. We found a close to 1:1 relationship between temperature and streamflow skill change for the spring season, whereas for autumn and winter large temperature skill improvements were not reflected in the streamflow forecasts to the same degree. This can be explained by streamflow being less influenced by sub-zero temperature improvements, which accounted for the biggest temperature biases and corrections during autumn and winter. The skill differs 25 between regions, which could partly be related to elevation differences and catchment area. It is evident, however, that temperature forecasts are important for streamflow forecasts in climates with seasonal snow cover. This indicates that further studies are needed, specifically addressing catchment specific calibration methods, for improved air temperature forecasts. Hydrol. Earth Syst. Sci. Discuss., https://doi.org/10.5194/hess-2018-373 Manuscript under review for journal Hydrol. Earth Syst. Sci. Discussion started: 31 July 2018 c © Author(s) 2018. CC BY 4.0 License.

I acknowledge that your meaning is consistent with how many meteorologists would interpret it. I would recommend to address this issue by either use a different word (I believe HESSD readers may be more familiar with 'post-processing') or by addressing this in the text somewhere. AR: We agree that hydrologists might interpret the term "calibration" to "hydrological model calibration", and we will clarify our use of the terminology as illustrated in Figure 2. Post-processing is, in our paper, a general term for any modifications applied to a raw meteorological forecast. We distinguish between calibration and downscaling, that both are post-processing methods. This is consistent with the terminology used by the Norwegian Meteorological Institute (MetNorway) (https://github/metno/gridpp). AC: P2L7-11 We rewrote to clarify the terminology used "Post-processing refers to all techniques used to change the output from a meteorological model, and includes calibration (described above) and downscaling. Downscaling implies resampling from the original forecast grid size to a grid of higher resolution, and both statistical (e.g. interpolation) and dynamical (e.g. a regional weather forecast model) techniques, can be used (Schaake et. al., 2010). A recent review of post-processing methods are given in  and the textbook edited by Vannitsem et al (2018)" • Citations aren't always properly formatted. I think I've seen ((double parentheses)), for example. In S3.1.2, l12, a correct way to refer to the evidence would be (Seierstad, 2017) with the 'personal communication' listed in the bibliography. I think. I've also seen citation in which both first and family names are listed. May be good to verify against Copernicus citation rules. AR: Thank you. AC: The citations and references have been formatted according to the HESS standard.

Abstract
• l9-11 These sentences distract from the point you're going to make. While the facts you state may have a place in the introduction, I would omit these from the abstract. AR: You are right. We will consider rewriting the abstract. AC: P1L9- 14 We changed the first sentences as follows: "In this study, we used meteorological ensemble forecasts with the hydrological models to quantify the uncertainty in forecasted streamflow, with a particular focus on the impact of ensemble temperature forecasts. In catchments with seasonal snow cover, snowmelt is an important flood generating process." • l20 'the HBV model is used to calculate streamflow'. The verb to calculate presumes certainty. Pls consider using estimate instead. AR: Thank you, we will change as suggested, i.e. using 'estimate' both in the abstract and in the text. AC: We changed as suggested P1L23, P2L20, P7L8+11+16, P10L25, P14L21.
• l21 'influenced'. My understanding is that 'influences' (and the associated verb) are a thing of the mind ("Who are your main influences?" "Joan Baez"). For physical processes, I think 'affected' is more suitable. AR: Thank you. We will change 'influence' used as a verb to affect, and to 'effect' where 'influence' is used as a noun. AC: Changed to affects or in some cases effect: P1L24+27(affected), P2L8(affected), P10L20 (effect), P11L19 (affects) P16L8 (affects) + L25 (effect), P17L7 (effect) • l26 'however'. I don't think this sentence contradicts anything that was stated before. Hence, the word 'however' may be omitted. AR: Thank you, we will omit "however". Author Response to RC#3 AC: P1L31 Rephrased to "Overall, it is evident that temperature forecasts are important for streamflow forecasts in climates with seasonal snow cover." Section 3.1.2 • I am not entirely sure who provides the calibration parameters. L5 suggests MetN, but the sentence "To establish the calibration parameters. . . " (l8) may be interpreted as an explanation of how the authors have done this. AR: MetNorway did the quantile mapping, and established the calibration parameters. The calibration parameters were originally used to bias correct the temperature forecasts as provided on yr.no (the Norwegian weather forecasting). We applied the Met-parameters to the raw ENS temperature forecasts of our selected period. AC: P8L8 We rephrased the sentence "To establish the calibration parameters MET Norway used both ENS re-forecast (Owens, 2018) and Hirlam data from July 2006 to December 2011 interpolated to a 5×5 km 2 grid." In the Met Norway procedure, why aren't temperature observations used? Are the HIRLAM reanalyses deemed to be sufficiently certain? This may deserve a few informed comments. AR: You are right to point out these differences in data sets used for calibration of forecasts and the hydrological model. First, as you mention, SeNorge and Hirlam are not the same data. Hirlam is a shortrange regional forecast model (4 km resolution) used in the operational weather forecast for the first 2 days, whereas SeNorge is a dataset where observations are interpolated to a 1 km grid.
In this study, we wanted to use the available operational method from MET Norway, and they use quantile mapping with Hirlam as a reference to calibrate the ECMWF ensemble forecast. Both Hirlam (for the first 2-3 days) and ECMWF (for the following 7-8 days) forecasts are used in the operational weather forecast (yr.no). Using Hirlam data to calibrate ECMWF will improve the transition between the forecasts. Hirlam is available as a sub daily grid and makes it possible for MET Norway to provide different calibration parameters for day and night, whereas SeNorge is only available as a daily grid and would not offer this possibility.
Hirlam have less errors than ECMWF in the temperature forecast for Norway , and as we see from e.g. fig 6 and 7 that the calibration improves especially the cold biases in the ECMWF forecasts. When we evaluated the hydrological model, the temperature calibration improved, in most cases, the hydrological forecasts, providing an indirect confirmation that the HIRLAM temperature is less biased than the ECMWF temperature. Nevertheless, the results suggest that there might be improvements using the SeNorge data instead of Hirlam, but this needs to be tested (beyond the scope of this study). AC: P8L5-8 Rewritten "MET Norway uses Hirlam (Bengtsson et al., 2017) temperature forecast (on a 4×4 km 2 ) to provide a reference for the parameter estimation (calibration). Hirlam is suitable as a reference since it provides a continuous field covering all of Norway at a sub daily time step. In addition, Hirlam gives a higher skill and are less biased than the ENS .
• If I am correct in understanding that both the raw and the calibrated ensembles have been provided by Met Norway then maybe this should be stated more clearly. Or is it the case that Met Norway computed the calibration parameters on a data set from 2006-2011 and that you applied these yourself to a data set ranging from March 2013 through Dec 2015? If so, maybe state this more bluntly? AR: Your second suggestion is correct. The raw ensembles from ECMWF (March 2013-Dec 2015 and the calibration parameters (based on data ranging from [2006][2007][2008][2009][2010][2011] were supplied by MET Norway, whereas we did the calibration using the provided calibration parameters and available computer scripts (github/metno/gridpp). AC: We separate what MET Norway did from what we did. The first paragraph of section 3.1.2 contains the description of calibration parameters from MET Norway P8L2-15, whereas the second paragraph P8L16-21 what we did: (1) P8L8-9 added to the first paragraph: "To establish the calibration parameters MET Norway used both ENS re-forecast (Owens, 2018) and Hirlam data from July 2006 to December 2011, both interpolated to a 5×5 km 2 grid… " (2) P8L16-17 added to the second paragraph: "In this study, we applied the calibration coefficients provided by MET Norway to the temperature forecasts for the period 2013-2015. Accordingly, the ENS was interpolated to the 5×5 km 2 …." • I am assuming that you used a HIRLAM reanalysis. Is that correct? If not, what lead times are you using and do the HIRLAM forecasts you used have the same max lead time as the ECMWF ensembles? I am only familiar with a few instances of HIRLAM and these all go out to just over 2 days max. AR: MetNorway used the operational Hirlam forecasts for the calibration period. It is correct that Hirlam does not cover the same lead times as ENS. Met Norway established the calibration parameters using the 24 first hours of the forecasts as the reference. AC: P8L13- 15 We added a sentence to clarify this "The same coefficients, based on the first 24h mapped, are applied to all lead times and ensemble members individually. For forecasts outside the observation range, a 1:1 extrapolation is used. That is, if a forecast is 2°C higher than the highest mapped percentile, then the calibrated forecast is 2°C higher than the same percentile for the reference. " • By off-setting Tens against Tcal, you create the impression that Tcal is not an ensemble forecast. Consider using Traw and Tcal instead. AR: We chose to use "ens" instead of "raw", since an elevation-correction was applied the forecasts, and hence they are not actually "raw". AC: We added to existing text to underline that Tcal (and Qcal) is an ensemble. P11L4-5+22-23, P12L29-30, P13L9-10, P14L8 • l29-30. The 'assessment' was done by you, not by the ensemble range.
• On assessing sharpness: how confident are you that a visual assessment does the job? Pls consider plotting the empirical distribution of sharpness of all your forecasts and comparing those. AR: We will plot the empirical distribution of sharpness for all temperature ensembles, and rephrase the sentence concerning sharpness accordingly. AC: P9L5-8 Rewritten "In this study, the temperature sharpness was assessed by first estimating the range between the 5 th and the 95 th percentile of the ordered ensemble forecasts for all issue dates, lead times and catchments. For streamflow, we estimated a relative sharpness by dividing the 5 th to 95 th percentile range by the ensemble mean. Thereafter, sharpness was determined for each catchment and lead time as the average range of all issue dates." • If you're calibrating the temp ensembles on a leadtime by leadtime basis and on a grid cell by grid cell basis, chances are that you'll change the temporal pattern (forecasted temperature as a function of time) as well as the spatial pattern. Does this in any way affect use in streamflow forecasting? I believe there are some techniques that may be helpful in trying to restore spatial-temporal relations (the Schaake shuffle springs to mind). Would these have a use in present study?
AR: We think that the calibration will not affect the spatial and temporal pattern significantly. The calibration function was applied to each ensemble member individually. We therefore kept the order of the ensemble members, both in space and time, and it was not necessary to use the Schaake shuffle. AC: We think this will be clearer by adding the following description to quantile mapping page 7, line 12-13. (Response above): "… are applied to all lead times and ensemble members individually…" Section 3.2 • Would it be fair to say that temperature forecasts are only relevant if they can discriminate between freezing and non-freezing situations? If so, would it be justified to focus more on this discrimination? Perhaps by defining an event (T<0, for example) for which one can compute a range of verification scores (false alarms, hits, ROC, Brier's probability score, etc). I acknowledge that this would be feasible for temperature and less obvious for streamflow. AR: This is a good suggestion. Nonetheless, we think this is beyond the scope of this study. This could be an interesting topic for a future study. AC: No change

Section 4
• " To reduce the amount of presented results, the remaining part of this paper focuses on CRPSS for a lead time of 5 days." This is fine, but temperature forecast at 5-day lead time may not affect streamflow forecasts until a (much) longer lead time. Or conversely, streamflow forecasts at day 5 would have been affected by a day 2 temperature forecast (this is an example). As in some cases you're comparing Qforecasts with T-forecasts, how have you accounted for this? AR: This is an interesting question. The streamflow forecast at day 5 will be affected by the temperature forecast the previous 4 days as well as day 5. However, for most catchments in this study, the concentration time is less than one day, and the streamflow will respond the same day as a major water input from rain or snow melt. For specific events, it is not evident which of the T-forecasts at day 1-5 is the most important for the Q-forecast at day 5. The sensitivity depends on the sequence of temperature and precipitation. Nevertheless, we think that using temperature CRPSS for day 5 is a good choice since the streamflow at day 5 is the most sensitive to the temperature at day 5 on average (which applies to all lead times). In addition, we see that the improvement in CRPSS across lead times is highly correlated and our results and conclusions would not change if we used temperature CRPSS for days 2, 3, or 4 instead. AC: P14L26-28 Added "The same lead time was used to relate improvement in streamflow to temperature, we consider this robust since most catchments in this study have a concentration time of less than a day." Section 4.2 "Scatter plots of the difference between CRPSS for calibrated and uncalibrated forecasts". CRPSS in itself is a fairly abstract measure. The difference between two CRPSS scores is, I find, even more abstract. What's the meaning of those values? As CRPSS is a skill of a forecast versus a baseline, why not simply calculate the CRPSS of the calibrated forecasts using the CRPS of the uncalibrated forecasts as a baseline? AR: We wanted to evaluate the skill of the uncalibrated forecasts as well. If we were to use the uncalibrated as a benchmark, we would not assess the quality of the original forecast, only the change between the uncalibrated and calibrated forecast. AC: No changes introduced.

Section 5
L7: 'dispersion' is not an expression of quality but a characteristic of an ensemble. Saying 'dispersion improved' makes little sense then? AR: Thank you. What we mean is that dispersion, as measured by rank histogram convexity, improved. AC: P13L1-3 We changed to "Even though both bias and dispersion (i.e. reliability) as measured by rank histogram slope and convexity improved with longer lead time, the reduced sharpness and increased uncertainty, resulted in a reduced skill (CRPSS)." Section 5.1 • L11 "skill. . . depends". Consider replacing by "skill. . . varies with". AR: Thank you. AC: P13L9 We applied as suggested "The skill for both raw (uncalibrated) Tens and calibrated Tcal temperature ensembles varies with season." • "Quantile mapping is sensitive to forecasts outside the range of calibration values and period". I think it would be good to point out that this is true for any statistical post-processing procedure. AR: Good point. AC: P13L16 "Quantile mapping, as most statistical techniques, is sensitive to forecasts outside the range of calibration values and period (Lafon et al. 2013), this may explain the too high correction in the highest Tens quantile. " • Immediately following: "and can be a" -> "and this can be a" AR: Noted AC: P13L17 Changed as suggested • On the causes of temperature forecast bias. You go into some detail to explain a situation in which land is colder than sea. Would this be a typical situation for summer/winter? If so, can you more directly link this to some of the results you're showing? AR: We will clarify that this is a typical situation of winter. This is to some point already exemplified in the text, and we can underline in the text that the situations are typical for winter. (5.3 will be included in 5.1 and 5.2, and we will ensure to get this information in the revised manuscript): AC: P13L20-28 Implemented and rewritten for the revised sec 5.1 "The most pronounced spatial pattern is the low autumn CRPSS for uncalibrated ensembles Tens in the coastal areas. This is seen from the boxplots for the regions West, Mid and North (Fig. 8) and in the plots of the western catchments Viksvatn and Foennerdalsvatn during winter months (Fig. 4). This cold bias is documented for the Norwegian coastal areas in the cold seasons by Seierstad et al (2016), and is mainly caused by the radiation calculations in the ECMWF model (Hogan et al., 2017). The coarse radiation grid results in warmer sea points being used to compute longwave fluxes applied over colder land points, causing too much cooling. This effect is seen for the temperature forecast for winter 2014 and 2015 for the coastal catchments in fig 4 (b) and (c), in contrast to the inland catchment (a) which is less biased. The radiation resolution is improved in later model cycles (Hogan et al., 2017;Seierstad et al., 2016). In addition, the challenging steep coastal topography is not well represented by the spatial resolution in the ECMWF model (Seierstad et al., 2016). For inland catchments, and the regions " Section 5.2 L10 Grammatically, this sentence is awkward if not wrong. AR: Thank you; we will rephrase this sentence. AC: P14L29-P15L1 We rephrased "In summary, it can be concluded that to further improve streamflow forecasts during the snowmelt season, improved temperature forecasts are essential. Streamflow forecasts during spring have the highest potential for improvements since the temperature forecasts were not, for a majority of the catchments, improved by the applied calibration."

Figures Overall
Many figures use a lot of white space between various plots/panels. Consider reducing this or, even better, removing altogether. AR: We will reduce some white space in figure 1 and 3. AC: New figures provided P

Figure 1
• Do the grey polygons add up to 139 in total? If so, many must be really small? • Caption: consider using 'boundaries' instead of 'limits' AR: Yes. Especially on the western coast, the catchments are small. This will be clarified in the caption AC: P24L5-10 New caption text Fig 1: " Figure 1: The maps for Norway shows the 139 catchments used in this study. The left map show the catchment boundaries including the location of four selected catchments. Please note that many catchments are relatively small and difficult to detect. The location of the catchments' gauging stations are shown in the right map. Norway is grouped into five regions (N=north, M=mid, W=west, S=south, and E=east), and all regions are marked with different colors and regional boundaries."

Figure 4
• Why plot the ensemble mean and not all five ensemble members, possibly as horizontal lines? AR: It is not evident to us which modification the reviewer suggests. In this plot, the mean is for the 51 ensemble members not five. If we were to plot all the members, it will be difficult to retain any information. By plotting the mean we show the bias in the forecast and by using the scatter plot, we also show that some biases are dependent on forecasted temperature (a conditional bias). AC: No changes introduced in the plots.
• The axes of the plots in the right-hand column vary. Please consider unifying this. Also: please consider ensuring that horizontal and vertical axes are identical. Maybe they are, but the labeling isn't. AR: We will unify the axes.  • What lead time are these plots for? AR: Thank you; we will add the lead time in the caption. AC: P32L5 Caption updated "All plots are presented for lead time 5 days." • Is the lead time for T identical to that for Q? What is the 'response time' of the catchment to snowmelt? If not zero then shouldn't this be taken into account somehow? AR: We use the same lead time for temperature as for streamflow. See comment to section 4. AC: No changes applied.
Please consider. . . • . . . removing data for seasons for which temperature has little or no effect on streamflow levels. AR: We would like to keep the plots for all seasons here. By showing the difference between the seasons, we think it is easier to understand the large variations we see. AC: No changes applied • . . . unifying horizontal and vertical axes. it took me a little while longer than I cared to realise that the light grey slanted line is the 1:1 diagonal. AR: We will consider changing the plots. However, unified axes means that we lose information about the regional distribution.. AC: P32 We unified the axis in Fig 5, and omitted summer and winter.

Figure 6
• What do you want the reader to compare? CRPSS(T) and CRPSS(Q)? Or CRPSS(spring) v CRPSS(autumn)? Pls ensure panels are ordered accordingly. AR: We wanted the reader, first of all, to compare CRPSS(T) and CRPSS(Q) Therefore, we placed CRPSS(T) and CRPSS(q) from the spring season on the first line and for the autumn season in the last line. Then the reader can evaluate how the improvements in temperature will affect improvement in streamflow, for both seasons. Secondary, we wanted to show the difference between seasons. Subplots for each season are therefore arranged vertically, for both temperature (left) and streamflow (right). AC: No changes introduced.
• pls ensure that within a row, panels have identical vertical axes so this comparison can indeed be done (i.e. the reader can then easily compare the top left with the top right plot) AR: We prefer to use different scales on the vertical axes within a row to increase the readability of each sub-plot. In particular, the plots of the CRPSS(Q) would be more difficult to read if we used the same scale as in the plots of CRPSS(T) in the left panel. AC: No changes • What is the purpose of showing both the 'real' observations and the 'model streamflow with SeNorge observations'? Is this distinction made in the paper, and addressed?
• Consider reversing the order of the graphs. The 9d lead time graph was available before the 2d lead time graph? • The horizontal axis labeling is not in English.
• As all horizontal axes are identical, pls consider removing white space between plots altogether and only label the axis of the bottom plot. AR: We will change the plots as suggested. We understand that the introduction of real observations in this figure is confusing, and we will therefore remove the real observation from the figure and from the text. AC: P38 New Fig 10 and updated caption. Updated the text P12L21-23 "The horizontal grey dotted lines represent mean annual flood, the 5-year and the 50-year floods (i.e. the operational flood warning levels) in this catchment." • The warning levels aren't relevant, are they? On reflection: you're scoring the forecast ensembles using CRPSS and rank histograms. This shows absence of preference for doing well for 'extremes', even though the work appears to be inspired by forecasting for floods. How is this consistent? Maybe omit references to 'floods' altogether? AR: In Norway, we use the mean annual, the 5-year and the 50-year floods as exceedance thresholds to issue flood warnings. This figure connects the theoretical aspects to the operational implementation, and points to the importance of calibrated temperature for a flood warning system. AC: We kept the reference to flood levels, but removed the warning colors all together. Thank you for the positive and thorough evaluation of our article. We appreciate the comments, which are valuable for us in order to improve the manuscript.
We would like to apology for the missing references. The error emerged when we specified the HESS format, and un-intentionally deleted many references from the reference list. The main author should nonetheless have detected this flaw prior to posting.
Replies and corrections are done as follows: the Author response (AR) is marked with red text, while the author's suggestions to corrections (AC) are marked with blue text; we use page and line number to specify the appropriate location, where this is needed. All Referee comments are kept in a black; we use page and line number when needed to specify the appropriate location. All page and line references from the author are to the provided track changes version of the revised manuscript. Hence, P6L3-5 indicates changes on page 6 and lines 3 to 5.

Review of 'Streamflow forecast sensitivity to air temperature forecast calibration for 139 Norwegian catchments' by Trine Hegdahl et al. Anonym referee#4
Supplement: Especially in hydrometeorological predictions where methods from both the meteorological and the hydrological forecasting community are used, it is of major importance to carefully define the terminology and to coherently use throughout the manuscript.
The current form of the manuscript shows a lack of precise formulations (e.g.: calibration, preprocessing, skill) which should be revised to better communicate the content of the study. Some of the graphics should be enhanced to facilitate the readability and the caption are sometimes incomplete. In addition, more than 15 references mentioned in the text are missing in the reference list and should be added.
AR: We thank you for the feedback. We would like to apology for the missing references. It seems to have been an error when we reformatted EndNote, which evidently lead to many references being deleted from the reference list. The main author should nonetheless have detected this flaw prior to submitting the manuscript. We will carefully revise the text to avoid inaccuracies in formulations.
Furthermore, some additional references could be of interest within discussion to put the findings of the study into a broader picture. Many of the references especially concerning the meteorological forecast are user guides and or technical reports or personal communications, which is fine, but I would appreciate if some more peer-reviewed literature would be cited as there is a large body of existing literature concerning the verification of ECMWF temperature predictions.
AR: We agree that it is better to use peer-reviewed literature. We chose to use technical reports and personal communication only when necessary and we found no other alternatives. In particular, there are not much peer-reviewed papers on the verification of the ECMWF temperature forecasts for Norway available. Hence, we chose to implement what is available of technical documentation.
In general, the language could be clearer and more concise. To me it is not clear what the authors understand under the term pre-processing, at least in the beginning of the manuscript. E.g. in literature there is a distinction between dynamical and statistical downscaling (see e.g. , Yuan et al. (2015)) and statistical downscaling does include a bias-correction. In the present manuscript, the term downscaling does only refer to applying a laps rate correction and interpolation, what is not as downscaling is referred to in the literature.
However, I think it would be important for the reader to have a short general overview of what preprocessing is in the introduction. In particular the term calibration, in the present manuscript used as a synonym for bias correction should be introduced more carefully because the term calibration is used by statisticians but in the meteorological, climatological and hydrological communities, the term bias correction is more common.
AR: We acknowledge that the literature is not consistent in terminology, and particular the terminology differs between the forecasting and the climate projection communities.
In our paper, we chose a terminology that is consistent with a large part of the literature, and that facilitates to explain the approaches we used. We use pre (and post)-processing as a general term, which includes all techniques applied to the raw temperature forecasts in order to improve the temperature output from the atmospheric model (i.e. downscaling and calibration are pre-processing techniques). We pre-processed the temperature in two ways: (i) only downscaling, (ii) both downscaling and calibration, with the purpose to reveal the effect of temperature calibration.
We used the term downscaling on the resampling from a low resolution for the ECMWF forecasts to the 1x1km grid used for the SeNorge data, combined with a temperature correction using a temperature lapse rate. This terminology is used by e.g. UK Met office (Sheridan et al 2010, with references therein). Especially for areas with a complex terrain, where the resolution of the NWP poorly resolves the terrain, the correction for the discrepancy between model elevation and terrain are useful. In some literature, the term downscaling includes both bias correction and resampling, (ref Yuan et al 2015), but we did not use this terminology here.
We used the term calibration on the statistical adjustments of bias and dispersion of the ensembles. The aim of calibration is to make the forecasts reliable in a statistical sense, i.e. 90% of the observations are within a 90% uncertainty interval. In particular, in the meteorological forecasting literature, calibration has this specific meaning (e.g. Gneiting, 2006) We think that to separate the pre-processing into downscaling and calibration is useful, but agree that the term downscaling might have a different signification in parts of the literature. Our terminology is also, to a large degree, in accordance to the descriptions in . Lie et al (2017) describes the main purposes of post-processing to be the following (1) correct bias and dispersion in the forecasts, (2) to preserve the predictive skill of the forecasts, (3) downscale the forecasts to the scale used in the applications, and (4) to generate ensemble members (…). Further, in the conclusion  writes that their purpose is "… to calibrate the bias …" In the referred article, we hence see the term calibrate used consistently to describe the statistical properties of both the meteorological and the hydrological ensembles. We further think that using calibration, as part of the pre-and post-processing is a wellestablished term for the hydrological community using ensemble forecasts. Calibrated ensembles and the calibration methods is more specific than only using only the term pre-or post-processing. Calibration strive for the ensemble to describe the mean and spread of the climatology they should represent.
We have not included any description of the dynamical downscaling, as this usually includes a regional climate model with a different approach, and is not the scope for this study.
AC: We added a description to clarify the use of pre-processing, calibration and downscaling. We further omitted the reference to post-processing (P2L32-P3L2) since in this study we focus on the calibration and downscaling of the meteorological forecasts, which from a hydrological perspective is preprocessing.
P3L7-11 "Pre-processing (from a hydrological perspective) refers to all techniques used to change the output from a meteorological model, and includes calibration (described above) and downscaling. Downscaling implies resampling from the original forecast grid size to a grid of higher resolution, and both statistical (e.g. interpolation) and dynamical (e.g. a regional weather forecast model) techniques, can be used (Schaake et. al., 2010). A recent review of post-processing methods are given in  and the textbook edited by Vannitsem et al (2018)." As you mention the forecasting period used for the study is only two and a half year long which might influence the results. You state this in the discussion but do not explain why it could be critical. I suggest that you discuss this explicitly. Namely, within such a short period, the interannual variability might not be sufficiently covered. In addition, using forecasts from different model Cycles (38r1 to 41r1) might have an influence of the skill as well because the adaption within a new cycle might enhance or decrease the forecast performance making the comparison between seasons difficult as it might not only originate from the particular season but might be influenced by model versions. I suggest including such limitations in the discussion.
AR: We agree that the inter-annual variability might affect the calibration coefficients, and of course, there are aspects with the different model version that might affect the result. However, the changes applied to the different model-cycles did not remove the biases apparent in temperature forecasts ( fig  4).
AC: P13L17-19 "The use of forecasts from different model cycles might affect the consistency in the forecasts. Moreover, the calibration parameters are sensitive to the representativeness of the calibration period" To apply Quantile mapping you do need the distribution of the forecast and the distribution of the observations. In section 3.1.2 you state that "MET Norway uses Hirlam temperature forecasts to provide the observational climatology used for parameter estimation". I think here, more information is needed to enable the reader how the calibration is done. Are daily values used for the parameter estimation? Is it empirical or parametric QM used and how are values outside the range treaded (e.g. constant extrapolation)?
Is it a member-by-member approach or are the same parameters used for all members?
AR: MetNorway uses parametric quantile mapping based on the first 24h. When a forecast is outside the observation range, a 1 to 1 extrapolation is used. Therefore, if a forecast is 2⁰C higher than the highest percentile, then the calibrated forecast is 2⁰C higher than the same percentile for the reference. The same parameters are applied to all members and lead times.
AC: P8L13-17 We added "The same coefficients, based on mapping the first 24 hours, were applied to all lead times and members. For forecasts outside the observation range, a 1:1 extrapolation was used. I.e. if a forecast is 2°C higher than the highest mapped forecasted temperature, then the calibrated forecast is 2°C higher than the highest mapped reference temperature." One critical point is that the calibration parameters are interfered from the Hirlam but the hydrological model is run with SeNorge observations. Why are not these observations used? The correction will account for the bias between ECMWF and Hirlam but I would expect that biases with SeNorge will at least slightly differ. Why don't you use the observations from SeNorge to get your calibrations?
In the summary it is stated that "The most obvious improvement in the forecasting chain is to use the same temperature information, the SeNorge temperature, for calibrating the temperature forecast that is used for calibrationg the hydrological model, generating …" (P14/L25-27).
But if I understand correctly from the manuscript SeNorge and Hrilam are not the same. I have troubles with this procedure as it is known that different forecast models do have different biases. To biascorrect or calibrate ensembles the observations should be taken into account and not another forecast. In this case the bias between two forecasts will be corrected and not the bias of the forecast with regard to the observations. AR: You are right to point out these differences in data sets used for calibration of forecasts and the hydrological model. First, as you mention, SeNorge and Hirlam are not the same data. Hirlam is a shortrange regional forecast model (4 km horizontal resolution) used in the operational weather forecast for the first 2 days, whereas SeNorge is a dataset where observations are interpolated to a 1 km grid.
In this study, we wanted to use the available operational method from MetNorway, and they used quantile mapping with Hirlam as a reference to calibrate the ECMWF ensemble forecast. Both Hirlam (for the first 2-3 days) and ECMWF (for the following 7-8 days) forecasts are used in the operational weather forecast (yr.no). Using Hirlam data to calibrate ECMWF will improve the transition between the forecasts. Hirlam is available as a sub daily grid and makes it possible for MetNorway to provide different calibration parameters for day and night, whereas SeNorge is only available as a daily grid and would not offer this possibility.
Hirlam has less (smaller) errors than ECMWF in the temperature forecast for Norway , and as we see from e.g. fig 6 and 7 in this manuscript, the calibration reduces the cold biases in the ECMWF forecasts. When we evaluated the hydrological model, the temperature calibration improved in most cases the hydrological forecasts, providing an indirect conformation that the Hirlam temperature is less biased than the ECMWF temperature.
Furthermore, by the many interpolations used, there is a large uncertainty introduced which will lower the trust in the results. Interpolation of ECMEF and Hirlam to derive correction parameters, another interpolation to meet the hydrological model requirements.
AR: We agree that there are uncertainties due to interpolation and downscaling. A temperature calibration that is tailored to the needs for the hydrological modelling would solve this challenge.
AC: P13L5-7 We added "The calibration procedure applied in this study involves many interpolations and downscaling steps that increases the uncertainty in temperature forecasts. We believe that a catchment specific temperature calibration, tailored to the needs for hydrological forecasting, would solve this challenge." Another point that should be discussed is if seasonal correction parameters are really sufficient or does it introduce artificial jumps between periods. In a climate context, seasonal windows for parameter estimation might be sufficient but in an operational forecasting context a shorter window should be taken into account if possible.
AR: MetNorway provided unique parameters for each month. The parameters are based on a window of three months, which smooths the seasonal patterns. A three month window was chosen to ensure enough data for robust calibration parameters.
In Section 3.2 where the CRPS is introduced you mention different notations (CRPS, Scrp) and same for the CRPSS. I think this is confusing, as later in the text only CRPS is used. I suggest only introducing one of the notations and stick to that.
AR: We agree that this notation might introduce confusion. The reason is the formatting standard of HESS where equations should only contain one capital letter with sub or super script. However, we find it appropriate to use CRPSS in the text since this is the abbreviation used in the community, and in the equations, we used an alternative notation according to the HESS standard: (Scrp and Scrps are only used in the equations). This approach is used in many HESS papers.
AC: P10L5-7 We provided a sentence to clarify sec 3.2: "For readability, the abbreviation Scrp and Scrps used in the equation will be substituted with CRPS and CRPSS in the text hereafter" P9L11+14+20-21+27: We added explanations similar to "CRPS denoted as SCRP in Eq. 1" Specific comments: P1 L7-14: You say the flood forecasting system uses deterministic forecasts for temperature and precipitation). But the ECMWF model you reference provides an ensemble of 51 members. Please state how this is used.
AR: The operational system today, uses one deterministic forecast, not the ensemble forecasts. In our setup, the hydrological system is setup to run the 51 ensemble members. We make sure that the same initial states are used for all members. This is explained in details in the main text, and in the abstract we keep the description simple. We think the suggested changes in the following point also covers this point. AC: P7L15-16 We added "In the forecasting mode each temperature ensemble member was used as input and run as separate deterministic forecasts." L11-12: "An alternative approach is to use meteorological and hydrological ensemble forecasts" is somewhat misleading. Either you used ensemble meteorological forecasts in combination with hydrological models to generate ensemble streamflow forecasts or one uses a different methodology to produce hydrological ensembles forecasts. I suggest rewriting the sentence: "An alternative approach is to combine meteorological ensemble forecast with hydrological models to quantify the uncertainty in the forecasted streamflow".
AR: You are right. We apply the suggested rewriting. AC: P1L9-14 Rewritten "In this study, we used meteorological ensemble forecast as input to hydrological models to quantify the uncertainty in forecasted streamflow, with a particular focus on the effect of temperature forecast calibration on the streamflow ensemble forecast skill." L14: "for an accurate forecasting of ", or "to accurately forecast streamflows" L15: Ensemble forecast of temperature from the ECMWF " L16: "to improve the skill and reduce biases" AR: Thank you. We include the suggestion L14, L15, and L16 AC: P1L15+17+18 Changed accordingly L18: why do you mention precipitation here? If it is not used for the calibration I would avoid it here.
AR: We mention precipitation since the "observed" precipitation and temperature was used to calculate the initial states of the hydrological model until the forecast issue day. We will consider omitting the sentence about SeNorge in the abstract. Ref RC#3, and discussion on abstract. AC: P1L20-22 We omitted "Estimated observed daily temperature and precipitation were obtained from the SeNorge-dataset, which is station data interpolated to a 1×1 km2 grid covering all of Norway." L20: was used to calculate the streamflow AR: Thank you. We include the suggestion AC: P1L23 Included Maybe cite some standard books for statistical bias correction and downscaling (Wilks, 2011) and for forecast verification (Jolliffe & Stephenson, 2011).
AR: We will cite some standard books and papers that provides reviews of forecast calibration methods. AC: P3L10-11 We added the following sentence at the end of the paragraph: "A recent review of calibration methods are given in   AR: We mean that an improvement in temperature forecast will not necessarily translate directly into an improvement of streamflow forecast. If temperatures are well below zero, an improvement in temperature forecasts has no effect on the streamflow forecasts, whereas for temperatures around zero degrees, the streamflow is very sensitive to temperature, in particular when it might turn on or of rain and/or snow melt.
AC: P3L13-18 Rewritten "The sensitivity of daily streamflow to temperature is non-linear since streamflow depends on temperature thresholds for rain/snow partitioning and for snow melt/freeze processes. The latter depends on the state of the system, i.e. snow is needed to generate snowmelt. For temperatures well below 0 o C, the streamflow is not sensitive to temperature, whereas for temperatures around 0 o C relatively small changes in temperature might control if the precipitation falls as rain or snow, and consequently, whether streamflow is generated or not." L5: Gragne, 2015 . missing reference AR: We will not use this reference in the modified manuscript AC: The reference will not be used.
L7-8: Forecasting, downscaling and interpolation are three completely different things and the challenge is connected to much more than laps rate. For interpolation and downscaling a large part can be attributed to temperature height correction which depend to a large degree to laps rates. But forecasting of temperature is far more complex and related to chaos theory.
AR: You are quite right. We should not have included forecasting in this sentence. We are addressing the downscaling and interpolation of forecasts. AR: We used marginal to separate the effect of temperature from that of precipitation. We will change the sentence to 'the isolated effect of…' AC: P3L31 Changed to "isolated" L26: do you mean from both, the hydrological and the meteorological perspective?
AR: Yes, we do. This will be clarified in the manuscript. AC: P4L6 Changed "Are there spatial patterns in the temperature and streamflow ensemble forecast skill and if so, can these be related to catchment characteristics?" L27: from the ECMWF, in addition I would mention the lead time here but maybe not the MET Norway pre-processing setup as you use the QM to pre-process the forecasts which is, if I understood correctly, not yet part of the pre-processing setup at MET Norway.
AR: The information in line 27 is correct. The QM was (new techniques has been implemented recently) a part the operational pre-processing chain at MET Norway and used on the forecast published at yr.no. We chose to not mention lead time here since the choice to focus on lead time 5 days was based on preliminary results. AC: P8L4-5 In section 3.1.2 we add one sentence to clarify: "This grid calibration was used in the operational post-processing chain for meteorological forecast including the forecasts published on yr.no." L28: Are the retrospective forecasts operational forecast for the period within 2013-2015? This could be misleading for readers or misinterpreted as reforecasts (or hindcasts) which are forecasts for the same day as the operational forecast but for the past 20 years using re-analyses for the initialization. Maybe rephrase to avoid any misinterpretation.
AR: We chose retrospective to underline that we used the operational forecasts in retrospect. Nevertheless, we understand that this can be misinterpreted. We will rephrase the sentence. AC: P4L9-10 Rephrased "Three years of operational ECMWF forecasts from 2013-2015 were used to regenerate streamflow forecasts, and the skill of temperature and streamflow forecasts were systematically evaluated for these catchments." L30: again, I think marginal is the wrong word, if the effect is assumed to be marginal, why should you analyze it in such detail.
AR: OK AC: P4L11 Changed to "isolated" L31: Not clear to me. Do you mean that the observed precipitation is used to drive the hydrological model? Specify that to make it clearer.
AR: Yes. The observed precipitation is used to drive the hydrological model.We will rephrase to make clearer AC: P4L11-13 Rewritten "To investigate the isolated effect of the temperature ensembles on the streamflow forecasts, the observed SeNorge precipitation (Tveito et al., 2005) was used instead of the precipitation ensemble forecasts when we re-generated streamflow forecasts, to run the hydrological model." L33-P4L2: Maybe combine this with the preceding paragraph. This would make it less generic.
AR: We will join the two paragraphs as suggested. AR: This is not a typo. There are several small catchments in our dataset, but only one of this size. AC: There will be no changes in the manuscript L21: what are the selection criteria for "data of sufficient quality" AR: This was inaccurate description since the catchments disregarded from the study was due to different reasons, both data retrieving and technical problems. For three catchments, we had problems running the model with the reference data, one catchments there was an issue with the elevation correction, and for two catchments, there were technical problems during the regional analysis. We have a large dataset, so the exclusion of the six catchments will not change our conclusions. AC: P5L5-6 Rewritten "Of the 145 flood forecasting catchments, 139 were chosen as the basis for the study (Fig. 1)." L27: "og" seems to be Norwegian Mention here that you use the precipitation data from this data set as a substitute of the precipitation forecasts (if this is the case).
AR: Thank you. That is a good suggestion. AC: P5L28-29 We added a sentence at the end of the paragraph "The SeNorge precipitation substitutes the precipitation forecasts in the ensemble forecasting chain, and hence the isolated effect of temperature calibration on streamflow forecastswas obtained" L15: constitutes as the basis AR: We prefer to keep the sentence as it is. AC: No changes will be introduced in the manuscript, L20: explain what PEST is.
AR: We will modify the sentence and explain what PEST is. AC: P6L9-10 Modified "... which has been calibrated using the PEST software for parameter estimation (Doherty, 2015), …" L21: Abbreviation NS (for Nash-Sutcliffe) not introduced before.
AR: Thank you. We will be corrected in the manuscript. AC: P6L11-12 Changed to "Nash-Sutcliffe" Section 2.2.2: Is the calibration done for each catchment separately? Do the given values for the NS coefficient represent the mean for all catchments? Is this good? Please state how these values translate into performance compared with other hydrological models.
AR: The calibration is done for each catchment separately. The mean is presented to give an impression of the performance, and of course, there is a great difference in the NS-score between the catchments. We think that NS between 0.73 and 0.77 is ok. Within the range of NS-scores there are of course catchments where the models performs less optimal. Other models applied to the same catchments has a very similar performance, indicating that the quality of data (precipiptation, temperature and streamflow) is an important contribution to model uncertainty. Since we in this paper use the model streamflow in stead of the observed streamflows for evaluation of forecast, we think it is not necessary to provide more details on the calibration of the hydrological model .

2.2.3
To make this more coherent I suggest renaming this section into "Reference observations" (or similar) and in the latter part of the study refer to reference observations as well. Otherwise it is difficult to distinguish between the model stream flow and forecasted streamflow. E.g. on P6 L13 you write reference model run, I assume this is the same as model streamflow? This is somewhat confusing if you state it twice in 2 different paragraphs.
AR: Thank you. We will change "model streamflow" to "reference streamflow" in the section title and in the text. AC: P6L13+14+17+19 We changed to "reference streamflow", throughout the text  30 We added one sentence to clarify this "In this study, we used the forecasts issued at 00:00 and aggregated daily values for the meteorological 24-hour period defined as 06:00-06:00 to provide forecasts for lead times up to nine days." L7: The Reference "ECMWF (2018a)" does only provide the documentation and support page of the ECMWF. The Specific documentations can be downloaded. The scientific basis of the ENS system has been discussed in multiple publications and it might be worth to reference some of them and point to this documentation for specific points only.
AR: We would like to keep the sentence and reference as it is, since this is provides a detailed overview of the model cycles. We provide an additional sentence, including references, to the description of ECMWF. AC: P6L24-27 We moved and rephrased "In short, 50 ensemble members of ENS are generated by adding small perturbations to the forecast initial conditions and model physics schemes, subsequently running the model with different perturbed conditions. The ensemble represent the temperature forecast uncertainty. A more detailed description of the ECMWF ENS system is provided in e.g. Buizza et al. (1999) and Persson (2015)." L8: "the ensemble members of ENS are…" AR: We will change 'model streamflow' to 'reference streamflow', but be prefer to keep section 2.2.3since we in section 2 describes the data and models, whereas in section 3 we describe how we used the data. AC: We changed to "reference streamflow" throughout the text.
Are the ENS forecast temporally aggregated as well?
AR: The ENS are also temporally aggregated. Ref p7 l1-2 (3.1.1) and l15-16 (3.1.2), and fig 2. AC: P7L13-15 We added "All temperature forecasts were aggregated to daily time steps since the operational HBV model runs on a daily time step and the SeNorge data used as a reference provides only daily values." L25: replace "include" with "referred to as" AR: Thank you. AC: P7L22 Changed "refers to" L27: Use the same units for both grids. ° or km^2. Best would be use both units for both grids, one of them in brackets.
AR: We think it is more accurate to use use degress for the ECMWF grid, but we will add a parenthesis with the grid resolution in km. Hence, we use degrees and km for EC, only km for SeNorge AC: P7L25 We changed " ... resolution of 0.25° ( ~ 30km)"

3.1.1
What is the rationale behind the choice of using a nearest neighbour technique?
AR: We tested also other techniques, e.g. bilinear interpolation, which has a higher computational demand and creates larger output files, than the nearest neighbor interpolation. Since the quality of the forecasts temperature was almost similar, the reduced computing time and smaller storage requirements made the nearest neighbor method more useful. AC: We introduced no changes in the manuscript.
AR: Thank you. AC: We updated the Reference list.
L8: Can you give a reference for the sentence "gives a higher skill and are less biased" AR: The reference is Engdahl et al 2015 AC: P8L8 We included  in the text and to the Reference list.
L20: Ensemble forecast verification does not only focus on reliability and sharpness. Therefore, different measures need to be taken into account (as well biases are important).
AR: In this sentence we refer to a specific paper  where the reliability and sharpness is used for evaluation of forecasts. We also think the bias is a part of the evaluation according to reliability. If the forecast is biased it will not be reliable. In the rank-histogram decomposition slope will identify bias in the forecasts. AC: We introduced no changes in the manuscript.
L30: "lowest and highest forecasted value" does it mean the minimum and maximum? Why not the 10th and 90th percentile and the interquartile range. I think this gives a better estimate of the sharpness of the forecast as it does not only account for the most extreme members.
AR: we agree that specific interquartile range might be a more robust measure for sharpness, and used the range between the 5 th and the 95 th percentile to evaluate the spread. AC: P9L5-8 We changed to "In this study, the temperature sharpness was assessed by first estimating the range between the 5th and the 95th 5 percentile of the ordered ensemble forecasts for all issue dates, lead times and catchments. For streamflow, we estimated a relative sharpness by dividing the 5th to 95th percentile range by the ensemble mean. Thereafter, sharpness was determined for each catchment and lead time as the average range of all issue dates." P8 L12: I would rephrase the sentence. "which a skilful forecast should outperform" and write it in a single sentence.
AR: We think the sentence is fine as it is. AC: We introduced no changes in the manuscript. The description about the slope and complexity is hard to follow. Could you give an example what the values really tell, e.g. how does a rank histogram look like with a complexity of 2000? I think rank histograms are very useful to be used for visually interpretation and the complexity and slope somehow lead to a reduction of the usefulness of the rank histogram at least to people not familiar with these parameters.
AR: We used the convexity and slope since then it is much easier to provide aggregated information of forecast performance. In our results, we do not focus on the values in themselves; the change of the values is the important information. We find that Jolliffe and Primo, 2007 provide detailed information.
AC: P8L28-31 We rephrased and elaborated more on the rank-histogram evaluation "A bias in the ensemble forecast is recognized as a slope in the rank-histogram, where a negative slope indicates too warm temperature forecasts, and positive slope too cold forecasts. A U-shape indicates that the ensemble forecast is under-dispersed, whereas a convex shape indicates over-dispersed (Hamill, 2001)." AR: There are two reasons for the small changes during summer (i) the skill of uncalibrated temperature forecasts are higher in summer and (ii) there is less or no snow in summer, and that will reduce the streamflow sensitivity to temperature. Ref comments RC#3 and editor, we omitted the results for summer and winter. AC: P10L16-19 We added "Summer (July to September) was excluded due to the relatively small changes in CRPSS explained by (i) the skill of uncalibrated temperature forecasts are higher and the potential for improvement is lower, and (ii) there is less or no snow in summer, resulting in a reduced streamflow sensitivity to temperature. Winter (January to March) was excluded since it performs similarly as the autumn." AR: For the slope of the regression lines being different from zero we used a significance level of pvalues < 0,05. This information is available in the caption text for fig. 9. We will include this in the text. AC: P12L12 We included "By indicating the significance and sign of the relationships, significant relationships were found for 12 out of 40 regression equations (5% significance level).

Discussion
Here I would again use words instead of the Tens Tcal only: "Both raw (Tens) and calibrated (Tens) temperature forecasts were more skilful with …". I think it makes the text more interesting to read. This could be adapted in different parts of the Manuscript, in the beginning of each section this should be repeated.
AR: We will introduce the abbreviations in the beginning of the sections AC: P12L29- 30 We changed according to suggestion.
L5-9: "Overall, the grid calibration of temperature had a positive effect on both …", but the lines before it states "…, resulted in reduced skill". This is somehow contradictive, could you make this clearer?
AR: The last sentence refer to the difference between raw and calibrated ensembles for all lead times, and we see that the grid calibration improves the performance for most scores and lead times. The previous statements are related to the development of performance for increased lead times. In short, the CRPSS is reduced for increased lead time, it is better for calibrated than raw ensembles. AC: P13L4-5 We changed the sentence "Overall, the grid calibration of temperature had a positive effect on both temperature and streamflow for most validation scores and lead times." AR: Thank you. AC: P14L21 corrected to "Hence, estimated streamflow has a high…" L7 "indicate" delete additional s AR: Thank you. AC: P14L24 Corrected.
L18: "the bias in Tens is explained by" I think this statement is too strong. It can be an explanation, but I think it cannot be reduced to this single causality, as you state in the next sentence.
AR: Thank you.
AC: This sentence is removed from sec 5.3, and content rewritten in sec 5.1 L21: "The Tens CRPSS is skilful" forecasts have a positive CRPSS and are skilful. The current formulation is not logical, a CRPSS is not skilful.
AR: We will rephrase to clarify that skillful refers to the forecast. AC: P13L29 We changed to "…, CRPSS show that the uncalibrated Tens is skillful for both…" L28: please state these characteristics very shortly again here.
AR: We will modify as suggested. AC: P15L17-18 We changed the text as follows "Only a few significant relationships between the catchment characteristics, e.g. catchment area and elevation gradient, and skill were found" P13 L1: I don't understand what you mean with "the averaging effect on temperature skill dominates".
If I understand correctly, you could discuss here what the difference would be if you use a spatially distributed hydrological model (e.g. gridded version of the model with high resolution). The effect of temperature downscaling might be higher in this case because you do not average temperature again after the downscaling and the spatial distribution within a catchment would have a much larger effect especially in catchments with high spatial variability of soil properties, altitude and vegetation cover.
AR: What we discuss in this paragraph is the effect of catchment size on the performance of the forecasts. We think that a forecast for small catchments are more sensitive than large catchments to the spatial pattern of forecasted temperature. The reasons are that (i) the smallest catchment are smaller than the grid size of the ECMWF model and (ii) it is more challenging to forecast weather on small spatial scales than large spatial scales. AC: P15L21-24 We rephrased "This result is not conclusive, but indicates that (i) the smallest catchment are smaller than the grid size of the ECMWF model and therefore very sensitive to the pre-processing (ii) it is more challenging to forecast weather on small spatial scales than large spatial scales.
L13: "the calibrated temperature reduced the skill of the forecasted streamflow." Please state what skill measure you mean here, did you calculate the CRPSS or bias for that specific event? In the result you only describe the range of the calibrated / uncalibrated ensembles but not a measure of skill.
AR: You are right. In this sentence, the use of skill is misleading. We did not calculate a specific measure of skill, but merely point to fact that compared to the reference streamflow, the calibrated T forecast induce too high streamflow, and the error becomes larger. A better word would might be performance. AC: P16L12 Changed to "performance" L15-17: I think you would like to point out that other errors (in the meteorological dataset and the hydrological model) do influence the results. If so, the sentence should be rephrased. Now the reader might think that forecasts are always getting worse if they are calibrated and this would be an argument against your conclusive statement in the summary on Page 14/L19-18.
AR: We agree, and will add a sentence to clarify this. We will also remove streamflow observations from the figure and consequently from the discussion. AC: P16L13-15 Rewritten "Deterioration in the forecast performance using calibrated temperature is particular for this event. Other results provided in this study shows clearly that the calibrated temperature ensembles improve the streamflow ensemble forecasts on average." Figures: Figure 1: write "grouped" instead of "divided". Something is wrong in the first sentence "this study shown using". Please rephrase.
AR: We will rephrase the caption. AC: P24 Rephrased caption "The maps for Norway indicates the 139 catchments used in this study. The left map show the catchment boundaries including the location of four selected catchments. Please note that many catchments are small and difficult to detect. The location of the catchments gauging stations are shown in the right map. Norway was grouped into five regions (N=north, M=mid, W=west, S=south, and E=east), all regions are marked with colors and regional boundaries." AR: We will have a look at the box-plots, the artefacts in the figures will probably disappear in the finishing stage, as all figures will be provided separately. We used partly overlapping boxes for each lead time to increase the readability of the figure, since it is easy to see to which boxes that belongs to the same lead time. We tried without, but found it then more difficult to read the plot. AC: No changes introduced. AR: Thank you. This will be corrected a suggested AC: P30 We changed the figures, and correct "Tobs to To" in both plots. AR: The artefacts in the figures will probably disappear in the finishing stage; all figures will be provided separately. AC: We will check that the line artefacts are not present in the final manuscript. AR: OK. We will do some changes to this figure. Ref A#3. We prefer-however, to not use box-plots. We think that the use of lines and shaded areas increase the readability of the figures. AC: P38 We have changed the figure. The background colors and the streamflow observation are removed.
Captions: Forecast issue date is the date when the forecast was issued, hence the x-axis could be different for each panel in this figure. I recommend adapting the caption to make this clearer, e.g. target day instead of issue date.
AR: .Thank you. We will follow the suggestion. AC: P38 Changed to "target day" in the caption Fig 10,  Abstract. In this study, The Norwegian flood forecasting system is based on a flood forecasting model running on catchments located all across Norway. The system relies on deterministic meteorological forecasts and uses an auto-regressive post-10 processing algorithm to achieve probabilistic streamflow forecasts and thus a measure of uncertainty. An alternative approach is to use we used meteorological ensemble forecast as input toand hydrological modelsensemble forecasts to quantify the uncertainty in forecasted streamflow, with a particular focus on the effect of temperature forecast calibration on the streamflow ensemble forecast skill. In catchments with seasonal snow cover, snowmelt is an important flood generating process. Hence, high quality air temperature data are important for to accuratelye forecasting of streamflows. In this study., tThe sensitivity of 15 hydrological streamflow ensemble forecasts to the calibration of temperature ensemble forecasts was investigated using .
Enensemble forecasts of temperature from ECMWF covering a period of nearly three years, from 01.03.2013 to 31.12.2015, were used. To improve the skill and reduce biases of the temperature ensembles, the Norwegian Meteorological Institute provided parameters for ensemble calibration. The calibration parameters are , derived using a standard quantile mapping method where Hirlam, a high resolution regional weather prediction model, was used as reference. Estimated observed daily 20 temperature and precipitation were obtained from the SeNorge-dataset, which is station data interpolated to a 1×1 km 2 grid covering all of Norway. The operational flood-forecasting model, a A lumped HBV model distributed on 10 elevation zones, was used to calculate estimate the streamflow.
The results show that temperature ensemble calibration influenced affected both temperature and streamflow forecast skill, but differently depending on season and region. We found a close to 1:1 relationship between temperature and streamflow skill 25 change for the spring season, whereas for autumn and winter large temperature skill improvements were not reflected in the streamflow forecasts to the same degree. This can be explained by streamflow being less influenced affected by sub-zero temperature improvements, which accounted for the biggest temperature biases and corrections during autumn and winter. The skill differs between regions. In particular there is a cold bias in the forecasted temperature during autumn and winter along the coast, enabling a high improvement by calibration. The forecast skill which could was partly be related to elevation  (Bergstrom, 1976; 15 Saelthun, 1996;Beldring, 2008) is used as the hydrological forecasting model, which combined with statistical uncertainty models (Langsrud et al., 1998a;Langsrud et al., 1998b), provides probabilistic streamflow forecasts. The uncertainty model accounts for the strong autocorrelation in forecast errors and estimates an uncertainty band around the deterministic temperature, precipitation and streamflow forecasts (Langsrud et al., 1998a;Langsrud et al., 1998b).
An alternative approach to calculate estimate probabilistic streamflow forecasts is to use meteorological ensemble forecasts 20 from numerical weather prediction models as a means to account for uncertainty in the forcing. The meteorological ensemble forecasts are created by perturbing both the initial states and the physics tendencies of the original deterministic forecast. The spread of the ensemble members can be interpreted as the uncertainty of the forecasts, where a large spread indicates large uncertainty (Buizza et al., 1999;Persson, 2015). Subsequently, the meteorological ensemble is used as forcing for a hydrological model to produce an ensemble of forecasted streamflow, referred to as a hydrological ensemble prediction system 25 (HEPS). HEPS are increasingly being used in flood forecasting (Cloke and Pappenberger, 2009;Wetterhall et al., 2013). A HEPS adds value to a flood forecast by assessing the forecast uncertainty caused by uncertainties in one or several parts of the modelling chain.
Raw (unprocessed) ensembles are rarely reliable in a statistical sense (Buizza, 1997;Wilson et al., 2007). Reliability means that the observation behaves as if it belongs to the forecast ensemble probability distribution (Leutbecher and Palmer, 2008).

30
To improve reliability, the ensemble forecasts are can be calibrated by applying statistical techniques correcting bias and under/over-dispersion (Hamill and Colucci, 1997;Buizza et al., 2005;Persson 2015). From a hydrological perspective, pre-3 processing are techniques (i.e. downscaling and calibration) used on the meteorological ensembles and post-processing refers to techniques used on the hydrological ensembles. Examples of methods used to calibrate meteorological ensembles are ensemble model output statistics (EMOS) Wilks and Hamill, 2007), Bayesian model averaging (BMA) Wilson et al., 2007), ensemble Kalman filters Verkade et al., 2013), non-homogenous Gaussian regression (Wilks and Hamill, 2007 ;Gneiting et al., 2005;Wilks and Hamill, 2007), quantile mapping (Bremnes, 5 2007), and kernel dressing (Wang and Bishop, 2005). These methods differ in their sensitivity to length of training data and ensemble size, and how spread and bias are corrected. Pre-processing (from a hydrological perspective) refers to all techniques used to change the output from a meteorological model, and includes calibration (described above) and downscaling.
Downscaling implies resampling from the original forecast grid size to a grid of higher resolution, and both statistical (e.g. interpolation) and dynamical (e.g. a regional weather forecast model) techniques, can be used (Schaake et. al., 2010). A recent 10 review of pre-processing methods are given in  and the textbook edited by Vannitsem et al (2018).
In climates with seasonal snow -cover, snowmelt during the spring season is an important flood-generating process. In these climates, temperature is a key variable to classify the precipitation phase and to estimate the snowmelt rate. The sensitivity of daily streamflow to temperature is non-linear since streamflow depends on temperature thresholds for rain/snow partitioning and for snow melt/freeze processes. The latter depends on the state of the system i.e. snow is needed to generate snowmelt.

15
For temperatures well below zero degree, the streamflow is not sensitive to temperature, whereas for temperatures around zero degree relatively small changes in temperature might control if precipitation falls as rain or snow, and consequently whether streamflow is generated or not., which may have long-memory effects due to the snow storage (Gragne, 2015). Most Norwegian catchments experience a seasonal snow -cover, but are otherwise diverse in terms of the length of the snow season and topographic complexity (Rizzie et al., 2017).

20
Forecasting, dDownscaling, and interpolating air temperature in complex topography are challenging, mostly because temperature lapse rates depend on several factors, i.e. altitude, time and place, as well as specific humidity and air temperature (Aguado and Burt, 2010;Pagès and Miró, 2010;SheridanPeter et al., 2010). Errors in forecasted temperature might result in a misclassification of precipitation phase and/or cause the hydrological forecasting system either to miss a flood event or provide a false alarm, caused by too high or too low snowmelt rates. It is therefore important to assess the relationship between 25 temperature and streamflow forecasts. The importance of reliable temperature forecasts for streamflow forecasts is demonstrated for two Aalpine catchments during a heavy precipitation event in Ceppi et al. (2013). An interesting finding in this paper is that catchment elevation distribution, and by this area above the snowline, was important for how streamflow forecasts were affected by temperature uncertainty. Verkade et al. (2013), on the other hand, fouind only modest effects of temperature calibration on streamflow forecast skill as an average over several years for Rhine catchments.

30
As far as the authors know, the marginal isolated effect of the uncertainties in temperature forecasts is not yet systematically investigated for a larger number of catchments in a cold climate. The large spatial and seasonal variations in snow accumulation and snowmelt processes found in cold regions with complex terrain require that both spatial and seasonal patterns in the performance of temperature and streamflow forecasts are evaluated.

4
The main objective of this study is to investigate the effect of temperature forecast calibration on the streamflow ensemble forecasts skill in catchments with seasonal snow cover, and to identify potential improvements in the forecasting chain. In particular, we address the following research question: • Are there seasonal effects of temperature calibration on the temperature ensemble forecast skill?
• Are there seasonal effects of temperature calibration on the streamflow ensemble forecast skill?

5
• Are there spatial patterns in the temperature and streamflow ensemble forecast skill and if so, can these be related to catchment characteristics?
To answer these questions, we applied temperature ensemble forecasts from ECMWF combined with the pre-processing setup from the MET Norway, to 139 catchments in Norway. Three years of retrospective operational ECMWF forecasts from 2013-2015 were used to re-generate streamflow forecastsd, and the skill of temperature and streamflow forecasts were systematically 10 evaluated for these catchments. In order tTo investigate the marginal isolated effect of the temperature ensembles on the streamflow forecasts, the observed SeNorge precipitation (Tveito et al., 2005) was used instead of the precipitation ensemble forecasts, to run the hydrological model. Finally, a flood eventcase study is presented, demonstrating the effect of temperature calibration on a single snowmelt induced flood event. We start by presenting the study area, data and hydrological model (HBV) used (Sect. 2). In Sect. 3, methods used to establish the hydro-meteorological forecasting chain, the skill metrics and 15 evaluation strategy are presented. Section 4 contains the results, followed by a discussion in Sect. 5. Finally, in Sect. 6, the findings are summarized, conclusions are drawn, and further research questions are discussed.

Study area
In Norway there are large spatial variations in climate and topography, and a recent overview over past, current and future 20 climate is given in Hanssen-Bauer et al. (2017). The western coast has steep mountains, high annual precipitation (4000-5000 mm/year) and a temperate oceanic climate. Inland areas have less precipitation, larger differences between winter and summer temperatures, and climatic zones from humid continental, to subarctic and mild tundra (according to the Köpper-Geiger system, see (Peel et al., 2007)). The mean annual runoff flows follows to a large degree the spatial patterns of precipitation. The two basic flood generating processes are snowmelt and rainfall (Vormoor et al., 2015). Most catchments in Norway have prolonged 25 periods of sub-zero temperatures during winter, resulting in a seasonal snow storage, winter low flow, and increased streamflow during spring due to snowmelt. The relative importance of rainfall and snowmelt processes are decided by the duration of the snow accumulation season and the share of annual precipitation stored as snow. Across Norway two basic runoff regimes can be identified, (i) coastal regions with high flows during autumn and winter due to heavy rainfall and (ii) inland regions with high runoff during spring due to snowmelt (Vormoor et al., 2015). However, there are many possible 30 transitions between these two basic patterns (Gottschalk et al., 1979).

5
The national flood-forecasting system builds on hydrological models providing streamflow forecasts in 145 catchments, covering most parts of Norway, varying in size (~3 to 15447 km 2 ) and elevation difference (103 to 2284 m). The latter is calculated as the difference between the lowest and highest point on the hypsographic curve, ∆ = ( 100 − 0 ). The flood forecasting catchments are mostly pristine, although some do have minor (hydropower) regulations. Fourteen catchments have a glacier coverage of 5 % or more. Of the 145 flood forecasting catchments, 139 have data of sufficient quality, and were 5 chosen as the basis for the study (Fig. 1). The catchments were grouped into five regions based on their location; North (N), South (S), West (W), Mid (M), and East (E) following Hanssen-Bauer et al. (2017) and Vormoor et al. (2016) (Fig. 1, right).
These regions are defined by the boundaries of the major watersheds, and reflect major hydro-climatological zones. Rainfall floods dominate in South, West, and Mid, whereas snowmelt floods dominate in East and North. There is still a large variability in hydrological regimes within individual regions. Figure 1 includes the location of four catchments, for which results that are 10 mo in this studyre detailed will be presented. Gjuvaa (E), Foennerdalsvatn (W) og and Viksvatn (W) were used to visualize the challenges in temperature forecasts, and both uncalibrated and calibrated ensemble values will be presented for these three catchments. Viksvatn (W) and Foennerdalsvatn (W) are located in Western Norway and are both catchments with some glaciers (~3 % and 47 % respectively). Gjuvaa (E) is non-glaciered and located inland southeast Norway (Fig. 1, left). The Bulken (W) catchment was chosen to demonstrate the effect of temperature calibration on the streamflow forecast for a 15 snowmelt driven flood event.

Interpolated precipitation and temperature observations-SeNorge data
In Norway, a network of about 400 precipitation stations and 240 temperature stations provides daily temperature and precipitation values. These in situ observations are interpolated to create a gridded (1×1 km 2 ) product, referred to as SeNorge 20 (available at SeNorge.no, Tveito et al., 2005). In this study, we used version 1.1. For this version, gridded temperature is calculated by spatial interpolationkriging, where both the elevation and location of temperature stations are accounted for. The observed daily precipitation is corrected for under-catch at the gauges, and triangulation is used for spatial interpolation to a 1×1 km 2 grid. A constant gradient of 10 % per 100 m beneath 1000 meter above sea level (masl) and 5% per 100 m above 1000 masl is applied to account for elevation gradients in precipitation (details can be found in Tveito (2002), Tveito et al. 25 (2005), and Mohr (2008)

Hydrological model -HBV
The HBV model (Bergstrom, 1976) as presented in Saelthun (1996) and Beldring (2008) constitutes as the basis for this study.
The vertical structure of the HBV model consists of a snow routine, a soil moisture routine, and a response function that includes a nonlinear reservoir for quick runoff and a linear reservoir for slow runoff. Each catchment is divided into 10 elevation zones, each covering 10% of the total catchment area. Catchment average precipitation and temperature that are 5 elevation adjusted using catchment specific lapse rates were used as forcing. The model uses catchment average temperature and precipitation as input. Each catchment is divided into 10 elevation zones, each covering 10% of the total catchment area.
The catchment average precipitation and temperature are elevation adjusted to each elevation zone using catchment specific laps rates. In this study, we used the operational model set-up which has been calibrated using the PEST software for parameter estimation (Doherty, 2015 ), with Nash-Sutcliffe (Nash and Sutcliffe, 1970) and volume bias as calibration metrics. The 10 calibration, 1996-2012, gives mean NS Nash-Sutcliffe 0.77, with zero volume bias. The validation period, 1980-1995, shows mean NS Nash-Sutcliffe 0.73, with a mean volume bias of 5% (personal communication, Gusong , (20163)).

Model Reference Streamflow
Model Reference streamflow, Qo(c,t), where c is an index for catchment, was derived using SeNorge precipitation and temperature, aggregated to the catchment scale, as forcing to the HBV model (Fig. 2, see "Reference mode" in the green 15 frame). In order to isolate the effect of temperature calibration on forecasted streamflow and avoid effects of hydrological model deficiencies, model reference streamflow was used as a reference benchmark when the streamflow forecasts were evaluated. Similarly, operational flood warning levels (here demonstrated for the case study basin, Bulken), are based on return-periods from model reference streamflow.

20
We used the ECMWF temperature forecast ensemble (ENS) for the period 01.03.2013 to 31.12.2015 from an original grid resolution of 0.25° (i.e. model cycles/versions 38r1/2, 40r1, and 41r1 (ECMWF, 2018b)). This period covers model cycles/versions for which temperature grid calibration parameters are trained (40r1 and 41r1, see section 3.1.2) plus spring 2013 (cycle 38r1/2) in order to include one more snow melt season. In short, 50 ensemble members of ENS are generated by adding small perturbations to the forecast initial conditions and model physics schemes, subsequently running the model with 25 different perturbed conditions. The ensemble represent the temperature forecast uncertainty. A more detailed description of the ECMWF ENS system is provided in e.g. Buizza et al.(1999), Buizza et al.(2005) and Persson (2015). For each issue date d, 51 ensemble members Tens [lat, lon, m, l*] are provided for a lead time up to 246 hours, where m is the ensemble member and l * the lead time in 6 hours intervals. In this study, we used the forecasts issued at 00:00 and aggregated daily values for the meteorological 24-hour period defined as 06:00-06:00 to provide forecast for lead times up to nine days. The observational 30 time t for a forecast is d + l*. For a full description of the ECMWF ENS product, see ECMWF (2018a). In short, ENS is 7 generated by adding small perturbations to the forecast initial conditions and model physics schemes. Consequently 50 ensemble members are generated running the model with different perturbed conditions, and represents the forecast uncertainty (Buizza et al., 1999;Persson, 2015).

Ensemble forecasting chain
5 Figure 2 shows the forecasting modelling chain designed for this study.

Temperature forecast downscaling
Within In this paper the term downscaling includes refers to the interpolation of temperature from a low resolution grid to a high resolution grid where vertical temperature gradients are accounted for. The ECMWF grid temperature, which represent the average temperature for the grid cell, was interpolated from a horizontal resolution of 0.25° (~ 30 km) to the 1×1 km 2 SeNorge grid, using the nearest neighbour method and aggregated to daily values in order to match the spatial and temporal 25 resolution of the SeNorge data. Due to elevation difference between the ECMWF and SeNorge grid elevations, we corrected the ensemble temperature at the 1×1 km 2 scale by applying a standard atmospherice lapse rate of -0.65 °C/100 m. Finally, the downscaled temperature ensemble was aggregated to daily values and averaged over the catchment areas to provide Tens [c,m,l] for a given lead time and ensemble member.

Temperature grid calibration
The grid temperature is calibrated using quantile mapping (Seierstad, 2016;, Bremnes, 2007, Bremnes, , 2004 to remove biases by moving the ENS forecast climatology closer to the observed climatology. MET Norway provided temperature grid calibration parameters used in this study. This grid calibration was used in the operational post-processing chain for meteorological forecast including the forecasts published on yr.no. MET Norway uses Hirlam (Bengtsson et al., 2017) temperature forecast 5 (on a 4×4 km 2 ) to provide the observational climatology useda reference for parameter estimation (calibration). Hirlam is suitable as a reference since it provides a continuous field covering all of Norway at a sub daily time step. In addition, Hirlam gives a higher skill and are less biased than the ENS . To establish the calibration parameters MET Norway ,used both ENS re-forecast (Owens, 2018) and Hirlam data from July 2006 to December 2011 are interpolated to a 5×5 km 2 grid. The ENS re-forecast is a 5 member ensemble generated from the same model cycle (40r1 and 41r1) as the ENS.

10
For each grid cell, monthly unique quantile transformation coefficients are determined by using data from a three-month window centred on the target month, (e.g. May analysis consists of April, May and June , personal communication (Seierstad, 2017)). The same coefficients, based on mapping the first 24 hours, were applied to all lead times and members. For forecasts outside the observation range, a 1:1 extrapolation was used. I.e. if a forecast is 2°C higher than the highest mapped forecasted temperature, then the calibrated forecast is 2°C higher than the highest mapped reference temperature.

15
For this study, we applied the calibration coefficients provided by MET Norway to the temperature forecasts for the period 2013-2015. Accordingly, the ENS was interpolated to the 5×5 km 2 grid for which the quantile mapping parameters coefficients were used to obtain the calibrated temperature ensembles (Tcal). Subsequently, the calibrated ensembles on the 5×5 km 2 grid were downscaled to the 1×1 km 2 grid following the same procedure as for the uncalibrated temperature ensemble (Tens, Sect. 3.1.1). Finally, the calibrated temperature ensemble was aggregated to daily values and averaged over the catchment areas to 20 provide Tcal [c,m,l].

Validation scores and evaluation strategy
The evaluation focused on the performance of the temperature forecast ensembles, and the effect of both uncalibrated and calibrated temperature forecasts on the performance of the streamflow ensembles. A well performing ensemble forecast should be reliable and sharp, where reliability has the first priority . A forecast is considered reliable if it is 25 statistically consistent with the observed uncertainty, i.e. 90% of the observations should verify within the 90% forecast interval. Rank-histograms are often used for visual evaluation of reliability, and show the frequencies of observations amongst ranked ensemble-members. For reliable ensemble forecasts, the rank-histogram will be uniform (horizontal). A bias in the ensemble forecast is recognized as a slope in the rank-histogram, where a negative slope indicates too warm temperature forecasts, and positive slope too cold forecasts. , whereas aA U-shape indicates that the ensemble forecast is under-dispersed,

30
whereas a convex shape indicates over-dispersed (Hamill, 2001). In order to quantify the reliability, a decomposition of the 9 chi-square test statistics for the rank-histogram was used to describe the rank-histograms slope (bias) and convexity (dispersion) (Jolliffe and Primo, 2008). Both rank-histogram slope and convexity are negatively oriented, i.e. lower values are better, with an optimal value of zero for un-biased and uniformly distributed data. The sharpness of a reliable forecast is described by the spread between the ensemble members, where a sharp forecast has a small spread and is the most useful (Hamill, 2007). In this study, the temperature sharpness was assessed by first estimating the range between the 5 th and the 95 th 5 percentile of the ordered ensemble forecasts for all issue dates, lead times and catchments. For streamflow, we estimated a relative sharpness by dividing the 5 th to 95 th percentile range by the ensemble mean. Thereafter, sharpness was determined for each catchment and lead time as the average range of all issue dates. In this study, the ensemble range (i.e. the interval spanned by the lowest and highest forecasted values) visually assessed the sharpness.
The continuous rank probability score (CRPS) or SCRP) is a summary of reliability, sharpness and uncertainty (Hersbach, 2000).

10
CRPS (denoted as SCRP in Eq. 1) measures the distance between the observation xa and the ensemble forecast, where the latter is expressed by the cumulative density function x ( ): where H is the Heaviside function that is zero when the argument is less than zero, and one otherwise (Hersbach, 2000). CRPS ̅̅̅̅̅̅̅ was calculated as the average CRPS (SCRP) over the study period (01.03.2013 to 31.12.2015). CRPS ̅̅̅̅̅̅̅ is similar to the mean absolute error for deterministic forecasts. The temperature CRPS ̅̅̅̅̅̅̅ was computed using the SeNorge temperature To, as 15 observations, whereas streamflow CRPS ̅̅̅̅̅̅̅ used Qo [c,t] as observations. This evaluation approach allowed us to evaluate the marginal isolated effect of the uncertainties in the temperature forecasts since we can then, to a large degree, ignore uncertainties in the HBV model itself.
Skill scores are convenient for comparison between forecast variables (e.g. temperature versus streamflow) and catchments since these scores are dimensionless. To calculate the continuous ranked probability skill score (CRPSS denoted as or SCRPS) 20 in Eq. 2), a benchmark score (CRPS ̅̅̅̅̅̅̅ B denoted asor ̅ B_CRP ) which a skilful forecast score (CRPS ̅̅̅̅̅̅̅ F odenoted asr ̅ F_CRP ) should outperform, is needed. For both temperature and streamflow, ensembles representing daily climatology were used as the benchmarks. Daily SeNorge temperature (To[c,t]) from 1958 to 2012 (i.e. 55 years) were used to create a climatological temperature ensemble of 55 members for each day of the year. Similarly, a daily streamflow climatology was established from model reference streamflow (Qo[c,t]) calculated by the HBV model, forced with the 55 years of temperature and precipitation 25 (To[c,t] and Po [c,t]) from the SeNorge data. CRPSS (SCRPS) was calculated for each catchments according to Eq. (2) (Hersbach, 2000).
10 CRPSS varies from -∞ to 1, where one is a perfect score. Negative values means that the forecasts performs worse than climatology, and CRPSS equal to zero implies that it performs similar to the benchmark (climatology in this case). The seasonal skill score was calculated by averaging the daily CRPS only for the months belonging to the target season.
The effect of the grid calibration on the temperature and streamflow forecast skill was evaluated by comparing the validation scores using both the uncalibrated (Tens) and the calibrated (Tcal) ensembles to generate the streamflow ensembles. For 5 readability, the abbreviations SCRP and SCRPS used in the equation, will be substituted with CRPS and CRPSS in the text hereafter.
Spatial patterns in the forecast performance for all 139 catchments, i.e. CRPSS and differences in CRPSS between calibrated and uncalibrated temperature, were mapped for Norway. Further, box plots for the five regions (see Fig. 1) were drawn to reveal potential regional patterns. Finally, we evaluated used using linear regression, whether the catchment characteristicsto 10 identify relationships between catchment characteristics , (elevation difference and catchment area, ) had any influence oand n the skill score (Tcal and Qcal CRPSS). The linear regression analysis was done for combinations of seasons and regions.
Seasonal variations in skill score were assessed by calculating CRPSS for the four two seasons winter (January to March), spring (April, to June), summer (July to September) and autumn (October to December). This definition of seasons is used to better capture a snowmelt season that for most Norwegian catchments is in the period April to June. For this paper, we chose 15 to focus on the results for autumn and spring. Summer (July to September) was excluded due to the relatively small changes in CRPSS explained by (i) the skill of uncalibrated temperature forecasts are higher and the potential for improvement is lower, and (ii) there is less or no snow in summer, resulting in a reduced streamflow sensitivity to temperature. Winter (January to March) was excluded since it performs similarly as the autumn.
Finally, the influence effect of temperature calibration on the flood warning level is illustrated for a snowmelt induced flood 20 event in the Bulken catchment. In the operational flood warning system at NVE, the predefined warning flood thresholds are catchment specific and calculated return-periods are based on model reference streamflow, which is also the approach used herein.

Results
Temperature and streamflow forecasts were calculated estimated for 139 catchments, 1036 issue dates and 9 lead times. Figure   25 3 presents a summary of the validation scores, CRPSS and the rank-histogram decomposition, in addition to sharpness, for all lead times. Each box plot shows the variations in the validation scores between the catchments. The rank-histogram slope and convexity describes bias and dispersion in the forecasts, respectively, both can be considered a measure for the reliability. As seen shown in Fig. 3, temperature slope and convexity, improve with increasing lead time, whereas CRPSS and sharpness gets poorer. For streamflow, slope and sharpness gets poorer,; convexity improves, whereas CRPSS shows small changes with lead 30 time. To reduce the amount of presented results, the remaining part of this paper focuses on CRPSS for a lead time of 5 days.
CRPSS was the chosen validation score since it contains information of reliability, uncertainty and sharpness, and enables a comparison between catchments. A lead time of 5 days was chosen since reliability (convexity and slope) has improved and some sharpness is maintained, i.e. too large ensemble spread will increase the reliability but the forecast value will be reduced.

Temperature forecasts
Time series of SeNorge daily temperature To,, the range of raw (uncalibrated) temperature ensembles Tens ( (left panels), and scatter plots of ensemble mean for both raw Tens and calibrated and Tcal versus To (right panels) are shown for three selected 5 catchments in Fig. 4. For Gjuvaa (E), a high altitude catchment in southeastern Norway (Fig. 1)

10
has a similar cold bias in Tens to Viksvatn (W), but for Foennerdalsvatn the bias is notable for all seasons and even increases for Tcal (Fig. 4). The Foennerdalsvatn catchment is only 7.1 km 2 , has a high elevation, steep topography, 47% is covered by glaciers, and is located close to the coast. The combination of all these catchment characteristics can make forecasting difficult.
Foennerdalsvatn is hence an example of how local conditions can be challenging and not well represented, neither by the numerical weather prediction model, nor by the calibration methods. The four two panels, one for each season, in Fig. 5 shows how the change in temperature CRPSS influence affects the change in streamflow CRPSS for spring and autumn. For spring, the relationship is close to the 1:1 line, whereas for winter and autumn 20 streamflow is less sensitive to the temperature calibration. In summer, there are only small changes in CRPSS for both temperature and streamflow. Based on these plots, we chose to present results for autumn and spring for the remaining part of the paper. The summer season was excluded due to the relatively small changes in CRPSS, whereas the winter season was excluded since it performs similarly as the autumn season.
Catchment CRPSS for spring and autumn were sorted according to increasing CRPSS for Tens and Qens in Fig. 6. The figure   25 reveals that Tens is more skillful in spring than in autumn when Tens has no skill (i.e. CRPSS<0) for about half of the catchments (i.e. they performs poorer than the climatology). In spring, 97% of catchments have skillful temperature forecasts. Temperature calibration improved the temperature skill for most catchments in autumn, whereas for many catchments in spring, the skill worsened. For streamflow, Qens, there are only small differences in CRPSS between spring and autumn ( Fig. 6 right panels).
Calibration of temperature improved the skill for streamflow, Qcal, in autumn. Whereas for spring, the streamflow forecast skill followed the temperature skill change, and are both reduced and improved.
CRPSS for uncalibrated temperature and streamflow forecasts, and the change in CRPSS, calculated as the difference in CRPSS between calibrated and uncalibrated forecasts, were mapped for all catchments. Fig. 7 (Fig. 7 left panel).
For autumn, however, Tens has the lowest skill for the coastal catchments ( Fig. 8 left panel). A coastal low CRPSS in autumn is also seen for Qens, even though less distinct compared to Tens. Both temperature and streamflow CRPSS were improved by calibration for the coastal regions ( Fig. 8 right panel).

Discussion
Box plots of validation scores for all catchments and lead times in Fig. 34 shows that, on average, both raw Tens and calibrated Tcal temperature ensembles were more skillful with a higher CRPSS, for shorter as compared to longer lead times, and that

13
Tcal was more skillful than Tens. Even though both bias and dispersion (i.e. reliability) as measured by rank histogram slope and convexity improved with longer lead time, the reduced sharpness and increased uncertainty, resulted in a reduced skill (CRPSS). For streamflow, the bias increased with longer lead time, while dispersion improved. Further, Qcal was slightly more skillful than Qens. Overall, the grid calibration of temperature had a positive effect on both temperature and streamflow for all most validation scores and lead times. The calibration procedure applied in this study involves many interpolations and 5 downscaling steps that increases the uncertainty in temperature forecasts. We believe that a catchment specific temperature calibration, tailored to the needs for hydrological forecasting, would solve this challenge.

Seasonal eEffect of temperature calibration for the temperature forecast skill
The skill for both raw (uncalibrated) Tens and calibrated Tcal depends on temperature ensembles varies with season ( Fig. 5 -8). The relatively small temperature skill improvements in spring and summer, and large skill improvements in autumn, and 10 winter, can be explained by the skill of the raw ensembles Tens. The low skill for Tens in autumn and winter is caused by the a cold bias, and lays the ground for the large improvements seen for Tcal. The seasonal differences in skill and response to calibration show the importance of using seasonal calibration parameters. It is also apparent that the applied methods do not perform optimally for all seasons. For spring, the results show that several catchments have a reduction in the forecast skill after calibration. By inspecting the forecasts in detail, we found a too extensive correction of temperature for some days and 15 catchments. Quantile mapping, as most statistical techniques, is sensitive to forecasts outside the range of calibration values and period (Lafon et al. 2013), and this can be an explanation for too high correction in the highest Tens quantile. The use of forecasts from different model cycles might affect the consistency in the forecasts. Moreover, the calibration parameters are sensitive to the representativeness of the calibration period.
The most pronounced spatial pattern is the low autumn CRPSS for uncalibrated ensembles Tens in the coastal areas. This is 20 seen from the boxplots for the regions West, Mid and North (Fig. 8) and in the plots of the western catchments Viksvatn and Foennerdalsvatn during winter months (Fig. 4). This cold bias is documented for the Norwegian coastal areas in the cold seasons by Seierstad et al (2016), and is mainly caused by the radiation calculations in the ECMWF model (Hogan et al., 2017). The coarse radiation grid results in warmer sea points being used to compute longwave fluxes applied over colder land points, causing too much cooling. This effect is seen for the temperature forecast for winter 2014 and 2015 for the coastal 25 catchments in fig 4 (b) and (c), in contrast to the inland catchment (a) which is less biased. The radiation resolution is improved in later model cycles (Hogan et al., 2017;Seierstad et al., 2016). In addition, the challenging steep coastal topography is not well represented by the spatial resolution in the ECMWF model (Seierstad et al., 2016). For inland catchments, and the regions South and East, CRPSS show that the uncalibrated Tens is skillful for both autumn and spring; hence, the calibration has a smaller effect in these catchments.
The cold autumn bias in temperature forecasts is seen for catchments along the coast (Fig. 8). The relatively low skill and cold bias for sub-zero ECMWF temperature forecasts for the Norwegian coastal areas in cold seasons are documented by Ivar Seierstad et al. (2016). Temperature bias can have complex causes, but the coastal bias typical for Norway is mainly caused by radiation (Hogan et al., 2017). The coarse radiation grid in the ECMWF atmospheric model results in warmer sea points being used to compute longwave fluxes applied over colder land points, causing too much cooling. The radiation resolution is 5 improved in later model cycles (Hogan et al., 2017;Ivar Seierstad et al., 2016).

Seasonal Eeffect of temperature calibration for the streamflow forecast skill
The skill of the temperature calibrated streamflow ensemble forecasts, Qcal, improved for most of the catchments for autumn, while both improved and reduced skill was seen for spring ( Fig. 5 -8). Autumn streamflow skill was improved by temperature calibration for all regions, the largest improvement was seen for the coast, and the regions West and Mid. Two possible 10 explanations for this spatial pattern are (i) the improvement in temperature forecasts skill during autumn in these regions, and (ii) that many coastal catchments are more sensitive to calibration of temperatures since the temperatures are more frequently around zero degrees compared to the colder and dryer inland catchments. In spring, no clear spatial patterns are seen, neither for Qens, nor for the change in skill.
It is also evident that, independent of the sign of the temperature skill change (Fig. 5), a change in temperature has a larger 15 impact on streamflow in spring than a temperature change has on streamflow in autumn. An explanation may be that during autumn, for temperatures well below zero degrees, the forecasted streamflow is not affected by improved forecasted temperatures. During spring, temperatures are often close to the two threshold temperatures that control the phase of precipitation and the onset of snowmelt. , and sSuch periods are challenging to simulate correctly (Engeland et al., 2010).
Additional important, for spring as opposed to autumn, is the snow storage at the end of winter, and the snowmelt contribution 20 to streamflow. Hence, estimatedCalculated streamflow has a hence a high sensitivity to changes in temperature during spring, a sensitivity also described for Aalpine snow covered catchments by Ceppi et al. (2013). Verkade et al. (2013), on the other hand, found only marginal effects of pre-processing temperature and precipitation for the streamflow skill in the Rhine catchments. The results presented herein and in the cited papers, indicates that the effect of pre-processing depends on the hydrological regime (i.e. sensitivity to temperature), the initial skill of the forcing variables, and for which temporal periods 25 (i.e. for specific events, seasons, or the whole year) the sensitivity is evaluated. The same lead time was used to relate improvement in streamflow to temperature, we consider this robust since most catchments in this study have a concentration time of less than a day.
These results show that in order toIn summary, it can be concluded that to further further improve the skill of streamflow forecasts , temperature forecasts with high skill is most important during the snowmelt season, improved temperature forecasts 30 are essential. . Streamflow forecasts during spring have the highest potential for improvements since Forecasts during spring therefore have the highest potential for improvements since the temperature forecasts are not, the temperature forecasts were not for a majority of the catchments, improved by the applied calibration. For autumn, the substantial improvement in temperature forecast skill by grid calibration improves streamflow forecast skill, but the sensitivity is less than for spring.

Spatial patterns
The most pronounced spatial pattern was the low autumn CRPSS for Tens in the coastal areas, also evident from the boxplots 5 for the regions West, Mid and North (Fig. 8). This seasonal cold bias is also clearly seen in the western catchments Viksvatn and Foennerdalsvatn (Fig. 4). The cold bias in Tens along the coast is explained by the radiative land heating and cooling in the coarse resolution forecasts (see Sect. 5.1). In addition, the challenging steep coastal topography is not well represented by the spatial resolution in the ECMWF model (Seierstad et al., 2016). For inland catchments, and the regions south and east, the Tens CRPSS is skillful for both autumn and spring; hence, the calibration has a smaller effect in these catchments.

10
Autumn streamflow skill was improved by temperature calibration for all regions, the highest improvement was seen for the coast, and the regions West and Mid. From Viksvatn ( Fig. 4 panel right) we found that the highest temperature improvements are seen in the temperature range around and below 0 °C. For many coastal catchments, the climate in autumn and winter is partly mild, and temperatures around 0 °C will have an influence on streamflow. In spring, no clear patterns are seen, neither for Qens, nor for the change in skill.

Catchment characteristics and skill
Only a few significant relationships between the catchment characteristics , e.g. catchment area and elevation gradient, and skill were found (Table 1). We expected to find the highest temperature skill in large catchments, due to averaging, and in catchments with small elevation differences, due to less elevation correction inaccuracy. No significant relationships between temperature skill and elevation difference was found for any combination of region or season. A positive relationship between 20 temperature skill and catchment area was found for five out of ten regression equations. This result is not conclusive, but indicates that the averaging effect on temperature skill dominates that (i) the smallest catchments are smaller than the grid size of the ECMWF model and therefore sensitive to the pre-processing and (ii) it is more challenging to forecast weather on small spatial scales than large spatial scales..
It was expected that streamflow skill would increase with catchment area due to averaging effects. Significant linear regression 25 coefficients were found for Eeast and Ssouth but with different signs, the same tendencies for both spring and autumn. The interpretation of this result is therefore ambiguous. For elevation difference, a significant negative correlation was found for three out of ten datasets. This suggest that the downscaling approach has a potential to improve the streamflow forecasts. These results are not conclusive, and studies that are more detailed are needed to determine any significant relationships to catchment characteristics.
Forecasting in small catchments with particular characteristics may be challenging since they may not be well represented, neither by the numerical weather prediction model, nor by the calibration methods. In our dataset, Foennerdalsvatn (fig 4c) is such an example. The catchment area is only 7.1 km2, elevation is high, topography is steep, glaciers cover 47% of the catchment area, and it is located close to the coast. The combination of all these catchment characteristics can make forecasting difficult. Foennerdalsvatn is hence an example of how local conditions can be challenging and not well represented, neither 5 by the numerical weather prediction model, nor by the calibration methods.

Snowmelt flood 2013
The snowmelt flood event (Fig. 10) illustrates clearly how temperature calibration influences affects forecasted ensemble streamflow. The increase in forecasted temperature by grid calibration, results in additional snowmelt and thus increased streamflow. The increased streamflow led to a change in the warning warning level, from green/yellow to yellow/red, and 10 moves closer to the observationsbelow to above the 5-year flood. For this event, however, the use of calibrated temperature reduced the skill performance of the forecasted streamflow, Qcal. The model reference streamflow, Qo that is used as the reference, is better captured by the streamflow forecasts based on uncalibrated temperature forecasts, Qens. The deterioration in the forecast performance using calibrated temperature is particular for this event. Other results provided in this study shows clearly that the calibrated temperature ensembles improve the streamflow forecasts on average. This discrepancy reveals the 15 other sources of errors; such as the uncertainty of the observed SeNorge precipitation and temperature, and the ability of the hydrological model to capture the highest flood peaks. These points are outside the scope of this study and will not be followed up further here, but are of course important for the performance of a flood forecasting system. forecasting intervals capture the events, but there is little information left in the forecasts.

Summary and conclusion
The main objective of this study was to investigate the effect of temperature forecast calibration on the streamflow ensemble forecast skill, and to identify potential improvements in the forecasting chain. We applied a gridded temperature calibration method, and evaluated its influence effect on both temperature and streamflow forecasting skill. The seasonality in skill was 25 evaluated and correlations to catchment characteristics and spatial patterns were investigated. Supported by the results presented in this paper, our answers to the research questions listed in the introduction are summarized as follows: Are there seasonal effects of temperature calibration on the temperature ensemble forecast skill?
 The largest temperature skill improvements by calibration were found for low performing coastal catchments in autumn and winter.
 The effect of calibration on temperature skill was less clear in spring and summer. In spring, the calibrated temperature resulted in reduced skill for many catchments.
 Smaller bias in spring and summer explained a higher Tens skill and hence, less room for improvements by calibration.
Are there seasonal effects of temperature calibration on the streamflow ensemble forecast skill?
 In autumn and winter, streamflow skill improved for most catchments. For spring, the calibration resulted in both 5 better and worse skill.
 In spring, changes in temperature skill had a higher influence effect on streamflow skill, compared to autumn and winter. Summer showed small changes for both temperature and streamflow..
Are there spatial patterns in the ensemble forecast skill and if so, can these be related to catchment characteristics?
 The skill in temperature forecasts was the lowest in coastal catchment along the coast in West, Mid and North in 10 autumn, caused by a cold bias in the forecasts (this was also the case for winter, although these results are not shown).
 The largest improvement in skill for both temperature and streamflow was found for catchments with a cold bias in the temperature forecasts.  A regional division seemed useful to identify spatial patterns in temperature forecasts, whereas for streamflow the spatial patterns were not so obvious.

15
 It was not possible to conclude a relationship between the catchment characteristics and skill.
Snowmelt flood  Streamflow increased by temperature calibration, changing the flood warning level, clearly showing the importance of correct temperature calibration for catchments with snow during snowmelt season This study showed that the applied gridded temperature calibration method improved the temperature skill for most catchments 20 in autumn and winter. Temperature forecasts have an impact on streamflow, and are important for seasons where temperature determines snowmelt and discriminates between rain and snowfall. The improvement in temperature skill propagated to streamflow skill for some, but not all, catchments. This was to a large degree depending on region, and the skill of the uncalibrated ensemble.
The most obvious improvement in the forecasting chain is to use the same temperature information, the SeNorge temperature,

25
for calibrating the temperature forecast that is used for calibrating the hydrological model, generating the initial conditions for the hydrological system, and evaluating the performance. In particular, the calibrated temperature forecast could be improved during spring when the streamflow forecasts are the most sensitive to temperature. The pre-processing of temperature includes both the an elevation correction dependency depending on lapse rate and the calibration method. Lapse rate in this study is defined as a constant, but actually depends on weather conditions, location and elevation. In addition, the calibration method,

30
here the quantile mapping, is sensitive to forecasted values outside the observation range, and other methods should be considered.
The conclusions in this study are based on a testing period of almost three years. Even if this is a relatively short testing periods, we believe that the large number of catchments to a large degree compensates for the short testing period and that the results and conclusions are therefore relatively robust.
The conclusions herein are based on a large and relatively representative data-set from Norway, but we suggest that some of the main conclusions can be valid for regions with a similar climate. The most important general conclusion is that streamflow 5 forecasts are sensitive to the skill of temperature forecasts, especially in the snow melt season. In addition, this study shows that reducing the cold temperature bias in in coastal areas results in improved streamflow forecasts, and that the postreprocessing need to account for seasonal differences in temperature forecasts (biases).

Data
Processed data is available by contacting corresponding author. Raw meteorological data must be required directly from 10 ECMWF.

Authors contribution
T. J. Hegdahl prepared the data, set up the forecasting chain (including writing new code for non-available functionalities), did the data simulations and analysis, and wrote the manuscript. K. Engeland contributed in the writing. K. Engeland, I.
Steinsland and L. M. Tallaksen contributed in the design of the study, by advice during the work, and in the revision of the 15 manuscript.

9
Table 1: Summary of significant correlations between CRPSS for calibrated temperature (Tcal) and streamflow (Qcal) ensembles and catchment characteristics, i.e., area and elevation difference (ΔH), for the five regions. Blue color indicates a significant positive relationship, red a significant negative relationship, and grey a non-significant relationship. Results are for a lead time of 5 days.