Major comments
This is a resubmitted version of a study describing the sources of skill in a Europe wide streamflow forecasting system. It's my second review of the paper. As I noted last time, dynamical continental scale ensemble prediction systems are at the cutting edge of seasonal streamflow forecasting, the paper is reasonably well presented, and it is well within the scope of HESS. In addition, the authors have addressed several of my concerns, including improving the description of their analyses of forecast skill and trends, and peforming analyses of reliability. They have also added some interesting analyses of sources of skill in Figure 12. I commend them on their efforts. There are still some outstanding issues, however. Some are repetitions of my statements last time, others have arisen from the revision. They are as follows:
1) Reliability: the authors have expended considerable effort on investigating reliability with attributes diagrams. Unfortunately, they have placed all this effort in an Appendix, and have not referred to that appendix at all in the body of the paper. As I stated previously: reliability is a crucial attribute of ensemble forecasting systems, and merits discussion. This is particularly true when the main analyses used in this paper reduces the ensemble to a deterministic forecast (i.e., correlations rather than the use of probabilistic scores). I recommend at least some of this analysis be moved into the body of the paper. The reliability results are not particularly strong - the ensembles appear strongly overconfident at longer lead times - but I think this is an avenue for future improvement. (NB - see also my suggestions for shortening the paper, below.)
2) Paper Length: the addition of analyses and discussion has resulted in a paper that is, in my view, too long. I offer three possible ways to shorten the paper:
i) I reiterate my recommendation from my previous review that the analyses of evapotranspiration forecasts be removed from this paper, and given its own paper.
ii) Figure 13, and its accompanying discussion, is superficial and could easily be removed. One of the major benefits of ESP forcings is that they offer a reliable estimate of uncertainty. This is distinct from the authors' InitSH experiment, which samples from Sys4 (resulting in an ensemble that is likley to be overconfident). This of course does not show up in correlations calculated on the median ensemble member. The authors could simply state something like "The use of InitSH produced very similar correlations to ESP (not shown for brevity)." If the authors feel strongly that Fig 13 should be included, they could put it in as another panel in Figure 5 (and reduce the discussion of it to a sentence or two), but I think this is unnecessary.
iii) The analysis of reliability of actual streamflow forecasts (Figure B2) is probably unnecessary, as the remainder of the paper verifies against pseudo observations. It could be removed.
Specific (minor) comments
L101 "specific hindcasts" - would 'experiments' perhaps be a better term?
L125-132 Arnal et al. 2018 has done this comprehensively over Europe recently, and is worth mentioning, both here and in the discussion of your results.
L175 "To spin up discharge, each 7-month hindcast was preceded by a one month simulation" The companion paper implies that the hydrological states at the start of the 1-month spin-up are taken from a long-run simulation (if I've interpreted this correctly). This is important information (!) and should be included. At present, it reads as though only a single month is used to spin-up hydrological model states, which is nowhere near enough.
L196 "from the observations themselves" Should this be 'pseudo-observations'?
L252-274 This information is perhaps better presented in a table, for easy reference.
L256-257 "More specifically, we selected member 1 from the 1981 hindcasts, member 2 from the 1983 hindcasts, etc. By using identical meteorological forcing for all of the years of the hindcasts, skill due to skill in the forcing is eliminated." Does this effectively mean that an ensemble member is drawn randomly from each year?
L270-271 "skill due to initial conditions is eliminated" Should that be "skill due to initial hydrological conditions is eliminated"?
L274 "15 uneven years 1981-2009" I take it this means the years {1981, 1983, ..., 2007, 2009}(?) Does this mean the exact (perfect) rainfall forecast is included in the ensemble when assessing odd years? I.e., if you are evaluating forecasts for the year 2009, the observed rainfall for that year is included in the forecast ensemble?
L288-289 "Thus, like the FullSH, all specific hindcasts for a single starting date consist of 15 members, which is important since ensemble size affects skill metrics". Of far greater concern is strict cross-validation of forecasts. Including a 'perfect' rainfall forecast in an ensemble of 15 is likely to have a much greater impact on skill scores than ensemble size. I don't think this is defensible. (Though as I recommend removing the ESP figures/experiment, above, it does not need to be addressed.)
L401-420 I think the additons to this explanation of what you're trying to show with your analysis of temperature trends has improved it considerably - it's much clearer to me (and hopefully others) now. There are a few instances later in the paper where phrases such as "
L500 "The InitSH forcing is the same for each year, so its interannual variation does not contain a signal nor noise." The 'signal' depends on observations, which vary from year to year. So you can't say there is no 'noise' - the accuracy of InitSH changes each year, because the observations vary. Rather than centering this discussion on signal and noise, it would be easier to simply talk about forecast accuracy. I think the main conclusion to draw from Fig 4 is that when the WHUSP forecasts are reduced to the median, at longer lead times the meteorological forecasts from FullSH are less accurate than the randomly drawn met forecasts in InitSH. This is possible, of course, and a reason why a number of studies choose calibration methods rather than simple bias-corrections to process meteorological forecasts (see Zhao et al. 2017) (and also one of the reasons why ESP forecasting systems are difficult to beat).
L645-646 "However, there is compensation for this direct effect by an indirect effect through soil moisture." Is it possible that this is an artefact of breaking correlations between states? i.e., SnInitSH has averaged soil moisture states at lead 0 - could running the model induce more correctly correlated soil moisture states at lead 1? In other words, could this be an artefact of your choice to average states, rather than using an ensemble of model states as standard revESP experiments do? If so, I think this should be acknowledged somewhere.
L542-L543 "Skill in the precipitation forcing of the first lead month leads to skill in the states of soil moisture and snow at the end of that month." I would guess skillful forecasts occur at long lead times in at least some catchments where there is no skill in precipitation in the first month. Correctly initialised hydrological models can produce skillful streamflow forecasts for a number of months, even with completely uninformative forcings.
L817-819 "This result is counter-intuitive but, as we discussed, a logical consequence of forcing with interannual variation
that has no or insufficient skill, such as the S4 forcing." This doesn't sound logical to me at all. InitSH has, by design, zero meteorological forecast skill - so you cannot explain the poor S4 performance by saying it S4 has no meteorological forcing skill. (If this were so, it should perform similarly to InitSH, not worse than it.) I think the explanations possible are: 1)there are actual flaws in the forecast S4 forcings; this might be because of bias (though this is unlikely, as I'm sure the quantile mapping takes care of this) or that the S4 forecasts are negatively skillful (i.e., less accurate than and ESP forecast - this looks the most likely candidae) at longer lead times. 2) the way you've assessed the forecast insn't sufficiently sensitive to determine
L821-822 "have identical meteorological forcing for each year". I think I commented on this in the past revision: ESP forecasts are frequently not identical for each year, because they are often cross-validated. So they are similar, but not identical. They are similar to InitSH in that they are (or should be) uninformative, but give a reliable uncertainty spread.
L841 "To our knowledge, two other studies analysed" as noted previously, Arnal et al. 2018 have done very similar work to that presented here, and with the same forcing.
L859 "In any case, that contribution depends and will depend on the climate model used (e.g. S4 or GloSea5)." It can also very much depend on the method used to process climate forecasts. Calibration removes bias, but also ensures consistently inaccurate forecasts (i.e., negatively skillful forecasts, in the sense that they perform worse than climatological reference forecasts) return to something like a climatology forecast. This allows skill at short lead times in meteorological forecasts to propagagate through to long lead times in streamflow predictions. Quantile mapping only removes bias.
L879 "taking" should be "taken"
L883 Figure 12 - headings say 'as lead' when they should say 'at lead'
l1031 "this is not the case for practical applications" - well, it probably says that if you are choosing a benchmark/reference forecast, climatology is probably not good enough; it would be more stringent to use a benchmark of climatology+trend.
L1174 Figure B1 - This is a nice figure, but wouldn't it have been better to look at forecasts exceeding the median, to make it more consistent with the calculation of correlations (which are calculated on the median ensemble member)?
Typos/Grammar
L145 "So, the objective" Delete "So, "
L205 "and to be available" should be "and are available"
L675 Delete the first 'because'.
References
Arnal L, Cloke HL, Stephens E, Wetterhall F, Prudhomme C, Neumann J, Krzeminski B, Pappenberger F. 2017. Skilful seasonal forecasts of streamflow over Europe? Hydrol. Earth Syst. Sci. Discuss. 2017: 1-27. DOI: 10.5194/hess-2017-610.
Zhao T, Bennett JC, Wang QJ, Schepen A, Wood AW, Robertson DE, Ramos M-H. 2017. How suitable is quantile mapping for post-processing GCM precipitation forecasts? Journal of Climate 30: 3185-3196. DOI: 10.1175/jcli-d-16-0652.1. |