Flood spatial coherence, triggers, and performance in hydrological simulations: large-sample evaluation of four streamflow-calibrated models

Brunner, Manuela I.; Melsen, Lieke A.; Wood, Andrew W.; Rakovec, Oldrich; Mizukami, Naoki; Knoben, Wouter J. M.; Clark, Martyn P.

doi:https://doi.org/10.5194/hess-25-105-2021

Articles | Volume 25, issue 1

https://doi.org/10.5194/hess-25-105-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-25-105-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 25, issue 1

Research article

|

06 Jan 2021

Research article |

| 06 Jan 2021

Flood spatial coherence, triggers, and performance in hydrological simulations: large-sample evaluation of four streamflow-calibrated models

Manuela I. Brunner, Lieke A. Melsen, Andrew W. Wood, Oldrich Rakovec, Naoki Mizukami, Wouter J. M. Knoben, and Martyn P. Clark

Download

Final revised paper (published on 06 Jan 2021)
Supplement to the final revised paper
Preprint (discussion started on 11 May 2020)

Interactive discussion

Status: closed

AC: Author comment | RC: Referee comment | SC: Short comment | EC: Editor comment

- Printer-friendly version

- Supplement

RC1: 'Referee Comment', Anonymous Referee #1, 26 May 2020
- AC1: 'Response to reviewer 1', Manuela Irene Brunner, 23 Jul 2020
RC2: 'Calibration strategies matter for flood studies (as has been shown before), so why do the authors not test calibration strategies?', Anonymous Referee #2, 13 Jun 2020
- AC2: 'Response to reviewer 2', Manuela Irene Brunner, 23 Jul 2020
RC3: 'Flood hazard and change impact assessments may profit from rethinking model calibration strategies by M. Brunner et al.', Anonymous Referee #3, 02 Jul 2020
- AC3: 'Response to reviewer 3', Manuela Irene Brunner, 23 Jul 2020

Peer-review completion

AR: Author's response | RR: Referee report | ED: Editor decision

ED: Reconsider after major revisions (further review by editor and referees) (02 Aug 2020) by Nadav Peleg

AR by Manuela Irene Brunner on behalf of the Authors (10 Aug 2020) Author's response Manuscript

ED: Referee Nomination & Report Request started (13 Aug 2020) by Nadav Peleg

RR by Anonymous Referee #4 (18 Sep 2020)

Suggestions for revision or reasons for rejection

General comments

This study compares the efficiency of three lumped and one distributed models to simulated flood magnitude, timing and spatial coherence. The objective function used for calibration is Kling-Gupta efficiency (KGE). The results show that models tend to underestimate flood magnitude and not always simulate well flood timing. The authors conclude that using KGE for calibration has limited reliability for flood hazard assessment.
In general, the topic fits scope of the journal and will be of interest for the readers. However, the manuscript in its current form (after the revision) will still benefit from a more thorough revision. The main critical points (in my opinion) are:

1) The formulation and justification of the novel scientific contribution is still not clear. The review of previous studies in the Introduction indicates that “…to achieve further improvements in flood peak simulations, a broader range of application-specific evaluation metrics is typically required.” (l.29-30, l.23-24). I agree with such formulation of current research gaps, but it is not in line with the objective function tested in the manuscript. If one would be interested in flood magnitude, timing and spatial connectivity, why one should use KGE for calibration? How does it account for such specific evaluation metrics, i.e. flood seasonality or spatial coherence?

2) The title is misleading. The main message of the paper, in its current form, is about the value of KGE for calibration of hydrologic models (if flood impact assessment is the main purpose). There is no assessment how the models describe and simulate different flood generation processes and which factors control their performance. So based on presented results it is difficult to interpret to what extent and how are the selected models suitable for flood impact assessment. The results are more about the accuracy of selected way (i.e. using lumped models, KGE for calibration, etc.).

3) The significance of the results is not clear. I’m not sure if for practical applications, a lumped model will be used or should be recommended. Perhaps a consistent assessment/evaluation of the difference between lumped and distributed type of models will be interesting (e.g. for HBV and mHM).

4) The design of the experiment reads more as a collection of available analyses and not results from initially clearly defined research question/hypothesis. I agree with previous reviews that using different time periods for calibration and using different model input datasets can have some impact on the results and the interpretation of results (including individual catchments) will be more consistent if the same data and time periods will be used. The authors claim that both datasets describe the observed climate, but are they identical also for individual extreme events?

5) The methodology is not rigorously described. It will be very difficult (if even possible) to reproduce/repeat the presented analysis (based on given information). Numerous information is missing, e.g., how the initial values were set, what were the ranges of calibrated model parameters and parameters of automatic calibration algorithm. It will be interesting to present, e.g. in appendix, the final model parameters and efficiencies for individual catchments. This will allow to assess the interpretation made.

6) I think that comparing lumped with distributed models can bring some more interesting results than are presented in its current form. What is the impact of lumping on the results? Are the differences in model efficiency related to the size of the basin? I would expect that using lumped models in larger catchments cannot describe well floods from convective rainfalls.

7) As a reader, I would be likely more interested in seeing where (in which catchments and why?) the models work well, rather than to conclude that in general they underestimate magnitude or do not represent well the timing or spatial patterns. So presenting some deeper analysis of the factors controlling the performance will be helpful and interesting.

Specific comments

1) Abstract, l.13: “…models calibrated on integrated metrics such as …have limited reliability…”. This is a general conclusion which is not supported well with the presented results. I would suggest to remove “such as”. I think if one combines flood magnitude, seasonality and spatial coherence into an integrated metrics (objective function) for calibration, the results can be different.

2) Data. I like the assessment based on large dataset and subsequent split/grouping of results into some relevant groups of catchments. It is however not clear how are flood generation processes (e.g. flood types) linked with selected groups of regimes? If the objective is about the suitability of models to represent floods (magnitude, seasonality, …) it will be interesting to see results for different flood generation mechanisms, i.e. how or if the models differ in simulating snowmelt floods, or floods from convective rains, etc.

3) Forcing. Which version of Daymet is used? Why not to use only one dataset for all the models?

4) The term “event”: By using term flood event, do you mean day of the flood peak? The same for precipitation. Is the event precipitation representing mean daily precipitation for the day of the peak? Some flood events (e.g. from snowmelt) can last several days. How sensitive/representative are the characteristics extracted only for the day of the peak?

5) Beta model parameter (l.170). It will be interesting to present model parameters for individual catchments, because otherwise the interpretation made reads more as speculation (it is not justified by presented results).

6) L.218-219. In my opinion HBV model can describe the surface runoff. Conceptual it is represented by the outflow from the upper reservoir (describing by k0 model parameter).

Hide

RR by Anonymous Referee #2 (21 Sep 2020)

Suggestions for revision or reasons for rejection

The authors responded to most of my concerns, but some issues are left.

[1] The title is still problematic. As far as I can tell, there is no "flood impact assessment" performed in this study. Why is it in the title? The study assess flood flows, so why is this not the title of the paper? Flood impact assessment would require a direct connection to the actual implication of flooding, such as flood inundation, damage to houses etc. These aspects are not part of the study, so why is the title focusing on this issue?

And, if the focus is on assessing the value of KGE as calibration metric, then why is this not in the title? The title "Evaluating the suitability of hydrological models for flood impact assessments", is still much broader than what this very focused study actually does.

[2] (line 140) The use of split sample schemes should include a reference back to Klemes (1986, HSJ, https://www.tandfonline.com/doi/abs/10.1080/02626668609491024 ) who introduced the idea.

[4] Section 3.1: I asked previously why HBV results are so poor and I am still confused by it. It would be useful for the discussion section of this paper to more closely compare the results obtained here to previous studies across the USA. For example, Kollat et al. (2012, WRR, https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2011WR011534) calibrated the HBV model across all MOPEX catchments and found Nash Sutcliffe Efficiency values much higher than what would be expected based on the results of the current study (see their Figure 9A). Why the discrepancy? Kollat et al. (2012) performed extensive MO-calibration whereas the current study used a LHS sampling strategy. So, is part of the result of the current study is due the chosen calibration approach?

[5] Other studies have disaggregated KGE to understand what controls the bias in the KGE terms. E.g. Gudmundsson et al. (2012, WRR, https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2011WR010911) found that one significant control on water balance error seemed to be precipitation data error – model predictions in catchments with significant elevation difference (where at least across Europe, precipitation measurements are expect to be less good) were performing poorer. Did the authors find similar patterns? Would elevation difference be a good way to see whether rainfall is indeed a likely problem for the study catchments in this paper?

[3] I am still confused by the authors conclusion that "Our model comparison shows that all flood characteristics are not equally well represented by models calibrated with the widely used Kling–Gupta efficiency metric." – OK, but very likely this is true for any metric given the extensive experience with multi-objective model calibration in hydrology, where a regular finding is that any single metric produces a focused result. So, what multi-objective strategy do we need to improve this problem? And what relevant trade-offs exists (e.g. Kollat et al., 2012)? The authors suggest that multi-objective calibration is the way forward, in which case a better review of this very rich multi-objective literature in hydrology would be nice (given that this topic has been explored for over 20 years). Currently there are only a couple recent references, which do not do the topic justice – even if narrowed down to those studies focusing on calibration for flood prediction.

The next sentence suggests a much wider conclusion: “The number of floods, flood magnitude, and timing are not always well captured by hydrological models in many catchments.” It would be good if the authors were to formulate their conclusions more carefully. Given that the authors have a very narrow focus in this study (which is fine) – to show that calibrating to KGE does not lead to a good reproduction of all flood characteristics – it would be good to formulate their conclusions with a similar focus to avoid that others misuse their conclusions.

Hide

ED: Reconsider after major revisions (further review by editor and referees) (27 Sep 2020) by Nadav Peleg

Dear Authors,

I have now received the reports of two referees, one (that did not revise the original text) suggested major revisions, while the other suggested minor revisions (although her/his comments read to me as moderate revisions). Reading the revised manuscript and the comments made by the reviewers, I conclude that additional changes to the text are needed before it can be considered for publication in HESS.

The main issue that I see here, is that the motivation, objectives and hypotheses of the study are not composing a clear storyline. This was already pointed by some of the reviewers in the first round of revision. In this study, a single objective function (KGE) is used for the model calibration, and you demonstrate that 4 different models are failing to represent the observed floods using this single-criteria objective function. In the conclusions, you suggested using multi-criteria objective functions for the calibration of the models if the focus is on representing flood events. This conclusion is not new – many studies in the past used multi-criteria objective functions to calibrate hydrological models to simulate floods, likely with a better match than can be obtained with KGE. Why have you chosen to calibrate the models using an objective function that is known (or can be expected) to fail to simulate flood events to begin with? What multi-criteria objective function/strategy could be used to calibrate hydrological models to better represent flood events (what strategies were used in the past and how they can be improved)? Will a multi-criteria objective function improves the match to flood events, or does some of the models that are presented here will still fail in reproducing flood events due to their internal structure? I am missing answers/discussion to these type of questions.

In my view, the introduction, discussion and conclusions sections will require considerable text edits to make the story clearer and more appealing to the readers of HESS, maybe also with minor changes to the structure of the text. I will be happy to reconsider the revised paper after major revisions.

Sincerely,
Nadav Peleg

Hide

AR by Manuela Irene Brunner on behalf of the Authors (04 Nov 2020) Author's response Manuscript

ED: Referee Nomination & Report Request started (05 Nov 2020) by Nadav Peleg

RR by Anonymous Referee #4 (12 Nov 2020)

Suggestions for revision or reasons for rejection

Dear authors,

I want to thank you for considering most of my comments and revisions made. The story has been significantly improved, including formulations of the aim and scientific contribution of the study. Only one general comment (point 5 in the previous review) has not been addressed adequately, but I think it is essential for supporting interpretations made. The results (and the response) indicate no clear pattern of the difference between distributed and lumped models. In particular, the comparison and difference between HBV and MHM do not show any relation to the size of the basin. But the different results of these two models do not only represent the difference between the lumped vs distributed format, but also in the datasets used for driving the models. The manuscript indicates (and the authors believe) that both datasets adequately describe the observed climate. Still, there is no support/evidence that these datasets also provide identical/similar patterns of event precipitation characteristics (e.g. magnitude, antecedent sum, duration, etc.). The differences in the precipitation differences can mask the differences between lumped and distributed models. A similar comment also applies for the difference in the calibration periods, but this has a likely smaller effect.

Specific comments

Abstract: please remove repletion of “the widely-used Kling–Gupta efficiency” (l.6-7, l.9-10)

Abstract: repetition l.13, l.14; “not necessarily…”

Discussion, l.235 “The results presented in this study demonstrate that simulating floods using hydrological models …”. Please be more specific…I think, the results show only the case if the models are calibrated to KGE. I think if you use seasonality or flood magnitude in the objective function, the results can be different.

Hide

ED: Publish subject to minor revisions (review by editor) (16 Nov 2020) by Nadav Peleg

AR by Manuela Irene Brunner on behalf of the Authors (17 Nov 2020) Author's response Manuscript

ED: Publish as is (20 Nov 2020) by Nadav Peleg

AR by Manuela Irene Brunner on behalf of the Authors (20 Nov 2020)

Short summary

Assessments of current, local, and regional flood hazards and their future changes often involve the use of hydrologic models. A reliable model ideally reproduces both local flood characteristics and regional aspects of flooding. In this paper we investigate how such characteristics are represented by hydrologic models. Our results show that both the modeling of local and regional flood characteristics are challenging, especially under changing climate conditions.