This study compares the efficiency of three lumped and one distributed models to simulated flood magnitude, timing and spatial coherence. The objective function used for calibration is Kling-Gupta efficiency (KGE). The results show that models tend to underestimate flood magnitude and not always simulate well flood timing. The authors conclude that using KGE for calibration has limited reliability for flood hazard assessment.
In general, the topic fits scope of the journal and will be of interest for the readers. However, the manuscript in its current form (after the revision) will still benefit from a more thorough revision. The main critical points (in my opinion) are:
1) The formulation and justification of the novel scientific contribution is still not clear. The review of previous studies in the Introduction indicates that “…to achieve further improvements in flood peak simulations, a broader range of application-specific evaluation metrics is typically required.” (l.29-30, l.23-24). I agree with such formulation of current research gaps, but it is not in line with the objective function tested in the manuscript. If one would be interested in flood magnitude, timing and spatial connectivity, why one should use KGE for calibration? How does it account for such specific evaluation metrics, i.e. flood seasonality or spatial coherence?
2) The title is misleading. The main message of the paper, in its current form, is about the value of KGE for calibration of hydrologic models (if flood impact assessment is the main purpose). There is no assessment how the models describe and simulate different flood generation processes and which factors control their performance. So based on presented results it is difficult to interpret to what extent and how are the selected models suitable for flood impact assessment. The results are more about the accuracy of selected way (i.e. using lumped models, KGE for calibration, etc.).
3) The significance of the results is not clear. I’m not sure if for practical applications, a lumped model will be used or should be recommended. Perhaps a consistent assessment/evaluation of the difference between lumped and distributed type of models will be interesting (e.g. for HBV and mHM).
4) The design of the experiment reads more as a collection of available analyses and not results from initially clearly defined research question/hypothesis. I agree with previous reviews that using different time periods for calibration and using different model input datasets can have some impact on the results and the interpretation of results (including individual catchments) will be more consistent if the same data and time periods will be used. The authors claim that both datasets describe the observed climate, but are they identical also for individual extreme events?
5) The methodology is not rigorously described. It will be very difficult (if even possible) to reproduce/repeat the presented analysis (based on given information). Numerous information is missing, e.g., how the initial values were set, what were the ranges of calibrated model parameters and parameters of automatic calibration algorithm. It will be interesting to present, e.g. in appendix, the final model parameters and efficiencies for individual catchments. This will allow to assess the interpretation made.
6) I think that comparing lumped with distributed models can bring some more interesting results than are presented in its current form. What is the impact of lumping on the results? Are the differences in model efficiency related to the size of the basin? I would expect that using lumped models in larger catchments cannot describe well floods from convective rainfalls.
7) As a reader, I would be likely more interested in seeing where (in which catchments and why?) the models work well, rather than to conclude that in general they underestimate magnitude or do not represent well the timing or spatial patterns. So presenting some deeper analysis of the factors controlling the performance will be helpful and interesting.
1) Abstract, l.13: “…models calibrated on integrated metrics such as …have limited reliability…”. This is a general conclusion which is not supported well with the presented results. I would suggest to remove “such as”. I think if one combines flood magnitude, seasonality and spatial coherence into an integrated metrics (objective function) for calibration, the results can be different.
2) Data. I like the assessment based on large dataset and subsequent split/grouping of results into some relevant groups of catchments. It is however not clear how are flood generation processes (e.g. flood types) linked with selected groups of regimes? If the objective is about the suitability of models to represent floods (magnitude, seasonality, …) it will be interesting to see results for different flood generation mechanisms, i.e. how or if the models differ in simulating snowmelt floods, or floods from convective rains, etc.
3) Forcing. Which version of Daymet is used? Why not to use only one dataset for all the models?
4) The term “event”: By using term flood event, do you mean day of the flood peak? The same for precipitation. Is the event precipitation representing mean daily precipitation for the day of the peak? Some flood events (e.g. from snowmelt) can last several days. How sensitive/representative are the characteristics extracted only for the day of the peak?
5) Beta model parameter (l.170). It will be interesting to present model parameters for individual catchments, because otherwise the interpretation made reads more as speculation (it is not justified by presented results).
6) L.218-219. In my opinion HBV model can describe the surface runoff. Conceptual it is represented by the outflow from the upper reservoir (describing by k0 model parameter).