the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
When best is the enemy of good – critical evaluation of performance criteria in hydrological models
Guillaume Cinkus
Naomi Mazzilli
Hervé Jourde
Andreas Wunsch
Tanja Liesch
Nataša Ravbar
Zhao Chen
Nico Goldscheider
Abstract. Performance criteria play a key role in the calibration and evaluation of hydrological models and have been extensively developed and studied, but some of the most used criteria still have unknown pitfalls. This study set out to examine counterbalancing errors, which are inherent to the Kling-Gupta Efficiency (KGE) and its variants. A total of nine performance criteria – including the KGE and its variants, as well as the Nash-Sutcliffe Efficiency (NSE) and the refined version of the Willmott’s index of agreement (dr) – were analysed using synthetic time series and a real case study. Results showed that, assessing a simulation, the score of the KGE and some of its variants can be increased by concurrent over- and underestimation of discharge. These counterbalancing errors may favour bias and variability parameters, therefore preserving an overall high score of the performance criteria. As bias and variability parameters generally account for 2/3 of the weight in the equation of performance criteria such as the KGE, this can lead to an overall higher criterion score without being associated to an increase in model relevance. We recommend using (i) performance criteria that are not or less prone to counterbalancing errors (NSE, dr, modified KGE, non-parametric KGE, Diagnostic Efficiency) in a multi-criteria framework, and/or (ii) scaling factors in the equation to reduce the influence of relative parameters.
Guillaume Cinkus et al.
Status: final response (author comments only)
-
CC1: 'Comment on hess-2022-380', Charles Onyutha, 12 Dec 2022
The study investigated counterbalancing errors which are inherent in Nash-Sutcliffe Efficiency and its variants. Reliability of the performance criteria is important to boost confidence with which a particular model can be chosen. There are some important points which the authors could address to strengthen their paper.
In the last sentence of the abstract, the authors mention the use of multi-criteria framework in their recommendation. On the need to consider a particular "goodness-of-fit" metric within the multi-criteria framework, the authors could clarify on other specific requirements apart from the general condition that the performance criteria should be less or not prone to counterbalancing.
Furthermore, the use of several criteria for a particular calibration can complicate the applications of automation of famous search strategies or algorithms (Onyutha 2022). It is upon this basis that a number of performance criteria which are not mathematically and statistically related tend to be formed into single metric. For instance, Kling-Gupta Efficience combines three components including measures of bias, variability and linear correlation between observed (X) and modelled (Y) series. Thus, the authors should provide more considered justification for their recommendation of the use of multi-criteria framework for calibration of hydrological models.
Most of (if not all) the metrics used in this study rely on the assumption that X and Y are linearly related. Note that X and Y can be so highly
dependent yet it may be nearly impossible to detect the dependence using classical dependence metric (Székely et al. 2007). In other words, the authors should clarify on whether the model performance results of this study may not have been affected by the said assumption.Most of the performance criteria (especially Nash Sutcliffe Efficiency NSE (Nash and Sutcliffe, 1970) and its variants) comprise some forms of the well-known coefficient of determination (R-squared) (see Onyutha, 2022). R-squared is known to have various short comings. To address these short comings, new metrics including the revised R-squared (RRS) and hydrological model skill score E (Onyutha 2022) were developed. Thus, instead of focussing on NSE and its variants, the authors should compare results of many other performance criteria such as RRS and E. Accordingly, Figure 7 and Table 1 in this manuscript can be updated. The MATLAB codes to compute RRS and E can be downloaded via https://doi.org/10.5281/zenodo.6570905 and the codes can also be found as supplementary material to the paper by Onyutha (2022).
ON EQUATION 20
According to Legates & McCabe (2013), the refinement of Index of Agreement (IOA) (Willmott, 1981) made by Willmott et al. (2012) especially regarding the extension of the IOA bound from 1 to 0 was unnecessary. Check Legates & McCabe (2013) for other limitations of the refined IOA. Therefore, could the authors make use of the original form of IOA for their model performance evaluation and analyses?LIST OF REFERENCES
Legates, D. R. & McCabe, G. J. 2013 A refined index of model performance: a rejoinder. International Journal of Climatology 33, 1053–1056.
Nash, J. E. & Sutcliffe, J. V. 1970 River flow forecasting through conceptual models part I – a discussion of principles. Journal of Hydrology 10,
282–290.Onyutha C (2022) A hydrological model skill score and revised R-squared. Hydrology Research 53 (1): 51–64. https://doi.org/10.2166/nh.2021.071
Székely G. J., Rizzo M. L. & Bakirov N. K. 2007 Measuring and testing independence by correlation of distances. The Annals of Statistics 35 (6), 2769–2794.
Willmott, C. J. 1981 On the validation of models. Physical Geography 2, 184–194.
Willmott, C. J., Robeson, S. M. & Matsuura, K. 2012 A refined index of model performance. International Journal of Climatolology 32,
2088–2094Citation: https://doi.org/10.5194/hess-2022-380-CC1 - AC1: 'Reply on CC1', Guillaume Cinkus, 29 Jan 2023
-
CC2: 'Comment on hess-2022-380', John Ding, 22 Dec 2022
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-380/hess-2022-380-CC2-supplement.pdf
- AC2: 'Reply on CC2', Guillaume Cinkus, 29 Jan 2023
-
RC1: 'Comment on hess-2022-380', Anonymous Referee #1, 07 Mar 2023
Having carried out hydrological modelling for the past 30 years it is interesting to see how the use of different performance criteria has developed. The Nash-Sutcliffe Efficiency (NSE) criteria has been the main criteria for flooding issues (and very often a general criteria on the overall performance of a model) for a long time. It has a number of well documented drawbacks but has the advantage of the values being widely understood. Kling-Gupta Efficiency (KGE) and its variants have become more popular recently but my feeling is that it is less well understood and some of the issues associated with its use have not been fully explored.
This paper is a useful addition to the subject of different performance criteria as it clearly shows that in the KGE there can be counterbalancing errors (i.e sometimes an over estimation and sometimes an under estimation of discharge) which produce a higher value without there being an improvement in the model. Whereas these counterbalancing error do not occur for the NSE. The authors summarize the issue and their contribution very well when they say “The aim of this paper is primarily to raise awareness among modellers. Performance criteria generally comprise several aspects of the characteristics of a model into a single value, which can lead to an inaccurate assessment of said aspects. Ultimately, all criteria have their flaws and should be carefully selected with regards to the aim of the model”
The paper is well written and presented. There is a good summary of the current state in the use of different performance criteria in hydrological models. The use of both a sythentic time series and a real case study gives more confidence in the issue of these counterbalancing errors. Overall, it is a good bit of work with a clear conclusion reached. I am happy to accept the paper with minor revisions
Specific comments:
- I wonder if it would be useful to mention some of the benchmarking studies in hydrological modelling (e.g. Seibert et al. 2018) which I feel are a useful addition with regards to the performance criteria of the models.
- L98 (Equation 11). This is the Gupta(2009) equation. This is surely wrong as the last term should be minus not plus.
- L133-135. I do not understand this bit. I can see there are 361 transformation between -0.36 and 0.36 but I need not understand where the logarithmic scale comes in and how you get from here to the w values
- L195 (Figure 4). Would it be useful to also show the “Bad-Bad” model on this figure?
- Change “consisting in” to “consisting of”
- Maybe change “both ways” to “both sides” or “both directions”
- Change “succeed to reproduce” to “succeed in reproducing” or “successfully reproduce”
- L273-L274. “In general, the ANN model can be described as better because it is closer to the observed values in the high and low flow periods”. As a hydrological modeller I agree the ANN model is better. But surely the whole point of performance criteria is to objectively decide which model is better. So how do you decide it is better when the performance criteria do not agree? There is no easy answer but I feel it is an important question that should be considered in more detail.
- There is no reference to Figure 7, should it be here.
- L300-L303. I do not think this bit adds anything. I would remove these lines.
- In Equation 22 the order of the parameters is alpha, beta, r. On Line 344 and subsequent lines it is r, alpha, beta. This is confusing. So when you look at (1-2-2) and you look at equation 22 everything needs to be swapped around as the 1 corresponds to r which is the last term in the equation
- Change “associated to” to “associated with”
- Change “include to” “include the”
Seibert, J., Vis, M. J., Lewis, E., & Meerveld, H. V. (2018). Upper and lower benchmarks in hydrological modelling. Hydrological processes.
Gupta, H. V., Kling, H., Yilmaz, K. K., & Martinez, G. F. (2009). Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. Journal of hydrology, 377(1-2), 80-91.
Citation: https://doi.org/10.5194/hess-2022-380-RC1 -
RC2: 'Comment on hess-2022-380', Anonymous Referee #2, 21 Mar 2023
Paper summary and overall evaluations
The paper examines several goodness-of-fit, skill scores used for hydrologic model evaluations based on the synthetic flow data and real simulation data. In the paper, the types of the skill scores are classified into two—1) multi-variant-based skill scores such as KGE (the skill score computed based on multiple metrics like bias, variability error, correlation etc. using distance measures) and its variants and 2) NSE. The paper focused on the impacts on the skill scores (also its components - bias and variability error if applicable), originating from the situation where under- and over-estimation on the peak flow can exist in one hydrograph. Also, the paper discusses compensation between the components of the skill scores, namely bias and variability. The paper concludes KGE type scores can be inflated for the hydrograph include both under- and over-estimation of the event (because of lower bias and variability error over the time series), which does not necessarily represent “accurate” simulations. The paper suggests that weighting KGE components mitigates this misleading score values.
I think the hydrologic modeling community intuitively realizes this counter-balancing issue in the KGE type scores. The paper explicitly illustrated the issues clearly and would be nice reference for the hydrologic modelers. I think the paper is also in fairly good in shape in terms of the presentations and writing, and I don’t find any major comments, and only several minor comments.
Another thought: the overall results are mostly due to the fact that the skill scores use bias, instead of the error in the magnitude (e.g., root-mean square of error, absolute error). I wonder if it is worth trying modifying KGE components into two components - absolute error and correlation. I am not requesting the authors do (I don’t even know this is a good idea), but reading the paper makes me think about it.
Minor comments
I do understand why NSE is included as “recommended skill scores” given the context of this paper, but still not sure if it is good idea to state so because NSE has one separate issue (underestimating variability so, peak-flow is underestimated and low flow is overestimated). I would suggest stating NSE is less impacted by counter-balance error in the hydrograph, but has its own issue for the practical applications.
Section 4. In real case study, the paper use “reservoir model” for actually some bucket type, conceptual hydrologic model. I suggest avoiding using “reservoir model” because some readers (including me) are confused with reservoir “operation” model (i.e., lake model).
L264. The paper said “third flood event (May 2017) is better simulated by the ANN”. I don’t see this. Both ANN and reservoir models similarly underestimate the flows. Also, the statements after because are unclear to me.
L269-273. I suggest using the dates to point which events are referred to.
Citation: https://doi.org/10.5194/hess-2022-380-RC2
Guillaume Cinkus et al.
Guillaume Cinkus et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
631 | 214 | 16 | 861 | 4 | 6 |
- HTML: 631
- PDF: 214
- XML: 16
- Total: 861
- BibTeX: 4
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1