the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Lack of robustness of hydrological models: A large-sample diagnosis and an attempt to identify the hydrological and climatic drivers
Abstract. The transferability of hydrological models over contrasted climate conditions, also identified as model robustness, has been the subject of much research in last decades. The occasional lack of robustness identified in such models is not only an operational challenge – since it affects the confidence that can be placed in projections of climate change impact – but it also hints at possible deficiencies in the structure of these models. This paper presents a large-scale application of the robustness assessment test (RAT) for three hydrological models with different levels of complexity: GR6J, HYPE and MIKE SHE. The dataset comprises 352 catchments located in Denmark, France and Sweden. Our aim is to evaluate how robustness varies over the dataset and between models and whether the lack of robustness can be linked to some hydrological and/or climatic characteristics of the catchments (thus providing a clue on where to focus model improvement efforts). We show that although the tested models are very different, they encounter similar robustness issues over the dataset. However, models do not necessarily lack robustness on the same catchments and are not sensitive to the same hydrological characteristics. This work highlights the applicability of the RAT regardless of model type and its ability to provide a detailed diagnostic evaluation of model robustness issues.
- Preprint
(1355 KB) - Metadata XML
-
Supplement
(3621 KB) - BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on hess-2024-80', Anonymous Referee #1, 12 Jul 2024
General comments
This paper presents an application of the robustness assessment test to a large-sample of catchments across France, Denmark and Sweden. The analysis uses results from three hydrological models and the paper analyses how the robustness varies across the dataset in relation to a selection of hydrological and climatic characteristics. Overall the paper is easy to follow and results are presented clearly, but the manuscript could do with a little more synthesis to bring it all together at the end.
Specific comments
Section 1.3: There could be a little more detail in this section. For example, instead of mentioning that you use a ‘large set of catchments spanning various climate conditions in three European countries’ perhaps you could mention how many catchments are simulated, across which conditions and in which countries.
L91 Did you consider using any metrics other than the model bias to assess the differences between observed and simulated flows? Why did you choose this one and do you think your results are sensitive to this choice?
L131 could you quantify how many rivers are affected by hydropower production? It would be interesting to know how many catchments in this sample are affected by this.
L129 I understand that perhaps it does not dictate the hydrology as heavily, but since the geology is discussed for France and Denmark is it worth describing the Swedish geology as well?
Table 1 and 2: Although these tables are useful for listing all the signatures used in this study, I do not think the quantiles are particularly easy to digest, is there a more visual way that this information could be displayed? I like the maps in the supplementary material but understand that there are probably too many to include in the main paper.
L167: Could the runoff ratios exceeding 100% be related to the hydropower? Often water is imported to support these schemes.
L231: on L432 you mention that there is a regulation module in HYPE, could this be briefly described here?
Figure 3: # of stations isn’t a particularly intuitive label, perhaps you could instead write ‘Reactive catchments: ‘
L282: instead of saying ‘especially numerous’ could you instead quantify how many catchments are reactive in France?
L385 which is section 0?
L410: Have you thought about using any signatures which describe the degree of flow regulation by reservoirs/ hydropower? This might help to identify whether the flaws in the GR4J model are linked to this and could be included as a signature in Figure 10.
L453: could you elaborate on what you mean when you say the calibration of S-HYPE could be responsible for the seemingly random reactivity? Perhaps this could be done on L533.
L472: what do you mean by differs from the rest of the dataset? If the same can be said for the Swedish catchments then do you just mean that the results differ from those associated with the French data? This seems to be contradicted by L504.
Table 4: Could you perhaps shade the last three columns so that we can see the patterns visually?
L528: Again, perhaps you could consider using a signature to quantify the degree of dam regulation in each catchment to confirm or reject this hypothesis.
L564 what did you do with the catchments where the KGE was less than 0.7?
Section 5.1: This section feels like a lot of repetition of results/ discussion and doesn’t really feel like it achieves much synthesis. It would be good to make the implications of your work clearer here. The start of the paper makes it clear that this work is useful for understanding the implications of using models such as HYPE, GR6J and MIKE SHE for climate change applications, but I don’t feel like you ever quite bring together your findings here and discuss what your results mean for using these models for climate change applications. ‘Our analysis pointed out flaws in the models in terms of robustness to changing climate.’. Although I can see that the idea is that you use the results from catchments with different climatic conditions as proxies for how the models will perform under climate change, it would be good to make this link clearer.
It would be good to also have some discussion surrounding how transferable your results are to other hydrological models. Are your findings only relevant for the models used in this study? Or is it likely that your findings will be relevant for other models in used in other countries too?
Citation: https://doi.org/10.5194/hess-2024-80-RC1 -
AC1: 'Reply on RC1', Léonard Santos, 02 Sep 2024
Thank you very much for your detailed review and for your encouraging comments. We will make all the necessary editorial changes in the revised manuscript, clarifying the points you mentioned. For now, we would like to focus on the main points of your review:
- Considering metrics other than the model bias to assess the differences between observed and simulated flows
You are right on this point, bias is only one of the metrics that could be considered (and success at the RAT should only be considered as a « necessary but not sufficient » condition for using a model in a climate evolution context: the same methodology could be applied to bias in different flow ranges (low or high flows) or to statistical indicators describing low-flow characteristics or maximum annual streamflow. And characteristics other than bias could be tested, e.g. ratios pertaining to the variability of flows. We had mentioned this point in the Technical note where we introduced the RAT methodology (Nicolle et al., 2021). Nonetheless, we believe that bias is the first metric to be considered when looking at robustness in a climate change context.
- Section 5.1
We modified this section to try to achieve a real synthesis, as this is what will be useful to the reader. The section is now much shorter and reads as follows:
This paper presented a large-sample analysis of the robustness of three models to a changing climate. The RAT allowed us to evaluate the robustness of the three different models without controlling their calibration process, and the analysis of the hydrological signatures of the catchments that react to the RAT suggested some potential issues specific to each model. Our objective was not to compare models, as we have shown that they all suffered of a lack of robustness to be safely applied in a changing climate context, but to identify the hydrological features that could be the cause of this lack of robustness. Overall, the models reacted to the RAT on a significant number of catchments (between 33% and 42% depending on the model and the datasets), and this indicates that much work is needed to make models more robust in the context of climate change.
- Transferability of our results
Thank you for suggesting this. We plan to add in the revised paper the following short section to the conclusion:
5.2 How generic are our results?
The issue of genericity is central in science. With an application of the RAT over 3 models, in 3 countries and on a total of 352 catchments, the work presented in this paper presents a significant improvement over what had initially been done in the note describing the method (Nicolle et al., 2021). Because models are more than ever used to predict the impact of a changing climate, we believe more than ever in the need to test them more thoroughly, in the need to challenge their extrapolation capacity. Because the RAT is so simple to apply, because it can be applied to models requiring calibration that run in seconds and to models which do not and need hours to produce a single run, we consider that it is a useful investment for a modeler as well as for a model user, one that is likely to « increase their confidence » in their results as de Marsily et al. (1992) were recommending.
Of course, we keep in mind the advice that the late Vit Klemeš (personal communication) had sent to one of us. Asked how he was looking back at the impact of his famous paper discussing the different options of split-sample test (Klemeš, 1986), he answered that he had in fact always been skeptical about the capacity of hydrologists to validate rigorously their models : he said he knew in advance that the tests he had suggested would be « avoided under whatever excuses available because modelers, especially those who want to ‘market’ their products, know only too well that they would not pass it », adding that he had « no illusions in this regard » when he wrote his paper. We do not have any illusions either, and we do not wish to fight against windmills. We modestly think it is part of our scientific duty to keep expressing our concerns on this topic.
References
Klemeš, V. 1986. Operational testing of hydrological simulation models. Hydrol. Sci. J., 31, 13-24.
de Marsily, G., Combes, P., and Goblet, P. 1992. Comment on ‘Ground-water models cannot be validated’, by L.F. Konikow and J.D. Bredehoeft. Adv. Water Resour., 15, 367-369.
Nicolle, P., V. Andréassian, P. Royer-Gaspard, C. Perrin, G. Thirel, L. Coron, & L. Santos. 2021. Technical Note – RAT: a Robustness Assessment Test for calibrated and uncalibrated hydrological models. Hydrol. Earth Syst. Sci., 25, 5013–5027. https://doi.org/10.5194/hess-25-5013-2021
Citation: https://doi.org/10.5194/hess-2024-80-AC1
-
AC1: 'Reply on RC1', Léonard Santos, 02 Sep 2024
-
RC2: 'Comment on hess-2024-80', Anonymous Referee #2, 15 Jul 2024
The study by Santos et al. explores the application of the robustness assessment test (RAT) for three hydrological models of varying complexity. They tested the RAT in 352 catchments across Denmark, France, and Sweden. The topic is very interesting and indeed worth studying. The methodology is well-explained, and the writing is clear. However, my main concern is that since these three models were calibrated separately, each at different times and by different research institutes, I am worried about the comparability of the results. Additionally, some of the explanations for the results appear somewhat strained and lack adequate data support; for example, linking robustness issues to dams regulation. If the authors adequately address these issues, I believe this paper is suitable for publication in the HESS journal. My detailed comments can be found below.
Detailed comments:
Line88: In this line 88, it says at least 30 years of data is needed, yet in the Figure 1, it labelled with ‘> 20 years’. So what is the minimum requirement for data?
Line 116: Could you could use Köppen-Geiger classes as a background map in Figure 2? This provides readers with a more intuitive understanding of the climate zones to which each watershed belongs.
Line 222-223: What do you mean by ‘free parameters’? Please clarify.
Line 234: Does this ‘ca.’ represent the catchment area?
Table 3: Please add the explanation of what do you mean by ‘OF’?
Table 3: You mentioned in the discussion that these 3 models were calibrated on different temporal period. Could you add in this table about the specific time periods during which each of these models was calibrated?
Line 385: Can you clarify what do you mean by ‘Sect. 0’?
Line 430: I don’t think the large river catchments will necessarily be higher than the average level. This sentence is not rigorous. Please correct.
Line 523-525: Can you provide the details of these tests on the choice of the evaporation formula in the supplement file?
Line 527: Can you add the location of these dams in one of your figures? It would nice and more convinced to see the spatial distribution of both dams locations and the GR6J robustness issues to draw this conclusion. Moreover, the presence of dams in a catchment does not necessarily mean they are impacted by the dams. So how do you know the streamflow of these catchments are actually affected by the dams?
Line 534-539: I’m a bit concerned about the different calibration strategies were used and also model calibrated over different time periods. Adopting different calibration methods may introduce uncertainty. I’m not sure whether the calibration results of models using different calibration methods are comparable or not? More justifications are needed here.
Citation: https://doi.org/10.5194/hess-2024-80-RC2 -
AC2: 'Reply on RC2', Léonard Santos, 02 Sep 2024
Thank you very much for your detailed review and for all your encouraging comments. We will correct the identified typos make the necessary editorial changes in the revised manuscript, clarifying all the points you mentioned.
For now, we would like to focus our answer on your concern about the different calibration strategies used. We understand your point: in order to keep “all other things equal”, and to ease the interpretation, you argue that it would have been better to have the same calibration strategy for all models. The problem is that the recommended calibration strategies vary widely among models, and very often, we face another critic: “you did not respect the calibration that the modelers recommend, therefore we do not trust your results” or even “you don’t have the necessary expertise to perform the calibration of this model, therefore we do not trust your results”. We believe that the only way to avoid this critic is to ask each “specialist” to perform his calibration/parameterization, to do his best, and to judge of the robustness of the model when placed in the “best” conditions.
Citation: https://doi.org/10.5194/hess-2024-80-AC2
-
AC2: 'Reply on RC2', Léonard Santos, 02 Sep 2024
Status: closed
-
RC1: 'Comment on hess-2024-80', Anonymous Referee #1, 12 Jul 2024
General comments
This paper presents an application of the robustness assessment test to a large-sample of catchments across France, Denmark and Sweden. The analysis uses results from three hydrological models and the paper analyses how the robustness varies across the dataset in relation to a selection of hydrological and climatic characteristics. Overall the paper is easy to follow and results are presented clearly, but the manuscript could do with a little more synthesis to bring it all together at the end.
Specific comments
Section 1.3: There could be a little more detail in this section. For example, instead of mentioning that you use a ‘large set of catchments spanning various climate conditions in three European countries’ perhaps you could mention how many catchments are simulated, across which conditions and in which countries.
L91 Did you consider using any metrics other than the model bias to assess the differences between observed and simulated flows? Why did you choose this one and do you think your results are sensitive to this choice?
L131 could you quantify how many rivers are affected by hydropower production? It would be interesting to know how many catchments in this sample are affected by this.
L129 I understand that perhaps it does not dictate the hydrology as heavily, but since the geology is discussed for France and Denmark is it worth describing the Swedish geology as well?
Table 1 and 2: Although these tables are useful for listing all the signatures used in this study, I do not think the quantiles are particularly easy to digest, is there a more visual way that this information could be displayed? I like the maps in the supplementary material but understand that there are probably too many to include in the main paper.
L167: Could the runoff ratios exceeding 100% be related to the hydropower? Often water is imported to support these schemes.
L231: on L432 you mention that there is a regulation module in HYPE, could this be briefly described here?
Figure 3: # of stations isn’t a particularly intuitive label, perhaps you could instead write ‘Reactive catchments: ‘
L282: instead of saying ‘especially numerous’ could you instead quantify how many catchments are reactive in France?
L385 which is section 0?
L410: Have you thought about using any signatures which describe the degree of flow regulation by reservoirs/ hydropower? This might help to identify whether the flaws in the GR4J model are linked to this and could be included as a signature in Figure 10.
L453: could you elaborate on what you mean when you say the calibration of S-HYPE could be responsible for the seemingly random reactivity? Perhaps this could be done on L533.
L472: what do you mean by differs from the rest of the dataset? If the same can be said for the Swedish catchments then do you just mean that the results differ from those associated with the French data? This seems to be contradicted by L504.
Table 4: Could you perhaps shade the last three columns so that we can see the patterns visually?
L528: Again, perhaps you could consider using a signature to quantify the degree of dam regulation in each catchment to confirm or reject this hypothesis.
L564 what did you do with the catchments where the KGE was less than 0.7?
Section 5.1: This section feels like a lot of repetition of results/ discussion and doesn’t really feel like it achieves much synthesis. It would be good to make the implications of your work clearer here. The start of the paper makes it clear that this work is useful for understanding the implications of using models such as HYPE, GR6J and MIKE SHE for climate change applications, but I don’t feel like you ever quite bring together your findings here and discuss what your results mean for using these models for climate change applications. ‘Our analysis pointed out flaws in the models in terms of robustness to changing climate.’. Although I can see that the idea is that you use the results from catchments with different climatic conditions as proxies for how the models will perform under climate change, it would be good to make this link clearer.
It would be good to also have some discussion surrounding how transferable your results are to other hydrological models. Are your findings only relevant for the models used in this study? Or is it likely that your findings will be relevant for other models in used in other countries too?
Citation: https://doi.org/10.5194/hess-2024-80-RC1 -
AC1: 'Reply on RC1', Léonard Santos, 02 Sep 2024
Thank you very much for your detailed review and for your encouraging comments. We will make all the necessary editorial changes in the revised manuscript, clarifying the points you mentioned. For now, we would like to focus on the main points of your review:
- Considering metrics other than the model bias to assess the differences between observed and simulated flows
You are right on this point, bias is only one of the metrics that could be considered (and success at the RAT should only be considered as a « necessary but not sufficient » condition for using a model in a climate evolution context: the same methodology could be applied to bias in different flow ranges (low or high flows) or to statistical indicators describing low-flow characteristics or maximum annual streamflow. And characteristics other than bias could be tested, e.g. ratios pertaining to the variability of flows. We had mentioned this point in the Technical note where we introduced the RAT methodology (Nicolle et al., 2021). Nonetheless, we believe that bias is the first metric to be considered when looking at robustness in a climate change context.
- Section 5.1
We modified this section to try to achieve a real synthesis, as this is what will be useful to the reader. The section is now much shorter and reads as follows:
This paper presented a large-sample analysis of the robustness of three models to a changing climate. The RAT allowed us to evaluate the robustness of the three different models without controlling their calibration process, and the analysis of the hydrological signatures of the catchments that react to the RAT suggested some potential issues specific to each model. Our objective was not to compare models, as we have shown that they all suffered of a lack of robustness to be safely applied in a changing climate context, but to identify the hydrological features that could be the cause of this lack of robustness. Overall, the models reacted to the RAT on a significant number of catchments (between 33% and 42% depending on the model and the datasets), and this indicates that much work is needed to make models more robust in the context of climate change.
- Transferability of our results
Thank you for suggesting this. We plan to add in the revised paper the following short section to the conclusion:
5.2 How generic are our results?
The issue of genericity is central in science. With an application of the RAT over 3 models, in 3 countries and on a total of 352 catchments, the work presented in this paper presents a significant improvement over what had initially been done in the note describing the method (Nicolle et al., 2021). Because models are more than ever used to predict the impact of a changing climate, we believe more than ever in the need to test them more thoroughly, in the need to challenge their extrapolation capacity. Because the RAT is so simple to apply, because it can be applied to models requiring calibration that run in seconds and to models which do not and need hours to produce a single run, we consider that it is a useful investment for a modeler as well as for a model user, one that is likely to « increase their confidence » in their results as de Marsily et al. (1992) were recommending.
Of course, we keep in mind the advice that the late Vit Klemeš (personal communication) had sent to one of us. Asked how he was looking back at the impact of his famous paper discussing the different options of split-sample test (Klemeš, 1986), he answered that he had in fact always been skeptical about the capacity of hydrologists to validate rigorously their models : he said he knew in advance that the tests he had suggested would be « avoided under whatever excuses available because modelers, especially those who want to ‘market’ their products, know only too well that they would not pass it », adding that he had « no illusions in this regard » when he wrote his paper. We do not have any illusions either, and we do not wish to fight against windmills. We modestly think it is part of our scientific duty to keep expressing our concerns on this topic.
References
Klemeš, V. 1986. Operational testing of hydrological simulation models. Hydrol. Sci. J., 31, 13-24.
de Marsily, G., Combes, P., and Goblet, P. 1992. Comment on ‘Ground-water models cannot be validated’, by L.F. Konikow and J.D. Bredehoeft. Adv. Water Resour., 15, 367-369.
Nicolle, P., V. Andréassian, P. Royer-Gaspard, C. Perrin, G. Thirel, L. Coron, & L. Santos. 2021. Technical Note – RAT: a Robustness Assessment Test for calibrated and uncalibrated hydrological models. Hydrol. Earth Syst. Sci., 25, 5013–5027. https://doi.org/10.5194/hess-25-5013-2021
Citation: https://doi.org/10.5194/hess-2024-80-AC1
-
AC1: 'Reply on RC1', Léonard Santos, 02 Sep 2024
-
RC2: 'Comment on hess-2024-80', Anonymous Referee #2, 15 Jul 2024
The study by Santos et al. explores the application of the robustness assessment test (RAT) for three hydrological models of varying complexity. They tested the RAT in 352 catchments across Denmark, France, and Sweden. The topic is very interesting and indeed worth studying. The methodology is well-explained, and the writing is clear. However, my main concern is that since these three models were calibrated separately, each at different times and by different research institutes, I am worried about the comparability of the results. Additionally, some of the explanations for the results appear somewhat strained and lack adequate data support; for example, linking robustness issues to dams regulation. If the authors adequately address these issues, I believe this paper is suitable for publication in the HESS journal. My detailed comments can be found below.
Detailed comments:
Line88: In this line 88, it says at least 30 years of data is needed, yet in the Figure 1, it labelled with ‘> 20 years’. So what is the minimum requirement for data?
Line 116: Could you could use Köppen-Geiger classes as a background map in Figure 2? This provides readers with a more intuitive understanding of the climate zones to which each watershed belongs.
Line 222-223: What do you mean by ‘free parameters’? Please clarify.
Line 234: Does this ‘ca.’ represent the catchment area?
Table 3: Please add the explanation of what do you mean by ‘OF’?
Table 3: You mentioned in the discussion that these 3 models were calibrated on different temporal period. Could you add in this table about the specific time periods during which each of these models was calibrated?
Line 385: Can you clarify what do you mean by ‘Sect. 0’?
Line 430: I don’t think the large river catchments will necessarily be higher than the average level. This sentence is not rigorous. Please correct.
Line 523-525: Can you provide the details of these tests on the choice of the evaporation formula in the supplement file?
Line 527: Can you add the location of these dams in one of your figures? It would nice and more convinced to see the spatial distribution of both dams locations and the GR6J robustness issues to draw this conclusion. Moreover, the presence of dams in a catchment does not necessarily mean they are impacted by the dams. So how do you know the streamflow of these catchments are actually affected by the dams?
Line 534-539: I’m a bit concerned about the different calibration strategies were used and also model calibrated over different time periods. Adopting different calibration methods may introduce uncertainty. I’m not sure whether the calibration results of models using different calibration methods are comparable or not? More justifications are needed here.
Citation: https://doi.org/10.5194/hess-2024-80-RC2 -
AC2: 'Reply on RC2', Léonard Santos, 02 Sep 2024
Thank you very much for your detailed review and for all your encouraging comments. We will correct the identified typos make the necessary editorial changes in the revised manuscript, clarifying all the points you mentioned.
For now, we would like to focus our answer on your concern about the different calibration strategies used. We understand your point: in order to keep “all other things equal”, and to ease the interpretation, you argue that it would have been better to have the same calibration strategy for all models. The problem is that the recommended calibration strategies vary widely among models, and very often, we face another critic: “you did not respect the calibration that the modelers recommend, therefore we do not trust your results” or even “you don’t have the necessary expertise to perform the calibration of this model, therefore we do not trust your results”. We believe that the only way to avoid this critic is to ask each “specialist” to perform his calibration/parameterization, to do his best, and to judge of the robustness of the model when placed in the “best” conditions.
Citation: https://doi.org/10.5194/hess-2024-80-AC2
-
AC2: 'Reply on RC2', Léonard Santos, 02 Sep 2024
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
500 | 178 | 38 | 716 | 35 | 18 | 21 |
- HTML: 500
- PDF: 178
- XML: 38
- Total: 716
- Supplement: 35
- BibTeX: 18
- EndNote: 21
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1