Improving the Pareto Frontier in multi-dataset calibration of hydrological models using metaheuristics

Stefnisdóttir, Silja; Sikorska-Senoner, Anna E.; Ásgeirsson, Eyjólfur I.; Finger, David C.

doi:https://doi.org/10.5194/hess-2021-325

Preprints

https://doi.org/10.5194/hess-2021-325

Preprints

06 Aug 2021

| 06 Aug 2021

Status: this discussion paper is a preprint. It has been under review for the journal Hydrology and Earth System Sciences (HESS). The manuscript was not accepted for further review after discussion.

Improving the Pareto Frontier in multi-dataset calibration of hydrological models using metaheuristics

Silja Stefnisdóttir, Anna E. Sikorska-Senoner, Eyjólfur I. Ásgeirsson, and David C. Finger

Abstract. Hydrological models are crucial tools in water and environmental resource management but they require careful calibration based on observed data. Model calibration remains a challenging task, especially if a multi-objective or multi-dataset calibration is necessary to generate realistic simulations of multiple flow components under consideration. In this study, we explore the value of three metaheuristics, i.e. (i) Monte Carlo (MC), (ii) Simulated Annealing (SA), and (iii) Genetic 5 Algorithm (GA), for a multi-data set calibration to simultaneously simulate streamflow, snow cover and glacier mass balances using the conceptual HBV model. Based on the results from a small glaciated catchment of the Rhone River in Switzerland, we show that all three metaheuristics can generate parameter sets that result in realistic simulations of all three variables. Detailed comparison of model simulations with these three metaheuristics reveals however that GA provides the most accurate simulations (with lowest confidence intervals) for all three variables when using both the 100 and the 10 best parameter sets for 10 each method. However, when considering the 100 best parameter sets per method, GA yields also some worst solutions from the pool of all methods’ solutions. The findings are supported by a reduction of the parameter equifinality and an improvement of the Pareto frontier for GA in comparison to both other metaheuristic methods. Based on our results, we conclude that GA-based multi-dataset calibration leads to the most reproducible and consistent hydrological simulations with multiple variables considered.

Received: 16 Jun 2021 – Discussion started: 06 Aug 2021

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Silja Stefnisdóttir, Anna E. Sikorska-Senoner, Eyjólfur I. Ásgeirsson, and David C. Finger

Status: closed

RC1:
'Review Comment on hess-2021-325', Anonymous Referee #1, 06 Sep 2021

This well-written manuscript compares three different non-likelihood based model calibration methods for a single case study with the aim of showing the relative value of each of the methods, given the same amount of model runs.

The paper is based on a case study presented in the work of Finger et al. 2011. It takes from this earlier study the case study, the (multi-objective) data, the hydrological model, the metrics and part of the model calibration methods. What is added is the use of the simulated annealing method and the Genetic Algorithm.

While the idea of comparing the relative value of different model calibration methods given a fixed amount of simulations is somewhat interesting, I cannot recommend the publication of this paper because all results are conditional on the single case study and, more importantly, on the chosen algorithmic parameters of the compared search algorithms (simulated annealing, Genetic algorithm). These algorithmic parameters are not discussed, they are simply fixed. Futhermore, GA is treated as a single method whereas there is a multitude of implementations with different performances.

Thus: The conclusion that GA outperforms the other two within a fixed amount of simulations is not really interesting: it certainly outperforms random search (MC); whether it outperforms or not simulated annealing depends on the problem at hand and on the algorithmic parameters (how the search algorithm is tuned).

All conclusions on the value of multi-data calibration for this case study re-iterate, re-inforce earlier conclusions.

The paper does not present new methods on how to compare the algorithm outputs nor on how to analyze the optimization outputs (metrics taken from Finger et al., 2011). Accordingly, the paper does not present new methods nor new transferable insights into existing methods (except into the exact algorithms used in this paper) nor new insights into hydrological processes.

As far as I see, this paper does thus not fit HESS.

Additional comment:

I do not understand how the paper can mix Pareto-optimality and multi-objective optimization via objective function weighing: if we optimize a weighted sum of objective functions, you cannot get the Pareto-frontier (or only if you explore different weighings). Either an algorithm looks specifically for solutions on the frontier or it does not. Judging a posteriori how many we found by chance seems a rather unfair criteria to compare different algorithms. But perhaps I misunderstood something her.

Citation: https://doi.org/10.5194/hess-2021-325-RC1
- AC1: 'Reply on RC1', David C. Finger, 10 Sep 2021
  
  Dear editor, Prof. Harrie-Jan Hendricks-Franssen,
  
  Dear reviewers,
  
  Thank you for your critical review comments on our manuscript. Below we respond to your comments and outline how we will address your comments in a revised version of our manuscript (our responses are provided in italics):
  
  We propose to change the manuscript title, a tentative title could be:
  
  “The value of metaheuristic calibration techniques in muti dataset calibrations to improve hydrological modeling results and constrain parameter uncertainty”,
  
  Which better describes our study.
  
  Anonymous Referee #1
  
  This well-written manuscript compares three different non-likelihood based model calibration methods for a single case study with the aim of showing the relative value of each of the methods, given the same amount of model runs.
  Authors’ reply (AR): thank you for this positive comment.
  The paper is based on a case study presented in the work of Finger et al. 2011. It takes from this earlier study the case study, the (multi-objective) data, the hydrological model, the metrics and part of the model calibration methods. What is added is the use of the simulated annealing method and the Genetic Algorithm.
  While the idea of comparing the relative value of different model calibration methods given a fixed amount of simulations is somewhat interesting, I cannot recommend the publication of this paper because all results are conditional on the single case study and, more importantly, on the chosen algorithmic parameters of the compared search algorithms (simulated annealing, Genetic algorithm). These algorithmic parameters are not discussed, they are simply fixed. Furthermore, GA is treated as a single method whereas there is a multitude of implementations with different performances.
  AR: Regarding the first comment, we agree with the reviewer that a single case study can only reveal case-specific results. However, the methodology applied and the framework developed are valid for any case study. Thus, the framework built for comparing different metaheuristic methods remains valid for other case studies and can be directly transferred to other catchments.
  
  The choice of the case study for illustrating work is a general issue in any hydrological study as every study has to be based on a selected catchment or a set of catchments. In our case, we chose the same case study as in Finger et al. 2011, i.e., the Rhoneglaciser catchment, which has been well investigated in numerous previous studies. Thus, our reasons for choosing this catchment are as follows: 1) the data availability is excellent, 2) we can refer current results to previous works done on this case study, and 3) its high glaciation fits the purpose of our study perfectly. We do envisage an expansion to other case studies in the future, but we believe this would be the subject of another study.
  Regarding the algorithm parameters and settings, we decided to use the version of the GA algorithm that is implemented as a part of the HBV model and suggested by Seibert et al. (2000). There is a multitude of GA algorithms and parameters, but we feel that the selection process and the sensitivity analysis required to find the best version of GA and the parameter settings to optimize the performances are out of scope for this paper. We, therefore, selected three algorithms and their parameter settings from the literature, i.e. MC from Finger et al. 2011, SA from Stefnisdóttir et al. (2020), and GA from Seibert (2000) and the HBV model as representatives for the three different algorithm paradigms. We will clarify this issue in the revised manuscript.
  Thus: The conclusion that GA outperforms the other two within a fixed amount of simulations is not really interesting: it certainly outperforms random search (MC); whether it outperforms or not simulated annealing depends on the problem at hand and on the algorithmic parameters (how the search algorithm is tuned).
  AR: We agree that GA will certainly find faster an optimal solution than MC. However, the novelty of our approach relies on the complementary processing of the results from MC, SA, and GA to perform a non-weighted multiple dataset calibration, which allows us to assess the impact of the three algorithms on three independent evaluation criteria and on the parameter uncertainty. Furthermore, our results reveal that GA is able to find faster and better solutions than SA, based on the example study. As the number of interactions for the model rounds is one of the critical settings that have to be given for any optimization algorithm, in our opinion, this finding should be of interest to the hydrological community. We agree however that how much GA is able to outperform MC or SA depends on the problem at hand. We will emphasize this issue much stronger in our revised manuscript.
  Moreover, we were able to demonstrate that the spread of the Pareto Frontier is much better in MC than in the other two methods as shown in Figure 9. One could argue that GA or SA will find a local optimum, without exploring the full spread of the Pareto Frontier. However, our results demonstrate the opposite of that, i.e., the Pareto Frontier of our multiple dataset calibration was improved compared to other methods, and parameter uncertainty could be significantly reduced. We do believe that this is a fundamental result that is of value for the entire hydrological modeling community.
  All conclusions on the value of multi-data calibration for this case study re-iterate, reinforce earlier conclusions.
  AR: We thank the reviewer for this supportive statement. We would like to point out that our results additionally provide proof that the Pareto Frontier can be identified with our method. We also highlight the importance of analyzing the Pareto Frontier in the post-processing analysis to support the evaluation of the methods. We will put more focus on this issue in our revised manuscript.
  The paper does not present new methods on how to compare the algorithm outputs nor on how to analyze the optimization outputs (metrics taken from Finger et al., 2011). Accordingly, the paper does not present new methods nor new transferable insights into existing methods (except into the exact algorithms used in this paper) nor new insights into hydrological processes.
  AR: It is true that MC, SA and GA have been developed in previous works, which demonstrates that we incorporated previous works into our study. However, the comparison of non-weighted multiple dataset calibration and its effect on the Pareto Frontier has never been quantified in hydrology in the context provided in this manuscript. To the best of our knowledge, there is no other study that looked at multiple-data calibration using snow cover, discharge, and glacier mass balance data comparing these three optimization algorithms. In addition, we judge the three methods using strict computational requirements, represented here by the number of model runs, since the question is also how well the algorithms use the available computational resources.
  As far as I see, this paper does thus not fit HESS.
  AR: Given our argumentation above, we must disagree with the reviewer on this point. We believe that HESS is a great journal for our manuscript as it deals with the problem of multiple data calibration using different optimization algorithms. This should be of interest to HESS readers, as also noticed by the 2nd reviewer. Based on this reviewer's comments, we believe however that we need to better highlight the novelty of our work and clarify methodology.
  Additional comment:
  
  I do not understand how the paper can mix Pareto-optimality and multi-objective optimization via objective function weighing: if we optimize a weighted sum of objective functions, you cannot get the Pareto-frontier (or only if you explore different weighings). Either an algorithm looks specifically for solutions on the frontier or it does not. Judging a posteriori how many we found by chance seems a rather unfair criteria to compare different algorithms. But perhaps I misunderstood something her.
  AR: We agree that the weighing function cannot simply define a Pareto frontier. However, we believe that there is a misunderstanding and we do have the firm intention to clarify this misunderstanding in a revised version of the manuscript. As stated in the reply to reviewer 2 (see below) we do not use function weighing. We complemented the standard calibration algorithms (MC, SA, and GA) with a ranking algorithm to quantify the trade-off between the three calibration criterion and identify the Pareto front (details are described in Finger et al. 2011). Each calibration function has the same weight. In short: we rank all runs according to each calibration criteria, average the ranks and select the 10 or the 100 best runs according to the average rank. This ensemble of “good runs” does not have “a best” run, it is simply the ensemble that describes the trade-off between the three criteria and accordingly defines the Pareto front. This method allows us to avoid function weighing and identifying a Pareto front (illustrated in Figure 9).
  
  Thus, we believe that there is a misunderstanding and we do have the firm intention to clarify this misunderstanding in a revised version of the manuscript.
  
  Citation: https://doi.org/10.5194/hess-2021-325-AC1
RC2:
'Comment on hess-2021-325', Anonymous Referee #2, 06 Sep 2021
The study analyses the performance of three frequently used automatic optimization algorithms including Monte Carlo simulations, simulated annealing (SA) and a genetic algorithm (GA) for a multivariable calibration in a small glaciated catchment considering streamflow, snow cover area and glacier mass balance. The results are evaluated based on the objective function values achieved by the best 100 and best 10 parameter sets as well as by uncertainty widths regarding prediction and parameter uncertainties. The authors conclude that the genetic algorithm outperformed the two other methods as it achieved better solutions and narrower confidence intervals than the other two methods.

The paper is generally well structured. The problem of model calibration is within the scope of HESS and the question of which search technique to select is a practical and relevant question that modelers have to decide upon.

However, I have several major concerns with this paper.

The authors see the novelty of their work in “confronting for the first time these three metaheuristics most frequently applied in hydrology within a multi-output calibration framework to derive practical recommendations for further applications”. This seems overstated. What about other comparison of optimization techniques? (See e.g. studies cited in Efstratiadis and Koutsoyiannis (2010)). Are Monte Carlo, GA and SA really the three most frequently applied optimization methods in hydrologic model calibration? What has been found by other studies that compare different optimization techniques and what are the research gaps that are addressed in this study?

A fundamental problem of the current study is that the authors emphasize the multi-objective nature of the problem and aim at analyzing which of the three optimization methods provides the most balanced pareto front. However, as far as I understand, the authors did not apply optimization techniques that are designed for this task (e.g. multi-objective variants of GA). Instead it seems that the multi-objective problem was summarized to a single-objective problem (using a weighted sum approach with fixed weightings) and SA and GA were applied in their single-objective forms (which is fine in principle but not if the aim is to study the pareto front).

For analyzing which method performs best in representing the pareto front, the study focuses on objective function values and the number of non-dominated solutions, concluding that GA performs best. In a multi-objective setting, one should additionally consider the diversity of the solutions, i.e. how well they are spread along the pareto front.

Some of the conclusions cannot be drawn from the results of this study. The study concludes that the results demonstrated the value of multi-dataset calibration for realistically simulating different runoff components. However, while this might have been a finding from a previous study I cannot see how this can be concluded based on results from the current study. The study also states that “it appears to be essential to give equal weights to all modelled runoff components”. However, only one weight configuration has been tested so that this statement cannot be derived from the presented results.
Citation: https://doi.org/10.5194/hess-2021-325-RC2
- AC2: 'Reply on RC2', David C. Finger, 10 Sep 2021
  
  Anonymous Referee #2
  
  The study analyses the performance of three frequently used automatic optimization algorithms including Monte Carlo simulations, simulated annealing (SA) and a genetic algorithm (GA) for a multivariable calibration in a small glaciated catchment considering streamflow, snow cover area and glacier mass balance. The results are evaluated based on the objective function values achieved by the best 100 and best 10 parameter sets as well as by uncertainty widths regarding prediction and parameter uncertainties. The authors conclude that the genetic algorithm outperformed the two other methods as it achieved better solutions and narrower confidence intervals than the other two methods.
  
  The paper is generally well structured. The problem of model calibration is within the scope of HESS and the question of which search technique to select is a practical and relevant question that modelers have to decide upon.
  
  AR: we thank the reviewer for this positive feedback.
  However, I have several major concerns with this paper.
  
  The authors see the novelty of their work in “confronting for the first time these three metaheuristics most frequently applied in hydrology within a multi-output calibration framework to derive practical recommendations for further applications”. This seems overstated. What about other comparison of optimization techniques? (See e.g. studies cited in Efstratiadis and Koutsoyiannis (2010)). Are Monte Carlo, GA and SA really the three most frequently applied optimization methods in hydrologic model calibration? What has been found by other studies that compare different optimization techniques and what are the research gaps that are addressed in this study?
  AR: We thank the reviewer for this meaningful comment. We agree with the reviewer that metaheuristics have been investigated numerous times in several studies. We also apologize for not having included all previous works. In a revised version we will include a thorough literature review and include all relevant works not cited so far, especially works cited in Efstratiadis and Koutsoyiannis. The novelty of our comparison study lies however in the fact that we use a non-weighted, multi-dataset calibration, where all datasets are equally weighted, and apply three algorithms of similar computational efforts, i.e., Monte Carlo (MC), Genetic Algorithm (GA) and Simulated Annealing (SA). We agree that these three metaheuristics may not be the most commonly applied but they are computationally equal and simple to implement, which is a great benefit of using them. We will highlight this point more in the revised manuscript. To our knowledge, there is no other study like that in hydrology that has used muti-dataset calibration with these three metaheuristics in order to investigate the Pareto front. We will emphasize the novelty and the distinction of our study to previously published works in a revised version of this manuscript.
  
  A fundamental problem of the current study is that the authors emphasize the multiobjective nature of the problem and aim at analyzing which of the three optimization methods provides the most balanced pareto front. However, as far as I understand, the authors did not apply optimization techniques that are designed for this task (e.g. multiobjective variants of GA). Instead it seems that the multi-objective problem was summarized to a single-objective problem (using a weighted sum approach with fixed weightings) and SA and GA were applied in their single-objective forms (which is fine in principle but not if the aim is to study the pareto front).
  AR: We agree with the argumentation of the reviewer but believe that there is a misunderstanding. We complemented the standard multi-objective calibrations (GA, SA, and MC) with a ranking of all runs according to each criterion individually and subsequent averaging of the ranks to obtain the Pareto front (see below for more details). We do not use a single-objective form, but a fully independent multi-dataset calibration with three equally weighted objective criteria. These three criteria are used during the calibration but, as correctly noticed by the reviewer, we do not distinguish weights for them but apply the same weighting. The reason for not varying weights is that we simply want all of the three variables (i.e. discharge, snowmelt, and glacier mass balance) to be simulated similarly well. Introducing different weights would possibly have improved the simulation of one of the variables but at cost of lower performance for the other two variables. Further details of the way how best runs are chosen are described in Finger et al. (2011), but in principle, we rank all model runs according to the performance of each dataset, then average the ranks and finally select the 10 or 100 best runs that received the highest average rank. We use this ensemble of “good” runs to illustrate our Parento frontier. This allows us also to obtain a Pareto frontier for each method (MC, SA, and GA). We will make an extra effort to clarify the method in a revised version of the manuscript.
  For analyzing which method performs best in representing the Pareto front, the study focuses on objective function values and the number of non-dominated solutions, concluding that GA performs best. In a multi-objective setting, one should additionally consider the diversity of the solutions, i.e. how well they are spread along the pareto front.
  AR: The reviewer is correct! We will look at the Pareto frontier as a complementary tool to support the comparison of our three calibration algorithms. We will add additional text on the Pareto spread in the revised manuscript as criteria for method comparison.
  Some of the conclusions cannot be drawn from the results of this study. The study concludes that the results demonstrated the value of multi-dataset calibration for realistically simulating different runoff components. However, while this might have been a finding from a previous study I cannot see how this can be concluded based on results from the current study. The study also states that “it appears to be essential to give equal weights to all modelled runoff components”. However, only one weight configuration has been tested so that this statement cannot be derived from the presented results.
  AR: Thank you for this comment which showed us that we must clarify our statements and conclusions. We believe that this comment is based on a misunderstanding and would like to clarify our point. It is true that similar results regarding the value of multi-objective calibration were found in the study of Finger (2011), however, that study was using only MC calibration. Our results indeed reconfirm previous findings from Finger et al (2011, 2012, 2015, 2018), Etter et al. (2018), J de Niet et al (2020) and complement them by demonstrating the value of SA and GC and their impacts on the Paret front. This is the major novelty of our study, which we will better describe in the revised manuscript.
  
  It is also true that we use the same weights for all three calibration criteria, i.e., the Nash-Sutcliffe coefficient for Discharge (Q), RMSE for Glacier mass balances (MB) between measured and simulate MB and the ratio of correctly predicted snow cover area for snow cover (SC). We did not test different set-ups of the weights as we aim at having all variables simulated equally well. We will modify these conclusions in our revised manuscript.
  
  Citation: https://doi.org/10.5194/hess-2021-325-AC2
AC3: 'Comment on hess-2021-325', David C. Finger, 07 Oct 2021

Dear Editor,

We have replied to all reviewers' comments and addressed all concerns raised by the reviewers in our response. The novelty of our work is the comparison of three metaheuristic calibration methods and their effect on the pareto front. We believe that the reviewers misunderstood our approach, and accordingly, we intend to improve the clarity in a revised version of the manuscript. In our opinion, our manuscript addresses a fundamental research question in hydrological modelling. Accordingly, we believe that HESS is an appropriate journal for our manuscript. We are open to engae in any further comment regarding our work.

Sincerely yours,

The authors

Citation: https://doi.org/10.5194/hess-2021-325-AC3

Status: closed

RC1:
'Review Comment on hess-2021-325', Anonymous Referee #1, 06 Sep 2021

This well-written manuscript compares three different non-likelihood based model calibration methods for a single case study with the aim of showing the relative value of each of the methods, given the same amount of model runs.

The paper is based on a case study presented in the work of Finger et al. 2011. It takes from this earlier study the case study, the (multi-objective) data, the hydrological model, the metrics and part of the model calibration methods. What is added is the use of the simulated annealing method and the Genetic Algorithm.

While the idea of comparing the relative value of different model calibration methods given a fixed amount of simulations is somewhat interesting, I cannot recommend the publication of this paper because all results are conditional on the single case study and, more importantly, on the chosen algorithmic parameters of the compared search algorithms (simulated annealing, Genetic algorithm). These algorithmic parameters are not discussed, they are simply fixed. Futhermore, GA is treated as a single method whereas there is a multitude of implementations with different performances.

Thus: The conclusion that GA outperforms the other two within a fixed amount of simulations is not really interesting: it certainly outperforms random search (MC); whether it outperforms or not simulated annealing depends on the problem at hand and on the algorithmic parameters (how the search algorithm is tuned).

All conclusions on the value of multi-data calibration for this case study re-iterate, re-inforce earlier conclusions.

The paper does not present new methods on how to compare the algorithm outputs nor on how to analyze the optimization outputs (metrics taken from Finger et al., 2011). Accordingly, the paper does not present new methods nor new transferable insights into existing methods (except into the exact algorithms used in this paper) nor new insights into hydrological processes.

As far as I see, this paper does thus not fit HESS.

Additional comment:

I do not understand how the paper can mix Pareto-optimality and multi-objective optimization via objective function weighing: if we optimize a weighted sum of objective functions, you cannot get the Pareto-frontier (or only if you explore different weighings). Either an algorithm looks specifically for solutions on the frontier or it does not. Judging a posteriori how many we found by chance seems a rather unfair criteria to compare different algorithms. But perhaps I misunderstood something her.

Citation: https://doi.org/10.5194/hess-2021-325-RC1
- AC1: 'Reply on RC1', David C. Finger, 10 Sep 2021
  
  Dear editor, Prof. Harrie-Jan Hendricks-Franssen,
  
  Dear reviewers,
  
  Thank you for your critical review comments on our manuscript. Below we respond to your comments and outline how we will address your comments in a revised version of our manuscript (our responses are provided in italics):
  
  We propose to change the manuscript title, a tentative title could be:
  
  “The value of metaheuristic calibration techniques in muti dataset calibrations to improve hydrological modeling results and constrain parameter uncertainty”,
  
  Which better describes our study.
  
  Anonymous Referee #1
  
  This well-written manuscript compares three different non-likelihood based model calibration methods for a single case study with the aim of showing the relative value of each of the methods, given the same amount of model runs.
  Authors’ reply (AR): thank you for this positive comment.
  The paper is based on a case study presented in the work of Finger et al. 2011. It takes from this earlier study the case study, the (multi-objective) data, the hydrological model, the metrics and part of the model calibration methods. What is added is the use of the simulated annealing method and the Genetic Algorithm.
  While the idea of comparing the relative value of different model calibration methods given a fixed amount of simulations is somewhat interesting, I cannot recommend the publication of this paper because all results are conditional on the single case study and, more importantly, on the chosen algorithmic parameters of the compared search algorithms (simulated annealing, Genetic algorithm). These algorithmic parameters are not discussed, they are simply fixed. Furthermore, GA is treated as a single method whereas there is a multitude of implementations with different performances.
  AR: Regarding the first comment, we agree with the reviewer that a single case study can only reveal case-specific results. However, the methodology applied and the framework developed are valid for any case study. Thus, the framework built for comparing different metaheuristic methods remains valid for other case studies and can be directly transferred to other catchments.
  
  The choice of the case study for illustrating work is a general issue in any hydrological study as every study has to be based on a selected catchment or a set of catchments. In our case, we chose the same case study as in Finger et al. 2011, i.e., the Rhoneglaciser catchment, which has been well investigated in numerous previous studies. Thus, our reasons for choosing this catchment are as follows: 1) the data availability is excellent, 2) we can refer current results to previous works done on this case study, and 3) its high glaciation fits the purpose of our study perfectly. We do envisage an expansion to other case studies in the future, but we believe this would be the subject of another study.
  Regarding the algorithm parameters and settings, we decided to use the version of the GA algorithm that is implemented as a part of the HBV model and suggested by Seibert et al. (2000). There is a multitude of GA algorithms and parameters, but we feel that the selection process and the sensitivity analysis required to find the best version of GA and the parameter settings to optimize the performances are out of scope for this paper. We, therefore, selected three algorithms and their parameter settings from the literature, i.e. MC from Finger et al. 2011, SA from Stefnisdóttir et al. (2020), and GA from Seibert (2000) and the HBV model as representatives for the three different algorithm paradigms. We will clarify this issue in the revised manuscript.
  Thus: The conclusion that GA outperforms the other two within a fixed amount of simulations is not really interesting: it certainly outperforms random search (MC); whether it outperforms or not simulated annealing depends on the problem at hand and on the algorithmic parameters (how the search algorithm is tuned).
  AR: We agree that GA will certainly find faster an optimal solution than MC. However, the novelty of our approach relies on the complementary processing of the results from MC, SA, and GA to perform a non-weighted multiple dataset calibration, which allows us to assess the impact of the three algorithms on three independent evaluation criteria and on the parameter uncertainty. Furthermore, our results reveal that GA is able to find faster and better solutions than SA, based on the example study. As the number of interactions for the model rounds is one of the critical settings that have to be given for any optimization algorithm, in our opinion, this finding should be of interest to the hydrological community. We agree however that how much GA is able to outperform MC or SA depends on the problem at hand. We will emphasize this issue much stronger in our revised manuscript.
  Moreover, we were able to demonstrate that the spread of the Pareto Frontier is much better in MC than in the other two methods as shown in Figure 9. One could argue that GA or SA will find a local optimum, without exploring the full spread of the Pareto Frontier. However, our results demonstrate the opposite of that, i.e., the Pareto Frontier of our multiple dataset calibration was improved compared to other methods, and parameter uncertainty could be significantly reduced. We do believe that this is a fundamental result that is of value for the entire hydrological modeling community.
  All conclusions on the value of multi-data calibration for this case study re-iterate, reinforce earlier conclusions.
  AR: We thank the reviewer for this supportive statement. We would like to point out that our results additionally provide proof that the Pareto Frontier can be identified with our method. We also highlight the importance of analyzing the Pareto Frontier in the post-processing analysis to support the evaluation of the methods. We will put more focus on this issue in our revised manuscript.
  The paper does not present new methods on how to compare the algorithm outputs nor on how to analyze the optimization outputs (metrics taken from Finger et al., 2011). Accordingly, the paper does not present new methods nor new transferable insights into existing methods (except into the exact algorithms used in this paper) nor new insights into hydrological processes.
  AR: It is true that MC, SA and GA have been developed in previous works, which demonstrates that we incorporated previous works into our study. However, the comparison of non-weighted multiple dataset calibration and its effect on the Pareto Frontier has never been quantified in hydrology in the context provided in this manuscript. To the best of our knowledge, there is no other study that looked at multiple-data calibration using snow cover, discharge, and glacier mass balance data comparing these three optimization algorithms. In addition, we judge the three methods using strict computational requirements, represented here by the number of model runs, since the question is also how well the algorithms use the available computational resources.
  As far as I see, this paper does thus not fit HESS.
  AR: Given our argumentation above, we must disagree with the reviewer on this point. We believe that HESS is a great journal for our manuscript as it deals with the problem of multiple data calibration using different optimization algorithms. This should be of interest to HESS readers, as also noticed by the 2nd reviewer. Based on this reviewer's comments, we believe however that we need to better highlight the novelty of our work and clarify methodology.
  Additional comment:
  
  I do not understand how the paper can mix Pareto-optimality and multi-objective optimization via objective function weighing: if we optimize a weighted sum of objective functions, you cannot get the Pareto-frontier (or only if you explore different weighings). Either an algorithm looks specifically for solutions on the frontier or it does not. Judging a posteriori how many we found by chance seems a rather unfair criteria to compare different algorithms. But perhaps I misunderstood something her.
  AR: We agree that the weighing function cannot simply define a Pareto frontier. However, we believe that there is a misunderstanding and we do have the firm intention to clarify this misunderstanding in a revised version of the manuscript. As stated in the reply to reviewer 2 (see below) we do not use function weighing. We complemented the standard calibration algorithms (MC, SA, and GA) with a ranking algorithm to quantify the trade-off between the three calibration criterion and identify the Pareto front (details are described in Finger et al. 2011). Each calibration function has the same weight. In short: we rank all runs according to each calibration criteria, average the ranks and select the 10 or the 100 best runs according to the average rank. This ensemble of “good runs” does not have “a best” run, it is simply the ensemble that describes the trade-off between the three criteria and accordingly defines the Pareto front. This method allows us to avoid function weighing and identifying a Pareto front (illustrated in Figure 9).
  
  Thus, we believe that there is a misunderstanding and we do have the firm intention to clarify this misunderstanding in a revised version of the manuscript.
  
  Citation: https://doi.org/10.5194/hess-2021-325-AC1
RC2:
'Comment on hess-2021-325', Anonymous Referee #2, 06 Sep 2021
The study analyses the performance of three frequently used automatic optimization algorithms including Monte Carlo simulations, simulated annealing (SA) and a genetic algorithm (GA) for a multivariable calibration in a small glaciated catchment considering streamflow, snow cover area and glacier mass balance. The results are evaluated based on the objective function values achieved by the best 100 and best 10 parameter sets as well as by uncertainty widths regarding prediction and parameter uncertainties. The authors conclude that the genetic algorithm outperformed the two other methods as it achieved better solutions and narrower confidence intervals than the other two methods.

The paper is generally well structured. The problem of model calibration is within the scope of HESS and the question of which search technique to select is a practical and relevant question that modelers have to decide upon.

However, I have several major concerns with this paper.

The authors see the novelty of their work in “confronting for the first time these three metaheuristics most frequently applied in hydrology within a multi-output calibration framework to derive practical recommendations for further applications”. This seems overstated. What about other comparison of optimization techniques? (See e.g. studies cited in Efstratiadis and Koutsoyiannis (2010)). Are Monte Carlo, GA and SA really the three most frequently applied optimization methods in hydrologic model calibration? What has been found by other studies that compare different optimization techniques and what are the research gaps that are addressed in this study?

A fundamental problem of the current study is that the authors emphasize the multi-objective nature of the problem and aim at analyzing which of the three optimization methods provides the most balanced pareto front. However, as far as I understand, the authors did not apply optimization techniques that are designed for this task (e.g. multi-objective variants of GA). Instead it seems that the multi-objective problem was summarized to a single-objective problem (using a weighted sum approach with fixed weightings) and SA and GA were applied in their single-objective forms (which is fine in principle but not if the aim is to study the pareto front).

For analyzing which method performs best in representing the pareto front, the study focuses on objective function values and the number of non-dominated solutions, concluding that GA performs best. In a multi-objective setting, one should additionally consider the diversity of the solutions, i.e. how well they are spread along the pareto front.

Some of the conclusions cannot be drawn from the results of this study. The study concludes that the results demonstrated the value of multi-dataset calibration for realistically simulating different runoff components. However, while this might have been a finding from a previous study I cannot see how this can be concluded based on results from the current study. The study also states that “it appears to be essential to give equal weights to all modelled runoff components”. However, only one weight configuration has been tested so that this statement cannot be derived from the presented results.
Citation: https://doi.org/10.5194/hess-2021-325-RC2
- AC2: 'Reply on RC2', David C. Finger, 10 Sep 2021
  
  Anonymous Referee #2
  
  The study analyses the performance of three frequently used automatic optimization algorithms including Monte Carlo simulations, simulated annealing (SA) and a genetic algorithm (GA) for a multivariable calibration in a small glaciated catchment considering streamflow, snow cover area and glacier mass balance. The results are evaluated based on the objective function values achieved by the best 100 and best 10 parameter sets as well as by uncertainty widths regarding prediction and parameter uncertainties. The authors conclude that the genetic algorithm outperformed the two other methods as it achieved better solutions and narrower confidence intervals than the other two methods.
  
  The paper is generally well structured. The problem of model calibration is within the scope of HESS and the question of which search technique to select is a practical and relevant question that modelers have to decide upon.
  
  AR: we thank the reviewer for this positive feedback.
  However, I have several major concerns with this paper.
  
  The authors see the novelty of their work in “confronting for the first time these three metaheuristics most frequently applied in hydrology within a multi-output calibration framework to derive practical recommendations for further applications”. This seems overstated. What about other comparison of optimization techniques? (See e.g. studies cited in Efstratiadis and Koutsoyiannis (2010)). Are Monte Carlo, GA and SA really the three most frequently applied optimization methods in hydrologic model calibration? What has been found by other studies that compare different optimization techniques and what are the research gaps that are addressed in this study?
  AR: We thank the reviewer for this meaningful comment. We agree with the reviewer that metaheuristics have been investigated numerous times in several studies. We also apologize for not having included all previous works. In a revised version we will include a thorough literature review and include all relevant works not cited so far, especially works cited in Efstratiadis and Koutsoyiannis. The novelty of our comparison study lies however in the fact that we use a non-weighted, multi-dataset calibration, where all datasets are equally weighted, and apply three algorithms of similar computational efforts, i.e., Monte Carlo (MC), Genetic Algorithm (GA) and Simulated Annealing (SA). We agree that these three metaheuristics may not be the most commonly applied but they are computationally equal and simple to implement, which is a great benefit of using them. We will highlight this point more in the revised manuscript. To our knowledge, there is no other study like that in hydrology that has used muti-dataset calibration with these three metaheuristics in order to investigate the Pareto front. We will emphasize the novelty and the distinction of our study to previously published works in a revised version of this manuscript.
  
  A fundamental problem of the current study is that the authors emphasize the multiobjective nature of the problem and aim at analyzing which of the three optimization methods provides the most balanced pareto front. However, as far as I understand, the authors did not apply optimization techniques that are designed for this task (e.g. multiobjective variants of GA). Instead it seems that the multi-objective problem was summarized to a single-objective problem (using a weighted sum approach with fixed weightings) and SA and GA were applied in their single-objective forms (which is fine in principle but not if the aim is to study the pareto front).
  AR: We agree with the argumentation of the reviewer but believe that there is a misunderstanding. We complemented the standard multi-objective calibrations (GA, SA, and MC) with a ranking of all runs according to each criterion individually and subsequent averaging of the ranks to obtain the Pareto front (see below for more details). We do not use a single-objective form, but a fully independent multi-dataset calibration with three equally weighted objective criteria. These three criteria are used during the calibration but, as correctly noticed by the reviewer, we do not distinguish weights for them but apply the same weighting. The reason for not varying weights is that we simply want all of the three variables (i.e. discharge, snowmelt, and glacier mass balance) to be simulated similarly well. Introducing different weights would possibly have improved the simulation of one of the variables but at cost of lower performance for the other two variables. Further details of the way how best runs are chosen are described in Finger et al. (2011), but in principle, we rank all model runs according to the performance of each dataset, then average the ranks and finally select the 10 or 100 best runs that received the highest average rank. We use this ensemble of “good” runs to illustrate our Parento frontier. This allows us also to obtain a Pareto frontier for each method (MC, SA, and GA). We will make an extra effort to clarify the method in a revised version of the manuscript.
  For analyzing which method performs best in representing the Pareto front, the study focuses on objective function values and the number of non-dominated solutions, concluding that GA performs best. In a multi-objective setting, one should additionally consider the diversity of the solutions, i.e. how well they are spread along the pareto front.
  AR: The reviewer is correct! We will look at the Pareto frontier as a complementary tool to support the comparison of our three calibration algorithms. We will add additional text on the Pareto spread in the revised manuscript as criteria for method comparison.
  Some of the conclusions cannot be drawn from the results of this study. The study concludes that the results demonstrated the value of multi-dataset calibration for realistically simulating different runoff components. However, while this might have been a finding from a previous study I cannot see how this can be concluded based on results from the current study. The study also states that “it appears to be essential to give equal weights to all modelled runoff components”. However, only one weight configuration has been tested so that this statement cannot be derived from the presented results.
  AR: Thank you for this comment which showed us that we must clarify our statements and conclusions. We believe that this comment is based on a misunderstanding and would like to clarify our point. It is true that similar results regarding the value of multi-objective calibration were found in the study of Finger (2011), however, that study was using only MC calibration. Our results indeed reconfirm previous findings from Finger et al (2011, 2012, 2015, 2018), Etter et al. (2018), J de Niet et al (2020) and complement them by demonstrating the value of SA and GC and their impacts on the Paret front. This is the major novelty of our study, which we will better describe in the revised manuscript.
  
  It is also true that we use the same weights for all three calibration criteria, i.e., the Nash-Sutcliffe coefficient for Discharge (Q), RMSE for Glacier mass balances (MB) between measured and simulate MB and the ratio of correctly predicted snow cover area for snow cover (SC). We did not test different set-ups of the weights as we aim at having all variables simulated equally well. We will modify these conclusions in our revised manuscript.
  
  Citation: https://doi.org/10.5194/hess-2021-325-AC2
AC3: 'Comment on hess-2021-325', David C. Finger, 07 Oct 2021

Dear Editor,

We have replied to all reviewers' comments and addressed all concerns raised by the reviewers in our response. The novelty of our work is the comparison of three metaheuristic calibration methods and their effect on the pareto front. We believe that the reviewers misunderstood our approach, and accordingly, we intend to improve the clarity in a revised version of the manuscript. In our opinion, our manuscript addresses a fundamental research question in hydrological modelling. Accordingly, we believe that HESS is an appropriate journal for our manuscript. We are open to engae in any further comment regarding our work.

Sincerely yours,

The authors

Citation: https://doi.org/10.5194/hess-2021-325-AC3

Silja Stefnisdóttir, Anna E. Sikorska-Senoner, Eyjólfur I. Ásgeirsson, and David C. Finger

Viewed

Total article views: 1,651 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,172	410	69	1,651	67	103

HTML: 1,172
PDF: 410
XML: 69
Total: 1,651
BibTeX: 67
EndNote: 103

Views and downloads (calculated since 06 Aug 2021)

Month	HTML	PDF	XML	Total
Aug 2021	219	59	2	280
Sep 2021	141	29	8	178
Oct 2021	64	16	2	82
Nov 2021	36	19	1	56
Dec 2021	27	7	1	35
Jan 2022	31	4	0	35
Feb 2022	21	10	0	31
Mar 2022	11	2	1	14
Apr 2022	12	9	1	22
May 2022	8	9	1	18
Jun 2022	6	2	2	10
Jul 2022	20	7	0	27
Aug 2022	7	10	0	17
Sep 2022	11	7	0	18
Oct 2022	17	5	1	23
Nov 2022	25	6	1	32
Dec 2022	16	3	0	19
Jan 2023	7	19	0	26
Feb 2023	9	4	0	13
Mar 2023	21	2	3	26
Apr 2023	3	5	1	9
May 2023	24	8	0	32
Jun 2023	31	5	1	37
Jul 2023	49	7	4	60
Aug 2023	49	5	1	55
Sep 2023	40	8	2	50
Oct 2023	25	9	1	35
Nov 2023	15	1	0	16
Dec 2023	7	2	1	10
Jan 2024	18	7	0	25
Feb 2024	13	9	0	22
Mar 2024	22	4	3	29
Apr 2024	13	4	7	24
May 2024	7	2	1	10
Jun 2024	14	6	2	22
Jul 2024	7	4	2	13
Aug 2024	11	7	2	20
Sep 2024	5	8	1	14
Oct 2024	3	16	1	20
Nov 2024	6	3	0	9
Dec 2024	7	4	1	12
Jan 2025	7	6	2	15
Feb 2025	10	7	4	21
Mar 2025	8	5	3	16
Apr 2025	11	10	2	23
May 2025	20	7	1	28
Jun 2025	22	11	0	33
Jul 2025	16	10	2	28
Aug 2025	1	0	1

Cumulative views and downloads (calculated since 06 Aug 2021)

Month	HTML	PDF	XML	Total
Aug 2021	219	59	2	280
Sep 2021	141	29	8	178
Oct 2021	64	16	2	82
Nov 2021	36	19	1	56
Dec 2021	27	7	1	35
Jan 2022	31	4	0	35
Feb 2022	21	10	0	31
Mar 2022	11	2	1	14
Apr 2022	12	9	1	22
May 2022	8	9	1	18
Jun 2022	6	2	2	10
Jul 2022	20	7	0	27
Aug 2022	7	10	0	17
Sep 2022	11	7	0	18
Oct 2022	17	5	1	23
Nov 2022	25	6	1	32
Dec 2022	16	3	0	19
Jan 2023	7	19	0	26
Feb 2023	9	4	0	13
Mar 2023	21	2	3	26
Apr 2023	3	5	1	9
May 2023	24	8	0	32
Jun 2023	31	5	1	37
Jul 2023	49	7	4	60
Aug 2023	49	5	1	55
Sep 2023	40	8	2	50
Oct 2023	25	9	1	35
Nov 2023	15	1	0	16
Dec 2023	7	2	1	10
Jan 2024	18	7	0	25
Feb 2024	13	9	0	22
Mar 2024	22	4	3	29
Apr 2024	13	4	7	24
May 2024	7	2	1	10
Jun 2024	14	6	2	22
Jul 2024	7	4	2	13
Aug 2024	11	7	2	20
Sep 2024	5	8	1	14
Oct 2024	3	16	1	20
Nov 2024	6	3	0	9
Dec 2024	7	4	1	12
Jan 2025	7	6	2	15
Feb 2025	10	7	4	21
Mar 2025	8	5	3	16
Apr 2025	11	10	2	23
May 2025	20	7	1	28
Jun 2025	22	11	0	33
Jul 2025	16	10	2	28
Aug 2025	1	0	1

Viewed (geographical distribution)

Total article views: 1,558 (including HTML, PDF, and XML) Thereof 1,558 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 08 Aug 2025

Short summary

We combine multiple dataset calibration with metaheuristic calibration techniques, namely Mone Carlo (MC), Simulated Annealing (SA) and Genetic Algorithms (GA), to improve hydrological models. Our results demonstrate that GA improves the overall performance of hydrological models. This leads to precise scenario simulations and, accordingly, is a major achievement in hydrology.


Total:	0
HTML:	0
PDF:	0
XML:	0

Improving the Pareto Frontier in multi-dataset calibration of hydrological models using metaheuristics

Viewed

Viewed (geographical distribution)

Cited

1 citations as recorded by crossref.