Multi-objective calibration by combination of stochastic and gradient-like parameter generation rules – the caRamel algorithm

. Environmental modelling is complex, and models often require the calibration of several parameters that are not able to be directly evaluated from a physical quantity or ﬁeld measurement. Multi-objective calibration has many advantages such as adding constraints in a poorly constrained problem or ﬁnding a compromise between different objectives by deﬁning a set of optimal parameters. The caRamel optimizer has been developed to meet the requirement for an automatic calibration procedure that delivers not just one but a family of parameter sets that are optimal with regard to a multi-objective target. The idea behind caRamel is to rely on stochastic rules while also allowing more “local” mechanisms, such as the extrapolation along vectors in the parameter space. The caRamel algorithm is a hybrid of the multi-objective evolutionary annealing simplex (MEAS) method and the non-dominated sorting genetic algorithm II ( ε -NSGA-II). It was initially developed for calibrating hydrological models but can be used for any environmental model. The caRamel algorithm is well adapted to complex modelling. The comparison with other optimizers in hydrological case studies (i.


Introduction
Environmental modelling is complex, and models often require the calibration of many parameters that cannot be directly estimated from a physical quantity or a field measurement. Moreover, as models' outputs exhibit errors whose sta-tistical structure may be difficult to characterize precisely, it is frequently necessary to use various objectives to evaluate the modelling performance. In other words, it is often difficult to find a rigorous likelihood function or sufficient statistics to be maximized/minimized (Fisher, 1922); for example, it is well known that errors in a simulated discharge time series are not normally distributed, and do not have constant variance or autocorrelation (Sorooshian and Dracup, 1980). In addition, Efstratiadis and Koutsoyiannis (2010) list other advantages of multi-objective calibration such as ensuring parsimony between the number of objectives and the parameters to optimize, fitting distributed responses of models on multiple measurements, recognizing the uncertainties and structural errors related to the model configuration and the parameter estimation procedure, and handling objectives that have contradictory performance.
Multi-objective calibration allows for a compromise between these different objectives to be found by defining a set of optimal parameters. Practical experience shows that single-objective calibrations are efficient for highlighting a certain property of a system, but this might lead to increasing errors in some other characteristics (Mostafaie et al., 2018). Evolutionary algorithms have been widely used to explore the Pareto-optimal front in multi-objective optimization problems that are too complex to be solved by descent methods with classical aggregation approaches. Evolutionary algorithms are advantageous not only because there are few alternatives for searching substantially large spaces for multiple Pareto-optimal solutions but also due to their inherent parallelism and capability to exploit similarities of solutions by recombination that enables them to approximate the Published by Copernicus Publications on behalf of the European Geosciences Union. 3190 C. Monteil et al.: Multi-objective calibration -the caRamel algorithm Pareto-optimal front in a single optimization run (Zitzler et al., 2000).
Many studies have used the multi-objective approach in environmental modelling (Oraei Zare et al., 2012;Ercan and Goodall, 2016) or in land use models (Gong et al., 2015;Newland et al., 2018). In hydrology, Madsen (2003) implemented automatic multi-objective calibration of the MIKE SHE model (Refsgaard and Storm , 1995) on the Danish Karup catchment (440 km 2 ) using the shuffled complex evolution (SCE) algorithm (Duan et al., 1992). Yang et al. (2014) ran a multi-objective optimization of the MO-BIDIC (Campo et al., 2006) distributed hydrologic model on the Davidson catchment (North Carolina, 105 km 2 ) using the non-dominated sorting genetic algorithm II (NSGA-II, Deb et al., 2002). More recently Smith et al. (2019) led a multi-objective ensemble approach to hydrological modelling in the UK over 303 catchments for historic drought reconstruction with the GR4J conceptual model (Coron et al., 2017) using Latin hypercube sampling (McKay et al., 1979) and a Pareto-optimizing ranking approach accounting for unacceptable trade-offs (Efstratiadis and Koutsoyiannis, 2010). Mostafaie et al. (2018) compared five different calibration techniques on the GR4J lumped hydrological model using in situ runoff and daily data from the Gravity Recovery And Climate Experiment (GRACE, Tapley et al., 2004). They came to the following conclusions: according to the diversity-based metrics, the NSGA-II method is the best approach; according to the accuracy metric, multi-objective particle swarm optimization (MPSO, Reddy and Nagesh Kumar , 2007) is ranked first; and, considering the cardinality measure, the performance of all algorithms is found to be the same.
The caRamel optimizer has been developed to meet the need for an automatic calibration procedure that delivers not only one but a family of parameter sets that are optimal with regard to a multi-objective target (Le Moine, 2009). Madsen (2003) indicated that global population-evolution-based algorithms are more effective than multi-start local search procedures, which, in turn, perform better than purely local search methods. However, most of the multi-objective algorithms rely mainly on stochastic generation rules, with few deterministic aspects, as is the case in the widely used NSGA-II for instance. The idea behind caRamel is not just to keep these stochastic "global" mechanisms (such as recombination or multivariate sampling using the covariance) but also to allow more "local" mechanisms, such as extrapolation along vectors in the parameter space, that are associated with an improvement in all objective functions (a "gradient-like" qualitative approach extended to the set of objective functions).
The caRamel algorithm was initially developed and used for the calibration of hydrological models by studies such as Rothfuss et al. (2012), Magand et al. (2014), Le Moine et al. (2015), Monteil et al. (2015), which were all previous to the R package release, or Rouhier et al. (2017), which utilized an R version for the calibration of a hydrologic model over the Loire Basin (35 707 km 2 ). The interesting performance of the caRamel algorithm in such studies prompted us to describe the algorithm in detail in the present paper. Considering the increasing use of R in hydrology (Slater et al., 2019), we decided to build an R package, caRamel, for use in any model in the R environment. The user simply has to define a vectorvalued function (at least two objectives) for the model to calibrate as well as upper and lower bounds for the calibrated parameters.
This paper aims to describe the principles of the caRamel algorithm via an analysis of its results when used for the parametrization of hydrological models. Pieces of code are provided in the Appendix. For an analytical example and for three river case studies, a comparison with the two calibration algorithms that inspired caRamel, the non-dominated sorting genetic algorithm II (NSGA-II; Reed and Devireddy, 2004) and the multi-objective evolutionary annealing simplex method (MEAS; Efstratiadis and Koutsoyiannis, 2008), is also presented.

Context and notations
The intent of multi-objective calibration is to find sets of parameters that provide a compromise between several potentially conflicting objectives; for instance, how to achieve a good simulation of both flood and low-flow conditions in a hydrological model. Multi-objective calibration is also a means of adding some constraints to an under-constrained problem when many parameters have to be quantified. This can help to reduce the equifinality of parameter sets. Her and Seong (2018) showed that the introduction of an adequate number of objective functions could improve the quality of a calibration without requiring additional observations. The amount of equifinality and the overall output uncertainty decreased while the model performance was maintained as the number of objective functions increased sequentially until four objective functions.
To introduce our notation, Fig. 1 shows a simplified calibration problem in which there is a model with n θ = 2 parameters to calibrate (θ 1 and θ 2 ). Thus, the model structure is unequivocally represented by the vector θ = (θ 1 , θ 2 ) in a n θ = 2-dimensional space, called "parameter space E θ ".
a vector y of n y observed values that should be simulated by the model. For example, for daily times series of 1 year at two gauging stations, n y = 2 × 365 = 730. The simulation is represented by a vectorŷ(θ ) in a n y -dimensional space (that cannot be illustrated graphically), called "observable space E y ".
a vector of n f objective values f (θ , y). For the example in Fig. 1, f = (f 1 , f 2 ) in a space with n f dimensions, called "objective space E f ". We will use the following notations: vectors or matrices are presented using bold italic and bold roman font respectively (θ , y, f , . . .), vector elements and scalars are presented using roman font (θ 1 , θ 2 , λ, . . .), and spaces or ensembles are presented using italic font (E θ , F, A, . . .). Figure 1 also illustrates the relevance of multi-objective calibration with regard to two kinds of equifinality: 1. equifinality of a structure -the two points θ and θ that are quite distant in the parameter space E θ become quite near in the observable space E y .
2. equifinality related to the objective -the vectors θ and θ are equifinal regarding f 1 , and the additional objective f 2 helps to discriminate them. The use of additional objectives may then help to better constrain the calibration.
The purpose of a multi-objective algorithm is to approach the Pareto front, F, of non-dominated solution in the objective space using an ensemble of points called the approximated Pareto frontF. We call "archiveÂ" the ensemble of parameter sets from E θ for which simulation outputs are in F.

The caRamel algorithm description
The caRamel algorithm belongs to the genetic algorithm family. The idea is to start from an ensemble of parameter sets (called a "population") and to make this population evolve following certain generation rules (Fig. 2). At each generation, new sets are evaluated regarding the objectives, and only the more "suitable" sets are kept to build the new population. The caRamel algorithm is largely inspired by 1. the multi-objective evolutionary annealing simplex method (MEAS; Efstratiadis and Koutsoyiannis, 2005;Efstratiadis and Koutsoyiannis, 2008), with respect to the directional search method, based on the simplexes of the objective space, and 2. the non-dominated sorting genetic algorithm II (ε-NSGA-II; Reed and Devireddy, 2004), for the classifi- This section describes the functioning of the caRamel algorithm; this algorithm has been implemented in an R package, caRamel, that is described in Appendix A.

Generation rules
The caRamel algorithm has five rules for producing new solutions at each generation: (1) interpolation, (2) extrapolation, (3) independent sampling with a priori parameter variance, (4) sampling with respect to a correlation structure and (5) recombination.
The first two rules (interpolation and extrapolation) are based on a n θ -dimensional Delaunay triangulation in the objective space E f . They assume that two neighbouring points in the objective space E f have two adjacent points in the parameter space E θ as antecedents; therefore, one can try to "guess" the directions of improvement in the parameter space from the improvement directions (in a Pareto sense) in the objective space, at least near the optimal zone.
The following two rules create new parameter sets by exploring the parameter space in a nondirectional and less local way -either by independent variations in each parameter or by multivariate sampling using the covariance structure of all parameter sets located near the estimated Pareto front at the current iteration.
Finally, the recombination rule consists of creating new parameter sets using two partial subsets derived from a pair of previously evaluated parameter sets (inspired by Baluja and Caruana, 1995).

Rule 1: interpolation
For rules 1 and 2, we use the notion of simplex which is a generalization of the notion of a triangle to higher dimensions: a 0-simplex is a point, a 1-simplex is a line segment, a 2-simplex is a triangle and a 3-simplex is a tetrahedron. A vertex is a point where two or more edges meet. The explanation of the first rule is based on Fig. 3a. First, a triangulation of the points in the objective space E f is established: simplexes built with these points f (θ i ) are a partition of the explored zone in this space (Efstratiadis and Koutsoyiannis, 2005).
Let us consider a simplex with at least one vertex on the approximated Pareto front. This simplex is the result of the function f from an ensemble of (n f + 1) points from the n θ -dimensional parameter space E θ . Under the hypothesis of continuity of f , a linear combination of the formθ = w 1 θ 1 + . . . + w (n f +1) θ (n f +1) , with the barycentric coordinates w i ≥ 0 and i w i = 1, might give a new Pareto-optimal solution f (θ ) inside this zone.
First the triangulation is established, then simplex volumes are computed. The probability of generating one new point with a simplex is proportional to its volume when it has at least one point on the Pareto front (otherwise it is zero). If the simplex is selected, then a set of barycentric coordinates are computed by randomly generating (n f + 1) values ε i in a uniform distribution on [0,1]:

Rule 2: extrapolation
Extrapolation is based on the same hypothesis of continuity as interpolation. In this case, it is tested to find if an improvement may be obtained by extrapolating from certain directions. These directions are computed from the triangulation by selecting the edges that have only one vertex on to the approximated Pareto front (the second vertex is dominated by the first). These oriented edges computed from the objective space represent directions of improvement in the parameter space (Fig. 3b).
The length L = f (θ 1 ) − f (θ 2 ) of each selected edge and the mean length L are computed. The probability of using an edge is proportional to its length L. In this case, the research vector in the parameter space is defined in Eq. (2), and a new parameter set is generated byθ = θ 1 + λU , where λ is a scalar from an exponential distribution with average of 1.

Rule 3: independent sampling with a priori parameter variance
The drawback of the first two rules is that the generation of new vectors is only based on a small number of existing vectors. To compensate for this search by gradient and to avoid convergence toward a local optimum, the third generation rule has two goals: to make the parameters vary within a larger range than with local rules, and to make the parameters vary independently of one another.
When considering a vector θ from the archiveÂ, the third rule is to generate n θ new vectors (θ k with k from 1 to n θ ) by making each element of θ (Eq. (3) vary individually, where σ 2 i is the a priori variance of the ith parameter and ε i is a value from a normal distribution (with an average of 0 and a variance of 1). The a priori variance is computed for each parameter from the bounds of variation indicated as the input of the optimizer.
The algorithm selects the n θ vectors that maximize each element of the objective vector individually as well as an additional vector that represents a "central" point of the Pareto front. To select this vector, the minimum of each vector θ ∈Â is computed, and the vector that maximizes this value is chosen.
One generation of this rule then produces (n f +1)×n θ new vectors. For this reason, this rule is applied every K generation, with K to be defined by the user. By default, K is computed so that each rule generates the same number of vectors on average.

Rule 4: sampling with respect to a correlation structure
The variance-covariance matrix is computed using Eq. (4), where E[X] is the expectancy of a random variable X, θ is a vector from the archive A, µ = E θ∈A [θ] is the barycentre of A and M T is the transpose of the matrix M.
This matrix reflects the correlation structure between the parameter sets. For instance, in the case of a hydrological model, parameters are frequently not independent of each other. This rule intends to obtain an estimateˆ of and µ of µ in order to generate new parameter vectors that respect this correlation structure and, therefore, limit the risk of generating "non-functional" parameter sets.
There are many possibilities in selecting the vector for evaluating the covariance matrix: 1. Vectors may be selected from a library of "historical" vectors for the calibrated model. The drawback is that this library has to be previously established, and it does not take the progression of the running calibration into account.
2. Vectors may be selected from the archiveÂ that provides points on the approximated Pareto front at the running generation. The new vectors frequently improve the front, but, as the variance is low, they do not avoid convergence toward a local optimum.
3. All vectors of the running population may be selected. This helps to maintain diversity, but it has a high com- putational cost as few new vectors will make the front progress.
Finally, the algorithm uses a mix between items 2 and 3: all simplexes from the first rule triangulation that have at least one vertex in the approximated Pareto front are selected. Reference vectors for the computation of the variancecovariance matrix are defined by the ensemble G from the objective space whose images by f are all the vertices of these simplexes. The estimatesˆ andμ are computed in Eqs. (5)-(6): This operation increases the number of selected points for the averages computation significantly. However, there is still the risk is of having a variance that is too low. To reduce this risk, the variance of all of the parameters is increased by the same factor (empirically doubled):ˆ = 2ˆ .
The new vectors are obtained from a classical procedure for multivariate generation: 1. computation of the upper triangular matrix T with T T T =ˆ , by Cholesky decomposition; 2. generation of vectorsθ =μ+T T ·ε, where ε is a vector with n θ independent and normally distributed components with an average of 0 and a variance of 1.
This fourth rule enables us to randomly explore some area of space E θ while implicitly reducing its dimension via the correlations between parameters. This reduces the number of evaluations of the objective function that are needed .

Rule 5: recombination
With respect to rule 4, recombination considers that the parameters from a model are not independent. In a hydrological model, they can frequently be grouped in functional blocks (for instance, rapid runoff, base flow, snow dynamics, transfer and so on). A new parameter vector is simply generated by combining blocks of parameters from vectors of the archiveÂ. The parameter blocks are specific to the calibrated model and are defined by the user.

Population downsizing
At the end of each generation, the population is kept under a maximum size (N max sets). This limitation is set for memory reasons (no need to keep poor parameter sets) and to reduce computational time, as the triangulation computation is carried out at each generation.
The population downsizing is adapted from ε-NSGA-II (Reed and Devireddy, 2004) and is performed in three steps (Fig. 4): 1. Pareto ranking -the parameter vectors are sorted according to the ranking order of the Pareto level to which they belong. Points from level 1 are non-dominated, points from level 2 are dominated only by points from level 1 and so on. with the precision δ i for each of the n f objective values.
All of the points in the same hypercube are considered to be equifinal with regard to accuracy, and only one point is kept. The selected point is the one that belongs to the lowest Pareto level. When many points are on the lowest level, the selected point is taken at random from among them.
3. Keeping the population size under N max -if the number of sets is still above N max , only the N max sets of the lower level are kept.

Optimization evaluation framework
The aim is to assess the performance of the caRamel algorithm against two other optimizers using various case studies. Two optimizers have been selected for the comparison: NSGA-II (Deb et al., 2002) and MEAS (Efstratiadis and Koutsoyiannis, 2008). The comparison focuses on different aspects: optimization evolution evaluated by specific metrics and optimization results in the objective space, parameters space and observable space. This section presents the optimizer configuration, the evaluation metrics and the four case studies.

Optimizer configurations
The caRamel algorithm is used in its general form, with a generation of five new parameters sets for each rule by iteration, involving an average of 25 parameter sets by generation. NSGA-II (Deb et al., 2002) is called by using the nsga2 function from the mco "Multiple Criteria Optimization Algorithms and Related Functions" (Mersmann et al., 2014) R package. The arguments are the function to minimize, the input and output dimensions, the parameter bounds, the number of generations, the size of the population, and the values for the crossover, mutation probability and distribution index. Some previous calibration experiments have been conducted to determine the best parameter configurations. NSGA-II has been used with a crossover probability set to 0.5 and mutation probability set to 0.3.
The MEAS algorithm (Efstratiadis and Koutsoyiannis, 2005) combines a performance evaluation procedure based on a Pareto approach and the concept of feasibility, an evolving pattern based on the downhill simplex method, and a simulated annealing strategy, to control randomness during evolution. The algorithm evolution is sensitive to the value of the mutation probability which has been adapted to each case study according to its complexity (5 % for Kursawe and 50 % for the other case studies).
For each optimizer, the end of one optimization is set to a maximum number of model evaluations depending on the case studies. As the algorithms use random functions, 40 optimizations of each test case have been run for each optimizer to obtain representative results. In order to focus on the evolution of the optimization, the initial population is the same for each optimizer (40 initial populations for each case study).
We chose to run an important number of model evaluations and optimizations to get representative results and assess the reproducibility of the optimization. Other benchmark methodology would be conceivable, such as that presented by Tsoukalas et al. (2016) where several test functions and two water resources applications are implemented to compare the surrogate-enhanced evolutionary annealing simplex (SEEAS) algorithm to four other mono-objective optimization algorithms. In this study, two alternative computational budgets (indicated by the maximal number of model evaluations) are considered that impact the parameters of the optimizers.

Optimization metrics
To evaluate the optimizer performance, we chose metrics from the literature. Evaluating optimization techniques experimentally always involves the notion of performance. In the case of multi-objective optimization, the definition of quality is substantially more complex than for singleobjective optimization problems, because the optimization goal itself consists of multiple objectives (Zitzler et al., 2000). Riquelme et al. (2015) categorize the metrics to evaluate three main aspects: accuracy, which is the closeness of the solutions to the theoretical Pareto front (if known) or relative closeness; diversity, which can be described by the spread of the set (range of values covered by the solutions) and the distribution (relative distance among solutions in the set); cardinality, which qualifies the number of Paretooptimal solutions in the set.
To quantify these aspects, we selected three different metrics that are evaluated in the objective space: 1. hypervolume (HV), which is a volume-based index that takes accuracy, diversity and cardinality into account (Zitzler and Thiele, 1999) and computes the volume between the vectors of the estimated Pareto frontF and a reference point; 2. generational distance (GD), which is a distance-based accuracy performance index (Van Veldhuizen, 1999, Eq. 7) that is expressed as where n is the number of vectors in the approximated Pareto frontF, and d i is the Euclidean distance between each vector and the nearest member of the reference front; 3. generalized spread (GS), which evaluates the diversity of the set (Zhou et al., 2006;Jiang et al., 2014).
The evaluation of the GS and GD metrics requires us to establish a reference front. For each case study, this reference front is built by evaluating the Pareto front on all of the final optimization results of all optimizers.

Case studies
Four case studies have been designed to have an increasing complexity: case study 1 is an analytical example with a Kursawe test function (Kursawe, 1991); case study (2) is on a pluvial catchment with a GR4J open-source hydrological model (Coron et al., 2017; case study (3) is on a pluvial catchment with a MORDOR-TS semi-distributed model (Rouhier et al., 2017); and case study (4) is on a snowy catchment, also with a MORDOR-TS model.

The Kursawe test function
The objective of a test function is to evaluate some characteristics of optimization algorithms. The final Pareto front has a specific shape (non-convex, asymmetric and discontinuous) with an isolated point that the optimizer has to accurately reproduce. The Kursawe function is a benchmark test for many researchers (Lim et al., 2015). It has three parameters (x 1 , x 2 and x 3 ) and two objectives (Obj 1 and Obj 2 ) to minimize (Kursawe, 1991, Eq. 8).
The optimizations are run on 50 000 model evaluations. The R script to run the Kursawe function optimization with caRamel is available in Appendix B or as a vignette in the caRamel package.

Calibration of the GR4J model on a pluvial catchment
The GR4J hydrological model is a widely used global rainfall-runoff model (Perrin et al., 2003) that has been implemented in an open-source R package airGR (Coron et al., 2017(Coron et al., , 2019. This package contains a data sample from a catchment called "Blue River at Nourlangie Rock" (360 km 2 , code L0123001), which has a pluvial regime (Fig. 5a). The advantage of using this case study is in having an opensource script with open data. GR4J has four parameters to calibrate: the production store capacity X1, the inter-catchment exchange coefficient X2, the routing store capacity X3 and the unit hydrograph time constant X4.
The calibration is done on the daily time series for the period from 1990 to 1999. The Kling-Gupta efficiency (KGE, Gupta et al., 2009) is frequently used in hydrology. The KGE can be split into three components that reflect the correlation between the simulated and observed values (KGE r ), the bias in standard deviation (KGE α ) and the bias in volume (KGE β ). The calibration is carried out on these three components (Eq. 9).
where r is the linear correlation coefficient between simulated and observed time series, σ s and σ o represent their standard deviations, and µ s and µ o represent their mean values.
For each component, the optimal value is 1 and the optimization consists of a maximization. At the end of the optimization only the sets with KGE β > 0 are considered, as a KGE β with a negative value indicates poor quality for hydrological results. This leads us to exclude a few sets for calibration with NSGA-II and caRamel but not for calibration with MEAS.
The R script to run an optimization of the GR4J model with caRamel is available in Appendix C.

Calibration of the MORDOR-TS model on two contrasting catchments
The spatially distributed MORDOR-TS rainfall-runoff model (Rouhier et al., 2017) is a spatialized version of the conceptual MORDOR-SD model  that has been widely used for operational applications at Électricité de France (EDF; the French electric utility company). The catchment is divided into elementary subcatchments connected according to the hydrographic network which constitutes a hydrological mesh. This model was implemented at a daily time step for two French catchments with contrasting climates. The Tarn catchment at Millau (Fig. 6a) covers an area of 2335 km 2 and has a moderate altitude ranging from 350 to 1600 m. The  regime is pluvial, with almost no influence from snow. The Durance at La Clapière catchment (2170 km 2 , Fig. 6b) is located in the French Alps and has elevations ranging from 800 to about 4000 m. Its hydrological regime is strongly influenced by snow, with a maximum during the melting season in June (Fig. 5c).
The hydrological meshes have been built with an average cell area of 100 km 2 , meaning that 28 cells are needed for the Tarn catchment and 22 cells for the Durance catchment.
MORDOR-TS has 22 free parameters in its comprehensive formulation. For the Tarn case study, a simplified formu-lation is adopted with 12 free parameters to calibrate in order to describe the functioning of conceptual reservoirs, evapotranspiration correction and wave celerity (Table 1). For the Durance catchment, parametrization of the snow module of MORDOR-TS is more complex, and 16 parameters are to be calibrated for the hydrological model. The parameter distribution is uniform for the two case studies, which means that the same set of parameters applies to all cells. Calibration is conducted over 10 years (1 January 1991-31 December 2000) based on three objectives that have to be maximized. For the Tarn catchment, the calibration is based on the Nash-Sutcliffe efficiencies (NSE; Nash and Sutcliffe, 1970) at three gauging stations: the catchment outlet (Tarn at Millau) and two interior points (Tarn at Montbrun and Dourbie at Gardiès). For the Durance catchment, the Kling-Gupta efficiency (KGE; Gupta et al., 2009) is computed at three gauging stations: the catchment outlet (Durance at La Clapière) and two interior points (Durance at Val-des-Prés and Guil at Mont-Dauphin). The theoretical optimum is the point (1, 1, 1) in the objective space.

Results of calibration evaluations
Four aspects are considered with respect to the results of the case studies: the shape of the final Pareto fronts, the dynamics of the optimizations, the distribution of the calibrated parameters and the consequences of the latter on simulated discharges for the hydrological case studies. To illustrate the results on the simulated discharges, a "best compromise" set has been selected regarding the distance to point (1,1,1) in the objective space for each hydrological case study.

Final Pareto front
First of all, it is important to accurately reproduce the disconnected Pareto front for the Kursawe test function, and this is the case for all of the optimizers (Fig. 7) with no noticeable differences between the solutions. This confirms the effectiveness of three different algorithms on a low-dimension research benchmark for the multi-objective optimization.
Concerning the three hydrological case studies, the solutions of the Pareto fronts look quite similar for caRamel and NSGA-II and more narrow with MEAS (Fig. 8). The num- ber of sets for the Pareto front changes depending on the case, and there is no rank for the optimizers. For the Blue River study, there are 1172 sets with caRamel, 878 sets with NSGA-II and 268 points with MEAS; there are 1457, 789 and 1882 sets for the Tarn study, and 708, 408 and 525 sets for the Durance study with caRamel, NSGA-II and MEAS respectively. The differences between Pareto fronts are not a priori in favour of a single MEAS-based algorithm. They are given for a limited number of cases which are not necessarily representative of a general behaviour. Figure 9 summarizes the dynamics of the optimizations for the four case studies.

Dynamics of the optimizations
The caRamel algorithm converges more quickly for accuracy (the HV and GD metrics in most cases). The caRamel's dynamics is closer to NSGA-II's dynamics than to MEAS's dynamics, as they have almost the same final values for the Figure 8. Pareto fronts over 40 optimizations with the caRamel, NSGA-II and MEAS optimizers for each hydrological case study: Blue River with GR4J (a-c), Tarn with MORDOR-TS (d-f) and Durance with MORDOR-TS (g-i). The red point represents a "best compromise" set that is used to illustrate model results. three metrics. This confirms the distinctive behaviour between the two classes of algorithms.
With respect to the diversity criteria, GS dynamics is different for the Kursawe test case than for the hydrological case studies. For the Kursawe test case, the optimal final front has a spread, so all optimizers give the same results. For the hydrological cases, the optimal solution is a point (1,1,1); thus, the Pareto front may get smaller with the optimization. NSGA-II and caRamel look alike, as they generate more diversity than MEAS (GS final values). On average, caRamel gives better values than NSGA-II for the three real cases.
Finally, the envelopes over 40 optimizations are comparable for the three optimizers, which means that reproducibility is always obtained but with different regularities depending on the case or the optimizer without any notable feature. In some cases, a smoother statistical GS convergence would have implied more optimizations. Figure 10 displays the distribution of parameters from the three case studies.

Parameter distribution
In the parameter space, the optimizers provide very similar results that explore the equifinality of the model, meaning that different parameter sets show similar performance (Fig. 10). Some parameters (such as kr or lkn) may have optimized values on the whole range defined by the bounds, whereas other parameters are better constrained (X4 and cel). These constitute a family of sets that are optimal with regard to the chosen objectives.
The difference in the diversity of the final sets is also visible in the parameter distributions. Distributions are quite similar for caRamel and NSGA-II but are much narrower for MEAS. This once again confirms the different behaviour of MEAS, with weaker general performance for the cases studied here.

Impact on model results
The consequences with respect to the simulated discharges are displayed in Fig. 11. The envelopes with NSGA-II and caRamel are quite similar, whereas the envelope with MEAS is narrower, as expected. This confirms that caRamel and NSGA-II generate more diversity on their Pareto front. The red line represents the simulated discharges with the best compromise set and fits quite well with the observed discharges. Multi-objective calibration allows for a range of variation of the calibrated discharges around the best compromise simulation. Figure 11c represents a flood event on the Tarn River at Millau. The observed discharge points are in the envelope of Figure 10. Calibrated parameter distributions for the sets on the Pareto front (y limits are the calibration bounds, except for X1 to X4) with caRamel, NSGA-II and MEAS for the three hydrological case studies: Blue River (first block of four parameters), Durance River (second block of 16 parameters) and Tarn River (third block of 10 parameters). Parameter values from the "best compromise" set are displayed in red.
simulation. The best compromise simulation does not accurately reproduce the flood peak. The figure also displays the simulated discharges obtained by optimizing parameters on the three gauging stations separately, and the simulation with the set that optimizes the NSE at Millau fits better with the observed points. Figure 11. Observed and simulated discharges for the three case studies. "Observations" refer to the observed discharges, "Best compromise" refers to the best compromise simulated discharges and "Envelope" refers to the simulated discharges envelope using all parameter sets on the Pareto front (over 40 optimizations) with caRamel, NSGA-II and MEAS. (a) Daily runoff regime of Blue River at Nourlangie Rock (1990Rock ( -1999; (b) daily runoff regime of Durance at La Clapière (1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000); (c) flood event for the Tarn River at Millau (14 April 1993-3 June 1993).

Conclusions
The caRamel function is an optimization algorithm for multiobjective calibration, and it results in a family of parameter sets that are Pareto-optimal with regard to the different objectives. The algorithm is a hybrid of the MEAS algorithm (Efstratiadis and Koutsoyiannis, 2005), using the directional search method based on the simplexes of the objective space, and the ε-NSGA-II algorithm, using the archiving management of the parameter vectors classified by ε dominance (Reed and Devireddy, 2004). The combination of stochastic and gradient-like parameter generation rules helps the con-vergence of optimization while preserving the diversity of the population in both the objective and parameter spaces. Four case studies with increasing complexity have been used to compare caRamel with NSGA-II and MEAS. The results are quite similar between optimizers and show that optimization converges more quickly with caRamel.
An optimization algorithm might be delicate to use because of the choice of input arguments, which are specific to the algorithm and might require some "expert knowledge". The sensitivity to caRamel internal parameters is not presented in this paper, but we have carried out some sensitivity analyses using the Morris method (Morris, 1991) in order to recommend some default values to the user. First, it is recommended that users assign the same weight to each generation rule by indicating the same number of parameter sets to generate. It is advantageous to produce a small number of sets by generation in order to reduce the number of model evaluations and obtain more rapid convergence. By default, five sets are generated for each rule. The size of the initial population should be large enough to have enough variability (at least 50 sets for a complex model). Moreover, as convergence can be sensitive to the randomly chosen initial population, it is recommended that the user run two or three optimizations to assess reproducibility.
Multi-objective optimization may require thousands of evaluations, which can be a limitation for the calibration of time-consuming models. To cope with this issue, parallel computation is implemented in the caRamel R package.
Better consideration of equality or inequality constraints, such as the relationship between two parameters, could be an improvement. Another perspective would be the ability of caRamel to deal with discrete parameters. Character, length = 1 The function (defined by the user) applied on each node of the cluster for initialization for parallel computation (for example, load of packages or copy of data). Arguments must be cl and numcores.
graph Logical, length = 1 Plot graphical output at each generation (TRUE by default)