Calibration is an essential step for improving
the accuracy of simulations generated using hydrologic models. A key modeling
decision is selecting the performance metric to be optimized. It has been
common to use squared error performance metrics, or normalized variants such
as Nash–Sutcliffe efficiency (NSE), based on the idea that their
squared-error nature will emphasize the estimates of high flows. However, we
conclude that NSE-based model calibrations actually result in

Computer-based hydrologic, land-surface, and water balance models are used
extensively to generate continuous long-term hydrologic simulations in
support of water resource management, planning, and decision making. Such
models contain many empirical parameters that cannot be estimated directly
from available observations, hence the need for parameter inference by means
of the indirect procedure known as calibration

A key decision in model calibration is the choice of performance metric (also
known as the “objective function”) that measures the goodness of fit
between the model simulation and system observations. The performance metric
can substantially affect the quality of the calibrated model simulations. The
most widely used performance metrics are based on comparisons of simulated
and observed response time series, including the mean squared error (MSE),
Nash–Sutcliffe efficiency (NSE; a normalized version of MSE), and root mean
squared error (RMSE; a transformation of MSE). Many previous studies have
examined different variants of these metrics (e.g., see

As an alternative to metrics that measure the distance between response time
series, the class of

The use of hydrologic signatures to form metrics for model calibration
requires selection of a full set of appropriate signature properties that are
relevant to all of the aspects of system behavior that are of interest in a
given situation. As discussed by

In general, water resource planners focus on achieving maximum accuracy in
terms of specific hydrologic properties and will therefore select metrics
that target the requirements of their specific application while accepting
(if necessary) reduced model skill in other aspects. For example, in climate
change impact assessment studies, reproduction of monthly or seasonal
streamflow is typically more important than behaviors at finer temporal
resolutions, and so hydrologists typically use monthly rather than daily
error metrics

In this study, we examine how the formulation of the performance metric used
for model calibration affects the overall functioning of system response
behaviors generated by hydrologic models, with a particular focus on
high-flow characteristics. The specific research questions addressed in this
paper are the following.

How do commonly used time-series-based performance metrics perform compared to the use of an application-specific metric?

To what degree does use of an application-specific metric result in reduced model skill in terms of other metrics not directly used for model calibration?

We address these questions by studying the high-flow characteristics and
flood frequency estimates for a diverse range of 492 catchments across the
contiguous United States (CONUS) generated by two models: the mesoscale
Hydrologic Model (mHM;

The remainder of this paper is organized as follows. Section 2 shows how the
use of NSE for model calibration is counter-intuitively problematic when
focusing on high-flow estimation. This part of the study is motivated by our
experience with CONUS-wide annual peak flow estimates and serves to motivate
the need for our large-sample study

One of the
earliest developments of a metric used for model development was by

Spatial distribution of Hydro-Climate Data Network (HCDN) basins;

Our recent large-sample calibration study

In general, it is impossible to improve the simulation of flow variability
(to improve high-flow estimates) without simultaneously affecting the mean
and correlation properties of the simulation. To provide a way to achieve
balanced improvement of simulated mean flow, flow variability, and daily
correlation,

The results of the

We use two hydrologic models: VIC (version 4.1.2h) and
mHM (version 5.8). The VIC model, which includes explicit soil–vegetation–snow processes,
has been used for a wide range of hydrologic applications, and has recently
been evaluated in a large-sample predictability benchmark study

The model parameters calibrated for each model are the same as previously
discussed: VIC

Optimization is performed using the dynamically dimensioned search (DDS,

Cumulative distributions of

The most common choice of KGE scaling factor for hydrologic model calibration
has been to set all of them to unity. We applied the KGE in different
variants (i.e., with non-unity scaling factors), which to the best of our
knowledge have not been studied so far. Note that this scaling is only used
to define the performance metric used in model calibration; all performance
evaluation results shown in this paper use KGE computed with

First, we focus on the general overall performance for the daily streamflow
simulations as measured by the performance metrics used. Figures

The same as Fig. 2 except for the mHM.

The use of APFB as a calibration metric yields poorer performance for both
models, on all of the individual KGE components (wider distributions for

Next, we focus on the specific performance of the models in terms of
simulation of high flows. As expected, use of the application-specific APFB
metric (Eq.

Boxplots of percent bias of APFB (

Figure

Boxplots of percent bias of flood estimates corresponding to three return periods (5-,10-, and 20-year) over the 492 HCDN basins for the two models. Box-plot representation is the same as Fig. 4.

Annual peak flow estimates are generally used directly in the flood frequency
analysis. Figure

Scatterplots between

Figure

While both models show fairly similar trends in skill for each performance
metric, it is clear from our large-sample study of 492 basins that the
absolute performance of VIC is inferior to that of mHM, irrespective of
choice of evaluation metric. A full investigation of why VIC does not perform
at the same level of mHM is clearly of interest but is left for future work.
To improve the performance of VIC it may be necessary to perform rigorous
sensitivity tests similar to comprehensive sensitivity studies that have
included investigation of hard-coded parameters in other more complex models
(e.g.,

Although the annual peak flow estimates improve by switching calibration
metrics from NSE to KGE and KGE to APFB, the flood magnitudes are
underestimated at all of the return periods examined no matter which
performance metric is used for calibration. While the APFB calibration
improves, on average, the error of annual peak flow over the 20-year period,
the flood magnitude estimates at several percentile or exceedance probability
levels are based on estimated peak flow series. Therefore, improving only the
bias does not guarantee accuracy of the flood magnitudes at a given return
period. Following

Residual distributions conditioned on the non-exceedance probability of the daily flows over the 492 study basins. Analyses are presented for the three calibration performance metrics. Daily residuals are computed based on the observed and simulated flows during the evaluation period.

The calibrated models do improve the flow metrics including both time series
metrics (mean, variability, etc.) and/or application-specific metrics,
depending on the performance metrics used for the calibration. However,
residuals always remain after the model calibration because the model never
reproduces the observations perfectly. Recently,

To gain more insight into this topic, we examine how stochastically generated
residuals, once re-introduced to the simulated flows, can affect the
performance metrics. We consider three performance metrics for this analysis:
NSE, KGE, and APFB. Figure

Distribution of the two error metrics (

The same as Fig. 8 except for APFB

The quality of the original deterministic flow simulated by the hydrologic models has little effect on the performance metrics based on the ensemble of residual augmented flows. Since the stochastically generated ensembles do not account for temporal correlation, every ensemble has reduced correlation and deteriorated NSE and KGE metrics. However, the error metric related to the flow duration curve (APFB) is not affected by the lack of correlation because metrics based on the flow duration curve (FDC) do not preserve information regarding the temporal sequence. Although residual augmented flow time series enhances some of the flow metrics, the (temporal) dynamical pattern is not reproduced. These observations point toward the need for careful investigation in interpreting the improvement in model skill, especially when different error metrics are considered.

A key issue is the extent to which high flows are represented in the deterministic and stochastic components. While it is possible to generate ensembles through stochastic simulation of the model residuals (as is done here), and these stochastic simulations improve high-flow error metrics, we will naturally have more confidence in the model simulations if the high flows are well represented in the deterministic model simulations. The use of squared error metrics simply means that a larger part of the high-flow signal must be reconstructed via stochastic simulation.

The use of large-sample catchment calibrations of two different hydrologic models with several performance metrics enables us to make robust inferences regarding the effects of the calibration metric on the ability to infer high-flow events. Here, we have focused on improving the representation of annual peak flow estimates, as they are important for flood frequency magnitude estimation. We draw the following conclusions from the analysis presented in this paper.

The choice of error metric for model calibration impacts high-flow estimates very similarly for both models, although mHM provides overall better performance than VIC in terms of all metrics evaluated.

Calibration with KGE improves performance as assessed by high-flow metrics by improving time-dependent metrics (e.g., variability error score). Adjustment of the scaling factors related to the different KGE components (bias, variability, and correlation terms) can further assist the model simulations in matching certain aspects of flow characteristics. The degree of improvement is, however, model dependent.

Application-specific metrics can improve estimation of specifically targeted aspects of the system response (here annual peak flows) if used to direct model calibration. However, the use of an application-specific metric does not guarantee acceptable performance with regard to other metrics, even those closely related to the application-specific metric.

Given that

Model calibration was performed using MPR-flex
available at

Authors from NCAR (NM, MPC, AJN, and AWW) and authors from UFZ (OR and RK) initiated model experiment designs separately, and both groups agreed to merge the results. NM, OR and RK performed the model simulations and designed figures and the structure of the paper. HVG provided insights into the model calibration results. All the authors discussed the results and wrote and reviewed the manuscript.

The authors declare that they have no conflict of interest.

We thank two anonymous referees for their constructive comments and John Ding for his short comment on NSE. The comments helped improve the manuscript, in particular discussion regarding the consideration of deterministic model residuals for error metric estimates. We also thank Ethan Gutmann and Manabendra Saharia (NCAR) for the earlier discussions on the topic.

This paper was edited by Dimitri Solomatine and reviewed by two anonymous referees.