We demonstrate both analytically and with a modelling example that cross-validation of free-running bias-corrected climate change simulations against observations is misleading. The underlying reasoning is as follows: a cross-validation can have in principle two outcomes. A negative (in the sense of not rejecting a null hypothesis), if the residual bias in the validation period after bias correction vanishes; and a positive, if the residual bias in the validation period after bias correction is large. It can be shown analytically that the residual bias depends solely on the difference between the simulated and observed change between calibration and validation periods. This change, however, depends mainly on the realizations of internal variability in the observations and climate model. As a consequence, the outcome of a cross-validation is also dominated by internal variability, and does not allow for any conclusion about the sensibility of a bias correction. In particular, a sensible bias correction may be rejected (false positive) and a non-sensible bias correction may be accepted (false negative). We therefore propose to avoid cross-validation when evaluating bias correction of free-running bias-corrected climate change simulations against observations. Instead, one should evaluate non-calibrated temporal, spatial and process-based aspects.

Bias correction is a widely used approach to postprocess climate model
simulations before they are applied in impact studies

The performance of a bias correction is typically evaluated against
independent observational data, which have not entered the calibration of the
correction function. For instance,

Cross-validation is a well-known and widely used statistical concept to
assess the skill of predictive statistical models

In climate change applications, however, the setting is typically different
from a weather forecasting or perfect predictor setting: here, the model is
running free, i.e. only external forcings are common to observation and
simulation. Internal climate variability on all scales is independent and not
synchronized. In this setting,
the aim is not to assess predictive power, e.g. on a day-by-day or
season-by-season basis, as in weather forecasting – in fact, by construction
it cannot be assessed. Importantly, observed and simulated long-term trends
may also differ substantially, just because of different random realizations
of long-term modes of variability. Prominent examples of such modes are the
Pacific Decadal Oscillation

These differences have crucial implications for the application of
cross-validation or any evaluation on
independent data. Our results build upon a recent study by

We will discuss the specific context of climate change simulations in
Sect.

Cross-validation was developed to quantify the predictive skill of
statistical models in the 1930s, and has become widely used with the advent
of modern computers

The first major aim of cross-validation is to eliminate artificial skill: if the statistical model is evaluated on the same data that are used for calibration, the performance to predict new data will almost certainly be lower than the estimated skill. Hence, the model is calibrated only on a subset of the data, and evaluated on another – ideally independent – subset of the data. This so-called holdout method, however, uses each data point only either for calibration or validation and thus suffers from relatively high sampling errors.

The second major aim of cross-validation is therefore to use the data
optimally. To this end, the holdout method, i.e. training and validation, is
repeated on different subsets of the data. The simplest approach is the
so-called split sample method, where the data are just split once into two subsets. More advanced

In weather and climate predictions, the aim is to predict the weather, i.e. internal variability, with a given lead time (say, 3 days or a season) at a desired timescale (say, 6 h or a season). A typical evaluation assesses how well certain meteorological aspects are predicted: in weather forecasting, one may for instance be interested in the overall prediction accuracy, measured by the root-mean squared error between predicted and observed daily time series. In a seasonal prediction, one may be interested in the bias of the predicted mean, or in the bias of the predicted wet-day frequency over a season. In this context, a cross-validation makes perfect sense if the validation blocks are long compared to the prediction lead time (and process memory).

Downscaling and bias correction methods are typically tested in perfect
predictor or perfect boundary condition experiments

In free-running climate simulations, however, the situation is fundamentally
different: here, any predictive power results only from external (e.g.
anthropogenic) forcing at very long timescales, but internal variability is
not synchronized at any timescale. Yet long-term modes of internal climate
variability, such as the PDO

As any cross-validation consists of repeated holdout evaluations, we will in
the following only consider the holdout method. In
Sect.

Consider a simulated time series

The remaining residual bias is then

This residual bias can be expressed in terms of the observed and
simulated climate change signals. The change signal from calibration
to validation period is defined as

For variables such as precipitation, one often considers relative
changes. Here a corresponding derivation holds. The relative error is
defined as

The residual relative error results in

The relative change signal from calibration to validation period is defined as

The residual bias or relative error could further be tested for significance,
i.e. whether the bias-corrected statistic

Assume now that a given bias correction may or may not be sensible. Note in
this context that it is completely irrelevant to explicitly define what
constitutes a sensible bias correction (but for a brief discussion see
Sect.

True negative: the bias correction is sensible, and the (bias-corrected) climate model simulates a trend closely resembling the observed trend.

False positive: the bias correction is sensible, but due to internal climate variability, the (bias-corrected) climate model simulates a trend different from the observed trend.

False negative: the bias correction is not sensible, but the (bias-corrected)
climate model for some reason simulates a trend similar to the
observed trend. This case corresponds to the example given in

True positive: the bias correction is not sensible, and the (bias-corrected) climate model simulates a trend different from the observed trend.

The crucial point is that for typical record lengths, much of the difference
between simulated and observed changes

Yet the discussion above implies an even stronger conclusion: because case 2 might randomly occur, a sensible bias correction may be rejected by a cross-validation. Thus, even more importantly, cross-validation in the given context is not just useless, but even misleading.

Maps of relative changes in boreal mean
summer (JJA) precipitation, 1981–2005 relative to 1956–1980.

To further illustrate the analytic findings, we will give examples of the
four cases in an exaggerated modelling example. We consider mean summer (JJA)
precipitation at four locations. As observational reference we select the
E-OBS data set

We need to select two examples where the given bias correction is sensible,
and two where it is not. Finding a convincing example of a sensible bias
correction has to rely on process understanding

In the following we show two examples where a bias correction is not
sensible. A discussion about the question when a bias correction makes no
sense would go very much beyond the scope of this piece. Therefore, we follow
the logic of

Time series of boreal summer (JJA)
precipitation.

Figure

We have demonstrated both analytically and with a modelling example that cross-validation of free-running bias-corrected climate change simulations against observations is misleading. The underlying reasoning is as follows: the result of a cross-validation – a significant or non-significant residual bias in the validation period – depends on the difference between observed and simulated changes between calibration and validation periods. For typical lengths of calibration and validation periods, these differences depend mostly on the realizations of internal variability in the observations and climate model. These differences therefore do not allow for conclusions about the sensibility of a bias correction. As in any setting of significance testing, four cases are possible: true negative, false positive, false negative and true negative. The actual outcome in a given application is mostly random.

The relevance of internal variability in the discussed cross-validation
context depends on the relative strength of internal variability and forced
trends, and the length of the calibration and validation period compared to
the periodicity of the dominant modes of internal climate variability. In
tropical climates, interannual variability such as that of El
Niño–Southern Oscillation dominates climate variability. Thus, relatively
short periods of a few decades may suffice to obtain stable estimates of
forced changes between calibration and validation period, and therefore to
assess whether a bias correction performs well given the observed changes. In
mid-latitude climates, however, the dominant modes of internal variability
have periodicities of several decades

We have derived these conclusions for the mean and the holdout method, where the bias correction is calibrated against one part of the data and validated against its complement. Yet the results can in principle be transferred to other statistics such as variances or individual quantiles, and to a full cross-validation. The residual mean bias, however, is always zero in a full cross-validation, as long as the individual folds have the same length. The reason is that changing the calibration and validation period changes the sign of the residual bias. When averaging the residual bias across the different folds, it cancels out. For the variance or similar statistics, the outcome depends on the way the cross-validation is carried out: if the residual bias is calculated for each fold separately and then averaged (as suggested in the classical literature), the behaviour is as for the mean. If the residual bias is calculated over a concatenated cross-validated time series (as is typically done in the atmospheric sciences), the bias correction in cases (b) and (d) will yield extremely high residual biases (because the shift in the mean is not removed in the variance calculation).

The consequence of these findings is that cross-validation should not
be used when evaluating bias correction of free-running climate
simulations against observations. In fact, a framework for evaluating
bias correction of climate simulations is still missing and not
trivial. As discussed in

E-OBS data are available from the ECA&D website at

DM and MW developed the idea for the study. DM conducted the analytical derivations, carried out the analysis, and wrote the manuscript. MW commented on the manuscript. DM and MW discussed the results.

The authors declare that they have no conflict of interest.

This study has been inspired by discussions in EU COST Action ES1102 VALUE. Edited by: Carlo De Michele Reviewed by: Uwe Ehret and Seth McGinnis