Statistical downscaling models (SDMs) are often used to produce local weather
scenarios from large-scale atmospheric
information. SDMs include transfer functions which are based on
a statistical link identified from observations between local
weather and a set of large-scale predictors. As physical processes
driving surface weather vary in time, the most relevant predictors
and the regression link are likely to vary in time too. This is well
known for precipitation for instance and the link is thus often
estimated after some seasonal stratification of the data. In this
study, we present a two-stage analog/regression model where the
regression link is estimated from atmospheric analogs of the current
prediction day. Atmospheric analogs are identified from fields of
geopotential heights at 1000 and 500

Statistical downscaling models (SDMs) have been widely used to
generate local weather scenarios for past or future climates from
outputs of climate models

Among the different SDM approaches presented over the last decades

Transfer functions mainly consist of regression models where the
expected value of the predictand for time

The downscaling relationship used in transfer functions is usually
established empirically between a selection of large-scale
predictors and the predictand (e.g., precipitation occurrence) from
a set of observations available for recent decades. As physical
processes driving surface weather vary in time, the most relevant
predictors and the downscaling link are however expected to vary in
time too. When inferred from all observations available for a given
period, the downscaling relationship – which is thus likely
inferred from a heterogeneous ensemble of weather configurations –
is consequently likely to be sub-optimal. To reduce this potential
limitation, the parameterization of the relationship is often
estimated after some data stratification. In the usual calendar
stratification, one parameter set is for instance optimized for
each calendar month or season

A smoother weather-type-like approach consists in defining the
weather type from all atmospheric situations that are similar to
the situation of the prediction day. The ensemble of days from
which the downscaling link can be identified is thus expected to be
rather homogeneous and to rather well inform the large- to small-scale link
sought for the considered prediction day. This is in
turn expected to make the link stronger and to improve the
prediction

In the present study, we present a two-stage analog/regression
downscaling model for the probabilistic prediction of small-scale
daily precipitation: for each prediction day, the statistical
downscaling link between some large-scale atmospheric predictors and
small-scale precipitation is estimated from large-scale and local-scale
observations available from an ensemble of days which
are atmospheric analogs to the prediction day. The analog model
(AM) used for the analog stage is based on developments from
different studies initially focusing on the probabilistic
quantitative precipitation forecasts in southern France

As mentioned above, SDMs are used for the simulation of local
weather scenarios in different contexts, e.g., local weather forecasts,
reconstructions, or climate impact studies. No
specific context is considered here and the two-stage model could
be further considered for either forecasting, reconstruction, or
future projections. Depending on its intended use, some specific
issues would obviously apply, calling for specific focused analyses
and developments. For instance, the large-scale atmospheric
parameters to be considered as predictors would depend on the
dataset considered (e.g., atmospheric reanalyses, climate models, or
numerical weather prediction models) as a result of their intrinsic
quality

The paper is structured as follows: Sect.

The predictand is the daily small-scale precipitation estimated for
the 1982–2001 period over 8981 grid cells of

Large-scale potential variables considered in the work. Stars: predictors
obtained from the best GLMs identified for the 12 test SAFRAN grid cells
(Sect.

Atmospheric predictors are taken from the European Centre for
Medium-Range Weather Forecasts (ECMWF) Re-Analysis

For the analog stage, predictors are the 1000 and 500

For the regression stage, 22 other predictors were also
considered. The selection gathers most predictors considered in
previous studies over Europe

To avoid the multi-colinearity in the predictors for the
regression, we identified a subset of uncorrelated predictors. The
cross-correlations between all predictor pairs were first estimated
on an annual basis from all available data. The correlation
structure can however differ from one atmospheric configuration to
the other. The set of uncorrelated predictors could thus differ
from one prediction day to the other. We thus repeated the
correlation analysis for each prediction day, using for this
estimation the predictor values observed for the 100 nearest
atmospheric analogs identified for this day. The main features of
the inter-variable correlations were found to be roughly
independent of the day (not shown). The final subset of
uncorrelated predictors is highlighted in
Table

A large number of different possible predictor sets can be built from these predictors. In the present work, for the sake of robustness, we consider that a maximum of four predictors can be integrated into a given regression model. Predictors are obviously expected to be both day and location specific. In the present work, for the sake of simplicity and readability, we select them from a unique set of four potential predictors. This allows us to reduce the degrees of freedom in the model and to better highlight its skill and adaptive behavior.

For each predictand, the set of the four potential predictors was selected as follows. For 12 SAFRAN grid cells uniformly distributed over the French territory, we first identified with a standard iterative forward/backward algorithm the four-predictor set which leads to the best prediction skill for the all-days configuration. From the 12 different sets obtained, respectively, for the 12 grid cells, we finally retained the set which leads on average to the best prediction skill for the 8981 SAFRAN grid cells.

For precipitation occurrence, this best four-predictor set is
constituted from the relative humidity

Predictors considered for the analog and regression stages obviously inform about different features of the atmosphere state for different scales. Geopotential fields, by their spatial extent, characterize the large-scale atmospheric circulation configuration (the spatial domain of several thousands of kilometers includes a part of the northeastern Atlantic and covers France and a part of the neighboring countries), whereas scalar predictors used in the regression stage are descriptive of a more local (and mostly thermodynamic) state of the atmosphere (the spatial domain of several hundreds of kilometers is roughly centered above the target location).

Cumulative distribution function (cdf) of the precipitation amount
for a given prediction day (in gray) at a given grid cell. For illustration, the
prediction here corresponds to the empirical cdf achieved with the analog
model (AM) mentioned in Sect.

As illustrated in Fig.

In the present work, the cdf of precipitation is modeled for each
grid cell and each prediction day with GLMs

In the following, we first describe the AM used to
identify atmospheric analog days (Sect.

As discussed later, one can face prediction days where the
regression stage fails, i.e., where the regression parameters are
not significantly different from zero at the chosen significance
level (

The way these different models are combined to finally give, for
the current prediction day, a probabilistic prediction of
precipitation, is presented in Sect.

The atmospheric analog days retained for the regression stage are
identified with an analog model defined from the developments of
several past studies in France

For any given prediction day (e.g., 31 May 2018), the analog days
retained for the regression are the

Following

The cdf of precipitation is then modeled for each prediction day
with GLMs estimated for this specific day from the atmospheric
analogs of the day. GLMs make the cdf, depending on some covariates,
atmospheric predictors in the present case. For each prediction
day, the probability of precipitation occurrence

For the non-zero precipitation amount, we used a GLM with the gamma
distribution and the log link function. The expected amount

For any given prediction day, the estimation of both GLM models practically proceeds as follows.

Possible regressive structures (i.e., a combination of predictors) for the modeling of precipitation occurrence and amount.

The precipitation state (wet or dry), the precipitation amount,
and the values of the different potential predictors
are extracted
for the

For occurrence probability, different sets of predictors are
considered in turn. For each set, the parameters of the occurrence
GLM are estimated from the predictors/occurrence values available
for the

For precipitation amount, different sets of predictors are again
considered in turn. For each set, the parameters of the GLM are
estimated from the predictors/amount values available from the
analog days which are wet (

The prediction of the occurrence probability (or the expected
precipitation amount) for the prediction day is finally obtained
from the best occurrence (or amount) GLM, using the values of
the predictors observed for that prediction day. The final
distribution of precipitation

The

If the significance conditions cannot be satisfied for the
precipitation occurrence GLM, the occurrence probability

Similarly, if the significance conditions cannot be satisfied for
the precipitation amount GLM, the distribution

As illustrated in
Fig.

Note that the regression stage achieved with GLMs can also be seen
as a way to refine the estimation of the cdf that could have been
obtained directly with the backup (and benchmark)

Illustrations of the four cases met for the issue of

As described previously, the two-stage analog/regression prediction process is repeated for each prediction day in turn. As the analog days vary from one prediction day to another, the predictors selected in the regression stage and the value of the corresponding regression coefficients are expected to vary from one prediction day to the other. The two-stage model SCAMP allows thus for a day-to-day adaptive and tailored downscaling.

The prediction skill of the downscaling model is assessed with
probabilistic scores usually used to evaluate ensemble prediction systems
(EPSs). Let us consider a given EPS, denoted as

The Brier score

The ability of EPS

For this evaluation, the probabilistic prediction of the predictand

In the following, we discuss the prediction skill for precipitation
occurrence and amount with the Brier skill score (BSS) and the continuous
ranked probability skill score (CRPSS), respectively. Both scores normalize
the prediction skill of EPS

In the following, to assess the added value of the two-stage SCAMP model when
compared to the benchmark

The two-stage model is used for the probabilistic prediction of
small-scale precipitation over the continental French territory
for each day of the 1982–2001 period. We here present the
prediction skill obtained for occurrence and amount with the two
predictors sets presented in Sect.

Figure

The BSS gain obtained with SCAMP over

Figure

The CRPSS gain obtained over

Despite the large dependency on regional features such as
topography or proximity to the sea, adding local and thermodynamic
information in SCAMP greatly improves the prediction skill over
that of

As described in Sect.

The frequency with which each activation case (cases 1 to 4) is obtained over
the simulation period is given in Fig.

Figure

For a given prediction day, the precipitation state of its analog days is actually expected to be roughly similar to that of the day. This thus explains SCAMP's behavior described above. In cases 1 and 2, analog days of the prediction day are likely very dry. The number of humid analog days is thus likely small to very small, and likely too small to allow for a robust estimation of the precipitation amount GLM. Analog days are conversely likely humid in case 4 or even very humid in case 3. The number of humid days in those cases is thus likely large enough to allow for a robust estimation of the precipitation amount GLM. The very humid configuration of case 3 suggests that prediction days are characterized by a very large number of humid analog days, which can in turn prevent a robust estimation of the occurrence GLM (e.g., the occurrence GLM cannot be estimated in configurations where all days are wet).

This can also explain the specific results obtained in the southeast. Case 2
is indeed activated much more often in this
region (increase of 30

Percentage of days where (1) no updates are applied, (2) only the precipitation occurrence is updated, (3) only the precipitation amount is updated, and (4) the occurrence and the precipitation amount are updated. Grids with gray colors correspond to grid cells where the corresponding case has been met less than 35 times over the 20-year evaluation period.

The CRPSS gain achieved with SCAMP's results from the updated prediction of
both precipitation occurrence and amount. To assess the relative effects of
these updates on the gain, we further compared the following four prediction
experiments.

Ratio between the mean amount obtained for all days belonging to a given case
and the overall mean precipitation amount. The four cases and gray grids:
same as in Fig.

Gain in CRPSS for different prediction experiments (see
Sect.

The CRPSS gains obtained between Exps. 1 and 2, between Exps. 1 and 3, and
between Exps. 1 and 4 are presented in
Fig.

For a large majority of grid cells, the CRPSS gain obtained with an
updated prediction of the occurrence probability (from 0 to 0.05
CRPSS points) is significantly lower than that obtained with an
updated prediction of amount (from 0.03 to 0.1 CRPSS points). The
CRPSS gain obtained in the latter case is additionally close to
that obtained with the full two-stage model. The CRPSS gain
obtained by SCAMP in Fig.

The sets of potential predictors used in SCAMP for the prediction
of precipitation occurrence and amount have been listed in
Sect.

Prediction of occurrence probability: selection frequencies (%) of the 15
regression structures and of the backup model

Same as Fig.

For each season and weather type, difference (%) in selection frequency with
the all-days case for different regression structures. Results for the
prediction of

For a given prediction day, the regressive structures selected by SCAMP for precipitation occurrence or for precipitation amount are supposed to include the best information for the prediction. In the following, we assess how often each structure has been selected. This allows for some insight into the atmospheric information really used for the regression stage and how this information varies in time.

Figures

For occurrence (Fig.

For precipitation amount, the most frequently selected structures are
Str. no. 3 and Str. no. 1, both based on one single
predictor,

Note that for the selection of the best regression structure for
a given prediction day, all 15 of these regressive structures have been
tested in turn. The results above suggest that this systematic
test is not necessary and that it could be reasonable to consider
only the few structures which are frequently retained or which are
retained a “reasonable” fraction of the days. However, the
selection frequency of a given structure actually varies with the
seasons and/or the encountered synoptic situation, and some
secondary regressive structures can be retained frequently for
specific situations. This is illustrated in
Fig.

For precipitation occurrence (Fig.

The preferential (or conversely reduced) selection of some
regression structures for given WTs or seasons was estimated for
all grid cells of France. In most cases, the preferential (or
reduced) selection was found to present a noticeable spatial
coherency. Different configurations are observed as illustrated in
Fig.

The preferential selection of some regression structures can first
be observed over large to very large regions. As an example, the
preferential selection of Str. no. 3 for the prediction of
precipitation amount for days in WP7 (more than

Names of the weather patterns (WP) defined in

For a given weather pattern, the preferential selection of
a regressive structure can also vary from one region to the
other. For WP2 for instance, the structures based on

The preferential selection of a regressive structure can also be
obtained for rather small and specific regions. In
Fig.

Whatever the configuration, the preferential selection of regression structures presents some spatial coherency, at small or large regional scales. This obviously also suggests the spatial robustness of the informative predictors to be retained for given large-scale weather configurations.

The relevance of a two-stage analog/regression model has been explored in this study for the probabilistic prediction of precipitation over France. Atmospheric analogs of the prediction day are identified to estimate the parameters of a two-part regression model further applied for the prediction. The regression model consists of a logistic GLM for the prediction of precipitation occurrence and a logarithmic GLM for the prediction of precipitation amount. The prediction obtained with this two-stage approach updates the predictive distribution that would have been achieved directly from a one-stage analog model based on atmospheric circulation analogs. The two-stage approach makes the downscaling model adaptive: as the analog days are identified for each prediction day, the predictors and regression coefficients of the regression models can vary from one day to the other.

The regression stage allows a non-negligible prediction skill gain
compared to the reference analog model (gain up to 0.1 skill score
points for both the BSS and the CRPSS). The CRPSS gain is mainly
achieved due to the regression model estimated for the
precipitation amount. The introduction of local-scale predictors
such as relative humidity is obviously crucial there. The adaptive
nature of the model and thus the possibility of tailoring the
downscaling relationship (both predictors and regression
coefficients) to the current prediction day seems to be decisive as
well. The CRPSS gain obtained with the two-stage approach is
actually 2 times larger than the one obtained by

The prediction skill and adaptability of this two-stage approach
was illustrated for the prediction of both the precipitation
occurrence and amount in a simplified configuration where four
predictors, selected in a preliminary analysis from a large
ensemble of potential predictors, are used in the regression
stage. The predictors used for precipitation occurrence are the
relative humidity and vertical velocity at 700

For the sake of simplicity and to limit the degrees of freedom in our analysis, we considered a unique set of four potential predictors for all SAFRAN grid cells. This obviously leads to a sub-optimal prediction configuration. The main meteorological processes driving precipitation in France obviously differ from one region to the other. The most informative predictors are thus expected to be region-dependent and the set of predictors to be considered in the regression stage could be refined on a regional basis. This is expected to improve the skill of the prediction. The same would apply for an application of SCAMP to other regions worldwide.

A number of atmospheric variables have been considered as potential predictors in similar downscaling studies. The predictors found to be of interest are most often few. They are roughly the same than those considered in the preliminary analysis of the present work. However, as in the present work, the analyses usually carried out to identify these informative predictors are potentially misleading. The selection of a variable is indeed often based on its predictive power, estimated with some prediction skill score in an all-days evaluation framework. As highlighted in the present work however, some predictors are likely to be informative for very few meteorological situations. An all-days evaluation is expected to reveal robust predictors. It however very likely misses important situation-specific predictors. The two-stage approach here estimates the statistical downscaling link from a homogeneous set of days, with respect to their large-scale atmospheric circulation configuration. Those days are moreover atmospheric analogs to the prediction day. This two-stage approach has thus the potential to reveal the predictive power of very specific predictors, suited for very specific meteorological configurations. It leaves very likely room for significant improvements of the prediction skill for such unusual configurations. It gives likely also the opportunity to better understand the atmospheric factors under play in a number of non-frequent and atypical meteorological situations. Notwithstanding the technical limitations that may hamper such analyses, a broader exploration of a much larger diversity of predictors, possibly non-conventional ones, would be thus definitively worth in this context.

Both the predictors and the regression coefficients were shown in our work to depend on the analog days identified in the analog stage. This is the reason for the adaptability of the downscaling discussed above. Besides the adaptability, we ideally expect that for a given prediction day the predictor selection and the associated regression coefficients will be robust. Further analyses should explore this issue. An interesting work would be for instance to check that the predictors and their related coefficients do not significantly change when the set of analog days considered for the estimation is modified as a result of a different setup of the analog model (e.g., when one changes the archive period or the archive length).

Results of our work depend on a number of choices and assumptions. They for instance likely depend on the database used for the large-scale atmospheric predictors. The day-to-day behavior of such an analog/regression approach (and the skill of the prediction) likely depends on the database and especially on the quality of the predictors. An atmospheric reanalysis with a higher spatial resolution would for instance likely allow for a better description of the shapes of geopotential fields and for a more relevant simulation of regional/local thermodynamic processes. It would likely lead in turn to higher-quality variables for some atmospheric parameters such as air instability. This may allow for a better identification of the daily specificity in the downscaling relationship and for the most informative predictors to be used each day. The reverse may occur when using lower-quality predictors, for instance lower-quality data from reanalyses available for the 20th century or lower-quality data from climate or numerical weather forecasting models. The quality of the predictors is thus obviously also an important issue to be further considered. It may lead to different informative predictors, depending on the intended use of the model (forecast, simulation, or climate impact studies).

SCAMP was used here for the prediction of small-scale precipitation
at individual grid cells. The prediction of precipitation fields,
obviously required for a number of impact studies, is also
a challenging issue

Data used for this work are data described in the section “data” of Quintana-Segui et al. (2008) and Vidal et al. (2010). Atmospheric predictors are taken from the European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis (ERA-40, Uppala et al., 2005).

This study is part of JC's PhD thesis. BH and ACF supervised the PhD. All authors contributed to the designed experiments and to the writing of the document. JC developed the model code and performed the simulations.

The authors declare that they have no conflict of interest.

The authors especially thank Charles Obled and Isabella Zin for fruitful
discussions on the analog method. The authors also thank the Grenoble
University High Performance Computing centre, CIMENT
(