The Short-Term Ensemble Prediction System (STEPS) is implemented in real-time
at the Royal Meteorological Institute (RMI) of Belgium. The main idea behind
STEPS is to quantify the forecast uncertainty by adding stochastic
perturbations to the deterministic Lagrangian extrapolation of radar images.
The stochastic perturbations are designed to account for the unpredictable
precipitation growth and decay processes and to reproduce the dynamic scaling
of precipitation fields, i.e., the observation that large-scale rainfall
structures are more persistent and predictable than small-scale convective
cells. This paper presents the development, adaptation and verification of
the STEPS system for Belgium (STEPS-BE). STEPS-BE provides in real-time
20-member ensemble precipitation nowcasts at 1 km and 5 min resolutions up
to 2 h lead time using a 4 C-band radar composite as input. In the context
of the PLURISK project, STEPS forecasts were generated to be used as input in
sewer system hydraulic models for nowcasting urban inundations in the cities
of Ghent and Leuven. Comprehensive forecast verification was performed in
order to detect systematic biases over the given urban areas and to analyze
the reliability of probabilistic forecasts for a set of case studies in 2013
and 2014. The forecast biases over the cities of Leuven and Ghent were found
to be small, which is encouraging for future integration of STEPS nowcasts
into the hydraulic models. Probabilistic forecasts of exceeding
0.5 mm h

The use of radar measurements for urban hydrological applications has substantially increased during the last years (e.g., Berne et al., 2004; Einfalt et al., 2004; Bruni et al., 2015). Given the fast response time of urban catchments and sewer systems, radar-based very short-term precipitation forecasting (nowcasting) has the potential to extend the lead time of hydrological and hydraulic flow predictions.

Nowcasting concerns the accurate description of the current weather situation together with very short-term forecasts obtained by extrapolating the real-time observations. Quantitative precipitation nowcasting (QPN) is traditionally done by estimating the apparent movement of radar precipitation fields using optical flow or variational echo tracking techniques and extrapolating the last observed precipitation field into the future (e.g., Germann and Zawadzki, 2002; Bowler et al., 2004a). During recent years there has been significant progress in NWP modeling with radar data assimilation techniques (see a review in Sun et al., 2014), which reduces the useful lead time of extrapolation-based nowcasts compared with NWP forecasts. The development of seamless forecasting systems that optimally blend the extrapolation nowcast with the output of NWP models makes the definition of the nowcasting time range even fuzzier (see, e.g., Pierce et al., 2010).

Due to the lack of predictability of rainfall growth and decay processes at
small spatial scales (Radhakrishna et al., 2012), it is very important to
provide together with a forecast an estimation of its uncertainty. The
established method to represent the forecast uncertainty in Numerical
Weather Prediction (NWP) is to generate an ensemble of forecasts by
perturbing the initial conditions of the model in the directions exhibiting
the largest error growth, which amplify more the spread of the obtained
ensemble. However, in the nowcasting range the computation of large NWP
ensembles (50–100 members) that resolve features at the scales of 1 km and
are updated every 5 min is still impossible to achieve. Consequently, the
efforts in nowcasting research have recently focused on developing heuristic
techniques for probabilistic precipitation nowcasting, which was the topic
of the

Probabilistic QPN methods can be categorized into three main classes: analog, local Lagrangian and stochastic approaches. The analog-based approach derives the forecast probability density function (pdf) by retrieving a set of similar situations from an archive of precipitation events (Panziera et al., 2011; Foresti et al., 2015), the local Lagrangian approach derives the pdf by collecting the precipitation values in a neighborhood of a given grid point in Lagrangian frame of reference (Hohti et al., 2000; Germann and Zawadzki, 2004) and the stochastic approach exploits a random number generator to compute an ensemble of equally likely precipitation fields, for example by adding stochastic perturbations to a deterministic extrapolation nowcast (Pegram and Clothier, 2001a, b; Bowler et al., 2006; Metta et al., 2009; Berenguer et al., 2011; Seed et al., 2013; Atencia and Zawadzki, 2014; Dai et al., 2015). The stochastic approach is also extensively used to produce ensembles of precipitation fields that characterize the radar measurement uncertainty (e.g., Jordan et al., 2003; Germann et al., 2009) and for design storm studies (e.g., Willems, 2001a; Paschalis et al., 2013).

Uncertainty quantification is nowadays an integral part of both weather and hydrological forecasting (Pappenberger and Beven, 2006). Not surprisingly, an important part of hydro-meteorological research aims at understanding how to propagate the uncertainty of precipitation observations and forecasts into the hydrological models (e.g., Willems, 2001b; Cloke and Pappenberger, 2009; Collier, 2009; Zappa et al., 2010).

Several studies already analyzed the value of deterministic nowcasting systems for catchment hydrology (e.g., Berenguer et al., 2005) and for better control of urban drainage systems (e.g., Achleitner et al., 2009; Verworn et al., 2009; Thorndahl and Rasmussen, 2013). Since an important fraction of the uncertainty of hydrological predictions is due to the uncertainty of the input rainfall observations and forecasts, radar-based ensemble nowcasting systems are increasingly used as inputs for flood and sewer system modeling (e.g., Ehret et al., 2008; Silvestro and Rebora, 2012; Silvestro et al., 2013; Xuan et al., 2009, 2014). At longer forecast ranges, the NWP ensembles are also exploited for uncertainty propagation into hydrological models (see Roulin and Vannitsem, 2005; Thielen et al., 2009; Schellekens et al., 2011).

The Short-Term Ensemble Prediction System (STEPS) is a probabilistic nowcasting system developed at the Australian Bureau of Meteorology and the UK MetOffice (see the series of papers: Seed, 2003; Bowler et al., 2006; Seed et al., 2013). STEPS is operationally used at both weather services and provides short-term ensemble precipitation forecasts using both the extrapolation of radar images and the downscaled precipitation output of NWP models. The main idea behind STEPS is to represent the uncertainty due to the unpredictable precipitation growth and decay processes by adding stochastic perturbations to the deterministic extrapolation of radar images. The stochastic perturbations are designed to represent the scale-dependence of the predictability of precipitation and to reproduce the correct spatio-temporal correlation and growth of the forecast errors.

One of the first applications of STEPS in hydrology is presented in Pierce et al. (2005), who used the STEPS ensemble nowcasts to quantify the accuracy of flow predictions in a medium-sized catchment in the UK. The value of STEPS nowcasts for urban hydrology was extensively analyzed by Liguori and Rico-Ramirez (2012), Liguori et al. (2012), Liguori and Rico-Ramirez (2013) and Xuan et al. (2014). Liguori and Rico-Ramirez (2012) concluded that the performance of the radar-based extrapolation nowcast can be improved after 1 h lead time if blended with the output of a NWP model. They also found that, according to the Receiver Operating Characteristic (ROC) curve, the probabilistic nowcasts have more discrimination power than the deterministic ones. Liguori et al. (2012) integrated STEPS nowcasts as inputs into sewer system hydraulic models in an urban catchment in Yorkshire (UK). They concluded that the blending of radar and NWP forecasts has the potential to increase the lead time of flow predictions, but is strongly limited by the low accuracy of the NWP model in forecasting small-scale features. Liguori and Rico-Ramirez (2013) performed a detailed verification of the accuracy of flow predictions and concluded that the STEPS ensembles provide a similar performance as using a deterministic STEPS control forecast, but the ensembles lead to a slight underestimation of the flow predictions. Xuan et al. (2014) used ensemble STEPS nowcasts as inputs in a lumped hydrological model for a medium-sized catchment in the southwest of the UK. The hydrological model calibrated with rain gauges had lower RMSE than the one using radar data, but the ability of STEPS to account for the forecast uncertainty was useful in capturing some of the high flow peaks and extending the forecast lead time. However, the conclusions of the previous studies are strongly affected by the limited number of flood events analyzed. An extensive review of the usage of precipitation forecast systems for operational hydrological predictions in the UK from very short to long ranges (including STEPS) is provided in Lewis et al. (2015).

The goal of this paper is to present the development and verification of the
STEPS system at the Royal Meteorological Institute of Belgium (RMI), referred
to as STEPS-BE. STEPS-BE provides in real-time 20-member ensemble
precipitation nowcasts at 1 km and 5 min resolutions up to 2 h lead time
on a

Ensemble STEPS nowcasts are computed for a set of sewer overflow cases that affected the cities of Leuven and Ghent in 2013 and 2014. The accuracy of the ensemble mean forecast is verified using both continuous verification scores (multiplicative bias, RMSE) and categorical scores derived from the contingency table (probability of detection, false alarm ratio and Gilbert skill score). However, the most interesting part of this paper is the probabilistic and ensemble verification of STEPS nowcasts using both stratiform and convective rainfall events. Probabilistic nowcasts are verified using reliability diagrams and ROC curves. On the other hand, the dispersion of the nowcast ensembles is verified using rank histograms and by comparing the ensemble spread to the error of the ensemble mean.

The paper is structured as follows. Section 2 presents the radar data processing and case studies that are used to generate and verify the STEPS forecasts. Section 3 describes the STEPS nowcasting system and its extension and local implementation for Belgium (STEPS-BE). Section 4 illustrates the forecast verification results. Section 5 concludes the paper and discusses future perspectives.

STEPS-BE integrates as input a composite image produced from the C-band radars of Wideumont (RMI, single-pol), Zaventem (Belgocontrol, single-pol), Jabbeke (RMI, dual-pol) and Avesnois (Meteo-France, dual-pol). The composite is produced on a 500 m resolution grid by combining single-radar pseudo Constant Altitude Plan Position Indicators (CAPPI) at a height of 1500 m a.s.l. The compositing algorithm takes the maximum reflectivity value from each radar at each grid point.

The radars have different hardware and scanning strategies, and are operated by different agencies (RMI, Belgocontrol and Meteo-France), which inevitably leads to differences in the data processing. The Wideumont and Zaventem radars eliminate the non-meteorological echoes using standard Doppler filtering. The Jabbeke radar includes an additional clutter filtering that uses a fuzzy logic algorithm based on the dual-polarization moments (essentially the co-polar correlation coefficient, the texture of the differential reflectivity and the texture of the specific differential phase shift). A static ground clutter map and a statistical filter are used by Meteo-France to remove the non-meteorological echoes of the Avensois radar. The French radar data processing chain is described in Tabary (2007) and in Figueras i Ventura and Tabary (2013).

Since the Zaventem radar is mainly used for aviation applications, its
scanning strategy is optimized for the measurement of winds. Except for the
lowest elevation scan, a dual PRF mode (

All radars use the standard Marshall–Palmer relationship

STEPS forecasts were generated and verified for a set of sewer system overflow cases that affected the cities of Ghent and Leuven (see Table 1). The Ghent cases have a more stratiform character and occurred in late autumn and winter. On the other hand, the Leuven cases are more convective and occurred in summer months. A detailed climatology of convective storms in Belgium can be found in Goudenhoofdt and Delobbe (2009).

List of precipitation events that caused sewer system floods in Ghent and Leuven.

The Short-Term Ensemble Prediction System (STEPS) was jointly developed by the Australian Bureau of Meteorology (BOM) and the UK MetOffice (Bowler et al., 2006). STEPS forecasts are produced operationally at both weather services and are distributed to weather forecasters and a number of external users, in particular the hydrological services.

The key idea behind STEPS is to account for the unpredictable rainfall
growth and decay processes by adding stochastic perturbations to the
deterministic extrapolation of radar images (Seed, 2003). In order to be
effective, the stochastic perturbations need to reproduce important
statistical properties of both the precipitation fields and the forecast
errors:

spatial scaling of precipitation fields,

dynamic scaling of precipitation fields,

spatial correlation of the forecast errors,

temporal correlation of the forecast errors.

The

The stochastic perturbations should be able to reflect the properties of the
forecast errors. Generating spatially and temporally correlated forecast
errors is mandatory for hydrological applications, in particular when the
correlation length of the errors is comparable or superior to the size and
response time of the catchment.

The practical implementation of STEPS to reproduce these important properties
consists of the following steps (see Bowler et al., 2006; Foresti and Seed,
2014).

Estimation of the velocity field using optical flow on the last two radar rainfall images (Bowler et al., 2004a)

Decomposition of both rainfall fields into a multiplicative cascade using an FFT to isolate a set of eight spatial frequencies

Estimation of the rate of temporal evolution of rainfall features at each level of the cascade (Lagrangian auto-correlation)

Generation of a cascade of spatially correlated stochastic noise using as a filter the 1-D or 2-D power spectra of the last observed radar rainfall field. A Gaussian filter is used to isolate a given spatial frequency (see Foresti and Seed, 2014).

Stochastic perturbation of the rainfall cascade using the noise cascade (level by level)

Extrapolation of the cascade levels using a semi-Lagrangian advection scheme

Application of the AR(1) or AR(2) model for the temporal update of the cascade levels at each forecast lead time using the Lagrangian auto-correlations estimated in step (3)

Recomposition of the cascade into a rainfall field

Probability matching of the forecast rainfall field with the original observed field (Ebert, 2001)

Computation of the forecast rainfall accumulations from the instant forecast rainfall rates. This procedure is known as advection correction and consists of advecting the instant rainfall rate forward over the 5 min period by discretizing the advection into smaller time steps.

Web platform of STEPS-BE showing the ensemble mean forecast.

Bowler et al. (2006) introduced a general framework for blending a radar-based extrapolation nowcast with one or more outputs of downscaled NWP models (see also Pierce et al., 2010; Seed et al., 2013). Because of being designed for urban applications, the maximum lead time of STEPS-BE is restricted to 2 h. The operational NWP model of RMI (ALARO) runs only four times daily using a grid spacing of 4 km. Considering the model spin-up time and the absence of radar data assimilation, it is very unlikely that ALARO provides useful skill for blending its output with a radar-based extrapolation nowcast within the considered nowcasting range. It must also be remembered that the effective resolution of NWP models is much larger than the grid spacing. For instance, Grasso (2000) estimates the effective resolution to be at least 4 times the grid spacing, while Skamarock (2004) estimates it to be up to 7 times the grid spacing. ALARO would then only be able to resolve features that are greater than 20 km. For all these reasons, STEPS-BE only involves an extrapolation nowcast without NWP blending.

The STEPS-BE forecast domain is smaller than the extent of the 4 C-band
radars composite (see Fig. 1). The radar field was upscaled from the original
resolution of 500 m to 1 km and a sub-region of

STEPS-BE includes a couple of improvements compared with the original
implementation of the BOM:

kernel interpolation of optical flow vectors;

generation of stochastic noise only within the advected radar mask.

The optical flow algorithm of Bowler et al. (2004a) estimates the velocity
field by dividing the radar domain into a series of blocks within which the
optical flow equation is solved. The minimization of the field divergence is
only performed at the level of the block, which leaves sharp discontinuities
in the velocity field between the blocks. In order to overcome this issue, a
Gaussian kernel smoothing was applied to interpolate the velocity vectors
located at the center of the blocks onto the fine radar grid. The bandwidth
of the Gaussian kernel was chosen to be

In STEPS-BE the 1-D power spectrum of the last observed rainfall field is used as a filter to generate the spatially correlated stochastic perturbations. The PS is parameterized using two spectral slopes to account for a scaling break that is often observed at the wavelength of 40 km (see Seed et al., 2013; Foresti and Seed, 2014). To simplify the computations, an auto-regressive model of order 1 (AR(1)) was employed for imposing the temporal correlations and to model the growth of forecast errors.

The original STEPS implementation (Bowler et al., 2006) was designed to blend the radar extrapolation nowcasts with the output of NWP models. However, the domain covered by the radars is smaller than the rectangular domain of the NWP model and small amounts of stochastic noise are generated by default also outside the radar composite. This setting was not adapted for radar-based nowcasts without NWP blending and needed some adaptation. In fact, when advecting the radar mask over several time steps, large areas with small amounts of stochastic rain appear outside the validity domain of the forecast and perturb the probability matching. In STEPS-BE the stochastic perturbations are only generated within the advected radar domain and set to zero elsewhere.

STEPS-BE can also account for the uncertainty in the estimation of the
velocity field. The STEPS version that is implemented in the UK (Bowler et
al., 2006) includes a detailed procedure to generate velocity perturbations
that reproduce various statistical properties of the differences between the
forecast velocity and the actual future diagnosed velocity (see details in
Bowler et al., 2004b). In the BOM and RMI implementations a simpler procedure
is applied. The diagnosed velocity field is multiplied by a single factor

The BOM and RMI versions of STEPS also include a stochastic model for the radar measurement error, a broken-line model to account for the unknown future evolution of the mean areal rainfall and the possibility to use time-lagged ensembles. However, a nowcasting model with too many components is harder to calibrate and complicates the interpretation of the forecast fields. Because of these reasons, STEPS-BE only exploits the basic stochastic model for the velocity field and for the evolution of rainfall fields (due to growth and decay processes).

The core of STEPS and the STEPS-BE extensions are implemented in C/C

The python matplotlib library is used to read the netCDF files, export the PNG figures and the time series of observed and forecast rainfall at the location of major cities and weather stations. A single STEPS nowcast generates more than 600 figures, which takes a significant fraction of the total computational time. In order to optimize the timing, a bash script monitors continuously the directory with incoming radar composites and triggers STEPS-BE once a field with a new time stamp is found. All these implementation details ensure that the user/forecaster can have access to an ensemble STEPS nowcast in less than 5 min after receiving the radar composite image.

The visualization system of STEPS-BE is very similar to the one of INCA-BE, the local Belgian implementation of the Integrated Nowcasting through Comprehensive Analysis system (INCA, Haiden et al., 2011) developed at the Austrian weather office (ZAMG). Figure 1 illustrates the web interface with an example of an ensemble mean nowcast. The user can highlight the major cities, weather stations and click to visualize the time series of observed and forecast precipitation/probability, which appears at the bottom of the web page. The navigation through the observations and forecast lead times is facilitated by the scroll wheel of the mouse. On the other hand, by clicking on the image it is possible to easily scroll through the various ensemble members or probability levels for a given lead time. Scrolling the ensemble members at different lead times is very instructive and can make the user aware of the forecast uncertainty. In fact, at a lead time of 5 min the ensemble members agree very well on the intensity and location of precipitation. This means that the ensemble spread is small and the probabilistic forecast is sharp; i.e., most of the forecast probabilities are close to 1 or 0 (see an explanation in Appendix A). On the other hand, at 1 or 2 h lead time the ensemble members disagree on the location and intensity of rainfall, which enhances the ensemble spread and decreases the sharpness of the probabilistic forecast. The web page includes extensive documentation to guide the user and a set of case studies to help understanding the strengths and limitations of STEPS. The visualization system was implemented with great attention to take full advantage of the multi-dimensional information content of probabilistic and ensemble forecasts.

Average forecast and observed rainfall accumulations for the Ghent
cases.

Average observed and forecast rainfall accumulations for the
Leuven cases.

Average 0–30 min multiplicative forecast biases for the Ghent
cases on

Average 0–30 min forecast RMSE for

This section presents the verification of STEPS-BE forecasts using a set of case studies (see Sect. 2). The accumulated radar observations were employed as reference for the verification. The rainfall rates are accumulated over the last 5 min by reversing the field vectors based on the observations and then performing the advection correction. The 30 min ensemble mean forecast was verified against the observed 30 min radar accumulations using both continuous and categorical verification scores. The deterministic verification procedure follows the one presented in Foresti and Seed (2015), which was designed to analyze the spatial distribution of the forecast errors. More details about the forecast verification setup and scores are given in Appendix A.

The continuous scores include the multiplicative bias and the root mean
squared error (RMSE), while the categorical scores include the probability of
detection (POD), false alarm ratio (FAR) and Gilbert skill score (GSS)
derived from the contingency table for rainfall thresholds of 0.5 and
5.0 mm h

Figures 2 and 3 show the average forecast and observed rainfall rates corresponding to the 0–30 min ensemble mean accumulation nowcast for the Ghent and Leuven cases, respectively. In other words, they represent the average forecast and observed rainfall rates over the duration of the precipitation event (for the 0–30 min lead time). The average was computed using the weak conditional principle explained above.

The average forecast and observed accumulations generally agree very well for the 0–30 min lead time forecast. The Ghent case on 10 November 2013 (Fig. 2a and b) is the only one with northwesterly flows and is characterized by the lowest average rainfall rates. The Avesnois radar demonstrates very well the range dependence of the average rainfall rates, which gradually decrease with increasing distance from the radar. On the contrary, the smaller ring of high rainfall rates around the Zaventem radar is mostly due to the bright band (Fig. 2b).

The bright band effect influences the radar observations and hence the nowcasts based on their extrapolation. At longer lead times the larger rainfall estimates due to the bright band are extrapolated far from the location of the radar. The stochastic perturbations of STEPS can help to gradually dissolve the circular patterns introduced by the bright band effect. However, the bright band affects more the observations used for the verification, in particular when the rainfall is advected from upstream over the radar region. In such a case, the local larger rainfall estimates lead to a verification bias and the forecasts are wrongly supposed of rainfall underestimation. In spite of these issues, bright band effects might not be so important for urban hydrological applications. In fact, except for one stratiform case presented in this paper, pluvial floods mainly happen in summer with convective precipitation events, during which the bright band is absent or negligible.

The Ghent case on 3 January 2014 has higher rainfall rates and the elongated structures of precipitation areas demonstrate well the southwesterly flow regime (Fig. 2c and d). For this case the measurements of the Zaventem radar are also affected by second trip echoes, which appear as a set of radially oriented rainfall structures northwest of the radar. These alternating patterns are explained by the dual PRF mode of Zaventem (see Sect. 2).

The Leuven cases on 9 June and 19–20 July 2014 have an important convective activity (Fig. 3a–d). The maximum average rainfall rates are located over the Ardennes mountain range and the city of Leuven, respectively. Since urban flash floods can be triggered by a single convective cell, the average rainfall rate over the duration of the event may not be as high in the considered city (e.g., Fig. 3b).

Figure 4 illustrates the multiplicative bias of the 0–30 min nowcast
averaged over each of the four events. A detailed interpretation of such
forecast biases using Australian radar data and their connection to
orographic features is given in Foresti and Seed (2015), which point out that
an important fraction of the forecast errors is caused by the biases of the
verifying radar observations rather than systematic rainfall growth and decay
processes due to orography. In Fig. 4a it is easy to notice the effect of
bright band, which causes a series of systematic forecast biases around the
Zaventem radar and perpendicularly oriented with respect to the prevailing
flow direction (NW). Systematic rainfall underestimation occurs along the
Belgian coast of the North Sea. One factor which contributes to this
underestimation is the absence of visibility of the radar at longer ranges.
The incoming precipitation is suddenly detected by the radar and therefore
strongly underestimated by STEPS. The only situation where the range
dependence of the rainfall estimation does not affect the forecast
verification occurs when the velocity field is perfectly rotational and
centered on the radar (assuming no beam blockage). All the other cases have
to deal with the fact that the rainfall nowcast also extrapolates the biases
of the radar observations! Contrary to expectation, on the upwind side of the
Ardennes there is overestimation, which may depict a region of systematic
rainfall decay. The bias over the city of Ghent is fortunately small and is
included in the range from 0 to

In Fig. 4b the systematic underestimation is also located upstream with
respect to the prevailing winds (SW). The strong overestimations in Germany
and the Netherlands are mostly due to the underestimation of rainfall by the
verifying radar observations rather than caused by systematic rainfall decay.
This is particularly visible after a range of 125 km from the Zaventem
radar, which demonstrates again that discontinuities and biases in the radar
observations lead to biases in the extrapolation forecast. Also in this case
the bias over the city of Ghent is small but in the range from

Reliability diagrams for the Ghent case on 3 January 2014 relative to
the probabilistic forecast of exceeding

Figure 5 shows the spatial distribution of the RMSE for the stratiform event
on 3 January 2014 in Ghent and the convective event on 19–20 July 2014 in
Leuven. If compared with Figs. 2d and 3d it is clear that the RMSE is
strongly correlated with the regions having the highest mean rainfall
accumulations (proportional effect). Thus, it is not surprising that the
RMSE of the convective case (Fig. 5b) displays values exceeding 10 mm h

Figure 6 illustrates an example of categorical verification of the
30–60 min ensemble mean forecast for the Leuven case on 19–20 July 2014
relative to the rainfall threshold of 0.5 mm h

ROC curves relative to the probabilistic forecast of exceeding 0.1 mm h

Figure 7 shows the reliability diagrams relative to the probabilistic
forecast of exceeding the 0.5 and 5.0 mm h

Figure 8 illustrates the ROC curves relative to the probabilistic forecast of
exceeding 0.1 mm h

Comparison of ensemble spread and RMSE of the ensemble mean
forecast at 5 min resolution for

Rank histograms for the Leuven case on 19–20 July 2014 for a lead
time of

Figure 9 compares the error of the ensemble mean (RMSE) and the ensemble spread for the Ghent case on 3 January 2014 and the Leuven case on 19–20 July 2014 (see interpretation of ensemble spread in Appendix A). In both cases the RMSE increases up to a lead time of 50–60 min and then starts a slow decrease, which can be counter-intuitive. However, it must be remembered that the ensemble mean forecast becomes smoother for increasing lead times, which reduces the double penalty error due to forecasting a thunderstorm at the wrong location. The ensemble spread also increases up to 50–60 min lead time and then slowly stabilizes. For both the Ghent and Leuven cases the ensemble spread is lower than the error of the ensemble mean at all lead times, which suggests that the ensemble forecasts are under-dispersive. The degree of under-dispersion is highest at a lead time of 5 min, with the spread values being equal to 60 % of the forecast error for the winter event in Ghent (Fig. 9a) and 70 % for the summer event in Leuven (Fig. 9b). Except for the 5 min lead time, the ensemble spread represents 75–90 % of the forecast error for the Ghent case (Fig. 9a) and 75–80 % for the Leuven case (Fig. 9b), which is a good result. It is not yet clear why the RMSE at a lead time of 5 min is higher than the one at 10 min for the winter case in Ghent (Fig. 9a).

The underestimation of the ensemble dispersion at the first lead time can be
attributed to both the underestimation of the ensemble spread and the
overestimation of the ensemble mean RMSE, but with different degrees
according to the different causes. High RMSEs at the start of the nowcast can
be due to using a very smooth velocity field for the advection (see
Sect. 3.2), which does not exploit sufficiently the very short-term
predictability of small-scale precipitation features, but is optimized for
predictions at longer lead times. Another explanation for this
underestimation of ensemble dispersion could be due to the space–time
variability of the Z–R relationship. Spatial and temporal changes in the
drop size distribution (DSD) can lead to changes in the estimated rainfall
rate that is used for the verification. Therefore, there could be a mismatch
between the “fixed” DSD of the forecasts and the variable DSD underlying
the verifying observations. Another possible source of mismatch could be due
to the advection correction with optical flow when computing the rainfall
accumulations. The forecast accumulations are computed by advecting the
previous rainfall field forward. On the other hand, the observed
accumulations are computed by reversing the optical flow vectors and
advecting the rainfall field backwards (see Sect. 4.1). This choice increases
the differences when comparing the

Summary of the forecast verification scores of the next four 30 min accumulation forecasts for the precipitation events in Ghent and Leuven. The lead time shown is the end of the 30 min accumulation period (e.g., 60 min is relative to the 30–60 min accumulation). The bias values correspond to the standard deviation of the multiplicative bias, which is more interesting than its mean (often close to 0).

Figure 10 illustrates the rank histograms for the Leuven case on
19–20 July 2014 for lead times of 5 and 60 min. The U-shape of the rank
histograms demonstrates again a certain degree of ensemble under-dispersion.
In particular, all the ensemble members for the 5 min lead time are inferior
to the observations in

Table 2 provides a comparison of the verification scores for each event. The average standard deviation of the multiplicative biases of the 30 min lead time forecast is in the range 0.3–0.8. Except for the event on 19–20 July 2014 the biases remain well below 1 dB for all lead times, which is a positive result. Of course, these are average values, and locally they can even exceed 3 dB (see Fig. 4).

On the other hand, the RMSE values mark more the distinction between the two
winter cases in Ghent and the two summer cases in Leuven. For the winter
cases the RMSE values increase from 0.38–0.95 at a lead time of 30 min to
0.78–1.48 at 120 min, while for the summer cases from 1.84–2.45 to 2.52–3.38 mm h

The probability of detection relative to the 0.5 mm h

The areas under the ROC curve values characterizing the potential
discrimination power of the probabilistic forecast of exceeding
0.5 mm h

From all these results we can conclude that there is not much predictability beyond 2 h lead time by extrapolating the 4 C-band composite radar images in Belgium. Therefore, a maximum lead time of 2 h in STEPS-BE is a good choice. Extending this lead time requires blending the radar-extrapolation nowcast with the output of NWP models to increase the predictability of precipitation.

The Short-Term Ensemble Prediction System (STEPS) is a probabilistic nowcasting system based on the extrapolation of radar images developed at the Australian Bureau of Meteorology in collaboration with the UK MetOffice. The principle behind STEPS is to produce an ensemble forecast by perturbing a deterministic extrapolation nowcast with stochastic noise. The perturbations are designed to reproduce the spatial and temporal correlations of the forecast errors and the scale dependence of the predictability of precipitation.

This paper presented the local implementation, adaptation and verification of STEPS at the Royal Meteorological Institute of Belgium, referred to as STEPS-BE. STEPS-BE produces in real-time 20-member ensemble nowcasts at 1 km and 5 min resolutions up to 2 h lead time using the four C-band radar composite of Belgium. Compared with the original implementation, STEPS-BE includes a kernel-based interpolation of optical flow vectors to obtain smoother velocity fields and an improvement to generate stochastic noise only within the advected radar composite to respect the validity domain of the nowcasts.

The performance of STEPS-BE was verified using the radar observations as
reference on four case studies that caused sewer system floods in the cities
of Ghent and Leuven during the years 2013 and 2014. The ensemble mean
forecast of the next four 30 min accumulations was verified using the
multiplicative bias, the RMSE as well as some categorical scores derived from
the contingency table: the probability of detection, false alarm ratio and
Gilbert skill score (equitable skill score). The spatial distribution of
multiplicative biases revealed regions of systematic over- and
under-estimation by STEPS. The underestimations are often associated with the
locations of convective initiation and thunderstorm growth, which cannot be
predicted by STEPS. On the other hand, the regions of overestimation are
mostly due to the underestimation of rainfall by the verifying observations
rather than systematic rainfall decay (see Foresti and Seed, 2015, for a more
detailed discussion). In order to disentangle the forecast and observation
biases, detailed knowledge about the spatial distribution of the radar
measurement errors for a given weather situation is needed. The
multiplicative biases over the cities of Leuven and Ghent are very low (from

The current contribution focused on the verification of STEPS-BE nowcasts using only four precipitation cases of different character. The deterministic and categorical verifications require many more cases to analyze the climatological distribution of the forecast errors, e.g., as done in Foresti and Seed (2015). On the other hand, the probabilistic and ensemble verification pools the data in both space and time and converges much faster to stable statistics. From a research perspective, STEPS-BE could also be extended by including a stochastic model to account for the residual radar measurement errors, in particular to obtain more accurate estimations of the forecast uncertainty at short range. The STEPS framework also allows blending of the extrapolation nowcast with the output of NWP models, which is a necessary step to increase the predictability of precipitation for lead times beyond 2 h.

Correspondence between the decibel scale and the power ratio.

Forecast verification is an important aspect of a forecasting system. A
forecast without an estimation of its accuracy is not very informative. For
an in-depth description of forecast verification science and corresponding
scores we refer to Jolliffe and Stephenson (2011) and the verification
website maintained at the Bureau of Meteorology (

The STEPS

Multiplicative bias:

Root mean square error:

Contingency table of a dichotomous (yes/no) forecast, see Table A2, where the “hits” is the number of times that both the observation and the forecast exceed a given rainfall threshold (at a given grid point), the “false alarms” is the number of times that the forecast exceeds the threshold but the observation does not, the “misses” is the number of times that the forecast does not exceed the threshold but the observation does and the “correct negatives” is the number of times that both the observation and the forecast do not exceed the threshold.

Contingency table of a categorical forecast.

Different scores can be derived from the contingency table to characterize a
particular feature or skill of the forecasting system:

Probability of detection (hit rate):

The “POD” characterizes the fraction of observed events that were correctly forecast and is also known as the hit rate (HR).

False alarm ratio:

False alarm rate:

The false alarm rate

Gilbert skill score (equitable threat score):

The accuracy of probabilistic forecasts can be verified in various ways. In this paper we employ the reliability diagram and the Receiver Operating Characteristic (ROC) curve. The reliability diagram compares the forecast probability with the observed frequency. Reliability characterizes the agreement between the forecast probability and observed frequency. For a reliable forecasting system the two values should be the same, which happens for example when we observe rain 80 % of the time when it is forecast with 80 % probability (in average, diagonal line of Fig. 8). Unreliable forecasts exhibit departures from this optimum (bias). Resolution characterizes the ability of the forecasts to categorize the observed frequencies into distinct classes. The complete lack of resolution occurs when the forecast probabilities are completely unable to distinguish the observed frequencies, which generally corresponds to the climatological frequency of exceeding a given precipitation threshold (horizontal dashed line in Fig. 8). The Brier skill score (BSS) characterizes the relative accuracy of the probabilistic forecast compared to a reference system (see Jolliffe and Stephenson, 2011). Although the climatology or sample climatology of the event is often used as a reference, the BSS can also be computed against other reference forecasts, e.g., another probabilistic forecasting method or even a deterministic forecasting method treated as a probabilistic binary forecast. However, in such cases it is not possible to draw a unique horizontal line representing complete lack of skill in Fig. 8. The region where the probabilistic forecast has a positive BSS, i.e., it is better than the climatological frequency, is grayed out. In fact, the points located below the no skill line are closer to the climatological frequency and produce a negative BSS. Reliability diagrams usually contain the histogram of the forecast probabilities to analyze the sharpness of the forecasts (small inset in Fig. 8). Sharpness characterizes the ability to forecast probabilities that are different from the reference forecast. Sharp forecasting systems are “confident” about their predictions and give many probabilities around one and zero.

The ROC curve is used to analyze the discrimination power of a probabilistic forecast of exceeding a given rainfall threshold. It is constructed by plotting the hit rates and false alarm rates evaluated at increasing probability thresholds to make the binary decision whether it will rain or not. The ROC curve of a random probabilistic forecast system lies on the diagonal where the hit rate equals the false alarm rate (no skill): the forecast probabilities do not have discrimination power. When the false alarm rate is higher than the hit rate the forecast is worse than that obtained by random chance (below the diagonal). A skilled forecasting system is observed when the hit rates are higher than the false alarm rates, which draws a characteristic curve. The area under the ROC curve (AUC) measures the discrimination power of the probabilistic forecasts, with a maximum value of 1 (100 % of hits and 0 % of false alarms) and a minimum value of 0.5 for a random forecasting system. Values below 0.5 denote a forecasting system that performs worse than random chance. The AUC is computed by integrating over all the trapezoids that can be drawn below the ROC curve. The AUC is not sensitive to the forecast bias and the reliability of the forecast could still be improved through calibration. For this reason the AUC is only a measure of potential skill.

The ensemble forecasts are verified to detect whether there is over- or
under-dispersion. It is common practice to compare the “skill” (error) of
the ensemble mean with the ensemble spread (Whitaker and Loughe, 1998;
Foresti et al., 2015):

Another way to analyze the spread of ensemble forecasts is based on rank
histograms (also known as a Talagrand diagram). First, the precipitation
values of the ensemble members are ranked in increasing order. Then, the rank
of the observation is evaluated by checking in which of the

This research was funded by Belgian Science Policy Office (BelSPO) project PLURISK: Forecasting and management of rainfall-induced risks in the urban environment (SD/RI/01A). We thank Clive Pierce for the detailed discussions about the STEPS implementation and the guidance for various code improvements. We also acknowledge Meteo-France for providing the Avesnois data. Edited by: R. Uijlenhoet