This study investigates the utilization of hydrological information in regional flood frequency analysis (RFFA) to enforce desired properties for a group of gauged stations. Neighbourhoods are particular types of regions that are centred on target locations. A challenge for using neighbourhoods in RFFA is that hydrological information is not available at target locations and cannot be completely replaced by the available physiographical information. Instead of using the available physiographic characteristics to define the centre of a target location, this study proposes to introduce estimates of reference hydrological variables to ensure a better homogeneity. These reference variables represent nonlinear relations with the site characteristics obtained by projection pursuit regression, a nonparametric regression method. The resulting neighbourhoods are investigated in combination with commonly used regional models: the index-flood model and regression-based models. The complete approach is illustrated in a real-world case study with gauged sites from the southern part of the province of Québec, Canada, and is compared with the traditional approaches such as region of influence and canonical correlation analysis. The evaluation focuses on the neighbourhood properties as well as prediction performances, with special attention devoted to problematic stations. Results show clear improvements in neighbourhood definitions and quantile estimates.

Accurate estimates of the risk of occurrence of extreme hydrological events are necessary for the minimization of the impacts of these events and for the optimal design and management of water resource systems. However, necessary information is not always available at the sites of interest. It is hence necessary to develop procedures to transfer, or to regionalize, the available information at existing gauged sites to the ungauged ones. regional flood frequency analysis (RFFA) represents a large class of techniques commonly used in water sciences to evaluate the risk of occurrence of extreme hydrological phenomena of rare magnitudes at ungauged locations (Haddad and Rahman, 2012; Hosking and Wallis, 1997; Laio et al., 2011; Pandey, 1998; Reis et al., 2005).

RFFA methods are usually composed of two main steps. The first step is the formation of homogenous regions. This step aims at pooling together sites that are approximately similar according to homogeneity criteria. Inside these homogenous regions, it is assumed that hydrological information can be reasonably transferred from gauged to ungauged locations (Cunnane, 1988). The second step, the estimation of flood quantiles, consists in the calibration of a regional model that characterizes the interrelation between hydrological variables of interest and explanatory physio-meteorological variables corresponding to known site characteristics. Consequently, RFFA is used to study unobserved hydrological behaviour from available hydrological and physio-meteorological information.

Neighbourhoods are specific forms of regions that are not composed of a fixed set of stations, but are rather composed of gauged sites that are the most similar to a given target site. Hence, two distinct target locations will have their own distinct neighbourhoods which may overlap. Comparative studies have shown that neighbourhoods will lead to better regional estimates than fixed regions (Burn, 1990; Ouarda et al., 2008; Tasker et al., 1996). To identify the most similar gauged sites in terms of hydrological properties, a notion of distance is needed. It allows to evaluate the proximity, or relevance, of each gauged site to the target location and to identify the most hydrologically similar gauged sites. However, when the target location is ungauged, this distance cannot be directly calculated due to the missing hydrological information. Physio-meteorological information is hence used for similarity evaluation. The traditional approach, based on the distance between site characteristics, is commonly referred to as the region of influence (ROI) model (Burn, 1990), which received particular attention in the hydrological literature. The focus was mainly on the estimation of the model parameters where, for instance, generalized least squares were used to account for unequal variability in the at-site estimations (e.g. Griffis and Stedinger, 2007; Stedinger and Tasker, 1985) and to deal with the presence of spatial correlation (e.g. Kjeldsen and Jones, 2009).

Alternatively, Ouarda et al. (2001) used canonical correlation analysis (CCA) to build neighbourhoods from a canonical distance that accounts for the interrelation between flood quantiles and site characteristics. For this method, neighbourhoods are formed by gauged sites that are the most similar to the target location, according to the distance between vectors of flood quantiles corresponding to different return periods. The CCA method in RFFA estimates the unavailable hydrological variables as linear combinations of site characteristics. Consequently, the available site characteristics are transformed into more meaningful hydrological quantities for the purpose of delineating neighbourhoods. However, the CCA method suffers from some limitations, such as linearity and normality assumptions (He et al., 2011). Subsequent studies have aimed at improving the CCA method by improving the CCA technique itself (Chebana and Ouarda, 2008; Ouali et al., 2015). However, little attention has been paid to the importance of properly choosing the hydrological quantities in the delineation step, whereas much effort has been devoted to the modelling step. Indeed, Chebana and Ouarda (2008) employed an iterative linear procedure to estimate neighbourhood centres and they showed that the quality of these centres' estimates is the crucial element to the improvement of the final model performance.

The present study aims to provide a general framework with more flexibility regarding the linearity and normality assumptions. This is achieved by replacing CCA in the prior analysis of hydrological variables by projection pursuit regression (PPR), a nonparametric regression method recently considered as an estimation model in RFFA (Durocher et al., 2015). The present study is also interested in assessing the advantages of employing hydrological variables other than the at-site flood quantiles in prior modelling as well as considering a combination of these hydrological variables with site characteristics.

L-moments have already been used in RFFA to test the homogeneity of fixed regions when the target site is gauged (Chebana and Ouarda, 2007; Hosking and Wallis, 1997). In the present study, the prediction of the L-moments at ungauged sites is also considered to improve the delineation of the neighbourhoods by reducing uncertainties. Moreover, a conceptual advantage of using L-moments conversely to at-site flood quantiles is that the L-moments do not depend on the subjective selection of at-site distributions.

The present paper is organized as follows. Section 2 presents the background material for the techniques used in the present research. Section 3 elaborates on the prior analysis of hydrological variables and their integration with the techniques presented in Sect. 2 to form a complete procedure. Section 3 also suggests criteria for the evaluation of the predictive performances and the neighbourhood properties. Section 4 illustrates the application of the method in a case study. Traditional ROI and CCA methods serve as references in order to evaluate the relative performance of the investigated method. Finally, concluding remarks are provided in Sect. 6.

In RFFA, neighbourhoods are used to identify gauged sites from which
information is transferred to the target location. A neighbourhood is
characterized by a centre and a radius that delimits an area (not necessary
in the geographical sense). Gauged sites inside the area delineate a region
that includes relevant sites to the target location. At each site

Alternatively, CCA is a multivariate technique used to unveil the
interrelation between two groups of variables. Let

To delineate neighbourhoods, the CCA approach considers the canonical scores

In RFFA, two types of regional models are often considered to predict flood quantiles corresponding to given return periods: the index-flood model and the regression-based model (Ouarda et al., 2008). The index-flood model predicts a target distribution by assuming that all distributions inside the region are proportional to a regional distribution, up to a scale factor called index flood. The flood quantile of interest at a target location is then calculated from the regional distribution based on the predicted index flood (e.g. Chebana and Ouarda, 2009; Dalrymple, 1960; Stedinger and Lu, 1995). Conversely, the regression-based model considers directly the at-site estimates of the desired flood quantiles for prediction. Flood quantiles are then predicted at their target locations by the regression equations estimated within the neighbourhoods (Pandey and Nguyen, 1999).

Even though they proceed differently, both the index-flood model and the
regression-based model may use the same multiple regression techniques to
transfer information to an ungauged location. For the sake of simplicity, the
term hydrological variables is used to designate the corresponding output
variables

Multiple regression models assume linear interrelation between the
hydrological variable

In line with previous notations, let

Some methods predict hydrological variables without the formation of regions, such as physiographical kriging (Castiglioni et al., 2009; Chokmani and Ouarda, 2004), generalized additive models (Chebana et al., 2014) and artificial neural networks (Dawson et al., 2006; Ouarda and Shu, 2009). More recently, PPR was introduced to provide a flexible nonparametric regression approach to describe the nonlinearity that is present in the relationship between hydrological variables and site characteristics. PPR was used in the RFFA context by Durocher et al. (2015) to directly predict flood quantiles without delineation.

The basic elements of a PPR model are

The components

The present study deals with neighbourhood delineation and focuses more precisely on the identification of reliable estimates of the hydrological centres of these neighbourhoods. For the sake of simplicity, the variables forming these centres will be referred to as reference variables, because they represent the reference used to evaluate the similarity between a target location and the gauged sites. Reference variables can take different forms, such as site characteristics, hydrological variables or a combination of both. Their nature is important, because it determines the properties that are deemed to be important between close sites. The particularity of the present method is that PPR can be used to predict these neighbourhood centres (prior to the RFFA modelling step) when some of the reference variables are unknown hydrological variables. Accordingly, the proposed method will be referred to as RVNs for reference variable neighbourhoods.

The general procedure can be described by the steps below:

Select the reference variables.

If necessary, predict the reference variables that are not available at the target site.

Compute the distance between sites.

Form the neighbourhood based on the previous distance.

Fit a regional model on the neighbourhood.

Predict the target site and evaluate a performance criterion.

Diagram of the RVN method using backward step-wise selection.

In step 1, the selection of a set of the reference variables can be subjective and depends on the problem at hand. In the present study, the backward step-wise selection procedure is considered to remove, from an initial set of reference variables, those that do not contribute to the prediction power of the model. This selection procedure is more objective and depends on a performance criterion. In the present study, the relative root mean square error (RRMSE) criterion is chosen for this purpose and will be described in Sect. 3.2. The backward step-wise selection is illustrated in Fig. 1 and consists in removing in turn each reference variable temporarily from the model and performing the remaining steps (2–6) in order to compute the RRMSE. Therefore, the reference variable whose removal leads to the best RRMSE is permanently removed. The process is repeated until all reference variables cannot be removed without altering the RRMSE.

Step 2 is required only if some reference variables are unknown at the target
sites. Otherwise, if we designate the target location by

If certain hydrological information is unavailable at the target location,
the estimation of the hydrological reference variables is necessary to
produce an estimate

If the hydrological variables

The errors related to prediction of the hydrological reference variables suggest that the RVN method may include an additional source of uncertainty. Indeed, the same source of uncertainty is present among the sites of a neighbourhood delineated on the basis of the site characteristics (i.e. that the average of the hydrological variables in the neighbourhood is not a perfect predictor). This could be seen as an advantage of the RVN method since it directly assesses this source of uncertainty and tries to reduce it.

Steps 1–3 are the particularity of the RVN method, while the other steps are
common in RFFA and are explained in Sect. 2. In the remainder of this study,
step 4 uses a specific type of neighbourhood that is composed of a fixed
number of the nearest sites (Eng et al., 2005; Tasker et al., 1996), but
could also be constrained to the degree of the homogeneity of the
neighbourhoods (Ouarda et al., 2001). Consequently, the selected gauged sites
can be obtained by sorting

Illustration of the neighbourhoods obtained by the RVN method.

Notice that the RVN method may be seen as a generalization of the ROI and the
CCA methods in RFFA. Indeed, the ROI method corresponds to the RVN method for
which all the reference variables are site characteristics. In that case,

For the RVN method presented above, the neighbourhood sizes must be calibrated
according to an objective criterion. In this regard, the leave-one-out
cross-validation approach is a general strategy to assess the performance of
the predicted hydrological variables

In their procedure, Hosking and Wallis (1997) used this heterogeneity measure
to test for regional homogeneity, which implies that the regional LCV can be
considered constant. Hence, the result of this test allow us to decide if a
region must be divided into smaller and more homogenous sub-regions. In the
present study, the size of the neighbourhoods is the same. Hence, if a
homogeneity test is performed with a given neighbourhood size, some of the
neighbourhoods will be considered homogenous, while the others will be
considered heterogeneous (Das and Cunnane, 2011). However, the heterogeneity
measure in Eq. (13) remains a useful indicator of dispersion for the regional
LCV

To facilitate the interpretation of the results and to ensure the
comparability between neighbourhoods, the heterogeneity measure

Another desired property for a neighbourhood is that it leads to estimation
models with less uncertainty. For the index-flood model, this implies in
particular less uncertainty in the prediction of the index flood, while for
regression-based models, it implies less uncertainty in the prediction of
flood quantiles. For a multiple regression model, the uncertainty can be
quantified by the residual variance:

During the cross-validation process, the sample variance of the regression
models can be calculated for every site, which leads to the average relative
efficiency (ARE) criterion defined by

To validate the RVN method, RFFA is carried out in a real-world case study using both the index-flood model and the regression-based model. The hydrological variables of interest are the flood quantiles corresponding to a return period of 100 years, denoted as Q100. The analysis is performed on 151 sites located in the southern part of the Province of Québec, Canada. Figure 3 illustrates the location of these sites. Each site has at least 15 years of data, and the average record length is 31 years. The usual hypotheses of stationarity, homogeneity and independence are verified for all 151 data series. Only a brief description of the data and the at-site frequency analysis is provided since the elements were already presented in detail in previous studies (e.g. Chokmani and Ouarda, 2004).

The at-site distributions are selected among several families including
generalized extreme values (GEVs), Pearson type III (P3), generalized
logistic (GLO) and log-normal with three parameters (LN3). In general, the
estimation of the at-site distribution was achieved by maximum likelihood and
the final choices of distributions are based on the Akaike information
criterion. Recent studies on the same data set have identified four relevant
site characteristics (Chebana et al., 2014; Durocher et al., 2015), which
are used in the present analysis: the drainage area or BV (

Steps 1–2 of the RVN method represent the selection of the reference
variables and, if necessary, the estimation of the hydrological reference
variables at the target locations. Two initial groups of reference variables
are considered and updated by backward step-wise selection. The first group is
based on L-moments only and the second is based on the combination of
L-moments and site characteristics. The acronym LM for L-moment and HYB for
hybrid are used to identify the two groups. More precisely, the L-moments
considered for both groups are the sample average (L1), the LCV, the
L coefficient of skewness (LSK) and the L coefficient of kurtosis (LKT).
These reference variables are transformed and standardized to obtain zero
mean and unit variance. The transformation used for L1 and LCV is the
logarithm, while for LSK and LKT, the transformation is

Location of the 151 hydrometric stations in southern Québec, Canada.

A specific implementation of PPR is assumed, which considers the smooth
functions

Figure 4 shows the fitting of the four reference variables by the PPR models.
Cross validation has selected PPR models with a unique direction

Residuals of the reference variables by PPR methods.

Figure 4a shows a strong linear relationship between L1 and the predictor

Due to its poor fit, LSK may not be a proper reference variable for the delineation step. To validate this assumption, the neighbourhoods are formed with and without using LSK and the rest of the analysis is carried out for both scenarios. Based on the RRMSE criterion, LSK must be maintained, as it is associated with better predictive performances. This strategy is part of the backward step-wise selection procedure as described in Sect. 3.1. Overall, it leads to discarding LKT and to maintaining L1, LCV and LSK. The second group of reference variables contains both the L-moments and the site characteristics. As with the first group, backward step-wise selection is performed and the final reference variables are BV, PLAC, LCV and LSK. In order to distinguish the two groups of reference variables, RVN-LM will designate the first group with the L-moments only and RVN-HYB will designate the second group with both the L-moments and the site characteristics.

At this point, the steps 1–4 of the RVN methodology are performed and the
neighbourhoods are identified. Notice that for the RVN-LM method, the
reference variables include the first three L-moments, which could be used as
a moment estimator to deduce the target distribution. This approach is,
however, not generally applicable to the present methodology as the reference
variables are selected by a step-wise procedure. Moreover, it is necessary to
identify a proper family of distributions from regional information, which is
achieved here by analysing the distribution of the gauged sites inside the
neighbourhoods. The index-flood model and the L-moments algorithm were proven
to lead to a reliable procedure to identify a regional distribution and to
estimate its parameters (Hosking and Wallis, 1997). In this model, the
regional quantile corresponding to a return period

The index-flood model is fitted inside the neighbourhoods obtained by each one
of the four methods: ROI, CCA, RVN-LM and RVN-HYB. For CCA, two canonical
pairs are calculated using flood quantiles corresponding to the 10- and
100-year return periods as hydrological variables, as described in Sect. 2.1.
The choice of the regional distribution is made between the four common
families of distributions that were mentioned earlier: GEV, GLO, LN3 and P3.
The parameters of the regional quantile function

Figure 5b, c and d present the L-moment ratio diagrams of the at-site LCV and LSK for three given target locations as an illustration of the gauged sites found in the respective neighbourhoods. In these diagrams, the nearest gauged sites selected for RVN-LM, CCA and ROI are highlighted. Figure 5b shows that RVN_LM has a denser cluster of gauged sites in terms of LCV and is approximately centred on the true target. Conversely, Fig. 5c and d show situations where the true targets do not correspond to the predicted target. Although, all the reference variables are known at the target location for the ROI method, Fig. 5b and c show that the selected sites are not located around the true target. This finding is consistent with the results of GREHYS (1996a, b) which indicates that delineation according to physiographical similarity can lead to substantially different regions than delineation according to hydrological similarity.

L-moments ratio diagram for index-flood model.

Results of the cross-validation are presented in Fig. 6. The evaluation criteria are calculated for every neighbourhood with a size greater than 15 in order to calibrate the model. The tendency illustrated in this figure helps to visualize the evolution of these criteria with better perspective. The comparison of Fig. 6a and b indicates that the optimal neighbourhood sizes for RRMSE and NHS are not always in agreement. In particular, the best RRMSE for the RVN-HYB method is achieved with 24 sites, while the best NHS is achieved with nearly 80 sites. Nevertheless, the optimal values for the three other methods are obtained with approximately 30 sites for both criteria. Figure 6b indicates that all methods have a relatively stable NHS between 86 and 87 %, but the best NHS is obtained by RVN-LM. Conversely, Fig. 6a shows clearer improvements of the calibration in terms of the RRMSE criterion. Hence, the calibrated models are set according to the RRMSE criterion and are represented by circles in Fig. 6. The results are summarized in Table 1. RVN-HYB, with a RRMSE of 40.1 %, outperforms the other methods. In particular, a difference of 6.1 and 5.3 % is observed, respectively, with the traditional ROI and CCA methods.

Figure 6c and d present, respectively, the AHM and the ARE criteria obtained from the considered methods. The AHM criterion indicates that the ROI and the CCA methods have lower heterogeneity than the whole data set in general, but are largely outperformed by the RVN-LM and RVN-HYB methods especially for smaller neighbourhoods. This is not surprising as the RVN-LM and RVN-HYB pool together sites with similar L-moments, but this quantifies the intuitive assumption that the regional LCV is calculated with less uncertainty when the L-moments are directly considered instead of other reference variables. In particular, the AHM of the ROI method is 72.8 % with the optimal neighbourhood size of 30. In comparison, the AHM of the RVN-LM method is 14.5 % with the optimal neighbourhood size of 29 sites, which is considerably lower. Figure 6c shows that the AHM criterion of the RVN-LM method does not reach a similar level to the ROI method until as many as 120 sites are used. These results indicate that even for relatively small neighbourhoods, the ROI method identifies regions that are only slightly less hydrologically heterogeneous than all sites pooled together. This suggests that, in the present case study, the ROI method has difficulties identifying sites that are similar to the target site in terms of LCV.

Evaluation criteria for the RVN method for optimal neighbourhood sizes.

Best criteria is written in bold.

Evaluation criteria for the index-flood model. Calibrated models are represented by circles.

Comparison of the cross-validation residuals for Q100 for different methods. The black line is the unitary slope and the red line is a smooth fitting of the residuals.

As mentioned in Sect. 4.2, previous studies have identified a few problematic
stations in the considered data set. Figure 7 presents the residuals between
different methods. As it may be difficult to see small improvements by
uniquely observing points around the

The present case study is an example of a region where some sites are problematic for any method. In practice, the residuals are not known; consequently, we do not know if the target sites of interest will be problematic or not. Globally, what Fig. 7a indicates is that the RVN-HYB model is somehow more robust, because for the sites that are well predicted by simpler models, such as ROI, RVN-HYB will perform similarly on average. However, if the target site is predicted less accurately, the RVN-HYB model will (on average) be better in terms of RRMSE. Consequently, the overall gain may seem of moderate magnitude. However, for some problematic stations, the gain could be more substantial. In particular, the red lines in the left part of Fig. 7a appear mostly influenced by two points, but the two improvements are of 77.2 and 68.5 %, which is considerable.

Evaluation criteria for the regression-based model. Calibrated models are represented by circles.

Prediction of Q100 at the target location is also carried out by the regression-based model using the same delineation methods as with the index-flood model, but with potentially different calibration values for the neighbourhood sizes. Consequently, the descriptions of steps 1–4 (in Sect. 3.1) are identical to those of the index-flood approach and are not repeated here.

Cross-validation criteria for the regression-based model are presented in
Fig. 8 and summarized in Table 1. As with the index-flood model, Table 1
reveals that the RVN-HYB method leads to the best performance in terms of the
RRMSE. Although all methods differ by less than 2 % in terms of NHS,
results indicate that NHS values corresponding to CCA and RVN-HYB are
inferior to those corresponding to the regression model applied
to all gauged sites, which corresponds to

The fit of the regression-based model is graphically assessed in Fig. 9 by
quantile–quantile plots. It is shown that, for all the delineation
approaches, the regression-based models correctly predict the flood quantile
Q100 at target locations, as it correctly follows the

Quantile–quantile plot of Q100 for the RVN method with the regression-based model.

A general methodology was investigated to improve homogenous properties of neighbourhoods in RFFA. A procedure to calculate relevant reference variables at a target location prior to the RFFA was proposed to improve neighbourhood properties and to reduce uncertainties. The predicted values of reference variables represent the unknown centres of neighbourhoods delineated according to a distance of gauged sites with respect to the centres. The proposed method represents a generalization of both ROI and CCA methods in RFFA. The proposed RVN method has the advantage of accepting various groups of reference variables, considering nonlinear interrelations and being more objective since L-moments are used instead of estimated flood quantiles from at-site analysis.

In this study, the reference variables correspond to transformed L-moments. The resulting RVN-LM and RVN-HYB methods were applied to sites located in the southern part of the province of Québec, Canada, to predict flood quantiles corresponding to the 100-year return period by both index-flood and regression-based models. The prediction of the reference variables at target locations showed that, after proper transformations, L1 can be linearly related to the site characteristics, but no proper transformations are found for the other L-moments. This justifies the consideration of the PPR method to account for the nonlinearity in the prediction of the reference variables. In general, other models, such as generalized additive models or artificial neural networks, could be considered instead of PPR to account for the nonlinearity. Nevertheless, the PPR approach unveils direction vectors that provide explicit, parsimonious and meaningful regression equations.

Although none of the methods performed best for all criteria, cross-validation showed that the proposed RVN method performs well in comparison to the traditional ROI and CCA methods. In both the index-flood and the regression-based models, the best RRMSE is obtained by RVN-HYB and the best NHS is obtained by RVN-LM. In particular, the favourable RRMSE values obtained by RVN-HYB are due to a more robust estimation of problematic sites. However, RVN-LM has the best balance, because it achieves the best or the second-best values for all criteria. Most importantly, the utilization of hydrological reference variables with the CCA and RVN methods has reduced the uncertainty on the regional LCV, the index flood and the predicted flood quantiles, in comparison to ROI. Consequently, prior modelling of hydrological reference variables was shown to be advantageous for the delineation of neighbourhoods in RFFA.

The present study has made specific assumptions in order to investigate the RVN method in well-defined conditions. Nevertheless, the approach that consists in predicting hydrological reference variables in an a priori analysis remains valid when other choices of regression models, neighbourhood forms and metrics are considered. More comparative studies should be carried out to evaluate alternatives to fixed size neighbourhoods and Euclidian distances in the specific context of the RVN framework.

The L coefficient of skewness is commonly used in RFFA to describe the shape of a distribution. Consequently, to improve the result of the RVN method, further research efforts could focus on improving the prediction of this crucial reference variable. One way to improve the prior analysis of the hydrological reference variables is the consideration of the unequal sampling error. This aspect is often considered in the estimation of flood quantiles in RFFA, but may also play an important role in the prior analysis of the RVN method.

The raw hydrological data can
be obtained from the Environment Ministry of the Province of Quebec
(

Financial support for this study was graciously provided by the Natural Sciences and Engineering Research Council (NSERC) of Canada. The authors are grateful to the editor, Elena Toth, and the two reviewers whose comments and suggestions contributed to the improvement of the manuscript. Edited by: E. Toth Reviewed by: T. Gado and one anonymous referee