Reliable estimates of missing streamflow values are relevant for water resource planning and management. This study proposes a multiple-dependence condition model via vine copulas for the purpose of estimating streamflow at partially gaged sites. The proposed model is attractive in modeling the high-dimensional joint distribution by building a hierarchy of conditional bivariate copulas when provided a complex streamflow gage network. The usefulness of the proposed model is firstly highlighted using a synthetic streamflow scenario. In this analysis, the bivariate copula model and a variant of the vine copulas are also employed to show the ability of the multiple-dependence structure adopted in the proposed model. Furthermore, the evaluations are extended to a case study of 54 gages located within the Yadkin–Pee Dee River basin in the eastern USA. Both results inform that the proposed model is better suited for infilling missing values. To be specific, the proposed multiple-dependence model shows the improvement of 9.2 % on average compared to the bivariate model from the historical case study. The performance of the vine copula is further compared with six other infilling approaches to confirm its applicability. Results demonstrate that the proposed model produces more reliable streamflow estimates than the other approaches. In particular, when applied to partially gaged sites with sufficient available data, the proposed model clearly outperforms the other models. Even though the model is illustrated by a specific case, it can be extended to other regions with diverse hydro-climatological variables for the objective of infilling.

Hydrological observation records covering long-term periods are instrumental in water resources planning and management, including the design of flood defense systems and irrigation water management (Aissia et al., 2017; Beguería et al., 2019). However, available streamflow data are often limited due to several situations like equipment failures, budgetary cuts, and natural hazards (Kalteh and Hjorth, 2009). Missing data are particularly observed in remote catchments where equipment failures are repaired only after significant delays following extreme events, which can be crucial for hydrological frequency analysis. Hence, hydrologists often rely on simulated sequences to infill missing data in partially gaged catchments (Booker and Snelder, 2012) by using two primary modeling approaches, such as (1) process-based models (i.e., estimating streamflow based on a conceptual understanding of hydrological processes) and (2) transfer-based statistical models (i.e., transferring information from gaged to ungaged catchments; Farmer and Vogel, 2016). This paper focuses on the latter, which estimates historical daily streamflow at inadequately and partially gaged sites by the means of a statistical relationship.

Over the past few decades, a variety of statistical models, including simple drainage area scaling (Croley and Hartmann, 1986), the spatial interpolation technique (Pugliese et al., 2014), a regression model (Beauchamp et al., 1989), and flow duration curves (FDCs; Hughes and Smakhtin, 1996), have been developed. In particular, the flow duration curve method has been regarded as one of the most trustworthy regionalization approaches (Archfield and Vogel, 2010; Boscarello et al., 2016; Castellarin et al., 2004; Li et al., 2010; Mendicino and Senatore, 2013). If the target watershed is completely gaged, FDCs can be established using regression models to regionalize the parameter sets of defined distributions (e.g., Ahn and Palmer, 2016a; Blum et al., 2017) or to regionalize a set of primary quantiles (Cunderlik and Ouarda, 2006; Schnier and Cai, 2014; Zaman et al., 2012). On the other hand, if target watershed is poorly or partially gaged, FDC models are built using the following four steps: (1) estimating the non-exceedance probability for recorded streamflow from the target watershed of interest, (2) selecting one or multiple donor watersheds for the target watershed, (3) transferring the time series of the non-exceedance probability from the donor watersheds for missing streamflow values, and (4) converting corresponding streamflow values back from the transferred non-exceedance probability. When FDCs are utilized for partially gaged watersheds, how the donor watersheds are selected (step 2) and how the probabilities are transferred from the donor watersheds (step 3) are fairly crucial in the FDC framework.

Many studies have developed diverse approaches for steps 2 and 3 in FDC
modeling. While the basic formulation is that non-exceedance probabilities of the target site are transferred by those at the single donor site, a
weighted average of non-exceedance probability from the selected donor sites
has been suggested by Smakhtin (1999) instead. In addition, Farmer (2015) applied
a kriging model to regionalized daily standard (i.e.,

Increasing attention has been received for copulas in the field of hydrology, with applications in flood frequency analysis, drought risk analysis, and multi-site streamflow generations (Ahn and Palmer, 2016b; Ariff et al., 2012; Chen et al., 2015; Daneshkhah et al., 2016; Fu and Butler, 2014). Copulas are effective mathematical functions that are capable of combining univariate marginal distribution functions of random variables into their joint cumulative distribution function and allow the representation of diverse dependence structures between these random variables corresponding to their family members (Sklar, 1959). For example, Fu and Butler (2014) showed that the Gumbel copula performs well in representing multiple flooding characteristics as compared to the other copulas from the Archimedean family, namely the Clayton and Frank copulas. To estimate streamflow (i.e., infilling missing data) at poorly and partially gaged sites, Worland et al. (2019) have developed bivariate copulas with an Archimedean copula but limited their application to a single donor. Despite the limitation, their bivariate copulas may be acceptable since the higher dimension of copulas is not rich enough to model all possible mutual dependencies among multi-site donors (see Karmakar and Simonovic, 2009, for details). Hao and Singh (2013) also describe that multivariate copulas are incapable of modeling multi-site data exhibiting complex patterns of dependence.

However, if the theoretical limitation of a multivariate copula is mitigated, dependency information from multiple donor sites may allow more reliable predictions of regionalized streamflow. Vine copulas, also known as pair copulas, offer a far more efficient way to construct a higher-dimensional dependence (Bedford and Cooke, 2002; Joe, 2014). They have hierarchical structures that sequentially apply bivariate copulas as the local building blocks for constructing a higher-dimensional copula. The high flexibility of vine copulas enables the modeling of a wide range of complex data dependencies. In particular, Aas et al. (2009) have popularized two classes of vine copulas, namely canonical vines (C-vines) and drawable vines (Dvines), by allowing diverse pair copula families, such as the bivariate Student

Based on the usefulness of vine copulas, Kraus and Czado (2017) have developed a promising algorithm that sequentially fits such a Dvine copula
model (

This study forwards two novel contributions to infill missing data in the
field of hydrology, i.e., (1) a Dvine copula-based model is introduced to estimate streamflow for poorly and partially gaged watersheds, and (2) the existing model (

A copula

Based on Sklar's (1959) theorem, a multivariate distribution function is a composition of a set of marginal distributions; thus, Eq. (1) can be expressed in terms of densities, as follows:

Following Bedford and Cooke (2001), any copula density

A Dvine is characterized by the ordering of its variables (see Fig. 1). In the first tree, the dependence of the first and second variables, of the second and third, and of the third and fourth, and so on, is modeled using
pair copulas. In the second tree, the conditional dependence of the first and
third, given the second variable (i.e.,

Example of Dvine structures with five variables, four trees, and 10 edges.

For the 5-dimensional Dvine copula, as an example in Fig. 1, the corresponding vine distribution has the following joint density:

As presented in Eq. (4), the conditional distribution functions and conditional bivariate copulas are required in vine copula modeling. The
conditional distribution functions

More details about Dvines can be found in Bedford and Cooke (2002) and Czado (2010, 2019).

Following Kraus and Czado (2017), a two-step estimation procedure is utilized for the prediction of the streamflow value at the target watershed. The algorithm (

Let

Next, to easily estimate conditional streamflow values at the target site,
the Dvine copula is fitted with the fixed order

The final number (

This study first explores the performance of

The seven infilling approaches discussed in the study.

Structure of the 6-dimensional vine model and marginal probability function for the synthetic simulation.

Synthetic streamflow data are generated using a controlled Monte Carlo experiment to explore how well the three copula-based models (

The performance of each model is evaluated in a calibration–validation framework. First, synthetic streamflow data are generated for a 6-dimensional gage network. Then,

The Yadkin–Pee Dee River basin (Fig. 3), covering around 18 700 km

Map of the Yadkin–Pee Dee basin with 54 stream gage stations.

Daily streamflow data at 54 gages are gathered throughout the study region from the web interface of the US Geological Survey (USGS) National Water Information System (NWIS; US Geological Survey, 2018). The 54 gages are selected based on the following criteria: (1) all gages are recorded continuously for 15 years of daily streamflow over the period from January 2004 to December 2018, and (2) gages have non-zero daily values for the period in the first criterion, since gages with streamflow values equal to zero require a more flexible modeling structure. Thus, it is common to model zero flows separately in regionalization studies. Based on the second criterion, this study discards 10 gage stations (not shown).

A set of seven infilling approaches is used in the final assessment (see
Table 1), i.e., (1)

As presented in Sect. 3.1 and 3.3, the RMSE (Eq. 6) and NSE are employed to evaluate prediction skills as follows:

Following derivations suggested in Gupta et al. (2009), the RMSE can be further decomposed into three components, as follows:

RMSE and NSE results over the validation periods under the synthetic experiment for comparing copula-based model formulations. The best metric values for each quantile are shown in bold.

In addition, the accuracy of the uncertainty quantification skill is also
evaluated for the copula-based models (

Prediction results from the out-of-sample RMSE and NSE metrics are
presented for the three copula-based models (

Results of average coverage error (ACE) over the validation periods under the synthetic experiment for comparing copula-based model formulations. The best metric values for each quantile are shown in bold.

In addition, the ACE results present how the three models characterize prediction uncertainty.

Pairwise upper- and lower-tail dependence for watersheds in the Yadkin–Pee Dee River basin. The upper triangular matrix shows values for the upper-tail dependence and the lower triangular matrix presents values for the lower-tail dependence. The metrics can range from 0 to 1, with higher values suggesting greater interdependence of the two streamflows for each upper- and lower tail.

Using the insights developed from the synthetic experiment above, the three
copula-based models are applied to the streamflow data for the Yadkin–Pee Dee River. At first, upper- and lower-tail dependences (

Figure 5 shows the RMSE and NSE results for the three copula-based models
under a leave-one-out cross-validation framework. This process is repeated 20 times to build an ensemble prediction by using test periods
randomly defined. For this analysis, 5 years of data are selected to be
assumed as the observed period at the target gage, and another 4 years are randomly selected in the remaining data for the test period. Similar to the results from the synthetic experiment,

Model performance for the Yadkin–Pee Dee river under a cross-validation framework, based on RMSE (dark squares) and NSE (light squares). Here, the RMSE (NSE) can range from 0 to

Figure 6a presents the ACE scores described for principal quantiles,

Based on the results in Figs. 5 and 6,

Structure of the Dvine copula applied for a particular target site (USGS site ID 214645022), with the defined bivariate copulas and their parameters.

Intermodel comparison using cross-validation experiments based on
RMSE

To assess the predictive skill of the proposed vine copula model, it is
compared with six other statistical models (see Table 1). Figure 8 shows
RMSE and NSE for the seven models where the streamflow values are estimated
based on the available data defined by the two different cases, labeled
“deficit record” and “sufficient record” (see Sect. 3.3). Under all
cases, the vine copula approach outperforms the other infilling approaches.
For example, for the sufficient record case, the median NSE for

The three contributions from the decomposed mean squared error (MSE)
for the cross-validation experiment with

Average coverage error of the Dvine model for two scenarios under

The RMSE is decomposed into their components (bias, variance, and timing) for both the deficit record and sufficient record predictions (Fig. 9). For both cases, timing components primarily bring about the majority of prediction errors for all seven models. In particular, models directly estimating streamflow values (IDW-streamflow, DAR-streamflow, and Kriging-streamflow) produce a somewhat biased component, which increases when a shorter record is employed in the model. For instance, the timing component for

Finally, the following two predictions are further produced using two additional experiments: (1) the observed marginal cumulative probabilities (i.e., using all 15 years) and conditional streamflow values constructed from the partial record (i.e., based on

Figure 10 shows the ACE scores from the out-of-sample predictions using the
proposed Dvine model under the two scenarios. When considering all the
quantiles together, the ACE scores for the two scenarios are 0.003 (scenario no. 1) and 0.006 (scenario no. 2) on average under the deficit record prediction. Also, the scores under the sufficient record prediction are all nearly 0.003. Those results of the scores are sufficiently close to zero, implying that both predictions are reliable. Yet, compared to the predictions estimated by the cumulative probabilities estimated by the
partial record and conditional models constructed by full records (i.e.,
scenario no. 2), the ACE scores are achieved better if the cumulative
probabilities are determined by the full record, except for some of the low
and high quantiles. Similar interpretation can be found in the NSE performance of two scenarios (see the insets in Fig. 10). It may suggest that careful attention should be paid to the first procedure (i.e., how to determine the cumulative probabilities for the target site and its donors) when

This study introduces a multiple-dependence conditional model (i.e., vine copulas) to produce streamflow estimates at partially gaged sites. The model includes a flexible high-dimensional joint-dependence structure and conditional bivariate copula simulations. In order to confirm the usefulness of a multiple-dependence structure and the procedure for an appropriate number of donor sites in the final vine copula model, the bivariate copula model and two types of vine copulas with their unique procedure to determine the optimal number of donor sites are first investigated using the generated data. These analyses were further extended in a case study of the Yadkin–Pee Dee River basin, in the eastern USA, by estimating streamflow in partially gaged locations. In this analysis, six statistical infilling approaches were also employed to represent the applicability of the proposed model.

Results of the synthetic experiment and application to the Yadkin–Pee Dee River basin demonstrate that the propose model has benefits in some aspects. First, a multiple-dependence structure adopted in the proposed model is beneficial. From the massive evaluation experiments, this study shows that a multiple-dependence structure clearly outperforms a single-dependence structure, although there is the risk of overfitting when too many dependence structures are employed. For example, the proposed model shows the improvement of 9.2 % on average compared to the bivariate model from the evaluation experiment over the historical case study. Moreover, this study confirms that the proposed multiple-dependence structure model, with its optimum number of donor sites, produces more reliable streamflow estimation than other common infilling models. To be specific, for the sufficient record case, the proposed model shows the improvement of 13.9 % on average compared to the FDC-highestrho model. Next, the proposed model allows the development of confidence intervals to consider prediction uncertainty, which is fairly attractive compared to other models. For example, Bárdossy and Pegram (2013) argue that confidence intervals obtained using an ordinary kriging model do not reflect the prediction uncertainty well, particularly on a daily scale. Overall, this study shows that a vine copula is potentially an effective tool to support water resource management planners for objectives like gap-filling or extending missing streamflow records.

While the results of the proposed model are favorable, there are possible limitations worthy of further discussion. First, the proposed method is computationally expensive, even after adopting the multicore processing to reduce the computational burden. This becomes more problematic when the method is applied to a larger, more complex streamflow gaging network. Nevertheless, because local water managers do not need to build the model repeatedly whenever they face missing values, once the model is calibrated for a specific site, this computational burden may be a minor issue. Second, the assessment illustrated in this study focuses on model performance under cross-validation at partially gaged basins, but additional work is needed to extend the proposed model to ungaged basins; one possible way is to build a regression based model with spatial proximity and physical basin characteristics to define associations between the target and donor sites (e.g., Ahn and Steinschneider, 2019). Lastly, this study does not consider the potential nonstationarity in FDCs and correlations caused by the influence of anthropogenic activity and change in land use. Nonstationarity may not be problematic in this analysis since the assessment is limited to 15 years across the gaging network. However, if longer records were used, it would be beneficial to consider the potential nonstationarity. This exploration has been left for future work.

There are several opportunities to improve the model structure. For instance, a vine copula is able to incorporate more additional conditioning variables. One feasible approach is to add a time series of climate data (e.g., precipitation) or to decompose a time series of streamflow from the donor sites into a number of periodic components at different frequency levels through the wavelet decomposition approach (Kisi and Cimen, 2011). Moreover, although the proposed model provides a more flexible way to model multivariate dependences, it can be further improved by not adopting the standard assumption (i.e., simplifying assumption) that the conditional pair copulas depend on the conditioning variables through the conditional margins (Acar et al., 2012). One possible alternative is the use of the semi-parametric estimation of a conditional copula (Acar et al., 2012; Vatter and Chavez-Demoulin, 2015). This semi-parametric approach enables an estimate of the dependence parameters which do not rely on the simplifying assumption, eventually leading to more reliable infilling estimations. I believe that this provides an interesting avenue for future research.

Lastly, the results presented here are specific to a study basin used in a case study. The proposed model is not restricted to other watersheds around the world, and its application is further required for drawing more generalized conclusions. In addition, the model could be used for the purpose of infilling missing values of other hydro-meteorological variables besides streamflow (e.g., precipitation and soil moisture). For this application, the implementation of a vine copula with combined discrete and continuous margins (i.e., to account for no rainfall days) should be explored (e.g., Stoeber et al., 2013).

The code is available upon the request to the corresponding author.

Daily streamflow data of 54 gages are available from the web interface of the US Geological Survey (USGS) National Water Information System (

The supplement related to this article is available online at:

The author declares that there is no conflict of interest.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT; grant no. 2019R1C1C1002438). The author would like to acknowledge Scott Steinschneider for his helpful comments during the development of this paper.

This research has been supported by the National Research Foundation of Korea (NRF; grant no. 2019R1C1C1002438).

This paper was edited by Carlo De Michele and reviewed by two anonymous referees.