There are various methods available for annual groundwater recharge estimation with in situ observations (i.e., observations obtained at the site/location of interest), but a great number of watersheds around the world still remain ungauged, i.e., without in situ observations of hydrologic responses. One approach for making estimates at ungauged watersheds is regionalization, namely, transferring information obtained at gauged watersheds to ungauged ones. The reliability of regionalization depends on (1) the underlying system of hydrologic similarity, i.e., the similarity in how watersheds respond to precipitation input, as well as (2) the approach by which information is transferred.

In this paper, we present a nested tree-based modeling approach for conditioning estimates of hydrologic responses at ungauged watersheds on ex situ data (i.e., data obtained at sites/locations other than the site/location of interest) while accounting for the uncertainties of the model parameters as well as the model structure. The approach is then integrated with a hypothesis of two-leveled hierarchical hydrologic similarity, where the higher level determines the relative importance of various watershed characteristics under different conditions and the lower level performs the regionalization and estimation of the hydrologic response of interest.

We apply the nested tree-based modeling approach to investigate the complicated relationship between mean annual groundwater recharge and watershed characteristics in a case study, and apply the hypothesis of hierarchical hydrologic similarity to explain the behavior of a dynamic hydrologic similarity system. Our findings reveal the decisive roles of soil available water content and aridity in hydrologic similarity at the regional and annual scales, as well as certain conditions under which it is risky to resort to climate variables for determining hydrologic similarity. These findings contribute to the understanding of the physical principles governing robust information transfer.

Groundwater resources supply approximately 50 % of the drinking water and
roughly 40 % of the irrigation water worldwide

This fact leads us to a critical question: how can one estimate hydrologic
responses without in situ data? Studying ungauged watersheds has been a
popular research topic for more than a decade, especially since the
Prediction in Ungauged Basins (PUB) initiative by the International
Association of Hydrological Sciences (IAHS)

However, the application of regionalization is not without challenges. One of
the key factors of predictive uncertainty identified by the PUB initiative is
the unsuitability of information transfer techniques, due to a lack of
comparative studies across watersheds and a lack of understanding of the
physical principles governing robust regionalization

One can resort to other physical characteristics of watersheds for the
determination of hydrologic similarity. However, what those characteristics
are may be a complicated question.

In this study, we would like to integrate the perspective in

Given a set of watershed characteristics, the next important question is how
the regionalization is carried out.

To that end, the objectives of this study are 2-fold. First, to address the aforementioned challenges in regionalization technique, we propose a nested tree-based modeling approach, which features (1) nonlinear regression in order to model the predictor–response relationship, (2) full Bayesian quantification of parameter uncertainty, and (3) proposal–comparison-based consideration of model structure uncertainty. Second, we integrate the nested tree-based modeling approach with a hypothesis of hierarchical hydrologic similarity. We apply the approach to estimate a groundwater recharge signature at ungauged watersheds in a case study, and we invoke the hypothesis of hierarchical similarity to reveal the key controlling factors of a dynamic hydrologic similarity system, which could ultimately contribute to robust information transfer in future applications.

The data-driven, Bayesian, and nonlinear regression approach proposed in this
study is powered by Bayesian Additive Regression Tree (BART) at its core.
The details of BART, including the establishment of prior distribution (which
we term prior), the calculation of likelihoods, and the posterior inference
statistics, are well documented in

Consider a fundamental problem of making inference about an unknown function
that estimates a response variable of interest using a set of predictor
variables. The general form of this problem can be expressed as follows:

Schematic diagrams of

To understand BART, first one needs to understand the build-up of the
additive ensemble tree model from individual classification and regression
tree (CART) models

To further improve the predictive performance on an individual CART, an
additive ensemble tree model can be built as the sum of

As mentioned above, instead of searching for the best

A schematic diagram of the MCMC simulation iteration procedure is shown in
Fig.

Given the aforementioned conditioned BART model, we now turn our attention to
estimating a new response that was not included in the data on which the BART
model was conditioned. This is done by inputting the vector of the new
predictors, denoted by

The key advantage of BART is that it combines the nonlinear regression for the predictor–response relationship with Bayesian inference, allowing for the determination of a full Bayesian posterior of predictive distribution, rather than one or a few estimates/predictions.

The estimation and regionalization processes are data-driven. Prior knowledge
of the underlying physics is only minimally accounted for in terms of the
composition of the predictor sets and the user-defined prior of the splitting
rules (which are embedded in the tree structure variable,

However, in compensation, we avoid one disadvantage of the application of
physically based models in the case of ungauged watersheds. The available
data at the ungauged watershed are limited, and it is unrealistic to expect
that certain watershed characteristics should be known. Data availability
could hinder the implementation of powerful hydrologic models

Note that in this study there is no intention to show the superiority of
either the data-driven or physically based approaches. As

As shown above, BART offers an elegant way to account for model parameter uncertainty of an additive ensemble tree model. However, uncertainty exists not only for the model parameters, but also for the models themselves, i.e., the model structure uncertainty. A significant factor of model structure uncertainty for BART could be the composition of the vector of predictors. Accounting for model structure uncertainty can be done by proposing a prior probability mass function of plausible BART models, which can then be evaluated and compared with each other. In the present study, we accomplish this by using a proposal–comparison procedure, which we termed the nested tree-based modeling approach. The details are as follows.

We start by proposing

Schematic diagrams of an example of nesting two BART models under a simple two-leveled CART model, using only one predictor. The partitioning rule is expressed in the diamond box, and the leaves are represented in blue boxes.

Up to this point, we have introduced the nested tree-based modeling approach,
which is general and data-driven. For estimation purposes, one would be
interested in accounting for model structure uncertainty by averaging the
estimates over

To facilitate the interpretation of the variation of

The lower level is termed the

The higher level is the

The study area includes

Having explained the hypothesis of hierarchical similarity, now suppose that
we have gone through the process described in Sect.

In this case study, we are going to apply the methodology described in
Sects.

The conterminous US can be divided into eight major river basins (MRBs), each
of which consists of thousands of watersheds

In 2002, annual groundwater recharge at each watershed was estimated via
baseflow analyses by the US Geological Survey (USGS)

Histograms of

The more arid US Midwest may have more pronounced localized recharge

At each watershed included in the study, the following data are retrieved
from publicly available databases: the long-term average annual
precipitation (

The annual recharge data (in volume of water per unit watershed area) can be
normalized by

We also consider various non-climate watershed characteristics in this study, including topography, land cover, soil properties, and geology. The land cover is based on data published in 2001, which we feel is close enough to 2002 to provide the appropriate information. The other characteristics are based on raw data obtained in different years before 2002; it is assumed that they remain steady at sub-century timescales. We provide the details of these watershed characteristics in the following subsections.

Watershed topography predictors.

Land cover classification by NLCD2001.

The topographic predictors are taken from publicly available databases

The soil property predictors include watershed-scale statistics (e.g.,
average, upper bound, and lower bound) of soil properties

Soil property predictors.

The geology predictors used in this study were retrieved from publicly
available databases

This section explains the setup of the holdout method specific to the case study, as well as the partitioning of the predictors into various subsets in order to evaluate the effects of different predictors.

Because we cannot evaluate the predictive accuracy at real ungauged
watersheds (due to the lack of in situ data to compare against), we adopt the
holdout method to partition the watersheds described in
Sect.

In this study, we define the watersheds in MRB 1 as the testing watersheds
and the watersheds in MRB 2 as the training watersheds. The ex situ data
(i.e., data in MRB 2) are used to fit multiple BART models, which are then
used to obtain predictive distributions of LNR at all the testing watersheds.
There are two reasons for this MRB-based data partitioning.

For reasons touched on in Sect.

Considering the distributions of LNR (Fig.

As mentioned in Sect.

Note that the determination of the six predictor sets is guided by a
conceptual division of predictors and the idea of testing the relative
importance of different categories of predictors under different conditions,
instead of aiming for high accuracy and precision. Therefore, by no means is
Table

Table of the six different predictor sets.

In addition to the six BART models, we also build a simple model by using the
estimated distribution of LNR at the training watersheds via kernel density
estimation

As mentioned in Sect.

In this study, two different accuracy metrics are adopted. The first is the
root mean squared error (RMSE), defined as

The second metric is the median log predictive probability density (LPD) at
the value of LNR observation, defined as

In addition to accuracy, we also quantify the predictive uncertainty. This is
done by first recognizing the two components of uncertainty for the

the posterior variance of

As discussed above, we built six BART models (Table

The following subsections present the effects of different predictor sets on predictive accuracy and uncertainty.

The effect of regionalization with the different predictor sets on predictive
uncertainty is shown in Fig.

The box plots of the estimate variances at the testing
watersheds

The predictive variance (Fig.

The total predictive variance (Fig.

The effect of regionalization with the different predictor sets on RMSE is
shown in Fig.

Regardless of

The box plot of the RMSE of the benchmark model at the testing
watersheds

The effect of regionalization with different predictor sets on LPD is shown
in Fig.

The box plot of the LPD of the benchmark model at the testing
watersheds

Looking at Figs.

Over-conditioning can occur when model fitting or model calibration leads to
constrained parameters that are, in fact, subject to different forms of model
uncertainty

An example of over-conditioning: the probability density at the true value (indicated by the red vertical line) of the over-conditioned distribution is not higher than that of the non-informative distribution or that of the weakly informative distribution, not because the conditioning does not work, but because of the disproportional reduction of the variance of the distribution.

The box plots in Figs.

To investigate this further, we give each testing watershed two labels: the
model with the lowest RMSE and the model with the highest LPD; we refer to
these labels as the RMSE labels and the LPD labels, respectively. The
possible values of each label include

CART model classifying the RMSE labels of the testing watersheds.
Splitting rules are shown in white nodes, while leaf nodes are colored based
on the classification results. For each leaf node, the brightness of the
coded color indicates the node impurity (the brighter the more impure), where
impurity is defined as the probability that two randomly chosen watersheds
within the node have different labels. On top of every node, in brackets, is
the node number, provided for convenient referencing. The predictors in the
splitting rules are expressed in code names for convenience; a reference
table is provided in the upper right. For each leaf node, the model of the
highest multinomial probability of having the best performance is shown
first, which also determines the classification result, followed by the model
of the second highest probability, also to indicate the impurity. Underneath
each leaf node box is the number of watersheds belonging to the leaf. Note
that the legend does not include benchmark because the benchmark model is
never the best-performing model at any testing watershed.

Figure

Further down the classification tree, watersheds with lower AWC are
classified roughly as arid or humid watersheds by the long-term aridity
index. For the more humid watersheds (Fig.

Node 14 is a small but unique cluster, featuring watersheds that have low
AWC, are humid, and have relatively homogeneous paragneiss and/or schist
bedrock. Both of these bedrock types belong to the category of crystalline
rock and often feature layering in a particular orientation. The groundwater
movement in such a rock formation often depends on foliation, i.e., rock
breaks along approximately parallel surfaces, which affect the direction of
the regional groundwater flow

Node 13 features watersheds that have low AWC, are humid, are not dominated
by homogeneous paragneiss and/or schist, have a relatively steep average
slope, and have a large amount of annual precipitation. The low aridity is
primarily driven by precipitation rather than evapotranspiration. In fact,
these watersheds are mostly outliers featuring an extremely low aridity index
(below 0.65) due to ample precipitation. Under such conditions,
evapotranspiration is expected to operate to its full potential; i.e., it is
shifting from a water-limited state to an energy-limited and
canopy-controlled state. In addition, as evapotranspiration is near its full
potential, the drainage of the excess precipitation would be controlled by
the topography of the watershed (e.g., the slope and the sinuosity of the
stream). Fast drainage leaves less water available for infiltration and
recharge, and vice versa. To that end, the land cover type and topography now
start to play a dominant role in hydrologic similarity. It is noteworthy to
point out node 20 here. Node 20 features watersheds that are relatively humid
among the arid watersheds (

On the other side of the tree (Fig.

The classification of the LPD labels is shown in Fig.

Same as Fig.

RMSE and LPD represent views of predictive accuracy in an estimation problem
and a simulation problem, respectively. Intuitively, if one only considers
unimodal predictive distribution with limited skewness, a high predictive
density at a value directly implies a closeness of the distribution central
tendency to that value. However, the reverse is not necessarily true: either
overestimation or underestimation of variance might possibly lead to low
predictive density, even if the mean is close to the target value (e.g.,
Fig.

Fortunately, regardless of the metric of predictive accuracy, in both
Figs.

In this section, we revisit the two research objectives pointed out in
Sect.

The nested tree-based modeling approach proposed in this study is essentially
a coupling of BART and CART. As demonstrated in Sect.

Our explanation starts with explaining two significant advantages of the
nested tree-based modeling approach. First of all, the greatest advantage of
BART (as mentioned in Sect.

How do the aforementioned two advantages of the nested tree-based modeling
approach justify the use at ungauged watersheds? First, of course the
performance of the model depends on the quality and the quantity of training
data. In this sense all modeling approaches are the same, and applying BART
does not disproportionally enhance the predictive accuracy when the data are
limited. However, what sets BART apart is the Bayesian feature that accounts
for model parameter uncertainty properly in the form of conditional
distribution, which cannot be done as easily with only a few point estimates
or a few posterior statistics. Second, uncertainty exists not only for the
model parameters, but also for the models themselves. The nested tree-based
modeling approach can help us obtain an informed empirical probability mass
function,

One may then argue how a modeler would make an informed proposal of plausible
BART models in the first place. This is where physical knowledge comes into
play, and the proposal is indeed case specific. This is why we proposed the
hypothesis of hierarchical similarity, which can be integrated with the
nested tree-based modeling approach to study the behavior of a dynamic
hydrologic similarity system, like what was demonstrated with the case study.
Unlike the generality and the merits of the nested tree-based modeling
approach, our findings regarding the variation of

With BART's ability to simultaneously model nonlinear and/or interaction effects and present uncertainty in a fully Bayesian fashion, we are able to show how the controlling factors of hydrologic similarity vary among different watersheds, among different conditions, and among different accuracy metrics. These are all manifested in the case study under the context of the hierarchical similarity hypothesis.

Climate variables have been identified as the dominant factors in previous
studies (see Sect.

The details of the hierarchical similarity are inferred from the data in the
fashion of supervised machine learning, using six BART models and one
benchmark model nested under one classification tree. It is of great
importance to have two levels in such a system, as it allows for
identification of the shifts of dominant factors under different conditions.
These shifts indicate shifts in dominant physical processes, as exemplified
by nodes 13 and 20 in Fig.

Here, we provide discussions about the limitations of the case study from the aspects of the data set, the target response, and the partitioning of data.

A major limitation of the case study is that the target hydrologic response
is the logit normalized watershed-averaged annual groundwater recharge. This
is a large-scale spatiotemporally homogenized response, and in this study,
the data were based on baseflow analyses. To that end, a working assumption
about the reliability of the baseflow analysis was made without rigorous
proof (see Sect.

Although we tried to justify the MRB-based partitioning by the reasons listed
in Sect.

Another case of lack of data coverage can be found in our climate predictor
data. Since aridity index is the ratio of potential evapotranspiration to
precipitation (

Although this might have been avoidable by using a more sophisticated design
of cross-validation, we kept the MRB-based holdout method on purpose. In
addition to the reasons that were explained in Sect.

Distributions of

Another limitation is the lack of temporal coverage. Given limited data
coverage along the time axis, in the case study we only studied the LNR in
the year of 2002, and we considered two types of climate predictors: those
from the same year and those from the long-term average. However, the
recharge process being highly nonlinear, it is not impossible that some
predictors representing the antecedent conditions, such as precipitation from
years prior to the year of 2002, could affect the LNR in the year of 2002.
Not having multiple years of climate data prevents us from testing the
effects of antecedent conditions or the effects that take place at various
multi-year scales, and thus it is clearly a limitation of the case study.
Because of this limitation, we made a steady-state working assumption
(mentioned in Sect.

The proposal of plausible BART models was guided by a conceptual
understanding and grouping of the available predictors. As mentioned in
Sect.

In this work, we proposed a nested tree-based modeling approach with three key features: (1) full Bayesian quantification of parameter uncertainty, (2) nonlinear regression in order to model the predictor–response relationship, and (3) proposal–comparison-based consideration of model structure uncertainty. We applied the nested tree-based modeling approach to obtain logit normalized recharge estimates conditioned on ex situ data at ungauged watersheds in a case study in the eastern US. We hypothesized a hierarchical similarity to explain the variation of the probability mass function of plausible models, and thus to investigate the behavior of a dynamic hydrologic similarity system.

The findings of this study contribute to the understanding of the physical principles governing robust regionalization among watersheds. Firstly, consistent with previous studies, we found that the climate variables are on average the most important controlling factors of hydrologic similarity at regional and annual scales, which means a climate-based regionalization technique is on average more likely to result in better estimates. However, with our hierarchical similarity hypothesis we revealed certain conditions under which non-climate variables become more dominant than climate variables. In particular, we demonstrated how soil available water content stood out to be the pivotal indicator of the variable importance of aridity in hydrologic similarity. Moreover, we showed that with hierarchical similarity one could identify shifts in dominant physical processes that are reflecting shifts in the controlling factors of hydrologic similarity under different conditions, such as water-limited evapotranspiration versus energy-limited evapotranspiration, or homogeneous and foliated bedrock versus heterogeneous bedrock. As the controlling factors change from one condition to another, the suitable regionalization technique also changes. We demonstrated how the hierarchical similarity hypothesis could indicate mechanisms by which available water content, aridity, and other watershed characteristics dynamically affect hydrologic similarity. The nested tree-based modeling approach can be applied to identify plausible sets of watershed characteristics to be considered in the regionalization process.

The contributions of this study may be viewed differently depending on
individual cases. In a situation where groundwater recharge is the ultimate
target variable at ungauged watersheds, the nested tree-based modeling
approach offers a systematic way to obtain informative predictive
distributions that are conditioned on ex situ data. In a different case,
where recharge estimation at ungauged watersheds is but one component of a
greater project, the aforementioned informative predictive distributions can
be treated as informative ex situ priors, which could be further updated
and/or integrated into simulation-based stochastic analyses where recharge is
an input/component of other models/functions. At ungauged watersheds that
will become gauged in the foreseeable future, the informative predictive
distributions again serve as informative ex situ priors that could guide the
design of the sampling campaign, as different recharge flux magnitudes
require different quantifying techniques

The data used in this study are from publicly
available databases. The potential evapotranspiration data are from the
ENVIREM database, which can be accessed at

CFC designed the study, performed the analyses, and prepared the manuscript under the supervision of YR.

The authors declare that they have no conflict of interest.

For this study, Ching-Fu Chang was financially supported by the Jane Lewis Fellowship from the University of California, Berkeley. The authors thank Sally Thompson and Chris Paciorek for the inspiration of this study. The authors also appreciate the helpful comments from two anonymous reviewers.

This paper was edited by Mauro Giudici and reviewed by two anonymous referees.