Technical Note: Improved sampling of behavioral subsurface flow model parameters using active subspaces

In global sensitivity analysis and ensemble-based model calibration, it is essential to create a large enough sample of model simulations with different parameters that all yield plausible model results. This can be difficult if a priori plausible parameter combinations frequently yield non-behavioral model results. In a previous study (Erdal and Cirpka, 2019), we developed and tested a parametersampling scheme based on active-subspace decomposition. While in principle this scheme worked well, it still implied testing a substantial fraction of parameter combinations that ultimately had to be discarded because of implausible model results. This technical note presents an improved sampling scheme and illustrates its simplicity and efficiency by a small test case. The new sampling scheme can be tuned to either outperform the original implementation by improving the sampling efficiency while maintaining the accuracy of the result or by improving the accuracy of the result while maintaining the sampling efficiency.

A key issue when conducting a global sensitivity analysis is the requirement of a large enough sample of model simulations with parameters ranging over the full parameter space. Simulations showing unrealistic behavior (e.g., wells or rivers running dry in the model, while they in reality always have water) should be removed from the sample. Already in moderately complex models this may result in many model trials that must be discarded on the level of a plausibility check. This leads to the contradictory requirements of sampling the entire space of parameters defined by preset wide margins to capture the entire distribution while exploring only the part of the parameter space yielding plausible results. One way of easing the computational burden is to make use of a simpler model (i.e., surrogate/proxy/emulator model), discussed, e.g., in the comprehensive reviews of Ratto et al. (2012), Razavi et al. (2012), Asher et al. (2015), and Rajabi (2019). A common sampling approach is to use a two-stage acceptance sampling scheme, in which a candidate parameter set is first tested with the surrogate model, and only if the surrogate model predicts the parameter set to be behavioral, it is applied in the full model. This idea has been applied to groundwater modeling by Cui et al. (2011), Laloy et al. (2013, and the authors of the current study (Erdal and Cirpka, 2019). In our last study, we used a response surface fitted to the first two active subspaces as the surrogate model in a sampling scheme for a subsurface catchment-scale flow model. The scope of the current technical note is to present an improvement of this scheme and compare it to the original one.

Methods
In the following subsections we briefly describe the activesubspace method and the base flow model. More details are given by Erdal and Cirpka (2019).

Active subspaces
In this section we briefly repeat the basic derivation of active subspaces for a generic function f (x), in whichx is the vector of scaled parameters x with a scaling to the range between 0 and 1. An active subspace is defined by the eigenvectors of the following matrix C, computed from the partial derivatives of f with respect tox i , evaluated over the entire parameter space (Constantine et al., 2014), here shown with its eigendecomposition and Monte Carlo approximation Constantine and Diaz, 2017): in which ⊗ denotes the matrix product, ρ is a probability density function, the integration is performed over the entire parameter space, W is the matrix of eigenvectors, is the diagonal matrix of the corresponding eigenvalues, and M is the number of samples used. The n-dimensional active subspace is spanned by the eigenvectors with the n highest eigenvalues. In our application, we use n = 2 as we could detect very little improvement with higher numbers. In a global sensitivity analysis using active subspaces, the activity score a i of parameter i is defined by in which λ j is the j th eigenvalue and w i,j the element relating to parameter i in the j th eigenvector. In the following, we consider the square root of the activity score to obtain a quantity that has the same unit as the target variable f . It should be noted that there are different global sensitivity methods with different metrics that may give different results (e.g., Razavi and Gupta, 2015;Dell'Oca et al., 2017). In principle, nothing speaks against computing another global-sensitivity metric for the sample selected by our active-subspace-based sampling scheme, as long as computing the metric is based on a random sample. For practical reasons and for a direct comparison with our previous work, we use the activity score in the present study. For the interested reader, a longer discussion about the current metric in relation to the specific appli- cation is given by Erdal and Cirpka (2019), and more general discussions have been presented by Saltelli et al. (2008), Song et al. (2015), and Pianosi et al. (2016), among others.

Model application
In our application we consider a model of the small Käsbach catchment in southwest Germany. The model has 32 unknown parameters, including material properties, boundarycondition values, and geometrical parameters of subsurface zones. Originally, Erdal and Cirpka (2019) simulated subsurface flow in the domain using the model software Hy-droGeoSphere (Aquanty Inc., 2015), which solves the 3-D Richards equation, here using the Mualem-van Genuchten (Van Genuchten, 1980) parameterization for unsaturated flow. Figure 1 illustrates the model domain. Details, including the governing equations, are given by in Erdal and Cirpka (2019). In a related study, we constructed a surrogate model using Gaussian process emulation (GPE) from roughly 4000 parameter sets. In the GPE model, the model response f (x i ) at the scaled parameter location x i is constructed by interpolation from the existing set of parameter realizations using kriging in parameter space with optimized statistical parameters. The GPE model is constructed with the Small Toolbox for Kriging (Bect et al., 2017). In the present work, we use the GPE model instead of the full HydroGeoSphere flow model as our virtually true model response. The prime reason for this is that we can perform pure Monte Carlo sampling of behavioral parameter sets with the GPE model, requiring about 600 000 model evaluations to create a set of 3000 behavioral parameter sets, which would be unfeasible with the original HydroGeoSphere model. That is, we use a surrogate model (the GPE model) to judge the performance of other surrogate models (based on active-subspace decomposition) in creating ensembles of plausible parameter sets. In order to avoid confusion we would like to point out that, in this paper, the term full-flow model means the GPE model, while the term surrogate model is, outside of this paragraph, ex-clusively used for the surrogate model used to improve the sampling schemes.
Like in our prior work (Erdal and Cirpka, 2019), the model considers five observations that define acceptable behavioral performance (for locations see Fig. 1).
-Limited flooding: maximum of 2×10 −3 m 3 s −1 of water leaving the domain on the top but outside of the streams.
-Division of water: between 25 % and 60 % of incoming recharge leaves the domain via the streams.
With the aim of keeping this technical note rather concise, we will not discuss individual parameters or their meaning in the model. To this end, we address all parameters by a parameter index (1-32) instead of a name, and the resulting histograms refer to the scaled parameters, ranging from 0 to 1.

Sampling schemes using active-subspace decomposition
The basic idea of using a surrogate-assisted sampling scheme is to use the (very fast) surrogate model to first evaluate a candidate parameter set. If the surrogate model predicts the parameter set to be behavioral, it is stage-1 accepted and will be run with the full model. If accepted also after running the full model, a parameter set is stage-2 accepted. Only the stage-2 accepted parameter sets are used in the global sensitivity analysis, whereas the stage-1 accepted ones are used to improve the surrogate model. Hence, one of the beauties of the surrogate-assisted sampling is its ability to quickly discharge large quantities of non-behavioral parameter set without running the full-flow model for each one (i.e., stage-1 rejected samples). Also, as the surrogate model is only used as a preselection filter, all results and the training of the surrogate model are based exclusively on full-flow model simulations.
For each observation considered, we need to perform an active-subspace decomposition. In our previous work (Erdal and Cirpka, 2019), a decision on whether to accept or reject a parameter set is made in the following way: 1. A third-order polynomial surface is fitted in the active subspace spanned by the two major active variables.
2. These polynomial surfaces are used to predict the observations of a candidate parameter set.
3a. If all predicted observations are acceptable, the candidate is stage-1 accepted.
3b. If any predicted observation is between the acceptance point and a user-defined outer point, we assign a probability of being stage-1 accepted by linear interpolation between 0 (at the outer point) and 1 (at the acceptance point), draw a random number from a uniform distribution, and accept the parameter set as stage-1 if the assigned probability is larger than the random number.
3c. If any predicted observation is outside of the outer point, we reject the sample, draw a new candidate, and return to (2).
4. After adding 100 stage-1 accepted parameter sets, we recalculate the active subspace using all stage-1 accepted parameter sets collected to this point. Hence, the surrogate model is based on all currently available fullflow model simulations.
Two critical points can be seen with this scheme. First, the polynomial surface is fitted through all stage-1 accepted points across the entire parameter space. However, locally, where we wish to make a prediction, it could still be strongly biased. Second, the user needs to prescribe the outer points, which should not only cover our uncertainty about the acceptance point but also implicitly addresses the error by using the active-subspace decomposition. As we project 32 dimensions to two, the potential for an imperfect decomposition is rather high (that is, two close points in active subspace may have different behavioral status). As we have no rigorous and yet simple method to address this uncertainty, the choice of the outer point becomes fairly subjective.
To overcome these issues, we here suggest a modified sampling scheme, with fewer tuning parameters and less sensitivity to local biases. As with the original scheme, we require one active-subspace decomposition per observation and use the first two active variables to create the 2-D active subspace. As in the original sampling scheme, we start with a set of 50 candidate parameter sets, sampled using a Latin hypercube setup, which are per definition directly stage-1 accepted. Hence we run the full-flow model 50 times to initialize the sampling scheme. The actual number is not critical and should be chosen with consideration to the number of unknown parameters. The new sampling scheme then proceeds as follows: 1. The candidate parameter set is projected into the active subspace.
2. The closest neighbors in the active subspace are sought. In this work we use the five closest neighbors plus all neighbors that fall within an ellipse around the candidate point that has a radius of 1 % of the total range of each active subspace, in each of the two dimensions. The number of neighbors selected and the radius of the ellipse are tuning parameters, here chosen based on a few prior tests. However, we believe they are applica- ble also for other applications, at the very least as good starting points.
3. For each observation, a candidate parameter set is preaccepted if a certain ratio (P ) of its neighbors are behavioral (i.e., stage-2 accepted).

The candidate parameter set is stage-1 accepted if it was
preaccepted for all observations, otherwise it is rejected.
5. If rejected, draw a new candidate parameter set and return to (1).
Like before, we recalculate the active subspace after adding 100 stage-1 accepted parameter sets. The two approaches are illustrated in Fig. 2, although just for a 1-D illustrative example. As can be seen in the figure, the original sampling scheme suggests that the candidate is behavioral (red dot is above the red line). With the new sampling scheme, on the other hand, it becomes a matter of the P value chosen. At P = 0.15 and P = 0.55, the candidate would have been stage-1 accepted (60 % of the green dots are behavioral), while at P = 0.75 the candidate would have been rejected. In this work, we consider the ratios P = 0.15, P = 0.55, and P = 0.75 and compare the performance of the sampling scheme with that used in the previous study (Erdal and Cirpka, 2019). Figure 3 shows the acceptance ratios (number of stage-2 accepted samples divided by the number of stage-1 accepted samples) for the original sampling scheme and the new sampling scheme with three different P values, together with a pure Monte Carlo sampler without preselection, applied to the Käsbach GPE model with 32 parameters. As can be seen, the new scheme with P = 0.75 is the fastest, while the original scheme and the new scheme with P = 0.15 show rather comparable behavior with lower acceptance rates. For comparison, the pure Monte Carlo sampling has an acceptance ratio of ≈ 0.005. It should be noted here that the acceptance ratio as a statistic only shows the ratio between the runs that are behavioral after running the full-flow model (stage-2 accepted) versus the number of full-flow model runs (stage-1 accepted). This, however, does not reflect the number of stage-1 rejected parameter sets, which is not reported in this work but is by far the largest for the higher P values. Hence, the acceptance ratio is a measure of computational efficiency rather than a measure of search efficiency (which here is simple Monte Carlo and, hence, comparably inefficient).

Results and discussion
While high acceptance rates are favorable in light of computational efficiency, we also want to avoid introducing a bias by the preselection scheme. We evaluate such bias by considering the marginal parameter distributions of the stage-2 accepted samples, which should agree with the distribution obtained by the (inefficient) pure Monte Carlo sampler. Fig-Figure 4. Histograms of the three parameters with the most complicated posterior marginal distributions. Each row shows a parameter and each column a sampling scheme. Blue bars: histograms from pure Monte Carlo sampling (i.e., true distribution); brown bars: sampling schemes with preselection; numbers: Cramér-von Mises metric ω 2 for the distance between the two distributions, here shown multiplied by 1000 for increased readability. ure 4 shows the resulting histograms for the three parameters with the most complex marginal distributions. We quantified the agreement of the marginal distributions of the sampling schemes with preselection and the pure Monte Carlo sampling by the Cramér-von Mises metric ω 2 : in whichP ss (x i ) is the marginal cumulative probability of the scaled parameterx i for a tested sampling scheme, and P MC (x i ) is the same quantity for pure Monte Carlo sampling.
The corresponding values of ω 2 are reported in the subplots of Fig. 4. From the histograms in Fig. 4 and the values of the Cramér-von Mises metric ω 2 , it becomes obvious that the fast new sampling with P = 0.75 results in marginal distributions that significantly differ from those of the unbiased pure Monte Carlo scheme. The new scheme with P = 0.55 results in marginal distributions that are comparable to those of the original scheme but that have been achieved by a sampling scheme with twice the acceptance rate and thus half the computational effort. By contrast, the new scheme with P = 0.15, which caused a computational effort similar to the original scheme, resulted in a marginal posterior distribution that is very similar to that obtained by pure Monte Carlo sampling. Hence, we can conclude that the proposed sampling scheme is superior to the old one: either it has much better sampling accuracy for the same efficiency (P = 0.15), or it has a much better efficiency with a very comparable accuracy (P = 0.55). While it may seem counterintuitive that the highest P value gets the highest acceptance ratio and the poorest match of the marginal distributions, it is worth noting that a higher P value means that the requirement for stage-1 acceptance is higher. Hence, at high P values we only sample the interior of the behavioral parameter space and avoid the boundaries where the behavioral status of a candidate parameter set is more uncertain. This results in the bias clearly seen in Fig. 4. Figure 5 shows the square root of the activity score for a selected target variable, computed by the active-subspacebased global sensitivity analysis and using the different sam-D. Erdal and O. A. Cirpka: Improved sampling using active subspaces pling schemes, which confirms the impression of the histograms shown in Fig. 4. The pure-MC scheme and the new scheme with P = 0.15 show almost identical activity scores, while the score patterns increasingly differ with increasing P values. Similarly, the original sampling scheme differed in the activity scores compared to the pure-MC scheme. Nonetheless, all sampling schemes correctly identified the two most important parameters and the correct set of the 10 most important parameters. That the order of the parameters within the set of the most important parameters is not captured by the faster sampling schemes may be an acceptable tradeoff between speed and accuracy, depending on the individual application. Based on the experience gained within this project, a recommended starting P value for our presented sampling scheme is P = 0.55.
In the current study, we have used Gaussian process emulation (GPE) as a proxy of the full HydroGeoSphere model, putting the question forward whether a GPE model could not also be used as surrogate model for preselection in an advanced sampling scheme. This is indeed possible, and we are currently developing such schemes, achieving acceptance ratios between 70 % and 90 %. Hence, GPE-based sampling schemes can be notably more efficient than the new scheme presented in this work. Nonetheless, we see a clear value in using the less efficient active-subspace-based sampling schemes. The key word is simplicity. The full active subspace-sampling scheme is implemented in-house, and the most complicated step is likely the eigenvalue decomposition, which is a standard tool in any programming environment. Hence, we have full control over the entire selection procedure. Further, the active-subspace-based sampling scheme presented here has a single tuning coefficient P with an easily comprehensible meaning, and the resulting active subspace can easily be visualized for an intuitive understanding of the method. This is quite different with GPE-based methods, which require choosing a covariance function in parameter space with coefficients that need to be estimated from the current set of training data. In our application, we have 32 original parameters, requiring one variance and 32 integral scales as covariance coefficients to be estimated every time the GPE model is retrained. Estimating 33 covariance parameters from O(1000) parameter sets is time consuming, and the integral scales in nonsensitive parameter directions are not well constrained by the data at all. Finally, to train a GPE model we need to rely on third-party codes which remain black boxes to a large extent and usually involve a rather decent amount of work until they do what they are supposed to do. Hence, we clearly see a benefit of using the simpler active-subspace-based sampling schemes even if they are computationally less efficient.

Conclusions
In this work we have presented an improved sampling scheme to obtain ensembles of parameter sets that lead to plausible model results. Like in the preceding study of Erdal and Cirpka (2019), the sampling scheme makes use of an active-subspace-based preselection scheme that reduces the number of full model runs that need to be discarded. In contrast to the preceding method, we do not perform a polynomial fit over the entire parameter space anymore nor have to set fuzzy boundaries of the target variables to define the behavioral status. Instead, the preselection of a parameter set is simply based on the behavior of surrounding trial solutions. The new scheme outperforms the preceding one either by achieving a higher accuracy in the resulting posterior parameter distributions for the same sampling efficiency or by having a much higher sampling efficiency for a comparable accuracy. We hence conclude that the new scheme presented here should be used instead of the original one.
Author contributions. Simulations and code development were performed by DE. Both authors contributed to developing and writing the paper. OAC was responsible for acquisition of the funding.
Competing interests. The authors declare that they have no conflict of interest. This open-access publication was funded by the University of Tübingen.
Review statement. This paper was edited by Monica Riva and reviewed by two anonymous referees.