the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Datadriven estimates for the geostatistical characterization of subsurface hydraulic properties
Falk Heße
Sebastian Müller
Sabine Attinger
Abstract. The geostatistical characterization of the subsurface is confronted with the double challenge of large uncertainties and high exploration costs. Making use of all available data sources is consequently very important. Bayesian inference is able to mitigate uncertainties in such a data scarce context by drawing on available background information in form of a prior distribution. To make such a prior distribution transparent and objective, it should be calibrated against a data set containing estimates of the target variable from available sites. In this study, we provide a collection of covariance/variogram functions of the subsurface hydraulic parameters from a large number of sites. We analyze this data set by fitting a number of widely used variogram model functions and show how they can be used to derive prior distributions of the parameters of said functions. In addition, we discuss a number of conclusions that can be drawn for our analysis and possible uses for the data set.
 Preprint
(1185 KB)  Metadata XML
 BibTeX
 EndNote
Falk Heße et al.
Status: closed

RC1: 'Comment on hess202315', Anonymous Referee #1, 22 Feb 2023
The paper "Datadriven estimates for geostatistical characterization of subsurface hydraulic properties" by Hesse et al. is a nice try to build a comprehensive data set of variogram parameters that could be used as reference values or as prior distributions for cases in which there are not enough data to construct an experimental variogram from the available observations. While the proposal seems interesting, I have many reservations about the paper's contents and how the results are obtained and analyzed.
First of all, pretending to find the "universal" variogram parameters distribution is a little bit pretentious. The analysis is made from a purely mathematical/computer science perspective without considering that variograms relate to the underlying geology. In this sense, one would not expect variograms computed in fractured media to be the same as those computed in sedimentary formations. Also, and very importantly, a parameter such as hydraulic conductivity is not expected to vary in space smoothly under any circumstances; for this reason, the use of Gaussian variograms or, in general, variograms with low roughness should never be advocated (unless they are accompanied by a nugget effect that would break the smoothness imposed by those variograms).
Next, the terminology used departs sufficiently from the one used in standard geostatistical literature (see, Journel and Huijbrets, 1978; Isaaks and Srivastava, 1989; Deutsch and Journel, 1991; or Goovaerts, 1997) to make it very confusing, and, at some points incorrect.
Let's dive into more detail about my reservations. The variogram is a vectorial function that depends on the separation distance vector h; only in the case of isotropic variograms in all directions does such a function becomes a scalar one. The definition in section 2.2 and the analysis thereof is made assuming that the variogram is a scalar function depending only on the modulus of h. This is a significant conceptual flaw, which is not cleaned with the anisotropy analysis made in section 3.3. The expression of the variogram function in line 135 is incorrect. You cannot separate the variance and the nugget effect. The variance of the random function model is (n+σ^{2}) (using the notation in that section), and the covariance is the variance minus de variogram ((n+σ^{2})γ). The authors should recall that, generally, the variogram is modeled as a sum of variogram functions, each one contributing their share to the total variance (or variogram sill). They have chosen to model it as the sum of two variogram models, a nugget effect (with a contribution of n) and another parametric function (with a contribution of σ^{2}), which is perfectly OK; however, the authors should acknowledge that there could be fitting with more than two components. What is not OK is much of the later analysis in which this fact is disregarded: many of the findings are due to this fact, the fitting of a nugegt plus another variogram, and some of the conclusions are said to be due to, say, the type of parametric model used, disregarding that a nugget had been used to fit the variogram.
Very important, σ^{2} is not the variance of the process! It is only the contribution of one of the two structures fitted to the total variance.
The variogram computed as a function of the modulus of the separation vector is referred to as the omnidirectional variogram. It is generally used during the modeling process to assess the nugget effect and the sill of the variogram and to have an idea of the average range from the different directions, but it is not a variogram to be used in practice for anisotropic media. Variograms are vectorial functions, and as such, their value depends on the vector orientation reflecting the anisotropy in the correlation patterns of the variable under analysis. Such anisotropy is characterized by an ellipsoid (in 3D) or an ellipse (in 2D), which, in principle, may be arbitrarily oriented, although, in practice, it is assumed that one of the ellipsoid axes is oriented vertically and the other two in the XY plane, not necessarily parallel to the Cartesian axes. The lengths of the semiaxes of this ellipsoid are known as the anisotropy ranges and are crucial to characterizing the spatial continuity of the attribute. The authors disregard the vectorial nature of the variogram and introduce the "characteristic length" as one of the parameters of the parametric variogram components. The characteristic length, 'l' in the equations in the lines between 135 and 150, is not the range and may induce mistakes. In standard geostatistics, the range is defined as the distance at which the variogram value reaches the sill, or for variograms that are asymptotic to the sill, the distance at which the variogram reaches at least 95 % of the sill. This means the range will equal l for the spherical variogram but 3l for the Gaussian or exponential ones. Comparing the value of l for different variogram functions is like comparing two different quantities.
I am very concerned with the many published theoretical papers in which the variogram is assumed to be Gaussian. This assumption is always a matter of convenience, especially in those papers more focused on analytical approaches, disregarding the fact that Gaussian variograms make only sense for variables that vary smoothly in space, such as the thickness of a layer, but never should be used to model a parameter characterized by a very large spread with significant shortscale variability like the hydraulic conductivity.
It is not clear if the variograms analyzed are for K or logK. Line 157 mentions K, and nowhere is logK mentioned. The variogram of K may not of interest in geostatistical studies since it is common to use a lognormal random function to model the spatial variability of K. In addition, if K had been used, my criticism of using Gaussian variograms (or variograms with low roughness) is exacerbated.
Considering the significant uncertainties associated with experimental variograms, trying to distinguish whether a stable or a Matern variogram fits better than an exponential or a spherical one is meaningless. The analysis would have benefited from comparing just the spherical and exponential variograms (plus the nugget effect). Neither the Matern nor the stable models make sense when modeling hydraulic conductivity from a geological point of view. A much more meaningful fitting would have been a power variogram, which would relate to the fractal behavior claimed by Neuman.
More specific comments
What are observable base rates? (line 22)
What is a variogram cloud? (line 39, line 96, and many more times in the following pages)
The paragraph from lines 52 to 55 is meaningless.
After introducing the roughness coefficient in line 166, the authors analyze the roughness of the different parametric functions used in their variogram definitions, but they fail to recognize that as soon as a nugget effect is added, the roughness of the fitted variogram is zero. This explains why, later, some of the analyses seem to favor fitting a "Gaussian variogram" when in truth, they are fitting a variogram composed of a nugget effect plus a Gaussian component. The nugget removes the implicit roughness of the Gaussian variogram, justifying its application in practice.
When introducing section 3, the authors discuss the length scale and vertical and horizontal anisotropy. For the first time, the authors recognize that anisotropy may be an issue, but it is poorly defined. The length scale could be the larger range, and the horizontal and vertical anisotropies could be the ratios of the ranges in the other two principal directions with respect to the largest range. But this is not what the authors analyze; they focus on the range of the omnidirectional variogram, plus the ranges in some arbitrary directions in the XY plane (not necessarily in the directions of the principal ranges) and the range in the vertical direction.
In section 2.3, it would be nice to discuss the fitting procedure used to get the variogram parameters.
In the fitting algorithm, the authors should use an algorithm capable of fitting a fully vectorial function and attempt to identify not just a single omnidirectional range but actually the whole anisotropy ellipsoid (or ellipse), that is, the three (or two) principal ranges and their orientations.
The statement in line 215 does not seem to be much justified. Why should all models fit data? There may be physical reasons why one kind of variogram is more appropriate than other. In some cases, as already mentioned, some variogram models are inconsistent with the sample data (such as the Gaussian variogram without nugget effect to fit the spatial variability of hydraulic conductivity).
Line 223. The authors realize that the nugget effect is responsible for the similar results yielded but the different parametric models.
The matrices of scatterplots must be explained. What does each point in each scatterplot represent? The diagonal scatterplots make no sense unless they represent something different from the rest of the scatterplots. If they show the same parameters, all points in the diagonal scatterplots should fall in the diagonal at 45 degrees.
The end of the sentence in line 229 is very important. It is always so! The particularities of the site under analysis should drive the analysis. Trying to find the de "universal" parameter distribution is a mistake.
The second sentence in the paragraph starting at line 230 is hard to understand.
Unclear what is meant in the sentence that starts in line 251 to the end of the paragraph.
End of page 12: A power variogram could explain this behavior.
In the paragraph in line 260, the relationship between the lengths should be at the 45degree line when comparing spherical and Gaussian and at the y=3x for the comparison with the spherical variogram.
Section 3.3 makes absolutely no sense. Anisotropy is a fact and should be appropriately analyzed and accounted for. Analyzing λ_{x} and λy when they do not represent the actual principal directions of the anisotropy ellipse is useless.
The analysis of λz would make sense if compared with the largest range in the plane but not when compared with an arbitrary range calculated in the x direction.
In section 3.4, I hope that the normalization of the nugget was done with respect to the total variance (n+σ^{2}), not with regard to σ^{2}. There is no reason why n could not be larger than σ^{2}, therefore escaping from the [0,1] limits.
The nugget not only accounts for measurement errors but also for shortscale variability. The discussion in this section is poor and, at times, flawed.
Line 354 is explained because the Gaussian model needs the nugget to fit most experimental variograms, whereas the exponential does not. Consequently, the results are not an artifact of the fitting procedure.
In the opening of section 3.5, the authors forget that the length parameter is not enough; there is an anisotropy ellipse.
Why does Figure 15 not show a histogram as before?
The discussion about survivor bias is very interesting.
The quality of the figures must be improved.
I think a reanalysis of the data accounting for the mistakes made in this evaluation, probably reducing the number of variogram functions and avoiding the construction of the prior probability functions for the different models, would be worth publishing.
Citation: https://doi.org/10.5194/hess202315RC1 
AC3: 'Reply on RC1', Falk Heße, 31 May 2023
The reviewer presents long and thorough discussion of our manuscript, which raises a number of points considered critical. At least some of these more critical are, in our opinion, caused by a misunderstanding of our aims and the methodology employed here. To avoid such a misunderstanding in the future, we will revise our manuscript and explain the points more accurately. Others are important and we will address them below as well as in the revised version of the manuscript.
Due to the length of the comments and our reply, we will present a PDF where each of the reviewer's comments are reproduced, followed by our individual responses. To improve readability, the reviewer's comments are printed in black, wereas our replies are printed in blue.

AC3: 'Reply on RC1', Falk Heße, 31 May 2023

RC2: 'Comment on hess202315', Anonymous Referee #2, 12 Mar 2023
The paper presents present a datadriven approach to a classical problem in subsurface hydrology, the estimation of parameters characterizing the variogram of subsurface properties. The proposed method advocates the use of Bayesian inference to set up a prior distribution for models that describe spatial correlations (covariance or variogram). A remarkable data set is examined, and available data are classifiable into primary (point measurements), secondary (empirical variogram functions), or tertiary (statistical estimates of subsurface properties) data. Data were processed to avoid overlaps and overrepresentation.
The first available result from the manuscript is the comparison between different variogram model functions, that could be improved in my view (see comment #2). The scale dependency of the hydraulic conductivity is examined next, confirming earlier literature results. An example application of the Bayesian approach to the estimate of correlation length given the maximum length scale is then presented. Other variogram parameters examined are anisotropy, nugget effect, and shape parameter.
The discussion addresses important issues such as the unbiasedness of the data set employed, among others.
The paper looks as a mature contribution; given the topic and the type of paper, I see however some room for further improvement. Results are of interest to the readership of Hydrology and Earth System Sciences. The methods are adequate, the paper subdivision into sections sound, and the figures illustrative. I recommend minor revisions for the reason explained below.
 The manuscript examines only stationary variograms, I suggest to mention that nonstationary variograms (see, e.g.,Di Federico and Neuman, 1997) were excluded from the analysis.
 The comparison among variograms having a different numbers of degrees of freedom (section 3.1) could be rendered more qualitative by model identification criteria (AOC, AIC_c, KIC, …), incorporating the number of parameters involved and the principle of parsimony. The same holds probably for other comparisons performed.
 How crucial is the assumption of two independent Gaussian distributions in Section 3.2 ? Could they develop a more general theory without it, maybe subject to other limitations ?
Citation: https://doi.org/10.5194/hess202315RC2 
AC2: 'Reply on RC2', Falk Heße, 30 May 2023
The paper presents present a datadriven approach to a classical problem in subsurface hydrology, the estimation of parameters characterizing the variogram of subsurface properties. The proposed method advocates the use of Bayesian inference to set up a prior distribution for models that describe spatial correlations (covariance or variogram). A remarkable data set is examined, and available data are classifiable into primary (point measurements), secondary (empirical variogram functions), or tertiary (statistical estimates of subsurface properties) data. Data were processed to avoid overlaps and overrepresentation.The first available result from the manuscript is the comparison between different variogram model functions, that could be improved in my view (see comment \#2). The scale dependency of the hydraulic conductivity is examined next, confirming earlier literature results. An example application of the Bayesian approach to the estimate of correlation length given the maximum length scale is then presented. Other variogram parameters examined are anisotropy, nugget effect, and shape parameter.
The discussion addresses important issues such as the unbiasedness of the data set employed, among others.
The paper looks as a mature contribution; given the topic and the type of paper, I see however some room for further improvement. Results are of interest to the readership of Hydrology and Earth System Sciences. The methods are adequate, the paper subdivision into sections sound, and the figures illustrative. I recommend minor revisions for the reason explained below.
Repl: We appreciate the reviewers comments and overall supportive feedback on our study.
The manuscript examines only stationary variograms, I suggest to mention that nonstationary variograms (see, e.g.,Di Federico and Neuman, 1997) were excluded from the analysis.
\textcolor{blue}{Due to some comments of another reviewer, we revised our analysis and included a truncated power law variogram.}The comparison among variograms having a different numbers of degrees of freedom (section 3.1) could be rendered more qualitative by model identification criteria (AOC, AIC\_c, KIC, …), incorporating the number of parameters involved and the principle of parsimony. The same holds probably for other comparisons performed.
Reply: We agree with the reviewer that the use of a modelselection criterion can formalize the comparison we have done in the manuscript to far. Models with more degrees of freedom have more flexibility and should therefore be better able to capture an observed behavior. In case of variogram model, the Matern and the Stable model should therefore outperform the other models in terms of goodnessoffit. In the current version, we only performed a qualitative analysis by observing the overall similar accuracy and noting how this would not justify the use of a more sophisticated model like the aforementioned Matern and Stable model. Using a criterion like AOC, AIC\_c, KIC, etc. would make this analysis more quantitative.
How crucial is the assumption of two independent Gaussian distributions in Section 3.2 ? Could they develop a more general theory without it, maybe subject to other limitations?
Reply: The assumption of two different Gaussian distribution is not crucial to the approach presented there at all. In fact, the parametric model for the residuals around the regression line was chosen on the spot after visually inspecting them. As we explain in the discussion section, many different parametric models may be possible depending on the situation. In fact, if enough data points are available a completely nonparametric approach is possible as well. To summarize our approach here again, we would describe it as follows. First, the residuals around the regression line are representing the uncertainty one has with respect to the regression model. From a Bayesian perspective, the can be used to estimate a prior probability. We do this by first visually inspecting the results of a kerneldensity estimation (KDE), ie., a nonparametric estimation procedure. KDE is a powerful estimator, but it always produces very smooth densities which may bamboozle practitioners into over interpreting its results. To avoid overconfidence, we therefore only use KDE to find a good parametric model that could describe the empirically observed distribution of the residuals. In our case, a mixture mode using two Gaussian distributions seemed like a good choice. When we fitted such a model to the residuals and compared it to the KDE, we saw a excellent overlap. Of course this agreement has to be interpret with care since the KDE is not the ultimate benchmark of truth, for the reasons outline above. But having two different estimation procedure give very similar results certainly adds confidence that the both express some underlying truth. This single example is explicitly presented as a proof of concept for how to use the data provided in our study for the derivation of prior distributions in a Bayesian context. As we state in the manuscript, using other data and/or other variogram functions may lead to a somewhat different regression analysis with different residuals. In our opinion, there is probably no general theory on what parametric model, if any, to use for the description of the prior distribution. Every case may be different and practitioners are advised to use their judgement in adapting this approach to their situations. In the revised manuscript, we now explain this reasoning in more detail to better convey this important idea.

RC3: 'Comment on hess202315', Shlomo P. Neuman, 14 Mar 2023
The authors have conducted an interesting statistical analysis of raw variograms (I assume that this is what the authors mean by variogram clouds) of published data from a variety of geographic locations and geologic settings. Though the data are said to include saturated hydraulic conductivities from aquifers and shallow soils, as well as permeabilities and transmissivities, I assume (but the authors need to confirm or otherwise clarify) that the data consist of logarithms of these variables. A strength AND a weakness of the analysis is the lack of clear distinction between varied hydraulic properties (conductivity versus transmissivity) from varied settings (sediments, rocks, porous versus fractured media) measured by varied means (I presume single and multihole pressure tests, perhaps other) on varied scales of measurement. The strength of the approach is that it reveals commonalities among the data (scaling and amenability to analysis by a variety of variogram models); its weakness is lack of clarity about the way hydrogeologic settings, methods and scales of measurements, affect variogram behavior.
To me, one of the more interesting results of the analysis is confirmation of a scale effect that reveals itself, precisely, when one throws data from varied sites and settings into a single basket, as do the authors. It would therefore behove the authors to take the additional step of attempting to fit a truncated power variogram to their data, thereby potentially showing that such a variogram captures a greater range of behaviors with fewer parameters than do the stationary variograms they consider.
Not only might a truncated power variogram prove to be more general and parsimonious than do standard stationary variogram, but they would be much more suitable than the latter when one needs to predict, using a transport model, how a plume of contaminant might migrate and behave outside the domain originally used to define the variogram.
I think that with additional work along the lines I have just proposed, this paper could stand out as a much more significant contribution to the literature than it does now.
Citation: https://doi.org/10.5194/hess202315RC3 
AC1: 'Reply on RC3', Falk Heße, 30 May 2023
The authors have conducted an interesting statistical analysis of raw variograms (I assume that this is what the authors mean by variogram clouds) of published data from a variety of geographic locations and geologic settings. Though the data are said to include saturated hydraulic conductivities from aquifers and shallow soils, as well as permeabilities and transmissivities, I assume (but the authors need to confirm or otherwise clarify) that the data consist of logarithms of these variables. A strength AND a weakness of the analysis is the lack of clear distinction between varied hydraulic properties (conductivity versus transmissivity) from varied settings (sediments, rocks, porous versus fractured media) measured by varied means (I presume single and multihole pressure tests, perhaps other) on varied scales of measurement. The strength of the approach is that it reveals commonalities among the data (scaling and amenability to analysis by a variety of variogram models); its weakness is lack of clarity about the way hydrogeologic settings, methods and scales of measurements, affect variogram behavior.
Reply: We appreciate the reviewers comments and overall supportive feedback on our study. The reviewer is right that we used the logarithm of the conductivity, transmissivity and permeability values. To make this more clear, we explicitly mention this now in the revised version of the manuscript. As regards the lack of clear distinction between varied hydraulic properties, we agree with the reviewer that, as far as the analyses are concerned, this is both a strength and weakness. Any statistical analysis has to find a compromise between sample size and accuracy. Using the whole sample means that no within sample effects can be discovered. Split the sample according to different criteria, and the resulting subsamples may be too small to perform a viable statistical analysis. For our paper, we only distinguished between soil and aquifer site. Splitting for instance the aquifer data further into conductivities, transmissivites and permeabilities, would facilitate the investigation of possible differences between them. Alas, since most of data was drawn from measurements of hydraulic conductivity, the samples for transmissivites and permeabilities would have been small, thus, making comparisons more elusive. Still, the data set is openly available and practitioners interested in this topic are free to do such an analysis by themselves. The same problem of reduced sample size also applies to comparisons between, say, different settings and/or measurement techniques. This was particularly unfortunate, since we were very interested in this topic ourselves. Alas, many papers did not report these additional information, which exacerbated the aforementioned problem even further. To better highlight this problem we revised the conclusions portion of the manuscript accordingly.
To me, one of the more interesting results of the analysis is confirmation of a scale effect that reveals itself, precisely, when one throws data from varied sites and settings into a single basket, as do the authors. It would therefore behove the authors to take the additional step of attempting to fit a truncated power variogram to their data, thereby potentially showing that such a variogram captures a greater range of behaviors with fewer parameters than do the stationary variograms they consider.
Not only might a truncated power variogram prove to be more general and parsimonious than do standard stationary variogram, but they would be much more suitable than the latter when one needs to predict, using a transport model, how a plume of contaminant might migrate and behave outside the domain originally used to define the variogram.I think that with additional work along the lines I have just proposed, this paper could stand out as a much more significant contribution to the literature than it does now.
Reply: We agree that using the available data set with a truncated power variogram can help to investigate the scaling behavior of subsurface media indicated by a number of different studies. We therefore redid the analysis using this variogram function. Our results show that a truncated power law variogram achieves a similar goodness of fit compared to the other variogram functions but does not show any scale dependencies of its parameters. Depending on the application, such a lack of scale dependency can be considered an asset. We will include these results and some discussion on this topic into the manuscript.

AC1: 'Reply on RC3', Falk Heße, 30 May 2023
Status: closed

RC1: 'Comment on hess202315', Anonymous Referee #1, 22 Feb 2023
The paper "Datadriven estimates for geostatistical characterization of subsurface hydraulic properties" by Hesse et al. is a nice try to build a comprehensive data set of variogram parameters that could be used as reference values or as prior distributions for cases in which there are not enough data to construct an experimental variogram from the available observations. While the proposal seems interesting, I have many reservations about the paper's contents and how the results are obtained and analyzed.
First of all, pretending to find the "universal" variogram parameters distribution is a little bit pretentious. The analysis is made from a purely mathematical/computer science perspective without considering that variograms relate to the underlying geology. In this sense, one would not expect variograms computed in fractured media to be the same as those computed in sedimentary formations. Also, and very importantly, a parameter such as hydraulic conductivity is not expected to vary in space smoothly under any circumstances; for this reason, the use of Gaussian variograms or, in general, variograms with low roughness should never be advocated (unless they are accompanied by a nugget effect that would break the smoothness imposed by those variograms).
Next, the terminology used departs sufficiently from the one used in standard geostatistical literature (see, Journel and Huijbrets, 1978; Isaaks and Srivastava, 1989; Deutsch and Journel, 1991; or Goovaerts, 1997) to make it very confusing, and, at some points incorrect.
Let's dive into more detail about my reservations. The variogram is a vectorial function that depends on the separation distance vector h; only in the case of isotropic variograms in all directions does such a function becomes a scalar one. The definition in section 2.2 and the analysis thereof is made assuming that the variogram is a scalar function depending only on the modulus of h. This is a significant conceptual flaw, which is not cleaned with the anisotropy analysis made in section 3.3. The expression of the variogram function in line 135 is incorrect. You cannot separate the variance and the nugget effect. The variance of the random function model is (n+σ^{2}) (using the notation in that section), and the covariance is the variance minus de variogram ((n+σ^{2})γ). The authors should recall that, generally, the variogram is modeled as a sum of variogram functions, each one contributing their share to the total variance (or variogram sill). They have chosen to model it as the sum of two variogram models, a nugget effect (with a contribution of n) and another parametric function (with a contribution of σ^{2}), which is perfectly OK; however, the authors should acknowledge that there could be fitting with more than two components. What is not OK is much of the later analysis in which this fact is disregarded: many of the findings are due to this fact, the fitting of a nugegt plus another variogram, and some of the conclusions are said to be due to, say, the type of parametric model used, disregarding that a nugget had been used to fit the variogram.
Very important, σ^{2} is not the variance of the process! It is only the contribution of one of the two structures fitted to the total variance.
The variogram computed as a function of the modulus of the separation vector is referred to as the omnidirectional variogram. It is generally used during the modeling process to assess the nugget effect and the sill of the variogram and to have an idea of the average range from the different directions, but it is not a variogram to be used in practice for anisotropic media. Variograms are vectorial functions, and as such, their value depends on the vector orientation reflecting the anisotropy in the correlation patterns of the variable under analysis. Such anisotropy is characterized by an ellipsoid (in 3D) or an ellipse (in 2D), which, in principle, may be arbitrarily oriented, although, in practice, it is assumed that one of the ellipsoid axes is oriented vertically and the other two in the XY plane, not necessarily parallel to the Cartesian axes. The lengths of the semiaxes of this ellipsoid are known as the anisotropy ranges and are crucial to characterizing the spatial continuity of the attribute. The authors disregard the vectorial nature of the variogram and introduce the "characteristic length" as one of the parameters of the parametric variogram components. The characteristic length, 'l' in the equations in the lines between 135 and 150, is not the range and may induce mistakes. In standard geostatistics, the range is defined as the distance at which the variogram value reaches the sill, or for variograms that are asymptotic to the sill, the distance at which the variogram reaches at least 95 % of the sill. This means the range will equal l for the spherical variogram but 3l for the Gaussian or exponential ones. Comparing the value of l for different variogram functions is like comparing two different quantities.
I am very concerned with the many published theoretical papers in which the variogram is assumed to be Gaussian. This assumption is always a matter of convenience, especially in those papers more focused on analytical approaches, disregarding the fact that Gaussian variograms make only sense for variables that vary smoothly in space, such as the thickness of a layer, but never should be used to model a parameter characterized by a very large spread with significant shortscale variability like the hydraulic conductivity.
It is not clear if the variograms analyzed are for K or logK. Line 157 mentions K, and nowhere is logK mentioned. The variogram of K may not of interest in geostatistical studies since it is common to use a lognormal random function to model the spatial variability of K. In addition, if K had been used, my criticism of using Gaussian variograms (or variograms with low roughness) is exacerbated.
Considering the significant uncertainties associated with experimental variograms, trying to distinguish whether a stable or a Matern variogram fits better than an exponential or a spherical one is meaningless. The analysis would have benefited from comparing just the spherical and exponential variograms (plus the nugget effect). Neither the Matern nor the stable models make sense when modeling hydraulic conductivity from a geological point of view. A much more meaningful fitting would have been a power variogram, which would relate to the fractal behavior claimed by Neuman.
More specific comments
What are observable base rates? (line 22)
What is a variogram cloud? (line 39, line 96, and many more times in the following pages)
The paragraph from lines 52 to 55 is meaningless.
After introducing the roughness coefficient in line 166, the authors analyze the roughness of the different parametric functions used in their variogram definitions, but they fail to recognize that as soon as a nugget effect is added, the roughness of the fitted variogram is zero. This explains why, later, some of the analyses seem to favor fitting a "Gaussian variogram" when in truth, they are fitting a variogram composed of a nugget effect plus a Gaussian component. The nugget removes the implicit roughness of the Gaussian variogram, justifying its application in practice.
When introducing section 3, the authors discuss the length scale and vertical and horizontal anisotropy. For the first time, the authors recognize that anisotropy may be an issue, but it is poorly defined. The length scale could be the larger range, and the horizontal and vertical anisotropies could be the ratios of the ranges in the other two principal directions with respect to the largest range. But this is not what the authors analyze; they focus on the range of the omnidirectional variogram, plus the ranges in some arbitrary directions in the XY plane (not necessarily in the directions of the principal ranges) and the range in the vertical direction.
In section 2.3, it would be nice to discuss the fitting procedure used to get the variogram parameters.
In the fitting algorithm, the authors should use an algorithm capable of fitting a fully vectorial function and attempt to identify not just a single omnidirectional range but actually the whole anisotropy ellipsoid (or ellipse), that is, the three (or two) principal ranges and their orientations.
The statement in line 215 does not seem to be much justified. Why should all models fit data? There may be physical reasons why one kind of variogram is more appropriate than other. In some cases, as already mentioned, some variogram models are inconsistent with the sample data (such as the Gaussian variogram without nugget effect to fit the spatial variability of hydraulic conductivity).
Line 223. The authors realize that the nugget effect is responsible for the similar results yielded but the different parametric models.
The matrices of scatterplots must be explained. What does each point in each scatterplot represent? The diagonal scatterplots make no sense unless they represent something different from the rest of the scatterplots. If they show the same parameters, all points in the diagonal scatterplots should fall in the diagonal at 45 degrees.
The end of the sentence in line 229 is very important. It is always so! The particularities of the site under analysis should drive the analysis. Trying to find the de "universal" parameter distribution is a mistake.
The second sentence in the paragraph starting at line 230 is hard to understand.
Unclear what is meant in the sentence that starts in line 251 to the end of the paragraph.
End of page 12: A power variogram could explain this behavior.
In the paragraph in line 260, the relationship between the lengths should be at the 45degree line when comparing spherical and Gaussian and at the y=3x for the comparison with the spherical variogram.
Section 3.3 makes absolutely no sense. Anisotropy is a fact and should be appropriately analyzed and accounted for. Analyzing λ_{x} and λy when they do not represent the actual principal directions of the anisotropy ellipse is useless.
The analysis of λz would make sense if compared with the largest range in the plane but not when compared with an arbitrary range calculated in the x direction.
In section 3.4, I hope that the normalization of the nugget was done with respect to the total variance (n+σ^{2}), not with regard to σ^{2}. There is no reason why n could not be larger than σ^{2}, therefore escaping from the [0,1] limits.
The nugget not only accounts for measurement errors but also for shortscale variability. The discussion in this section is poor and, at times, flawed.
Line 354 is explained because the Gaussian model needs the nugget to fit most experimental variograms, whereas the exponential does not. Consequently, the results are not an artifact of the fitting procedure.
In the opening of section 3.5, the authors forget that the length parameter is not enough; there is an anisotropy ellipse.
Why does Figure 15 not show a histogram as before?
The discussion about survivor bias is very interesting.
The quality of the figures must be improved.
I think a reanalysis of the data accounting for the mistakes made in this evaluation, probably reducing the number of variogram functions and avoiding the construction of the prior probability functions for the different models, would be worth publishing.
Citation: https://doi.org/10.5194/hess202315RC1 
AC3: 'Reply on RC1', Falk Heße, 31 May 2023
The reviewer presents long and thorough discussion of our manuscript, which raises a number of points considered critical. At least some of these more critical are, in our opinion, caused by a misunderstanding of our aims and the methodology employed here. To avoid such a misunderstanding in the future, we will revise our manuscript and explain the points more accurately. Others are important and we will address them below as well as in the revised version of the manuscript.
Due to the length of the comments and our reply, we will present a PDF where each of the reviewer's comments are reproduced, followed by our individual responses. To improve readability, the reviewer's comments are printed in black, wereas our replies are printed in blue.

AC3: 'Reply on RC1', Falk Heße, 31 May 2023

RC2: 'Comment on hess202315', Anonymous Referee #2, 12 Mar 2023
The paper presents present a datadriven approach to a classical problem in subsurface hydrology, the estimation of parameters characterizing the variogram of subsurface properties. The proposed method advocates the use of Bayesian inference to set up a prior distribution for models that describe spatial correlations (covariance or variogram). A remarkable data set is examined, and available data are classifiable into primary (point measurements), secondary (empirical variogram functions), or tertiary (statistical estimates of subsurface properties) data. Data were processed to avoid overlaps and overrepresentation.
The first available result from the manuscript is the comparison between different variogram model functions, that could be improved in my view (see comment #2). The scale dependency of the hydraulic conductivity is examined next, confirming earlier literature results. An example application of the Bayesian approach to the estimate of correlation length given the maximum length scale is then presented. Other variogram parameters examined are anisotropy, nugget effect, and shape parameter.
The discussion addresses important issues such as the unbiasedness of the data set employed, among others.
The paper looks as a mature contribution; given the topic and the type of paper, I see however some room for further improvement. Results are of interest to the readership of Hydrology and Earth System Sciences. The methods are adequate, the paper subdivision into sections sound, and the figures illustrative. I recommend minor revisions for the reason explained below.
 The manuscript examines only stationary variograms, I suggest to mention that nonstationary variograms (see, e.g.,Di Federico and Neuman, 1997) were excluded from the analysis.
 The comparison among variograms having a different numbers of degrees of freedom (section 3.1) could be rendered more qualitative by model identification criteria (AOC, AIC_c, KIC, …), incorporating the number of parameters involved and the principle of parsimony. The same holds probably for other comparisons performed.
 How crucial is the assumption of two independent Gaussian distributions in Section 3.2 ? Could they develop a more general theory without it, maybe subject to other limitations ?
Citation: https://doi.org/10.5194/hess202315RC2 
AC2: 'Reply on RC2', Falk Heße, 30 May 2023
The paper presents present a datadriven approach to a classical problem in subsurface hydrology, the estimation of parameters characterizing the variogram of subsurface properties. The proposed method advocates the use of Bayesian inference to set up a prior distribution for models that describe spatial correlations (covariance or variogram). A remarkable data set is examined, and available data are classifiable into primary (point measurements), secondary (empirical variogram functions), or tertiary (statistical estimates of subsurface properties) data. Data were processed to avoid overlaps and overrepresentation.The first available result from the manuscript is the comparison between different variogram model functions, that could be improved in my view (see comment \#2). The scale dependency of the hydraulic conductivity is examined next, confirming earlier literature results. An example application of the Bayesian approach to the estimate of correlation length given the maximum length scale is then presented. Other variogram parameters examined are anisotropy, nugget effect, and shape parameter.
The discussion addresses important issues such as the unbiasedness of the data set employed, among others.
The paper looks as a mature contribution; given the topic and the type of paper, I see however some room for further improvement. Results are of interest to the readership of Hydrology and Earth System Sciences. The methods are adequate, the paper subdivision into sections sound, and the figures illustrative. I recommend minor revisions for the reason explained below.
Repl: We appreciate the reviewers comments and overall supportive feedback on our study.
The manuscript examines only stationary variograms, I suggest to mention that nonstationary variograms (see, e.g.,Di Federico and Neuman, 1997) were excluded from the analysis.
\textcolor{blue}{Due to some comments of another reviewer, we revised our analysis and included a truncated power law variogram.}The comparison among variograms having a different numbers of degrees of freedom (section 3.1) could be rendered more qualitative by model identification criteria (AOC, AIC\_c, KIC, …), incorporating the number of parameters involved and the principle of parsimony. The same holds probably for other comparisons performed.
Reply: We agree with the reviewer that the use of a modelselection criterion can formalize the comparison we have done in the manuscript to far. Models with more degrees of freedom have more flexibility and should therefore be better able to capture an observed behavior. In case of variogram model, the Matern and the Stable model should therefore outperform the other models in terms of goodnessoffit. In the current version, we only performed a qualitative analysis by observing the overall similar accuracy and noting how this would not justify the use of a more sophisticated model like the aforementioned Matern and Stable model. Using a criterion like AOC, AIC\_c, KIC, etc. would make this analysis more quantitative.
How crucial is the assumption of two independent Gaussian distributions in Section 3.2 ? Could they develop a more general theory without it, maybe subject to other limitations?
Reply: The assumption of two different Gaussian distribution is not crucial to the approach presented there at all. In fact, the parametric model for the residuals around the regression line was chosen on the spot after visually inspecting them. As we explain in the discussion section, many different parametric models may be possible depending on the situation. In fact, if enough data points are available a completely nonparametric approach is possible as well. To summarize our approach here again, we would describe it as follows. First, the residuals around the regression line are representing the uncertainty one has with respect to the regression model. From a Bayesian perspective, the can be used to estimate a prior probability. We do this by first visually inspecting the results of a kerneldensity estimation (KDE), ie., a nonparametric estimation procedure. KDE is a powerful estimator, but it always produces very smooth densities which may bamboozle practitioners into over interpreting its results. To avoid overconfidence, we therefore only use KDE to find a good parametric model that could describe the empirically observed distribution of the residuals. In our case, a mixture mode using two Gaussian distributions seemed like a good choice. When we fitted such a model to the residuals and compared it to the KDE, we saw a excellent overlap. Of course this agreement has to be interpret with care since the KDE is not the ultimate benchmark of truth, for the reasons outline above. But having two different estimation procedure give very similar results certainly adds confidence that the both express some underlying truth. This single example is explicitly presented as a proof of concept for how to use the data provided in our study for the derivation of prior distributions in a Bayesian context. As we state in the manuscript, using other data and/or other variogram functions may lead to a somewhat different regression analysis with different residuals. In our opinion, there is probably no general theory on what parametric model, if any, to use for the description of the prior distribution. Every case may be different and practitioners are advised to use their judgement in adapting this approach to their situations. In the revised manuscript, we now explain this reasoning in more detail to better convey this important idea.

RC3: 'Comment on hess202315', Shlomo P. Neuman, 14 Mar 2023
The authors have conducted an interesting statistical analysis of raw variograms (I assume that this is what the authors mean by variogram clouds) of published data from a variety of geographic locations and geologic settings. Though the data are said to include saturated hydraulic conductivities from aquifers and shallow soils, as well as permeabilities and transmissivities, I assume (but the authors need to confirm or otherwise clarify) that the data consist of logarithms of these variables. A strength AND a weakness of the analysis is the lack of clear distinction between varied hydraulic properties (conductivity versus transmissivity) from varied settings (sediments, rocks, porous versus fractured media) measured by varied means (I presume single and multihole pressure tests, perhaps other) on varied scales of measurement. The strength of the approach is that it reveals commonalities among the data (scaling and amenability to analysis by a variety of variogram models); its weakness is lack of clarity about the way hydrogeologic settings, methods and scales of measurements, affect variogram behavior.
To me, one of the more interesting results of the analysis is confirmation of a scale effect that reveals itself, precisely, when one throws data from varied sites and settings into a single basket, as do the authors. It would therefore behove the authors to take the additional step of attempting to fit a truncated power variogram to their data, thereby potentially showing that such a variogram captures a greater range of behaviors with fewer parameters than do the stationary variograms they consider.
Not only might a truncated power variogram prove to be more general and parsimonious than do standard stationary variogram, but they would be much more suitable than the latter when one needs to predict, using a transport model, how a plume of contaminant might migrate and behave outside the domain originally used to define the variogram.
I think that with additional work along the lines I have just proposed, this paper could stand out as a much more significant contribution to the literature than it does now.
Citation: https://doi.org/10.5194/hess202315RC3 
AC1: 'Reply on RC3', Falk Heße, 30 May 2023
The authors have conducted an interesting statistical analysis of raw variograms (I assume that this is what the authors mean by variogram clouds) of published data from a variety of geographic locations and geologic settings. Though the data are said to include saturated hydraulic conductivities from aquifers and shallow soils, as well as permeabilities and transmissivities, I assume (but the authors need to confirm or otherwise clarify) that the data consist of logarithms of these variables. A strength AND a weakness of the analysis is the lack of clear distinction between varied hydraulic properties (conductivity versus transmissivity) from varied settings (sediments, rocks, porous versus fractured media) measured by varied means (I presume single and multihole pressure tests, perhaps other) on varied scales of measurement. The strength of the approach is that it reveals commonalities among the data (scaling and amenability to analysis by a variety of variogram models); its weakness is lack of clarity about the way hydrogeologic settings, methods and scales of measurements, affect variogram behavior.
Reply: We appreciate the reviewers comments and overall supportive feedback on our study. The reviewer is right that we used the logarithm of the conductivity, transmissivity and permeability values. To make this more clear, we explicitly mention this now in the revised version of the manuscript. As regards the lack of clear distinction between varied hydraulic properties, we agree with the reviewer that, as far as the analyses are concerned, this is both a strength and weakness. Any statistical analysis has to find a compromise between sample size and accuracy. Using the whole sample means that no within sample effects can be discovered. Split the sample according to different criteria, and the resulting subsamples may be too small to perform a viable statistical analysis. For our paper, we only distinguished between soil and aquifer site. Splitting for instance the aquifer data further into conductivities, transmissivites and permeabilities, would facilitate the investigation of possible differences between them. Alas, since most of data was drawn from measurements of hydraulic conductivity, the samples for transmissivites and permeabilities would have been small, thus, making comparisons more elusive. Still, the data set is openly available and practitioners interested in this topic are free to do such an analysis by themselves. The same problem of reduced sample size also applies to comparisons between, say, different settings and/or measurement techniques. This was particularly unfortunate, since we were very interested in this topic ourselves. Alas, many papers did not report these additional information, which exacerbated the aforementioned problem even further. To better highlight this problem we revised the conclusions portion of the manuscript accordingly.
To me, one of the more interesting results of the analysis is confirmation of a scale effect that reveals itself, precisely, when one throws data from varied sites and settings into a single basket, as do the authors. It would therefore behove the authors to take the additional step of attempting to fit a truncated power variogram to their data, thereby potentially showing that such a variogram captures a greater range of behaviors with fewer parameters than do the stationary variograms they consider.
Not only might a truncated power variogram prove to be more general and parsimonious than do standard stationary variogram, but they would be much more suitable than the latter when one needs to predict, using a transport model, how a plume of contaminant might migrate and behave outside the domain originally used to define the variogram.I think that with additional work along the lines I have just proposed, this paper could stand out as a much more significant contribution to the literature than it does now.
Reply: We agree that using the available data set with a truncated power variogram can help to investigate the scaling behavior of subsurface media indicated by a number of different studies. We therefore redid the analysis using this variogram function. Our results show that a truncated power law variogram achieves a similar goodness of fit compared to the other variogram functions but does not show any scale dependencies of its parameters. Depending on the application, such a lack of scale dependency can be considered an asset. We will include these results and some discussion on this topic into the manuscript.

AC1: 'Reply on RC3', Falk Heße, 30 May 2023
Falk Heße et al.
Falk Heße et al.
Viewed
HTML  XML  Total  BibTeX  EndNote  

426  148  23  597  10  10 
 HTML: 426
 PDF: 148
 XML: 23
 Total: 597
 BibTeX: 10
 EndNote: 10
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1