Aggregation effects on tritium-based mean transit times and young water fractions in spatially heterogeneous catchments and groundwater systems

Stewart, Michael K.; Morgenstern, Uwe; Gusyev, Maksym A.; Małoszewski, Piotr

doi:https://doi.org/10.5194/hess-21-4615-2017

Articles | Volume 21, issue 9

https://doi.org/10.5194/hess-21-4615-2017

© Author(s) 2017. This work is distributed under
the Creative Commons Attribution 3.0 License.

https://doi.org/10.5194/hess-21-4615-2017

© Author(s) 2017. This work is distributed under
the Creative Commons Attribution 3.0 License.

Articles | Volume 21, issue 9

Research article

|

13 Sep 2017

Research article |

| 13 Sep 2017

Aggregation effects on tritium-based mean transit times and young water fractions in spatially heterogeneous catchments and groundwater systems

Michael K. Stewart, Uwe Morgenstern, Maksym A. Gusyev, and Piotr Małoszewski

Download

Final revised paper (published on 13 Sep 2017)
Preprint (discussion started on 10 Oct 2016)

Interactive discussion

Status: closed

AC: Author comment | RC: Referee comment | SC: Short comment | EC: Editor comment

- Printer-friendly version

- Supplement

SC1: 'comment on the method used', Julien Farlin, 31 Dec 2016
- SC3: 'reply to comments by Julien Farlin', James Kirchner, 03 Jan 2017
  - SC4: 'reply to James Kirchner's reply', Julien Farlin, 09 Jan 2017
    - SC5: 'reply to further comments by Julien Farlin', James Kirchner, 12 Jan 2017
      - AC4: 'Authors' Reply to Comments', Michael Stewart, 07 Feb 2017
RC1: 'Review - Aggregation effects using tritium', Anonymous Referee #1, 02 Jan 2017
- SC2: 'on comments made by referee #1', Julien Farlin, 02 Jan 2017
- AC1: 'Author's Reply to Referee #1', Michael Stewart, 26 Jan 2017
RC2: 'Review of aggregation effects using tritium', Anonymous Referee #2, 11 Jan 2017
- AC2: 'Author's Reply to Referee #2', Michael Stewart, 30 Jan 2017
RC3: 'General and specific comments', Anonymous Referee #3, 12 Jan 2017
- AC3: 'Authors' Reply to Referee #3', Michael Stewart, 07 Feb 2017
  - AC5: 'Corrected version of reply to referee #3', Michael Stewart, 08 Feb 2017

Peer-review completion

AR: Author's response | RR: Referee report | ED: Editor decision

ED: Reconsider after major revisions (further review by Editor and Referees) (08 Feb 2017) by Markus Hrachowitz

Dear authors,

As you have seen, the three reviewers do, in principle, appreciate the topic you address in your manuscript but they also raise a few, quite critical issues.

The first is the quality of the presentation of your work. The manuscript is, in places, written in quite a confusing way with quite some inconsistencies in terminology and the symbols used, making it difficult to follow. In addition, some of the methods of your analysis are not explained in sufficient detail and clarity so as to allow the reader to repeat your experiments and to (ideally) reproduce your results. I think this first issue can (and needs to) be addressed and solved quite easily.

The second point is, in my opinion, somewhat more problematic, as it touches on the core assumptions of the experiment. The reviewers emphasized, and I fully agree with that, that real systems and their heterogeneity are in general not known. Neither are their *real* TTDs and MTTs. While it is plausible that compound TTDs will reduce the effect of natural heterogeneity on the aggregation bias, it is, at this point and in the absence of the necessary information, impossible to say what a “well-chosen” compound TTD is. If, for example, a *real* system can be represented by a hyper-exponential TTD, it is still not known, how many components , i.e. parallel, exponential processes active at different (unknown) time scales, are necessary to correctly describe this system. Is two components enough? If not, how do we know and how many do we need? What indeed has been shown in your manuscript is that more flexibility (i.e. more parameters, and thus potentially more processes) can reduce the problem. However, in real world systems it is far from obvious how relevant this reduction is, as the absolute aggregation bias is not known. The bottom line is the question: “What is the real world reference, which we can compare our models to?”

In addition, it is stated that, for example a gamma distribution as TTD will result in considerable aggregation error (fig.3). I would argue that such a generalizing statement is not necessarily justified, as the choice of the gamma distribution parameters very much depends on the performance metric and “calibration” strategy used. Fitting a distribution to another one is always a difficult task and many factors will considerable influence the results: how many points are used for fitting (e.g. only the points at the actual time steps used in the model or also some sub-steps?)? How long a tail is considered for fitting (i.e. the longer the tail, the more the performance metric will tend to get the tails right at the cost of a worse representation of the early phases)? Another aspect that is not fully clear, is if we should rather fit the PDFs or the CDFs of the distributions (depends on the type of sample we use)? All these aspects will strongly influence, which TTD is chosen as meaningful.

For example, experimenting a bit with a gamma distribution over 500 time steps, with the squared errors computed from these 500 time steps, I found that a gamma distribution with alpha=0.4 and beta=250 (i.e. MTT=100) gives a sum of squared errors (compared to the compound exponential with MTTs of 3 and 197, as in the manuscript) much lower than for the gamma distribution used in the manuscript (alpha~0.3, beta~150, MTT~45), while exhibiting essentially *NO* aggregation error (MTTs are almost equal at 100). And, as also pointed out by the reviewers, there are many more different parameter combinations that fit better than the combination given in the manuscript, while exhibiting partly considerably lower aggregation bias.

These points are not yet sufficiently well explained and clarified in the author replies to the reviewers. Given their central importance for the analysis and its interpretation, these points need particular attention when revising your manuscript.

Best regards,
Markus Hrachowitz

Hide

AR by Michael Stewart on behalf of the Authors (13 Apr 2017) Author's response Manuscript

ED: Referee Nomination & Report Request started (13 Apr 2017) by Markus Hrachowitz

RR by Anonymous Referee #4 (11 May 2017)

RR by Anonymous Referee #3 (25 May 2017)

Suggestions for revision or reasons for rejection

The revised version of this manuscript is an improvement. It resolves several of the issues that were raised in the first round of reviews.

Unfortunately, there are also important areas where the manuscript has hardly changed in response to major issues identified by the reviewers as well as the editor. To the extent that aspects of the paper haven't changed, my assessments of those aspects also remain unchanged.

The paper's central claim is that "well-chosen compound lumped parameter models should be used as they will reduce potential aggregation errors due to the application of simple lumped parameter models". This claim, from the abstract, is stated much more strongly in the text, e.g., "... the binary models have very much less potential for aggregation bias than the simple models".

These claims must be demonstrated with evidence, and not simply asserted.

To the extent that evidence is offered, it is mostly true by definition, since: a) a "well-chosen" model is operationally defined in the paper as one that is consistent with the perceptual geological model, which is assumed to be correct b) the authors define the "true" MTT as the MTT of the compound model for the purposes of the results in Section 4, and c) the authors measure the aggregation error as the deviation of the simple model MTT from the compound model MTT (which is considered to be the "true" MTT).

Under this system of definitions, of course the compound models are better than the simple models... because the compound models are defined from the start as being correct and therefore having no aggregation error!

Of course IF the heterogeneity in the catchment actually coincides with the chosen compound model (that is, if the catchment actually consists of two compartments that each behave according to their chosen age distributions), and IF the parameter estimation procedures actually retrieve parameter values that match the true values for the two compartments, then that model will, indeed, subsume the heterogeneity in the real-world system and eliminate the aggregation bias. But this only says that IF the model is actually the correct model, then it will be correct.

The problem is, how can we know whether the model (including the underlying perceptual model) and its parameters are correct?

The authors' position seems to be that as long as the model fits the perceptual model and the data, it is correct. But long experience in hydrological modeling has shown that models with many parameters can often fit the available data, whether or not those models are structurally correct. In many cases, the models will also fit well for multiple, widely differing parameter sets – in other words, the parameter values will not be identifiable, even though the models fit the data nicely.

To justify the statements made in the manuscript, one would need to present evidence that if the model fits the perceptual model and the data, then the model structure and the parameter values are guaranteed to be correct – that is, that it is not possible for the real-world structure, or its parameter values, to be significantly in error. No evidence is presented to meet this burden of proof.

There is another way forward, and that is to restrict the claims that are made to ones that actually demonstrated. For example, the authors have analyzed several cases where there is geological evidence for a multicomponent catchment system, and in those cases, compound LPM's that mirror these perceptual models fit the tritium data better than simple LPM's do. So that is a claim that could be made in the paper.

But if the manuscript is to make broader claims (for example, about whether compound LPM's will reduce or eliminate aggregation errors), then those claims need to be substantiated with evidence. That evidence also needs to be nontrivial; it is not enough to posit a system of two exponential distributions, and then assert that a compound LPM with two exponential distributions would fit this system, and would have no aggregation bias. That is simply irrelevant to the problem we face in the real world, which is that we do not already know the "right answer", including the structure of the system we are trying to analyze.

The manuscript demonstrates an important point, which is that where we already have geological information about aquifer partitioning, then that information can be very useful in constraining transit time models. Where we have that information, we don't need to rely on tracers alone. Another way to say it is, if the transit time models agree with the geological information and the tracers, then they strengthen our confidence in both of them. That is a useful point to make.

But the generalization that "well-chosen" models are always preferred is either true by definition (of course "well-chosen" is better than "poorly chosen"), or else it leaves open the critical issue: how do we know when our models are "well chosen"? Is two compartments "well chosen", or just one? Or six? How should we decide?

More complex models, with more adjustable parameters, will almost always fit data sets better, but the parameters themselves may be highly uncertain. The critical question that the manuscript must come to grips with is: how well can the parameters be constrained? What are their uncertainties? Holding all the other parameter values constant and just varying one of them is not a valid way to estimate parameter uncertainty, unless those other values are actually known to have their assigned values. But as far as I can tell (the manuscript doesn't say, and that is itself a problem), either this, or nothing at all, is what has been done.

My previous review said that there was no excuse for not properly analyzing parameter uncertainty, and the response was that this had not been done in tritium papers until very recently. The fact that it has rarely been done before is no excuse for not doing it now. The only alternative is to remove all claims about whether parameters (including the MTT's that are derived from them) are well constrained.

Some particularly problematic passages are quoted below (this is an incomplete list):

"We believe that the use of compound LPMs could strongly reduce aggregation errors in hydrological systems with “significant and distinct” heterogeneity. For example, we consider a simple case of a catchment split into two parts by two very different rock types that produce waters with very different MTTs; i.e. the most extreme “significant and distinct” heterogeneity one can imagine. A binary LPM describes this type of system, and when optimised with suitable data would very effectively separate the young MTT from the old MTT waters in the catchment outflow, and therefore minimise aggregation errors in MTT. "

This assumes the existence of "suitable data", but gives no criteria by which to determine whether data are "suitable". No analysis is presented to support any of the statements in the passage. They may be true, or they may not, but unless they can be supported by evidence they should be removed.

"If we now consider a catchment split into four parts with two areas of each rock type, the binary LPM when optimised is still very effective for separating the two types of water, while the potential for aggregation errors is smaller. In systems which are split into eight, sixteen, etc. parts the binary LPM retains its effectiveness, but the potential for aggregation errors becomes very much smaller because the system starts to look homogeneous at larger scales."

No evidence is presented to substantiate these statements. In Kirshner [2016], splitting the model system into more components does not reduce the aggregation bias, which suggests that the statements made here are not correct.

"There is, of course, a wide range of different types of hydrological systems, but the binary LPM is likely to remain effective in cases of “significant and distinct” heterogeneity, which are the ones of concern for aggregation error."

What do "remain effective" and "significant and distinct" mean?

It is clear that many (if not most) studies using seasonal tracer cycles interpreted with simple LPMs will have been affected by aggregation bias. But we contend that tritium studies will have been affected less, despite aggregation bias also applying to tritium-derived MTTs, because many of the tritium studies in the literature applied compound models calibrated by fitting to time series of tritium measurements rather than or as well as using simple LPMs. Provided the compound LPMs were well-chosen based on the characteristics of the catchments, they will produce more accurate TTDs than the simple LPMs and therefore will reduce aggregation bias on MTTs.

These statements may or may not be true, but they are not supported by any clear evidence that I can find in the manuscript.

"A good example is the study of Blavoux et al. (2013) describing the interpretation of an exceptionally long and very detailed record of tritium concentrations from the Evian-Cachat Spring in France. The tritium record was much too complicated to be fitted by a simple LPM. Instead, the detailed records of input and output allowed accurate specification of a combined model comprising of exponential (τm = 8 yr) and dispersion (τm = 60 yr) models in series, with a small bypass flow in parallel with them, followed by a piston flow model (τm = 2.5 yr) in series giving an overall τm of 70 yr. The combined model was closely related to the hydrogeology of the area and produced an accurate TTD for the average stationary state of the system, so there is little possibility of aggregation bias."

The Blavoux et al. study relied on samples taken over decades, including ones that nicely traced out the bomb pulse. It is a nice study, but is completely misleading as an indication of what will be possible with tritium data, except in the very few places where records go back that far.

A central problem with the paper is that it gives the reader the impression that fitting more complex LPM's will solve all kinds of problems. But now and into the future, with the bomb pulse gone, there will be very little tritium variation available to fit even simple models to, let alone more complex ones.

"Compound LPMs are to be preferred, but often there can be considerable difficulty in uniquely quantifying the parameters especially if the output data is limited."

The statement that compound LPM's "are to be preferred" is unsubstantiated, except perhaps as a statement of the authors' preferences. And the second part of the sentence effectively cancels the first, since (although this is not stated), the "difficulty in uniquely quantifying the parameters" becomes geometrically more difficult as the number of parameters grows. The manuscript needs to show that the "considerable difficulty" is not a problem in the cases presented here, and also in the types of situations where tritium is likely to be used in the future, when there will be no bomb pulse to work with.

"We find that MTT aggregation errors are small when the component waters have similar MTTs to each other. On the other hand, aggregation errors can be large when very young water components are mixed with older components."

Unlike in the case of seasonal tracer cycles, with tritium the absolute ages (such as "very young") are not relevant because tritium decays exponentially. Aggregation errors can arise in tritium whenever the component ages span a large range, regardless of where that range is centered. The exponential decay curve implies that the aggregation error depends on the ratios of the ages, not their absolute values. Even if we have no "very young" water at all, for example, adding some much older (tritium-depleted) water can potentially lead to a large aggregation error, depending on how much older that water is. If, for example, we added 50% zero-tritium water to our mixture, we would decrease the average 3H concentration by half and therefore increase the model age by 12.3 years... but we could increase the ACTUAL mean age by hundreds or thousands of years, depending on how old the zero-tritium water actually is.

The authors point out in section 4.4 that additions of tritium-free water will often have a very small effect on model fits to tritium time series, but (although they don't emphasize this), this could have a potentially huge effect on what the true mean transit time is (because the tritium-free water could be 200, or 2000, or 20,000 years old). Thus the true mean transit time can become decoupled from the mean transit time estimated from tritium. This is a clear aggregation error, and it should be identified as such. The same potential aggregation error exists (but is not mentioned) in the other case examples as well. Every TTD model in the paper, both simple and compound, is vulnerable to it.

"In general, well-chosen compound lumped parameter models should be used as they will reduce potential aggregation errors due to the application of simple lumped parameter models. An opportunity to determine a realistic compound lumped parameter model is given by matching simulations to time series of tritium measurements (underlining the value of long series of past tritium measurements), but such results should be validated by reference to the characteristics of the hydrological system to ensure that the parameters found by modelling correspond to reality."

Again, the value of comparing model assumptions to "the characteristics of the hydrological system" (presumably determined independently?) is clear, but claims that any model is "realistic" – and, more generally, that the approach outlined here is a reliable guide to model "realism" – need to be supported with evidence.

Minor items:

The claim that tritium measurements are usable over a 200-year time range should be substantiated. 200 years is almost 17 half-lives, during which tritium concentrations will decay to 10^-5 of their original values. Even assuming we start with 5,000 TU (the peak of the bomb pulse), this will decay to about 0.05 TU after 200 years. Even if such a measurement is analytically feasible, it is hard to see how it is useful in practice, because contamination with even a tiny bit of younger water would obscure the tiny traces of the 200-year-old water. (And remember, this is starting from the strongest possible tritium signal, the peak of the bomb pulse.)

The statement that there is no aggregation bias when the MTT's of the two components are the same (see the abstract, and Section 5.1, for example) are rather trivial. If there is no heterogeneity, then of course no bias can result from it.

On page 5, line 19, the dimensions of beta are wrong.

The conclusions are mostly a word-for-word repetition of the abstract. If the authors don't have anything further to say, then there is no need for the conclusions. There is need to print the same sentences in two different places.

Hide

RR by Anonymous Referee #5 (31 May 2017)

Suggestions for revision or reasons for rejection

General Comments:
I was not a reviewer in the initial round of reviews and therefore I cannot judge exactly how much improvement has been made in the revision process so far. Looking at the initial reviews and the revised manuscript I can however see that the authors did make some efforts to resolve problems that were brought forward.
I can also say that the manuscript is well-written in terms of style and structure.
The part I am struggling with is more a general one. The authors conclude that using a well-chosen compound model solves the issue of spatial aggregation errors. Still, they never really show how to ‘choose well’. They demonstrate which of the models causes the smallest aggregation error, but what does that really mean? If a gamma model with shape parameter 10 represents the dynamics of the real-world hydrology best, this model should still be used regardless of the fact that it causes more aggregation errors than a gamma model with shape parameter 1 (or is that a wrong assumption?).
Also, the f parameter bothers me. How can it be constrained and how does it influence the results? How does it affect the young water fractions and the young water thresholds? The authors often state that they strongly believe that the use of compound LPMs could strongly reduce aggregation errors. What about a little demonstration of whether the model actually works. If you mix two waters of different ages in a specified ratio (not only 50:50), can you define both MTTs correctly using the binary model or do you get lost in equifinality problems when using 5 fitting parameters? You could set up a model scenario with known inputs and outputs to demonstrate that your method actually works.
Another issue is the exclusion of anything non-stationary from the discussion. How would the omnipresent time-variance influence the results? Are all the assumptions (and conclusions) still valid if the fast component (or both components) change velocity over time? How would aggregation in time affect the apparent mTTs?
Is the determination of the young water fraction threshold simply done by minimizing the difference between apparent and true young water fractions? You do not specify this method. So what exactly is the young water fraction? Is it a random number fulfilling some mathematical/statistical requirements – like for example the value where the young water transit time distribution has its peak? What is it actually good for, if its value ranges from 0.1 to 18 years? You state that certain thresholds are important (1 year, 60 years) but you do not show/test whether they can be predicted accurately (I guess they cannot because they are far from the value of 18 years).
Maybe you can discuss some of these issues in more detail.
In essence, what I want to say is that this manuscript would profit considerably from going even further beyond reproducing Kirchner’s 2016 paper with tritium.

Specific Comments:
Page 13, Line 21: What do you mean by ‘using the DM and DEPM together or the EPM and DDM’? This is confusing me.
The young water fraction in Table 2 should not have dimensions of (yr).

Hide

ED: Publish subject to revisions (further review by Editor and Referees) (06 Jun 2017) by Markus Hrachowitz

Dear authors,

thank you for the revised manuscript. The reviewers and myself appreciate the efforts made. Besides appreciating the general objective and direction of the paper, we think that the revised version is strongly improved with respect to organization, structure and clarity.

Unfortunately, the most critical issue is not yet adequately resolved, as concisely pointed out by reviewer #3 (reflecting the other previous reviewers as well as reviewer #5):
"[...] IF the heterogeneity in the catchment actually coincides with the chosen compound model (that is, if the catchment actually consists of two compartments that each behave according to their chosen age distributions), and IF the parameter estimation procedures actually retrieve parameter values that match the true values for the two compartments, then that model will, indeed, subsume the heterogeneity in the real-world system and eliminate the aggregation bias. But this only says that IF the model is actually the correct model, then it will be correct. The problem is, how can we know whether the model (including the underlying perceptual model) and its parameters are correct? [...]"

As large parts of the interpretation of the manuscript critically hinge on the assumption of a "well chosen" model, it will be necessary to unambiguously substantiate this claim with data.
Another way forward could be, as also pointed out by reviewer #3, to "[...] restrict the claims that are made to ones that actually demonstrated [...]".

In addition, the question concerning parameter uncertainty issue (or "how sure are you that the chosen parameters are indeed the most appropriate ones?") has not at all been addressed, raising questions about the robustness of the chosen parameters.

The above points need to be addressed in detail before the manuscript can be considered for publication. I would thus be glad if the authors invested some more effort to develop the manuscript to the point.

Best regards,

Markus Hrachowitz

Hide

AR by Michael Stewart on behalf of the Authors (04 Aug 2017) Author's response Manuscript

ED: Publish as is (15 Aug 2017) by Markus Hrachowitz

AR by Michael Stewart on behalf of the Authors (18 Aug 2017)

Short summary

This paper presents for the first time the effects of aggregation errors on mean transit times and young fractions estimated using tritium concentrations. Such errors, due to heterogeneity in catchments, had previously been demonstrated for seasonal tracer cycles by Kirchner (2016a). We found that mean transit times derived from tritium are just as susceptible to aggregation bias as those from seasonal tracer cycles. Young fractions were found to be almost immune to aggregation bias.