|The revised version of this manuscript is an improvement. It resolves several of the issues that were raised in the first round of reviews.|
Unfortunately, there are also important areas where the manuscript has hardly changed in response to major issues identified by the reviewers as well as the editor. To the extent that aspects of the paper haven't changed, my assessments of those aspects also remain unchanged.
The paper's central claim is that "well-chosen compound lumped parameter models should be used as they will reduce potential aggregation errors due to the application of simple lumped parameter models". This claim, from the abstract, is stated much more strongly in the text, e.g., "... the binary models have very much less potential for aggregation bias than the simple models".
These claims must be demonstrated with evidence, and not simply asserted.
To the extent that evidence is offered, it is mostly true by definition, since: a) a "well-chosen" model is operationally defined in the paper as one that is consistent with the perceptual geological model, which is assumed to be correct b) the authors define the "true" MTT as the MTT of the compound model for the purposes of the results in Section 4, and c) the authors measure the aggregation error as the deviation of the simple model MTT from the compound model MTT (which is considered to be the "true" MTT).
Under this system of definitions, of course the compound models are better than the simple models... because the compound models are defined from the start as being correct and therefore having no aggregation error!
Of course IF the heterogeneity in the catchment actually coincides with the chosen compound model (that is, if the catchment actually consists of two compartments that each behave according to their chosen age distributions), and IF the parameter estimation procedures actually retrieve parameter values that match the true values for the two compartments, then that model will, indeed, subsume the heterogeneity in the real-world system and eliminate the aggregation bias. But this only says that IF the model is actually the correct model, then it will be correct.
The problem is, how can we know whether the model (including the underlying perceptual model) and its parameters are correct?
The authors' position seems to be that as long as the model fits the perceptual model and the data, it is correct. But long experience in hydrological modeling has shown that models with many parameters can often fit the available data, whether or not those models are structurally correct. In many cases, the models will also fit well for multiple, widely differing parameter sets – in other words, the parameter values will not be identifiable, even though the models fit the data nicely.
To justify the statements made in the manuscript, one would need to present evidence that if the model fits the perceptual model and the data, then the model structure and the parameter values are guaranteed to be correct – that is, that it is not possible for the real-world structure, or its parameter values, to be significantly in error. No evidence is presented to meet this burden of proof.
There is another way forward, and that is to restrict the claims that are made to ones that actually demonstrated. For example, the authors have analyzed several cases where there is geological evidence for a multicomponent catchment system, and in those cases, compound LPM's that mirror these perceptual models fit the tritium data better than simple LPM's do. So that is a claim that could be made in the paper.
But if the manuscript is to make broader claims (for example, about whether compound LPM's will reduce or eliminate aggregation errors), then those claims need to be substantiated with evidence. That evidence also needs to be nontrivial; it is not enough to posit a system of two exponential distributions, and then assert that a compound LPM with two exponential distributions would fit this system, and would have no aggregation bias. That is simply irrelevant to the problem we face in the real world, which is that we do not already know the "right answer", including the structure of the system we are trying to analyze.
The manuscript demonstrates an important point, which is that where we already have geological information about aquifer partitioning, then that information can be very useful in constraining transit time models. Where we have that information, we don't need to rely on tracers alone. Another way to say it is, if the transit time models agree with the geological information and the tracers, then they strengthen our confidence in both of them. That is a useful point to make.
But the generalization that "well-chosen" models are always preferred is either true by definition (of course "well-chosen" is better than "poorly chosen"), or else it leaves open the critical issue: how do we know when our models are "well chosen"? Is two compartments "well chosen", or just one? Or six? How should we decide?
More complex models, with more adjustable parameters, will almost always fit data sets better, but the parameters themselves may be highly uncertain. The critical question that the manuscript must come to grips with is: how well can the parameters be constrained? What are their uncertainties? Holding all the other parameter values constant and just varying one of them is not a valid way to estimate parameter uncertainty, unless those other values are actually known to have their assigned values. But as far as I can tell (the manuscript doesn't say, and that is itself a problem), either this, or nothing at all, is what has been done.
My previous review said that there was no excuse for not properly analyzing parameter uncertainty, and the response was that this had not been done in tritium papers until very recently. The fact that it has rarely been done before is no excuse for not doing it now. The only alternative is to remove all claims about whether parameters (including the MTT's that are derived from them) are well constrained.
Some particularly problematic passages are quoted below (this is an incomplete list):
"We believe that the use of compound LPMs could strongly reduce aggregation errors in hydrological systems with “significant and distinct” heterogeneity. For example, we consider a simple case of a catchment split into two parts by two very different rock types that produce waters with very different MTTs; i.e. the most extreme “significant and distinct” heterogeneity one can imagine. A binary LPM describes this type of system, and when optimised with suitable data would very effectively separate the young MTT from the old MTT waters in the catchment outflow, and therefore minimise aggregation errors in MTT. "
This assumes the existence of "suitable data", but gives no criteria by which to determine whether data are "suitable". No analysis is presented to support any of the statements in the passage. They may be true, or they may not, but unless they can be supported by evidence they should be removed.
"If we now consider a catchment split into four parts with two areas of each rock type, the binary LPM when optimised is still very effective for separating the two types of water, while the potential for aggregation errors is smaller. In systems which are split into eight, sixteen, etc. parts the binary LPM retains its effectiveness, but the potential for aggregation errors becomes very much smaller because the system starts to look homogeneous at larger scales."
No evidence is presented to substantiate these statements. In Kirshner , splitting the model system into more components does not reduce the aggregation bias, which suggests that the statements made here are not correct.
"There is, of course, a wide range of different types of hydrological systems, but the binary LPM is likely to remain effective in cases of “significant and distinct” heterogeneity, which are the ones of concern for aggregation error."
What do "remain effective" and "significant and distinct" mean?
It is clear that many (if not most) studies using seasonal tracer cycles interpreted with simple LPMs will have been affected by aggregation bias. But we contend that tritium studies will have been affected less, despite aggregation bias also applying to tritium-derived MTTs, because many of the tritium studies in the literature applied compound models calibrated by fitting to time series of tritium measurements rather than or as well as using simple LPMs. Provided the compound LPMs were well-chosen based on the characteristics of the catchments, they will produce more accurate TTDs than the simple LPMs and therefore will reduce aggregation bias on MTTs.
These statements may or may not be true, but they are not supported by any clear evidence that I can find in the manuscript.
"A good example is the study of Blavoux et al. (2013) describing the interpretation of an exceptionally long and very detailed record of tritium concentrations from the Evian-Cachat Spring in France. The tritium record was much too complicated to be fitted by a simple LPM. Instead, the detailed records of input and output allowed accurate specification of a combined model comprising of exponential (τm = 8 yr) and dispersion (τm = 60 yr) models in series, with a small bypass flow in parallel with them, followed by a piston flow model (τm = 2.5 yr) in series giving an overall τm of 70 yr. The combined model was closely related to the hydrogeology of the area and produced an accurate TTD for the average stationary state of the system, so there is little possibility of aggregation bias."
The Blavoux et al. study relied on samples taken over decades, including ones that nicely traced out the bomb pulse. It is a nice study, but is completely misleading as an indication of what will be possible with tritium data, except in the very few places where records go back that far.
A central problem with the paper is that it gives the reader the impression that fitting more complex LPM's will solve all kinds of problems. But now and into the future, with the bomb pulse gone, there will be very little tritium variation available to fit even simple models to, let alone more complex ones.
"Compound LPMs are to be preferred, but often there can be considerable difficulty in uniquely quantifying the parameters especially if the output data is limited."
The statement that compound LPM's "are to be preferred" is unsubstantiated, except perhaps as a statement of the authors' preferences. And the second part of the sentence effectively cancels the first, since (although this is not stated), the "difficulty in uniquely quantifying the parameters" becomes geometrically more difficult as the number of parameters grows. The manuscript needs to show that the "considerable difficulty" is not a problem in the cases presented here, and also in the types of situations where tritium is likely to be used in the future, when there will be no bomb pulse to work with.
"We find that MTT aggregation errors are small when the component waters have similar MTTs to each other. On the other hand, aggregation errors can be large when very young water components are mixed with older components."
Unlike in the case of seasonal tracer cycles, with tritium the absolute ages (such as "very young") are not relevant because tritium decays exponentially. Aggregation errors can arise in tritium whenever the component ages span a large range, regardless of where that range is centered. The exponential decay curve implies that the aggregation error depends on the ratios of the ages, not their absolute values. Even if we have no "very young" water at all, for example, adding some much older (tritium-depleted) water can potentially lead to a large aggregation error, depending on how much older that water is. If, for example, we added 50% zero-tritium water to our mixture, we would decrease the average 3H concentration by half and therefore increase the model age by 12.3 years... but we could increase the ACTUAL mean age by hundreds or thousands of years, depending on how old the zero-tritium water actually is.
The authors point out in section 4.4 that additions of tritium-free water will often have a very small effect on model fits to tritium time series, but (although they don't emphasize this), this could have a potentially huge effect on what the true mean transit time is (because the tritium-free water could be 200, or 2000, or 20,000 years old). Thus the true mean transit time can become decoupled from the mean transit time estimated from tritium. This is a clear aggregation error, and it should be identified as such. The same potential aggregation error exists (but is not mentioned) in the other case examples as well. Every TTD model in the paper, both simple and compound, is vulnerable to it.
"In general, well-chosen compound lumped parameter models should be used as they will reduce potential aggregation errors due to the application of simple lumped parameter models. An opportunity to determine a realistic compound lumped parameter model is given by matching simulations to time series of tritium measurements (underlining the value of long series of past tritium measurements), but such results should be validated by reference to the characteristics of the hydrological system to ensure that the parameters found by modelling correspond to reality."
Again, the value of comparing model assumptions to "the characteristics of the hydrological system" (presumably determined independently?) is clear, but claims that any model is "realistic" – and, more generally, that the approach outlined here is a reliable guide to model "realism" – need to be supported with evidence.
The claim that tritium measurements are usable over a 200-year time range should be substantiated. 200 years is almost 17 half-lives, during which tritium concentrations will decay to 10^-5 of their original values. Even assuming we start with 5,000 TU (the peak of the bomb pulse), this will decay to about 0.05 TU after 200 years. Even if such a measurement is analytically feasible, it is hard to see how it is useful in practice, because contamination with even a tiny bit of younger water would obscure the tiny traces of the 200-year-old water. (And remember, this is starting from the strongest possible tritium signal, the peak of the bomb pulse.)
The statement that there is no aggregation bias when the MTT's of the two components are the same (see the abstract, and Section 5.1, for example) are rather trivial. If there is no heterogeneity, then of course no bias can result from it.
On page 5, line 19, the dimensions of beta are wrong.
The conclusions are mostly a word-for-word repetition of the abstract. If the authors don't have anything further to say, then there is no need for the conclusions. There is need to print the same sentences in two different places.