the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Can causal discovery lead to a more robust prediction model for runoff signatures?
Abstract. Runoff signatures characterize a catchment's response and provide insight into the hydrological processes. These signatures are governed by the co-evolution of catchment properties and climate processes, making them useful for understanding and explaining hydrological responses. However, catchment behaviours can vary significantly across different spatial scales, which complicates the identification of key drivers of hydrologic response. This study represents catchments as networks of variables linked by cause-and-effect relationships. We examine whether the direct causes of runoff signatures can explain these signatures across different environments, with the goal of developing more robust, parsimonious, and physically interpretable predictive models. We compare predictive models that incorporate causal information derived from the relationships between catchment, climate, and runoff characteristics. We use the Peter and Clarck (PC) causal discovery algorithm, along with three prediction models: Bayesian Network (BN), Generalized Additive Model (GAM), and Random Forest (RF). The results indicate that among models, BN exhibits the smallest decline in accuracy between training and test simulations compared to the other models. While RF achieves the highest overall performance, it also demonstrates the most significant drop in accuracy between the training and test phases. When the training sample is small, the accuracy of the causal RF model, which uses causal parents as predictors, is comparable to that of the non-causal RF model, which uses all selected variables as predictors. This study demonstrates the potential of causal inference techniques in representing the interconnected processes in hydrological systems in a more interpretable and effective manner.
- Preprint
(17318 KB) - Metadata XML
-
Supplement
(62030 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on hess-2024-297', Anonymous Referee #1, 05 Dec 2024
The manuscript proposes the use of causal discovery algorithms for understanding the role of a variety attributes in the runoff signatures.
The topic is surely interesting and nowadays rapidly evolving. The manuscript is well organized, include a particularly rich data set that allows to the authors to provide conclusions.
While I am inclined to suggest the manuscript publication, I would like to share with the authors some possible manuscript ameliorations in order to make easier its understanding.
- The title, the abstract, and the conclusions are not fully aligned. In the title authors mention to “prediction”, in the abstract “interpretation”, in the conclusion there are many points mixing the two topics.
- Runoff signature is synonymous of “hydrological response” or “watershed response"? Maybe in the introduction this other common term could be mentioned just to better orient the reader.
- In the lines 103-113 it should be clarified which is the innovative contribution or the advancement compared to the previous literature accurately listed by the authors
- Section 3. Data are crucial for understanding the model application. In the Section 3 there is the attribute list but not the data characterization. A first question that could have the reader is “Did they authors select one number for each attribute and for each catchment?” or a time series?
- Figure 1 is not fully clear, Is the cluster analysis necessary? Is it an alternative way to analyze the entire data set? If yes it should be in a different level, like a starting option in the flow chart.
- The Section 2.2.1 seems incomplete and refers to the Supplementary materials, however this step seems important in the whole procedure. More details on how the most important feature are ranked are necessary, indeed the “out-of-bag method” is vague and the sentence “variables are selected based on a combination of correlation analysis, variable importance assessment and consideration of the underlying physics of the runoff signatures.” is too general.
Citation: https://doi.org/10.5194/hess-2024-297-RC1 -
AC1: 'Reply on RC1', Hossein Abbasizadeh, 23 Dec 2024
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2024-297/hess-2024-297-AC1-supplement.pdf
-
RC2: 'Comment on hess-2024-297', Anonymous Referee #2, 03 Jan 2025
The paper's starting point is the following insight that has received a lot of attention in the machine learning and statistics literature: if the data generating process (including a response Y and covariates X) follows a causal model, such as a structural causal model, and if we observe the data generating process under different environments (which, e.g., correspond to different values of a variable in the system that is a non-descendant of Y), then the distribution of Y given PA(Y) is the same for different environments; here, PA(Y) is the set of causal parents of Y (in general, conditioning on subsets of PA(Y) does not suffice). Thus, identifying PA(Y) and using only PA(Y) as covariates may yield models that are more robust with respect to changes in the environment. I consider this line of thoughts as interesting and appreciate the authors' effort to apply the principle to a non-trivial real-world application. Unfortunately, however, the paper falls short in proving that one benefits from applying this principle in the context of predicting runoff signatures from climate and catchment variables and, in my view, does not meet the high standards of publication in HESS. I provide a few more details below.
* The authors write that they assume that the runoff signatures are sink nodes ("We also assumed that runoff signatures do not cause climate and catchment attributes") and that there are no hidden variables ("It is also
assumed that there are no unobserved variables" -- see also the assumption that the distribution is Markov and faithful wrt a DAG over the observed variables). But then, identifying the set of causal parents of a runoff signature becomes 'classical' variable selection, a non-causal problem. This implies that a causal analysis is not necessary.* This also has some implications on one of the paper's main points regarding the invariance property of the causal parents of Y: If the environments can be modeled as the values of a random variable (if I understand correctly, the environments are created by clustering certain covariates, so we can indeed model them as a child of such covariate(s)), then not only the set of causal parents but also the full set of covariates is invariant -- invalidating some of the main points of the paper.
* (As a side, empirical differences between 'causal' and 'non-causal' methods are then, in my view, 'only' due to different ways of performing variable selection -- and worse performance of non-causal methods on test data simply means that the variable selection or regularization can be improved.)
* Even if the above three points were not an issue, the authors do not provide sufficient arguments on why we should trust that the result obtained by PC reflects the causal ground truth. A few points why in my view this is not obvious:
a) The authors use "expert knowledge (...) to determine the causal direction between two variables with an undirected edge, correct the causally wrong direction between variables and block the spurious edges between variables". But if we know that some of the edges are incorrect, why should we trust the others? (Also, I did not find the description of the process of correcting edges sufficiently clear.)
b) The PC algorithm is known to produce results that are not reliable. E.g., relabeling the variables, i.e., simply permuting the columns in the data matrix, or subsampling the data set sometimes change the outcome. Simulation experiments show that even under no model misspecification huge sample sizes can be needed to reliably obtain the ground truth graph.
c) The paper does not provide any theoretical guarantees. This would probably be too much to ask for an applied paper but I argue below that this question is not purely theoretical: the assumptions that are known to be sufficient to obtain theoretical guarantees are most likely violated in this application.* The paper does not provide sufficient arguments on whether the differences between methods are statistically significant.
* In my view, the paper is not sufficiently clear about the experimental setup using the different clusters. E.g., how exactly are the training and test sets chosen? The authors mention robustness across environments but then training and test data should be from different clusters?
* It was unclear to me how the paper accounts for time-dependence of the data points.
* (As a side, in general, when considering robustness against a change of environment, using the causal parents as covariates may not be optimal. Instead, one could use what is referred to as the stable blanket.)
* The paper contains several imprecise/incorrect statements. Here are two examples (there are more): "They are the assumptions under which the causal relationship from the observational data can be learned." What precisely does "can be learned" mean? This may sound like a minor point but in my view it is not. One way of making this precise is to write down conditions for uniform consistency. There are few conditions known under which uniform consistency holds. However, such conditions are very restrictive. (E.g., some of such conditions include the assumption that the random variables are jointly Gaussian. If I understand correctly, the authors transform marginals but even this does not suffice.) It is known that, in general, all nonparametric conditional independence tests that are level are trivial (and do not have any non-trivial power), so it may even be impossible to relax such conditions to something reasonable. This is important in that these thoughts may be a reason for why the PC algorithm is usually unreliable in practice (see above). To give another example, "Covariate shift states that if variable Y is to be predicted from X, and X is the cause of Y, the conditional probability P(Y|X) remains the same across all environments if the distribution of X changes" is in my view at least imprecise: covariate shift is usually meant as a non-causal assumption and invariance generally holds only if X is the set of all causal parents of Y.
* The paper contains several typos, such as "casual" or "Clarck" or "causal models are assum result".
Citation: https://doi.org/10.5194/hess-2024-297-RC2
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
268 | 65 | 79 | 412 | 75 | 3 | 4 |
- HTML: 268
- PDF: 65
- XML: 79
- Total: 412
- Supplement: 75
- BibTeX: 3
- EndNote: 4
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1