Detecting hydrological connectivity using causal inference from time series: synthetic and real karstic case studies

Delforge, Damien; de Viron, Olivier; Vanclooster, Marnik; Van Camp, Michel; Watlet, Arnaud

doi:https://doi.org/10.5194/hess-26-2181-2022

Articles | Volume 26, issue 8

https://doi.org/10.5194/hess-26-2181-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-26-2181-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 26, issue 8

Research article

|

29 Apr 2022

Research article |

| 29 Apr 2022

Detecting hydrological connectivity using causal inference from time series: synthetic and real karstic case studies

Damien Delforge, Olivier de Viron, Marnik Vanclooster, Michel Van Camp, and Arnaud Watlet

Download

Final revised paper (published on 29 Apr 2022)
Supplement to the final revised paper
Preprint (discussion started on 06 Sep 2021)
Supplement to the preprint

Interactive discussion

Status: closed

RC1:
'Comment on hess-2021-445', Anonymous Referee #1, 05 Oct 2021
This paper by Delforge et al. titled “Detecting hydrological connectivity using causal inference from time-series: synthetic and real karstic study cases” uses four causal inference methods to obtain insights on hydrologic connectivity in a karstic site. The authors also use a synthetic case study to obtain a preliminary understanding on the performance of the methods prior to applying them to the real-world case study.

Overall, the paper objective and the experimental design are well articulated and set clearly. I also think that causal inference is an important topic that should receive more attention from the hydrologic community, and it is of an interest to the audience of HESS. However, I noted some technical issues on the implementation of causal inference methods and the interpretation of results. Also, the discussion section is lacking as the authors didn’t compare their findings to recent papers that utilized causal inference in hydrology. Below I summarize these issues and other major comments followed by minor ones. I urge the authors to pay attention to these comments while revising their manuscript. Generally, I recommend that this paper undergoes major revisions before it can be considered for publication.

In the application of multivariate causal inference methods (Partial Correlation and Conditional Mutual Information), the authors didn’t illustrate whether the history of variables is used in the conditioning set of variables or not. For instance, if one wants to test for the hypothesis that causes using multivariate methods, then ideally the following conditional test should be implemented: . This conditional test means that one is testing for the statistical relationship between and at time while conditioning in the history of both variables up to a lag time of as well as the set of variables which includes all other confounders. This is a crucial point in the implementation of multivariate methods because it removes the effect of autocorrelation. There are many ways of conditioning in the history of variables; for example, the classical Granger causality (Granger, 1969) conditions in the history of but not . Similarly, Transfer Entropy (Schreiber, 2000) does the same, whereas methods such as momentary information transfer (Pompe & Runge, 2011) conditions on both variables. I think that the authors need to mention explicitly what is the conditioning test for both multivariate methods. Please consider adding this information with clear mathematical expressions. Also, the parameter is often one of the most important hyperparameters in causal inference methods, so a discussion on the value of this parameter needs to be included. Please note that in this context is slightly different than which the authors use to set the maximum lag time for testing interactions.

Related to the previous point, the reason why conditioning in the history of variables is important is because hydrologic timeseries of variables are often highly autocorrelated which leads to spurious causal links, and this testing removes autocorrelation. This context is important in interpreting the results obtained from causal inference methods either in the synthetic or real-world case studies. The authors used first-order difference of the original time series to remove the effect of seasonality and autocorrelation; however, this is not needed if the implementation of multivariate methods already conditions on the history of variables. Please revise the interpretation of results both in sections 3.2 and 4.2 to highlight this issue of autocorrelation.

I found the discussion section to be lacking and it does not report any insightful comparisons to previous work of causal inference in hydrology. For instance, Ombadi et al. 2020 used four causal inference methods on a synthetic and real-world case studies with formal investigation on the impact of sample length, observational and process noise. Some of the methods used in this paper (e.g., CCM and CMI) were also used in that study. It would be important to compare the findings of both studies on the performance of different causal inference methods as this will allow us to build a consensus on the suitability of causal inference methods for hydrologic applications. Although Ombadi et al. 2020 is perhaps the most relevant to this study, there are other studies that used specifically information-theoretic approaches for hydrologic systems characterization such as (Jiang & Kumar, 2019). Please enrich the discussion section by linking the findings of this study to previous work.

The record length of the timeseries used in this study is relatively short. This is one of the main challenges that face the application of causal inference methods in hydrology. In my experience, I found that methods based on information theory (e.g., CMI) often needs a long record (~ 2000 – 3000 data points) to provide reasonable performance. The results shown in this study either for the synthetic or real-world case studies are perhaps significantly impacted by the record length yet no discussion was included on the effect of record length. Please enrich the discussion section by highlighting the potential impacts of sample length.

Overall, the paper needs to be subjected to proofreading to eliminate grammatical deficiencies and typos as well as to improve the readability. I highlight some of these grammatical errors, unclear sentences and typos in the minor comments below, but these are only few examples and similar corrections should be implemented throughout the paper.

Some of the basic information on causal inference methods that was mentioned in section 2 is not accurate and incorrect. For instance, the description of partial correlation in lines 136-138 is not accurate. The authors mention that “partial correlation is like Granger causality”. This is quite vague. What the word “like” means specifically here? It is true that partial correlation shares similarities with Granger causality in the sense that both use linear regression to assess interactions while conditioning on potential confounders. However, there are crucial difference too. For instance, Granger causality is technically implemented in a different way than partial correlation with setting both restrictive and unrestrictive regression models, and testing for statistically significant differences using t-test and F-test. These are very crucial differences. Also, on a higher level, the concept of Granger causality is based on what is known as predictive causality and it take into account time precedence. The authors should be careful in introducing the different methods and use precise information. I suggest that the authors revise section 2.1 by writing clear mathematical expressions for each causal inference method, and also be more precise in their description.

Lines 315-320 and elsewhere: the instability of CMI here is attributed to missing data. This might be one of the reasons, but I suspect that the main reason is the short record length (see my major comment #4). Also, a possible but unlikely reason is that the instability is the result of changes in the dynamic connectivity. This might be true if the timeseries used in Figure 6 (a, b, c and d) correspond to different hydrologic conditions (wet vs dry). If this latter case is possible, then it is worth of highlight and discussion.

Figures 4 & 5: there are several causal links with arrows pointing toward RF (Rainfall)?? Apparently, this is physically incorrect, but I was not able to find any discussion on this in the paper. Are these arrows drawn in the wrong way? Or these are the real results obtained from causal inference methods? If it is the latter case, then this needs to be discussed. In general, this raises a red flag on the accuracy of causal links obtained from the different methods.

Lines 228-229: the standard deviation of the noise added to the precipitation signal is unrealistically large!! Even the smallest value used here which is 0.05 of the standard deviation of precipitation is still large. A proportion of 0.001 of the standard deviation of precip is often sufficient to satisfy the condition of causal sufficiency. If process noise is very large, this will impact the results. See Ombadi et al. 2020 for the impact of process noise on the performance of different causal inference methods.

Lines 230-231: why only the last year was used? Is it a spin-off period to eliminate the impact of initial conditions or for computational reasons?

Lines 252-253: this is perhaps related to the conditioning on the history of variables (see my major comments #1 and #2)

Lines 260-262: This is a well-known issue with causal inference methods. You can refer to some studies that pointed to the same issue in evaluating causal inference methods either in hydrology or other fields.

Table 4: please replace the abbreviations with the full name (e.g., TP: True positives) or alternatively add this info to the caption of the table so that it can be a standalone component.

Figures 4, 5 and 6: I suppose that the numbers in the arrows denote the lag time of interaction in days; however, this was never introduced or mentioned in the captions. Please revise.

Lines 27-29: some applications of causal inference in hydrology are missing here. For instance, soil moisture-rainfall feedback (Wang et al., 2018) or differential impact of environmental drivers of evapotranspiration (Ombadi et al., 2020). There are others too if you look in the literature.

Lines 35-44: I liked the distinction between structural, functional and effective connectivity. However, from the text, it was not clear what is meant by the effective connectivity and how it differs from the functional one.

Line 71: remove “obtained from”. Typo.

Line 144: the correct name is transfer entropy not “entropy transfer”

Line 150: “computationally expensive and quickly require …”. The sentence is not logically correct. Please revise.

Line 152: grammatical error in “at section 3”. It should be “in section 3”

Figure 1 caption: when referring to (a) and (b), please remove the parentheses because it is confusing. Only use the parentheses the first time you introduce them.

Line 291: replace “As for CCM” with “Similar to CCM”. The sentence does not read well currently.

Line 15: this sentence does not read well at all. I understand what you want to convey, but it needs to be rephrased. Something like “…interactions between variables from timeseries only…etc.”

Lines 98-99: the description of the parameter alpha_pc is not very clear and intuitive to me. Could you elaborate?

Pompe, B., & Runge, J. (2011). Momentary information transfer as a coupling measure of time series. Physical Review E, 83(5), 051122.

Granger, C. W. (1969). Investigating causal relations by econometric models and cross-spectral methods. , 424-438.

Ombadi, M., Nguyen, P., Sorooshian, S., & Hsu, K. L. (2020). Evaluation of methods for causal discovery in hydrometeorological systems. , (7), e2020WR027251.

Schreiber, T. (2000). Measuring information transfer. Physical review letters, 85(2), 461. https://doi.org/10.1103/PhysRevLett.85.461

Wang, Y., Yang, J., Chen, Y., De Maeyer, P., Li, Z., & Duan, W. (2018). Detecting the causal effect of soil moisture on precipitation using convergent cross mapping. Scientific reports, 8(1), 12171. https://doi.org/10.1038/s41598-018-30669-2

Jiang, P., & Kumar, P. (2019). Using Information Flow for Whole System understanding from component dynamics. Water Resources Research, 55(11), 8305-8329.
Citation: https://doi.org/10.5194/hess-2021-445-RC1
- RC2:
  'Reply on RC1', Anonymous Referee #1, 05 Oct 2021
  
  The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2021-445/hess-2021-445-RC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/hess-2021-445-RC2
  - AC5: 'Reply on RC2', Damien Delforge, 28 Nov 2021
    
    Answered in AC1.
    
    Citation: https://doi.org/10.5194/hess-2021-445-AC5
- AC1:
  'Reply on RC1', Damien Delforge, 26 Nov 2021
  1. General response
  Dear reviewer, on behalf of the co-authors and myself, I would like to thank you for your attention to the preprint manuscript and the helpful comments you made. Below are detailed responses to each of your major and minor comments. A pdf file is also attached, which also displays your original comments for convenience.
  Many points you raised seem related to the lack of precision in the methodology section and, for the discussion section, a lack of clear insights and missing references to existing literature. Regarding the methods, although several of your questions are answered in the supplementary materials (SM), we agree that the manuscript should be self-contained. Accordingly, the revised paper is improved by briefly revising the methodological section while keeping the section succinct. Regarding the discussion, based on your remarks and those of the other reviewers, we propose a rewriting of the section with improved references and comparisons to the literature on the following topics:
  a summary and appreciation of our results (with improved references to the literature);
  
  a particular focus on the estimation of Conditional Mutual Information (CMI) concerning missing values, record length, dimensionality, the nature of the dependencies, or noise;
  
  and practical recommendations for the uses of causal inference methods and future research perspectives.
  
  Concerning point 2 and the virtual experiment, the other reviewers strongly encourage us to extend the virtual experiment to study the effect of the sample size and/or the number of variables. This was not explicitly your request, although you expressed concern about the instability of the PCMCI-CMI method and its relation to the record length (Major comment 4). Nevertheless, we have decided not to comply with this request for multiple reasons that are reflected in the revised discussion, as any reader may have the same concerns.
  These motivations are numerous and quite related to your major comment 4 about the sample size. First, our study remains a comparative study. Such a focus on the PCMCI-CMI method would rather deserve a separate issue. Also, the conclusion of such an extended virtual experiment is reasonably known a priori: the results become more robust with increasing sample length or decreasing number of variables (including delayed ones up to d_max). We see no reason why a non-trivial conclusion, such as recommendations of sample length as a function of the number of variables, would be transposable to a problem with different characteristics, such as different noise levels, model coupling patterns, signal behavior, or representative scales. In addition to the sample size and dimensionality, the CMI also depends on the nature of the CMI dependencies, smooth or not smooth as it could be expected in systems with highly dynamic connectivity, as well as the magnitude and the characteristic of noise. This CMI dependency and noise vary across spatial and time scales. The results also depend on the methods, for instance, kernel-based or nearest neighbors estimators and their hyperparameters.
  Our point, with the synthetic studies, was to show the divergence of the methods on the same – simplistic - case study, not as an answer to the question “what should we do”, but rather as an exploration of the behavior of the tested methodology in a case where we can give meaningful interpretations of the results. For each problem on which those methods are used, we consider that a good strategy would be to test the issues met and the insights gained by using fit-for-purpose models mimicking the property of signals they want to study.
  Thank you again for your contribution to this discussion,
  Damien Delforge
  2. Major comments
  2.1 Major comment 1
  
  In the application of multivariate causal inference methods (Partial Correlation and Conditional Mutual Information), the authors didn’t illustrate whether the history of variables is used in the conditioning set of variables or not. For instance, if one wants to test for the hypothesis that Y causes X using multivariate methods, then ideally the following conditional test should be implemented: I(Y_t,X_t ┤| Y_(t-1):Y_(t-τ),X_(t-1):X_(t-τ),Z). This conditional test means that one is testing for the statistical relationship between Y and X at time t while conditioning in the history of both variables up to a lag time of 𝜏 as well as the set of variables Z which includes all other confounders. This is a crucial point in the implementation of multivariate methods because it removes the effect of autocorrelation. There are many ways of conditioning in the history of variables; for example, the classical Granger causality (Granger, 1969) conditions in the history of but not . Similarly, Transfer Entropy (Schreiber, 2000) does the same, whereas methods such as momentary information transfer (Pompe & Runge, 2011) conditions on both variables. I think that the authors need to mention explicitly what is the conditioning test for both multivariate methods. Please consider adding this information with clear mathematical expressions. Also, the parameter 𝜏 is often one of the most important hyperparameters in causal inference methods, so a discussion on the value of this parameter needs to be included. Please note that 𝜏 in this context is slightly different than 𝑑𝑚𝑎𝑥 which the authors use to set the maximum lag time for testing interactions.
  
  The PCMCI algorithm tests for Momentary Conditional Independence (MCI) between variables, which implies conditioning on the Parents of both variables, which are subsets of the historical variables selected using the PC algorithm. This is further explained in the Supplementary materials following the description given in Runge et al. (2019).
  Based on your remark, we agreed that the main manuscript could give more details about the algorithm and the tests. As suggested, the revised methodological section now includes the mathematical description for the independence test with a short description in section 2.1. A longer description will still be available in the supplementary materials. Also, we agree that the method section lacks clarity and that the references to supplementary material are not visible enough. Therefore, we propose a few edits in section 2.1.3 to remind that the PCMCI algorithm and the independence tests are further described in the SM. To facilitate a cross-sectional reading of the paper, we also refer to the test equation mentioned above of the MCI in section 2.1, in sections 2.1.3 and 2..1.4 related to ParCorr and MCI. Finally, as you said, the maximum delay is an important parameter that affects the outcome. Some remarks about this parameter are scattered in the preprint article. In particular, in line L190-195, we mention that it should be large enough in virtue of the causal sufficiency but cannot exceed five days due to missing values. In practice, this delay is often set based on the analysis of bivariate dependencies. We propose to stress the importance of the delay and mention these points earlier in the more appropriate section 2.1. Also, we can refer to figure SM.3 and 4 showing the CCF and CCM bivariate dependencies to motivate a maximum delay of 5 days. Both figures show that most of the dependencies do not sustain beyond four days.
  2.2 Major Comment 2
  
  Related to the previous point, the reason why conditioning in the history of variables is important is because hydrologic timeseries of variables are often highly autocorrelated which leads to spurious causal links, and this testing removes autocorrelation. This context is important in interpreting the results obtained from causal inference methods either in the synthetic or real-world case studies. The authors used first-order difference of the original time series to remove the effect of seasonality and autocorrelation; however, this is not needed if the implementation of multivariate methods already conditions on the history of variables. Please revise the interpretation of results both in sections 3.2 and 4.2 to highlight this issue of autocorrelation.
  
  We agree with your point on auto-correlation. Therefore, we will revise sections 3.2 and 4.2 as suggested. Even if the first-order difference is not needed, we consider it worth illustrating it as we believe that potential causal inference method users are unaware of this issue and its implication. This change will also meet the concerns of reviewer #4, who also requested some clarification and explanations about it.
  
  2.3 Major comment 3
  
  I found the discussion section to be lacking and it does not report any insightful comparisons to previous work of causal inference in hydrology. For instance, Ombadi et al. 2020 used four causal inference methods on a synthetic and real-world case studies with formal investigation on the impact of sample length, observational and process noise. Some of the methods used in this paper (e.g., CCM and CMI) were also used in that study. It would be important to compare the findings of both studies on the performance of different causal inference methods as this will allow us to build a consensus on the suitability of causal inference methods for hydrologic applications. Although Ombadi et al. 2020 is perhaps the most relevant to this study, there are other studies that used specifically information-theoretic approaches for hydrologic systems characterization such as (Jiang & Kumar, 2019). Please enrich the discussion section by linking the findings of this study to previous work.
  
  We thank the reviewer for sharing these additional references, particularly the recent work of Ombadi et al., 2020. In addition, we propose to rewrite this section as proposed in our general response to be, as the commentary suggests, more in line with the existing literature and focused on specific issues as the impact of missing data or sample size, or the number of variables (including lagged ones up to d_max).
  
  At first glance, the study of Ombadi et al. does not challenge our conclusions but complements them. We were particularly interested regarding the results of the bivariate Transfer Entropy (TE) method as we didn’t explore it. The asymptotic behavior for True Positive Rate was moderately comparable to the PC method, and the reduction of the False Positive Rate is quite impressive. TE, we believe, could be less demanding in terms of record length as no conditioning is performed for the other variables and could be a viable potential alternative when the sample size is problematic. The behavior of CCM with respect to noise also allowed us to respond to some concerns of reviewer 3. We will make sure to report and discuss these interesting findings properly. The work Jiang and Kumar (2019) is also very relevant as it uses the PCMCI framework implemented in Tigramite to characterize hydrological systems.
  
  2.4 Major Comment 4
  
  The record length of the timeseries used in this study is relatively short. This is one of the main challenges that face the application of causal inference methods in hydrology. In my experience, I found that methods based on information theory (e.g., CMI) often needs a long record (~ 2000 – 3000 data points) to provide reasonable performance. The results shown in this study either for the synthetic or real-world case studies are perhaps significantly impacted by the record length yet no discussion was included on the effect of record length. Please enrich the discussion section by highlighting the potential impacts of sample length.
  
  Our beginning discussion mentions the problem of missing values (L316-320). We consider missing values to be problematic in the same sense as the record length problem. Missing values reduce the overlapping time domain over which conditioning can be performed. This is equivalent to a reduction in available samples, triggering issues similar to those of a short record length. Concerning Conditional Mutual Information (CMI), the adequate record length would depend on the dimensionality of the problem (the number of necessary variables to characterize the system) and the complexity of conditional dependencies (e.g., highly non-smooth dependencies, which could be the very case of intermittent or highly dynamic connectivity), which is not scale-invariant in time and space, and may vary geographically from study site to study site.
  
  To the extent that hydrology deals with problems that are empirically complex to characterize beyond an overly simplistic lumped view, we agree that record length should be relatively large, but to what extent, we believe, is highly variable and case-dependent. It also depends on the method chosen to characterize the CMI. In particular, this is the reason why we chose the nearest-neighbor estimator and the shuffle test of Runge (2018) that is better suited than kernel-based approaches for short record length (< 1000), based on numerical experiments covering sample sizes from 50 to 2,000 and dimensions up to 10. Yet, despite the use of method recommended for small records, the real study case of the manuscript is concerned by the pitfall of estimating CMI with short record length resulting in non-robust test results, as you also interpret (minor comment #1). We showed that the robustness could be increased by performing an ensemble of tests. Of course, the robustness of a numerical result is one issue; its reliability in terms of connectivity is another.
  
  Overall, the problem of estimating the CMI in case of missing values, short record length, or high dimensionality was the concern of all reviewers. Therefore, as mentioned in our general response, we propose revising the discussion section to discuss this topic properly.
  
  2.5 Major Comment 5
  
  Finally, there are several places with strange phrasing, or where a term is introduced before it is defined, so there is momentary confusion on whether a reference is missing or the sentence is relevant. I am highlighting some of these that I noticed in the minor line-by-line comments below.
  
  We apologize for these writing problems and thank the reviewer for pointing out some of them. We will make sure to correct them and recheck the whole manuscript carefully.
  
  2.6 Major Comment 6
  
  Some of the basic information on causal inference methods that was mentioned in section 2 is not accurate and incorrect. For instance, the description of partial correlation in lines 136-138 is not accurate. The authors mention that “partial correlation is like Granger causality”. This is quite vague. What the word “like” means specifically here? It is true that partial correlation shares similarities with Granger causality in the sense that both use linear regression to assess interactions while conditioning on potential confounders. However, there are crucial difference too. For instance, Granger causality is technically implemented in a different way than partial correlation with setting both restrictive and unrestrictive regression models, and testing for statistically significant differences using t-test and Ftest. These are very crucial differences. Also, on a higher level, the concept of Granger causality is based on what is known as predictive causality and it take into account time precedence. The authors should be careful in introducing the different methods and use precise information. I suggest that the authors revise section 2.1 by writing clear mathematical expressions for each causal inference method, and also be more precise in their description.
  
  As mentioned in response to Major Comment 1 and the general response, the revised methodology section aims at being more specific, but not too much longer, while recalling the additional description in the SM. In this case, you will notice that the questions about the differences between the PCMCI-ParCorr and Granger causality are clearly mentioned in the SM1.3. How crucial the differences between the PCMCI-ParCorr and Granger approaches are is a matter for a debate that, in our opinion, does not belong to the paper’s scope. We agree that changing the PC routine, or the test to an F-test or a Wald test are alternatives that will probably change the outcomes of causal analysis without altering the general philosophy. Of course, we may understand that if a choice alters the outcome, it is somehow crucial when it comes to causality. By being more specific in the main manuscript, we hope to provide information that may be of interest to the reader in relation to this question they may have or for a simple question of reproducibility. However, we do not have the arguments for preferring one implementation over another and, therefore, we do not wish to elaborate on the different variants of Granger causality.
  
  3. Minor Comments
  
  3.1 Minor Comment 1
  
  Lines 315-320 and elsewhere: the instability of CMI here is attributed to missing data. This might be one of the reasons, but I suspect that the main reason is the short record length (see my major comment #4). Also, a possible but unlikely reason is that the instability is the result of changes in the dynamic connectivity. This might be true if the timeseries used in Figure 6 (a, b, c and d) correspond to different hydrologic conditions (wet vs dry). If this latter case is possible, then it is worth of highlight and discussion.
  
  We agree that the record length is the main reason for the instability and is related to the missing value problem (see the general response and major comment 4). The instability to which we first refer is related to the variable results of the test obtained on the same dataset and on the same time domain given by the time distribution of missing values, the record length, and the maximum delay. Since the time domain on which the conditioning is performed does not vary, the same holds for the hydrological conditions. It is, however, confirmed that for the graphs a, b, c, d of Figure 6, the temporal domain for which the variables are conditioned varies since the datasets vary. Hence, the analysis covers more or less hydrological states according to the size of this domain.
  
  Although not stated clearly in the current preprint, we hypothesized that hydrological connections are perennial, i.e., not intermittent, even if sensitive to the hydrological conditions, i.e., nonlinear. We stress this assumption in the revised document. With this in mind, we might expect the hydrologic connections revealed by the MCI to be more robust as the sample size increases. However, this assumption may not be correct if the connectivity is intermittent or highly dynamic. In this case, we believe that the revealed connections are representative of averaged connectivity and may differ if the time domain of the analysis varies for small sample sizes, as you have suggested. It is worth being notified and discussed together with our hypothesis of constant connectivity over time as opposed to intermittent connectivity. We encourage further studies about intermittency in the perspectives at the end of the discussion.
  
  3.2 Minor Comment 2
  
  Figures 4 & 5: there are several causal links with arrows pointing toward RF (Rainfall)?? Apparently, this is physically incorrect, but I was not able to find any discussion on this in the paper. Are these arrows drawn in the wrong way? Or these are the real results obtained from causal inference methods? If it is the latter case, then this needs to be discussed. In general, this raises a red flag on the accuracy of causal links obtained from the different methods.
  
  Indeed, some links that are obtained from the causal inference methods are physically incorrect. We report some of them, e.g., in L296-297 in the result section: “We also denote two upward links to R4 and ET. These links seem physically unrealistic and potentially problematic since the effect of P2 is be removed from these variables, which may alter the whole causal graph.”. We further remind the how problematic it is in the discussion L353-356: “For the multivariate methods, we have chosen to let the causal graphs be formed from the data. We have not prescribed any constraint on the conditioning of variables. This means that variables can be conditioned on potentially aberrant links, negatively impacting the whole causal graph”.
  
  For more clarity, we propose to discuss this in the first part of the discussion about the summary and general appreciation of the results (see general response).
  
  3.3 Minor Comment 3
  
  Lines 228-229: the standard deviation of the noise added to the precipitation signal is unrealistically large!! Even the smallest value used here which is 0.05 of the standard deviation of precipitation is still large. A proportion of 0.001 of the standard deviation of precip is often sufficient to satisfy the condition of causal sufficiency. If process noise is very large, this will impact the results. See Ombadi et al. 2020 for the impact of process noise on the performance of different causal inference methods.
  
  We understand that our process noise may be unrealistic because we did not attempt to mimic a realistic environmental rainfall noise. The noisy rainfall series, P_{eff, A} and P_{eff, B} are intermediate variables that are used in the causal analysis. Ultimately, what matters is the resulting noise in the reservoir discharge series Q_A and Q_B. In practice, we adjusted the parameters to our liking by graphically interpreting the discharge differences between consecutive runs of the toy model. The noise parameter may seem large because the unit hydrograph and reservoir act as low-pass filters, removing a significant portion of the injected noise.
  
  Following your remark, we wanted to have a better representation of the impact of the noise level parameter eps_lvl. The table below shows the 100-runs average correlation rho_mean between two discharge series of 365 days generated with the same unit hydrograph H=[0.7, 0.2, 0.1] and linear storage discharge equation Q = 0.1*S per noise level parameter. In the manuscript, this is equivalent to the average correlation between two Q_A obtained with model configuration 1. These values, although roughly evaluated, seem to us to be a reasonable noise panel to explore, especially given the wide range of scales of possible hydrological applications and where we could imagine any kind of dissimilarities.
  
  eps_lvl 0.05 0.1 0.15 0.2 0.25 ... 0.7
  rho_mean 0.96 0.87 0.76 0.66 0.57 ... 0.24
  3.4 Minor Comment
  
  Lines 230-231: why only the last year was used? Is it a spin-off period to eliminate the impact of initial conditions or for computational reasons?
  
  Yes, the early period was considered as a warming-up period. It will be mentioned. Of course, it also impacts computational time.
  
  3.5 Minor Comment 5
  
  Lines 252-253: this is perhaps related to the conditioning on the history of variables (see my major comments #1 and #2)
  
  L253-253: “This finding supports our theoretical assertion: the multivariate nonlinear method is the best suited to address effective hydrological connectivity. Furthermore, the method appears to perform better if seasonality is left present in the time-series”.
  
  As we did with the major comments, we agree here as well. We revised to make sure that this is explicit for the reader.
  
  3.6 Minor Comment 6
  
  Lines 260-262: This is a well-known issue with causal inference methods. You can refer to some studies that pointed to the same issue in evaluating causal inference methods either in hydrology or other fields.
  
  L260-262: "This is particularly contrasting with other methods and provides a valuable piece of information. However, the high precision comes at the cost of a low recall: CMI misses about half of the actual causal links. On the contrary, ParCorr misses none but has a bad precision, i.e., many false positives."
  
  It is indeed in phase with Runge’s reports on the methods. This also could be related to the results of Ombadi et al, 2020. We will look for other references in hydrology.
  
  3.7 Minor Comment 7
  
  Table 4: please replace the abbreviations with the full name (e.g., TP: True positives) or alternatively add this info to the caption of the table so that it can be a standalone component.
  
  We will choose one of the two options to improve the readability of the paper.
  
  3.8 Minor Comment 8
  
  Figures 4, 5 and 6: I suppose that the numbers in the arrows denote the lag time of interaction in days; however, this was never introduced or mentioned in the captions. Please revise.
  
  Please, note that all captions mention it: “An undirected line represents contemporaneous dependencies. Delayed dependencies are shown using directed curved arrows. All corresponding delays d are displayed in the middle of its corresponding arrow”
  
  We will, however, notify the reader that the delay is expressed in days.
  
  3.9 Minor Comment 9
  
  Lines 27-29: some applications of causal inference in hydrology are missing here. For instance, soil moisture-rainfall feedback (Wang et al., 2018) or differential impact of environmental drivers of evapotranspiration (Ombadi et al., 2020). There are others too if you look in the literature.
  
  We thank the reviewer for sharing these additional references, as well as those from Jiang and Kumar 2019. We will update our literature review and ensure that we further compare our results with existing studies.
  
  3.10 Minor Comment 10
  
  Lines 35-44: I liked the distinction between structural, functional and effective connectivity. However, from the text, it was not clear what is meant by the effective connectivity and how it differs from the functional one.
  
  L35-44: We refer to the terminology of 35 Rinderer et al. (2018), which is inspired by and borrowed from the field of neurological and brain connectivity (Friston, 2011). There are three types of connectivity: (i) structural, (ii) functional, and (iii) effective connectivity. The structural connectivity is derived from the medium and highlights the potential, static, and time-invariant water flow paths from the geological environment’s topography, spatial adjacency, or contiguity. The functional one is dynamic and is retrieved from statistical time-dependencies between local hydrological variables. A statistical association may result from confounding factors, e.g., rainfall acting on two disconnected reservoirs or a shared seasonal pattern. Therefore, dependencies do not necessarily imply factual causation, such as process-based water flows. Then, the functional connectivity is a matter of cross-predictability and still reflects potential rather than actual flow paths for water. CIMs with a multivariate framework address confounding factors. They offer the promises of discriminating functional connectivity from the effective one, which reveals actual flow paths and processes within the system. From the structural to the effective connectivity through the functional one, the search for hydrological connections can be seen as a progressive constraint from the potential paths to the actual paths taken by water.
  
  For more clarity, we propose a reordering of the paragraph and some edits: “[…] The functional one is dynamic and is retrieved from statistical time-dependencies between local hydrological variables. Functional connectivity is a matter of cross-predictability and reflects dynamic links between the variables. These dynamic links are potential connections subject to confounding factors, i.e., they may or may not be related to a flow process between variables. Effective connectivity precisely refers to actual connections linked through hydrological processes and flows. Since CIMs with a multivariate framework address confounding factors, they offer the promise of discriminating functional connectivity from the effective one. From the structural to the effective connectivity through the functional one, the search for hydrological connections can be seen as a progressive limitation of the possibilities, from the potential paths to the actual paths taken by water".
  
  3.11 Minor Comment 11
  
  11- Line 71: remove “obtained from”. Typo.
  
  Corrected. Thank you for pointing the issue.
  
  3.12 Minor Comment 12
  
  12- Line 144: the correct name is transfer entropy not “entropy transfer”
  
  Corrected. Thank you for pointing the issue.
  
  3.13 Minor Comment 13
  
  13- Line 150: “computationally expensive and quickly require …”. The sentence is not logically correct. Please revise.
  
  Corrected. Thank you for pointing the issue.
  
  3.14 Minor Comment 14
  
  14- Line 152: grammatical error in “at section 3”. It should be “in section 3”
  
  Corrected. Thank you for pointing the issue.
  
  3.15 Minor Comment 15
  
  15- Figure 1 caption: when referring to (a) and (b), please remove the parentheses because it is confusing. Only use the parentheses the first time you introduce them.
  
  Corrected. We avoid parenthesis or parenthesis with a single letter when referring to the subplots. Thank you for pointing the issue.
  
  3.16 Minor Comment 16
  
  16- Line 291: replace “As for CCM” with “Similar to CCM”. The sentence does not read well currently.
  
  Corrected. Thank you for pointing the issue.
  
  3.17 Minor Comment 17
  
  17- Line 15: this sentence does not read well at all. I understand what you want to convey, but it needs to be rephrased. Something like “…interactions between variables from timeseries only…etc.”
  
  Corrected. We now use “from timeseries only” as suggested. Thank you for pointing the issue.
  
  3.18 Minor Comment 18
  
  18- Lines 98-99: the description of the parameter alpha_pc is not very clear and intuitive to me. Could you elaborate?
  
  L98-99: The PC procedure has a tuning hyperparameter named αP C , which controls the number of potential causes. αP C varies from 0 to 1, where 1 is the less restrictive case which implies not pre-selection.
  
  The alpha_PC is a significance level used to select the parents in the PC stage of the PCMCI algorithm. We agree that “tuning hyperparameter” is too vague when used alone and will use a succinct description stating explicitly that it is a significant level, closer to the one of Runge et al. (2019): “The main free parameter of PCMCI is the significance level αPC in PC1, which should be regarded as a hyperparameter ...”
  
  More details are available in the SM.
  
  4. Cited references
  
  Runge, J.: Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information, in: International Conference on Artificial Intelligence and Statistics, International Conference on Artificial Intelligence and Statistics, 938–947, 2018.
  
  Runge, J., Nowack, P., Kretschmer, M., Flaxman, S., and Sejdinovic, D.: Detecting and quantifying causal associations in large nonlinear time series datasets, 5, eaau4996, https://doi.org/10.1126/sciadv.aau4996, 2019.
  
  Jiang, P. and Kumar, P.: Using Information Flow for Whole System Understanding From Component Dynamics, 55, 8305–8329, https://doi.org/10.1029/2019WR025820, 2019.
  
  Ombadi, M., Nguyen, P., Sorooshian, S., and Hsu, K.: Evaluation of Methods for Causal Discovery in Hydrometeorological Systems, 56, e2020WR027251, https://doi.org/10.1029/2020WR027251, 2020.
  
  Citation: https://doi.org/10.5194/hess-2021-445-AC1
RC3:
'Comment on hess-2021-445', Anonymous Referee #2, 06 Oct 2021

This paper focuses on causal inference in a hydrologic system, and compares four different causal inference frameworks that range between bivariate versus multivariate and linear versus nonlinear. The authors introduce several aspects and assumptions related to causal inference, such as confounding factors, nonlinear and threshold interactions, effective versus functional connectivity, and the potential for missing drivers. They then apply the four methods to a synthetic hydrology problem in which two reservoirs are connected by meteorologic forcing, such that they appear causally connected if the forcing is not accounted for. Next, they apply the methods to an actual dataset in a karst system. They conclude that while the nonlinear methods detect nonlinear interactions, they also miss some interactions, and should be used carefully or in combination with multiple methods to more robustly detect causal interactions.

This was an interesting paper that should be of interest to the hydrologic community and anyone interested in applying a causal inference method to their data. However I have several comments on the methods, interpretation of the results, and the writing in general and would suggest “moderate” revisions prior to publication.

Major comments:

The lack of statistically significant links in the CMI method made more sense to me when I noticed the sparse dataset (e.g. fewer than 100 data points) that is available for the karst system analysis. In general, I think this method would not be expected to produce robust results given this amount of data, due to the high dimensional pdfs involved for information theory measures. With this, I think the interpretation should not be that CMI performed “worse” in some way, but that it has higher need for data length as a much higher dimensional approach, especially relative to the bivariate methods. Additionally, it seems like the bivariate methods actually do utilize the full amount of data available for any two variables, which means that the comparison is even less “fair” – it might be better to only consider the time window in which there is data available for all (or some, like the P1, P2, and P3 cases in Figures 5 and 6) variables. I think this aspect should be made more clear at the least, and possibly deserves a change in the time windows used for the comparison.

I also have a question about the statistical significance. For example, in some information theory based studies, we use a shuffled surrogates method for statistical significance of a given link, which would differ from a p-value in a correlation analysis. If the method for identifying a statistically significant link varies at all between methods, this also needs to be apparent.

For the synthetic case, I see you used a lot longer dataset than you have available for the real karst case study. This could lead to the better performance for the higher dimensional or multivariate methods – it might be useful to test the synthetic case for a much smaller dataset and observe or confirm whether these methods start to lose their detection of links. This could better show that the CMI/ParCorr types of methods do have better performance, but only when given a lot of data. I think the differences between the methods might make them inherently difficult to compare, but these are some things that could improve the attempt.

Finally, there are several places with strange phrasing, or where a term is introduced before it is defined, so there is momentary confusion on whether a reference is missing or the sentence is relevant. I am highlighting some of these that I noticed in the minor line-by-line comments below.

Minor comments:

Line 7: “appears unstable” relates to my comment on data length…I think the instability is at least partially due to a very small dataset. Either way, it is not very clear what this term means within the abstract.

Line 15: “between variables from variables” did not make sense to me

Line 28: cross-scale

Line 35-45: This paragraph seemed scattered, and I did not come out of it with a clear understanding of “effective connectivity” in particular. Suggest to revise

Line 41: “process-based water flows” – I’m not sure if there are non-process-based flows of water?

Line 45: “progressive constraint” was not clear to me

Line 50: “heterogeneity” instead of “hiddenness”?

Line 60: I’m not sure about the sentence “nonlinearity is imputed to nonlinear hydrological processes”, seems redundant

Line 65: would be good to re-define CCF here

Line 74: What do you mean by “to appreciate the results” – to compare with the results, or validate them?

Line 85: “Being multivariate” – I’m not sure that a multivariate approach inherently deals with confounding effects. For example, a multiple linear regression is multivariate, but does not do any type of conditioning on confounding variables…

Line 93: PC and MCI are brought up, and then defined later – would be better to re-arrange such that we are not wondering what they are.

Line 99: “not preselection” to “no preselection”?

Line 103: reference for causal sufficiency? In general, this is a good point for any analysis, you particularly reference it for CMI, could state that this hypothesis really underlies all your methods…

Section 2.2.1: I felt like you did some discussion previously that was particular to each method, but then you have these sections for each method separately. I would move some of the above material at the beginning of 2.1 into these sections directly, and save the “causal sufficiency” aspect at the beginning as it applies to any method.

Line 121: “overall good performance of this value” is vague, do you mean for the synthetic study, or the real study, or in general?

Line 136: I don’t think you have introduced Granger Causality

Line 153: Is two weeks of computation for a single processor? I figure it would take different amounts of time depending on whether you used a laptop or a server, etc, so could make this more clear.

Line 176: It seems like for 2014-2017 time-series, there would be more than 465 time steps, implying the presence of gaps. This also comes into play in terms of your total data length for the multivariate methods. Basically, the counts in Table 2 make it seem like there is more data available than there actually is, when you start comparing multiple datasets (with 48 data points being the total overlap).

Line 204: “problematic case of the common cause” – after this, you define what this means, but as it is, the phrase seems a little mysterious, like a Sherlock Holmes story.

Line 210: haven’t defined Qb’ yet

Line 230: This is a “synthetic study” but the years make it seem like you are using actual data from your real study?

Line 234: What is a differenced dataset? This comes up a few times and I’m not completely sure what it is…whether it is the increment or something done in the modeling process.

Line 270: “If causality is hard to infer…” is not a great sentence, excusing a complicated figure and telling us it actually makes sense. You could just remove this.

Line 297: “is be removed”

Line 338: “evanescent singularity” is unclear

Citation: https://doi.org/10.5194/hess-2021-445-RC3
- AC2:
  'Reply on RC3', Damien Delforge, 28 Nov 2021
  1. General response
  Dear reviewer, we would like to thank you for your attention to the preprint manuscript and the helpful comments you made. The major points you raised seem related to the lack of precision regarding the description of the statistical tests and concerns relative to the stability of the PCMCI-CMI method and, concurrently, its fair comparison to the other methods. Regarding the description of the tests, we briefly revised the methodological section to make their description more apparent. Regarding the stability issue of PCMCI-CMI and its comparison to other CIMs, we have chosen to clarify our point of view and use it as a basis for revising the discussion section of the manuscript.
  Regarding the discussion, based on your remarks and those of the other reviewers, we will rewrite the section with improved references and comparisons to the literature on the following topics:
  a summary and appreciation of our results;
  
  a particular focus on the robust estimation of Conditional Mutual Information (CMI) concerning missing values, record length, dimensionality, the nature of the dependencies, or noise;
  
  and practical recommendations for the uses of causal inference methods and future research perspectives.
  
  Concerning point 2 and the virtual experiment, you, as other reviewers, encourage us to extend the virtual experiment to study the effect of the sample size and/or the number of variables. Nevertheless, we have decided not to comply with this request for multiple reasons presented in the revised discussion, and in the responses below.
  Thank you again for your contribution to this discussion,
  On behalf of co-authors,
  Damien Delforge
  2. Major Comments
  2.1 Major Comment 1
  
  The lack of statistically significant links in the CMI method made more sense to me when I noticed the sparse dataset (e.g. fewer than 100 data points) that is available for the karst system analysis. In general, I think this method would not be expected to produce robust results given this amount of data, due to the high dimensional pdfs involved for information theory measures. With this, I think the interpretation should not be that CMI performed “worse” in some way, but that it has higher need for data length as a much higher dimensional approach, especially relative to the bivariate methods. Additionally, it seems like the bivariate methods actually do utilize the full amount of data available for any two variables, which means that the comparison is even less “fair” – it might be better to only consider the time window in which there is data available for all (or some, like the P1, P2, and P3 cases in Figures 5 and 6) variables. I think this aspect should be made more clear at the least, and possibly deserves a change in the time windows used for the comparison.
  
  We thank the reviewer for his/her comment that helped clarify the paper. Since stability is a recurring point among other reviewers, we decided to write a section in the discussion. Our intention was not to convey that the PCMCI-CMI method is worse or bad in a general sense but in our real study case. Still, we fully agree that the message to be conveyed is not relative to a ranking of the method as better or worse but that the method has more important prerequisites in terms of sample size depending on the number of variables. Therefore, we will make sure that it is apparent in the revised discussion.
  
  As you said, the bivariate methods use the full amount of data. We see the fair treatment of the time domain as an interesting discussion, which is also related to (1) the question that arises primarily in the presence of non-identically distributed missing data, "should we condition on all variables?", and (2) the problem of intermittent connectivity.
  
  Our first point of view is to use the PCMCI without formulating any constraint on which variable is allowed to influence another or not. Hence, this attitude is very empirical as we expect the PCMCI and the data to speak for themselves. The second initial point of view is that we make the assumption of persistent connections in time between the variables, even if nonlinear, as opposed to intermittent connectivity. This assumption is supported by the fact that percolation is continuous, but of course not verified. Our research question is, therefore, straightforward and boils down to "Is there a connection between two variables", without any complement such as "in winter or in summer", or "when it is very wet".
  
  With both viewpoints in mind, we consider two reasons why our approach may be unfair. First, with no regard to the hydrological problem, the methods are evaluated on different sample sizes, which is unfair if we aim at ranking them. Here, we may counterargue that the fact that bivariate methods allow for a more extensive cross-domain coverage is one of their advantages, which is not fair to dismiss. More generally, in the empirical generalization process of answering our question, "is there a connection between two variables?", it is unfair not to use all the available information to support the conclusion. Instead of harmonizing the temporal domain downwards, we prefer to discuss how to increase the time-domain to avoid the stability and dimensionality problem. With missing values, the small size of the time domain is due to the fact that we test and condition each variable with respect to each variable. This is why focusing on P1, P2, P3 only has the effect of increasing the size of the cross-domain while removing the possibility that P1, P2, P3 influence each other. This brings us to the first aforementioned question (1) "should we condition on all variables?". Conceivably, when it comes to testing the connectivity between two points A or B within a system, conditioning on their past and the common driver, e.g.., evaporation and rainfall or effective precipitation alone, could be a dimensionally adequate and contained representation of the problem.
  
  The other aspect that could be unfair when comparing causal graphs over different time domains was pointed out by reviewer 1. This is to interpret dissimilarities in the represented connections as a lack of robustness when they could be different connectivity regimes associated with different environmental conditions. This is why the hypothesis of persistent connections over time, as opposed to intermittent connectivity (2) is essential to introduce and discuss. If the hypothesis proves to be true, we can expect similar results for a given method regardless of time domain or hydrological conditions, as long as we reach a sufficient sample size. On the other hand, if intermittent connectivity is expected, the suggestion would be to consider piecemeal applications of the causal inference methods for some temporal subdomains (e.g., summer vs winter) or hydrological conditions (wet vs dry). In our case, this would reinforce the problem (1) mentioned above.
  
  In conclusion, we consider that changing the window as you suggest as a potential revision would be unfair in other respects mentioned above. Assuming we do so, we expect similar results in the case of CCF. For CCM, we may encounter robustness or convergence issues related to sample length similarly to PCMCI-CMI, without substantially enriching the discussion while requiring us to explain ourselves using different terminology related to nonlinear dynamical systems theory. In the end, the reader may also legitimately ask why not use longer records for CCM if we could.
  
  2.2 Major Comment 2
  
  I also have a question about the statistical significance. For example, in some information theory based studies, we use a shuffled surrogates method for statistical significance of a given link, which would differ from a p-value in a correlation analysis. If the method for identifying a statistically significant link varies at all between methods, this also needs to be apparent.
  
  Currently, we rely on a Student’s t-test for CCF, CCM, and PCMCI-ParCorr to test the obtained dependencies or the mean correlation value in the case of CCM. PCMCI-ParCorr takes into account the larger degrees of freedom. PCMCI-CMI relies on a non-parametric test involving random local permutations (Runge, 2018). We agree on the importance of specifying the test as it controls the outcome of the causal inference methods. In the revised manuscript, we clarify the description of the methods to better evidence the applied statistical tests.
  
  As far as we would expect, the random shuffling surrogate test is comparable to a Student t-test for a bivariate normal distribution. Random shuffling is consistent with the principle that the data consists of independent draws from a fixed probability distribution. If the latter is normal, that is white noise, then, it is in phase with the null hypothesis behind a Student’s t-test. Of course, we most likely do not compare two white noise signals, unless the causal inference method is multivariate and completely captures the deterministic signals. In the literature, it is common to rely on more constrained tests or randomization (e.g., Schreiber, 2000). We do not wish to expand on the appropriate test for causal inference as we believe it is a complex and open issue. We want to convey the main point that a test or p-value threshold is a way for the practitioner to control the number of links being output and limit the focus on the fewer, more substantial results.
  
  2.3 Major Comment 3
  
  For the synthetic case, I see you used a lot longer dataset than you have available for the real karst case study. This could lead to the better performance for the higher dimensional or multivariate methods – it might be useful to test the synthetic case for a much smaller dataset and observe or confirm whether these methods start to lose their detection of links. This could better show that the CMI/ParCorr types of methods do have better performance, but only when given a lot of data. I think the differences between the methods might make them inherently difficult to compare, but these are some things that could improve the attempt.
  
  Indeed, the virtual experiment is a three-variable case that spans 365 days, while the real study case is an 11 (All) or 9 (P1, P2, P3) variable case over a restricted time domain of 48 (All), and 184 (P1), 62 (P2), and 218 (P3) timestamps. As also stated in response to your first comment, we fully agree that the sample size (together with the dimensionality) is an element impacting the robustness and performance of the method. In the revised discussion, the second point (see general response) aims at covering this point. Nevertheless, as stated in our general response, we decided not to pursue the suggestion to include sample size as a variable in our virtual experiment.
  
  Our motivations are the following. First, our study remains a comparative study, and we think that such a focus on the PCMCI-CMI method and this complex problem would rather deserve a separate issue. Also, the conclusion of such an extended virtual experiment is reasonably known a priori, and as you have suggested: the results become more robust with increasing sample length (or decreasing number of variables). We see no reason why a non-trivial conclusion, such as recommendations of sample length as a function of the number of variables, would be transposable to a problem with different characteristics, such as different noise levels, model coupling patterns, signal behavior, or representative scales. In addition to the sample size and dimensionality, an adequate estimation of CMI also depends on the nature of the CMI dependencies, smooth or not smooth as it could be expected in systems with highly dynamic connectivity, as well as the magnitude and the dynamic traits of noise. This CMI dependency and noise vary across spatial and time scales. The results also depend on the methods, for instance, kernel-based or nearest neighbors’ estimators and their hyperparameters.
  
  Our point, with the synthetic studies, was to show the divergence of the methods on the same – simplistic - case study, not as an answer to the question “what should we do”, but rather as an exploration of the behavior of the tested methodology in a case where we can give meaningful interpretations of the results. For each problem on which those methods are used, we consider that a good strategy to be highlighted in the perspectives and recommendation section would be to test the issues met and the insights gained by using fit-for-purpose models mimicking the property of signals they want to study.
  
  2.4 Major Comment 4
  
  Finally, there are several places with strange phrasing, or where a term is introduced before it is defined, so there is momentary confusion on whether a reference is missing or the sentence is relevant. I am highlighting some of these that I noticed in the minor line-by-line comments below.
  
  We apologize for these writing problems and thank the reviewer for pointing out some of them. We will make sure to correct them and recheck the whole manuscript carefully.
  
  3. Minor Comments
  
  3.1 Minor Comment 1
  
  Line 7: “appears unstable” relates to my comment on data length…I think the instability is at least partially due to a very small dataset. Either way, it is not very clear what this term means within the abstract.
  
  We agree with your interpretation. We hope to have answered the questions related to instability in our responses to comments 1 and 3. The revised abstract mentions how unstable the results are, i.e., that they provide run-dependent variable results.
  
  3.2 Minor Comment 2
  
  Line 15: “between variables from variables” did not make sense to me
  
  We modified “…interactions between variables from time-series only”.
  
  3.4 Minor Comment 4
  
  Line 35-45: This paragraph seemed scattered, and I did not come out of it with a clear understanding of “effective connectivity” in particular. Suggest to revise
  
  For more clarity, we propose a reordering of the paragraph and some edits: “[…] The functional one is dynamic and is retrieved from statistical time-dependencies between local hydrological variables. Functional connectivity is a matter of cross-predictability and reflects dynamic links between the variables. These dynamic links are potential connections subject to confounding factors, i.e., they may or may not be related to a flow process between variables. Effective connectivity precisely refers to actual connections linked through hydrological processes and flows. Since CIMs with a multivariate framework address confounding factors, they offer the promise of discriminating functional connectivity from the effective one. From the structural to the effective connectivity through the functional one, the search for hydrological connections can be seen as a progressive limitation of the possibilities, from the potential paths to the actual paths taken by water".
  
  3.5 Minor Comment 5
  
  Line 41: “process-based water flows” – I’m not sure if there are non-process-based flows of water?
  
  Corrected. See the above response to minor comment 4.
  
  3.6 Minor Comment 6
  
  Line 45: “progressive constraint” was not clear to me
  
  Also edited in response to Minor comment 4.
  
  3.7 Minor Comment 7
  
  Line 50: “heterogeneity” instead of “hiddenness”?
  
  We meant both. Edited to “hidden heterogeneity”.
  
  3.8 Minor Comment 8
  
  Line 60: I’m not sure about the sentence “nonlinearity is imputed to nonlinear hydrological processes”, seems redundant
  
  L60: "Nonlinearity is imputed to inherently nonlinear hydrological processes such as power laws or threshold effects triggering flows."
  
  Indeed. To clarify what we meant, we propose: “Nonlinearity could be the result of heterogeneities or could be imputed hydrological processes themselves, often mathematically described as nonlinear with power laws or threshold effects triggering flows”.
  
  3.9 Minor Comment 9
  
  Line 65: would be good to re-define CCF here
  
  Corrected
  
  3.10 Minor Comment 10
  
  Line 74: What do you mean by “to appreciate the results” – to compare with the results, or validate them?
  
  L74:76: “To appreciate the results, previous dye tracing tests have revealed fast connected preferential flow between the surface and a particular spot in the cave (Poulain et al., 2018). This prior knowledge can be seen as a reality check on the blind CIMs.”
  
  We meant to say "compare and validate", but talking about validation is probably an exaggeration. It is more like an agreement with our (one of our) expectations. We propose to be more explicit, we propose: “In terms of expected results, previous dye tracing […] can be seen as a partial reality check”.
  
  3.11 Minor Comment 11
  
  Line 85: “Being multivariate” – I’m not sure that a multivariate approach inherently deals with confounding effects. For example, a multiple linear regression is multivariate, but does not do any type of conditioning on confounding variables…
  
  L85: Being multivariate, those methods can cope with confounding variables.
  
  We are not sure to understand your point. What we mean here is that, as several causes can be entered in the test, it can distinguish - if the problem is well posed and fulfills the condition of causal sufficiency - between direct and indirect causation. A multilinear regression, such as Granger causality, can do the same. In that sense, we consider that it can cope with confounding variables.
  
  3.12 Minor Comment 12
  
  Line 93: PC and MCI are brought up, and then defined later – would be better to re-arrange such that we are not wondering what they are.
  
  Accepted. PC and MCI can be introduced earlier at L85.
  
  3.13 Minor Comment 13
  
  Line 99: “not preselection” to “no preselection”?
  
  Corrected.
  
  3.14 Minor Comment 14
  
  Line 103: reference for causal sufficiency? In general, this is a good point for any analysis, you particularly reference it for CMI, could state that this hypothesis really underlies all your methods…
  
  L100-104: “While the CMI method is the most promising in that it does not assume linearity and accounts for confounding effects, as for the other CIMs, the reliability of the reported causal relationships nevertheless depends on underlying hypotheses (discussed in Runge, 2018a). Perhaps, the most important but the most challenging to verify and conceptualize in practice is the hypothesis of causal sufficiency. Causal sufficiency implies that the analysis should include all potential common causes.”
  
  The reference is the same as in the previous sentence. We repeat it and now also refer to Runge et al. (2019). We agree with your point and revise accordingly: “Perhaps, the most important is the hypothesis of causal sufficiency because it underlies all CIMs (Runge 2018a, Runge et al. 2019). Causal sufficiency implies that the analysis should include all potential common causes. It is, however, challenging to verify and conceptualize in practice."
  
  3.16 Minor Comment 16
  
  Line 121: “overall good performance of this value” is vague, do you mean for the synthetic study, or the real study, or in general?
  
  We now mention that the overall good performance of this value is relative to our dataset. Still, from experience and based on the literature, the value of m equal to 2 is common. With m=2, CCM relies on actual trajectory segments rather than unique points (m=1). On average, the gain of performance can be thought of as the algorithm knowing in which direction the dynamic is going (e.g., lower discharge or higher discharge), hence a better state estimates (e.g., recession or rainy period), and mapping performance with the other variable. Usually, there is no significant gain of performance while passing from m=2 to m=3 (if gain there is), and one would hold with m=2 for parsimony.
  
  3.17 Minor Comment 17
  
  Line 136: I don’t think you have introduced Granger Causality
  
  The revision of the method section now correctly introduces Granger's pioneering work and these differences with the PCMCI-ParCorr method. For the preprint version, Granger was further explained in the supplementary materials.
  
  3.18 Minor Comment 18
  
  Line 153: Is two weeks of computation for a single processor? I figure it would take different amounts of time depending on whether you used a laptop or a server, etc, so could make this more clear.
  
  Corrected. Two weeks parallelized on 6-cores intel i7 9th gen laptop.
  
  3.19 Minor Comment 19
  
  Line 176: It seems like for 2014-2017 time-series, there would be more than 465 time steps, implying the presence of gaps. This also comes into play in terms of your total data length for the multivariate methods. Basically, the counts in Table 2 make it seem like there is more data available than there actually is, when you start comparing multiple datasets (with 48 data points being the total overlap).
  
  That is the point we are making in the last paragraph of the data section and the reason why we highlighted the restricted time-domain of the overlap in Figure 1.
  
  3.20 Minor Comment 20
  
  Line 204: “problematic case of the common cause” – after this, you define what this means, but as it is, the phrase seems a little mysterious, like a Sherlock Holmes story.
  
  Rephrased into: “the common cause problem”.
  
  3.21 Minor Comment 21
  
  Line 210: haven’t defined Qb’ yet
  
  Qb is mentioned as the discharge of B earlier in L208. Yet, we modified to introduce Qb in L206 “In this case, two reservoirs, A and B, and their discharge Q_A, Q_B are subject to the same forcing …”
  
  3.22 Minor Comment 22
  
  Line 230: This is a “synthetic study” but the years make it seem like you are using actual data from your real study?
  
  Indeed, see L214. We propose to remind it in the caption of Figure 2 to make it more apparent.
  
  3.23 Minor Comment 23
  
  Line 234: What is a differenced dataset? This comes up a few times and I’m not completely sure what it is…whether it is the increment or something done in the modeling process.
  
  The first-order difference is the time-series obtained by subtracting the values at the previous time step, i.e., Y_{t}-Y_{t-1}. This is now mentioned explicitly with a better motivation of the underlying reasons, as requested by other reviewers as well.
  
  3.24 Minor Comment 24
  
  Line 270: “If causality is hard to infer…” is not a great sentence, excusing a complicated figure and telling us it actually makes sense. You could just remove this.
  
  We do as suggested.
  
  3.25 Minor Comment 25
  
  Line 297: “is be removed”
  
  Corrected.
  
  3.26 Minor Comment 26
  
  Line 338: “evanescent singularity” is unclear
  
  This sentence is removed from our revised discussion.
  
  4. Cited references
  
  Runge, J.: Causal network reconstruction from time series: From theoretical assumptions to practical estimation, Chaos, 28, 075310, https://doi.org/10.1063/1.5025050, 2018.
  
  Runge, J., Bathiany, S., Bollt, E., Camps-Valls, G., Coumou, D., Deyle, E., Glymour, C., Kretschmer, M., Mahecha, M. D., Muñoz-Marí, J., Nes, E. H. van, Peters, J., Quax, R., Reichstein, M., Scheffer, M., Schölkopf, B., Spirtes, P., Sugihara, G., Sun, J., Zhang, K., and Zscheischler, J.: Inferring causation from time series in Earth system sciences, 10, 2553, https://doi.org/10.1038/s41467-019-10105-3, 2019.
  
  Schreiber, T. and Schmitz, A.: Surrogate time series, Physica D: Nonlinear Phenomena, 142, 346–382, https://doi.org/10.1016/S0167-2789(00)00043-9, 2000.
  
  Citation: https://doi.org/10.5194/hess-2021-445-AC2
RC4:
'Comment on hess-2021-445', Anonymous Referee #3, 07 Oct 2021
The authors compared four causal inference methods in a hydrological issue, which is probably the first method-comparison study on revealing causality in hydrology. The following four methods are used: cross-correlation, convergent cross mapping, partial correlation-based PCMCI, and conditional mutual information-based PCMCI. However, there are two main issues that concern me (as described below). Therefore, I recommend a major revision.

Comparison between CCM and CMI-based PCMCI. The current comparison based on noisy data is unfair to CCM, because CCM is more suitable for deterministic dynamics and does not work well in a stochastic system. Also, the authors used different levels of noises in the synthetic study. Still, only the averaged results are reported in Figure 3, and the noise impact on the performances of the four methods remains unknown. Therefore, I suggest, at least in the synthetic case study, performing the comparison based on a noise-free/deterministic system and a thorough evaluation of the noise impact.

Limited data points for computing CMI. In the real case study, the inferred causality from CMI-based PCMCI is much less trustable, given only 465 datapoints of 7 variables (what is the maximum allowed number of conditioned variables set in PCMCI by the way?). In fact, it is somehow expected that the CMI-based PCMCI does not work well using this limited dataset (even for a three-dimensional CMI estimation, several hundred data points might not be sufficient). Although the authors acknowledged this limitation, I strongly recommend a corresponding synthetic study to evaluate the impact of dataset size and the number of variables in CMI-based PCMCI, which is very critical to guide the current and future causality analysis in earth science inferred by the PCMCI algorithm.

Other minor revisions/comments:

Lines 67 and 68: Please spell out ParCorr and CMI.

Line 144: “entropy transfer” –> “transfer entropy”

Lines 147 and 148: “The nearest-neighbor estimator is recommended for time-series below 1000 samples” ... under what dimensionality?

Line 238: Q_b -> Q_B

Line 356: “constraint causal inference” –> “constrain causal inference”
Citation: https://doi.org/10.5194/hess-2021-445-RC4
- AC3:
  'Reply on RC4', Damien Delforge, 28 Nov 2021
  1. General Response
  Dear reviewer, we would like to thank you for your attention to the preprint manuscript and the helpful comments you made. We understood that your major concerns are related to the virtual experiment about (1) the fair comparison of CCM when it comes to noisy signals and (2) the robustness and performance of the PCMCI-CMI method under limited data points. For both points, as explained here below, we do not think that extending our synthetic case study can reply to the very valid points you raised. We rather choose to better explain the results we get and the limits of the conclusions that can be drawn out of them. To this end, we clarify our point of view in a revised discussion section, which we hope will meet your concerns and that of the readers.
  We also improve references to the existing literature on these concerns within a revised discussion structured as follows:
  a summary and appreciation of our results (improved for CCM as well);
  
  a particular focus on the robust estimation of Conditional Mutual Information (CMI) concerning missing values, record length, dimensionality, the nature of the dependencies, or noise;
  
  and practical recommendations for the uses of causal inference methods and future research perspectives.
  
  We hope you will find below satisfactory answers to your observations.
  Thank you again for your contribution to this discussion,
  On behalf of co-authors,
  Damien Delforge
  2. Major Comments
  2.1 Major Comment 1
  
  Comparison between CCM and CMI-based PCMCI. The current comparison based on noisy data is unfair to CCM, because CCM is more suitable for deterministic dynamics and does not work well in a stochastic system. Also, the authors used different levels of noises in the synthetic study. Still, only the averaged results are reported in Figure 3, and the noise impact on the performances of the four methods remains unknown. Therefore, I suggest, at least in the synthetic case study, performing the comparison based on a noise-free/deterministic system and a thorough evaluation of the noise impact.
  
  Because CCM is based on the theory of nonlinear deterministic systems (or chaos theory), it is generally considered a tool for deterministic signals. Therefore, CCM is also considered as a method that would not work well for a stochastic system, which is your legitimate point and potentially the one shared by many readers familiar and unfamiliar with the CCM method. Despite this widely held view, we consider that noise or stochasticity should not be a significant issue in our case and, thus, that the problem should not be emphasized in the manuscript. Noise is not the main reason why CCM could be deceiving. In our experiment, we face many False Positive and believe that the False Positive Rate should be the major concern of hydrologists, with confounding being the primary reason for False Positive misclassifications.
  
  We consider that CCM's inability to deal with confounding is the critical message to illustrate and convey, as this limit is not apparent in the current literature. Because of this, researchers may have high and misguided expectations about the capabilities of CCM (e.g., see Sugihara, 2017), just as us when we first turned to this method. These expectations are all the more pronounced since the CCM method refers directly to the concept of causality. This is not the case for the correlation or the cross-correlation function. Researchers use them implicitly to make causal inferences while being aware of the limit, as the maxim "correlation is not causality" says. Many papers on CCM, including the original (Sugihara et al. 2012) or some in hydrology (e.g., Ombadi et al., 2020), refer to this maxim to motivate the use of more advanced methods such as CCM. Yet, when one types the maxim into an internet search engine, most of the illustrations refer precisely to the problem of confounding. Therefore, it seems crucial to dispel the myth that CCM could do better than correlation or CCF by escaping the confounding problems. Our interpretation of our results is that CCM differs from CCF and correlation in its ability to reveal nonlinear associations, as well as linear associations, not in any capability of solving confounding problems because it is a bivariate method. Since most hydrological signals already exhibit linear correlations, we believe that the usefulness of CCM for causal inference is quite limited in hydrology. However, the method is interesting for studying the dynamic and nonlinear properties of signals (e.g., see Delforge et al., 2020).
  
  In conclusion, we understand that you have not made any remark relative to the main message we are trying to convey, and we consider your remark about the noise relevant. Yet, we have chosen not to modify the manuscript to introduce the noise issues because 1) we consider its implications to be secondary in our case, (2) exposing the issue could outshine the main message concerning CCM, and (3) we also need to discuss three other methods in our comparative study.
  
  In response to your comment, you will still find below specific elements answering your concerns on the effect of noise, together with other elements motivating our opinion regarding CCM.
  
  ---
  
  About the effect of noise in our virtual experiment
  
  Regarding the virtual experiment, Figure 3 (in the preprint) displays the average dependencies, as you said, but also the interquartile range. Hence, Fig. 3 shows some dispersion possibly imputed to noise or model configurations. To illustrate further the effect of noise, the Figure below reports the CCM results for the different noise level parameters ε_lvl for the case where Q_A and Q_B are disconnected. Error bars are the remaining interquartile ranges. In this case, the figure shows that injecting noise decreases the mapping skills. Beware that we do not consider that noise decreasing the mapping skills would be systematic, for instance, if we inject auto-correlated noise and have relatively short time-series. However, in our case, we used a Theiler window (Theiler, 1986) tw of 10 days, which should limit the potentiality that CCM performance is related to the autocorrelation (See supplementary materials for details).
  
  If we define a case without noise, as you suggest, the result would still be deceiving by reporting a significant link between Q_A and Q_B, most likely above the case with the minimal noise in the Figure (ε_lvl=0.05). When noise is injected, CCM improves its causality detection skills for the wrong reason. CCM mapping skills are simply reduced because of noise. Similar conclusions are reported in the literature. Reviewer 1 pointed to the study of Ombadi et al. 2020, which also shows that CCM has a high False Positive Rate based on another virtual experiment and that the False Positive Rate, alongside the True Positive Rate, decreases when noise is injected. Accordingly, without adding this additional figure to the manuscript, we propose a short modification to mention that we observed that noise reduced the mapping skills, as in Ombadi et al. (2020).
  
  Still, even if the mapping skills are noise-sensitive, CCM will always attempt to report such a link because we concluded that CCM could not deal with confounding given its bivariate nature. Hence, as long as the virtual experiment tends to suggest a dependency between Q_A and Q_B at lag 1, we consider that our application is fair and not compromised by noise. More generally, we consider that CCM could be applied to noisy systems with awareness of its limits and the necessary cautions (e.g., a Theiler window). After all, CCM relies on algorithms developed to be more robust to short and noisy time series (Sugihara and May, 1990; Sugihara, 1994). If convergence there is, CCM succeeds in revealing an association, either linear or not. If not, either there is no association, or it is drowned in the noise.
  
  Notes about CCM, synchrony, and confounding
  
  Initially, Sugihara et al. (2012) do not refer to confounding but to the concept of synchrony, such as the famous synchronization discovered by Huygens for its pendulum clocks. Even if there may be substantial differences between the concepts, we consider that synchrony refers to confounding as synchrony could occur in dynamical systems subject to strong forcing (See the SM of Sugihara 2012 for a discussion on synchrony, and Sugihara et al. 2017). Since we fall in this case, we do as suggested in Ye et al. (2015) and varied the predictive horizon to infer causality from the principle of priority when synchrony occurs. Yet, in this case, we show that it is still deceiving, reinforcing our point that CCM cannot deal with confounding and the common cause problem. To make sense of our results, we had to conclude that if both Q_A and Q_B are deterministically mapping to P_eff, it would seem logical that a deterministic map exists between the two, just as when one rearranges deterministic equations, and that they present some isomorphism in their reconstructed attractor (what is tested with CCM). Referring to Sugihara et al. (2012), "time-series variables are causally linked if they are from the same dynamic system that is, they share a common attractor manifold M. This means that each variable can identify the state of the other". As a matter of hydrological connectivity, given reasonably low noise, it would always be possible to identify the state of a hydrological reservoir from another parallel, yet disconnected reservoir, as two buckets sitting next to each other.
  
  2.2 Major Comment 2
  
  Limited data points for computing CMI. In the real case study, the inferred causality from CMI-based PCMCI is much less trustable, given only 465 datapoints of 7 variables (what is the maximum allowed number of conditioned variables set in PCMCI by the way?). In fact, it is somehow expected that the CMI-based PCMCI does not work well using this limited dataset (even for a three-dimensional CMI estimation, several hundred data points might not be sufficient). Although the authors acknowledged this limitation, I strongly recommend a corresponding synthetic study to evaluate the impact of dataset size and the number of variables in CMI-based PCMCI, which is very critical to guide the current and future causality analysis in earth science inferred by the PCMCI algorithm.
  
  Your present concern meets those of the other reviewers as well, in particular, reviewer 2. For clarification, the virtual experiment is a three-variable case that spans 365 days, while the real study case is an 11 (All) or 9 (P1, P2, P3) variable case over a restricted time domain of 48 (All), and 184 (P1), 62 (P2), and 218 (P3) timestamps. This time-domain is very short because the PCMCI algorithm dismisses all timestamps where the time-series and their lags up to 2d_max exhibits at least one missing value. Therefore, missing values are a problem, mainly when they are unevenly distributed over time, because the analysis falls down to a very small sample size compared to the length of the individual time series. There is no maximum number of conditioned variables. This number would depend on the parameter α_PC and the preselection step with the PC algorithm (see supplementary materials for details). Our philosophy (better exposed in the revised manuscript) is to use the PCMCI without adding constraints on which variable is supposed to influence another or not. Hence, this attitude is very empirical as what we are testing is the ability of causality methods, such as the PCCMI, to screen and detect the existing links between the datasets. We assume that many PCMCI users will also have the same initial mindset of letting the method work by itself. Our revised discussion points out the resulting problems, as you do in your comment, and evaluate that mindset.
  
  We share your view that the small sample size is problematic and relative to the number of variables. Our strategy was to use the CMI estimators and test that is the most forgiving (Runge, 2018). The nearest-neighbor estimator and the shuffle test of Runge (2018) are better suited than kernel-based approaches for short record length (< 1000), based on numerical experiments covering sample sizes from 50 to 2,000 and dimensions up to 10. Yet, despite the use of method recommended for small records, the real study case of the manuscript is concerned by the pitfall of estimating CMI with short record length and high dimensions resulting in non-robust test results, as you also interpret. We showed that the robustness could be increased by performing an ensemble of tests. Of course, the robustness of a numerical result is one issue; its reliability in terms of connectivity is another.
  
  So, we agree with you that this is a critical point to discuss. However, extending toy model-based sensitivity analysis will not provide sufficiently robust results to guide the scientific community properly for other applications. The reasons go beyond that. First, our study remains a comparative study. Such a focus on the PCMCI-CMI method would rather deserve a separate issue. Also, the conclusion of such an extended virtual experiment is reasonably known a priori: the results become more robust with increasing sample length or decreasing number of variables. We see no reason why a non-trivial conclusion, such as recommendations of sample length as a function of the number of variables, would be transposable to a problem with different characteristics, such as different noise levels, the number of variables, the maximum delay d_max, different model coupling patterns, signal behavior, or representative scales. In addition to the sample size and dimensionality, the CMI also depends on the nature of the CMI dependencies, smooth or not smooth as it could be expected in systems with highly dynamic connectivity, and the magnitude and the characteristic of noise. This CMI dependency and noise vary across spatial and time scales. The results also depend on the methods, for instance, kernel-based or nearest neighbors estimators and their hyperparameters. Altogether, we believe that proper recommendations and fit to one's specific case would require an extended and computationally intensive sensitivity analysis with another model, which is not the scope of our study.
  
  Our point, with the synthetic studies, was to show the divergence of the methods on the same – simplistic - case study, not as an answer to the question "what should we do", but rather as an exploration of the behavior of the tested methodology in a case where we can give meaningful interpretations of the results. As a recommendation for each problem on which those methods are used, we consider that a good strategy would be to test the issues met and the insights gained by using fit-for-purpose models mimicking the property of signals they want to study. This is suggested in the revised perspective.
  
  Another revision is to discuss the question "should we condition on all variables?". As mentioned above, with missing values, the small size of the time domain is because we test and condition each variable with respect to each variable. This is why focusing on P1, P2, P3 only increases the size of the cross-domain while removing the possibility that P1, P2, P3 influence each other. Perhaps, when it comes to testing the connectivity between two points A or B within a system, conditioning on their past and the common driver, e.g.., evaporation and rainfall or effective precipitation alone, is a dimensionally adequate and contained representation of the problem, on which CMI can be more reliably estimated.
  
  3. Minor Comments
  
  3.1 Minor Comment 1
  
  Lines 67 and 68: Please spell out ParCorr and CMI.
  
  Corrected
  
  3.2 Minor Comment 2
  
  Line 144: "entropy transfer" –> "transfer entropy"
  
  Corrected
  
  3.3 Minor Comment 3
  
  Lines 147 and 148: "The nearest-neighbor estimator is recommended for time-series below 1000 samples"... under what dimensionality?
  
  As mentioned in our response to your major comment 2, this heuristic value is based on Runge, 2018 for a varying dimensionality up to 10. This is clarified in the revised manuscript.
  
  3.4 Minor Comment 4
  
  Line 238: Qb -> QB
  
  Corrected. B has been capitalized.
  
  3.5 Minor Comment 5
  
  Line 356: "constraint causal inference" –> "constrain causal inference"
  
  Corrected.
  
  4. Cited References
  
  Theiler, J.: Spurious dimension from correlation algorithms applied to limited time-series data, Phys Rev A Gen Phys, 34, 2427–2432, 1986.
  
  Sugihara, G. and May, R. M.: Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series, 344, 734–741, https://doi.org/10.1038/344734a0, 1990.
  
  Sugihara, G.: Nonlinear forecasting for the classification of natural time series, Phil. Trans. R. Soc. Lond. A, 348, 477–495, https://doi.org/10.1098/rsta.1994.0106, 1994.
  
  Ye, H., Deyle, E. R., Gilarranz, L. J., and Sugihara, G.: Distinguishing time-delayed causal interactions using convergent cross mapping, 5, 14750, https://doi.org/10.1038/srep14750, 2015.
  
  Sugihara, G., Deyle, E. R., and Ye, H.: Reply to Baskerville and Cobey: Misconceptions about causation with synchrony and seasonal drivers, PNAS, 114, E2272–E2274, https://doi.org/10.1073/pnas.1700998114, 2017.
  
  Runge, J.: Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information, in: International Conference on Artificial Intelligence and Statistics, International Conference on Artificial Intelligence and Statistics, 938–947, 2018.
  
  Delforge, D., Muñoz‐Carpena, R., Van Camp, M., and Vanclooster, M.: A Parsimonious Empirical Approach to Streamflow Recession Analysis and Forecasting, 56, e2019WR025771, https://doi.org/10.1029/2019WR025771, 2020.
  
  Ombadi, M., Nguyen, P., Sorooshian, S., and Hsu, K.: Evaluation of Methods for Causal Discovery in Hydrometeorological Systems, 56, e2020WR027251, https://doi.org/10.1029/2020WR027251, 2020.
  
  Citation: https://doi.org/10.5194/hess-2021-445-AC3
RC5:
'Comment on hess-2021-445', Anonymous Referee #4, 27 Oct 2021

I reviewed with great interest the manuscript entitled “Detecting hydrological connectivity using causal inference from time-series: synthetic and real karstic study cases”. In their work, the authors explore four methods for detecting dependencies in two sets of time series, covering linear/nonlinear and bivariate/multivariate relationships among variables. Using a synthetic set of flows from two reservoirs, they measure statistics of dependency over a lag, finding critical differences among the methods. They capture further differences among the methods when applied to real data, finding that the CMI method is the most accurate for the synthetic case study but was unstable for the real data. The authors emphasized that the methods detect dependencies within the bounds of their parameterizations and highlight some exciting future avenues of research. Overall, this work was well presented and the analysis thorough such that I think it would benefit other researchers exploring these methods for application to their data. I have minor comments and suggestions to help improve the work, which I detail below. I recommend publication of this work after minor revision.

General comments:

I was confused about the difference between the original dataset and the differenced dataset when comparing the outcomes of the methods. I am not sure what was differenced to produce such different results within the method. Further clarification on this would be great.

When you find contemporaneous links, could this be due to shorter term processes that could be resolved with a shorter time step? So an instant link means that the data are already synchronized and presumably there exists a measurable scale that could capture the actual lag of that information flow. Is that possible for the karstic data?

Discussion section: The discussion section could be improved by highlighting the results in terms of the connectivities described in the introduction, especially for the real case study. Also, greater connections between these results and previous studies on CIMs would add better context to the contributions of this study.

Specific comments:

L66: You state the abbreviation of the method before stating the actual name. Please correct.

L70: The phrase “obtained from” is repeated twice.

L99: Change “not” to “no”

Figure 1 caption: It is unclear what the red areas are showing in the figure based on the description. Is it the overlapping timespans for all data? Or just the portion that can be analyzed using a 5-day lag? Please clarify.

L210: What is QB’?

L218: Move the phrase “with R either A or B” earlier, when you first introduce HR.

L225, that paragraph: What is the length of the dataset? How did you set your length to ensure sufficient data for applying the CIMs?

L233: Formatting issue for variable HAB.

L238, that paragraph: It would help to reference specific parts of Figure 3 in the paragraph

L271: Are the patterns shown in the figure or just stated here? It is difficult to know which relationships you are showing. I am also confused by what you mean by the time dependencies flipping as you can’t have a negative delay? Please clarify.

L297: Phrase “P2 is be removed” is awkward. Please revise.

L317: You state the instability of the CMI may be due to the interdependence of the ERT data. Could this be considered a strength? Since this means it can detect that these data were already inter-related and therefore do not function well as independent nodes in the network?

L329: Change “compare” to “compared”

Supporting information: There are some minor spelling mistakes in the document.

Figure SM2 caption: Are the descriptions for the symbols switched?

Citation: https://doi.org/10.5194/hess-2021-445-RC5
- AC4:
  'Reply on RC5', Damien Delforge, 28 Nov 2021
  1. General Response
  Dear reviewer, we would like to thank you for your attention to the preprint manuscript and the helpful comments you made. Below are detailed responses to each of your major and minor comments. We hope they will answer your questions and expectations.
  To improve the preprint based on the comments made by all the reviewers, our revised manuscript contains minor revisions throughout the document, a revision of the methodology section where some details of supplementary materials about the methods and their tests are included to make it more clear and self-explanatory, and a rewriting of the discussion section. Regarding the discussion, based on your remarks and those of the other reviewers, we will rewrite the section with improved references and comparisons to the literature on the following topics:
  a summary and appreciation of our results (in terms of type of connectivity as well);
  
  a particular focus on the estimation of Conditional Mutual Information (CMI) concerning missing values, record length, dimensionality, the nature of the dependencies, or noise;
  
  and practical recommendations for the uses of causal inference methods and future research perspectives.
  
  Concerning point 2 and the virtual experiment, the other reviewers encourage us to extend the virtual experiment to study the effect of the sample size and/or the number of variables. This was not your request. For your information, we have decided not to comply with this request for multiple reasons that will be reflected in the revised discussion (point 2), as any reader may have the same concerns. You will find the motivations in our answers to the other reviewers.
  Thank you again for your contribution to this discussion,
  Damien Delforge
  2. Major Comments
  2.1 Major Comment 1
  
  I was confused about the difference between the original dataset and the differenced dataset when comparing the outcomes of the methods. I am not sure what was differenced to produce such different results within the method. Further clarification on this would be great.
  
  Other reviewers have also asked for clarification. The first-order difference is the time-series obtained by subtracting the values at the previous time step, i.e., Y_t-Y_(t-1). The differenced dataset is the first-order difference of the time series of the original dataset. This is now mentioned explicitly with a better motivation of the underlying reasons, as requested by other reviewers as well.
  
  2.2 Major Comment 2
  
  When you find contemporaneous links, could this be due to shorter term processes that could be resolved with a shorter time step? So an instant link means that the data are already synchronized and presumably there exists a measurable scale that could capture the actual lag of that information flow. Is that possible for the karstic data?
  
  Indeed, contemporaneous relationships can be interpreted as a lack of temporal resolution. In our case, the ERT dataset is daily and we cannot get a finer time-resolution. Pursuing a finer scale is an option, and we briefly mention it in the perspective of the revised manuscript. However, one can expect cases of synchronization at all scales. Moreover, assuming that one wants to keep the same time horizon with the maximum delay d_max, finer resolution increases the dimension of the causal analysis. This is for these two reasons that we mention perspectives about causal inference method working on the frequency-domain to cover broader time-horizon (perspective 1, L365) and those aiming at solving contemporaneous dependencies (perspective 3, L375).
  
  2.3 Major Comment 3
  
  Discussion section: The discussion section could be improved by highlighting the results in terms of the connectivities described in the introduction, especially for the real case study. Also, greater connections between these results and previous studies on CIMs would add better context to the contributions of this study.
  
  The revised manuscript includes a discussion better addressing the elements highlighted by the reviewers (see general response). In the introduction, the definition of effective connectivity is improved, and the first point of the discussion on the appreciation of the result now focuses on the question: "can we interpret our results in terms of effective connections?" For bivariate methods, the answer is no, because by definition they measure direct statistical dependencies and are to be associated with the definition of functional connectivity. We insist on this point for CCM because, contrary to cross-correlation which is implicitly used to make causal inference, CCM is explicitly put forward as a causal inference method. Bivariate methods remain interesting because multivariate methods provide results whose accuracy is difficult to judge. Insofar as we have identified a series of problems with multivariate methods, even if by definition they can reveal effective connections, we think it is more prudent to consider them as potential connections and thus as attached to functional connectivity. Regarding their use in general, these methods are one of many tools for exploration and analysis in science and are best combined with other tools given the uncertainties.
  
  3. Minor Comments
  
  3.1 Minor Comment 1
  
  L66: You state the abbreviation of the method before stating the actual name. Please correct.
  
  Corrected.
  
  3.2 Minor Comment 2
  
  L70: The phrase “obtained from” is repeated twice.
  
  Corrected.
  
  3.3 Minor Comment 3
  
  L99: Change “not” to “no”
  
  Corrected
  
  3.4 Minor Comment 4
  
  Figure 1 caption: It is unclear what the red areas are showing in the figure based on the description. Is it the overlapping timespans for all data? Or just the portion that can be analyzed using a 5-day lag? Please clarify.
  
  Clarified in the manuscript and caption. When P1, P2, P3 are included individually, the red areas show the overlapping timespans for all data and their lags up to two d_max. Concurrently, this is the portion analyzed using a d_max lag of 5 days.
  
  3.5 Minor Comment 5
  
  L210: What is QB’?
  
  L210: “For comparison, we consider a case where QA and QB’ are effectively connected as if they were contributing to the same drainage network, with QA upstream of QB’ ”
  
  Modified: […] we consider another case where QA is effectively connected to an adapted variable QB’ […]
  
  3.6 Minor Comment 6
  
  L218: Move the phrase “with R either A or B” earlier, when you first introduce HR.
  
  Corrected
  
  3.7 Minor Comment 7
  
  L225, that paragraph: What is the length of the dataset? How did you set your length to ensure sufficient data for applying the CIMs?
  
  The length is 365 days (L230). We did not ensure this but did not notice any particular unstable behavior, the performance was satisfactory and fully detailed in Table 2. Possibly, we would have obtained better results with a longer dataset. In the revised perspective and recommendation, we recommend using a virtual case mimicking the signal properties to answer this type of case-specific question.
  
  3.8 Minor Comment 8
  
  L233: Formatting issue for variable HAB.
  
  Corrected.
  
  3.9 Minor Comment 9
  
  L238, that paragraph: It would help to reference specific parts of Figure 3 in the paragraph
  
  The specific parts are now referenced in the main text as well.
  
  3.10 Minor Comment 10
  
  L271: Are the patterns shown in the figure or just stated here? It is difficult to know which relationships you are showing. I am also confused by what you mean by the time dependencies flipping as you can’t have a negative delay? Please clarify.
  
  L271: "A typical pattern is that the sign of time-dependencies tends to flip after a few delays due to RF’s forcing and the fact that dry periods come after the rain."
  
  We now illustrate the pattern we mention with the R5-P2 relationship. At low lag, the relationship is negative as expected between resistivity and drainage, i.e., low resistivity implies high water content and drainage. After a few lag, the relationship becomes positive.
  
  3.11 Minor Comment 11
  
  L297: Phrase “P2 is be removed” is awkward. Please revise.
  
  "be" has been removed.
  
  3.12 Minor Comment 12
  
  L317: You state the instability of the CMI may be due to the interdependence of the ERT data. Could this be considered a strength? Since this means it can detect that these data were already inter-related and therefore do not function well as independent nodes in the network?
  
  We are not sure to understand how this can be a strength. We prefer to keep this as an explanatory hypothesis that could deserve further consideration. Perhaps, beyond the ERT inversion model, the instability also depends on the used clustering method and the number of clusters selected.
  
  3.13 Minor Comment 13
  
  L329: Change “compare” to “compared”
  
  Corrected
  
  3.14 Minor Comment 14
  
  Thank you for pointing the issue. We will make sure to check again the whole supplementary materials carefully.
  
  3.15 Minor Comment 15
  
  The source of confusion is that CCM forecast direction is opposite to the causal direction. We will clarify and make sure that the description matches the legend.
  
  Citation: https://doi.org/10.5194/hess-2021-445-AC4

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (06 Dec 2021) by Nunzio Romano

AR by Damien Delforge on behalf of the Authors (21 Jan 2022) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (21 Jan 2022) by Nunzio Romano

RR by Anonymous Referee #3 (13 Feb 2022)

RR by Anonymous Referee #1 (14 Feb 2022)

ED: Publish subject to minor revisions (review by editor) (24 Feb 2022) by Nunzio Romano

AR by Damien Delforge on behalf of the Authors (05 Mar 2022) Author's response Author's tracked changes Manuscript

ED: Publish as is (19 Mar 2022) by Nunzio Romano

AR by Damien Delforge on behalf of the Authors (27 Mar 2022)

Short summary

Causal inference methods (CIMs) aim at identifying causal links from temporal dependencies found in time-series data. Using both synthetic data and real-time series from a karst system, we study and discuss the potential of four CIMs to reveal hydrological connections between variables in hydrological systems. Despite the ever-present risk of spurious hydrological connections, our results highlight that the nonlinear and multivariate CIM has a substantially lower false-positive rate.

eps_lvl	0.05	0.1	0.15	0.2	0.25	...	0.7
rho_mean	0.96	0.87	0.76	0.66	0.57	...	0.24