the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Technical Note: An illustrative introduction to the domain dependence of spatial Principal Component patterns
Abstract. Principal Component Analysis (PCA) of synchronous time series of one variable, e.g. water level or discharge, measured at multiple locations, has been applied in a wide spectrum of hydrological analyses. Principal Components (PCs) were used in regionalisation and to identify dominant modes, signals, processes or other hydrological properties of the analysed system. The possibility that the PCs of such analysis can exhibit domain dependence (DD) found only little recognition in the hydrological PCA literature so far. DD describes the situation in which the spatial PC patterns are mainly determined by the size and shape of the analysed spatial domain. Domain size means the spatial extent of the analysed data set, domain shape the spatial arrangement of the data sets´ locations. Thus, instead of the hydrological functioning of the analysed system, the spatial PC patterns rather reflect the functioning of the PCA within the context of the data set´s spatial domain. The effect is caused by homogeneous spatial autocorrelation in the analysed series, a common feature in hydrological data sets. DD patterns are distinct, with strong gradients and contrasts, and can come together with substantial accumulation of variance in the leading PCs. In addition, DD can cause effectively degenerate multiplets, i.e. PCs which are not well separable. All these features are highly suggestive and easily lead to wrong hydrological interpretations. Consequently, DD should be considered for any application in which the PCs are used to draw conclusions about spatially distinct properties of the analysed system. DD patterns calculated for the analysed spatial domain can be used as reference to test whether spatial PC patterns differ significantly from pure DD patterns. We present two methods, one stochastic, one analytic, to calculate DD reference patterns for defined spatial correlation properties and arbitrary spatial domains. With a series of synthetic examples, we explore the DD effect with respect to a) domain shape, b) domain size and spatial correlation length and c) effectively degenerate multiplets. Particular focus is given to the effect of DD on the explained variance of the PCs and the contrasts of their spatial patterns. Finally, considering DD is discussed. Accompanying this technical note, R-scripts to (i) demonstrate and explore the DD effect, and (ii) perform the presented DD reference methods are provided.
- Preprint
(2345 KB) - Metadata XML
-
Supplement
(1101 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on hess-2024-172', Anonymous Referee #1, 29 Sep 2024
Review of Technical Note: An illustrative introduction to the domain dependence of spatial Principal Component patterns by Lehr and Hohenbrink.
This manuscript attempts to extend the study of how analyzing data on various shaped spatial domains affects the principal component loading patterns. The extension is both in content, as new material is added to the existing literature and the authors hope to gain the audience of hydrologists who, by and large, have not been exposed to such a concept. The importance of the work lies in several areas (expanded on below) but the key one is that if the PC loading patterns match those that are expected to arise from the shape of the domain, rather than the covariance fields, the recommendation should be a full stop on continuing. Therefore, understanding domain dependence is a necessary, but not sufficient condition, for physical interpretation of PC loadings.
Let me add that I like this paper and believe it can be a useful addition to the literature, helping analysts to interpret their eigenanalyses. Therefore, I hope the authors view my extensive comments with that in mind. If I come across as opinionated it is because of my lengthy work in this area and if it seems direct, that is my nature. Regardless, I like this manuscript and hope it gets published after further revisions.
Now for the general comments. The paper builds upon the pioneering work of C. Eugene Buell. Those papers are cited. Buell (1979) left the reader with this final thought on the subject of domain dependence in the last line of his conclusions, stating that unless domain dependence was accounted for, on interpreting EOFs, "Otherwise, such interpretations may well be on a scientific level with the observations of children who see castles in the clouds". That is a pretty direct and strong statement. Digging deeper into why that can occur, the manner in which individual EOFs were being analyzed in the 1970s,...,2020s is by inferring physics by visual inspection of the magnitudes and gradients of the EOFs when plotted on maps. There was no external or internal validation of the patterns, only conjecture. With over 50 years of this practice, little attention was paid to whether this was a wise idea and thousands of such EOF studies emerged, with claims of the importance of the magnitudes and shapes of the patterns, many of which looked suspiciously like those patterns Buell generate. However, we should be wiser today and the authors are telling the investigator that if the covariance fields vary across a given domain shape but the same basic Buell patterns emerge, perhaps it is castles in the clouds rather than physics. However, there may be something more than a chimera, a mixture of signal and domain dependence. We come to learn later in the manuscript that a third confounding factor, namely the degeneracy of PC loading patterns with closely spaced eigenvalues, playing a role. It is good to see these factors considered.
Next, let's discuss PCA as a technique. According to those who understand the method, there is general agreement that PCA is useful for data reduction. In other words, in the type of analysis in the manuscript, the time series at n gridpoints or locations can have their covariances explained in k PCs where k<
1. Given the above prologue, the authors on lines 408-409 discuss "heavy constraints" of PCA that inhibit physical interpretation. To that good list, I'll add that it has been shown the leading PC, by virtue of the constraint of maximal variance can pull multiple unrelated sources of variation onto that leading PC, confounding physical interpretation. This should be added. The Karl and Koscielny citation (in your reference list already) shows this in their Appendix. Further details are given in the annotated manuscript (attached).
2. There is a general lack of agreement on terminology for eigenmodels, that leads to massive confusion among users of these techniques. At first when reading this manuscript, I thought the authors were applying EAOFs, only to change my opinion later in the manuscript that they were applying the PCA model. The original paper where EOFs were named EOFs, is generally attributed to Lorenz (1956). However, in that report, Lorenz refers to the displays as EOFs of space, and EOFs of time, to define what have now mutated somewhat into what are called "EOFs", and "Principal Components", respectively. Assuming a spatial analysis, those EOFs of space are unit length (sum of the squares of each EOF's coefficients = 1), whereas the EOFs of time are orthogonal vectors, each with a mean of zero and variance equal to the associated eigenvalue. In contrast, the PCA model, generally attributed to both Pearson (1901) and more fully to Hoteling (1933). weights (postmultiplies) the unit length eigenvectors (EOFs) by the square root of the corresponding eigenvalue to give "PC loadings". That seemingly minor change in the spatial patterns (keeping with the definition of space and time given for EOFs) results in the time series calculation and properties being different. Those time series in the PCA model are called "PC scores" and have mean 0 and variance 1. They are also orthogonal. Flip the space and time definitions of these displays if the analysis is temporal. Because the two models result in different space and time patterns, they cannot be compared directly and the precise equations used are necessary to attempt to reproduce the findings of others. I urge the authors to state clearly what model they are using immediately after the introduction and show the equation. The situation becomes more complicated as users of these techniques tend to grab EOF/PCA code off of various statistical packages or Python code libraries, that often mislabel the results, never checking the specifics, thereby perpetuating the confusion. For the current paper, one must know if the analyses are applied to EOFs (unit length eigenvectors) or PC loadings (unit length eigenvectors postmultiplied by the square root of the corresponding eigenvalues). Further, it would be helpful to know if any of the results for domain dependence change as a function of the specific model invoked. There is considerable confusion about this topic when reading this paper. It is important the model being used herein is stated unambiguously at the outset of this paper and the equation added in the methods section to avoid such confusion. Further adopt the correct terminology for that model and don't list any alternative terminology that might confuse the reader.
3. The treatment of eigenvalue degeneracy is generally well addressed with one exception that potentially plagues nearly every applied eigenanalysis: eigenvalue degeneracy at the truncation point (k). If those PCs associated with closely spaced eigenvalues between k and k+1 have information that is intermixed, problems arise and data is intermixed with noise on the kth retained PC loading vector. Your paper presents 10 PCs, therefore, the spacing between the 10th and 11th eigenvalues should exceed the North et al. criterion. Does it? Let the reader know.
Further, this needs to be mentioned because it can cause the loss of a domain dependence pattern simply because the way eigenvalues are ordered in descending order makes them more likely to be closely spaced as the smallest eigenvalues head toward the tail (presumably noise) where the analyst would normally truncate the analysis to discard the k+1,...,nth eigenvalues, perhaps using some other criterion (e.g., based on percent variance extracted, eigenvalue magnitude).
Related to this, I wonder why eigenvalue degeneracy is not addressed earlier in the paper as it seems to affect domain dependence. If that is the case, then consider moving it earlier in the paper as those PC loadings arising from degenerate multiplets should not be expected to exhibit the domain dependent patterns but the multiplet may be dominated by the domain dependent patterns and those are intermixed into new patterns that don;t seem to be domain dependent patterns.
4. Comparison of PC loading patterns is accomplished with correlations. S-mode PC loading (and that of EOFs) interpretation depends on the magnitude of the PC loadings plotted on a map (and in general, the magnitude of the PC loadings/EOFs is important in any mode). Therefore, correlations subtract each PC loading/EOF vector mean (pattern mean), so two patterns with different means can have their large correlations, yet their magnitude patterns will be much different and the grid boxes (I think what you refer to as cells) with the maximum PC loadings will be in different geographical (or topological) locations in your domains. If that is the case, the the correlation is suboptimal for such comparisons. Find a better metric that includes magnitude in terms of comparison. I suggest the congruence coefficient, though others exist that preserve the vector magnitudes.
5. It seems odd that after the paper establishes the details and importance of domain dependence, it has no results on how rotating those PCs affects such dependence. There is only a scant mention of the possibility of this near the end of the paper, mostly in the context of rotating degenerate multiplets. However, rotation can be applied to PC loadings associated with non-degenerate eigenvalues and it will affect domain dependence patterns. Please consider adding a section on rotation and show those patterns to comment about how domain dependence is addressed by post processing the PC loadings with a rotation.
6. The manuscript discusses accounting for domain dependence prior to attempting physical interpretation. Both the abstract and the introductions discuss how ignorance about domain dependence can easily lead to the wrong interpretations of PCA results (e.g., "Ignorance about DD can easily lead to the wrong interpretations of PCA results. DD patterns are distinct, with strong gradients and contrasts, and therefore highly suggestive to indicate physically meaningful drivers or properties of the analyses system". I agree with this statement and, assuming it is valid, the reader will want to know abut the right interpretations of PCA results. The manuscript further states (correctly) that the analyses proceed from data that are formed into a correlation (or covariance) matrix, either explicitly and implicitly and that matrix (or the standardized data in the case of SVD) are decomposed into eigenvectors that should be capable of summarizing the correlations/covariances of the data (after ensuring they do not represent domain dependence patterns). Therefore, some additional discussion of how to interpret those eigenvector (in the case of the present manuscript, PC loadings and PC scores), after passing a domain dependence assessment, must be added. It seems the majority of patterns shown in the paper suffer from domain dependence or from the effects of eigenvalue degeneracy combined with domain dependence. Would that be the null hypothesis for other investigators?
The main recommendation to assess such a hypothesis of domain dependent patterns (according to the manuscript) seems to be to visually assess the similarity but it leaves the reader asking, "then what do I do?". Presently, there is a suggestion to visually assess the analyzed patterns and compare to the domain dependent patterns for a similarly shaped domain. Two issues with visual assessment are (a) the reliability of the same pattern under the eyes of different analysts may well have one analyst believing there is a strong resemblance, and the pattern should not be further interpreted, yet a second analyst may think it has some resemblance but not that much to reject it as domain dependent. Further, (b) the nature of a qualitative visual assessment means any one analyst can see some resemblance to domain dependent patterns in their visual assessment and then discount it based on personal bias. A more quantitative approach to avoid (a) and (b) would be a direct numerical comparison using a matching coefficient (e.g., congruence coefficient). In that case, a recommendation could be made, such as, if the congruence coefficient exceeds some value (e.g., > 0.8), the analysis is dominated by domain dependence and the unrotated PC loadings/EOFs should not be analyzed physically. The assessment of the physical interpretation gets even trickier at this point. If the PC loading pattern based on either visual assessment or congruence coefficient value is thought not to be sufficiently contaminated be domain dependence, it does not mean it is physically interpretable as a meaningful mode without further investigation. Recall what the PCA does. It summarizes the correlation/covariance structure into a set of k PC loadings and k PC scores. Do we know if any of those structures relate well to the correlation/covariance matrix from which they were drawn? Without such a step, physical interpretation would seem unwise (we're back to the castles in the clouds but now from the "heavy constraints"). Because the manuscript is motivated by finding physically important modes, a revised manuscript should address or provide some suggestions on how to confirm if a mode is physically realistic or related to the correlations/covariances (or not). There is some literature on this topic, ranging from never physically analyze any PC structures (in that case domain dependence is moot because domain don't affect the ability of PCA to extract most of the variance from a dense correlation/covariance matrix) to, in many cases, the PC structures can be analyzed after confirming similarity to the correlations/covariances . I suggested examining the Compagnucci and Richman (2008) and Huth and Beranova (2021) papers for starters. The latter asks the specific question about what is a "true mode" whereas the former addresses the question about if certain analysis modes can retrieve the modal patterns. Of course, there are other alternatives, such as using a technique not rooted in eigenvectors. However, if the paper offers a path to identifying domain dependence that undercuts physical interpretation, some remedy should be offered.
Specific comments
Numerous specific comments are listed in the annotated manuscript (attached).
-
RC2: 'Comment on hess-2024-172', Anonymous Referee #2, 02 Jan 2025
This paper highlights a largely overlooked issue called domain dependence (DD), where the PCA results are influenced more by the size and shape of the spatial domain being analyzed than by the actual hydrological processes. This effect, caused by spatial autocorrelation in hydrological data, can lead to misleading patterns, accumulation of variance in leading PCs, and closely related (degenerate) PCs that are difficult to distinguish. The paper emphasizes the need to account for DD when interpreting PCA results and introduces two methods—stochastic and analytic—for generating DD reference patterns. These methods are demonstrated using synthetic examples, and R-scripts are provided to help users explore and address DD in their analyses. The results presented are solid. The paper covers all the aspects that are important for a user. However, there are redundancy and a lack of clarity in some of the sections. I suggest a major revision that’s focused on organizing and presenting the materials. Please see my detailed comments below.
Major comments:
It is good to have all the relevant terms explained in Section 2. However, as a hydrologist, I personally found the section 2 quite challenging to follow. Since the objective of this technical note is to raise attention to the DD effects among PCA users in the hydrology community, it is better to use terminologies and displayable items accessible/understandable to hydrologists especially in the method section.
I suggest adding 1) equations when necessary and 2) conceptual diagrams like hypothetical spatial and temporal PC graphs to explain PCA and S-mode PCA (they can be put in the appendix). The authors can also add workflow diagrams in both the method and discussion sections when they illustrate to practitioners how to consider DD, how to diminish DD, etc. Also, consider adding a real hydrological case at the end of the paper to illustrate the DD effects and how to deal with DD. That way, the value of the paper to hydrologists and other PCA users can be greatly improved.
Minor comments:
Combine data set to be one word “dataset”.
Avoid using the word “system” which is too broad a term and could mean different things to different people. Be more specific. If you are talking about a catchment, use catchment. If you are talking about a soil column, use soil column.
Abstract: The abstract needs reworking. Currently, the authors spend three quarters of the abstract on describing what DD is and why it’s important to consider DD. Only 3-4 sentences are focused on what the paper does. The abstract needs to be re-organized such that the first quarter gives the introduction and background information about DD. The middle two quarters focus on the methodology and results. The last few sentences focus on the implications of the findings.
Line 45-50: Could expand the list by adding references of PCA/EOF to hydro-climate research like:
Li et al. (2023): https://link.springer.com/article/10.1007/s00382-021-06017-y
Bieri et al. (2021): https://journals.ametsoc.org/view/journals/hydr/22/3/JHM-D-20-0116.1.xml
Line 105: You’ve defined domain dependence to be DD. Use DD here.
Line 118:”Considering DD is discussed”. I don’t quite understand. Do the authors mean “in practical, when and how to consider DD is discussed”? Be a bit more specific here.
Move section 3 to data and code availability statement.
Figures 5-6: Show the colorbars for the color shadings.
Figure 7 is just a repeat of the square experiments in Figures 5 and 6. I suggest showing one figure of square experiments, one figure of rectangle experiments, and one figure of triangle experiments. On all the PCs, show the colorbar, the information you showed in the title of Figure 7a.
The section titles can be more informative. Like “4.1 First examples, 4.2 Domain shape, 5 Considering DD”… The authors should use short phrases instead of words for the subheaders. This is a good opportunity to provide more information to summarize the subsections.
Table 1: When the PC of the subsampled variant does not correlate the best with the all-cell PC of the same rank, i.e., the values with “\”, the correlation is significantly lower. For example, 0.52 for PC4 in Square patter, 0.45 for PC5 in Square, 0.52 for PC6 in Rectangle. They are significantly lower than other values in the table. Is there an explanation for that?
It is unclear to me how exactly did you calculate stability. Suggest showing equation when it is first mentioned to illustrate.
Citation: https://doi.org/10.5194/hess-2024-172-RC2
Data sets
R-scripts to (i) explore the domain dependence (DD) of spatial Principal Components and (ii) calculate DD reference patterns Christian Lehr https://doi.org/10.5281/zenodo.11213430
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
249 | 62 | 98 | 409 | 48 | 10 | 13 |
- HTML: 249
- PDF: 62
- XML: 98
- Total: 409
- Supplement: 48
- BibTeX: 10
- EndNote: 13
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1