Technical Note: A data-driven method for estimating the composition of end-members from streamwater chemistry observations

End-Member Mixing Analysis (EMMA) is a method of interpreting streamwater chemistry variations and is widely used for chemical hydrograph separation. It is based on the assumption that the streamwater is a mixture of varying contributions from relatively time-invariant source solutions (end-members). These end-members are typically identified by collecting additional measurements of candidate end-members from within the watershed. This technical note introduces a complementary, data-driven method: Convex-Hull End-Member Mixing Analysis (CHEMMA), to infer the end-member compositions 5 and their associated uncertainties from the streamwater observations alone. The method involves two steps. The first step uses Convex-Hull Non-negative Matrix Factorization (CH-NMF) to infer possible end-member compositions by searching for a simplex that optimally encloses the streamwater observations. The second step uses Constrained K-means Clustering (COP-KMEANS) to classify the results from repeated applications of CH-NMF to analyze the uncertainty associated with the algorithm. In an example application using the 1986 to 1988 Panola Mountain Research Watershed dataset, CHEMMA is able 10 to robustly reproduce the three field-measured end-members found in previous research using only the streawater chemical observations. It also suggests the existence of a fourth end-member. Further work is needed to explore the constraints and capabilities of this approach.

EMMA assumes that the chemical composition of streamwater should be explained by the conservative mixing of a finite set of temporally invariant end-members (Hooper et al., 1990). These end-members, therefore, are the most extreme points that 25 define a range within which all streamwater observations are included. End-members are identified by collecting samples of candidate source-water within the watershed. Inasmuch as the end-members are identified by candidate sampling, they depend upon hypotheses that 1) identified end-members supply streamwater; and 2) identified end-member set may be incomplete.
Streamwater concentration are naturally correlated. EMMA utilizes the Principal Component Analysis (PCA) method to 30 convert the naturally correlated streamwater concentrations into a set of linearly uncorrelated variables. Each new variable, which is called Principal Component (PC), is a linear combination of the observed streamwater attributes. For a set of n variables, PCA first requires standardized observations (X obs ) by subtracting the mean and dividing the standard deviation. Then it calculates a projection matrix P obs , which transforms the from the observation space to the PC space, by decomposing the covariance matrix of X obs . The transformed columns of Y obs (representing the n observations in the PC space) are uncorrelated, 35 and each of which accounts for a portion of total variance: Standardized end-member candidates X em can be projected into the PC space by the same projection matrix P obs , and be converted in the transformed space as Y em :

40
To find the parsimonious subset of appropriate end-members, EMMA then takes the information provided by PCA to determine the approximate dimensionality of the streamwater mixture and to screen end-members. In the PC space, appropriate end-member candidates (Y em ) are selected by choosing ones that tightly bound the transformed observations (Y obs ) (Christophersen and Hooper et al., 1990;Hooper, 2003). However, the number of retained PCs is usually determined using a heuristic, such as using the number of PCs that explain at least 1 n proportion of the total variance, because of the dardized data are projected into the 2D subspace spanned by two of the PCs. Qualified points forming a convex hull around the projected data are marked. This is repeated for every pair of PCs. Finally, we interpolate between convex-hull vertices in each subspace to find the vertices of a simplex in a k-dimensional subspace. This simplex forms a "convex-hull" such that all the data points can be optimally approximated as convex linear combinations of them. The algorithm is summarized as follows:

90
Algorithm 1: CH-NMF algorithm (Thurau et al., 2011) adapted to the end-member identification problem given m streamwater observations of n solutes Result: i th end-member composition x n×1 emi , and its contribution h m×1 i , i = 1, 2, ..., k 1. Subtract the mean (µ 1:n ) and dividing by standard deviation (σ 1:n ) for each solute to obtain standardized observation matrix X m×n obs 2. Compute d eigenvectors (PCs) e 1 , ..., e d , d = rank(X obs X T obs ) ≤ n 3. Project X obs onto each of the d 2 2D-subspaces spanned by pairs of PCs 4. Mark the k convex hull vertices for each projection plane and stored in matrix S n×p , p is the maximum number of points needed to make a convex hull in one projection plane.

Define end-member matrix
With given standardized m streamwater samples with n measured attributes X m×n obs and desired k end-members (Step 1, Figure 1 a), CH-NMF decomposes the covariance matrix of the observations to obtain d PCs, which is the same linear orthogonal projection as PCA method ( Step 2). Instead of immediate dimension reduction as EMMA, CH-NMF examines the distribution of X obs in all of the subspaces spanned by PC pairs (Step 3, Figure 1 b, light blue points) and marks the most 95 extreme points (Figure 1 b, red crosses) that construct the convex hull (Figure 1 b, red lines) to store in S (Step 4). Then, a subset of S, SI = X em , is found as a convex combination of S (Step 5, Figure 1 c, square vertices of the simplex) that minimizes the Frobenius norm · 2 F (the entry-wise Euclidean norm of the matrix). Finally, the contribution H is found by finding the convex combination of end-members that reproduces the data with minimal error (again using the Frobenius norm) (Step 6).

100
Step 5 is the essential step of the CH-NMF theory, and it is a modification of Convex Nonnegative Matrix Factorization (C-NMF) by adding a convexity constraint on J (Ding et al., 2008;Thurau et al., 2011). In the original setting of C-NMF, the I and J are naturally sparse if the vertex search is in PC subspaces (Ding et al., 2008). Adding the convexity constraint on J makes J an interpolation between each columns of SI (i.e. each end-member composition x em ), however, the sparse nature of I remains (Thurau et al., 2011). We could interpret the objective function of Step 5 (minimize S − SI p×k J k×p 2 F ) in three steps. First , the sparsity of I results in the end-member composition X em close to a subset of the extreme observations (S) projected in the PC subspace.
Second, J makes other extreme observations in S to be expressed as a convex combination (interpolation) of X em . Third, minimizing the Frobenius distance between S and X em J guarantees end-member compositions X em will be convex hull ver-110 tices because all other extreme points can be written as convex combinations of vertices, but not vice versa. As a consequence, a well-supported set of convex hull vertices tightly bound the observations and are as unique as possible, which satisfies the original EMMA assumption of finite set of distinct end-members. The sparse nature of I helps prevent overfitting because noise will tend to be concentrated on superfluous vertices without degrading identification of the others. The noisy end-members can be identified in the classification step given in the next section.

115
The constraint requiring that the end-members be a convex combination of the extreme observations implies that CH-NMF may not accurately identify end-members that are not a large fraction of any observation in the dataset. As the synthetic example shown in Figure 1 illustrates, the simplex formed by joining the CH-NMF end-members lies inside the shell formed by connecting the extreme points (red crosses in Figure 1c). If no samples are anywhere close to being 'pure' representatives 120 of an end-member, the apparent end-member identified by CH-NMF may lie closer to the data centroid than the true endmember. Methods to relax the constraint on Step 5 and better identify end-members distant from the data in mixing space will be investigated in future work.

Quantify the intrinsic uncertainty using COP-KMEANS
Each run of CH-NMF may yield different end-member estimates. This is because the complex structure of the high-dimensional 125 streamwater data result in a rough objective function surface (Step 5). CH-NMF runs with different initial search locations may fall into different local minima.
Depending on the structure of the data cloud, each run's end-members may be nearly identical (if the end-member is wellcharacterized) or one or more may vary widely. Poor identification of extreme points may result from a lack of sufficient 130 well-defined "vertices" in the data cloud. This may occur if more end-members are sought than the data can support. It may also occur if an end-member is variable in time. Instead of a vertex, the time-varying end-member forms an edge in the data space. Alternatively, the observations may not sample the true mixing space sufficiently to identify an end-member in the space as a convex-hull vertex, perhaps because it never represents more than a small fraction of variance.

135
Even in the absence of these issues, the variability and uncertainty of the stream concentration observations will contribute to uncertainty in end-member identification. The variation in the CH-NMF-identified end-members can be assessed by running the CH-NMF analysis a large number of times, and then using a clustering algorithm to extract the centroid and spread of areas consistently identified as an end-member. We use the COP-KMEANS variant of the K-means clustering algorithm, which allows us to require that end-members predicted from the same CH-NMF run must not be placed in the same cluster 140 As the number of end-members increases, the centroid and variance within the cluster may increase or decrease, which provides another way to decide the number of needed end-members for a given observation set. In this technical note, we consider a new cluster to be well-identified as a proper end-member if two conditions are both satisfied: 1) the spread of previously identified clusters remains similar or decreases, and 2) the cluster itself has a reasonable variance. 3 An application of CHEMMA 155 We applied CHEMMA to a test dataset of 905 samples of six solutes (alkalinity, sulfate, sodium, magnesium, calcium, and dissolved silica) collected from the stream in the Panola Mountain research catchment, Georgia, U.S. and described in Hooper et al. (1990). The six solutes were specifically selected to meet EMMA's assumption that their concentrations vary significantly across the watershed (Hooper et al., 1990). Hooper et al. (1990) found that the stream chemistry could be interpreted as a mixture of hillslope, groundwater, and organic soil end-members, which are identified by sampling within the watershed. Here we 160 ask 1) does CHEMMA recover the same three end-members as Hooper et al. (1990) identified in field-sampling? and 2) does the data support the existence of additional end-members?

Example Python implementation
We ran CHEMMA for three, four, and five end-member cases (k = 3, 4, 5) because two and three PCs account for 94% and 97% of the total variance, respectively . In order to capture the intrinsic uncertainty associated with the identified clusters, we 165 calculated the mean and standard deviation (st.dev) for each case based on 100 CH-NMF runs (Table 1). CHEMMA was able to recover the three field-measured end-members reported by Hooper et al. (1990) (Figure 2, three blue stars). The mean of the three CHEMMA identified clusters (Figure 3 and Table 1) are very similar to the median concentration of the field-measured end-members ( Table 2). The median concentration of the hillslope field sample (Table 2) has much lower alkalinity concentration compared with the mean concentration of the CHEMMA identified Green cluster ( Figure 3 and Table 1), however, it is 170 still within the cluster spread given in Table 1. 6 https://doi.org/10.5194/hess-2020-250 Preprint. Discussion started: 15 June 2020 c Author(s) 2020. CC BY 4.0 License.
A fourth end-member could be robustly identified (Figure 2, four red stars) that explained more of the data variability. This end-member apperared to be a mixture of hillslope and groundwater in some ways but had relatively high alkalinity and silica concentration compared to those end-members ( Figure 2 brown and navy axes). The fourth end-member captures variations 175 along the third PC axis (Figure 3 d), which are not apparent in the 2D view (Figure 3 b).
The spread of all end-member clusters (generated by 100 runs of CH-NMF) was small when four were sought, but a fifth could not be clearly identified. As the number of end-members was increased from three (Figure 3 a) Table 1) suggesting they could now be identified with less uncertainty. However, the inclusion of the fifth end-members (Figure 3 c) not only did not further tighten the previously identified clusters, but the fifth cluster was poorly defined (black Cluster 5). Except the cyan cluster has generally decreased within cluster variation, the standard deviations of other clusters increase for both three and four end-member cases (Table 1).

4 Discussion and conclusions
As the application results show, CHEMMA is able to reproduce the three end-members that were verified in the previous study as well as a fourth end-member that explains more variation in the data (Hooper et al., 1990). The dispersed cluster distributions in Figure 3 c suggests that the fifth end-member may be finding the noisy edges of the sample space, and so cannot be supported by the data. However, we have not here identified an objective criteria for determining whether an end-member is 190 supported.
Because CHEMMA extracts end-members from the observations, the accuracy of the end-member's composition is influenced by the noise from sample chemical analysis error, how well the collected samples represent the full range of sources in the catchment, and how valid the assumption of conservatively-mixing time-invariant end-members is. The captured variations 195 in PC 3 shown in Figure 3 d may result from temporal variations of the end-member composition. The less concentrated Cluster 3 in Figure 3 b may result from relatively rare contributions from that end-member. Fortunately, CHEMMA itself provides a tool for exploring some of these sources of uncertainty. By partitioning the dataset into time periods (or hydrologic state, etc), the temporal variability of end-members could be explored.

200
Future work refining and applying this method may focus on 1) applying quantitative methods to optimize the model complexity, such as the Akaike Information Criterion (AIC), or Bayesian Information Criterion (BIC, or Schwarz criterion); 2) relaxing the constraints on the CH-NMF algorithm (Algorithm 1, Step 5) so that extreme points in S also lie inside the simplex, allowing the method to better characterize end-members that are never a large fraction of any observations; and 3) exploring