Hydrology and Earth System Sciences Modelling Groundwater-dependent Vegetation Patterns Using Ensemble Learning

Vegetation patterns arise from the interplay between intraspecific and interspecific biotic interactions and from different abiotic constraints and interacting driving forces and distributions. In this study, we constructed an ensemble learning model that, based on spatially distributed environmental variables, could model vegetation patterns at the local scale. The study site was an alluvial floodplain with marked hydrologic gradients on which different vegetation types developed. The model was evaluated on accuracy, and could be concluded to perform well. However, model accuracy was remarkably lower for boundary areas between two distinct vegetation types. Subsequent application of the model on a spatially independent data set showed a poor performance that could be linked with the niche concept to conclude that an empirical distribution model, which has been constructed on local observations, is incapable to be applied beyond these boundaries.


Introduction
Ecosystems are complex, evolving structures whose characteristics and dynamic properties depend on many interrelated links between direct gradients (nutrients, moisture, temperature), their environmental determinants (climate, geology, topography) and potential natural vegetation, and the processes that mediate between the potential and actual vegetation cover (Baird and Wilby, 1999).Riparian wetlands in Correspondence to: J. Peters (jan.peters@ugent.be)particular exhibit a complex interplay between meteorological, hydrological and biological processes and interactions with the surrounding terrestrial and aquatic systems resulting in a high spatial and short-term variability (Dall'O' et al., 2001).The conceptual representation shown in Fig. 1 illustrates the relationships between hydrology, the physicochemical environment and vegetation at the local scale.The direct effect of site hydrology on physicochemical site properties, such as soil moisture content, oxygen and nutrient availability determines the productivity and species composition of the site (Venterink et al., 2001;Wassen et al., 2003).Vegetation, however, is not passive to the abiotical setting, but affects site hydrology and physicochemical properties through feedback processes of which transpiration (Engel et al., 2005), soil aeration (Mainiero and Kazda, 2005) and alterations in nutrient loadings (Hill, 1996;Fisher and Acreman, 2004) are just some examples.These localized direct and feedback processes result in spatial and temporal distributions of the abiotical constraints at a higher scale level (Schröder, 2006).Together with intraspecific, interspecific and anthropogenic interactions these distributed abiotical constraints result in vegetation patterns.
Exploring vegetation patterns is a central goal in ecology.Numerous studies examined environmental gradients in relation to vegetation type distributions in various ecosystems (Schulze et al., 1996;Famiglietti et al., 1998;Molina et al., 2004;Rudner, 2005), and different techniques have been developed to quantify vegetation-environment relationships.Canonical ordination (Jongman et al., 1995) for example, is widely applied in ecological studies to detect patterns of variation in vegetation data and quantify the main relations between vegetation and environmental variables.Generalized Published by Copernicus Publications on behalf of the European Geosciences Union.linear models (e.g.multiple logistic regression (Hosmer and Lemeshow, 2000)) are frequently applied to construct distribution models (Austin, 2002;Bio et al., 2002, among others).Distribution models tend to predict spatial distributions of species based on environmental variables (Guisan and Zimmerman, 2000;Guisan and Thuiller, 2005).In this study, an ensemble learning technique named random forests (Breiman, 2001;Prasad et al., 2006), is applied to a spatially distributed data set containing information on environmental conditions and vegetation type distributions.The random forest distribution model was assessed in terms of: (i) its clas- sification accuracy, (ii) its applicability on a similar alluvial floodplain, and (iii) its potential to model vegetation distributions based on a reduced number of important environmental variables in groundwater-dependent ecosystems.

Description of the study site
A lowland river valley in Belgium called "Doode Bemde" was the research area of this study (Fig. 2).The site is an alluvial floodplain mire in the middle course of the river Dijle, situated approximately 30 m above sea level.The area is bordered by the river Dijle in the west, the Molenbeek, a tributary of the Dijle, in the north and the valley slope with a number of permanent springs in the east (De Becker et al., 1999).The climatic conditions at the site are typically temperate, with an average yearly rainfall of ≈800 mm distributed evenly over the year (Verhoest et al., 1997;De Jongh et al., 2006), an average annual pan evaporation of 450 mm, and an average yearly air temperature of 9.8 • C (Van Herpe and Troch, 2000).Local conditions at the Doode Bemde have been extensively described by De Becker et al. (1999) and Joris and Feyen (2003).

Ecohydrological monitoring scheme
During the summer of 1993 and the spring of 1994, plant species occurrences were mapped in the study area.Therefore, the total area of 21.08 ha was subdivided in 519 regular and adjacent 20 m by 20 m grid cells.Mapping was restricted to a selection of 56 plant species of which 45 were typically groundwater dependent (phreatophytes, sensu Londo (1988)) and 11 were differential species for several vegetation types at the Doode Bemde.Based on these species cover data, De Becker et al. (1999) applied TWINSPAN (Hill, 1979) 1), and their spatial distribution can be seen in Fig. 2. All vegetation types are herbaceous, except for Carici elongetae -Alnetum glutinosae where a tree layer of Common Alder is present.The similarity in species composition between grid cells was compared using the Jaccard index of similarity JS=c/(a+b+c) where c is the number of species shared by both cells, and a and b are the numbers of species unique to each of the cells (Jaccard, 1912).The Jaccard similarity of two grid cells expresses their ecological resemblance concerning species composition, and ranges between 0 (when both cells have unique species) and 1 (when both cells have equal species composition).Averaged JS values are given in A groundwater monitoring network consisting of 25 piezometers was installed in 1989.Groundwater depths were measured every fortnight during the period 1 January 1991-31 December 1993.Time series of linear interpolated groundwater depths measured at several piezometers (A-E, locations can be seen in Fig. 2) along a topographical transect are plotted in Fig. 3a.A yearly pattern of high summer depths and low winter depths was observed at all piezometers.Based on these time series, hydrological duration lines expressing the probability (%) that a groundwater depth is exceeded are calculated (Fig. 3b).Groundwater depths corresponding to a probability of exceedance of 50% are yearly average groundwater depths.They differed considerably along the transect (Fig. 3b).At the levee near the river an average value of 1.27 m was measured (piezometer A), which decreased gradually moving further down toward the depression (piezometer B→C→D), with a minimal yearly average groundwater depth of 0.05 m measured at piezometer D in the center of the depression.Fig. 3b also shows different periods of superficial groundwater depths (<0.3 m) in all piezometers, ranging from 75% of the year in piezometer C to 35% of the year in piezometers B and D. Groundwater depths measured in piezometer A are never <0.3 m.Additional to the monitoring of groundwater dynamics, all 25 piezometers were sampled on several groundwater quality variables during a sampling campaign in September 1993 with respect to pH, Cl − , Ca 2+ , Fe tot , K + , Mg 2+ , NO − 3 -N, NH + 4 -N, H 2 PO − 4 and SO 2− 4 .All values are in [mg L −1 ] except for pH [-].A soil type map was made based on 60 drillings to a depth of 1 m, evenly distributed over the study area.Management regime was assessed for each grid cell separately.Four different regimes could be distinguished: -Yearly mowing in early summer, followed by grazing or mowing of the aftermath; -Cyclic mowing (once every 5 to 10 years) or not mown at all since at least 5, and up to 10 years; -No management for at least 10 years; -Transition from yearly to cyclic mowing.

Data set
Groundwater depth measurements were used to calculate a dynamic groundwater variable, the mean groundwater depth (MGD) below surface [m].Values of this variable, together with the groundwater quality variables, were assigned to each grid cell by spatial interpolation of measurement data over the entire area using block kriging (for details, see Bio et al. (2002)).The spatially explicit variables were structured into a data set.The data set contains N=519 measurement vectors x i =(x i1 , x i2 , . . ., x ip ) consisting of the values of p=13 variables describing the abiotic environment: -Groundwater dynamics: mean groundwater depth (continuous variable); -Groundwater quality: pH, Cl − , Ca 2+ , Fe tot , K + , Mg 2+ , NO − 3 -N, NH + 4 -N, H 2 PO − 4 and SO 2− 4 .All these variables are continuous; -Soil: soil type (silt/peat, categorical); -Management: yearly mowing, cyclic mowing, no management, transition (categorical).
Seven different vegetation types c 1 , . . ., c 7 are considered.To each measurement vector x i a unique vegetation type l i ∈ {c 1 , . . ., c 7 } is assigned.The data set will be denoted as: (1)

Independent evaluation data set
A spatially independent ecohydrological data set L ev was constructed for a similar valley ecosystem, "Snoekengracht".The Snoekengracht is an alluvial floodplain of the river Velp, situated approximately 15 km from the Doode Bemde.The climatic setting of both nature reserves is very much alike, and local environmental conditions and floral composition are very similar (Bio et al., 2002).The monitoring scheme was largely the same as in the Doode Bemde (Huybrechts and De Becker, 1999) where l i is the vegetation type assigned to measurement vector y i .Most vegetation types coincide with those found at Doode Bemde, except for Magnocaricion which was not found at Snoekengracht (see Table 1).

Distribution model
The distribution model used in this study applies the random forest technique (Breiman, 2001).Random forest is an ensemble learning technique which generates many classification trees (Breiman et al., 1984) that are aggregated to compute a classification.Each classification tree is grown using another bootstrap subset L i of the original data set L and the nodes are split using the best split variable among a subset of m randomly selected variables (Liaw and Wiener, 2002).
The pseudo-code for growing a random forest is given in Appendix A1.The number of trees (k) and the number of variables used to split the nodes (m) are two user-defined parameters required to grow a random forest.An unbiased estimate of the generalization error (the so called out-of-bag error, oob error) is obtained during the construction of a random forest (Appendix A2).Breiman (2001) proved that random forests produce a limiting value for the oob error.As the number of trees increases, the generalization error always converges.
The number of trees (k) needs to be set sufficiently high to allow for this convergence.The oob error can be used to optimize the other user-defined parameter m, in order to get a minimal random forest error (Peters et al., 2007).The model outcome is an ensemble of k classification trees which are aggregated based on majority votes to compute the final classification.Since every classification tree votes for a certain vegetation type c j based on the measurement vector x i of grid cell i, the probability of occurrence of vegetation type c j is given by P (c j ) = N c j /k, where N c j is the number of trees voting for vegetation type c j , and k the total number of trees.The highest probability of occurrence (P (c j ) max ) determines the predicted vegetation type c j .Additionally, the random forest algorithm can estimate variable importances (Appendix A3), i.e. variables can be ranked according to their importance in determining vegetation distributions at the study site.

Model construction and results
At first instance the data set L was randomly split into 3 data subsets for 3-fold cross-validation.The model was constructed using the random forest program provided by Breiman and Cutler (2005).User-defined parameters m, the  3).A κ (Cohen, 1960) value of 0.716 was calculated, indicating a substantial agreement between observations and predictions.A threshold-independent evaluation using receiver operating characteristic (ROC) graphs was performed (Hosmer and Lemeshow, 2000).ROC graphs are useful for visualizing classifier performances (Fawcett, 2006).ROC graphs are two-dimensional graphs in which the true positive rate, tp, is plotted on the y-axis, and the false positive rate, fp, on the x-axis, where tp = positives correctly classified total positives (3) The area under the ROC curve, abbreviated AUC, is a scalar value between 0 and 1 representing the classifier performance (Fawcett, 2006).Since random guessing produces a diagonal line between (0,0) and (1,1) in ROC space, with an AUC value of 0.5, a classifier with a higher AUC value than 0.5 does better than random guessing.For multi-class ROC graphs, which should be applied here since 7 vegetation types are considered, a methodology described in Fawcett (2006) is used.For each class a different ROC curve is produced, with ROC curve j plotting the classification performance using vegetation class c j as positive and all other classes as negative.For each ROC curve, the AUC can be calculated and averaged over the different classes using class weights based on class prevalences in the test data (Provost and Domingos, 2001):

LEGEND (a)
where AUC(c j ) is the area under the class reference ROC curve for c j , and w(c j ) a weighing factor.Weighing factors are obtained from Table 1. Figure 5 visualizes the ROC curves for each vegetation type.The AUC total value equals 0.96 and the random forest distribution model is concluded to perform well.

Spatially explicit evaluation
For each grid cell, the ensemble of k=1000 classification results is aggregated by calculating probabilities of occurrence P (c j ) for all j vegetation types of which the vegetation type with the highest P (c j ) value (P (c j ) max ) is the predicted one.As seen in Fig. 6 this decision rule leads to an increasing number of correct classifications with increasing P (c j ) max values.Indeed, 252 elements are correctly classified with a probability higher than 0.7, whereas only 2 elements are correctly classified with a probability lower than 0.3.50% of the correctly classified elements are based on probabilities >0.78.The incorrect classifications show a maximum in the [0.4,0.5]interval, with 1 element incorrectly classified with a probability lower than 0.3, and 28 elements incorrectly classified with probabilities higher than 0.7.50% of the incorrectly classified elements are based on probabilities > 0.55.
Figure 4b shows the spatial distribution of P (c j ) max values at the study site in graduated colours.Correctly classified grid cells with high P (c j ) max values are situated within the central areas of homogeneous vegetation clusters, and P (c j ) max values tend to decrease toward the boundaries of these areas (see also Fig. 4a).Incorrectly classified grid cell are mainly found where two adjacent vegetation types meet, and are based on low P (c j ) max values at the central depression and the north-eastern side of the study site.The vegetation types found in these areas are Carici elongetae-Alnetum glutinosae, Phragmitetalia, Magnocaricion with Phragmites and Magnocaricion.A Jaccard similarity matrix was constructed for the boundary grid cells only (Table 4).The JS Hydrol.Earth Syst.Sci., 12, 603-613, 2008 www.hydrol-earth-syst-sci.net/12/603/2008/ [0,0.1 [ [0.1,0.2[ [0.2,0.3[ [0.3,0.4[ [0.4,0.5[ [0.5,0.6[ [0.6,0.7[ [0.7,0.8[ [0.8,0.9[ [0.9,1]  values in Table 4 express averaged resemblances in species composition of each boundary grid cell with its 8 neighboring grid cells.Boundary grid cells of Phragmitetalia, Magnocaricion with Phragmites and Magnocaricion can be concluded to share a large proportion of their species with JS values higher than 0.5.This is reflected in the modelling results, P (c j ) max values for these grid cells are generally low because comparable numbers of the k=1000 classifiers classify these grid cells as Phragmitetalia, Magnocaricion with Phragmites and Magnocaricion.Another conclusion should be drawn for isolated grid cells and small isolated vegetation clusters surrounded by another vegetation type (e.g. as occurs along the western border of the study area, see Fig. 4a).These grid cells are frequently incorrectly classified with high P (c j ) max values, and are the weak point of the random forest distribution model.The worse performance of the model on boundary grid cells can also be seen in Fig. 5, where ROC curves of classification results computed for boundary grid cells only are lower than those computed for the entire data set.The corresponding AUC total value for model performances in boundary areas equaled 0.92, while being 0.96 for the entire study area.

Performance on independent test data
The use of independent test data allows us to assess the model generalization abilities.Edwards et al. (2006) pointed out that cross-validated model accuracies are frequently different from accuracies assessed with truly independent data.It is easy to conclude that the random forest vegetation distribution model, which was trained on the data set L did not classify data set L ev satisfactory.From the 501 ele-  (Hutchinson, 1957).
ments included in L ev , only 99 elements were classified correctly (19.8%).This can be explained by the niche concept (Hutchinson, 1957).The fundamental niche of a plant species, and by extension a vegetation type, is defined as an n-dimensional hypervolume (Hutchinson, 1957) in which every point corresponds to a state of the environment which would permit the species to exist and reproduce.Due to interspecific competition species generally occupy only an elementary part of this volume, the realized niche.The niches realized by each of the vegetation types found at the Doode Bemde differ from those realised by the same vegetation types at Snoekengracht.Although similar results were observed for all vegetation types, the example of Calthion palustris is given in Fig. 7. Since 13 environmental variables are used in this study, a principle component analysis was performed to reduce dimensions and make results visible.Fig. 7 graphs the component scores of grid cells where Calthion palustris was observed on the 2 principle component axes (cumulatively explaining 70% of variance).
Although partly intersecting, two different realized niches can be distinguished.Obviously, a random forest distibution model that is trained on the vegetation distributions at the Doode Bemde and which uses explicit environmental thresholds to compute a classification, cannot perform well on such an independent test data set of an apparantely similar ecosystem.

Reduction of model complexity
The random forest algorithm includes a procedure to estimate the importance of the independent variables (Appendix A3).
Applying this procedure on data set L results in a ranking of all 13 variables according to importance (Fig. 8a).The most important variable is mean groundwater depth.This means that, according to this classification technique, the spatial differences in mean groundwater depths at the Doode Bemde are determinative for the vegetation distributions at the study site.Based on this variable ranking, 13 random forest distribution models were constructed, each on a data set with reduced complexity, i.e. each based on a different number of variables by eliminating the variables in order of importance.Results are summarized in terms of the oob error, and plotted in Fig. 8b.A stable oob error value was found for the models with complexities between 4 and 13 variables.
Based on this result, a simplification of the ecohydrological monitoring scheme for distribution modelling is preliminarily assessed.Since the random forest performances were similar when all 13 or just a part (>3) of these variables were included, there seems to be no need to describe the environmental conditions of the study area by that many variables.Therefore, a simplification of monitoring efforts can be made based on various criteria such as relevance and measurement costs.For similar alluvial ecosystems with groundwater dependent vegetations, the inclusion of groundwater depth together with some -easily measurable -groundwater quality variables such as pH, NO − 3 -N, NH + 4 -N, and management as environmental variables on which the vegetation distribution modelling is based, is proposed.The independent test data set L ev was redesigned only to include 5 variables: mean groundwater depth, pH, NO − 3 -N, NH + 4 -N, and management.A random forest distribution model was trained on this data set, and 3-fold cross-validation resulted in an overall accuracy of 72.5% (363 grid cells correctly classified, 138 incorrectly classified), and a κ value of 0.657 and an AUC total value of 0.94 were computed.The reduced random forest distribution model did perform satisfactorily, even when compared to the 3-fold cross-validated results of the random forest model constructed on the entire data set L ev (accuracy=76.6%,κ=0.709,AUC total =0.96).

Conclusions
Vegetation patterns arise from the interplay between intraspecific and interspecific biotic interactions and from different abiotic constraints and interacting driving forces and distributions (Schröder, 2006).In this study, we constructed a vegetation distribution model based on spatially distributed environmental variables which were linked with the occurrence of a certain vegetation type.Biotic interactions were only included indirectly, i.e. their effect was included through the observed vegetation distribution pattern, not directly as independent variables underlaying the vegetation distribution.As far as classification accuracy of the random forest is concerned, results were satisfactory (AUC total =0.96).Model errors were located in boundary areas (AUC boundary area = 0.92) between adjacent vegetation types.A proportion of these errors could be attributed to high similarities between neighboring grid cells.These incorrect predictions were generally based on low probabilities of occurrence of several similar vegetation types.Furthermore, the random forest distribution model cannot be applied beyond the local conditions upon which it was constructed, because realized niches of species/vegetation types do seldom coincide, even between apparently similar sites.This restricts the model's applicability.In order to make it operational on a larger scale many data would be needed, ranging over the entire ecological amplitude of the modelled attributes.Finally, gradual reductions in model complexity were analysed.Based on these results, a significant reduction of the ecohydrological monitoring scheme could be proposed for a similar groundwaterdependent ecosystem.The random forest distribution model made a reasonably accurate classification (AUC total =0.94) when constructed on spatially distributed measurement of five easily measured environmental variables only.

A3 Variable importance
The random forest algorithm can estimate the importance of each variable by using the variable importance measure.Defining variable importances is done by looking at how much the oob error increases when oob data are permuted for one variable while left unchanged for all others.The calculation procedure goes as follows: (i) For i = 1 to k do (grow a random forest consisting of k classification trees): (1) apply tree i to the n oob elements and count the number of correct classifications over the n oob elements (C i,untouched ); (2) for j = 1 to p (with p the total number of variables) do: (a) take the n untouched oob elements; (b) randomly permute the values of variable j in the n oob elements; (c) apply tree i to all the j permuted oob elements; (d) count the number of correct classifications (C i,j −permuted ); (e) subtract the number of correct classifications of the variable-j -permuted oob elements from the number of correct classifications of the untouched oob elements and divide by the number of oob elements ( C i,j = (C i,untouched − C i,j −permuted )/n); The results from these iterations are p (number of variables, j =1 to p) groups of k (number of trees, i=1 to k) C i,j values.Since trees are independent, correlations among the C i,j values within the p groups are generally low.Finally: (ii) For each of the j = 1 to p groups, the mean C i,j over all i=1 to k trees is calculated ( C j = k i=1 C i,j /k).The value C j × 100 is referred to as the "mean importance score" of variable j .The value is positive when C i,untouched >C i,j −permuted and negative when C i,untouched <C i,j −permuted .Mean importance scores have high values when the classification error increases by permuting the values of variable p.
(iii) Since correlations of the C i,j scores are generally low within the j =1 to p groups, standard errors can be calculated for each of the j groups of i=1 to k C i,j scores.Divide C j by the standard error to obtain a z-score for variable j , and assign a significance level assuming normality.

Fig. 1 .
Fig. 1.Conceptual model illustrating the relationships between hydrology, the physicochemical environment and vegetation at the local scale.Legend: full arrows indicate direct effects, broken arrows indicate vegetation feedbacks, and rounded squares and bent arrows indicate exogenous disturbances.Figure adapted from Franklin (1995); Baird and Wilby (1999); Mitsch and Gosselink (2000).

Fig. 2 .
Fig. 2. The Doode Bemde is situated in the valley of the river Dijle.A detailed overview of the topography and the vegetation distribution at the site are shown.The positions of 5 (A-E) piezometers located along a topographical transect are symbolized by •.
Fig. 3. (a)Time series of the groundwater depth, as monitored by piezometers A-E along a topographic transect (see Fig.2).(b) Hydrological duration lines expressing the probability that measured groundwater depths are exceeded.The line colours correspond to the vegetation types wherein these piezometers were installed (see Fig.2).
Fig. 4. (a) Observed vegetation types overlaid by the classification made by the random forest distribution model.(b) Modelled probabilities (P (c j ) max ) on which the classification is based.

Fig. 5 .
Fig. 5. Receiver operating characteristic (ROC) curves visualizing the classification performances of the 3-fold cross-validated random forest distribution model for the 7 vegetation types (full curves).The AUC total equals 0.96.Model performances for boundary cells only are summarized by the dashed ROC curves, yielding an AUC total value of 0.92.

Fig. 8 .
Fig. 8. (a) All variables ranked according to their importance as calculated with the variable importance measure (Appendix A3).M stands for management regime, S represents the variable soil type, and MGD the mean groundwater depth.(b) Oob error of random forest distribution models constructed on data sets with reduced complexity.The model containing only the most important variable (MGD) has an oob error of 65.5%.The oob error decreases gradually when more variables are included.(c) Summarizing table of model performances: accuracy, Cohen's κ and AUC values associated with a decreasing number of variables included.

Table 1 .
Summary of the vegetation types: abbreviation, name, short description and area.

Table 2
for the seven different vegetation types.The values of the diagonal elements in Table2are a measure of similarity between grid cells of the same vegetation type.Based on these values, patches of Phragmitetalia, Magnocaricion with Phragmites and Magnocaricion can be concluded to be more homogeneous in species composition compared to the other vegetation types which have lower values.Between the different vegetation types, marked differences in similarity can be observed.Magnocaricion with Phragmites has high similarities with Phragmitetalia and Magnocaricion.Between the other vegetation types, similarities are generally lower, but nevertheless differences can be observed.Arrhenatherion for example, has twice as much species in common with Filipendulion than with Magnocaricion.

Table 2 .
Jaccard index of similarity between the vegetation types in the Doode Bemde.

Table 3 .
Confusion matrix of the classification made by the random forest distribution model.Predicted vegetation types are compared with the observations at the Doode Bemde.

Table 4 .
Jaccard index of similarity for boundary grid cells between two vegetation types at the Doode Bemde.Non-adjacent vegetation types are indicated by -.Conceptual representation of realised niches of Calthion palustris at the Doode Beemde and Snoekengracht.The fundamental niche of Calthion palustris ranges over all environmental states which would permit to Calthion palustris to exist indefinitely