Classification and flow prediction in a data-scarce watershed of the equatorial Nile region

Continuous developments and investigations in flow predictions are of interest in watershed hydrology especially where watercourses are poorly gauged and data are scarce like in most parts of Africa. Thus, this paper reports on two approaches to generate local monthly runoff of the data-scarce Semliki watershed. The Semliki River is part of the upper drainage of the Albert Nile. With an average annual local runoff of 4.622 km3/annum, the Semliki watershed contributes up to 20 % of the flows of the White Nile. The watershed was sub-divided in 21 sub-catchments (S3 to S23). Using eight physiographic and meteorological variables, generated from remotely sensed acquired datasets and limited catchment data, monthly runoffs were estimated. One ordination technique, the Principal Component Analysis (PCA), and the tree cluster analysis of the landform attributes were performed to study the data structure and spot physiographic similarities between sub-catchments. The PCA revealed the existence of two major groups of subcatchments – flat (Group I) and hilly (Group II). Linear and nonlinear regression models were used to predict the longterm monthly mean discharges for the two groups of subcatchments, and their performance evaluated by the NashSutcliffe Efficiency (NSE), Percent bias (PBIAS) and root mean square error to the standard deviation ratio (RSR). The dimensionless indices used for model evaluation indicate that the non-linear model provides better prediction of the flows than the linear one.


Introduction
Numerous approaches exist for streamflow prediction in natural river reaches.Streamflow forecasting has significant interest both from a research and an operational point of view.The choice of methods depends on data availability and the type of application.While continuous research efforts strive at enhancing our predictive capability for streamflow, we are often faced with the challenge of making such predictions in basins that are poorly gauged or not gauged at all (Sivapalan et al., 2003).Reliable and accurate estimates of hydrologic components are not only important for water resources planning and management but are also increasingly relevant to environmental studies (Schröder, 2006).Several studies have reported on the use of catchment descriptors and regionalization of parameters for flow prediction in ungauged basins.Among the most recent studies are those of Sefton and Howard (1998), Mwakalila (2003), Xu (2003), Merz and Blöschl (2004), McIntyre et al. (2005), Sanborn and Bledsoe (2006), Yadav et al. (2007), Sharda et al. (2008) Kwon et al. (2009) and Shao et al. (2009).In their comparison of linear regression with artificial neural networks, Heuvelmans et al. (2006) indicated the need for well-informed choice of physical catchment descriptors as a first condition for successful parameter regionalization.Cheng et al. (2006) reported on the importance and usefulness of parsimonious models for runoff prediction in datapoor environment as these models are characterized by few numbers of parameters.Many authors have also identified the reduction of uncertainty associated with predictions in ungauged basins as being very important (Uhlenbrook and Siebert, 2005;Koutsoyiannis, 2005a, b;Zhang et al., 2008).Lately, Koutsoyiannis et al. (2008) indicated the use of analogue modeling techniques for flow prediction which give impressive performance due to advances in nonlinear dynamical systems (chaotic systems).The major drawbacks of these approaches are that they are data intensive and work as black boxes, thus provide no insight into the hydrological processes.This paper reports on linear and nonlinear regression modeling approaches for flow prediction in a medium size watershed of the equatorial Nile region (Semliki catchment), where very little hydro-meteorological data are available.These approaches attempt to provide monthly flow estimates that relate similar catchment physiographic attributes to generated flows.

Study area
These analyses are conducted within the Semliki watershed of the equatorial Nile region (Fig. 1).The catchment studied covers an estimated area of 7699 km 2 .The Semliki drains the basins of lakes Edwards and George, and a contributing area downstream that includes the western slopes of the Ruwenzori range.The watershed receives an average rainfall of 1245 mm per annum, with peaks occurring in May (95 mm/month) and October (205 mm/month).An average annual local runoff of 4.622 km 3 has been estimated from records at Bweramule (Sutcliffe and Parks, 1999).The elevations comprise flat areas and ice-caped mountains, climbing up to 4862 m above the sea level.The flora and the fauna of the watershed constitute one of the unique and distinct ecosystems of the Albertine Rift region.The vegetation predominantly comprises medium altitude moist evergreen to semi deciduous forest.Five distinct vegetation zones have been documented under the mount Ruwenzori and they occur with changes in altitude.Detailed information on landscape physiographic attributes are reported in Kileshye Onema and Taigbenu (2009).

Data and methods
The landscape of any catchment is made up of several combinations of physiographic attributes.These combinations are   into two categories using the principal component analysis (PCA) and cluster analysis.Making use of the assumption that the physiographic variables are stationary, a quasitemporal prediction of the flow was achieved with two additional variables (NDVI and rainfall) that have both spatial and temporal variation.These two latter variables were generated for the catchment from remote-sensed data.The physiographical characteristics of these sub-catchments were identified by six variables, namely is the stream length (x 1 ), the drainage density (x 2 ), the mean stream slope (x 3 ), the maximum elevation (x 4 ), the minimum elevation (x 5 ), and the weighted average elevation (x 6 ).The statistical properties of these physiographic variables are presented in Table 1 and used subsequently in the grouping of these sub-catchments.

Bwe eramule
In predicting monthly flows in these sub-catchments, it is assumed that these variables are stationary, constant over time but not over space.Two additional variables, generated for these sub-catchments, are the monthly rainfall (x 7 ) and NDVI (x 8 ) which have both spatial and temporal variation.can be found in Kileshye Onema and Taigbenu (2009).At the outlet of the Semliki watershed is a gauging station located at Bweramule (Fig. 2) which provided historical monthly flows from 1950 to 1978 that were used for the calibration of both the linear and the non-linear flow prediction models that are subsequently described in this paper.The periods of available data on climate-related variables, namely rainfall and NDVI, that can be considered causal to the observed stream flow are presented in Fig. 3.The figure reveals the challenge posed in the flow modelling of this catchment which, apart from the lack of data for flows in the tributaries of the Semliki river and climatic variables for the 21 sub-catchments, the only available point rainfall data from Beni, which is not representative of the entire catchment, overlaps the record of monthly flow measurements at Bweramule for only six years from 1973 to 1978.Our problem statement, therefore, precludes the use of regionalization techniques that are based on flow duration curves (FDCs) that can be constructed by statistical or parametric or graphical approaches (Castellarin et al., 2004;Quipo et al., 1983).For each subcatchment, a historical 28-yr monthly mean volume was computed proportionally to the subcathment area and was labeled as "control".This historical approach is similar to the one undertaken by Asadullah et al. (2008)   stream length = x 1 , drainage density = x 2 , mean stream slope = x 3 , maximum elevation = x 4 , minimum elevation = x 5 , weighted average elevation = x 6 , monthly rainfall = x 7 and NDVI = x 8 .Italicized correlation coefficients are significant at 95 % confidence level.
to arid and semi-arid regions, "stream flows increase in the downstream direction, and the spatial distribution of average monthly or seasonal rainfall is more or less the same from one part of the river basin to another, hence the runoff per unit land area is assumed constant over space.In these situations, estimated flows are usually based on the watershed areas, as contributing flow to those sites, and the corresponding streamflows and watershed areas above the nearest or most representative gauge sites" (Loucks et al., 1981;Loucks and Van Beek, 2005).Furthermore, rainfall for the equatorial region, one of the main driving force behind runoff generation has not changed in a statistical significant way since 1950 despite reported seasonal and interannual variations (Nicholson and Entekhabi, 1987;Bigot et al., 1998;Lienou et al., 2008).
Table 2 provides landscape attributes, their sources and software used and guides any further work that intends to follow the approach undertaken in this paper.
The data generated from the physiographic variables of the 21 sub-catchments were examined to explore their groupings and similarities in their data structure.To achieve these objectives, the principal component analysis (PCA) and cluster analysis were used as exploratory techniques to study the structure and similarities of the data.The PCA analysis performed in this study used the scale invariant correlation matrix of the variables as opposed to the covariance matrix.The scaled variables are obtained by the expression: where X i is the scaled variable of the original variable x i , and µ i and σ i are the mean and standard deviation of the original variable.
The Glaeson-Staelin redundancy statistic and the Bartlett's sphericity test were performed as pre-requisite to the PCA.The Glaeson-Staelin redundancy statistic (φ) measures the level of interrelation between a group of variables (Magingxaa et al., 2009) with a zero value indicating uncorrelated variables and unity indicating perfect correlation between the variables (Jolliffe, 2002).The Bartlett's sphericity test is used to test the null hypothesis that the variables of the correlation matrix are not correlated (Sousa et al., 2007).When p-values of the Bartlett's test are greater than 0.05 it is not prudent to proceed with the PCA (Peres-Neto et al., 2005).The Kaiser criterion was used to identify the number of principal components for consideration as providing significant variance in the data (Jolliffe, 2002).The general suggested rule associated with this criterion is to retain principal components with eigenvalue greater than 1 (Chen and Chen, 2003), and to consider those less than 1 as trivial.
The monthly local runoff of the catchment was predicted by linear and nonlinear regression models that make use of the physiographic and meteorological variables.They are described by the relationships below.For the linear model where Q represents the monthly flows, the superscript T denotes the transpose, a 0 and A T i = {α 1 , α 2 , ...α 8 } are the regression parameters, and x T i = {x 1 , x 2 , ...x 8 }.The expression for the flow by the nonlinear model is given by where To evaluate the performance of these two models, the Nash-Sutcliffe efficiency (NSE), the percent bias (PBIAS) and the RMSE to observation standard deviation ratio (RSR) are used.NSE is a normalized index that defines the relative magnitude of the residual variance compared to that of the measured, and expressed as (Nash and Sutcliffe, 1970 (5) where Q obs i and Q mean i are respectively the ith observed flow and its mean, Q sim i is the ith simulated flow, and N is the number of observations.NSE has a range between −∞ and 1, with its optimal value of 1. Negative values of NSE indicate poor performance of the model, suggesting that the observed mean is a better predictor of the flows than those of the model (Moriasi et al., 2007).The PBIAS is expressed as The optimal value of PBIAS is 0, with low-magnitude values indicating good predictive ability of the model.Positive values indicate underestimation of the observed flows by the model, and conversely for negative values (Gupta et al., 1999).The RMSE-observation standard deviation ratio (RSR) is an error statistics that normalizes the RMSE with the standard deviation of the observed flows.It is expressed as The RSR is a normalized error index that utilizes the benefits of the RMSE which is one of the frequently used error index statistics.Excellent model performance will produce a zero value of RSR or RMSE (Moriasi et al., 2007).

Principal Components Analysis (PCA)
The scaled variables of the physiographic and meteorological variables, expressed by Eq. ( 1), are incorporated into the PCA.In this way the correlation matrix of the variables is scale invariant.The correlations between the variables are summarized in Table 3. Italicized correlation coefficients are significant at 95 % confidence level.There are some high correlations (greater than 0.5), implying that there is a correlation structure that can potentially be modeled or further explored using the PCA.The Glaeson-Staelin redundancy test yields a value of φ = 0.395, suggesting that there is considerable complexity in the data of these variables which warrants further examination using the PCA.The p-value from Bartlett's sphericity test is 0.00000 which indicates that the null hypothesis can be rejected and there is significant strength in the relationship among the variables to warrant carrying a PCA.The eigenvalues of the principal components are presented in Table 4, and making use of the Kaiser criterion allows for the retention of the first three principal components which account for 76 % of the variation in the data.The factor loadings of these three principal components reflect the contributions and roles of these variables in the correlation of the data.The eigenvectors of the three loading factors are presented in Table 5.The factor loadings are the correlations between the variables and the factors.Factor 1 showed best correlation with the maximum elevation and the average elevation, whereas factor 3 showed best correlation with the   average slope and the minimum elevation.Figure 4 provides an illustration of the projection of the variables on a factor plane using an alternative criteria for the PCA of the variables.Each quadrant represents a similar group of variables.Further use of the PCA is made in order to identify groupings of sub-catchments by assessing the projection of cases onto the factor plane (Fig. 5).The figure establishes four groups of sub-catchments that are further reduced to two main groups or categories (Table 6) so as to simplify the prediction equations of the runoff, which is the main goal of this paper.The physiographic attributes, namely mean stream slope, minimum elevation, maximum elevation, weighted average elevation that provide this major categorization from the PCA are located in the first quadrant of Fig.  results of the tree cluster analysis provide a similar trend as that of the PCA, except for the discrepancy in the grouping of S21, S22 and S23 (Fig. 6).

Linear and non-linear model simulation results
As earlier indicated, the six physiographic variables are assumed stationary while the two meteorological variables have spatial and temporal (at monthly time step) variations.
The regression parameters of the models are evaluated using Eqs.
(2) and ( 3) from which estimates of the monthly local runoff are obtained.The predicted flows from the two models are of order of magnitude comparable to estimates provided by Senay et al. (2009) in their attempt to document the overall basin dynamics of the Nile River.Typical mean monthly flow hydrographs for the month of April (considered a wet month) are presented in Fig. 7 for group I sub-catchments from 2001-2007 and Fig. 8 for group II sub-catchments."Control" values for flows were computed as reported under Sect.3 of this paper.The values of the performance error statistics (NSE, PBIAS and RSR) presented in Table 7 for the predicted monthly flows in the sub-catchments indicate very satisfactory performance by both models, though the nonlinear model has a better performance than the linear one especially in group I of subcatchments (Figs. 9 and 10).

Conclusions
This study reports on the use of two modelling approaches for the prediction of the monthly flows in the data-scarce Semliki watershed of the equatorial Nile.The principal component analysis was carried to identify variables that explained most of the variance in the dataset which comprised six physiographic and two climate-related variables.Similar sub-catchments were grouped into two categories on the basis in of their physiographic attributes, and monthly runoff was then generated by the linear and nonlinear regression models.The dimensionless statistics (NSE, PBIAS and RSR) for the two models indicate that the nonlinear model outperforms the linear one especially in the first group of subcatchments characterised by flatter elevations.Whereas these two mesoscale and deterministic models do not in any way directly address the hydrological processes in the catchment, they provide a statistical relationship between landscape attributes and monthly flows.The order of magnitude of these estimates that relate similar physiographic attributes to monthly flows can be useful for preliminary assessment of the water resources of a very poorly gauged catchment like the Semliki.The results from the current work could subsequently be complemented by a physically-based hydrological approach that explicitly accounts for the interaction between landscape attributes and water resources in the Semliki watershed, but that would still require good hydro-meteorological data.

Figure 3 :Figure 4
Figure 3: Periods of available causal climate-related data (rainfall and NDVI) and flow.
The remotely-sensed NDVI data covered the period from 1982 to 2008 and were extracted from decadal maximum composite imagery provided by the National Oceanic and Atmospheric Administration-Advance Very High Resolution Radiometer (NOAA-AVHRR) and processed with the image display and analysis software WinDisp 5.1.The satellite derived rainfall data were provided by the National Oceanic and Atmospheric Administration (NOAA) through the Famine Early Warning System Network (FEWS-Net) for the period 2001-2007, and were refined by the only point daily rainfall data in the catchment obtained from a station located at Beni which covered the period from 1973 to 2008.Additional details on the processing of remotely sensed acquired datasets

Figure 5 FigureFig. 6 .
Figure 5 Projection of cases from the PCA of variables

Table 1 .
Statistics of physiographic variables.

Table 3 .
Coefficients of correlations between variables.

Table 4 .
Eigenvalues of components.

Table 5 .
Eigenvectors of principal components.

Table 6 .
Grouping of Semliki sub-catchments from the PCA.