Structural break or long memory : an empirical survey on daily rainfall data sets across Malaysia

A short memory process that encounters occasional structural breaks in mean can show a slower rate of decay in the autocorrelation function and other properties of fractional integratedI (d) processes. In this paper we employed a procedure for estimating the fractional differencing parameter in semiparametric contexts proposed by Geweke and Porter-Hudak (1983) to analyse nine daily rainfall data sets across Malaysia. The results indicate that all the data sets exhibit long memory. Furthermore, an empirical fluctuation process using the ordinary least square (OLS)-based cumulative sum (CUSUM) test for the break date was applied. Break dates were detected in all data sets. The data sets were partitioned according to their respective break date, and a further test for long memory was applied for all subseries. Results show that all subseries follows the same pattern as the original series. The estimate of the fractional parameters d1 andd2 on the subseries obtained by splitting the original series at the break date confirms that there is a long memory in the data generating process (DGP). Therefore this evidence shows a true long memory not due to structural break.


Introduction
Recent analyses have claimed the possible presence of nonstationarity is produced by the presence of either trend or long-term cyclic fluctuations.However, it is well known that a reliable assessment of the presence of non-stationarity in hydrological records is not an easy task because of the limited extension of the available data sets.This causes difficulty in distinguishing between non-stationarity, sample variability and long-term climatic fluctuations (Brath et al., 1999).
Studies by Cheung (1993) and Diebold and Inoue (2001) have shown that there is a bias in favour of finding longmemory processes when structural breaks are not accounted for in a time series.Observed long memory behaviour can be due to neglected structural breaks.The presence of breaks in a time series and/or of long memory behaviour in the break-free series could indicate whether the series shows real breaks or long memory.A further insight can be obtained by estimating the difference parameter on the subseries identified by splitting the original series according to the estimated break dates: in the case of erroneous long memory identification, the subseries are expected to show short memory.
If a time series has one or more structural breaks, then the series has one or more discontinuities in the data-generating process (DGP).In this case a structural break method will report a number of breaks which will divide the series into regimes which are of different subpopulations.The statistical properties of these subpopulations within the regimes will need to be estimated.The estimated differences will be the result of actual differences between the samples.The task of estimating the break dates can be accomplished within the framework of least squares regression (Rea et al., 2009).
In the last decade, a lot of interest has been paid to the issue of confusing long memory and occasional structural breaks in mean (see among others Diebold and Inoue, 2001;Granger and Hyung, 2004;and Smith, 2005).Indeed, there is evidence that a stationary short memory process that encounters occasional structural breaks in mean can show a slower  (Cappelli and Angela, 2006).Therefore, a time series with structural breaks can generate a strong persistence in the autocorrelation function, which is an observed behaviour of a long memory process.On the contrary, long memory processes may cause breaks to be detected spuriously.In this paper, we used the term true long memory to refer to fractionally integrated series.
The literature on the tests to distinguish between true long memory and various spurious long memory models has been steadily growing.For example, Berkes et al. (2006) and Shao (2011) proposed a testing procedure to discriminate a stationary long memory time series from a short-range dependent time series with change points in the mean.Their null hypothesis corresponds to changes in the mean, and their alternative is that the series is stationary with long memory.The test statistic is a modification of the cumulative sum (CUSUM-type) test, which is quite popular in the literature of change point detection (e.g.Galeano, 2007;Shao and Zhang, 2010).
Since the work of Hurst (1951) that detected the presence of long-term persistence in minimum annual flow series of the Nile River, a lot of studies have been carried out for testing and modelling long memory in hydrological processes.For example, Montanari et al. (1997) studied the monthly and daily inflows of Lake Maggiore, Italy, using fractional an autoregressive integrated moving average (ARIMA) model.Rao and Bhattacharya (1999) studied the memory of four hydrologic time series in Midwestern United States by testing the null hypothesis that there is only short-term memory in the series using a modified version of the rescaled range method.The main conclusion from the study is that there is little evidence of long-term memory in monthly hydrologic series.However, for annual series, the evidence for lack of long-term memory is inconclusive, mainly because the number of observations is small and the power of the test based on modified rescaled range is low with small samples.Koutsoyiannis (2002) proposed a simple explanation of the Hurst phenomenon based on the fluctuation of a hydrological process at different temporal scales.The stochastic process that was devised to represent the Hurst phenomenon, i.e. the fractional Gaussian noise, was also studied.Based on its studied properties, three simple and fast methods to generate fractional Gaussian noise are proposed.Wang et al. (2005) analysed two daily stream flow series of the Yellow River, and found that both daily stream flow processes exhibit a strong long memory.In the subsequent study, Wang et al. (2007) applied four methods: the Lo's modified rescaled adjusted range test (R/S test), the GPH test (Geweke and Porter-Hudak, 1983) and two approximate maximum likelihood estimation (MLE) methods, i.e.Whittle's estimator (W-MLE) and another one implemented in S-Plus (S-MLE) to the daily average discharge series recorded at 31 gauge stations with different drainage areas in eight river basins in Europe, Canada and USA to detect the existence of long memory.The results show that 29 out of the 31 daily series exhibit long memory as confirmed by at least three methods, whereas the other two series are indicated to have long memory with two methods.Gil-Alana (2012) analysed the UK monthly rainfall data from a long-term persistence viewpoint using different modelling approaches, taking into account the strong dependence and the seasonality in the data.The results indicate that the most appropriate model is the one that presents cyclical long-run dependence with the order of integration being positive though small, and the cycles having a periodicity of about a year.Other applications of long memory models to hydrological time series can be found in Hosking (1984), Koutsoyiannis (2003Koutsoyiannis ( , 2005aKoutsoyiannis ( , b, 2006)), Koscielny-Bunde et al. (2006), Rybski et al. (2006), Langousis andKoutsoyiannis (2006), andMudelsee (2007).
Several authors did a reasonably good job in testing for long memory in hydro-meteorological time series; however, is it really a long memory process or a short memory with structural break?This is the main concern of this paper.In order to have an insight on "which is which", we propose fitting long memory and structural break separately following Cappelli and Angela (2006).In case both provide a plausible explanation of the DGP of the data at hand, the long memory and structural break analysis are repeated on the breakfree series and on the filtered series, respectively.However in Malaysia and other tropical regions, little or not much has been done in this area.
Heavy rainfall could bring disaster such as floods and landslides.Of course, the shortage of rainfall could also affect the water management system in such a way it could cause problems to the economic activities.Therefore, there is a need to investigate the characteristic of rainfall of a country intensively and comprehensively.Modelling of daily rainfall using various mathematical models has been done throughout the world to give a better understanding about the rainfall pattern and its characteristics (e.g.Suhaila and Abdul Aziz, 2008).
The purpose of this paper is to apply a simple strategy to study whether the daily rainfall data sets of nine weather stations across Malaysia exhibit true long memory behaviour, since the presence of long memory in hydrological time series is a well-known phenomenon and, on the other hand, the presence of structural breaks in these series represents a relevant environmental issue.
The long memory or long-term dependence property describes the high-order correlation structure of a series.Thus, autocorrelation provides initial information relevant to the internal organisation of each set of time series data (Mahdi and Petra, 2002).The prevalence of autocorrelation in a data series is also an indication of persistence in the series of observations.The autocorrelation coefficients provide an essential hint whether forecasting models can be developed based on the given data (Janssen and Laatz, 1997).If a series exhibits long memory, there is persistent temporal dependence even between distant observations; such series are characterized by distinct but non-periodic cyclical patterns.Fractionally integrated processes can give rise to long memory (Alptekin, 2006).Fractional integration is part of the larger classification of time series, commonly referred to as long memory models.Long memory models address the degree of persistence in the data.In empirical modelling of long memory processes, the autoregressive fractionally integrated moving average (ARFIMA) model that was proposed by Granger and Joyeux (1980) and Hosking (1981) is used.

Statistical model
A time series process {X t , t = 0, ±1, . ..} is said to be (covariance) stationary if the mean and the variance do not depend on time, and the covariance between any two observations depends on the temporal distance between them but not on their specific location in time.This is a minimal requirement in time series analysis to make statistical inference.
Given a zero-mean covariance-stationary process {X t , t = 0, ±1, . ..}, with autocovariance function γ µ = E{X t , X t+µ }, we say that X t is integrated of order zero (denoted by If a time series is non-stationary, one possibility for transforming the series into a stationary one is to take first differences of the series, such that where B is the lag-operator (BX t = X t−1 ) and µ t is I (0).In such a case, X t is said to be integrated of order 1, denoted I (1).Likewise, if two differences are required, the series is integrated of order 2, denoted I (2).If the number of differences required to get I (0) stationary is not an integer value but a fractional one, the process is said to be fractionally integrated or with µ t equal to I (0).The expression in the left-hand side in Eq. ( 3) can be presented in terms of a binomial expansion, such that, for all real d, Therefore, Eq. ( 3) can be written in the following form: If d is a positive integer value, X t will be a function of a finite number of past observations, while if d is not an integer, X t depends strongly upon values of the time series far in the past (e.g.Granger and Ding, 1996;Dueker and Asea, 1995).Moreover, the higher the value of d, the higher will be the level of association between the observations (Gil-Alana, 2009).
The parameter d plays an important role from a statistical viewpoint.Thus, if −0.5 < d < 0.5, µ t is a stationary and ergodic process with a bounded and positively valued spectrum at all frequencies.One important class of processes occurs when µ t is I (0) and is covariance stationary.For 0 < d < 0.5, the process exhibits long memory in the sense of Eq. ( 1); its autocorrelations are all positive and decay at a hyperbolic rate.For −0.5 < d < 0, the sum of absolute values of the process autocorrelations tends to constant, so that it has short memory according to Eq. ( 1).In this situation the ARFIMA (0, d, 0) process is said to be anti-persistent or to have intermediate memory, and all its autocorrelations excluding lag zero are negative and decay hyperbolically to zero.As d increases beyond 0.5 and through 1 (the unit root case), X i can be viewed as becoming "more nonstationary" in the sense, for example, that the variance of the partial sums increases in magnitude.This is also true for d > 1 (Gil-Alana, 2009).

Test and estimation of order of integration
There exist several procedures for estimating the fractional differencing parameter in semiparametric contexts.Of these, the log-periodogram regression estimate proposed by Geweke and Porter-Hudak (1983) has been the most widely used (Shimotsu, 2002).
Given a fractional integrated process {Y t }, its spectral density is given by where ω is the Fourier frequency, f u (ω) the spectral density corresponding to u t , and u t a stationary short memory noise with 0 mean.Consider the set of harmonic frequencies ω j = (2π j /n), j = 0, 1, . . .n/2, where n is the sample size.Taking the logarithm of Eq. ( 4) we have Equation ( 5) can be re-written in an alternative form following (Wang et al., 2007) as The fractional differencing parameter d can be estimated by the regression equations constructed from Eq. ( 6).Using the periodogram estimate of f (ω j ), if the number of frequencies m used in Eq. ( 6) is a function g(n) (a positive integer) of the sample size n, where m = g(n) = n α with 0 < α < 1, it can be demonstrated that the least squares estimate d using the above regression is asymptotically normally distributed in large samples (Geweke and Porter-Hudak, 1983).
Under the null hypothesis, of no long memory (d = 0), The value of the power factor α is the main determinant of the ordinates included in the regression.Traditionally the number of periodogram ordinates m is chosen from the interval [T 0.45 , T 0.55 ].However, Hurvich and Deo (1998) showed that the optimal m is of order O (T 0.8 ).

OLS-based CUSUM test
The CUSUM tests are concerned with testing against the alternative that an unknown coefficient vector varies over time (Zeileis, 2000).Ploberger and Kramer (1992) proposed a test based on ordinary least square (OLS) residuals.The OLSbased CUSUM test uses the OLS residuals ût = y t − x T t β.The OLS-based CUSUM-type empirical fluctuation process is defined as where σ = t .This path will always start with 0 at t = 0 and it also returns to 0 for t = 1, but if there is structural change at t 0 it should have a peak close to the break point t 0 .
The null hypothesis H 0 is rejected if the path crosses either  of the boundaries (−λ, λ), which is equivalent to rejecting when the test statistic is larger than λ, which determines the significance level of the test.As n → ∞, where B 0 (t) is the standard Brownian bridge.Details on this test can be found in Ploberger and Kramer (1992).

Results and discussion
Data of daily rainfall record for nine stations across Malaysia for the period 1 January 1968-31 December 2003 obtained from the Malaysian Meteorological Department were analysed in this study.Table 1 illustrates the generalized geographic information of the selected weather stations used for this study.We start the analysis by discussing the descriptive statistic of the considered daily rainfall series.Figure 1 depicts time series plots for Kota Bharu and Kuantan stations.From the figure, the time series plots show very persistent behaviour.The autocorrelation function (Fig. 2) provides a measure of temporal correlation between rainfall data points with different time lags.For a purely random event, all autocorrelation coefficients are zero, apart from r(0), which is equal to 1.The autocorrelation function plots in Fig. 2 decay with hyperbolic rates and indicate that the time series are strongly correlated, i.e. decay up to long lags.These are the main characteristics of long memory appearance.
The series may now be stationary by taking the first difference displayed in Fig. 3, though the correlogram (Fig. 4) still shows significant values even at some lags relatively away from zero and clearly shows that the autocorrelation at lag one exceeds −0.5, which suggests overdifferencing (Van Beusekom, 2003).Therefore, taking the first difference is not appropriate for the series(s) under consideration.This may be an indication that fractional differencing smaller than or greater than 1 may be more appropriate than first differences.In addition, the periodograms (Fig. 5) show values close to 0 at the zero frequency, which might suggest that these series are overdifferenced (Caporale and Gil-Alana, 2004).
It is well known fact that, when analysing climatic time series, distinguishing between long-term fluctuations and nonstationarity is not a simple task.Nevertheless, such a distinction would be extremely interesting.In fact, the presence of long-term climatic fluctuations, rather than non-stationarity, would imply that the patterns found in the data set could likely be attributed to cyclical behaviour rather than to irreversible tendencies.A useful tool for the above-mentioned distinction is the detection of the possible presence of long memory in the data.As mentioned before, one of the effects of long memory is the behaviour of the time series to be subjected to long-term cycles.
A time series which is generated by a true long memory process has a uniform data DGP throughout the entire series.Thus if a structural break location method is mistakenly applied to the series, it may report a number of breaks where no breaks exist.These spurious breaks will yield a number of partitions of differing lengths, but this partition will only be subsamples of a single population.Thus the subsamples will have the same statistical properties as the full series, because subsamples have been drawn from a single population.Any estimated differences will be the result of randomness and long range serial correlation but not differences between the samples.If a time series has one or more structural breaks, then the series has one or more discontinuities in the DGP.In this case a structural break method will report a number of breaks which will divide the series into regimes, which are different subpopulations.The statistical properties of these subpopulations within the regimes will need to be estimated.The estimated differences will be the result of actual differences between the samples.The results for the estimated values of the fractional differencing parameters along with ARFIMA (p, d, q) model for all data sets are displayed in Table 3. From Table 3, we can observe that all the rainfall series can be described in terms of fractional integration, which is part of the larger classification of time series, commonly referred to as long memory models.In order to show how the above-described strategy can help in dealing with the discrimination between long memory and structural breaks, we have analysed the series(s) on which we have identified an ARFIMA (0, d, 0) model.The maximum likelihood estimate of the fractional parameter d was obtained by means of GPH.The presence of long memory in all data sets is displayed in Table 3.At the same time, we have detected the regime break using the OLS-based CUSUM test for break date.The break-date results found in these data sets differ from station to station.The series were partitioned according to their break dates, and all subseries were tested for long memory.Table 4 gives the break date along with the estimates of the differencing parameters d 1 , and d 2 for series before and after break respectively.

Conclusions
It is now well established that long memory and structural change are easily confused.However, most researchers choose to ignore the problem of structural break in testing for long memory.It is a known fact that short memory with structural break may exhibit the properties of long memory.
The main contribution of the paper was to detect if the DGP of daily rainfall series of some locations across Malaysia is generated by a true long memory process.An approach based on fractional integration FI (d) process that can characterize a series was applied.The findings indicate that all the data sets exhibit long memory, but is it a true long memory?To answer this we employed a method which allowed detecting a structural break in each series.The series were partitioned according to the break date identified, and a similar test was applied to the subseries.We found that all subseries displayed the same properties as the original series.Therefore, all the rainfall data sets considered were generated by a true long memory process.

F.
Yusof et al.: Structural break or long memory rate of decay in the autocorrelation function and other properties of fractionally integrated (I (d)) processes

Fig. 1 .Fig. 1 .
Fig. 1.Time series plots of daily rainfall series of Kota bahru and Kuantan Fig. 1.Time series plots of daily rainfall series of Kota Bharu and Kuantan.

Fig. 2 .
Fig. 2. Autocorrelation function of the data sets of Kota Bharu and Kuantan.

Fig. 5 .Fig. 5 .
Fig. 5. Periodogram of the first difference series of Kota bahru and Kuantan Fig. 5. Periodogram of the first difference series of Kota Bharu and Kuantan.

Table 1 .
Selected weather stations in Malaysia and their general geographic information.
27 Fig. 2. Autocorrelation function of the data sets of Kota Bahru and Kuantan

Table 2a
c depict a summary of the statistical properties for the full rainfall series, subseries before break and the subseries after the break.It could be observed from the tables that both series follow the same statistical patterns with standard deviation greater than the mean and high values for kurtosis Hydrol.Earth Syst.Sci., 17, 1311-1318, 2013 www.hydrol-earth-syst-sci.net/17/1311/2013/ F.

Yusof et al.: Structural break or long memory 1315 28
Fig. 3. Time plot of the first difference series of Kota Bahru and Kuantan

Table 2a .
Summary statistics for the rainfall data sets.

Table 2b .
Statistics for the daily rainfall series before break.

Table 2c .
Statistics for the daily rainfall series after break.

Table 4 .
CUSUM test with corresponding break date and estimates of the differencing parameters d 1 (before break) and d 2 (after break).