Technical note: A procedure to clean, decompose and aggregate time series
- Finres, 59 Boulevard Exelmans, 75016 Paris, France
- Finres, 59 Boulevard Exelmans, 75016 Paris, France
Abstract. Errors, gaps and outliers complicate and sometimes invalidate the analysis of time series. While most fields have developed their own strategy to clean the raw data, no generic procedure has been promoted to standardize the pre-processing. This lack of harmonization makes the inter-comparison of studies difficult, and leads to screening methods that are usually ambiguous or case-specific. This study provides a generic pre-processing procedure (called past, implemented in R) dedicated to any univariate time series. Past is based on data binning and decomposes the time series into a long-term trend and a cyclic component (quantified by a new metric, the Stacked Cycles Index) to finally aggregate the data. Outliers are flagged with an enhanced Boxplot rule called Logbox. Three different Earth Science datasets (contaminated with gaps and outliers) are successfully cleaned and aggregated with past. This illustrates the robustness of this procedure that can be valuable to any discipline.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(930 KB)
-
Supplement
(328 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(930 KB) -
Supplement
(328 KB) - BibTeX
- EndNote
Journal article(s) based on this preprint
François Ritter
Interactive discussion
Status: closed
-
RC1: 'Comment on hess-2021-609', Anonymous Referee #1, 29 Dec 2021
Overall, I find this submission not suitable for HESS due to the following major issues.
1) The title says it is a technical note, but the structure does not really fit that. According to instructions, submissions of technical notes should be only a few pages, while this submission is substantially longer. Also, in the abstract it is stated that the submission wants to propose a standardized way to pre-process time series, which is not really something that can be done in a technical note. However, to be a full research article several important parts are missing in the submission (see comments further down)
2) The article lacks references and introduction in the field of HESS. The introduction is quite general on time-series and R packages, but does not discuss what is commonly used in hydrology or earth sciences and to which extend there is a need for additional improvement of outlier detection and gap imputation within these areas. The only part that connects to the journal are the case studies, which are relevant, but as the results are not compared with other approaches or articles that have used these series before in a different way conclusions are difficult to make.
3) Generally, in environmental and earth science time series can have a large variety of different structures and the questions to be investigated vary widely. Depending on which statistical analysis is to be made, the filling of gaps or outlier detection can be more or less important. For many approaches, outliers or gaps are not a crucial problems, but can be handled intrinsically. It is, thus, not obvious that a standardized way to preprocess is desirable. Obviously, when several series are within the same academic study they should be handled similarly, but no examples of this being a real problem at present is given. Also, in earth sciences there are few situations where only single time series need to be handled. Either there are several variables observed at the same time point, which can be used to identify if there is something wrong with the sample altogether one variable specifically, or there are nearby stations available that can be used to identify outliers or fill gaps.
Potentially the submission could be resubmitted as a purely technical note describing the R package and discussing the possible inputs to the function (k, …) and with some examples of different choices on the output. Such a submission should more intuitively describe which effect a change on k has, rather than rely on a simulation study that is difficult to relate to in practice. For example, describing how it might work for a normal and a log-normal distribution often met in earth science.
Specific comments:
There are several important parts missing or unclear:
- A new boxplot rule is suggested and motivated by that using this rule leads to far less false positives, i.e. the type I error is improved. No mention is made on the type II error, which is typically increases, when the type I error decreases. Clearly, this is not easy to study as, in a univariate time series, only outliers above a certain threshold can be detected. In this study this threshold is chosen to be very high, leading probably to situations where few (real) outliers are detected. This is also one of the reasons outlier detection methods flag rather many observations as outlies, giving the user the possibility to doublecheck the correctness of those and keep the ones that seem reasonable. In the recommendations it is stated that the value of k=0.6 will minimize the type I and type II errors, but it is very unclear how this determined and generally it is not possible to minimize type I and type II errors at the same time.
- It is not discussed which definition of outlier is used in this context, and especially it would be important to define outliers in highly skewed distributions and how it would be possible to distinguish them from observations that belong to the distribution.
- In the case studies, outliers are introduced and can be identified with the proposed method, but the outliers are completely unrealistic and could be identified by visual inspection only, no advanced methods are needed.
- The breakdown points of the outlier detection methods are not given.
- As a technical note on an R package the code should be made available, e.g. on GitHub or similar, as this is a major part of the submission.
- It is rather unclear how well the suggested values of 3.8 and 9.4 work in practice as they are the median of values achieved in the simulation. This means probably that these values work considerably worse for some specific distribution. No discussion is made about this.
- It is rather unclear how the value of k are determined. Are simulations in Figure 1a-1c made for several sample sizes and their medians are shown in panel d?
- In many cases in environmental and earth science it is common to work with e.g. log-transformed values (or models that use a log-link) to account for single high values in skewed distributions. This would make it unnecessary to develop outlier detection methods for skewed distribution. Instead, conventional methods can be used on the symmetric transformed values. Has typical handling of skewed distributions in earth science been studied? Is there a need to find outliers in skewed distributions?
- For comparison between outlier detection method 600 distributions were selected to give the same weight to different types of distributions. For determining the value of alpha and k all 9702 distribution are used. It is unclear why.
- It is also not clear how the 9702 distributions are defined and how they are chosen. At one place, a reference to the supplementary is given, but there is no info on distributions in the supplementary.
- At least one of the case studies has a seasonal pattern, which would allow a comparison to STL or STLplus
- It is argumented that STLplus has severe disadvantages compared to the proposed method. For example, it is said that the trend modelled with loess needs to be parametrized. No reference is given and it is unclear what is meant by this, as loess is a non-parametric regression methods and does not need a parametrization.
Comments on structure
- Referencing within the article is not clear. Often Figures are referenced already in the methods description, making the reading difficult. A better separation of method and results would be helpful
- Also sections called context and method are not clearly divided.
- References to the use of outlier detection methods in earth sciences are missing.
- AC1: 'Reply on RC1', Francois Ritter, 03 Jan 2022
-
RC2: 'Submit to statistical journal', Thomas Wutzler, 08 Jan 2022
The technical notes presents two studies:
(1) the LogBox outlier detection
(2) the past data aggregation scheme
While (1) is not within the target scope of HESS, but should be submitted to a statistical journal like all the methods it is compared against (L130), (2) presents a method that is of interesting to many readers. Hence, I will not comment, here, on (1) but only on (2). Although the authors claim a generality of their ‘past’ scheme, I see several points why it is difficult to apply at data I am working with. I have several major concerns:
(a) clarify and discuss the assumptions on the data series.
(b) Discussion on generality vs expert knowledge
Considering that the method is not as generally applicable, I have doubts that HESS is the proper place to publish the “past” method. If a techincal note is prepared for (2) for HESS it should be submitted as a new manuscript rather than a revision.
Thanks for making the source code and data available. I could reproduce the results an the plots.
(a) The basic assumption is that the dataset is an additive signal of a long-term trend, a periodic anomaly (termed “cylce” in the manuscript) that does not change with time, observational noise and outliers. As demonstrated, the method is already useful for such cases. However, of to be even more useful, the authors should think about about extending the method to infer or take into account changes of the anomaly with time. At least they need to give the possibility to the user to supply a mask slicing the time series into chunks where the anomaly can be assumed constant, e.g., stacking winter/spring/summer into different stacks.
In the first of the three examples, the daily cycle of temperature (luckily) does not vary. But what about synoptic cycles? During clear-sky weeks the daily temperature cycle will be larger than on cloudy weeks. For signals influenced by vegetation, the cycles will differ with phenology, etc.
I tried applying the method to several soil respiration time series of the publicly available COSORE dataset. I got it to work technically, but was not able to properly detect outliers and aggregate to annual values. For some series, there was probably too few data within periods (Vern series: 4hourly measurement within a daily period), for others the properties of the signal changed too strongly with season (Migliavacca series).
The authors need to better clarify the assumption and limitations of the method. The method is not as general as claimed in the first version of the manuscript.
(b) The application of case-specific outlier-detection and aggregation is discussed as being a thing one wants to avoid. However, usually researchers know their data quite well and know their distributions, stability over time, problematic periods, changes in measurement equipment etc. It needs a more balanced discussion on the value of consistency for meta-analysis and usage of expert-knowledge.
Outlook: many observational time series come in replicates. Can you think of ways to extend the method to use information across the replicates?
- AC2: 'Reply on RC2', Francois Ritter, 12 Jan 2022
-
AC3: 'Comment on hess-2021-609', Francois Ritter, 21 Dec 2022
Dear Anke Hildebrandt and dear Jens Schumacher,
Thank you for all the time you put in this manuscript. I have adressed the changes asked by the reviewer and removed the very small sample case (n < 9) from the Logbox procedure. Another change came from an independant reviewer (Dr. Rob Hyndman), who mentioned that an important reference was missing: Barbato et al. (2011).
I was not aware of the study from Barbato et al. (2011), but it appears that they found a law similar to logbox (alpha = A*log(n)+B) but with a different reasoning (heuristic approach) and calibrated on the Gaussian distribution only (A and B are constant). I had to include this approach in part I, and to compare its performance to Logbox (Fig. 2 updated). The discussion has been slightly modified to include this new model, but the conclusion remains the same.
Please find below a list of changes:
Part I:
- the very small sample case (n < 9) has been removed and the method section has therefore been simplified.
- m.star is not exactly a predictor of the kurtosis excess (which takes into account the two tails of a distribution), but it is more a predictor of the weight of the heavier tail. The description of m.star has been updated accordingly.
- Barbato et al. (2011) has been included in the introduction and the discussion.Part II: unaffected
References:
- updated to the HESS format.supplementary material:
- The very small sample case (n < 9) has been removed, and Fig. S1 updated.Acknowledgement:
- I will personally fund the publication of this article, and I therefore removed the "fonds de dotation O" from the acknowledgement.Code
- the code has been updated on https://github.com/fritte2/ctbi_article to account for the changes in Part I.
Ref:
Barbato, G., Barini, E. M., Genta, G., and Levi, R.: Features and Performance of Some Outlier Detection Methods, Journal of Applied Statistics, https://doi.org/10.1080/02664763.2010.545119, 2011.Best regards,
François Ritter
Peer review completion












Interactive discussion
Status: closed
-
RC1: 'Comment on hess-2021-609', Anonymous Referee #1, 29 Dec 2021
Overall, I find this submission not suitable for HESS due to the following major issues.
1) The title says it is a technical note, but the structure does not really fit that. According to instructions, submissions of technical notes should be only a few pages, while this submission is substantially longer. Also, in the abstract it is stated that the submission wants to propose a standardized way to pre-process time series, which is not really something that can be done in a technical note. However, to be a full research article several important parts are missing in the submission (see comments further down)
2) The article lacks references and introduction in the field of HESS. The introduction is quite general on time-series and R packages, but does not discuss what is commonly used in hydrology or earth sciences and to which extend there is a need for additional improvement of outlier detection and gap imputation within these areas. The only part that connects to the journal are the case studies, which are relevant, but as the results are not compared with other approaches or articles that have used these series before in a different way conclusions are difficult to make.
3) Generally, in environmental and earth science time series can have a large variety of different structures and the questions to be investigated vary widely. Depending on which statistical analysis is to be made, the filling of gaps or outlier detection can be more or less important. For many approaches, outliers or gaps are not a crucial problems, but can be handled intrinsically. It is, thus, not obvious that a standardized way to preprocess is desirable. Obviously, when several series are within the same academic study they should be handled similarly, but no examples of this being a real problem at present is given. Also, in earth sciences there are few situations where only single time series need to be handled. Either there are several variables observed at the same time point, which can be used to identify if there is something wrong with the sample altogether one variable specifically, or there are nearby stations available that can be used to identify outliers or fill gaps.
Potentially the submission could be resubmitted as a purely technical note describing the R package and discussing the possible inputs to the function (k, …) and with some examples of different choices on the output. Such a submission should more intuitively describe which effect a change on k has, rather than rely on a simulation study that is difficult to relate to in practice. For example, describing how it might work for a normal and a log-normal distribution often met in earth science.
Specific comments:
There are several important parts missing or unclear:
- A new boxplot rule is suggested and motivated by that using this rule leads to far less false positives, i.e. the type I error is improved. No mention is made on the type II error, which is typically increases, when the type I error decreases. Clearly, this is not easy to study as, in a univariate time series, only outliers above a certain threshold can be detected. In this study this threshold is chosen to be very high, leading probably to situations where few (real) outliers are detected. This is also one of the reasons outlier detection methods flag rather many observations as outlies, giving the user the possibility to doublecheck the correctness of those and keep the ones that seem reasonable. In the recommendations it is stated that the value of k=0.6 will minimize the type I and type II errors, but it is very unclear how this determined and generally it is not possible to minimize type I and type II errors at the same time.
- It is not discussed which definition of outlier is used in this context, and especially it would be important to define outliers in highly skewed distributions and how it would be possible to distinguish them from observations that belong to the distribution.
- In the case studies, outliers are introduced and can be identified with the proposed method, but the outliers are completely unrealistic and could be identified by visual inspection only, no advanced methods are needed.
- The breakdown points of the outlier detection methods are not given.
- As a technical note on an R package the code should be made available, e.g. on GitHub or similar, as this is a major part of the submission.
- It is rather unclear how well the suggested values of 3.8 and 9.4 work in practice as they are the median of values achieved in the simulation. This means probably that these values work considerably worse for some specific distribution. No discussion is made about this.
- It is rather unclear how the value of k are determined. Are simulations in Figure 1a-1c made for several sample sizes and their medians are shown in panel d?
- In many cases in environmental and earth science it is common to work with e.g. log-transformed values (or models that use a log-link) to account for single high values in skewed distributions. This would make it unnecessary to develop outlier detection methods for skewed distribution. Instead, conventional methods can be used on the symmetric transformed values. Has typical handling of skewed distributions in earth science been studied? Is there a need to find outliers in skewed distributions?
- For comparison between outlier detection method 600 distributions were selected to give the same weight to different types of distributions. For determining the value of alpha and k all 9702 distribution are used. It is unclear why.
- It is also not clear how the 9702 distributions are defined and how they are chosen. At one place, a reference to the supplementary is given, but there is no info on distributions in the supplementary.
- At least one of the case studies has a seasonal pattern, which would allow a comparison to STL or STLplus
- It is argumented that STLplus has severe disadvantages compared to the proposed method. For example, it is said that the trend modelled with loess needs to be parametrized. No reference is given and it is unclear what is meant by this, as loess is a non-parametric regression methods and does not need a parametrization.
Comments on structure
- Referencing within the article is not clear. Often Figures are referenced already in the methods description, making the reading difficult. A better separation of method and results would be helpful
- Also sections called context and method are not clearly divided.
- References to the use of outlier detection methods in earth sciences are missing.
- AC1: 'Reply on RC1', Francois Ritter, 03 Jan 2022
-
RC2: 'Submit to statistical journal', Thomas Wutzler, 08 Jan 2022
The technical notes presents two studies:
(1) the LogBox outlier detection
(2) the past data aggregation scheme
While (1) is not within the target scope of HESS, but should be submitted to a statistical journal like all the methods it is compared against (L130), (2) presents a method that is of interesting to many readers. Hence, I will not comment, here, on (1) but only on (2). Although the authors claim a generality of their ‘past’ scheme, I see several points why it is difficult to apply at data I am working with. I have several major concerns:
(a) clarify and discuss the assumptions on the data series.
(b) Discussion on generality vs expert knowledge
Considering that the method is not as generally applicable, I have doubts that HESS is the proper place to publish the “past” method. If a techincal note is prepared for (2) for HESS it should be submitted as a new manuscript rather than a revision.
Thanks for making the source code and data available. I could reproduce the results an the plots.
(a) The basic assumption is that the dataset is an additive signal of a long-term trend, a periodic anomaly (termed “cylce” in the manuscript) that does not change with time, observational noise and outliers. As demonstrated, the method is already useful for such cases. However, of to be even more useful, the authors should think about about extending the method to infer or take into account changes of the anomaly with time. At least they need to give the possibility to the user to supply a mask slicing the time series into chunks where the anomaly can be assumed constant, e.g., stacking winter/spring/summer into different stacks.
In the first of the three examples, the daily cycle of temperature (luckily) does not vary. But what about synoptic cycles? During clear-sky weeks the daily temperature cycle will be larger than on cloudy weeks. For signals influenced by vegetation, the cycles will differ with phenology, etc.
I tried applying the method to several soil respiration time series of the publicly available COSORE dataset. I got it to work technically, but was not able to properly detect outliers and aggregate to annual values. For some series, there was probably too few data within periods (Vern series: 4hourly measurement within a daily period), for others the properties of the signal changed too strongly with season (Migliavacca series).
The authors need to better clarify the assumption and limitations of the method. The method is not as general as claimed in the first version of the manuscript.
(b) The application of case-specific outlier-detection and aggregation is discussed as being a thing one wants to avoid. However, usually researchers know their data quite well and know their distributions, stability over time, problematic periods, changes in measurement equipment etc. It needs a more balanced discussion on the value of consistency for meta-analysis and usage of expert-knowledge.
Outlook: many observational time series come in replicates. Can you think of ways to extend the method to use information across the replicates?
- AC2: 'Reply on RC2', Francois Ritter, 12 Jan 2022
-
AC3: 'Comment on hess-2021-609', Francois Ritter, 21 Dec 2022
Dear Anke Hildebrandt and dear Jens Schumacher,
Thank you for all the time you put in this manuscript. I have adressed the changes asked by the reviewer and removed the very small sample case (n < 9) from the Logbox procedure. Another change came from an independant reviewer (Dr. Rob Hyndman), who mentioned that an important reference was missing: Barbato et al. (2011).
I was not aware of the study from Barbato et al. (2011), but it appears that they found a law similar to logbox (alpha = A*log(n)+B) but with a different reasoning (heuristic approach) and calibrated on the Gaussian distribution only (A and B are constant). I had to include this approach in part I, and to compare its performance to Logbox (Fig. 2 updated). The discussion has been slightly modified to include this new model, but the conclusion remains the same.
Please find below a list of changes:
Part I:
- the very small sample case (n < 9) has been removed and the method section has therefore been simplified.
- m.star is not exactly a predictor of the kurtosis excess (which takes into account the two tails of a distribution), but it is more a predictor of the weight of the heavier tail. The description of m.star has been updated accordingly.
- Barbato et al. (2011) has been included in the introduction and the discussion.Part II: unaffected
References:
- updated to the HESS format.supplementary material:
- The very small sample case (n < 9) has been removed, and Fig. S1 updated.Acknowledgement:
- I will personally fund the publication of this article, and I therefore removed the "fonds de dotation O" from the acknowledgement.Code
- the code has been updated on https://github.com/fritte2/ctbi_article to account for the changes in Part I.
Ref:
Barbato, G., Barini, E. M., Genta, G., and Levi, R.: Features and Performance of Some Outlier Detection Methods, Journal of Applied Statistics, https://doi.org/10.1080/02664763.2010.545119, 2011.Best regards,
François Ritter
Peer review completion












Journal article(s) based on this preprint
François Ritter
François Ritter
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
865 | 238 | 21 | 1,124 | 65 | 11 | 10 |
- HTML: 865
- PDF: 238
- XML: 21
- Total: 1,124
- Supplement: 65
- BibTeX: 11
- EndNote: 10
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(930 KB) - Metadata XML
-
Supplement
(328 KB) - BibTeX
- EndNote