Skip to content
This repository has been archived by the owner on Nov 13, 2021. It is now read-only.

AnomalyDetectionTs drops timezone from POSIXct objects and converts POSIXlt to POSIXct #2

Open
zachmayer opened this issue Jan 6, 2015 · 29 comments

Comments

@zachmayer
Copy link

AnomalyDetectionTs should keep the timestamp column of the output dataset as-is, rather than converting to POSIXct and dropping the timezone attribute:

library(AnomalyDetection)
data(data_raw)
data <- raw_data
data$timestamp <- as.POSIXct(data$timestamp)
attr(data$timestamp, "tzone") 
attr(data$timestamp, "tzone") <- "America/New_York"

res = AnomalyDetectionTs(data, max_anoms=0.002, direction='both', plot=FALSE)
attr(res$anoms, 'tzone')

Dropping the timezone is problematic if you wish to merge the anomalies back to the main dataset:

> merge(data, res$anoms, by='timestamp')
             timestamp   count    anoms
1  1980-09-28 22:40:00 114.308 193.1036
2  1980-09-30 12:26:00 130.222 180.8990
3  1980-09-30 12:30:00 126.721 178.8220
4  1980-09-30 12:31:00 152.956 198.3260
5  1980-09-30 12:32:00 136.004 203.9010
6  1980-09-30 12:33:00 134.589 200.3090
7  1980-09-30 12:34:00 122.490 178.4910
8  1980-09-30 12:36:00 126.806 183.0180
9  1980-09-30 12:38:00 117.334 186.8230
10 1980-09-30 12:39:00 121.061 183.6600
11 1980-09-30 12:40:00 116.924 179.2760
12 1980-09-30 12:41:00 129.097 197.2830
13 1980-09-30 12:42:00 119.566 191.0970
14 1980-09-30 12:43:00 137.694 194.6700
15 1980-09-30 12:46:00 136.876 200.8160
16 1980-09-30 12:47:00 125.126 186.2350
17 1980-09-30 12:48:00 122.008 185.4210
18 1980-09-30 12:49:00 127.935 178.9580
19 1980-09-30 12:51:00 138.159 203.2310
20 1980-09-30 12:52:00 130.939 181.3540
21 1980-09-30 12:53:00 122.351 186.7780
22 1980-09-30 12:55:00 121.120 176.1250
23 1980-09-30 12:56:00 122.707 181.5140
24 1980-09-30 12:57:00 118.378 175.2610
25 1980-10-05 05:18:00 101.332  40.0000
26 1980-10-05 05:28:00 103.798 250.0000
27 1980-10-05 05:38:00 100.839  40.0000
@zachmayer zachmayer changed the title AnomalyDetectionTs drops timezone from POSIXct objects AnomalyDetectionTs drops timezone from POSIXct objects and converts POSIXlt to POSIXct Jan 6, 2015
@jhochenbaum
Copy link

Thanks, Owen and I will look into it Zach!

@zachmayer
Copy link
Author

Thanks!

@erikriverson
Copy link

This looks like a useful package, thanks!

Perhaps the 'timestamp' column should be required to be of class POSIXct? I don't believe base::merge works when the 'by' column is of class POSIXlt. For example:

x <- Sys.time()
test1 <- data.frame(timestamp = as.POSIXct(x:(x + 9), origin = "1970-01-01"),
                    y = rnorm(1:10))
test2 <- data.frame(timestamp = test1$timestamp, z = rnorm(1:10))

## ok when timestamp is POSIXct
merge(test1, test2, by = "timestamp")

## not ok when timestamp is POSIXlt
test1$timestamp <- test2$timestamp <- as.POSIXlt(test1$timestamp)
merge(test1, test2, by = "timestamp")

With the timezone issue, if the 'timestamp' column is POSIXct, the timezone on the object returned can be made to match that of the input data. If the timestamp column is not POSIXct, it could be converted using as.POSIXct instead of strptime in format_timestamp, and given a default time zone of UTC possibly.

@zachmayer
Copy link
Author

@erikeldridge There's really 2 issues here:

  1. POSIXlt is silently converted to POSIXct. Perhaps forcing the use of POSIXct is the way to go, but we should at least get a warning before conversion. Personally, I'm in favor of the output using the exact same data type as the input and letting the user worry about how to merge a POSIXlt object.
  2. Even if the user provided a POSIXct column, the timezone is silently dropped, which means the return data has the wrong timezone and merges incorrectly with the original data (see my example above: the data returned by AnomalyDetectionTs is off by 4 hours).

@zachmayer
Copy link
Author

@erikeldridge I didn't see the last paragraph of your comment. I agree, if the timestamp is of class POSIXct, the timezone should be kept, and if the timestamp must be converted the timezone should be preserved if possible.

@jhochenbaum
Copy link

Thanks for the comments guys -- Owen and I will look into this later today, but on glance you're right, we should preserve the timestamp/timezone. Stay tuned...

@zachmayer
Copy link
Author

@jhochenbaum Awesome, thank you!

@jhochenbaum
Copy link

@zachmayer, would you like to submit a patch and @owenvallis and I can review?

@zachmayer
Copy link
Author

@jhochenbaum I haven't had time to take a crack on this yet, but if I do I'll submit a PR.

@jhochenbaum
Copy link

Thanks @zachmayer

@caijun
Copy link

caijun commented Dec 20, 2017

@zachmayer After my PR #92, the timestamp column is now stored as POSIXct rather than POSIXlt and the timezone attribute is kept. Now your code demo produces following output

> library(AnomalyDetection)
> data <- raw_data
> str(data)
'data.frame':	14398 obs. of  2 variables:
 $ timestamp: POSIXct, format: "1980-09-25 14:01:00" "1980-09-25 14:02:00" "1980-09-25 14:03:00" ...
 $ count    : num  182 176 184 178 165 ...
> attr(data$timestamp, "tzone")
[1] "UTC"
> 
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = FALSE)
> attr(res$anoms$timestamp, 'tzone')
[1] "UTC"

You should check timezone attribute for res$anoms$timestamp instead of res$anoms. Therefore, the timezone attribute was not dropped.

However, your code snippet exposed another problem of AnomalyDetection package, which is the input timestamp with non UTC timezone. The problem is explained by following code output.

> attr(data$timestamp, "tzone") <- "America/New_York"
> attr(data$timestamp, "tzone")
[1] "America/New_York"
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = FALSE)
> attr(res$anoms$timestamp, 'tzone')
[1] "UTC"
> head(res$anoms)
            timestamp    anoms
1 1980-09-25 12:05:00  21.3510
2 1980-09-29 02:40:00 193.1036
3 1980-09-30 16:26:00 180.8990
4 1980-09-30 16:30:00 178.8220
5 1980-09-30 16:31:00 198.3260
6 1980-09-30 16:32:00 203.9010
> data[120:130, ]
              timestamp   count
120 1980-09-25 12:00:00 134.646
121 1980-09-25 12:01:00 157.175
122 1980-09-25 12:02:00 150.374
123 1980-09-25 12:03:00 151.579
124 1980-09-25 12:04:00 133.844
125 1980-09-25 12:05:00  21.351
126 1980-09-25 12:06:00 120.827
127 1980-09-25 12:07:00 144.293
128 1980-09-25 12:08:00 137.726
129 1980-09-25 12:09:00 131.608
130 1980-09-25 12:10:00 135.320

As you see, the 1st row of res$anoms is corresponding to the 125th row of data. The timestamp columns are of the same value and the only difference is the timezone attribute.

I will try to fix the problem.

@caijun
Copy link

caijun commented Dec 20, 2017

I have fixed it by PR #92

Now the AnomalyDetectionTs keeps the original timezone attr as the input timestamp, see following output.

> attr(data$timestamp, "tzone") <- "America/New_York"
> attr(data$timestamp, "tzone")
[1] "America/New_York"
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE)
> attr(res$anoms$timestamp, 'tzone')
[1] "America/New_York"
> head(res$anoms)
            timestamp    anoms
1 1980-09-25 12:05:00  21.3510
2 1980-09-29 02:40:00 193.1036
3 1980-09-30 16:26:00 180.8990
4 1980-09-30 16:30:00 178.8220
5 1980-09-30 16:31:00 198.3260
6 1980-09-30 16:32:00 203.9010
> data[120:130, ]
              timestamp   count
120 1980-09-25 12:00:00 134.646
121 1980-09-25 12:01:00 157.175
122 1980-09-25 12:02:00 150.374
123 1980-09-25 12:03:00 151.579
124 1980-09-25 12:04:00 133.844
125 1980-09-25 12:05:00  21.351
126 1980-09-25 12:06:00 120.827
127 1980-09-25 12:07:00 144.293
128 1980-09-25 12:08:00 137.726
129 1980-09-25 12:09:00 131.608
130 1980-09-25 12:10:00 135.320

@Maryoda2
Copy link

Maryoda2 commented Jul 5, 2018

Hi team,

Sorry for this up, I mean really.

But with this script :

library(AnomalyDetection)
data(raw_data)
data <- raw_data
data$timestamp <- as.POSIXct(data$timestamp)
attr(data$timestamp, "tzone") <- "America/New_York"
attr(data$timestamp, "tzone")
res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE)
attr(res$anoms$timestamp, 'tzone')
res$plot

Here what I have :

library(AnomalyDetection)

data(raw_data)
data <- raw_data
data$timestamp <- as.POSIXct(data$timestamp)
attr(data$timestamp, "tzone") <- "America/New_York"
attr(data$timestamp, "tzone")
[1] "America/New_York"
res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE)
attr(res$anoms$timestamp, 'tzone')
[1] "UTC"
res$plot
Error: Column x is a date/time and must be stored as POSIXct, not POSIXlt

I cannot understand why, I've checked all what you've said, but no result... May I have some help ?

@caijun
Copy link

caijun commented Jul 6, 2018

@Maryoda2 The Error: Column x is a date/time and must be stored as POSIXct, not POSIXlt has been fixed by my PR #92, which has not been merged into the master branch (It seems that the twitter/AnomalyDetection repository has not been maintained since Sep, 2015). You could install AnomalyDetection package that I modified.

devtools::install_github("caijun/AnomalyDetection")

Then rerunning your script will produce the expected results without errors (I tested your script).

@Maryoda2
Copy link

Maryoda2 commented Jul 6, 2018

Oh thanks a lot !

@Maryoda2
Copy link

Maryoda2 commented Jul 6, 2018

Well... in fact sometimes it writes :

Error in aggregate.data.frame(x[2], format(x[1], "%Y-%m-%d %H:%M:00"), :
'by' must be a list

@caijun
Copy link

caijun commented Jul 6, 2018

@Maryoda2 could you give an example reproducing the error?

@Maryoda2
Copy link

Maryoda2 commented Jul 6, 2018

library(AnomalyDetection)
data <-dems
data$timestamp <- as.POSIXct(data$timestamp)
attr(data$timestamp, "tzone") <- "America/New_York"
attr(data$timestamp, "tzone")
res <- AnomalyDetectionTs(data, max_anoms = 0.02, direction = 'both', plot = TRUE)
attr(res$anoms$timestamp, 'tzone')
res$plot

With another dataset, I've tried to translate your script.
I created a dataset called dems and attempted to format it exactly like raw_data, but it doesn't work

@caijun
Copy link

caijun commented Jul 7, 2018

Please run str(data) to see the structure of data and post the result. Otherwise without your data, I cannot reproduce the error to provide help.

@Maryoda2
Copy link

Maryoda2 commented Jul 7, 2018

str(raw_data)
'data.frame': 14398 obs. of 2 variables:
$ timestamp: POSIXct, format: "1980-09-25 10:01:00" "1980-09-25 10:02:00" ...
$ count : num 182 176 184 178 165 ...
str(dems)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1127 obs. of 2 variables:
$ timestamp: POSIXct, format: "2018-07-04 09:51:41" "2018-07-04 09:51:51" ...
$ count : num 2.6 2.6 2.6 2.6 2.5 2.4 2.4 2.4 2.5 2.7 ...

@caijun
Copy link

caijun commented Jul 8, 2018

This is because your timestamp is not in the format of "%Y-%m-%d %H:%M:00", which means AnomalyDetectionTs() detect anomalies every minute, while your timestamp appears to be every 10 seconds. If you really need to detect anomalies at a time resolution of less than a minute, then the code of twitter/AnomalyDetection has to be modified to support such a feature.

@Maryoda2
Copy link

Maryoda2 commented Jul 8, 2018

Oh I see, How can I edit the code of twitter/AnomalyDetection, in fact the caijun/Twitter (yours) ?

@caijun
Copy link

caijun commented Jul 10, 2018

@Maryoda2 Could you share a sample of your data, which I can use for reproducing the error and testing?

I think the problem you encountered has been reported in issue #77, which has been resolved by PRs #98 and #79

Although these PRs have not been merged into the twitter/AnomalyDetection master branch, but has been merged into isalgueiro/AnomalyDetection. You can try this fork of AnomalyDetection by

devtools::install_github("isalgueiro/AnomalyDetection")

@Maryoda2
Copy link

Maryoda2 commented Jul 10, 2018

Well thanks for your answer
With isalgueiro it doesn't work, same error.
Here is a sample

@caijun
Copy link

caijun commented Jul 10, 2018

@Maryoda2 the data you provided is not enough to reproduce the error. Please provide data with at least 20000 rows.

@Maryoda2
Copy link

Maryoda2 commented Jul 10, 2018

20.000 is no more a sample, but we can't detect anomalies on dataset less than 20.000 ?

@caijun
Copy link

caijun commented Jul 10, 2018

@Maryoda2 Using the data you provided, I encountered the error "Anom detection needs at least 2 periods worth of data". Hence, I need more data to make sure this error was not because of limited data.

@Maryoda2
Copy link

In fact for the moment this is all what I have... So game over then ?

@caijun
Copy link

caijun commented Jul 10, 2018

@Maryoda2 this error has been discussed in issues #15 and #42, perhaps they can provide some useful information for you to fix this error by yourself, e.g. fork this repository and modify the source code.

caijun pushed a commit to caijun/AnomalyDetection that referenced this issue Mar 1, 2019
replace calls to 'subset()' with '['
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants