Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark-acid incorrectly reads/writes pre-Gregorian timestamps #95

Open
bersprockets opened this issue Sep 17, 2020 · 3 comments
Open

spark-acid incorrectly reads/writes pre-Gregorian timestamps #95

bersprockets opened this issue Sep 17, 2020 · 3 comments
Assignees

Comments

@bersprockets
Copy link

In beeline:

0: jdbc:hive2://localhost:10000> create table ts_acid (ts TIMESTAMP) stored as orc TBLPROPERTIES ('transactional' = 'true');
No rows affected (0.132 seconds)
0: jdbc:hive2://localhost:10000> insert into ts_acid values ('1200-01-01 00:00:00.0');
No rows affected (2.339 seconds)
0: jdbc:hive2://localhost:10000> select * from ts_acid;
20/09/16 17:51:51 INFO lockmgr.DbTxnManager: Stopped heartbeat for query: bruce_20200916175151_d8bc218d-da7d-491a-8f97-e38fb2416748
+------------------------+
|       ts_acid.ts       |
+------------------------+
| 1200-01-01 00:00:00.0  |
+------------------------+
1 row selected (0.207 seconds)
0: jdbc:hive2://localhost:10000> 

In Spark:

scala> val df = spark.read.format("HiveAcid").options(Map("table" -> "default.ts_acid")).load
20/09/16 17:53:22 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> df.show(truncate=false)
20/09/16 17:53:29 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/09/16 17:53:31 ERROR AcidUtils: Failed to get files with ID; using regular API: Only supported for DFS; got class org.apache.hadoop.fs.LocalFileSystem
+-------------------+
|ts                 |
+-------------------+
|1199-12-24 16:00:00|
+-------------------+

scala> 

Conversely, in Spark:

scala> val df = Seq(java.sql.Timestamp.valueOf("1400-01-01 00:00:00.0")).toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> df.write.format("HiveAcid").option("table", "default.ts_acid").mode("append").save()
20/09/16 18:05:30 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/09/16 18:05:31 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/09/16 18:05:32 WARN HiveConf: HiveConf of name hive.metastore.local does not exist

scala> 

Then, in beeline:

0: jdbc:hive2://localhost:10000> select * from ts_acid;
20/09/16 18:06:25 INFO lockmgr.DbTxnManager: Stopped heartbeat for query: bruce_20200916180625_03a152d1-2be1-41c7-b205-33d38e68ce2f
+------------------------+
|       ts_acid.ts       |
+------------------------+
| 1200-01-01 00:00:00.0  |
| 1400-01-09 08:00:00.0  |    <--- This should be 1400-01-01 00:00:00.0
+------------------------+
2 rows selected (0.283 seconds)
0: jdbc:hive2://localhost:10000> 
@amoghmargoor amoghmargoor self-assigned this Sep 17, 2020
@amoghmargoor
Copy link
Collaborator

thanks @bersprockets we will take a look at it.

@bersprockets
Copy link
Author

An update:

On write, spark-acid does not write post-start-of-Gregorian calendar values correctly either:

In Spark:

scala> val df = Seq(java.sql.Timestamp.valueOf("2017-03-14 23:22:01.786")).toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> df.write.format("HiveAcid").option("table", "default.ts_acid2").mode("append").save()
20/09/17 10:55:36 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/09/17 10:55:38 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/09/17 10:55:40 WARN HiveConf: HiveConf of name hive.metastore.local does not exist

scala> 

Then in beeline

0: jdbc:hive2://localhost:10000> select * from ts_acid2;
20/09/17 10:55:58 INFO lockmgr.DbTxnManager: Stopped heartbeat for query: bruce_20200917105558_5eca4bb0-2f28-4722-aca1-5749e4bf8e58
+------------------------+
|      ts_acid2.ts       |
+------------------------+
| 2017-03-15 06:22:01.0  |
+------------------------+
1 row selected (0.411 seconds)
0: jdbc:hive2://localhost:10000> 

If you run everything in UTC, you might not see this issue with post-start-of-Gregorian calendar values (you'll still see issues with pre-Gregorian values).

@amoghmargoor
Copy link
Collaborator

amoghmargoor commented Sep 28, 2020

@bersprockets i ran few queries to see the behavior for non-ACID tables and it is consistent with ACID tables atleast on 2.4.3 for Pre-Gregorian timestamps i.e., Spark in general is not giving Julian time stamps atleast in 2.4.3. Following is the non-acid Hive table created below:

`0: jdbc:hive2://0.0.0.0:10001/default> create table ts_normal (ts TIMESTAMP) stored as orc;

0: jdbc:hive2://0.0.0.0:10001/default> insert into ts_normal values ('1200-01-01 00:00:00.0');

0: jdbc:hive2://0.0.0.0:10001/default> select ts from ts_normal;
INFO : Compiling command(queryId=root_20200928180409_8b652df5-d3ea-4226-a651-14a5110af18d): select ts from ts_normal
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:ts, type:timestamp, comment:null)], properties:null)
INFO : Completed compiling command(queryId=root_20200928180409_8b652df5-d3ea-4226-a651-14a5110af18d); Time taken: 0.193 seconds
INFO : Executing command(queryId=root_20200928180409_8b652df5-d3ea-4226-a651-14a5110af18d): select ts from ts_normal
INFO : Completed executing command(queryId=root_20200928180409_8b652df5-d3ea-4226-a651-14a5110af18d); Time taken: 0.001 seconds

INFO : OK

+------------------------+
| ts |
+------------------------+
| 1200-01-01 00:00:00.0 |`

In Spark:

`scala> spark.sql("select ts from ts_normal").show(false)
+-------------------+
|ts |
+-------------------+
|1199-12-25 08:00:00|
+-------------------+

scala>`

So I don't think this has to do anything with DataSource in specific. Even if you take a look at the unix timestamp it is stored as same:

Spark:
scala> spark.sql("select unix_timestamp(ts) from ts_normal").show(false) bash: /usr/lib/hustler/bin/qubole-bash-lib.sh: No such file or directory bash: /usr/lib/hustler/bin/qubole-bash-lib.sh: No such file or directory bash: /usr/lib/hustler/bin/qubole-bash-lib.sh: No such file or directory 20/09/28 10:57:42 main WARN HiveAuthHelper: Unable to execute Ranger flag check, reverting to default. 20/09/28 10:57:43 main WARN HiveConf: HiveConf of name hive.allow.move.on.s3 does not exist +---------------------------------------+ |unix_timestamp(ts, yyyy-MM-dd HH:mm:ss)| +---------------------------------------+ |-24298848000 | +---------------------------------------+

Hive:

0: jdbc:hive2://0.0.0.0:10001/default> select unix_timestamp(ts) from ts_normal; INFO : Compiling command(queryId=root_20200928180453_850f7666-8856-45d2-a485-93e40b343f1c): select unix_timestamp(ts) from ts_normal INFO : Semantic Analysis Completed (retrial = false) INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=root_20200928180453_850f7666-8856-45d2-a485-93e40b343f1c); Time taken: 0.224 seconds INFO : Executing command(queryId=root_20200928180453_850f7666-8856-45d2-a485-93e40b343f1c): select unix_timestamp(ts) from ts_normal INFO : Completed executing command(queryId=root_20200928180453_850f7666-8856-45d2-a485-93e40b343f1c); Time taken: 0.001 seconds INFO : OK +---------------+ | _c0 | +---------------+ | -24298876800 | +---------------+

I have not checked it against Spark 3.0.0 as we have not yet upgraded DS for it, so not sure if this issue would have got fixed with: https://issues.apache.org/jira/browse/SPARK-31557. Are you seeing different behavior for non-ACID tables ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants