spark-acid incorrectly reads/writes pre-Gregorian timestamps #95

bersprockets · 2020-09-17T01:11:15Z

In beeline:

0: jdbc:hive2://localhost:10000> create table ts_acid (ts TIMESTAMP) stored as orc TBLPROPERTIES ('transactional' = 'true');
No rows affected (0.132 seconds)
0: jdbc:hive2://localhost:10000> insert into ts_acid values ('1200-01-01 00:00:00.0');
No rows affected (2.339 seconds)
0: jdbc:hive2://localhost:10000> select * from ts_acid;
20/09/16 17:51:51 INFO lockmgr.DbTxnManager: Stopped heartbeat for query: bruce_20200916175151_d8bc218d-da7d-491a-8f97-e38fb2416748
+------------------------+
|       ts_acid.ts       |
+------------------------+
| 1200-01-01 00:00:00.0  |
+------------------------+
1 row selected (0.207 seconds)
0: jdbc:hive2://localhost:10000>

In Spark:

scala> val df = spark.read.format("HiveAcid").options(Map("table" -> "default.ts_acid")).load
20/09/16 17:53:22 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> df.show(truncate=false)
20/09/16 17:53:29 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/09/16 17:53:31 ERROR AcidUtils: Failed to get files with ID; using regular API: Only supported for DFS; got class org.apache.hadoop.fs.LocalFileSystem
+-------------------+
|ts                 |
+-------------------+
|1199-12-24 16:00:00|
+-------------------+

scala>

Conversely, in Spark:

scala> val df = Seq(java.sql.Timestamp.valueOf("1400-01-01 00:00:00.0")).toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> df.write.format("HiveAcid").option("table", "default.ts_acid").mode("append").save()
20/09/16 18:05:30 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/09/16 18:05:31 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/09/16 18:05:32 WARN HiveConf: HiveConf of name hive.metastore.local does not exist

scala>

Then, in beeline:

0: jdbc:hive2://localhost:10000> select * from ts_acid;
20/09/16 18:06:25 INFO lockmgr.DbTxnManager: Stopped heartbeat for query: bruce_20200916180625_03a152d1-2be1-41c7-b205-33d38e68ce2f
+------------------------+
|       ts_acid.ts       |
+------------------------+
| 1200-01-01 00:00:00.0  |
| 1400-01-09 08:00:00.0  |    <--- This should be 1400-01-01 00:00:00.0
+------------------------+
2 rows selected (0.283 seconds)
0: jdbc:hive2://localhost:10000>

The text was updated successfully, but these errors were encountered:

amoghmargoor · 2020-09-17T01:50:32Z

thanks @bersprockets we will take a look at it.

bersprockets · 2020-09-17T18:34:24Z

An update:

On write, spark-acid does not write post-start-of-Gregorian calendar values correctly either:

In Spark:

scala> val df = Seq(java.sql.Timestamp.valueOf("2017-03-14 23:22:01.786")).toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> df.write.format("HiveAcid").option("table", "default.ts_acid2").mode("append").save()
20/09/17 10:55:36 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/09/17 10:55:38 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/09/17 10:55:40 WARN HiveConf: HiveConf of name hive.metastore.local does not exist

scala>

Then in beeline

0: jdbc:hive2://localhost:10000> select * from ts_acid2;
20/09/17 10:55:58 INFO lockmgr.DbTxnManager: Stopped heartbeat for query: bruce_20200917105558_5eca4bb0-2f28-4722-aca1-5749e4bf8e58
+------------------------+
|      ts_acid2.ts       |
+------------------------+
| 2017-03-15 06:22:01.0  |
+------------------------+
1 row selected (0.411 seconds)
0: jdbc:hive2://localhost:10000>

If you run everything in UTC, you might not see this issue with post-start-of-Gregorian calendar values (you'll still see issues with pre-Gregorian values).

amoghmargoor · 2020-09-28T18:46:46Z

@bersprockets i ran few queries to see the behavior for non-ACID tables and it is consistent with ACID tables atleast on 2.4.3 for Pre-Gregorian timestamps i.e., Spark in general is not giving Julian time stamps atleast in 2.4.3. Following is the non-acid Hive table created below:

`0: jdbc:hive2://0.0.0.0:10001/default> create table ts_normal (ts TIMESTAMP) stored as orc;

0: jdbc:hive2://0.0.0.0:10001/default> insert into ts_normal values ('1200-01-01 00:00:00.0');

0: jdbc:hive2://0.0.0.0:10001/default> select ts from ts_normal;
INFO : Compiling command(queryId=root_20200928180409_8b652df5-d3ea-4226-a651-14a5110af18d): select ts from ts_normal
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:ts, type:timestamp, comment:null)], properties:null)
INFO : Completed compiling command(queryId=root_20200928180409_8b652df5-d3ea-4226-a651-14a5110af18d); Time taken: 0.193 seconds
INFO : Executing command(queryId=root_20200928180409_8b652df5-d3ea-4226-a651-14a5110af18d): select ts from ts_normal
INFO : Completed executing command(queryId=root_20200928180409_8b652df5-d3ea-4226-a651-14a5110af18d); Time taken: 0.001 seconds

INFO : OK

+------------------------+
| ts |
+------------------------+
| 1200-01-01 00:00:00.0 |`

In Spark:

`scala> spark.sql("select ts from ts_normal").show(false)
+-------------------+
|ts |
+-------------------+
|1199-12-25 08:00:00|
+-------------------+

scala>`

So I don't think this has to do anything with DataSource in specific. Even if you take a look at the unix timestamp it is stored as same:

Spark:
scala> spark.sql("select unix_timestamp(ts) from ts_normal").show(false) bash: /usr/lib/hustler/bin/qubole-bash-lib.sh: No such file or directory bash: /usr/lib/hustler/bin/qubole-bash-lib.sh: No such file or directory bash: /usr/lib/hustler/bin/qubole-bash-lib.sh: No such file or directory 20/09/28 10:57:42 main WARN HiveAuthHelper: Unable to execute Ranger flag check, reverting to default. 20/09/28 10:57:43 main WARN HiveConf: HiveConf of name hive.allow.move.on.s3 does not exist +---------------------------------------+ |unix_timestamp(ts, yyyy-MM-dd HH:mm:ss)| +---------------------------------------+ |-24298848000 | +---------------------------------------+

Hive:

0: jdbc:hive2://0.0.0.0:10001/default> select unix_timestamp(ts) from ts_normal; INFO : Compiling command(queryId=root_20200928180453_850f7666-8856-45d2-a485-93e40b343f1c): select unix_timestamp(ts) from ts_normal INFO : Semantic Analysis Completed (retrial = false) INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=root_20200928180453_850f7666-8856-45d2-a485-93e40b343f1c); Time taken: 0.224 seconds INFO : Executing command(queryId=root_20200928180453_850f7666-8856-45d2-a485-93e40b343f1c): select unix_timestamp(ts) from ts_normal INFO : Completed executing command(queryId=root_20200928180453_850f7666-8856-45d2-a485-93e40b343f1c); Time taken: 0.001 seconds INFO : OK +---------------+ | _c0 | +---------------+ | -24298876800 | +---------------+

I have not checked it against Spark 3.0.0 as we have not yet upgraded DS for it, so not sure if this issue would have got fixed with: https://issues.apache.org/jira/browse/SPARK-31557. Are you seeing different behavior for non-ACID tables ?

amoghmargoor self-assigned this Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-acid incorrectly reads/writes pre-Gregorian timestamps #95

spark-acid incorrectly reads/writes pre-Gregorian timestamps #95

bersprockets commented Sep 17, 2020

amoghmargoor commented Sep 17, 2020

bersprockets commented Sep 17, 2020

amoghmargoor commented Sep 28, 2020 •

edited

Loading

spark-acid incorrectly reads/writes pre-Gregorian timestamps #95

spark-acid incorrectly reads/writes pre-Gregorian timestamps #95

Comments

bersprockets commented Sep 17, 2020

amoghmargoor commented Sep 17, 2020

bersprockets commented Sep 17, 2020

amoghmargoor commented Sep 28, 2020 • edited Loading

amoghmargoor commented Sep 28, 2020 •

edited

Loading