Cast issue when reading the Hive-BQ External table via Spark - Cannot up cast #122

sandeep11sagar · 2024-08-30T05:07:05Z

In HIve-BQ connector , the DATETIME datatype is mapped as Timestamp Type.

In Spark-BQ connector, the DATETIME datatype is mapped as StringType.

As there is a differences between both the connectors, the table that is created with "com.google.cloud.hive.bigquery.connector.BigQueryStorageHandler" is not accessable from Spark and fails with CAST error.

Cannot up cast column from "String" to "Timestamp".

This issue is not only with DATETIME, but also for other datatypes but those can be handled by an workaround.
For example : In Hive, the column is created as INT, But in BQ all the integer type (tinyInt, Bigint,int) are aliased as INT64 (LongType)

Upon reading that table from spark , we get an error " Cannot up cast column from "BIGINT" to "INT"
This can be handled by changing the hive external table column data type to bigint.

davidrabinowitz · 2024-08-30T16:23:32Z

Prior to version 3.5, Spark did not support the equivalent for BigQuery's DATETIME. In version 3.5, the TimestampNTZ has been added, and the connector maps it to the DATETIME type.

Spark 3.5 is available in Dataproc image 2.2 and Dataproc Serverless 2.2.

sandeep11sagar · 2024-08-30T17:09:11Z

@davidrabinowitz

In our organisation we use DP 2.1 which uses spark 3.3.

Is there any workaround for this version or its a blocker untill migration of DP2.2 ?

davidrabinowitz · 2024-08-30T17:19:02Z

Unfortunately Spark 3.3 does not support TimestampNTZ. You can use the load from query feature (spark.read.format("bigquery").load("SELECT ...")) in order to load the data and convert it on the fly to data types that Spark 3.3 support such as Timestamp (with timezone) or Date. You can also reach out via your account manager and open a ticket with Google.

sandeep11sagar · 2024-08-30T17:27:56Z

The use case here is to read the BQ external table which is created in GCS -Hive using HiveBq Connector.

In that case the read statement will not be bigquery format.

Spark.sql("select * from dev.hive_bq_ext")

With spark catalog, the datatype is read as String. When its internally trying to cast as TimeStamp as per table column dataType, it throws error

davidrabinowitz · 2024-08-30T21:18:17Z

Can you please share the schema of the external table as it appears in BQ?

sandeep11sagar · 2024-08-31T07:29:32Z

@davidrabinowitz

BQ Base Table Creation:

create table project.dataset.bq_base_test
as
select cast('2024-08-30' as DATETIME) as col_dt , cast(10 as TINYINT) as col_int;
create or replace view project.dataset.bq_base_test_view as select * from project.dataset.bq_base_test;

Hive External Table Creation:

CREATE EXTERNAL <db_name>.hive_ext_test
(
col_dt timestamp,
col_int tinyint
) 
STORED BY 
  'com.google.cloud.hive.bigquery.connector.BigQueryStorageHandler' 
TBLPROPERTIES (
  'bq.table'='project.dataset.bq_base_test_view',
  'viewsEnabled'='true');

create or replace view <db_name>.hive_ext_test_view as select * from <db_name>.hive_ext_test;

Accesing From Spark 3.3 (DP 2.1):

spark.conf.set("viewsEnabled","true")
spark.sql("select * from <db_name>.hive_ext_test_view").show(false)

_org.apache.spark.sql.AnalysisException: Cannot up cast spark_catalog.<db_name>.hive_ext_test.col_dt from "STRING" to "TIMESTAMP".
The type path of the target object is:

You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object
at org.apache.spark.sql.errors.QueryCompilationErrors$.upCastFailureError(QueryCompilationErrors.scala:171)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3623)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$60$$anonfun$applyOrElse$181.applyOrElse(Analyzer.scala:3654)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$60$$anonfun$applyOrElse$181.applyOrElse(Analyzer.scala:3631)_

sandeep11sagar · 2024-09-11T04:43:01Z

@davidrabinowitz - Further analysis on this would be helpful. Please let me know if any other details required on this

davidrabinowitz self-assigned this Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cast issue when reading the Hive-BQ External table via Spark - Cannot up cast #122

Cast issue when reading the Hive-BQ External table via Spark - Cannot up cast #122

sandeep11sagar commented Aug 30, 2024

davidrabinowitz commented Aug 30, 2024

sandeep11sagar commented Aug 30, 2024

davidrabinowitz commented Aug 30, 2024

sandeep11sagar commented Aug 30, 2024 •

edited

Loading

davidrabinowitz commented Aug 30, 2024

sandeep11sagar commented Aug 31, 2024 •

edited

Loading

sandeep11sagar commented Sep 11, 2024

Cast issue when reading the Hive-BQ External table via Spark - Cannot up cast #122

Cast issue when reading the Hive-BQ External table via Spark - Cannot up cast #122

Comments

sandeep11sagar commented Aug 30, 2024

davidrabinowitz commented Aug 30, 2024

sandeep11sagar commented Aug 30, 2024

davidrabinowitz commented Aug 30, 2024

sandeep11sagar commented Aug 30, 2024 • edited Loading

davidrabinowitz commented Aug 30, 2024

sandeep11sagar commented Aug 31, 2024 • edited Loading

sandeep11sagar commented Sep 11, 2024

sandeep11sagar commented Aug 30, 2024 •

edited

Loading

sandeep11sagar commented Aug 31, 2024 •

edited

Loading