Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast issue when reading the Hive-BQ External table via Spark - Cannot up cast #122

Open
sandeep11sagar opened this issue Aug 30, 2024 · 7 comments
Assignees

Comments

@sandeep11sagar
Copy link

In HIve-BQ connector , the DATETIME datatype is mapped as Timestamp Type.

In Spark-BQ connector, the DATETIME datatype is mapped as StringType.

As there is a differences between both the connectors, the table that is created with "com.google.cloud.hive.bigquery.connector.BigQueryStorageHandler" is not accessable from Spark and fails with CAST error.

Cannot up cast column from "String" to "Timestamp".

This issue is not only with DATETIME, but also for other datatypes but those can be handled by an workaround.
For example : In Hive, the column is created as INT, But in BQ all the integer type (tinyInt, Bigint,int) are aliased as INT64 (LongType)

Upon reading that table from spark , we get an error " Cannot up cast column from "BIGINT" to "INT"
This can be handled by changing the hive external table column data type to bigint.

@davidrabinowitz
Copy link
Member

Prior to version 3.5, Spark did not support the equivalent for BigQuery's DATETIME. In version 3.5, the TimestampNTZ has been added, and the connector maps it to the DATETIME type.

Spark 3.5 is available in Dataproc image 2.2 and Dataproc Serverless 2.2.

@sandeep11sagar
Copy link
Author

@davidrabinowitz

In our organisation we use DP 2.1 which uses spark 3.3.

Is there any workaround for this version or its a blocker untill migration of DP2.2 ?

@davidrabinowitz
Copy link
Member

Unfortunately Spark 3.3 does not support TimestampNTZ. You can use the load from query feature (spark.read.format("bigquery").load("SELECT ...")) in order to load the data and convert it on the fly to data types that Spark 3.3 support such as Timestamp (with timezone) or Date. You can also reach out via your account manager and open a ticket with Google.

@davidrabinowitz davidrabinowitz self-assigned this Aug 30, 2024
@sandeep11sagar
Copy link
Author

sandeep11sagar commented Aug 30, 2024

The use case here is to read the BQ external table which is created in GCS -Hive using HiveBq Connector.

In that case the read statement will not be bigquery format.

Spark.sql("select * from dev.hive_bq_ext")

With spark catalog, the datatype is read as String. When its internally trying to cast as TimeStamp as per table column dataType, it throws error

@davidrabinowitz
Copy link
Member

Can you please share the schema of the external table as it appears in BQ?

@sandeep11sagar
Copy link
Author

sandeep11sagar commented Aug 31, 2024

@davidrabinowitz

BQ Base Table Creation:

create table project.dataset.bq_base_test
as
select cast('2024-08-30' as DATETIME) as col_dt , cast(10 as TINYINT) as col_int;
create or replace view project.dataset.bq_base_test_view as select * from project.dataset.bq_base_test;

Hive External Table Creation:

CREATE EXTERNAL <db_name>.hive_ext_test
(
col_dt timestamp,
col_int tinyint
) 
STORED BY 
  'com.google.cloud.hive.bigquery.connector.BigQueryStorageHandler' 
TBLPROPERTIES (
  'bq.table'='project.dataset.bq_base_test_view',
  'viewsEnabled'='true');

create or replace view <db_name>.hive_ext_test_view as select * from <db_name>.hive_ext_test;

Accesing From Spark 3.3 (DP 2.1):

spark.conf.set("viewsEnabled","true")
spark.sql("select * from <db_name>.hive_ext_test_view").show(false)

_org.apache.spark.sql.AnalysisException: Cannot up cast spark_catalog.<db_name>.hive_ext_test.col_dt from "STRING" to "TIMESTAMP".
The type path of the target object is:

You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object
at org.apache.spark.sql.errors.QueryCompilationErrors$.upCastFailureError(QueryCompilationErrors.scala:171)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3623)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$60$$anonfun$applyOrElse$181.applyOrElse(Analyzer.scala:3654)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$60$$anonfun$applyOrElse$181.applyOrElse(Analyzer.scala:3631)_

@sandeep11sagar
Copy link
Author

@davidrabinowitz - Further analysis on this would be helpful. Please let me know if any other details required on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants