Skip to content

fix: get_struct field is incorrect when struct in array #1687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 16, 2025

Conversation

comphead
Copy link
Contributor

Which issue does this PR close?

Closes #1681 .

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@codecov-commenter
Copy link

codecov-commenter commented May 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.64%. Comparing base (f09f8af) to head (0b75dfd).
Report is 194 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1687      +/-   ##
============================================
+ Coverage     56.12%   58.64%   +2.51%     
- Complexity      976     1140     +164     
============================================
  Files           119      130      +11     
  Lines         11743    12676     +933     
  Branches       2251     2368     +117     
============================================
+ Hits           6591     7434     +843     
- Misses         4012     4055      +43     
- Partials       1140     1187      +47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@comphead comphead marked this pull request as ready for review May 16, 2025 21:12
@comphead
Copy link
Contributor Author

@andygrove @parthchandra @mbutrovich if you have time to look into it

@comphead comphead requested a review from kazuyukitanimura May 16, 2025 21:13
) -> DataFusionResult<ArrayRef> {
match (from_type, to_type) {
(DataType::List(_), DataType::List(to_inner_type)) => {
let cast_field = cast_array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will recurse back into cast_array where we have the special handling for casting struct-to-struct 👍

parquet_options: &SparkParquetOptions,
) -> DataFusionResult<ArrayRef> {
match (from_type, to_type) {
(DataType::List(_), DataType::List(to_inner_type)) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just pass to_inner_type directly into this method to avoid matching on List again here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from_type doesn't need to be passed in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified

@@ -171,7 +178,36 @@ fn cast_array(
.with_timezone(Arc::clone(tz)),
))
}
_ => Ok(cast_with_options(&array, to_type, &PARQUET_OPTIONS)?),
// If Arrow cast supports the cast, delegate the cast to Arrow
_ if can_cast_types(from_type, to_type) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

_ if can_cast_types(from_type, to_type) => {
Ok(cast_with_options(&array, to_type, &PARQUET_OPTIONS)?)
}
_ => Ok(array),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an error rather than ignoring the cast request?

Copy link
Contributor Author

@comphead comphead May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it depends, for now we just ignore the casting as we cannot find a proper handler leaving the data as it comes from Spark and propagating the responsibility back to the caller.

Returning Error also makes sense to identify such unsupported pairs of type sooner than later by getting the error.

I dont have a strong opinion here, WDYT which one would serve better?

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some nits / questions. This is a huge step forward 🚀.

@comphead comphead merged commit 48af872 into apache:main May 16, 2025
78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Experimental scans] schema adapter does not apply required schema for structs within lists
3 participants