Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: upgrade to datafusion 43 #2886

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ion-elgreco
Copy link
Collaborator

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Sep 14, 2024
@rtyler
Copy link
Member

rtyler commented Sep 16, 2024

We need Datafusion to upgrade as well before we can do this. I have some patches to make the latest datafusion work but we're blocked on them to adopt the newest arrow and kernel

@rtyler rtyler self-assigned this Sep 16, 2024
@rtyler rtyler marked this pull request as draft September 16, 2024 16:24
@Xuanwo
Copy link
Contributor

Xuanwo commented Sep 18, 2024

Hi, datafusion has been released. Let's rock!

@alamb
Copy link

alamb commented Sep 20, 2024

Thank you @ion-elgreco -- let me know if there is anything I can to do help with this PR (e.g. I could work on updating the code to use non deprecated APIs in the core for parquet statistics...)

@ion-elgreco
Copy link
Collaborator Author

Hey @alamb, I am not working on this atm, but @rtyler has picked it up from here

@rtyler rtyler changed the title chore: bump kernel chore: upgrade to datafusion 42 Sep 20, 2024
@rtyler rtyler force-pushed the chore/bump_kernel branch 2 times, most recently from 8db59c2 to b325e27 Compare September 20, 2024 12:35
@ion-elgreco
Copy link
Collaborator Author

@alamb there seems to be a regression in DF, see https://github.com/delta-io/delta-rs/actions/runs/10959345721/job/30431504526?pr=2886#step:7:948.

A literal with a list is creating a dtype list with inner type using the name "item", we however have our data internally read according to the parquet spec, so that when we write the parquet files are meeting the field naming of the spec.

I had a quick glance at the Changelog but can't really pinpoint where this change may have occurred, do you have any ideas?

DFSchema { inner: 
Schema { fields: [
Field { name: "items_in_bucket", data_type: List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 
new schema: 
DFSchema { inner: Schema { fields: [
Field { name: "items_in_bucket", data_type: List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },

@alamb
Copy link

alamb commented Sep 20, 2024

@alamb there seems to be a regression in DF, see https://github.com/delta-io/delta-rs/actions/runs/10959345721/job/30431504526?pr=2886#step:7:948.

A literal with a list is creating a dtype list with inner type using the name "item", we however have our data internally read according to the parquet spec, so that when we write the parquet files are meeting the field naming of the spec.

It sounds somewhat similar to a discussion happening here about nullability (but I think field name is similar in terms of how it is treated): apache/datafusion#11989 (comment)

Update; filed apache/datafusion#12560 to track the regression more

@rtyler
Copy link
Member

rtyler commented Sep 20, 2024

@alamb thanks for following up! I feel a little sheepish 🐑 insofar that my internal CI showed these errors weeks ago in my integration testing between datafusion main with delta-rs main, I just didn't have the time to triage them like @ion-elgreco has 😄

I've subscribed to the issue you created and can help testing if necessary! Depending on my availability this weekend I can try fixing it in datafusion too 🤔

@matthewmturner
Copy link
Contributor

Hi there - I believe a fix was implemented in datafusion and this PR is not blocked any longer.

@ion-elgreco
Copy link
Collaborator Author

Hi there - I believe a fix was implemented in datafusion and this PR is not blocked any longer.

It's not yet released though

@rtyler
Copy link
Member

rtyler commented Oct 7, 2024

I can confirm that the latest datafusion main passes tests in CI 🥳

@rtyler rtyler changed the title chore: upgrade to datafusion 42 chore: upgrade to datafusion 43 Oct 8, 2024
@alamb
Copy link

alamb commented Oct 8, 2024

fix

I filed apache/datafusion#12813 to consider releasing a patch version of DataFusion

@ion-elgreco
Copy link
Collaborator Author

@alamb is there any change still required? Because I still see failed tests

@alamb
Copy link

alamb commented Nov 5, 2024

@alamb is there any change still required? Because I still see failed tests

I don't know to be honest.

@rtyler
Copy link
Member

rtyler commented Nov 5, 2024

@ion-elgreco don't worry about this pull request, I will be taking care of it. The 43 release is imminent 🎊

@matthewmturner
Copy link
Contributor

Hi there - just doing a pulse check on this PR. It looks like its been updated to DataFusion v43 but I still see several failing CI checks. Are there new regressions in v43?

@hntd187
Copy link
Collaborator

hntd187 commented Nov 15, 2024

@matthewmturner I just did some more fixes on our side, I noticed we still have what appears to be a regression https://github.com/delta-io/delta-rs/actions/runs/11860379826/job/33055504796?pr=2886#step:4:2436 could someone take a look at this?

@timsaucer
Copy link

timsaucer commented Nov 15, 2024

I think the problem is that the projected schema for array types is changing the field name for the elements of the array from "element" to "item". I have seen this before, but I'm not sure exactly where that is happening. I will try to keep digging in. Which is to say, I think the problem may be upstream in datafusion.

@rtyler
Copy link
Member

rtyler commented Nov 16, 2024

This last test failure I believe is because of the use of the make_array UDF which expects items in the array to be item rather than element.

I'm not convinced this is a Datafusion issue per se, I'm still digging through some code to see if they've standardized on using "element" vs "item". Either way I don't want to wait for datafusion 44 to merge this change, so I'll be figuring out a workaround shortly 🤗

@timsaucer
Copy link

Please let me know what you discover. I do expect we may release a DF 43.1.0 due to a number of changes recently to account for the new default StringView instead of String. If the root problem is on the DF side, we could potentially include it as well.

@rtyler
Copy link
Member

rtyler commented Nov 16, 2024

@timsaucer My preference would certainly be for make_array to use element for the field name of the elements in the array 😆 We have adopted that because it's consistent with how Apache Parquet does things. In this case, the test itself is using make_array so I'm going to update the test to create the ListArray the Good Ol Fashioned Way ™️ and add a warning for folks sending in ListArrays with item element names.

@hntd187
Copy link
Collaborator

hntd187 commented Nov 16, 2024

Is it perhaps possible to include UDFs that just let you specify an optional field name for the inner field like this? Because I know Map also gets the same treatment of "what are the inner fields called, well we call it this, they call it that." So it might just let more advanced users choose their own adventure instead of having to go back and forth.

@rtyler
Copy link
Member

rtyler commented Nov 16, 2024

I have rebased on main and consolidated a couple commits. This should be ready to merge if the actions go green @hntd187

rtyler and others added 13 commits November 16, 2024 19:08
The release of pyo3 0.22.3 compells this since we cannot otherwise
compile. The choice is between pinning 0.22.2 and upgrading our ABI, and
I think it's better to upgrade the ABI

Signed-off-by: R. Tyler Croy <[email protected]>
Today the make_array function from Datafusion uses "item" as the list
element's field name. With recent changes in delta-kernel-rs we have
switched to calling it "element" which is more conventional related to
how Apache Parquet handles things

Signed-off-by: R. Tyler Croy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schema metadata descriptions cause write failures in deltalake >= 0.19.0
8 participants