Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Apache ORC File Format, and Use Sparse Index #4707

Open
hrh007 opened this issue Dec 22, 2022 · 12 comments
Open

Support Apache ORC File Format, and Use Sparse Index #4707

hrh007 opened this issue Dec 22, 2022 · 12 comments
Labels
enhancement New feature or request

Comments

@hrh007
Copy link

hrh007 commented Dec 22, 2022

  1. Support a new file format: apache orc
  2. Use sparse index(page level, row group level) in orc to speed up.
@hrh007 hrh007 added the enhancement New feature or request label Dec 22, 2022
@hrh007
Copy link
Author

hrh007 commented Dec 22, 2022

Is this feature planned? @alamb @andygrove

@alamb
Copy link
Contributor

alamb commented Dec 22, 2022

Hi @hrh007 -- I do not know of any plans to support ORC at this time.

I think we would welcome a contribution if you would like to do so.

@hrh007
Copy link
Author

hrh007 commented Jan 3, 2023

Hi @hrh007 -- I do not know of any plans to support ORC at this time.

I think we would welcome a contribution if you would like to do so.

I will contribute to this issue, but it may take a lot time; because there is no official ORC implementation of the Rust language

@alamb
Copy link
Contributor

alamb commented Jan 3, 2023

I will contribute to this issue, but it may take a lot time; because there is no official ORC implementation of the Rust language

That does sound like an important dependency to implement first 🤔

@andygrove
Copy link
Member

There is https://github.com/DataEngineeringLabs/orc-format but it builds on arrow2 rather than arrow-rs

@Jefffrey
Copy link
Contributor

Hi @hrh007 just wondering if there's been progress on this? If not then I'm interested in picking this up

@hrh007
Copy link
Author

hrh007 commented Mar 15, 2023

Hi @hrh007 just wondering if there's been progress on this? If not then I'm interested in picking this up

I have not made any progress yet; Glad you can participate, Thanks for your contribution!

@Jefffrey
Copy link
Contributor

See apache/arrow-rs#4980

@hrh007
Copy link
Author

hrh007 commented Oct 24, 2023

See apache/arrow-rs#4980

Great job, thanks for your efforts !!!

@alamb
Copy link
Contributor

alamb commented Oct 31, 2023

BTW I think we need some more help to get ORC implemented: apache/arrow-rs#4980 (comment)

@Jefffrey
Copy link
Contributor

Jefffrey commented Nov 3, 2023

Following discussion in apache/arrow-rs#4980

We will focus on implementing ORC file format support for Arrow in https://github.com/datafusion-contrib/datafusion-orc first

Which when ready could be used by DataFusion to query

Eventually we hope https://github.com/datafusion-contrib/datafusion-orc could be merged into arrow-rs which DataFusion could then use directly

@waynexia
Copy link
Member

waynexia commented Nov 3, 2023

Drafted a short-term roadmap for datafusion-orc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants