-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Short-term roadmap for this implementation #34
Comments
Thanks for writing this here. Just to preface, I'm no expert in ORC nor do I technically have a usecase for it, so can take my thoughts with a grain of salt. With that said:
I'll create more issues based on this roadmap Also I assume all our focus will be on a read implementation first, with write coming much later Another question I have is if we'll focus solely on arrow interop (that is, we focus only on reading from ORC directly into arrow arrays). Parquet crate in arrow-rs seems to support a more generic ColumnReader API for users who don't need arrow. If we focus only on arrow then we can optimize the read behaviour as such, wheres it might require a separate read implemention for a more generic API |
BTW some potentially relevant documents in case anyone is interested: |
Thanks for these, will definitely give a read! |
What is missing from this roadmap which is required to allow this library be added to the datafusion (and arrow-rs, polars?) |
hey @klangner thanks for the interest! For DataFusion there is an issue for it: datafusion-contrib/datafusion-orc#63 Right now it lacks support for projection, not to mention the code is sequestered in an example instead of being code as part of the library. For arrow-rs it's basically just... all features needed for supporting read use cases (sorry if this is too vague 😅 ) I'm not familiar with polars so I can't say on that front. For now I'm imagining enhancing the API for RecordBatch reading (akin to what parquet provides in arrow-rs) and also creating the necessary impl's to allow DataFusion to read from ORC files using this library. |
Previous discussion: apache/datafusion#4707
Though the ORC format is not as widely used as parquet in arrow-rs and datafusion related projects, there are still some (growing, to my feelings) interesting and requirements on this format. As @Jefffrey said here, a noticeable and viable milestone for this project is it can be merged into arrow-rs. This draft roadmap is raised to help us discuss, arrange and take our efforts toward that milestone.
Given the ORC format is less complex than parquet, there are still many work to do in various aspects. Here is a list of functionalities need to be done if we consider making ORC files queriable from datafusion as the primary use case on this stage. Please feel free to add/remove/set priorities to them. It's likely that we can't finish all of them in a short term, thus marking what are going to be done is also important.
The below are also related but with lower priorities
Long term items:
Then something I'm not sure about. Looking for more information. Also feel free to change previous two lists.The text was updated successfully, but these errors were encountered: