Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for arrow stream #265

Open
djouallah opened this issue Sep 8, 2024 · 3 comments
Open

Add support for arrow stream #265

djouallah opened this issue Sep 8, 2024 · 3 comments
Assignees

Comments

@djouallah
Copy link

first congratulation on the progress you made, chDB is substantially better than just 6 months ago, I am trying to read a folder of csv and export it to delta, current I am using df = sess.sql(sql,"ArrowTable") to transfer the data to deltalake Python, the problem is I am getting OOM errors, would be nice if you can add support for arrow recordbatch so the transfer is done in smaller batch

thanks

@djouallah
Copy link
Author

@auxten how do you get a schema when using this

df = sess.sql(sql,"ArrowStream")
write_deltalake(f"/lakehouse/default/Tables/T{total_files}/chdb",df, mode="append", partition_by=['year'], storage_options= storage_options)

@auxten
Copy link
Member

auxten commented Sep 10, 2024

I understand that what you’re trying to do is retrieve the output schema and then stream the data into Delta Lake.

  1. Regarding the issue of retrieving the schema, I believe it can be obtained by setting the output format to JSON, ArrowTable, DataFrame, etc. However, in cases of large data volumes, a LIMIT should be applied.
  2. Currently, the implementation of chDB requires loading the entire dataset into memory before proceeding with further processing, which can lead to an OOM (out of memory) issue when dealing with large data volumes. This is a point that needs improvement, and I will schedule it for future development.

@djouallah
Copy link
Author

I added chdb to my etl benchmarks, feel free to have a look, if i am doing something terribly wrong
https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/ETL/Light_ETL_Python_Notebook.ipynb

@auxten auxten closed this as completed by moving to Done in chDB 2024 Q4 Sep 29, 2024
@auxten auxten reopened this Sep 29, 2024
@auxten auxten self-assigned this Sep 29, 2024
@auxten auxten changed the title add support for arrow stream Add support for arrow stream Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants