Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read in process Python objects like Dataframe, Numpy or dict #211

Merged
merged 26 commits into from
Jun 17, 2024

Conversation

auxten
Copy link
Member

@auxten auxten commented Apr 12, 2024

This PR is in very early stage. The implementation could change a lot for final patch.

Just hold this PR for other projects to tracking the progress of "chDB on Pandas/NumPy..."

Related issues:

@auxten auxten added the Arrow Apache Arrow support label Apr 12, 2024
@auxten auxten self-assigned this Apr 12, 2024
@auxten auxten marked this pull request as draft April 12, 2024 09:41
@auxten
Copy link
Member Author

auxten commented Apr 29, 2024

Still working on it. Good news is the prototype worked. Python API example could be like this below. Any suggestion?

#!python3

import chdb


class myReader(chdb.PyReader):
    def __init__(self, data):
        self.data = data
        self.cursor = 0
        super().__init__(data)

    def read(self, col_names, count):
        # count ignored for demo
        if self.cursor >= len(self.data["a"]):
            return []
        block = [self.data[col] for col in col_names]
        self.cursor += len(block[0])
        return block


reader = myReader(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query("SELECT b, sum(a) FROM Python('reader') GROUP BY b", "debug").show()

Output:

"tom",5
"auxten",9
"jerry",7

@auxten auxten marked this pull request as ready for review June 17, 2024 05:39
@auxten auxten merged commit eeb6b68 into main-23.10-20240617 Jun 17, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow Apache Arrow support
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet
1 participant