-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas-ai fails with polars.scan_parquet input #1344
Comments
To work with large parquet files using
Here is the updated code: import polars as pl
class PolarsConnector(BaseConnector):
# ... other methods ...
def _load_df(self, df: Union[PolarsDataFrame, PolarsSeries, str, dict]):
"""
Load the dataframe from a file or polars dataframe.
Args:
df (Union[pl.DataFrame, pl.Series, str, dict]): The dataframe to load.
"""
polars_df = None
if isinstance(df, pl.Series):
polars_df = df.to_frame()
elif isinstance(df, pl.DataFrame):
polars_df = df
elif isinstance(df, str):
if df.endswith('.parquet'):
polars_df = pl.scan_parquet(df)
else:
polars_df = FileImporter.import_from_file(df)
elif isinstance(df, dict):
try:
polars_df = pl.DataFrame(df)
except Exception as e:
raise ValueError(
"Invalid input data. We cannot convert it to a dataframe."
) from e
else:
raise ValueError("Invalid input data. We cannot convert it to a dataframe.")
self.pandas_df = polars_df.collect().to_pandas() This modification ensures that when a parquet file is provided, it is lazily loaded using |
The Dosubot is just annoying here. brute-force coercing a lazy polars connection to a parquet file into a pandas dataframe defeats the purpose of using scan_parquet in the first place. |
Hi, @cboettig. I'm Dosu, and I'm helping the pandas-ai team manage their backlog. I'm marking this issue as stale. Issue Summary:
Next Steps:
Thank you for your understanding and contribution! |
thanks @dosu, but I think pandas-ai should support larger-than-RAM datasets. The underlying libraries like duckdb or polars already support this. the AI only need the table schema, it doesn't need the full dataset, to construct the queries. Small adjustments in using existing mechanisms in duckdb or polars would allow pandas-ai to work on large parquet files that cannot be loaded into RAM. |
Hi, @cboettig. I'm Dosu, and I'm helping the pandas-ai team manage their backlog. I'm marking this issue as stale. Issue Summary:
Next Steps:
Thank you for your understanding and contribution! |
System Info
python 3.11, pandas-ai 2.2.14, ubuntu 22.04
🐛 Describe the bug
I'm looking for a way to work with large parquet files without loading entirely into RAM. This is incredibly well supported in packages like duckdb and polars already used in this project, but I can't see how to get it to work in the interface. For instance, the simple approach using polars.scan_parquet():
Gives the error:
The text was updated successfully, but these errors were encountered: