Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas-ai fails with polars.scan_parquet input #1344

Closed
cboettig opened this issue Aug 30, 2024 · 5 comments
Closed

pandas-ai fails with polars.scan_parquet input #1344

cboettig opened this issue Aug 30, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@cboettig
Copy link

System Info

python 3.11, pandas-ai 2.2.14, ubuntu 22.04

🐛 Describe the bug

I'm looking for a way to work with large parquet files without loading entirely into RAM. This is incredibly well supported in packages like duckdb and polars already used in this project, but I can't see how to get it to work in the interface. For instance, the simple approach using polars.scan_parquet():

# All necessary imports at the beginning
import pandas as pd
from pandasai import Agent

# Sample DataFrame
df = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
    "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
df.to_parquet("test.parquet")
import polars as pl

df = pl.scan_parquet("test.parquet")
# Instantiate a LLM
from pandasai.llm import OpenAI
#llm = OpenAI(api_token="YOUR_API_TOKEN")

df = Agent([df], config={"llm": llm})
df.chat('Which are the 5 happiest countries?')

Gives the error:

   196         connectors.append(PolarsConnector({"original_df": df}))
    198     else:
--> 199         raise ValueError(
    200             "Invalid input data. We cannot convert it to a dataframe."
    201         )
    202 except ImportError as e:
    203     raise ValueError(
    204         "Invalid input data. We cannot convert it to a dataframe."
    205     ) from e

Copy link

dosubot bot commented Aug 30, 2024

To work with large parquet files using polars.scan_parquet without loading them entirely into RAM and avoid the error "Invalid input data. We cannot convert it to a dataframe," you need to modify the _load_df method in the PolarsConnector class to handle lazy loading with polars.scan_parquet. Here is how you can do it:

  1. Import polars at the beginning of the file.
  2. Modify the _load_df method to use polars.scan_parquet for parquet files.

Here is the updated code:

import polars as pl

class PolarsConnector(BaseConnector):
    # ... other methods ...

    def _load_df(self, df: Union[PolarsDataFrame, PolarsSeries, str, dict]):
        """
        Load the dataframe from a file or polars dataframe.

        Args:
            df (Union[pl.DataFrame, pl.Series, str, dict]): The dataframe to load.
        """
        polars_df = None
        if isinstance(df, pl.Series):
            polars_df = df.to_frame()
        elif isinstance(df, pl.DataFrame):
            polars_df = df
        elif isinstance(df, str):
            if df.endswith('.parquet'):
                polars_df = pl.scan_parquet(df)
            else:
                polars_df = FileImporter.import_from_file(df)
        elif isinstance(df, dict):
            try:
                polars_df = pl.DataFrame(df)
            except Exception as e:
                raise ValueError(
                    "Invalid input data. We cannot convert it to a dataframe."
                ) from e
        else:
            raise ValueError("Invalid input data. We cannot convert it to a dataframe.")

        self.pandas_df = polars_df.collect().to_pandas()

This modification ensures that when a parquet file is provided, it is lazily loaded using polars.scan_parquet, which avoids loading the entire file into RAM. The collect() method is then used to convert the lazy frame into a pandas DataFrame when needed [1].

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the bug Something isn't working label Aug 30, 2024
@cboettig
Copy link
Author

The Dosubot is just annoying here. brute-force coercing a lazy polars connection to a parquet file into a pandas dataframe defeats the purpose of using scan_parquet in the first place.

Copy link

dosubot bot commented Nov 29, 2024

Hi, @cboettig. I'm Dosu, and I'm helping the pandas-ai team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • The issue involves a bug when using polars.scan_parquet() for large parquet files in the pandas-ai library.
  • A proposed solution was to modify the _load_df method in the PolarsConnector class for lazy loading.
  • You noted that converting a lazy polars connection to a pandas dataframe negates the benefits of scan_parquet.

Next Steps:

  • Please let me know if this issue is still relevant with the latest version of the pandas-ai repository by commenting here.
  • If there is no further activity, I will automatically close this issue in 7 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 29, 2024
@cboettig
Copy link
Author

cboettig commented Nov 29, 2024

thanks @dosu, but I think pandas-ai should support larger-than-RAM datasets. The underlying libraries like duckdb or polars already support this. the AI only need the table schema, it doesn't need the full dataset, to construct the queries. Small adjustments in using existing mechanisms in duckdb or polars would allow pandas-ai to work on large parquet files that cannot be loaded into RAM.

@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 29, 2024
Copy link

dosubot bot commented Feb 28, 2025

Hi, @cboettig. I'm Dosu, and I'm helping the pandas-ai team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • You reported a bug with polars.scan_parquet in pandas-ai for handling large parquet files.
  • The initial suggestion to modify _load_df for lazy loading was criticized for negating scan_parquet benefits.
  • You emphasized the need for pandas-ai to support larger-than-RAM datasets, as supported by libraries like duckdb and polars.
  • The issue remains open, with a call for better support for large datasets.

Next Steps:

  • Please let me know if this issue is still relevant to the latest version of pandas-ai by commenting here.
  • If there is no further activity, the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 28, 2025
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 7, 2025
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant