pandas-ai fails with polars.scan_parquet input #1344

cboettig · 2024-08-30T21:39:49Z

System Info

python 3.11, pandas-ai 2.2.14, ubuntu 22.04

🐛 Describe the bug

I'm looking for a way to work with large parquet files without loading entirely into RAM. This is incredibly well supported in packages like duckdb and polars already used in this project, but I can't see how to get it to work in the interface. For instance, the simple approach using polars.scan_parquet():

# All necessary imports at the beginning
import pandas as pd
from pandasai import Agent

# Sample DataFrame
df = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
    "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
df.to_parquet("test.parquet")
import polars as pl

df = pl.scan_parquet("test.parquet")
# Instantiate a LLM
from pandasai.llm import OpenAI
#llm = OpenAI(api_token="YOUR_API_TOKEN")

df = Agent([df], config={"llm": llm})
df.chat('Which are the 5 happiest countries?')

Gives the error:

   196         connectors.append(PolarsConnector({"original_df": df}))
    198     else:
--> 199         raise ValueError(
    200             "Invalid input data. We cannot convert it to a dataframe."
    201         )
    202 except ImportError as e:
    203     raise ValueError(
    204         "Invalid input data. We cannot convert it to a dataframe."
    205     ) from e

The text was updated successfully, but these errors were encountered:

dosubot · 2024-08-30T21:41:45Z

To work with large parquet files using polars.scan_parquet without loading them entirely into RAM and avoid the error "Invalid input data. We cannot convert it to a dataframe," you need to modify the _load_df method in the PolarsConnector class to handle lazy loading with polars.scan_parquet. Here is how you can do it:

Import polars at the beginning of the file.
Modify the _load_df method to use polars.scan_parquet for parquet files.

Here is the updated code:

import polars as pl

class PolarsConnector(BaseConnector):
    # ... other methods ...

    def _load_df(self, df: Union[PolarsDataFrame, PolarsSeries, str, dict]):
        """
        Load the dataframe from a file or polars dataframe.

        Args:
            df (Union[pl.DataFrame, pl.Series, str, dict]): The dataframe to load.
        """
        polars_df = None
        if isinstance(df, pl.Series):
            polars_df = df.to_frame()
        elif isinstance(df, pl.DataFrame):
            polars_df = df
        elif isinstance(df, str):
            if df.endswith('.parquet'):
                polars_df = pl.scan_parquet(df)
            else:
                polars_df = FileImporter.import_from_file(df)
        elif isinstance(df, dict):
            try:
                polars_df = pl.DataFrame(df)
            except Exception as e:
                raise ValueError(
                    "Invalid input data. We cannot convert it to a dataframe."
                ) from e
        else:
            raise ValueError("Invalid input data. We cannot convert it to a dataframe.")

        self.pandas_df = polars_df.collect().to_pandas()

This modification ensures that when a parquet file is provided, it is lazily loaded using polars.scan_parquet, which avoids loading the entire file into RAM. The collect() method is then used to convert the lazy frame into a pandas DataFrame when needed ^[1].

To continue talking to Dosu, mention @dosu.

cboettig · 2024-08-30T23:25:35Z

The Dosubot is just annoying here. brute-force coercing a lazy polars connection to a parquet file into a pandas dataframe defeats the purpose of using scan_parquet in the first place.

dosubot · 2024-11-29T16:09:03Z

Hi, @cboettig. I'm Dosu, and I'm helping the pandas-ai team manage their backlog. I'm marking this issue as stale.

Issue Summary:

The issue involves a bug when using polars.scan_parquet() for large parquet files in the pandas-ai library.
A proposed solution was to modify the _load_df method in the PolarsConnector class for lazy loading.
You noted that converting a lazy polars connection to a pandas dataframe negates the benefits of scan_parquet.

Next Steps:

Please let me know if this issue is still relevant with the latest version of the pandas-ai repository by commenting here.
If there is no further activity, I will automatically close this issue in 7 days.

Thank you for your understanding and contribution!

cboettig · 2024-11-29T18:37:11Z

thanks @dosu, but I think pandas-ai should support larger-than-RAM datasets. The underlying libraries like duckdb or polars already support this. the AI only need the table schema, it doesn't need the full dataset, to construct the queries. Small adjustments in using existing mechanisms in duckdb or polars would allow pandas-ai to work on large parquet files that cannot be loaded into RAM.

dosubot · 2025-02-28T16:00:46Z

Hi, @cboettig. I'm Dosu, and I'm helping the pandas-ai team manage their backlog. I'm marking this issue as stale.

Issue Summary:

You reported a bug with polars.scan_parquet in pandas-ai for handling large parquet files.
The initial suggestion to modify _load_df for lazy loading was criticized for negating scan_parquet benefits.
You emphasized the need for pandas-ai to support larger-than-RAM datasets, as supported by libraries like duckdb and polars.
The issue remains open, with a call for better support for large datasets.

Next Steps:

Please let me know if this issue is still relevant to the latest version of pandas-ai by commenting here.
If there is no further activity, the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

dosubot bot added the bug Something isn't working label Aug 30, 2024

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 29, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 29, 2024

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 28, 2025

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 7, 2025

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas-ai fails with polars.scan_parquet input #1344

pandas-ai fails with polars.scan_parquet input #1344

cboettig commented Aug 30, 2024

dosubot bot commented Aug 30, 2024

cboettig commented Aug 30, 2024

dosubot bot commented Nov 29, 2024

cboettig commented Nov 29, 2024 •

edited

Loading

dosubot bot commented Feb 28, 2025

pandas-ai fails with polars.scan_parquet input #1344

pandas-ai fails with polars.scan_parquet input #1344

Comments

cboettig commented Aug 30, 2024

System Info

🐛 Describe the bug

dosubot bot commented Aug 30, 2024

cboettig commented Aug 30, 2024

dosubot bot commented Nov 29, 2024

cboettig commented Nov 29, 2024 • edited Loading

dosubot bot commented Feb 28, 2025

cboettig commented Nov 29, 2024 •

edited

Loading