Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Support loading data from multiple Excel/ODS workbooks #20404

Merged

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Dec 22, 2024

Closes #20354.

Allows read_excel and read_ods to take a list or glob pattern in the "source" parameter. This enables loading a given sheet from multiple workbooks (for example: directories containing workbooks that contain the same sheet data for different dates - can be useful to be able to easily load them all into a single frame).

Also: tidied up some "source" docstrings (rogue linebreaks), and renamed the "ScanSource" type to "FileSource" (as it isn't just used for scan funcs).

Example

Load the "data" sheet from all "trades" workbooks found in subdirs of the "2024" directory into a single DataFrame.

df = pl.read_excel("~/2024/**/trades*.xlsx", sheet_name="data")

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Dec 22, 2024
@alexander-beedie alexander-beedie added the A-io-spreadsheet Area: reading/writing Excel/ODS files label Dec 22, 2024
Copy link

codecov bot commented Dec 22, 2024

Codecov Report

Attention: Patch coverage is 85.29412% with 5 lines in your changes missing coverage. Please review.

Project coverage is 78.96%. Comparing base (676f10d) to head (fedb71f).
Report is 31 commits behind head on main.

Files with missing lines Patch % Lines
py-polars/polars/io/spreadsheet/functions.py 84.84% 3 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20404      +/-   ##
==========================================
- Coverage   79.13%   78.96%   -0.17%     
==========================================
  Files        1572     1562      -10     
  Lines      219839   220103     +264     
  Branches     2462     2486      +24     
==========================================
- Hits       173961   173811     -150     
- Misses      45310    45719     +409     
- Partials      568      573       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46 ritchie46 merged commit 62ebbe5 into pola-rs:main Dec 22, 2024
22 checks passed
@ritchie46
Copy link
Member

Nice!

@alexander-beedie alexander-beedie deleted the read-excel-multiple-workbooks branch December 22, 2024 17:48
@ldacey
Copy link

ldacey commented Dec 24, 2024

Is it possible to add the filename when reading from multiple files similar to the scan_ methods? I have been using include_file_paths when reading csv/parquet/ndjson.

For Excel, I have been looping through files and concatenating:

            df = df.with_columns(pl.lit(path).alias("raw_file_path"))
            dfs.append(df)

I am not sure how this would work if I just passed a list or glob to the the source argument though. Normally there are fields in the blob metadata (GCS/Azure) that I want to add to the dataframe so I make a dataframe of all of the metadata and join it back to the actual data using that file_paths column.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Dec 24, 2024

Is it possible to add the filename when reading from multiple files similar to the scan_ methods? I have been using include_file_paths when reading csv/parquet/ndjson.

@ldacey: Sure, it's do-able; can you make this a proper feature request so it's easier to track?
If the request only exists in a closed PR it's likely to get lost ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-spreadsheet Area: reading/writing Excel/ODS files enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support glob paths in read_excel
4 participants