Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Optimise read_excel when using "calamine" engine with the latest fastexcel #17735

Merged
merged 1 commit into from
Jul 24, 2024

Conversation

lukapeschke
Copy link
Contributor

Since worksheets are directly converted to polars, fastexcel's eager loading functions can be used. This makes fastexcel use borrowed types under the hood, which allows it to run faster and have a slightly lower memory footprint. See ToucanToco/fastexcel#147 (comment)

@lukapeschke lukapeschke changed the title feat(fastexcel): use load_sheet_eager on fastexcel >= 0.11 feat(python): use load_sheet_eager on fastexcel >= 0.11 Jul 19, 2024
@lukapeschke lukapeschke changed the title feat(python): use load_sheet_eager on fastexcel >= 0.11 feat(python): Use load_sheet_eager on fastexcel >= 0.11 Jul 19, 2024
@lukapeschke
Copy link
Contributor Author

I'm not really sure how to tackle the coverage issue here... Is there an easy way to run the CI with different versions of a dependency ?

@lukapeschke
Copy link
Contributor Author

depends on ToucanToco/fastexcel#260

@alexander-beedie alexander-beedie changed the title feat(python): Use load_sheet_eager on fastexcel >= 0.11 feat(python): Optimise read_excel when underlying fastexcel >= 0.11 Jul 19, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars and removed title needs formatting labels Jul 19, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 19, 2024

I'm not really sure how to tackle the coverage issue here... Is there an easy way to run the CI with different versions of a dependency ?

Unfortunately not... looks like this change can't be merged here until ToucanToco/fastexcel#260 is merged (and released) on your side though - unless you want to update this PR's fastexcel version check to additionally detect that the currently-failing parameter is being used, and fall-back if so? Otherwise we can park this in draft until it's good to go.

@alexander-beedie alexander-beedie added the A-io-spreadsheet Area: reading/writing Excel/ODS files label Jul 19, 2024
@lukapeschke lukapeschke marked this pull request as draft July 22, 2024 08:09
@lukapeschke lukapeschke force-pushed the eager-loading-fastexcel branch from 05c6856 to c7786c4 Compare July 22, 2024 12:22
@lukapeschke lukapeschke changed the title feat(python): Optimise read_excel when underlying fastexcel >= 0.11 feat(python): Optimise read_excel when underlying fastexcel is >= 0.11 Jul 22, 2024
@lukapeschke lukapeschke changed the title feat(python): Optimise read_excel when underlying fastexcel is >= 0.11 feat(python): Optimise read_excel when underlying fastexcel is >= 0.11.2 Jul 22, 2024
@lukapeschke lukapeschke marked this pull request as ready for review July 22, 2024 12:24
@lukapeschke lukapeschke force-pushed the eager-loading-fastexcel branch from c7786c4 to 8f76948 Compare July 22, 2024 14:12
@lukapeschke
Copy link
Contributor Author

@alexander-beedie I've updated the PR to check for fastexcel >= 0.11.2. However, I'm not sure on how to proceed to have coverage on both branches. Do you have a suggestion/preference ?

@lukapeschke lukapeschke marked this pull request as draft July 23, 2024 07:41
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 23, 2024

@alexander-beedie I've updated the PR to check for fastexcel >= 0.11.2. However, I'm not sure on how to proceed to have coverage on both branches. Do you have a suggestion/preference ?

Only real option is that you locally test the earlier version 😅

Also, looks like we have to wait for the 0.11.4 release for tests to be able to complete:

Run uv pip install --compile-bytecode -r requirements-dev.txt -r requirements-ci.txt
  × No solution found when resolving dependencies:
  ╰─▶ Because only the following versions of fastexcel are available:
          fastexcel<=0.11.2
          fastexcel==0.11.3
      and fastexcel==0.11.2 has no wheels are available with a matching
      platform, we can conclude that fastexcel>=0.11.2,<0.11.3 cannot be used.
      And because fastexcel==0.11.3 was yanked (reason: doesn't work on
      macos 14) and you require fastexcel>=0.11.2, we can conclude that the
      requirements are unsatisfiable.

@lukapeschke
Copy link
Contributor Author

Yup, I've marked the PR as draft until ToucanToco/fastexcel#270 is merged. These macos runner updates are a real pain 😕

All right, I'll run the tests locally with both versions then. I'll let you know when we're all good, thanks!

@lukapeschke lukapeschke force-pushed the eager-loading-fastexcel branch from 8f76948 to 5503c17 Compare July 23, 2024 15:43
Copy link

codecov bot commented Jul 23, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 80.49%. Comparing base (9eec688) to head (5503c17).
Report is 1 commits behind head on main.

Files Patch % Lines
py-polars/polars/io/spreadsheet/functions.py 50.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #17735   +/-   ##
=======================================
  Coverage   80.49%   80.49%           
=======================================
  Files        1503     1503           
  Lines      197027   197031    +4     
  Branches     2795     2796    +1     
=======================================
+ Hits       158589   158598    +9     
+ Misses      37918    37912    -6     
- Partials      520      521    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lukapeschke lukapeschke marked this pull request as ready for review July 23, 2024 16:25
@lukapeschke
Copy link
Contributor Author

@alexander-beedie macos seem to be finally working! I've ran the tests in tests/unit/io/test_spreadsheet.py with fastexcel 0.11.5 and fastexcel 0.10.4, and they passed with both versions 🙂 I couldn't find any other mention of calamine in the tests, but there are other test files I should run, please let me know!

Regarding the perf and memory gain, I've ran the following script on this branch with both fastexcel versions:

#!/usr/bin/env python3

import polars as pl
from sys import argv

pl.read_excel(argv[1], engine="calamine")

I got the following results with fastexcel 0.10.4:
fastexcel_0_10_4

And the following ones with 0.11.5:
fastexcel_0_11_5

@alexander-beedie alexander-beedie changed the title feat(python): Optimise read_excel when underlying fastexcel is >= 0.11.2 feat(python): Optimise read_excel when using "calamine" engine with the latest fastexcel Jul 24, 2024
@alexander-beedie alexander-beedie merged commit 4c19115 into pola-rs:main Jul 24, 2024
15 checks passed
@alexander-beedie
Copy link
Collaborator

Good stuff! 👍

@lukapeschke lukapeschke deleted the eager-loading-fastexcel branch July 24, 2024 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-spreadsheet Area: reading/writing Excel/ODS files enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants