Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow to select a subset of columns #189

Merged
merged 14 commits into from
Feb 27, 2024
Merged

Conversation

lukapeschke
Copy link
Collaborator

closes #172 . Depends on #186

@lukapeschke lukapeschke force-pushed the select-subset-of-columns branch from c11fc28 to c5e6f68 Compare February 25, 2024 18:56
@lukapeschke lukapeschke marked this pull request as ready for review February 25, 2024 18:57
@lukapeschke lukapeschke added ✋ need review ✋ 🐍 python 🐍 Pull requests that edit Python code 🦀 rust 🦀 Pull requests that edit Rust code and removed 🔨 WIP 🔧 labels Feb 25, 2024
@lukapeschke lukapeschke changed the title Select subset of columns feat: allow to select a subset of columns Feb 25, 2024
Copy link
Member

@PrettyWood PrettyWood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As always great tests and comments.

Makefile Outdated Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved

for sheet_name_or_idx in [0, "January"]:
for col in ["Month", "Year"]:
sheet = excel_reader_single_sheet.load_sheet(0, use_columns=[col])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_sheet(sheet_name_or_idx)?
Same below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops good catch 8ff4192

def test_single_sheet_invalid_column_indices_column_does_not_exist_int(
excel_reader_single_sheet_with_unnamed_columns: fastexcel.ExcelReader,
) -> None:
expected_message = """column at index 42 not found
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add to ExcelSheet the range of indices and list of columns. Currently we only have the width

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

045a114 . I wonder wether the property should be called use_columns to match the parameter though 🤔

Comment on lines +146 to +149
for (idx, name) in column_names.iter().enumerate() {
// If we have an index for the given column, extract it and add it to the schema. Otherwise,
// just ignore it
if let Some(col_idx) = match selected_columns {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon the code is not optimal performance-wise here but we probably don't care to build the schema

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was wondering if I should remove the TryFrom<&ExcelSheet> implem for RecordBatch to avoid re-instantiating column_names as well, but this isn't really a hot path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it troubles me a bit but we can go on with this for now and rethink this in the future.

src/types/excelsheet.rs Show resolved Hide resolved
src/types/excelsheet.rs Outdated Show resolved Hide resolved
@lukapeschke
Copy link
Collaborator Author

@PrettyWood I'm looking at the implem for column ranges right now, and I'm thinking about wether column names should be loaded and validated on sheet instatiation. However, this would have an impactt:

  • ExcelSheet instantiation would become slightly slower and less efficient memory-wise, as data loading & validation would occur on instantiation.
  • Errors would occur earlier: the column selection would be validated on sheet instantiation rather than on conversion
  • column_names could become a sheet attribute
  • It might be possible that this would allow to make functions obtaining data by ref lazy (not sure about this one, depends if I can elide lifetimes, as they don't play well with PyO3: https://pyo3.rs/main/class.html?highlight=lifetime#no-lifetime-parameters . Maybe by making the reader owned ?)

Any thoughts about this ?

@PrettyWood
Copy link
Member

PrettyWood commented Feb 26, 2024

Yeah this is what I wanted originally when I said "Would be nice to add to ExcelSheet the range of indices and list of columns. Currently we only have the width"
I'd rather have a slight drop in performance-wise but better API and validation.
In my mind once you load the excelsheet you could have the range of column indices and names.
And if you select some they are directly validated at instantiation with better error messages (even with fuzzysearch to say "did you mean this column?" which is clearly not needed but why not haha)
Could we try to go with this approach and see the perf impact?

For your 4th point I have no idea to be honest. Clearly lifetimes are a mess with pyo3 (cf the discussion we had on bytes)

@lukapeschke
Copy link
Collaborator Author

@PrettyWood

script

import argparse

import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()


if __name__ == "__main__":
    main()

before

invalid columns

❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou
Traceback (most recent call last):
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 24, in <module>
    main()
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 20, in main
    excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
  File "/home/lpeschke/work/toucan/fastexcel/.direnv/python-3.12/lib64/python3.12/site-packages/fastexcel/__init__.py", line 65, in to_arrow
    return self._sheet.to_arrow()
           ^^^^^^^^^^^^^^^^^^^^^^
_fastexcel.ColumnNotFoundError: column with name "coucou" not found
Context:
    0: selected columns are invalid
    1: could not create RecordBatch from sheet "INVOICES"
    2: could not convert RecordBatch to pyarrow for sheet "INVOICES"
python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou  5.24s user 0.69s system 110% cpu 5.374 total

valid

❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx
python3.12 test.py ~/Downloads/TestExcel.xlsx  5.90s user 0.75s system 109% cpu 6.089 total

after

invalid

Traceback (most recent call last):
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 23, in <module>
    main()
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 19, in main
    excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lpeschke/work/toucan/fastexcel/.direnv/python-3.12/lib64/python3.12/site-packages/fastexcel/__init__.py", line 132, in load_sheet_by_name
    self._reader.load_sheet_by_name(
_fastexcel.ColumnNotFoundError: column with name "coucou" not found
Context:
    0: selected columns are invalid, available columns are: [<column list, elided>]

python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou  5.25s user 0.74s system 110% cpu 5.398 total

valid

❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx                
python3.12 test.py ~/Downloads/TestExcel.xlsx  5.90s user 0.78s system 108% cpu 6.123 total

So the perf impact seems neglectible to me.

If the code LGTY, I'll expose the columns though an available_columns property. Souds good to you ?

@lukapeschke lukapeschke force-pushed the select-subset-of-columns branch from ab49e55 to 1edc77a Compare February 26, 2024 13:47
@PrettyWood
Copy link
Member

Perfect 🎉

@lukapeschke
Copy link
Collaborator Author

Great, I'll do it later this afternoon then. Anything else I need to change ?

@PrettyWood
Copy link
Member

Nope I'll test the final PR tonight and will probably merge. Great work

@PrettyWood PrettyWood merged commit 5ac369e into main Feb 27, 2024
22 checks passed
@PrettyWood PrettyWood deleted the select-subset-of-columns branch February 27, 2024 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🦀 rust 🦀 Pull requests that edit Rust code ✋ need review ✋ 🐍 python 🐍 Pull requests that edit Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Be able to select a subset of columns
2 participants