feat: allow to select a subset of columns #189

lukapeschke · 2024-02-25T16:36:09Z

closes #172 . Depends on #186

Signed-off-by: Luka Peschke <[email protected]>

closes #172 Signed-off-by: Luka Peschke <[email protected]>

python/fastexcel/__init__.py

PrettyWood

As always great tests and comments.

Makefile

Cargo.toml

PrettyWood · 2024-02-25T23:06:37Z

python/tests/test_column_selection.py

+
+    for sheet_name_or_idx in [0, "January"]:
+        for col in ["Month", "Year"]:
+            sheet = excel_reader_single_sheet.load_sheet(0, use_columns=[col])


load_sheet(sheet_name_or_idx)?
Same below?

woops good catch 8ff4192

PrettyWood · 2024-02-25T23:10:46Z

python/tests/test_column_selection.py

+def test_single_sheet_invalid_column_indices_column_does_not_exist_int(
+    excel_reader_single_sheet_with_unnamed_columns: fastexcel.ExcelReader,
+) -> None:
+    expected_message = """column at index 42 not found


Would be nice to add to ExcelSheet the range of indices and list of columns. Currently we only have the width

Will add 👍

045a114 . I wonder wether the property should be called use_columns to match the parameter though 🤔

PrettyWood · 2024-02-25T23:17:19Z

src/utils/arrow.rs

+    for (idx, name) in column_names.iter().enumerate() {
+        // If we have an index for the given column, extract it and add it to the schema. Otherwise,
+        // just ignore it
+        if let Some(col_idx) = match selected_columns {


I reckon the code is not optimal performance-wise here but we probably don't care to build the schema

Yeah I was wondering if I should remove the TryFrom<&ExcelSheet> implem for RecordBatch to avoid re-instantiating column_names as well, but this isn't really a hot path

Yeah it troubles me a bit but we can go on with this for now and rethink this in the future.

src/types/excelsheet.rs

Signed-off-by: Luka Peschke <[email protected]>

…op in PR previews Signed-off-by: Luka Peschke <[email protected]>

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke · 2024-02-26T12:39:56Z

@PrettyWood I'm looking at the implem for column ranges right now, and I'm thinking about wether column names should be loaded and validated on sheet instatiation. However, this would have an impactt:

ExcelSheet instantiation would become slightly slower and less efficient memory-wise, as data loading & validation would occur on instantiation.
Errors would occur earlier: the column selection would be validated on sheet instantiation rather than on conversion
column_names could become a sheet attribute
It might be possible that this would allow to make functions obtaining data by ref lazy (not sure about this one, depends if I can elide lifetimes, as they don't play well with PyO3: https://pyo3.rs/main/class.html?highlight=lifetime#no-lifetime-parameters . Maybe by making the reader owned ?)

Any thoughts about this ?

PrettyWood · 2024-02-26T12:57:19Z

Yeah this is what I wanted originally when I said "Would be nice to add to ExcelSheet the range of indices and list of columns. Currently we only have the width"
I'd rather have a slight drop in performance-wise but better API and validation.
In my mind once you load the excelsheet you could have the range of column indices and names.
And if you select some they are directly validated at instantiation with better error messages (even with fuzzysearch to say "did you mean this column?" which is clearly not needed but why not haha)
Could we try to go with this approach and see the perf impact?

For your 4th point I have no idea to be honest. Clearly lifetimes are a mess with pyo3 (cf the discussion we had on bytes)

lukapeschke · 2024-02-26T13:44:49Z

@PrettyWood

script

import argparse

import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()


if __name__ == "__main__":
    main()

before

invalid columns

❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou
Traceback (most recent call last):
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 24, in <module>
    main()
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 20, in main
    excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
  File "/home/lpeschke/work/toucan/fastexcel/.direnv/python-3.12/lib64/python3.12/site-packages/fastexcel/__init__.py", line 65, in to_arrow
    return self._sheet.to_arrow()
           ^^^^^^^^^^^^^^^^^^^^^^
_fastexcel.ColumnNotFoundError: column with name "coucou" not found
Context:
    0: selected columns are invalid
    1: could not create RecordBatch from sheet "INVOICES"
    2: could not convert RecordBatch to pyarrow for sheet "INVOICES"
python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou  5.24s user 0.69s system 110% cpu 5.374 total

valid

❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx
python3.12 test.py ~/Downloads/TestExcel.xlsx  5.90s user 0.75s system 109% cpu 6.089 total

after

invalid

Traceback (most recent call last):
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 23, in <module>
    main()
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 19, in main
    excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lpeschke/work/toucan/fastexcel/.direnv/python-3.12/lib64/python3.12/site-packages/fastexcel/__init__.py", line 132, in load_sheet_by_name
    self._reader.load_sheet_by_name(
_fastexcel.ColumnNotFoundError: column with name "coucou" not found
Context:
    0: selected columns are invalid, available columns are: [<column list, elided>]

python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou  5.25s user 0.74s system 110% cpu 5.398 total

valid

❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx                
python3.12 test.py ~/Downloads/TestExcel.xlsx  5.90s user 0.78s system 108% cpu 6.123 total

So the perf impact seems neglectible to me.

If the code LGTY, I'll expose the columns though an available_columns property. Souds good to you ?

Signed-off-by: Luka Peschke <[email protected]>

PrettyWood · 2024-02-26T13:48:25Z

Perfect 🎉

lukapeschke · 2024-02-26T13:51:25Z

Great, I'll do it later this afternoon then. Anything else I need to change ?

PrettyWood · 2024-02-26T14:03:03Z

Nope I'll test the final PR tonight and will probably merge. Great work

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke added the 🔨 WIP 🔧 label Feb 25, 2024

lukapeschke self-assigned this Feb 25, 2024

lukapeschke added 5 commits February 25, 2024 19:54

feat(Makefile): split test and lint targets between rust and python

3949132

Signed-off-by: Luka Peschke <[email protected]>

fix(build): allow tests in pyo3 modules to be run

048405d

Signed-off-by: Luka Peschke <[email protected]>

refactor(build): Allow rust tests to spin up python interpreters

5e0bd76

Signed-off-by: Luka Peschke <[email protected]>

feat: introduced a ColumnNotFoundError exception

f044a51

Signed-off-by: Luka Peschke <[email protected]>

feat: allow to select a subset of columns

c5e6f68

closes #172 Signed-off-by: Luka Peschke <[email protected]>

lukapeschke force-pushed the select-subset-of-columns branch from c11fc28 to c5e6f68 Compare February 25, 2024 18:56

PrettyWood reviewed Feb 25, 2024

View reviewed changes

python/fastexcel/__init__.py Show resolved Hide resolved

lukapeschke marked this pull request as ready for review February 25, 2024 18:57

lukapeschke added ✋ need review ✋ 🐍 python 🐍 Pull requests that edit Python code 🦀 rust 🦀 Pull requests that edit Rust code and removed 🔨 WIP 🔧 labels Feb 25, 2024

lukapeschke changed the title ~~Select subset of columns~~ feat: allow to select a subset of columns Feb 25, 2024

PrettyWood reviewed Feb 25, 2024

View reviewed changes

lukapeschke added 7 commits February 26, 2024 11:49

fix: mypy should be in lint-python make target

309f8b4

Signed-off-by: Luka Peschke <[email protected]>

typo

3acd8c8

Signed-off-by: Luka Peschke <[email protected]>

fix(tests): use sheet_name_or_idx

8ff4192

Signed-off-by: Luka Peschke <[email protected]>

refactor: only check for py_list lenght once

bbb3542

Signed-off-by: Luka Peschke <[email protected]>

docs: documented use_columns param

36c054d

Signed-off-by: Luka Peschke <[email protected]>

ci: move check-docs jobs higher in the CI file to get the result on t…

f5a8d4d

…op in PR previews Signed-off-by: Luka Peschke <[email protected]>

feat: added a selected_columns property to ExcelSheet

045a114

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke requested a review from PrettyWood February 26, 2024 11:32

refactor: validate columns on sheet instantiation

1edc77a

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke force-pushed the select-subset-of-columns branch from ab49e55 to 1edc77a Compare February 26, 2024 13:47

feat: added an available_columns property

61cfc39

Signed-off-by: Luka Peschke <[email protected]>

PrettyWood approved these changes Feb 27, 2024

View reviewed changes

PrettyWood merged commit 5ac369e into main Feb 27, 2024
22 checks passed

PrettyWood deleted the select-subset-of-columns branch February 27, 2024 07:52

lukapeschke mentioned this pull request Feb 27, 2024

feat: support column ranges in string format in use_columns #190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow to select a subset of columns #189

feat: allow to select a subset of columns #189

lukapeschke commented Feb 25, 2024

PrettyWood left a comment

PrettyWood Feb 25, 2024

lukapeschke Feb 26, 2024

PrettyWood Feb 25, 2024

lukapeschke Feb 26, 2024

lukapeschke Feb 26, 2024

PrettyWood Feb 25, 2024

lukapeschke Feb 26, 2024

PrettyWood Feb 26, 2024

lukapeschke commented Feb 26, 2024

PrettyWood commented Feb 26, 2024 •

edited

Loading

lukapeschke commented Feb 26, 2024

PrettyWood commented Feb 26, 2024

lukapeschke commented Feb 26, 2024

PrettyWood commented Feb 26, 2024

feat: allow to select a subset of columns #189

feat: allow to select a subset of columns #189

Conversation

lukapeschke commented Feb 25, 2024

PrettyWood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukapeschke commented Feb 26, 2024

PrettyWood commented Feb 26, 2024 • edited Loading

lukapeschke commented Feb 26, 2024

script

before

invalid columns

valid

after

invalid

valid

PrettyWood commented Feb 26, 2024

lukapeschke commented Feb 26, 2024

PrettyWood commented Feb 26, 2024

PrettyWood commented Feb 26, 2024 •

edited

Loading