feat(excelsheet): add support for multi-dtype columns #164

lukapeschke · 2024-01-31T11:10:37Z

closes #160

Signed-off-by: Luka Peschke <[email protected]>

closes #160 Signed-off-by: Luka Peschke <[email protected]>

lukapeschke · 2024-01-31T11:37:03Z

Performance impact

As expected, this has a small performance impact time-wise. However, the cost seems acceptable to me. If it seems unacceptable to others, I'm open for a refactoring to make this optional.

The good news is that there is no impact memory-wise. However, iterating over all data in the sheet to build the schema appears to cost some time (note the flat memory usage that appears in the charts just after the first usage spike).

Some ideas that could help to mitigate this:

Allowing the dtypes to be specified (as suggested in parameter for dtype override option AND/OR better inference #158), which would allow us to skip the dtype guessing step entirely
Allowing to specify a maximal number of rows to sample for dtype guessing (for example, on my 280k rows spreadsheet, reading the first 1000 would be sufficient)

Benchmarks

Using the following script, here are the results of benchmarks on my machine (sheet of 41 columns x ~280k rows):

import argparse

import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    for sheet_name in excel_file.sheet_names:
        excel_file.load_sheet_by_name(sheet_name).to_arrow()


if __name__ == "__main__":
    main()

Before

After

After, without the tweak in `create_string_array`

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke · 2024-01-31T11:43:42Z

cc @ldacey @deanm0000 @alexander-beedie in case you want to have a look :)

ldacey · 2024-01-31T14:22:43Z

Nice - yeah, being able to cast the type ahead of time or change the number of rows used for inference could be useful. I have raised an issue for polars before because a column which looked like an integer actually had some text values beyond the default row inference. In this case it was the 622nd row, but you can imagine a large file that has been sorted where all values with letters or pure numbers are grouped at the bottom of the file.

import polars as pl
import tempfile

case_numbers = [f"{i:06d}" for i in range(1, 1001)]
test = ["test" for _ in range(1, 1001)]

df = pl.DataFrame({'test': test, 'case_number': case_numbers})

df[622, "case_number"] = "CASE-NO-0A60"

with tempfile.NamedTemporaryFile(delete=True, suffix=".csv") as temp_file:
    filename = temp_file.name
    df.write_csv(filename)
    pl.read_csv(filename, n_rows=1, infer_schema_length=623)

PrettyWood · 2024-01-31T18:33:14Z

I'll review when I'm back from holidays @lukapeschke
Super glad fastexcel has been added to polars ecosystem as we first planned ;)
Great job @lukapeschke @alexander-beedie

lukapeschke · 2024-02-01T09:30:58Z

@PrettyWood sure no worries 🙂 I've released 0.8.0 in the meantime so that people can try out the windows wheels

PrettyWood · 2024-02-03T14:30:52Z

src/utils/arrow.rs

+    // pure string
+    #[case(4, 5, ArrowDataType::Utf8)]
+    // pure int + float
+    #[case(2, 4, ArrowDataType::Float64)]


Note to myself: add int + null

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke · 2024-02-09T17:07:14Z

@PrettyWood last changes add:

Merge with main (calamine 0.24)
Added support for bools for multi-dtype columns
Adapted to recent calamine changes (stuff related to ExcelDateTime)
Added schema_sample_rows param

PrettyWood

LGTM apart one question

PrettyWood · 2024-02-09T23:30:58Z

src/utils/arrow.rs

+fn string_types() -> &'static HashSet<ArrowDataType> {
+    STRING_TYPES_CELL.get_or_init(|| {
+        HashSet::from([
+            ArrowDataType::Int64,


the case boolean + string is impossible?

It is probably possible, but string to bool conversion is not supported by calamine

lukapeschke added 2 commits January 31, 2024 11:40

feat(deps-dev): as rstest as a dev dependency

552bb19

Signed-off-by: Luka Peschke <[email protected]>

feat(excelsheet): add support for multi-dtype columns

0f409cf

closes #160 Signed-off-by: Luka Peschke <[email protected]>

lukapeschke added bug Something isn't working enhancement New feature or request ✋ need review ✋ 🦀 rust 🦀 Pull requests that edit Rust code labels Jan 31, 2024

lukapeschke self-assigned this Jan 31, 2024

Merge branch 'main' into multi-dtype-columns

f4791a6

fix: use as_f64 rather than get_float

368c4ee

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke requested a review from PrettyWood January 31, 2024 14:07

Merge branch 'main' into multi-dtype-columns

f8a21f3

This was referenced Feb 2, 2024

read_excel plus "calamine" engine issues when loading Excel data with some empty values pola-rs/polars#14174

Closed

read_excel(..., engine="calamine") returns column as all-null if first item is null pola-rs/polars#14224

Closed

PrettyWood reviewed Feb 3, 2024

View reviewed changes

lukapeschke added 2 commits February 5, 2024 10:50

Merge branch 'main' into multi-dtype-columns

eb2fc49

test: add null + int and null + int + float test case

025bdc1

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke mentioned this pull request Feb 6, 2024

Columns containing missing values will render entire column as NaN #169

Closed

Merge branch 'main' into multi-dtype-columns

1cab5d0

lukapeschke force-pushed the multi-dtype-columns branch from 520da62 to 1cab5d0 Compare February 9, 2024 15:48

lukapeschke added 3 commits February 9, 2024 16:51

feat: add support for bools when determining the dtype fo a column

27983dc

Signed-off-by: Luka Peschke <[email protected]>

feat: add support for int columns

e2f91a8

Signed-off-by: Luka Peschke <[email protected]>

feat: added a schema_sample_rows param

e4a69bc

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke requested a review from PrettyWood February 9, 2024 17:04

chore: doc

7d1bbea

PrettyWood reviewed Feb 9, 2024

View reviewed changes

PrettyWood approved these changes Feb 13, 2024

View reviewed changes

lukapeschke merged commit e243719 into main Feb 13, 2024
41 checks passed

lukapeschke deleted the multi-dtype-columns branch February 13, 2024 09:53

PrettyWood removed the ✋ need review ✋ label Feb 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(excelsheet): add support for multi-dtype columns #164

feat(excelsheet): add support for multi-dtype columns #164

lukapeschke commented Jan 31, 2024

lukapeschke commented Jan 31, 2024

lukapeschke commented Jan 31, 2024

ldacey commented Jan 31, 2024

PrettyWood commented Jan 31, 2024 •

edited

Loading

lukapeschke commented Feb 1, 2024

PrettyWood Feb 3, 2024 •

edited

Loading

lukapeschke Feb 5, 2024

lukapeschke commented Feb 9, 2024

PrettyWood left a comment

PrettyWood Feb 9, 2024

lukapeschke Feb 13, 2024

feat(excelsheet): add support for multi-dtype columns #164

feat(excelsheet): add support for multi-dtype columns #164

Conversation

lukapeschke commented Jan 31, 2024

lukapeschke commented Jan 31, 2024

Performance impact

Benchmarks

Before

After

After, without the tweak in create_string_array

lukapeschke commented Jan 31, 2024

ldacey commented Jan 31, 2024

PrettyWood commented Jan 31, 2024 • edited Loading

lukapeschke commented Feb 1, 2024

PrettyWood Feb 3, 2024 • edited Loading

Choose a reason for hiding this comment

lukapeschke Feb 5, 2024

Choose a reason for hiding this comment

lukapeschke commented Feb 9, 2024

PrettyWood left a comment

Choose a reason for hiding this comment

PrettyWood Feb 9, 2024

Choose a reason for hiding this comment

lukapeschke Feb 13, 2024

Choose a reason for hiding this comment

After, without the tweak in `create_string_array`

PrettyWood commented Jan 31, 2024 •

edited

Loading

PrettyWood Feb 3, 2024 •

edited

Loading