Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(excelsheet): add support for multi-dtype columns #164

Merged
merged 12 commits into from
Feb 13, 2024

Conversation

lukapeschke
Copy link
Collaborator

closes #160

@lukapeschke lukapeschke added bug Something isn't working enhancement New feature or request ✋ need review ✋ 🦀 rust 🦀 Pull requests that edit Rust code labels Jan 31, 2024
@lukapeschke lukapeschke self-assigned this Jan 31, 2024
@lukapeschke
Copy link
Collaborator Author

Performance impact

As expected, this has a small performance impact time-wise. However, the cost seems acceptable to me. If it seems unacceptable to others, I'm open for a refactoring to make this optional.

The good news is that there is no impact memory-wise. However, iterating over all data in the sheet to build the schema appears to cost some time (note the flat memory usage that appears in the charts just after the first usage spike).

Some ideas that could help to mitigate this:

  • Allowing the dtypes to be specified (as suggested in parameter for dtype override option AND/OR better inference  #158), which would allow us to skip the dtype guessing step entirely
  • Allowing to specify a maximal number of rows to sample for dtype guessing (for example, on my 280k rows spreadsheet, reading the first 1000 would be sufficient)

Benchmarks

Using the following script, here are the results of benchmarks on my machine (sheet of 41 columns x ~280k rows):

import argparse

import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    for sheet_name in excel_file.sheet_names:
        excel_file.load_sheet_by_name(sheet_name).to_arrow()


if __name__ == "__main__":
    main()

Before

before

After

after

After, without the tweak in create_string_array

without_match

@lukapeschke
Copy link
Collaborator Author

cc @ldacey @deanm0000 @alexander-beedie in case you want to have a look :)

@ldacey
Copy link

ldacey commented Jan 31, 2024

Nice - yeah, being able to cast the type ahead of time or change the number of rows used for inference could be useful. I have raised an issue for polars before because a column which looked like an integer actually had some text values beyond the default row inference. In this case it was the 622nd row, but you can imagine a large file that has been sorted where all values with letters or pure numbers are grouped at the bottom of the file.

import polars as pl
import tempfile

case_numbers = [f"{i:06d}" for i in range(1, 1001)]
test = ["test" for _ in range(1, 1001)]

df = pl.DataFrame({'test': test, 'case_number': case_numbers})

df[622, "case_number"] = "CASE-NO-0A60"

with tempfile.NamedTemporaryFile(delete=True, suffix=".csv") as temp_file:
    filename = temp_file.name
    df.write_csv(filename)
    pl.read_csv(filename, n_rows=1, infer_schema_length=623)

@PrettyWood
Copy link
Member

PrettyWood commented Jan 31, 2024

I'll review when I'm back from holidays @lukapeschke
Super glad fastexcel has been added to polars ecosystem as we first planned ;)
Great job @lukapeschke @alexander-beedie

@lukapeschke
Copy link
Collaborator Author

@PrettyWood sure no worries 🙂 I've released 0.8.0 in the meantime so that people can try out the windows wheels

// pure string
#[case(4, 5, ArrowDataType::Utf8)]
// pure int + float
#[case(2, 4, ArrowDataType::Float64)]
Copy link
Member

@PrettyWood PrettyWood Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: add int + null

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukapeschke
Copy link
Collaborator Author

@PrettyWood last changes add:

  • Merge with main (calamine 0.24)
  • Added support for bools for multi-dtype columns
  • Adapted to recent calamine changes (stuff related to ExcelDateTime)
  • Added schema_sample_rows param

Copy link
Member

@PrettyWood PrettyWood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart one question

fn string_types() -> &'static HashSet<ArrowDataType> {
STRING_TYPES_CELL.get_or_init(|| {
HashSet::from([
ArrowDataType::Int64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the case boolean + string is impossible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably possible, but string to bool conversion is not supported by calamine

@lukapeschke lukapeschke merged commit e243719 into main Feb 13, 2024
41 checks passed
@lukapeschke lukapeschke deleted the multi-dtype-columns branch February 13, 2024 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working 🦀 rust 🦀 Pull requests that edit Rust code enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dtype resolution is wrong for multi-dtype columns
3 participants