-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(excelsheet): add support for multi-dtype columns #164
Conversation
Signed-off-by: Luka Peschke <[email protected]>
closes #160 Signed-off-by: Luka Peschke <[email protected]>
Performance impactAs expected, this has a small performance impact time-wise. However, the cost seems acceptable to me. If it seems unacceptable to others, I'm open for a refactoring to make this optional. The good news is that there is no impact memory-wise. However, iterating over all data in the sheet to build the schema appears to cost some time (note the flat memory usage that appears in the charts just after the first usage spike). Some ideas that could help to mitigate this:
BenchmarksUsing the following script, here are the results of benchmarks on my machine (sheet of 41 columns x ~280k rows): import argparse
import fastexcel
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("file")
return parser.parse_args()
def main():
args = get_args()
excel_file = fastexcel.read_excel(args.file)
for sheet_name in excel_file.sheet_names:
excel_file.load_sheet_by_name(sheet_name).to_arrow()
if __name__ == "__main__":
main() BeforeAfterAfter, without the tweak in
|
Signed-off-by: Luka Peschke <[email protected]>
cc @ldacey @deanm0000 @alexander-beedie in case you want to have a look :) |
Nice - yeah, being able to cast the type ahead of time or change the number of rows used for inference could be useful. I have raised an issue for polars before because a column which looked like an integer actually had some text values beyond the default row inference. In this case it was the 622nd row, but you can imagine a large file that has been sorted where all values with letters or pure numbers are grouped at the bottom of the file. import polars as pl
import tempfile
case_numbers = [f"{i:06d}" for i in range(1, 1001)]
test = ["test" for _ in range(1, 1001)]
df = pl.DataFrame({'test': test, 'case_number': case_numbers})
df[622, "case_number"] = "CASE-NO-0A60"
with tempfile.NamedTemporaryFile(delete=True, suffix=".csv") as temp_file:
filename = temp_file.name
df.write_csv(filename)
pl.read_csv(filename, n_rows=1, infer_schema_length=623) |
I'll review when I'm back from holidays @lukapeschke |
@PrettyWood sure no worries 🙂 I've released 0.8.0 in the meantime so that people can try out the windows wheels |
// pure string | ||
#[case(4, 5, ArrowDataType::Utf8)] | ||
// pure int + float | ||
#[case(2, 4, ArrowDataType::Float64)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself: add int + null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Luka Peschke <[email protected]>
520da62
to
1cab5d0
Compare
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
@PrettyWood last changes add:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart one question
fn string_types() -> &'static HashSet<ArrowDataType> { | ||
STRING_TYPES_CELL.get_or_init(|| { | ||
HashSet::from([ | ||
ArrowDataType::Int64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the case boolean + string is impossible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably possible, but string to bool conversion is not supported by calamine
closes #160