Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added support for a dtypes parameter #195

Merged
merged 7 commits into from
Mar 4, 2024
Merged

Conversation

lukapeschke
Copy link
Collaborator

closes #173

@lukapeschke lukapeschke added enhancement New feature or request ✋ need review ✋ 🐍 python 🐍 Pull requests that edit Python code 🦀 rust 🦀 Pull requests that edit Rust code labels Mar 2, 2024
@lukapeschke lukapeschke requested a review from PrettyWood March 2, 2024 15:28
@lukapeschke lukapeschke self-assigned this Mar 2, 2024
@lukapeschke
Copy link
Collaborator Author

Sorry for the delay on this one, I was AFK for the last couple of days

@PrettyWood
Copy link
Member

PrettyWood commented Mar 3, 2024

I tried to play with it. Overall looks good!
the dtypes override with datetime with "duration" seems weird

ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "duration"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬──────────────┬────────┐
│ Date                ┆ Country     ┆ Confirmed ┆ Recovered    ┆ Deaths │
│ ---                 ┆ ---         ┆ ---       ┆ ---          ┆ ---    │
│ datetime[ms]        ┆ str         ┆ i64       ┆ duration[ms] ┆ i64    │
╞═════════════════════╪═════════════╪═══════════╪══════════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0         ┆ null         ┆ 0      │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0         ┆ null         ┆ 0      │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0         ┆ null         ┆ 0      │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0         ┆ null         ┆ 0      │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0         ┆ null         ┆ 0      │
│ …                   ┆ …           ┆ …         ┆ …            ┆ …      │
│ 2020-09-28 00:00:00 ┆ India       ┆ 6145291   ┆ null         ┆ 96318  │
│ 2020-09-29 00:00:00 ┆ India       ┆ 6225763   ┆ null         ┆ 97497  │
│ 2020-09-30 00:00:00 ┆ India       ┆ 6312584   ┆ null         ┆ 98678  │
│ 2020-10-01 00:00:00 ┆ India       ┆ 6394068   ┆ null         ┆ 99773  │
│ 2020-10-02 00:00:00 ┆ India       ┆ 6473544   ┆ null         ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴──────────────┴────────┘

compared to

ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "string"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬───────────┬────────┐
│ Date                ┆ Country     ┆ Confirmed ┆ Recovered ┆ Deaths │
│ ---                 ┆ ---         ┆ ---       ┆ ---       ┆ ---    │
│ datetime[ms]        ┆ str         ┆ i64       ┆ str       ┆ i64    │
╞═════════════════════╪═════════════╪═══════════╪═══════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0         ┆ 0         ┆ 0      │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0         ┆ 0         ┆ 0      │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0         ┆ 0         ┆ 0      │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0         ┆ 0         ┆ 0      │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0         ┆ 0         ┆ 0      │
│ …                   ┆ …           ┆ …         ┆ …         ┆ …      │
│ 2020-09-28 00:00:00 ┆ India       ┆ 6145291   ┆ 5101397   ┆ 96318  │
│ 2020-09-29 00:00:00 ┆ India       ┆ 6225763   ┆ 5187825   ┆ 97497  │
│ 2020-09-30 00:00:00 ┆ India       ┆ 6312584   ┆ 5273201   ┆ 98678  │
│ 2020-10-01 00:00:00 ┆ India       ┆ 6394068   ┆ 5352078   ┆ 99773  │
│ 2020-10-02 00:00:00 ┆ India       ┆ 6473544   ┆ 5427706   ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴───────────┴────────┘
ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "datetime"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬───────────────────────┬────────┐
│ Date                ┆ Country     ┆ Confirmed ┆ Recovered             ┆ Deaths │
│ ---                 ┆ ---         ┆ ---       ┆ ---                   ┆ ---    │
│ datetime[ms]        ┆ str         ┆ i64       ┆ datetime[ms]          ┆ i64    │
╞═════════════════════╪═════════════╪═══════════╪═══════════════════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31 00:00:00   ┆ 0      │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31 00:00:00   ┆ 0      │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31 00:00:00   ┆ 0      │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31 00:00:00   ┆ 0      │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31 00:00:00   ┆ 0      │
│ …                   ┆ …           ┆ …         ┆ …                     ┆ …      │
│ 2020-09-28 00:00:00 ┆ India       ┆ 6145291   ┆ +15867-02-23 00:00:00 ┆ 96318  │
│ 2020-09-29 00:00:00 ┆ India       ┆ 6225763   ┆ +16103-10-12 00:00:00 ┆ 97497  │
│ 2020-09-30 00:00:00 ┆ India       ┆ 6312584   ┆ +16337-07-13 00:00:00 ┆ 98678  │
│ 2020-10-01 00:00:00 ┆ India       ┆ 6394068   ┆ +16553-06-27 00:00:00 ┆ 99773  │
│ 2020-10-02 00:00:00 ┆ India       ┆ 6473544   ┆ +16760-07-20 00:00:00 ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴───────────────────────┴────────┘
ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "date"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬──────────────┬────────┐
│ Date                ┆ Country     ┆ Confirmed ┆ Recovered    ┆ Deaths │
│ ---                 ┆ ---         ┆ ---       ┆ ---          ┆ ---    │
│ datetime[ms]        ┆ str         ┆ i64       ┆ date         ┆ i64    │
╞═════════════════════╪═════════════╪═══════════╪══════════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31   ┆ 0      │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31   ┆ 0      │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31   ┆ 0      │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31   ┆ 0      │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31   ┆ 0      │
│ …                   ┆ …           ┆ …         ┆ …            ┆ …      │
│ 2020-09-28 00:00:00 ┆ India       ┆ 6145291   ┆ +15867-02-23 ┆ 96318  │
│ 2020-09-29 00:00:00 ┆ India       ┆ 6225763   ┆ +16103-10-12 ┆ 97497  │
│ 2020-09-30 00:00:00 ┆ India       ┆ 6312584   ┆ +16337-07-13 ┆ 98678  │
│ 2020-10-01 00:00:00 ┆ India       ┆ 6394068   ┆ +16553-06-27 ┆ 99773  │
│ 2020-10-02 00:00:00 ┆ India       ┆ 6473544   ┆ +16760-07-20 ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴──────────────┴────────┘

But maybe we can merge the PR like that and dig a bit more into "duration", "datetime" support with format, mix of strings + dates...
WDTY @lukapeschke ?

@@ -27,6 +29,9 @@
)
from ._fastexcel import read_excel as _read_excel

DType = Literal["null", "int", "float", "string", "boolean", "datetime", "date", "duration"]
DTypeMap: TypeAlias = "dict[str, DType] | dict[int, DType]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to work with my tests

ipdb> excel_file.load_sheet_by_idx(1, dtypes={"__UNNAMED__1": "boolean"}).to_polars()
shape: (24, 12)
┌────────────────────────┬──────────────┬──────────────────────────────┬──────────────┬───┬──────────────┬──────────────┬───────────────┬───────┐
│ Phone Economics Model  ┆ __UNNAMED__1 ┆ Asset Type:                  ┆ __UNNAMED__3 ┆ … ┆ __UNNAMED__8 ┆ __UNNAMED__9 ┆ __UNNAMED__10 ┆ ALL   │
│ ---                    ┆ ---          ┆ ---                          ┆ ---          ┆   ┆ ---          ┆ ---          ┆ ---           ┆ ---   │
│ str                    ┆ bool         ┆ str                          ┆ str          ┆   ┆ null         ┆ null         ┆ null          ┆ str   │
╞════════════════════════╪══════════════╪══════════════════════════════╪══════════════╪═══╪══════════════╪══════════════╪═══════════════╪═══════╡
│ null                   ┆ null         ┆ Forecast Error (less than):  ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ COO   │
│ null                   ┆ null         ┆ Production Time (Between):   ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ DELCO │
│ null                   ┆ null         ┆ Store Call Volume (Between): ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ RBD   │
│ null                   ┆ null         ┆ Dine-In Volume (Less than):  ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ RR    │
│ null                   ┆ null         ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
│ …                      ┆ …            ┆ …                            ┆ …            ┆ … ┆ …            ┆ …            ┆ …             ┆ …     │
│ All calls Instore      ┆ null         ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
│ Busy Call Forward      ┆ null         ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
│ Call Forward No Answer ┆ null         ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
│ Promise Time           ┆ null         ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
│ 100% Call Center       ┆ null         ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
└────────────────────────┴──────────────┴──────────────────────────────┴──────────────┴───┴──────────────┴──────────────┴───────────────┴───────┘
ipdb> excel_file.load_sheet_by_idx(1, dtypes={1: "boolean"}).to_polars()
shape: (24, 12)
┌────────────────────────┬───────────────────────────────────┬──────────────────────────────┬──────────────┬───┬──────────────┬──────────────┬───────────────┬───────┐
│ Phone Economics Model  ┆ __UNNAMED__1                      ┆ Asset Type:                  ┆ __UNNAMED__3 ┆ … ┆ __UNNAMED__8 ┆ __UNNAMED__9 ┆ __UNNAMED__10 ┆ ALL   │
│ ---                    ┆ ---                               ┆ ---                          ┆ ---          ┆   ┆ ---          ┆ ---          ┆ ---           ┆ ---   │
│ str                    ┆ str                               ┆ str                          ┆ str          ┆   ┆ null         ┆ null         ┆ null          ┆ str   │
╞════════════════════════╪═══════════════════════════════════╪══════════════════════════════╪══════════════╪═══╪══════════════╪══════════════╪═══════════════╪═══════╡
│ null                   ┆ null                              ┆ Forecast Error (less than):  ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ COO   │
│ null                   ┆ null                              ┆ Production Time (Between):   ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ DELCO │
│ null                   ┆ null                              ┆ Store Call Volume (Between): ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ RBD   │
│ null                   ┆ null                              ┆ Dine-In Volume (Less than):  ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ RR    │
│ null                   ┆ null                              ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
│ …                      ┆ …                                 ┆ …                            ┆ …            ┆ … ┆ …            ┆ …            ┆ …             ┆ …     │
│ All calls Instore      ┆ Store answers all calls (no call… ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
│ Busy Call Forward      ┆ Calls are forwarded to the call … ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
│ Call Forward No Answer ┆ Calls are forwarded to the call … ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
│ Promise Time           ┆ All calls are forwarded after th… ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
│ 100% Call Center       ┆ Call center answers all calls (M… ┆ null                         ┆ null         ┆ … ┆ null         ┆ null         ┆ null          ┆ null  │
└────────────────────────┴───────────────────────────────────┴──────────────────────────────┴──────────────┴───┴──────────────┴──────────────┴───────────────┴───────┘
  • I reckon we could support dict[str | int, Dtype] for unnamed columns (instead of recalling to write __UNNAMED__1. ..)

@@ -64,6 +69,11 @@ def available_columns(self) -> list[str]:
"""The columns available for the given sheet"""
return self._sheet.available_columns

@property
def dtypes(self) -> DTypeMap | None:
"""The dtypes specified for the sheet"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have all the dtypes? (not only the overridden ones)?

@lukapeschke
Copy link
Collaborator Author

Hi @PrettyWood thanks for the testing!

About your remarks:

  • the dtypes override with datetime with "duration" seems weird

    I guess it's because we're still using as_duration, which does not seem to work with DateTimeIso, but we'd need some debug prints for this. Or it might be an overflow. But I don't know if converting a datetime to a duration really makes sense, does it ?

  • could we have all the dtypes? (not only the overridden ones)?

    Yes, seems reasonable, I also considered it. My proposal would be to rename the property to specified_dtypes, and add a def get_dtypes(self) -> DTypeMap method, since dtype computation might be expensive, WDYT ?

  • It does not seem to work with my tests

    Indeed, the behaviour is not clear. The thing is that here, I went with "if columns are specified by name or not specified, go with name, otherwise got with index". For now, the solution would probably be a third closure in the case of SelectedColumns::All, where we would try by name, then by index, and finally try to guess.

  • I reckon we could support dict[str | int, Dtype] for unnamed columns (instead of recalling to write __UNNAMED__1. ..)

    In another PR, I'd like to adapt SelectedColumns to allow for mixed indices and names, and implement that for dtype selection as well. The sheet's columns would probably be stored as a struct containing their name and index (and maybe have a distinction between the provided name of guessed name ? or have an enum for that, something like "provided" | "looked_up" | "generated_unnamed"). This would allow for less computations in the end, and more information being exposed to the user. What's your opinion on this ?
    In any case, I think this would rather be something for v0.11.0, since there already have been a lot of changes for v0.10.0.

@PrettyWood
Copy link
Member

PrettyWood commented Mar 4, 2024

But I don't know if converting a datetime to a duration really makes sense, does it ?

Can probably be tackled in another PR (if needed)

Yes, seems reasonable, I also considered it. My proposal would be to rename the property to specified_dtypes, and add a def get_dtypes(self) -> DTypeMap method, since dtype computation might be expensive, WDYT ?

Yes, perfect

Indeed, the behaviour is not clear

Not sure if I was explicit enough. If I write dtypes = {1: "boolean"} I expect the second column to be casted. It doesn't work. It works with __UNNAMED__1 though

In another PR

Completely fine with me. Was just a remark. Your proposal seems ok. We can probably write that in an issue, discuss the implementation and then open the PR

@lukapeschke
Copy link
Collaborator Author

@PrettyWood 94e59d4 allows for dtypes to be specified by name or index in case all columns are selected and ef55f27 renames the dtypes property to specified_dtypes.

In the end, I haven't added the dtypes method, as #198 will provide that information

@lukapeschke lukapeschke merged commit 5b70648 into main Mar 4, 2024
22 checks passed
@lukapeschke lukapeschke deleted the support-dtype-param branch March 4, 2024 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🦀 rust 🦀 Pull requests that edit Rust code enhancement New feature or request 🐍 python 🐍 Pull requests that edit Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add dtype-like param to enforce a dtype for a given column
2 participants