Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Various Schema improvements (equality/init dtype checks) #19379

Merged
merged 6 commits into from
Oct 23, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Oct 22, 2024

A few improvements (arguably fixes) for equality/inequality checks, and dtype init:

  • We require that a Schema is initialised with instantiated dtypes (and make a point of it in the docstring), but don't actually check that we are given such; this PR validates that nested dtypes are given in instantiated form (eg: pl.List(Float64), not just pl.List), as without the inner dtype we don't actually have a valid Schema. I haven't done a full recursive check (eg: walking into Struct fields), but can look at it in a second pass.

  • Uninstantiated non-nested types (such as Int64 that do not take modifying parameters) are allowed, and are automatically instantiated on the way in to guarantee that a Schema instance is always composed of instantiated dtypes. Those that take units/specifiers (such as Datetime) require specific user-side instantiation, as above.

  • Fixed equality check between Schema and dict objects; similar to the above, we could compare a Schema with a {name: dtype, …} dict and get potentially invalid results if the dictionary had uninstantiated dtypes.

  • Added a convenience base_types method, returning a {name: dtype_class, …} dict (as returned by the individual dtype equivalent base_type method). (update: removed the new method)

  • Optimised internal Schema init where the dtypes are already known to be valid (eg: we are initialising from internal state, so we know that we don't have to check them).

  • Drive-by; The "schema_overrides" param in read_excel should handle python dtypes (as this param does elsewhere), eg: schema_overrides={"col": float}.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Oct 22, 2024
@alexander-beedie alexander-beedie force-pushed the misc-schema-improvements branch 2 times, most recently from 8a5605b to da188e1 Compare October 22, 2024 13:19
Copy link

codecov bot commented Oct 22, 2024

Codecov Report

Attention: Patch coverage is 86.66667% with 6 lines in your changes missing coverage. Please review.

Project coverage is 80.19%. Comparing base (27289b2) to head (db59be1).
Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
py-polars/polars/schema.py 86.84% 3 Missing and 2 partials ⚠️
py-polars/polars/series/struct.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19379      +/-   ##
==========================================
+ Coverage   80.18%   80.19%   +0.01%     
==========================================
  Files        1523     1527       +4     
  Lines      209897   210268     +371     
  Branches     2434     2440       +6     
==========================================
+ Hits       168314   168633     +319     
- Misses      41028    41079      +51     
- Partials      555      556       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexander-beedie alexander-beedie force-pushed the misc-schema-improvements branch 4 times, most recently from 4843133 to 873d39b Compare October 22, 2024 14:55
@alexander-beedie alexander-beedie force-pushed the misc-schema-improvements branch from 873d39b to ea52dce Compare October 22, 2024 15:31
@ritchie46
Copy link
Member

I do agree this is valuable, but I am afraid we will add features we cannot support when we move to Rust.

I want our schema to be a python wrapper around PolarsSchema, just like we do with DataFrame and Series. I think we should do that refactor first and then add functionality there.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Oct 23, 2024

I do agree this is valuable, but I am afraid we will add features we cannot support when we move to Rust.

Yup, can remove the base_types method, no problem - was merely a minor convenience for lazy people (like myself) 😆

However, I think we still need to do the dtype-checks here, as we will always want to pass correct (and instantiated) dtypes into the PolarsSchema object, when exposed (which, you are quite right!, is where any significant new Schema functionality should live) 👍

@alexander-beedie alexander-beedie force-pushed the misc-schema-improvements branch from b4895ca to db59be1 Compare October 23, 2024 07:04
@alexander-beedie alexander-beedie changed the title feat(python): Various Schema improvements (new base_types method, improved equality/init dtype checks) feat(python): Various Schema improvements (improved equality/init dtype checks) Oct 23, 2024
@alexander-beedie alexander-beedie changed the title feat(python): Various Schema improvements (improved equality/init dtype checks) feat(python): Various Schema improvements (equality/init dtype checks) Oct 23, 2024
@ritchie46 ritchie46 merged commit c58fbcb into pola-rs:main Oct 23, 2024
21 checks passed
@alexander-beedie alexander-beedie deleted the misc-schema-improvements branch October 23, 2024 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants