Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Use OrderedDict for schemas #11742

Merged
merged 1 commit into from
Oct 15, 2023
Merged

feat(python): Use OrderedDict for schemas #11742

merged 1 commit into from
Oct 15, 2023

Conversation

stinodego
Copy link
Contributor

I was writing some tests and found out that our schemas aren't ordered, e.g.:

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [6.0, 7.0, 8.0]})
df_rev = df.select("bar", "foo")
assert df.schema == df_rev.schema   # passes ?!

The order of columns might not always be relevant, but it definitely is in some cases. And the schema shouldn't be considered equal if the column order is different.

The simple fix is to use OrderedDict for our schemas instead of a regular dict. Amazingly, switching doesn't cause any of our tests/lints to fail - only some doctests as the repr is different.

@alexander-beedie Anything I'm missing here, or is this just a quick win?

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Oct 14, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 14, 2023

@alexander-beedie Anything I'm missing here, or is this just a quick win?

Nope, looks like a quick win to me; it's a true dict subclass, so everything that currently expects a dict or mapping should remain happy. Regular Python dicts are guaranteed to be stable with respect to item insert order, but that doesn't imply anything else about the order (such as comparison equality, as shown above).

OrderedDict used to be dramatically slower (10x or more) than regular dict, but these days (I think since py3.5, but don't quote me on that!) it has a proper C implementation with worst-case performance difference being something more like ~2x, and for initial insert (which is all that would affect us) it's probably no worse than 25% slower, which is absolutely fine.

It does make me think I should finish-up a proper Python-side Schema dict/class for us though! I started one a while back, but let it sit around and get a bit stale. Can polish it up tomorrow and you can take a look? Don't let it stop you committing this though ;)

@stinodego
Copy link
Contributor Author

stinodego commented Oct 14, 2023

It does make me think I should finish-up a proper Python-side Schema dict/class for us though! I started one a while back, but let it sit around and get a bit stale.

There is a Schema struct on the Rust side. We should expose this through PyO3 and use that. It doesn't have much functionality yet, but we have some ideas to make it more useful (e.g. #10247).

That'll be a bit more involved than this PR though.

@alexander-beedie
Copy link
Collaborator

There is a Schema struct on the Rust side. We should expose this through PyO3 and use that. It doesn't have much functionality yet, but we have some ideas to make it more useful (e.g. #10247).

That'll be a bit more involved than this PR though.

Yup, for sure; no need to wait on anything more sophisticated here - merge away!
(I should go eyeball this Rust-side Schema struct :)

@stinodego stinodego merged commit 8ffbcaf into main Oct 15, 2023
@stinodego stinodego deleted the ordered-schema branch October 15, 2023 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants