-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Dask Support Implementation #484
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome effort, thanks @benrutter !
a few comments:
- I think a subclass or a decorator should be fine. I think there's not that many if/then statements you had to add? maybe a decorator is easier then. for
shape
we can unify the implementations underlen(self._native_series)
- merge conflicts: there are some merge conflicts, but it's mostly down to some things having been renamed:
- PandasDataFrame -> PandasLikeDataFrame
- ._series -> ._native_series
- implementation == "dask" => implementation is Implementation.DASK
The major refactors are over (for pandas-like), I promise not to do this again π π³ Fancy fixing up the merge conflicts?
@MarcoGorelli absolutely! I'll do some "tidy up" changes and then work through the merge conflicts. |
Ok, I thought I'd sorted out the merge conflicts, and at least I'll have a bit more of a look some time soon. |
awesome stuff! love the decorator a lot of |
Ah thanks! That probably explains it then- I might just need to implement a few small bits over there then. |
for more information, see https://pre-commit.ci
Ok, all tests passing bar one which I'll mention in a sec (and sorted out the latest conflicts). Looks like one of the tests which checks the API docs for stable.v1 is passing, but I'm not 100% sure of the intention behind it so don't want to mess too much without checking. Is the idea that I should go in and update the documentation so that v1 matches elsewhere? |
yup, if you update the main namespace docs, you should also update the v1 docs |
Awesome - thanks @MarcoGorelli for all the guidance! Looks like all the tests are passing now (I spotted they're failing in 3.8 python but that just looks like the old type hinting API not being supported rather than anything new here) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @benrutter !
left some comments, there's some pre-commit checks to fixup too, but awesome work here
dataframe_is_empty = ( | ||
self._df._native_dataframe.empty | ||
if self._df._implementation != Implementation.DASK | ||
else len(self._df._native_dataframe) == 0 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just use self._df.is_empty()
?
narwhals/_pandas_like/utils.py
Outdated
if other._native_series.index is not index: | ||
if ( | ||
other._native_series.index is not index | ||
and other._implementation != Implementation.DASK | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if other._native_series.index is not index
and other._implementation is Implementation.DASK
? I think we need to raise an error message in that case
narwhals/_pandas_like/utils.py
Outdated
@@ -218,6 +298,9 @@ def set_axis( | |||
kwargs["copy"] = False | |||
else: # pragma: no cover | |||
pass | |||
if implementation == "dask": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
narwhals/_pandas_like/utils.py
Outdated
@@ -449,6 +532,8 @@ def to_datetime(implementation: Implementation) -> Any: | |||
return get_modin().to_datetime | |||
if implementation is Implementation.CUDF: | |||
return get_cudf().to_datetime | |||
if implementation == "dask": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -51,6 +57,17 @@ def maybe_get_modin_df(df_pandas: pd.DataFrame) -> Any: | |||
return mpd.DataFrame(df_pandas.to_dict(orient="list")) | |||
|
|||
|
|||
def maybe_get_dask_df(df_pandas: pd.DataFrame) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this used?
Thanks! Yeah just spotted precommit and it turned up a bunch of issues, I'll work through them plus your comments |
for more information, see https://pre-commit.ci
Ah quick update on this, I actually spotted that "get_dask()" (and get_modin()) obviously only imports dask if it's already imported. Which makes total sense, but also meant that dask wasn't being included in the new tests at all, since I hadn't added an initial import statement anywhere. I've put one + an ImportError suppression in the conftest which has worked nicely (yaaay! π) but that's also turned up a whole bunch of errors in the test. I suspect its a mix of:
I'll work through, and the failing tests actually give me better confidence that the final integration should be reliable and nicely tested. Might take a little time though, let me know if it'd be easier for me to close off the PR while I work on it and reopen another later or not. |
thanks @benrutter for the update! up to you if you want to keep this PR or open a new one, whichever's easiest for you |
i'll work on updating this, would quite like to see it, but I think it should be back lazyframe, not dataframe? EDIT ok, turns out this is going to be really complicated. putting it off for now, gonna see if I can catch up with a dask maintainer first |
Oh yikes! Thanks for taking a look, I think it might look worse than it is because some of the testing stuff like .to_dict isn't supported. Maybe I'm being naively optimistic π I haven't taken a look at this for a little but but I'm hoping to spend some time hacking on it again soon. |
Closing off for now so that dask functionality can be implemented in line with #566 later. I'll keep playing around with the fork though. |
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below.
Hello,
This is still a draft PR (although everything is in a working state, there's a very unperformance step in it I've put in that I do need to take out) but I'd love to get some feedback around making integration as sleek as possible!
I've implemented a bunch of logic with something along the lines of:
Which I think for the most part is fine because that's generally how other implementations like modin etc are done? Not sure if there's any guidance around another way of doing this? (or maybe I should just put in some friendly comments where that happens to explain why dask is implemented differently). I actually found it pretty clear to follow along, but the alternative I could think of woud be subclassing or something.
The other thing I've done (mentioned over on the Narwhals discord) is implement
.collect()
to.compute()
a dask dataframe, which seems like good behaviour. Dask dataframes are a little like polars lazyframes, and computing them forces all data into a single noded dataframe.There are also a few things where something just isn't possible with dask, for instance cross-joins. Which dask doesn't support natively. Theoretically you could hack something in by doing something like adding two columns that are always true, joining on that, then dropping the column, but I thought it was a better move to maintain the implementations of dask as mark as "not implemented" (particularly as my understanding is dask intentionally doesn't implement cross joins for performance reasons).
I was thinking it might be nice to put in a decorator along the lines of:
To nicely raise a NotImplementedError if that method gets called with one of those backends? Any thoughts?
It's my first time working with the Narwhal's codebase - I've had a blast running through it! I'm probably not familiar with all the conventions etc, so please do let me know if I've missed something major.