-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add minimal PySpark support #908
Conversation
This PR diff is getting big because of all the |
For dask we started with it's own test file, so that we didn't have to modify every other file. Would that be a good strategy again? |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seriously awesome and exciting work here, I'm impressed, well done!
Just left some comments / questions
Also, just noting something which I noticed:
(Pdb) p df.with_columns(d=nw.col('a').mean()).collect().to_native()
*** pyspark.errors.exceptions.captured.AnalysisException: [MISSING_GROUP_BY] The query does not include a GROUP BY clause. Add GROUP BY or turn it into the window functions using OVER clauses.;
Aggregate [a#1L, b#2L, c#3L, avg(a#1L) AS d#62]
+- Project [a#1L, b#2L, c#3L]
+- Sort [index#0L ASC NULLS FIRST], true
+- Repartition 2, true
+- LogicalRDD [index#0L, a#1L, b#2L, c#3L], false
but we can deal with it later - I presume it shouldn't be too bad?
Can we also add a check in
narwhals/tests/no_imports_test.py
Lines 31 to 45 in 80aad57
def test_pandas(monkeypatch: pytest.MonkeyPatch) -> None: | |
monkeypatch.delitem(sys.modules, "polars") | |
monkeypatch.delitem(sys.modules, "pyarrow") | |
monkeypatch.delitem(sys.modules, "dask", raising=False) | |
monkeypatch.delitem(sys.modules, "ibis", raising=False) | |
df = pd.DataFrame({"a": [1, 1, 2], "b": [4, 5, 6]}) | |
nw.from_native(df, eager_only=True).group_by("a").agg(nw.col("b").mean()).filter( | |
nw.col("a") > 1 | |
) | |
assert "polars" not in sys.modules | |
assert "pandas" in sys.modules | |
assert "numpy" in sys.modules | |
assert "pyarrow" not in sys.modules | |
assert "dask" not in sys.modules | |
assert "ibis" not in sys.modules |
that in test_pandas
or test_polars
that 'pyspark'
isn't in sys.modules
at the end of the test?
|
||
|
||
def get_column_name(df: SparkLazyFrame, column: Column) -> str: | ||
return str(df._native_frame.select(column).columns[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of interest, is the str
here just for typing purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes :)
datetime_types = [ | ||
pyspark_types.TimestampType, | ||
pyspark_types.TimestampNTZType, | ||
] | ||
if any(isinstance(dtype, t) for t in datetime_types): | ||
return dtypes.Datetime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any time_unit / time_zone we should pass to dtypes.Datetime
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point!
I think we cannot get it from the type itself but I need some more time to do some research.
If found things pyspark does when converting to pandas that could be useful to us:
- https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L716
- https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L607
Still have to spend some time to try to understand, but with a brief look it seems the time_unit is set to ns
and time_zone is set to the local timezone.
For arrow it seems to be different though https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L66
I don't have an opinion at the moment. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, happy to defer this, not a blocker
amazing, thanks for updating! happy to merge once it's green as π₯¦ |
Thank you all for the feedback and apologies this is taking longer but I am a bit busy. The CI is almost green π
I need to add some if-else because regarding the issue with |
yup definitely! |
Ok CI is π²π³ Regarding that issue, I am still working on a fix π₯² |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @EdAbati for this massive effort
cool, let's ship it before merge conflicts creep it, excited about building this up!
Exciting πππ thanks! |
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below.
As mentioned in the latest call, I've started working on the support for PySpark.
The goal of this PR would be to have a minimal initial implementation as a starting point. As we did for Dask, we can implement single methods in following PRs!