feat: Add minimal PySpark support #908

EdAbati · 2024-09-03T21:43:57Z

What type of PR is this? (check all applicable)

Related issues

Closes [Enh]: Add Support For PySpark #333

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

As mentioned in the latest call, I've started working on the support for PySpark.

The goal of this PR would be to have a minimal initial implementation as a starting point. As we did for Dask, we can implement single methods in following PRs!

~~⚠️ this is not ready for review, a lot of test are failing, the code is ugly. :) Just making the PR for visibility and to have a place to comment/ask questions on specific points~~

EdAbati · 2024-09-12T06:36:50Z

This PR diff is getting big because of all the xfail in tests. 😕
@MarcoGorelli @FBruzzesi do you have a better idea on how to make it more "reviewable"? or do you think it is fine?

for more information, see https://pre-commit.ci

FBruzzesi · 2024-09-12T06:46:33Z

This PR diff is getting big because of all the xfail in tests. 😕 @MarcoGorelli @FBruzzesi do you have a better idea on how to make it more "reviewable"? or do you think it is fine?

For dask we started with it's own test file, so that we didn't have to modify every other file.
Once we had a few methods implemented we shifted the constructor into the conftest list of constructors and added the xfails.

Would that be a good strategy again?

for more information, see https://pre-commit.ci

MarcoGorelli

Seriously awesome and exciting work here, I'm impressed, well done!

Just left some comments / questions

Also, just noting something which I noticed:

(Pdb) p df.with_columns(d=nw.col('a').mean()).collect().to_native()
*** pyspark.errors.exceptions.captured.AnalysisException: [MISSING_GROUP_BY] The query does not include a GROUP BY clause. Add GROUP BY or turn it into the window functions using OVER clauses.;
Aggregate [a#1L, b#2L, c#3L, avg(a#1L) AS d#62]
+- Project [a#1L, b#2L, c#3L]
   +- Sort [index#0L ASC NULLS FIRST], true
      +- Repartition 2, true
         +- LogicalRDD [index#0L, a#1L, b#2L, c#3L], false

but we can deal with it later - I presume it shouldn't be too bad?

Can we also add a check in

narwhals/tests/no_imports_test.py

Lines 31 to 45 in 80aad57

    
           def test_pandas(monkeypatch: pytest.MonkeyPatch) -> None: 
        
               monkeypatch.delitem(sys.modules, "polars") 
        
               monkeypatch.delitem(sys.modules, "pyarrow") 
        
               monkeypatch.delitem(sys.modules, "dask", raising=False) 
        
               monkeypatch.delitem(sys.modules, "ibis", raising=False) 
        
               df = pd.DataFrame({"a": [1, 1, 2], "b": [4, 5, 6]}) 
        
               nw.from_native(df, eager_only=True).group_by("a").agg(nw.col("b").mean()).filter( 
        
                   nw.col("a") > 1 
        
               ) 
        
               assert "polars" not in sys.modules 
        
               assert "pandas" in sys.modules 
        
               assert "numpy" in sys.modules 
        
               assert "pyarrow" not in sys.modules 
        
               assert "dask" not in sys.modules 
        
               assert "ibis" not in sys.modules

that in test_pandas or test_polars that 'pyspark' isn't in sys.modules at the end of the test?

narwhals/_spark_like/utils.py

MarcoGorelli · 2024-11-21T14:44:45Z

narwhals/_spark_like/utils.py

+
+
+def get_column_name(df: SparkLazyFrame, column: Column) -> str:
+    return str(df._native_frame.select(column).columns[0])


out of interest, is the str here just for typing purposes?

MarcoGorelli · 2024-11-21T14:45:19Z

narwhals/_spark_like/utils.py

+    datetime_types = [
+        pyspark_types.TimestampType,
+        pyspark_types.TimestampNTZType,
+    ]
+    if any(isinstance(dtype, t) for t in datetime_types):
+        return dtypes.Datetime()


is there any time_unit / time_zone we should pass to dtypes.Datetime?

great point!

I think we cannot get it from the type itself but I need some more time to do some research.

If found things pyspark does when converting to pandas that could be useful to us:

https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L716

https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L607

Still have to spend some time to try to understand, but with a brief look it seems the time_unit is set to ns and time_zone is set to the local timezone.

For arrow it seems to be different though https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L66

I don't have an opinion at the moment. What do you think?

sure, happy to defer this, not a blocker

narwhals/_spark_like/utils.py

narwhals/translate.py

MarcoGorelli · 2024-12-03T21:13:46Z

amazing, thanks for updating! happy to merge once it's green as 🥦

EdAbati · 2024-12-04T12:17:06Z

Thank you all for the feedback and apologies this is taking longer but I am a bit busy.

The CI is almost green 😅 I need to add some if-else because pyspark is not available in python>3.11. Hopefully it will be fixed tonight

regarding the issue with df.with_columns(d=nw.col('a').mean()).collect().to_native()
I may have an idea on how to do it. I try to squeeze this is in too (and add a test). If this takes longer, could it be a follow-up?

MarcoGorelli · 2024-12-04T13:46:11Z

If this takes longer, could it be a follow-up?

yup definitely!

EdAbati · 2024-12-05T07:30:41Z

Ok CI is 🌲🌳

Regarding that issue, I am still working on a fix 🥲

MarcoGorelli

thanks @EdAbati for this massive effort

cool, let's ship it before merge conflicts creep it, excited about building this up!

EdAbati · 2024-12-05T08:09:12Z

Exciting 🎉🎉🎉 thanks!

EdAbati added 2 commits September 3, 2024 08:08

first pyspark draft

72c1b49

Merge remote-tracking branch 'upstream/main' into pyspark

e67140a

github-actions bot added the enhancement New feature or request label Sep 3, 2024

EdAbati changed the title ~~feat: Add Pyspark support~~ feat: Add minimal Pyspark support Sep 3, 2024

EdAbati added 10 commits September 4, 2024 08:44

added schema

3316460

add methods needed for compliant types

12f62c1

fix all_horizontal

2b114eb

add xfail to some tests

378b421

draft with sql

b5957dc

merge upstream

9f8f944

making all frame tests pass

b2aee0e

group by

0e4b2f2

skipping tests

741cdde

restore type

2bdfe31

EdAbati and others added 4 commits September 12, 2024 08:39

smaller diff + mypy fix

c0b1a18

remove print

ec0b26f

Merge remote-tracking branch 'upstream/main' into pyspark

32b87a3

[pre-commit.ci] auto fixes from pre-commit.com hooks

a053b07

for more information, see https://pre-commit.ci

EdAbati and others added 10 commits September 12, 2024 08:48

smaller diff

a415bd0

reenable pyspark

6065eb2

count without window

1688f7d

revert expr series tests

191dcb7

revert rest of tests

41368ef

placeholder pyspark test

b0dffad

merge main

37ecc70

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c76b0b

for more information, see https://pre-commit.ci

moved test_column

9802fdc

moved select filter and with_columns

267f2ff

EdAbati changed the title ~~feat: Add minimal Pyspark support~~ feat: Add minimal PySpark support Nov 21, 2024

MarcoGorelli mentioned this pull request Nov 21, 2024

ci: test_ewm_mean_series xpasses with Polars 0.20.31 #1414

Closed

MarcoGorelli reviewed Nov 21, 2024

View reviewed changes

EdAbati added 7 commits December 3, 2024 18:13

Merge remote-tracking branch 'upstream/main' into pyspark

84d5b6a

update to latest changes

b9c21df

add implementation to expr

bb82020

rename SparkLike...

010a362

rename native_to_narwhals_dtype

dac5901

dtype unknown for decimal

6d67b0c

simplify return unknown

15ca58e

MarcoGorelli reviewed Dec 3, 2024

View reviewed changes

narwhals/translate.py Outdated Show resolved Hide resolved

EdAbati added 5 commits December 4, 2024 08:42

update no_imports_tests

ce4e2fb

level lazy for spark

d841ec5

add _change_dtypes

ac68a7e

Merge remote-tracking branch 'upstream/main' into pyspark

9a1f741

_change_version is back

2121c40

EdAbati added 6 commits December 4, 2024 18:43

fix no imports tests

c0f44b6

rename spark_like tests

4b7895f

same error message as dask

638c402

remove extra expr._call

b46f1b5

update coverage

a3e3dba

extract _columns_from_expr

d8e6064

MarcoGorelli approved these changes Dec 5, 2024

View reviewed changes

MarcoGorelli merged commit ea278ae into narwhals-dev:main Dec 5, 2024
24 checks passed

EdAbati deleted the pyspark branch December 5, 2024 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add minimal PySpark support #908

feat: Add minimal PySpark support #908

EdAbati commented Sep 3, 2024 •

edited

Loading

EdAbati commented Sep 12, 2024 •

edited

Loading

FBruzzesi commented Sep 12, 2024

MarcoGorelli left a comment

MarcoGorelli Nov 21, 2024

EdAbati Dec 4, 2024

MarcoGorelli Nov 21, 2024

EdAbati Dec 4, 2024 •

edited

Loading

MarcoGorelli Dec 4, 2024 •

edited

Loading

MarcoGorelli commented Dec 3, 2024

EdAbati commented Dec 4, 2024

MarcoGorelli commented Dec 4, 2024

EdAbati commented Dec 5, 2024

MarcoGorelli left a comment

EdAbati commented Dec 5, 2024

	def test_pandas(monkeypatch: pytest.MonkeyPatch) -> None:
	monkeypatch.delitem(sys.modules, "polars")
	monkeypatch.delitem(sys.modules, "pyarrow")
	monkeypatch.delitem(sys.modules, "dask", raising=False)
	monkeypatch.delitem(sys.modules, "ibis", raising=False)
	df = pd.DataFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
	nw.from_native(df, eager_only=True).group_by("a").agg(nw.col("b").mean()).filter(
	nw.col("a") > 1
	)
	assert "polars" not in sys.modules
	assert "pandas" in sys.modules
	assert "numpy" in sys.modules
	assert "pyarrow" not in sys.modules
	assert "dask" not in sys.modules
	assert "ibis" not in sys.modules



		def get_column_name(df: SparkLazyFrame, column: Column) -> str:
		return str(df._native_frame.select(column).columns[0])

feat: Add minimal PySpark support #908

feat: Add minimal PySpark support #908

Conversation

EdAbati commented Sep 3, 2024 • edited Loading

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below.

EdAbati commented Sep 12, 2024 • edited Loading

FBruzzesi commented Sep 12, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli Nov 21, 2024

Choose a reason for hiding this comment

EdAbati Dec 4, 2024

Choose a reason for hiding this comment

MarcoGorelli Nov 21, 2024

Choose a reason for hiding this comment

EdAbati Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli commented Dec 3, 2024

EdAbati commented Dec 4, 2024

MarcoGorelli commented Dec 4, 2024

EdAbati commented Dec 5, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

EdAbati commented Dec 5, 2024

EdAbati commented Sep 3, 2024 •

edited

Loading

EdAbati commented Sep 12, 2024 •

edited

Loading

EdAbati Dec 4, 2024 •

edited

Loading

MarcoGorelli Dec 4, 2024 •

edited

Loading