feat: support casting to and from spark-like structs #1991

FBruzzesi · 2025-02-11T11:36:36Z

Reason

There are multiple reason for this PR to happen 😁

Eventually I would like to support Schema.to_pyspark
For some integration it might be useful to have:
- nw.struct emulating pl.struct
- .struct.unnest() and/or Frame.unnest

What type of PR is this? (check all applicable)

Related issues

Closes [Enh]: cast expr in SparkLike #1743

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

I am having a hard time testing this 🤔

MarcoGorelli · 2025-02-11T21:47:49Z

thanks!

I am having a hard time testing this 🤔

😄 sorry could you elaborate please?

FBruzzesi · 2025-02-11T22:27:37Z

😄 sorry could you elaborate please?

Sure, sorry 😄

Ideally we would want to:

def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None:
    if any(
-         backend in str(constructor) for backend in ("dask", "modin", "cudf", "pyspark")
+         backend in str(constructor) for backend in ("dask", "modin", "cudf")
    ):

However pyspark converts the following input in a column of type MAP<STRING, STRING>:

data = {
        "a": [
            {"movie ": "Cars", "rating": 4.5},
            {"movie ": "Toy Story", "rating": 4.9},
        ]
    }

and conversion via cast is not supported.

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

(Here is the link to the above test)

narwhals/tests/expr_and_series/cast_test.py

Lines 238 to 240 in fd8ccac

    
           def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None: 
        
               if any( 
        
                   backend in str(constructor) for backend in ("dask", "modin", "cudf", "pyspark")

MarcoGorelli · 2025-02-13T21:52:46Z

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

sure thanks!

FBruzzesi · 2025-02-13T22:13:07Z

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

sure thanks!

I had already forgotten 🙈 pushed now!

osoucy · 2025-02-15T18:25:55Z

Great work! I had done something very similar on my side!

For testing however, I had a slightly different strategy. Instead of creating a new test, I used the existing test_cast_struct as follows:

def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None:
    if any(
        backend in str(constructor) for backend in ("dask", "modin", "cudf")
    ):
        request.applymarker(pytest.mark.xfail)

    if "pandas" in str(constructor) and PANDAS_VERSION < (2, 2):
        request.applymarker(pytest.mark.xfail)

    data = {
        "a": [
            {"movie ": "Cars", "rating": 4.5},
            {"movie ": "Toy Story", "rating": 4.9},
        ]
    }

    dtype = nw.Struct([nw.Field("movie ", nw.String()), nw.Field("rating", nw.Float64())])

    native_df = constructor(data)
    if "spark" in str(constructor):
        import pyspark.sql.functions as F
        import pyspark.sql.types as T

        native_df = native_df.withColumn("a", F.struct(
            F.col("a.movie ").alias("movie ").cast(T.StringType()),
            F.col("a.rating").alias("rating").cast(T.DoubleType()),
        ))

    result = (
        nw.from_native(native_df).select(nw.col("a").cast(dtype)).lazy().collect()
    )
    assert result.schema == {"a": dtype}

As you can see, when the consutrctor is PySpark, we need to re-define the column "a" to force a StructType instead of a MapType, which is something you faced yourself.

However, I still had an issue when calling the last .collect() as it uses df._collect_to_pyarrow() which does not seem to support StructType. A normal df.collect() would work, but it would not return a DataFrame object.

Have you seen the same thing when you run your test?

FBruzzesi · 2025-02-15T18:52:15Z

Great work! I had done something very similar on my side!

Thanks @osoucy and I am sorry to hear we did duplicate work 🥲

However, I still had an issue when calling the last .collect() as it uses df._collect_to_pyarrow() which does not seem to support StructType. A normal df.collect() would work, but it would not return a DataFrame object.

Have you seen the same thing when you run your test?

Not really, locally I have no issue with your code as well - If you fancy sharing your github commit email I can add you as a co-author

osoucy · 2025-02-15T19:04:41Z

Here is my email: [email protected]

In that case, it must be an issue with my specific environment python vs pyspark vs pyarrow version. I'm glad it's only me!

FBruzzesi · 2025-02-15T19:12:20Z

Here is my email: [email protected]

The one used for commits should be something like: [email protected] (see how to find it)

In that case, it must be an issue with my specific environment python vs pyspark vs pyarrow version. I'm glad it's only me!

We did some refactor + new features, let us know if you keep having problems with the env in the future 🤔

Co-authored-by: Olivier Soucy <[email protected]>

feat: support casting to and from spark-like structs

b315d60

FBruzzesi added the enhancement New feature or request label Feb 11, 2025

Merge branch 'main' into feat/pyspark-struct-dtype

6b27450

FBruzzesi changed the title ~~WIP, feat: support casting to and from spark-like structs~~ feat: support casting to and from spark-like structs Feb 11, 2025

FBruzzesi marked this pull request as ready for review February 11, 2025 13:08

Merge branch 'main' into feat/pyspark-struct-dtype

ae8d8f5

add dedicated pyspark test

0919442

FBruzzesi and others added 2 commits February 13, 2025 23:18

no cover test due to skip if pyspark not installed

3ec018f

Merge branch 'main' into feat/pyspark-struct-dtype

7d2e140

FBruzzesi and others added 3 commits February 15, 2025 22:42

refactor test

5dc3a09

Co-authored-by: Olivier Soucy <[email protected]>

Merge branch 'main' into feat/pyspark-struct-dtype

7c96752

ignore mypy

9e856e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support casting to and from spark-like structs #1991

feat: support casting to and from spark-like structs #1991

FBruzzesi commented Feb 11, 2025 •

edited

Loading

MarcoGorelli commented Feb 11, 2025

FBruzzesi commented Feb 11, 2025 •

edited

Loading

MarcoGorelli commented Feb 13, 2025

FBruzzesi commented Feb 13, 2025

osoucy commented Feb 15, 2025

FBruzzesi commented Feb 15, 2025

osoucy commented Feb 15, 2025

FBruzzesi commented Feb 15, 2025

feat: support casting to and from spark-like structs #1991

Are you sure you want to change the base?

feat: support casting to and from spark-like structs #1991

Conversation

FBruzzesi commented Feb 11, 2025 • edited Loading

Reason

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

MarcoGorelli commented Feb 11, 2025

FBruzzesi commented Feb 11, 2025 • edited Loading

MarcoGorelli commented Feb 13, 2025

FBruzzesi commented Feb 13, 2025

osoucy commented Feb 15, 2025

FBruzzesi commented Feb 15, 2025

osoucy commented Feb 15, 2025

FBruzzesi commented Feb 15, 2025

FBruzzesi commented Feb 11, 2025 •

edited

Loading

FBruzzesi commented Feb 11, 2025 •

edited

Loading