Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support casting to and from spark-like structs #1991

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Feb 11, 2025

Reason

There are multiple reason for this PR to happen 😁

  • Eventually I would like to support Schema.to_pyspark
  • For some integration it might be useful to have:
    • nw.struct emulating pl.struct
    • .struct.unnest() and/or Frame.unnest

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

I am having a hard time testing this 🤔

@FBruzzesi FBruzzesi added the enhancement New feature or request label Feb 11, 2025
@FBruzzesi FBruzzesi changed the title WIP, feat: support casting to and from spark-like structs feat: support casting to and from spark-like structs Feb 11, 2025
@FBruzzesi FBruzzesi marked this pull request as ready for review February 11, 2025 13:08
@MarcoGorelli
Copy link
Member

thanks!

I am having a hard time testing this 🤔

😄 sorry could you elaborate please?

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Feb 11, 2025

😄 sorry could you elaborate please?

Sure, sorry 😄

Ideally we would want to:

def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None:
    if any(
-         backend in str(constructor) for backend in ("dask", "modin", "cudf", "pyspark")
+         backend in str(constructor) for backend in ("dask", "modin", "cudf")
    ):

However pyspark converts the following input in a column of type MAP<STRING, STRING>:

data = {
        "a": [
            {"movie ": "Cars", "rating": 4.5},
            {"movie ": "Toy Story", "rating": 4.9},
        ]
    }

and conversion via cast is not supported.

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

(Here is the link to the above test)

def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None:
if any(
backend in str(constructor) for backend in ("dask", "modin", "cudf", "pyspark")

@MarcoGorelli
Copy link
Member

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

sure thanks!

@FBruzzesi
Copy link
Member Author

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

sure thanks!

I had already forgotten 🙈 pushed now!

@osoucy
Copy link
Contributor

osoucy commented Feb 15, 2025

Great work! I had done something very similar on my side!

For testing however, I had a slightly different strategy. Instead of creating a new test, I used the existing test_cast_struct as follows:

def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None:
    if any(
        backend in str(constructor) for backend in ("dask", "modin", "cudf")
    ):
        request.applymarker(pytest.mark.xfail)

    if "pandas" in str(constructor) and PANDAS_VERSION < (2, 2):
        request.applymarker(pytest.mark.xfail)

    data = {
        "a": [
            {"movie ": "Cars", "rating": 4.5},
            {"movie ": "Toy Story", "rating": 4.9},
        ]
    }

    dtype = nw.Struct([nw.Field("movie ", nw.String()), nw.Field("rating", nw.Float64())])

    native_df = constructor(data)
    if "spark" in str(constructor):
        import pyspark.sql.functions as F
        import pyspark.sql.types as T

        native_df = native_df.withColumn("a", F.struct(
            F.col("a.movie ").alias("movie ").cast(T.StringType()),
            F.col("a.rating").alias("rating").cast(T.DoubleType()),
        ))

    result = (
        nw.from_native(native_df).select(nw.col("a").cast(dtype)).lazy().collect()
    )
    assert result.schema == {"a": dtype}

As you can see, when the consutrctor is PySpark, we need to re-define the column "a" to force a StructType instead of a MapType, which is something you faced yourself.

However, I still had an issue when calling the last .collect() as it uses df._collect_to_pyarrow() which does not seem to support StructType. A normal df.collect() would work, but it would not return a DataFrame object.

Have you seen the same thing when you run your test?

@FBruzzesi
Copy link
Member Author

Great work! I had done something very similar on my side!

Thanks @osoucy and I am sorry to hear we did duplicate work 🥲

However, I still had an issue when calling the last .collect() as it uses df._collect_to_pyarrow() which does not seem to support StructType. A normal df.collect() would work, but it would not return a DataFrame object.

Have you seen the same thing when you run your test?

Not really, locally I have no issue with your code as well - If you fancy sharing your github commit email I can add you as a co-author

@osoucy
Copy link
Contributor

osoucy commented Feb 15, 2025

Here is my email: [email protected]

In that case, it must be an issue with my specific environment python vs pyspark vs pyarrow version. I'm glad it's only me!

@FBruzzesi
Copy link
Member Author

Here is my email: [email protected]

The one used for commits should be something like: [email protected] (see how to find it)

In that case, it must be an issue with my specific environment python vs pyspark vs pyarrow version. I'm glad it's only me!

We did some refactor + new features, let us know if you keep having problems with the env in the future 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enh]: cast expr in SparkLike
3 participants