Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Qualification Tool does not mark array_join as unsupported #1315

Open
amahussein opened this issue Aug 26, 2024 · 1 comment
Open

[BUG] Qualification Tool does not mark array_join as unsupported #1315

amahussein opened this issue Aug 26, 2024 · 1 comment
Assignees
Labels
bug Something isn't working core_tools Scope the core module (scala)

Comments

@amahussein
Copy link
Collaborator

Describe the bug

from pyspark.sql.functions import *
df = spark.createDataFrame([(["a", "b", "a"], ["b", "c"]), (["a","a"], ["b", "c"]), (["aa"], ["b", "c"])    ], ['x', 'y'])
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")
df = spark.read.parquet("/tmp/testparquet")
df.select(array_join(df.x, ".").alias("join")).collect()

Qual tool did not detect the UDF or array_join as unsupported operators

CC: @viadea

@amahussein amahussein added ? - Needs Triage bug Something isn't working core_tools Scope the core module (scala) labels Aug 26, 2024
@amahussein amahussein self-assigned this Aug 26, 2024
@amahussein
Copy link
Collaborator Author

amahussein commented Aug 27, 2024

I created a unit test and the array_join is marked as unsupported.
I investigated the incident on a client's eventlog and it looks like the SparkUI shows the array_join in the PhysicalPlan "details section". Since the description is too long, the SparkGraphNode does not load the entire description (truncated)

The source code is org.apache.spark.sql.execution.ExecutionPage.scala
The physicalPlanDescription is built using org.apache.spark.sql.execution.QueryExecution.explainString

Note that there is Spark configuration:

  • SQLConf.MAX_TO_STRING_FIELDS.key which limits the number of fields by SQL Config
  • MAX_PLAN_STRING_LENGTH: Maximum number of characters to output for a plan string.

The fix is not going to be very easy/quick.
The implementation is to re-visit the SqlPlanParser. Each time a node is constructed, we need to:

  • call explainString() on the physical description.
  • build graphNode from SparkPlanInfo
  • Use the node Id to locate the description of each node in the output of explainString
  • Modify the SqlPlanParser code to parse the output of the above step instead of the graphNode description.
  test("array_join is not supported") {
    TrampolineUtil.withTempDir { outputLoc =>
      TrampolineUtil.withTempDir { eventLogDir =>
        val (eventLog, _) = ToolTestUtils.generateEventLog(eventLogDir, "arrayjoin") { spark =>
          import spark.implicits._
          val df = Seq(
            (List("a", "b", "c"), List("b", "c")),
            (List("a", "a"), List("b", "c")),
            (List("aa"), List("b", "c"))
          ).toDF("x", "y")
          df.write.parquet(s"$outputLoc/test_arrayjoin")
          val df2 = spark.read.parquet(s"$outputLoc/test_arrayjoin")
          val df3 = df2.withColumn("arr_join", array_join(col("x"), "."))
          df3.explain(true)
          df3
        }
        //val pluginTypeChecker = new PluginTypeChecker()
        val app = createAppFromEventlog(eventLog)
        assert(app.sqlPlans.size == 2)
        val pluginTypeChecker = new PluginTypeChecker()
        val parsedPlans = app.sqlPlans.map { case (sqlID, plan) =>
          SQLPlanParser.parseSQLPlan(app.appId, plan, sqlID, "", pluginTypeChecker, app)
        }
        val allExecInfo = getAllExecsFromPlan(parsedPlans.toSeq)
        val projectExecs = allExecInfo.filter(_.exec == "Project")
        assertSizeAndNotSupported(1, projectExecs)
      }
    }
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

No branches or pull requests

1 participant