Fix inconsistent shuffle write time sum results in Profiler output #1450

cindyyuanjiang · 2024-12-05T23:45:25Z

Changes

In TaskModel class, keep using nanoseconds for shuffle write time
Convert these into milliseconds when generating output

This improves shuffle write time metrics output to avoid potential precision lost from converting nanoseconds to milliseconds and then taking the sum of the converted numbers. It also separates TaskModel and output reporting so that we know all metrics are in their original values before output generation.

Testing

Existing unit tests
Manually confirm the shuffle write time value is consistent in all places in Profiler output

Before/After Values (shuffle write time sum)

core/src/test/resources/ProfilingExpectations/rapids_join_eventlog_jobmetricsaggmulti_expectation.csv
944 --> 1001
849 --> 901

core/src/test/resources/ProfilingExpectations/rapids_join_eventlog_sqlmetricsaggmulti_expectation.csv
944 --> 1001
849 --> 901

core/src/test/resources/ProfilingExpectations/rapids_join_eventlog_stagemetricsaggmulti_expectation.csv
397 --> 400
505 --> 508
42 --> 93
373 --> 376
473 --> 475
3 --> 50

Follow up issue: #1481

… ms only in output Signed-off-by: cindyyuanjiang <[email protected]>

nartal1

LGTM. Thanks @cindyyuanjiang for the fix!
Nit: It would be nice to include the before and after values in the description. I understand that we can confirm the fix from the expected_files.

parthosa

Thanks @cindyyuanjiang for this change.

@amahussein Unrelated but should we have similar approach for executorCpuTime and executorDeserializeCpuTime?

cindyyuanjiang · 2024-12-09T22:52:29Z

Thanks @nartal1! Updated the before/after values in PR description.

cindyyuanjiang · 2024-12-09T22:54:42Z

Thanks @parthosa! Agree we should discuss the requirements for executorCpuTime and executorDeserializeCpuTime.

amahussein · 2024-12-11T15:52:59Z

Thanks @cindyyuanjiang for this change.

@amahussein Unrelated but should we have similar approach for executorCpuTime and executorDeserializeCpuTime?

Thanks @parthosa. Yes, it would have been better to fix the inconsistency for other metrics within the is very PR since the change is not big compared to the overhead we would have to go through filing another bug then dealing with the a new PR.

amahussein · 2024-12-16T20:21:10Z

@cindyyuanjiang
Is this ready to merge? Or there is something you are going to address?

amahussein

The implementation is still not accurate. because we need to convert the units after all the tasks are aggregated on each level.

amahussein · 2024-12-16T20:37:26Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

@@ -438,7 +438,7 @@ class AppSparkMetricsAnalyzer(app: AppBase) extends AppAnalysisBase(app) {
        val peakMemoryValues = tasksInStage.map(_.peakExecutionMemory)
        val shuffleWriteTime = tasksInStage.map(_.sw_writeTime)
        (AppSparkMetricsAnalyzer.maxWithEmptyHandling(peakMemoryValues),
-          shuffleWriteTime.sum)
+          TimeUnit.NANOSECONDS.toMillis(shuffleWriteTime.sum))


This still does not fix the problem because the conversion is done on the stage-level.
The correct way, is to convert after the metrics are aggregated on each level.
For example, perStage/perSql/perJob.

The per SQL and per job results are computed based on cached per stage results. Please correct me if I am wrong.

Correct!
But when we are aggregating perSql, this PR is actually aggregating the stages per SQL after the time is converted to milliseconds.
If we want to be more accurate, then the cached-per-stage-results should still be in nano-seconds; then per-sql value is the sum in nano-seconds; and finally it gets converted to milliseonds.

understood, thanks @amahussein! I will address this now.

Discussed offline. We will keep the current implementation to avoid potential overflow if we aggregate at SQL/job level.

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang · 2024-12-17T02:37:48Z

Applied same approach for executorCpuTime and executorDeserializeCpuTime.

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang · 2024-12-17T02:40:30Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

  }

  override def convertToCSVSeq: Seq[String] = {
-    Seq(appIndex.toString, StringUtils.reformatCSVString(appId), rootsqlID.getOrElse("").toString,
-      sqlID.toString, durStr, containsDataset.toString, appDurStr,
-      StringUtils.reformatCSVString(potentialStr), execCpuTimePercent)


Updated format only for better readability.

cindyyuanjiang · 2024-12-17T02:40:42Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

@@ -950,14 +992,27 @@ case class SQLDurationExecutorTimeProfileResult(appIndex: Int, appId: String,
  }

  override def convertToSeq: Seq[String] = {
-    Seq(appIndex.toString, rootsqlID.getOrElse("").toString, appId, sqlID.toString, durStr,
-      containsDataset.toString, appDurStr, potentialStr, execCpuTimePercent)


Updated format only for better readability.

cindyyuanjiang · 2024-12-17T02:41:33Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

-    "resultSerializationTime_sum", "resultSize_max", "sr_fetchWaitTime_sum",
-    "sr_localBlocksFetched_sum", "sr_localBytesRead_sum", "sr_remoteBlocksFetched_sum",
-    "sr_remoteBytesRead_sum", "sr_remoteBytesReadToDisk_sum", "sr_totalBytesRead_sum",
-    "sw_bytesWritten_sum", "sw_recordsWritten_sum", "sw_writeTime_sum")


Updated format only for better readability.

cindyyuanjiang · 2024-12-17T02:41:53Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

@@ -924,12 +951,27 @@ case class IOAnalysisProfileResult(
  }
 }

-case class SQLDurationExecutorTimeProfileResult(appIndex: Int, appId: String,
-    rootsqlID: Option[Long], sqlID: Long, duration: Option[Long], containsDataset: Boolean,
-    appDuration: Option[Long], potentialProbs: String,


Updated format only for better readability.

cindyyuanjiang · 2024-12-17T02:41:58Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

    executorCpuRatio: Double) extends ProfileResult {
-  override val outputHeaders = Seq("appIndex", "App ID", "RootSqlID", "sqlID", "SQL Duration",
-    "Contains Dataset or RDD Op", "App Duration", "Potential Problems", "Executor CPU Time Percent")


Updated format only for better readability.

cindyyuanjiang · 2024-12-17T02:47:01Z

@amahussein @parthosa @nartal1
Question: After we make the changes, I see Executor CPU Time Percent of 103.45 > 100 in core/src/test/resources/ProfilingExpectations/rapids_duration_and_cpu_expectation.csv, do we want to limit/upper-bound this ratio to 100.0, or it is okay to have >100 percentages?

amahussein · 2024-12-17T17:41:23Z

@amahussein @parthosa @nartal1 Question: After we make the changes, I see Executor CPU Time Percent of 103.45 > 100 in core/src/test/resources/ProfilingExpectations/rapids_duration_and_cpu_expectation.csv, do we want to limit/upper-bound this ratio to 100.0, or it is okay to have >100 percentages?

It seems more like a bug.

amahussein

Thanks @cindyyuanjiang
minor styling issue.

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/TaskModel.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang · 2024-12-17T21:49:46Z

@amahussein @parthosa @nartal1 Question: After we make the changes, I see Executor CPU Time Percent of 103.45 > 100 in core/src/test/resources/ProfilingExpectations/rapids_duration_and_cpu_expectation.csv, do we want to limit/upper-bound this ratio to 100.0, or it is okay to have >100 percentages?

It seems more like a bug.

thanks @amahussein! Filed issue: #1469

parthosa

Thanks @cindyyuanjiang for this change. The discussion above about overflow concerns makes sense.

amahussein

@amahussein @parthosa @nartal1 Question: After we make the changes, I see Executor CPU Time Percent of 103.45 > 100 in core/src/test/resources/ProfilingExpectations/rapids_duration_and_cpu_expectation.csv, do we want to limit/upper-bound this ratio to 100.0, or it is okay to have >100 percentages?

It seems more like a bug.

thanks @amahussein! Filed issue: #1469

I am not sure we should fix the percentage in a followup issue. This means that we are fixing inconsistent view across 2 files and we introduce another bug.

cindyyuanjiang · 2024-12-20T19:49:32Z

@amahussein @parthosa @nartal1 Question: After we make the changes, I see Executor CPU Time Percent of 103.45 > 100 in core/src/test/resources/ProfilingExpectations/rapids_duration_and_cpu_expectation.csv, do we want to limit/upper-bound this ratio to 100.0, or it is okay to have >100 percentages?

It seems more like a bug.

thanks @amahussein! Filed issue: #1469

I am not sure we should fix the percentage in a followup issue. This means that we are fixing inconsistent view across 2 files and we introduce another bug.

Investigated into this. It looks more like a rounding issue than a bug to me:

executorRunTime is in milliseconds in its raw form while executorCpuTime is in nanoseconds, therefore executorRunTime could have lost precision before we take the sum over all tasks.
The runtime is very low where execCPURatio = 103.45, execCpuTime = 30ms and execRunTime = 29ms.

WDYT? @amahussein @nartal1 @parthosa

parthosa · 2024-12-20T22:58:59Z

@amahussein @parthosa @nartal1 Question: After we make the changes, I see Executor CPU Time Percent of 103.45 > 100 in core/src/test/resources/ProfilingExpectations/rapids_duration_and_cpu_expectation.csv, do we want to limit/upper-bound this ratio to 100.0, or it is okay to have >100 percentages?

It seems more like a bug.

thanks @amahussein! Filed issue: #1469

I am not sure we should fix the percentage in a followup issue. This means that we are fixing inconsistent view across 2 files and we introduce another bug.

Investigated into this. It looks more like a rounding issue than a bug to me:

executorRunTime is in milliseconds in its raw form while executorCpuTime is in nanoseconds, therefore executorRunTime could have lost precision before we take the sum over all tasks.

The runtime is very low where execCPURatio = 103.45, execCpuTime = 30ms and execRunTime = 29ms.

WDYT? @amahussein @nartal1 @parthosa

Thanks @cindyyuanjiang for looking into this. I think this was always a bug but we are now able to catch it due to the changes in this PR. If in the raw form they are measured in different units, I do not think we can fix this problem.

I could not find a reason why spark reports runtime in ms and cpu time in ns.

Ref:
https://github.com/apache/spark/blob/a2e3188b4997001f4dbc1eb364d61ca55d438208/core/src/main/scala/org/apache/spark/executor/Executor.scala#L715-L720

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein · 2024-12-31T16:40:10Z

I could not find a reason why spark reports runtime in ms and cpu time in ns.

Ref: https://github.com/apache/spark/blob/a2e3188b4997001f4dbc1eb364d61ca55d438208/core/src/main/scala/org/apache/spark/executor/Executor.scala#L715-L720

@parthosa, in systems, it is almost the standard to use nanoseconds when measuring CPU time. This is mainly due to the fact that CPUs have high frequencies and using nanoseconds will be more precise analyzing efficiency and performances (CPU utilizations..etc).
On the other hand, executionTime is fine to be in ms since it measures the lifetime of an executor which should not be very sensitive to small fractions like nanoseconds.

Thanks @cindyyuanjiang for investigating the inconsistent ratio.
Let me take a look at the code to see how the ratio is calculated and how to handle that unit difference..

amahussein · 2024-12-31T18:42:14Z

I added some changes to the fix.
Changed the implementation of the calculateDurationPercent to optionally cap the result to 100%.

I found that the qualification output also has the same problem because timeUnits are converted to Milliseconds on the task level. Below is a list of some occurences where the timeunits are converted on the task level (but not limited to)

spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

Lines 170 to 174 in 993bc8f

    
           private def calculateCpuTimePercent(perSqlStageSummary: Seq[SQLStageSummary]): Double = { 
        
             val totalCpuTime = perSqlStageSummary.map(_.execCPUTime).sum 
        
             val totalRunTime = perSqlStageSummary.map(_.execRunTime).sum 
        
             ToolUtils.calculateDurationPercent(totalCpuTime, totalRunTime) 
        
           }

spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationEventProcessor.scala

Lines 45 to 47 in 89fbf83

    
           taskSum.executorRunTime += event.taskMetrics.executorRunTime 
        
           taskSum.executorCPUTime += NANOSECONDS.toMillis(event.taskMetrics.executorCpuTime) 
        
           taskSum.totalTaskDuration += event.taskInfo.duration

@cindyyuanjiang PTAL at the patch file and please file a followup issue to fix the same problem in the qualification output.

pr-1450.patch

cindyyuanjiang · 2024-12-31T19:41:13Z

@amahussein thank you for putting together the patch! LGTM. I will apply the changes.

I saw in this new patch, we are converting nano-sec to milli-sec when aggregating at stage/job/SQL level. Are we okay with the potential overflow from doing this way?

Follow-up issue: #1481 (also included in PR description)

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein · 2024-12-31T20:36:45Z

I saw in this new patch, we are converting nano-sec to milli-sec when aggregating at stage/job/SQL level. Are we okay with the potential overflow from doing this way?

mmm, let us re-iterate on the steps and please correct me if I am wrong:

The new change is converting nano-to-milli when aggregating at stage-level. this will be stored in stageLevelSparkMetrics (aka, stageLevelCache)
Then, both Job/SQL are supposed to use the cached values of the stages. This implies that they are going to read the CPUTime in milliseconds.

comments on this minor refactor:

Better abstraction. The accumulator is responsible of calculating and managing timeUnit conversions.
There is a better design that requires adding getter for those fields that need conversions. However, that would be more changes that is not necessary for the scope of this Pr.

cindyyuanjiang · 2024-12-31T20:49:46Z

I saw in this new patch, we are converting nano-sec to milli-sec when aggregating at stage/job/SQL level. Are we okay with the potential overflow from doing this way?

mmm, let us re-iterate on the steps and please correct me if I am wrong:

The new change is converting nano-to-milli when aggregating at stage-level. this will be stored in stageLevelSparkMetrics (aka, stageLevelCache)

Then, both Job/SQL are supposed to use the cached values of the stages. This implies that they are going to read the CPUTime in milliseconds.

comments on this minor refactor:

Better abstraction. The accumulator is responsible of calculating and managing timeUnit conversions.

There is a better design that requires adding getter for those fields that need conversions. However, that would be more changes that is not necessary for the scope of this Pr.

Thanks @amahussein! Yes, you are correct.

amahussein

Thanks @cindyyuanjiang !
LGTME!

keep nano sec unit for shuffle write time in taskmodel and convert to…

a47ebd7

… ms only in output Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang self-assigned this Dec 5, 2024

cindyyuanjiang requested review from parthosa, amahussein and nartal1 December 5, 2024 23:45

cindyyuanjiang added bug Something isn't working core_tools Scope the core module (scala) labels Dec 5, 2024

nartal1 previously approved these changes Dec 9, 2024

View reviewed changes

This comment was marked as duplicate.

Sign in to view

parthosa previously approved these changes Dec 9, 2024

View reviewed changes

amahussein requested changes Dec 16, 2024

View reviewed changes

refactored executorCpuTime and executorDeserializeCpuTime

8e03d94

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang dismissed stale reviews from parthosa and nartal1 via 8e03d94 December 17, 2024 02:33

cindyyuanjiang requested review from amahussein, parthosa and nartal1 December 17, 2024 02:38

change for better readability

dbb879c

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang commented Dec 17, 2024

View reviewed changes

amahussein mentioned this pull request Dec 17, 2024

Optimize implementation of getAggregateRawMetrics in core-tools #1468

Merged

amahussein previously approved these changes Dec 17, 2024

View reviewed changes

minor style issue

babc840

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang dismissed amahussein’s stale review via babc840 December 17, 2024 21:49

cindyyuanjiang requested a review from amahussein December 17, 2024 21:55

parthosa previously approved these changes Dec 17, 2024

View reviewed changes

nartal1 previously approved these changes Dec 17, 2024

View reviewed changes

amahussein requested changes Dec 18, 2024

View reviewed changes

cindyyuanjiang mentioned this pull request Dec 18, 2024

[BUG] Profiler output Executor CPU Time Percent is greater than 100 #1469

Closed

cindyyuanjiang added 2 commits December 23, 2024 16:43

Merge branch 'dev' into spark-rapids-tools-1408

4c72fae

fix after merge dev

ded7601

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang dismissed stale reviews from nartal1 and parthosa via ded7601 December 24, 2024 01:27

change import order

9b65f70

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang requested review from amahussein, parthosa and nartal1 December 30, 2024 21:19

cap cpu percent at 100

49c1178

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang mentioned this pull request Dec 31, 2024

[BUG] Qualification tool converts time units at task level #1481

Open

amahussein approved these changes Jan 2, 2025

View reviewed changes

amahussein merged commit 5755cfc into NVIDIA:dev Jan 2, 2025
15 checks passed

cindyyuanjiang deleted the spark-rapids-tools-1408 branch January 10, 2025 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inconsistent shuffle write time sum results in Profiler output #1450

Fix inconsistent shuffle write time sum results in Profiler output #1450

cindyyuanjiang commented Dec 5, 2024 •

edited

Loading

nartal1 left a comment

This comment was marked as duplicate.

parthosa left a comment

cindyyuanjiang commented Dec 9, 2024

cindyyuanjiang commented Dec 9, 2024 •

edited

Loading

amahussein commented Dec 11, 2024

amahussein commented Dec 16, 2024

amahussein left a comment

amahussein Dec 16, 2024

cindyyuanjiang Dec 16, 2024

amahussein Dec 16, 2024

cindyyuanjiang Dec 16, 2024

cindyyuanjiang Dec 17, 2024

cindyyuanjiang commented Dec 17, 2024

cindyyuanjiang Dec 17, 2024

cindyyuanjiang Dec 17, 2024

cindyyuanjiang Dec 17, 2024

cindyyuanjiang Dec 17, 2024

cindyyuanjiang Dec 17, 2024

cindyyuanjiang commented Dec 17, 2024 •

edited

Loading

amahussein commented Dec 17, 2024

amahussein left a comment

cindyyuanjiang commented Dec 17, 2024 •

edited

Loading

parthosa left a comment

amahussein left a comment

cindyyuanjiang commented Dec 20, 2024

parthosa commented Dec 20, 2024 •

edited

Loading

amahussein commented Dec 31, 2024

amahussein commented Dec 31, 2024

cindyyuanjiang commented Dec 31, 2024 •

edited

Loading

amahussein commented Dec 31, 2024

cindyyuanjiang commented Dec 31, 2024 •

edited

Loading

amahussein left a comment

Fix inconsistent shuffle write time sum results in Profiler output #1450

Fix inconsistent shuffle write time sum results in Profiler output #1450

Conversation

cindyyuanjiang commented Dec 5, 2024 • edited Loading

Changes

Testing

Before/After Values (shuffle write time sum)

nartal1 left a comment

Choose a reason for hiding this comment

This comment was marked as duplicate.

parthosa left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Dec 9, 2024

cindyyuanjiang commented Dec 9, 2024 • edited Loading

amahussein commented Dec 11, 2024

amahussein commented Dec 16, 2024

amahussein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cindyyuanjiang commented Dec 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cindyyuanjiang commented Dec 17, 2024 • edited Loading

amahussein commented Dec 17, 2024

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Dec 17, 2024 • edited Loading

parthosa left a comment

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Dec 20, 2024

parthosa commented Dec 20, 2024 • edited Loading

amahussein commented Dec 31, 2024

amahussein commented Dec 31, 2024

cindyyuanjiang commented Dec 31, 2024 • edited Loading

amahussein commented Dec 31, 2024

cindyyuanjiang commented Dec 31, 2024 • edited Loading

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Dec 5, 2024 •

edited

Loading

cindyyuanjiang commented Dec 9, 2024 •

edited

Loading

cindyyuanjiang commented Dec 17, 2024 •

edited

Loading

cindyyuanjiang commented Dec 17, 2024 •

edited

Loading

parthosa commented Dec 20, 2024 •

edited

Loading

cindyyuanjiang commented Dec 31, 2024 •

edited

Loading

cindyyuanjiang commented Dec 31, 2024 •

edited

Loading