Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix command graph generation bugs around reductions #223

Merged
merged 3 commits into from
Nov 8, 2023

Conversation

fknorr
Copy link
Contributor

@fknorr fknorr commented Oct 25, 2023

Implementing IDAG reductions uncovered two bugs around reductions in distributed command graph generation:

  • For an all-to-all reduction, we emit push commands for the partial results from our node followed by a reduction command. The reduction command logically overwrites the buffer contents, so it must anti-depend on these pushes. This bug does not appear to break reductions in the current runtime, most likely because the final reduction result is only committed to device memory once it's being read in the next consumer task.
  • We elide reduction commands if there only is a single producer chunk. If the result is subsequently read by multiple nodes, we generate push commands on the producer, but failed to generate the corresponding await-pushes on the consumer node.

I've added unit tests for both cases.

@fknorr fknorr added this to the 0.5.0 milestone Oct 25, 2023
@fknorr fknorr requested review from psalz and PeterTh October 25, 2023 11:55
@fknorr fknorr self-assigned this Oct 25, 2023
@github-actions
Copy link

Check-perf-impact results: (b003273516680ef3e6ca0110b3678f5e)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff!

src/distributed_graph_generator.cc Outdated Show resolved Hide resolved
src/distributed_graph_generator.cc Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

test/graph_gen_reduction_tests.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

test/graph_gen_reduction_tests.cc Outdated Show resolved Hide resolved
@fknorr fknorr force-pushed the fix-cdag-reductions branch from 0b1bf48 to 4c61835 Compare November 2, 2023 13:23
fknorr added a commit that referenced this pull request Nov 2, 2023
@fknorr fknorr force-pushed the fix-cdag-reductions branch from 4c61835 to 2cfd1c2 Compare November 2, 2023 13:54
Copy link

github-actions bot commented Nov 2, 2023

Check-perf-impact results: (dee217934841bf19e612d83adf4e7dfb)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: 4 individual benchmarks affected
🚀 Significant speedup (<0.80x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / jacobi topology

Relative execution time per category: (mean of relative medians)

  • command-graph : 1.00x
  • graph-nodes : 0.99x
  • grid : 1.02x
  • scheduler : 1.02x
  • system : 1.15x ⚠️
  • task-graph : 1.01x

@fknorr fknorr force-pushed the fix-cdag-reductions branch from 2cfd1c2 to 2f6687a Compare November 2, 2023 17:19
Copy link

github-actions bot commented Nov 2, 2023

Check-perf-impact results: (d21ecac39af892ab1c227e6d0ae10ebf)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology, benchmark independent task pattern with N tasks - 100 / task generation
🚀 Significant speedup (<0.80x) in some microbenchmark results: benchmark stencil pattern with N time steps - 50 / iterations

Relative execution time per category: (mean of relative medians)

  • command-graph : 1.00x
  • graph-nodes : 0.96x
  • grid : 1.01x
  • scheduler : 1.03x
  • system : 1.06x
  • task-graph : 0.99x

@fknorr
Copy link
Contributor Author

fknorr commented Nov 2, 2023

I re-ran the benchmarks because there seemed to be significant jitter in the system benchmarks, but it appears that "benchmark independent task pattern with 100 tasks" is indeed slowing down, even though the change should not affect code without reductions.

@fknorr
Copy link
Contributor Author

fknorr commented Nov 8, 2023

@PeterTh discovered that results of our multi-threaded benchmarks, especially system benchmarks, are not as stable and reliable as we thought, and our benchmarking setup needs some work.

Aside from extremely obscure reason in instruction cache, OS scheduling or similar, I'm going to trust the command-graph benchmarks which measure this change in isolation and do not show a change in performance.

@fknorr fknorr merged commit b2ee29d into master Nov 8, 2023
28 checks passed
@psalz psalz deleted the fix-cdag-reductions branch November 8, 2023 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants