Transformer benchmark forward #3684

nsarka · 2025-01-08T22:50:40Z

In this PR I added Meghan's transformer test as a benchmark. It works for one process, but with > 1 processes it seems there's a hang in cuptiActivityDisable in the profiler. I'm opening the PR now as a draft to ask for comments on the code itself and ideas why cupti might be causing a hang. Here's the backtrace:

(gdb) bt
#0  0x00007f0f3a84fc25 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#1  0x00007f0f3b1d633e in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#2  0x00007f0f3a6c57bc in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#3  0x00007f0f3a774712 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#4  0x00007f0f3b197315 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#5  0x00007f10318fc867 in ?? () from /usr/local/cuda/lib64/libcupti.so.12
#6  0x00007f0f3a91ca90 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#7  0x00007f10318fcf72 in ?? () from /usr/local/cuda/lib64/libcupti.so.12
#8  0x00007f103190017a in ?? () from /usr/local/cuda/lib64/libcupti.so.12
#9  0x00007f10319002c5 in ?? () from /usr/local/cuda/lib64/libcupti.so.12
#10 0x00007f10319004c5 in cuptiActivityDisable () from /usr/local/cuda/lib64/libcupti.so.12
#11 0x0000558d644f2e42 in nvfuser::FusionProfiler::stop () at /opt/pytorch/Fuser/csrc/fusion_profiler.cpp:728
#12 0x0000558d64218788 in nvfuser::FusionExecutorCache::runFusionWithInputs (this=0x558d69c95ed0, inputs=..., forced_index_type=std::optional [no contained value], selected_device=std::optional [no contained value]) at /opt/pytorch/Fuser/csrc/runtime/fusion_executor_cache.cpp:102
#13 0x0000558d648142f7 in runBenchmarkIterations (benchmark_state=..., executor_cache=0x558d69c95ed0, aten_inputs=std::vector of length 13, capacity 13 = {...}) at /opt/pytorch/Fuser/benchmarks/cpp/utils.cpp:212
#14 0x0000558d647d02bd in NvFuserScheduler_TransformerFwd (benchmark_state=..., executor_cache=0x558d69c95ed0, dtype=...) at /opt/pytorch/Fuser/benchmarks/cpp/transformer.cpp:174
#15 0x0000558d647d0f58 in TransformerForward___GRAPH_TransformerForward_Benchmark::BenchmarkCase (this=0x558d69c04f50, benchmark_state=...) at /opt/pytorch/Fuser/benchmarks/cpp/transformer.cpp:182
#16 0x0000558d646221ac in benchmark::Fixture::Run (this=0x558d69c04f50, st=...) at /opt/pytorch/Fuser/third_party/benchmark/include/benchmark/benchmark.h:1217
#17 0x0000558d648814ea in benchmark::internal::BenchmarkInstance::Run (this=0x558d69c16890, iters=1000, thread_id=0, timer=0x7ffd47f99490, manager=0x558d689adf40, perf_counters_measurement=0x0) at /opt/pytorch/Fuser/third_party/benchmark/src/benchmark_api_internal.cc:92
#18 0x0000558d6485f88b in benchmark::internal::(anonymous namespace)::RunInThread (b=0x558d69c16890, iters=1000, thread_id=0, manager=0x558d689adf40, perf_counters_measurement=0x0) at /opt/pytorch/Fuser/third_party/benchmark/src/benchmark_runner.cc:126
#19 0x0000558d648602c4 in benchmark::internal::BenchmarkRunner::DoNIterations (this=0x558d65be0f50) at /opt/pytorch/Fuser/third_party/benchmark/src/benchmark_runner.cc:191
#20 0x0000558d64860a67 in benchmark::internal::BenchmarkRunner::DoOneRepetition (this=0x558d65be0f50) at /opt/pytorch/Fuser/third_party/benchmark/src/benchmark_runner.cc:283
#21 0x0000558d648414b3 in benchmark::internal::(anonymous namespace)::RunBenchmarks (benchmarks=std::vector of length 1, capacity 1 = {...}, display_reporter=0x558d69c189b0, file_reporter=0x0) at /opt/pytorch/Fuser/third_party/benchmark/src/benchmark.cc:350
#22 0x0000558d64842307 in benchmark::RunSpecifiedBenchmarks (display_reporter=0x558d69c189b0, file_reporter=0x0, spec="TransformerForward") at /opt/pytorch/Fuser/third_party/benchmark/src/benchmark.cc:507
#23 0x0000558d64841b24 in benchmark::RunSpecifiedBenchmarks () at /opt/pytorch/Fuser/third_party/benchmark/src/benchmark.cc:432
#24 0x0000558d6471e381 in main (argc=1, argv=0x7ffd47f9c798) at /opt/pytorch/Fuser/benchmarks/cpp/main.cpp:92

~~The output with 2 ranks looks like this:~~

did=0 in fwd
did=1 in fwd
did=0 in fwd
did=1 in fwd
did=0 in fwd
did=1 in fwd
did=0 in fwd
did=1 in fwd
---------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                       Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------------
TransformerForward___GRAPH/TransformerForward/8/16/128/128/manual_time       2076 us         6300 us          345 bytes_per_second=4.94844G/s
calling comm cleanup, size=2, did=0
entered cleanup on rank 0
calling barrier on rank 0
done calling barrier on rank 0
pg shutdown on rank 0

benchmarks/cpp/main.cpp

benchmarks/cpp/transformer.cpp

tests/cpp/multidevice_transformer.h

benchmarks/cpp/transformer.cpp

wujingyue · 2025-01-09T03:46:24Z

with > 1 processes it seems there's a hang in cuptiActivityDisable in the profiler.

I haven't used cupti enough to tell. @kevinstephano do you have a clue? @nsarka posted the stack trace in the PR description.

To confirm, this patch or the command you used to run the benchmark didn't have anything in particular to trigger cupti. Correct?

nsarka · 2025-01-09T13:37:13Z

with > 1 processes it seems there's a hang in cuptiActivityDisable in the profiler.

I haven't used cupti enough to tell. @kevinstephano do you have a clue? @nsarka posted the stack trace in the PR description.

To confirm, this patch or the command you used to run the benchmark didn't have anything in particular to trigger cupti. Correct?

~~My command is just mpirun -np 2 ./build/nvfuser_bench --benchmark_filter=TransformerForward, so nothing in particular to trigger cupti.~~

The benchmark suite triggers cupti for measuring kernel time.

nsarka · 2025-01-09T17:07:39Z

After manually disabling with:

ProfilerOptionsGuard::getCurOptions().set(ProfilerOption::EnableNocupti);

It passes

nsarka · 2025-01-09T17:46:23Z

I removed the line disabling CUPTI and manually set the number of iterations to 10. With the lowered number of iterations it seems to pass.

Here is the output with 4 ranks:

----------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                        Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------
TransformerForward___GRAPH/TransformerForward/iterations:10/manual_time       6075 us        49000 us           10 bytes_per_second=1.14804G/s
----------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                        Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------
TransformerForward___GRAPH/TransformerForward/iterations:10/manual_time      43423 us        63985 us           10 bytes_per_second=164.48M/s
----------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                        Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------
TransformerForward___GRAPH/TransformerForward/iterations:10/manual_time       4269 us        58275 us           10 bytes_per_second=1.63401G/s
----------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                        Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------
TransformerForward___GRAPH/TransformerForward/iterations:10/manual_time        835 us        45158 us           10 bytes_per_second=8.35546G/s

Here is the output with 2 ranks:

----------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                        Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------
TransformerForward___GRAPH/TransformerForward/iterations:10/manual_time      29398 us        42675 us           10 bytes_per_second=357.841M/s
----------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                        Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------
TransformerForward___GRAPH/TransformerForward/iterations:10/manual_time        123 us        38640 us           10 bytes_per_second=83.6172G/s

@cowanmeg Is the skew in latency between ranks (29398 vs 123 us) expected?

nsarka · 2025-01-09T18:34:42Z

!build

benchmarks/cpp/transformer.cpp

wujingyue

🚢

benchmarks/cpp/transformer.cpp

wujingyue · 2025-01-09T21:46:34Z

!test

wujingyue · 2025-01-09T21:48:17Z

!build

There was a change last year and build no longer triggers execution of tests. https://github.com/NVIDIA/Fuser/wiki/Bot-Commands#test-command

Co-authored-by: Jingyue Wu <[email protected]>

nsarka requested review from wujingyue, cowanmeg and samnordmann January 8, 2025 22:50

wujingyue reviewed Jan 8, 2025

View reviewed changes

nsarka force-pushed the nsarka/transformer-benchmark branch from ef8ecb4 to f87c137 Compare January 9, 2025 18:34

wujingyue approved these changes Jan 9, 2025

View reviewed changes

benchmarks/cpp/transformer.cpp Outdated Show resolved Hide resolved

benchmarks/cpp/transformer.cpp Outdated Show resolved Hide resolved

benchmarks/cpp/transformer.cpp Outdated Show resolved Hide resolved

nsarka changed the title ~~Draft: Transformer benchmark~~ Transformer benchmark forward Jan 9, 2025

wujingyue approved these changes Jan 9, 2025

View reviewed changes

benchmarks/cpp/transformer.cpp Outdated Show resolved Hide resolved

Nicholas Sarkauskas and others added 15 commits January 10, 2025 10:15

Add transformer benchmark skeleton based off of a bert test

f158a54

savE

35550be

fusion definition

e7df952

add at_inputs, compiling

528bf86

forward working

eaa5ddb

remove unused code

162683c

add debug prints and comm cleanup

0e54996

Update benchmarks/cpp/transformer.cpp

937319e

Co-authored-by: Jingyue Wu <[email protected]>

working with multiple ranks, disabled cupti profiling and set iters to 1

c7c234b

remove

a7fe957

remove debug prints

18ec7a0

Update benchmarks/cpp/transformer.cpp

e07a108

Co-authored-by: Jingyue Wu <[email protected]>

Update benchmarks/cpp/transformer.cpp

a725471

Co-authored-by: Jingyue Wu <[email protected]>

review feedback

d5f566c

update

9d9b693

Nicholas Sarkauskas and others added 4 commits January 10, 2025 10:15

linter

fe000e9

review feedback

ef82581

lint

de3b86f

remove unused variable

6e03077

nsarka force-pushed the nsarka/transformer-benchmark branch from 132a3bd to 6e03077 Compare January 10, 2025 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer benchmark forward #3684

Transformer benchmark forward #3684

nsarka commented Jan 8, 2025 •

edited

Loading

wujingyue commented Jan 9, 2025

nsarka commented Jan 9, 2025 •

edited

Loading

nsarka commented Jan 9, 2025

nsarka commented Jan 9, 2025 •

edited

Loading

nsarka commented Jan 9, 2025

wujingyue left a comment

wujingyue commented Jan 9, 2025

wujingyue commented Jan 9, 2025

Transformer benchmark forward #3684

Are you sure you want to change the base?

Transformer benchmark forward #3684

Conversation

nsarka commented Jan 8, 2025 • edited Loading

wujingyue commented Jan 9, 2025

nsarka commented Jan 9, 2025 • edited Loading

nsarka commented Jan 9, 2025

nsarka commented Jan 9, 2025 • edited Loading

nsarka commented Jan 9, 2025

wujingyue left a comment

Choose a reason for hiding this comment

wujingyue commented Jan 9, 2025

wujingyue commented Jan 9, 2025

nsarka commented Jan 8, 2025 •

edited

Loading

nsarka commented Jan 9, 2025 •

edited

Loading

nsarka commented Jan 9, 2025 •

edited

Loading