Suggested perf improvements #36

CoffeeBeforeArch · 2020-03-31T21:36:49Z

GPGPU-Sim contains many asserts (and conditional checks) that are primarily used for debugging. As these checks never fail for most (if not all simulations run today), I propose a third mode of compilation be added to GPGPU-Sim (one that focuses on the speed of the simulation when benchmarks already pass).

Let's consider a brief hotspot analysis of the code from Linux perf tools from running some matrix multiplication code:

Samples: 350K of event 'cycles:ppp', Event count (approx.): 267406121728
Overhead  Command          Shared Object        Symbol
  13.27%  mmul             libcudart.so         [.] cache_stats::operator()
   7.36%  mmul             libcudart.so         [.] tag_array::probe
   5.90%  mmul             libcudart.so         [.] cache_stats::operator+=
   5.20%  mmul             libcudart.so         [.] sector_cache_block::is_reserved_line
   3.92%  mmul             libcudart.so         [.] ptx_thread_info::get_reg
   3.50%  mmul             libcudart.so         [.] Scoreboard::checkCollision
   2.86%  mmul             libcudart.so         [.] pipelined_simd_unit::cycle
   2.49%  mmul             libcudart.so         [.] simt_stack::get_pdom_stack_top_info
...

13.27% of the time is spent updating cache stats (this breakdown is consistent for tests running from 1 second to 10 minutes, and likely fully scaled sims). This is partially the result of branch-heavy code generated by:

  if (fail_outcome) {
    if (!check_fail_valid(access_type, access_outcome))
      assert(0 && "Unknown cache access type or fail outcome");

    return m_fail_stats[access_type][access_outcome];
  } else {
    if (!check_valid(access_type, access_outcome))
      assert(0 && "Unknown cache access type or access outcome");

    return m_stats[access_type][access_outcome];
  }

Where check_fail_valid and check_valid both contain additional branches.

Ignoring the fact that accesses to m_stats and m_fail_stats miss ~100% of the time, just removing the unnecessary branches resulted in a 5-15% speedup.

Perhaps it makes sense to have a performance build option that uses the preprocessor to select between versions of these functions (such as the modified version I have below). This PR just contains the modified function and would be edited to include the new build mode if people agree with this approach.

Food for thought.

src/gpgpu-sim/gpu-cache.cc

…ement

CoffeeBeforeArch · 2020-04-11T19:20:35Z

Interestingly enough, -DNDEBUG removes the asserts, but the compiler still isn't smart enough to remove the call to the check_valid and check_fail_valid functions.

…ement

CoffeeBeforeArch · 2020-04-13T15:56:57Z

Some perf results for longer running apps:

LavaMD
- 10:35:45 -> 8:31:19
Srad_v1
- 8:46:32 -> 8:02:11
Kmeans
- 7:20:41 -> 6:15:42

…ement

test remove asserts and branchy code on hotspot

a97561a

rgreen requested review from brad-mengchi, mkhairy and tgrogers March 31, 2020 21:40

rgreen added the enhancement New feature or request label Mar 31, 2020

tgrogers requested changes Apr 5, 2020

View reviewed changes

src/gpgpu-sim/gpu-cache.cc Show resolved Hide resolved

CoffeeBeforeArch added 5 commits April 7, 2020 11:52

Merge remote-tracking branch 'upstream/dev' into suggeted_perf_improv…

1885d77

…ement

Add debug guards

6a4f5d6

Guard off other asserts

ec681be

Remove layer of indirection

22559ed

Disabled asserts in release mode with -DNDEBUG

54033d6

CoffeeBeforeArch requested a review from tgrogers April 11, 2020 19:21

Merge remote-tracking branch 'upstream/dev' into suggeted_perf_improv…

338593e

…ement

Merge remote-tracking branch 'upstream/dev' into suggeted_perf_improv…

4ad84b5

…ement

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggested perf improvements #36

Suggested perf improvements #36

CoffeeBeforeArch commented Mar 31, 2020 •

edited

Loading

CoffeeBeforeArch commented Apr 11, 2020

CoffeeBeforeArch commented Apr 13, 2020

Suggested perf improvements #36

Are you sure you want to change the base?

Suggested perf improvements #36

Conversation

CoffeeBeforeArch commented Mar 31, 2020 • edited Loading

CoffeeBeforeArch commented Apr 11, 2020

CoffeeBeforeArch commented Apr 13, 2020

CoffeeBeforeArch commented Mar 31, 2020 •

edited

Loading