Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggested perf improvements #36

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from
Open

Suggested perf improvements #36

wants to merge 8 commits into from

Conversation

CoffeeBeforeArch
Copy link

@CoffeeBeforeArch CoffeeBeforeArch commented Mar 31, 2020

GPGPU-Sim contains many asserts (and conditional checks) that are primarily used for debugging. As these checks never fail for most (if not all simulations run today), I propose a third mode of compilation be added to GPGPU-Sim (one that focuses on the speed of the simulation when benchmarks already pass).

Let's consider a brief hotspot analysis of the code from Linux perf tools from running some matrix multiplication code:

Samples: 350K of event 'cycles:ppp', Event count (approx.): 267406121728
Overhead  Command          Shared Object        Symbol
  13.27%  mmul             libcudart.so         [.] cache_stats::operator()
   7.36%  mmul             libcudart.so         [.] tag_array::probe
   5.90%  mmul             libcudart.so         [.] cache_stats::operator+=
   5.20%  mmul             libcudart.so         [.] sector_cache_block::is_reserved_line
   3.92%  mmul             libcudart.so         [.] ptx_thread_info::get_reg
   3.50%  mmul             libcudart.so         [.] Scoreboard::checkCollision
   2.86%  mmul             libcudart.so         [.] pipelined_simd_unit::cycle
   2.49%  mmul             libcudart.so         [.] simt_stack::get_pdom_stack_top_info
...

13.27% of the time is spent updating cache stats (this breakdown is consistent for tests running from 1 second to 10 minutes, and likely fully scaled sims). This is partially the result of branch-heavy code generated by:

  if (fail_outcome) {
    if (!check_fail_valid(access_type, access_outcome))
      assert(0 && "Unknown cache access type or fail outcome");

    return m_fail_stats[access_type][access_outcome];
  } else {
    if (!check_valid(access_type, access_outcome))
      assert(0 && "Unknown cache access type or access outcome");

    return m_stats[access_type][access_outcome];
  }

Where check_fail_valid and check_valid both contain additional branches.

Ignoring the fact that accesses to m_stats and m_fail_stats miss ~100% of the time, just removing the unnecessary branches resulted in a 5-15% speedup.

Perhaps it makes sense to have a performance build option that uses the preprocessor to select between versions of these functions (such as the modified version I have below). This PR just contains the modified function and would be edited to include the new build mode if people agree with this approach.

Food for thought.

@rgreen rgreen added the enhancement New feature or request label Mar 31, 2020
src/gpgpu-sim/gpu-cache.cc Show resolved Hide resolved
@CoffeeBeforeArch
Copy link
Author

Interestingly enough, -DNDEBUG removes the asserts, but the compiler still isn't smart enough to remove the call to the check_valid and check_fail_valid functions.

@CoffeeBeforeArch
Copy link
Author

Some perf results for longer running apps:

  • LavaMD
    • 10:35:45 -> 8:31:19
  • Srad_v1
    • 8:46:32 -> 8:02:11
  • Kmeans
    • 7:20:41 -> 6:15:42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants