Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP for VSmem abstraction #553

Closed
Tracked by #612
jrhemstad opened this issue Sep 20, 2023 · 1 comment · Fixed by #619
Closed
Tracked by #612

MVP for VSmem abstraction #553

jrhemstad opened this issue Sep 20, 2023 · 1 comment · Fixed by #619
Assignees

Comments

@jrhemstad
Copy link
Collaborator

jrhemstad commented Sep 20, 2023

Infrastructure + a simple, placeholder kernel that will automatically switch between using actual shared memory vs a global memory allocation as scratchpad based on size of input type.

  • Emits only a single kernel instantiation
  • Emits ld/st.shared when using actual shared memory
  • Added as a unit test to CUB's Catch2 tests
@jrhemstad
Copy link
Collaborator Author

jrhemstad commented Oct 4, 2023

@elstehle wants to investigate potential performance impact of using static shared memory vs dynamic shared memory:

__shared__ int static_shmem[2048];

vs

extern __shared__ int* dynamic_shmem;

We want to know if there is any performance impact of using dynamic_shmem. For example, the compiler has less information about the resource usage in the dynamic_shmem case. Consider V100 that has max 96KB shmem. If a CTA statically uses 48KB of shmem, then the compiler knows only two CTAs will fit and can increase the number of registers per thread. In contrast, when using dynamic shmem, the compiler doesn't know how much shared memory is used and can't adjust registers per thread based on this information.

One idea would be to try and use __launch_bounds__(block_size, num_ctas) to account for this, but it's not clear that this would recoup all the potential performance.

There are three main options for the vsmem fall back:

  1. Static Shmem -> gmem (going with this one)
    • Pros: No perf loss from using dynamic shmem
    • Cons: Prematurely falls back to gmem by not using dynamic shmem
  2. Static Shmem -> dynamic shmem -> gmem
    • Pros: Preserves perf of static shmem when possible, avoids prematurely going to gmem
    • Cons: Difficult to implement. Maybe require two kernel instantiations?
  3. dynamic shmem -> gmem (Thrust & CUB Merge Sort's current solution)
    • Pros: Easiest to implement
    • Cons: May lose perf from not using static shmem

In all of these approaches, we'd want to include a further fall back mechanism where we just reduce items per thread before falling back to gmem.

@jrhemstad jrhemstad transferred this issue from NVIDIA/cub Oct 12, 2023
@github-project-automation github-project-automation bot moved this to Todo in CCCL Oct 12, 2023
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Oct 25, 2023
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Nov 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants