MVP for VSmem abstraction #553

jrhemstad · 2023-09-20T16:19:55Z

Infrastructure + a simple, placeholder kernel that will automatically switch between using actual shared memory vs a global memory allocation as scratchpad based on size of input type.

Emits only a single kernel instantiation
Emits ld/st.shared when using actual shared memory
Added as a unit test to CUB's Catch2 tests

jrhemstad · 2023-10-04T16:38:47Z

@elstehle wants to investigate potential performance impact of using static shared memory vs dynamic shared memory:

__shared__ int static_shmem[2048];

vs

extern __shared__ int* dynamic_shmem;

We want to know if there is any performance impact of using dynamic_shmem. For example, the compiler has less information about the resource usage in the dynamic_shmem case. Consider V100 that has max 96KB shmem. If a CTA statically uses 48KB of shmem, then the compiler knows only two CTAs will fit and can increase the number of registers per thread. In contrast, when using dynamic shmem, the compiler doesn't know how much shared memory is used and can't adjust registers per thread based on this information.

One idea would be to try and use __launch_bounds__(block_size, num_ctas) to account for this, but it's not clear that this would recoup all the potential performance.

There are three main options for the vsmem fall back:

Static Shmem -> gmem (going with this one)
- Pros: No perf loss from using dynamic shmem
- Cons: Prematurely falls back to gmem by not using dynamic shmem
Static Shmem -> dynamic shmem -> gmem
- Pros: Preserves perf of static shmem when possible, avoids prematurely going to gmem
- Cons: Difficult to implement. Maybe require two kernel instantiations?
dynamic shmem -> gmem (Thrust & CUB Merge Sort's current solution)
- Pros: Easiest to implement
- Cons: May lose perf from not using static shmem

In all of these approaches, we'd want to include a further fall back mechanism where we just reduce items per thread before falling back to gmem.

jrhemstad mentioned this issue Oct 25, 2023

[EPIC] Design a scheme allowing CUB to process user-defined types of any size #612

Closed

jrhemstad assigned elstehle Sep 20, 2023

elstehle mentioned this issue Oct 11, 2023

Integrate new VShmem facility into DeviceMergeSort #549

Closed

github-project-automation bot added this to CCCL Oct 12, 2023

jrhemstad transferred this issue from NVIDIA/cub Oct 12, 2023

github-project-automation bot moved this to Todo in CCCL Oct 12, 2023

elstehle mentioned this issue Oct 25, 2023

Adds virtual shared memory helper and tests #619

Merged

2 tasks

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Oct 25, 2023

elstehle closed this as completed in #619 Nov 12, 2023

github-project-automation bot moved this from In Review to Done in CCCL Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MVP for VSmem abstraction #553

MVP for VSmem abstraction #553

jrhemstad commented Sep 20, 2023 •

edited

Loading

jrhemstad commented Oct 4, 2023 •

edited by elstehle

Loading

MVP for VSmem abstraction #553

MVP for VSmem abstraction #553

Comments

jrhemstad commented Sep 20, 2023 • edited Loading

jrhemstad commented Oct 4, 2023 • edited by elstehle Loading

jrhemstad commented Sep 20, 2023 •

edited

Loading

jrhemstad commented Oct 4, 2023 •

edited by elstehle

Loading