Skip to content

Commit

Permalink
PTX: Add helper functions for dsmem (#1336)
Browse files Browse the repository at this point in the history
  • Loading branch information
ahendriksen authored Feb 6, 2024
1 parent 67a09e0 commit ac83b5f
Show file tree
Hide file tree
Showing 8 changed files with 387 additions and 414 deletions.
57 changes: 34 additions & 23 deletions libcudacxx/docs/ptx.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,29 +260,29 @@ notes](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release

### [9.7.8. Data Movement and Conversion Instructions](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions)

| Instruction | Available in libcu++ |
|------------------------------------------|----------------------|
| [`mov`] | No |
| [`mov`] | No |
| [`shfl (deprecated)`] | No |
| [`shfl.sync`] | No |
| [`prmt`] | No |
| [`ld`] | No |
| [`ld.global.nc`] | No |
| [`ldu`] | No |
| [`st`] | No |
| [`st.async`] | No |
| [`multimem.ld_reduce, multimem.st, multimem.red`] | No |
| [`prefetch, prefetchu`] | No |
| [`applypriority`] | No |
| [`discard`] | No |
| [`createpolicy`] | No |
| [`isspacep`] | No |
| [`cvta`] | No |
| [`cvt`] | No |
| [`cvt.pack`] | No |
| [`mapa`] | No |
| [`getctarank`] | No |
| Instruction | Available in libcu++ |
|---------------------------------------------------|-------------------------|
| [`mov`] | No |
| [`mov`] | No |
| [`shfl (deprecated)`] | No |
| [`shfl.sync`] | No |
| [`prmt`] | No |
| [`ld`] | No |
| [`ld.global.nc`] | No |
| [`ldu`] | No |
| [`st`] | No |
| [`st.async`] | CTK-FUTURE, CCCL v2.3.0 |
| [`multimem.ld_reduce, multimem.st, multimem.red`] | No |
| [`prefetch, prefetchu`] | No |
| [`applypriority`] | No |
| [`discard`] | No |
| [`createpolicy`] | No |
| [`isspacep`] | No |
| [`cvta`] | No |
| [`cvt`] | No |
| [`cvt.pack`] | No |
| [`mapa`] | No |
| [`getctarank`] | No |

[`mov`]: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-mov-2
[`shfl (deprecated)`]: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-shfl-deprecated
Expand Down Expand Up @@ -338,6 +338,7 @@ __device__ static inline void st_async(
uint64_t* remote_bar);
```


**Usage**:
```cuda
#include <cstdio>
Expand Down Expand Up @@ -677,6 +678,7 @@ __device__ static inline void red_async(
int64_t* remote_bar);
```


### [9.7.12.15. Parallel Synchronization and Communication Instructions: mbarrier](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier)

| Instruction | Available in libcu++ |
Expand Down Expand Up @@ -709,6 +711,7 @@ __device__ static inline void red_async(

- PTX ISA: [`mbarrier.arrive`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-arrive)

**mbarrier_arrive**:
```cuda
// mbarrier.arrive.shared.b64 state, [addr]; // 1. PTX ISA 70, SM_80
template <typename=void>
Expand Down Expand Up @@ -768,6 +771,8 @@ __device__ static inline void mbarrier_arrive(
const uint32_t& count);
```


**mbarrier_arrive_no_complete**:
```cuda
// mbarrier.arrive.noComplete.shared.b64 state, [addr], count; // 5. PTX ISA 70, SM_80
template <typename=void>
Expand All @@ -776,6 +781,8 @@ __device__ static inline uint64_t mbarrier_arrive_no_complete(
const uint32_t& count);
```


**mbarrier_arrive_expect_tx**:
```cuda
// mbarrier.arrive.expect_tx{.sem}{.scope}{.space}.b64 state, [addr], tx_count; // 8. PTX ISA 80, SM_90
// .sem = { .release }
Expand All @@ -802,6 +809,7 @@ __device__ static inline void mbarrier_arrive_expect_tx(
const uint32_t& tx_count);
```


Usage:
```cuda
#include <cuda/ptx>
Expand Down Expand Up @@ -865,6 +873,7 @@ __device__ static inline bool mbarrier_test_wait(
const uint64_t& state);
```


**mbarrier_test_wait_parity**:
```cuda
// mbarrier.test_wait.parity.shared.b64 waitComplete, [addr], phaseParity; // 3. PTX ISA 71, SM_80
Expand All @@ -884,6 +893,7 @@ __device__ static inline bool mbarrier_test_wait_parity(
const uint32_t& phaseParity);
```


**mbarrier_try_wait**:
```cuda
// mbarrier.try_wait.shared::cta.b64 waitComplete, [addr], state; // 5a. PTX ISA 78, SM_90
Expand Down Expand Up @@ -921,6 +931,7 @@ __device__ static inline bool mbarrier_try_wait(
const uint32_t& suspendTimeHint);
```


**mbarrier_try_wait_parity**:
```cuda
// mbarrier.try_wait.parity.shared::cta.b64 waitComplete, [addr], phaseParity; // 7a. PTX ISA 78, SM_90
Expand Down
Loading

0 comments on commit ac83b5f

Please sign in to comment.