Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add cuda wrapper for cluster ptx operations (#3672)
This PR adds a basic set of operations to use a cluster of CTAs. ## Why? We can apply TMA multicast to copy data from gmem to the smem of multiple CTAs. This is an extension for threadblock swizzling for L2 cache optimization. > The optional modifier .multicast::cluster allows copying of data from global memory to shared memory of multiple CTAs in the cluster. Operand ctaMask specifies the destination CTAs in the cluster such that each bit position in the 16-bit ctaMask operand corresponds to the %ctaid of the destination CTA. The source data is multicast to the same CTA-relative offset as dstMem in the shared memory of each destination CTA. The mbarrier signal is also multicast to the same CTA-relative offset as mbar in the shared memory of the destination CTA. ## Operations 1. cluster_arrive_relaxed 2. cluster_arrive 3. cluster_wait 4. cluster_sync 5. cluster_grid_dims 6. cluster_id_in_grid 7. block_id_in_cluster 8. cluster_shape 9. block_rank_in_cluster 10. map_shared_rank Reference: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#cluster-group
- Loading branch information