Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about merge-path SpMM implementation #10

Open
shadow150519 opened this issue Oct 8, 2023 · 2 comments
Open

question about merge-path SpMM implementation #10

shadow150519 opened this issue Oct 8, 2023 · 2 comments

Comments

@shadow150519
Copy link

Hello, I am reading your paper and code and I have some problems about the merge-path SpMM.
(1) In my understanding, limit[i] meaning paritition i should start from row limit[i] of A, and end in row limit[i+1], but how do I find which element it should start, I can't understand what does end and row_ind mean in your alg.
(2)what is the granularity of merge-path SpMM? In row-split a wrap in responsible for a row in A. In your paper you say a thread is responsible for T works. Does T work means T nnz in A?
(3) How to finally reduce the partial sum. Since each thread is responsible for T work which might come from different rows in A?
image

@ctcyang
Copy link
Contributor

ctcyang commented Oct 15, 2023

Hi shadow150519,

(1) In my understanding, limit[i] meaning paritition i should start from row limit[i] of A, and end in row limit[i+1], but how do I find which element it should start, I can't understand what does end and row_ind mean in your alg.
Suppose you have a CSR array A.row_ptr = { 0, 957, 1008, 2000, 2312 }. Suppose you are trying to get each block to do 1000 calculations i.e. { 0, 1000, 2000 }. Then you should expect to get limits = { 0, 1, 3 } and offsets = {0, 43, 0} by running PartitionSpmm, which finds those values by binary searching (i.e. MergePath when it's run in parallel) using the second array over A.row_ptr [1]. This means Block 0 is responsible for Row 0, Block 1 for Row 1 and 2. Block 2 for Row 3. Now Block 1 (with A.row_ptr[1] 957) is correctly identified as the starting index it should use. Each thread of Block 1 will now know which values it has to look up.

This is not mentioned in the paper, but generally the look-up is done as follows i.e.
Block 1, Thread 0 loads A.col_ind[1000] (since limits[1] + offsets[1] + threadIdx.x = 957 + 43 + 0 = 1000) and saves it to shared memory
Block 1, Thread 1 loads A.col_ind[1001] (since limits[1] + offsets[1] + threadIdx.x = 957 + 43 + 1 = 1001) and saves it to shared memory
...
Block 1, Thread 31 loads A.col_ind[1031] and saves it to shared memory
Block 1, Thread 0 loads A.col_ind[1032] and saves it to shared memory
Block 1, Thread 1 loads A.col_ind[1033] and saves it to shared memory
...
Block 1, Thread 8 loads A.col_ind[2000] and saves it to shared memory
Block 1, Thread 9 saves 0 to shared memory
Block 1, Thread 10 saves 0 to shared memory
...
Block 1, Thread 31 saves 0 to shared memory

This is how that part works. For row_ind, you can think of it as threadIdx.x. For end, it is only important for the last Block, because that is the only Block where you are limited by how many nnz there are. In this example, the last Block needs to check that no entry exceeds 2312.

(2) what is the granularity of merge-path SpMM? In row-split a wrap in responsible for a row in A. In your paper you say a thread is responsible for T works. Does T work means T nnz in A?

On page 12, I mention that due to register limitations, in practice I had to keep T way too low i.e. T = 1 to keep the register count reasonable and not overflow.

(3) How to finally reduce the partial sum. Since each thread is responsible for T work which might come from different rows in A?

Each thread being responsible for T work is okay, because that gets resolved within each Block by doing a segmented reduce at the Block-level using shared memory. However, your point does become an issue across Blocks. That's why on line 22, ReduceToGlobalSpmm has to write both the partial results (C) and carry-outs (i.e. incomplete rows) to global memory. Then for long rows that were processed by more than 1 block, FixCarryout needs to be run in order to reconcile them using the partial results (C) and the carry-outs.

[1] MergePath paper: https://web.cs.ucdavis.edu/~amenta/f15/GPUmp.pdf

Hope that answers your questions!

@shadow150519
Copy link
Author

Thank for your answer!I still have some questions.

Suppose you are trying to get each block to do 1000 calculations i.e. { 0, 1000, 2000 }. Then you should expect to get limits = { 0, 1, 3 } and offsets = {0, 43, 0} by running PartitionSpmm, which finds those values by binary searching (i.e. MergePath when it's run in parallel) using the second array over A.row_ptr [1].

I can't find the offset array in this function PartitionCsrSegReducePrealloc, so how to get the offset array?

what's more, what does tb means?

("tb", po::value<int>()->default_value(32), "B slab width")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants