questions about slice_col_par #19

Lenan22 · 2024-04-07T13:35:10Z

`
int slice_col_par = (iters * blockIdx.x) / k_tiles;
int slice_col = slice_col_par; //
int slice_iters; // number of threadblock tiles in the current slice
int slice_count = 0; // total number of active threadblocks in the current slice
int slice_idx; // index of threadblock in current slice; numbered bottom to top

if (slice_col_par >= n_tiles) {

`
I have some questions about the code above. For example, if there are 108 SMs on the GPU and the calculated iters is 19, with blockIdx.x ranging from 0 to 127, is slice_col_par directly calculated based on iters=19? For instance, when blockIdx.x=5 or others, this thread block might not iterate 19 times.

efrantar · 2024-04-07T13:59:14Z

If the batchsize is larger than 64, we essentially process multiple batchsize 64 matmuls in a single kernel invocations (to allow better partitioning). This is done by virtually replicating the matrix. Consider this example:parallel = 2, a matrix that partitions into 4 tiles and 3 SMs:

SM -> tile assignment:

00 12
01 12

01 01 // slice_col
01 23 // slice_col_par

slice_col points to the actual column in the matrix and slice_col_par to the column in the virtually replicated version.

Yes, it can happen that a few SMS (here SM 2) process less tiles than others; however, the distribution should usually be quite even since our partitioning is designed so that one SM can partially process multiple columns (see SM 0 or SM 1 above).

Lenan22 · 2024-04-10T11:55:34Z

Thanks a lot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions about slice_col_par #19

questions about slice_col_par #19

Lenan22 commented Apr 7, 2024

efrantar commented Apr 7, 2024

Lenan22 commented Apr 10, 2024

questions about slice_col_par #19

questions about slice_col_par #19

Comments

Lenan22 commented Apr 7, 2024

efrantar commented Apr 7, 2024

Lenan22 commented Apr 10, 2024