Skip to content

Commit d444d47

Browse files
authored
XeGPU minor fix for chunk_size attribute (#935)
1 parent faa5390 commit d444d47

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

docs/rfcs/XeGPU.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -354,7 +354,7 @@ In the example above, wi_data_size is 1, sg_map_size is 16, tensor_size is 128.
354354
distribute_unit_size = sg_map_size[0] x sg_map_size[1] = subgroup_size x wi_data_size
355355
tensor_size = tensor_desc[0] x tensor_desc[1]
356356
```
357-
wi_data_size can be larger than 1, meaning that each work item operates on multiple elements, which is eventually lowered to "SIMT-flavor" vector, like SPIR-V vector or llvm vector. The multiple elements indicated by wi_data can only be from one dimension and must be contiguous in the memory along either dimension.
357+
wi_data_size can be larger than 1, meaning that each work item operates on multiple elements, which is eventually lowered to "SIMT-flavor" vector, like SPIR-V vector or llvm vector, or packed to a storage data type for matrix operations. The multiple elements indicated by wi_data can only be from one dimension and must be contiguous in the memory along either dimension.
358358

359359
To distribute a tensor, tensor_size must be divisible by distribute_unit_size. More specifically, tensor_desc[0] must be divisible by wi_layout[0] x wi_data[0], tensor_desc[1] by wi_layout[1] x wi_data[1]. The 2D subtensor is evenly distributed to work items, so each work item gets a 2D data fragment, which may contain mulitple distribution of wi_data elements.
360360

@@ -400,12 +400,12 @@ For load_nd with `transpose` attribute, wi_layout is transposed to match with th
400400
tensor_desc<16xfp32, #tdesc_attr, #sg_map_t>, vector<1xi1> -> vector<1xfp32>
401401
```
402402

403-
Below example shows that each WI loads 4 fp32 data element with the chunk_size_per_lane. This load with chunk_size_per_lane is effectively load 2D tensor and transpose. The data fragement <1x4xf32> is loaded and transposed as <4x1xf32>.
403+
Below example shows that each WI loads 4 fp32 data element with the chunk_size. This load with chunk_size is effectively load 2D tensor and transpose. The data fragement <1x4xf32> is loaded and transposed as <4x1xf32>.
404404
```mlir
405405
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 4]>
406406
#scatter_attr = !xegpu.tdesc_attr< memory_space=slm, scattered=true>
407407
%scatter_tdesc_chunk = xegpu.create_tdesc, %src_addr, %offsets
408-
{chunk_size_per_lane=4} :
408+
{chunk_size=4} :
409409
uint64, vector<16xindex> into tensor_desc<16x4xfp32, #scatter_attr, #sg_map_t>
410410
411411
%result = xegpu.load_gather %scatter_tdesc_chunk, %mask {L1 = cached, L2 = uncached, transpose=[1,0]} :
@@ -473,15 +473,15 @@ user must use for the WI data distribution of 1d block load and regular load wit
473473
# assert (wi_layout[0] x wi_layout[1] == subgroup_size) // PVC subgroup_size = 16
474474
#sg_map = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
475475
476-
For regular load with chunk_size_per_lane // PVC subgroup_size = 16
477-
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>
476+
For regular load with chunk_size // PVC subgroup_size = 16
477+
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, chunk_size]>
478478
479479
For 1d block load
480480
# assert (wi_layout[0] x wi_layout[1] == subgroup_size) // ARC subgroup_size = 8
481481
#sg_map = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]>
482482
483-
For regular load with chunk_size_per_lane // ARC subgroup_size = 8
484-
#sg_map_t = xegpu.sg_map<wi_layout = [8, 1], wi_data = [1, 1]>
483+
For regular load with chunk_size // ARC subgroup_size = 8
484+
#sg_map_t = xegpu.sg_map<wi_layout = [8, 1], wi_data = [1, chunk_size]>
485485
```
486486

487487
## Rules of sg_map setting for DPAS on PVC and ARC
@@ -549,14 +549,14 @@ An example on how to load a 2d block, perform dpas, and store back to memory.
549549
```
550550

551551
## sg_map use case - regular load:
552-
An example on how to perform transpose using load_gather with chunk_size_per_lane in SIMT flavor.
552+
An example on how to perform transpose using load_gather with chunk_size in SIMT flavor.
553553

554554
```mlir
555555
556-
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>
556+
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 4]>
557557
#scatter_attr = !xegpu.tdesc_attr< memory_space=slm, scattered=true>
558558
%scatter_tdesc_chunk = xegpu.create_tdesc, %src_addr, %offsets
559-
{chunk_size_per_lane=4} :
559+
{chunk_size=4} :
560560
uint64, vector<16xindex> into tensor_desc<16x4xfp32, #scatter_attr, #sg_map_t>
561561
562562
%result = xegpu.load_gather %scatter_tdesc_chunk, %mask {L1 = cached, L2 = uncached, transpose=[1,0]} :

0 commit comments

Comments
 (0)