Skip to content

Commit

Permalink
offset, len, and stride in words not bytes (or 32b) (Xilinx#1111)
Browse files Browse the repository at this point in the history
  • Loading branch information
makslevental authored Mar 19, 2024
1 parent e0514b1 commit 1fb02e8
Show file tree
Hide file tree
Showing 33 changed files with 546 additions and 236 deletions.
9 changes: 2 additions & 7 deletions include/aie/Dialect/AIE/IR/AIEInterfaces.td
Original file line number Diff line number Diff line change
Expand Up @@ -133,12 +133,7 @@ def AIETarget : OpInterface<"AIETarget"> {
];
}

// def OffloadingTranslationAttrTrait :
// NativeTrait<"OffloadingTranslationAttrTrait", ""> {
// let cppNamespace = "::mlir::gpu";
// }

def MyOffsetSizeAndStrideOpInterface: OpInterfaceTrait<"::xilinx::AIE::MyOffsetSizeAndStrideOpInterface"> {
}
// Don't delete - see AIEDialect::myVerifyOffsetSizeAndStrideOp
def MyOffsetSizeAndStrideOpInterface: OpInterfaceTrait<"::xilinx::AIE::MyOffsetSizeAndStrideOpInterface"> {}

#endif // AIE_INTERFACES
78 changes: 57 additions & 21 deletions include/aie/Dialect/AIE/IR/AIEOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -745,46 +745,67 @@ def AIE_DMABDPACKETOp: AIE_Op<"dma_bd_packet", []> {
}];
}

def AIE_DMABDOp: AIE_Op<"dma_bd", []> {
let summary = "Declare a dma block descriptor op";
def AIE_DMABDOp: AIE_Op<"dma_bd", [
ParentOneOf<["MemOp", "MemTileDMAOp", "ShimDMAOp", "DMAOp"]>,
]> {
let summary = "Declare a dma buffer descriptor op";
let description = [{
This operation describes a block descriptor for DMA operations. In particular, it specifies
what buffer addresss to use, the transfer length, and the buffer type (A or B).
This operation describes a buffer descriptor for DMA operations. In particular, it specifies
what buffer to use, and optionally:

This operation must be used in an MLIR block that lives inside a MemOp's region.
The block descriptor specifies what lock to use and the buffer configuration.
1. the offset into the buffer;
2. the transfer length;
3. the sizes and strides for n-d tensor addressing (described below);
4. the "bd_id" with which to associate the buffer descriptor (most often left empty).

`offset`, `len`, `size`s and `stride`s are all denominated in element width; e.g., transferring the whole of
`memref<512xi32>` means `len == 512`, and also while transferring the whole of `memref<512xi16>`, `len == 512`.

The only caveat to this "everything-is-in-terms-of-element-width" rule is regarding the inner-most dimension's stride
(see [Important gotcha regarding strides](#important-gotcha-regarding-strides) below).

`dma_bd` ops must appear in their own BBs (basic blocks) and such BBs can (optionally) include `use_lock`
operations (specifying an "acquire" and a "release" lock; see the `use_lock` operation) and subsequent BDs in
a "chain" of BDs (using `next_bd` as a "jump" to the next BB which contains a `dma_bd`).

Example:
```
// this defines a BD that uses lock %lck0 and buffer %buf0
^bd5:
aie.use_lock(%lck, "Acquire", 0)
aie.dma_bd(<$buf0 : memref<512xi32>, 0, 512>, 1)
// transfer the first 32 elements of the memref
aie.dma_bd(<$buf0 : memref<128xi32>, 0, 32)
aie.use_lock(%lck, "Release", 1)
br ^bd6 // point to the next Block, which is also a different Block Descriptor
aie.next_bd ^bd6 // point to the next bb, which describes the next buffer descriptor
^bd6:
aie.use_lock(%lck, "Acquire", 1)
// transfer the last 32 elements of the memref
aie.dma_bd(<$buf1 : memref<128xi32>, 96, 32)
aie.use_lock(%lck, "Release", 0)
aie.next_bd ^end

...

// this defines a BD that does not use any lock
^bd8:
aie.dma_bd(<$buf1 : memref<64xi32>, 0, 64>, 0)
aie.dma_bd(<$buf2 : memref<64xi32>, 0, 64)
```
A DMA channel in a Memory Module can process one block descriptor after another by chaining them.
There are 16 block descriptors per Memory Module. They are shared by four DMA channels.

## Background/context:

A DMA channel in a Memory Module can process one buffer descriptor after another by chaining them.
There are 16 buffer descriptors per Core memory module and 48 buffer descriptors per Memtile memory module.
They are shared by four DMA channels (or 12).

## DMA Data Layout Transformations on AIE-ML Devices

AIE-ML devices can apply data layout transformations at the buffer
descriptor level. These transformation are described by strides and sizes in up to three dimensions (four
dimensions on memtiles). Strides and sizes can be supplied to the `dma_bd`
through an optional argument, an array of tuples `<size, stride>`.

The first element of this array gives the _highest-dimension_ stride and
size, the last element of the array gives the lowest-dimension.

Strides are always expressed in units of `i32`s; this is an architectural
requirement, as data is moved by the DMA at this fundamental size.
through an optional argument, an array of "tuple-like" attributes `bd_dim_layout<size, stride>`.

The first element of this array gives the outer-most dimension's stride and
size, the last element of the array gives the inner-most dimension's stride and size.
We can model the access pattern strides and sizes generate by a series of
nested loops. In general, a set of strides and sizes like this...

Expand Down Expand Up @@ -820,11 +841,18 @@ def AIE_DMABDOp: AIE_Op<"dma_bd", []> {
for(int k = 0; k < 8 /*size_0*/; k++)
// access/store element at/to index (i * 16 /*stride_2*/ + j * 1 /*stride_1*/ + k * 2 /*stride_0*/)
```

## Important gotcha regarding strides

All strides are expressed in multiples of the element width (just like `len` and `offset`)
**with the caveat that the inner-most dimension's stride must be 1**.
}];

let arguments = (
ins AnyMemRef:$buffer,
OptionalAttr<AIEI32Attr>:$offset,
// in multiples of element width (not bytes)
DefaultValuedOptionalAttr<AIEI32Attr, "0">:$offset,
// in multiples of element width (not bytes)
OptionalAttr<AIEI32Attr>:$len,
OptionalAttr<BDDimLayoutArrayAttr>:$dimensions,
OptionalAttr<AIEI32Attr>:$bd_id,
Expand All @@ -840,8 +868,16 @@ def AIE_DMABDOp: AIE_Op<"dma_bd", []> {

let extraClassDeclaration = [{
BufferOp getBufferOp();
int getOffsetValue() { return getOffset().value_or(0); }
int getLenValue() { return getLen().value_or(getBuffer().getType().getNumElements()); }
int32_t getBufferElementTypeWidthInBytes() {
return getBuffer().getType().getElementTypeBitWidth() / 8;
}
int32_t getLenInBytes() {
if (std::optional<int32_t> len = getLen(); len.has_value())
return len.value() * getBufferElementTypeWidthInBytes();
else
return getBuffer().getType().getNumElements() * getBufferElementTypeWidthInBytes();
}
int32_t getOffsetInBytes() { return getOffset() * getBufferElementTypeWidthInBytes(); }
}];

let hasVerifier = 1;
Expand Down
83 changes: 51 additions & 32 deletions lib/Dialect/AIE/IR/AIEDialect.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1545,60 +1545,79 @@ LogicalResult DMABDOp::verify() {
if (!isa<BufferOp, ExternalBufferOp>(getBuffer().getDefiningOp()))
return emitOpError(
"BDs only support BufferOp or ExternalBufferOp operands.");
if (auto memOp = getOperation()->getParentOfType<MemOp>()) {
if (auto bufferOp = getBufferOp();
bufferOp.getTileOp().colIndex() != memOp.colIndex() ||
bufferOp.getTileOp().rowIndex() != memOp.rowIndex())
return emitOpError("can only access a buffer in the same tile.");
}

// The following checks only apply if non-default strides/wraps are defined.
if (getDimensions()) {
MemRefType buffer = getBuffer().getType();
// We are not restrictive about the type of the memref used as the input
// to the DMABD when used with multi-dimensional strides/wraps. Since the
// BD will use the memref as a base address and copy from it in 32 bit
// chunks, while assuming the layout of the memref is contiguous. We
// assume the user/compiler understands and accounts for this.
uint64_t memrefSize = 1; // in bytes
uint64_t maxIdx = 0;
for (int64_t memrefDim : buffer.getShape())
memrefSize *= 4 * memrefDim;

ArrayRef<BDDimLayoutAttr> dims = *getDimensions();
if (getLenInBytes() % 4)
return emitOpError("transfer length must be multiple of 4 (i.e., represent "
"4 byte aligned address)");

TileID parentTileId = getParentTileElement(getOperation()).getTileID();

if (getOperation()->getParentOfType<MemOp>() &&
(getBufferOp().getTileOp().colIndex() != parentTileId.col ||
getBufferOp().getTileOp().rowIndex() != parentTileId.row))
return emitOpError(
"Core tile DMAs can only access a buffer in the same tile.");

const AIETargetModel &targetModel = getTargetModel(getOperation());

uint32_t maxBds = targetModel.getNumBDs(parentTileId.col, parentTileId.row);
if (std::optional<int32_t> bdId = getBdId();
bdId.has_value() && static_cast<uint32_t>(*bdId) >= maxBds)
return emitOpError("bdId attribute exceeds max: ") << maxBds - 1;
if (std::optional<int32_t> nextBdId = getNextBdId();
nextBdId.has_value() && static_cast<uint32_t>(*nextBdId) >= maxBds)
return emitOpError("nextBdId attribute exceeds max: ") << maxBds - 1;
if (auto dims = getDimensions(); dims.has_value()) {
size_t maxNDims = 3;
if (isa_and_nonnull<MemTileDMAOp>(getOperation()->getParentOp()))
maxNDims = 4;

if (dims.size() > maxNDims)
if (dims->size() > maxNDims)
return emitOpError() << "Cannot give more than "
<< std::to_string(maxNDims)
<< " dimensions for step sizes and wraps in this "
" tile (got "
<< std::to_string(dims.size()) << " dimensions).";
<< std::to_string(dims->size()) << " dimensions).";

for (BDDimLayoutAttr dim : dims) {
MemRefType buffer = getBuffer().getType();
int64_t maxIdx = 0;
for (BDDimLayoutAttr dim : *dims) {
maxIdx += dim.getStride() * (dim.getSize() - 1);
if (0 == dim.getStride())
return emitOpError()
<< "Invalid step size; must be a positive integer.";
if (dim.getStride() > memrefSize)
if (dim.getStride() > buffer.getNumElements())
return emitOpError()
<< "Step size " << std::to_string(dim.getStride() * 4) << " "
<< "bytes exceeds memref size " << std::to_string(memrefSize);
<< "Step size " << std::to_string(dim.getStride()) << " "
<< "exceeds memref size "
<< std::to_string(buffer.getNumElements());
if (dim.getSize() >= (1UL << 9) + 1)
return emitOpError() << "Size may not exceed 1023.";
if (dim.getStride() >= (1UL << 19))
return emitOpError() << "Stride may not exceed " << (1 << 20);
}

if (memrefSize <= 4 * maxIdx)
if (buffer.getNumElements() <= maxIdx)
return emitOpError() << "Specified stride(s) and size(s) result in out "
"of bounds access in buffer, for index "
<< std::to_string(maxIdx) << ", accessing at "
<< std::to_string(4 * maxIdx)
<< " byte offset in memref of length "
<< std::to_string(memrefSize) << ".";
<< std::to_string(maxIdx) << " in memref of length "
<< std::to_string(buffer.getNumElements()) << ".";

// Since streams read 32b words, there's no way to read eg 16b with stride
// of 2 (ie lower halfs of each 32b). So force it to be 1 (and then in
// CDODirect/XAIEV2 scale the size by 4/getBufferElementTypeWidthInBytes).
if (getBufferElementTypeWidthInBytes() < 4 && dims->back().getStride() != 1)
return emitOpError(
"For <32b width datatypes, inner-most dim stride must be 1");
}
if (targetModel.isMemTile(parentTileId.col, parentTileId.row) ||
targetModel.isCoreTile(parentTileId.col, parentTileId.row)) {
if (auto baseAddr = getBufferOp().getAddress(); baseAddr.has_value()) {
int offsetInBytes = *baseAddr + getOffsetInBytes();
if (offsetInBytes % 4)
return emitOpError(
"bd address must be 4 byte (32b) aligned; got base+offset: ")
<< offsetInBytes << " (bytes)";
}
}

if (!getLen() && !getBuffer().getType().hasStaticShape())
Expand Down
39 changes: 14 additions & 25 deletions lib/Dialect/AIE/Transforms/AIEObjectFifoStatefulTransform.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -204,14 +204,6 @@ struct AIEObjectFifoStatefulTransformPass
return !hasSharedMemory || atLeastOneConsumerWantsTransform;
}

/// Function to multiply all dimensions of a memref.
int64_t getMemrefTypeSize(MemRefType memref) {
int64_t size = 1;
for (auto dim : memref.getShape())
size *= dim;
return size;
}

/// Function to retrieve ObjectFifoLinkOp of ObjectFifoCreateOp,
/// if it belongs to one.
std::optional<ObjectFifoLinkOp> getOptionalLinkOp(ObjectFifoCreateOp op) {
Expand Down Expand Up @@ -325,14 +317,14 @@ struct AIEObjectFifoStatefulTransformPass
.getElemType()
.cast<AIEObjectFifoType>();
auto elemInType = fifoInType.getElementType().cast<MemRefType>();
int inSize = getMemrefTypeSize(elemInType);
int inSize = elemInType.getNumElements();

auto fifoOutType = linkOp->getOutputObjectFifos()[0]
.getElemType()
.cast<AIEObjectFifoType>();
auto elemOutType = fifoOutType.getElementType().cast<MemRefType>();

if (int outSize = getMemrefTypeSize(elemOutType); inSize >= outSize) {
if (int outSize = elemOutType.getNumElements(); inSize >= outSize) {
if (op.name() != fifoIn.name())
return;
} else {
Expand Down Expand Up @@ -474,11 +466,10 @@ struct AIEObjectFifoStatefulTransformPass

int acqNum = 1;
int relNum = 1;
int offset = 0;

auto fifo = op.getElemType().cast<AIEObjectFifoType>();
auto elemType = fifo.getElementType().cast<MemRefType>();
int len = getMemrefTypeSize(elemType);
int len = elemType.getNumElements();

// search for the buffers/locks (based on if this objFifo has a link)
ObjectFifoCreateOp target = op;
Expand Down Expand Up @@ -539,8 +530,8 @@ struct AIEObjectFifoStatefulTransformPass

builder.setInsertionPointToStart(curr);
createBdBlock<BufferOp>(builder, target, lockMode, acqNum, relNum,
buffersPerFifo[target][blockIndex], offset, len,
channelDir, blockIndex, succ, dims);
buffersPerFifo[target][blockIndex], /*offset*/ 0,
len, channelDir, blockIndex, succ, dims);
curr = succ;
blockIndex++;
}
Expand All @@ -558,7 +549,6 @@ struct AIEObjectFifoStatefulTransformPass

int acqNum = 1;
int relNum = 1;
int offset = 0;

// search for ShimDMAOp
Operation *producerDMA = nullptr;
Expand Down Expand Up @@ -612,12 +602,12 @@ struct AIEObjectFifoStatefulTransformPass
succ = builder.createBlock(endBlock);

MemRefType buffer = externalBuffersPerFifo[op][blockIndex].getType();
int len = getMemrefTypeSize(buffer);
int len = buffer.getNumElements();
builder.setInsertionPointToStart(curr);
createBdBlock<ExternalBufferOp>(builder, op, lockMode, acqNum, relNum,
externalBuffersPerFifo[op][blockIndex],
offset, len, channelDir, blockIndex, succ,
dims);
/*offset*/ 0, len, channelDir, blockIndex,
succ, dims);
curr = succ;
blockIndex++;
}
Expand All @@ -633,11 +623,9 @@ struct AIEObjectFifoStatefulTransformPass
if (numBlocks == 0)
return;

int offset = 0;
auto fifo = op.getElemType().cast<AIEObjectFifoType>();
auto elemType = fifo.getElementType().cast<MemRefType>();
int lenOut = getMemrefTypeSize(elemType);
int bytes = elemType.getElementTypeBitWidth() / 8;
int lenOut = elemType.getNumElements();
int acqNum = 1;
int relNum = 1;

Expand All @@ -663,7 +651,7 @@ struct AIEObjectFifoStatefulTransformPass
auto elemType = fifoType.getElementType().cast<MemRefType>();
if (fifoIn.name() == op.name())
break;
extraOffset += getMemrefTypeSize(elemType);
extraOffset += elemType.getNumElements();
}
}
} else if (linkOp->isDistribute()) {
Expand All @@ -678,15 +666,15 @@ struct AIEObjectFifoStatefulTransformPass
auto elemType = fifoType.getElementType().cast<MemRefType>();
if (fifoOut.name() == op.name())
break;
extraOffset += getMemrefTypeSize(elemType);
extraOffset += elemType.getNumElements();
}
}
} else {
if (target != op) {
auto targetFifo = target.getElemType().cast<AIEObjectFifoType>();
auto targetElemType =
targetFifo.getElementType().cast<MemRefType>();
lenOut = getMemrefTypeSize(targetElemType);
lenOut = targetElemType.getNumElements();
}
}

Expand Down Expand Up @@ -748,8 +736,9 @@ struct AIEObjectFifoStatefulTransformPass
succ = builder.createBlock(endBlock);

builder.setInsertionPointToStart(curr);
int offset = 0;
if (isDistribute || isJoin)
offset = extraOffset * bytes;
offset = extraOffset;
createBdBlock<BufferOp>(builder, target, lockMode, acqNum, relNum,
buffersPerFifo[target][blockIndex], offset,
lenOut, channelDir, blockIndex, succ, dims);
Expand Down
Loading

0 comments on commit 1fb02e8

Please sign in to comment.