Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] IRON API extension and dev tools #1732

Draft
wants to merge 82 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 79 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
49bd37f
brainstorming proof of concept
hunhoffe Aug 16, 2024
048d156
Update test.cpp
hunhoffe Oct 14, 2024
095ce35
Clean up code in preparation for additional development
hunhoffe Oct 14, 2024
847be7c
Saving progress
hunhoffe Oct 14, 2024
b52c6ec
Fix formatting
hunhoffe Oct 15, 2024
181368a
Save progress
hunhoffe Oct 15, 2024
696633a
Update python/pyrightconfig.json
hunhoffe Oct 15, 2024
94f1fa4
working on building up runtime sequence
hunhoffe Oct 15, 2024
1a0567e
Merge branch 'main' into erika-iron-brainstorming
hunhoffe Oct 15, 2024
d450878
Merge branch 'main' into erika-iron-brainstorming
hunhoffe Oct 16, 2024
e1fd409
Fix paths from merge
hunhoffe Oct 16, 2024
2e3b923
rewrite passthrough kernel to use dma task operations
hunhoffe Oct 16, 2024
22a8418
First example minimally working again
hunhoffe Oct 16, 2024
a54659a
Rename api as iron2 (for now)
hunhoffe Oct 16, 2024
f2ce079
Merge branch 'main' into erika-iron-brainstorming
hunhoffe Oct 16, 2024
5e6a328
Working with tiler some more
hunhoffe Oct 17, 2024
92ff3d6
Started on DMA transpose example
hunhoffe Oct 17, 2024
ac0ec99
DMA transpose example minimally working
hunhoffe Oct 17, 2024
cb52c07
Add matrix_scalar_add experimental
hunhoffe Oct 17, 2024
57a0049
Merge branch 'main' into erika-iron-brainstorming
hunhoffe Oct 17, 2024
989d755
another example
hunhoffe Oct 17, 2024
745002c
Another example
hunhoffe Oct 17, 2024
4526758
object fifo forward
hunhoffe Oct 17, 2024
0fe4b8b
Working through a few more examples
hunhoffe Oct 17, 2024
ebb2716
Add experimental vector_exp example
hunhoffe Oct 17, 2024
106b702
Merge branch 'main' into erika-iron-brainstorming
hunhoffe Oct 17, 2024
dea7d28
stub of tensor tiler
hunhoffe Oct 18, 2024
00713ea
py fmt and imports
hunhoffe Oct 18, 2024
2650c41
Explore tile helper class
hunhoffe Oct 19, 2024
747ca3a
First version of tensor tiler
hunhoffe Oct 21, 2024
e5518fa
Merge branch 'main' into tiler-helper
hunhoffe Oct 21, 2024
1f86f05
Add some tests for the tiler
hunhoffe Oct 21, 2024
39e0a5d
Some improvements
hunhoffe Oct 21, 2024
03a0741
Merge branch 'main' into tiler-helper
hunhoffe Oct 22, 2024
8d67307
Some small improvements to tensortiler
hunhoffe Oct 22, 2024
637e314
Stub out example
hunhoffe Oct 22, 2024
d18a2ef
Added simple tiling examples
hunhoffe Oct 22, 2024
3c8ffb3
Merge branch 'main' into tiler-helper
hunhoffe Oct 22, 2024
d293c96
Update programming_examples/basic/tiling_exploration/per_tile/aie2.py
hunhoffe Oct 22, 2024
9c2ce5f
Fix makefile typos
hunhoffe Oct 22, 2024
2a3a484
Add tensor tiler tests
hunhoffe Oct 22, 2024
a47df3a
a couple more tests
hunhoffe Oct 22, 2024
babf9e7
Add a few more tests, remove template
hunhoffe Oct 22, 2024
46a487c
Add one more test
hunhoffe Oct 22, 2024
1071ee0
make tensortile test formatting a bit more sane
hunhoffe Oct 22, 2024
192194d
More python formatting
hunhoffe Oct 22, 2024
4f9656a
A few more tests
hunhoffe Oct 22, 2024
34ea2d8
Merge branch 'main' into tiler-helper
hunhoffe Oct 22, 2024
e437776
add visualization example
hunhoffe Oct 22, 2024
c744299
caption more correctly
hunhoffe Oct 22, 2024
d51e5c8
A bit of progress towards matrix_vector
hunhoffe Oct 23, 2024
fe56391
Merge branch 'main' into erika-iron-brainstorming
hunhoffe Oct 23, 2024
87df9a7
Merge branch 'main' into tiler-helper
hunhoffe Oct 23, 2024
880ee2f
update tiler code
hunhoffe Oct 21, 2024
0d63d47
Merge branch 'main' into erika-iron-brainstorming
hunhoffe Oct 23, 2024
5d998cc
Merge branch 'tiler-helper' into erika-iron-brainstorming
hunhoffe Oct 23, 2024
1140074
fix mistake from merge
hunhoffe Oct 23, 2024
3d6c8ea
DMA Transpose working with new TensorTiler
hunhoffe Oct 23, 2024
50d9ce3
Fix small typos in dma transpose designs
hunhoffe Oct 23, 2024
0e17621
Matrix scalar add working with new tiler
hunhoffe Oct 23, 2024
173cea1
Passthrough DMA working with TensorTile, but not TensorTiler2D
hunhoffe Oct 23, 2024
df855ff
passthrough kernel experimental working because of hack in dmatask
hunhoffe Oct 23, 2024
b71677b
add missing transfer length
hunhoffe Oct 23, 2024
b582e85
experimental working with row_wise_bias_add
hunhoffe Oct 23, 2024
6cf794c
experimental vector exp now working
hunhoffe Oct 23, 2024
aa5de3e
Merge branch 'main' into erika-iron-brainstorming
hunhoffe Oct 23, 2024
379b712
Remove development notes and restore unneeded changes
hunhoffe Oct 23, 2024
f4a0335
Use peano and do not pollute source dir in experimental tests
hunhoffe Oct 23, 2024
e9e57ab
Fix typo
hunhoffe Oct 23, 2024
3429d78
Stub out plans for placement
hunhoffe Oct 23, 2024
32e18c5
Merge branch 'main' into erika-iron-brainstorming
hunhoffe Oct 23, 2024
ee8a483
dma transpose with placer working
hunhoffe Oct 23, 2024
44f5c95
Port rest of examples to use SequentialPlacer
hunhoffe Oct 23, 2024
6cb2b6a
Stub out (untestsed) matrix vector
hunhoffe Oct 24, 2024
01731dc
Some notes for demo
hunhoffe Oct 24, 2024
8efd95b
more notes
hunhoffe Oct 24, 2024
d42987d
Merge branch 'main' into erika-iron-brainstorming
hunhoffe Oct 25, 2024
d842023
Small stylistic updates
hunhoffe Oct 25, 2024
1785464
more demo prep
hunhoffe Oct 25, 2024
5b2dd41
Add some composition notes
hunhoffe Oct 25, 2024
a46c8b4
Add some (untested) access count visualizations in addition to the ac…
hunhoffe Oct 25, 2024
6d83cab
Some cleanups for next demo
hunhoffe Oct 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions erika_demo_notes/library_brainstorming.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@


import numpy as np

from aie.api.ml_lib import EltwiseAdd, EltwiseMul # Does not exist yet

from aie.api.io.iocoordinator import IOCoordinator
from aie.api.dataflow.objectfifo import ObjectFifo
from aie.api.placers import SequentialPlacer
from aie.api.program import Program
from aie.api.phys.device import NPU1Col4
from aie.helpers.tensortiler.tensortiler2D import TensorTiler2D

M = ...
N = ...
tensor_ty = np.ndarray[(M,N), np.dtype[np.uint8]]

of_a = ObjectFifo(2, tensor_ty, "inA")
of_b = ObjectFifo(2, tensor_ty, "inB")

# Does not exist yet start
add_workers, of_add_out = EltwiseAdd(of_a.second, of_b.second, n_cores=4, ...)
mul_workers, of_c = EltwiseMul(of_add_out.second, of_add_out.second, n_cores=4, ...)
# Does not exist yet end

io = IOCoordinator()
with io.build_sequence(tensor_ty, tensor_ty, tensor_ty) as (a_in, b_in, c_out):
tiler = TensorTiler2D(M, N)
for t in io.tile_loop(tiler.tile_iter()):
io.fill(of_a.first, t, a_in)
io.fill(of_b.first, t, b_in)
io.drain(of_c.second, t, c_out)

my_program = Program(NPU1Col4(), io, workers=add_workers + mul_workers)
my_program.resolve_program(SequentialPlacer())

# Notes:
# - May need some notion of worker shape
# - Do you just rely on documentation for out data format/location
# or do you have something like a spensor object?


##################################################################################
# Alternate version with more placement control
import numpy as np

from aie.api.ml_lib import EltwiseAdd, EltwiseMul

from aie.api.io.iocoordinator import IOCoordinator
from aie.api.dataflow.objectfifo import ObjectFifo
from aie.api.placers import SequentialPlacer
from aie.api.program import Program
from aie.api.phys.device import NPU1Col4
from aie.helpers.tensortiler.tensortiler2D import TensorTiler2D

dev = NPU1Col4()

M = ...
N = ...
tensor_ty = np.ndarray[(M,N), np.dtype[np.uint8]]

of_a = ObjectFifo(2, tensor_ty, "inA")
of_b = ObjectFifo(2, tensor_ty, "inB")

add_workers, of_add_out = EltwiseAdd(of_a.second, of_b.second, n_cores=4, placement_resources=dev.tiles[0])
mul_workers, of_c = EltwiseMul(of_add_out.second, of_add_out.second, n_cores=4, placement_resources=dev.tiles[1])

io = IOCoordinator()
with io.build_sequence(tensor_ty, tensor_ty, tensor_ty) as (a_in, b_in, c_out):
tiler = TensorTiler2D(M, N)
for t in io.tile_loop(tiler.tile_iter()):
io.fill(of_a.first, t, a_in, placement=dev.tiles[0].shim)
io.fill(of_b.first, t, b_in, placement=dev.tiles[0].shim)
io.drain(of_c.second, t, c_out, placement=dev.tiles[1].shim)

my_program = Program(dev, io, workers=add_workers + mul_workers)
my_program.resolve_program(SequentialPlacer())
56 changes: 56 additions & 0 deletions erika_demo_notes/minimal.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# matrix_scalar_add/aie2.py -*- Python -*-
#
# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
#
# (c) Copyright 2024 Advanced Micro Devices, Inc. or its affiliates
import numpy as np
import sys
from aie.dialects.aie import *
from aie.dialects.aiex import *
from aie.extras.context import mlir_mod_ctx
from aie.helpers.dialects.ext.scf import _for as range_

IMAGE_HEIGHT, IMAGE_WIDTH = 16, 128
IMAGE_SIZE = IMAGE_WIDTH * IMAGE_HEIGHT
TILE_HEIGHT, TILE_WIDTH = 8, 16
TILE_SIZE = TILE_WIDTH * TILE_HEIGHT

with mlir_mod_ctx() as ctx:
def my_matrix_add_one():
hunhoffe marked this conversation as resolved.
Show resolved Hide resolved
@device(AIEDevice.npu1_1col)
def device_body():
tile_ty = np.ndarray[(TILE_SIZE,), np.dtype[np.int32]]

ShimTile = tile(0, 0)
ComputeTile2 = tile(0, 2)

of_in1 = object_fifo("in0", ShimTile, ComputeTile2, 2, tile_ty)
of_out1 = object_fifo("out0", ComputeTile2, ShimTile, 2, tile_ty)

@core(ComputeTile2)
def core_body():
for _ in range_(sys.maxsize):
elem_in = of_in1.acquire(ObjectFifoPort.Consume, 1)
elem_out = of_out1.acquire(ObjectFifoPort.Produce, 1)
for i in range_(TILE_SIZE):
elem_out[i] = elem_in[i] + 1
of_in1.release(ObjectFifoPort.Consume, 1)
of_out1.release(ObjectFifoPort.Produce, 1)

@runtime_sequence(tile_ty, tile_ty, tile_ty)
def sequence(inTensor, _, outTensor):
npu_dma_memcpy_nd(metadata=of_in1, bd_id=1,
hunhoffe marked this conversation as resolved.
Show resolved Hide resolved
mem=inTensor,
sizes=[1, 1, TILE_HEIGHT, TILE_WIDTH],
strides=[1, 1, IMAGE_WIDTH, 1],
issue_token=True,
)

npu_dma_memcpy_nd(metadata=of_out1, bd_id=0,
hunhoffe marked this conversation as resolved.
Show resolved Hide resolved
mem=outTensor,
sizes=[1, 1, TILE_HEIGHT, TILE_WIDTH],
strides=[1, 1, IMAGE_WIDTH, 1],
)
dma_wait(of_in1, of_out1)
53 changes: 53 additions & 0 deletions erika_demo_notes/minimal_experimental.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# passthrough_kernel/aie2.py -*- Python -*-
#
# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
#
# (c) Copyright 2024 Advanced Micro Devices, Inc. or its affiliates
import itertools
import numpy as np

from aie.api.io.iocoordinator import IOCoordinator
from aie.api.dataflow.objectfifo import ObjectFifo
from aie.api.program import Program
from aie.api.placers import SequentialPlacer
from aie.api.worker import Worker
from aie.api.phys.device import NPU1Col1
from aie.helpers.tensortiler.tensortiler2D import TensorTiler2D
from aie.helpers.dialects.ext.scf import _for as range_

IMAGE_HEIGHT, IMAGE_WIDTH = 16, 128
IMAGE_SIZE = IMAGE_WIDTH * IMAGE_HEIGHT
TILE_HEIGHT, TILE_WIDTH = 8, 16
TILE_SIZE = TILE_WIDTH * TILE_HEIGHT

def my_matrix_add_one():
hunhoffe marked this conversation as resolved.
Show resolved Hide resolved
tile_ty = np.ndarray[(TILE_SIZE,), np.dtype[np.int32]]

of_in = ObjectFifo(4, tile_ty, "in0")
of_out = ObjectFifo(4, tile_ty, "out0")

def core_fn(of_in1, of_out1):
elem_in = of_in1.acquire(1)
elem_out = of_out1.acquire(1)
for i in range_(TILE_SIZE):
elem_out[i] = elem_in[i] + 1
of_in1.release(1)
of_out1.release(1)

my_worker = Worker(core_fn,
fn_args=[of_in.second, of_out.first],
while_true=True)
hunhoffe marked this conversation as resolved.
Show resolved Hide resolved

io = IOCoordinator()
with io.runtime_sequence(tile_ty, tile_ty, tile_ty) as (in_tensor, _, out_tensor):
tiler = TensorTiler2D(IMAGE_HEIGHT, IMAGE_WIDTH, TILE_HEIGHT, TILE_WIDTH)
for t in io.tile_loop(itertools.islice(tiler.tile_iter(), 0, 1)):
io.fill(of_in.first, t, in_tensor)
io.drain(of_out.second, t, out_tensor, wait=True)

return Program(NPU1Col1(), io, workers=[my_worker])


my_matrix_add_one().resolve_program(SequentialPlacer())
21 changes: 21 additions & 0 deletions programming_examples/basic/dma_transpose/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,22 @@ build/aie.mlir: ${srcdir}/aie2.py
mkdir -p ${@D}
python3 $< ${M} ${K} > $@

build/experimental_aie.mlir: ${srcdir}/experimental_aie2.py
mkdir -p ${@D}
python3 $< ${M} ${K} > $@

build/final.xclbin: build/aie.mlir
mkdir -p ${@D}
cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
--no-xchesscc --no-xbridge \
--aie-generate-npu --npu-insts-name=insts.txt $(<:%=../%)

build/experimental_final.xclbin: build/experimental_aie.mlir
mkdir -p ${@D}
cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
--no-xchesscc --no-xbridge \
--aie-generate-npu --npu-insts-name=experimental_insts.txt $(<:%=../%)

${targetname}.exe: ${srcdir}/test.cpp
rm -rf _build
mkdir -p _build
Expand All @@ -44,5 +54,16 @@ endif
run: ${targetname}.exe build/final.xclbin
${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE --M ${M} --K ${K}

run_experimental: ${targetname}.exe build/experimental_final.xclbin build/experimental_insts.txt
${powershell} ./$< -x build/experimental_final.xclbin -i build/experimental_insts.txt -k MLIR_AIE --M ${M} --K ${K}

generate_access_map: ${srcdir}/aie2.py
mkdir -p ${@D}
python3 $< --generate-access-map ${M} ${K}

generate_experimental_access_map: ${srcdir}/experimental_aie2.py
mkdir -p ${@D}
python3 $< --generate-access-map ${M} ${K}

clean:
rm -rf build _build inst ${targetname}.exe
15 changes: 14 additions & 1 deletion programming_examples/basic/dma_transpose/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,24 @@ This reference design can be run on a Ryzen™ AI NPU.
In the [design](./aie2.py), a 2-D array in a row-major layout is read from external memory to `ComputeTile2` with a transposed layout,
by using an implicit copy via the compute tile's Data Movement Accelerator (DMA). The data is read from and written to external memory through the Shim tile (`col`, 0).

This data movement transformation can be visualized as a map which shows the order the data the data is streamed (e.g., in transposed layout):
<p align="center">
<img
src="transpose_data.png">
<h3 align="center"> Visualization of the Transpose Data Transformation for M=32, K=16.
</h3>
</p>

The implicit copy is performed using the `object_fifo_link` operation that specifies how input data arriving via `of_in` should be sent further via `of_out` by specifically leveraging the compute tile's DMA. This operation and its functionality are described in more depth in [Section-2b](../../../programming_guide/section-2/section-2b/README.md/#object-fifo-link) of the programming guide.


To compile and run the design for NPU:
```
```bash
make
make run
```

To generate a data visualization of the transpose (like that above), run:
```bash
make generate_access_map
```
53 changes: 39 additions & 14 deletions programming_examples/basic/dma_transpose/aie2.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,27 +5,26 @@
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
#
# (c) Copyright 2024 Advanced Micro Devices, Inc. or its affiliates
import argparse
import numpy as np
import sys

from aie.dialects.aie import *
from aie.dialects.aiex import *
from aie.extras.context import mlir_mod_ctx
from aie.helpers.dialects.ext.scf import _for as range_
from aie.helpers.tensortiler.tensortiler2D import TensorTiler2D

N = 4096
M = 64
K = 64

if len(sys.argv) == 3:
M = int(sys.argv[1])
K = int(sys.argv[2])
N = M * K
def my_passthrough(M, K, N, generate_acccess_map=False):
tensor_ty = np.ndarray[(M, K), np.dtype[np.int32]]
tile_in = next(TensorTiler2D(M, K, tensor_col_major=True).tile_iter())
tile_out = next(TensorTiler2D(K, M).tile_iter())

tensor_ty = np.ndarray[(M, K), np.dtype[np.int32]]
if generate_acccess_map:
tile_in.visualize(file_path="transpose_data.png")
return


def my_passthrough():
with mlir_mod_ctx() as ctx:

@device(AIEDevice.npu1_1col)
Expand Down Expand Up @@ -56,14 +55,40 @@ def sequence(A, B, C):
metadata=of_in,
bd_id=1,
mem=A,
sizes=[1, K, M, 1],
strides=[1, 1, K, 1],
sizes=tile_in.sizes,
strides=tile_in.strides,
issue_token=True,
)
npu_dma_memcpy_nd(metadata=of_out, bd_id=0, mem=C, sizes=[1, 1, 1, N])
npu_dma_memcpy_nd(
metadata=of_out,
bd_id=0,
mem=C,
sizes=tile_out.sizes,
strides=tile_out.strides,
)
dma_wait(of_in, of_out)

print(ctx.module)


my_passthrough()
if __name__ == "__main__":
p = argparse.ArgumentParser()
p.add_argument("dims", help="M K", type=int, nargs="*", default=[64, 64])
p.add_argument(
"--generate-access-map",
action="store_true",
help="Produce a file showing data access order",
)
args = p.parse_args()

if len(args.dims) != 2:
print(
"ERROR: Must provide either no dimensions or both M and K", file=sys.stderr
)
exit(-1)
my_passthrough(
M=args.dims[0],
K=args.dims[1],
N=args.dims[0] * args.dims[1],
generate_acccess_map=args.generate_access_map,
)
Loading
Loading