Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cocktailer Artifact #518

Merged
merged 199 commits into from
Apr 24, 2023
Merged
Changes from 1 commit
Commits
Show all changes
199 commits
Select commit Hold shift + click to select a range
a488ac2
some util functions for onnx frontend
xysmlx Jul 15, 2021
ae261c7
more datatype support in constant op
xysmlx Jul 15, 2021
73d3e6f
draft: onnx frontend support for if and loop op
xysmlx Jul 15, 2021
6cff7e7
draft: if and loop op_define
xysmlx Jul 15, 2021
ed0221a
refactor GraphConvert of the ONNX frontend, support if and loop convert
xysmlx Jul 15, 2021
1a30651
draft: cuda kernel emitter for if and loop op (placeholder)
xysmlx Jul 15, 2021
436d7d3
update onnx frontend convert and shape inference for loop op
xysmlx Jul 15, 2021
a2863d7
fix output size bug in ONNX Loop op convert
xysmlx Jul 20, 2021
b7b66b2
Generic_op_define and ONNX opset_11 frontend converter for ScatterND op
xysmlx Jul 20, 2021
ffae3f7
Comment m_expression construction of generic_op to bypass translate f…
xysmlx Jul 20, 2021
f771004
Merge branch 'master' into control_flow
xysmlx Aug 10, 2021
811a126
Merge master branch into control_flow branch
xysmlx Nov 17, 2021
a77ee7b
disable ORT optimizations
xysmlx Nov 18, 2021
d42be30
fix bug for disabling ORT optimizations
xysmlx Nov 18, 2021
0a29cc1
fix bug for disabling ORT optimizations
xysmlx Nov 18, 2021
2f29e0d
temp
nox-410 Nov 18, 2021
b0d4c56
Support -ftuning_list in kernel tuning pass
xysmlx Nov 28, 2021
faedc9b
Merge remote-tracking branch 'origin/control_flow' into control_flow_1
nox-410 Nov 28, 2021
6812438
Implement If and Loop code gen
nox-410 Dec 2, 2021
324f2cf
add mod operator
xysmlx Dec 2, 2021
e45ce1e
Add recursion
nox-410 Dec 2, 2021
5a2955a
Merge remote-tracking branch 'origin/control_flow' into control_flow_1
nox-410 Dec 2, 2021
df4ea2e
add -fcodegen_pybind
xiayuqing0622 Dec 3, 2021
2d2a9c5
fix bug
xiayuqing0622 Dec 3, 2021
35212e6
Fix code for cudaEmitter
nox-410 Dec 4, 2021
ef037e1
Fix recursion kernel name
nox-410 Dec 4, 2021
ee06cb9
Remove unused param in control flow
nox-410 Dec 4, 2021
a72c89f
Create base class for controlflow emitter
nox-410 Dec 6, 2021
d67fa83
Recursion Op workspace allocation
nox-410 Dec 6, 2021
14536eb
python patch
heheda12345 Dec 7, 2021
51f5de2
add kernel_entry
heheda12345 Dec 7, 2021
3ff6e40
remove half
heheda12345 Dec 7, 2021
5eb9ef4
allocate tensor in c
heheda12345 Dec 7, 2021
3b62fc8
torch::tensor for one output
heheda12345 Dec 7, 2021
912c126
tmp fix
xiayuqing0622 Dec 7, 2021
beafc0b
fix
nox-410 Dec 7, 2021
dbde8dd
list of tensor
heheda12345 Dec 7, 2021
8016924
Merge branch 'master' of github.com:heheda12345/nnfusion
Dec 7, 2021
7b1bdb0
fix bias dim
heheda12345 Dec 7, 2021
7648008
apply shared memory
nox-410 Dec 8, 2021
0023532
stderr
heheda12345 Dec 8, 2021
b6e052a
pybind int64
heheda12345 Dec 8, 2021
15bdba5
Fix recursion
nox-410 Dec 9, 2021
07a7f59
some parameter changes
nox-410 Dec 9, 2021
80726df
Merge branch 'control_flow_2' of https://github.com/nox-410/nnfusion
heheda12345 Dec 9, 2021
1c6f23a
bypass reshape and broadcast
nox-410 Dec 9, 2021
e3cfae0
fix duplicate node
nox-410 Dec 9, 2021
6176f8a
bugfix
nox-410 Dec 10, 2021
3c22fce
Bypass GatherV2 & merge code
nox-410 Dec 10, 2021
5dd292e
Merge branch 'control_flow_2' of https://github.com/nox-410/nnfusion
heheda12345 Dec 10, 2021
e9e9bd2
Adjust parameter
nox-410 Dec 10, 2021
7a58506
Merge remote-tracking branch 'zc/master' into control_flow_finetune
nox-410 Dec 11, 2021
b694f34
Merge branch 'control_flow_2' of https://github.com/nox-410/nnfusion
heheda12345 Dec 11, 2021
a23d8aa
fix a bug related with gatherV2
nox-410 Dec 11, 2021
38cfaaa
Merge branch 'control_flow_2' into control_flow_finetune
nox-410 Dec 11, 2021
bec08a9
fix reshape & broadcast bypass
nox-410 Dec 12, 2021
5071d8d
fix reshape & broadcast bypass
nox-410 Dec 12, 2021
8c57871
add threadfence
nox-410 Dec 13, 2021
bb4e4a0
Merge branch 'control_flow_2' into control_flow_finetune
nox-410 Dec 13, 2021
0598667
Support multiple outputs
nox-410 Dec 23, 2021
4888a31
Merge branch 'control_flow_2' into control_flow_finetune
nox-410 Dec 23, 2021
8d992b2
Add loop initialization
nox-410 Dec 23, 2021
bb54c8a
Fix extern result memory for Loop
nox-410 Dec 23, 2021
6b57878
Support broadcast Matmul
nox-410 Dec 25, 2021
287b0be
Adjust parameters
nox-410 Dec 25, 2021
6242182
Allow scalar float in torch codegen
nox-410 Dec 25, 2021
6c3a48d
Skip inplace analysis for subgraphs
nox-410 Dec 25, 2021
2016e0d
Fix Reshape error in Control flow
nox-410 Dec 27, 2021
2e43315
Use injected SumOp
nox-410 Dec 27, 2021
3233154
update Dot kernel cache
nox-410 Dec 28, 2021
74059ad
Fix controlflow inplace
nox-410 Dec 29, 2021
ebf0679
Set memory reuse to false
nox-410 Dec 29, 2021
28f3b07
Merge branch 'control_flow_finetune' of https://github.com/nox-410/nn…
heheda12345 Dec 29, 2021
11f9caf
grid.sync() & elementwise
heheda12345 Jan 1, 2022
5fd70b2
add scatternd op
heheda12345 Jan 1, 2022
cad0688
-fcheck_result
heheda12345 Jan 7, 2022
a87cfb5
add control edge to loop graph & fix kernel fusion
heheda12345 Jan 8, 2022
d68b2e7
concat with fewer resource
heheda12345 Jan 8, 2022
94e78d4
concat: no implace
heheda12345 Jan 9, 2022
4429553
elementwise: general blockdim
heheda12345 Jan 9, 2022
2d8229b
hack: add more dependency
heheda12345 Jan 9, 2022
98ea31b
forward control edge
heheda12345 Jan 9, 2022
8cf153c
remove useless barrier
heheda12345 Jan 9, 2022
7616e7c
add return
heheda12345 Jun 16, 2022
2d435fa
fix bug in conv-bn
heheda12345 Jul 22, 2022
903c774
add roller
heheda12345 Jul 25, 2022
fd0dc0d
debug tensor
heheda12345 Jul 25, 2022
847be21
add roller as submodule
heheda12345 Jul 25, 2022
db67a1a
upate gitignore
heheda12345 Jul 25, 2022
55706ff
change weight of inner graph to Constant op
heheda12345 Jul 25, 2022
0b71dba
gnode cout
heheda12345 Jul 25, 2022
626c66e
identity op
heheda12345 Jul 25, 2022
e2b7995
blockCudaEmitter: emit parameters from function sig
heheda12345 Jul 25, 2022
ba24cbb
draft version of seperate kernel launch, conflict wtih postprocessing…
heheda12345 Jul 26, 2022
ad3802d
fix typo
heheda12345 Jul 26, 2022
b3d0f8c
two branch call finish
heheda12345 Jul 27, 2022
0e9e76a
fix concat op
heheda12345 Aug 11, 2022
7ed651b
update op frontend
heheda12345 Aug 11, 2022
9f72880
conv to CNHW initial support
heheda12345 Aug 11, 2022
e75617e
add concat and reshape op
heheda12345 Aug 11, 2022
6a1e0c5
add if op to conv_layout_pass and fix related bugs
heheda12345 Aug 15, 2022
a0bc620
fix bug in share memory allocation of if op
heheda12345 Aug 15, 2022
01ac748
fuse small kernels (not finish yet)
heheda12345 Aug 18, 2022
d1f988b
fuse small kernels
heheda12345 Aug 18, 2022
2f189d0
reorder the kernels
heheda12345 Aug 23, 2022
ac90748
impl d2h
heheda12345 Aug 25, 2022
ac6e50f
fix bug in elementwise kernel
heheda12345 Aug 29, 2022
7a7d75e
main_test 100+100, print ref
heheda12345 Aug 29, 2022
e6bd194
fuse then else
heheda12345 Aug 30, 2022
6675125
move subtract out of if
heheda12345 Aug 31, 2022
bb1210e
loop in c
heheda12345 Sep 5, 2022
b5c0a4e
fix small bugs for skipnet
heheda12345 Sep 7, 2022
f658b61
CPU-GPU hybrid: assign stage
heheda12345 Sep 14, 2022
da1aaeb
CPU-GPU hybrid: add d2h and h2d gnode
heheda12345 Sep 14, 2022
41609fc
CPU-GPU hybrid: dumplicate memory pool
heheda12345 Sep 14, 2022
92445f4
CPU-GPU hybrid: call by tensor with _cpu
heheda12345 Sep 14, 2022
645e78f
CPU-GPU hybrid: forward stage info in element-wise fusion pass
heheda12345 Sep 15, 2022
ad057ed
remove debug code
heheda12345 Sep 15, 2022
78614ca
CPU-GPU hybrid: copy cpu emitter from cuda emitter
heheda12345 Sep 15, 2022
84dbfa2
CPU-GPU hybrid: for (int tid=0; tid <...)
heheda12345 Sep 15, 2022
56bfcbf
CPU-GPU hybrid: put result op on CPU
heheda12345 Sep 15, 2022
35adaba
CPU-GPU hybrid: bmm & conv codegen, can run
heheda12345 Sep 16, 2022
9d2b6f1
CPU-GPU hybrid: avoid run to_cpu_pass in inner graph
heheda12345 Sep 16, 2022
e914e1c
add cpu op
heheda12345 Sep 27, 2022
ccd9270
fix bugs in recursion
heheda12345 Sep 27, 2022
ef8d23b
inline recursion call
heheda12345 Sep 27, 2022
0d9e94c
recursive with stack
heheda12345 Sep 27, 2022
9d270ec
add be_state_buffer and state_base
heheda12345 Sep 28, 2022
fe943c7
add be_state_buffer and state_base to more place
heheda12345 Sep 28, 2022
9cac765
fast barrier codegen
heheda12345 Sep 28, 2022
294fa47
check control edge in operator << (gnode)
heheda12345 Oct 4, 2022
e358bc0
optimize elementwise perf
heheda12345 Oct 5, 2022
f9e1e80
add pipeline fail to compile commands
heheda12345 Oct 5, 2022
b7e558d
eliminate copy back of cond
heheda12345 Oct 6, 2022
88ee7bc
add bool to dtypes
heheda12345 Oct 7, 2022
4c0f3d2
add translate_v2 for identity op
heheda12345 Oct 9, 2022
68e3643
avoid inplace opt when gnode i/o contains result
heheda12345 Oct 9, 2022
b4b4e1a
fix bug in conv layout pass
heheda12345 Oct 9, 2022
dc2320b
cast_pytorch_tensor: use data_ptr instead of storage.data_ptr
heheda12345 Oct 9, 2022
99d2529
add fused_max_grid to loop
heheda12345 Oct 10, 2022
452d049
add more cpu op
heheda12345 Oct 11, 2022
c50e66c
add naive impls for breakdown exp
heheda12345 Oct 11, 2022
47f392e
skip scalar op: reshape
heheda12345 Oct 11, 2022
e798c76
is_outmost_graph for blockfusion
heheda12345 Oct 18, 2022
deccf1c
hacked parallel recursion: assume all calls can be executed in parallel
heheda12345 Oct 19, 2022
95ee0d4
tune recursion
heheda12345 Oct 20, 2022
3691741
add reduce-memcpy blockop
heheda12345 Nov 11, 2022
001c376
argmax kernel (not tested)
heheda12345 Nov 11, 2022
5e213d0
support while op, can compile but loop cannot stop
heheda12345 Nov 11, 2022
e343fe7
alloc cond tensor, fix bug in parameter mapping, can run bs=1
heheda12345 Nov 11, 2022
8668be8
hardcode num_local_thread_sync in reduce.hpp because emit_function_bo…
heheda12345 Nov 12, 2022
62fcf06
while in c
heheda12345 Nov 12, 2022
4b7373d
fast barrier for single block
heheda12345 Nov 12, 2022
18e60a3
fix bug in fast barrier of single block
heheda12345 Nov 12, 2022
3dfae17
fix bug in argmax and element_fused, pass while_op
heheda12345 Nov 14, 2022
823fc2a
enalbe & disable result d2d inplace
heheda12345 Nov 17, 2022
9215c13
support different schedule of if inside while op
heheda12345 Nov 22, 2022
865c613
extend elementwise to support simple broadcast
heheda12345 Nov 28, 2022
08f8504
extend scatternd to support index array
heheda12345 Nov 28, 2022
80cc929
reshape memcpy block kernel
heheda12345 Nov 28, 2022
652677f
softmax block kernel
heheda12345 Nov 28, 2022
117cb7f
batchmatmul with broadcast
heheda12345 Nov 28, 2022
e4acd37
small fix
heheda12345 Nov 28, 2022
27bdc7d
manually set max_block_dim for bcast and elementwise
heheda12345 Nov 28, 2022
428edb6
sync for rocm
heheda12345 Dec 2, 2022
f235896
merge rocm code
heheda12345 Mar 27, 2023
ecf47cc
add IfSingle operator
heheda12345 Mar 28, 2023
547136d
reorganize parameters
heheda12345 Mar 30, 2023
c8cce31
remove cudadevicereset
heheda12345 Apr 3, 2023
60a287a
fix depunit bug in loop
heheda12345 Apr 3, 2023
f5e6739
dump kerneldb requests
heheda12345 Apr 3, 2023
65c6f4f
search unroll width
heheda12345 Apr 3, 2023
f0c358d
fix blockfusion sync problem
heheda12345 Apr 10, 2023
f138341
wrap python part with ifdef
heheda12345 Apr 14, 2023
845c52c
add __syncthreads() to cf kernels
heheda12345 Apr 17, 2023
ac6bd1f
copy file from ControlFlow repo
heheda12345 Apr 18, 2023
77fbf5b
change path
heheda12345 Apr 18, 2023
8266a71
fix bug in scripts
heheda12345 Apr 18, 2023
f2d7df3
add more guides
heheda12345 Apr 19, 2023
246385a
add more guides
heheda12345 Apr 19, 2023
e4e02fe
remove cudnn
heheda12345 Apr 19, 2023
16713b3
install_grinder script
heheda12345 Apr 19, 2023
9e3c80f
remove cudnn in manual
heheda12345 Apr 19, 2023
ec00f1e
autotvm kernel
heheda12345 Apr 20, 2023
2ba5cf4
remove training code
heheda12345 Apr 20, 2023
e568ef6
change permission
heheda12345 Apr 20, 2023
a6514e7
update kernels in manual impls
heheda12345 Apr 20, 2023
afdae02
add rocm kerneldb script
heheda12345 Apr 21, 2023
06a0ff2
copy roller rocm code
heheda12345 Apr 21, 2023
186fe5b
first try of rocm kerneldb
heheda12345 Apr 22, 2023
48d0651
rocm reproduced
heheda12345 Apr 22, 2023
767c6f8
remove grinder from filename
heheda12345 Apr 22, 2023
2632ad8
kerneldb scripts
heheda12345 Apr 23, 2023
dd1796a
finish rocm?
heheda12345 Apr 23, 2023
ec0594f
remove name 'grinder' from scripts
heheda12345 Apr 23, 2023
063996b
update gitignore
heheda12345 Apr 23, 2023
b1a889b
small fix
heheda12345 Apr 23, 2023
071ded7
rename project and remove some script
heheda12345 Apr 24, 2023
7d0c87a
update links
heheda12345 Apr 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
rename project and remove some script
heheda12345 committed Apr 24, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit 071ded7b35cabcce68b160fcbae0adebcabcb015
2 changes: 1 addition & 1 deletion artifacts/Figure19/README.md
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ logout
ssh root@impreza0 -p 31703
cd Figure19 && ./run_jax.sh # about 10 min
logout
# in grinder-ae docker
# in cocktailer-ae docker
ssh root@impreza0 -p 31705
cd Figure19 && ./run_in_sys_docker.sh # about 1 hour
```
2 changes: 1 addition & 1 deletion artifacts/INSTALL.md
Original file line number Diff line number Diff line change
@@ -59,7 +59,7 @@ cd .. # to $YOUR_DIR_FOR_NNFUSION/nnfusion
mkdir build && cd build && cmake .. && make -j
```

## Pytorch & Grinder
## Pytorch & Cocktailer
```bash
conda create python=3.7 --name controlflow -y
conda activate controlflow
34 changes: 13 additions & 21 deletions artifacts/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# OSDI'23 Grinder Artifacts Evaluation
# OSDI'23 Cocktailer Artifacts Evaluation

## 0. Overview
This code branch is used for OSDI'23 Artifact Evaluation of paper #628, titled "Grinder: Analysis and Optimization for Dynamic Control Flow in Deep Learning".
This code branch is used for OSDI'23 Artifact Evaluation of paper #628, titled "Cocktailer: Analysis and Optimization for Dynamic Control Flow in Deep Learning".

### Evaluation Setup
* Artifacts Available:
* All Grinder related code are available under NNFusion open-source project located in: [https://github.com/microsoft/nnfusion/tree/TODO](https://github.com/microsoft/nnfusion/tree/TODO)
* All Cocktailer related code are available under NNFusion open-source project located in: [https://github.com/microsoft/nnfusion/tree/TODO](https://github.com/microsoft/nnfusion/tree/TODO)
* Artifacts Functional:
* *Documentation*: the following of documents include detailed guidelines on how to build, install, test Grinder and the experiments to compare with other baselines.
* *Completeness*: the [C++ part](..) of Grinder has been merged into NNFusion in this branch, and the [Python part](ast_analyzer) is available in this artifact.
* *Documentation*: the following of documents include detailed guidelines on how to build, install, test Cocktailer and the experiments to compare with other baselines.
* *Completeness*: the [C++ part](..) of Cocktailer has been merged into NNFusion in this branch, and the [Python part](ast_analyzer) is available in this artifact.
* *Exercisability*: under the *artifacts* folder, we prepare all the script and data to reproduce the experiements in individual folders named by the figure name in paper.
* Results Reproduced:
* To reproduce the main results presented in our paper, we provide Docker images containing all the environments and baseline software, and machines with the same configurations as we used in paper evaluation. We also provide detailed guideline to help reproduce the results step by step.
@@ -19,28 +19,20 @@ This code branch is used for OSDI'23 Artifact Evaluation of paper #628, titled "
Please follow the instructions in "Comments for AEC" on HotCRP and skip this section if you want to use the provided environment. The following steps need docker permission which is not provided due to security concerns.

## NVIDIA GPU
Please follow the instructions in [INSTALL.md](INSTALL.md) or use the following docker-based script to build and install Grinder.
Please follow the instructions in [INSTALL.md](INSTALL.md) or use the following docker-based script to build and install Cocktailer.
```bash
cd $YOUR_DIR_FOR_NNFUSION
git clone https://github.com/microsoft/nnfusion.git --branch TODO --single-branch
cd nnfusion/artifacts
docker build -t grinder -f env/Dockerfile.nv .
docker build -t cocktailer -f env/Dockerfile.nv .
chmod 777 $YOUR_DIR_FOR_NNFUSION/nnfusion
docker run -it --gpus all --name grinder-ae -v $YOUR_DIR_FOR_NNFUSION/nnfusion:/root/nnfusion --shm-size="32g" -w /root/nnfusion/artifacts grinder:latest /bin/bash
docker run -it --gpus all --name cocktailer-ae -v $YOUR_DIR_FOR_NNFUSION/nnfusion:/root/nnfusion --shm-size="32g" -w /root/nnfusion/artifacts cocktailer:latest /bin/bash
# run inside docker
bash ./env/install_in_docker.sh
```

adapted (TODO: remove)
```bash
docker build --network=host -t grinder -f env/Dockerfile.nv .
docker run -it --gpus all --name heheda-grinder-ae -v /home/heheda/control_flow/nnfusion-docker:/root/nnfusion -v /home/heheda/control_flow/kernel_db.docker:/root/.cache/nnfusion -w /root/nnfusion/artifacts --privileged=true --shm-size="32g" --network=host grinder:latest /bin/bash
srun -p AE -w nico1 --pty --exclusive docker exec -it heheda-grinder-ae bash ./run_nv_gpu.sh
permission: chmod 777 the two folders, config not to /dev/shm
```

## AMD GPU
Please prepare four dockers for running JAX, TensorFlow, TVM, PyTorch \& Grinder respectively.
Please prepare four dockers for running JAX, TensorFlow, TVM, PyTorch \& Cocktailer respectively.
* download code
```bash
cd $YOUR_DIR_FOR_NNFUSION
@@ -69,11 +61,11 @@ Please prepare four dockers for running JAX, TensorFlow, TVM, PyTorch \& Grinder
docker build -t tvm_rocm_cuda:latest -f env/Dockerfile.tvm.rocm --network=host .
docker run -it --device=/dev/kfd --device=/dev/dri --name tvm-ae -v $YOUR_DIR_FOR_NNFUSION/kernel_db:/root/.cache/nnfusion -v $YOUR_DIR_FOR_NNFUSION/nnfusion:/root/nnfusion -w /root/nnfusion/artifacts -e ARTIFACT_ROOT=/root/nnfusion/artifacts tvm_rocm_cuda /bin/bash
```
* Build and run grinder docker
* Build and run cocktailer docker
```bash
cd $YOUR_DIR_FOR_NNFUSION/nnfusion/artifacts
docker build -t grinder:latest -f env/Dockerfile.rocm --network=host .
docker run -it --device=/dev/kfd --device=/dev/dri --name grinder-ae -v $YOUR_DIR_FOR_NNFUSION/kernel_db:/root/.cache/nnfusion -v $YOUR_DIR_FOR_NNFUSION/nnfusion:/root/nnfusion -w /root/nnfusion/artifacts -e ARTIFACT_ROOT=/root/nnfusion/artifacts grinder /bin/bash
docker build -t cocktailer:latest -f env/Dockerfile.rocm --network=host .
docker run -it --device=/dev/kfd --device=/dev/dri --name cocktailer-ae -v $YOUR_DIR_FOR_NNFUSION/kernel_db:/root/.cache/nnfusion -v $YOUR_DIR_FOR_NNFUSION/nnfusion:/root/nnfusion -w /root/nnfusion/artifacts -e ARTIFACT_ROOT=/root/nnfusion/artifacts cocktailer /bin/bash
# run inside docker
bash ./env/install_in_rocm_docker.sh
```
@@ -99,7 +91,7 @@ Please prepare four dockers for running JAX, TensorFlow, TVM, PyTorch \& Grinder
│ │ └── tatoeba-eng-fra
```

* Generates all kernels for Grinder. More details can be found in [README_KERNEL_DB.md](kernel_db/README_KERNEL_DB.md).
* Generates all kernels for Cocktailer. More details can be found in [README_KERNEL_DB.md](kernel_db/README_KERNEL_DB.md).
**NOTE**: this process will take about 20 minutes for each architecture if using the tuning result in the artifact, or longer if you want to re-tune the kernels.
* NVIDIA GPU
```bash
22 changes: 11 additions & 11 deletions artifacts/get_started_tutorial/README_GET_STARTED.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Get Start Tutorial: Compile a NASRNN model with Grinder
We assume you already build and install Grinder following the *Environment Preparation* section in [README.md](../README.md).
# Get Start Tutorial: Compile a NASRNN model with Cocktailer
We assume you already build and install Cocktailer following the *Environment Preparation* section in [README.md](../README.md).

The goal of this tutorial is to demonstrate how to compile and optimize a typical DNN model with control flow, and showcase the performance improvement with Grinder compiler.
The goal of this tutorial is to demonstrate how to compile and optimize a typical DNN model with control flow, and showcase the performance improvement with Cocktailer compiler.

## Run PyTorch, TensorFlow, and JAX baselines

@@ -38,11 +38,11 @@ Summary: [min, max, mean] = [297.036409, 335.924387, 323.820636] ms
Summary: [min, max, mean] = [43.358564, 43.553829, 43.469448] ms
```

## Run GrinderBase and Grinder
## Run CocktailerBase and Cocktailer

## Prepare kernel database

Grinder needs the source code of dataflow operators to generate the optimized code. The source code of operators in NASRNN includes BatchMatMul generated by Roller, and built-in element-wise operators in [NNFusion](https://github.com/microsoft/nnfusion/tree/main/src/nnfusion/core/kernels/cuda_gpu/kernels). (TODO: check branch) Below is the script to generate and save the BatchMatmul kernel.
Cocktailer needs the source code of dataflow operators to generate the optimized code. The source code of operators in NASRNN includes BatchMatMul generated by Roller, and built-in element-wise operators in [NNFusion](https://github.com/microsoft/nnfusion/tree/main/src/nnfusion/core/kernels/cuda_gpu/kernels). (TODO: check branch) Below is the script to generate and save the BatchMatmul kernel.

```bash
export ARTIFACT_ROOT=TODO
@@ -57,7 +57,7 @@ cd $ARTIFACT_ROOT/get_started_tutorial

After that, you can get a kernel database file in `~/.cache/nnfusion/kernel_cache.db`. NNFusion will automatically detect this path and import these kernels.

## Run GrinderBase
## Run CocktailerBase
```bash
export ARTIFACT_ROOT=TODO
cd $ARTIFACT_ROOT/get_started_tutorial
@@ -85,9 +85,9 @@ tensor equals!
100 iters, min = 65.4466 ms, max = 66.7112 ms, avg = 65.8735 ms
```

The `forward` function in the output is the python code executed during time measurement, which accelerates the basic block of the model (locate at `/dev/shm/$USER/controlflow/base_nasrnn_bs64_0/forward` and `/dev/shm/$USER/controlflow/base_nasrnn_bs64_2/forward`) and relies on PyTorch for executing the control flows. The `tensor equals!` indicates that the output of GrinderBase matches that of PyTorch.
The `forward` function in the output is the python code executed during time measurement, which accelerates the basic block of the model (locate at `/dev/shm/$USER/controlflow/base_nasrnn_bs64_0/forward` and `/dev/shm/$USER/controlflow/base_nasrnn_bs64_2/forward`) and relies on PyTorch for executing the control flows. The `tensor equals!` indicates that the output of CocktailerBase matches that of PyTorch.

## Run Grinder
## Run Cocktailer
```bash
export ARTIFACT_ROOT=TODO
cd $ARTIFACT_ROOT/get_started_tutorial
@@ -110,11 +110,11 @@ tensor equals!
100 iters, min = 25.3108 ms, max = 25.7773 ms, avg = 25.3788 ms
```

The generated code of Grinder is located at `/dev/shm/$USER/controlflow/nasrnn_bs64_0/forward`. The `Best flag` indicates the schedule result of Grinder. The `tensor equals!` indicates that the output of Grinder matches that of PyTorch.
The generated code of Cocktailer is located at `/dev/shm/$USER/controlflow/nasrnn_bs64_0/forward`. The `Best flag` indicates the schedule result of Cocktailer. The `tensor equals!` indicates that the output of Cocktailer matches that of PyTorch.

## Summary
The following table summarizes the above experiments. Grinder achieves $1.71\times$ speedup against the fastest baseline (JAX).
The following table summarizes the above experiments. Cocktailer achieves $1.71\times$ speedup against the fastest baseline (JAX).

| | TorchScript | TensorFlow | JAX | GrinderBase | Grinder |
| | TorchScript | TensorFlow | JAX | CocktailerBase | Cocktailer |
|:-----------:|:------:|:--:|:----:|:--:|:---:|
| **Time (ms)** | 108.87 | 323.82 | 43.47 | 65.87 | 25.38 |
2 changes: 1 addition & 1 deletion artifacts/kernel_db/README_KERNEL_DB.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Kernel DB for GrinderBase and Grinder
# Kernel DB for CocktailerBase and Cocktailer

The `reproduce_kernel_db.sh` scripts will leverage AutoTVM, Ansor, Roller, and manual implementation to generate kernels. The result kernels will be injected in to a kernel database, located in *~/.cache/nnfusion/kernel_cache.db*, which will be finally loaded by NNFusion.

2 changes: 1 addition & 1 deletion artifacts/plot/common.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import numpy as np
import re

sys_name = "Grinder"
sys_name = "Cocktailer"

line_markers = [
'x',
2 changes: 1 addition & 1 deletion artifacts/plot/figure14.py
Original file line number Diff line number Diff line change
@@ -7,7 +7,7 @@

figure_id = 14

sys = ['TorchScript', 'TensorFlow', 'JAX+JIT', 'GrinderBase', sys_name]
sys = ['TorchScript', 'TensorFlow', 'JAX+JIT', 'CocktailerBase', sys_name]

hatch_def = [
'..',
2 changes: 1 addition & 1 deletion artifacts/plot/figure19.py
Original file line number Diff line number Diff line change
@@ -7,7 +7,7 @@

figure_id = 19

sys = ['TorchScript', 'TensorFlow', 'JAX+JIT', 'GrinderBase', sys_name]
sys = ['TorchScript', 'TensorFlow', 'JAX+JIT', 'CocktailerBase', sys_name]

hatch_def = [
'..',
4 changes: 2 additions & 2 deletions artifacts/plot/figure20.py
Original file line number Diff line number Diff line change
@@ -22,7 +22,7 @@
'',
]

sys_general = ['GrinderBase', 'schedule', 'optimize & schedule']
sys_general = ['CocktailerBase', 'schedule', 'optimize & schedule']

def get_log_from(filename: str):
result_dir = f'../reproduce_results/Figure{figure_id}'
@@ -50,7 +50,7 @@ def get_log_from(filename: str):
blockdrop = get_log_from('blockdrop.b1.log')
skipnet = get_log_from('skipnet.b1.log')

sys_recursive = ['GrinderBase', 'serial schedule', 'stack in global memory', 'stack in shared memory', 'parallel schedule']
sys_recursive = ['CocktailerBase', 'serial schedule', 'stack in global memory', 'stack in shared memory', 'parallel schedule']
rae = [
parse_time(f'../reproduce_results/Figure{figure_id}/base/rae.b1.log'),
parse_time(f'../reproduce_results/Figure{figure_id}/schedule/rae.opt1.b1.log'),