Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cocktailer Artifact #518

Merged
merged 199 commits into from
Apr 24, 2023
Merged
Changes from 1 commit
Commits
Show all changes
199 commits
Select commit Hold shift + click to select a range
a488ac2
some util functions for onnx frontend
xysmlx Jul 15, 2021
ae261c7
more datatype support in constant op
xysmlx Jul 15, 2021
73d3e6f
draft: onnx frontend support for if and loop op
xysmlx Jul 15, 2021
6cff7e7
draft: if and loop op_define
xysmlx Jul 15, 2021
ed0221a
refactor GraphConvert of the ONNX frontend, support if and loop convert
xysmlx Jul 15, 2021
1a30651
draft: cuda kernel emitter for if and loop op (placeholder)
xysmlx Jul 15, 2021
436d7d3
update onnx frontend convert and shape inference for loop op
xysmlx Jul 15, 2021
a2863d7
fix output size bug in ONNX Loop op convert
xysmlx Jul 20, 2021
b7b66b2
Generic_op_define and ONNX opset_11 frontend converter for ScatterND op
xysmlx Jul 20, 2021
ffae3f7
Comment m_expression construction of generic_op to bypass translate f…
xysmlx Jul 20, 2021
f771004
Merge branch 'master' into control_flow
xysmlx Aug 10, 2021
811a126
Merge master branch into control_flow branch
xysmlx Nov 17, 2021
a77ee7b
disable ORT optimizations
xysmlx Nov 18, 2021
d42be30
fix bug for disabling ORT optimizations
xysmlx Nov 18, 2021
0a29cc1
fix bug for disabling ORT optimizations
xysmlx Nov 18, 2021
2f29e0d
temp
nox-410 Nov 18, 2021
b0d4c56
Support -ftuning_list in kernel tuning pass
xysmlx Nov 28, 2021
faedc9b
Merge remote-tracking branch 'origin/control_flow' into control_flow_1
nox-410 Nov 28, 2021
6812438
Implement If and Loop code gen
nox-410 Dec 2, 2021
324f2cf
add mod operator
xysmlx Dec 2, 2021
e45ce1e
Add recursion
nox-410 Dec 2, 2021
5a2955a
Merge remote-tracking branch 'origin/control_flow' into control_flow_1
nox-410 Dec 2, 2021
df4ea2e
add -fcodegen_pybind
xiayuqing0622 Dec 3, 2021
2d2a9c5
fix bug
xiayuqing0622 Dec 3, 2021
35212e6
Fix code for cudaEmitter
nox-410 Dec 4, 2021
ef037e1
Fix recursion kernel name
nox-410 Dec 4, 2021
ee06cb9
Remove unused param in control flow
nox-410 Dec 4, 2021
a72c89f
Create base class for controlflow emitter
nox-410 Dec 6, 2021
d67fa83
Recursion Op workspace allocation
nox-410 Dec 6, 2021
14536eb
python patch
heheda12345 Dec 7, 2021
51f5de2
add kernel_entry
heheda12345 Dec 7, 2021
3ff6e40
remove half
heheda12345 Dec 7, 2021
5eb9ef4
allocate tensor in c
heheda12345 Dec 7, 2021
3b62fc8
torch::tensor for one output
heheda12345 Dec 7, 2021
912c126
tmp fix
xiayuqing0622 Dec 7, 2021
beafc0b
fix
nox-410 Dec 7, 2021
dbde8dd
list of tensor
heheda12345 Dec 7, 2021
8016924
Merge branch 'master' of github.com:heheda12345/nnfusion
Dec 7, 2021
7b1bdb0
fix bias dim
heheda12345 Dec 7, 2021
7648008
apply shared memory
nox-410 Dec 8, 2021
0023532
stderr
heheda12345 Dec 8, 2021
b6e052a
pybind int64
heheda12345 Dec 8, 2021
15bdba5
Fix recursion
nox-410 Dec 9, 2021
07a7f59
some parameter changes
nox-410 Dec 9, 2021
80726df
Merge branch 'control_flow_2' of https://github.com/nox-410/nnfusion
heheda12345 Dec 9, 2021
1c6f23a
bypass reshape and broadcast
nox-410 Dec 9, 2021
e3cfae0
fix duplicate node
nox-410 Dec 9, 2021
6176f8a
bugfix
nox-410 Dec 10, 2021
3c22fce
Bypass GatherV2 & merge code
nox-410 Dec 10, 2021
5dd292e
Merge branch 'control_flow_2' of https://github.com/nox-410/nnfusion
heheda12345 Dec 10, 2021
e9e9bd2
Adjust parameter
nox-410 Dec 10, 2021
7a58506
Merge remote-tracking branch 'zc/master' into control_flow_finetune
nox-410 Dec 11, 2021
b694f34
Merge branch 'control_flow_2' of https://github.com/nox-410/nnfusion
heheda12345 Dec 11, 2021
a23d8aa
fix a bug related with gatherV2
nox-410 Dec 11, 2021
38cfaaa
Merge branch 'control_flow_2' into control_flow_finetune
nox-410 Dec 11, 2021
bec08a9
fix reshape & broadcast bypass
nox-410 Dec 12, 2021
5071d8d
fix reshape & broadcast bypass
nox-410 Dec 12, 2021
8c57871
add threadfence
nox-410 Dec 13, 2021
bb4e4a0
Merge branch 'control_flow_2' into control_flow_finetune
nox-410 Dec 13, 2021
0598667
Support multiple outputs
nox-410 Dec 23, 2021
4888a31
Merge branch 'control_flow_2' into control_flow_finetune
nox-410 Dec 23, 2021
8d992b2
Add loop initialization
nox-410 Dec 23, 2021
bb54c8a
Fix extern result memory for Loop
nox-410 Dec 23, 2021
6b57878
Support broadcast Matmul
nox-410 Dec 25, 2021
287b0be
Adjust parameters
nox-410 Dec 25, 2021
6242182
Allow scalar float in torch codegen
nox-410 Dec 25, 2021
6c3a48d
Skip inplace analysis for subgraphs
nox-410 Dec 25, 2021
2016e0d
Fix Reshape error in Control flow
nox-410 Dec 27, 2021
2e43315
Use injected SumOp
nox-410 Dec 27, 2021
3233154
update Dot kernel cache
nox-410 Dec 28, 2021
74059ad
Fix controlflow inplace
nox-410 Dec 29, 2021
ebf0679
Set memory reuse to false
nox-410 Dec 29, 2021
28f3b07
Merge branch 'control_flow_finetune' of https://github.com/nox-410/nn…
heheda12345 Dec 29, 2021
11f9caf
grid.sync() & elementwise
heheda12345 Jan 1, 2022
5fd70b2
add scatternd op
heheda12345 Jan 1, 2022
cad0688
-fcheck_result
heheda12345 Jan 7, 2022
a87cfb5
add control edge to loop graph & fix kernel fusion
heheda12345 Jan 8, 2022
d68b2e7
concat with fewer resource
heheda12345 Jan 8, 2022
94e78d4
concat: no implace
heheda12345 Jan 9, 2022
4429553
elementwise: general blockdim
heheda12345 Jan 9, 2022
2d8229b
hack: add more dependency
heheda12345 Jan 9, 2022
98ea31b
forward control edge
heheda12345 Jan 9, 2022
8cf153c
remove useless barrier
heheda12345 Jan 9, 2022
7616e7c
add return
heheda12345 Jun 16, 2022
2d435fa
fix bug in conv-bn
heheda12345 Jul 22, 2022
903c774
add roller
heheda12345 Jul 25, 2022
fd0dc0d
debug tensor
heheda12345 Jul 25, 2022
847be21
add roller as submodule
heheda12345 Jul 25, 2022
db67a1a
upate gitignore
heheda12345 Jul 25, 2022
55706ff
change weight of inner graph to Constant op
heheda12345 Jul 25, 2022
0b71dba
gnode cout
heheda12345 Jul 25, 2022
626c66e
identity op
heheda12345 Jul 25, 2022
e2b7995
blockCudaEmitter: emit parameters from function sig
heheda12345 Jul 25, 2022
ba24cbb
draft version of seperate kernel launch, conflict wtih postprocessing…
heheda12345 Jul 26, 2022
ad3802d
fix typo
heheda12345 Jul 26, 2022
b3d0f8c
two branch call finish
heheda12345 Jul 27, 2022
0e9e76a
fix concat op
heheda12345 Aug 11, 2022
7ed651b
update op frontend
heheda12345 Aug 11, 2022
9f72880
conv to CNHW initial support
heheda12345 Aug 11, 2022
e75617e
add concat and reshape op
heheda12345 Aug 11, 2022
6a1e0c5
add if op to conv_layout_pass and fix related bugs
heheda12345 Aug 15, 2022
a0bc620
fix bug in share memory allocation of if op
heheda12345 Aug 15, 2022
01ac748
fuse small kernels (not finish yet)
heheda12345 Aug 18, 2022
d1f988b
fuse small kernels
heheda12345 Aug 18, 2022
2f189d0
reorder the kernels
heheda12345 Aug 23, 2022
ac90748
impl d2h
heheda12345 Aug 25, 2022
ac6e50f
fix bug in elementwise kernel
heheda12345 Aug 29, 2022
7a7d75e
main_test 100+100, print ref
heheda12345 Aug 29, 2022
e6bd194
fuse then else
heheda12345 Aug 30, 2022
6675125
move subtract out of if
heheda12345 Aug 31, 2022
bb1210e
loop in c
heheda12345 Sep 5, 2022
b5c0a4e
fix small bugs for skipnet
heheda12345 Sep 7, 2022
f658b61
CPU-GPU hybrid: assign stage
heheda12345 Sep 14, 2022
da1aaeb
CPU-GPU hybrid: add d2h and h2d gnode
heheda12345 Sep 14, 2022
41609fc
CPU-GPU hybrid: dumplicate memory pool
heheda12345 Sep 14, 2022
92445f4
CPU-GPU hybrid: call by tensor with _cpu
heheda12345 Sep 14, 2022
645e78f
CPU-GPU hybrid: forward stage info in element-wise fusion pass
heheda12345 Sep 15, 2022
ad057ed
remove debug code
heheda12345 Sep 15, 2022
78614ca
CPU-GPU hybrid: copy cpu emitter from cuda emitter
heheda12345 Sep 15, 2022
84dbfa2
CPU-GPU hybrid: for (int tid=0; tid <...)
heheda12345 Sep 15, 2022
56bfcbf
CPU-GPU hybrid: put result op on CPU
heheda12345 Sep 15, 2022
35adaba
CPU-GPU hybrid: bmm & conv codegen, can run
heheda12345 Sep 16, 2022
9d2b6f1
CPU-GPU hybrid: avoid run to_cpu_pass in inner graph
heheda12345 Sep 16, 2022
e914e1c
add cpu op
heheda12345 Sep 27, 2022
ccd9270
fix bugs in recursion
heheda12345 Sep 27, 2022
ef8d23b
inline recursion call
heheda12345 Sep 27, 2022
0d9e94c
recursive with stack
heheda12345 Sep 27, 2022
9d270ec
add be_state_buffer and state_base
heheda12345 Sep 28, 2022
fe943c7
add be_state_buffer and state_base to more place
heheda12345 Sep 28, 2022
9cac765
fast barrier codegen
heheda12345 Sep 28, 2022
294fa47
check control edge in operator << (gnode)
heheda12345 Oct 4, 2022
e358bc0
optimize elementwise perf
heheda12345 Oct 5, 2022
f9e1e80
add pipeline fail to compile commands
heheda12345 Oct 5, 2022
b7e558d
eliminate copy back of cond
heheda12345 Oct 6, 2022
88ee7bc
add bool to dtypes
heheda12345 Oct 7, 2022
4c0f3d2
add translate_v2 for identity op
heheda12345 Oct 9, 2022
68e3643
avoid inplace opt when gnode i/o contains result
heheda12345 Oct 9, 2022
b4b4e1a
fix bug in conv layout pass
heheda12345 Oct 9, 2022
dc2320b
cast_pytorch_tensor: use data_ptr instead of storage.data_ptr
heheda12345 Oct 9, 2022
99d2529
add fused_max_grid to loop
heheda12345 Oct 10, 2022
452d049
add more cpu op
heheda12345 Oct 11, 2022
c50e66c
add naive impls for breakdown exp
heheda12345 Oct 11, 2022
47f392e
skip scalar op: reshape
heheda12345 Oct 11, 2022
e798c76
is_outmost_graph for blockfusion
heheda12345 Oct 18, 2022
deccf1c
hacked parallel recursion: assume all calls can be executed in parallel
heheda12345 Oct 19, 2022
95ee0d4
tune recursion
heheda12345 Oct 20, 2022
3691741
add reduce-memcpy blockop
heheda12345 Nov 11, 2022
001c376
argmax kernel (not tested)
heheda12345 Nov 11, 2022
5e213d0
support while op, can compile but loop cannot stop
heheda12345 Nov 11, 2022
e343fe7
alloc cond tensor, fix bug in parameter mapping, can run bs=1
heheda12345 Nov 11, 2022
8668be8
hardcode num_local_thread_sync in reduce.hpp because emit_function_bo…
heheda12345 Nov 12, 2022
62fcf06
while in c
heheda12345 Nov 12, 2022
4b7373d
fast barrier for single block
heheda12345 Nov 12, 2022
18e60a3
fix bug in fast barrier of single block
heheda12345 Nov 12, 2022
3dfae17
fix bug in argmax and element_fused, pass while_op
heheda12345 Nov 14, 2022
823fc2a
enalbe & disable result d2d inplace
heheda12345 Nov 17, 2022
9215c13
support different schedule of if inside while op
heheda12345 Nov 22, 2022
865c613
extend elementwise to support simple broadcast
heheda12345 Nov 28, 2022
08f8504
extend scatternd to support index array
heheda12345 Nov 28, 2022
80cc929
reshape memcpy block kernel
heheda12345 Nov 28, 2022
652677f
softmax block kernel
heheda12345 Nov 28, 2022
117cb7f
batchmatmul with broadcast
heheda12345 Nov 28, 2022
e4acd37
small fix
heheda12345 Nov 28, 2022
27bdc7d
manually set max_block_dim for bcast and elementwise
heheda12345 Nov 28, 2022
428edb6
sync for rocm
heheda12345 Dec 2, 2022
f235896
merge rocm code
heheda12345 Mar 27, 2023
ecf47cc
add IfSingle operator
heheda12345 Mar 28, 2023
547136d
reorganize parameters
heheda12345 Mar 30, 2023
c8cce31
remove cudadevicereset
heheda12345 Apr 3, 2023
60a287a
fix depunit bug in loop
heheda12345 Apr 3, 2023
f5e6739
dump kerneldb requests
heheda12345 Apr 3, 2023
65c6f4f
search unroll width
heheda12345 Apr 3, 2023
f0c358d
fix blockfusion sync problem
heheda12345 Apr 10, 2023
f138341
wrap python part with ifdef
heheda12345 Apr 14, 2023
845c52c
add __syncthreads() to cf kernels
heheda12345 Apr 17, 2023
ac6bd1f
copy file from ControlFlow repo
heheda12345 Apr 18, 2023
77fbf5b
change path
heheda12345 Apr 18, 2023
8266a71
fix bug in scripts
heheda12345 Apr 18, 2023
f2d7df3
add more guides
heheda12345 Apr 19, 2023
246385a
add more guides
heheda12345 Apr 19, 2023
e4e02fe
remove cudnn
heheda12345 Apr 19, 2023
16713b3
install_grinder script
heheda12345 Apr 19, 2023
9e3c80f
remove cudnn in manual
heheda12345 Apr 19, 2023
ec00f1e
autotvm kernel
heheda12345 Apr 20, 2023
2ba5cf4
remove training code
heheda12345 Apr 20, 2023
e568ef6
change permission
heheda12345 Apr 20, 2023
a6514e7
update kernels in manual impls
heheda12345 Apr 20, 2023
afdae02
add rocm kerneldb script
heheda12345 Apr 21, 2023
06a0ff2
copy roller rocm code
heheda12345 Apr 21, 2023
186fe5b
first try of rocm kerneldb
heheda12345 Apr 22, 2023
48d0651
rocm reproduced
heheda12345 Apr 22, 2023
767c6f8
remove grinder from filename
heheda12345 Apr 22, 2023
2632ad8
kerneldb scripts
heheda12345 Apr 23, 2023
dd1796a
finish rocm?
heheda12345 Apr 23, 2023
ec0594f
remove name 'grinder' from scripts
heheda12345 Apr 23, 2023
063996b
update gitignore
heheda12345 Apr 23, 2023
b1a889b
small fix
heheda12345 Apr 23, 2023
071ded7
rename project and remove some script
heheda12345 Apr 24, 2023
7d0c87a
update links
heheda12345 Apr 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add more guides
heheda12345 committed Apr 19, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit f2d7df38f5c778319b975f369b74dfcdceeffc1d
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -54,4 +54,6 @@ nnfusion_rt/
models/frozenmodels/

artifacts/data
artifacts/reproduce_results
artifacts/reproduce_results
*.onnx
*.tfgraph
93 changes: 93 additions & 0 deletions artifacts/INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Installation Tutorial
This document describes how to install the software used in the artifact on a node with NVIDIA GPU. All scripts are assumed to be run from `nnfusion/artifacts` directory.

## Prerequirements
We assume that you have a node with NVIDIA GPU and CUDA installed. We also assume that you have installed conda and nvcc. If you have not installed conda, you can install it by following the instructions [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html) (Miniconda is enough, and this artifact assumes that miniconda is installed at the default path `~/miniconda3`). If you have not installed nvcc, you can install it by following the instructions [here](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html).

## TensorFlow
The onnx-tf for TF 1.15 needs to be built from source because the pre-compiled version depends on TF2. We also fix some bugs in that commit to properly support the control flow operations. The following commands will prepare the conda env for TF 1.15.

```bash
conda create python=3.8 --name baseline_tf1 -y
conda activate baseline_tf1
pip install nvidia-pyindex
pip install -r env/requirements_tf.txt
mkdir -p third-party && cd third-party
git clone https://github.com/onnx/onnx-tensorflow.git
cd onnx-tensorflow
git checkout 0e4f4836 # v1.7.0-tf-1.15m
git apply ../../env/onnx_tf.patch
pip install -e .
conda deactivate
```
## JAX
The following commands will prepare the conda env for JAX.
```bash
conda create python=3.8 --name baseline_jax -y
conda activate baseline_jax
pip install nvidia-pyindex
pip install -r env/requirements_jax.txt -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html -f https://download.pytorch.org/whl/torch_stable.html
conda deactivate
```

## TVM
The following commands will prepare the conda env for TVM.
```bash
conda create python==3.8 --name kerneldb -y
conda activate kerneldb
pip install ply==3.11
mkdir -p third-party && cd third-party
git clone https://github.com/apache/tvm.git
cd tvm
git checkout 22ba6523c
git submodule init && git submodule update
git apply ../../env/tvm.patch
mkdir build
cd build
cp ../../../env/tvm.config.cmake config.cmake
make -j
cd ../python
pip install -e .
```

## NNFusion
The following commands will build nnfusion. Please use the [script](../maint/script/install_dependency.sh) (needs sudo) to prepare the environment for nnfusion before running the following commands.

```bash
cd .. # to $YOUR_DIR_FOR_NNFUSION/nnfusion
mkdir build && cd build && cmake .. && make -j
```

## Pytorch & Grinder
```bash
conda create python=3.7 --name grinder -y
conda activate grinder
pip install nvidia-pyindex
pip install -r env/requirements_pytorch.txt -f https://download.pytorch.org/whl/torch_stable.html
pip install -e .
conda deactivate
```

TODO get data

TODO prepare kerneldb

docker: --shm-size="32g"
docker build -t grinder:latest -f env/Dockerfile.rocm --network=host .





assume running at artifacts directory


## Pre-requisites
conda, nvcc ......




srun --pty -w nico3 -p Long --exclusive ./run_nv_gpu.sh

cd plot && ./plot_nv.sh && cd -
136 changes: 59 additions & 77 deletions artifacts/README.md
Original file line number Diff line number Diff line change
@@ -1,88 +1,41 @@
# Installation of Evaluated Systems
assume running at artifacts directory
# OSDI'23 Grinder Artifacts Evaluation

## 0. Overview
This code branch is used for OSDI'23 Artifact Evaluation of paper #628, titled "Grinder: Analysis and Optimization for Dynamic Control Flow in Deep Learning".

## Pre-requisites
conda, nvcc ......
### Evaluation Setup
* Artifacts Available:
* All Grinder related code are available under NNFusion open-source project located in: [https://github.com/microsoft/nnfusion/tree/TODO](https://github.com/microsoft/nnfusion/tree/TODO)
* Artifacts Functional:
* *Documentation*: the following of documents include detailed guidelines on how to build, install, test Grinder and the experiments to compare with other baselines.
* *Completeness*: the [C++ part](..) of Grinder has been merged into NNFusion in this branch, and the [Python part](ast_analyzer) is available in this artifact.
* *Exercisability*: under the *artifacts* folder, we prepare all the script and data to reproduce the experiements in individual folders named by the figure name in paper.
* Results Reproduced:
* To reproduce the main results presented in our paper, we provide Docker images containing all the environments and baseline software, and machines with the same configurations as we used in paper evaluation. We also provide detailed guideline to help reproduce the results step by step.

## TensorFlow
install from env/requirements_tf.txt
Install onnx-tf from source (the pre-compiled version depends on TF2)
## 1. Environment Preparation

```bash
conda create python=3.8 --name baseline_tf1 -y
conda activate baseline_tf1
pip install nvidia-pyindex
pip install -r env/requirements_tf.txt
mkdir third-party && cd third-party
git clone https://github.com/onnx/onnx-tensorflow.git
cd onnx-tensorflow
git checkout 0e4f4836 # v1.7.0-tf-1.15m
git apply ../../env/onnx_tf.patch
pip install -e .
conda deactivate
```
## JAX
```bash
conda create python=3.8 --name baseline_jax -y
conda activate baseline_jax
pip install nvidia-pyindex
pip install -r env/requirements_jax.txt -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html -f https://download.pytorch.org/whl/torch_stable.html
conda deactivate
```
**For AE Reviewers**:
1. The nico cluster we provide for artifact evaluation is managed by slurm. To run GPU-related commands, please use `srun --pty --exclusive` before the original command, which will submit the job to the compute node (nico[3-4]). For your convenience, we have included this prefix in our artifact but will remove it in the final version. If you are running the artifact on your own machine, please remember to remove the prefix.
2. Due to security concerns, we cannot provide the docker permission to reviewers. Instead, for NVIDIA GPU, we provide an account with all the dependencies installed, and for AMD GPU, we provide ssh access into the dockers. You can skip this environment preparation section.

# TVM
```bash
conda create python==3.8 --name kerneldb -y
pip install ply==3.11
mkdir third-party && cd third-party
git clone https://github.com/apache/tvm.git --recursive
cd tvm
git checkout 22ba6523c
git apply ../../env/tvm.patch
mkdir build
cd build
cp ../../../env/tvm.config.cmake config.cmake
make -j
cd ../python
pip install -e .
## NVIDIA GPU
```

## NNFusion

## Pytorch & Grinder
```bash
conda create python=3.7 --name grinder -y
conda activate grinder
pip install nvidia-pyindex
pip install -r env/requirements_pytorch.txt -f https://download.pytorch.org/whl/torch_stable.html
conda deactivate
cd $YOUR_DIR_FOR_NNFUSION
git clone https://github.com/microsoft/nnfusion.git --branch TODO --single-branch
cd nnfusion/artifacts
docker build -t grinder -f env/Dockerfile.nv .
docker run -it --name grinder-ae -v $YOUR_DIR_FOR_NNFUSION/nnfusion:/root/nnfusion grinder:latest --shm-size="32g" /bin/bash
```

## Grinder (with code)
```bash
export ARTIFACT_ROOT=***/ControlFlow/artifacts TODO
cd $ARTIFACT_ROOT/..
pip install -e .
```
TODO install nnfusion
TODO prepare kerneldb
adapted (TODO: remove)
docker build --network=host -t grinder -f env/Dockerfile.nv .
docker run -it --name heheda-grinder-ae -v /home/heheda/control_flow/nnfusion-docker:/root/nnfusion --shm-size="32g" --network=host grinder:latest /bin/bash

TODO get data

docker: --shm-size="32g"
docker build -t grinder:latest -f env/Dockerfile.rocm --network=host .

cmake ..
```
cd $ARTIFACT_ROOT/../nnfusion
mkdir build && cd build
cmake .. && make -j
cd $ARTIFACT_ROOT/..
pip install -e .
TODO: config.py
```

# build jax docker
## AMD GPU
* build jax docker
```bash
mkdir third-party && cd third-party
git clone https://github.com/google/jax.git
@@ -91,8 +44,37 @@ git checkout 0282b4bfad
git apply ../../env/jax.rocm.patch
./build/rocm/ci_build.sh --keep_image bash -c "./build/rocm/build_rocm.sh"
```
TODO get data

## 2. Getting Started with a Simple Example

* Go to the *get_started_tutorial/* folder and follow [README_GET_STARTED.md](get_started_tutorial/README_GET_STARTED.md).


## 3. Kernel Generation
This step generates all kernels for Grinder. More details can be found in [README_KERNEL_DB.md](kernel_db/README_KERNEL_DB.md).
**NOTE**: this process will take about TODO hours.
```bash
# assume running at nnfusion/artifacts directory
cd kernel_db
srun --pty --exclusive ./reproduce_kernel_db.sh
```

## 4. Reproducing Individual Experiement Results
**NOTE**: we provide a script named "run_nv_gpu.sh" to run the experiments except Figure19. You can use `./run_nv_gpu.sh` to run the experiments. TODO: explain the run of Figure 19.

**For AE Reviewers**: Please use `srun --pty -w nico3 --exclusive ./run_nv_gpu.sh ` to submit the jobs to the compute node of the provided cluster. TODO: 是否需要 -p Long?

srun --pty -w nico3 -p Long --exclusive ./run_nv_gpu.sh
| Experiments | Figure # in Paper | Script Location |
| ----------- | ----------- | ----------- |
| #1. Control flow overhead in JAX | Figure 2 | N/A (use the results in Figure 15, 16, and 18) |
| #2. End-to-end DNN inference on NVIDIA V100 GPU | Figure 14 | [run.sh](Figure14/run.sh) |
| #3. Control flow overhead of models with loops | Figure 15 | [run.sh](Figure15/run.sh) |
| #4. Control flow overhead of models with branches | Figure 16 | [run.sh](Figure16/run.sh) |
| #5. Different ratio of executed layers | Figure 17 | [run.sh](Figure17/run.sh) |
| #6. Control flow overhead of RAE with recursion | Figure 18 | [run.sh](Figure18/run.sh) |
| #7. End-to-end DNN inference on ROCm MI100 GPU with BS=1 | Figure 19 | [run.sh](Figure19/run.sh) TODO |
| #8. Breakdown of models with BS=1 | Figure 20 | [run.sh](Figure20/run.sh)|

cd plot && ./plot_nv.sh && cd -
## 5. Reproduce the Figures in the paper
TODO (how to draw figure 19?)
2 changes: 1 addition & 1 deletion artifacts/ast_analyzer/utils/config.py
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@
# config start
KERNELDB_REQUEST_FNAME="kerneldb_request.log"
TMP_DIR = f"/dev/shm/{getpass.getuser()}/grinder"
NNFUSION_ROOT = os.path.normpath(os.path.join(os.path.dirname(os.path.realpath(__file__)), '../..'))
NNFUSION_ROOT = os.path.normpath(os.path.join(os.path.dirname(os.path.realpath(__file__)), '../../..'))
KERNELDB_PATH = os.path.expanduser(f"/tmp/{getpass.getuser()}/kernel_cache.db")
NUM_GPU = 8
# config end
2 changes: 1 addition & 1 deletion artifacts/env/Dockerfile.nv
Original file line number Diff line number Diff line change
@@ -12,4 +12,4 @@ RUN source /root/miniconda3/etc/profile.d/conda.sh && conda create python=3.8 --
RUN source /root/miniconda3/etc/profile.d/conda.sh && conda create python=3.8 --name baseline_jax -y && conda activate baseline_jax && pip install nvidia-pyindex && pip install -r env/requirements_jax.txt -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html -f https://download.pytorch.org/whl/torch_stable.html && conda deactivate
RUN source /root/miniconda3/etc/profile.d/conda.sh && conda create python=3.7 --name grinder -y && conda activate grinder && pip install nvidia-pyindex && pip install -r env/requirements_pytorch.txt -f https://download.pytorch.org/whl/torch_stable.html && conda deactivate
RUN source /root/miniconda3/etc/profile.d/conda.sh && conda create python==3.8 --name kerneldb -y && pip install ply==3.11 && mkdir -p third-party && cd third-party && git clone https://github.com/apache/tvm.git && cd tvm && git checkout 22ba6523c && git submodule init && git submodule update && git apply ../../env/tvm.patch && mkdir build && cd build && cp ../cmake/config.cmake config.cmake && sed -i "s/USE_CUDA OFF/USE_CUDA ON/g" config.cmake && sed -i "s/USE_LLVM OFF/USE_LLVM ON/g" config.cmake && cmake .. && make -j && cd ../python && pip install -e .
RUN apt-get install -y libgflags-dev libsqlite3-dev libcurl4-openssl-dev curl libcurl4-openssl-dev
RUN cd env && bash install_nnfusion_dependency.sh && cd ..
8 changes: 8 additions & 0 deletions artifacts/env/install_grinder.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/bin/bash
cd nnfusion
mkdir build && cd build && cmake .. && make -j && cd -

cd artifacts
conda activate grinder
pip install -e .
conda deactivate
76 changes: 76 additions & 0 deletions artifacts/env/install_nnfusion_dependency.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#!/bin/bash -e

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

echo "Running NNFusion install_dependency.sh"
DEB_PACKAGES="build-essential cmake git curl zlib1g zlib1g-dev libtinfo-dev unzip \
autoconf automake libtool ca-certificates gdb sqlite3 libsqlite3-dev libcurl4-openssl-dev \
libprotobuf-dev protobuf-compiler libgflags-dev libgtest-dev"

ubuntu_codename=$(. /etc/os-release;echo $UBUNTU_CODENAME)

if [[ $ubuntu_codename != "focal" ]]; then
DEB_PACKAGES="${DEB_PACKAGES} clang-3.9 clang-format-3.9"
fi

if [[ "$(whoami)" != "root" ]]; then
SUDO=sudo
fi

if ! dpkg -L $DEB_PACKAGES >/dev/null 2>&1; then
#Thirdparty deb for ubuntu 18.04(bionic)
$SUDO sh -c "apt update && apt install -y --no-install-recommends software-properties-common apt-transport-https ca-certificates gnupg wget"
$SUDO sh -c "wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null"
$SUDO sh -c "apt-add-repository 'deb https://apt.kitware.com/ubuntu/ $ubuntu_codename main'"
$SUDO sh -c "apt update && apt install -y --no-install-recommends $DEB_PACKAGES"

if [[ $ubuntu_codename != "focal" ]]; then
# Install protobuf 3.6.1 from source
$SUDO sh -c "wget https://github.com/protocolbuffers/protobuf/releases/download/v3.6.1/protobuf-cpp-3.6.1.tar.gz -P /tmp"
$SUDO sh -c "cd /tmp && tar -xf /tmp/protobuf-cpp-3.6.1.tar.gz && rm /tmp/protobuf-cpp-3.6.1.tar.gz"
$SUDO sh -c "cd /tmp/protobuf-3.6.1/ && ./configure && make && make check && make install && ldconfig && rm -rf /tmp/protobuf-3.6.1/"
fi
fi

# if [[ $ubuntu_codename == "focal" ]]; then
# # Install clang-format-3.9
# $SUDO sh -c "cd /tmp && wget https://releases.llvm.org/3.9.0/clang+llvm-3.9.0-x86_64-linux-gnu-ubuntu-16.04.tar.xz && tar -xf clang+llvm-3.9.0-x86_64-linux-gnu-ubuntu-16.04.tar.xz"
# $SUDO sh -c "cp /tmp/clang+llvm-3.9.0-x86_64-linux-gnu-ubuntu-16.04/bin/clang-format /usr/bin/clang-format-3.9 && ln -s /usr/bin/clang-format-3.9 /usr/bin/clang-format"
# $SUDO sh -c "rm -rf /tmp/clang+llvm-3.9.0-x86_64-linux-gnu-ubuntu-16.04/bin/clang-format /tmp/clang+llvm-3.9.0-x86_64-linux-gnu-ubuntu-16.04.tar.xz"
# fi

echo "- Dependencies are installed in system."

if [ ! -f "/usr/lib/libgtest.a" ]; then

# if Ubuntu 16.04, we have some dev node using ubuntu 16.04
if [[ $ubuntu_codename == "xenial" ]]; then
$SUDO sh -c "mkdir /usr/src/googletest && ln -s /usr/src/gtest /usr/src/googletest/googletest"
fi

# Compile gtest
$SUDO sh -c "cd /usr/src/googletest/googletest/ && mkdir -p build && cd build && cmake .. -DCMAKE_CXX_FLAGS=\"-std=c++11\" && make -j"

if [[ $ubuntu_codename == "focal" ]]; then
$SUDO sh -c "cp /usr/src/googletest/googletest/build/lib/libgtest*.a /usr/lib/"
else
$SUDO sh -c "cp /usr/src/googletest/googletest/build/libgtest*.a /usr/lib/"
fi

$SUDO sh -c "rm -rf /usr/src/googletest/googletest/build"
$SUDO sh -c "mkdir /usr/local/lib/googletest"
$SUDO sh -c "ln -s /usr/lib/libgtest.a /usr/local/lib/googletest/libgtest.a"
$SUDO sh -c "ln -s /usr/lib/libgtest_main.a /usr/local/lib/googletest/libgtest_main.a"
fi
echo "- libgtest is installed in system."

# Install numpy
$SUDO sh -c "apt install -y python3 python3-pip"
if [[ $ubuntu_codename == "xenial" ]]; then
$SUDO sh -c "pip3 install numpy==1.18.5"
else
$SUDO sh -c "pip3 install numpy"
fi

echo "- Done."
3 changes: 0 additions & 3 deletions artifacts/get_started_tutorial/README_GET_STARTED.md
Original file line number Diff line number Diff line change
@@ -3,9 +3,6 @@ We assume you already build and install Grinder following the *Environment Prepa

The goal of this tutorial is to demonstrate how to compile and optimize a typical DNN model with control flow, and showcase the performance improvement with Grinder compiler.

**For AE Reviewers**: The nico cluster we provide for artifact evaluation is managed by slurm. To run GPU-related commands, please use `srun --pty --exclusive` before the original command, which will submit the job to the compute node (nico[3-4]). For your convenience, we have included this prefix in our artifact but will remove it in the final version. If you are running the artifact on your own machine, please remember to remove the prefix.


## Run PyTorch, TensorFlow, and JAX baselines

```bash
9 changes: 9 additions & 0 deletions artifacts/kernel_db/README_KERNEL_DB.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Kernel DB for GrinderBase and Grinder

The `reproduce_kernel_db.sh` scripts will leverage AutoTVM, Ansor, Roller, and manual implementation to generate kernels. The result kernels will be injected in to a kernel database, located in *~/.cache/nnfusion/kernel_cache.db*, which will be finally loaded by NNFusion.

This folder contains the following contents:
* `*_kernels` folders: the tuning result of each source
* `db`: scripts for injecting kernels into the kernel database, adapted from [https://github.com/microsoft/nnfusion/tree/osdi20_artifact/artifacts/kernel_db/kernel_db_scripts](https://github.com/microsoft/nnfusion/tree/osdi20_artifact/artifacts/kernel_db/kernel_db_scripts)
* `roller`: the source code of Roller, adapted from [https://github.com/microsoft/nnfusion/tree/osdi22_artifact/artifacts/roller](https://github.com/microsoft/nnfusion/tree/osdi22_artifact/artifacts/roller)
* `test_config`: the TVM implementation of each operator, adapted from [https://github.com/microsoft/nnfusion/tree/osdi22_artifact/artifacts/roller/test_config](https://github.com/microsoft/nnfusion/tree/osdi22_artifact/artifacts/roller/test_config)