CV-16์กฐ ๐ก ๋น์ ๊ธธ์ก์ด ๐ก
NAVER Connect Foundation boostcamp AI Tech 4th
๋ฏผ๊ธฐ | ๋ฐ๋ฏผ์ง | ์ ์์ค | ์ฅ์งํ | ์ต๋ํ |
---|---|---|---|---|
- SegFormer : ์๋ฒ ๋๋ ๋ฐ ๋ชจ๋ฐ์ผ ๊ธฐ๊ธฐ๋ฅผ ์ํ Transformer ๊ธฐ๋ฐ Semantic Segmentation ๋ชจ๋ธ ๊ฒฝ๋ํ
- Model driven approach : ํ์ดํผํ๋ผ๋ฏธํฐ ํ๋ ๋ฑ ๊ณ ๋ํ๋ ํ์ต ๊ธฐ๋ฒ ๋ฐฐ์ ยท ์์ ๋ชจ๋ธ๋ง์ ํตํ ์ฑ๋ฅ ํฅ์
- Pruning ๋ฐ quantization ๋ฑ compression ๋ฐฉ๋ฒ ๋ฐฐ์ : ๋ชจ๋ธ ๋ธ๋ก ยท ๋ ์ด์ด ์ฌ์ค๊ณ ๋ฑ ๊ฒฝ๋ํ ๊ตฌ์กฐ๋ณ๊ฒฝ ์งํ
Tiny-ImageNet | ADE20K | |
---|---|---|
Purpose | Pre-training | Fine-tuning |
Num_classes | 200 | 150 |
Training set | 100,000 images | 20,210 images |
Validation set | 10,000 images | 2,000 images |
|-- ADEChallengeData2016
| |-- image
| | |-- train
| | `-- val
| `-- mask
| |-- train
| `-- val
`-- tiny-imagenet-200
|-- train
|-- val
Encoder | Decoder |
---|---|
Overlap Patch Embedding | MLP Layer (upsampling) |
SegFormer Block | Concat |
Efficient Self-Attention | Linear-Fuse |
Mix-FFN | Classifier |
Encoder | Decoder |
---|---|
Poolin Patch Embedding | MLP Layer (upsampling) |
PoolFormer Block | Weighed Sum |
SegFormerV2 Block | Classifier |
Custom Efficient Self-Attention | - |
Mix-CFN | - |
- Segformer-B2์ custom model ์ฑ๋ฅ ๋น๊ต ๋ฐ Params์ Flops ์ธก์ (util/get_flops_params.py)
- Token Mixer : MHSA ๋์ Pooling์ผ๋ก feature ์ถ์ถ
$\hat {F_0}=\mathrm {LayerScale}(\mathrm {Pooling}(F_{in}))+F_{in}$ $\hat {F_1}=\mathrm {LayerScale}(\mathrm {MixCFN}(\hat {F_0}))+\hat {F_0}$
- ๊ธฐ์กด Self Output ๋ชจ๋ ์ญ์
$\hat {F_0}=\mathrm {CSA}(F_{in})+F_{in}$ $\hat {F_1}=\mathrm {MixCFN}(\hat {F_0})+\hat {F_0}$
-
Pooling์ผ๋ก K, V ์ฐจ์ ์ถ์
$K, V=\mathrm {Pooling}(F_C)$
-
1x1 Convolution ์ญ์
$\mathrm {Attention}(Q,K,V)=\mathrm {Softmax}({{QK^T}\over {\sqrt {d_{head}}}}V)$
-
๊ธฐ์กด์ Linear(dense) embedding ์ฐ์ฐ์ 1x1 Conv๋ก ๋ณ๊ฒฝ
$\hat {F_C}=\mathrm {Conv}_{1 \times 1}(F_C)$
-
3x3 DWConv๋ฅผ 3x3๊ณผ 5x5 DWConv๋ก channel-wise๋ก ๋๋์ด ์ฐ์ฐ ํ Concat (Mix-CFN)
-
Batch-Normalization ์ถ๊ฐ
model | Params | Flops | Accval (%) | mIoUval (%) |
---|---|---|---|---|
SegFormer-B2 | 27.462M | 58.576G | 66.48 | 29.84 |
BoostFormer (Ours) |
17.575M (-36.00%) |
15.826G (-72.98%) |
72.28 (+8.72%) |
34.29 (+14.91%) |
- ๊ธฐ์กด ๋ชจ๋ธ ๋๋น Params 36% ๊ฐ์, FLOPs 72% ๊ฐ์, mIoU ์ฑ๋ฅ 14% ํฅ์
git clone https://github.com/boostcampaitech4lv23cv3/final-project-level3-cv-16.git
bash dist_train.sh {์ฌ์ฉํ๋ gpu ๊ฐ์} \
--data-path {tiny_imagenet path} \ # ์ด๋ฆ์ tiny๊ฐ ํฌํจ๋์ด์ผํจ
--output_dir {save dir path} \
--batch-size {batch size per gpu } # default=128
# example
bash dist_train.sh 4 \
--data-path /workspace/dataset/tiny_imagenet \
--output_dir result/mod_segformer/ \
--batch-size 64
# ํ์ฌ ๋๋ ํ ๋ฆฌ: /final-project-level3-cv-16
python train.py \
--data_dir {ADE20K์ path} \
--device 0,1,2,3 \ # ํ๊ฒฝ์ ๋ง๊ฒ ์์
--save_path {saveํ๊ณ ์ ํ๋ dir์ path} \
--pretrain {pretrain ๋ชจ๋ธ dir ํน์ .pth์ path} # .pth(pretrain์ output), dir(huggingface์ ๋ชจ๋ธํ๋ธ์์ ์ ๊ณตํ๋ ํํ)
--batch_size {batch size} # default=16
# phase๋ฅผ ํตํด val ๋๋ test set ์ค์
python eval.py \ # eval.py ๋ด์ model์ ์ ์ํ๋ ์ฝ๋ ์์
--data_dir {ADE20K์ path} \
--pretrain {pretrain ๋ชจ๋ธ dir์ path}
python util/get_flops_params.py \ # get_flops_params.py ๋ด์ model์ ์ ์ํ๋ ์ฝ๋ ์์
--data_dir {ADE20K์ path}
|-- ๐ appendix : ๋ฐํ์๋ฃ ๋ฐ WrapUpReport
|-- ๐ segformer : HuggingFace ๊ธฐ๋ฐ segformer ๋ชจ๋ธ ์ฝ๋
|-- ๐ boostformer : Segformer ๊ฒฝ๋ํ ๋ชจ๋ธ ์ฝ๋
|-- ๐ imagenet_pretrain : Tiny-ImageNet encoder ํ์ต์ ์ฌ์ฉํ ์ฝ๋
|-- ๐ util : tools ์ฝ๋ ๋ชจ์
|-- Dockerfile
|-- train.py : ADE20K Finetuning ์ฝ๋
|-- eval.py : ๋ชจ๋ธ Inference ๊ฒฐ๊ณผ ์ถ๋ ฅ ์ฝ๋
|-- requirements.txt
`-- README.md