Skip to content

Supplementary materials for: Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

License

Notifications You must be signed in to change notification settings

graphcore-research/making-efficientnet-more-efficient

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Making EfficientNet More Efficient

This repository details how to reproduce the experiments from the paper "Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training" by Dominic Masters, Antoine Labatie, Zach Eaton-Rosen and Carlo Luschi

https://arxiv.org/abs/2106.03640

1. Paper Summary

This work investigates three techniques to improve the practical training and inference throughput of the popular model EfficientNet [1].

Adding Group Convolutions We expand the depthwise convolutions to group convolutions with group size 16 and reduce the MBconv the expansion ratio to keep approximately equal FLOPs. [G1 (baseline) vs G16 (ours)]

Using Proxy Normalization To retain batch norm performance without the need for batch dependent statistics we use the recently developed Proxy Norm [2] in conjunction with Layer Norm. We compare this to a baseline implentation using Group Norm. [GN (baseline) vs LN+PN (ours)]

Half Resolution Training To reduce training FLOPs we use images of half the total pixel count of the native resolution used in [1].

Model Size Native Resolution Half Resolution
B0 224 160
B1 240 176
B2 260 192
B3 300 204
B4 380 252
B5 456 328

For all results (apart from the baseline) we finetune the final model on the validation image sizes for 2 epochs before testing.

2. Hardware and Environment Setup

This repo describes how to reproduce all the experiments on aGraphcore POD16 server.

This repo has been designed to run with poplar SDK v2.1 (also works on v2.2) and uses Tensorflow v1. Note that the results in the paper use SDK v2.0 so throughput measurements may differ slightly.

The poplar SDK can be downloaded here (a support account is required).

A full overview of the SDK and instructions for installation can be found here.

Further information can also be found in our documentation and on our Developer Portal

3. Code

All the code to replicate the experiments in the paper has been upstreamed to the CNN training part of our main examples repository https://github.com/graphcore/examples/tree/master/applications/tensorflow/cnns/training. Information on the code, structure and examples that can be run are found in the associated Readme.

4. Training

To specify the options for training (with train.py) or validation (with validation.py) you can either use command line arguments, configurations defined in a .yml file or a mix of both.

Environment Variables

To run the training commands a number of environment variables need to be set:

export REPO=%PATH_TO_GRAPHCORE_EXAMPLES%/applications/tensorflow/cnns/training 
export DATA_DIR=%PATH_TO_IMAGENET_DATASET%
export CONFIG_PATH=%PATH_TO_THIS_REPO%/config.yml
Configs

In the config.yml configurations to complete all the main experiments in the paper are included. They are organised in the format B${MODEL_SIZE}-G${GROUP_CONV_SIZE}-${NORM_METHOD}-${TRAINING_RESOLUTION}_16IPU_${NUM_REPLICAS}x${NUM_PIPELINE_STAGES}

B0-G1-GN-Native_16IPU_2x8
B1-G1-GN-Native_16IPU_4x4
B2-G1-GN-Native_16IPU_4x4
B3-G1-GN-Native_16IPU_4x4
B4-G1-GN-Native_16IPU_4x4
B5-G1-GN-Native_16IPU_4x4

B0-G16-GN-Native_16IPU_2x8
B1-G16-GN-Native_16IPU_2x8
B2-G16-GN-Native_16IPU_4x4
B3-G16-GN-Native_16IPU_4x4
B4-G16-GN-Native_16IPU_4x4
B5-G16-GN-Native_16IPU_4x4

B0-G16-GN-Half_16IPU_2x8
B1-G16-GN-Half_16IPU_4x4
B2-G16-GN-Half_16IPU_4x4
B3-G16-GN-Half_16IPU_4x4
B4-G16-GN-Half_16IPU_4x4
B5-G16-GN-Half_16IPU_4x4

B0-G16-LN_PN-Native_16IPU_2x8
B1-G16-LN_PN-Native_16IPU_2x8
B2-G16-LN_PN-Native_16IPU_4x4
B3-G16-LN_PN-Native_16IPU_4x4
B4-G16-LN_PN-Native_16IPU_4x4
B5-G16-LN_PN-Native_16IPU_4x4

B0-G16-LN_PN-Half_16IPU_2x8
B1-G16-LN_PN-Half_16IPU_4x4
B2-G16-LN_PN-Half_16IPU_4x4
B3-G16-LN_PN-Half_16IPU_4x4
B4-G16-LN_PN-Half_16IPU_4x4
B5-G16-LN_PN-Half_16IPU_4x4
Running Without PopRun

To run any of these cases without poprun you can simply run the following command where $CASE is one of the above config options

python $REPO/train.py --config-path=$CONFIG_PATH --config $CASE

In many cases however these runs can be bound by input pipeline overheads to do with tensorflow and python.

Running With PopRun

To overcome these issues and fully utilise the available CPU power, for maximum throughput poprun should be used. This runs multiple python instances which can each by pinned to separate NUMA nodes. This greatly improves the overall CPU utilisation.

For this command the ${NUM_REPLICAS} and ${NUM_PIPELINE_STAGES} values are given in the $CASE name.

poprun --numa-aware=yes --num-replicas=${NUM_REPLICAS} --ipus-per replica=${NUM_PIPELINE_STAGES} --num-instances=${NUM_REPLICAS}  --only-output-from-instance 0 python $REPO/train.py --config-path=$CONFIG_PATH --config $CASE --no-validation

However when running with poprun, validation must be run separately afterwards as it uses a different configuration of replicas. Here $RESTORE_PATH should be the directory containing the generated checkpoint files, typically $WORKING_DIR/logs/EfficientNet.../.

python $REPO/validation.py --config-path=$CONFIG_PATH --config $CASE --restore-path=$RESTORE_PATH 

porun can also be used here by prepending this command with

poprun -v --numa-aware 1 --num-instances 8 --num-replicas 16 --ipus-per-replica 1 --only-output-from-instance 0

5. Finetuning

To finetune the final model to a new image size you need to training the last few layers for a further 2 epochs, with the same preprocessing as validation. As there is very little backprop need, this is much cheaper in terms of both memory and compute. We can therefore typically fit it on a single IPU and replicate 16 times to use all available IPUs.

  • $RESTORE_PATH should take the same value as above
  • $IMAGE_SIZE should be chosen such that training_image_size % 32 == finetuning_image_size % 32 for the best results (see section 3.3 in paper for why)
  • $BATCH_SIZE should be chosen on such that it fits in memory
  • $GRAD_ACC should be chosen such that BATCH_SIZE * REPLICAS * GRAD_ACC = GLOBAL_BATCH_SIZE = 512
python $REPO/train.py --config-path=$CONFIG_PATH --config $CASE --batch-size=$BATCH_SIZE --image-size=$IMAGE_SIZE --gradient-accumulation-count=$GRAD_ACC --init-path=${RESTORE_PATH}/ckpt/weight_avg_exp_0.97 --replicas=16 --shards=1 --lr-schedule=cosine --epochs=2 --train-with-valid-preprocessing --base-learning-rate=-11 --variable-filter Top

To finetune more layers of the model add block7 and/or block6 to the --variable-filter list.

About

Supplementary materials for: Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •