Benchmarking-Transformers

Final Project for COMSE6998 Practical Deep Learning System Performance Fall 2021

Contributers :

Supriya Arun (sa3982)

Arvind Kanesan Rathna (ak4728)

Summary

Comparison, benchmarking and analysis of performance metrics for three transformer models BERT, DistilBert, and SqueezeBERT. We evaluate tradeoffs in accuracy, computational requirements, and dollar cost on Question and Answering (SQuAD benchmark).

Running the Code

Step 1 : Hugging Face Setup

Install the huggingface_hub library with pip in your environment:


python -m pip install huggingface_hub

Once you have successfully installed the huggingface_hub library, log in to your Hugging Face account:

huggingface-cli login

Login with the token you can get on your hugging face account.

Step 2 : Install Dependencies

Install the required dependencies in the the requirements.txt file

Step 3 : Run Training Code

python trianing.py

The training code will run with the following default parameters :

model_checkpoint = "distilbert-base-uncased" batch_size = 16 epochs = 3

Change these as required.

Once the training is finished your model will be uploaded to the hugging face Model Hub.

Step 4 : Run Benchmarking Code

Memory consumption benchmarking

Here, we will benchmark the memory consumption of the 3 finetuned models for varying sequence lengths (32, 128, 512, 1024) while maintaing a constant batch size of 32
- The memory consumption is measured in the same way as the nvidia-smi measures GPU memory usage
- Note that we pull the models from the model hub where we had just uploaded (SupriyaArun/distilbert-base-uncased-finetuned-squad, SupriyaArun/bert-base-uncased-finetuned-squad, SupriyaArun/squeezebert-uncased-finetuned-squad)
- The memory consumption results are saved to benchmark_results/required_memory.csv
- The environment used for benchmarking is saved to benchmark_results/env.csv.
- The plot is saved to plots_pt/required_memory_plot.png.

$ mkdir benchmark_results plots_pt
$ python run_benchmark.py --no_speed --save_to_csv \
                                --models SupriyaArun/distilbert-base-uncased-finetuned-squad \
                                SupriyaArun/bert-base-uncased-finetuned-squad \
                                SupriyaArun/squeezebert-uncased-finetuned-squad \
                                --sequence_lengths 32 128 512 1024 \
                                --batch_sizes 32 \
                                --inference_memory_csv_file benchmark_results/required_memory.csv \
                                --env_info_csv_file benchmark_results/env.csv
$ python plot_csv_file.py --csv_file benchmark_results/required_memory.csv --figure_png_file=plots_pt/required_memory_plot.png --no_log_scale --short_model_names distilbert bert squeeze-bert

Next, we will benchmark the memory consumption of the 3 finetuned models for varying batch sizes (64, 128, 256, 512) while maintaing a constant sequence length (512).
- In our experiments we observed that both K80 and P100 error out for batch sizes >= 512

$ python run_benchmark.py --no_speed --save_to_csv \
                                --inference_memory_csv_file benchmark_results/required_memory_2.csv \
                                --env_info_csv_file benchmark_results/env.csv \
                                --models SupriyaArun/distilbert-base-uncased-finetuned-squad \
                                SupriyaArun/bert-base-uncased-finetuned-squad \
                                SupriyaArun/squeezebert-uncased-finetuned-squad \
                                --sequence_lengths 512 \
                                --batch_sizes 64 128 256 512\

$ python plot_csv_file.py --csv_file benchmark_results/required_memory_2.csv \
                          --figure_png_file=plots_pt/required_memory_plot_2.png \
                          --no_log_scale \
                          --short_model_names distilbert bert squeeze-bert \
                          --plot_along_batch

Inference Time Benchmarking

We find the inference time for varying sequence lengths (8 32 128 512) while maintaning a constant batch size of 256

$ python run_benchmark.py  --no_memory --save_to_csv \
                          --inference_time_csv_file benchmark_results/time.csv \
                          --env_info_csv_file benchmark_results/env.csv \
                          --models SupriyaArun/distilbert-base-uncased-finetuned-squad \
                          SupriyaArun/bert-base-uncased-finetuned-squad \
                          SupriyaArun/squeezebert-uncased-finetuned-squad \
                          --sequence_lengths 8 32 128 512 \
                          --batch_sizes 256 \
$ python plot_csv_file.py --csv_file benchmark_results/time.csv \
                    --figure_png_file=plots_pt/time_plot.png --no_log_scale \
                    --short_model_names distilbert bert squeeze-bert --is_time

Approach and Solution Diagram

Results & Observations:

DistilBERT and SqueezeBERT both compromise F1 and EM scores to a small extent for faster inference speed.

EM (Exact Match) : A binary measure of whether the system output matches the ground truth answer exactly

Architecture changes that we think explain this behavior :

DistilBERT The model architecture is similar to BERT but has half the number of layers of BERT. Using knowledge distillation the model is able to retain 97% of BERT's language capabilities. [ Our results (F1 score ratio) of 96.8% correlates with the paper's claims ]

SqueezeBERT This model is built for edge devices and has BERT's fully connected layers replaced with "grouped convolutions" this makes the model 4x faster but leads to some loss in accuracy.

Cost for training on P100-GCP

DistilBERT is most economical to finetune Anamoly: SqueezeBERT is most expensive despite being a smaller model than BERT. Requires further investigation.

Inference Time

P100 GPU:

K80 GPU:

For all GPUs and sequence lengths
- Performance: DistilBERT < SqueezeBERT < BERT
Between GPUs: P100 is roughly 4x faster than K80 for all models
We observe an almost linear scaling with increasing input token size
DistilBERT is roughly 65% faster than BERT
- Validates paper’s claim of 60%
DistilBERT is roughly 50% faster than SqueezeBERT
For lowest inference time, we should choose DistilBERT
SqueezeBERT’s claims being 4x faster than BERT in pixel4 device is not observed in GPUs

Memory Consumption

P100 GPU: (Batch size constant at 32)

K80 GPU: (Batch size constant at 32)

P100 GPU: (Input token size constant at 512)

K80 GPU: (Input token size constant at 512)

With increasing batch size and increasing token size, the GPU memory requirements also increase almost linearly
Memory footprint: SqueezeBERT < DistilBERT < BERT
SqueezeBERT is only marginally lower memory comsumption than DistilBERT
Both GPUs do not have sufficient memory to support 512 batch size (max – 256)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
benchmark_results		benchmark_results
plots_pt		plots_pt
README.md		README.md
plot_csv_file.py		plot_csv_file.py
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking-Transformers

Final Project for COMSE6998 Practical Deep Learning System Performance Fall 2021

Contributers :

Supriya Arun (sa3982)

Arvind Kanesan Rathna (ak4728)

Summary

Running the Code

Step 1 : Hugging Face Setup

Step 2 : Install Dependencies

Step 3 : Run Training Code

Step 4 : Run Benchmarking Code

Memory consumption benchmarking

Inference Time Benchmarking

Approach and Solution Diagram

Results & Observations:

Cost for training on P100-GCP

Inference Time

Memory Consumption

References:

About

Releases

Packages

Contributors 2

Languages

supriyaarun27/COMSE6998-Benchmarking-Transformers

Folders and files

Latest commit

History

Repository files navigation

Benchmarking-Transformers

Final Project for COMSE6998 Practical Deep Learning System Performance Fall 2021

Contributers :

Supriya Arun (sa3982)

Arvind Kanesan Rathna (ak4728)

Summary

Running the Code

Step 1 : Hugging Face Setup

Step 2 : Install Dependencies

Step 3 : Run Training Code

Step 4 : Run Benchmarking Code

Memory consumption benchmarking

Inference Time Benchmarking

Approach and Solution Diagram

Results & Observations:

Cost for training on P100-GCP

Inference Time

Memory Consumption

References:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages