Name	Name	Last commit message	Last commit date
parent directory ..
data	data
scripts	scripts
LICENSE	LICENSE
README.md	README.md
bert_config.json	bert_config.json
bert_config_1.2B.json	bert_config_1.2B.json
create_pretraining_data.py	create_pretraining_data.py
file_utils.py	file_utils.py
lamb.py	lamb.py
modeling.py	modeling.py
optimization.py	optimization.py
pack_pretraining_data_pytorch.py	pack_pretraining_data_pytorch.py
pytorch_packed_data_checker.py	pytorch_packed_data_checker.py
requirements.txt	requirements.txt
run_pretraining.py	run_pretraining.py
run_squad.py	run_squad.py
schedulers.py	schedulers.py
tokenization.py	tokenization.py
utils.py	utils.py

BERT for PyTorch

This folder contains scripts to pre-train, finetune BERT model and run inference on finetuned BERT model on Intel® Gaudi® AI accelerator to achieve state-of-the-art accuracy. To obtain model performance data, refer to the Intel Gaudi Model Performance Data page. Before you get started, make sure to review the Supported Configurations.

For more information about training deep learning models using Gaudi, visit developer.habana.ai.

Note: BERT is enabled on both Gaudi and Gaudi 2.

Model References
Model Overview
Setup
Training and Examples
Inference and Examples
Supported Configurations
Changelog
Known Issues

Model Overview

Bidirectional Encoder Representations from Transformers (BERT) is a technique for natural language processing (NLP) pre-training developed by Google. The original English-language BERT model comes with two pre-trained general types: (1) the BERTBASE model, a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, and (2) the BERTLARGE model, a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture; both of which were trained on the BooksCorpus with 800M words, and a version of the English Wikipedia with 2,500M words. The base training and modeling scripts for pre-training are based on a clone of https://github.com/NVIDIA/DeepLearningExamples.git and fine-tuning is based on https://github.com/huggingface/transformers.git.

The scripts included in this release are as follows:

BERT Large pre-training for BF16 mixed precision for Wikipedia BookCorpus and Wiki dataset in Torch Compile mode.
BERT Large finetuning for BF16 mixed precision for Wikipedia BookCorpus and SQUAD dataset in Lazy mode.
Multi-card (1 server = 8 cards) support for BERT Large pre-training in Torch Compile and finetuning with BF16 mixed precision in Lazy mode.
Multi-server (4 servers = 32 cards) support for BERT Large pre-training with BF16 mixed precision in Torch Compile mode.
BERT pre-training 1.2B parameters using ZeroRedundancyOptimizer with BF16 mixed precision in Torch Compile mode.

Additional environment variables are used in training scripts in order to achieve optimal results for each workload.

Pre-Training

Located in: Model-References/PyTorch/nlp/bert/
Suited for datasets:
- wiki, bookswiki(combination of BooksCorpus and Wiki datasets)
Uses optimizer: LAMB ("Layer-wise Adaptive Moments optimizer for Batch training").
Consists of two tasks:
- Task 1 - Masked Language Model - where given a sentence, a randomly chosen word is guessed.
- Task 2 - Next Sentence Prediction - where the model guesses whether sentence B comes after sentence A.
The resulting (trained) model weights are language-specific (here: english) and has to be further "fitted" to do a specific task (with fine-tuning).
Heavy-weight: the training takes several hours or days.

BERT training script supports pre-training of dataset on BERT large for both FP32 and BF16 mixed precision data type using Torch Compile mode.

Finetuning

Located in: Model-References/PyTorch/nlp/bert/
Suited for dataset:
- SQUAD(Stanford Question Answering Dataset)
Uses optimizer: Fused ADAM.
Light-weight: the finetuning takes several minutes.

BERT finetuning script supports fine-tuning of SQUAD dataset on BERT large for both FP32 and BF16 mixed precision data type using Lazy mode.

Setup

Please follow the instructions provided in the Gaudi Installation Guide to set up the environment including the $PYTHON environment variable. The guide will walk you through the process of setting up your system to run the model on Gaudi.

Clone Intel Gaudi Model-References

In the docker container, clone this repository and switch to the branch that matches your Intel Gaudi software version. You can run the hl-smi utility to determine the Intel Gaudi software version.

git clone -b [Intel Gaudi software version] https://github.com/HabanaAI/Model-References

Install Model Requirements

In the docker container, go to the BERT directory

cd Model-References/PyTorch/nlp/bert

Install the required packages using pip:

$PYTHON -m pip install -r requirements.txt

Vocab File

Download the Vocab file located here.

Download Dataset

Pre-Training:

Model-References/PyTorch/nlp/bert/data provides scripts to download, extract and pre-process Wikipedia and BookCorpus datasets.

Go to the data folder and run the data preparation script.

cd Model-References/PyTorch/nlp/bert/data

It is highly recommended to download Wiki dataset alone using the following command.

bash create_datasets_from_start.sh

Wiki and BookCorpus datasets can be downloaded by running the script as follows.

bash create_datasets_from_start.sh wiki_books

Note that the pre-training dataset is huge and takes several hours to download. BookCorpus may have access and download constraints. The final accuracy may vary depending on the dataset and its size. The script creates formatted dataset for Phase 1 and Phase 2 of the pre-training.

Finetuning:

This section provides steps to extract and pre-process Squad Dataset(V1.1).

Go to squad folder.

cd Model-References/PyTorch/nlp/bert/data/squad

Download Squad dataset.

bash squad_download.sh

Packing the Data

Intel Gaudi supports using a Data packing technique, called Non-Negative Least Squares Histogram. Here, instead of padding with zero, several short sequences are packed into one multi-sequence of size max_seq_len. Thus, this removes most of the padding, which can lead to a speedup of up to 2× in time-to-train (TTT). This packing technique can be applied on other datasets with high variability in samples length.

Please note that for each NLP dataset with sequential data samples, the speedup with data packing is determined by the ratio of max_seq_len to average_seq_len in that particular dataset. The larger the ratio, the higher the speedup.

To pack the dataset, in docker run:

cd /root/Model-References/PyTorch/nlp/bert

$PYTHON pack_pretraining_data_pytorch.py --input_dir <dataset_path_phase1> --output_dir <packed_dataset_path_phase1> --max_sequence_length 128 --max_predictions_per_sequence 20

$PYTHON pack_pretraining_data_pytorch.py --input_dir <dataset_path_phase2> --output_dir <packed_dataset_path_phase2> --max_sequence_length 512 --max_predictions_per_sequence 80

Note: This will generate json at the path <output_dir>/../<tail_dir>_metadata.json with meta data info like: "avg_seq_per_sample" etc. This json will be used as an input to run_pretraining.py to extract "avg_seq_per_sample" in case of packed dataset mode.

Training and Examples

Please create a log directory to store dllogger.json and specify its location for --json_summary attribute.

Single Card and Multi-Card Pre-Training Examples

Run training on 1 HPU:

Using packed data: torch.compile mode, 1 HPU, BF16 mixed precision, batch size 64 for Phase 1 and batch size 8 for Phase 2:

export PT_HPU_LAZY_MODE=0
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints \
      --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase1/train_packed_new \
      --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=7038 \
      --warmup_proportion=0.2843 --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=128 \
      --use_torch_compile

export PT_HPU_LAZY_MODE=0
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints \
      --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase2/train_packed_new \
      --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=80 --max_steps=1563 \
      --warmup_proportion=0.128 --num_steps_per_checkpoint=200 --learning_rate=0.004 \
      --gradient_accumulation_steps=512 --resume_from_checkpoint --phase1_end_step=7038 --phase2 \
      --use_torch_compile

Using packed data: Eager mode with torch.compile enabled, 1 HPU, BF16 mixed precision, batch size 64 for Phase 1 on Gaudi 2::

export PT_HPU_LAZY_MODE=0
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints \
      --use_fused_lamb --use_torch_compile \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase1/train_packed_new \
      --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=7038 \
      --warmup_proportion=0.2843 --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=128

Using packed data: torch.compile mode, 1 HPU, BF16 mixed precision, batch size 64 for Phase 1 and batch size 16 for Phase 2 on Gaudi 2:

export PT_HPU_LAZY_MODE=0
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints \
      --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase1/train_packed_new \
      --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=7038 \
      --warmup_proportion=0.2843 --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=128 \
      --use_torch_compile

export PT_HPU_LAZY_MODE=0
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints \
      --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase2/train_packed_new \
      --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=80 --max_steps=1563 \
      --warmup_proportion=0.128 --num_steps_per_checkpoint=200 --learning_rate=0.004 \
      --gradient_accumulation_steps=256 --resume_from_checkpoint --phase1_end_step=7038 --phase2 \
      --use_torch_compile

torch.compile mode, 1 HPU, unpacked data, BF16 mixed precision, batch size 64 for Phase 1 and batch size 8 for Phase 2:

export PT_HPU_LAZY_MODE=0
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_128/books_wiki_en_corpus \
      --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=7038 \
      --warmup_proportion=0.2843 --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=128 \
      --enable_packed_data_mode False --use_torch_compile

export PT_HPU_LAZY_MODE=0
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/books_wiki_en_corpus \
      --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=80 --max_steps=1563 \
      --warmup_proportion=0.128 --num_steps_per_checkpoint=200 --learning_rate=0.004\
      --gradient_accumulation_steps=512 --resume_from_checkpoint --phase1_end_step=7038 --phase2 \
      --enable_packed_data_mode False --use_torch_compile

torch.compile mode, 1 HPU, unpacked data, FP32 precision, batch size 32 for Phase 1 and batch size 4 for Phase 2:

export PT_HPU_LAZY_MODE=0
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_128/books_wiki_en_corpus \
      --train_batch_size=512 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=7038 \
      --warmup_proportion=0.2843 --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=32 \
      --enable_packed_data_mode False --use_torch_compile

export PT_HPU_LAZY_MODE=0
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_512/books_wiki_en_corpus \
      --train_batch_size=128 --max_seq_length=512 --max_predictions_per_seq=80 --max_steps=1563 \
      --warmup_proportion=0.128 --num_steps_per_checkpoint=200 --learning_rate=0.004 \
      --gradient_accumulation_steps=64 --resume_from_checkpoint --phase1_end_step=7038 --phase2 \
      --enable_packed_data_mode False --use_torch_compile

Run training on 8 HPUs:

To run multi-card demo, make sure the host machine has 512 GB of RAM installed. Modify the docker run command to pass 8 Gaudi cards to the docker container. This ensures the docker has access to all the 8 cards required for multi-card demo.

NOTE: mpirun map-by PE attribute value may vary on your setup. For the recommended calculation, refer to the instructions detailed in mpirun Configuration.

Using packed data: torch.copmile mode, 8 HPUs, BF16 mixed precision, per chip batch size of 64 for Phase 1 and 8 for Phase 2:

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --autocast --config_file=./bert_config.json --use_habana \
      --allreduce_post_accumulation --allreduce_post_accumulation_fp16 --json-summary=/tmp/log_directory/dllogger.json \
      --output_dir=/tmp/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase1/train_packed_new \
      --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=7038 \
      --warmup_proportion=0.2843 --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=128 \
      --use_torch_compile

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --autocast --config_file=./bert_config.json --use_habana \
      --allreduce_post_accumulation --allreduce_post_accumulation_fp16 --json-summary=/tmp/log_directory/dllogger.json \
      --output_dir=/tmp/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase2/train_packed_new \
      --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=80 --max_steps=1563 \
      --warmup_proportion=0.128 --num_steps_per_checkpoint=200 --learning_rate=0.004 \
      --gradient_accumulation_steps=512 --resume_from_checkpoint --phase1_end_step=7038 --phase2 \
      --use_torch_compile

Using packed data: torch.copmile mode, 8 HPUs, BF16 mixed precision, per chip batch size of 64 for Phase 1 and 16 for Phase 2 on Gaudi 2:

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --autocast --config_file=./bert_config.json --use_habana \
      --allreduce_post_accumulation --allreduce_post_accumulation_fp16 --json-summary=/tmp/log_directory/dllogger.json \
      --output_dir=/tmp/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase1/train_packed_new \
      --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=7038 \
      --warmup_proportion=0.2843 --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=128 \
      --use_torch_compile

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --autocast --config_file=./bert_config.json --use_habana \
      --allreduce_post_accumulation --allreduce_post_accumulation_fp16 --json-summary=/tmp/log_directory/dllogger.json \
      --output_dir=/tmp/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase2/train_packed_new \
      --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=80 --max_steps=1563 \
      --warmup_proportion=0.128 --num_steps_per_checkpoint=200 --learning_rate=0.004 \
      --gradient_accumulation_steps=256 --resume_from_checkpoint --phase1_end_step=7038 --phase2 \
      --use_torch_compile

Eager mode with torch.compile enabled, 8 HPUs, packed data, BF16 mixed precision, per chip batch size of 64 for Phase 1 on Gaudi 2:

export PT_HPU_LAZY_MODE=0
export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --use_torch_compile \
      --config_file=./bert_config.json --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/BERT_PRETRAINING/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase1/train_packed_new \
      --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --warmup_proportion=0.2843 \
      --max_steps=7038 --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=128

torch.compile mode, 8 HPUs, unpacked data, BF16 mixed precision, per chip batch size of 64 for Phase 1 and 8 for Phase 2:

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --use_torch_compile \
      --config_file=./bert_config.json --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/BERT_PRETRAINING/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_128/books_wiki_en_corpus \
      --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --warmup_proportion=0.2843 \
      --max_steps=7038 --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=128 \
      --enable_packed_data_mode False

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --use_torch_compile \
      --config_file=./bert_config.json --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/BERT_PRETRAINING/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_512/books_wiki_en_corpus \
      --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=80 --warmup_proportion=0.128 \
      --max_steps=5 --num_steps_per_checkpoint=200 --learning_rate=0.004 --gradient_accumulation_steps=512 --resume_from_checkpoint --phase1_end_step=7038 --phase2 \
      --enable_packed_data_mode False

torch.compile mode, 8 HPUs, unpacked data, FP32 precision, per chip batch size of 32 for Phase 1 and 4 for Phase 2:

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation  --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints \
      --use_fused_lamb --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_128/books_wiki_en_corpus \
      --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=3 --warmup_proportion=0.2843 \
      --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=256 \
      --enable_packed_data_mode False --use_torch_compile

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 --json-summary=/tmp/log_directory/dllogger.json \
      --output_dir=/tmp/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_512/books_wiki_en_corpus \
      --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=80 --max_steps=1563 --warmup_proportion=0.128 \
      --num_steps_per_checkpoint=200 --learning_rate=0.004 --gradient_accumulation_steps=512 \
      --resume_from_checkpoint --phase1_end_step=7038 --phase2 \
      --enable_packed_data_mode False --use_torch_compile

torch.compile mode, 8 HPUs, unpacked data, BF16 mixed precision, per chip batch size of 64 for Phase 1 and 8 for Phase 2:

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root -x PT_HPU_LAZY_MODE=0 \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --use_torch_compile \
      --config_file=./bert_config.json --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/BERT_PRETRAINING/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_128/books_wiki_en_corpus \
      --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --warmup_proportion=0.2843 \
      --max_steps=7038 --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=128 \
      --enable_packed_data_mode False

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --use_torch_compile \
      --config_file=./bert_config.json --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/BERT_PRETRAINING/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_512/books_wiki_en_corpus \
      --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=80 --warmup_proportion=0.128 \
      --max_steps=5 --num_steps_per_checkpoint=200 --learning_rate=0.004 --gradient_accumulation_steps=512 --resume_from_checkpoint --phase1_end_step=7038 --phase2 \
      --enable_packed_data_mode False

Single Card and Multi-Card Finetuning Examples

Run training on 1 HPU:

Lazy mode, 1 HPU, BF16 mixed precision, batch size 24 for train and batch size 8 for test:

$PYTHON run_squad.py --do_train --bert_model=bert-large-uncased \
      --config_file=./bert_config.json \
      --use_habana --use_fused_adam --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json \
      --train_batch_size=24 --predict_batch_size=8 --seed=1 --max_seq_length=384 \
      --doc_stride=128 --max_steps=-1   --learning_rate=3e-5 --num_train_epochs=2 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --train_file=data/squad/v1.1/train-v1.1.json \
      --skip_cache --do_predict  \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py --log_freq 20 \
      --autocast

Lazy mode, 1 HPU, FP32 precision, batch size 12 for train and batch size 8 for test:

$PYTHON run_squad.py --do_train --bert_model=bert-large-uncased --config_file=./bert_config.json \
      --use_habana --use_fused_adam --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json \
      --train_batch_size=12 --predict_batch_size=8 --seed=1 --max_seq_length=384 \
      --doc_stride=128 --max_steps=-1   --learning_rate=3e-5 --num_train_epochs=2 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --train_file=data/squad/v1.1/train-v1.1.json \
      --skip_cache --do_predict  \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py --log_freq 20

Eager mode with torch.compile enabled, 1 HPU, FP32 precision, batch size 12 for train and batch size 8 for test:

export PT_HPU_LAZY_MODE=0
$PYTHON run_squad.py --do_train --bert_model=bert-large-uncased --config_file=./bert_config.json \
      --use_habana --use_fused_adam --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json --use_torch_compile \
      --train_batch_size=12 --predict_batch_size=8 --seed=1 --max_seq_length=384 \
      --doc_stride=128 --max_steps=-1   --learning_rate=3e-5 --num_train_epochs=2 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --train_file=data/squad/v1.1/train-v1.1.json \
      --skip_cache --do_predict  \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py --log_freq 20

Run training on 8 HPUs:

NOTE: mpirun map-by PE attribute value may vary on your setup. For the recommended calculation, refer to the instructions detailed in mpirun Configuration.

Lazy mode, 8 HPUs, BF16 mixed precision, per chip batch size of 24 for train and 8 for test:

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_squad.py --do_train --bert_model=bert-large-uncased \
      --config_file=./bert_config.json \
      --use_habana --use_fused_adam --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json \
      --train_batch_size=24 --predict_batch_size=8 --seed=1 --max_seq_length=384 \
      --doc_stride=128 --max_steps=-1   --learning_rate=3e-5 --num_train_epochs=2 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --train_file=data/squad/v1.1/train-v1.1.json \
      --skip_cache --do_predict  \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py --log_freq 20 \
      --autocast

Lazy mode, 8 HPUs, FP32 precision, per chip batch size of 12 for train and 8 for test:

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_squad.py --do_train --bert_model=bert-large-uncased --config_file=./bert_config.json \
      --use_habana --use_fused_adam --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json \
      --train_batch_size=12 --predict_batch_size=8 --seed=1 --max_seq_length=384 \
      --doc_stride=128 --max_steps=-1   --learning_rate=3e-5 --num_train_epochs=2 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --train_file=data/squad/v1.1/train-v1.1.json \
      --skip_cache --do_predict  \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py --log_freq 20

Eager mode with torch.compile enabled, 8 HPUs, BF16 mixed precision, per chip batch size of 24 for train and 8 for test:

export PT_HPU_LAZY_MODE=0
export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_squad.py --do_train --bert_model=bert-large-uncased \
      --config_file=./bert_config.json --use_torch_compile \
      --use_habana --use_fused_adam --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json \
      --train_batch_size=24 --predict_batch_size=8 --seed=1 --max_seq_length=384 \
      --doc_stride=128 --max_steps=-1   --learning_rate=3e-5 --num_train_epochs=2 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --train_file=data/squad/v1.1/train-v1.1.json \
      --skip_cache --do_predict  \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py --log_freq 20 \
      --autocast

Intel Gaudi provides the pretraining checkpoints for most of the models. The user can simply feed the data from BERT checkpoint to provide the path-to-checkpoint for --init_checkpoint when you run the above model.

Multi-Server Training Examples

To run multi-server demo, make sure the host machine has 512 GB of RAM installed. Also ensure you followed the Gaudi Installation Guide to install and set up docker, so that the docker has access to all the 8 cards required for multi-node demo. Multi-server configuration for BERT PT training up to 4 servers, each with 8 Gaudi cards, have been verified.

Before execution of the multi-server scripts, make sure all network interfaces are up. You can change the state of each network interface managed by the habanalabs driver using the following command:

sudo ip link set <interface_name> up

To identify if a specific network interface is managed by the habanalabs driver type, run:

sudo ethtool -i <interface_name>

Docker ssh Port Setup for Multi-Server Training

By default, the Intel Gaudi docker uses port 22 for ssh. The default port configured in the script is port 3022. Run the following commands to configure the selected port number , port 3022 in example below.

sed -i 's/#Port 22/Port 3022/g' /etc/ssh/sshd_config
sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
service ssh restart

Set up password-less ssh

To set up password-less ssh between all connected servers used in scale-out training, follow the below steps:

Run the following in all the nodes' docker sessions:
```
mkdir ~/.ssh
cd ~/.ssh
ssh-keygen -t rsa -b 4096
```
a. Copy id_rsa.pub contents from every node's docker to every other node's docker's ~/.ssh/authorized_keys (all public keys need to be in all hosts' authorized_keys):
```
cat id_rsa.pub > authorized_keys
vi authorized_keys
```
b. Copy the contents from inside to other systems.

c. Paste all hosts' public keys in all hosts' “authorized_keys” file.

On each system, add all hosts (including itself) to known_hosts. The IP addresses used below are just for illustration:

ssh-keyscan -p 3022 -H 10.10.100.101 >> ~/.ssh/known_hosts
ssh-keyscan -p 3022 -H 10.10.100.102 >> ~/.ssh/known_hosts
ssh-keyscan -p 3022 -H 10.10.100.103 >> ~/.ssh/known_hosts
ssh-keyscan -p 3022 -H 10.10.100.104 >> ~/.ssh/known_hosts

Install python packages required for BERT Pre-training model

pip install -r Model-References/PyTorch/nlp/bert/requirements.txt

Run training on 32 HPUs:

NOTE:

mpirun map-by PE attribute value may vary on your setup. For the recommended calculation, refer to the instructions detailed in mpirun Configuration.
$MPI_ROOT environment variable is set automatically during Setup. See Gaudi Installation Guide for details.
Using packed data: torch.compile mode, 32 HPUs, BF16 mixed precision, per chip batch size 64 for Phase 1 and batch size 8 for Phase 2:

export MASTER_ADDR="10.10.100.101"
export MASTER_PORT="12345"
mpirun --allow-run-as-root --mca plm_rsh_args "-p 3022" --bind-to core -n 32 --map-by ppr:4:socket:PE=6 \
--rank-by core --report-bindings --prefix --mca btl_tcp_if_include 10.10.100.101/16
      $MPI_ROOT -H 10.10.100.101:16,10.10.100.102:16,10.10.100.103:16,10.10.100.104:16 -x LD_LIBRARY_PATH \
      -x HABANA_LOGS -x PYTHONPATH -x MASTER_ADDR \
      -x MASTER_PORT -x PT_HPU_LAZY_MODE=0 \
      $PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --autocast --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints \
      --use_fused_lamb --input_dir=/data/pytorch/bert_pretraining/packed_data/phase1/train_packed_new \
      --train_batch_size=2048 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=7038 \
      --warmup_proportion=0.2843 --num_steps_per_checkpoint=200 --learning_rate=0.006 \
      --gradient_accumulation_steps=32 --use_torch_compile

export MASTER_ADDR="10.10.100.101"
export MASTER_PORT="12345"
mpirun --allow-run-as-root --mca plm_rsh_args "-p 3022" --bind-to core -n 32 --map-by ppr:4:socket:PE=6 \
--rank-by core --report-bindings --prefix --mca btl_tcp_if_include 10.10.100.101/16 \
      $MPI_ROOT -H 10.10.100.101:16,10.10.100.102:16,10.10.100.103:16,10.10.100.104:16 -x LD_LIBRARY_PATH \
      -x HABANA_LOGS -x PYTHONPATH -x MASTER_ADDR \
      -x MASTER_PORT -x PT_HPU_LAZY_MODE=0 \
      $PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --autocast --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/results/checkpoints \
      --use_fused_lamb --input_dir=/data/pytorch/bert_pretraining/packed_data/phase2/train_packed_new \
      --train_batch_size=1024 --max_seq_length=512 --max_predictions_per_seq=80 --max_steps=1563 --warmup_proportion=0.128 \ --num_steps_per_checkpoint=200 --learning_rate=0.004 --gradient_accumulation_steps=128 \
      --resume_from_checkpoint --phase1_end_step=7038 --phase2 --use_torch_compile

torch.compile mode, 32 HPUs, unpacked data, BF16 mixed precision, batch size 64 for Phase 1 and batch size 8 for Phase 2:

export MASTER_ADDR="10.10.100.101"
export MASTER_PORT="12345"
mpirun --allow-run-as-root --mca plm_rsh_args -p3022 --bind-to core -n 32 --map-by ppr:4:socket:PE=6 \
--rank-by core --report-bindings --prefix --mca btl_tcp_if_include 10.10.100.101/16 \
$MPI_ROOT -H 10.10.100.101:16,10.10.100.102:16,10.10.100.103:16,10.10.100.104:16 \
      -x LD_LIBRARY_PATH -x HABANA_LOGS -x PYTHONPATH -x MASTER_ADDR -x MASTER_PORT -x https_proxy -x http_proxy \
      -x PT_HPU_LAZY_MODE=0 \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased \
      --autocast --config_file=./bert_config.json \
      --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir= /tmp/results/checkpoints \
      --use_fused_lamb --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_128/books_wiki_en_corpus \
      --train_batch_size=2048 --max_seq_length=128 --max_predictions_per_seq=20
      --max_steps=7038 --warmup_proportion=0.2843 \
      --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=32 \
      --enable_packed_data_mode False --use_torch_compile

export MASTER_ADDR="10.10.100.101"
export MASTER_PORT="12345"
mpirun --allow-run-as-root --mca plm_rsh_args -p3022 --bind-to core -n 32 --map-by ppr:4:socket:PE=6 \
--rank-by core --report-bindings --prefix --mca btl_tcp_if_include 10.10.100.101/16 \
      $MPI_ROOT -H 10.10.100.101:16,10.10.100.102:16,10.10.100.103:16,10.10.100.104:16 -x LD_LIBRARY_PATH \
      -x HABANA_LOGS -x PYTHONPATH -x MASTER_ADDR -x MASTER_PORT -x PT_HPU_LAZY_MODE=0 -x \
      $PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --autocast \
      --config_file=./bert_config.json --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir= /tmp/results/checkpoints \
      --use_fused_lamb --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_512/books_wiki_en_corpus \
      --train_batch_size=1024 --max_seq_length=512 --max_predictions_per_seq=80 --max_steps=1563 \
      --warmup_proportion=0.128 --num_steps_per_checkpoint=200 --learning_rate=0.004 \
      --gradient_accumulation_steps=128 --resume_from_checkpoint --phase1_end_step=7038 --phase2 \
      --enable_packed_data_mode False --use_torch_compile

BERT Pre-Training with ZeroRedundancyOptimizer

BERT training script supports pre-training of BERT 1.2B parameters using ZeroRedundancyOptimizer with BF16 mixed precision data type in Torch Compile mode.

torch.compile mode, 8 HPUs, BF16 mixed precision, per chip batch size 8 for Phase 1 and batch size 2 for Phase 2:

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --autocast --use_torch_compile \
      --config_file=./bert_config_1.2B.json --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/BERT_PRETRAINING/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase1/train_packed_new \
      --train_batch_size=1024 --max_seq_length=128 --max_predictions_per_seq=20 --warmup_proportion=0.2843 \
      --max_steps=7038 --num_steps_per_checkpoint=200 --learning_rate=0.006 --gradient_accumulation_steps=128 \
      --use_zero_optimizer True

export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
export PT_HPU_LAZY_MODE=0
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \
$PYTHON run_pretraining.py --do_train --bert_model=bert-large-uncased --autocast --use_torch_compile \
      --config_file=./bert_config_1.2B.json --use_habana --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
      --json-summary=/tmp/log_directory/dllogger.json --output_dir=/tmp/BERT_PRETRAINING/results/checkpoints --use_fused_lamb \
      --input_dir=/data/pytorch/bert_pretraining/packed_data/phase2/train_packed_new \
      --train_batch_size=1024 --max_seq_length=512 --max_predictions_per_seq=80 --warmup_proportion=0.128 \
      --max_steps=1563 --num_steps_per_checkpoint=200 --learning_rate=0.004 --gradient_accumulation_steps=512 \
      --resume_from_checkpoint --phase1_end_step=7038 --phase2 --use_zero_optimizer True

Inference and Examples

NOTE: To run the inference examples use a fine-tuned model.

Run inference on 1 HPU:

Lazy mode, 1 HPU, BF16 mixed precision, batch size 24:

$PYTHON run_squad.py --bert_model=bert-large-uncased --autocast \
      --config_file=./bert_config.json \
      --use_habana --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json \
      --predict_batch_size=24 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --do_predict  \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py

HPU Graphs, 1 HPU, BF16 mixed precision, batch size 24:

$PYTHON run_squad.py --bert_model=bert-large-uncased --autocast --use_hpu_graphs \
      --config_file=./bert_config.json \
      --use_habana --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json \
      --predict_batch_size=24 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --do_predict  \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py

Lazy mode, 1 HPU, FP16 mixed precision, batch size 24:

$PYTHON run_squad.py --bert_model=bert-large-uncased --autocast \
      --config_file=./bert_config.json \
      --use_habana --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json \
      --predict_batch_size=24 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --do_predict --fp16 \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py

HPU Graphs, 1 HPU, FP16 mixed precision, batch size 24:

$PYTHON run_squad.py --bert_model=bert-large-uncased --autocast --use_hpu_graphs \
      --config_file=./bert_config.json \
      --use_habana --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json \
      --predict_batch_size=24 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --do_predict --fp16 \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py

Run inference on 1 HPU with torch.compile:

1 HPU, BF16 mixed precision, batch size 24:

$PYTHON run_squad.py --bert_model=bert-large-uncased --autocast \
      --config_file=./bert_config.json \
      --use_habana --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json \
      --predict_batch_size=24 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --do_predict --use_torch_compile \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py

1 HPU, FP16 mixed precision, batch size 24:

$PYTHON run_squad.py --bert_model=bert-large-uncased --autocast \
      --config_file=./bert_config.json \
      --use_habana --do_lower_case --output_dir=/tmp/results/checkpoints \
      --json-summary=/tmp/log_directory/dllogger.json \
      --predict_batch_size=24 \
      --init_checkpoint=<path-to-checkpoint> \
      --vocab_file=<path-to-vocab> \
      --do_predict --use_torch_compile --fp16 \
      --predict_file=data/squad/v1.1/dev-v1.1.json \
      --do_eval --eval_script=data/squad/v1.1/evaluate-v1.1.py

When not using torch.compile, it is recommended to use HPU Graphs model type to minimize the host time spent in the forward() call.

Supported Configurations

BERT Pretraining

Validated on	Intel Gaudi Software Version	PyTorch Version	Mode
Gaudi	1.20.0	2.6.0	Training
Gaudi 2	1.20.0	2.6.0	Training

BERT Finetuning

Validated on	Intel Gaudi Software Version	PyTorch Version	Mode
Gaudi	1.18.0	2.4.0	Training
Gaudi	1.20.0	2.6.0	Inference
Gaudi 2	1.18.0	2.4.0	Training
Gaudi 2	1.20.0	2.6.0	Inference
Gaudi 3	1.20.0	2.6.0	Inference*

*Disclaimer: only bf16

Changelog

1.20.0

Link pointing to Pre-trained checkpoint was removed.

1.19.0

Add support for torch.compile mode in BERT L Pretraining Phase1 and Phase 2 as default mode

1.17.0

Forced static compilation for BERT Finetuning in torch.compile mode.

1.15.0

Changed model configurations mentioned in this README:

Lazy mode, 1 HPU, BF16 mixed precision, batch size 64 for Phase 1 and batch size 16 for Phase 2 on Gaudi 2.
Lazy mode, 8 HPUs, BF16 mixed precision, per chip batch size of 64 for Phase 1 and 16 for Phase 2 on Gaudi 2.

1.14.0

Added support for dynamic shapes in BERT Pretraining.

1.13.0

Added tensorboard logging.
Added support for torch.compile inference.
Added support for FP16 through autocast.
Aligned profiler invocation between training and inference loops.
Added support for dynamic shapes in BERT Finetuning
Added torch.compile support - performance improvement feature for PyTorch Eager mode for BERT Pretraining. Supported only for phase1.
Added torch.compile support - performance improvement feature for PyTorch Eager mode for BERT Finetuning.

1.12.0

Removed HMP; switched to autocast.
Eager mode support is deprecated.

1.11.0

Dynamic Shapes will be enabled by default in future releases. It is currently enabled in BERT Pretraining Model training script as a temporary solution.

1.10.0

Support added for cached dataset for finetuning.

1.9.0

Enabled usage of PyTorch autocast.
Enabled BERT finetuning(run_squad.py) with SQUAD dataset (training and inference).

1.6.0

ZeroReduancyOptimer is support is added and tested BERT 1.2B parameter config.

1.5.0

Packed dataset mode is set as default execution mode.
Deprecated the flags enable_packed_data_mode and avg_seq_per_pack and added support for automatic detection of those parameters based on dataset metadata file.
Changes related to Saving and Loading checkpoint were removed.
Removed changes related to padding index and flatten.
Fixed throughput calculation for packed dataset.
Demo scripts were removed and references to custom demo script were replaced by community entry points in README.
Reduced the number of distributed barrier calls to once per gradient accumulation steps.
Simplified the distributed Initialization.
Added support for training on Gaudi 2 supporting up to 8 cards.

1.4.0

Lazy mode is set as default execution mode, for Eager mode set use-lazy-mode as False.
Pretraining with packed dataset is supported.

1.3.0

Single worker thread changes are removed.
Loss computation brought it back to training script.
Removed setting the embedding padding index as 0 explicitly.
Removed the select op implementation using index select and squeeze and retained the default code.
Permute and view is replaced as flatten.
Change python or python3 to $PYTHON to execute correct version based on environment setup.

1.2.0

Enabled HCCL flow for distributed training.
Removed changes related to data type conversions for input_ids, segment ids, position_ids and input_mask.
Removed changes related to position ids from training script.
Removed changes related to no pinned memory and skip last batch.

Training Script Modifications

The following changes have been added to training (run_pretraining.py and run_squad.py) and modeling (modeling.py) scripts.

Added support for Gaudi devices:

a. Load Intel Gaudi specific library.

b. Support required for CPU to work.

c. Required environment variables are defined for Gaudi.

d. Added BF16 mixed precision support.

e. Added python version of LAMB optimizer and will be used as default (from lamb.py).

f. Support for distributed training on Gaudi.

g. Added changes to support Lazy mode with required mark_step().

h. Added changes to calculate the performance per step and report through dllogger.

i. Using conventional torch layernorm, linear and activation functions.

j. Changes for dynamic loading of HCCL library.

k. Added support for FusedAdamW and FusedClipNorm in run_squad.py.

l. optimizer_grouped_parameters config has changed for weight_decay from 0.01 to 0.0.
To improve performance:

a. Added support for Fused LAMB optimizer in run_pretraining.py.

b. Bucket size set to 230MB for better performance in distributed training.

c. Added support to use distributed all_reduce instead of default Distributed Data Parallel in pre-training.

d. Added support for lowering print frequency of loss and associated this with log_freq.

e. Added support for Fused ADAMW optimizer and FusedClipNorm in run_squad.py.

Known Issues

Placing mark_step() arbitrarily may lead to undefined behavior. Recommend to keep mark_step() as shown in provided scripts.
BERT 1.2B parameter model is restricted to showcase the PyTorch ZeroReduancyOptimer feature and not for Model convergence.
Only scripts and configurations mentioned in this README are supported and verified.

Files

bert

Directory actions

More options