|
| 1 | +# PyTorch BERT Large training |
| 2 | + |
| 3 | +## Description |
| 4 | +This document has instructions for running BERT-Large training using Intel Extension for PyTorch. |
| 5 | + |
| 6 | +## Pull Command |
| 7 | + |
| 8 | +```bash |
| 9 | +docker pull intel/language-modeling:pytorch-cpu-bert-large-training |
| 10 | +``` |
| 11 | + |
| 12 | +> [!NOTE] |
| 13 | +> The `avx-fp32` precision runs the same scripts as `fp32`, except that the `DNNL_MAX_CPU_ISA` environment variable is unset. The environment variable is otherwise set to `DNNL_MAX_CPU_ISA=AVX512_CORE_AMX`. |
| 14 | +
|
| 15 | +## Datasets |
| 16 | +Follow instructions to [download and preprocess](./README.md#download-the-preprocessed-text-dataset) the text dataset and set the `DATASET_DIR` to point to the pre-processed dataset. |
| 17 | + |
| 18 | +# BERT Config File |
| 19 | +BERT Training happens in two stages. Download the BERT Config file from [here](https://drive.google.com/drive/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT) and export `BERT_MODEL_CONFIG` variable to point to this file path. |
| 20 | + |
| 21 | +# Checkpoint Directory |
| 22 | +The checkpoint directory is created as a result of Phase 1 Training. Please set the `PRETRAINED_MODEL` to point to the pre-trained model path and volume mount it for Phase 2 training. |
| 23 | + |
| 24 | +## Docker Run |
| 25 | +(Optional) Export related proxy into docker environment. |
| 26 | +```bash |
| 27 | +export DOCKER_RUN_ENVS="-e ftp_proxy=${ftp_proxy} \ |
| 28 | + -e FTP_PROXY=${FTP_PROXY} -e http_proxy=${http_proxy} \ |
| 29 | + -e HTTP_PROXY=${HTTP_PROXY} -e https_proxy=${https_proxy} \ |
| 30 | + -e HTTPS_PROXY=${HTTPS_PROXY} -e no_proxy=${no_proxy} \ |
| 31 | + -e NO_PROXY=${NO_PROXY} -e socks_proxy=${socks_proxy} \ |
| 32 | + -e SOCKS_PROXY=${SOCKS_PROXY}" |
| 33 | +``` |
| 34 | + |
| 35 | +To run the BERT-Large training scripts, set environment variables to specify the dataset directory, precision and an output directory. |
| 36 | + |
| 37 | +```bash |
| 38 | +export DATASET_DIR=<path to the dataset> |
| 39 | +export OUTPUT_DIR=<directory where log files will be written> |
| 40 | +export PRECISION=<specify the precision to run> |
| 41 | +export BERT_MODEL_CONFIG=<path to bert configuration file> |
| 42 | +export PRETRAINED_MODEL=<path to checkpoint to directory> |
| 43 | +export TRAINING_PHASE=<set either 1 or 2> |
| 44 | +export DNNL_MAX_CPU_ISA=<provide AVX512_CORE_AMX_FP16 for fp16 precision> |
| 45 | +export TRAIN_SCRIPT=/workspace/pytorch-bert-large-training/run_pretrain_mlperf.py |
| 46 | +export DDP=false |
| 47 | +export TORCH_INDUCTOR=0 |
| 48 | + |
| 49 | +DOCKER_ARGS="--rm -it" |
| 50 | +IMAGE_NAME=intel/language-modeling:pytorch-cpu-bert-large-training |
| 51 | + |
| 52 | +docker run \ |
| 53 | + --cap-add SYS_NICE \ |
| 54 | + --shm-size 16G \ |
| 55 | + --env PRECISION=${PRECISION} \ |
| 56 | + --env OUTPUT_DIR=${OUTPUT_DIR} \ |
| 57 | + --env TRAIN_SCRIPT=${TRAIN_SCRIPT} \ |
| 58 | + --env DATASET_DIR=${DATASET_DIR} \ |
| 59 | + --env TRAINING_PHASE=${TRAINING_PHASE} \ |
| 60 | + --env DDP=${DDP} \ |
| 61 | + --env TORCH_INDUCTOR=${TORCH_INDUCTOR} \ |
| 62 | + --env BERT_MODEL_CONFIG=${BERT_MODEL_CONFIG} \ |
| 63 | + --env PRETRAINED_MODEL=${PRETRAINED_MODEL} \ |
| 64 | + --env DNNL_MAX_CPU_ISA=${DNNL_MAX_CPU_ISA} \ |
| 65 | + --volume ${OUTPUT_DIR}:${OUTPUT_DIR} \ |
| 66 | + --volume ${DATASET_DIR}:${DATASET_DIR} \ |
| 67 | + --volume ${BERT_MODEL_CONFIG}:${BERT_MODEL_CONFIG} \ |
| 68 | + --volume ${PRETRAINED_MODEL}:${PRETRAINED_MODEL} \ |
| 69 | + ${DOCKER_RUN_ENVS} \ |
| 70 | + ${DOCKER_ARGS} \ |
| 71 | + $IMAGE_NAME \ |
| 72 | + /bin/bash run_model.sh |
| 73 | +``` |
| 74 | + |
| 75 | +> [!NOTE] |
| 76 | +> The workload container was validated on a single node(`DDP=false`) with `TORCH_INDUCTOR=0`. |
| 77 | +
|
| 78 | +## Documentation and Sources |
| 79 | +#### Get Started |
| 80 | +[Docker* Repository](https://hub.docker.com/r/intel/language-modeling) |
| 81 | + |
| 82 | +[Main GitHub*](https://github.com/IntelAI/models) |
| 83 | + |
| 84 | +[Release Notes](https://github.com/IntelAI/models/releases) |
| 85 | + |
| 86 | +[Get Started Guide](https://github.com/IntelAI/models/blob/master/models_v2/pytorch/bert_large/training/cpu/CONTAINER.md) |
| 87 | + |
| 88 | +#### Code Sources |
| 89 | +[Dockerfile](https://github.com/IntelAI/models/tree/master/docker/pytorch) |
| 90 | + |
| 91 | +[Report Issue](https://community.intel.com/t5/Intel-Optimized-AI-Frameworks/bd-p/optimized-ai-frameworks) |
| 92 | + |
| 93 | +## License Agreement |
| 94 | +LEGAL NOTICE: By accessing, downloading or using this software and any required dependent software (the “Software Package”), you agree to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party software included with the Software Package. Please refer to the [license](https://github.com/IntelAI/models/tree/master/third_party) file for additional details. |
| 95 | + |
| 96 | +[View All Containers and Solutions 🡢](https://www.intel.com/content/www/us/en/developer/tools/software-catalog/containers.html?s=Newest) |
0 commit comments