Skip to content

Commit 262ba97

Browse files
authored
add training workloads back (#2567)
* add training workloads back * Revert "clean underperforming models (#2563)" This reverts commit aee3d8f.
1 parent e3a751c commit 262ba97

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+4329
-37
lines changed

README.md

+2
Original file line numberDiff line numberDiff line change
@@ -70,12 +70,14 @@ For best performance on Intel® Data Center GPU Flex and Max Series, please chec
7070
| [BERT large](https://arxiv.org/pdf/1810.04805.pdf) [Sapphire Rapids](https://www.intel.com/content/www/us/en/newsroom/opinion/updates-next-gen-data-center-platform-sapphire-rapids.html#gs.blowcx) | Tensorflow | Training | [FP32 BFloat16 BFloat32](/quickstart/language_modeling/tensorflow/bert_large/training/cpu/README.md) | [SQuAD](https://github.com/IntelAI/models/tree/master/datasets/bert_data/README.md#inference) |
7171
| [BERT large (Hugging Face)](https://arxiv.org/pdf/1810.04805.pdf) | TensorFlow | Inference | [FP32 FP16 BFloat16 BFloat32](/benchmarks/language_modeling/tensorflow/bert_large_hf/inference/README.md) | [SQuAD](https://github.com/IntelAI/models/tree/master/datasets/bert_data/README.md#inference) |
7272
| [BERT large](https://arxiv.org/pdf/1810.04805.pdf) | PyTorch | Inference | [FP32 Int8 BFloat16 BFloat32](/models_v2/pytorch/bert_large/inference/cpu/README.md) | BERT Large SQuAD1.1 |
73+
| [BERT large](https://arxiv.org/pdf/1810.04805.pdf) | PyTorch | Training | [FP32 BFloat16 BFloat32](/models_v2/pytorch/bert_large/training/cpu/README.md) | [preprocessed text dataset](https://drive.google.com/drive/folders/1cywmDnAsrP5-2vsr8GDc6QUc7VWe-M3v) |
7374
| [DistilBERT base](https://arxiv.org/abs/1910.01108) | PyTorch | Inference | [FP32 BF32 BF16Int8-FP32 Int8-BFloat16 BFloat32](/models_v2/pytorch/distilbert/inference/cpu/README.md) | [ DistilBERT Base SQuAD1.1](https://huggingface.co/distilbert-base-uncased-distilled-squad) |
7475
| [RNN-T](https://arxiv.org/abs/2007.15188) | PyTorch | Inference | [FP32 BFloat16 BFloat32](/models_v2/pytorch/rnnt/inference/cpu/README.md) | [RNN-T dataset](/models_v2/pytorch/rnnt/inference/cpu/download_dataset.sh) |
7576
| [RNN-T](https://arxiv.org/abs/2007.15188) | PyTorch | Training | [FP32 BFloat16 BFloat32](/models_v2/pytorch/rnnt/training/cpu/README.md) | [RNN-T dataset](/models_v2/pytorch/rnnt/training/cpu/download_dataset.sh) |
7677
| [GPTJ 6B](https://huggingface.co/EleutherAI/gpt-j-6b) | PyTorch | Inference | [FP32 FP16 BFloat16 BF32 INT8](/models_v2/pytorch/gptj/inference/cpu/README.md) | |
7778
| [GPTJ 6B MLPerf](https://github.com/mlcommons/inference/tree/master/language/gpt-j#datasets--models) | PyTorch | Inference | [INT4](/models_v2/pytorch/gpt-j_mlperf/inference/cpu/README.md) | [CNN-Daily Mail dataset](https://huggingface.co/datasets/cnn_dailymail)|
7879
| [LLAMA2 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) | PyTorch | Inference | [FP32 FP16 BFloat16 BF32 INT8](/models_v2/pytorch/llama/inference/cpu/README.md) | |
80+
| [LLAMA2 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) | PyTorch | Training | [FP32 FP16 BFloat16 BF32](/models_v2/pytorch/llama/training/cpu/README.md) | |
7981
| [LLAMA2 13B](https://huggingface.co/meta-llama/Llama-2-13b-hf) | PyTorch | Inference | [FP32 FP16 BFloat16 BF32 INT8](/models_v2/pytorch/llama/inference/cpu/README.md) | |
8082
| [ChatGLMv3 6B](https://huggingface.co/THUDM/chatglm3-6b) | PyTorch | Inference | [FP32 FP16 BFloat16 BF32 INT8](/models_v2/pytorch/chatglm/inference/cpu/README.md) | |
8183

docker/pytorch/docker-compose.yml

+18-18
Original file line numberDiff line numberDiff line change
@@ -32,15 +32,15 @@ services:
3232
dockerfile: docker/pytorch/bert_large/inference/cpu/pytorch-bert-large-inference.Dockerfile-${BASE_IMAGE_NAME:-ubuntu}
3333
command: >
3434
bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'"
35-
# bert_large-training-cpu:
36-
# image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-language-modeling-bert-large-training
37-
# pull_policy: always
38-
# build:
39-
# context: ../../
40-
# dockerfile: docker/pytorch/bert_large/training/cpu/pytorch-bert-large-training.Dockerfile-${BASE_IMAGE_NAME:-ubuntu}
41-
# extends: bert_large-inference-cpu
42-
# command: >
43-
# bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'"
35+
bert_large-training-cpu:
36+
image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-language-modeling-bert-large-training
37+
pull_policy: always
38+
build:
39+
context: ../../
40+
dockerfile: docker/pytorch/bert_large/training/cpu/pytorch-bert-large-training.Dockerfile-${BASE_IMAGE_NAME:-ubuntu}
41+
extends: bert_large-inference-cpu
42+
command: >
43+
bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'"
4444
maskrcnn-inference-cpu:
4545
image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-object-detection-maskrcnn-inference
4646
pull_policy: always
@@ -185,15 +185,15 @@ services:
185185
extends: bert_large-inference-cpu
186186
command: >
187187
bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'"
188-
# llama-training-cpu:
189-
# image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-generative-ai-llama-training
190-
# pull_policy: always
191-
# build:
192-
# context: ../../
193-
# dockerfile: docker/pytorch/llama/training/cpu/pytorch-llama-training.Dockerfile-${BASE_IMAGE_NAME:-ubuntu}
194-
# extends: bert_large-inference-cpu
195-
# command: >
196-
# bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'"
188+
llama-training-cpu:
189+
image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-generative-ai-llama-training
190+
pull_policy: always
191+
build:
192+
context: ../../
193+
dockerfile: docker/pytorch/llama/training/cpu/pytorch-llama-training.Dockerfile-${BASE_IMAGE_NAME:-ubuntu}
194+
extends: bert_large-inference-cpu
195+
command: >
196+
bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'"
197197
vit-inference-cpu:
198198
image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-image-recognition-vit-inference
199199
pull_policy: always

docs/general/CPU_DEVCATALOG.md

+2
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ The tables below link to documentation on how to run each use case using docker
1313
| --------| ------------------------------------------------------ | ---------- | ------| --------------------- |
1414
| PyTorch | [GPT-J](../../models_v2/pytorch/gptj/inference/cpu/CONTAINER.md) | FP32,BF32,BF16,FP16,INT8-FP32 | Inference | LAMBADA |
1515
| PyTorch | [Llama 2](../../models_v2/pytorch/llama/inference/cpu/CONTAINER.md) 7B,13B | FP32,BF32,BF16,FP16,INT8-FP32 | Inference | LAMBADA |
16+
| PyTorch | [Llama 2](../../models_v2/pytorch/llama/training/cpu/CONTAINER.md) 7B | FP32,BF32,BF16,FP16 | Training | LAMBADA |
1617
| PyTorch | [ChatGLM](../../models_v2/pytorch/chatglm/inference/cpu/CONTAINER.md) | FP32,BF32,BF16,FP16,INT8-FP32 | Inference | LAMBADA |
1718
| PyTorch | [LCM](../../models_v2/pytorch/LCM/inference/cpu/CONTAINER.md) | FP32,BF32,BF16,FP16,INT8-FP32,INT8-BF16 | Inference | COCO 2017 |
1819
| PyTorch | [Stable Diffusion](../../models_v2/pytorch/stable_diffusion/inference/cpu/CONTAINER.md) | FP32,BF32,BF16,FP16,INT8-FP32,INT8-BF16 | Inference | COCO 2017 |
@@ -39,6 +40,7 @@ The tables below link to documentation on how to run each use case using docker
3940

4041
| Framework | Model | Precisions | Mode | Dataset |
4142
| --------| ------------------------------------------------------ | ---------- | ------| --------------------- |
43+
| PyTorch | [BERT large](../../models_v2/pytorch/bert_large/training/cpu/CONTAINER.md) | FP32,BF32,BF16,FP16 | Training | Preprocessed Text dataset |
4244
| PyTorch |[BERT large](../../models_v2/pytorch/bert_large/inference/cpu/CONTAINER.md) | FP32,BF32,BF16,INT8 | Inference | SQuAD1.0 |
4345
| PyTorch | [RNN-T](../../models_v2/pytorch/rnnt/training/cpu/CONTAINER.md) | FP32,BF32,BF16,INT8 | Inference | LibriSpeech |
4446
| PyTorch |[RNN-T](../../models_v2/pytorch/rnnt/inference/cpu/CONTAINER.md) | FP32,BF32,FP16 | Training | LibriSpeech |

models_v2/pytorch/bert_large/inference/cpu/CONTAINER.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ To run the BERT Large inference scripts, set environment variables to specify th
4545
```bash
4646
export EVAL_DATA_FILE=<path to the eval data>
4747
export OUTPUT_DIR=<directory where log files will be written>
48-
export PRECISION=<provide bf16, fp32, fp16, int8, avx-int8, avx-fp32 for throughput and bf16, bf32, fp32, fp16, int8, avx-fp32, avx-int8, fp8 for accuracy and realtime>
48+
export PRECISION=<specify the precision>
4949
export FINETUNED_MODELL=<path to pre-trained model>
5050
export TEST_MODE=<provide either REALTIME,THROUGHPUT OR ACCURACY mode>
5151
export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX_FP16 (for FP16 precision)

models_v2/pytorch/bert_large/inference/cpu/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ export FINETUNED_MODEL=$(pwd)/bert_squad_model
9595
| **TEST_MODE** (THROUGHPUT, ACCURACY, REALTIME) | `export TEST_MODE=THROUGHPUT (THROUGHPUT, ACCURACY, REALTIME)` |
9696
| **EVAL_DATA_FILE** | `export EVAL_DATA_FILE=<path to dev-v1.1.json file>` |
9797
| **OUTPUT_DIR** | `export OUTPUT_DIR=<path to an output directory>` |
98-
| **PRECISION** | `export PRECISION=bf16` (bf16, fp32, fp16, int8, avx-int8, avx-fp32 for throughput and bf16, bf32, fp32, fp16, int8, avx-fp32, avx-int8, fp8 for accuracy and realtime) |
98+
| **PRECISION** | `export PRECISION=bf16` (bf16, bf32, fp32, fp16, int8, avx-int8, avx-fp32 for throughput and bf16, bf32, fp32, fp16, int8, avx-fp32, avx-int8, fp8 for accuracy) |
9999
| **FINETUNED_MODEL** | `export FINETUNED_MODEL=<path to the fine tuned model>` |
100100
| **MODEL_DIR** | `export MODEL_DIR=$(pwd)` |
101101
| **BATCH_SIZE** (optional) | `export BATCH_SIZE=<set a value for batch size, else it will run with default batch size>` |
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# PyTorch BERT Large training
2+
3+
## Description
4+
This document has instructions for running BERT-Large training using Intel Extension for PyTorch.
5+
6+
## Pull Command
7+
8+
```bash
9+
docker pull intel/language-modeling:pytorch-cpu-bert-large-training
10+
```
11+
12+
> [!NOTE]
13+
> The `avx-fp32` precision runs the same scripts as `fp32`, except that the `DNNL_MAX_CPU_ISA` environment variable is unset. The environment variable is otherwise set to `DNNL_MAX_CPU_ISA=AVX512_CORE_AMX`.
14+
15+
## Datasets
16+
Follow instructions to [download and preprocess](./README.md#download-the-preprocessed-text-dataset) the text dataset and set the `DATASET_DIR` to point to the pre-processed dataset.
17+
18+
# BERT Config File
19+
BERT Training happens in two stages. Download the BERT Config file from [here](https://drive.google.com/drive/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT) and export `BERT_MODEL_CONFIG` variable to point to this file path.
20+
21+
# Checkpoint Directory
22+
The checkpoint directory is created as a result of Phase 1 Training. Please set the `PRETRAINED_MODEL` to point to the pre-trained model path and volume mount it for Phase 2 training.
23+
24+
## Docker Run
25+
(Optional) Export related proxy into docker environment.
26+
```bash
27+
export DOCKER_RUN_ENVS="-e ftp_proxy=${ftp_proxy} \
28+
-e FTP_PROXY=${FTP_PROXY} -e http_proxy=${http_proxy} \
29+
-e HTTP_PROXY=${HTTP_PROXY} -e https_proxy=${https_proxy} \
30+
-e HTTPS_PROXY=${HTTPS_PROXY} -e no_proxy=${no_proxy} \
31+
-e NO_PROXY=${NO_PROXY} -e socks_proxy=${socks_proxy} \
32+
-e SOCKS_PROXY=${SOCKS_PROXY}"
33+
```
34+
35+
To run the BERT-Large training scripts, set environment variables to specify the dataset directory, precision and an output directory.
36+
37+
```bash
38+
export DATASET_DIR=<path to the dataset>
39+
export OUTPUT_DIR=<directory where log files will be written>
40+
export PRECISION=<specify the precision to run>
41+
export BERT_MODEL_CONFIG=<path to bert configuration file>
42+
export PRETRAINED_MODEL=<path to checkpoint to directory>
43+
export TRAINING_PHASE=<set either 1 or 2>
44+
export DNNL_MAX_CPU_ISA=<provide AVX512_CORE_AMX_FP16 for fp16 precision>
45+
export TRAIN_SCRIPT=/workspace/pytorch-bert-large-training/run_pretrain_mlperf.py
46+
export DDP=false
47+
export TORCH_INDUCTOR=0
48+
49+
DOCKER_ARGS="--rm -it"
50+
IMAGE_NAME=intel/language-modeling:pytorch-cpu-bert-large-training
51+
52+
docker run \
53+
--cap-add SYS_NICE \
54+
--shm-size 16G \
55+
--env PRECISION=${PRECISION} \
56+
--env OUTPUT_DIR=${OUTPUT_DIR} \
57+
--env TRAIN_SCRIPT=${TRAIN_SCRIPT} \
58+
--env DATASET_DIR=${DATASET_DIR} \
59+
--env TRAINING_PHASE=${TRAINING_PHASE} \
60+
--env DDP=${DDP} \
61+
--env TORCH_INDUCTOR=${TORCH_INDUCTOR} \
62+
--env BERT_MODEL_CONFIG=${BERT_MODEL_CONFIG} \
63+
--env PRETRAINED_MODEL=${PRETRAINED_MODEL} \
64+
--env DNNL_MAX_CPU_ISA=${DNNL_MAX_CPU_ISA} \
65+
--volume ${OUTPUT_DIR}:${OUTPUT_DIR} \
66+
--volume ${DATASET_DIR}:${DATASET_DIR} \
67+
--volume ${BERT_MODEL_CONFIG}:${BERT_MODEL_CONFIG} \
68+
--volume ${PRETRAINED_MODEL}:${PRETRAINED_MODEL} \
69+
${DOCKER_RUN_ENVS} \
70+
${DOCKER_ARGS} \
71+
$IMAGE_NAME \
72+
/bin/bash run_model.sh
73+
```
74+
75+
> [!NOTE]
76+
> The workload container was validated on a single node(`DDP=false`) with `TORCH_INDUCTOR=0`.
77+
78+
## Documentation and Sources
79+
#### Get Started​
80+
[Docker* Repository](https://hub.docker.com/r/intel/language-modeling)
81+
82+
[Main GitHub*](https://github.com/IntelAI/models)
83+
84+
[Release Notes](https://github.com/IntelAI/models/releases)
85+
86+
[Get Started Guide](https://github.com/IntelAI/models/blob/master/models_v2/pytorch/bert_large/training/cpu/CONTAINER.md)
87+
88+
#### Code Sources
89+
[Dockerfile](https://github.com/IntelAI/models/tree/master/docker/pytorch)
90+
91+
[Report Issue](https://community.intel.com/t5/Intel-Optimized-AI-Frameworks/bd-p/optimized-ai-frameworks)
92+
93+
## License Agreement
94+
LEGAL NOTICE: By accessing, downloading or using this software and any required dependent software (the “Software Package”), you agree to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party software included with the Software Package. Please refer to the [license](https://github.com/IntelAI/models/tree/master/third_party) file for additional details.
95+
96+
[View All Containers and Solutions 🡢](https://www.intel.com/content/www/us/en/developer/tools/software-catalog/containers.html?s=Newest)

0 commit comments

Comments
 (0)