Skip to content

Commit 2798b43

Browse files
jingxu10tye1zhuhong61
authored
[Doc] highlight some features as experimental (#2152)
* generic python * Update feature list in release note * fine tune, add experimental to horovod, simple trace and profiler_legacy * Update CPU part in release note * add cpu to OS matrix * DDP doc: Add torch-ccl source build command for cpu (#2159) --------- Co-authored-by: Ye Ting <[email protected]> Co-authored-by: zhuhong61 <[email protected]>
1 parent c8c71d2 commit 2798b43

15 files changed

+70
-45
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,7 @@ build_ios
126126
.build_release/*
127127
distribute/*
128128
dist/
129+
docs/_build/
129130
*.testbin
130131
*.bin
131132
cmake_build

docs/tutorials/features.rst

+6-6
Original file line numberDiff line numberDiff line change
@@ -66,9 +66,9 @@ On Intel® GPUs, quantization usages follow PyTorch default quantization APIs. C
6666
Distributed Training
6767
--------------------
6868

69-
To meet demands of large scale model training over multiple devices, distributed training on Intel® GPUs and CPUs are supported. Two alternative methodologies are available. Users can choose either to use PyTorch native distributed training module, `Distributed Data Parallel (DDP) <https://pytorch.org/docs/stable/notes/ddp.html>`_, with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support via `Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl) <https://github.com/intel/torch-ccl>`_ or use Horovod with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support.
69+
To meet demands of large scale model training over multiple devices, distributed training on Intel® GPUs and CPUs are supported. Two alternative methodologies are available. Users can choose either to use PyTorch native distributed training module, `Distributed Data Parallel (DDP) <https://pytorch.org/docs/stable/notes/ddp.html>`_, with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support via `Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl) <https://github.com/intel/torch-ccl>`_ or use Horovod with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support (Experimental).
7070

71-
For more detailed information, check `DDP <features/DDP.md>`_ and `Horovod <features/horovod.md>`_.
71+
For more detailed information, check `DDP <features/DDP.md>`_ and `Horovod (Experimental) <features/horovod.md>`_.
7272

7373
.. toctree::
7474
:hidden:
@@ -122,8 +122,8 @@ For more detailed information, check `Advanced Configuration <features/advanced_
122122
features/advanced_configuration
123123

124124

125-
Legacy Profiler Tool
126-
--------------------
125+
Legacy Profiler Tool (Experimental)
126+
-----------------------------------
127127

128128
The legacy profiler tool is an extension of PyTorch* legacy profiler for profiling operators' overhead on XPU devices. With this tool, users can get the information in many fields of the run models or code scripts. User should build Intel® Extension for PyTorch* with profiler support as default and enable this tool by a `with` statement before the code segment.
129129

@@ -135,8 +135,8 @@ For more detailed information, check `Legacy Profiler Tool <features/profiler_le
135135

136136
features/profiler_legacy
137137

138-
Simple Trace Tool
139-
-----------------
138+
Simple Trace Tool (Experimental)
139+
--------------------------------
140140

141141
Simple Trace is a built-in debugging tool that lets you control printing out the call stack for a piece of code. Once enabled, it can automatically print out verbose messages of called operators in a stack format with indenting to distinguish the context.
142142

docs/tutorials/features/DDP.md

+29-11
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,15 @@
1-
# DistributedDataParallel (DDP)
1+
DistributedDataParallel (DDP)
2+
=============================
23

34
## Introduction
45

56
`DistributedDataParallel (DDP)` is a PyTorch\* module that implements multi-process data parallelism across multiple GPUs and machines. With DDP, the model is replicated on every process, and each model replica is fed a different set of input data samples. DDP enables overlapping between gradient communication and gradient computations to speed up training. Please refer to [DDP Tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) for an introduction to DDP.
67

7-
The PyTorch `Collective Communication (c10d)` library supports communication across processes. To run DDP on XPU, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing such collectives as `allreduce`, `allgather`, and `alltoall`. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
8+
The PyTorch `Collective Communication (c10d)` library supports communication across processes. To run DDP on GPU, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing such collectives as `allreduce`, `allgather`, and `alltoall`. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
89

910
## Installation of Intel® oneCCL Bindings for Pytorch\*
1011

11-
To use PyTorch DDP on XPU, install Intel® oneCCL Bindings for Pytorch\* as described below.
12+
To use PyTorch DDP on GPU, install Intel® oneCCL Bindings for Pytorch\* as described below.
1213

1314
### Install PyTorch and Intel® Extension for PyTorch\*
1415

@@ -19,6 +20,18 @@ For more detailed information, check [installation guide](../installation.md).
1920

2021
#### Install from source:
2122

23+
Installation for CPU:
24+
25+
```bash
26+
git clone https://github.com/intel/torch-ccl.git -b v1.13.0
27+
cd torch-ccl
28+
git submodule sync
29+
git submodule update --init --recursive
30+
python setup.py install
31+
```
32+
33+
Installation for GPU:
34+
2235
```bash
2336
git clone https://github.com/intel/torch-ccl.git -b v1.13.100+gpu
2437
cd torch-ccl
@@ -29,19 +42,24 @@ BUILD_NO_ONECCL_PACKAGE=ON COMPUTE_BACKEND=dpcpp python setup.py install
2942

3043
#### Install from prebuilt wheel:
3144

32-
Installation for CPU:
45+
Prebuilt wheel files for CPU, GPU with generic Python\* and GPU with Intel® Distribution for Python\* are released in separate repositories.
3346

34-
```bash
35-
python -m pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable-cpu
47+
```
48+
# Generic Python* for CPU
49+
REPO_URL: https://developer.intel.com/ipex-whl-stable-cpu
50+
# Generic Python* for GPU
51+
REPO_URL: https://developer.intel.com/ipex-whl-stable-xpu
52+
# Intel® Distribution for Python*
53+
REPO_URL: https://developer.intel.com/ipex-whl-stable-xpu-idp
3654
```
3755

38-
Installation for GPU:
56+
Installation from either repository shares the command below. Replace the place holder `<REPO_URL>` with a real URL mentioned above.
3957

4058
```bash
41-
python -m pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable-xpu
59+
python -m pip install oneccl_bind_pt -f <REPO_URL>
4260
```
4361

44-
**Note:** Make sure you have installed basekit from https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit
62+
**Note:** Make sure you have installed [basekit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit) when using Intel® oneCCL Bindings for Pytorch\* on Intel® GPUs.
4563

4664
```bash
4765
source $basekit_root/ccl/latest/env/vars.sh
@@ -165,7 +183,7 @@ For using one GPU card with multiple tiles, each tile could be regarded as a dev
165183

166184
### Usage of DDP scaling API
167185

168-
Note: This API supports XPU devices on one card.
186+
Note: This API supports GPU devices on one card.
169187

170188
```python
171189
Args:
@@ -221,5 +239,5 @@ print("DDP Use XPU: {} for training".format(xpu))
221239
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
222240
num_workers=args.workers, pin_memory=True, sampler=train_sampler)
223241
```
224-
Then you can start your model training on multiple XPU devices of one card.
242+
Then you can start your model training on multiple GPU devices of one card.
225243

docs/tutorials/features/DLPack.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
2-
# DLPack Solution
1+
DLPack Solution
2+
===============
33

44
## Introduction
55

docs/tutorials/features/DPC++_Extension.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# DPC++ Extension
1+
DPC++ Extension
2+
===============
23

34
## Introduction
45

docs/tutorials/features/advanced_configuration.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Advanced Configuration
1+
Advanced Configuration
2+
======================
23

34
The default settings for Intel® Extension for PyTorch\* are sufficient for most use cases. However, if users want to customize Intel® Extension for PyTorch\*, advanced configuration is available at build time and runtime.
45

docs/tutorials/features/amp_cpu.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Auto Mixed Precision (AMP) on CPU
1+
Auto Mixed Precision (AMP) on CPU
2+
=================================
23

34
## Introduction
45

docs/tutorials/features/amp_gpu.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Auto Mixed Precision (AMP) on GPU
1+
Auto Mixed Precision (AMP) on GPU
2+
=================================
23

34
## Introduction
45

docs/tutorials/features/horovod.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Horovod with PyTorch
1+
Horovod with PyTorch (Experimental)
2+
===================================
23

34
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. Horovod core principles are based on MPI concepts such as size, rank, local rank, allreduce, allgather, broadcast, and alltoall. To use Horovod with PyTorch, you need to install Horovod with Pytorch first, and make specific change for Horovod in your training script.
45

docs/tutorials/features/profiler_legacy.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
2-
# Legacy Profiler Tool
1+
Legacy Profiler Tool (Experimental)
2+
===================================
33

44
## Introduction
55

docs/tutorials/features/simple_trace.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
2-
# Simple Trace Tool [EXPERIMENTAL]
1+
Simple Trace Tool (Experimental)
2+
================================
33

44
## Introduction
55

docs/tutorials/getting_started.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,10 @@ model = Model()
3838
data = ...
3939
dtype=torch.float32 # torch.bfloat16, torch.float16 (float16 only works on GPU)
4040

41-
##### Run on GPU #####
41+
##### Run on GPU ######
4242
model = model.to('xpu')
4343
data = data.to('xpu')
44-
######################
44+
#######################
4545

4646
model = ipex.optimize(model, dtype=dtype)
4747

docs/tutorials/installation.md

+6-3
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,11 @@ Verified Hardware Platforms:
1616

1717
|Hardware|OS|Driver|
1818
|-|-|-|
19-
|Intel® Data Center GPU Flex Series|Ubuntu 22.04, Red Hat 8.6|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
20-
|Intel® Data Center GPU Max Series|Red Hat 8.6, Sles 15sp3/sp4|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
19+
|Intel® Data Center GPU Flex Series|Ubuntu 22.04 (Validated), Red Hat 8.6|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
20+
|Intel® Data Center GPU Max Series|Red Hat 8.6, Sles 15sp3/sp4 (Validated)|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
2121
|Intel® Arc™ A-Series Graphics|Ubuntu 22.04|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
2222
|Intel® Arc™ A-Series Graphics|Windows 11 or Windows 10 21H2 (via WSL2)|[for Windows 11 or Windows 10 21H2](https://www.intel.com/content/www/us/en/download/726609/intel-arc-graphics-windows-dch-driver.html)|
23+
|CPU (3<sup>rd</sup> and 4<sup>th</sup> Gen of Intel® Xeon® Scalable Processors)|Linux\* distributions with glibc>=2.17. Validated on Ubuntu 18.04.|N/A|
2324

2425
- Intel® oneAPI Base Toolkit 2023.0
2526
- Python 3.7-3.10
@@ -74,8 +75,10 @@ Prebuilt wheel files availability matrix for Python versions:
7475

7576
### Repositories for prebuilt wheel files
7677

78+
Prebuilt wheel files for generic Python\* and Intel® Distribution for Python\* are released in separate repositories. Replace the place holder `<REPO_URL>` in installation commands with a real URL below.
79+
7780
```
78-
# Stock PyTorch
81+
# Generic Python
7982
REPO_URL: https://developer.intel.com/ipex-whl-stable-xpu
8083
8184
# Intel® Distribution for Python*

docs/tutorials/performance_tuning/known_issues.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Known Issues
4545
4646
then clean build PyTorch\*.
4747
48-
- OSError: `libmkl_intel_lp64.so.1`: cannot open shared object file: No such file or directory
48+
- OSError: `libmkl_intel_lp64.so.2`: cannot open shared object file: No such file or directory
4949
5050
Wrong MKL library is used when multiple MKL libraries exist in system. Preload oneMKL by:
5151

docs/tutorials/releases.md

+9-11
Original file line numberDiff line numberDiff line change
@@ -3,27 +3,25 @@ Releases
33

44
## 1.13.10+xpu
55

6-
Intel® Extension for PyTorch\* v1.13.10+xpu extends PyTorch\* 1.13 with up-to-date features and optimizations on `xpu` for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch* `xpu` device, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs with PyTorch*.
6+
Intel® Extension for PyTorch\* v1.13.10+xpu is the first Intel® Extension for PyTorch\* release supports both CPU platforms and GPU platforms (Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series) based on PyTorch\* 1.13. It extends PyTorch\* 1.13 with up-to-date features and optimizations on `xpu` for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch* `xpu` device, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs with PyTorch*.
77

88
### Highlights
99

10-
This release introduces specific XPU solution optimizations on Intel discrete GPUs which include Intel® Data Center GPU Flex Series 170 and Intel® Data Center GPU Max Series. Optimized operators and kernels are implemented and registered through PyTorch\* dispatching mechanism for the `xpu` device. These operators and kernels are accelerated on Intel GPU hardware from the corresponding native vectorization and matrix calculation features. In graph mode, additional operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.
10+
This release introduces specific XPU solution optimizations on Intel discrete GPUs which include Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series. Optimized operators and kernels are implemented and registered through PyTorch\* dispatching mechanism for the `xpu` device. These operators and kernels are accelerated on Intel GPU hardware from the corresponding native vectorization and matrix calculation features. In graph mode, additional operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.
1111

1212
This release provides the following features:
13-
- Usability and Performance Features listed in [Intel® Extension for PyTorch\* v1.13.0+cpu release](https://intel.github.io/intel-extension-for-pytorch/cpu/1.13.0+cpu/tutorials/releases.html#id1)
14-
- Distributed Training
13+
- Distributed Training on GPU:
1514
- support of distributed training with DistributedDataParallel (DDP) on Intel GPU hardware
16-
- support of distributed training with Horovod (experimental) on Intel GPU hardware
17-
- DLPack Solution
18-
- mechanism to share tensor data without copy when interoparate with other libraries on Intel GPU hardware
19-
- Legacy Profiler Tool
20-
- an extension of PyTorch* legacy profiler for profiling operators' overhead on Intel GPU hardware
21-
- Simple Trace Tool
22-
- built-in debugging tool to print out the call stack for a piece of code
15+
- support of distributed training with Horovod (experimental feature) on Intel GPU hardware
16+
- Automatic channels last format conversion on GPU:
17+
- Automatic channels last format conversion is enabled. Models using `torch.xpu.optimize` API running on Intel® Data Center GPU Max Series will be converted to channels last memory format, while models running on Intel® Data Center GPU Flex Series will choose oneDNN block format.
18+
- CPU support is merged in this release:
19+
- CPU features and optimizations are equivalent to what has been released in Intel® Extension for PyTorch* v1.13.0+cpu release that was made publicly available in Nov 2022. For customers who would like to evaluate workloads on both GPU and CPU, they can use this package. For customers who are focusing on CPU only, we still recommend them to use Intel® Extension for PyTorch* v1.13.0+cpu release for smaller footprint, less dependencies and broader OS support.
2320

2421
This release adds the following fusion patterns in PyTorch\* JIT mode for Intel GPU:
2522
- `Conv2D` + UnaryOp(`abs`, `sqrt`, `square`, `exp`, `log`, `round`, `GeLU`, `Log_Sigmoid`, `Hardswish`, `Mish`, `HardSigmoid`, `Tanh`, `Pow`, `ELU`, `hardtanh`)
2623
- `Linear` + UnaryOp(`abs`, `sqrt`, `square`, `exp`, `log`, `round`, `Log_Sigmoid`, `Hardswish`, `HardSigmoid`, `Pow`, `ELU`, `SiLU`, `hardtanh`, `Leaky_relu`)
24+
2725
### Known Issues
2826

2927
Please refer to [Known Issues webpage](./performance_tuning/known_issues.md).

0 commit comments

Comments
 (0)