You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Doc] highlight some features as experimental (#2152)
* generic python
* Update feature list in release note
* fine tune, add experimental to horovod, simple trace and profiler_legacy
* Update CPU part in release note
* add cpu to OS matrix
* DDP doc: Add torch-ccl source build command for cpu (#2159)
---------
Co-authored-by: Ye Ting <[email protected]>
Co-authored-by: zhuhong61 <[email protected]>
Copy file name to clipboardExpand all lines: docs/tutorials/features.rst
+6-6
Original file line number
Diff line number
Diff line change
@@ -66,9 +66,9 @@ On Intel® GPUs, quantization usages follow PyTorch default quantization APIs. C
66
66
Distributed Training
67
67
--------------------
68
68
69
-
To meet demands of large scale model training over multiple devices, distributed training on Intel® GPUs and CPUs are supported. Two alternative methodologies are available. Users can choose either to use PyTorch native distributed training module, `Distributed Data Parallel (DDP) <https://pytorch.org/docs/stable/notes/ddp.html>`_, with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support via `Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl) <https://github.com/intel/torch-ccl>`_ or use Horovod with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support.
69
+
To meet demands of large scale model training over multiple devices, distributed training on Intel® GPUs and CPUs are supported. Two alternative methodologies are available. Users can choose either to use PyTorch native distributed training module, `Distributed Data Parallel (DDP) <https://pytorch.org/docs/stable/notes/ddp.html>`_, with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support via `Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl) <https://github.com/intel/torch-ccl>`_ or use Horovod with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support (Experimental).
70
70
71
-
For more detailed information, check `DDP <features/DDP.md>`_ and `Horovod <features/horovod.md>`_.
71
+
For more detailed information, check `DDP <features/DDP.md>`_ and `Horovod (Experimental) <features/horovod.md>`_.
72
72
73
73
.. toctree::
74
74
:hidden:
@@ -122,8 +122,8 @@ For more detailed information, check `Advanced Configuration <features/advanced_
122
122
features/advanced_configuration
123
123
124
124
125
-
Legacy Profiler Tool
126
-
--------------------
125
+
Legacy Profiler Tool (Experimental)
126
+
-----------------------------------
127
127
128
128
The legacy profiler tool is an extension of PyTorch* legacy profiler for profiling operators' overhead on XPU devices. With this tool, users can get the information in many fields of the run models or code scripts. User should build Intel® Extension for PyTorch* with profiler support as default and enable this tool by a `with` statement before the code segment.
129
129
@@ -135,8 +135,8 @@ For more detailed information, check `Legacy Profiler Tool <features/profiler_le
135
135
136
136
features/profiler_legacy
137
137
138
-
Simple Trace Tool
139
-
-----------------
138
+
Simple Trace Tool (Experimental)
139
+
--------------------------------
140
140
141
141
Simple Trace is a built-in debugging tool that lets you control printing out the call stack for a piece of code. Once enabled, it can automatically print out verbose messages of called operators in a stack format with indenting to distinguish the context.
Copy file name to clipboardExpand all lines: docs/tutorials/features/DDP.md
+29-11
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,15 @@
1
-
# DistributedDataParallel (DDP)
1
+
DistributedDataParallel (DDP)
2
+
=============================
2
3
3
4
## Introduction
4
5
5
6
`DistributedDataParallel (DDP)` is a PyTorch\* module that implements multi-process data parallelism across multiple GPUs and machines. With DDP, the model is replicated on every process, and each model replica is fed a different set of input data samples. DDP enables overlapping between gradient communication and gradient computations to speed up training. Please refer to [DDP Tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) for an introduction to DDP.
6
7
7
-
The PyTorch `Collective Communication (c10d)` library supports communication across processes. To run DDP on XPU, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing such collectives as `allreduce`, `allgather`, and `alltoall`. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
8
+
The PyTorch `Collective Communication (c10d)` library supports communication across processes. To run DDP on GPU, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing such collectives as `allreduce`, `allgather`, and `alltoall`. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
8
9
9
10
## Installation of Intel® oneCCL Bindings for Pytorch\*
10
11
11
-
To use PyTorch DDP on XPU, install Intel® oneCCL Bindings for Pytorch\* as described below.
12
+
To use PyTorch DDP on GPU, install Intel® oneCCL Bindings for Pytorch\* as described below.
12
13
13
14
### Install PyTorch and Intel® Extension for PyTorch\*
14
15
@@ -19,6 +20,18 @@ For more detailed information, check [installation guide](../installation.md).
**Note:** Make sure you have installed basekit from https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit
62
+
**Note:** Make sure you have installed [basekit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit) when using Intel® oneCCL Bindings for Pytorch\* on Intel® GPUs.
45
63
46
64
```bash
47
65
source$basekit_root/ccl/latest/env/vars.sh
@@ -165,7 +183,7 @@ For using one GPU card with multiple tiles, each tile could be regarded as a dev
165
183
166
184
### Usage of DDP scaling API
167
185
168
-
Note: This API supports XPU devices on one card.
186
+
Note: This API supports GPU devices on one card.
169
187
170
188
```python
171
189
Args:
@@ -221,5 +239,5 @@ print("DDP Use XPU: {} for training".format(xpu))
Copy file name to clipboardExpand all lines: docs/tutorials/features/advanced_configuration.md
+2-1
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,5 @@
1
-
# Advanced Configuration
1
+
Advanced Configuration
2
+
======================
2
3
3
4
The default settings for Intel® Extension for PyTorch\* are sufficient for most use cases. However, if users want to customize Intel® Extension for PyTorch\*, advanced configuration is available at build time and runtime.
Copy file name to clipboardExpand all lines: docs/tutorials/features/horovod.md
+2-1
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,5 @@
1
-
# Horovod with PyTorch
1
+
Horovod with PyTorch (Experimental)
2
+
===================================
2
3
3
4
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. Horovod core principles are based on MPI concepts such as size, rank, local rank, allreduce, allgather, broadcast, and alltoall. To use Horovod with PyTorch, you need to install Horovod with Pytorch first, and make specific change for Horovod in your training script.
Copy file name to clipboardExpand all lines: docs/tutorials/installation.md
+6-3
Original file line number
Diff line number
Diff line change
@@ -16,10 +16,11 @@ Verified Hardware Platforms:
16
16
17
17
|Hardware|OS|Driver|
18
18
|-|-|-|
19
-
|Intel® Data Center GPU Flex Series|Ubuntu 22.04, Red Hat 8.6|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
20
-
|Intel® Data Center GPU Max Series|Red Hat 8.6, Sles 15sp3/sp4|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
19
+
|Intel® Data Center GPU Flex Series|Ubuntu 22.04 (Validated), Red Hat 8.6|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
20
+
|Intel® Data Center GPU Max Series|Red Hat 8.6, Sles 15sp3/sp4 (Validated)|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
|Intel® Arc™ A-Series Graphics|Windows 11 or Windows 10 21H2 (via WSL2)|[for Windows 11 or Windows 10 21H2](https://www.intel.com/content/www/us/en/download/726609/intel-arc-graphics-windows-dch-driver.html)|
23
+
|CPU (3<sup>rd</sup> and 4<sup>th</sup> Gen of Intel® Xeon® Scalable Processors)|Linux\* distributions with glibc>=2.17. Validated on Ubuntu 18.04.|N/A|
Prebuilt wheel files for generic Python\* and Intel® Distribution for Python\* are released in separate repositories. Replace the place holder `<REPO_URL>` in installation commands with a real URL below.
Copy file name to clipboardExpand all lines: docs/tutorials/releases.md
+9-11
Original file line number
Diff line number
Diff line change
@@ -3,27 +3,25 @@ Releases
3
3
4
4
## 1.13.10+xpu
5
5
6
-
Intel® Extension for PyTorch\* v1.13.10+xpu extends PyTorch\* 1.13 with up-to-date features and optimizations on `xpu` for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch*`xpu` device, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs with PyTorch*.
6
+
Intel® Extension for PyTorch\* v1.13.10+xpu is the first Intel® Extension for PyTorch\* release supports both CPU platforms and GPU platforms (Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series) based on PyTorch\* 1.13. It extends PyTorch\* 1.13 with up-to-date features and optimizations on `xpu` for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch*`xpu` device, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs with PyTorch*.
7
7
8
8
### Highlights
9
9
10
-
This release introduces specific XPU solution optimizations on Intel discrete GPUs which include Intel® Data Center GPU Flex Series 170 and Intel® Data Center GPU Max Series. Optimized operators and kernels are implemented and registered through PyTorch\* dispatching mechanism for the `xpu` device. These operators and kernels are accelerated on Intel GPU hardware from the corresponding native vectorization and matrix calculation features. In graph mode, additional operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.
10
+
This release introduces specific XPU solution optimizations on Intel discrete GPUs which include Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series. Optimized operators and kernels are implemented and registered through PyTorch\* dispatching mechanism for the `xpu` device. These operators and kernels are accelerated on Intel GPU hardware from the corresponding native vectorization and matrix calculation features. In graph mode, additional operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.
11
11
12
12
This release provides the following features:
13
-
- Usability and Performance Features listed in [Intel® Extension for PyTorch\* v1.13.0+cpu release](https://intel.github.io/intel-extension-for-pytorch/cpu/1.13.0+cpu/tutorials/releases.html#id1)
14
-
- Distributed Training
13
+
- Distributed Training on GPU:
15
14
- support of distributed training with DistributedDataParallel (DDP) on Intel GPU hardware
16
-
- support of distributed training with Horovod (experimental) on Intel GPU hardware
17
-
- DLPack Solution
18
-
- mechanism to share tensor data without copy when interoparate with other libraries on Intel GPU hardware
19
-
- Legacy Profiler Tool
20
-
- an extension of PyTorch* legacy profiler for profiling operators' overhead on Intel GPU hardware
21
-
- Simple Trace Tool
22
-
- built-in debugging tool to print out the call stack for a piece of code
15
+
- support of distributed training with Horovod (experimental feature) on Intel GPU hardware
16
+
- Automatic channels last format conversion on GPU:
17
+
- Automatic channels last format conversion is enabled. Models using `torch.xpu.optimize` API running on Intel® Data Center GPU Max Series will be converted to channels last memory format, while models running on Intel® Data Center GPU Flex Series will choose oneDNN block format.
18
+
- CPU support is merged in this release:
19
+
- CPU features and optimizations are equivalent to what has been released in Intel® Extension for PyTorch* v1.13.0+cpu release that was made publicly available in Nov 2022. For customers who would like to evaluate workloads on both GPU and CPU, they can use this package. For customers who are focusing on CPU only, we still recommend them to use Intel® Extension for PyTorch* v1.13.0+cpu release for smaller footprint, less dependencies and broader OS support.
23
20
24
21
This release adds the following fusion patterns in PyTorch\* JIT mode for Intel GPU:
0 commit comments