[Doc] highlight some features as experimental (#2152)

jingxu10 · tye1 · zhuhong61 · web-flow · commit 2798b43e27f6 · 2023-01-07T18:25:55.000+08:00
* generic python
* Update feature list in release note
* fine tune, add experimental to horovod, simple trace and profiler_legacy
* Update CPU part in release note
* add cpu to OS matrix
* DDP doc: Add torch-ccl source build command for cpu (#2159)

---------

Co-authored-by: Ye Ting &lt;ting.ye@intel.com&gt;
Co-authored-by: zhuhong61 &lt;95205772+zhuhong61@users.noreply.github.com&gt;
diff --git a/.gitignore b/.gitignore
@@ -126,6 +126,7 @@ build_ios
 .build_release/*
 distribute/*
 dist/
+docs/_build/
 *.testbin
 *.bin
 cmake_build
diff --git a/docs/tutorials/features.rst b/docs/tutorials/features.rst
@@ -66,9 +66,9 @@ On Intel® GPUs, quantization usages follow PyTorch default quantization APIs. C
 Distributed Training
 --------------------
 
-To meet demands of large scale model training over multiple devices, distributed training on Intel® GPUs and CPUs are supported. Two alternative methodologies are available. Users can choose either to use PyTorch native distributed training module, `Distributed Data Parallel (DDP) <https://pytorch.org/docs/stable/notes/ddp.html>`_, with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support via `Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl) <https://github.com/intel/torch-ccl>`_ or use Horovod with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support.
+To meet demands of large scale model training over multiple devices, distributed training on Intel® GPUs and CPUs are supported. Two alternative methodologies are available. Users can choose either to use PyTorch native distributed training module, `Distributed Data Parallel (DDP) <https://pytorch.org/docs/stable/notes/ddp.html>`_, with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support via `Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl) <https://github.com/intel/torch-ccl>`_ or use Horovod with `Intel® oneAPI Collective Communications Library (oneCCL) <https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html>`_ support (Experimental).
 
-For more detailed information, check `DDP <features/DDP.md>`_ and `Horovod <features/horovod.md>`_.
+For more detailed information, check `DDP <features/DDP.md>`_ and `Horovod (Experimental) <features/horovod.md>`_.
 
 .. toctree::
    :hidden:
@@ -122,8 +122,8 @@ For more detailed information, check `Advanced Configuration <features/advanced_
    features/advanced_configuration
 
 
-Legacy Profiler Tool
---------------------
+Legacy Profiler Tool (Experimental)
+-----------------------------------
 
 The legacy profiler tool is an extension of PyTorch* legacy profiler for profiling operators' overhead on XPU devices. With this tool, users can get the information in many fields of the run models or code scripts. User should build Intel® Extension for PyTorch* with profiler support as default and enable this tool by a `with` statement before the code segment.
 
@@ -135,8 +135,8 @@ For more detailed information, check `Legacy Profiler Tool <features/profiler_le
 
    features/profiler_legacy
 
-Simple Trace Tool
------------------
+Simple Trace Tool (Experimental)
+--------------------------------
 
 Simple Trace is a built-in debugging tool that lets you control printing out the call stack for a piece of code. Once enabled, it can automatically print out verbose messages of called operators in a stack format with indenting to distinguish the context. 
 
diff --git a/docs/tutorials/features/DDP.md b/docs/tutorials/features/DDP.md
@@ -1,14 +1,15 @@
-# DistributedDataParallel (DDP)
+DistributedDataParallel (DDP)
+=============================
 
 ## Introduction
 
 `DistributedDataParallel (DDP)` is a PyTorch\* module that implements multi-process data parallelism across multiple GPUs and machines. With DDP, the model is replicated on every process, and each model replica is fed a different set of input data samples. DDP enables overlapping between gradient communication and gradient computations to speed up training. Please refer to [DDP Tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) for an introduction to DDP.
 
-The PyTorch `Collective Communication (c10d)` library supports communication across processes. To run DDP on XPU, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing such collectives as `allreduce`, `allgather`, and `alltoall`. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
+The PyTorch `Collective Communication (c10d)` library supports communication across processes. To run DDP on GPU, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing such collectives as `allreduce`, `allgather`, and `alltoall`. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
 
 ## Installation of Intel® oneCCL Bindings for Pytorch\*
 
-To use PyTorch DDP on XPU, install Intel® oneCCL Bindings for Pytorch\* as described below.
+To use PyTorch DDP on GPU, install Intel® oneCCL Bindings for Pytorch\* as described below.
 
 ### Install PyTorch and Intel® Extension for PyTorch\*
 
@@ -19,6 +20,18 @@ For more detailed information, check [installation guide](../installation.md).
 
 #### Install from source:
 
+Installation for CPU:
+
+```bash
+git clone https://github.com/intel/torch-ccl.git -b v1.13.0
+cd torch-ccl
+git submodule sync
+git submodule update --init --recursive
+python setup.py install
+```
+
+Installation for GPU:
+
 ```bash
 git clone https://github.com/intel/torch-ccl.git -b v1.13.100+gpu
 cd torch-ccl
@@ -29,19 +42,24 @@ BUILD_NO_ONECCL_PACKAGE=ON COMPUTE_BACKEND=dpcpp python setup.py install
 
 #### Install from prebuilt wheel:
 
-Installation for CPU:
+Prebuilt wheel files for CPU, GPU with generic Python\* and GPU with Intel® Distribution for Python\* are released in separate repositories.
 
-```bash
-python -m pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable-cpu
+```
+# Generic Python* for CPU
+REPO_URL: https://developer.intel.com/ipex-whl-stable-cpu
+# Generic Python* for GPU
+REPO_URL: https://developer.intel.com/ipex-whl-stable-xpu
+# Intel® Distribution for Python*
+REPO_URL: https://developer.intel.com/ipex-whl-stable-xpu-idp
 ```
 
-Installation for GPU:
+Installation from either repository shares the command below. Replace the place holder `<REPO_URL>` with a real URL mentioned above.
 
 ```bash
-python -m pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable-xpu
+python -m pip install oneccl_bind_pt -f <REPO_URL>
 ```
 
-**Note:** Make sure you have installed basekit from https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit
+**Note:** Make sure you have installed [basekit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit) when using Intel® oneCCL Bindings for Pytorch\* on Intel® GPUs.
 
 ```bash
 source $basekit_root/ccl/latest/env/vars.sh
@@ -165,7 +183,7 @@ For using one GPU card with multiple tiles, each tile could be regarded as a dev
 
 ### Usage of DDP scaling API 
 
-Note: This API supports XPU devices on one card.
+Note: This API supports GPU devices on one card.
 
 ```python
 Args:
@@ -221,5 +239,5 @@ print("DDP Use XPU: {} for training".format(xpu))
 train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
     num_workers=args.workers, pin_memory=True, sampler=train_sampler)
 ```
-Then you can start your model training on multiple XPU devices of one card.
+Then you can start your model training on multiple GPU devices of one card.
 
diff --git a/docs/tutorials/features/DLPack.md b/docs/tutorials/features/DLPack.md
@@ -1,5 +1,5 @@
-
-# DLPack Solution
+DLPack Solution
+===============
 
 ## Introduction
 
diff --git a/docs/tutorials/features/DPC++_Extension.md b/docs/tutorials/features/DPC++_Extension.md
@@ -1,4 +1,5 @@
-# DPC++ Extension
+DPC++ Extension
+===============
 
 ## Introduction
 
diff --git a/docs/tutorials/features/advanced_configuration.md b/docs/tutorials/features/advanced_configuration.md
@@ -1,4 +1,5 @@
-# Advanced Configuration
+Advanced Configuration
+======================
 
 The default settings for Intel® Extension for PyTorch\* are sufficient for most use cases. However, if users want to customize Intel® Extension for PyTorch\*, advanced configuration is available at build time and runtime. 
 
diff --git a/docs/tutorials/features/amp_cpu.md b/docs/tutorials/features/amp_cpu.md
@@ -1,4 +1,5 @@
-# Auto Mixed Precision (AMP) on CPU
+Auto Mixed Precision (AMP) on CPU
+=================================
 
 ## Introduction
 
diff --git a/docs/tutorials/features/amp_gpu.md b/docs/tutorials/features/amp_gpu.md
@@ -1,4 +1,5 @@
-# Auto Mixed Precision (AMP) on GPU
+Auto Mixed Precision (AMP) on GPU
+=================================
 
 ## Introduction
 
diff --git a/docs/tutorials/features/horovod.md b/docs/tutorials/features/horovod.md
@@ -1,4 +1,5 @@
-# Horovod with PyTorch
+Horovod with PyTorch (Experimental)
+===================================
 
 Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. Horovod core principles are based on MPI concepts such as size, rank, local rank, allreduce, allgather, broadcast, and alltoall. To use Horovod with PyTorch, you need to install Horovod with Pytorch first, and make specific change for Horovod in your training script.
 
diff --git a/docs/tutorials/features/profiler_legacy.md b/docs/tutorials/features/profiler_legacy.md
@@ -1,5 +1,5 @@
-
-# Legacy Profiler Tool
+Legacy Profiler Tool (Experimental)
+===================================
 
 ## Introduction
 
diff --git a/docs/tutorials/features/simple_trace.md b/docs/tutorials/features/simple_trace.md
@@ -1,5 +1,5 @@
-
-# Simple Trace Tool [EXPERIMENTAL]
+Simple Trace Tool (Experimental)
+================================
 
 ## Introduction
 
diff --git a/docs/tutorials/getting_started.md b/docs/tutorials/getting_started.md
@@ -38,10 +38,10 @@ model = Model()
 data = ...
 dtype=torch.float32 # torch.bfloat16, torch.float16 (float16 only works on GPU)
 
-##### Run on GPU #####
+##### Run on GPU ######
 model = model.to('xpu')
 data = data.to('xpu')
-######################
+#######################
 
 model = ipex.optimize(model, dtype=dtype)
 
diff --git a/docs/tutorials/installation.md b/docs/tutorials/installation.md
@@ -16,10 +16,11 @@ Verified Hardware Platforms:
 
 |Hardware|OS|Driver|
 |-|-|-|
-|Intel® Data Center GPU Flex Series|Ubuntu 22.04, Red Hat 8.6|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
-|Intel® Data Center GPU Max Series|Red Hat 8.6, Sles 15sp3/sp4|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
+|Intel® Data Center GPU Flex Series|Ubuntu 22.04 (Validated), Red Hat 8.6|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
+|Intel® Data Center GPU Max Series|Red Hat 8.6, Sles 15sp3/sp4 (Validated)|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
 |Intel® Arc™ A-Series Graphics|Ubuntu 22.04|[Stable 540](https://dgpu-docs.intel.com/releases/stable_540_20221205.html)|
 |Intel® Arc™ A-Series Graphics|Windows 11 or Windows 10 21H2 (via WSL2)|[for Windows 11 or Windows 10 21H2](https://www.intel.com/content/www/us/en/download/726609/intel-arc-graphics-windows-dch-driver.html)|
+|CPU (3<sup>rd</sup> and 4<sup>th</sup> Gen of Intel® Xeon® Scalable Processors)|Linux\* distributions with glibc>=2.17. Validated on Ubuntu 18.04.|N/A|
 
 - Intel® oneAPI Base Toolkit 2023.0
 - Python 3.7-3.10
@@ -74,8 +75,10 @@ Prebuilt wheel files availability matrix for Python versions:
 
 ### Repositories for prebuilt wheel files
 
+Prebuilt wheel files for generic Python\* and Intel® Distribution for Python\* are released in separate repositories. Replace the place holder `<REPO_URL>` in installation commands with a real URL below.
+
 ```
-# Stock PyTorch
+# Generic Python
 REPO_URL: https://developer.intel.com/ipex-whl-stable-xpu
 
 # Intel® Distribution for Python*
diff --git a/docs/tutorials/performance_tuning/known_issues.md b/docs/tutorials/performance_tuning/known_issues.md
@@ -45,7 +45,7 @@ Known Issues
     
     then clean build PyTorch\*.
 
-- OSError: `libmkl_intel_lp64.so.1`: cannot open shared object file: No such file or directory
+- OSError: `libmkl_intel_lp64.so.2`: cannot open shared object file: No such file or directory
 
     Wrong MKL library is used when multiple MKL libraries exist in system. Preload oneMKL by:
     
diff --git a/docs/tutorials/releases.md b/docs/tutorials/releases.md
@@ -3,27 +3,25 @@ Releases
 
 ## 1.13.10+xpu
 
-Intel® Extension for PyTorch\* v1.13.10+xpu extends PyTorch\* 1.13 with up-to-date features and optimizations on `xpu` for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch* `xpu` device, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs with PyTorch*.
+Intel® Extension for PyTorch\* v1.13.10+xpu is the first Intel® Extension for PyTorch\* release supports both CPU platforms and GPU platforms (Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series) based on PyTorch\* 1.13. It extends PyTorch\* 1.13 with up-to-date features and optimizations on `xpu` for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch* `xpu` device, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs with PyTorch*.
 
 ### Highlights
 
-This release introduces specific XPU solution optimizations on Intel discrete GPUs which include Intel® Data Center GPU Flex Series 170 and Intel® Data Center GPU Max Series. Optimized operators and kernels are implemented and registered through PyTorch\* dispatching mechanism for the `xpu` device. These operators and kernels are accelerated on Intel GPU hardware from the corresponding native vectorization and matrix calculation features. In graph mode, additional operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.
+This release introduces specific XPU solution optimizations on Intel discrete GPUs which include Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series. Optimized operators and kernels are implemented and registered through PyTorch\* dispatching mechanism for the `xpu` device. These operators and kernels are accelerated on Intel GPU hardware from the corresponding native vectorization and matrix calculation features. In graph mode, additional operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.
 
 This release provides the following features:
-- Usability and Performance Features listed in [Intel® Extension for PyTorch\* v1.13.0+cpu release](https://intel.github.io/intel-extension-for-pytorch/cpu/1.13.0+cpu/tutorials/releases.html#id1)
-- Distributed Training
+- Distributed Training on GPU:
   - support of distributed training with DistributedDataParallel (DDP) on Intel GPU hardware
-  - support of distributed training with Horovod (experimental) on Intel GPU hardware
-- DLPack Solution
-  - mechanism to share tensor data without copy when interoparate with other libraries on Intel GPU hardware
-- Legacy Profiler Tool
-  - an extension of PyTorch* legacy profiler for profiling operators' overhead on Intel GPU hardware
-- Simple Trace Tool
-  - built-in debugging tool to print out the call stack for a piece of code
+  - support of distributed training with Horovod (experimental feature) on Intel GPU hardware
+- Automatic channels last format conversion on GPU:
+  - Automatic channels last format conversion is enabled. Models using `torch.xpu.optimize` API running on Intel® Data Center GPU Max Series will be converted to channels last memory format, while models running on Intel® Data Center GPU Flex Series will choose oneDNN block format.
+- CPU support is merged in this release:
+  - CPU features and optimizations are equivalent to what has been released in Intel® Extension for PyTorch* v1.13.0+cpu release that was made publicly available in Nov 2022. For customers who would like to evaluate workloads on both GPU and CPU, they can use this package. For customers who are focusing on CPU only, we still recommend them to use Intel® Extension for PyTorch* v1.13.0+cpu release for smaller footprint, less dependencies and broader OS support.
 
 This release adds the following fusion patterns in PyTorch\* JIT mode for Intel GPU:
 - `Conv2D` + UnaryOp(`abs`, `sqrt`, `square`, `exp`, `log`, `round`, `GeLU`, `Log_Sigmoid`, `Hardswish`, `Mish`, `HardSigmoid`, `Tanh`, `Pow`, `ELU`, `hardtanh`)
 - `Linear` + UnaryOp(`abs`, `sqrt`, `square`, `exp`, `log`, `round`, `Log_Sigmoid`, `Hardswish`, `HardSigmoid`, `Pow`, `ELU`, `SiLU`, `hardtanh`, `Leaky_relu`)
+
 ### Known Issues
 
 Please refer to [Known Issues webpage](./performance_tuning/known_issues.md).