update docs for 1.11 release (#618)

jingxu10 · web-flow · commit bd7349af9ad7 · 2022-03-16T15:01:03.000+09:00
diff --git a/README.md b/README.md
@@ -8,18 +8,18 @@ More detailed tutorials are available at [**Intel® Extension for PyTorch\* onli
 
 ## Installation
 
-Wheel files are avaiable for the following Python versions.
+You can use either of the following 2 commands to install Intel® Extension for PyTorch\*.
 
-| Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 |
-| :--: | :--: | :--: | :--: | :--: |
-| 1.10.100 | ✔️ | ✔️ | ✔️ | ✔️ |
+```python
+python -m pip install intel_extension_for_pytorch
+```
 
 ```python
-python -m pip install torch==1.10.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
-python -m pip install intel_extension_for_pytorch==1.10.100 -f https://software.intel.com/ipex-whl-stable
-python -m pip install psutil
+python -m pip install intel_extension_for_pytorch -f https://software.intel.com/ipex-whl-stable
 ```
 
+**Note:** Intel® Extension for PyTorch\* has PyTorch version requirement. Please check more detailed information via the URL below.
+
 More installation methods can be found at [Installation Guide](https://intel.github.io/intel-extension-for-pytorch/tutorials/installation.html)
 
 ## Getting Started
diff --git a/docs/tutorials/features.rst b/docs/tutorials/features.rst
@@ -44,6 +44,8 @@ Low precision data type BFloat16 has been natively supported on the 3rd Generati
 
 Check more detailed information for `Auto Mixed Precision (AMP) <features/amp.html>`_.
 
+Bfloat16 computation can be conducted on platforms with AVX512 instruction set. On platforms with `AVX512 BFloat16 instruction <https://www.intel.com/content/www/us/en/developer/articles/technical/intel-deep-learning-boost-new-instruction-bfloat16.html>`_, there will be further performance boost.
+
 .. toctree::
    :hidden:
    :maxdepth: 1
diff --git a/docs/tutorials/installation.md b/docs/tutorials/installation.md
@@ -49,6 +49,8 @@ Prebuilt wheel files availability matrix for Python versions
 | 1.9.0 | ✔️ | ✔️ | ✔️ | ✔️ |  |
 | 1.8.0 |  | ✔️ |  |  |  |
 
+**Note:** Intel® Extension for PyTorch\* has PyTorch version requirement. Please check the mapping table above.
+
 Starting from 1.11.0, you can use normal pip command to install the package.
 
 ```
@@ -63,7 +65,7 @@ python -m pip install intel_extension_for_pytorch -f https://software.intel.com/
 
 **Note:** For version prior to 1.10.0, please use package name `torch_ipex`, rather than `intel_extension_for_pytorch`.
 
-**Note:** To install a package with a specific version, please use the standard way of pip.
+**Note:** To install a package with a specific version, please run with the following command.
 
 ```
 python -m pip install <package_name>==<version_name> -f https://software.intel.com/ipex-whl-stable
diff --git a/docs/tutorials/performance_tuning/known_issues.md b/docs/tutorials/performance_tuning/known_issues.md
@@ -1,31 +1,15 @@
 Known Issues
 ============
 
-- BFloat16 is currently only supported natively on platforms with the following instruction set. The support will be expanded gradually to more platforms in furture releases.
+- BF16 AMP(auto-mixed-precision) runs abnormally with the extension on the AVX2-only machine if the topology contains `Conv`, `Matmul`, `Linear`, and `BatchNormalization`
 
-  | Instruction Set | Description |
-  | --- | --- |
-  | AVX512\_CORE | Intel AVX-512 with AVX512BW, AVX512VL, and AVX512DQ extensions |
-  | AVX512\_CORE\_VNNI | Intel AVX-512 with Intel DL Boost |
-  | AVX512\_CORE\_BF16 | Intel AVX-512 with Intel DL Boost and bfloat16 support |
-  | AVX512\_CORE\_AMX | Intel AVX-512 with Intel DL Boost and bfloat16 support and Intel Advanced Matrix Extensions (Intel AMX) with 8-bit integer and bfloat16 support |
+- Runtime extension does not support the scenario that the BS is not divisible by the stream number
 
-- INT8 performance of EfficientNet and DenseNet with IntelÂ® Extension for PyTorch\* is slower than that of FP32
+- Incorrect Conv and Linear result if the number of OMP threads is changed at runtime
 
-- `omp_set_num_threads` function failed to change OpenMP threads number of oneDNN operators if it was set before.
+  The oneDNN memory layout depends on the number of OMP threads, which requires the caller to detect the changes for the # of OMP threads while this release has not implemented it yet.
 
-  `omp_set_num_threads` function is provided in Intel® Extension for PyTorch\* to change number of threads used with openmp. However, it failed to change number of OpenMP threads if it was set before.
-
-  pseudo code:
-
-  ```
-  omp_set_num_threads(6)
-  model_execution()
-  omp_set_num_threads(4)
-  same_model_execution_again()
-  ```
-
-  **Reason:** oneDNN primitive descriptor stores the omp number of threads. Current oneDNN integration caches the primitive descriptor in IPEX. So if we use runtime extension with oneDNN based pytorch/ipex operation, the runtime extension fails to change the used omp number of threads.
+- INT8 performance of EfficientNet and DenseNet with Intel® Extension for PyTorch\* is slower than that of FP32
 
 - Low performance with INT8 support for dynamic shapes
 
diff --git a/docs/tutorials/releases.md b/docs/tutorials/releases.md
@@ -1,13 +1,95 @@
 Releases
 =============
 
+## 1.11.0
+
+We are excited to announce Intel® Extension for PyTorch\* 1.11.0-cpu release by tightly following PyTorch 1.11 release. Along with extension 1.11, we focused on continually improving OOB user experience and performance. Highlights include:
+
+* Support a single binary with runtime dynamic dispatch based on AVX2/AVX512 hardware ISA detection
+* Support install binary from `pip` with package name only (without the need of specifying the URL)
+* Provide the C++ SDK installation to facilitate ease of C++ app development and deployment
+* Add more optimizations, including graph fusions for speeding up Transformer-based models and CNN, etc
+* Reduce the binary size for both the PIP wheel and C++ SDK (2X to 5X reduction from the previous version)
+
+### Highlights
+- Combine the AVX2 and AVX512 binary as a single binary and automatically dispatch to different implementations based on hardware ISA detection at runtime. The typical case is to serve the data center that mixtures AVX2-only and AVX512 platforms. It does not need to deploy the different ISA binary now compared to the previous version
+
+    ***NOTE***:  The extension uses the oneDNN library as the backend. However, the BF16 and INT8 operator sets and features are different between AVX2 and AVX512. Please refer to [oneDNN document](https://oneapi-src.github.io/oneDNN/dev_guide_int8_computations.html#processors-with-the-intel-avx2-or-intel-avx-512-support) for more details. 
+
+    > When one input is of type u8, and the other one is of type s8, oneDNN assumes that it is the user’s responsibility to choose the quantization parameters so that no overflow/saturation occurs. For instance, a user can use u7 [0, 127] instead of u8 for the unsigned input, or s7 [-64, 63] instead of the s8 one. It is worth mentioning that this is required only when the Intel AVX2 or Intel AVX512 Instruction Set is used.
+
+- The extension wheel packages have been uploaded to [pypi.org](https://pypi.org/project/intel-extension-for-pytorch/). The user could directly install the extension by `pip/pip3` without explicitly specifying the binary location URL.
+
+<table align="center">
+<tbody>
+<tr>
+<td>v1.10.100-cpu</td>
+<td>v1.11.0-cpu</td>
+</tr>
+<tr>
+<td>
+
+```python
+python -m pip install intel_extension_for_pytorch==1.10.100 -f https://software.intel.com/ipex-whl-stable
+```
+</td>
+<td>
+
+```python
+pip install intel_extension_for_pytorch
+```
+</td>
+</tr>
+</tbody>
+</table>
+
+- Compared to the previous version, this release provides a dedicated installation file for the C++ SDK. The installation file automatically detects the PyTorch C++ SDK location and installs the extension C++ SDK files to the PyTorch C++ SDK. The user does not need to manually add the extension C++ SDK source files and CMake to the PyTorch SDK. In addition to that, the installation file reduces the C++ SDK binary size from ~220MB to ~13.5MB. 
+
+<table align="center">
+<tbody>
+<tr>
+<td>v1.10.100-cpu</td>
+<td>v1.11.0-cpu</td>
+</tr>
+<tr>
+<td>
+
+```python
+intel-ext-pt-cpu-libtorch-shared-with-deps-1.10.0+cpu.zip (220M)
+intel-ext-pt-cpu-libtorch-cxx11-abi-shared-with-deps-1.10.0+cpu.zip (224M)
+```
+</td>
+<td>
+
+```python
+libintel-ext-pt-1.11.0+cpu.run (13.7M)
+libintel-ext-pt-cxx11-abi-1.11.0+cpu.run (13.5M)
+```
+</td>
+</tr>
+</tbody>
+</table>
+
+- Add more optimizations, including more custom operators and fusions.
+    - Fuse the QKV linear operators as a single Linear to accelerate the Transformer\*(BERT-\*) encoder part  - [#278](https://github.com/intel/intel-extension-for-pytorch/commit/0f27c269cae0f902973412dc39c9a7aae940e07b).
+    - Remove Multi-Head-Attention fusion limitations to support the 64bytes unaligned tensor shape. [#531](https://github.com/intel/intel-extension-for-pytorch/commit/dbb10fedb00c6ead0f5b48252146ae9d005a0fad)
+    - Fold the binary operator to Convolution and Linear operator to reduce computation. [#432](https://github.com/intel/intel-extension-for-pytorch/commit/564588561fa5d45b8b63e490336d151ff1fc9cbc) [#438](https://github.com/intel/intel-extension-for-pytorch/commit/b4e7dacf08acd849cecf8d143a11dc4581a3857f) [#602](https://github.com/intel/intel-extension-for-pytorch/commit/74aa21262938b923d3ed1e6929e7d2b629b3ff27)
+    - Replace the outplace operators with their corresponding in-place version to reduce memory footprint. The extension currently supports the operators including `sliu`, `sigmoid`, `tanh`, `hardsigmoid`, `hardswish`, `relu6`, `relu`, `selu`, `softmax`. [#524](https://github.com/intel/intel-extension-for-pytorch/commit/38647677e8186a235769ea519f4db65925eca33c)
+    - Fuse the Concat + BN + ReLU as a single operator. [#452](https://github.com/intel/intel-extension-for-pytorch/commit/275ff503aea780a6b741f04db5323d9529ee1081)
+    - Optimize Conv3D for both imperative and JIT by enabling NHWC and pre-packing the weight. [#425](https://github.com/intel/intel-extension-for-pytorch/commit/ae33faf62bb63b204b0ee63acb8e29e24f6076f3)
+- Reduce the binary size. C++ SDK is reduced from ~220MB to ~13.5MB while the wheel packaged is reduced from ~100MB to ~40MB.
+- Update oneDNN and oneDNN graph to [2.5.2](https://github.com/oneapi-src/oneDNN/releases/tag/v2.5.2) and [0.4.2](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.2) respectively.
+
+### What's Changed
+**Full Changelog**: https://github.com/intel/intel-extension-for-pytorch/compare/v1.10.100...v1.11.0
+
 ## 1.10.100
 
 This release is meant to fix the following issues:
 - Resolve the issue that the PyTorch Tensor Expression(TE) did not work after importing the extension.
-- Wraps the BactchNorm(BN) as another operator to break the TE's BN-related fusions. Because the BatchNorm performance of PyTorch Tensor Expression can not achieve the same performance as PyTorch ATen BN. 
+- Wraps the BactchNorm(BN) as another operator to break the TE's BN-related fusions. Because the BatchNorm performance of PyTorch Tensor Expression can not achieve the same performance as PyTorch ATen BN.
 - Update the [documentation](https://intel.github.io/intel-extension-for-pytorch/)
-    - Fix the INT8 quantization example issue #205 
+    - Fix the INT8 quantization example issue #205
     - Polish the installation guide
 
 ## 1.10.0
@@ -149,7 +231,7 @@ class MyModel(nn.Module):
     def __init__(self):
         super(MyModel, self).__init__()
         self.conv = nn.Conv2d(10, 10, 3)
-        
+
     def forward(self, x):
         x = self.conv(x)
         return x