Merge branch 'main' into seq2seq

jiqing-feng · Aug 28, 2024 · 918a199 · 918a199
2 parents 27f7d9d + 9a18ae0
commit 918a199
Show file tree

Hide file tree

Showing 33 changed files with 1,042 additions and 228 deletions.
diff --git a/README.md b/README.md
@@ -223,15 +223,14 @@ To load your IPEX model, you can just replace your `AutoModelForXxx` class with
   tokenizer = AutoTokenizer.from_pretrained(model_id)
   pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
   results = pipe("He's a dreadful magician and")
-
 ```
 
 For more details, please refer to the [documentation](https://intel.github.io/intel-extension-for-pytorch/#introduction).
 
 
 ## Running the examples
 
-Check out the [`examples`](https://github.com/huggingface/optimum-intel/tree/main/examples) directory to see how 🤗 Optimum Intel can be used to optimize models and accelerate inference.
+Check out the [`examples`](https://github.com/huggingface/optimum-intel/tree/main/examples) and [`notebooks`](https://github.com/huggingface/optimum-intel/tree/main/notebooks) directory to see how 🤗 Optimum Intel can be used to optimize models and accelerate inference.
 
 Do not forget to install requirements for every example:
 

diff --git a/docker/Dockerfile.intel b/docker/Dockerfile.intel
@@ -27,6 +27,8 @@ RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \
     libpng-dev \
     python3 \
     python3-pip \
+    python3-dev \
+    libnuma-dev \
     && rm -rf /var/lib/apt/lists/*"
 RUN /usr/sbin/update-ccache-symlinks
 RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
@@ -43,12 +45,11 @@ RUN python3 -m pip install --no-cache-dir \
     torchaudio==${TORCHAUDIO_VERSION} \
     -f https://download.pytorch.org/whl/torch_stable.html && \
     python3 -m pip install intel-extension-for-pytorch==$IPEX_VERSION && \
-    python3 -m pip install oneccl_bind_pt --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/
+    python3 -m pip install oneccl_bind_pt --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/ && \
+    python3 -m pip install --no-cache-dir  numa
 
-ARG OMP_NUM_THREADS=1
-ENV OMP_NUM_THREADS=${OMP_NUM_THREADS}
 ARG KMP_BLOCKTIME=1
 ENV KMP_BLOCKTIME=${KMP_BLOCKTIME}
 ARG KMP_HW_SUBSET=1T
 ENV KMP_HW_SUBSET=${KMP_HW_SUBSET}
-ENV LD_PRELOAD="/usr/local/lib/libiomp5.so /usr/lib/x86_64-linux-gnu/libtcmalloc.so"
+ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc.so"
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -30,5 +30,16 @@
       title: Tutorials
       isExpanded: false
     title: OpenVINO
+  - sections:
+    - local: ipex/inference
+      title: Inference
+    - local: ipex/models
+      title: Supported Models
+    - sections:
+      - local: ipex/tutorials/notebooks
+        title: Notebooks
+      title: Tutorials
+      isExpanded: false
+    title: IPEX
   title: Optimum Intel
   isExpanded: false
diff --git a/docs/source/index.mdx b/docs/source/index.mdx
@@ -19,6 +19,8 @@ limitations under the License.
 
 🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.
 
+[Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction) (IPEX) is an open-source library which provides optimizations for both eager mode and graph mode, however, compared to eager mode, graph mode in PyTorch* normally yields better performance from optimization techniques, such as operation fusion.
+
 [Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
 
 [OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.

diff --git a/docs/source/installation.mdx b/docs/source/installation.mdx
@@ -22,6 +22,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
 |:-----------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------|
 | [Intel Neural Compressor (INC)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade --upgrade-strategy eager "optimum[neural-compressor]"`|
 | [Intel OpenVINO](https://docs.openvino.ai )                                                                            | `pip install --upgrade --upgrade-strategy eager "optimum[openvino]"`         |
+| [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction)                       | `pip install --upgrade --upgrade-strategy eager "optimum[ipex]"`         |
 
 The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
 
@@ -42,4 +43,4 @@ or to install from source including dependencies:
 python -m pip install "optimum-intel[extras]"@git+https://github.com/huggingface/optimum-intel.git
 ```
 
-where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
+where `extras` can be one or more of `neural-compressor`, `openvino`, `ipex`.
diff --git a/docs/source/ipex/inference.mdx b/docs/source/ipex/inference.mdx
@@ -0,0 +1,45 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Inference
+
+Optimum Intel can be used to load models from the [Hub](https://huggingface.co/models) and create pipelines to run inference with IPEX optimizations (including patching with custom operators, weight prepacking and graph mode) on a variety of Intel processors. For now support is only enabled for CPUs.
+
+
+## Loading
+
+You can load your model and apply IPEX optimizations (including weight prepacking and graph mode). For supported architectures like LLaMA, BERT and ViT, further optimizations will be applied by patching the model to use custom operators.
+For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated.
+
+```diff
+  import torch
+  from transformers import AutoTokenizer, pipeline
+- from transformers import AutoModelForCausalLM
++ from optimum.intel import IPEXModelForCausalLM
+
+  model_id = "gpt2"
+- model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
++ model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, export=True)
+  tokenizer = AutoTokenizer.from_pretrained(model_id)
+  pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
+  results = pipe("He's a dreadful magician and")
+```
+
+As shown in the table below, each task is associated with a class enabling to automatically load your model.
+
+| Auto Class                           | Task                                 |
+|--------------------------------------|--------------------------------------|
+| `IPEXModelForSequenceClassification` | `text-classification`                |
+| `IPEXModelForTokenClassification`    | `token-classification`               |
+| `IPEXModelForQuestionAnswering`      | `question-answering`                 |
+| `IPEXModelForImageClassification`    | `image-classification`               |
+| `IPEXModel`                          | `feature-extraction`                 |
+| `IPEXModelForMaskedLM`               | `fill-mask`                          |
+| `IPEXModelForAudioClassification`    | `audio-classification`               |
+| `IPEXModelForCausalLM`               | `text-generation`                    |
diff --git a/docs/source/ipex/models.mdx b/docs/source/ipex/models.mdx
@@ -0,0 +1,46 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Supported models
+
+🤗 Optimum provides IPEX optimizations for both eager mode and graph mode. It provides classes and functions to perform this step easily.
+Here is the list of the supported architectures :
+
+## [Transformers](https://huggingface.co/docs/transformers/index)
+
+- Albert
+- Bart
+- Beit
+- Bert
+- BlenderBot
+- BlenderBotSmall
+- Bloom
+- CodeGen
+- DistilBert
+- Electra
+- Flaubert
+- GPT-2
+- GPT-BigCode
+- GPT-Neo
+- GPT-NeoX
+- Llama
+- MPT
+- Mistral
+- MobileNet v1
+- MobileNet v2
+- MobileVit
+- OPT
+- ResNet
+- Roberta
+- Roformer
+- SqueezeBert
+- UniSpeech
+- Vit
+- Wav2Vec2
+- XLM
diff --git a/docs/source/ipex/tutorials/notebooks.mdx b/docs/source/ipex/tutorials/notebooks.mdx
@@ -0,0 +1,16 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Notebooks
+
+## Inference
+
+| Notebook                                                                                                                   | Description                                                                                                |                                                                                                                                                                                                          |       |
+|:---------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------- |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|------:|
+| [How to run inference with the IPEX](https://github.com/huggingface/optimum-intel/tree/main/notebooks/ipex)                | Explains how to export your model to IPEX and to run inference with IPEX model on text-generation task     | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/ipex/text_generation.ipynb)          | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/ipex/text_generation.ipynb)          |
diff --git a/optimum/commands/export/openvino.py b/optimum/commands/export/openvino.py
@@ -190,6 +190,24 @@ def parse_args_openvino(parser: "ArgumentParser"):
     )
 
 
+def no_compression_parameter_provided(args):
+    return all(
+        (
+            it is None
+            for it in (
+                args.ratio,
+                args.group_size,
+                args.sym,
+                args.all_layers,
+                args.dataset,
+                args.num_samples,
+                args.awq,
+                args.sensitivity_metric,
+            )
+        )
+    )
+
+
 class OVExportCommand(BaseOptimumCLICommand):
     COMMAND = CommandInfo(name="openvino", help="Export PyTorch models to OpenVINO IR.")
 
@@ -230,23 +248,17 @@ def run(self):
 
         if self.args.weight_format is None:
             ov_config = None
+            if not no_compression_parameter_provided(self.args):
+                logger.warning(
+                    "The provided compression parameters will not affect conversion because of the missing --weight-format argument."
+                )
         elif self.args.weight_format in {"fp16", "fp32"}:
             ov_config = OVConfig(dtype=self.args.weight_format)
         else:
             is_int8 = self.args.weight_format == "int8"
 
-            # For int4 quantization if not parameter is provided, then use the default config if exist
-            if (
-                not is_int8
-                and self.args.ratio is None
-                and self.args.group_size is None
-                and self.args.sym is None
-                and self.args.all_layers is None
-                and self.args.dataset is None
-                and self.args.num_samples is None
-                and self.args.awq is None
-                and self.args.sensitivity_metric is None
-            ):
+            # For int4 quantization if no parameter is provided, then use the default config if exist
+            if no_compression_parameter_provided(self.args) and not is_int8:
                 quantization_config = get_default_int4_config(self.args.model)
             else:
                 quantization_config = {
@@ -305,7 +317,7 @@ def run(self):
             model = model_cls.from_pretrained(self.args.model, export=True, quantization_config=quantization_config)
             model.save_pretrained(self.args.output)
             if not self.args.disable_convert_tokenizer:
-                maybe_convert_tokenizers(library_name, self.args.output, model)
+                maybe_convert_tokenizers(library_name, self.args.output, model, task=task)
         elif task.startswith("text-generation") and quantize_with_dataset:
             from optimum.intel import OVModelForCausalLM
 
@@ -324,7 +336,7 @@ def run(self):
                 preprocessors = maybe_load_preprocessors(
                     self.args.model, trust_remote_code=self.args.trust_remote_code
                 )
-                maybe_convert_tokenizers(library_name, self.args.output, preprocessors=preprocessors)
+                maybe_convert_tokenizers(library_name, self.args.output, preprocessors=preprocessors, task=task)
         else:
             # TODO : add input shapes
             main_export(

diff --git a/optimum/exporters/ipex/model_patcher.py b/optimum/exporters/ipex/model_patcher.py
@@ -13,6 +13,8 @@
 #  limitations under the License.
 
 from transformers.models.bert.modeling_bert import BertIntermediate
+from transformers.models.falcon.modeling_falcon import FalconDecoderLayer, FalconForCausalLM
+from transformers.models.gpt2.modeling_gpt2 import GPT2Attention, GPT2Block, GPT2LMHeadModel
 from transformers.models.llama.modeling_llama import (
     LlamaDecoderLayer,
     LlamaForCausalLM,
@@ -22,10 +24,14 @@
 from transformers.models.vit.modeling_vit import ViTIntermediate
 
 from optimum.intel.utils.import_utils import is_ipex_version, is_transformers_version
+from optimum.intel.utils.modeling_utils import replace_customized_linear_with_linear
 
 from .modeling_utils import (
     _IPEX_MINIMUM_VERSION_FOR_PATCHING,
+    _gpt2_block_forward,
     _ipex_rms_layer_norm_forward,
+    _IPEXFalconDecoderLayer,
+    _IPEXGPT2Attention,
     _IPEXIntermediate,
     _IPEXLlamaDecoderLayer,
     _llama_model_forward,
@@ -67,18 +73,56 @@ def patch_op(m, target_m, new_op_name, new_op):
 
 
 def _patch_llama_model(model):
+    """
+    Patch llama model:
+        1. Use IPEX Rope and IAKV cache
+        2. Linear fusion with (2 Linears + Silu + Mul) and (Linear + Add)
+    """
     convert_functions(model, LlamaModel, "forward", _llama_model_forward)
     convert_functions(model, LlamaRMSNorm, "forward", _ipex_rms_layer_norm_forward)
     convert_class(model, LlamaDecoderLayer, _IPEXLlamaDecoderLayer, model.config)
     return model
 
 
+def _patch_falcon_model(model):
+    """
+    Patch falcon model:
+        1. Disable SDPA so the attention mask will be compatible to ipex attention.
+        2. Use IPEX Rope and IAKV cache
+        3. Linear fusion with (Linear + Gelu) and (Linear + Add + Add)
+    """
+    model.transformer._use_sdpa = False
+    replace_customized_linear_with_linear(model)
+    convert_class(model, FalconDecoderLayer, _IPEXFalconDecoderLayer, model.config)
+    return model
+
+
+def _patch_gpt2_model(model):
+    """
+    Patch gpt2 model:
+        1. Disable SDPA so the attention mask will be compatible to ipex attention.
+        2. Use IAKV cache
+    """
+    model.transformer._attn_implementation = "eager"
+    convert_class(model, GPT2Attention, _IPEXGPT2Attention, model.config)
+    convert_functions(model, GPT2Block, "forward", _gpt2_block_forward)
+    return model
+
+
 def _patch_bert_model(model):
+    """
+    Patch bert model:
+        1. Linear fusion with Linear + Gelu
+    """
     convert_class(model, BertIntermediate, _IPEXIntermediate)
     return model
 
 
 def _patch_vit_model(model):
+    """
+    Patch vit model:
+        1. Linear fusion with Linear + Gelu
+    """
     convert_class(model, ViTIntermediate, _IPEXIntermediate)
     return model
 
@@ -94,6 +138,10 @@ def _patch_model(model):
         )
     if isinstance(model, LlamaForCausalLM):
         model = _patch_llama_model(model)
+    elif isinstance(model, FalconForCausalLM):
+        model = _patch_falcon_model(model)
+    elif isinstance(model, GPT2LMHeadModel):
+        model = _patch_gpt2_model(model)
     elif model.config.model_type == "bert":
         model = _patch_bert_model(model)
     elif model.config.model_type == "vit":