Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS) #908

Open
Kamalrajkannan opened this issue Sep 20, 2024 · 5 comments

Comments

@Kamalrajkannan
Copy link

Export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS)

As per suggestion, I referred to ONNX Runtime Build Documentation and followed the steps below:

git clone https://github.com/microsoft/onnxruntime-genai
cd onnxruntime-genai
curl -L https://github.com/microsoft/onnxruntime/releases/download/v1.19.2/onnxruntime-linux-x64-1.19.2.tgz -o onnxruntime-linux-x64-1.19.2.tgz && \
tar xvzf onnxruntime-linux-x64-1.19.2.tgz && \
mv onnxruntime-linux-x64-1.19.2 ort
python build.py --config Release
python3 builder.py -m microsoft/Phi-3-small-8k-instruct -o phi3_small8k -e cpu -p fp16

However, I encountered the following error:
AssertionError: Flash Attention is not available, but is needed for dense attention.

Detailed Trace:

Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
Extra options: {}
/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:961: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
 warnings.warn(
/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
 warnings.warn(
/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py:2509: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at ../aten/src/ATen/SparseCsrTensorImpl.cpp:53.)
 block_mask_dense_output = [xi.to_sparse_csr() for xi in block_mask_dense_output]
/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:469: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
 warnings.warn(
2024-09-20 08:54:16,513 transformers_modules.microsoft.Phi-3-small-8k-instruct.1535ae26fb4faada95c6950e8bc6e867cdad6b00.modeling_phi3_small [INFO] - Layer 2 is using dense attention since it is divisible by 2
Traceback (most recent call last):
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py", line 2872, in <module>
  create_model(args.model_name, args.input, args.output, args.precision, args.execution_provider, args.cache_dir, **extra_options)
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py", line 2764, in create_model
  onnx_model.make_model(input_path)
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py", line 1762, in make_model
  model = AutoModelForCausalLM.from_pretrained(self.model_name_or_path, cache_dir=self.cache_dir, use_auth_token=True, trust_remote_code=True, **extra_kwargs)
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
  return model_class.from_pretrained(
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3832, in from_pretrained
  model = cls(config, *model_args, **model_kwargs)
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 903, in __init__
  self.model = Phi3SmallModel(config)
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 745, in __init__
  self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 745, in <listcomp>
  self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 651, in __init__
  self.self_attn = Phi3SmallSelfAttention(config, layer_idx)
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 218, in __init__
  assert is_flash_attention_available, "Flash Attention is not available, but is needed for dense attention"
AssertionError: Flash Attention is not available, but is needed for dense attention

Note: I verified my build by exporting the Phi-3-min-4k-instruct model successfully.

Additionally, I want to export the model with input_ids and attention_mask as inputs (without position_ids, present, and past key values) and obtain logits as the output. is there any way to achieve it?

Any help from members of the official repository would be greatly appreciated!

@yufenglee
Copy link
Member

yufenglee commented Sep 20, 2024

Could you please try:

  • install the flash attention package: pip install flash-attn
  • the combination of cpu+fp16 is not supported. You can change he ep to cuda. python3 builder.py -m microsoft/Phi-3-small-8k-instruct -o phi3_small8k **-e cuda** -p fp16

BTW, those are the combination the tool support now:
FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML

@Kamalrajkannan
Copy link
Author

Kamalrajkannan commented Sep 20, 2024

Thanks for the response @yufenglee

To export Microsoft/Phi-3-small-8k-instruct onnx CPU model, CUDA is mandatory. we can't export it using CPU(because flash attention needs CUDA) . But we can run the resultant model on the CPU exported using -e cuda ? is that right?
correct me if I am wrong.

And is there any way to export the model with input_ids and attention_mask as inputs (without position_ids, present, and past key values) and obtain logits only as the output?

@yufenglee
Copy link
Member

CUDA is required to export the model for the reason you mentioned. '-e cuda' specifies that the exported ONNX is targeted to run with OnnxRuntime cuda EP. You can use '-e CPU' to export the ONNX model to run with ORT CPU. '-p fp16/fp32/int4' specifies the data type of the ONNX model.

positions_ids, present/past key/values are required inputs of the model. We don't have option to ignore them now. However, those inputs/outputs are managed by ORT GenAI API automatically. You can get logits with ORT GenAI API like this after you exporting the model:

logits = generator.get_output("logits")

@Kamalrajkannan
Copy link
Author

Here they mentioned Phi-3 small ONNX models can now run on CPU.
I can't export Phi-3 small using CPU (because flash attention needs CUDA) and I can't run exported model in CPU if it targeted to run with OnnxRuntime cuda EP.
Could you please clarify how we can run the Phi-3 small ONNX model on the CPU? @yufenglee

@kunal-vaishnavi
Copy link
Contributor

As per #710 (comment), I referred to ONNX Runtime Build Documentation and followed the steps below

Since that PR was merged, the changes have been added to the latest versions of ONNX Runtime and ONNX Runtime GenAI. You can install the latest stable versions to produce the Phi-3 small ONNX model for CPU instead of needing to build from source.

Additionally, I want to export the model with input_ids and attention_mask as inputs (without position_ids, present, and past key values) and obtain logits as the output. is there any way to achieve it?

You can make the following modifications to the model builder to achieve this.

  1. To remove the past and present key-value caches from being added as inputs and outputs to the ONNX model respectively, you can comment out this code block.

# Add KV cache to inputs and outputs
for i in range(self.num_layers):
# Add KV cache to inputs
key_name = f"past_key_values.{i}.key"
inputs.append(helper.make_tensor_value_info(key_name, self.input_types["past_key_values.key"], shape=self.input_shapes["past_key_values.key"]))
value_name = f"past_key_values.{i}.value"
inputs.append(helper.make_tensor_value_info(value_name, self.input_types["past_key_values.value"], shape=self.input_shapes["past_key_values.value"]))
# Add KV cache to outputs
key_name = f"present.{i}.key"
outputs.append(helper.make_tensor_value_info(key_name, self.output_types["present.key"], shape=self.output_shapes["present.key"]))
value_name = f"present.{i}.value"
outputs.append(helper.make_tensor_value_info(value_name, self.output_types["present.value"], shape=self.output_shapes["present.value"]))

  1. To make sure the attention ops do not reference the past and present key-value caches, please set the following variables to empty strings.

past_k = f"past_key_values.{layer_id}.key"
past_v = f"past_key_values.{layer_id}.value"
present_k = f"present.{layer_id}.key"
present_v = f"present.{layer_id}.value"

  1. Position ids are added into the graph as an input when the RotaryEmbedding op needs to be created. The op is not created with FP32 CPU and INT4 CPU, so those configurations will produce an ONNX model that does not contain a position_ids input.

As mentioned above, however, the past and present key-value caches are required to run with ONNX Runtime GenAI.

AssertionError: Flash Attention is not available, but is needed for dense attention.

As mentioned above, pip install flash-attn should resolve this issue. The original Phi-3 small modeling file checks if the flash-attn package is installed because the package is required to run Phi-3 small with PyTorch. However, since the model builder only loads the model weights into memory and does not run the model, you do not need to have flash-attn installed to get the ONNX model.

Here's how you can get around this issue.

  1. Clone the repo with the Phi-3 small model that you wish to use (for example, the Phi-3 small 8K repo).
  2. Comment out the assert for flash attention (for example, this line in the Phi-3 small 8K modeling file).
  3. Run the model builder as python3 builder.py -i /path/to/repo/you/cloned/ -o /path/to/output/folder/ -p {int4 or fp32} -e cpu. The -i flag will load the model from a local folder where you commented out the assert whereas the -m flag will download the model from Hugging Face and use the already-uploaded files that do not comment out the assert to load the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants