Unable to export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS) #908

Kamalrajkannan · 2024-09-20T09:52:31Z

Export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS)

As per suggestion, I referred to ONNX Runtime Build Documentation and followed the steps below:

git clone https://github.com/microsoft/onnxruntime-genai
cd onnxruntime-genai
curl -L https://github.com/microsoft/onnxruntime/releases/download/v1.19.2/onnxruntime-linux-x64-1.19.2.tgz -o onnxruntime-linux-x64-1.19.2.tgz && \
tar xvzf onnxruntime-linux-x64-1.19.2.tgz && \
mv onnxruntime-linux-x64-1.19.2 ort
python build.py --config Release
python3 builder.py -m microsoft/Phi-3-small-8k-instruct -o phi3_small8k -e cpu -p fp16

However, I encountered the following error:
AssertionError: Flash Attention is not available, but is needed for dense attention.

Detailed Trace:

Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
Extra options: {}
/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:961: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
 warnings.warn(
/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
 warnings.warn(
/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py:2509: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at ../aten/src/ATen/SparseCsrTensorImpl.cpp:53.)
 block_mask_dense_output = [xi.to_sparse_csr() for xi in block_mask_dense_output]
/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:469: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
 warnings.warn(
2024-09-20 08:54:16,513 transformers_modules.microsoft.Phi-3-small-8k-instruct.1535ae26fb4faada95c6950e8bc6e867cdad6b00.modeling_phi3_small [INFO] - Layer 2 is using dense attention since it is divisible by 2
Traceback (most recent call last):
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py", line 2872, in <module>
  create_model(args.model_name, args.input, args.output, args.precision, args.execution_provider, args.cache_dir, **extra_options)
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py", line 2764, in create_model
  onnx_model.make_model(input_path)
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py", line 1762, in make_model
  model = AutoModelForCausalLM.from_pretrained(self.model_name_or_path, cache_dir=self.cache_dir, use_auth_token=True, trust_remote_code=True, **extra_kwargs)
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
  return model_class.from_pretrained(
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3832, in from_pretrained
  model = cls(config, *model_args, **model_kwargs)
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 903, in __init__
  self.model = Phi3SmallModel(config)
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 745, in __init__
  self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 745, in <listcomp>
  self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 651, in __init__
  self.self_attn = Phi3SmallSelfAttention(config, layer_idx)
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 218, in __init__
  assert is_flash_attention_available, "Flash Attention is not available, but is needed for dense attention"
AssertionError: Flash Attention is not available, but is needed for dense attention

Note: I verified my build by exporting the Phi-3-min-4k-instruct model successfully.

Additionally, I want to export the model with input_ids and attention_mask as inputs (without position_ids, present, and past key values) and obtain logits as the output. is there any way to achieve it?

Any help from members of the official repository would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

yufenglee · 2024-09-20T16:26:47Z

Could you please try:

install the flash attention package: pip install flash-attn
the combination of cpu+fp16 is not supported. You can change he ep to cuda. python3 builder.py -m microsoft/Phi-3-small-8k-instruct -o phi3_small8k **-e cuda** -p fp16

BTW, those are the combination the tool support now:
FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML

Kamalrajkannan · 2024-09-20T17:09:00Z

Thanks for the response @yufenglee

To export Microsoft/Phi-3-small-8k-instruct onnx CPU model, CUDA is mandatory. we can't export it using CPU(because flash attention needs CUDA) . But we can run the resultant model on the CPU exported using -e cuda ? is that right?
correct me if I am wrong.

And is there any way to export the model with input_ids and attention_mask as inputs (without position_ids, present, and past key values) and obtain logits only as the output?

yufenglee · 2024-09-20T17:25:01Z

CUDA is required to export the model for the reason you mentioned. '-e cuda' specifies that the exported ONNX is targeted to run with OnnxRuntime cuda EP. You can use '-e CPU' to export the ONNX model to run with ORT CPU. '-p fp16/fp32/int4' specifies the data type of the ONNX model.

positions_ids, present/past key/values are required inputs of the model. We don't have option to ignore them now. However, those inputs/outputs are managed by ORT GenAI API automatically. You can get logits with ORT GenAI API like this after you exporting the model:

onnxruntime-genai/test/python/test_onnxruntime_genai_api.py

Line 243 in f5af763

logits = generator.get_output("logits")

Kamalrajkannan · 2024-09-20T17:48:50Z

Here they mentioned Phi-3 small ONNX models can now run on CPU.
I can't export Phi-3 small using CPU (because flash attention needs CUDA) and I can't run exported model in CPU if it targeted to run with OnnxRuntime cuda EP.
Could you please clarify how we can run the Phi-3 small ONNX model on the CPU? @yufenglee

kunal-vaishnavi · 2024-09-20T21:07:36Z

As per #710 (comment), I referred to ONNX Runtime Build Documentation and followed the steps below

Since that PR was merged, the changes have been added to the latest versions of ONNX Runtime and ONNX Runtime GenAI. You can install the latest stable versions to produce the Phi-3 small ONNX model for CPU instead of needing to build from source.

Additionally, I want to export the model with input_ids and attention_mask as inputs (without position_ids, present, and past key values) and obtain logits as the output. is there any way to achieve it?

You can make the following modifications to the model builder to achieve this.

To remove the past and present key-value caches from being added as inputs and outputs to the ONNX model respectively, you can comment out this code block.

onnxruntime-genai/src/python/py/models/builder.py

Lines 495 to 507 in f5af763

    
           # Add KV cache to inputs and outputs 
        
           for i in range(self.num_layers): 
        
               # Add KV cache to inputs 
        
               key_name = f"past_key_values.{i}.key" 
        
               inputs.append(helper.make_tensor_value_info(key_name, self.input_types["past_key_values.key"], shape=self.input_shapes["past_key_values.key"])) 
        
               value_name = f"past_key_values.{i}.value" 
        
               inputs.append(helper.make_tensor_value_info(value_name, self.input_types["past_key_values.value"], shape=self.input_shapes["past_key_values.value"])) 
        
               # Add KV cache to outputs 
        
               key_name = f"present.{i}.key" 
        
               outputs.append(helper.make_tensor_value_info(key_name, self.output_types["present.key"], shape=self.output_shapes["present.key"])) 
        
               value_name = f"present.{i}.value" 
        
               outputs.append(helper.make_tensor_value_info(value_name, self.output_types["present.value"], shape=self.output_shapes["present.value"]))

To make sure the attention ops do not reference the past and present key-value caches, please set the following variables to empty strings.

onnxruntime-genai/src/python/py/models/builder.py

Lines 1378 to 1381 in f5af763

    
           past_k = f"past_key_values.{layer_id}.key" 
        
           past_v = f"past_key_values.{layer_id}.value" 
        
           present_k = f"present.{layer_id}.key" 
        
           present_v = f"present.{layer_id}.value"

Position ids are added into the graph as an input when the RotaryEmbedding op needs to be created. The op is not created with FP32 CPU and INT4 CPU, so those configurations will produce an ONNX model that does not contain a position_ids input.

As mentioned above, however, the past and present key-value caches are required to run with ONNX Runtime GenAI.

AssertionError: Flash Attention is not available, but is needed for dense attention.

As mentioned above, pip install flash-attn should resolve this issue. The original Phi-3 small modeling file checks if the flash-attn package is installed because the package is required to run Phi-3 small with PyTorch. However, since the model builder only loads the model weights into memory and does not run the model, you do not need to have flash-attn installed to get the ONNX model.

Here's how you can get around this issue.

Clone the repo with the Phi-3 small model that you wish to use (for example, the Phi-3 small 8K repo).
Comment out the assert for flash attention (for example, this line in the Phi-3 small 8K modeling file).
Run the model builder as python3 builder.py -i /path/to/repo/you/cloned/ -o /path/to/output/folder/ -p {int4 or fp32} -e cpu. The -i flag will load the model from a local folder where you commented out the assert whereas the -m flag will download the model from Hugging Face and use the already-uploaded files that do not comment out the assert to load the model.

microsoft-github-policy-service bot added the model:transformer label Sep 20, 2024

github-actions bot added ep:CUDA ep:DML labels Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS) #908

Unable to export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS) #908

Kamalrajkannan commented Sep 20, 2024

yufenglee commented Sep 20, 2024 •

edited

Loading

Kamalrajkannan commented Sep 20, 2024 •

edited

Loading

yufenglee commented Sep 20, 2024

Kamalrajkannan commented Sep 20, 2024

kunal-vaishnavi commented Sep 20, 2024

Unable to export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS) #908

Unable to export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS) #908

Comments

Kamalrajkannan commented Sep 20, 2024

Export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS)

yufenglee commented Sep 20, 2024 • edited Loading

Kamalrajkannan commented Sep 20, 2024 • edited Loading

yufenglee commented Sep 20, 2024

Kamalrajkannan commented Sep 20, 2024

kunal-vaishnavi commented Sep 20, 2024

yufenglee commented Sep 20, 2024 •

edited

Loading

Kamalrajkannan commented Sep 20, 2024 •

edited

Loading