Failed to build the model execution plan using a model architecture file #2325

Skyline-23 · 2024-08-27T17:36:53Z

🐞Describing the bug

Hello. I'm trying to convert PyTorch model to Stateful CoreML Model

I wrote this code referred to WWDC 2024 session Mistral-7B model
The CoreML file is appear after run, but "Failed to build the model execution plan using a model architecture file" error appears when CoreML Class init

Stack Trace

/opt/homebrew/lib/python3.11/site-packages/transformers/modeling_utils.py:4481: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
/opt/homebrew/lib/python3.11/site-packages/torch/jit/_trace.py:1116: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Tensor-likes are not close!

Mismatched elements: 12 / 90000 (0.0%)
Greatest absolute difference: 1.6361474990844727e-05 at index (0, 11, 1251) (up to 1e-05 allowed)
Greatest relative difference: 0.000991315116234805 at index (0, 12, 1660) (up to 1e-05 allowed)
  _check_trace(
Torch var valueCache is added again.
Torch var keyCache is added again.
Converting PyTorch Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████| 1516/1516 [00:00<00:00, 2464.91 ops/s]
Running MIL frontend_pytorch pipeline: 100%|█████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 27.54 passes/s]
Running MIL default pipeline:  60%|█████████████████████████████████████████████████████▏                                  | 52/86 [00:02<00:01, 27.17 passes/s]/opt/homebrew/lib/python3.11/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py:894: RuntimeWarning: overflow encountered in cast
  return input_var.val.astype(dtype=string_to_nptype(dtype_val))
Running MIL default pipeline: 100%|████████████████████████████████████████████████████████████████████████████████████████| 86/86 [00:06<00:00, 13.08 passes/s]
Running MIL backend_mlprogram pipeline: 100%|██████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 64.39 passes/s]
/opt/homebrew/lib/python3.11/site-packages/coremltools/models/model.py:441: RuntimeWarning: You will not be able to run predict() on this Core ML model. Underlying exception message was: {
    NSLocalizedDescription = "Failed to build the model execution plan using a model architecture file '/private/var/folders/pz/rmstwmls5ls_0hrn5_jj01kh0000gn/T/tmplybl8sp_.mlmodelc/model.mil' with error code: 14.";
}
  _warnings.warn(
Model successfully converted and saved as: zenz_v1_cached.mlpackage

To Reproduce

import torch
from transformers.models.gpt2.modeling_gpt2 import GPT2LMHeadModel, GPT2Attention, GPT2_ATTENTION_CLASSES
from transformers import AutoTokenizer
import coremltools as ct
from typing import Optional, Tuple
import numpy as np
from transformers.cache_utils import Cache
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

class SliceUpdateKeyValueCache(Cache):
    def __init__(self, shape: Tuple[int, ...], device="cpu", dtype=torch.float32) -> None:
        """KV cache of shape (#layers, batch_size, #kv_heads, context_size, head_dim)."""
        super().__init__()
        self.past_seen_tokens: int = 0
        self.k_cache: torch.Tensor = torch.zeros(shape, dtype=dtype, device=device)
        self.v_cache: torch.Tensor = torch.zeros(shape, dtype=dtype, device=device)

    def update(self, k_state: torch.Tensor, v_state: torch.Tensor, layer_idx: int, slice_indices: torch.LongTensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Update key/value cache tensors for slice [slice_indices[0], slice_indices[1])."""
        if len(slice_indices) != 2:
            raise ValueError(f"Expect tuple of integers [start, end), got {slice_indices=}.")
        begin, end = slice_indices
        self.k_cache[layer_idx, :, : k_state.shape[1], begin:end, :] = k_state
        self.v_cache[layer_idx, :, : v_state.shape[1], begin:end, :] = v_state
        return self.k_cache[layer_idx, :, :, :end, :], self.v_cache[layer_idx, :, :, :end, :]

    def get_seq_length(self, _: int = 0) -> int:
        """Get the sequence length of the cache."""
        return self.past_seen_tokens

    def to_past_key_values(self):
        """Convert the internal cache to a format expected by GPT2."""
        return [(self.k_cache[layer], self.v_cache[layer]) for layer in range(self.k_cache.size(0))]

class SliceUpdateGPT2Attention(GPT2Attention):
    def __init__(self, config, layer_idx: Optional[int] = None):
        super().__init__(config=config, layer_idx=layer_idx)

    @torch.no_grad()
    def forward(self, hidden_states: torch.Tensor, 
                layer_past: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
                attention_mask: Optional[torch.FloatTensor] = None, 
                head_mask: Optional[torch.FloatTensor] = None,
                use_cache: bool = False) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
        # Compute query, key, and value tensors
        query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
        query = self._split_heads(query, self.num_heads, self.head_dim)
        key = self._split_heads(key, self.num_heads, self.head_dim)
        value = self._split_heads(value, self.num_heads, self.head_dim)

        # Handle past key/value tensors using tensor-based condition
        if layer_past is not None:
            past_key, past_value = layer_past
            if past_key.size(-2) > 0:
                key = torch.cat([past_key, key], dim=-2)
                value = torch.cat([past_value, value], dim=-2)

        # Optimize attention mask handling
        if attention_mask is not None:
            attention_mask = attention_mask[:, :, :, -key.size(-2):]

        # Calculate attention output
        attn_output, _ = self._attn(query, key, value, attention_mask, head_mask)
        attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)
        attn_output = self.c_proj(attn_output)

        # Return the updated cache if use_cache is True
        present = (key, value) if use_cache else None
        return attn_output, present

# Load the model and tokenizer
model_name = "Miwa-Keita/zenz-v1-checkpoints"
GPT2_ATTENTION_CLASSES["sdpa"] = SliceUpdateGPT2Attention
model = GPT2LMHeadModel.from_pretrained(model_name).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare input data
text = "Example sentence"
inputs = tokenizer(text, return_tensors="pt")

# Model tracing
class StatefulZenz(torch.nn.Module):
    def __init__(self, model, max_context_size: int = 256, batch_size: int = 1):
        super(StatefulZenz, self).__init__()
        self.model = model
        config = self.model.config
        self.kv_cache_shape: Tuple[int, ...] = (
            config.num_hidden_layers,
            batch_size,
            config.n_head,
            max_context_size,
            config.hidden_size // config.num_attention_heads,
        )
        self.kv_cache = SliceUpdateKeyValueCache(shape=self.kv_cache_shape)
        self.register_buffer("keyCache", self.kv_cache.k_cache)
        self.register_buffer("valueCache", self.kv_cache.v_cache)

    @torch.no_grad()
    def forward(self, input_ids, attention_mask):
        self.kv_cache.past_seen_tokens = attention_mask.shape[-1] - input_ids.shape[-1]
        past_key_values = self.kv_cache.to_past_key_values()

        # Reintroduce the attention mask extension logic
        attention_mask = self._extend_attention_mask(attention_mask, past_key_values)

        outputs = self.model(input_ids, attention_mask=attention_mask, past_key_values=past_key_values, use_cache=True)
        return outputs.logits

    def _extend_attention_mask(self, attention_mask, past_key_values):
        """Adjust the attention mask to match the size of the key/value cache."""
        if past_key_values is not None:
            past_length = past_key_values[0][0].size(-2)
            new_length = past_length + attention_mask.size(-1)
            extended_attention_mask = torch.ones(
                (attention_mask.size(0), 1, 1, new_length), dtype=attention_mask.dtype, device=attention_mask.device
            )
            extended_attention_mask[:, :, :, -attention_mask.size(-1):] = attention_mask
        else:
            extended_attention_mask = attention_mask
        return extended_attention_mask

# Create the traced model
stateful_zenz = StatefulZenz(model).eval()
traced_model = torch.jit.trace(stateful_zenz, (inputs['input_ids'], inputs['attention_mask']))

# Convert the model to CoreML
mlmodel = ct.convert(
    traced_model,
    inputs=[
        ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 256))),  # 上限を256に設定
        ct.TensorType(name="attention_mask", shape=(1, ct.RangeDim(1, 256)))  # 上限を256に設定
    ],
    outputs=[
        ct.TensorType(dtype=np.float32, name="output")
    ],
    states=[
        ct.StateType(
            wrapped_type=ct.TensorType(
                shape=stateful_zenz.kv_cache_shape
            ),
            name="keyCache",
        ),
        ct.StateType(
            wrapped_type=ct.TensorType(
                shape=stateful_zenz.kv_cache_shape
            ),
            name="valueCache",
        ),
    ],
    minimum_deployment_target=ct.target.iOS18
)

# Save the converted model
mlmodel.save("zenz_v1_cached.mlpackage")
print("Model successfully converted and saved as: zenz_v1_cached.mlpackage")

System environment (please complete the following information):

coremltools version: 8.0b2
OS (e.g. MacOS version or Linux type): Mac OS Version 15.1 Beta (24B5024e)
Any other relevant version information (e.g. PyTorch or TensorFlow version):
- python 3.11 with homebrew
- torch-2.3.0
- torchvision-0.18.0
- transformers-4.41.0

The text was updated successfully, but these errors were encountered:

TobyRoseman · 2024-08-28T21:41:22Z

@Skyline-23 that is a lot of code. Can you give us a more minimal example?

Skyline-23 · 2024-09-02T02:03:10Z

@TobyRoseman All of code is required to run stateful model based on GPT-2. Sorry 😢

lithium0003 · 2024-09-21T14:12:46Z

Official document example says,

converted_model_kvcache = ct.convert(
    traced_model_kvcache,
    inputs=inputs,
    outputs=outputs,
    states=states,
    minimum_deployment_target=ct.target.iOS18,
    compute_units=ct.ComputeUnit.CPU_AND_GPU,
)

I got same error on compute_units=ct.ComputeUnit.ALL, but pass on compute_units=ct.ComputeUnit.CPU_AND_GPU

Skyline-23 added the bug Unexpected behaviour that should be corrected (type) label Aug 27, 2024

TobyRoseman added the PyTorch (traced) label Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to build the model execution plan using a model architecture file #2325

Failed to build the model execution plan using a model architecture file #2325

Skyline-23 commented Aug 27, 2024 •

edited

Loading

TobyRoseman commented Aug 28, 2024

Skyline-23 commented Sep 2, 2024

lithium0003 commented Sep 21, 2024

Failed to build the model execution plan using a model architecture file #2325

Failed to build the model execution plan using a model architecture file #2325

Comments

Skyline-23 commented Aug 27, 2024 • edited Loading

🐞Describing the bug

Stack Trace

To Reproduce

System environment (please complete the following information):

TobyRoseman commented Aug 28, 2024

Skyline-23 commented Sep 2, 2024

lithium0003 commented Sep 21, 2024

Skyline-23 commented Aug 27, 2024 •

edited

Loading