Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to build the model execution plan using a model architecture file #2325

Open
Skyline-23 opened this issue Aug 27, 2024 · 3 comments
Open
Labels
bug Unexpected behaviour that should be corrected (type) PyTorch (traced)

Comments

@Skyline-23
Copy link

Skyline-23 commented Aug 27, 2024

🐞Describing the bug

Hello. I'm trying to convert PyTorch model to Stateful CoreML Model

I wrote this code referred to WWDC 2024 session Mistral-7B model
The CoreML file is appear after run, but "Failed to build the model execution plan using a model architecture file" error appears when CoreML Class init

Stack Trace

/opt/homebrew/lib/python3.11/site-packages/transformers/modeling_utils.py:4481: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
/opt/homebrew/lib/python3.11/site-packages/torch/jit/_trace.py:1116: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Tensor-likes are not close!

Mismatched elements: 12 / 90000 (0.0%)
Greatest absolute difference: 1.6361474990844727e-05 at index (0, 11, 1251) (up to 1e-05 allowed)
Greatest relative difference: 0.000991315116234805 at index (0, 12, 1660) (up to 1e-05 allowed)
  _check_trace(
Torch var valueCache is added again.
Torch var keyCache is added again.
Converting PyTorch Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████| 1516/1516 [00:00<00:00, 2464.91 ops/s]
Running MIL frontend_pytorch pipeline: 100%|█████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 27.54 passes/s]
Running MIL default pipeline:  60%|█████████████████████████████████████████████████████▏                                  | 52/86 [00:02<00:01, 27.17 passes/s]/opt/homebrew/lib/python3.11/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py:894: RuntimeWarning: overflow encountered in cast
  return input_var.val.astype(dtype=string_to_nptype(dtype_val))
Running MIL default pipeline: 100%|████████████████████████████████████████████████████████████████████████████████████████| 86/86 [00:06<00:00, 13.08 passes/s]
Running MIL backend_mlprogram pipeline: 100%|██████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 64.39 passes/s]
/opt/homebrew/lib/python3.11/site-packages/coremltools/models/model.py:441: RuntimeWarning: You will not be able to run predict() on this Core ML model. Underlying exception message was: {
    NSLocalizedDescription = "Failed to build the model execution plan using a model architecture file '/private/var/folders/pz/rmstwmls5ls_0hrn5_jj01kh0000gn/T/tmplybl8sp_.mlmodelc/model.mil' with error code: 14.";
}
  _warnings.warn(
Model successfully converted and saved as: zenz_v1_cached.mlpackage

To Reproduce

import torch
from transformers.models.gpt2.modeling_gpt2 import GPT2LMHeadModel, GPT2Attention, GPT2_ATTENTION_CLASSES
from transformers import AutoTokenizer
import coremltools as ct
from typing import Optional, Tuple
import numpy as np
from transformers.cache_utils import Cache
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

class SliceUpdateKeyValueCache(Cache):
    def __init__(self, shape: Tuple[int, ...], device="cpu", dtype=torch.float32) -> None:
        """KV cache of shape (#layers, batch_size, #kv_heads, context_size, head_dim)."""
        super().__init__()
        self.past_seen_tokens: int = 0
        self.k_cache: torch.Tensor = torch.zeros(shape, dtype=dtype, device=device)
        self.v_cache: torch.Tensor = torch.zeros(shape, dtype=dtype, device=device)

    def update(self, k_state: torch.Tensor, v_state: torch.Tensor, layer_idx: int, slice_indices: torch.LongTensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Update key/value cache tensors for slice [slice_indices[0], slice_indices[1])."""
        if len(slice_indices) != 2:
            raise ValueError(f"Expect tuple of integers [start, end), got {slice_indices=}.")
        begin, end = slice_indices
        self.k_cache[layer_idx, :, : k_state.shape[1], begin:end, :] = k_state
        self.v_cache[layer_idx, :, : v_state.shape[1], begin:end, :] = v_state
        return self.k_cache[layer_idx, :, :, :end, :], self.v_cache[layer_idx, :, :, :end, :]

    def get_seq_length(self, _: int = 0) -> int:
        """Get the sequence length of the cache."""
        return self.past_seen_tokens

    def to_past_key_values(self):
        """Convert the internal cache to a format expected by GPT2."""
        return [(self.k_cache[layer], self.v_cache[layer]) for layer in range(self.k_cache.size(0))]

class SliceUpdateGPT2Attention(GPT2Attention):
    def __init__(self, config, layer_idx: Optional[int] = None):
        super().__init__(config=config, layer_idx=layer_idx)

    @torch.no_grad()
    def forward(self, hidden_states: torch.Tensor, 
                layer_past: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
                attention_mask: Optional[torch.FloatTensor] = None, 
                head_mask: Optional[torch.FloatTensor] = None,
                use_cache: bool = False) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
        # Compute query, key, and value tensors
        query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
        query = self._split_heads(query, self.num_heads, self.head_dim)
        key = self._split_heads(key, self.num_heads, self.head_dim)
        value = self._split_heads(value, self.num_heads, self.head_dim)

        # Handle past key/value tensors using tensor-based condition
        if layer_past is not None:
            past_key, past_value = layer_past
            if past_key.size(-2) > 0:
                key = torch.cat([past_key, key], dim=-2)
                value = torch.cat([past_value, value], dim=-2)

        # Optimize attention mask handling
        if attention_mask is not None:
            attention_mask = attention_mask[:, :, :, -key.size(-2):]

        # Calculate attention output
        attn_output, _ = self._attn(query, key, value, attention_mask, head_mask)
        attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)
        attn_output = self.c_proj(attn_output)

        # Return the updated cache if use_cache is True
        present = (key, value) if use_cache else None
        return attn_output, present

# Load the model and tokenizer
model_name = "Miwa-Keita/zenz-v1-checkpoints"
GPT2_ATTENTION_CLASSES["sdpa"] = SliceUpdateGPT2Attention
model = GPT2LMHeadModel.from_pretrained(model_name).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare input data
text = "Example sentence"
inputs = tokenizer(text, return_tensors="pt")

# Model tracing
class StatefulZenz(torch.nn.Module):
    def __init__(self, model, max_context_size: int = 256, batch_size: int = 1):
        super(StatefulZenz, self).__init__()
        self.model = model
        config = self.model.config
        self.kv_cache_shape: Tuple[int, ...] = (
            config.num_hidden_layers,
            batch_size,
            config.n_head,
            max_context_size,
            config.hidden_size // config.num_attention_heads,
        )
        self.kv_cache = SliceUpdateKeyValueCache(shape=self.kv_cache_shape)
        self.register_buffer("keyCache", self.kv_cache.k_cache)
        self.register_buffer("valueCache", self.kv_cache.v_cache)

    @torch.no_grad()
    def forward(self, input_ids, attention_mask):
        self.kv_cache.past_seen_tokens = attention_mask.shape[-1] - input_ids.shape[-1]
        past_key_values = self.kv_cache.to_past_key_values()

        # Reintroduce the attention mask extension logic
        attention_mask = self._extend_attention_mask(attention_mask, past_key_values)

        outputs = self.model(input_ids, attention_mask=attention_mask, past_key_values=past_key_values, use_cache=True)
        return outputs.logits

    def _extend_attention_mask(self, attention_mask, past_key_values):
        """Adjust the attention mask to match the size of the key/value cache."""
        if past_key_values is not None:
            past_length = past_key_values[0][0].size(-2)
            new_length = past_length + attention_mask.size(-1)
            extended_attention_mask = torch.ones(
                (attention_mask.size(0), 1, 1, new_length), dtype=attention_mask.dtype, device=attention_mask.device
            )
            extended_attention_mask[:, :, :, -attention_mask.size(-1):] = attention_mask
        else:
            extended_attention_mask = attention_mask
        return extended_attention_mask

# Create the traced model
stateful_zenz = StatefulZenz(model).eval()
traced_model = torch.jit.trace(stateful_zenz, (inputs['input_ids'], inputs['attention_mask']))

# Convert the model to CoreML
mlmodel = ct.convert(
    traced_model,
    inputs=[
        ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 256))),  # 上限を256に設定
        ct.TensorType(name="attention_mask", shape=(1, ct.RangeDim(1, 256)))  # 上限を256に設定
    ],
    outputs=[
        ct.TensorType(dtype=np.float32, name="output")
    ],
    states=[
        ct.StateType(
            wrapped_type=ct.TensorType(
                shape=stateful_zenz.kv_cache_shape
            ),
            name="keyCache",
        ),
        ct.StateType(
            wrapped_type=ct.TensorType(
                shape=stateful_zenz.kv_cache_shape
            ),
            name="valueCache",
        ),
    ],
    minimum_deployment_target=ct.target.iOS18
)

# Save the converted model
mlmodel.save("zenz_v1_cached.mlpackage")
print("Model successfully converted and saved as: zenz_v1_cached.mlpackage")

System environment (please complete the following information):

  • coremltools version: 8.0b2
  • OS (e.g. MacOS version or Linux type): Mac OS Version 15.1 Beta (24B5024e)
  • Any other relevant version information (e.g. PyTorch or TensorFlow version):
    • python 3.11 with homebrew
    • torch-2.3.0
    • torchvision-0.18.0
    • transformers-4.41.0
@Skyline-23 Skyline-23 added the bug Unexpected behaviour that should be corrected (type) label Aug 27, 2024
@TobyRoseman
Copy link
Collaborator

@Skyline-23 that is a lot of code. Can you give us a more minimal example?

@Skyline-23
Copy link
Author

@TobyRoseman All of code is required to run stateful model based on GPT-2. Sorry 😢

@lithium0003
Copy link

Official document example says,

converted_model_kvcache = ct.convert(
    traced_model_kvcache,
    inputs=inputs,
    outputs=outputs,
    states=states,
    minimum_deployment_target=ct.target.iOS18,
    compute_units=ct.ComputeUnit.CPU_AND_GPU,
)

I got same error on compute_units=ct.ComputeUnit.ALL, but pass on compute_units=ct.ComputeUnit.CPU_AND_GPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unexpected behaviour that should be corrected (type) PyTorch (traced)
Projects
None yet
Development

No branches or pull requests

4 participants
@lithium0003 @TobyRoseman @Skyline-23 and others