Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use pipeline load the fastlanguage model #1471

Open
Hyfred opened this issue Dec 24, 2024 · 0 comments
Open

how to use pipeline load the fastlanguage model #1471

Hyfred opened this issue Dec 24, 2024 · 0 comments

Comments

@Hyfred
Copy link

Hyfred commented Dec 24, 2024

How to use transformer.pipeline loading the fastlanguagemodel

FastlanguageModel:

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# "unsloth/Llama-3.3-70B-Instruct-bnb-4bit"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.3-70B-Instruct-bnb-4bit", # "unsloth/mistral-7b" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

Now I want to integrate this faster model to transformer.pipeline

from transformers import pipeline
pipe = pipeline("text-generation", model="unsloth/Llama-3.3-70B-Instruct-bnb-4bit")
pipe(messages)

How to do that?
Any comment would be appreciate!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant