Skip to content

A SOTA vision model built on top of llama3 8B.

Notifications You must be signed in to change notification settings

JinZhang-21/llama3v

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama3v

llama3v is a SOTA vision model that is powered by Llama3 8B and siglip-so400m.

[ GitHub ] [ Model Weights ] [ Blog Post ]

Features

  • SOTA open-source VLLM
  • Model is available on Huggingface
  • Fast local inference
  • Release inference code (training code is coming soon, just cleaning up)

Checkout huggingface for the model weights.

Metrics

Usage

You can use llama3v with the Transformers library.

from transformers import AutoTokenizer, AutoModel
from PIL import Image

model = AutoModel.from_pretrained("mustafaaljadery/llama3v").cuda()
tokenizer = AutoTokenizer.from_pretrained("mustafaaljadery/llama3v")

image = Image.open("test_image.png")

answer = model.generate(image=image, message="What is this image?", temperature=0.1, tokenizer=tokenizer)

print(answer)

The model first passes through the image through the vision model to extract the features, then pass through the language model to generate the answer. Here is a sample inference pipeline:

Architecture

Training Process

In our training process, we combine the siglip-so400m model for vision with the Llama3 8B model for multi-modal image-text input with text generation.

We add a projection layer to the siglip-so400m model to project the image features to the LLaMA embedding space for the model to better understand the image.

In the pretraining process, we use freeze all the weights other than the projection layer. We train on about 600K images.

In the fine-tuning process, we update the weights of the Llama3 8B model while freezing the weights of the siglip-so400m model and the projection layer. We train for approximately 1M images. Moreover, we generate synthetic multimodal data from YI's model family for multimodal text generation as well. We finetune our model on this synsthetic data.

Read more about our training process here.

Acknowledgements

Citations

This was built with the help of the following resources:

About

A SOTA vision model built on top of llama3 8B.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%