Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keeping track of the performance and compatibility of models #147

Open
jrp2014 opened this issue Dec 14, 2024 · 0 comments
Open

Keeping track of the performance and compatibility of models #147

jrp2014 opened this issue Dec 14, 2024 · 0 comments

Comments

@jrp2014
Copy link

jrp2014 commented Dec 14, 2024

This is just a snapshot of my impressions of various models from the perspective of keywording / captioning.

In summary, at this point, there are a couple of good and fast models for this purpose, more just give a good, fast description of the image, others give a very fast but very succinct account of the image (without keywords). Several models are not yet supported, or have config files that mlx-vlm can't use.

A few models are just too slow or need too much memory (on a 128Gb Mac) to function.

I'll add / subtract from these as I experiment further.

import mlx.core as mx # unused ...
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config, load_image_processor # unused?

from PIL import Image # Unused

import os
from pathlib import Path

# model_path="JosefAlbers/akemiH_MedQA_Reason"
# model_path = "HuggingFaceTB/SmolVLM-Instruct" # Fast, but too concise (eg, no keywords)
# model_path="Qwen/Qwen2-VL-7B-Instruct" # to be downloaded
# model_path="cognitivecomputations/dolphin-2.9.2-qwen2-72b" # To be downloaded
# model_path="distilbert/distilbert-base-uncased-finetuned-sst-2-english"
# model_path="google/siglip-so400m-patch14-384" # To be downloaded
# model_path="meta-llama/Llama-3.2-11B-Vision-Instruct" # Unusably slow, gives fairly detailed captions, but is not always accurate.  Uses over 90Gb.  No keywords.
# model_path="meta-llama/Llama-3.2-90B-Vision-Instruct"
# model_path="microsoft/Phi-3.5-mini-instruct"
# model_path="microsoft/Phi-3.5-vision-instruct" # provides a good description, but that is all
# model_path="mistral-community/pixtral-12b" # Unsupported model type: pixtral
# model_path="mlx-community/Florence-2-large-ft-bf16" # Produces gibberish v quickly.  Corrupt?
# model_path="mlx-community/Llama-3.2-11B-Vision-Instruct-8bit" # Much better than the native version, but still slows down
# model_path="mlx-community/Molmo-7B-D-0924-bf16"  ModuleNotFoundError: No module named 'einops'
# model_path="mlx-community/Phi-3.5-vision-instruct-bf16" # Pretty good description, but no keywords
# model_path="mlx-community/Qwen2-VL-72B-Instruct-8bit" # libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 135383101952 bytes which is greater than the maximum allowed buffer size of 77309411328 bytes.
# model_path="mlx-community/SmolVLM-Instruct-bf16" # Very good, fast description, but that is all
# model_path="mlx-community/dolphin-vision-72b-4bit" # mlx-vlm crash
# model_path="mlx-community/idefics2-8b-chatty-8bit" # ValueError: Unsupported model type: idefics2_vision
# model_path="mlx-community/llava-1.5-7b-4bit" # Vague description
# model_path="mlx-community/llava-v1.6-34b-8bit" # Pretty good
# model_path="mlx-community/llava-v1.6-mistral-7b-8bit" # V similar to the above
# model_path="mlx-community/paligemma2-3b-pt-896-4bit" # too innaccurate
# model_path="mlx-community/pixtral-12b-8bit" # A rival to llava 1.6
# model_path="mlx-community/whisper-tiny"
# model_path="mlx-community/paligemma2-10b-ft-docci-448-bf16" # Generates a good description, but no keywords
# https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3   # Very concise, no keywords
# model_path = "mlx-community/llava-1.5-7b-4bit"
# model_path = "mlx-community/llava-v1.6-mistral-7b-8bit"
# model_path = "mlx-community/pixtral-12b-8bit" # To the point
# model_path = "Qwen/Qwen2-VL-7B-Instruct"  # libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 269535412224 bytes which is greater than the maximum allowed buffer size of 28991029248 bytes.###
# model_path = "mlx-community/llava-v1.6-34b-8bit" # Slower but more precise
# model_path = "mlx-community/Phi-3.5-vision-instruct-bf16" # OK, but doesn't provide keywords
# model_path = "mistral-community/pixtral-12b"
# model_path = "meta-llama/Llama-3.2-11B-Vision-Instruct"  # needs about 95Gb, precise, but is slow
# model_path ="mlx-community/Qwen2-VL-72B-Instruct-8bit" # libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 135383101952 bytes which is greater than the maximum allowed buffer size of 77309411328 bytes.
# model_path ="mlx-community/dolphin-vision-72b-4bit"  # Needs image_processor = load_image_processor(model_path) 
# model_path = "meta-llama/Llama-3.2-11B-Vision-Instruct" # Very slow, gives more detailed captions, but is not always accurate.  Uses over 90Gb.  No keywords.
model_path = "OpenGVLab/InternVL2_5-38B" # ValueError: Model type internvl_chat not supported.

print("Model: ", model_path)

# Load the model
model, processor = load(model_path)
# processor = load_image_processor(model_path)
config = load_config(model_path)

prompt = "Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily"

picpath = "/Users/x/Pictures/Processed"
pics = sorted(Path(picpath).iterdir(), key=os.path.getmtime, reverse=True)
pic = str(pics[0])
print("Image: ", pic)

# Apply chat template
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)

# Generate output
output = generate(model, processor, pic, formatted_prompt, max_tokens=500, verbose=True)
print(output)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant