Mistral assisted visual instruction data generation (with llama-cpp-python + gguf)
Follows Chapter 3. GPT-assisted Visual Instruction Data Generation from the LLaVa paper.
A few manually designed examples from here are the only human annotations that are used as seed examples in in-context-learning to query Mistral 7b instruct v0.1.
Three types of instruction-following data are generated in this way:
- Conversation: conversation_58k.json
- Detailed description: detail_23k.json
- Complex reasoning: complex_reasoning_77k.json
In order to encode an image into its visual features to prompt a text-only Mistral, two types of symbolic representations from COCO annotations_trainval2017
are used:
- Captions
- Bounding boxes
According to Mistral-7B-Instruct-v0.1
's chat_template
specified here, system message is not supported.
One remedy as suggested in Mistral's documentation is to prepend the system message to the first user messagem, leading to an overall prompt sructure like below:
[
{"role":"user", "content":"system message" + "context 000"},
{"role":"assistant", "content":"response 000"},
{"role":"user", "content":"context 001"},
{"role":"assistant", "content":"response 001"},
...
{"role":"user", "content":"context query (Captions or Captions + Bounding boxes)"},
]
generate-language-image-instruction-following-data
├─ COCO2017 <-- COCO2017 annotation data will be downloaded here
├─ models <-- mistral 7B instruct model will be downloaded here
├─ generated_data <-- generated raw & post processed instruction following data will be stored here
├─ symbolic_rep_data <-- symbolic representations created from COCO2017 annotation data will be saved here
├─ src
│ ├─ coco_symbolic_representations.py <-- run this to create symbolic representations for prompting Mistral
│ ├─ data_utils.py
│ ├─ instruct_158k.py <-- run this to generate all three types of raw instruction following data
│ ├─ model_utils.py
│ ├─ post_process_instruct_158k.py <-- run this to recreate LLaVA-Instruct-150K using the raw instruction following data
│ ├─ pretrain_595k.py <-- run this to replace the instructions in the original LLaVA-CC3M-Pretrain-595K
| └─ run_on_test_queries.py
├─ prompts <-- manually designed seed examples from LLaVA's official repo
│ ├─ complex_reasoning
│ │ ├─ ...txt
│ │ └─ system_message.txt
│ ├─ conversation
│ │ ├─ ...txt
│ │ └─ system_message.txt
│ └─ detail_description
│ ├─ ...txt
│ └─ system_message.txt
├─ requirements.txt
├─ archive data for COCO image 000000443336.jpg
│ └─ eda.py
└─ test_queries <-- symbolic representations of one example COCO image 000000443336.jpg
├─ complex_reasoning
│ └─ 000_caps.txt
├─ conversation
│ └─ 000_caps.txt
└─ detail_description
└─ 000_caps.txt
-
download the data and gguf model
mistral-7b-instruct-v0.1.Q5_K_M.gguf
cd models wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_K_M.gguf wget https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf
GCP VM spec necessary to run
mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf
: g2-standard-48 (4 NVIDIA L4 GPU).COCO 2017
cd COCO2017 wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
make sure
git lfs
is install firstgit lfs install git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K
git lfs install git clone https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K
-
create conda environment and install dependencies
conda create -n mistral python=3.10.0
conda activate mistral
pip install -r requirements.txt
# to run llama-cpp-python on NVIDIA GPUs
export CMAKE_ARGS=-DLLAMA_CUBLAS=on
export FORCE_CMAKE=1
pip install llama_cpp_python==0.2.43 --force-reinstall --upgrade --no-cache-dir
# Use --verbose for extra assurance that cuBLAS is being used in compilation.
-
Install
Jupyter
extension first in VS Code, then runsrc/run_on_test_queries.py
interactively usingshift + return
. -
Run
pretrain_595k.py
to replace the original instructions in LLaVA-CC3M-Pretrain-595K with the ones from thereplacing_instructions_brief
list here. Results will be saved asgenerated_data/chat.json
, which has the same format as the originalchat.json
. -
Recreate the three types of instruction following data,
detailed description
,complex reasoning
andconversation
, inLLaVA-Instruct-150K
using Mistral instruct 7B.nohup python -u src/instruct_158k.py > instruct_158k.out &
- Run instruct_158k.py to created raw instruction following data
- The generated raw data is saved in
generated_data/raw_detail_23k_{%Y_%m_%d_%H_%M_%S}.csv
generated_data/raw_complex_reasoning_77k_{%Y_%m_%d_%H_%M_%S}.csv
generated_data/raw_conversation_58k_{%Y_%m_%d_%H_%M_%S}.csv
respectively.
- Then run
src/post_process_instruct_158k.py
to parse the generated text and recreate the LLaVA training data, which share the same format as those in LLaVA-Instruct-150K.- NOTE that you'll need to change
raw_resp_fp
in each post processing function to beoutput_fp
from step 2 - After post processing, there will be four new files
generated_data/complex_reasoning_77k.json
generated_data/conversation_58k.json
generated_data/detail_23k.json
generated_data/mistral_generated_llava_instruct_150k
generated_data/mistral_generated_llava_instruct_150k
is the final finetuning dataset from merging the first three files together
- NOTE that you'll need to change
For this image COCO_train2014_000000443336.jpg, results from running run_on_test_queries.py.
- Complex reasoning
Question:
What is unusual about this photo?
===
Answer:
In this photo, there is a large clock sitting on the floor and displayed in a room. The clock's design is quite unique, featuring wood and glass construction. The clock's large size and mechanical components suggest it is a grandfather clock. The clock's placement at the center of the room draws attention to it and might be used as a focal point or decorative element to enhance the room's ambiance. The presence of people near the clock is unusual because it's not common to see people interacting directly with a clock on display, especially one as large and intricate as this one.
- Conversation
Question:
What is the clock doing in the room? Is it functional or just for display?
===
Answer:
The clock in the room appears to be functional, as it is large and has a grandfather clock design. However, without more context or information on its surroundings or other objects in the room, it is difficult to determine its exact purpose or intended use.
In contrast to a clock on display in a museum, which is typically a historical or antique piece meant to be admired for its beauty, craftsmanship, or historical significance, the clock in the room seems to be more practical or functional in nature.
The clock design suggests that it may be used to indicate time or elapsed time for various purposes, such as work schedules, personal timekeeping, or even coordinating activities within a household or organization.
However, without additional details, it is not possible to determine its specific function or intended purpose with certainty.
- Detailed description
The image shows a large clock sitting in the middle of a room. The clock is likely a grandfather clock, with its intricate wooden and glass design prominent in the photograph. The clock appears to be on display in a museum or historical building, surrounded by empty space and possibly other antiques or artifacts.
There is also one person visible in the photograph, standing to the right side of the clock and facing it. The person appears to be admiring or examining the clock closely. Another person is standing nearby but out of the frame, likely also looking at the clock or taking photographs.