-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image feature extraction. #213
base: release/v2.1.2
Are you sure you want to change the base?
Conversation
You will need to load a vision model and its mmproj file. The settings are in the "LLM.cs" script under the "Advanced Options". You will also need llamalib 1.17 or higher.
Model used: llava-v1.6-mistral-7b.Q4_K_M, mmproj-model-f16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this PR!!
It needs some work before it is merged, I have left some comments.
@@ -0,0 +1,38 @@ | |||
using UnityEngine; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be moved in a sample dir inside the Samples~ folder e.g. Samples~/ImageReceiver/ImageReceiver.cs
Also rename to ImageReceiver.cs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same with the AndroidLlava.unity above.
Also rename to Scene.unity similarly to the other samples.
|
||
//This field is used to relay the image to the AI, this can be done by both a URL or a file in your system. | ||
|
||
public TextMeshProUGUI AnyImageData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a Text instead of TextMeshProUGUI element.
TextMeshProUGUI requires the TMP assets which vary between different Unity versions.
public TextMeshProUGUI AnyImageData; | ||
|
||
// Should work with any script that calls the Chat function on the LLMCharacter script. | ||
public AndroidDemo AD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy and paste the SimpleInteraction.cs code and modify it.
This ensures that samples are independent from each other and users can install whichever they want.
|
||
public void SendImageToAI() | ||
{ | ||
AD.onInputFieldSubmit(" [\r\n {\"role\": \"system\", \"content\": \"You are an assistant who perfectly describes images.\"},\r\n {\r\n \"role\": \"user\",\r\n \"content\": [\r\n {\"type\" : \"text\", \"text\": \"What's in this image?\"},\r\n {\"type\": \"image_url\", \"image_url\": {\"url\":" + AnyImageData.text + "\" } }\r\n ]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you do here, it's better to define a function in Runtime/LLMCharacter.cs
that takes over this part and can be reused e.g.:
public async Task<string> ChatWithImage(string query, Uri url, Callback<string> callback = null, EmptyCallback completionCallback = null, bool addToHistory = true)
{
URLContent urlText = new URLContent(){url = url.ToString() };
ImageURLContent urlContent = new ImageURLContent(type="image_url", image_url = urlText)
TextContent message = new TextContent(){type = "text", text = query};
string queryWithImage = "[" + JsonUtility.ToJson(message) + "," + JsonUtility.ToJson(urlContent) + "]";
return await Chat(queryWithImage, callback, completionCallback, addToHistory);
}
public async Task<string> ChatWithImage(string query, Path path, Callback<string> callback = null, EmptyCallback completionCallback = null, bool addToHistory = true)
{
string queryWithImage = ...
return await Chat(queryWithImage, callback, completionCallback, addToHistory);
}
and inside the Runtime/LLMInterface.cs
[Serializable]
public struct TextContent
{
public string type;
public string text;
}
[Serializable]
public struct ImageURLContent
{
public string type;
public URLContent image_url;
}
[Serializable]
public struct URLContent
{
public string url;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of manually defining the "What's in this image?" text, you can use the existing text box in the SimpleInteraction sample.
if (remote) arguments += $" --port {port} --host 0.0.0.0"; | ||
if (numThreadsToUse > 0) arguments += $" -t {numThreadsToUse}"; | ||
if (loraPath != "") arguments += $" --lora \"{loraPath}\""; | ||
if (MMPROJmodel != "") arguments += $" --mmproj \"{MMPROJmodel}\""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of copying a new LLM.cs file modify the Runtime/LLM.cs to add the MMPROJmodel.
The MMPROJmodel needs to be treated similar to e.g. the loras, rather than providing this as a text, it needs some additional functionality to load it and make sure it is added inside the builds.
I will take over this part because it is quite involved.
mmproj model loading and image feature extraction update. You will need to load a vision model and its mmproj file. The settings are in the "LLM.cs" script under the "Advanced Options". You will also need llamalib 1.17 or higher.