You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create a model.json file to interact with the model.
Send an image and request a description.
Screenshots / Logs
2024-09-19T12:11:04.250Z [CORTEX]::Debug: Request to kill cortex
2024-09-19T12:11:04.254Z [CORTEX]::Debug: 20240919 12:10:46.430861 UTC 3549698 DEBUG [LoadModel] Multi Modal Mode Enabled - llama_server_context.cc:159
20240919 12:10:46.676668 UTC 3549698 DEBUG [LoadModel] Request 4096 for context length for llava-1.6 - llama_server_context.cc:170
20240919 12:10:47.890831 UTC 3549698 DEBUG [Initialize] Available slots: - llama_server_context.cc:225
20240919 12:10:47.890848 UTC 3549698 DEBUG [Initialize] -> Slot 0 - max context: 4096 - llama_server_context.cc:233
20240919 12:10:47.890947 UTC 3549698 INFO Started background task here! - llama_server_context.cc:252
20240919 12:10:47.891006 UTC 3549698 INFO Warm-up model: llava-7b - llama_engine.cc:819
20240919 12:10:47.891010 UTC 3549742 DEBUG [UpdateSlots] all slots are idle and system prompt is empty, clear the KV cache - llama_server_context.cc:1250
20240919 12:10:47.891017 UTC 3549742 DEBUG [KvCacheClear] Clear the entire KV cache - llama_server_context.cc:258
20240919 12:10:47.901986 UTC 3549742 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 0] - llama_server_context.cc:623
20240919 12:10:47.902059 UTC 3549742 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 0, p0: 0 - llama_server_context.cc:1544
20240919 12:10:48.166076 UTC 3549742 DEBUG [PrintTimings] PrintTimings: prompt eval time = 172.433ms / 2 tokens (86.2165 ms per token, 11.5987079039 tokens per second) - llama_client_slot.cc:79
20240919 12:10:48.166081 UTC 3549742 DEBUG [PrintTimings] PrintTimings: eval time = 91.653 ms / 4 runs (22.91325 ms per token, 43.6428703916 tokens per second)
llama_client_slot.cc:86
20240919 12:10:48.166082 UTC 3549742 DEBUG [PrintTimings] PrintTimings: total time = 264.086 ms - llama_client_slot.cc:92
20240919 12:10:48.166116 UTC 3549742 INFO slot released: id_slot: 0, id_task: 0, n_ctx: 4096, n_past: 6, n_system_tokens: 0, n_cache_tokens: 0, truncated: 0 - llama_server_context.cc:1304
20240919 12:10:48.166129 UTC 3549698 INFO {"content":",\nI recently bought","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/Users/louis/Library/Application Support/Jan/jan/models/llava-7b/llava-v1.6-mistral-7b.Q4_K_M.gguf","n_ctx":4096,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":false,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.0,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false},"model":"/Users/louis/Library/Application Support/Jan/jan/models/llava-7b/llava-v1.6-mistral-7b.Q4_K_M.gguf","prompt":"Hello","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":91.653,"predicted_n":4,"predicted_per_second":43.64287039158565,"predicted_per_token_ms":22.91325,"prompt_ms":172.433,"prompt_n":2,"prompt_per_second":11.598707903939502,"prompt_per_token_ms":86.2165},"tokens_cached":6,"tokens_evaluated":2,"tokens_predicted":4,"truncated":false} - llama_engine.cc:827
20240919 12:10:48.166183 UTC 3549698 INFO Model loaded successfully: llava-7b - llama_engine.cc:216
20240919 12:10:48.171967 UTC 3549699 INFO Model status responded - llama_engine.cc:259
20240919 12:10:48.175867 UTC 3549700 INFO Request 1, model llava-7b: Generating response for inference request - llama_engine.cc:469
20240919 12:10:48.175871 UTC 3549700 INFO Request 1: Stop words:null
llama_engine.cc:486
20240919 12:10:48.175892 UTC 3549700 INFO Request 1: Base64 image detected - llama_engine.cc:549
20240919 12:10:48.179648 UTC 3549700 INFO Request 1: Streamed, waiting for respone - llama_engine.cc:608
20240919 12:10:48.179692 UTC 3549700 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535
20240919 12:10:48.182143 UTC 3549742 DEBUG [LaunchSlotWithData] slot 0 - loaded image - llama_server_context.cc:562
20240919 12:10:48.182156 UTC 3549742 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 1] - llama_server_context.cc:623
20240919 12:10:48.182167 UTC 3549742 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 1, p0: 0 - llama_server_context.cc:1544
20240919 12:10:52.482469 UTC 3549701 INFO Request 2, model llava-7b: Generating response for inference request - llama_engine.cc:469
20240919 12:10:52.482483 UTC 3549701 INFO Request 2: Stop words:null
llama_engine.cc:486
20240919 12:11:04.251929 UTC 3549702 INFO Program is exitting, goodbye! - processManager.cc:8
2024-09-19T12:11:04.294Z [CORTEX]::Debug: cortex process is terminated
2024-09-19T12:11:04.294Z [CORTEX]::Debug: cortex exited with code: 0
2024-09-19T12:11:04.305Z [CORTEX]::CPU information - 10
2024-09-19T12:11:04.305Z [CORTEX]::Debug: Request to kill cortex
2024-09-19T12:11:04.306Z [CORTEX]::Debug: cortex process is terminated
2024-09-19T12:11:04.307Z [CORTEX]::Debug: Spawning cortex subprocess...
2024-09-19T12:11:04.307Z [CORTEX]::Debug: Spawn cortex at path: /Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64/cortex-cpp, and args: 1,127.0.0.1,3928
2024-09-19T12:11:04.307Z [CORTEX]::Debug: Cortex engine path: /Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64
2024-09-19T12:11:04.307Z [CORTEX] PATH: /usr/bin:/bin:/usr/sbin:/sbin::/Users/louis/Library/Application Support/Jan/jan/engines/@janhq/inference-cortex-extension/1.0.17:/Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64:/Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64
2024-09-19T12:11:04.410Z [CORTEX]::Debug: Loading model with params {"cpu_threads":10,"vision_model":true,"text_model":false,"ctx_len":2048,"prompt_template":"{system_message}\n### Instruction: {prompt}\n### Response:","llama_model_path":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","mmproj":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-mmproj-f16.gguf","system_prompt":"","user_prompt":"\n### Instruction: ","ai_prompt":"\n### Response:","model":"moondream2-f16.gguf","ngl":100}
2024-09-19T12:11:04.410Z [CORTEX]::Debug: cortex is ready
2024-09-19T12:11:04.419Z [CORTEX]::Debug: 20240919 12:11:04.315010 UTC 3550094 INFO cortex-cpp version: 0.5.0 - main.cc:73
20240919 12:11:04.315589 UTC 3550094 INFO Server started, listening at: 127.0.0.1:3928 - main.cc:78
20240919 12:11:04.315590 UTC 3550094 INFO Please load your model - main.cc:79
20240919 12:11:04.315592 UTC 3550094 INFO Number of thread is:10 - main.cc:86
20240919 12:11:04.411469 UTC 3550098 INFO CPU instruction set: fpu = 0| mmx = 0| sse = 0| sse2 = 0| sse3 = 0| ssse3 = 0| sse4_1 = 0| sse4_2 = 0| pclmulqdq = 0| avx = 0| avx2 = 0| avx512_f = 0| avx512_dq = 0| avx512_ifma = 0| avx512_pf = 0| avx512_er = 0| avx512_cd = 0| avx512_bw = 0| has_avx512_vl = 0| has_avx512_vbmi = 0| has_avx512_vbmi2 = 0| avx512_vnni = 0| avx512_bitalg = 0| avx512_vpopcntdq = 0| avx512_4vnniw = 0| avx512_4fmaps = 0| avx512_vp2intersect = 0| aes = 0| f16c = 0| - server.cc:288
20240919 12:11:04.418604 UTC 3550098 INFO Loaded engine: cortex.llamacpp - server.cc:314
20240919 12:11:04.418615 UTC 3550098 INFO cortex.llamacpp version: 0.1.25 - llama_engine.cc:163
20240919 12:11:04.418638 UTC 3550098 INFO MMPROJ FILE detected, multi-model enabled! - llama_engine.cc:300
20240919 12:11:04.418667 UTC 3550098 INFO Number of parallel is set to 1 - llama_engine.cc:352
20240919 12:11:04.418670 UTC 3550098 DEBUG [LoadModelImpl] cache_type: f16 - llama_engine.cc:365
20240919 12:11:04.418672 UTC 3550098 DEBUG [LoadModelImpl] Enabled Flash Attention - llama_engine.cc:374
20240919 12:11:04.418679 UTC 3550098 DEBUG [LoadModelImpl] stop: null
2024-09-19T12:11:06.399Z [CORTEX]::Debug: Load model success with response {}
2024-09-19T12:11:06.399Z [CORTEX]::Debug: Validating model moondream2-f16.gguf
2024-09-19T12:11:06.400Z [CORTEX]::Debug: Validate model state with response 200
2024-09-19T12:11:06.401Z [CORTEX]::Debug: Validate model state success with response {"model_data":"{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","n_ctx":2048,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":false,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.0,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false}","model_loaded":true}
2024-09-19T12:11:06.408Z [CORTEX]::Error: libc++abi: terminating due to uncaught exception of type std::length_error: vector
2024-09-19T12:11:06.408Z [CORTEX]::Debug: 20240919 12:11:04.419177 UTC 3550098 DEBUG [LoadModel] Multi Modal Mode Enabled - llama_server_context.cc:159
20240919 12:11:06.301128 UTC 3550098 DEBUG [Initialize] Available slots: - llama_server_context.cc:225
20240919 12:11:06.301136 UTC 3550098 DEBUG [Initialize] -> Slot 0 - max context: 2048 - llama_server_context.cc:233
20240919 12:11:06.301210 UTC 3550098 INFO Started background task here! - llama_server_context.cc:252
20240919 12:11:06.301254 UTC 3550098 INFO Warm-up model: moondream2-f16.gguf - llama_engine.cc:819
20240919 12:11:06.301257 UTC 3550146 DEBUG [UpdateSlots] all slots are idle and system prompt is empty, clear the KV cache - llama_server_context.cc:1250
20240919 12:11:06.301262 UTC 3550146 DEBUG [KvCacheClear] Clear the entire KV cache - llama_server_context.cc:258
20240919 12:11:06.304526 UTC 3550146 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 0] - llama_server_context.cc:623
20240919 12:11:06.304589 UTC 3550146 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 0, p0: 0 - llama_server_context.cc:1544
20240919 12:11:06.397659 UTC 3550146 DEBUG [PrintTimings] PrintTimings: prompt eval time = 38.775ms / 1 tokens (38.775 ms per token, 25.7898130239 tokens per second) - llama_client_slot.cc:79
20240919 12:11:06.397667 UTC 3550146 DEBUG [PrintTimings] PrintTimings: eval time = 54.356 ms / 4 runs (13.589 ms per token, 73.5889322246 tokens per second)
llama_client_slot.cc:86
20240919 12:11:06.397668 UTC 3550146 DEBUG [PrintTimings] PrintTimings: total time = 93.131 ms - llama_client_slot.cc:92
20240919 12:11:06.397727 UTC 3550146 INFO slot released: id_slot: 0, id_task: 0, n_ctx: 2048, n_past: 5, n_system_tokens: 0, n_cache_tokens: 0, truncated: 0 - llama_server_context.cc:1304
20240919 12:11:06.397739 UTC 3550098 INFO {"content":", Alien friend! Today","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","n_ctx":2048,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":false,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.0,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false},"model":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","prompt":"Hello","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":54.356,"predicted_n":4,"predicted_per_second":73.58893222459342,"predicted_per_token_ms":13.589,"prompt_ms":38.775,"prompt_n":1,"prompt_per_second":25.78981302385558,"prompt_per_token_ms":38.775},"tokens_cached":5,"tokens_evaluated":1,"tokens_predicted":4,"truncated":false} - llama_engine.cc:827
20240919 12:11:06.397784 UTC 3550098 INFO Model loaded successfully: moondream2-f16.gguf - llama_engine.cc:216
20240919 12:11:06.400552 UTC 3550099 INFO Model status responded - llama_engine.cc:259
20240919 12:11:06.402786 UTC 3550100 INFO Request 1, model moondream2-f16.gguf: Generating response for inference request - llama_engine.cc:469
20240919 12:11:06.402791 UTC 3550100 INFO Request 1: Stop words:[
"<|END_OF_TURN_TOKEN|>",
"<end_of_turn>",
"[/INST]",
"<|end_of_text|>",
"<|eot_id|>",
"<|im_end|>",
"<|end|>"
]
llama_engine.cc:486
20240919 12:11:06.402820 UTC 3550100 INFO Request 1: Base64 image detected - llama_engine.cc:549
20240919 12:11:06.406590 UTC 3550100 INFO Request 1: Streamed, waiting for respone - llama_engine.cc:608
20240919 12:11:06.406633 UTC 3550100 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535
20240919 12:11:06.408420 UTC 3550146 DEBUG [LaunchSlotWithData] slot 0 - loaded image - llama_server_context.cc:562
20240919 12:11:06.408434 UTC 3550146 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 1] - llama_server_context.cc:623
20240919 12:11:06.408442 UTC 3550146 DEBUG [UpdateSlots] slot 0 : we have to evaluate at least 1 token to generate logits - llama_server_context.cc:1496
2024-09-19T12:11:06.409Z [CORTEX]::Debug: cortex exited with code: null
What is your OS?
MacOS
Windows
Linux
The text was updated successfully, but these errors were encountered:
louis-jan
changed the title
bug: Unable to send image to the Moondream2 Vision model
bug: Unable to chat with image using Moondream2 Vision model
Sep 19, 2024
Jan version
0.5.4
Describe the Bug
I can successfully load the model for chats, but as soon as I send an image, it crashes.
Context:
https://huggingface.co/moondream/moondream2-gguf
Same glitch on Linux here.
https://discord.com/channels/1107178041848909847/1285784195125219338/1286348026973261835
Steps to Reproduce
Screenshots / Logs
2024-09-19T12:11:04.250Z [CORTEX]::Debug: Request to kill cortex
2024-09-19T12:11:04.254Z [CORTEX]::Debug: 20240919 12:10:46.430861 UTC 3549698 DEBUG [LoadModel] Multi Modal Mode Enabled - llama_server_context.cc:159
20240919 12:10:46.676668 UTC 3549698 DEBUG [LoadModel] Request 4096 for context length for llava-1.6 - llama_server_context.cc:170
20240919 12:10:47.890831 UTC 3549698 DEBUG [Initialize] Available slots: - llama_server_context.cc:225
20240919 12:10:47.890848 UTC 3549698 DEBUG [Initialize] -> Slot 0 - max context: 4096 - llama_server_context.cc:233
20240919 12:10:47.890947 UTC 3549698 INFO Started background task here! - llama_server_context.cc:252
20240919 12:10:47.891006 UTC 3549698 INFO Warm-up model: llava-7b - llama_engine.cc:819
20240919 12:10:47.891010 UTC 3549742 DEBUG [UpdateSlots] all slots are idle and system prompt is empty, clear the KV cache - llama_server_context.cc:1250
20240919 12:10:47.891017 UTC 3549742 DEBUG [KvCacheClear] Clear the entire KV cache - llama_server_context.cc:258
20240919 12:10:47.901986 UTC 3549742 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 0] - llama_server_context.cc:623
20240919 12:10:47.902059 UTC 3549742 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 0, p0: 0 - llama_server_context.cc:1544
20240919 12:10:48.166076 UTC 3549742 DEBUG [PrintTimings] PrintTimings: prompt eval time = 172.433ms / 2 tokens (86.2165 ms per token, 11.5987079039 tokens per second) - llama_client_slot.cc:79
20240919 12:10:48.166081 UTC 3549742 DEBUG [PrintTimings] PrintTimings: eval time = 91.653 ms / 4 runs (22.91325 ms per token, 43.6428703916 tokens per second)
20240919 12:10:48.166082 UTC 3549742 DEBUG [PrintTimings] PrintTimings: total time = 264.086 ms - llama_client_slot.cc:92
20240919 12:10:48.166116 UTC 3549742 INFO slot released: id_slot: 0, id_task: 0, n_ctx: 4096, n_past: 6, n_system_tokens: 0, n_cache_tokens: 0, truncated: 0 - llama_server_context.cc:1304
20240919 12:10:48.166129 UTC 3549698 INFO {"content":",\nI recently bought","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/Users/louis/Library/Application Support/Jan/jan/models/llava-7b/llava-v1.6-mistral-7b.Q4_K_M.gguf","n_ctx":4096,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":false,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.0,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false},"model":"/Users/louis/Library/Application Support/Jan/jan/models/llava-7b/llava-v1.6-mistral-7b.Q4_K_M.gguf","prompt":"Hello","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":91.653,"predicted_n":4,"predicted_per_second":43.64287039158565,"predicted_per_token_ms":22.91325,"prompt_ms":172.433,"prompt_n":2,"prompt_per_second":11.598707903939502,"prompt_per_token_ms":86.2165},"tokens_cached":6,"tokens_evaluated":2,"tokens_predicted":4,"truncated":false} - llama_engine.cc:827
20240919 12:10:48.166183 UTC 3549698 INFO Model loaded successfully: llava-7b - llama_engine.cc:216
20240919 12:10:48.171967 UTC 3549699 INFO Model status responded - llama_engine.cc:259
20240919 12:10:48.175867 UTC 3549700 INFO Request 1, model llava-7b: Generating response for inference request - llama_engine.cc:469
20240919 12:10:48.175871 UTC 3549700 INFO Request 1: Stop words:null
20240919 12:10:48.175892 UTC 3549700 INFO Request 1: Base64 image detected - llama_engine.cc:549
20240919 12:10:48.179648 UTC 3549700 INFO Request 1: Streamed, waiting for respone - llama_engine.cc:608
20240919 12:10:48.179692 UTC 3549700 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535
20240919 12:10:48.182143 UTC 3549742 DEBUG [LaunchSlotWithData] slot 0 - loaded image - llama_server_context.cc:562
20240919 12:10:48.182156 UTC 3549742 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 1] - llama_server_context.cc:623
20240919 12:10:48.182167 UTC 3549742 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 1, p0: 0 - llama_server_context.cc:1544
20240919 12:10:52.482469 UTC 3549701 INFO Request 2, model llava-7b: Generating response for inference request - llama_engine.cc:469
20240919 12:10:52.482483 UTC 3549701 INFO Request 2: Stop words:null
20240919 12:11:04.251929 UTC 3549702 INFO Program is exitting, goodbye! - processManager.cc:8
2024-09-19T12:11:04.294Z [CORTEX]::Debug: cortex process is terminated
2024-09-19T12:11:04.294Z [CORTEX]::Debug: cortex exited with code: 0
2024-09-19T12:11:04.305Z [CORTEX]::CPU information - 10
2024-09-19T12:11:04.305Z [CORTEX]::Debug: Request to kill cortex
2024-09-19T12:11:04.306Z [CORTEX]::Debug: cortex process is terminated
2024-09-19T12:11:04.307Z [CORTEX]::Debug: Spawning cortex subprocess...
2024-09-19T12:11:04.307Z [CORTEX]::Debug: Spawn cortex at path: /Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64/cortex-cpp, and args: 1,127.0.0.1,3928
2024-09-19T12:11:04.307Z [CORTEX]::Debug: Cortex engine path: /Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64
2024-09-19T12:11:04.307Z [CORTEX] PATH: /usr/bin:/bin:/usr/sbin:/sbin::/Users/louis/Library/Application Support/Jan/jan/engines/@janhq/inference-cortex-extension/1.0.17:/Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64:/Users/louis/Library/Application Support/Jan/jan/extensions/@janhq/inference-cortex-extension/dist/bin/mac-arm64
2024-09-19T12:11:04.410Z [CORTEX]::Debug: Loading model with params {"cpu_threads":10,"vision_model":true,"text_model":false,"ctx_len":2048,"prompt_template":"{system_message}\n### Instruction: {prompt}\n### Response:","llama_model_path":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","mmproj":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-mmproj-f16.gguf","system_prompt":"","user_prompt":"\n### Instruction: ","ai_prompt":"\n### Response:","model":"moondream2-f16.gguf","ngl":100}
2024-09-19T12:11:04.410Z [CORTEX]::Debug: cortex is ready
2024-09-19T12:11:04.419Z [CORTEX]::Debug: 20240919 12:11:04.315010 UTC 3550094 INFO cortex-cpp version: 0.5.0 - main.cc:73
20240919 12:11:04.315589 UTC 3550094 INFO Server started, listening at: 127.0.0.1:3928 - main.cc:78
20240919 12:11:04.315590 UTC 3550094 INFO Please load your model - main.cc:79
20240919 12:11:04.315592 UTC 3550094 INFO Number of thread is:10 - main.cc:86
20240919 12:11:04.411469 UTC 3550098 INFO CPU instruction set: fpu = 0| mmx = 0| sse = 0| sse2 = 0| sse3 = 0| ssse3 = 0| sse4_1 = 0| sse4_2 = 0| pclmulqdq = 0| avx = 0| avx2 = 0| avx512_f = 0| avx512_dq = 0| avx512_ifma = 0| avx512_pf = 0| avx512_er = 0| avx512_cd = 0| avx512_bw = 0| has_avx512_vl = 0| has_avx512_vbmi = 0| has_avx512_vbmi2 = 0| avx512_vnni = 0| avx512_bitalg = 0| avx512_vpopcntdq = 0| avx512_4vnniw = 0| avx512_4fmaps = 0| avx512_vp2intersect = 0| aes = 0| f16c = 0| - server.cc:288
20240919 12:11:04.418604 UTC 3550098 INFO Loaded engine: cortex.llamacpp - server.cc:314
20240919 12:11:04.418615 UTC 3550098 INFO cortex.llamacpp version: 0.1.25 - llama_engine.cc:163
20240919 12:11:04.418638 UTC 3550098 INFO MMPROJ FILE detected, multi-model enabled! - llama_engine.cc:300
20240919 12:11:04.418667 UTC 3550098 INFO Number of parallel is set to 1 - llama_engine.cc:352
20240919 12:11:04.418670 UTC 3550098 DEBUG [LoadModelImpl] cache_type: f16 - llama_engine.cc:365
20240919 12:11:04.418672 UTC 3550098 DEBUG [LoadModelImpl] Enabled Flash Attention - llama_engine.cc:374
20240919 12:11:04.418679 UTC 3550098 DEBUG [LoadModelImpl] stop: null
{"timestamp":1726747864,"level":"INFO","function":"LoadModelImpl","line":418,"message":"system info","n_threads":10,"total_threads":10,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "}
2024-09-19T12:11:04.420Z [CORTEX]::Error: ggml_metal_init: allocating
2024-09-19T12:11:04.431Z [CORTEX]::Error: ggml_metal_init: found device: Apple M2 Pro
2024-09-19T12:11:04.458Z [CORTEX]::Error: ggml_metal_init: picking default device: Apple M2 Pro
2024-09-19T12:11:04.459Z [CORTEX]::Error: ggml_metal_init: using embedded metal library
2024-09-19T12:11:04.462Z [CORTEX]::Error: ggml_metal_init: GPU name: Apple M2 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
2024-09-19T12:11:04.841Z [CORTEX]::Error: llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from /Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi2
llama_model_loader: - kv 1: general.name str = moondream2
llama_model_loader: - kv 2: phi2.context_length u32 = 2048
llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048
llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192
llama_model_loader: - kv 5: phi2.block_count u32 = 24
llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
2024-09-19T12:11:04.845Z [CORTEX]::Error: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", """, "#", "$", "%", "&", "'", ...
2024-09-19T12:11:04.846Z [CORTEX]::Error: llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-09-19T12:11:04.850Z [CORTEX]::Error: llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256
llama_model_loader: - type f32: 147 tensors
llama_model_loader: - type f16: 98 tensors
2024-09-19T12:11:04.874Z [CORTEX]::Error: llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
2024-09-19T12:11:04.881Z [CORTEX]::Error: llm_load_vocab: special tokens cache size = 944
2024-09-19T12:11:04.889Z [CORTEX]::Error: llm_load_vocab: token to piece cache size = 0.3151 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 51200
llm_load_print_meta: n_merges = 50000
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 24
2024-09-19T12:11:04.889Z [CORTEX]::Error: llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 32
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.42 B
llm_load_print_meta: model size = 2.64 GiB (16.01 BPW)
llm_load_print_meta: general.name = moondream2
llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 50256 '<|endoftext|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.22 MiB
2024-09-19T12:11:04.890Z [CORTEX]::Error: ggml_backend_metal_log_allocated_size: allocated buffer, size = 2506.30 MiB, ( 3425.89 / 21845.34)
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors: CPU buffer size = 200.00 MiB
llm_load_tensors: Metal buffer size = 2506.29 MiB
2024-09-19T12:11:04.890Z [CORTEX]::Error: .....................................
2024-09-19T12:11:04.890Z [CORTEX]::Error: .....................
2024-09-19T12:11:04.890Z [CORTEX]::Error: ......................
2024-09-19T12:11:04.892Z [CORTEX]::Error: llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
2024-09-19T12:11:04.893Z [CORTEX]::Error: ggml_metal_init: found device: Apple M2 Pro
2024-09-19T12:11:04.893Z [CORTEX]::Error: ggml_metal_init: picking default device: Apple M2 Pro
2024-09-19T12:11:04.893Z [CORTEX]::Error: ggml_metal_init: using embedded metal library
2024-09-19T12:11:04.894Z [CORTEX]::Error: ggml_metal_init: GPU name: Apple M2 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
2024-09-19T12:11:04.928Z [CORTEX]::Error: llama_kv_cache_init: Metal KV buffer size = 384.00 MiB
llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.20 MiB
2024-09-19T12:11:04.929Z [CORTEX]::Error: llama_new_context_with_model: Metal compute buffer size = 416.00 MiB
llama_new_context_with_model: CPU compute buffer size = 32.02 MiB
llama_new_context_with_model: graph nodes = 826
llama_new_context_with_model: graph splits = 2
2024-09-19T12:11:06.399Z [CORTEX]::Debug: Load model success with response {}
2024-09-19T12:11:06.399Z [CORTEX]::Debug: Validating model moondream2-f16.gguf
2024-09-19T12:11:06.400Z [CORTEX]::Debug: Validate model state with response 200
2024-09-19T12:11:06.401Z [CORTEX]::Debug: Validate model state success with response {"model_data":"{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","n_ctx":2048,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":false,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.0,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false}","model_loaded":true}
2024-09-19T12:11:06.408Z [CORTEX]::Error: libc++abi: terminating due to uncaught exception of type std::length_error: vector
2024-09-19T12:11:06.408Z [CORTEX]::Debug: 20240919 12:11:04.419177 UTC 3550098 DEBUG [LoadModel] Multi Modal Mode Enabled - llama_server_context.cc:159
20240919 12:11:06.301128 UTC 3550098 DEBUG [Initialize] Available slots: - llama_server_context.cc:225
20240919 12:11:06.301136 UTC 3550098 DEBUG [Initialize] -> Slot 0 - max context: 2048 - llama_server_context.cc:233
20240919 12:11:06.301210 UTC 3550098 INFO Started background task here! - llama_server_context.cc:252
20240919 12:11:06.301254 UTC 3550098 INFO Warm-up model: moondream2-f16.gguf - llama_engine.cc:819
20240919 12:11:06.301257 UTC 3550146 DEBUG [UpdateSlots] all slots are idle and system prompt is empty, clear the KV cache - llama_server_context.cc:1250
20240919 12:11:06.301262 UTC 3550146 DEBUG [KvCacheClear] Clear the entire KV cache - llama_server_context.cc:258
20240919 12:11:06.304526 UTC 3550146 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 0] - llama_server_context.cc:623
20240919 12:11:06.304589 UTC 3550146 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 0, p0: 0 - llama_server_context.cc:1544
20240919 12:11:06.397659 UTC 3550146 DEBUG [PrintTimings] PrintTimings: prompt eval time = 38.775ms / 1 tokens (38.775 ms per token, 25.7898130239 tokens per second) - llama_client_slot.cc:79
20240919 12:11:06.397667 UTC 3550146 DEBUG [PrintTimings] PrintTimings: eval time = 54.356 ms / 4 runs (13.589 ms per token, 73.5889322246 tokens per second)
20240919 12:11:06.397668 UTC 3550146 DEBUG [PrintTimings] PrintTimings: total time = 93.131 ms - llama_client_slot.cc:92
20240919 12:11:06.397727 UTC 3550146 INFO slot released: id_slot: 0, id_task: 0, n_ctx: 2048, n_past: 5, n_system_tokens: 0, n_cache_tokens: 0, truncated: 0 - llama_server_context.cc:1304
20240919 12:11:06.397739 UTC 3550098 INFO {"content":", Alien friend! Today","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","n_ctx":2048,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":false,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.0,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false},"model":"/Users/louis/Library/Application Support/Jan/jan/models/moondream2-f16.gguf/moondream2-f16.gguf","prompt":"Hello","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":54.356,"predicted_n":4,"predicted_per_second":73.58893222459342,"predicted_per_token_ms":13.589,"prompt_ms":38.775,"prompt_n":1,"prompt_per_second":25.78981302385558,"prompt_per_token_ms":38.775},"tokens_cached":5,"tokens_evaluated":1,"tokens_predicted":4,"truncated":false} - llama_engine.cc:827
20240919 12:11:06.397784 UTC 3550098 INFO Model loaded successfully: moondream2-f16.gguf - llama_engine.cc:216
20240919 12:11:06.400552 UTC 3550099 INFO Model status responded - llama_engine.cc:259
20240919 12:11:06.402786 UTC 3550100 INFO Request 1, model moondream2-f16.gguf: Generating response for inference request - llama_engine.cc:469
20240919 12:11:06.402791 UTC 3550100 INFO Request 1: Stop words:[
"<|END_OF_TURN_TOKEN|>",
"<end_of_turn>",
"[/INST]",
"<|end_of_text|>",
"<|eot_id|>",
"<|im_end|>",
"<|end|>"
]
20240919 12:11:06.402820 UTC 3550100 INFO Request 1: Base64 image detected - llama_engine.cc:549
20240919 12:11:06.406590 UTC 3550100 INFO Request 1: Streamed, waiting for respone - llama_engine.cc:608
20240919 12:11:06.406633 UTC 3550100 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535
20240919 12:11:06.408420 UTC 3550146 DEBUG [LaunchSlotWithData] slot 0 - loaded image - llama_server_context.cc:562
20240919 12:11:06.408434 UTC 3550146 DEBUG [LaunchSlotWithData] slot 0 is processing [task id: 1] - llama_server_context.cc:623
20240919 12:11:06.408442 UTC 3550146 DEBUG [UpdateSlots] slot 0 : we have to evaluate at least 1 token to generate logits - llama_server_context.cc:1496
2024-09-19T12:11:06.409Z [CORTEX]::Debug: cortex exited with code: null
What is your OS?
The text was updated successfully, but these errors were encountered: