planning: Jan's path to cortex.cpp? #3690

dan-homebrew · 2024-09-17T12:00:09Z

Goal

Jan should be able to seamlessly move from Nitro to cortex.cpp
What is the scope of change?
- Different inference extensions? (e.g. nitro-extension, and cortex-extension?)
- Data Structures (old legacy folders, vs. new?)
- Separation of concerns (e.g. Jan used to be in charge of model downloads, now calls cortex.cpp instead?)
What is our strategy?
- Parallel: support both legacy and new
- Migration: move from old Nitro to new cortex.cpp?

Tasklist

Clearly articulate the architectural change that needs to happen
Clearly articulate the scope of changes we need to account for
Figure out our migration strategy

The text was updated successfully, but these errors were encountered:

louis-jan · 2024-09-19T07:06:28Z

Scope of changes

Nitro Inference Extension
Model Extension
Monitoring Extension

Nitro inference extension

Current implementation

Register Models (pre-populate model.json files)
Any extensions register models on load will pre-populate model.json under /models/[model-id]/model.json

sequenceDiagram
    participant ModelExtension
    participant BaseExtension
    participant FileSystem

    ModelExtension->>BaseExtension: Register Models
    BaseExtension->>BaseExtension: Pre-populate Data
    BaseExtension->>FileSystem: Write to /models

Load Model:
- Set additional .dll/.so PATH (for engine loading)
- Hardware Information (to decide engine binary)
- Run nitro server
- Parse prompt template
- Load a GGUF model with its file path and model settings (passed from App)

sequenceDiagram
    participant App
    participant NitroInferenceExtension
    participant NitroServer

    App->>NitroInferenceExtension: loadModel
    NitroInferenceExtension->>NitroInferenceExtension: killProcess
    NitroInferenceExtension->>NitroInferenceExtension: fetch hardware information
    NitroInferenceExtension->>child_process: spawn Nitro process
    NitroInferenceExtension->>NitroServer: wait for server healthy
    NitroInferenceExtension->>NitroInferenceExtension: parsePromptTemplate
    NitroInferenceExtension->>NitroServer: send loadModel request
    NitroInferenceExtension->>NitroServer: wait for model loaded

Inference (inheritance - OAIEngine.ts)
Any extensions inheriting from the Base OAI Engine class will forward requests to their respective inference endpoints.

sequenceDiagram
    participant App
    participant NitroInferenceExtension
    participant NitroServer

    App->>NitroInferenceExtension: inference
    NitroInferenceExtension->>NitroInferenceExtension: transform payload
    NitroInferenceExtension->>NitroServer: chat/completions

Possible Changes

Current ❌	Upcoming ✅
Run Nitro server on model load	Run cortex.cpp daemon service on start
Kill nitro process on pre-model-load and pre-app-exit	Keep cortex-cpp alive, daemon process, stop on exit
Heavy hardware detection & prompt processing	Just send a request
So many requests (check port, check health, model load status)	One request to do the whole thing
Mixing of model management and inference - Multiple responsibilities	Single responsibility

Model extension

Current implementation

Download Model (ModelFile as payload)
Delete Model (ModelFile as payload)
Get Models (Scan through models folder and return ModelFile[])
Import Model (Generate ModelFile and download)
Fetch HF Repo Data (for HF model import selection)

App retrieves pre-populated models:

sequenceDiagram

App ->> ModelExtension: get available models
ModelExtension ->> FS: read /models
FS --> ModelExtension : ModelFile

App downloads a model:

sequenceDiagram

App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress

App imports a model

sequenceDiagram

App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
ModelExtension ->> model.json :generate
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress

App deletes a model

graph LR

App --> |remove| Model_Extension
Model_Extension --> |FS unlink| /models/model/__files__

Possible Changes

Current ❌	Upcoming ✅
Implementation - Depends on FS	Abstraction - API Forwarding
List Available Models: Scan through Model Folder	GET /models
Delete: Unlink FS	DELETE /models
Download: Download	POST & Progress /models/pulls
Broken Model Import - Using default model.json	cortex.cpp handles the model metadata
Model prediction depends on model size & available RAM/VRAM only	cortex.cpp predicts base on hardware and model.yaml

System Monitoring extension

Current implementation

Get GPU Settings
Get System Information

App get resources information

graph LR

App --> |getResourcesInfo| Model_Extension
Model_Extension --> |fetch| node-os-utils
Model_Extension --> |getCurrentLoad| nvidia-smi

Possible Changes

Current ❌	Upcoming ✅
Implementation - Depends on FS & CMD	Abstraction - API Forwarding
Execute CMD	GET - Hardware Information Endpoint

Overview

Current ❌

Upcoming ✅

Assumption

cortex.cpp bundles multiple engines (different CPU instructions and CUDA versions)
cortex.cpp support /models APIs
- GET: /models (available, active status, compatibility prediction)
- POST: /models/pull (& progress?)
- DELETE: /models
cortex.cpp support /hardware-information API

Challenges of moving Nitro to cortex.cpp

Different Data (Folder & File) structures
Backward / Forward compatibility

The migration

How to seamlessly move from Nitro to cortex.cpp, where:
- cortex.cpp works with new Data Folder structure
- cortex.cpp works with model.yaml
- cortex.cpp works with models.list
How to maintain the data folder when users switch back to older versions?
- Older versions rely on model-extension, which searches for a model.json file within the Data Folder.
- Newer versions rely on cortex-extension, which searches for a model.yaml file within the Data Folder.

Let's think about a couple of our principles.

We don't remove or manipulate user data.
Rollback should always work.
Minimal migration

What are some of the main concerns here?

Can we use model.json and model.yaml side by side?
1. We should. Since the model folder can contain anything, from README.md, .gitignore, GGUF, model.yaml to model.json.
2. Older versions will still function with legacy model.json files.
3. Newer versions will work with the latest model.yaml files.
How to sync between those two?
1. It's hard to sync between those two, since different structures could break the app.
2. We just try to migrate once when there is no models.list available. This is a good flag for migration triggering.
3. After migrating, each app version works independently with its own model file format.
How about model pre-population? In other words, Model Hub.
1. Model pre-population is an anti-pattern. Pre-populated models do not work with versioning or create unwanted data that confuse users. How about our Model Hub list thousands of models?
2. We implemented model import, which replaces the need for a model file. Users can just import with the HF repo ID. Users do not have any reason to duplicate or edit a pre-populated model.json.
3. Model listing can be done from the extension.
4. In short, in the next version, we don't pre-populate unwanted files to the Data Folder. Only when users decide to download.
5. Users deleting a model means deleting the persisted model.yaml & model files.
How do other extensions work with their models? E.g., OpenAI
1. Remote models can be populated during the build, not persisted. registerModels now persists in-memory model DTO.
2. We don't pre-populate remote models, which is not necessary. Users are better setting them from Extension Settings. It's more or less an Extension configuration, not Model population.
Migration complexity and UX
1. We don't convert model.json to model.yaml. Instead, import with symlink. It could be faster and avoid new logic added from Jan, which is redundant. Lightweight migration with less risk. Maintain the Model ID is key; otherwise, all threads break.
2. We don't move any files, which could drag the migration process long. E.g., GGUF
3. How about new/manual adding GGUFs? The model symlink feature is always there for that.
4. There are bad migration experiences in the past that we can avoid such as:
  1. Migrate all pre-populated models
  2. Heavy file movement drags the duration long
  3. Migrate everything at once
5. Now we just migrate downloaded models:
  1. Import downloaded models only as symlinks (no file movement)
  2. Don't update the ID, which will kill us on data inconsistency
  3. Another thought: Do we really need to wait for model.yaml creation during migration?
    1. cortex.cpp can work with the models.list to provide available models?
    2. model.yaml generation is an asynchronous operation so:
      1. It generates model.yaml as soon as user try to get or load.
      2. It generates model.yaml as soon as user try to import.
      3. Don't block the client GUI; model list can be done with just the models.list contents. Any further operations with a certain model can generate a model.yaml later.
      4. Client will prioritize the active thread's model then others to not blocking users working threads.
      5. If something goes wrong, the GGUF file will still be there and can be generated later on other operations. The model.yaml file is not strictly required to be available, but just the cache of model file metadata?
Better cache mechanism
1. Model list and detail have worked with the File System before, and now they're sending an API request to cortex.cpp.
2. To prevent slow loading, the client should cache accordingly on the frontend.

Summary

In short, the entire migration process is just to create symlinks to downloaded models from models.list. No model.yaml or folder manipulation involved. It should be done instantly?

Migrate indicator: models.list exist.

Don't pre-populate models. Remote Extensions work with their own settings instead of pre-populated models. Cortex Extension registers in-memory available to pull models (templates).

cortex.cpp is a daemon process that should be started alongside the app.

Jan migration from 0.5.x to 0.6.0

louis-jan · 2024-09-19T07:19:27Z

Bundled Engines

Is it possible that, cortex-cpp bundles multiple engines, but expose only 1 gateway?

E.g.
The client requests to load a llama.cpp model, but cortex.cpp can predict the hardware compatible and run an efficient binary.

So:

Clients do not need to send any extra engine parameters or minimal (type).
Clients don't need to parse prompt templates, that's something the model should handle.
cortex.cpp owns the model metadata, allowing it to operate independently.
cortex.cpp masks up the complex binary distribution, exposing a simple interface.
GPU ON/OFF - GPU Selections can be done via engine /settings?

Eventually, that's all it needs to work with – the Model ID (aka model name).

Simplify `model load / chat completions` request

louis-jan · 2024-09-19T08:11:52Z

Incremental Path

We do what's not related to cortex.cpp first - Remote Extensions & Pre-populated Models
1. Rather than pre-populate, enhance the model configurations.
2. registerModels now lists available models for download, don't persist model.json.
Better data caching
1. Data retrieved from extensions should be cached on the frontend for subsequent loads.
2. Reduce direct API requests and perform more data synchronization operations.
3. Implementing a good cache layer would save a bad user experience during migration later, where the app doesn't need to scan through the models list, but can just dump cached data and imports right away. It won't interrupt users' working threads since asynchronous operations take care of data persistence (model.yaml), and model load requests are typically long-delayed responses.
Minimal Migration Steps (cortex-cpp ready)
1. Generate models.list based on cached data, do not need to scan the Model Folder, which can be costly.
2. Send model import or symlink requests to generate models.list. It would be great if cortex.cpp could support batch symlinks (import), as that would only require creating a models.list file. The model.yaml files can be generated asynchronously. (This would cover the case user edits the models.list manually)
3. Update extensions to redirect requests.
4. The worst-case scenario is when users update from significantly older versions that lack cache improvements. Go through model folders and send import requests. During app update.

sequenceDiagram
    participant App as "App"
    participant Models as "Models"
    participant ModelList as "Model List"
    participant ModelYaml as "Model YAML"

    App->>Models: import
    activate Models
    Models->>ModelList: update models.list
    activate ModelList
    ModelList->>Models: return data
    deactivate ModelList
    Models->>ModelYaml: generate (async)
    activate ModelYaml
    Note right of ModelYaml: generate model.yaml asynchronously
    ModelYaml->>Models: (async) generated
    deactivate ModelYaml
    deactivate Models

0xSage · 2024-09-29T15:04:16Z

This is really well thought through.

Questions @louis-jan :

What are the specific attributes Jan needs from the get hardware info endpoint?
- OS info
- CPU info
- RAM size (total, utilized)
- GPU SKU
- VRAM size (total, utilized)
- What else? (do you need storage info, or additional unforeseen stats?)
Do you need hardware configuration endpoints? i.e. does Jan ever need to let users change some hardware level configuration.
Dumb question, but do you need engine status endpoints? Or is the level of abstraction good enough at the model level
What are the Cortex sub-process endpoints needed? /keepalive /healthcheck?

louis-jan · 2024-09-30T02:35:23Z

@0xSage All great questions, related issue that we discussed Hardware Info endpoint. janhq/cortex.cpp#1165 (comment)

Jan does not let users change hardware level configuration but Engine settings (CPU/GPU mode, CPU threads ...)
Engine status and model status were previously supported in cortex.cpp, but I don't see a clear use case from Jan's work, such as implementing a switch or update mechanism for the engine.
/healthcheck is needed, and implemented from cortex.cpp

The one blocking this is download progress sync, which we aligned with a socket approach.

0xSage · 2024-09-30T04:03:37Z

Nice, always great when the best answer is the simplest!! 🙏

…

On Mon, Sep 30, 2024 at 10:35 AM Louis ***@***.***> wrote: @0xSage <https://github.com/0xSage> All great questions, related issue that we discussed Hardware Info endpoint. janhq/cortex.cpp#1165 (comment) <janhq/cortex.cpp#1165 (comment)> 2. Jan does not let users change hardware level configuration but Engine settings (CPU/GPU mode, CPU threads ...) 3. Engine status and model status were previously supported in cortex.cpp, but I don't see a clear use case from Jan's work, such as implementing a switch or update mechanism for the engine. 4. /healthcheck is needed, and implemented from cortex.cpp The one blocking this is download progress sync, which we aligned with a socket approach. — Reply to this email directly, view it on GitHub <#3690 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQVWFCD6FUKNF7B3EWM42Q3ZZC2IBAVCNFSM6AAAAABOLJYFMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBRHA4DQOBYG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

louis-jan · 2024-10-02T07:56:08Z

Package cortex.cpp into Jan app:

The app will bundle all available cortex.cpp binaries, just like the cortex.cpp installer.
During update, it executes cortex engines install --sources [path_to_binaries] or via API, cortex will detect hardware and install accordingly (from sources).
App Settings

Use cortex engines get to see the installed variant -> Update UI accordingly, e.g. GPU On/Off.
cortex.cpp will include a flag to select the variant, letting users choose their GPU from the app settings. Then, it will install the appropriate cortex engine version accordingly. Since all binaries are included, it's simply about switching between variants. (idea: Options to select CPU and GPU binaries for llama-cpp engine cortex.cpp#1390)
Investigating on Multiple GPUs support... TBD

Pros

No internet connection is required.
The same user experience.

Cons

The package has become a bit larger because it now includes the Cuda DLLs. However, no additional downloads are needed.

cortex.cpp configurations on spawn

Config path
Host & Port
Log paths
Data folder path
HF Token (via Env?)
Proxy Configs (via Env?) - KIV for now

louis-jan · 2024-10-02T16:11:22Z

Seamless Models Imports & Run

Since we no longer use models.list, it's now simply based on the engine name for imports.

App 0.6.x opens (given that users have updated to 0.5.5)
Trigger get models on load as usual (extension - asynchronously - background).
Go through the downloaded models (from the cache) — very fast since only previously downloaded models are involved.
Send model import requests for nitro models.
Persist the cache.

This operation runs asynchronously, won't affect the UX, since it works with cached data. Even the model is not imported yet, it should still function normally (with stateless model load endpoint). There will be a door for broken requests or attempts.

In case users are updating from a very old version, we run a scan on model.json and persist the cache (as the current - legacy logics) -> Continue with (1)

dan-homebrew · 2024-10-14T06:38:22Z

Current issues being faced:

Jan/Cortex Data Folder issues (backward compatible)

dan-homebrew · 2024-10-17T05:26:06Z

@dan-homebrew: will create an Implementation Issue:
@louis-jan Can you link the implementation-related issues here:

dan-homebrew changed the title ~~epic: Jan migration from Nitro to cortex.cpp~~ epic: Jan to start using cortex.cpp in addition to Nitro Sep 17, 2024

dan-homebrew changed the title ~~epic: Jan to start using cortex.cpp in addition to Nitro~~ epic: Jan's path to cortex.cpp? Sep 17, 2024

dan-homebrew mentioned this issue Sep 17, 2024

feat: Jan supports new Cortex's Model Folder and model.yaml architecture #3633

Closed

dan-homebrew assigned louis-jan Sep 18, 2024

imtuyethan added the P1: important Important feature / fix label Sep 18, 2024

louis-jan mentioned this issue Sep 25, 2024

bug: Fix Download location of Nitro TensorRT on Windows #3491

Closed

dan-homebrew added this to the v0.5.6 milestone Sep 29, 2024

louis-jan mentioned this issue Sep 30, 2024

chore: improve models and threads caching #3744

Merged

louis-jan mentioned this issue Oct 3, 2024

feat: support a local custom cortex engine #3764

Open

0xSage mentioned this issue Oct 13, 2024

bug: Jan is not using GPU #3737

Closed

3 tasks

dan-homebrew mentioned this issue Oct 13, 2024

planning: Remote API Extensions for Jan & Cortex #3786

Open

23 tasks

0xSage mentioned this issue Oct 14, 2024

planning: Jan and Cortex's Extension Framework #3773

Open

dan-homebrew changed the title ~~epic: Jan's path to cortex.cpp?~~ architecture: Jan's path to cortex.cpp? Oct 14, 2024

imtuyethan mentioned this issue Oct 14, 2024

bug: can enable GPU acceleration with cuda not installed - model fails to start #3762

Closed

4 tasks

0xSage added category: local engines category: cortex.cpp Related to cortex.cpp labels Oct 14, 2024

0xSage pinned this issue Oct 14, 2024

0xSage mentioned this issue Oct 15, 2024

epic: Jan installer & uninstaller #3557

Open

6 tasks

imtuyethan removed this from the v0.5.7 milestone Oct 15, 2024

dan-homebrew changed the title ~~architecture: Jan's path to cortex.cpp?~~ discussion: Jan's path to cortex.cpp? Oct 17, 2024

0xSage mentioned this issue Oct 17, 2024

epic: Jan integrates Cortex.cpp #3825

Closed

2 tasks

0xSage added category: providers Local & remote inference providers and removed category: local providers labels Oct 17, 2024

0xSage changed the title ~~discussion: Jan's path to cortex.cpp?~~ planning: Jan's path to cortex.cpp? Oct 17, 2024

0xSage added the type: planning Discussions, specs and decisions stage label Oct 17, 2024

imtuyethan mentioned this issue Oct 18, 2024

chore: Structure Icebox in Github Projects #3840

Open

This was referenced Oct 29, 2024

planning: Jan refactors APIs and State to Cortex #3895

Open

planning: Cortex.cpp features needed to fully support Jan janhq/cortex.cpp#1555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

planning: Jan's path to cortex.cpp? #3690

planning: Jan's path to cortex.cpp? #3690

dan-homebrew commented Sep 17, 2024 •

edited by louis-jan

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading

0xSage commented Sep 29, 2024 •

edited

Loading

louis-jan commented Sep 30, 2024

0xSage commented Sep 30, 2024 via email

louis-jan commented Oct 2, 2024 •

edited

Loading

louis-jan commented Oct 2, 2024 •

edited

Loading

dan-homebrew commented Oct 14, 2024

dan-homebrew commented Oct 17, 2024

planning: Jan's path to cortex.cpp? #3690

planning: Jan's path to cortex.cpp? #3690

Comments

dan-homebrew commented Sep 17, 2024 • edited by louis-jan Loading

Goal

Tasklist

louis-jan commented Sep 19, 2024 • edited Loading

Scope of changes

Nitro inference extension

Model extension

System Monitoring extension

Overview

Assumption

Challenges of moving Nitro to cortex.cpp

The migration

Summary

louis-jan commented Sep 19, 2024 • edited Loading

Bundled Engines

Eventually, that's all it needs to work with – the Model ID (aka model name).

louis-jan commented Sep 19, 2024 • edited Loading

Incremental Path

0xSage commented Sep 29, 2024 • edited Loading

louis-jan commented Sep 30, 2024

0xSage commented Sep 30, 2024 via email

louis-jan commented Oct 2, 2024 • edited Loading

Package cortex.cpp into Jan app:

cortex.cpp configurations on spawn

louis-jan commented Oct 2, 2024 • edited Loading

Seamless Models Imports & Run

dan-homebrew commented Oct 14, 2024

dan-homebrew commented Oct 17, 2024

dan-homebrew commented Sep 17, 2024 •

edited by louis-jan

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading

0xSage commented Sep 29, 2024 •

edited

Loading

louis-jan commented Oct 2, 2024 •

edited

Loading

louis-jan commented Oct 2, 2024 •

edited

Loading