llmvectorapi4j

Implementation of large language models in Java using the Vector API.

This implementation is an extension of the implementation llama3.java by Alfonso² Peterssen which is based on Andrej Karpathy's llama2.c and minbpe projects. This extension contains:

An internal HTTP-server to serve OpenAI-like requests.
An optional display of important attentions. You can see which tokens the model is most interested in.
Storing of the KV-cache into a file.

Supported models are:

DeepSeek-R1-Distill-Qwen-1.5B-Q8_0
Llama-3 (Llama-3.2, Llama-3.3)
Phi-3 (CLI only, no http server)
Qwen-2.5

This project has no dependencies on other libraries. You may have a look at LangChain for Java.

Display of Attentions

If the display of attentions is enabled, the attentions with long value-vectors are displayed. In a translation from english to chinese at the token "三" the model might be interested in the english word "three" (see picture below).

Link to animation:

Usage

Options:
  --model, -m <path>            required, path to .gguf file
  --interactive, --chat, -i     run in chat mode
  --instruct                    run in instruct (once) mode, default mode
  --prompt, -p <string>         input prompt
  --system-prompt, -sp <string> (optional) system prompt
  --temperature, -temp <float>  temperature in [0,inf], default 0.1
  --top-p <float>               p value in top-p (nucleus) sampling in [0,1] default 0.95
  --seed <long>                 random seed, default System.nanoTime()
  --max-tokens, -n <int>        number of steps to run for < 0 = limited by context length, default 512
  --stream <boolean>            print tokens during generation; may cause encoding artifacts for non ASCII text, default true
  --echo <boolean>              print ALL tokens to stderr, if true, recommended to set --stream=false, default false
  --host <ip>                   optional ip-address of http-server (default 127.0.0.1)
  --port <port>                 optional port number of http-server (default 8080)
  --path <path>                 optional path of public-html of http-server
  --state-cache-folder          optional folder to store state-caches (to save .ggsc-files)
  --state-cache, -sc path       optional state-cache to be used (read .ggsc-file)
  --attention-trace <int>       maximum number of attentions to be traced per token

Examples

Start a HTTP-server serving DeepSeek-R1-Distill-Qwen-1.5B using the web-UI of llama.cpp:

java --enable-preview --add-modules jdk.incubator.vector -cp target/llmvectorapi4j-0.1.0-SNAPSHOT.jar org.rogmann.llmva4j.Qwen2 -m .../DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf -i --path .../git/llama.cpp/examples/server/public -n 2000

Start a HTTP-server serving using the internal web-UI to display attentions.

java --enable-preview --add-modules jdk.incubator.vector -cp target/llmvectorapi4j-0.1.0-SNAPSHOT.jar org.rogmann.llmva4j.Qwen2 -m .../qwen2.5-1.5b-instruct-q8_0.gguf -sp "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." -i --attention-trace 3 --path src/main/webapp

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
docs		docs
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llmvectorapi4j

Display of Attentions

Usage

Examples

About

Releases

Packages

Languages

License

srogmann/llmvectorapi4j

Folders and files

Latest commit

History

Repository files navigation

llmvectorapi4j

Display of Attentions

Usage

Examples

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages