llm-eval-test

A wrapper around lm-eval-harness and Unitxt designed for evaluation of a local inference endpoint.

Requirements

To Install

Python 3.10 or newer

To Run

An OpenAI API-compatible inference server; like vLLM
A directory containing the necessary datasets for the benchmark (see example)

To Develop

PDM

Getting Started

# Create a Virtual Environment
python -m venv venv
source venv/bin/activate

# Install the package
pip install git+https://github.com/sjmonson/llm-eval-test.git

# View run options
llm-eval-test run --help

Download Usage

usage: llm-eval-test download [-h] [--catalog-path PATH] [--tasks-path PATH] [--offline | --no-offline] [-v | -q] -t TASKS [-d DATASETS] [-f | --force-download | --no-force-download]

download datasets for open-llm-v1 tasks

options:
  -h, --help            show this help message and exit

  -t TASKS, --tasks TASKS
                        comma separated tasks to download for example: arc_challenge,hellaswag
  -d DATASETS, --datasets DATASETS
                        Dataset directory
  -f, --force-download, --no-force-download
                        Force download datasets even it already exist

Run Usage

usage: llm-eval-test run [-h] [--catalog-path PATH] [--tasks-path PATH] [--offline | --no-offline] [-v | -q] -H ENDPOINT -m MODEL -t TASKS -d PATH [-T TOKENIZER] [-b INT] [-r INT] [-o OUTPUT | --no-output] [--format {full,summary}] [--chat-template | --no-chat-template]

Run tasks

options:
  -h, --help            show this help message and exit
  --catalog-path PATH   unitxt catalog directory
  --tasks-path PATH     lm-eval tasks directory
  --offline, --no-offline
                        Disable/enable updating datasets from the internet
  -v, --verbose         set loglevel to DEBUG
  -q, --quiet           set loglevel to ERROR
  -T, --tokenizer TOKENIZER
                        path or huggingface tokenizer name, if none uses model name (default: None)
  -b, --batch INT       per-request batch size
  -r, --retry INT       max number of times to retry a single request
  -o, --output OUTPUT   results output file
  --no-output           disable results output file
  --format {full,summary}
                        format of output file

required:
  -H, --endpoint ENDPOINT
                        OpenAI API-compatible endpoint
  -m, --model MODEL     name of the model under test
  -t, --tasks TASKS     comma separated list of tasks
  -d, --datasets PATH   path to dataset storage

prompt parameters:
  these modify the prompt sent to the server and thus will affect the results

  --chat-template, --no-chat-template
                        use chat template for requests

Example: MMLU-Pro Benchmark

# Create dataset directory
DATASETS_DIR=$(pwd)/datasets
mkdir $DATASETS_DIR

# Download the MMLU-Pro dataset
DATASET=TIGER-Lab/MMLU-Pro
llm-eval-test download --datasets $DATASETS_DIR --tasks mmlu_pro

# Run the benchmark
ENDPOINT=http://127.0.0.1:8080/v1/completions # An OpenAI API-compatable completions endpoint
MODEL_NAME=meta-llama/Llama-3.1-8B # Name of the model hosted on the inference server
TOKENIZER=ibm-granite/granite-3.1-8b-instruct
llm-eval-test run --endpoint $ENDPOINT --model $MODEL_NAME --datasets $DATASETS_DIR --tasks mmlu_pro

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github/workflows		.github/workflows
src/llm_eval_test		src/llm_eval_test
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pdm.lock		pdm.lock
pylock.toml		pylock.toml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llm-eval-test

Requirements

To Install

To Run

To Develop

Getting Started

Download Usage

Run Usage

Example: MMLU-Pro Benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

openshift-psap/llm-eval-test

Folders and files

Latest commit

History

Repository files navigation

llm-eval-test

Requirements

To Install

To Run

To Develop

Getting Started

Download Usage

Run Usage

Example: MMLU-Pro Benchmark

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages