LLMs

This is a tool to evaluate large language models on NLP tasks such as text classification and summarization. It implements a common API for traditional encoder-decoder and prompt-based large language models, as well as APIs such as OpenAI and Cohere.

Currently, these functionalities are available:

Prompting and truncation logic
Support for vanilla LLMs (OPT, LLaMa) and instruction-tuned models (T0, Alpaca)
Evaluation based on 🤗 Datasets or CSV files
Memoization: inference outputs are cached on disk
Parallelized computation of metrics

Setup

git clone https://github.com/thefonseca/llms.git
cd llms && pip install -e .

Classification

llm-classify \
--model_name llama-2-7b-chat \
--model_checkpoint_path path_to_llama2_checkpoint \
--model_dtype float16 \
--dataset_name imdb \
--split test \
--source_key text \
--target_key label \
--model_labels "{'Positive':1,'Negative':0}" \
--max_samples 1000

Summarization

Evaluating BigBird on PubMed validation split, and saving the results on the output folder:

llm-summarize \
--dataset_name scientific_papers \
--dataset_config pubmed \
--split validation \
--source_key article \
--target_key abstract \
--max_samples 1000 \
--model_name google/bigbird-pegasus-large-pubmed \
--output_dir output

where --model_name is a huggingface model identifier.

Evaluating Alpaca (float16) on arXiv validation split:

llm-summarize \
--arxiv_id https://arxiv.org/abs/2304.15004v1 \
--model_name alpaca-7b \
--model_checkpoint_path path_to_alpaca_checkpoint \
--budget 7 \
--budget_unit sentences \
--model_dtype float16 \
--output_dir output

Notes:

--budget controls length of instruct-tuned summaries (by default, in sentences).
--model_checkpoint_path allows changing checkpoint folder while keeping the cache key (--model_name) constant.

Evaluating ChatGPT API on arXiv validation split:

export OPENAI_API_KEY=<your_api_key>
llm-summarize \
--dataset_name scientific_papers \
--dataset_config arxiv \
--split validation \
--source_key article \
--target_key abstract \
--max_samples 1000 \
--model_name gpt-3.5-turbo \
--output_dir output

Evaluating summary predictions from a CSV file:

llm-summarize \
--dataset_name scientific_papers \
--dataset_config arxiv \
--split validation \
--source_key article \
--target_key abstract \
--prediction_path path_to_csv_file \
--prediction_key prediction \
--max_samples 1000 \
--output_dir output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLMs

Setup

Classification

Summarization

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLMs

Setup

Classification

Summarization