Skip to content

Latest commit

 

History

History
94 lines (81 loc) · 3 KB

README.md

File metadata and controls

94 lines (81 loc) · 3 KB

LLMs

This is a tool to evaluate large language models on NLP tasks such as text classification and summarization. It implements a common API for traditional encoder-decoder and prompt-based large language models, as well as APIs such as OpenAI and Cohere.

Currently, these functionalities are available:

  • Prompting and truncation logic
  • Support for vanilla LLMs (OPT, LLaMa) and instruction-tuned models (T0, Alpaca)
  • Evaluation based on 🤗 Datasets or CSV files
  • Memoization: inference outputs are cached on disk
  • Parallelized computation of metrics

Setup

git clone https://github.com/thefonseca/llms.git
cd llms && pip install -e .

Classification

llm-classify \
--model_name llama-2-7b-chat \
--model_checkpoint_path path_to_llama2_checkpoint \
--model_dtype float16 \
--dataset_name imdb \
--split test \
--source_key text \
--target_key label \
--model_labels "{'Positive':1,'Negative':0}" \
--max_samples 1000

Summarization

Evaluating BigBird on PubMed validation split, and saving the results on the output folder:

llm-summarize \
--dataset_name scientific_papers \
--dataset_config pubmed \
--split validation \
--source_key article \
--target_key abstract \
--max_samples 1000 \
--model_name google/bigbird-pegasus-large-pubmed \
--output_dir output

where --model_name is a huggingface model identifier.

Evaluating Alpaca (float16) on arXiv validation split:

llm-summarize \
--arxiv_id https://arxiv.org/abs/2304.15004v1 \
--model_name alpaca-7b \
--model_checkpoint_path path_to_alpaca_checkpoint \
--budget 7 \
--budget_unit sentences \
--model_dtype float16 \
--output_dir output

Notes:

  • --budget controls length of instruct-tuned summaries (by default, in sentences).
  • --model_checkpoint_path allows changing checkpoint folder while keeping the cache key (--model_name) constant.

Evaluating ChatGPT API on arXiv validation split:

export OPENAI_API_KEY=<your_api_key>
llm-summarize \
--dataset_name scientific_papers \
--dataset_config arxiv \
--split validation \
--source_key article \
--target_key abstract \
--max_samples 1000 \
--model_name gpt-3.5-turbo \
--output_dir output

Evaluating summary predictions from a CSV file:

llm-summarize \
--dataset_name scientific_papers \
--dataset_config arxiv \
--split validation \
--source_key article \
--target_key abstract \
--prediction_path path_to_csv_file \
--prediction_key prediction \
--max_samples 1000 \
--output_dir output