WRAVAL – WRiting Assist eVALuation

WRAVAL helps in evaluating LLMs for writing assistant tasks like summarization, professional tone, witty tone, etc.

With the popularity of large language models (LLMs), the focus of Language Model (LM) evaluation strongly shifted to problem solving or reasoning tasks, thus targeting a form of general intelligence. Small Language Models (SLMs) – defined here as LMs under 10B parameters – score low on these forms of LM evaluation, sometimes 3-4 times lower than Large Language Models (LLMs). We show that the performance of many of the most popular representative uses for LLMs in industrial settings, including tone change (e.g., funny, serious, professional), are not accurately reflected by these metrics. This paper proposes an evaluation framework that highlights SLMs' strengths on non-reasoning tasks that do not have a predefined evaluation dataset. We contribute with data generation, prompt-tuning, LLM-as-a-judge; and show how this framework helps highlight the potential of finetuning for a set of specific tasks. Our framework helps practitioners benchmark SLMs or LLMs on tasks they are good at and reinforces their usefulness in edge and private computing.

Quick start

pip install -r requirements.txt
python main.py generate

Structure

Start by generating evaluation data for each of the writing assistant tasks (a.k.a. tones) [here](1. data_generation.ipynb)
You can then use Bedrock hosted models ([here](2.b. Haiku_tones.ipynb)) or self-hosted models ([here](2.a. SLM_tones.ipynb)), to play the role of a writing assistant.
You can use an LLM-as-a-judge to evaluate these models ([here](3. judge_eval.ipynb))
Finally you can setup a Sagemaker Groundtruth tasks [here](4 human_eval.py)

An additional notebook is provided to benchmark models on translation tasks on open datasets here.

Data Generation

Generate synthetic data for tone transformation using various LLMs. Data is saved to CSV files with timestamps and can optionally be uploaded to S3.

Basic Usage

# By default generates all tone types. A specific tone and model can be specified.
python main.py generate --type witty_sentences --model nova-lite

Available Tone Types

witty_sentences: Factual sentences to be made witty
professional_sentences: Casual sentences to be made professional
casual_sentences: Formal sentences to be made casual
elaborate_sentences: Simple sentences to be made detailed
shorten_sentences: Wordy sentences to be made concise
improve_sentences: Poorly written sentences to be improved
keypoints_sentences: Detailed paragraphs for key point extraction
proofread_sentences: Sentences with errors to be corrected
emojify_sentences: Plain sentences to be enhanced with emojis
paragraph_summary: Paragraph-summary pairs

Output

Generated data is saved to ~/data/all-tones-{timestamp}.csv
Raw outputs are saved to ~/data/{tone_type}_raw.txt

Notebook quick start

You can use the CloudFormation yaml to start a Sagemaker notebook with the permissions to call Bedrock models (make sure you enable the Bedrock models in your AWS console beforehand).

ToDo

run Qwen and Phi as standalone sagemaker endpoints.
requirements.txt. (uv?)
data
1. data generation -> prompt library
2.b. LLM -> implement this in a modular way in in format_prompt_as_xml

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
resources		resources
src		src
.gitignore		.gitignore
1. data_generation.ipynb		1. data_generation.ipynb
2.a. SLM_tones.ipynb		2.a. SLM_tones.ipynb
2.b. LLM_tones.ipynb		2.b. LLM_tones.ipynb
3. judge_eval.ipynb		3. judge_eval.ipynb
4.a human_eval_upload.py		4.a human_eval_upload.py
4.b human_eval_parsing.py		4.b human_eval_parsing.py
LICENSE-2.0.txt		LICENSE-2.0.txt
LLM_translate.ipynb		LLM_translate.ipynb
NOTICE.txt		NOTICE.txt
README.md		README.md
header.tmpl		header.tmpl
main.py		main.py
requirements.txt		requirements.txt
settings.toml		settings.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WRAVAL – WRiting Assist eVALuation

Quick start

Structure

Data Generation

Basic Usage

Available Tone Types

Output

Notebook quick start

ToDo

About

Releases

Packages

Contributors 2

Languages

amazon-science/wraval

Folders and files

Latest commit

History

Repository files navigation

WRAVAL – WRiting Assist eVALuation

Quick start

Structure

Data Generation

Basic Usage

Available Tone Types

Output

Notebook quick start

ToDo

About

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages