ProLex ✍️: A Benchmark for Language Proficiency-oriented Lexical Substitution

This is the repository for ProLex, a novel benchmark that evaluates system performances on language proficiency-oriented lexical substitution, a new task that proposes substitutes that are not only contextually suitable but also demonstrate advanced-level proficiency. For example:

Target word w: promotion (B2)

Context s: This **promotion** has a beautiful and effective visual part, but they miss the real point: the product.

Acceptable w^a: advertising (A2), marketing (B1), publicity (B2), campaign (B1), advertisement (A2)

Proficiency-oriented w^a_p: publicity (B2)

Note that the proficiency level of each word is indicated based on Common European Framework of Reference (CEFR). We refer to the CEFR Checker developed by Cathoven AI to label the CEFR level of each word in ProLex.

In general, this repository offers:

The data format (CSV) for ProLex ✍️
An instruction tuning pipeline with task-specific synthetic data
A standardized evaluation pipeline

News

[2024/05] ProLex gets accepted by ACL 2024 Findings! The camera-ready version of it is also released.
[2024/01] 🔥 We released the very first version of ProLex. Read the paper for more details!

Downloading the ProLex benchmark

We prepare both the dev and test sets for ProLex ✍️. They can be downloaded from the following links:

ProLex CSV format

ProLex is composed of quadruplets (w, s, w^a, w^a_p), each containing a target word, a context sentence, a list of acceptable substitutes, and a list of proficiency-oriented substitutes. We organize these contents into a CSV format. The columns are described as follows:

target word: the target word as plain text.
Sentence: the context sentence as plain text, with target word encompassed with asterisks.
acc_subs: a list of acceptable substitutes annotated by human experts.
unacc_subs: a list of unacceptable substitutes annotated by human experts.
prof_acc_subs: a list of advanced proficiency-oriented substitutes from acc_subs.
prof_unacc_subs: a list of low-proficiency substitutes removed from acc_subs.
t_words_cefr: the CEFR level of the target word.
prof_acc_cefr: the CEFR levels of the substitutes from prof_acc_subs.
prof_unacc_cefr: the CEFR levels of the substitutes from prof_unacc_subs.

Note that we encoded the CEFR levels with integers ranging from 0 to 5. Please refer to the following mapping to derive the CEFR labels. Currently, there are some limitations of the CEFR checkers we are using. For example, it can not recognize certain words that are uncommon (e.g. gasoline). Also, it cannot recognize CEFR levels for phrases. However, to encourage vocabulary diversity of ProLex, we retain label 6 for accetapble but unknown words, and None for acceptable phrases, respectively.

0: A1
1: A2
2: B1
3: B2
4: C1
5: C2
6: unknown word
None: phrases

Environment settings

Conda environment installation

git clone [email protected]:BillyZhang24kobe/LS_Proficiency.git
cd LS_Proficiency
conda env create -f environment.yml
conda activate LS_prof

Instruction-tuning pipelines

We provide scripts to synthesize task-specific training data with GPT-4, and then finetune Vicuna-1.5 and Llama 2 on top the synthetic data. Besides, according to CEFR, we also filter the dataset from Swords, which can also be used for training.

Synthesize task-specific training data

With GPT-4

Get the synthetic data from GPT-4 with the following command. Note that you should create an api_secrets.py file in the root directory of the project, and input your OpenAI API credentials to the file before running the script.

python synthesize_train_gpt.py

Specifically, the script takes as an input the raw data from data/raw/toefl_1500.csv, which contains the sentences randomly selected from the TOEFL-11 corpus. The output is store in data/train/synthetic_gpt4.csv.

Modified dataset from Swords

We take the dev and test set from Swords and retrieve the CEFR labels of all target words and their corresponding acceptable substitutes (i.e. score greater than 50%). We remove the substitutes that demonstrate lower proficiency than the target words. The modified dataset can be downloaded here.

Instruction-tuning recipes

We use the followingn script to fine-tune a 7B Vicuna-1.5 and Llama 2 models. The configuration details are provided to replicate our expeirment results. We build our experiments based on FastChat. Please fill in DEVICE, MODEL_PATH, DATA_PATH, and OUTPUT_MODEL_PATH accordingly.

CUDA_VISIBLE_DEVICES={DEVICE} python fastchat/train/train_mem.py \
    --model_name_or_path {MODEL_PATH} \
    --data_path {DATA_PATH} \
    --bf16 True \
    --output_dir {OUTPUT_MODEL_PATH} \
    --num_train_epochs 10 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_steps 100 \
    --lr_scheduler_type "linear" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess False \

Similarly, we refer to the following script to fine-tune a 13B Vicuna-1.5 and Llama 2 models. We conduct each of these experiments on two NVIDIA A100-80G GPUs.

CUDA_VISIBLE_DEVICES={DEVICE0, DEVICE1} torchrun --nproc_per_node=2 --master_port=20003 fastchat/train/train_mem.py \
    --model_name_or_path {MODEL_PATH} \
    --data_path {DATA_PATH}  \
    --bf16 True \
    --output_dir {OUTPUT_MODEL_PATH} \
    --num_train_epochs 5 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_steps 100 \
    --lr_scheduler_type "linear" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess False \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'

Model Weights

Vicuna-1.5 Weights

Size	Description	Hugging Face Repo
7B	Our Vicuna-1.5 7B model finetuned on our task-specific synthetic data	Columbia-NLP/vicuna-7b-v1.5-syn-ProLex
7B-D_LS	Our Vicuna-1.5 7B model finetuned on D_LS, the combination of synthetic data and the modified data from Swords.	Columbia-NLP/vicuna-7b-v1.5-comb-ProLex
13B	Our Vicuna-1.5 13B model finetuned on our task-specific synthetic data	Columbia-NLP/vicuna-13b-v1.5-syn-ProLex
13B-D_LS	Our Vicuna-1.5 13B model finetuned on D_LS, the combination of synthetic data and the modified data from Swords.	Columbia-NLP/vicuna-13b-v1.5-comb-ProLex

Llama-2 Weights

Size	Description	Hugging Face Repo
7B	Our Llama-2 7B model finetuned on our task-specific synthetic data	Columbia-NLP/llama-2-7b-hf-syn-ProLex
7B-D_LS	Our Llama-2 7B model finetuned on D_LS, the combination of synthetic data and the modified data from Swords.	Columbia-NLP/llama-2-7b-hf-comb-ProLex
13B	Our Llama-2 13B model finetuned on our task-specific synthetic data	Columbia-NLP/llama-2-13b-hf-syn-ProLex
13B-D_LS	Our Llama-2 13B model finetuned on D_LS, the combination of synthetic data and the modified data from Swords.	Columbia-NLP/llama-2-13b-hf-comb-ProLex

Evaluating on ProLex

Evaluating your model checkpoints

We implement the evaluation pipeline for ProLex in evaluate.py. The following example demonstrates how to obtain substitutes predicted by our fine-tuned models for a given word-sentence pair. Feel free to plug in your own model checkpoints to produce the results.

from evaluate import format_test_prompt
from transformers import AutoModelForCausalLM, AutoTokenizer

# load model and tokenzier from HF
model = AutoModelForCausalLM.from_pretrained("Columbia-NLP/vicuna-7b-v1.5-syn-ProLex")
tokenzier = AutoTokenizer.from_pretrained("Columbia-NLP/vicuna-7b-v1.5-syn-ProLex")

target_word = "obligatory"
sentence = "Even though it was an **obligatory** experience, I could take part in a community program"

system_input = format_test_prompt(target_word, sentence)
input_ids = tokenizer.encode(system_input, return_tensors='pt', add_special_tokens=True)

# Generate the candidates.
model.eval()

with torch.no_grad():
    generated_ids = model.generate(
        input_ids,
        max_length=tokenizer.model_max_length,
        temperature=0.2
    )
  
# Decode the candidates.
generated_texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts[0])

To evaluate on the whole dev/test set of ProLex, please run the following command, with MODEL_PATH being your own checkpoints to evaluate and DATA_PATH being either data/dev/ProLex_v1.0_dev.csv or data/test/ProLex_v1.0_test.csv.

python3 evaluate.py --model_name_or_path {MODEL_PATH} --data_path {DATA_PATH}

Evaluating your model predictions in files

Alternatively, you can first store your model predictions into a CSV file and then upload it to evaluate with our pipeline. An example prediction file is shown in outputs/example_predictions.csv. Please follow its data format in contructing your own prediction file. The file path can be passed to YOUR_FILE_PATH in the following command:

python3 evaluate.py --model_name_or_path {YOUR_FILE_PATH} --data_path {DATA_PATH}

Citation

We highly appriciate your interests in our work. If you find ProLex ✍️ helpful, please consider citing our paper in your work:

Xuanming Zhang, Zixun Chen, Zhou Yu. ACL 2024 (Findings). ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution.

@article{zhang2024prolex,
  title={ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution},
  author={Zhang, Xuanming and Chen, Zixun and Yu, Zhou},
  journal={arXiv preprint arXiv:2401.11356},
  year={2024}
}

Questions

Please reach out to us at [email protected] if you have any questions in using our benchmark. If you find an issue in either the source code or dataset, please feel free to create a pull request and make contribution to the benchmark!

Name		Name	Last commit message	Last commit date
Latest commit History 792 Commits
.github		.github
baselines/ParaLS_results		baselines/ParaLS_results
data		data
fastchat		fastchat
figures		figures
outputs		outputs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
environment.yml		environment.yml
evaluate.py		evaluate.py
evaluate.sh		evaluate.sh
prompt_gpt.py		prompt_gpt.py
synthesize_train_gpt.py		synthesize_train_gpt.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProLex ✍️: A Benchmark for Language Proficiency-oriented Lexical Substitution

News

Table of Contents

Downloading the ProLex benchmark

ProLex CSV format

Environment settings

Conda environment installation

Instruction-tuning pipelines

Synthesize task-specific training data

With GPT-4

Modified dataset from Swords

Instruction-tuning recipes

Model Weights

Vicuna-1.5 Weights

Llama-2 Weights

Evaluating on ProLex

Evaluating your model checkpoints

Evaluating your model predictions in files

Citation

Questions

About

Releases

Packages

Languages

License

BillyZhang24kobe/LS_Proficiency

Folders and files

Latest commit

History

Repository files navigation

ProLex ✍️: A Benchmark for Language Proficiency-oriented Lexical Substitution

News

Table of Contents

Downloading the ProLex benchmark

ProLex CSV format

Environment settings

Conda environment installation

Instruction-tuning pipelines

Synthesize task-specific training data

With GPT-4

Modified dataset from Swords

Instruction-tuning recipes

Model Weights

Vicuna-1.5 Weights

Llama-2 Weights

Evaluating on ProLex

Evaluating your model checkpoints

Evaluating your model predictions in files

Citation

Questions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages