This is the repository for ProLex, a novel benchmark that evaluates system performances on language proficiency-oriented lexical substitution, a new task that proposes substitutes that are not only contextually suitable but also demonstrate advanced-level proficiency. For example:
Target word
w
:promotion (B2)
Context
s
:This **promotion** has a beautiful and effective visual part, but they miss the real point: the product.
Acceptable wa:
advertising (A2), marketing (B1), publicity (B2), campaign (B1), advertisement (A2)
Proficiency-oriented wap:
publicity (B2)
Note that the proficiency level of each word is indicated based on Common European Framework of Reference (CEFR). We refer to the CEFR Checker developed by Cathoven AI to label the CEFR level of each word in ProLex.
In general, this repository offers:
- The data format (CSV) for ProLex ✍️
- An instruction tuning pipeline with task-specific synthetic data
- A standardized evaluation pipeline
- [2024/05] ProLex gets accepted by ACL 2024 Findings! The camera-ready version of it is also released.
- [2024/01] 🔥 We released the very first version of ProLex. Read the paper for more details!
- Downloading the ProLex benchmark
- Environment settings
- Instruction-tuning pipelines
- Evaluating on ProLex
- Citation
- Questions
We prepare both the dev and test sets for ProLex ✍️. They can be downloaded from the following links:
ProLex is composed of quadruplets (w, s, wa, wap), each containing a target word, a context sentence, a list of acceptable substitutes, and a list of proficiency-oriented substitutes. We organize these contents into a CSV format. The columns are described as follows:
target word
: the target word as plain text.Sentence
: the context sentence as plain text, with target word encompassed with asterisks.acc_subs
: a list of acceptable substitutes annotated by human experts.unacc_subs
: a list of unacceptable substitutes annotated by human experts.prof_acc_subs
: a list of advanced proficiency-oriented substitutes fromacc_subs
.prof_unacc_subs
: a list of low-proficiency substitutes removed fromacc_subs
.t_words_cefr
: the CEFR level of thetarget word
.prof_acc_cefr
: the CEFR levels of the substitutes fromprof_acc_subs
.prof_unacc_cefr
: the CEFR levels of the substitutes fromprof_unacc_subs
.
Note that we encoded the CEFR levels with integers ranging from 0 to 5. Please refer to the following mapping to derive the CEFR labels. Currently, there are some limitations of the CEFR checkers we are using. For example, it can not recognize certain words that are uncommon (e.g. gasoline). Also, it cannot recognize CEFR levels for phrases. However, to encourage vocabulary diversity of ProLex, we retain label 6
for accetapble but unknown words, and None
for acceptable phrases, respectively.
0: A1
1: A2
2: B1
3: B2
4: C1
5: C2
6: unknown word
None: phrases
git clone [email protected]:BillyZhang24kobe/LS_Proficiency.git
cd LS_Proficiency
conda env create -f environment.yml
conda activate LS_prof
We provide scripts to synthesize task-specific training data with GPT-4, and then finetune Vicuna-1.5
and Llama 2
on top the synthetic data. Besides, according to CEFR, we also filter the dataset from Swords, which can also be used for training.
Get the synthetic data from GPT-4 with the following command. Note that you should create an api_secrets.py
file in the root directory of the project, and input your OpenAI API credentials to the file before running the script.
python synthesize_train_gpt.py
Specifically, the script takes as an input the raw data from data/raw/toefl_1500.csv
, which contains the sentences randomly selected from the TOEFL-11
corpus. The output is store in data/train/synthetic_gpt4.csv
.
We take the dev and test set from Swords and retrieve the CEFR labels of all target words and their corresponding acceptable substitutes (i.e. score greater than 50%). We remove the substitutes that demonstrate lower proficiency than the target words. The modified dataset can be downloaded here.
We use the followingn script to fine-tune a 7B Vicuna-1.5
and Llama 2
models. The configuration details are provided to replicate our expeirment results. We build our experiments based on FastChat. Please fill in DEVICE
, MODEL_PATH
, DATA_PATH
, and OUTPUT_MODEL_PATH
accordingly.
CUDA_VISIBLE_DEVICES={DEVICE} python fastchat/train/train_mem.py \
--model_name_or_path {MODEL_PATH} \
--data_path {DATA_PATH} \
--bf16 True \
--output_dir {OUTPUT_MODEL_PATH} \
--num_train_epochs 10 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--save_steps 1200 \
--save_total_limit 10 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_steps 100 \
--lr_scheduler_type "linear" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess False \
Similarly, we refer to the following script to fine-tune a 13B Vicuna-1.5
and Llama 2
models. We conduct each of these experiments on two NVIDIA A100-80G
GPUs.
CUDA_VISIBLE_DEVICES={DEVICE0, DEVICE1} torchrun --nproc_per_node=2 --master_port=20003 fastchat/train/train_mem.py \
--model_name_or_path {MODEL_PATH} \
--data_path {DATA_PATH} \
--bf16 True \
--output_dir {OUTPUT_MODEL_PATH} \
--num_train_epochs 5 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--save_steps 1200 \
--save_total_limit 10 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_steps 100 \
--lr_scheduler_type "linear" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess False \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
Size | Description | Hugging Face Repo |
---|---|---|
7B | Our Vicuna-1.5 7B model finetuned on our task-specific synthetic data | Columbia-NLP/vicuna-7b-v1.5-syn-ProLex |
7B-DLS | Our Vicuna-1.5 7B model finetuned on DLS, the combination of synthetic data and the modified data from Swords. | Columbia-NLP/vicuna-7b-v1.5-comb-ProLex |
13B | Our Vicuna-1.5 13B model finetuned on our task-specific synthetic data | Columbia-NLP/vicuna-13b-v1.5-syn-ProLex |
13B-DLS | Our Vicuna-1.5 13B model finetuned on DLS, the combination of synthetic data and the modified data from Swords. | Columbia-NLP/vicuna-13b-v1.5-comb-ProLex |
Size | Description | Hugging Face Repo |
---|---|---|
7B | Our Llama-2 7B model finetuned on our task-specific synthetic data | Columbia-NLP/llama-2-7b-hf-syn-ProLex |
7B-DLS | Our Llama-2 7B model finetuned on DLS, the combination of synthetic data and the modified data from Swords. | Columbia-NLP/llama-2-7b-hf-comb-ProLex |
13B | Our Llama-2 13B model finetuned on our task-specific synthetic data | Columbia-NLP/llama-2-13b-hf-syn-ProLex |
13B-DLS | Our Llama-2 13B model finetuned on DLS, the combination of synthetic data and the modified data from Swords. | Columbia-NLP/llama-2-13b-hf-comb-ProLex |
We implement the evaluation pipeline for ProLex in evaluate.py
. The following example demonstrates how to obtain substitutes predicted by our fine-tuned models for a given word-sentence pair. Feel free to plug in your own model checkpoints to produce the results.
from evaluate import format_test_prompt
from transformers import AutoModelForCausalLM, AutoTokenizer
# load model and tokenzier from HF
model = AutoModelForCausalLM.from_pretrained("Columbia-NLP/vicuna-7b-v1.5-syn-ProLex")
tokenzier = AutoTokenizer.from_pretrained("Columbia-NLP/vicuna-7b-v1.5-syn-ProLex")
target_word = "obligatory"
sentence = "Even though it was an **obligatory** experience, I could take part in a community program"
system_input = format_test_prompt(target_word, sentence)
input_ids = tokenizer.encode(system_input, return_tensors='pt', add_special_tokens=True)
# Generate the candidates.
model.eval()
with torch.no_grad():
generated_ids = model.generate(
input_ids,
max_length=tokenizer.model_max_length,
temperature=0.2
)
# Decode the candidates.
generated_texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])
To evaluate on the whole dev/test set of ProLex, please run the following command, with MODEL_PATH
being your own checkpoints to evaluate and DATA_PATH
being either data/dev/ProLex_v1.0_dev.csv
or data/test/ProLex_v1.0_test.csv
.
python3 evaluate.py --model_name_or_path {MODEL_PATH} --data_path {DATA_PATH}
Alternatively, you can first store your model predictions into a CSV file and then upload it to evaluate with our pipeline. An example prediction file is shown in outputs/example_predictions.csv
. Please follow its data format in contructing your own prediction file. The file path can be passed to YOUR_FILE_PATH
in the following command:
python3 evaluate.py --model_name_or_path {YOUR_FILE_PATH} --data_path {DATA_PATH}
We highly appriciate your interests in our work. If you find ProLex ✍️ helpful, please consider citing our paper in your work:
Xuanming Zhang, Zixun Chen, Zhou Yu. ACL 2024 (Findings). ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution.
@article{zhang2024prolex,
title={ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution},
author={Zhang, Xuanming and Chen, Zixun and Yu, Zhou},
journal={arXiv preprint arXiv:2401.11356},
year={2024}
}
Please reach out to us at [email protected] if you have any questions in using our benchmark. If you find an issue in either the source code or dataset, please feel free to create a pull request and make contribution to the benchmark!