Skip to content

Latest commit

 

History

History
48 lines (36 loc) · 3.27 KB

README.md

File metadata and controls

48 lines (36 loc) · 3.27 KB

Creativity Index

This is the official repo for the paper "AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text"

Disclaimer

Disclaimer 1: The Creativity Index of texts that appear exactly in the reference corpus may be deflated. In our paper, we remove exact duplicates (including quotations and citations) from the corpus before computing the Creativity Index. However, deduplication is not applied in this repository, as it requires hosting the backend of the Infini-gram search engine.

Disclaimer 2: The Creativity Index of texts generated by latest models (e.g., GPT-4) may be inflated. This is because we do not have access to all the data these models were trained on, and our supported corpora have earlier cutoff dates (Dolma-v1.7: October 2023, RedPajama: March 2023, Pile: 2020).

Requirement

We suggest using conda to setup environment. You need to first replace prefix in environment_infini.yml and environment_vllm.yml with your home path. With conda installed, create two environments called infini-gram and vllm with:

conda env create -f environment_infini.yml
conda env create -f environment_vllm.yml

Creativity Index with Exact Matches

Please first replace HF_TOKEN in DJ_search_exact.py with your own Huggingface token. To compute Creativity Index based on exact matches with the default hyperparameters, run

conda activate infini-gram
python DJ_search_exact.py --task GPT3_book --data data/book/GPT3_book.json --output_dir outputs/book

The default tokenizer used by DJ Search is the Moses tokenizer. To use the LLaMA 2 tokenizer, which we adopt for the poem domain, please include the --lm_tokenizer flag.

python DJ_search_exact.py --task GPT3_poem --data data/poem/GPT3_poem.json --output_dir outputs/poem --lm_tokenizer

Creativity Index with Semantic Matches

Before running DJ Search, compute the lookup table of pairwise word embedding distances using the get_lookup_table function in DJ_search_earth_mover.py. This function generates and saves the lookup table as a pickle file at data/embed_distance/Llama-3-8B-Instruct.pkl. This step only needs to be performed once.

Next, retrieve the most similar documents using ElasticSearch, then process the retrieved documents. Please replace API_KEY in retrieve_documents.py with your own ElasticSearch API key.

conda activate vllm
python retrieve_documents.py --input_file data/book/GPT3_book.json --output_dir data/book/retrieved/ --index DOLMA --nb_documents 100
conda activate infini-gram
python process_documents.py --task GPT3_book --retrieved_data_path data/book/retrieved/GPT3_book_nbgens_100_nbdoc100_DOLMA.json --data_output_dir data/new_book/filtered

Finally, replace HF_TOKEN in DJ_search_earth_mover.py with your own Huggingface token. To compute Creativity Index based on semantic matches with default the hyperparameters, run

python DJ_search_earth_mover.py --task GPT3_book --data_dir data/book/filtered --output_dir outputs/semantic/book --embed_table_path data/embed_distance/Llama-3-8B-Instruct.pkl