This is the official repo for the paper "AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text"
Disclaimer 1: The Creativity Index of texts that appear exactly in the reference corpus may be deflated. In our paper, we remove exact duplicates (including quotations and citations) from the corpus before computing the Creativity Index. However, deduplication is not applied in this repository, as it requires hosting the backend of the Infini-gram search engine.
Disclaimer 2: The Creativity Index of texts generated by latest models (e.g., GPT-4) may be inflated. This is because we do not have access to all the data these models were trained on, and our supported corpora have earlier cutoff dates (Dolma-v1.7: October 2023, RedPajama: March 2023, Pile: 2020).
We suggest using conda to setup environment. You need to first replace prefix
in environment_infini.yml and environment_vllm.yml with your home path. With conda installed, create two environments called infini-gram
and vllm
with:
conda env create -f environment_infini.yml
conda env create -f environment_vllm.yml
Please first replace HF_TOKEN
in DJ_search_exact.py with your own Huggingface token.
To compute Creativity Index based on exact matches with the default hyperparameters, run
conda activate infini-gram
python DJ_search_exact.py --task GPT3_book --data data/book/GPT3_book.json --output_dir outputs/book
The default tokenizer used by DJ Search is the Moses tokenizer. To use the LLaMA 2 tokenizer, which we adopt for the poem domain, please include the --lm_tokenizer
flag.
python DJ_search_exact.py --task GPT3_poem --data data/poem/GPT3_poem.json --output_dir outputs/poem --lm_tokenizer
Before running DJ Search, compute the lookup table of pairwise word embedding distances using the get_lookup_table
function in DJ_search_earth_mover.py. This function generates and saves the lookup table as a pickle file at data/embed_distance/Llama-3-8B-Instruct.pkl
. This step only needs to be performed once.
Next, retrieve the most similar documents using ElasticSearch, then process the retrieved documents. Please replace API_KEY
in retrieve_documents.py with your own ElasticSearch API key.
conda activate vllm
python retrieve_documents.py --input_file data/book/GPT3_book.json --output_dir data/book/retrieved/ --index DOLMA --nb_documents 100
conda activate infini-gram
python process_documents.py --task GPT3_book --retrieved_data_path data/book/retrieved/GPT3_book_nbgens_100_nbdoc100_DOLMA.json --data_output_dir data/new_book/filtered
Finally, replace HF_TOKEN
in DJ_search_earth_mover.py with your own Huggingface token.
To compute Creativity Index based on semantic matches with default the hyperparameters, run
python DJ_search_earth_mover.py --task GPT3_book --data_dir data/book/filtered --output_dir outputs/semantic/book --embed_table_path data/embed_distance/Llama-3-8B-Instruct.pkl