Causal Representation Learning Approach for Unlearning Toxic Content in LLMs

Dataset

Dataset 1: RealToxicityPrompts [Dataset Link] | [Paper]

Notes:

The entire codebase for ROME can be found in the directory rome-main. Cloned from Locating and Editing Factual Associations in GPT (NeurIPS'22).
The rome-main/trace_main.py is the main script to run a vanilla example of causal tracing on one of the datasets in the original paper.
The directory rome-main/dsets contains the datasets they use. This is where we need to add the RealToxicityPrompts dataset and load it from for inference. It is present in the file rome-main/dsets/realtoxicityprompts.py

For William:

Refer to the RealToxicityPrompts paper to determine which model they used and pick one of the GPT-2 variants to run ROME with the prompts from the dataset.
Choose the "challenging" subset of prompts, i.e., where dataset["challenging"] == true
For our use case, right now just the causal tracing part is enough, we don't need to worry about the editing part yet.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
causal_to_concept/llms		causal_to_concept/llms
causalrep		causalrep
dataset		dataset
.gitignore		.gitignore
README.md		README.md
fluency_clean.py		fluency_clean.py
fluency_prompt.yaml		fluency_prompt.yaml
json_process.py		json_process.py
pair_generation.py		pair_generation.py
prompts.jsonl		prompts.jsonl
requirements.txt		requirements.txt
toxic.ipynb		toxic.ipynb