This repository contains code to analyze the extent to which LLM-jp models memorize training data.
- Python: 3.10+
- See requirements.txt for Python package requirements.
Install the required Python packages:
pip install -r requirements.txt
First, preprocess the training data.
The first step is to extract the training data at specific training steps.
PATH_TO_DATA_DIR=<PATH-TO-DATA-DIR>
PATH_TO_EXTRACT_DIR=<PATH-TO-EXTRACT-DIR>
python src/preprocess.py extract --data_dir $PATH_TO_DATA_DIR --output_dir $PATH_TO_EXTRACT_DIR
The second step is to annotate the training data with frequency information, etc.
PATH_TO_EXTRACT_DIR=<PATH-TO-EXTRACT-DIR>
PATH_TO_ANNOTATE_DIR=<PATH-TO-ANNOTATE-DIR>
python src/preprocess.py annotate --data_dir $PATH_TO_EXTRACT_DIR --output_dir $PATH_TO_ANNOTATE_DIR
Evaluate memorization metrics for a model.
PATH_TO_ANNOTATE_DIR=<PATH-TO-ANNOTATE-DIR>
PATH_TO_RESULT_DIR=<PATH-TO-RESULT-DIR>
MODEL_NAME_OR_PATH=<MODEL-NAME-OR-PATH>
python src/evaluate.py --data_dir $PATH_TO_ANNOTATE_DIR --output_dir $PATH_TO_RESULT_DIR --model_name_or_path $MODEL_NAME_OR_PATH
Visualize the results.
PATH_TO_RESULT_DIR=<PATH-TO-RESULT-DIR>
PATH_TO_PLOT_DIR=<PATH-TO-PLOT-DIR>
python src/plot.py --data_dir $PATH_TO_RESULT_DIR --output_dir $PATH_TO_PLOT_DIR
To browse the memorization metrics, run the following command and open the URL in a browser.
PATH_TO_RESULT_DIR=<PATH-TO-RESULT-DIR>
streamlit run src/browse.py -- --data_dir $PATH_TO_RESULT_DIR
Ensure that adding unit tests for new code and that all tests pass:
pytest -vv
The code is formatted using Ruff. To ensure that the code is formatted correctly, install the pre-commit hooks:
pre-commit install