Skip to content

llm-jp/memorization-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Memorization Analysis

This repository contains code to analyze the extent to which LLM-jp models memorize training data.

Requirements

Installation

Install the required Python packages:

pip install -r requirements.txt

Preprocess

First, preprocess the training data.

Extract training data

The first step is to extract the training data at specific training steps.

PATH_TO_DATA_DIR=<PATH-TO-DATA-DIR>
PATH_TO_EXTRACT_DIR=<PATH-TO-EXTRACT-DIR>
python src/preprocess.py extract --data_dir $PATH_TO_DATA_DIR --output_dir $PATH_TO_EXTRACT_DIR

Annotate training data

The second step is to annotate the training data with frequency information, etc.

PATH_TO_EXTRACT_DIR=<PATH-TO-EXTRACT-DIR>
PATH_TO_ANNOTATE_DIR=<PATH-TO-ANNOTATE-DIR>
python src/preprocess.py annotate --data_dir $PATH_TO_EXTRACT_DIR --output_dir $PATH_TO_ANNOTATE_DIR

Evaluate memorization metrics

Evaluate memorization metrics for a model.

PATH_TO_ANNOTATE_DIR=<PATH-TO-ANNOTATE-DIR>
PATH_TO_RESULT_DIR=<PATH-TO-RESULT-DIR>
MODEL_NAME_OR_PATH=<MODEL-NAME-OR-PATH>
python src/evaluate.py --data_dir $PATH_TO_ANNOTATE_DIR --output_dir $PATH_TO_RESULT_DIR --model_name_or_path $MODEL_NAME_OR_PATH

Visualize the results.

PATH_TO_RESULT_DIR=<PATH-TO-RESULT-DIR>
PATH_TO_PLOT_DIR=<PATH-TO-PLOT-DIR>
python src/plot.py --data_dir $PATH_TO_RESULT_DIR --output_dir $PATH_TO_PLOT_DIR

To browse the memorization metrics, run the following command and open the URL in a browser.

PATH_TO_RESULT_DIR=<PATH-TO-RESULT-DIR>
streamlit run src/browse.py -- --data_dir $PATH_TO_RESULT_DIR

Development

Ensure that adding unit tests for new code and that all tests pass:

pytest -vv

The code is formatted using Ruff. To ensure that the code is formatted correctly, install the pre-commit hooks:

pre-commit install

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages