Memorization Analysis

This repository contains code to analyze the extent to which LLM-jp models memorize training data.

Requirements

Python: 3.10+
See requirements.txt for Python package requirements.

Installation

Install the required Python packages:

pip install -r requirements.txt

Preprocess

First, preprocess the training data.

Extract training data

The first step is to extract the training data at specific training steps.

PATH_TO_DATA_DIR=<PATH-TO-DATA-DIR>
PATH_TO_EXTRACT_DIR=<PATH-TO-EXTRACT-DIR>
python src/preprocess.py extract --data_dir $PATH_TO_DATA_DIR --output_dir $PATH_TO_EXTRACT_DIR

Annotate training data

The second step is to annotate the training data with frequency information, etc.

PATH_TO_EXTRACT_DIR=<PATH-TO-EXTRACT-DIR>
PATH_TO_ANNOTATE_DIR=<PATH-TO-ANNOTATE-DIR>
python src/preprocess.py annotate --data_dir $PATH_TO_EXTRACT_DIR --output_dir $PATH_TO_ANNOTATE_DIR

Evaluate memorization metrics

Evaluate memorization metrics for a model.

PATH_TO_ANNOTATE_DIR=<PATH-TO-ANNOTATE-DIR>
PATH_TO_RESULT_DIR=<PATH-TO-RESULT-DIR>
MODEL_NAME_OR_PATH=<MODEL-NAME-OR-PATH>
python src/evaluate.py --data_dir $PATH_TO_ANNOTATE_DIR --output_dir $PATH_TO_RESULT_DIR --model_name_or_path $MODEL_NAME_OR_PATH

Visualize the results.

PATH_TO_RESULT_DIR=<PATH-TO-RESULT-DIR>
PATH_TO_PLOT_DIR=<PATH-TO-PLOT-DIR>
python src/plot.py --data_dir $PATH_TO_RESULT_DIR --output_dir $PATH_TO_PLOT_DIR

To browse the memorization metrics, run the following command and open the URL in a browser.

PATH_TO_RESULT_DIR=<PATH-TO-RESULT-DIR>
streamlit run src/browse.py -- --data_dir $PATH_TO_RESULT_DIR

Development

Ensure that adding unit tests for new code and that all tests pass:

pytest -vv

The code is formatted using Ruff. To ensure that the code is formatted correctly, install the pre-commit hooks:

pre-commit install

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Memorization Analysis

Requirements

Installation

Preprocess

Extract training data

Annotate training data

Evaluate memorization metrics

Development

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

llm-jp/memorization-analysis

Folders and files

Latest commit

History

Repository files navigation

Memorization Analysis

Requirements

Installation

Preprocess

Extract training data

Annotate training data

Evaluate memorization metrics

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages