Scaling Monosemanticity with LLaMA

This repository reproduces and extends the work from the article Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, applying the methods to LLaMA 3.2-3B. The project explores monosemantic neurons in large language models, investigates their scaling behavior, and implements sparse autoencoders to extract interpretable features.

Features

Extracting Activations: Analyze token-level activations from the 16th layer of LLaMA 3.2 (3B, 16-bit quantized) using the Pile dataset.
Sparse Autoencoder (SAE): Train an overcomplete SAE (3072-65536-3072) to uncover interpretable features in the latent space.
Feature Search: Identify features relevant to specific topics using multi-prompt inputs and activation metrics.
Influencing Outputs: Use steering vectors derived from the SAE to influence LLaMA’s output during inference.

Examples and Results

Prompt Search

Example prompts used for finding relevant features:

1. program on aerobic capacity and muscle strength of adults with hearing loss. Twenty-three adults with hearing loss were separated into 2 groups. Thirteen subjects
2. the effect of a traditional dance training program on aerobic capacity and muscle strength of adults with hearing loss. Twenty-three adults with hearing loss were separated into
3. been examined comprehensively. Peritoneal lavage was performed in 351 patients before curative resection of a gastric carcinoma between 1987 and

Feature Retrieval

Top 5 Examples for Feature 45783

Printing 5 examples from the dataset with the highest activation values:

(click to see full size image)

Automatic Feature Explanation

We put the examples from above into ChatGPT4o to generate a description for the feature, and it gives us 5 pieces of information.

Example: Feature Index [45783]

Dominant Tokens: 'patients', '(', 'Fifty', ';', 'into'
Patterns: Activates in medical or clinical study contexts, often quantifying patients or describing study methodologies.
Summary: Highlights patient-focused data or study details in medical literature.
Context: Found in detailed descriptions of clinical trials or patient demographics.
Title: Clinical Study Patients

Top Features and their Titles:

Feature 32026: Scientific Study Purpose
Feature 57660: Academic References
Feature 45783: Clinical Study Patients
Feature 41517: Experiment Validation
Feature 64668: Action and Roles
Feature 29073: Conversational Context
Feature 14701: Quantitative Demographics
Feature 6527: Population Studies
Feature 49447: Technical Problem-Solving
Feature 52757: Medical Study Terms

Influence

Influencing LLaMA’s outputs during inference using the starting prompt:

I am a

Zero Boost:

I am a little confused about the meaning of the word 'sociology' in the title of this book. I have read the book and I am not sure what the word 's

30x Boost on Feature 45783:

I am a 20 year old female who has been diagnosed with a rare disease called SLE (systemic lupus erythematosus) and have been diagnosed with 3 cases of pulmonary

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
feature_meanings		feature_meanings
kaggle_results		kaggle_results
models		models
notebooks		notebooks
pictures		pictures
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
llm-interpret-present.pdf		llm-interpret-present.pdf
todo_pseudo.txt		todo_pseudo.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scaling Monosemanticity with LLaMA

Features

Examples and Results

Prompt Search

Feature Retrieval

Top 5 Examples for Feature 45783

Automatic Feature Explanation

Influence

About

Releases

Packages

Languages

License

DrejcPesjak/scaling-monosemanticity-llama

Folders and files

Latest commit

History

Repository files navigation

Scaling Monosemanticity with LLaMA

Features

Examples and Results

Prompt Search

Feature Retrieval

Top 5 Examples for Feature 45783

Automatic Feature Explanation

Influence

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages