This repository reproduces and extends the work from the article Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, applying the methods to LLaMA 3.2-3B. The project explores monosemantic neurons in large language models, investigates their scaling behavior, and implements sparse autoencoders to extract interpretable features.
- Extracting Activations: Analyze token-level activations from the 16th layer of LLaMA 3.2 (3B, 16-bit quantized) using the Pile dataset.
- Sparse Autoencoder (SAE): Train an overcomplete SAE (3072-65536-3072) to uncover interpretable features in the latent space.
- Feature Search: Identify features relevant to specific topics using multi-prompt inputs and activation metrics.
- Influencing Outputs: Use steering vectors derived from the SAE to influence LLaMA’s output during inference.
Example prompts used for finding relevant features:
1. program on aerobic capacity and muscle strength of adults with hearing loss. Twenty-three adults with hearing loss were separated into 2 groups. Thirteen subjects
2. the effect of a traditional dance training program on aerobic capacity and muscle strength of adults with hearing loss. Twenty-three adults with hearing loss were separated into
3. been examined comprehensively. Peritoneal lavage was performed in 351 patients before curative resection of a gastric carcinoma between 1987 and
Printing 5 examples from the dataset with the highest activation values:
(click to see full size image)
We put the examples from above into ChatGPT4o to generate a description for the feature, and it gives us 5 pieces of information.
Example: Feature Index [45783]
- Dominant Tokens:
'patients'
,'('
,'Fifty'
,';'
,'into'
- Patterns: Activates in medical or clinical study contexts, often quantifying patients or describing study methodologies.
- Summary: Highlights patient-focused data or study details in medical literature.
- Context: Found in detailed descriptions of clinical trials or patient demographics.
- Title: Clinical Study Patients
Top Features and their Titles:
- Feature 32026: Scientific Study Purpose
- Feature 57660: Academic References
- Feature 45783: Clinical Study Patients
- Feature 41517: Experiment Validation
- Feature 64668: Action and Roles
- Feature 29073: Conversational Context
- Feature 14701: Quantitative Demographics
- Feature 6527: Population Studies
- Feature 49447: Technical Problem-Solving
- Feature 52757: Medical Study Terms
Influencing LLaMA’s outputs during inference using the starting prompt:
I am a
-
Zero Boost:
I am a little confused about the meaning of the word 'sociology' in the title of this book. I have read the book and I am not sure what the word 's
-
30x Boost on Feature 45783:
I am a 20 year old female who has been diagnosed with a rare disease called SLE (systemic lupus erythematosus) and have been diagnosed with 3 cases of pulmonary