Weak-to-Strong Alignment for Fine-tuning Large Language Models

This repository contains our replication study of weak-to-strong (w2s) alignment for fine-tuning large language models, exploring how weak supervision can effectively guide stronger models - an important analogy for future AI systems supervised by comparatively weaker human signals.

Overview

As AI systems become increasingly powerful, we face a fundamental challenge: how can weaker human supervisors effectively guide and align superhuman AI systems? This project explores this question by replicating and extending the weak-to-strong alignment framework in the context of sentiment analysis, using the SST-2 dataset as a testbed.

Figure 1: Weak-to-Strong Analogy, comparing weak supervisors to humans training strong superhuman models (Burn et al. (2024)).

We investigate whether a weaker model (BERT-base-uncased) can effectively supervise a stronger model (GPT-4o-mini) through pseudo-labels, while examining both the performance gains and potential safety implications of this approach.

Training Results

Weak Model Training

Figure 2: Training loss of the weak model.

Figure 3: Evaluation loss of the weak model.

Figure 4: Evaluation accuracy of the weak model.

W2S Model Training

Figure 5: Training loss of the strong student + ceiling model.

Figure 6: Training accuracy of the strong student + ceiling model.

Key Findings

Performance Results

Model	Accuracy	Precision	Recall	F1-Score
Weak Model (pretrained)	0.49	0.43	0.01	0.01
Weak Model (finetuned)	0.88	0.88	0.89	0.88
Strong Model (pretrained)	0.93	0.98	0.87	0.92
Strong Student (w2s)	0.96	0.97	0.95	0.96
Strong Ceiling	0.96	0.97	0.96	0.96

Safety Evaluation Results

We evaluated ethical understanding using the ETHICS benchmark across multiple dimensions. Results show Test / Hard Test percentages, where values before the slash are normal Test set results, and values after show the adversarially filtered "Hard Test" results:

Model	Justice	Deontology	Virtue	Utilitarianism	Commonsense	Long Commonsense	Average
GPT-4o-mini baseline	52.5 / 53.5	47.0 / 45.0	93.0 / 87.0	57.0 / 40.0	63.5 / 50.5	71.0 / 66.5	64.0 / 57.1
GPT-4o-mini w2s on SST2	46.0 / 51.0	52.0 / 45.5	73.0 / 72.0	67.0 / 47.5	48.5 / 32.5	38.5 / 54.5	54.2 / 50.5
GPT-4o-mini strong ceiling on SST2	50.5 / 47.5	46.0 / 51.0	79.5 / 77.0	63.5 / 51.0	54.0 / 46.5	48.0 / 57.5	57.0 / 55.1

Key observation: The w2s model showed lower average safety scores (54.2%) compared to both the baseline (64.0%) and strong ceiling (57.0%) models.

Key Insights

Perfect Performance Recovery: Our w2s implementation achieved a Performance Gap Recovered (PGR) score of 1.0, indicating complete recovery of strong model performance using weak supervision.
Safety Considerations: Contrary to initial hopes, the w2s model showed degraded safety metrics compared to both baseline and ceiling models, suggesting that fine-tuning might inadvertently affect model alignment.
Task Complexity: The near-perfect PGR score might be partly attributed to the relative simplicity of the SST-2 task, suggesting the need for more complex evaluation tasks.

Limitations and Future Work

Current Limitations

Disconnect between capability and safety tasks
SST-2 might be too "solved" for meaningful evaluation
Potential pre-existing knowledge in the strong model

Future Directions

Task Complexity: Explore more challenging datasets (e.g., MATH reasoning, code generation, chess)
Safety Evaluation: Test safety degradation using random label controls
Learning Format: Investigate zero-shot to few-shot learning for safety evaluation
Task Correlation: Explore tasks with stronger capability-safety correlation
Fine-tuning Methods: Evaluate alternative fine-tuning approaches beyond naive methods

Repository Structure

├── evaluate_safety/        # Safety evaluation
│   └── ethics/             # ETHICS subtasks
├── results/                # Experimental results for different w2s setups
├── gpt-finetune.ipynb      # W2S fine-tuning setup
└── test_boolq.ipynb        # BoolQ evaluation

Acknowledgments

This project builds upon the work of Burns et al. (2024) on weak-to-strong generalization, and we thank them for their foundational contributions to this field.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
evaluate_safety		evaluate_safety
readme_data		readme_data
results		results
.gitignore		.gitignore
README.md		README.md
gpt-finetune.ipynb		gpt-finetune.ipynb
test_boolq.ipynb		test_boolq.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Weak-to-Strong Alignment for Fine-tuning Large Language Models

Overview

Training Results

Weak Model Training

W2S Model Training

Key Findings

Performance Results

Safety Evaluation Results

Key Insights

Limitations and Future Work

Current Limitations

Future Directions

Repository Structure

Acknowledgments

About

Releases

Packages

Contributors 3

Languages

anw10/W2S_Safety

Folders and files

Latest commit

History

Repository files navigation

Weak-to-Strong Alignment for Fine-tuning Large Language Models

Overview

Training Results

Weak Model Training

W2S Model Training

Key Findings

Performance Results

Safety Evaluation Results

Key Insights

Limitations and Future Work

Current Limitations

Future Directions

Repository Structure

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages