This repository contains our replication study of weak-to-strong (w2s) alignment for fine-tuning large language models, exploring how weak supervision can effectively guide stronger models - an important analogy for future AI systems supervised by comparatively weaker human signals.
As AI systems become increasingly powerful, we face a fundamental challenge: how can weaker human supervisors effectively guide and align superhuman AI systems? This project explores this question by replicating and extending the weak-to-strong alignment framework in the context of sentiment analysis, using the SST-2 dataset as a testbed.

Figure 1: Weak-to-Strong Analogy, comparing weak supervisors to humans training strong superhuman models (Burn et al. (2024)).
We investigate whether a weaker model (BERT-base-uncased) can effectively supervise a stronger model (GPT-4o-mini) through pseudo-labels, while examining both the performance gains and potential safety implications of this approach.
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Weak Model (pretrained) | 0.49 | 0.43 | 0.01 | 0.01 |
Weak Model (finetuned) | 0.88 | 0.88 | 0.89 | 0.88 |
Strong Model (pretrained) | 0.93 | 0.98 | 0.87 | 0.92 |
Strong Student (w2s) | 0.96 | 0.97 | 0.95 | 0.96 |
Strong Ceiling | 0.96 | 0.97 | 0.96 | 0.96 |
We evaluated ethical understanding using the ETHICS benchmark across multiple dimensions. Results show Test / Hard Test percentages, where values before the slash are normal Test set results, and values after show the adversarially filtered "Hard Test" results:
Model | Justice | Deontology | Virtue | Utilitarianism | Commonsense | Long Commonsense | Average |
---|---|---|---|---|---|---|---|
GPT-4o-mini baseline | 52.5 / 53.5 | 47.0 / 45.0 | 93.0 / 87.0 | 57.0 / 40.0 | 63.5 / 50.5 | 71.0 / 66.5 | 64.0 / 57.1 |
GPT-4o-mini w2s on SST2 | 46.0 / 51.0 | 52.0 / 45.5 | 73.0 / 72.0 | 67.0 / 47.5 | 48.5 / 32.5 | 38.5 / 54.5 | 54.2 / 50.5 |
GPT-4o-mini strong ceiling on SST2 | 50.5 / 47.5 | 46.0 / 51.0 | 79.5 / 77.0 | 63.5 / 51.0 | 54.0 / 46.5 | 48.0 / 57.5 | 57.0 / 55.1 |
Key observation: The w2s model showed lower average safety scores (54.2%) compared to both the baseline (64.0%) and strong ceiling (57.0%) models.
-
Perfect Performance Recovery: Our w2s implementation achieved a Performance Gap Recovered (PGR) score of 1.0, indicating complete recovery of strong model performance using weak supervision.
-
Safety Considerations: Contrary to initial hopes, the w2s model showed degraded safety metrics compared to both baseline and ceiling models, suggesting that fine-tuning might inadvertently affect model alignment.
-
Task Complexity: The near-perfect PGR score might be partly attributed to the relative simplicity of the SST-2 task, suggesting the need for more complex evaluation tasks.
- Disconnect between capability and safety tasks
- SST-2 might be too "solved" for meaningful evaluation
- Potential pre-existing knowledge in the strong model
- Task Complexity: Explore more challenging datasets (e.g., MATH reasoning, code generation, chess)
- Safety Evaluation: Test safety degradation using random label controls
- Learning Format: Investigate zero-shot to few-shot learning for safety evaluation
- Task Correlation: Explore tasks with stronger capability-safety correlation
- Fine-tuning Methods: Evaluate alternative fine-tuning approaches beyond naive methods
├── evaluate_safety/ # Safety evaluation
│ └── ethics/ # ETHICS subtasks
├── results/ # Experimental results for different w2s setups
├── gpt-finetune.ipynb # W2S fine-tuning setup
└── test_boolq.ipynb # BoolQ evaluation
This project builds upon the work of Burns et al. (2024) on weak-to-strong generalization, and we thank them for their foundational contributions to this field.