Saining Xie Dale Schuurmans Quoc V. Le Sergey Levine Yi Ma
Misc: We prompt DALL-E 3 via "Conceptual figure of 'SFT Memorizes, RL Generalizes', with trendlines and style of Hong Kong" but somehow skycrapters dominate the picture...
- [02/24/25] Support API Evaluator. Use our environments to evaluate your API-based models~
- [02/8/25] We add SFT scripts and text-only SFT data. Still updating~
- [01/28/25] Excited to shout out our paper SFT Memorizes, RL Generalizes! We release the environments, training scripts, evaluation scripts, SFT data, and initial checkpoints.
Our codebase is tested on H800 servers with python 3.13.0
torch 2.5.1+cu124
.
- Clone this repository and navigate to into the codebase
git clone https://github.com/LeslieTrue/SFTvsRL.git
cd SFTvsRL
- Install Packages
conda create -n SFTvsRL python==3.13 -y
conda activate SFTvsRL
pip install -r requirements.txt
cd gym
pip install -e . # install gym environment
cd ..
We instantiate RL experiments on top of SFT initialized checkpoints to guarantee model's basic instruction following capabilities. We provide all 4 initial checkpoints for {GeneralPoints, V-IRL}X{Language (-L), Vision-Language (-VL)}.
huggingface-cli download tianzhechu/GP-L-Init --local-dir YOUR_LOCAL_DIR
huggingface-cli download tianzhechu/GP-VL-Init --local-dir YOUR_LOCAL_DIR
huggingface-cli download tianzhechu/VIRL-L-Init --local-dir YOUR_LOCAL_DIR
huggingface-cli download tianzhechu/VIRL-VL-Init --local-dir YOUR_LOCAL_DIR
It's optional to download these checkpoints via huggingface CLI. You may directly specify repo_name
as CKPT_NAME
in shell scripts.
- Install packages and prepare the initial checkpoints (optional).
- Check here to download initial checkpoints for all 4 training experiments.
- You may train your own initial checkpoints following instructions here.
- We use Llama-3.2-Vision-Instruct for all our experiments. Other models might not need SFT initialization and welcome to explore~
- Launch RL experiments (PPO).
- For GeneralPoints, please use execute the following scripts:
- Language only:
bash scripts/gp_training/language_train.sh
- With vision:
bash scripts/gp_training/vl_train.sh
- Edit training configs either in shell scripts or
rl/configs/llama_gp_*.yaml
- Language only:
- For V-IRL, please do the following steps:
- First, download data from here.
- Then, specify paths in training shell scripts
STREETVIEWS=YOUR_PATH/nyc_1k_routes/street_views/
GPS_TO_PANO=YOUR_PATH/nyc_1k_routes/gps_pano_mapping.pkl
ROUTE_INFO=YOUR_PATH/nyc_1k_routes/route_infos.json
- Finally, start training
- Language only:
bash scripts/virl_training/language_train.sh
- With vision:
bash scripts/virl_training/vl_train.sh
- Edit training configs either in shell scripts or
rl/configs/llama_virl_*.yaml
- Language only:
- For GeneralPoints, please use execute the following scripts:
- Evaluate RL checkpoints after training.
- We have a series of evaluation scripts:
scripts/gp_evaluation/*.sh
: evaluate GeneralPointsscripts/virl_evaluation/*.sh
: evaluate V-IRLscripts/recog_evaluation/*.sh
: evaluate GeneralPoints recognition
- Please modify
CKPT_NAME
in these shell scripts.
- We have a series of evaluation scripts:
** Note that our shell scripts support slurm clusters if launched via sbatch scripts/*/*.sh
. Reproducing our training experiments require a node of 8 gpus with memory of 80GB each.
If you find this project useful for your research and applications, please cite using this BibTeX:
@misc{chu2025sftmemorizesrlgeneralizes,
title={SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training},
author={Tianzhe Chu and Yuexiang Zhai and Jihan Yang and Shengbang Tong and Saining Xie and Dale Schuurmans and Quoc V. Le and Sergey Levine and Yi Ma},
year={2025},
eprint={2501.17161},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2501.17161},
}
- RL4VLM: We start our codebase from Simon's amazing project.
- Llama-3.2-Vision-Instruct: We instantiate our experiments on top of this model.
- Llama-3.2-Vision-Finetune: Our SFT code is modified from early version of this repository.
- V-IRL: Grounding Virtual Intelligence in Real Life: We adopt this fantastic environment.