SFT Memorizes, RL Generalizes:
A Comparative Study of Foundation Model Post-training

Tianzhe Chu* Yuexiang Zhai* Jihan Yang Shengbang Tong
Saining Xie Dale Schuurmans Quoc V. Le Sergey Levine Yi Ma

Misc: We prompt DALL-E 3 via "Conceptual figure of 'SFT Memorizes, RL Generalizes', with trendlines and style of Hong Kong" but somehow skycrapters dominate the picture...

Release

[02/24/25] Support API Evaluator. Use our environments to evaluate your API-based models~
[02/8/25] We add SFT scripts and text-only SFT data. Still updating~
[01/28/25] Excited to shout out our paper SFT Memorizes, RL Generalizes! We release the environments, training scripts, evaluation scripts, SFT data, and initial checkpoints.

Installation

Prepare

Our codebase is tested on H800 servers with python 3.13.0 torch 2.5.1+cu124.

Clone this repository and navigate to into the codebase

git clone https://github.com/LeslieTrue/SFTvsRL.git
cd SFTvsRL

Install Packages

conda create -n SFTvsRL python==3.13 -y
conda activate SFTvsRL
pip install -r requirements.txt
cd gym
pip install -e . # install gym environment
cd ..

Download Initial Checkpoints (Optional)

We instantiate RL experiments on top of SFT initialized checkpoints to guarantee model's basic instruction following capabilities. We provide all 4 initial checkpoints for {GeneralPoints, V-IRL}X{Language (-L), Vision-Language (-VL)}.

huggingface-cli download tianzhechu/GP-L-Init --local-dir YOUR_LOCAL_DIR
huggingface-cli download tianzhechu/GP-VL-Init --local-dir YOUR_LOCAL_DIR
huggingface-cli download tianzhechu/VIRL-L-Init --local-dir YOUR_LOCAL_DIR
huggingface-cli download tianzhechu/VIRL-VL-Init --local-dir YOUR_LOCAL_DIR

It's optional to download these checkpoints via huggingface CLI. You may directly specify repo_name as CKPT_NAME in shell scripts.

Getting Started

Install packages and prepare the initial checkpoints (optional).
- Check here to download initial checkpoints for all 4 training experiments.
- You may train your own initial checkpoints following instructions here.
- We use Llama-3.2-Vision-Instruct for all our experiments. Other models might not need SFT initialization and welcome to explore~
Launch RL experiments (PPO).
- For GeneralPoints, please use execute the following scripts:
  - Language only: bash scripts/gp_training/language_train.sh
  - With vision: bash scripts/gp_training/vl_train.sh
  - Edit training configs either in shell scripts or rl/configs/llama_gp_*.yaml
- For V-IRL, please do the following steps:
  - First, download data from here.
  - Then, specify paths in training shell scripts
    - STREETVIEWS=YOUR_PATH/nyc_1k_routes/street_views/
    - GPS_TO_PANO=YOUR_PATH/nyc_1k_routes/gps_pano_mapping.pkl
    - ROUTE_INFO=YOUR_PATH/nyc_1k_routes/route_infos.json
  - Finally, start training
    - Language only: bash scripts/virl_training/language_train.sh
    - With vision: bash scripts/virl_training/vl_train.sh
    - Edit training configs either in shell scripts or rl/configs/llama_virl_*.yaml
Evaluate RL checkpoints after training.
- We have a series of evaluation scripts:
  - scripts/gp_evaluation/*.sh: evaluate GeneralPoints
  - scripts/virl_evaluation/*.sh: evaluate V-IRL
  - scripts/recog_evaluation/*.sh: evaluate GeneralPoints recognition
- Please modify CKPT_NAME in these shell scripts.

** Note that our shell scripts support slurm clusters if launched via sbatch scripts/*/*.sh. Reproducing our training experiments require a node of 8 gpus with memory of 80GB each.

Citation

If you find this project useful for your research and applications, please cite using this BibTeX:

@misc{chu2025sftmemorizesrlgeneralizes,
      title={SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training}, 
      author={Tianzhe Chu and Yuexiang Zhai and Jihan Yang and Shengbang Tong and Saining Xie and Dale Schuurmans and Quoc V. Le and Sergey Levine and Yi Ma},
      year={2025},
      eprint={2501.17161},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2501.17161}, 
}

Acknowledgement

RL4VLM: We start our codebase from Simon's amazing project.
Llama-3.2-Vision-Instruct: We instantiate our experiments on top of this model.
Llama-3.2-Vision-Finetune: Our SFT code is modified from early version of this repository.
V-IRL: Grounding Virtual Intelligence in Real Life: We adopt this fantastic environment.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
assets		assets
evaluation		evaluation
gym		gym
prompt_lib		prompt_lib
rl		rl
scripts		scripts
sft		sft
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
utils_general.py		utils_general.py
utils_mllm.py		utils_mllm.py
utils_rl.py		utils_rl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SFT Memorizes, RL Generalizes:
A Comparative Study of Foundation Model Post-training

Release

Installation

Prepare

Download Initial Checkpoints (Optional)

Getting Started

Citation

Acknowledgement

About

Releases

Packages

Languages

LeslieTrue/SFTvsRL

Folders and files

Latest commit

History

Repository files navigation

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Release

Installation

Prepare

Download Initial Checkpoints (Optional)

Getting Started

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

SFT Memorizes, RL Generalizes:
A Comparative Study of Foundation Model Post-training

Packages