Skip to content

Official implementation of paper: SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Notifications You must be signed in to change notification settings

LeslieTrue/SFTvsRL

Repository files navigation

SFT Memorizes, RL Generalizes:
A Comparative Study of Foundation Model Post-training

Cambrian

arXiv Website HF Model: Cambrian-1

Misc: We prompt DALL-E 3 via "Conceptual figure of 'SFT Memorizes, RL Generalizes', with trendlines and style of Hong Kong" but somehow skycrapters dominate the picture...

Release

  • [02/24/25] Support API Evaluator. Use our environments to evaluate your API-based models~
  • [02/8/25] We add SFT scripts and text-only SFT data. Still updating~
  • [01/28/25] Excited to shout out our paper SFT Memorizes, RL Generalizes! We release the environments, training scripts, evaluation scripts, SFT data, and initial checkpoints.

Installation

Prepare

Our codebase is tested on H800 servers with python 3.13.0 torch 2.5.1+cu124.

  1. Clone this repository and navigate to into the codebase
git clone https://github.com/LeslieTrue/SFTvsRL.git
cd SFTvsRL
  1. Install Packages
conda create -n SFTvsRL python==3.13 -y
conda activate SFTvsRL
pip install -r requirements.txt
cd gym
pip install -e . # install gym environment
cd ..

Download Initial Checkpoints (Optional)

We instantiate RL experiments on top of SFT initialized checkpoints to guarantee model's basic instruction following capabilities. We provide all 4 initial checkpoints for {GeneralPoints, V-IRL}X{Language (-L), Vision-Language (-VL)}.

huggingface-cli download tianzhechu/GP-L-Init --local-dir YOUR_LOCAL_DIR
huggingface-cli download tianzhechu/GP-VL-Init --local-dir YOUR_LOCAL_DIR
huggingface-cli download tianzhechu/VIRL-L-Init --local-dir YOUR_LOCAL_DIR
huggingface-cli download tianzhechu/VIRL-VL-Init --local-dir YOUR_LOCAL_DIR

It's optional to download these checkpoints via huggingface CLI. You may directly specify repo_name as CKPT_NAME in shell scripts.

Getting Started

  1. Install packages and prepare the initial checkpoints (optional).
    • Check here to download initial checkpoints for all 4 training experiments.
    • You may train your own initial checkpoints following instructions here.
    • We use Llama-3.2-Vision-Instruct for all our experiments. Other models might not need SFT initialization and welcome to explore~
  2. Launch RL experiments (PPO).
    • For GeneralPoints, please use execute the following scripts:
      • Language only: bash scripts/gp_training/language_train.sh
      • With vision: bash scripts/gp_training/vl_train.sh
      • Edit training configs either in shell scripts or rl/configs/llama_gp_*.yaml
    • For V-IRL, please do the following steps:
      • First, download data from here.
      • Then, specify paths in training shell scripts
        • STREETVIEWS=YOUR_PATH/nyc_1k_routes/street_views/
        • GPS_TO_PANO=YOUR_PATH/nyc_1k_routes/gps_pano_mapping.pkl
        • ROUTE_INFO=YOUR_PATH/nyc_1k_routes/route_infos.json
      • Finally, start training
        • Language only: bash scripts/virl_training/language_train.sh
        • With vision: bash scripts/virl_training/vl_train.sh
        • Edit training configs either in shell scripts or rl/configs/llama_virl_*.yaml
  3. Evaluate RL checkpoints after training.
    • We have a series of evaluation scripts:
      • scripts/gp_evaluation/*.sh: evaluate GeneralPoints
      • scripts/virl_evaluation/*.sh: evaluate V-IRL
      • scripts/recog_evaluation/*.sh: evaluate GeneralPoints recognition
    • Please modify CKPT_NAME in these shell scripts.

** Note that our shell scripts support slurm clusters if launched via sbatch scripts/*/*.sh. Reproducing our training experiments require a node of 8 gpus with memory of 80GB each.

Citation

If you find this project useful for your research and applications, please cite using this BibTeX:

@misc{chu2025sftmemorizesrlgeneralizes,
      title={SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training}, 
      author={Tianzhe Chu and Yuexiang Zhai and Jihan Yang and Shengbang Tong and Saining Xie and Dale Schuurmans and Quoc V. Le and Sergey Levine and Yi Ma},
      year={2025},
      eprint={2501.17161},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2501.17161}, 
}

Acknowledgement

About

Official implementation of paper: SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published