TinyZeroMathVeRL

TinyZeroMathVeRL is a toy project designed to replicate the DeepSeek-R1-Zero at minimal cost, leveraging the gsm8k dataset to validate mathematical reasoning tasks.

Environment

Please follow the instructions provided in verl: Volcano Engine Reinforcement Learning for LLM .

Running

Data Preparation

python3 ./examples/data_preprocess/gsm8k_r1_zero.py \ 
--local_dir {local_dir} \
--template_type instruct

Training

bash main_grpo.sh

Experimental Setup & Results

Experimental Setup

Prompt Design: the prompt is same as DeepSeek-R1.

System: You are a helpful assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. 
User: {question}. Please reason step by step, and put your final answer within \boxed{}.

Models: Qwen2.5-Instruct-1.5B & Qwen2.5-Instruct-7B

Reward Settings

Format Reward
- Follow Format: Reward +0.1 (easy to learn) .
- Not Follow Format: Reward -1.
- Format judgment is implemented based on Logic-RL.
Answer Reward:
- Correct Answer: Reward +2.
- Correct Answer but Multiple \boxed{}: Reward +1.5.
- Incorrect Answer: Reward -2.

Experimental Results

Qwen2.5-Instruct-1.5B, Answer Reward only


Answer Reward	Response Length	Valid Answer Accuracy

Qwen2.5-1.5B-Instruct, Answer Reward and Format Reward


Answer Reward	Format Reward	Response Length	Valid Answer Accuracy

Qwen2.5-7B-Instruct, Answer Reward only


Answer Reward	Response Length	Valid Answer Accuracy

Qwen2.5-7B-Instruct, Answer Reward and Format Reward


Answer Reward	Format Reward	Response Length	Valid Answer Accuracy

Key Observations

Ease of Format Learning:
- Even with very small format reward, the format is relatively easy to learn. However, learning the format may not significantly improve answer quality in the simple math reasoning task.
Impact of \boxed{} Format:
- Using the \boxed{} format led to answer improvements in both the 1.5B and 7B models for the reason that the model tended to output a complete sentence within <answer></answer> rather than a single numerical answer.
- For the 7B model, accuracy improved from approximately 80% to 92%.
- For the 1.5B model, accuracy improved from approximately 68% to 80% (based on reported observations, as no experiments were conducted specifically for the 1.5B model).
Initial Performance of Qwen-1.5B-Instruct:
- The initial poor performance of the Qwen-1.5B-Instruct model was largely due to its failure to adhere to the required format. For example:
  - The final answer was placed after the </answer> tag instead of within <answer> </answer>.
  - The answer was not enclosed in the \boxed{} format.
Self-evolution Process on GSM8K Dataset:
- The GSM8K dataset might be too simple, as no self-evolution process was observed in either the 1.5B or 7B models.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docker		docker
docs		docs
examples		examples
patches		patches
pics		pics
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
README_verl.md		README_verl.md
main_grpo.sh		main_grpo.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyZeroMathVeRL

Environment

Running

Data Preparation

Training

Experimental Setup & Results

Experimental Setup

Experimental Results

Key Observations

About

Releases

Packages

Languages

License

Leputa/TinyZeroMathVeRL

Folders and files

Latest commit

History

Repository files navigation

TinyZeroMathVeRL

Environment

Running

Data Preparation

Training

Experimental Setup & Results

Experimental Setup

Experimental Results

Key Observations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages