SPARKLE is a fine-grained framework for evaluating LLM reasoning improvements under reinforcement learning (RL), analyzing models along three key axes: plan-following and execution, knowledge utilization, and subproblem decomposition. We also study difficulty, and our work reveals that hard problems remain valuable for RL training when appropriately structured with partial solution steps.
Contrary to common belief, hard problems can be effective for RL training when augmented with partial solution steps. Our curriculum-style approach shows that continuing training on the hardest problemsβaugmented with partial solutionsβleads to the best performance.
RL-tuned models don't just execute external plans betterβthey formulate and follow internal strategies better suited to their reasoning processes. Providing explicit step-by-step plans surprisingly degrades performance on challenging benchmarks, but RL models show greater robustness.
RL significantly enhances the model's capacity to integrate provided knowledge into its reasoning process, leading to consistent performance improvements across diverse mathematical tasks and difficulty levels.
Model | AIME | AMC | MATH500 | GSM8K | OlympiadBench | Avg. |
---|---|---|---|---|---|---|
Qwen-2.5-Math-7B-Base | 16.67 | 42.50 | 44.03 | 42.53 | 28.65 | 35.23 |
SparkleRL-Stage 1 | 46.67 (β30.00) | 67.50 (β25.00) | 80.00 (β35.97) | 91.77 (β49.24) | 39.11 (β10.46) | 65.01 |
SparkleRL-Stage 2 (Aug) | 50.42 (β33.75) | 71.25 (β28.75) | 81.00 (β36.97) | 92.38 (β49.85) | 40.11 (β11.46) | 67.03 |
Table: Avg@8 performance across benchmarks. Stage 2 (Aug) uses our curriculum-style training with augmented hard problems.
# Create and activate conda environment
conda create -n sparkle python==3.12
conda activate sparkle
# Install PyTorch and Flash Attention
pip3 install torch==2.4.0
pip3 install flash-attn --no-build-isolation
# Install VERL and dependencies
cd verl
pip3 install -e .
pip install wandb IPython matplotlib
pip install vertexai latex2sympy2
pip3 install -U antlr4-python3-runtime==4.9.3
# Generate parquet files in data/*.parquet
python scripts/data/prepare_stage_one_data.py
python scripts/data/prepare_stage_two_data.py --aug_version all # Recommended based on our ablation studies
# Set XFormers backend to avoid CUDA errors
export VLLM_ATTENTION_BACKEND=XFORMERS
# Stage 1: Foundation RL training on full dataset
export PATH_TO_BASE_MODEL="Qwen/Qwen2.5-Math-7B"
./scripts/train/stage_one.sh --model $PATH_TO_BASE_MODEL
# Stage 2: Curriculum-style training with augmented hard problems (recommended)
export PATH_TO_STAGE_ONE_MODEL="/path/to/your/stage1/checkpoint"
./scripts/train/stage_two_aug.sh --model $PATH_TO_STAGE_ONE_MODEL
Note: Stage 2 training uses the
spk_h_aug
reward type which handles augmented responses with partial format. This is crucial for the curriculum-style training approach.
# Step 1: Convert FSDP checkpoint to HuggingFace format (if using your own checkpoints)
python eval/fsdp2hf.py \
--fsdp_path /path/to/checkpoint/actor \
--base_model Qwen/Qwen2.5-Math-7B \
--output_path /path/to/output
# Step 2: Set up evaluation environment
cd eval/lm-evaluation-harness
pip install -e .
# Step 3: Run comprehensive evaluation across all benchmarks
export PATH_TO_STAGE_ONE_MODEL="/path/to/stage1/model"
export PATH_TO_STAGE_TWO_MODEL="/path/to/stage2/model"
./scripts/eval/eval_all_vllm.sh
Tip: You can also directly use our pre-trained checkpoints from HuggingFace instead of converting your own FSDP checkpoints.
We release our checkpoints on HuggingFace:
sparkle-reasoning/SparkleRL-7B-Stage1
- Foundation RL-tuned model trained with the large-scale full datasetsparkle-reasoning/SparkleRL-7B-Stage2-aug
- Recommended: Curriculum-style training with a small amount of augmented hard problemssparkle-reasoning/SparkleRL-7B-Stage2-hard
- Training on hard problems onlysparkle-reasoning/SparkleRL-7B-Stage2-mix
- Mixed difficulty training
Our curated datasets are available on HuggingFace:
sparkle-reasoning/dsr40k
- Large-scale training data (40.3k problems) used for stage one foundation trainingsparkle-reasoning/hardmath
- Challenging mathematical problems (6.5k problems) used for stage two curriculum training, specifically questions that the stage one model cannot answer, with rigorous data label cleaning
- AIME 2024, AMC 2023, MATH500, GSM8K, OlympiadBench - Standard mathematical reasoning evaluation sets
The SPARKLE framework evaluates mathematical reasoning along three dimensions:
- Plan-Following and Execution: How well models follow and execute reasoning plans
- Knowledge Utilization: Ability to integrate external knowledge into reasoning
- Subproblem Decomposition: Capacity to solve decomposed subproblems
Our key innovation is a two-stage curriculum approach:
- Stage 1: Train on the full dataset to build a strong foundation
- Stage 2: Continue training on the hardest problems augmented with partial solution steps
π΅ Original Problem:
One of Euler's conjectures was disproved in the 1960s by three American mathematicians when they showed there was a positive integer such that: 133β΅ + 110β΅ + 84β΅ + 27β΅ = nβ΅. Find the value of n.
π― Augmented with Partial Solution:
One of Euler's conjectures was disproved in the 1960s by three American mathematicians when they showed there was a positive integer such that: 133β΅ + 110β΅ + 84β΅ + 27β΅ = nβ΅. Find the value of n.
Taking the given equation modulo 2, 3, and 5, respectively, we have: nβ΅ β‘ 0 (mod 2), nβ΅ β‘ 0 (mod 3), nβ΅ β‘ 4 (mod 5)
- Release test sets - ETA by July 13, 2025
- Provide additional evaluation scripts for fine-grained analysis
If you encounter any problems, have questions, or would like to contribute to the project, please feel free to:
- Open an issue on our GitHub repository
- Contact us directly at [email protected]
We welcome contributions, bug reports, and feature requests from the community!
If you find this work useful, please consider citing:
@misc{wang2025accuracydissectingmathematicalreasoning,
title={Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning},
author={Jiayu Wang and Yifei Ming and Zixuan Ke and Caiming Xiong and Shafiq Joty and Aws Albarghouthi and Frederic Sala},
year={2025},
eprint={2506.04723},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.04723},
}
This project is licensed under the MIT License - see the LICENSE file for details.
- π Paper: Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning
- π Project Page: https://sparkle-reasoning.github.io/
- π€ Models: https://huggingface.co/sparkle-reasoning/models
- π€ Datasets: https://huggingface.co/sparkle-reasoning/datasets