SPARKLE: Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

SPARKLE is a fine-grained framework for evaluating LLM reasoning improvements under reinforcement learning (RL), analyzing models along three key axes: plan-following and execution, knowledge utilization, and subproblem decomposition. We also study difficulty, and our work reveals that hard problems remain valuable for RL training when appropriately structured with partial solution steps.

🔥 Key Insights

💡 Hard Problems Are Still Valuable

Contrary to common belief, hard problems can be effective for RL training when augmented with partial solution steps. Our curriculum-style approach shows that continuing training on the hardest problems—augmented with partial solutions—leads to the best performance.

💡 RL Enhances Internal Strategy Formation

RL-tuned models don't just execute external plans better—they formulate and follow internal strategies better suited to their reasoning processes. Providing explicit step-by-step plans surprisingly degrades performance on challenging benchmarks, but RL models show greater robustness.

💡 Better Knowledge Integration

RL significantly enhances the model's capacity to integrate provided knowledge into its reasoning process, leading to consistent performance improvements across diverse mathematical tasks and difficulty levels.

📊 Results

Model	AIME	AMC	MATH500	GSM8K	OlympiadBench	Avg.
Qwen-2.5-Math-7B-Base	16.67	42.50	44.03	42.53	28.65	35.23
SparkleRL-Stage 1	46.67 (↑30.00)	67.50 (↑25.00)	80.00 (↑35.97)	91.77 (↑49.24)	39.11 (↑10.46)	65.01
SparkleRL-Stage 2 (Aug)	50.42 (↑33.75)	71.25 (↑28.75)	81.00 (↑36.97)	92.38 (↑49.85)	40.11 (↑11.46)	67.03

Table: Avg@8 performance across benchmarks. Stage 2 (Aug) uses our curriculum-style training with augmented hard problems.

🚀 Quick Start

Installation

# Create and activate conda environment
conda create -n sparkle python==3.12
conda activate sparkle

# Install PyTorch and Flash Attention
pip3 install torch==2.4.0 
pip3 install flash-attn --no-build-isolation

# Install VERL and dependencies
cd verl
pip3 install -e .
pip install wandb IPython matplotlib
pip install vertexai latex2sympy2
pip3 install -U antlr4-python3-runtime==4.9.3

Prepare Datasets

# Generate parquet files in data/*.parquet
python scripts/data/prepare_stage_one_data.py
python scripts/data/prepare_stage_two_data.py --aug_version all  # Recommended based on our ablation studies

Training

# Set XFormers backend to avoid CUDA errors
export VLLM_ATTENTION_BACKEND=XFORMERS

# Stage 1: Foundation RL training on full dataset
export PATH_TO_BASE_MODEL="Qwen/Qwen2.5-Math-7B"
./scripts/train/stage_one.sh --model $PATH_TO_BASE_MODEL

# Stage 2: Curriculum-style training with augmented hard problems (recommended)
export PATH_TO_STAGE_ONE_MODEL="/path/to/your/stage1/checkpoint"
./scripts/train/stage_two_aug.sh --model $PATH_TO_STAGE_ONE_MODEL

Note: Stage 2 training uses the spk_h_aug reward type which handles augmented responses with partial format. This is crucial for the curriculum-style training approach.

Evaluation

# Step 1: Convert FSDP checkpoint to HuggingFace format (if using your own checkpoints)
python eval/fsdp2hf.py \
    --fsdp_path /path/to/checkpoint/actor \
    --base_model Qwen/Qwen2.5-Math-7B \
    --output_path /path/to/output

# Step 2: Set up evaluation environment
cd eval/lm-evaluation-harness
pip install -e .

# Step 3: Run comprehensive evaluation across all benchmarks
export PATH_TO_STAGE_ONE_MODEL="/path/to/stage1/model"
export PATH_TO_STAGE_TWO_MODEL="/path/to/stage2/model"
./scripts/eval/eval_all_vllm.sh

Tip: You can also directly use our pre-trained checkpoints from HuggingFace instead of converting your own FSDP checkpoints.

🤗 Model Checkpoints

We release our checkpoints on HuggingFace:

sparkle-reasoning/SparkleRL-7B-Stage1 - Foundation RL-tuned model trained with the large-scale full dataset
sparkle-reasoning/SparkleRL-7B-Stage2-aug - Recommended: Curriculum-style training with a small amount of augmented hard problems
sparkle-reasoning/SparkleRL-7B-Stage2-hard - Training on hard problems only
sparkle-reasoning/SparkleRL-7B-Stage2-mix - Mixed difficulty training

📚 Datasets

Our curated datasets are available on HuggingFace:

Training Data

sparkle-reasoning/dsr40k - Large-scale training data (40.3k problems) used for stage one foundation training
sparkle-reasoning/hardmath - Challenging mathematical problems (6.5k problems) used for stage two curriculum training, specifically questions that the stage one model cannot answer, with rigorous data label cleaning

Evaluation Benchmarks

AIME 2024, AMC 2023, MATH500, GSM8K, OlympiadBench - Standard mathematical reasoning evaluation sets

🏗️ Framework Overview

The SPARKLE framework evaluates mathematical reasoning along three dimensions:

Plan-Following and Execution: How well models follow and execute reasoning plans
Knowledge Utilization: Ability to integrate external knowledge into reasoning
Subproblem Decomposition: Capacity to solve decomposed subproblems

Curriculum-Style Training

Our key innovation is a two-stage curriculum approach:

Stage 1: Train on the full dataset to build a strong foundation
Stage 2: Continue training on the hardest problems augmented with partial solution steps

Example: Augmented Hard Problem

🔵 Original Problem:

One of Euler's conjectures was disproved in the 1960s by three American mathematicians when they showed there was a positive integer such that: 133⁵ + 110⁵ + 84⁵ + 27⁵ = n⁵. Find the value of n.

🎯 Augmented with Partial Solution:

One of Euler's conjectures was disproved in the 1960s by three American mathematicians when they showed there was a positive integer such that: 133⁵ + 110⁵ + 84⁵ + 27⁵ = n⁵. Find the value of n.

Taking the given equation modulo 2, 3, and 5, respectively, we have: n⁵ ≡ 0 (mod 2), n⁵ ≡ 0 (mod 3), n⁵ ≡ 4 (mod 5)

🔧 TODOs

Release test sets - ETA by July 13, 2025
Provide additional evaluation scripts for fine-grained analysis

🐛 Issues & Support

If you encounter any problems, have questions, or would like to contribute to the project, please feel free to:

Open an issue on our GitHub repository
Contact us directly at [email protected]

We welcome contributions, bug reports, and feature requests from the community!

📖 Citation

If you find this work useful, please consider citing:

@misc{wang2025accuracydissectingmathematicalreasoning,
    title={Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning}, 
    author={Jiayu Wang and Yifei Ming and Zixuan Ke and Caiming Xiong and Shafiq Joty and Aws Albarghouthi and Frederic Sala},
    year={2025},
    eprint={2506.04723},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2506.04723}, 
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

📄 Paper: Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning
🌐 Project Page: https://sparkle-reasoning.github.io/
🤗 Models: https://huggingface.co/sparkle-reasoning/models
🤗 Datasets: https://huggingface.co/sparkle-reasoning/datasets

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets/imgs		assets/imgs
eval/lm-evaluation-harness		eval/lm-evaluation-harness
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SPARKLE: Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

🔥 Key Insights

💡 Hard Problems Are Still Valuable

💡 RL Enhances Internal Strategy Formation

💡 Better Knowledge Integration

📊 Results

🚀 Quick Start

Installation

Prepare Datasets

Training

Evaluation

🤗 Model Checkpoints

📚 Datasets

Training Data

Evaluation Benchmarks

🏗️ Framework Overview

Curriculum-Style Training

Example: Augmented Hard Problem

🔧 TODOs

🐛 Issues & Support

📖 Citation

📄 License

🔗 Links

About

Uh oh!

Releases

Packages

Languages

License

sparkle-reasoning/sparkle

Folders and files

Latest commit

History

Repository files navigation

SPARKLE: Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

🔥 Key Insights

💡 Hard Problems Are Still Valuable

💡 RL Enhances Internal Strategy Formation

💡 Better Knowledge Integration

📊 Results

🚀 Quick Start

Installation

Prepare Datasets

Training

Evaluation

🤗 Model Checkpoints

📚 Datasets

Training Data

Evaluation Benchmarks

🏗️ Framework Overview

Curriculum-Style Training

Example: Augmented Hard Problem

🔧 TODOs

🐛 Issues & Support

📖 Citation

📄 License

🔗 Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages