- Uses LLaMA (Large Language Model Meta AI) as the base model
- Wrapped in an
AdvancedModel
class for easier handling
- Two-stage approach:
- Policy Initialization (Stage I)
- Multi-Turn Reinforcement Learning (Stage II)
- Task-specific reward functions:
- Math: Symbolic equation checking
- Code: Safe execution and test case validation
- Maintains similarity to reference model
- Prevents drastic departures from initial policy
- Generate initial attempts
- Compute rewards
- Apply KL divergence penalty
- Update model parameters
- Generate first attempt
- Create prompt for correction
- Generate second attempt
- Compute rewards for both attempts
- Apply reward shaping
- Compute KL divergence
- Update model parameters
- Bonus for improvement:
α * (reward_second - reward_first)
- Encourages self-correction behavior
- Runs generated code in a separate thread
- Implements timeout mechanism for safety
- Uses AdamW optimizer
- Linear learning rate schedule with warmup
- Gradient accumulation for effective larger batch sizes
- Mixed precision training option for efficiency
- Accuracy@t1: First attempt accuracy
- Accuracy@t2: Second attempt accuracy
- Δ(t1,t2): Overall improvement
- Δ_i→c(t1,t2): Incorrect to correct ratio
- Δ_c→i(t1,t2): Correct to incorrect ratio
- Custom datasets for MATH, MBPP, and HumanEval tasks
- Dynamic batch preparation based on task type
- Training reward history plotting
- Edit distance ratio visualization between attempts