Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: Precision scaling research #127

Open
hahuyhoang411 opened this issue Nov 20, 2024 · 7 comments
Open

idea: Precision scaling research #127

hahuyhoang411 opened this issue Nov 20, 2024 · 7 comments
Assignees
Labels
type: idea Research, data, any new ideas
Milestone

Comments

@hahuyhoang411
Copy link
Contributor

hahuyhoang411 commented Nov 20, 2024

Problem Statement

Hypothesis: Increasing numerical precision during training can improve the performance of small language models (≈1B parameters), potentially enabling them to achieve capabilities comparable to larger models (3B-7B parameters).

Implications

If validated, this hypothesis could:

  • Reduce the computational resources needed for training effective language models
  • Enable broader adoption of smaller, more efficient models
  • Lead to new approaches in optimizer design and implementation

Idea

Reference: https://arxiv.org/pdf/2411.04330

@hahuyhoang411 hahuyhoang411 added the type: idea Research, data, any new ideas label Nov 20, 2024
@hahuyhoang411
Copy link
Contributor Author

hahuyhoang411 commented Nov 21, 2024

@bachvudinh as co-author please help me to add some exploded attempts and MMLU score

We have ran some testing to train Llama 3.2 1B Instruct to check if fp32 can perform better bf16

Precision Learning Rate Weight Decay Global batchsize Trained Samples Final Loss MMLU
fp32 3e-4 0.01 96 0.2M 1.24    
fp32 2.5e-4 0.01 96 0.2M 1.22  
bf16 3e-4 0.01 96 0.2M exploded  
bf16 2e-4 0.01 96 0.2M exploded  
b16 2.5e-4 0.01 96 0.2M 1.26  
fp32 3e-4 0.2 ? 0.5M 0.67   25.54
fp32 1e-4 0.05 ? 1.7M 1.32   23.18

for the training configs with fp32 and setting lr as 1e-4 and weight decay as 0.05 , there are some weird mmlu results with checkpoint step 1000, 2000 and 3000:

  • step 1000:
    Screenshot from 2024-11-19 10-42-42

  • step 2000:
    Screenshot from 2024-11-19 10-56-42

  • step 3000:
    Screenshot from 2024-11-19 11-04-04

@hahuyhoang411
Copy link
Contributor Author

hahuyhoang411 commented Nov 21, 2024

fp32 3e-4
Screenshot 2024-11-21 at 01 09 02

@hahuyhoang411
Copy link
Contributor Author

fp32 2.5e-4
Screenshot 2024-11-21 at 01 09 49

@hahuyhoang411
Copy link
Contributor Author

bf16 2.5e-4
Screenshot 2024-11-21 at 01 10 39

@hahuyhoang411
Copy link
Contributor Author

hahuyhoang411 commented Nov 21, 2024

fp32 0.5M
Screenshot 2024-11-21 at 01 19 44

@hahuyhoang411
Copy link
Contributor Author

fp32 1.7M
Screenshot 2024-11-21 at 01 20 14

@tikikun
Copy link
Collaborator

tikikun commented Nov 21, 2024

A few pending issues:

  • It's very obvious that even tho more stabilized the training is not converging and still have hiccup, which indicating issue inside the optimizer itself not being able to give direction good enough for the optimizing process
  • Since we are currently hitting a wall on the optimizer itself, we will not continue scaling up the precision

Next steps:

  • @tikikun to do his own research on optimizer
  • We make use of the cluster for training qwen 32b instruct

cc @0xSage for interested

@tikikun tikikun assigned tikikun and unassigned hahuyhoang411 and bachvudinh Nov 21, 2024
@hiento09 hiento09 added this to Menlo Nov 22, 2024
@github-project-automation github-project-automation bot moved this to Investigating in Menlo Nov 22, 2024
@tikikun tikikun moved this from Investigating to Icebox in Menlo Nov 25, 2024
@bachvudinh bachvudinh added this to the Icebox milestone Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: idea Research, data, any new ideas
Projects
Status: Icebox
Development

No branches or pull requests

3 participants