Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task: Ichigo LLM v0.5 Training #122

Open
2 of 5 tasks
Tracked by #116
hahuyhoang411 opened this issue Nov 19, 2024 · 11 comments
Open
2 of 5 tasks
Tracked by #116

task: Ichigo LLM v0.5 Training #122

hahuyhoang411 opened this issue Nov 19, 2024 · 11 comments
Assignees
Labels
P1: important Important feature / fix

Comments

@hahuyhoang411
Copy link
Contributor

hahuyhoang411 commented Nov 19, 2024

Goal

Experiment on Ichigo model on multilingual.

Tasklist

@hahuyhoang411 hahuyhoang411 changed the title Ichigo Training (Issue: ) task: Ichigo Training Nov 19, 2024
@hahuyhoang411 hahuyhoang411 added the P1: important Important feature / fix label Nov 19, 2024
@hahuyhoang411
Copy link
Contributor Author

Lora for pretraining ref: https://unsloth.ai/blog/contpretraining

@hiento09 hiento09 added this to Menlo Nov 22, 2024
@github-project-automation github-project-automation bot moved this to Investigating in Menlo Nov 22, 2024
@tikikun tikikun moved this from Investigating to In Progress in Menlo Nov 25, 2024
@dan-menlo
Copy link
Contributor

@hahuyhoang411 - Can we combine this into #116

@hahuyhoang411 hahuyhoang411 changed the title task: Ichigo Training task: Ichigo LLM v0.5 Training Dec 1, 2024
@hahuyhoang411 hahuyhoang411 moved this from In Progress to Scheduled in Menlo Dec 1, 2024
@bachvudinh
Copy link
Contributor

bachvudinh commented Jan 8, 2025

Pretrainning phase of ichigo v0.5

Methodology

  • Pretraining Llama 3.1 8B model with an expanded vocabulary of 2,560 sound semantic tokens and 50 duration tokens.

  • Data Methodology:

    Datasets (3.392M total samples):

    • 880k samples from Vivoice
    • 112k samples from Libris Clean
    • 2.4M samples from MLS Eng 10k

    Synthetic Data Pipeline:

    • We developed an efficient pipeline using VLLM and a custom Text-to-Semantic model (based on LLaMA 3.2-1B)
    • The pipeline uses Ray for distributed processing, running VLLM instances across multiple GPUs.

    Performance:

    • Thanks to the model's minimal VRAM requirements, the pipeline achieves high throughput, generating 250k samples in just 20-30 minutes.

Hyperparams

Parameter Value
Epoch 1
Global batch size 256
Learning Rate 2e-4
gradient_clipping 1.0
Learning Scheduler Cosine
Optimizer AdamW Fused
Warmup Steps 100
Weight Decay 0.01
Gradient Checkpointing Full
Max length 512
Precision bf16
compile True

Results

  • Time: Took around ~40 hours on 6xA6000.
  • Loss ~ 2.8 after training finished.

Learnings

  • The loss did not go down compared to using normal sound tokens data but it acceptable.

Quicklinks

@bachvudinh
Copy link
Contributor

bachvudinh commented Jan 8, 2025

Instruction tuning phase of Ichigo v0.5.

Methodology

  • Instruction fine-tuning the pretrained checkpoint.
  • Data methodology: same as Ichigo v0.4 but removing noise-rejection data.

Hyperparams

Parameter Value
Epoch 1
Global batch size 256
Learning Rate 3e-4
gradient_clipping 1.0
Learning Scheduler Cosine
Optimizer AdamW Fused
Warmup Steps 100
Weight Decay 0.01
Gradient Checkpointing Full
Max length 4096
Precision bf16
compile True

Results

  • Time estimate: Took around ~5 hours on 8xH100 (runpod).

  • Loss Curve:

    image

  • Instruction-Following Evaluation

    Benchmark Category Metric Score
    Text-only MMLU 61.74
    MMLU pro
    VMLU
    Audio-Bench Alpaca(Instruction Speech) GPT-4 judge
    Open-hermes(Instruction Speech) GPT-4 judge
    ASR WER score Not benchmarking yet due to early stop.

Learnings

  • The loss declined significantly and stabilized between 0.78-0.8. This may be due to larger codebook size of Ichigo quantizer. Furthermore, Not the the loss, Model demonstrates strong instruction-following capabilities in both Vietnamese and English.
  • However, due to the quality limitations in our Vietnamese training data, the model's responses tend to be brief and lack depth so i opted for early stopping at step 2000.

Quicklinks

@bachvudinh
Copy link
Contributor

bachvudinh commented Jan 8, 2025

Data Quality Issue Resolution and Pipeline Update

Issue

Text-to-semantic model undertrained on English, causing noise in small fraction of output data.

Analysis

  • instruction-speech-v1 and instruction-speech-v2 subsets showed highest percentage of affected data
  • Using old data pipeline to regenerted the data: Text → Audio (WhisperSpeech TTS) → Semantic tokens (Ichigo Tokenizer)

Resolution

  • Filtered outlier noise data
  • Regenerated data for 2 subsets --> 4.96k samples
  • After backfill we have total of 2.769M samples.

Quick link

@hahuyhoang411
Copy link
Contributor Author

Can you add an example of affected data @bachvudinh

@bachvudinh
Copy link
Contributor

bachvudinh commented Jan 14, 2025

Instruction tuning phase of Ichigo v0.5(Second Attempt)

Methodology

Hyperparams

Parameter Value
Epoch 1
Global batch size 256
Learning Rate 3e-4
gradient_clipping 1.0
Learning Scheduler Cosine
Optimizer AdamW Fused
Warmup Steps 100
Weight Decay 0.01
Gradient Checkpointing Full
Max length 4096
Precision bf16
compile True

Results

  • Time estimate: Took around ~5 hours on 8xH100 (runpod).

  • Loss Curve:

    image

  • Instruction-Following Evaluation

    Model Name MMLU MMLU pro VMLU Alpaca (GPT-4 judge) Open-hermes (GPT-4 judge) ASR (WER)
    Ichigo v0.5 checkpoint 4000 60.61 - - - - -
    Ichigo v0.5 end epoch 62.27 - - 2.93 3.28 -

Learnings

  • The loss declined significantly and stabilized between 0.78-0.8.

Quicklinks

@bachvudinh
Copy link
Contributor

bachvudinh commented Jan 14, 2025

Some very good result from Ichigo v0.5 checkpoint step 4000. The response of model is now clearer and more helpful:

image

@hahuyhoang411
Copy link
Contributor Author

It's also good on my end. But seems like it's too noise sensitive
Screenshot 2025-01-14 at 19 26 59
Screenshot 2025-01-14 at 19 26 28

@bachvudinh
Copy link
Contributor

bachvudinh commented Jan 15, 2025

Training Issues and Budget Request Report

We are currently training a large language model using 8x H100 GPUs via RunPod. The initial budget estimation was $800 for a 32-hour training run.

Technical Issues Encountered

  • Initial 2000 training steps (~$150) were lost due to quality of viettel x nvidia data: https://github.com/janhq/ichigo-internal/issues/8.
  • During the training process, we encountered an unexpected loss explosion consistently occurring around step 6000. Initial hypothesis focused on learning rate optimization, leading to multiple experimental runs to tune this parameter. --> Root cause analysis eventually identified a corrupted batch in the training data.

Solution

  • ViettelxNvidia data Issue: Temporarily shut down H100 instances and moved data synthesis to 4A6000s to minimize costs
  • Loss exploding: Resume training from checkpoint 5000 with the corrupted data batch has been identified and removed.

Financial Impact

Run Cost
Original Budget (32 hours) $800
Viettel x Nvidia data bug ~$150
Debugging Costs (learning rate optimization attempts) ~$250
  • Current Status: Training duration has exceeded initial estimates due to necessary debugging and data cleanup.

Additional Bugdet Request

  • We request a total $150 top-up to ensure successful completion of the model training phase:
    • Remaining training time needed: 3-4 hours
    • Current H100 cost rate: $24/hour

@tikikun
Copy link
Collaborator

tikikun commented Jan 21, 2025

Issues report

We are facing some issues on ichigo v0.5, in short, it is really bad:

Issues:

  • The model can no longer follow system prompt (it seems continual pretraining effect has been broken due to configuration change somehow)
  • Model quality in speech is not good enough ( @bachvu02 please link the speech benchmark degraded quality not much but still)

Root causes:

  • We have 5% of the data having error (missing <|end_token|>)
  • Configuration change (possibly) learning rate is being changed due to multiple issues during the training, hence it seems damaging the "recovering" process (which is not continual but more overfit on the dataset at hands)
  • Data quality just suck (we realized this during the training but continued anyways)

Mitigation plans:

  • We need to have cicd for some mature pipeline (like speech), right now it's mostly manual
  • Change to distillation in the future to mitigate the model drift effect
  • We should have run this on 8xA6000 than the runpod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1: important Important feature / fix
Projects
Status: Scheduled
Development

No branches or pull requests

5 participants