task: Ichigo LLM v0.5 Training #122

hahuyhoang411 · 2024-11-19T16:56:18Z

Goal

Experiment on Ichigo model on multilingual.

Tasklist

feat: Modify Torchtune to train Ichigo Qwen #125
Experiment with hyper-parameters
Draft research report
Check for the degradation
Check the model can do multi-lingual conversation

hahuyhoang411 · 2024-11-20T20:19:29Z

Lora for pretraining ref: https://unsloth.ai/blog/contpretraining

dan-menlo · 2024-11-27T06:47:01Z

@hahuyhoang411 - Can we combine this into #116

bachvudinh · 2025-01-08T03:35:41Z

Pretrainning phase of ichigo v0.5

Methodology

Pretraining Llama 3.1 8B model with an expanded vocabulary of 2,560 sound semantic tokens and 50 duration tokens.
Data Methodology:

• Datasets (3.392M total samples):
- 880k samples from Vivoice
- 112k samples from Libris Clean
- 2.4M samples from MLS Eng 10k
• Synthetic Data Pipeline:
- We developed an efficient pipeline using VLLM and a custom Text-to-Semantic model (based on LLaMA 3.2-1B)
- The pipeline uses Ray for distributed processing, running VLLM instances across multiple GPUs.
• Performance:
- Thanks to the model's minimal VRAM requirements, the pipeline achieves high throughput, generating 250k samples in just 20-30 minutes.

Hyperparams

Parameter	Value
Epoch	1
Global batch size	256
Learning Rate	2e-4
gradient_clipping	1.0
Learning Scheduler	Cosine
Optimizer	AdamW Fused
Warmup Steps	100
Weight Decay	0.01
Gradient Checkpointing	Full
Max length	512
Precision	bf16
compile	True

Results

Time: Took around ~40 hours on 6xA6000.
Loss ~ 2.8 after training finished.

Learnings

The loss did not go down compared to using normal sound tokens data but it acceptable.

Quicklinks

Data:
- Speech instruction data v0.5: https://huggingface.co/datasets/homebrewltd/Ichigo-pretrain-tokenized-v0.1.

bachvudinh · 2025-01-08T03:45:14Z

Instruction tuning phase of Ichigo v0.5.

Methodology

Instruction fine-tuning the pretrained checkpoint.
Data methodology: same as Ichigo v0.4 but removing noise-rejection data.

Hyperparams

Parameter	Value
Epoch	1
Global batch size	256
Learning Rate	3e-4
gradient_clipping	1.0
Learning Scheduler	Cosine
Optimizer	AdamW Fused
Warmup Steps	100
Weight Decay	0.01
Gradient Checkpointing	Full
Max length	4096
Precision	bf16
compile	True

Results

Time estimate: Took around ~5 hours on 8xH100 (runpod).
Loss Curve:

Instruction-Following Evaluation

Benchmark	Category	Metric	Score
Text-only	MMLU		61.74
	MMLU pro
	VMLU
Audio-Bench	Alpaca(Instruction Speech)	GPT-4 judge
	Open-hermes(Instruction Speech)	GPT-4 judge
	ASR	WER score	Not benchmarking yet due to early stop.

Learnings

The loss declined significantly and stabilized between 0.78-0.8. This may be due to larger codebook size of Ichigo quantizer. Furthermore, Not the the loss, Model demonstrates strong instruction-following capabilities in both Vietnamese and English.
However, due to the quality limitations in our Vietnamese training data, the model's responses tend to be brief and lack depth so i opted for early stopping at step 2000.

Quicklinks

Data:
- Speech instruction data v0.5: https://huggingface.co/datasets/homebrewltd/Ichigo-instruction-tokenized-v0.2-clean.
Checkpoint:
- Pretrain checkpoint: https://huggingface.co/jan-hq/ichigo-v0.5-llama3.2-8B-base.

bachvudinh · 2025-01-08T04:11:22Z

Data Quality Issue Resolution and Pipeline Update

Issue

Text-to-semantic model undertrained on English, causing noise in small fraction of output data.

Analysis

instruction-speech-v1 and instruction-speech-v2 subsets showed highest percentage of affected data
Using old data pipeline to regenerted the data: Text → Audio (WhisperSpeech TTS) → Semantic tokens (Ichigo Tokenizer)

Resolution

Filtered outlier noise data
Regenerated data for 2 subsets --> 4.96k samples
After backfill we have total of 2.769M samples.

Quick link

FIltered data: https://huggingface.co/datasets/homebrewltd/Ichigo-instruction-tokenized-v0.2-clean.
Backfilled : https://huggingface.co/datasets/homebrewltd/Ichigo-instruction-tokenized-v0.2-bug.

hahuyhoang411 · 2025-01-08T06:50:33Z

Can you add an example of affected data @bachvudinh

bachvudinh · 2025-01-14T22:41:52Z

Instruction tuning phase of Ichigo v0.5(Second Attempt)

Methodology

Instruction fine-tuning the pretrained checkpoint.
Data methodology: Fixing the quality of viettel x nvidia data. see https://github.com/janhq/ichigo-factory/issues/10 for details.

Hyperparams

Parameter	Value
Epoch	1
Global batch size	256
Learning Rate	3e-4
gradient_clipping	1.0
Learning Scheduler	Cosine
Optimizer	AdamW Fused
Warmup Steps	100
Weight Decay	0.01
Gradient Checkpointing	Full
Max length	4096
Precision	bf16
compile	True

Results

Time estimate: Took around ~5 hours on 8xH100 (runpod).
Loss Curve:
Instruction-Following Evaluation

Model Name MMLU MMLU pro VMLU Alpaca (GPT-4 judge) Open-hermes (GPT-4 judge) ASR (WER)

Ichigo v0.5 checkpoint 4000 60.61 - - - - -

Ichigo v0.5 end epoch 62.27 - - 2.93 3.28 -

Learnings

The loss declined significantly and stabilized between 0.78-0.8.

Quicklinks

Data:
- Speech instruction data v0.5: https://huggingface.co/datasets/homebrewltd/Ichigo-instruction-tokenized-v0.2-clean.
Checkpoint:
- https://huggingface.co/homebrewltd/Ichigo-llama3.1-8B-v0.5.

bachvudinh · 2025-01-14T22:49:15Z

Some very good result from Ichigo v0.5 checkpoint step 4000. The response of model is now clearer and more helpful:

hahuyhoang411 · 2025-01-14T23:47:52Z

It's also good on my end. But seems like it's too noise sensitive

bachvudinh · 2025-01-15T23:26:01Z

Training Issues and Budget Request Report

We are currently training a large language model using 8x H100 GPUs via RunPod. The initial budget estimation was $800 for a 32-hour training run.

Technical Issues Encountered

Initial 2000 training steps (~$150) were lost due to quality of viettel x nvidia data: https://github.com/janhq/ichigo-factory/issues/10.
During the training process, we encountered an unexpected loss explosion consistently occurring around step 6000. Initial hypothesis focused on learning rate optimization, leading to multiple experimental runs to tune this parameter. --> Root cause analysis eventually identified a corrupted batch in the training data.

Solution

ViettelxNvidia data Issue: Temporarily shut down H100 instances and moved data synthesis to 4A6000s to minimize costs
Loss exploding: Resume training from checkpoint 5000 with the corrupted data batch has been identified and removed.

Financial Impact

Run	Cost
Original Budget (32 hours)	$800
Viettel x Nvidia data bug	~$150
Debugging Costs (learning rate optimization attempts)	~$250

Current Status: Training duration has exceeded initial estimates due to necessary debugging and data cleanup.

Additional Bugdet Request

We request a total $150 top-up to ensure successful completion of the model training phase:
- Remaining training time needed: 3-4 hours
- Current H100 cost rate: $24/hour

tikikun · 2025-01-21T08:20:24Z

Issues report

We are facing some issues on ichigo v0.5, in short, it is really bad:

Issues:

The model can no longer follow system prompt (it seems continual pretraining effect has been broken due to configuration change somehow)
Model quality in speech is not good enough ( @bachvu02 please link the speech benchmark degraded quality not much but still)

Root causes:

We have 5% of the data having error (missing <|end_token|>)
Configuration change (possibly) learning rate is being changed due to multiple issues during the training, hence it seems damaging the "recovering" process (which is not continual but more overfit on the dataset at hands)
Data quality just suck (we realized this during the training but continued anyways)

Mitigation plans:

We need to have cicd for some mature pipeline (like speech), right now it's mostly manual
Change to distillation in the future to mitigate the model drift effect
We should have run this on 8xA6000 than the runpod

hahuyhoang411 mentioned this issue Nov 19, 2024

milestone: Ichigo v0.5 Multi-lingual #116

Open

7 tasks

hahuyhoang411 changed the title ~~Ichigo Training (Issue: )~~ task: Ichigo Training Nov 19, 2024

hahuyhoang411 added the P1: important Important feature / fix label Nov 19, 2024

hahuyhoang411 assigned bachvudinh and tuanlda78202 Nov 20, 2024

hiento09 added this to Menlo Nov 22, 2024

github-project-automation bot moved this to Investigating in Menlo Nov 22, 2024

tikikun moved this from Investigating to In Progress in Menlo Nov 25, 2024

hahuyhoang411 added this to the Ichigo v0.5 - Multilingual milestone Nov 25, 2024

hahuyhoang411 changed the title ~~task: Ichigo Training~~ task: Ichigo LLM v0.5 Training Dec 1, 2024

hahuyhoang411 moved this from In Progress to Scheduled in Menlo Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task: Ichigo LLM v0.5 Training #122

task: Ichigo LLM v0.5 Training #122

hahuyhoang411 commented Nov 19, 2024 •

edited by bachvudinh

Loading

hahuyhoang411 commented Nov 20, 2024

dan-menlo commented Nov 27, 2024

bachvudinh commented Jan 8, 2025 •

edited

Loading

bachvudinh commented Jan 8, 2025 •

edited

Loading

bachvudinh commented Jan 8, 2025 •

edited

Loading

hahuyhoang411 commented Jan 8, 2025

bachvudinh commented Jan 14, 2025 •

edited

Loading

bachvudinh commented Jan 14, 2025 •

edited

Loading

hahuyhoang411 commented Jan 14, 2025

bachvudinh commented Jan 15, 2025 •

edited

Loading

tikikun commented Jan 21, 2025

task: Ichigo LLM v0.5 Training #122

task: Ichigo LLM v0.5 Training #122

Comments

hahuyhoang411 commented Nov 19, 2024 • edited by bachvudinh Loading

Goal

Tasklist

hahuyhoang411 commented Nov 20, 2024

dan-menlo commented Nov 27, 2024

bachvudinh commented Jan 8, 2025 • edited Loading

Pretrainning phase of ichigo v0.5

Methodology

Hyperparams

Results

Learnings

Quicklinks

bachvudinh commented Jan 8, 2025 • edited Loading

Instruction tuning phase of Ichigo v0.5.

Methodology

Hyperparams

Results

Learnings

Quicklinks

bachvudinh commented Jan 8, 2025 • edited Loading

Data Quality Issue Resolution and Pipeline Update

Issue

Analysis

Resolution

Quick link

hahuyhoang411 commented Jan 8, 2025

bachvudinh commented Jan 14, 2025 • edited Loading

Instruction tuning phase of Ichigo v0.5(Second Attempt)

Methodology

Hyperparams

Results

Learnings

Quicklinks

bachvudinh commented Jan 14, 2025 • edited Loading

hahuyhoang411 commented Jan 14, 2025

bachvudinh commented Jan 15, 2025 • edited Loading

Training Issues and Budget Request Report

Technical Issues Encountered

Solution

Financial Impact

Additional Bugdet Request

tikikun commented Jan 21, 2025

Issues report

Issues:

Root causes:

Mitigation plans:

hahuyhoang411 commented Nov 19, 2024 •

edited by bachvudinh

Loading

bachvudinh commented Jan 8, 2025 •

edited

Loading

bachvudinh commented Jan 8, 2025 •

edited

Loading

bachvudinh commented Jan 8, 2025 •

edited

Loading

bachvudinh commented Jan 14, 2025 •

edited

Loading

bachvudinh commented Jan 14, 2025 •

edited

Loading

bachvudinh commented Jan 15, 2025 •

edited

Loading