This cut of code is using sonnet 3.5 to reverse engineer code from white paper (this is for La Raza) https://www.youtube.com/watch?v=bZ8AS300WH4
MegaPortrait
https://github.com/johndpope/MegaPortrait-hack
the models here seem to be ok - it does train / converge
johndpope/MegaPortrait-hack#36
this is fresh code spat out by sonnet 3.5 the earlier ai models were maybe light on the training data for diffusion - this latest code seems to nail it.
I update progress here - #20
✅ dataset is good
models seem good
stage 1 trainer I had working some time back - running into OOM problems... https://github.com/johndpope/MegaPortrait-hack/tree/5fb4398634c0e27c60d850d4ab997f9b1df2c3fe
https://wandb.ai/snoozie/vasa?nw=nwusersnoozie
python dataset_testing.py
ulimit -n 65535
#!/bin/bash
# Debug memory
mprof run train_stage_1.py
# 1. STAGE 1 - Basic single GPU training
accelerate launch --mixed_precision fp16 train_stage1.py \
--config configs/training/stage1-base.yaml
# 2. Multi-GPU training on a single machine
accelerate launch \
--multi_gpu \
--mixed_precision fp16 \
--gradient_accumulation_steps 4 \
train_stage1.py \
--config configs/training/stage1-base.yaml
# 3. Distributed training across multiple machines
accelerate launch \
--multi_gpu \
--num_processes 8 \
--num_machines 2 \
--machine_rank 0 \
--main_process_ip "master_node_ip" \
--main_process_port 29500 \
--mixed_precision fp16 \
--gradient_accumulation_steps 4 \
train_stage1.py \
--config configs/training/stage1-base.yaml
# 4. Debug mode with smaller batch size and fewer iterations
accelerate launch \
--mixed_precision fp16 \
train_stage1.py \
--config configs/training/stage1-base.yaml \
training.batch_size=4 \
training.base_epochs=2 \
training.sample_interval=10
# 5. Resume training from checkpoint
accelerate launch \
--mixed_precision fp16 \
train_stage1.py \
--config configs/training/stage1-base.yaml \
training.resume_path=checkpoints/stage1/checkpoint_epoch_10.pt
# 6. Override specific config values
accelerate launch \
--mixed_precision fp16 \
train_stage1.py \
--config configs/training/stage1-base.yaml \
training.batch_size=16 \
training.lr=2e-4 \
training.weight_decay=1e-2 \
model.feature_dim=512
# 7. Training with different loss weights
accelerate launch \
--mixed_precision fp16 \
train_stage1.py \
--config configs/training/stage1-base.yaml \
loss.perceptual_weight=1.0 \
loss.gan_weight=0.5 \
loss.identity_weight=1.0 \
loss.motion_weight=0.5
# 8. Training with specific GPU devices
CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
--multi_gpu \
--mixed_precision fp16 \
train_stage1.py \
--config configs/training/stage1-base.yaml
# 9. Training with CPU offload (for limited GPU memory)
accelerate launch \
--mixed_precision fp16 \
--cpu_offload \
train_stage1.py \
--config configs/training/stage1-base.yaml
# 10. Training with gradient checkpointing (for memory efficiency)
accelerate launch \
--mixed_precision fp16 \
--gradient_accumulation_steps 8 \
train_stage1.py \
--config configs/training/stage1-base.yaml \
model.use_gradient_checkpointing=true
# STAGE 2 - Single GPU training
python train_stage2.py
# Multi-GPU training with Accelerate
accelerate launch \
--multi_gpu \
--mixed_precision fp16 \
--gradient_accumulation_steps 4 \
train_stage2.py
# Distributed training with specific GPU configuration
accelerate launch \
--multi_gpu \
--mixed_precision fp16 \
--gradient_accumulation_steps 4 \
--num_processes 4 \
--num_machines 1 \
--machine_rank 0 \
--main_process_port 29500 \
train_stage2.py
class PositionalEncoding(nn.Module):
- Adds position information to transformer embeddings
- Arguments:
d_model
: Dimension of modeldropout
: Dropout ratemax_len
: Maximum sequence length
- Used for sequence position information in transformers
class AudioEncoder(nn.Module):
- Encodes audio features using Wav2Vec2 architecture
- Components:
- Multiple convolutional layers
- Layer normalization
- Projection layer
- Processes audio input for the model
class VASAFaceEncoder(nn.Module):
- Enhanced face encoder with disentangled representations
- Components:
- 3D Appearance Volume Encoder
- Identity Encoder
- Head Pose Encoder
- Facial Dynamics Encoder
- Creates separate representations for different face aspects
class VASADiffusionTransformer(nn.Module):
- Diffusion Transformer with conditioning and CFG
- Features:
- 8-layer transformer architecture
- Multiple embedding layers
- Classifier-free guidance support
- Handles motion generation and conditioning
class VASATrainer:
- Main training orchestrator
- Responsibilities:
- Model initialization
- Training loop management
- Optimization handling
- Distributed training support
class VASAConfig:
- Configuration management
- Sections:
- Training settings
- Model parameters
- Diffusion settings
- CFG settings
- Data settings
class TrainingLogger:
- Logging functionality
- Features:
- Weights & Biases integration
- File logging
- Metric tracking
- Video logging
class CheckpointManager:
- Manages model checkpoints
- Features:
- Checkpoint saving/loading
- Best model tracking
- Checkpoint rotation
class VASADataset(Dataset):
- Dataset implementation
- Features:
- Video frame processing
- Audio feature extraction
- Face attribute extraction
- Gaze and emotion processing
class VideoGenerator:
- Video generation pipeline
- Features:
- Sliding window approach
- Source image processing
- Motion sequence generation
- Frame synthesis
class CAPPScore(nn.Module):
- Contrastive Audio and Pose Pretraining score
- Components:
- Pose encoder (6-layer transformer)
- Audio encoder (Wav2Vec2-based)
- Temperature parameter
- Evaluates audio-pose synchronization
class Evaluator:
- Comprehensive evaluation metrics
- Metrics:
- SyncNet confidence
- CAPP score
- Pose variation intensity
- FVD (if real video available)
class VASADiffusion:
- Diffusion process handler
- Features:
- Forward diffusion sampling
- Reverse diffusion sampling
- Beta schedule management
- Noise level control
class VASALoss(nn.Module):
- Combined loss implementation
- Components:
- DPE (Disentangled Pose Encoding) loss
- Identity preservation loss
- Reconstruction loss
- Configuration for loss balancing
Each class is designed to work together in the VASA pipeline:
- Configuration is managed by
VASAConfig
- Data is loaded through
VASADataset
- Training is orchestrated by
VASATrainer
- Models (
VASAFaceEncoder
,VASADiffusionTransformer
, etc.) process the data VideoGenerator
produces the final outputEvaluator
assesses the results
- Most neural network classes inherit from
nn.Module
- The system uses PyTorch as its deep learning framework
- Classes support distributed training where applicable
- Extensive use of type hints for better code clarity
- Modular design allows for easy component modification