Skip to content

Implementation of diffusion model from scratch in C++/CUDA, supporting both DiT and UNet architectures.

Notifications You must be signed in to change notification settings

codingwithsurya/diffusion.cu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

diffusion.cu

This project is a from-scratch implementation of diffusion model training in raw C++/CUDA. It is currently in progress, with support for both the classic UNet architecture, based on Diffusion Models Beat GANs on Image Synthesis, and the transformer architecture (DiT), as detailed in Scalable Diffusion Models with Transformers. My work is focused on developing the DiT model from scratch, while also enhancing Chen Lu's unet.cu by adding distributed training support and optimizations such as mixed precision training. The project is inspired by Andrej Karpathy's llm.c.

Training

UNet currently supports training. You can train it on the images from the ImageNet 64x64 dataset via

gunzip unet/data/elephant_train.bin.gz 
python unet/train_diffusion.py --init_model_only True 
make -C unet train_diffusion
./unet/train_diffusion

Current Implementation:

This currently supports unconditional diffusion model training, and the end-to-end training loop is currently running at about 42% the speed of PyTorch with torch.compile when run on a single H100. Further detailed benchmarks will have to be done to understand bottlenecks + adjust implementation for better performance. I do think we can incorporate mixed-precision training here, though (FP16 w/ loss scaling).

Platform Time on H100 (ms)
This repo (CUDA implementation) 56.98
PyTorch (w/ torch.compile) 23.68

In Progress:

  • support for distributed training via MPI in UNet
  • support for mixed precision training in UNet
  • support for DiT full fledged training

My Motivation:

I've always been intrigued by diffusion models but found the math and implementation challenging. My interest in ML systems and GPU programming led me to start this project. Inspired by Karpathy's llm.c, I aimed to directly program the GPU for faster, more efficient training.

My goal is to develop a solution that could potentially surpass PyTorch's torch.compile, which optimizes model execution on NVIDIA GPUs using advanced techniques like JIT compilation, operator fusion, and kernel optimizations. These optimizations significantly improve runtime performance by reducing overhead and maximizing hardware resource utilization.

Learning Resources That Helped Me:

If you're interested in learning more about diffusion models and CUDA programming, here are some resources that I found incredibly helpful:

More CUDA/GPU Programming Resources:

Articles/Blogs

Tutorials

Notebooks

Videos

Acknowledgments:

About

Implementation of diffusion model from scratch in C++/CUDA, supporting both DiT and UNet architectures.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published