This is a reading list of papers/videos/repos I've personally found useful as I was ramping up on ML Systems and that I wish more people would just sit and study carefully during their work hours. If you're looking for more recommendations, go through the citations of the below papers and enjoy!
- Attention is all you need: Start here, Still one of the best intros
- Online normalizer calculation for softmax: A must read before reading the flash attention. Will help you get the main "trick"
- Self Attention does not need O(n^2) memory:
- Flash Attention 2: The diagrams here do a better job of explaining flash attention 1 as well
- Llama 2 paper: Skim it for the model details
- gpt-fast: A great repo to come back to for minimal yet performant code
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation: There's tons of papers on long context lengths but I found this to be among the clearest
- Google the different kinds of attention: cosine, dot product, cross, local, sparse, convolutional
- Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems: Wonderful survey, start here
- Efficiently Scaling transformer inference: Introduced many ideas most notably KV caches
- Making Deep Learning go Brrr from First Principles: One of the best intros to fusions and overhead
- Fast Inference from Transformers via Speculative Decoding: This is the paper that helped me grok the difference in performance characteristics between prefill and autoregressive decoding
- Group Query Attention: KV caches can be chunky this is how you fix it
- Orca: A Distributed Serving System for Transformer-Based Generative Models: introduced continuous batching (great pre-read for the PagedAttention paper).
- Efficient Memory Management for Large Language Model Serving with PagedAttention: the most crucial optimization for high throughput batch inference
- Colfax Research Blog: Excellent blog if you're interested in learning more about CUTLASS and modern GPU programming
- Sarathi LLM: Introduces chunked prefill to make workloads more balanced between prefill and decode
- Epilogue Visitor Tree: Fuse custom epilogues by adding more epilogues to the same class (visitor design pattern) and represent the whole epilogue as a tree
- A White Paper on Neural Network Quantization: Start here this is will give you the foundation to quickly skim all the other papers
- LLM.int8: All of Dettmers papers are great but this is a natural intro
- FP8 formats for deep learning: For a first hand look of how new number formats come about
- Smoothquant: Balancing rounding errors between weights and activations
- Mixed precision training: The OG paper describing mixed precision training strategies for half
- RoFormer: Enhanced Transformer with Rotary Position Embedding: The paper that introduced rotary positional embeddings
- YaRN: Efficient Context Window Extension of Large Language Models: Extend base model context lengths with finetuning
- Ring Attention with Blockwise Transformers for Near-Infinite Context: Scale to infinite context lengths as long as you can stack more GPUs
- Venom: Vectorized N:M Format for sparse tensor cores when hardware only supports 2:4
- Megablocks: Efficient Sparse training with mixture of experts
- ReLu Strikes Back: Really enjoyed this paper as an example of doing model surgery for more efficient inference
- Singularity: Shows how to make jobs preemptible, migratable and elastic
- Local SGD: So hot right now
- OpenDiloco: Asynchronous training for decentralized training
- torchtitan: Minimal repository showing how to implement 4D parallelism in pure PyTorch
- pipedream: The pipeline parallel paper
- jit checkpointing: a very clever alternative to periodic checkpointing
- Reducing Activation Recomputation in Large Transformer models: THe paper thatt introduced selective activation checkpointing and goes over activation recomputation strategies
- Breaking the computation and communication abstraction barrier: God tier paper that goes over research at the intersection of distributed computing and compilers to maximize comms overlap
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: The ZeRO algorithm behind FSDP and DeepSpeed intelligently reducing memory usage for data parallelism.
- Megatron-LM: For an introduction to Tensor Parallelism