Releases: pytorch/torchtune
v0.4.0
Highlights
Today we release v0.4.0 of torchtune with some exciting new additions! Some notable ones include full support for activation offloading, recipes for Llama3.2V 90B and QLoRA variants, new documentation, and Qwen2.5 models!
Activation offloading (#1443, #1645, #1847)
Activation offloading is a memory-saving technique that asynchronously moves checkpointed activations that are not currently running to the CPU. Right before the GPU needs the activations for the microbatch’s backward pass, this functionality prefetches the offloaded activations back from the CPU. Enabling this functionality is as easy as setting the following options in your config:
enable_activation_checkpointing: True
enable_activation_offloading: True
In experiments with Llama3 8B, activation offloading used roughly 24% less memory while inflicting a performance slowdown of under 1%.
Llama3.2V 90B with QLoRA (#1880, #1726)
We added model builders and configs for the 90B version of Llama3.2V, which outperforms the 11B version of the model across common benchmarks. Because this model size is larger, we also added the ability to run the model using QLoRA and FSDP2.
# Download the model first
tune download meta-llama/Llama-3.2-90B-Vision-Instruct --ignore-patterns "original/consolidated*"
# Run with e.g. 4 GPUs
tune run --nproc_per_node 4 lora_finetune_distributed --config llama3_2_vision/90B_qlora
Qwen2.5 model family has landed (#1863)
We added builders for Qwen2.5, the cutting-edge models from the Qwen family of models! In their own words "Compared to Qwen2, Qwen2.5 has acquired significantly more knowledge (MMLU: 85+) and has greatly improved capabilities in coding (HumanEval 85+) and mathematics (MATH 80+)."
Get started with the models easily:
tune download Qwen/Qwen2.5-1.5B-Instruct --ignore-patterns None
tune run lora_finetune_single_device --config qwen2_5/1.5B_lora_single_device
New documentation on using custom recipes, configs, and components (#1910)
We heard your feedback and wrote up a simple page on how to customize configs, recipes, and individual components! Check it out here
What's Changed
- Fix PackedDataset bug for seq_len > 2 * max_seq_len setting. by @mirceamironenco in #1697
- Bump version 0.3.1 by @joecummings in #1720
- Add error propagation to distributed run. by @mirceamironenco in #1719
- Update fusion layer counting logic for Llama 3.2 weight conversion by @ebsmothers in #1722
- Resizable image positional embeddings by @felipemello1 in #1695
- Unpin numpy by @ringohoffman in #1728
- Add HF Checkpoint Format Support for Llama Vision by @pbontrager in #1727
- config changes by @felipemello1 in #1733
- Fix custom imports for both distributed and single device by @RdoubleA in #1731
- Pin urllib3<2.0.0 to fix eleuther eval errors by @RdoubleA in #1738
- Fixing recompiles in KV-cache + compile by @SalmanMohammadi in #1663
- Fix CLIP pos embedding interpolation to work on DTensors by @ebsmothers in #1739
- Bump version to 0.4.0 by @RdoubleA in #1748
- [Feat] Activation offloading for distributed lora recipe by @Jackmin801 in #1645
- Add LR Scheduler to single device full finetune by @user074 in #1350
- Custom recipes use slash path by @RdoubleA in #1760
- Adds repr to Message by @thomasjpfan in #1757
- Fix save adapter weights only by @ebsmothers in #1764
- Set drop_last to always True by @RdoubleA in #1761
- Remove nonexistent flag for acc offloading in memory_optimizations.rst by @janeyx99 in #1772
- [BUGFIX] Adding sequence truncation to
max_seq_length
in eval recipe by @SalmanMohammadi in #1773 - Add ROCm "support" by @joecummings in #1765
- [BUG] Include system prompt in Phi3 by default by @joecummings in #1778
- Fixing quantization in eval recipe by @SalmanMohammadi in #1777
- Delete deprecated ChatDataset and InstructDataset by @joecummings in #1781
- Add split argument to required builders and set it default value to "train" by @krammnic in #1783
- Fix quantization with generate by @SalmanMohammadi in #1784
- Fix typo in multimodal_datasets.rst by @krammnic in #1787
- Make AlpacaToMessage public. by @krammnic in #1785
- Fix misleading attn_dropout docstring by @ebsmothers in #1792
- Add filter_fn to all generic dataset classes and builders API by @krammnic in #1789
- Set dropout in SDPA to 0.0 when not in training mode by @ebsmothers in #1803
- Skip entire header for llama3 decode by @RdoubleA in #1656
- Remove unused bsz variable by @zhangtemplar in #1805
- Adding
max_seq_length
to vision eval config by @SalmanMohammadi in #1802 - Add check that there is no PackedDataset while building ConcatDataset by @krammnic in #1796
- Add posibility to pack in _wikitext.py by @krammnic in #1807
- Add evaluation configs under qwen2 dir by @joecummings in #1809
- Fix eos_token problem in all required models by @krammnic in #1806
- Deprecating
TiedEmbeddingTransformerDecoder
by @SalmanMohammadi in #1815 - Torchao version check changes/BC import of TensorCoreTiledLayout by @ebsmothers in #1812
- 1810 move gemma evaluation by @malinjawi in #1819
- Consistent type checks for prepend and append tags. by @krammnic in #1824
- Move schedulers to training from modules. by @krammnic in #1801
- Update EleutherAI Eval Harness to v0.4.5 by @joecummings in #1800
- 1810 Add evaluation configs under phi3 dir by @Harthi7 in #1822
- Create CITATION.cff by @joecummings in #1756
- fixed error message for GatedRepoError by @DawiAlotaibi in #1832
- 1810 Move mistral evaluation by @Yousof-kayal in #1829
- More consistent trace names. by @krammnic in #1825
- fbcode using TensorCoreLayout by @jerryzh168 in #1834
- Remove pad_max_tiles in CLIP by @pbontrager in #1836
- Remove pad_max_tiles in CLIP inference by @lucylq in #1853
- Add
vqa_dataset
, update docs by @krammnic in #1820 - Add offloading tests and fix obscure edge case by @janeyx99 in #1860
- Toggling KV-caches by @SalmanMohammadi in #1763
- Cacheing doc nits by @SalmanMohammadi in #1876
- LoRA typo fix + bias=True by @felipemello1 in #1881
- Correct
torchao
check forTensorCoreTiledLayout
by @joecummings in #1886 - Kd_loss avg over tokens by @moussaKam in #1885
- Support Optimizer-in-the-backward by @mori360 in #1833
- Remove deprecated
GemmaTransformerDecoder
by @SalmanMohammadi in #1892 - Add PromptTemplate examples by @SalmanMohammadi in #1891
- Temporarily disable building Python 3.13 version of torchtune by @joecummings in #1896
- Block on Python 3.13 version by @joecummings in #1898
- [bug] fix sharding multimodal by @felipemello1 in #1889
- QLoRA with bias + Llama 3.2 Vision QLoRA configs by @ebsmothers in #1726
- Block on Python 3.13 version by @joecummings in #1899
- Normalize CE loss by total number of (non-padding) tokens by @ebsmothers in #1875
- nit: remove (ni...
v0.3.1 (Llama 3.2 Vision patch)
Overview
We've added full support for Llama 3.2 after it was announced, and this includes full/LoRA fine-tuning on the Llama3.2-1B, Llama3.2-3B base and instruct text models and Llama3.2-11B-Vision base and instruct text models. This means we now support the full end-to-end development of VLMs - fine-tuning, inference, and eval! We've also included a lot more goodies in a few short weeks:
- Llama 3.2 1B/3B/11B Vision configs for full/LoRA fine-tuning
- Updated recipes to support VLMs
- Multimodal eval via EleutherAI
- Support for torch.compile for VLMs
- Revamped generation utilities for multimodal support + batched inference for text only
- New knowledge distillation recipe with configs for Llama3.2 and Qwen2
- Llama 3.1 405B QLoRA fine-tuning on 8xA100s
- MPS support (beta) - you can now use torchtune on Mac!
New Features
Models
Multimodal
- Update recipes for multimodal support (#1548, #1628)
- Multimodal eval via EleutherAI (#1669, #1660)
- Multimodal compile support (#1670)
- Exportable multimodal models (#1541)
Generation
- Revamped generate recipe with multimodal support (#1559, #1563, #1674, #1686)
- Batched inference for text-only models (#1424, #1449, #1603, #1622)
Knowledge Distillation
Memory and Performance
- Compile FFT FSDP (#1573)
- Apply rope on k earlier for efficiency (#1558)
- Streaming offloading in (q)lora single device (#1443)
Quantization
- Update quantization to use tensor subclasses (#1403)
- Add int4 weight-only QAT flow targeting tinygemm kernel (#1570)
RLHF
- Adding generic preference dataset builder (#1623)
Miscellaneous
Documentation
- nits in memory optimizations doc (#1585)
- Tokenizer and prompt template docs (#1567)
- Latexifying IPOLoss docs (#1589)
- modules doc updates (#1588)
- More doc nits (#1611)
- update docs (#1602)
- Update llama3 chat tutorial (#1608)
- Instruct and chat datasets docs (#1571)
- Preference dataset docs (#1636)
- Messages and message transforms docs (#1574)
- Readme Updates (#1664)
- Model transform docs (#1665)
- Multimodal dataset builder + docs (#1667)
- Datasets overview docs (#1668)
- Update README.md (#1676)
- Readme updates for Llama 3.2 (#1680)
- Add 3.2 models to README (#1683)
- Knowledge distillation tutorial (#1698)
- Text completion dataset docs (#1696)
Quality-of-Life Improvements
- Set possible resolutions to debug, not info (#1560)
- Remove TiedEmbeddingTransformerDecoder from Qwen (#1547)
- Make Gemma use regular TransformerDecoder (#1553)
- llama 3_1 instantiate pos embedding only once (#1554)
- Run unit tests against PyTorch nightlies as part of our nightly CI (#1569)
- Support load_dataset kwargs in other dataset builders (#1584)
- add fused = true to adam, except pagedAdam (#1575)
- Move RLHF out of modules (#1591)
- Make logger only log on rank0 for Phi3 loading errors (#1599)
- Move rlhf tests out of modules (#1592)
- Update PR template (#1614)
- Update
get_unmasked_sequence_lengths
example 4 release (#1613) - remove ipo loss + small fixed (#1615)
- Fix dora configs (#1618)
- Remove unused var in generate (#1612)
- remove deprecated message (#1619)
- Fix qwen2 config (#1620)
- Proper names for dataset types (#1625)
- Make
q
optional insample
(#1637) - Rename
JSONToMessages
toOpenAIToMessages
(#1643) - update gemma to ignore gguf (#1655)
- Add Pillow >= 9.4 requirement (#1671)
- guard import (#1684)
- add upgrade to pip command (#1687)
- Do not run CI on forked repos (#1681)
Bug Fixes
- Fix flex attention test (#1568)
- Add
eom_id
to Llama3 Tokenizer (#1586) - Only merge model weights in LoRA recipe when
save_adapter_weights_only=False
(#1476) - Hotfix eval recipe (#1594)
- Fix typo in PPO recipe (#1607)
- Fix lora_dpo_distributed recipe (#1609)
- Fixes for MM Masking and Collation (#1601)
- delete duplicate LoRA dropout fields in DPO configs (#1583)
- Fix tune download command in PPO config (#1593)
- Fix tune run not identifying custom components (#1617)
- Fix compile error in
get_causal_mask_from_padding_mask
(#1627) - Fix eval recipe bug for group tasks (#1642)
- Fix basic tokenizer no special tokens (#1640)
- add BlockMask to batch_to_device (#1651)
- Fix PACK_TYPE import in collate (#1659)
- Fix llava_instruct_dataset (#1658)
- convert rgba to rgb (#1678)
New Contributors (auto-generated by GitHub)
- @dvorjackz made their first contribution (#1558)
Full Changelog: v0.3.0...v0.3.1
v0.3.0
Overview
We haven’t had a new release for a little while now, so there is a lot in this one. Some highlights include FSDP2 recipes for full finetune and LoRA(/QLoRA), support for DoRA fine-tuning, a PPO recipe for RLHF, Qwen2 models of various sizes, a ton of improvements to memory and performance (try our recipes with torch compile! try our sample packing with flex attention!), and Comet ML integration. For the full set of perf and memory improvements, we recommend installing with the PyTorch nightlies.
New Features
Here are highlights of some of our new features in 0.3.0.
Recipes
- Full finetune FSDP2 recipe (#1287)
- LoRA FSDP2 recipe with faster training than FSDP1 (#1517)
- RLHF with PPO (#1005)
- DoRA (#1115)
- SimPO (#1223)
Models
- Qwen2 0.5B, 1.5B, 7B model (#1143, #1247)
- Flamingo model components (#1357)
- CLIP encoder and vision transform (#1127)
Perf, memory, and quantization
- Per-layer compile: 90% faster compile time and 75% faster training time (#1419)
- Sample packing with flex attention: 80% faster training time with compile vs unpacked (#1193)
- Chunked cross-entropy to reduce peak memory (#1390)
- Make KV cache optional (#1207)
- Option to save adapter checkpoint only (#1220)
- Delete logits before bwd, saving ~4 GB (#1235)
- Quantize linears without LoRA applied to NF4 (#1119)
- Compile model and loss (#1296, #1319)
- Speed up QLoRA initialization (#1294)
- Set LoRA dropout to 0.0 to save memory (#1492)
Data/Datasets
- Multimodal datasets: The Cauldron and LLaVA-Instruct-150K (#1158)
- Multimodal collater (#1156)
- Tokenizer redesign for better model-specific feature support (#1082)
- Create general SFTDataset combining instruct and chat (#1234)
- Interleaved image support in tokenizers (#1138)
- Image transforms for CLIP encoder (#1084)
- Vision cross-attention mask transform (#1141)
- Support images in messages (#1504)
Miscellaneous
- Deep fusion modules (#1338)
- CometLogger integration (#1221)
- Add profiler to full finetune recipes (#1288)
- Support memory viz tool through the profiler (#1382, #1384)
- Add RSO loss (#1197)
- Add support for non-incremental decoding (#973)
- Move utils directory to training (#1432, #1519, …)
- Add bf16 dtype support on CPU (#1218)
- Add grad norm logging (#1451)
Documentation
- QAT tutorial (#1105)
- Recipe docs pages and memory optimizations tutorial (#1230)
- Add download commands to model API docs (#1167)
- Updates to utils API docs (#1170)
Bug Fixes
- Prevent pad ids, special tokens displaying in generate (#1211)
- Reverting Gemma checkpoint logic causing missing head weight (#1168)
- Fix compile on PyTorch 2.4 (#1512)
- Fix Llama 3.1 RoPE init for compile (#1544)
- Fix checkpoint load for FSDP2 with CPU offload (#1495)
- Add missing quantization to Llama 3.1 layers (#1485)
- Fix accuracy number parsing in Eleuther eval test (#1135)
- Allow adding custom system prompt to messages (#1366)
- Cast DictConfig -> dict in instantiate (#1450)
New Contributors (Auto generated by Github)
@sanchitintel made their first contribution in #1218
@lulmer made their first contribution in #1134
@stsouko made their first contribution in #1238
@spider-man-tm made their first contribution in #1220
@winglian made their first contribution in #1119
@fyabc made their first contribution in #1143
@mreso made their first contribution in #1274
@gau-nernst made their first contribution in #1288
@lucylq made their first contribution in #1269
@dzheng256 made their first contribution in #1221
@ChinoUkaegbu made their first contribution in #1310
@janeyx99 made their first contribution in #1382
@Gasoonjia made their first contribution in #1385
@shivance made their first contribution in #1417
@yf225 made their first contribution in #1419
@thomasjpfan made their first contribution in #1363
@AnuravModak made their first contribution in #1429
@lindawangg made their first contribution in #1451
@andrewldesousa made their first contribution in #1470
@mirceamironenco made their first contribution in #1523
@mikaylagawarecki made their first contribution in #1315
v0.2.1 (llama3.1 patch)
v0.2.0
Overview
It’s been awhile since we’ve done a release and we have a ton of cool, new features in the torchtune library including distributed QLoRA support, new models, sample packing, and more! Checkout #new-contributors for an exhaustive list of new contributors to the repo.
Enjoy the new release and happy tuning!
New Features
Here’s some highlights of our new features in v0.2.0.
Recipes
- We added support for QLoRA with FSDP2! This means users can now run 70B+ models on multiple GPUs. We provide example configs for Llama2 7B and 70B sizes. Note: this currently requires you to install PyTorch nightlies to access the FSDP2 methods. (#909)
- Also by leveraging FSDP2, we see a speed up of 12% tokens/sec and a 3.2x speedup in model init over FSDP1 with LoRA (#855)
- We added support for other variants of the Meta-Llama3 recipes including:
- We introduce a quantization-aware training (QAT) recipe. Training with QAT shows significant improvement in model quality if you plan on quantizing your model post-training. (#980)
- torchtune made updates to the eval recipe including:
Models
- Phi-3 Mini-4K-Instruct from Microsoft (#876)
- Gemma 7B from Google (#971)
- Code Llama2: 7B, 13B, and 70B sizes from Meta (#847)
- @salman designed and implemented reward modeling for Mistral models (#840, #991)
Perf, memory, and quantization
- We made improvements to our FSDP + Llama3 recipe, resulting in 13% more savings in allocated memory for the 8B model. (#865)
- Added Int8 per token dynamic activation + int4 per axis grouped weight (8da4w) quantization (#884)
Data/Datasets
- We added support for a widely requested feature - sample packing! This feature drastically speeds up model training - e.g. 2X faster with the alpaca dataset. (#875, #1109)
- In addition to our instruct tuning, we now also support continued pretraining and include several example datasets like wikitext and CNN DailyMail. (#868)
- Users can now train on multiple datasets using concat datasets (#889)
- We now support OpenAI conversation style data (#890)
Miscellaneous
- @jeromeku added a much more advanced profiler so users can understand the exact bottlenecks in their LLM training. (#1089)
- We made several metric logging improvements:
- Users can now save models in a safetensor format. (#1096)
- Updated activation checkpointing to support selective layer and selective op activation checkpointing (#785)
- We worked with the Hugging Face team to provide support for loading adapter weights fine tuned via torchtune directly into the PEFT library. (#933)
Documentation
- We wrote a new tutorial for fine-tuning Llama3 with chat data (#823) and revamped the datasets tutorial (#994)
- Looooooooong overdue, but we added proper documentation for the tune CLI (#1052)
- Improved contributing guide (#896)
Bug Fixes
- @Optimox found and fixed a bug to ensure that LoRA dropout was correctly applied (#996)
- Fixed a broken link for Llama3 tutorial in #805
- Fixed Gemma model generation (#1016)
- Bug workaround: to download CNN DailyMail, launch a single device recipe first and once it’s downloaded you can use the dataset for distributed recipes.
New Contributors
- @supernovae made their first contribution in #803
- @eltociear made their first contribution in #814
- @Carolinabanana made their first contribution in #810
- @musab-mk made their first contribution in #818
- @apthagowda97 made their first contribution in #816
- @lessw2020 made their first contribution in #785
- @weifengpy made their first contribution in #843
- @musabgultekin made their first contribution in #857
- @xingyaoww made their first contribution in #890
- @vmoens made their first contribution in #902
- @andrewor14 made their first contribution in #884
- @kunal-mansukhani made their first contribution in #926
- @EvilFreelancer made their first contribution in #889
- @water-vapor made their first contribution in #950
- @Optimox made their first contribution in #995
- @tambulkar made their first contribution in #1011
- @christobill made their first contribution in #1004
- @j-dominguez9 made their first contribution in #1056
- @andyl98 made their first contribution in #1061
- @hmosousa made their first contribution in #1065
- @yasser-sulaiman made their first contribution in #1055
- @parthsarthi03 made their first contribution in #1081
- @mdeff made their first contribution in #1086
- @jeffrey-fong made their first contribution in #1096
- @jeromeku made their first contribution in #1089
- @man-shar made their first contribution in #1126
Full Changelog: v0.1.1...v0.2.0
v0.1.1 (llama3 patch)
torchtune v0.1.0 (first release)
Overview
We are excited to announce the release of torchtune v0.1.0! torchtune is a PyTorch library for easily authoring, fine-tuning and experimenting with LLMs. The library emphasizes 4 key aspects:
- Simplicity and Extensibility. Native-PyTorch, componentized design and easy-to-reuse abstractions
- Correctness. High bar on proving the correctness of components and recipes
- Stability. PyTorch just works. So should torchtune
- Democratizing LLM fine-tuning. Works out-of-the-box on both consumer and professional hardware setups
torchtune is tested with the latest stable PyTorch release (2.2.2) as well as the preview nightly version.
New Features
Here are a few highlights of new features from this release.
Recipes
- Added support for running a LoRA finetune using a single GPU (#454)
- Added support for running a QLoRA finetune using a single GPU (#478)
- Added support for running a LoRA finetune using multiple GPUs with FSDP (#454, #266)
- Added support for running a full finetune using a single GPU (#482)
- Added support for running a full finetune using multiple GPUs with FSDP (#251, #482)
- Added WIP support for DPO (#645)
- Integrated with EleutherAI Eval Harness for an evaluation recipe (#549)
- Added support for quantization through integration with torchao (#632)
- Added support for single-GPU inference (#619)
- Created a config parsing system to interact with recipes through YAML and the command line (#406, #456, #468)
Models
- Added support for Llama2 7B (#70, #137) and 13B (#571)
- Added support for Mistral 7B (#571)
- Added support for Gemma [WIP] (#630, #668)
Datasets
- Added support for instruction and chat-style datasets (#752, #624)
- Included example implementations of datasets (#303, #116, #407, #541, #576, #645)
- Integrated with Hugging Face Datasets (#70)
Utils
- Integrated with Weights & Biases for metric logging (#162, #660)
- Created a checkpointer to handle model files from HF and Meta (#442)
- Added a tune CLI tool (#396)
Documentation
In addition to documenting torchtune’s public facing APIs, we include several new tutorials and “deep-dives” in our documentation.
- Added LoRA tutorial (#368)
- Added “End-to-End Workflow with torchtune” tutorial (#690)
- Added datasets tutorial (#735)
- Added QLoRA tutorial (#693)
- Added deep-dive on the checkpointer (#674)
- Added deep-dive on configs (#311)
- Added deep-dive on recipes (#316)
- Added deep-dive on Weights & Biases integration (#660)
Community Contributions
This release of torchtune features some amazing work from the community:
- Gemma 2B model from @solitude-alive (#630)
- DPO finetuning recipe from @yechenzhi (#645)
- Weights & Biases updates from @tcapelle (#660)