20 Aug 17:51

ArthurZucker

ca56cd7

Patch release v4.44.1

Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues

is_torchdynamo_compiling -- cast a wide exception net (#32476) by @gante
Revert "fixes to properly shard FSDP across cpu and meta for cpu_effcient_loading for prequantized 4bit (#32276)" (#32477) by @gante and @matthewdouglas
Gemma2: fix FA2 generation (#32553) by @zucchini-nlp
Fix: FA2 with packed training (#32487) by @zucchini-nlp
Fix sliding window attention used in Gemma2FlashAttention2 (#32522) by @brcps12
Automatically add transformers tag to the modelcard (#32623) by @LysandreJik
add back the position ids (#32554) by @ArthurZucker
Use head_dim if in config for RoPE (#32495) @suiyoubi @ArthurZucker
Revert PR 32299, flag users when Zero-3 was missed (#32851) by @muellerzr
fix multi-gpu with static cache (#32543) by @SunMarc
Reduce the error log when using core models that need their weights r… (#32656) by @muellerzr
Fix VLM generation issues (#32836) by @zucchini-nlp
Fix generate with inputs_embeds as input (#32493) (this PR has some cherry-pick)

Full Changelog: v4.44.0...v4.44.1

Contributors

muellerzr, gante, and 7 other contributors

Assets 2

06 Aug 18:39

ArthurZucker

v4.44.0

984bc11

Release v4.44.0

Release v4.44.0: End to end compile generation!!! Gemma2 (with assisted decoding), Codestral (Mistral for code), Nemotron, Efficient SFT training, CPU Offloaded KVCache, torch export for static cache

This release comes a bit early in our cycle because we wanted to ship important and requested models along with improved performances for everyone!

All of these are included with examples in the awesome https://github.com/huggingface/local-gemma repository! 🎈 We tried to share examples of what is now possible with all the shipped features! Kudos to @gante, @sanchit-gandhi and @xenova

💥 End-to-end generation compile

Generate: end-to-end compilation #30788 by @gante: model.generate now supports compiling! There are a few limitations, but here is a small snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import copy

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

# compile generate
compiled_generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")

# compiled generate does NOT accept parameterization except a) model inputs b) a generation config
generation_config = copy.deepcopy(model.generation_config)
generation_config.pad_token_id = model.config.eos_token_id

model_inputs = tokenizer(["Write a poem about the market crashing in summer"], return_tensors="pt")
model_inputs = model_inputs.to(model.device)
output_compiled = compiled_generate(**model_inputs, generation_config=generation_config)
print(output_compiled)

⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)

3-5x faster torch.compile forward compilation for autoregressive decoder models #32227* by @fxmarty .
As documented on the PR, this makes the whole generation a lot faster when you re-use the cache!
You can see this when you run model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀

Offloaded KV Cache #31325* by @n17s : you just have to set cache_implementation="offloaded" when calling from_pretrained or using this:

from transformers import GenerationConfig
gen_config = GenerationConfig(cache_implementation="offloaded", # other generation options such as num_beams=4,num_beam_groups=2,num_return_sequences=4,diversity_penalty=1.0,max_new_tokens=50,early_stopping=True)
outputs = model.generate(inputs["input_ids"],generation_config=gen_config)

📦 Torch export for static cache

pytorch team gave us a great gift: you can now use torch.export directly compatible with Executorch! Find examples here.

Make static cache compatible with torch.export #32168 by @guangy10

This also unlocks support for prompt reuse:

import os, torch, copy
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
device = "cuda"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"

INITIAL_PROMPT = "From now on, you are going to answer all my questions with historical details. Make sure to always add a bit of french here and there, for style."

model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

prompt_cache = DynamicCache()
inputs = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")
prompt_cache = model(**inputs, past_key_values = prompt_cache).past_key_values

prompt = "Why are french people obsessed with french?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
past_key_values = copy.deepcopy(prompt_cache)
outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20) 
response = tokenizer.batch_decode(outputs)[0]
print(response)

prompt = "What is the best city to swim in?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**new_inputs, past_key_values=copy.deepcopy(prompt_cache),max_new_tokens=20) 
response = tokenizer.batch_decode(outputs)[0]

Gemma2: assisted decoding

Gemma 2: support assisted generation #32357 by @gante

We now have a 2B Gemma 2 model -- a perfect sidekick for the 27B with assisted generation. We've enabled assisted generation in gemma 2, with a caveat: assisted generation currently requires the use of a windowless cache (as opposed to the default cache for gemma 2), so you might observe some output mismatch on long sequences. Read more about it here.

# transformers assisted generation reference: 
# https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding 
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# we DON’T recommend using the 9b model with the 2b model as its assistant
assistant_model_name = 'google/gemma-2-2b-it'
reference_model_name = 'google/gemma-2-27b-it'

tokenizer = AutoTokenizer.from_pretrained(reference_model_name)
model = AutoModelForCausalLM.from_pretrained(
   reference_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
assistant_model = AutoModelForCausalLM.from_pretrained(
   assistant_model_name, device_map='auto', torch_dtype=torch.bfloat16
)

model_inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(model.device)
generation_options = {
   "assistant_model": assistant_model,
   "do_sample": True,
   "temperature": 0.7,
   "max_new_tokens": 64,
}

outputs = model.generate(**model_inputs, **generation_options)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Nemotron support

Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.

The conversion script should be able to cover Minitron and Nemotron, thanks and kudos to @suiyoubi. See:

Add Nemotron HF Support #31699

Codestral support

Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.

Codestral saves developers time and effort: it can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism. Interacting with Codestral will help level up the developer’s coding game and reduce the risk of errors and bugs.

It's mamba2 architecture, was a bit of a pain to remove all einops but hope we made it better for everyone!

Add codestral mamba2 #32080 by @molbap and @vasqu

Breaking changes:

We removed the chat template in the code, they should all be on the hub!

🚨 No more default chat templates #31733 by @Rocketknight1

Long-form decoding for whisper, even faster:

Our great @sanchit-gandhi worked on porting the recent compile upgrades to long form decoding in

[whisper] compile compatibility with long-form decoding #31772

What's Changed

Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in #31629
Updated ruff to the latest version by @Sai-Suraj-27 in #31926
fix by @gante in #32162
fix: Fixed an if condition that is always evaluating to true by @Sai-Suraj-27 in #32160
[docs] change temperature to a positive value by @faaany in #32077
adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer by @rohitdwivedula in #32171
fix: default value reflects the runtime environment variables rather than the ones present at import time. by @junrae6454 in #32153
Update qwen2.md by @ArtificialZeng in #32108
Remove conversational pipeline tests by @amyeroberts in #32099
RoPE: relaxed rope validation by @gante in #32182
let's not warn when someone is running a forward by @ArthurZucker in #32176
Fix resize embedding with Deepspeed by @zucchini-nlp in #32192
Fix float8_e4m3fn in modeling_utils by @SunMarc in #32193
Support dequantizing GGUF FP16 format by @PenutChen in #31783
🚨 No more default chat templates by @Rocketknight1 in #31733
fix: Replaced deprecated unittest method with the correct one by @Sai-Suraj-27 in #32198
[whisper] fix short-form output type by @sanchit-gandhi in https://github....

Contributors

kashif, n17s, and 55 other contributors

Assets 2

05 Aug 10:57

ArthurZucker

v4.43.4

868d36d

v4.43.4 Patch Release

Patch Release v4.43.4

There was a mick mack, now deepseep issues are properly pushed with:

Resize embeds with DeepSpeed #32214

🤗 Enjoy holidays

Assets 2

26 Jul 15:30

ArthurZucker

v4.43.3

47c29cc

v4.43.3 Patch deepspeed

Patch release v4.43.3:
We still saw some bugs so @zucchini-nlp added:
~~- Resize embeds with DeepSpeed #32214~~

don't log base model architecture in wandb if log model is false #32143

Other fixes:

[whisper] fix short-form output type #32178, by @sanchit-gandhi which fixes the short audio temperature fallback!
[BigBird Pegasus] set _supports_param_buffer_assignment to False #32222 by @kashif, mostly related to the new super fast init, some models have to get this set to False. If you see a weird behavior look for that 😉

Contributors

kashif, sanchit-gandhi, and zucchini-nlp

Assets 2

24 Jul 15:50

LysandreJik

v4.43.2

38d94bf

v4.43.2: Patch release

Fix float8_e4m3fn in modeling_utils (#32193)
Fix resize embedding with Deepspeed (#32192)
let's not warn when someone is running a forward (#32176)
RoPE: relaxed rope validation (#32182)

Assets 2

23 Jul 15:55

LysandreJik

v4.43.1

782bfff

v4.43.1: Patch release

fix (#32162)

Assets 2

23 Jul 15:09

LysandreJik

v4.43.0

7fa7508

v4.43.0: Llama 3.1, Chameleon, ZoeDepth, Hiera

Llama

The Llama 3.1 models are released by Meta and come in three flavours: 8B, 70B, and 405B.

To get an overview of Llama 3.1, please visit the Hugging Face announcement blog post.

We release a repository of llama recipes to showcase usage for inference, total and partial fine-tuning of the different variants.

Chameleon

The Chameleon model was proposed in Chameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response.

Chameleon: add model by @zucchini-nlp in #31534

ZoeDepth

The ZoeDepth model was proposed in ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the DPT framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.

Add ZoeDepth by @NielsRogge in #30136

Hiera

Hiera was proposed in Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

The paper introduces “Hiera,” a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed “bells-and-whistles,” are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity.

Adding hiera by @Namangarg110 in #30356

Agents

Our ReactAgent has a specific way to return its final output: it calls the tool final_answer, added to the user-defined toolbox upon agent initialization, with the answer as the tool argument. We found that even for a one-shot agent like CodeAgent, using a specific final_answer tools helps the llm_engine find what to return: so we generalized the final_answer tool for all agents.

Adds final answer tool for all agents by @aymeric-roucher in #31703

Now if your code-based agent (like ReactCodeAgent) defines a function at step 1, it will remember the function definition indefinitely. This means your agent can create its own tools for later re-use!

Code agent: allow function persistence between steps by @aymeric-roucher in #31769

This is a transformative PR: it allows the agent to regularly run a specific step for planning its actions in advance. This gets activated if you set an int for planning_interval upon agent initialization. At step 0, a first plan will be done. At later steps (like steps 3, 6, 9 if you set planning_interval=3 ), this plan will be updated by the agent depending on the history of previous steps. More detail soon!

Agents planning by @aymeric-roucher in #31702

Notable changes to the codebase

A significant RoPE refactor was done to make it model agnostic and more easily adaptable to any architecture.
It is only applied to Llama for now but will be applied to all models using RoPE over the coming days.

Llama: RoPE refactor by @gante in #32135

Breaking changes

TextGenerationPipeline and tokenizer kwargs

🚨🚨 This PR changes the code to rely on the tokenizer's defaults when these flags are unset. This means some models using TextGenerationPipeline previously did not add a <bos> by default, which (negatively) impacted their performance. In practice, this is a breaking change.

Example of a script changed as a result of this PR:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it", torch_dtype=torch.bfloat16, device_map="auto")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("Foo bar"))

🚨🚨 TextGenerationPipeline: rely on the tokenizer default kwargs by @gante in #31747

Bugfixes and improvements

Fix post gemma merge by @ArthurZucker in #31660
Fix float out of range in owlvit and owlv2 when using FP16 or lower precision by @aliencaocao in #31657
[docs] Llama3 by @stevhliu in #31662
[HybridCache] Fix get_seq_length method by @sanchit-gandhi in #31661
don't zero out the attention_mask when using sliding window with flash attention by @winglian in #31670
Fix Gemma2 4d attention mask by @hiyouga in #31674
Fix return_dict in encodec by @jla524 in #31646
add gather_use_object arguments by @SangbumChoi in #31514
Gemma capping is a must for big models by @ArthurZucker in #31698
Add French version of run scripts tutorial by @jadechoghari in #31483
dependencies: keras-nlp<0.14 pin by @gante in #31684
remove incorrect urls pointing to the llava repository by @BiliBraker in #31107
Move some test files (tets/test_xxx_utils.py) to tests/utils by @ydshieh in #31730
Fix mistral ONNX export by @fxmarty in #31696
[whisper] static kv cache by @sanchit-gandhi in #31166
Make tool JSON schemas consistent by @Rocketknight1 in #31756
Fix documentation for Gemma2. by @jbornschein in #31682
fix assisted decoding by @jiqing-feng in #31401
Requires for torch.tensor before casting by @echarlaix in #31755
handle (processor_class, None) returned by ModelPatterns by @molbap in #31753
Gemma 2: Update slow tests by @gante in #31759
Add ignore_errors=True to trainer.py rmtree in _inner_training_loop by @njbrake in #31668
[fix bug] logits's shape different from label's shape in preprocess_logits_for_metrics by @wiserxin in #31447
Fix RT-DETR cache for generate_anchors by @qubvel in #31671
Fix RT-DETR weights initialization by @qubvel in #31724
pytest_num_workers=4 for some CircleCI jobs by @ydshieh in #31764
Fix Gemma2 types by @hiyouga in #31779
Add torch_empty_cache_steps to TrainingArguments by @aliencaocao in #31546
Fix ClapProcessor to merge feature_extractor output into the returned BatchEncoding by @mxkopy in #31767
Fix serialization for offloaded model by @SunMarc in #31727
Make tensor device correct when ACCELERATE_TORCH_DEVICE is defined by @kiszk in #31751
Exclude torch.compile time from metrics computation by @zxd1997066 in #31443
Update CometCallback to allow reusing of the running experiment by @Lothiraldan in #31366
Fix gemma tests by @ydshieh in #31794
Add training support for SigLIP by @aliencaocao in #31495
Repeating an important warning in the chat template docs by @Rocketknight1 in #31796
Allow FP16 or other precision inference for Pipelines by @aliencaocao in #31342
Fix galore lr display with schedulers by @vasqu in #31710
Fix Wav2Vec2 Fairseq conversion (weight norm state dict keys) by @gau-nernst in #31714
Depth Anything: update conversion script for V2 by @pcuenca in #31522
Fix Seq2SeqTrainer crash when BatchEncoding data is None by @iohub in #31418
Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31813
Add FA2 and sdpa support for SigLIP by @qubvel in #31499
Bump transformers from 4.26.1 to 4.38.0 in /examples/tensorflow/language-modeling-tpu by @dependabot[bot] in #31837
Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/lxmert by @dependabot[bot] in #31838
Fix typos by @omahs in #31819
transformers.fx.symbolic_trace supports inputs_embeds by @fxmarty in #31574
Avoid failure TFBlipModelTest::test_pipeline_image_to_text by @ydshieh in #31827
Fix incorrect accelerator device handling for MPS in TrainingArguments by @andstor in #31812
Mamba & RecurrentGemma: enable strict signature by @gante in #31549
Deprecate vocab_size in other two VLMs by @zucchini-nlp in #31681
FX symbolic_trace: do not test decoder_inputs_embeds by @fxmarty in #31840
[Grounding DINO] Add processor to auto mapping by @NielsRogge in #31845
chore: remove duplicate words by @hattizai in #31853
save_pretrained: use tqdm when saving checkpoint shards from offloaded params by @kallewoof in #31856
Test loading generation config with safetensor weights by @gante in #31550
docs: typo in tf qa example by @chen-keinan in #31864
Generate: Add new decoding strategy "DoLa" in .generate() by @voidism in #29619
Fix _init_weights for ResNetPreTrainedModel by @ydshieh in #31851
Update depth estimation task guide by @merveenoyan in #31860
Bump zip...

Contributors

moses, jbornschein, and 84 other contributors

Assets 2

11 Jul 17:09

ArthurZucker

v4.42.4

fc35907

Patch release v4.42.4

Mostly gemma2 support FA2 softcapping!

but also fix the sliding window for long context and other typos.

[Gemma2] Support FA2 softcapping (#31887) by @ArthurZucker
[ConvertSlow] make sure the order is preserved for addedtokens (#31902) by @ArthurZucker
Fixes to alternating SWA layers in Gemma2 (#31775) by @turboderp
Requires for torch.tensor before casting (#31755) by @echarlaix

Was off last week could not get this out, thanks all for your patience 🥳

Contributors

turboderp, ArthurZucker, and echarlaix

Assets 2

28 Jun 15:35

ArthurZucker

v4.42.3

b7ee1e8

Patch release v4.42.3

Make sure we have attention softcapping for "eager" GEMMA2 model

After experimenting, we noticed that for the 27b model mostly, softcapping is a must. So adding it back (it should have been there, but an error on my side made it disappear) sorry all! 😭

Gemma capping is a must for big models (#31698)

Assets 2

28 Jun 06:42

ArthurZucker

v4.42.2

086c74e

Patch release v4.42.2

Patch release

Thanks to our 2 contributors for their prompt fixing mostly applies for training and FA2!

Fix Gemma2 4d attention mask (#31674) by @hiyouga
don't zero out the attention_mask when using sliding window with flash attention (#31670) by @winglian

Contributors

winglian and hiyouga

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues

Contributors

Release v4.44.0: End to end compile generation!!! Gemma2 (with assisted decoding), Codestral (Mistral for code), Nemotron, Efficient SFT training, CPU Offloaded KVCache, torch export for static cache

💥 End-to-end generation compile

⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)

🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀

📦 Torch export for static cache

Gemma2: assisted decoding

Nemotron support

Codestral support

Breaking changes:

Long-form decoding for whisper, even faster:

What's Changed

Contributors

Patch Release v4.43.4

Contributors

Llama

Chameleon

ZoeDepth

Hiera

Agents

Notable changes to the codebase

Breaking changes

TextGenerationPipeline and tokenizer kwargs

Bugfixes and improvements

Contributors

Mostly gemma2 support FA2 softcapping!

Contributors

Make sure we have attention softcapping for "eager" GEMMA2 model

Patch release

Contributors

Releases: huggingface/transformers

Patch release v4.44.1

Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues

Contributors

Release v4.44.0

Release v4.44.0: End to end compile generation!!! Gemma2 (with assisted decoding), Codestral (Mistral for code), Nemotron, Efficient SFT training, CPU Offloaded KVCache, torch export for static cache

💥 End-to-end generation compile

⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)

🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀

📦 Torch export for static cache

Gemma2: assisted decoding

Nemotron support

Codestral support

Breaking changes:

Long-form decoding for whisper, even faster:

What's Changed

Contributors

v4.43.4 Patch Release

Patch Release v4.43.4

v4.43.3 Patch deepspeed

Contributors

v4.43.2: Patch release

v4.43.1: Patch release

v4.43.0: Llama 3.1, Chameleon, ZoeDepth, Hiera

Llama

Chameleon

ZoeDepth

Hiera

Agents

Notable changes to the codebase

Breaking changes

TextGenerationPipeline and tokenizer kwargs

Bugfixes and improvements

Contributors

Patch release v4.42.4

Mostly gemma2 support FA2 softcapping!

Contributors

Patch release v4.42.3

Make sure we have attention softcapping for "eager" GEMMA2 model

Patch release v4.42.2

Patch release

Contributors