Releases · huggingface/transformers

The goal of this model is to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.

Add BridgeTower model by @abhiwand in #20775
Add loss for BridgeTowerForMaskedLM and BridgeTowerForImageAndTextRetrieval by @abhiwand in #21684
[WIP] Add BridgeTowerForContrastiveLearning by @abhiwand in #21964

Whisper speedup

The Whisper model was integrated a few releases ago. This release offers significant performance optimizations when generating with timestamps. This was made possible by rewriting the generate() function of Whisper, which now uses the generation_config and implementing a batched timestamp prediction. The language and task can now also be setup when calling generate(). For more details about this refactoring checkout this colab.
Notably, whisper is also now supported in Flax 🚀 thanks to @andyehrenberg ! More whisper related commits:

[Whisper] Refactor whisper by @ArthurZucker in #21252
[WHISPER] Small patch by @ArthurZucker in #21307
[Whisper] another patch by @ArthurZucker in #21324
add flax whisper implementation by @andyehrenberg in #20479
Add WhisperTokenizerFast by @jonatanklosko in #21222
Remove CLI spams with Whisper FeatureExtractor by @qmeeus in #21267
Update document of WhisperDecoderLayer by @ling0322 in #21621
[WhisperModel] fix bug in reshaping labels by @jonatasgrosman in #21653
[Whisper] Add SpecAugment by @bofenghuang in #21298
Fix-ci-whisper by @ArthurZucker in #21767
Fix WhisperModelTest by @ydshieh in #21883
[Whisper] Add rescaling function with do_normalize by @ArthurZucker in #21263
Refactor whisper asr pipeline to include language too. by @Narsil in #21427
Update model_split_percents for WhisperModelTest by @ydshieh in #21922
[Whisper] Fix feature normalization in WhisperFeatureExtractor by @bofenghuang in #21938
[Whisper] Add model for audio classification by @sanchit-gandhi in #21754
fixes the gradient checkpointing of whisper by @soma2000-lang in #22019
Skip 3 tests for WhisperEncoderModelTest by @ydshieh in #22060
[Whisper] Remove embed_tokens from encoder docstring by @sanchit-gandhi in #21996
[Whiper] add get_input_embeddings to WhisperForAudioClassification by @younesbelkada in #22133
[🛠️] Fix-whisper-breaking-changes by @ArthurZucker in #21965

DETA

DETA (short for Detection Transformers with Assignment) improves Deformable DETR by replacing the one-to-one bipartite Hungarian matching loss with one-to-many label assignments used in traditional detectors with non-maximum suppression (NMS). This leads to significant gains of up to 2.5 mAP.

Add DETA by @NielsRogge in #20983

SpeechT5

The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.

add SpeechT5 model by @hollance in #18922

XLM-V

XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R).

Add XLM-V to Model Doc by @stefan-it in #21498

BLIP-2

BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon Flamingo, an 80 billion parameter model, by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.

Add BLIP-2 by @NielsRogge in #21441

X-MOD

X-MOD extends multilingual masked language models like XLM-R to include language-specific modular components (language adapters) during pre-training. For fine-tuning, the language adapters in each transformer layer are frozen.

Add X-MOD by @jvamvas in #20939

Ernie-M

ERNIE-M is a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance.

Add Ernie-M Model to huggingface by @susnato in #21349

TVLT

The Textless Vision-Language Transformer (TVLT) is a model that uses raw visual and audio inputs for vision-and-language representation learning, without using text-specific modules such as tokenization or automatic speech recognition (ASR). It can perform various audiovisual and vision-language tasks like retrieval, question answering, etc.

Add TVLT by @zinengtang in #20725

CLAP

CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.

[CLAP] Add CLAP to the library by @ArthurZucker in #21370
[CLAP] Fix few broken things by @younesbelkada in #21670

GPTSAN

GPTSAN is a Japanese language model using Switch Transformer. It has the same structure as the model introduced as Prefix LM in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can fine-tune for translation or summarization.

add GPTSAN model (reopen) by @tanreinama in #21291

EfficientNet

EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models.

Add EfficientNet by @alaradirik in #21563

ALIGN

ALIGN is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. ALIGN features a dual-encoder architecture with EfficientNet as its vision encoder and BERT as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.

Add ALIGN to transformers by @alaradirik in #21741

Informer

Informer is a method to be applied to long-sequence time-series forecasting. This method introduces a Probabilistic Attention mechanism to select the “active” queries rather than the “lazy” queries and provides a sparse Transformer thus mitigating the quadratic compute and memory requirements of vanilla attention.

[Time-Series] informer model by @elisim in #21099

API updates and improvements

Safetensors

safetensors is a safe format of serialization of tensors, which has been supported in transformers as a first-class citizen for the past few versions.

This change enables explicitly forcing the from_pretrained method to use or not to use safetensors. This unlocks a few use-cases, notably the possibility to enforce loading only from this format, limiting security risks.

Example of usage:

from transformers import AutoModel

# As of version v4.27.0, this loads the `pytorch_model.bin` by default if `safetensors` is not installed.
# It loads the `model.safetensors` file if `safetensors` is installed.
model = AutoModel.from_pretrained('bert-base-cased')

# This forces the load from the `model.safetensors` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=True)

# This forces the load from the `pytorch_model.bin` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=False)

[Safetensors] Add explicit flag to from pretrained by @patrickvonplaten in #22083

Variant

This PR adds a "variant" keyword argument to PyTorch's from_pretrained and save_pretrained so that multiple weight variants can be saved in the model repo.

Example of usage with the model hosted in this folder on the Hub:

from transformers import CLIPTextModel

path = "huggingface/the-no-branch-repo"  # or ./text_encoder if local

# Loads the `no_ema` variant. This loads the `pytorch_model.fp16.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder", variant="fp16")

# This loads the no-variant checkpoint, loading the `pytorch_model.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder")

Add variant to transformers by @patrickvonplaten in #21332
[Variant] Make sure variant files are not incorrectly deleted by @patrickvonplaten in #21562

bitsandbytes

The bitsandbytes integration is overhauled, now offering a new configuration: the BytsandbytesConfig.

Read more about it in the [documentation](https://huggingf...

Contributors

kashif, Narsil, and 135 other contributors

Assets 2

09 Feb 19:30

sgugger

v4.26.1

ae54e3c

V4.26.1: Patch release

ESM openfold_utils type hints by @ringohoffman in #20544
Add cPython files in build by @sgugger in #21372
Fix T5 inference in float16 + bnb error by @younesbelkada in #21281
Fix import in Accelerate for find_exec_bs by @sgugger in #21501
Fix inclusion of non py files in package by @sgugger in #21546

Contributors

ringohoffman, sgugger, and younesbelkada

Assets 2

25 Jan 18:32

sgugger

v4.26.0

820c46a

v4.26.0: Generation configs, image processors, backbones and plenty of new models!

`GenerationConfig`

The generate method has multiple arguments whose defaults were lying in the model config. We have now decoupled these in a separate generation config, which makes it easier to store different sets of parameters for a given model, with different generation strategies. While we will keep supporting generate arguments in the model configuration for the foreseeable future, it is now recommended to use a generation config. You can learn more about its uses here and its documentation here.

Generate: use GenerationConfig as the basis for .generate() parametrization by @gante in #20388
Generate: TF uses GenerationConfig as the basis for .generate() parametrization by @gante in #20994
Generate: FLAX uses GenerationConfig as the basis for .generate() parametrization by @gante in #21007

`ImageProcessor`

In the vision integration, all feature extractor classes have been deprecated to be renamed to ImageProcessor. The old feature extractors will be fully removed in version 5 of Transformers and new vision models will only implement the ImageProcessor class, so be sure to switch your code to this new name sooner rather than later!

Add deprecation warning when image FE instantiated by @amyeroberts in #20427
Vision processors - replace FE with IPs by @amyeroberts in #20590
Replace FE references by @amyeroberts in #20702

New models

AltCLIP

AltCLIP is a variant of CLIP obtained by switching the text encoder with a pretrained multilingual text encoder (XLM-Roberta). It has very close performances with CLIP on almost all tasks, and extends the original CLIP’s capabilities to multilingual understanding.

Add AltCLIP by @jongjyh in #20446

BLIP

BLIP is a model that is able to perform various multi-modal tasks including visual question answering, image-text retrieval (image-text matching) and image captioning.

Add BLIP by @younesbelkada in #20716

BioGPT

BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.

Add BioGPT by @kamalkraj in #20420

BiT

BiT is a simple recipe for scaling up pre-training of ResNet-like architectures (specifically, ResNetv2). The method results in significant improvements for transfer learning.

Add BiT + ViT hybrid by @NielsRogge in #20550

EfficientFormer

EfficientFormer proposes a dimension-consistent pure transformer that can be run on mobile devices for dense prediction tasks like image classification, object detection and semantic segmentation.

Efficientformer by @Bearnardd in #20459

GIT

GIT is a decoder-only Transformer that leverages CLIP’s vision encoder to condition the model on vision inputs besides text. The model obtains state-of-the-art results on image captioning and visual question answering benchmarks.

Add GIT (GenerativeImage2Text) by @NielsRogge in #20295

GPT-sw3

GPT-Sw3 is a collection of large decoder-only pretrained transformer language models that were developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. GPT-Sw3 has been trained on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation.

Add gpt-sw3 model to transformers by @ekgren in #20209

Graphormer

Graphormer is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessign and collation, then using a modified attention.

Graphormer model for Graph Classification by @clefourrier in #20968

Mask2Former

Mask2Former is a unified framework for panoptic, instance and semantic segmentation and features significant performance and efficiency improvements over MaskFormer.

Add Mask2Former by @alaradirik and @shivalikasingh95 in #20792

OneFormer

OneFormer is a universal image segmentation framework that can be trained on a single panoptic dataset to perform semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference.

Add OneFormer Model by @praeclarumjj3 in #20577

Roberta prelayernorm

The RoBERTa-PreLayerNorm model is identical to RoBERTa but uses the --encoder-normalize-before flag in fairseq.

Implement Roberta PreLayerNorm by @AndreasMadsen in #20305

Swin2SR

Swin2R improves the SwinIR model by incorporating Swin Transformer v2 layers which mitigates issues such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data.

Add Swin2SR by @NielsRogge in #19784

TimeSformer

TimeSformer is the first video transformer. It inspired many transformer based video understanding and classification papers.

[New Model] Add TimeSformer model by @fcakyon in #18908

UPerNet

UPerNet is a general framework to effectively segment a wide range of concepts from images, leveraging any vision backbone like ConvNeXt or Swin.

Add UperNet by @NielsRogge in #20648

Vit Hybrid

ViT hybrid is a slight variant of the plain Vision Transformer, by leveraging a convolutional backbone (specifically, BiT) whose features are used as initial “tokens” for the Transformer. It’s the first architecture that attains similar results to familiar convolutional architectures.

Add BiT + ViT hybrid by @NielsRogge in #20550

Backbones

Breaking a bit the one model per file policy, we introduce backbones (mainly for vision models) which can then be re-used in more complex models like DETR, MaskFormer, Mask2Former etc.

[NAT, DiNAT] Add backbone class by @NielsRogge in #20654
Add Swin backbone by @NielsRogge in #20769
[DETR and friends] Use AutoBackbone as alternative to timm by @NielsRogge in #20833

Bugfixes and improvements

fix cuda OOM by using single Prior by @ArthurZucker in #20486
Add ESM contact prediction by @Rocketknight1 in #20535
flan-t5.mdx: fix link to large model by @szhublox in #20555
Fix torch device issues by @ydshieh in #20584
Fix flax GPT-J-6B linking model in tests by @JuanFKurucz in #20556
[Vision] fix small nit on BeitDropPath layers by @younesbelkada in #20587
Install natten with CUDA version by @ydshieh in #20546
Add entries to FEATURE_EXTRACTOR_MAPPING_NAMES by @ydshieh in #20551
Cleanup some config attributes by @ydshieh in #20554
[Whisper] Move decoder id method to tokenizer by @sanchit-gandhi in #20589
Add require_torch to 2 pipeline tests by @ydshieh in #20585
Install tensorflow_probability for TF pipeline CI by @ydshieh in #20586
Ci-whisper-asr by @ArthurZucker in #20588
cross platform from_pretrained by @ArthurZucker in #20538
Make convert_to_onnx runable as script again by @mcernusca in #20009
ESM openfold_utils type hints by @ringohoffman in #20544
Add RemBERT ONNX config by @hchings in #20520
Fix link to Swin Model contributor novice03 by @JuanFKurucz in #20557
Fix link to swin transformers v2 microsoft model by @JuanFKurucz in #20558
Fix link to table transformer detection microsoft model by @JuanFKurucz in #20560
clean up unused classifier_dropout in config by @ydshieh in #20596
Fix whisper and speech to text doc by @ArthurZucker in #20595
Replace set-output by $GITHUB_OUTPUT by @ydshieh in #20547
[Vision] .to function for ImageProcessors by @younesbelkada in #20536
[Whisper] Fix decoder ids methods by @sanchit-gandhi in #20599
Add-whisper-conversion by @ArthurZucker in #20600
README in Hindi 🇮🇳 by @pacman100 in #20097
Fix code sample in preprocess by @stevhliu in #20561
Split autoclasses on modality by @stevhliu in #20559
Fix test for file not found by @sgugger in #20604
Rework the pipeline tutorial by @Narsil in #20437
Documentation fixes by @samuelzxu in #20607
Adding anchor links to Hindi README by @pacman100 in #20606
exclude jit time from the speed metric calculation of evaluation and prediction by @sywangyi in #20553
Check if docstring is None before formating it by @xxyzz in #20592
updating T5 and BART models to support Prefix Tuning by @pacman100 in #20601
Fix AutomaticSpeechRecognitionPipelineTests.run_pipeline_test by @ydshieh in #20597
Ci-jukebox by @ArthurZucker in #20613
Update some GH action versions by @ydshieh in #20537
Fix dtype of weights in from_pretrained when device_map is set by @sgugger in #20602
add missing is_decoder param by @stevhliu in #20631
Fix link to speech encoder decoder model in speech recognition readme by @JuanFKurucz in #20633
Fix natten installation in docker file by @ydshieh in #20632
Clip floating point constants to bf16 range to avoid inf conversion b...

Contributors

mcernusca, bastings, and 108 other contributors

Assets 2

02 Dec 16:03

LysandreJik

v4.25.1

31d452c

PyTorch 2.0 support, Audio Spectogram Transformer, Jukebox, Switch Transformers and more

PyTorch 2.0 stack support

We are very excited by the newly announced PyTorch 2.0 stack. You can enable torch.compile on any of our models, and get support with the Trainer (and in all our PyTorch examples) by using the torchdynamo training argument. For instance, just add --torchdynamo inductor when launching those examples from the command line.

This API is still experimental and may be subject to changes as the PyTorch 2.0 stack matures.

Note that to get the best performance, we recommend:

using an Ampere GPU (or more recent)
sticking to fixed shaped for now (so use --pad_to_max_length in our examples)

Repurpose torchdynamo training args towards torch._dynamo by @sgugger in #20498

Audio Spectrogram Transformer

The Audio Spectrogram Transformer model was proposed in AST: Audio Spectrogram Transformer by Yuan Gong, Yu-An Chung, James Glass. The Audio Spectrogram Transformer applies a Vision Transformer to audio, by turning audio into an image (spectrogram). The model obtains state-of-the-art results for audio classification.

Add Audio Spectogram Transformer by @NielsRogge in #19981

Jukebox

The Jukebox model was proposed in Jukebox: A generative model for music by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. It introduces a generative music model which can produce minute long samples that can be conditionned on an artist, genres and lyrics.

Add Jukebox model (replaces #16875) by @ArthurZucker in #17826

Switch Transformers

The SwitchTransformers model was proposed in Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity by William Fedus, Barret Zoph, Noam Shazeer.

It is the first MoE model supported in transformers, with the largest checkpoint currently available currently containing 1T parameters.

Add Switch transformers by @younesbelkada and @ArthurZucker in #19323

RocBert

The RoCBert model was proposed in RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. It’s a pretrained Chinese language model that is robust under various forms of adversarial attacks.

Add RocBert by @sww9370 in #20013

CLIPSeg

The CLIPSeg model was proposed in Image Segmentation Using Text and Image Prompts by Timo Lüddecke and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen CLIP model for zero- and one-shot image segmentation.

Add CLIPSeg by @NielsRogge in #20066

NAT and DiNAT

NAT

NAT was proposed in Neighborhood Attention Transformer by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.

It is a hierarchical vision transformer based on Neighborhood Attention, a sliding-window self attention pattern.

DiNAT

DiNAT was proposed in Dilated Neighborhood Attention Transformer by Ali Hassani and Humphrey Shi.

It extends NAT by adding a Dilated Neighborhood Attention pattern to capture global context, and shows significant performance improvements over it.

Add Neighborhood Attention Transformer (NAT) and Dilated NAT (DiNAT) models by @alihassanijr in #20219

MobileNetV2

The MobileNet model was proposed in MobileNetV2: Inverted Residuals and Linear Bottlenecks by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen.

add MobileNetV2 model by @hollance in #17845

MobileNetV1

The MobileNet model was proposed in MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.

add MobileNetV1 model by @hollance in #17799

Image processors

Image processors replace feature extractors as the processing class for computer vision models.

Important changes:

size parameter is now a dictionary of {"height": h, "width": w}, {"shortest_edge": s}, {"shortest_egde": s, "longest_edge": l} instead of int or tuple.
Addition of data_format flag. You can now specify if you want your images to be returned in "channels_first" - NCHW - or "channels_last" - NHWC - format.
Processing flags e.g. do_resize can be passed directly to the preprocess method instead of modifying the class attribute: image_processor([image_1, image_2], do_resize=False, return_tensors="pt", data_format="channels_last")
Leaving return_tensors unset will return a list of numpy arrays.

The classes are backwards compatible and can be created using existing feature extractor configurations - with the size parameter converted.

Add Image Processors by @amyeroberts in #19796
Add Donut image processor by @amyeroberts #20425
Add segmentation + object detection image processors by @amyeroberts in #20160
AutoImageProcessor by @amyeroberts in #20111

Backbone for computer vision models

We're adding support for a general AutoBackbone class, which turns any vision model (like ConvNeXt, Swin Transformer) into a backbone to be used with frameworks like DETR and Mask R-CNN. The design is in early stages and we welcome feedback.

Add AutoBackbone + ResNetBackbone by @NielsRogge in #20229
Improve backbone by @NielsRogge in #20380
[AutoBackbone] Improve API by @NielsRogge in #20407

Support for `safetensors` offloading

If the model you are using has a safetensors checkpoint and you have the library installed, offload to disk will take advantage of this to be more memory efficient and roughly 33% faster.

Safetensors offload by @sgugger in #20321

Contrastive search in the `generate` method

Generate: TF contrastive search with XLA support by @gante in #20050
Generate: contrastive search with full optional outputs by @gante in #19963

Breaking changes

🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string by @beneyal in #15775

Bugfixes and improvements

add dataset by @stevhliu in #20005
Add BERT resources by @stevhliu in #19852
Add LayoutLMv3 resource by @stevhliu in #19932
fix typo by @stevhliu in #20006
Update object detection pipeline to use post_process_object_detection methods by @alaradirik in #20004
clean up vision/text config dict arguments by @ydshieh in #19954
make sentencepiece import conditional in bertjapanesetokenizer by @ripose-jp in #20012
Fix gradient checkpoint test in encoder-decoder by @ydshieh in #20017
Quality by @sgugger in #20002
Update auto processor to check image processor created by @amyeroberts in #20021
[Doctest] Add configuration_deberta_v2.py by @Saad135 in #19995
Improve model tester by @ydshieh in #19984
Fix doctest by @ydshieh in #20023
Show installed libraries and their versions in CI jobs by @ydshieh in #20026
reorganize glossary by @stevhliu in #20010
Now supporting pathlike in pipelines too. by @Narsil in #20030
Add **kwargs by @amyeroberts in #20037
Fix some doctests after PR 15775 by @ydshieh in #20036
[Doctest] Add configuration_camembert.py by @Saad135 in #20039
[Whisper Tokenizer] Make more user-friendly by @sanchit-gandhi in #19921
[FuturWarning] Add futur warning for LEDForSequenceClassification by @ArthurZucker in #19066
fix jit trace error for model forward sequence is not aligned with jit.trace tuple input sequence, update related doc by @sywangyi in #19891
Update esmfold conversion script by @Rocketknight1 in #20028
Fixed torch.finfo issue with torch.fx by @michaelbenayoun in #20040
Only resize embeddings when necessary by @sgugger in #20043
Speed up TF token classification postprocessing by converting complete tensors to numpy by @deutschmn in #19976
Fix ESM LM head test by @Rocketknight1 in #20045
Update README.md by @bofenghuang in #20063
fix tokenizer_type to avoid error when loading checkpoint back by @pacman100 in #20062
[Trainer] Fix model name in push_to_hub by @sanchit-gandhi in #20064
PoolformerImageProcessor defaults to match previous FE by @amyeroberts in #20048
change constant torch.tensor to torch.full by @MerHS in #20061
Update READMEs for ESMFold and add notebooks by @Rocketknight1 in #20067
Update documentation on seq2seq models with absolute positional embeddings, to be in line with Tips section for BERT and GPT2 by @jordiclive in #20068
Allow passing arguments to model testers for CLIP-like models by @ydshieh in #20044
Show installed libraries and their versions in GA jobs by @ydshieh in #20069
Update defaults and logic to match old FE by @amyeroberts in #20065
Update modeling_tf_utils.py by @cakiki in #20076
Update hub.py by @cakiki in #20075
[Doctest] Add configuration_dpr.py by @Saad135 in #20080
Removing RobertaConfig inheritance from CamembertConfig by @Saad135 in #20059
Skip 2 tests in VisionTextDualEncoderProcessorTest by @ydshieh in #20098
Replace unsupported facebookresearch/bitsandbytes by @tomaarsen in #20093
docs: Resolve many typos in the English docs by @tomaarsen in #20088
use huggingface_hub.model_inifo() to get pipline_tag by @y-tag in #20077
Fix generate_dummy_inputs for ImageGPTOnnxConfig by @ydshieh in #20103
docs: Fixed variables in f-strings by @tomaarsen in #20087
Add...

Contributors

Narsil, hollance, and 71 other contributors

Assets 2

01 Nov 15:45

sgugger

v4.24.0

94b3f54

v4.24.0: ESM-2/ESMFold, LiLT, Flan-T5, Table Transformer and Contrastive search decoding

ESM-2/ESMFold

ESM-2 and ESMFold are new state-of-the-art Transformer protein language and folding models from Meta AI's Fundamental AI Research Team (FAIR). ESM-2 is trained with a masked language modeling objective, and it can be easily transferred to sequence and token classification tasks for proteins. Checkpoints exist in various sizes, from 8 million parameters up to a huge 15 billion parameter model.

ESMFold is a state-of-the-art single sequence protein folding model which produces high accuracy predictions significantly faster. Unlike previous protein folding tools like AlphaFold2 and openfold, ESMFold uses a pretrained protein language model to generate token embeddings that are used as input to the folding model, and so does not require a multiple sequence alignment (MSA) of related proteins as input. As a result, proteins can be folded in a single forward pass of the model without requiring any external databases or search/alignment tools to be present at inference time. This hugely reduces the time and compute requirements for folding.

Transformer protein language models were introduced in the paper Biological structure and function emerge from scaling
unsupervised learning to 250 million protein sequences by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus.

ESMFold was introduced in the paper Language models of protein sequences at the scale of evolution enable accurate structure prediction by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, and Alexander Rives.

Add ESMFold by @Rocketknight1 in #19977
TF port of ESM by @Rocketknight1 in #19587

LiLT

LiLT allows to combine any pre-trained RoBERTa text encoder with a lightweight Layout Transformer, to enable LayoutLM-like document understanding for many languages.

It was proposed in LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding by Jiapeng Wang, Lianwen Jin, Kai Ding.

Add LiLT by @NielsRogge in #19450

Flan-T5

FLAN-T5 is an enhanced version of T5 that has been finetuned on a mixture of tasks.

It was released in the paper Scaling Instruction-Finetuned Language Models by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei.

Add flan-t5 documentation page by @younesbelkada in #19892

Table Transformer

Table Transformer is a model that can perform table extraction and table structure recognition from unstructured documents based on the DETR architecture.

It was proposed in PubTables-1M: Towards comprehensive table extraction from unstructured documents by Brandon Smock, Rohith Pesala, Robin Abraham.

Add table transformer [v2] by @NielsRogge in #19614

Contrastive search decoding

Contrastive search decoding is a new state-of-the-art generation method which aims at reducing the repetitive patterns in which generation models often fall.

It was introduced in A Contrastive Framework for Neural Text Generation by Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, Nigel Collier.

Adding the state-of-the-art contrastive search decoding methods for the codebase of generation_utils.py by @gmftbyGMFTBY in #19477

Safety and security

We continue to explore the new serialization format not using Pickle via the safetensors library, this time by adding support for TensorFlow models. More checkpoints have been converted to this format. Support is still experimental.

Safetensors tf by @sgugger in #19900

🚨 Breaking changes

The following changes are bugfixes that we have chosen to fix even if it changes the resulting behavior. We mark them as breaking changes, so if you are using this part of the codebase, we recommend you take a look at the PRs to understand what changes were done exactly.

🚨🚨🚨 TF: Remove TFWrappedEmbeddings (breaking: TF embedding initialization updated for encoder-decoder models) by @gante in #19263
🚨🚨🚨 [Breaking change] Deformable DETR intermediate representations by @Narsil in #19678

Bugfixes and improvements

Enabling custom TF signature draft by @dimitreOliveira in #19249
Fix whisper for pipeline by @ArthurZucker in #19482
Extend nested_XXX functions to mappings/dicts. by @Guillem96 in #19455
Syntax issues (lines 126, 203) by @kant in #19444
CLI: add import protection to datasets by @gante in #19470
Fix TFGroupViT CI by @ydshieh in #19461
Fix doctests for DeiT and TFGroupViT by @ydshieh in #19466
Update WhisperModelIntegrationTests.test_large_batched_generation by @ydshieh in #19472
[Swin] Replace hard-coded batch size to enable dynamic ONNX export by @lewtun in #19475
TF: TFBart embedding initialization by @gante in #19460
Make LayoutLM tokenizers independent from BertTokenizer by @arnaudstiegler in #19351
Make XLMRoberta model and config independent from Roberta by @asofiaoliveira in #19359
Fix get_embedding dtype at init. time by @ydshieh in #19473
Decouples XLMProphet model from Prophet by @srhrshr in #19406
Implement multiple span support for DocumentQuestionAnswering by @ankrgyl in #19204
Add warning in generate & device_map=auto & half precision models by @younesbelkada in #19468
Update TF whisper doc tests by @amyeroberts in #19484
Make bert_japanese and cpm independent of their inherited modules by @Davidy22 in #19431
Added tokenize keyword arguments to feature extraction pipeline by @quancore in #19382
Adding the README_es.md and reference to it in the others files readme by @Oussamaosman02 in #19427
[CvT] Tensorflow implementation by @mathieujouffroy in #18597
python3 instead of python in push CI setup job by @ydshieh in #19492
Update PT to TF CLI for audio models by @amyeroberts in #19465
New by @IMvision12 in #19481
Fix OPTForQuestionAnswering doctest by @ydshieh in #19479
Use a dynamic configuration for circleCI tests by @sgugger in #19325
Add multi-node conditions in trainer_qa.py and trainer_seq2seq.py by @regisss in #19502
update doc for perf_train_cpu_many by @sywangyi in #19506
Avoid Push CI failing to report due to many commits being merged by @ydshieh in #19496
[Doctest] Add configuration_bert.py to doctest by @ydshieh in #19485
Fix whisper doc by @ArthurZucker in #19518
Syntax issue (line 497, 526) Documentation by @kant in #19442
Fix pytorch seq2seq qa by @FilipposVentirozos in #19258
Add depth estimation pipeline by @nandwalritik in #18618
Adding links to pipelines parameters documentation by @AndreaSottana in #19227
fix MarkupLMProcessor option flag by @davanstrien in #19526
[Doctest] Bart configuration update by @imarekkus in #19524
Remove roberta dependency from longformer fast tokenizer by @sirmammingtonham in #19501
made tokenization_roformer independent of bert by @naveennamani in #19426
Remove bert fast dependency from electra by @Threepointone4 in #19520
[Examples] Fix typos in run speech recognition seq2seq by @sanchit-gandhi in #19514
[X-CLIP] Fix doc tests by @NielsRogge in #19523
Update Marian config default vocabulary size by @gante in #19464
Make MobileBert tokenizers independent from Bert by @501Good in #19531
[Whisper] Fix gradient checkpointing by @sanchit-gandhi in #19538
Syntax issues (paragraphs 122, 130, 147, 155) Documentation: @sgugger by @kant in #19437
using trunc_normal for weight init & cls_token by @mathieujouffroy in #19486
Remove MarkupLMForMaskedLM from MODEL_WITH_LM_HEAD_MAPPING_NAMES by @ydshieh in #19534
Image transforms library by @amyeroberts in #18520
Add a decorator for flaky tests by @sgugger in #19498
[Doctest] Add configuration_yolos.py by @daspartho in #19539
Albert config update by @imarekkus in #19541
[Doctest] Add configuration_whisper.py by @daspartho in #19540
Throw an error if getattribute_from_module can't find anything by @ydshieh in #19535
[Doctest] Beit Config for doctest by @daspartho in #19542
Create the arange tensor on device for enabling CUDA-Graph for Clip Encoder by @RezaYazdaniAminabadi in #19503
[Doctest] GPT2 Config for doctest by @daspartho in #19549
Build Push CI images also in a daily basis by @ydshieh in #19532
Fix checkpoint used in MarkupLMConfig by @ydshieh in #19547
add a note to whisper docs clarifying support of long-form decoding by @akashmjn in #19497
[Whisper] Freeze params of encoder by @sanchit-gandhi in #19527
[Doctest] Fixing the Doctest for imageGPT config by @RamitPahwa in #19556
[Doctest] Fixing mobile bert configuration doctest by @RamitPahwa in #19557
[Doctest] Fixing doctest bert_generation configuration by @Threepointone4 in #19558
[Doctest] DeiT Config for doctest by @daspartho in #19560
[Doctest] Reformer Config for doctest by @daspartho in #19562
[Doctest] RoBERTa Config for doctest by @daspartho in #19563
[Doctest] Add configuration_vit.py by @dasparth...