Skip to content

Releases: huggingface/transformers

v4.27.4: Patch release

29 Mar 17:08
4e9f6fc
Compare
Choose a tag to compare

This patch fixes a regression with FlauBERT and XLM models.

  • Revert "Error (also in original) model, scaling only q matrix not qk.T dot product (qk.T/sqrt(dim_per_head)) (#21627) in #22444 by @sgugger

v4.27.3: Patch release

23 Mar 19:01
5e3b19a
Compare
Choose a tag to compare

Enforce max_memory for device_map strategies by @sgugger in #22311

v4.27.2: Patch release

20 Mar 17:42
6828768
Compare
Choose a tag to compare

v4.27.1: Patch release

15 Mar 19:50
Compare
Choose a tag to compare

Patches two unwanted breaking changes that were released in v4.27.0:

  • Revert 22152 MaskedImageCompletionOutput changes (#22187)
  • Regression pipeline device (#22190)

BridgeTower, Whisper speedup, DETA, SpeechT5, BLIP-2, CLAP, ALIGN, API updates

15 Mar 16:25
d941f07
Compare
Choose a tag to compare

BridgeTower

The goal of this model is to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.

Whisper speedup

The Whisper model was integrated a few releases ago. This release offers significant performance optimizations when generating with timestamps. This was made possible by rewriting the generate() function of Whisper, which now uses the generation_config and implementing a batched timestamp prediction. The language and task can now also be setup when calling generate(). For more details about this refactoring checkout this colab.
Notably, whisper is also now supported in Flax 🚀 thanks to @andyehrenberg ! More whisper related commits:

DETA

DETA (short for Detection Transformers with Assignment) improves Deformable DETR by replacing the one-to-one bipartite Hungarian matching loss with one-to-many label assignments used in traditional detectors with non-maximum suppression (NMS). This leads to significant gains of up to 2.5 mAP.

SpeechT5

The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.

XLM-V

XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R).

BLIP-2

BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon Flamingo, an 80 billion parameter model, by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.

X-MOD

X-MOD extends multilingual masked language models like XLM-R to include language-specific modular components (language adapters) during pre-training. For fine-tuning, the language adapters in each transformer layer are frozen.

Ernie-M

ERNIE-M is a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance.

TVLT

The Textless Vision-Language Transformer (TVLT) is a model that uses raw visual and audio inputs for vision-and-language representation learning, without using text-specific modules such as tokenization or automatic speech recognition (ASR). It can perform various audiovisual and vision-language tasks like retrieval, question answering, etc.

CLAP

CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.

GPTSAN

GPTSAN is a Japanese language model using Switch Transformer. It has the same structure as the model introduced as Prefix LM in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can fine-tune for translation or summarization.

EfficientNet

EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models.

ALIGN

ALIGN is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. ALIGN features a dual-encoder architecture with EfficientNet as its vision encoder and BERT as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.

Informer

Informer is a method to be applied to long-sequence time-series forecasting. This method introduces a Probabilistic Attention mechanism to select the “active” queries rather than the “lazy” queries and provides a sparse Transformer thus mitigating the quadratic compute and memory requirements of vanilla attention.

API updates and improvements

Safetensors

safetensors is a safe format of serialization of tensors, which has been supported in transformers as a first-class citizen for the past few versions.

This change enables explicitly forcing the from_pretrained method to use or not to use safetensors. This unlocks a few use-cases, notably the possibility to enforce loading only from this format, limiting security risks.

Example of usage:

from transformers import AutoModel

# As of version v4.27.0, this loads the `pytorch_model.bin` by default if `safetensors` is not installed.
# It loads the `model.safetensors` file if `safetensors` is installed.
model = AutoModel.from_pretrained('bert-base-cased')

# This forces the load from the `model.safetensors` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=True)

# This forces the load from the `pytorch_model.bin` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=False)

Variant

This PR adds a "variant" keyword argument to PyTorch's from_pretrained and save_pretrained so that multiple weight variants can be saved in the model repo.

Example of usage with the model hosted in this folder on the Hub:

from transformers import CLIPTextModel

path = "huggingface/the-no-branch-repo"  # or ./text_encoder if local

# Loads the `no_ema` variant. This loads the `pytorch_model.fp16.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder", variant="fp16")

# This loads the no-variant checkpoint, loading the `pytorch_model.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder")

bitsandbytes

The bitsandbytes integration is overhauled, now offering a new configuration: the BytsandbytesConfig.

Read more about it in the [documentation](https://huggingf...

Read more

V4.26.1: Patch release

09 Feb 19:30
ae54e3c
Compare
Choose a tag to compare

v4.26.0: Generation configs, image processors, backbones and plenty of new models!

25 Jan 18:32
820c46a
Compare
Choose a tag to compare

GenerationConfig

The generate method has multiple arguments whose defaults were lying in the model config. We have now decoupled these in a separate generation config, which makes it easier to store different sets of parameters for a given model, with different generation strategies. While we will keep supporting generate arguments in the model configuration for the foreseeable future, it is now recommended to use a generation config. You can learn more about its uses here and its documentation here.

  • Generate: use GenerationConfig as the basis for .generate() parametrization by @gante in #20388
  • Generate: TF uses GenerationConfig as the basis for .generate() parametrization by @gante in #20994
  • Generate: FLAX uses GenerationConfig as the basis for .generate() parametrization by @gante in #21007

ImageProcessor

In the vision integration, all feature extractor classes have been deprecated to be renamed to ImageProcessor. The old feature extractors will be fully removed in version 5 of Transformers and new vision models will only implement the ImageProcessor class, so be sure to switch your code to this new name sooner rather than later!

New models

AltCLIP

AltCLIP is a variant of CLIP obtained by switching the text encoder with a pretrained multilingual text encoder (XLM-Roberta). It has very close performances with CLIP on almost all tasks, and extends the original CLIP’s capabilities to multilingual understanding.

BLIP

BLIP is a model that is able to perform various multi-modal tasks including visual question answering, image-text retrieval (image-text matching) and image captioning.

BioGPT

BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.

BiT

BiT is a simple recipe for scaling up pre-training of ResNet-like architectures (specifically, ResNetv2). The method results in significant improvements for transfer learning.

EfficientFormer

EfficientFormer proposes a dimension-consistent pure transformer that can be run on mobile devices for dense prediction tasks like image classification, object detection and semantic segmentation.

GIT

GIT is a decoder-only Transformer that leverages CLIP’s vision encoder to condition the model on vision inputs besides text. The model obtains state-of-the-art results on image captioning and visual question answering benchmarks.

GPT-sw3

GPT-Sw3 is a collection of large decoder-only pretrained transformer language models that were developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. GPT-Sw3 has been trained on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation.

Graphormer

Graphormer is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessign and collation, then using a modified attention.

Mask2Former

Mask2Former is a unified framework for panoptic, instance and semantic segmentation and features significant performance and efficiency improvements over MaskFormer.

OneFormer

OneFormer is a universal image segmentation framework that can be trained on a single panoptic dataset to perform semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference.

Roberta prelayernorm

The RoBERTa-PreLayerNorm model is identical to RoBERTa but uses the --encoder-normalize-before flag in fairseq.

Swin2SR

Swin2R improves the SwinIR model by incorporating Swin Transformer v2 layers which mitigates issues such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data.

TimeSformer

TimeSformer is the first video transformer. It inspired many transformer based video understanding and classification papers.

UPerNet

UPerNet is a general framework to effectively segment a wide range of concepts from images, leveraging any vision backbone like ConvNeXt or Swin.

Vit Hybrid

ViT hybrid is a slight variant of the plain Vision Transformer, by leveraging a convolutional backbone (specifically, BiT) whose features are used as initial “tokens” for the Transformer. It’s the first architecture that attains similar results to familiar convolutional architectures.

Backbones

Breaking a bit the one model per file policy, we introduce backbones (mainly for vision models) which can then be re-used in more complex models like DETR, MaskFormer, Mask2Former etc.

Bugfixes and improvements

Read more

PyTorch 2.0 support, Audio Spectogram Transformer, Jukebox, Switch Transformers and more

02 Dec 16:03
31d452c
Compare
Choose a tag to compare

PyTorch 2.0 stack support

We are very excited by the newly announced PyTorch 2.0 stack. You can enable torch.compile on any of our models, and get support with the Trainer (and in all our PyTorch examples) by using the torchdynamo training argument. For instance, just add --torchdynamo inductor when launching those examples from the command line.

This API is still experimental and may be subject to changes as the PyTorch 2.0 stack matures.

Note that to get the best performance, we recommend:

  • using an Ampere GPU (or more recent)
  • sticking to fixed shaped for now (so use --pad_to_max_length in our examples)
  • Repurpose torchdynamo training args towards torch._dynamo by @sgugger in #20498

Audio Spectrogram Transformer

The Audio Spectrogram Transformer model was proposed in AST: Audio Spectrogram Transformer by Yuan Gong, Yu-An Chung, James Glass. The Audio Spectrogram Transformer applies a Vision Transformer to audio, by turning audio into an image (spectrogram). The model obtains state-of-the-art results for audio classification.

Jukebox

The Jukebox model was proposed in Jukebox: A generative model for music by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. It introduces a generative music model which can produce minute long samples that can be conditionned on an artist, genres and lyrics.

Switch Transformers

The SwitchTransformers model was proposed in Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity by William Fedus, Barret Zoph, Noam Shazeer.

It is the first MoE model supported in transformers, with the largest checkpoint currently available currently containing 1T parameters.

RocBert

The RoCBert model was proposed in RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. It’s a pretrained Chinese language model that is robust under various forms of adversarial attacks.

CLIPSeg

The CLIPSeg model was proposed in Image Segmentation Using Text and Image Prompts by Timo Lüddecke and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen CLIP model for zero- and one-shot image segmentation.

NAT and DiNAT

NAT

NAT was proposed in Neighborhood Attention Transformer by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.

It is a hierarchical vision transformer based on Neighborhood Attention, a sliding-window self attention pattern.

DiNAT

DiNAT was proposed in Dilated Neighborhood Attention Transformer by Ali Hassani and Humphrey Shi.

It extends NAT by adding a Dilated Neighborhood Attention pattern to capture global context, and shows significant performance improvements over it.

  • Add Neighborhood Attention Transformer (NAT) and Dilated NAT (DiNAT) models by @alihassanijr in #20219

MobileNetV2

The MobileNet model was proposed in MobileNetV2: Inverted Residuals and Linear Bottlenecks by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen.

MobileNetV1

The MobileNet model was proposed in MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.

Image processors

Image processors replace feature extractors as the processing class for computer vision models.

Important changes:

  • size parameter is now a dictionary of {"height": h, "width": w}, {"shortest_edge": s}, {"shortest_egde": s, "longest_edge": l} instead of int or tuple.
  • Addition of data_format flag. You can now specify if you want your images to be returned in "channels_first" - NCHW - or "channels_last" - NHWC - format.
  • Processing flags e.g. do_resize can be passed directly to the preprocess method instead of modifying the class attribute: image_processor([image_1, image_2], do_resize=False, return_tensors="pt", data_format="channels_last")
  • Leaving return_tensors unset will return a list of numpy arrays.

The classes are backwards compatible and can be created using existing feature extractor configurations - with the size parameter converted.

Backbone for computer vision models

We're adding support for a general AutoBackbone class, which turns any vision model (like ConvNeXt, Swin Transformer) into a backbone to be used with frameworks like DETR and Mask R-CNN. The design is in early stages and we welcome feedback.

Support for safetensors offloading

If the model you are using has a safetensors checkpoint and you have the library installed, offload to disk will take advantage of this to be more memory efficient and roughly 33% faster.

Contrastive search in the generate method

  • Generate: TF contrastive search with XLA support by @gante in #20050
  • Generate: contrastive search with full optional outputs by @gante in #19963

Breaking changes

  • 🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string by @beneyal in #15775

Bugfixes and improvements

Read more

v4.24.0: ESM-2/ESMFold, LiLT, Flan-T5, Table Transformer and Contrastive search decoding

01 Nov 15:45
94b3f54
Compare
Choose a tag to compare

ESM-2/ESMFold

ESM-2 and ESMFold are new state-of-the-art Transformer protein language and folding models from Meta AI's Fundamental AI Research Team (FAIR). ESM-2 is trained with a masked language modeling objective, and it can be easily transferred to sequence and token classification tasks for proteins. Checkpoints exist in various sizes, from 8 million parameters up to a huge 15 billion parameter model.

ESMFold is a state-of-the-art single sequence protein folding model which produces high accuracy predictions significantly faster. Unlike previous protein folding tools like AlphaFold2 and openfold, ESMFold uses a pretrained protein language model to generate token embeddings that are used as input to the folding model, and so does not require a multiple sequence alignment (MSA) of related proteins as input. As a result, proteins can be folded in a single forward pass of the model without requiring any external databases or search/alignment tools to be present at inference time. This hugely reduces the time and compute requirements for folding.

Transformer protein language models were introduced in the paper Biological structure and function emerge from scaling
unsupervised learning to 250 million protein sequences
by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus.

ESMFold was introduced in the paper Language models of protein sequences at the scale of evolution enable accurate structure prediction by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, and Alexander Rives.

LiLT

LiLT allows to combine any pre-trained RoBERTa text encoder with a lightweight Layout Transformer, to enable LayoutLM-like document understanding for many languages.

It was proposed in LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding by Jiapeng Wang, Lianwen Jin, Kai Ding.

Flan-T5

FLAN-T5 is an enhanced version of T5 that has been finetuned on a mixture of tasks.

It was released in the paper Scaling Instruction-Finetuned Language Models by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei.

Table Transformer

Table Transformer is a model that can perform table extraction and table structure recognition from unstructured documents based on the DETR architecture.

It was proposed in PubTables-1M: Towards comprehensive table extraction from unstructured documents by Brandon Smock, Rohith Pesala, Robin Abraham.

Contrastive search decoding

Contrastive search decoding is a new state-of-the-art generation method which aims at reducing the repetitive patterns in which generation models often fall.

It was introduced in A Contrastive Framework for Neural Text Generation by Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, Nigel Collier.

  • Adding the state-of-the-art contrastive search decoding methods for the codebase of generation_utils.py by @gmftbyGMFTBY in #19477

Safety and security

We continue to explore the new serialization format not using Pickle via the safetensors library, this time by adding support for TensorFlow models. More checkpoints have been converted to this format. Support is still experimental.

🚨 Breaking changes

The following changes are bugfixes that we have chosen to fix even if it changes the resulting behavior. We mark them as breaking changes, so if you are using this part of the codebase, we recommend you take a look at the PRs to understand what changes were done exactly.

  • 🚨🚨🚨 TF: Remove TFWrappedEmbeddings (breaking: TF embedding initialization updated for encoder-decoder models) by @gante in #19263
  • 🚨🚨🚨 [Breaking change] Deformable DETR intermediate representations by @Narsil in #19678

Bugfixes and improvements

Read more

v4.23.1 Patch release

11 Oct 12:02
bd469c4
Compare
Choose a tag to compare

Fix a revert introduced by mistake making the "automatic-speech-recognition" for Whisper.