Releases: huggingface/transformers
v4.27.4: Patch release
v4.27.3: Patch release
v4.27.2: Patch release
v4.27.1: Patch release
BridgeTower, Whisper speedup, DETA, SpeechT5, BLIP-2, CLAP, ALIGN, API updates
BridgeTower
The goal of this model is to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.
- Add BridgeTower model by @abhiwand in #20775
- Add loss for BridgeTowerForMaskedLM and BridgeTowerForImageAndTextRetrieval by @abhiwand in #21684
- [WIP] Add BridgeTowerForContrastiveLearning by @abhiwand in #21964
Whisper speedup
The Whisper model was integrated a few releases ago. This release offers significant performance optimizations when generating with timestamps. This was made possible by rewriting the generate()
function of Whisper
, which now uses the generation_config
and implementing a batched timestamp prediction. The language
and task
can now also be setup when calling generate()
. For more details about this refactoring checkout this colab.
Notably, whisper is also now supported in Flax
🚀 thanks to @andyehrenberg ! More whisper related commits:
- [Whisper] Refactor whisper by @ArthurZucker in #21252
- [WHISPER] Small patch by @ArthurZucker in #21307
- [Whisper] another patch by @ArthurZucker in #21324
- add flax whisper implementation by @andyehrenberg in #20479
- Add WhisperTokenizerFast by @jonatanklosko in #21222
- Remove CLI spams with Whisper FeatureExtractor by @qmeeus in #21267
- Update document of WhisperDecoderLayer by @ling0322 in #21621
- [WhisperModel] fix bug in reshaping labels by @jonatasgrosman in #21653
- [Whisper] Add SpecAugment by @bofenghuang in #21298
- Fix-ci-whisper by @ArthurZucker in #21767
- Fix
WhisperModelTest
by @ydshieh in #21883 - [Whisper] Add rescaling function with
do_normalize
by @ArthurZucker in #21263 - Refactor whisper asr pipeline to include language too. by @Narsil in #21427
- Update
model_split_percents
forWhisperModelTest
by @ydshieh in #21922 - [Whisper] Fix feature normalization in
WhisperFeatureExtractor
by @bofenghuang in #21938 - [Whisper] Add model for audio classification by @sanchit-gandhi in #21754
- fixes the gradient checkpointing of whisper by @soma2000-lang in #22019
- Skip 3 tests for
WhisperEncoderModelTest
by @ydshieh in #22060 - [Whisper] Remove embed_tokens from encoder docstring by @sanchit-gandhi in #21996
- [
Whiper
] addget_input_embeddings
toWhisperForAudioClassification
by @younesbelkada in #22133 - [🛠️] Fix-whisper-breaking-changes by @ArthurZucker in #21965
DETA
DETA (short for Detection Transformers with Assignment) improves Deformable DETR by replacing the one-to-one bipartite Hungarian matching loss with one-to-many label assignments used in traditional detectors with non-maximum suppression (NMS). This leads to significant gains of up to 2.5 mAP.
- Add DETA by @NielsRogge in #20983
SpeechT5
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.
XLM-V
XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R).
- Add XLM-V to Model Doc by @stefan-it in #21498
BLIP-2
BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon Flamingo, an 80 billion parameter model, by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.
- Add BLIP-2 by @NielsRogge in #21441
X-MOD
X-MOD extends multilingual masked language models like XLM-R to include language-specific modular components (language adapters) during pre-training. For fine-tuning, the language adapters in each transformer layer are frozen.
Ernie-M
ERNIE-M is a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance.
TVLT
The Textless Vision-Language Transformer (TVLT) is a model that uses raw visual and audio inputs for vision-and-language representation learning, without using text-specific modules such as tokenization or automatic speech recognition (ASR). It can perform various audiovisual and vision-language tasks like retrieval, question answering, etc.
- Add TVLT by @zinengtang in #20725
CLAP
CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.
- [CLAP] Add CLAP to the library by @ArthurZucker in #21370
- [
CLAP
] Fix few broken things by @younesbelkada in #21670
GPTSAN
GPTSAN is a Japanese language model using Switch Transformer. It has the same structure as the model introduced as Prefix LM in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can fine-tune for translation or summarization.
- add GPTSAN model (reopen) by @tanreinama in #21291
EfficientNet
EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models.
- Add EfficientNet by @alaradirik in #21563
ALIGN
ALIGN is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. ALIGN features a dual-encoder architecture with EfficientNet as its vision encoder and BERT as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.
- Add ALIGN to transformers by @alaradirik in #21741
Informer
Informer is a method to be applied to long-sequence time-series forecasting. This method introduces a Probabilistic Attention mechanism to select the “active” queries rather than the “lazy” queries and provides a sparse Transformer thus mitigating the quadratic compute and memory requirements of vanilla attention.
API updates and improvements
Safetensors
safetensors
is a safe format of serialization of tensors, which has been supported in transformers
as a first-class citizen for the past few versions.
This change enables explicitly forcing the from_pretrained
method to use or not to use safetensors
. This unlocks a few use-cases, notably the possibility to enforce loading only from this format, limiting security risks.
Example of usage:
from transformers import AutoModel
# As of version v4.27.0, this loads the `pytorch_model.bin` by default if `safetensors` is not installed.
# It loads the `model.safetensors` file if `safetensors` is installed.
model = AutoModel.from_pretrained('bert-base-cased')
# This forces the load from the `model.safetensors` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=True)
# This forces the load from the `pytorch_model.bin` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=False)
- [Safetensors] Add explicit flag to from pretrained by @patrickvonplaten in #22083
Variant
This PR adds a "variant" keyword argument to PyTorch's from_pretrained and save_pretrained so that multiple weight variants can be saved in the model repo.
Example of usage with the model hosted in this folder on the Hub:
from transformers import CLIPTextModel
path = "huggingface/the-no-branch-repo" # or ./text_encoder if local
# Loads the `no_ema` variant. This loads the `pytorch_model.fp16.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder", variant="fp16")
# This loads the no-variant checkpoint, loading the `pytorch_model.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder")
- Add variant to transformers by @patrickvonplaten in #21332
- [Variant] Make sure variant files are not incorrectly deleted by @patrickvonplaten in #21562
bitsandbytes
The bitsandbytes
integration is overhauled, now offering a new configuration: the BytsandbytesConfig
.
Read more about it in the [documentation](https://huggingf...
V4.26.1: Patch release
- ESM openfold_utils type hints by @ringohoffman in #20544
- Add cPython files in build by @sgugger in #21372
- Fix T5 inference in float16 + bnb error by @younesbelkada in #21281
- Fix import in Accelerate for find_exec_bs by @sgugger in #21501
- Fix inclusion of non py files in package by @sgugger in #21546
v4.26.0: Generation configs, image processors, backbones and plenty of new models!
GenerationConfig
The generate
method has multiple arguments whose defaults were lying in the model config. We have now decoupled these in a separate generation config, which makes it easier to store different sets of parameters for a given model, with different generation strategies. While we will keep supporting generate arguments in the model configuration for the foreseeable future, it is now recommended to use a generation config. You can learn more about its uses here and its documentation here.
- Generate: use
GenerationConfig
as the basis for.generate()
parametrization by @gante in #20388 - Generate: TF uses
GenerationConfig
as the basis for.generate()
parametrization by @gante in #20994 - Generate: FLAX uses
GenerationConfig
as the basis for.generate()
parametrization by @gante in #21007
ImageProcessor
In the vision integration, all feature extractor classes have been deprecated to be renamed to ImageProcessor
. The old feature extractors will be fully removed in version 5 of Transformers and new vision models will only implement the ImageProcessor
class, so be sure to switch your code to this new name sooner rather than later!
- Add deprecation warning when image FE instantiated by @amyeroberts in #20427
- Vision processors - replace FE with IPs by @amyeroberts in #20590
- Replace FE references by @amyeroberts in #20702
New models
AltCLIP
AltCLIP is a variant of CLIP obtained by switching the text encoder with a pretrained multilingual text encoder (XLM-Roberta). It has very close performances with CLIP on almost all tasks, and extends the original CLIP’s capabilities to multilingual understanding.
BLIP
BLIP is a model that is able to perform various multi-modal tasks including visual question answering, image-text retrieval (image-text matching) and image captioning.
- Add BLIP by @younesbelkada in #20716
BioGPT
BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.
- Add BioGPT by @kamalkraj in #20420
BiT
BiT is a simple recipe for scaling up pre-training of ResNet-like architectures (specifically, ResNetv2). The method results in significant improvements for transfer learning.
- Add BiT + ViT hybrid by @NielsRogge in #20550
EfficientFormer
EfficientFormer proposes a dimension-consistent pure transformer that can be run on mobile devices for dense prediction tasks like image classification, object detection and semantic segmentation.
- Efficientformer by @Bearnardd in #20459
GIT
GIT is a decoder-only Transformer that leverages CLIP’s vision encoder to condition the model on vision inputs besides text. The model obtains state-of-the-art results on image captioning and visual question answering benchmarks.
- Add GIT (GenerativeImage2Text) by @NielsRogge in #20295
GPT-sw3
GPT-Sw3 is a collection of large decoder-only pretrained transformer language models that were developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. GPT-Sw3 has been trained on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation.
Graphormer
Graphormer is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessign and collation, then using a modified attention.
- Graphormer model for Graph Classification by @clefourrier in #20968
Mask2Former
Mask2Former is a unified framework for panoptic, instance and semantic segmentation and features significant performance and efficiency improvements over MaskFormer.
- Add Mask2Former by @alaradirik and @shivalikasingh95 in #20792
OneFormer
OneFormer is a universal image segmentation framework that can be trained on a single panoptic dataset to perform semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference.
- Add OneFormer Model by @praeclarumjj3 in #20577
Roberta prelayernorm
The RoBERTa-PreLayerNorm model is identical to RoBERTa but uses the --encoder-normalize-before flag in fairseq.
- Implement Roberta PreLayerNorm by @AndreasMadsen in #20305
Swin2SR
Swin2R improves the SwinIR model by incorporating Swin Transformer v2 layers which mitigates issues such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data.
- Add Swin2SR by @NielsRogge in #19784
TimeSformer
TimeSformer is the first video transformer. It inspired many transformer based video understanding and classification papers.
UPerNet
UPerNet is a general framework to effectively segment a wide range of concepts from images, leveraging any vision backbone like ConvNeXt or Swin.
- Add UperNet by @NielsRogge in #20648
Vit Hybrid
ViT hybrid is a slight variant of the plain Vision Transformer, by leveraging a convolutional backbone (specifically, BiT) whose features are used as initial “tokens” for the Transformer. It’s the first architecture that attains similar results to familiar convolutional architectures.
- Add BiT + ViT hybrid by @NielsRogge in #20550
Backbones
Breaking a bit the one model per file policy, we introduce backbones (mainly for vision models) which can then be re-used in more complex models like DETR, MaskFormer, Mask2Former etc.
- [NAT, DiNAT] Add backbone class by @NielsRogge in #20654
- Add Swin backbone by @NielsRogge in #20769
- [DETR and friends] Use AutoBackbone as alternative to timm by @NielsRogge in #20833
Bugfixes and improvements
- fix cuda OOM by using single Prior by @ArthurZucker in #20486
- Add ESM contact prediction by @Rocketknight1 in #20535
- flan-t5.mdx: fix link to large model by @szhublox in #20555
- Fix torch device issues by @ydshieh in #20584
- Fix flax GPT-J-6B linking model in tests by @JuanFKurucz in #20556
- [Vision] fix small nit on
BeitDropPath
layers by @younesbelkada in #20587 - Install
natten
with CUDA version by @ydshieh in #20546 - Add entries to
FEATURE_EXTRACTOR_MAPPING_NAMES
by @ydshieh in #20551 - Cleanup some config attributes by @ydshieh in #20554
- [Whisper] Move decoder id method to tokenizer by @sanchit-gandhi in #20589
- Add
require_torch
to 2 pipeline tests by @ydshieh in #20585 - Install
tensorflow_probability
for TF pipeline CI by @ydshieh in #20586 - Ci-whisper-asr by @ArthurZucker in #20588
- cross platform from_pretrained by @ArthurZucker in #20538
- Make convert_to_onnx runable as script again by @mcernusca in #20009
- ESM openfold_utils type hints by @ringohoffman in #20544
- Add RemBERT ONNX config by @hchings in #20520
- Fix link to Swin Model contributor novice03 by @JuanFKurucz in #20557
- Fix link to swin transformers v2 microsoft model by @JuanFKurucz in #20558
- Fix link to table transformer detection microsoft model by @JuanFKurucz in #20560
- clean up unused
classifier_dropout
in config by @ydshieh in #20596 - Fix whisper and speech to text doc by @ArthurZucker in #20595
- Replace
set-output
by$GITHUB_OUTPUT
by @ydshieh in #20547 - [Vision]
.to
function for ImageProcessors by @younesbelkada in #20536 - [Whisper] Fix decoder ids methods by @sanchit-gandhi in #20599
- Add-whisper-conversion by @ArthurZucker in #20600
- README in Hindi 🇮🇳 by @pacman100 in #20097
- Fix code sample in preprocess by @stevhliu in #20561
- Split autoclasses on modality by @stevhliu in #20559
- Fix test for file not found by @sgugger in #20604
- Rework the pipeline tutorial by @Narsil in #20437
- Documentation fixes by @samuelzxu in #20607
- Adding anchor links to Hindi README by @pacman100 in #20606
- exclude jit time from the speed metric calculation of evaluation and prediction by @sywangyi in #20553
- Check if docstring is None before formating it by @xxyzz in #20592
- updating T5 and BART models to support Prefix Tuning by @pacman100 in #20601
- Fix
AutomaticSpeechRecognitionPipelineTests.run_pipeline_test
by @ydshieh in #20597 - Ci-jukebox by @ArthurZucker in #20613
- Update some GH action versions by @ydshieh in #20537
- Fix dtype of weights in from_pretrained when device_map is set by @sgugger in #20602
- add missing is_decoder param by @stevhliu in #20631
- Fix link to speech encoder decoder model in speech recognition readme by @JuanFKurucz in #20633
- Fix
natten
installation in docker file by @ydshieh in #20632 - Clip floating point constants to bf16 range to avoid inf conversion b...
PyTorch 2.0 support, Audio Spectogram Transformer, Jukebox, Switch Transformers and more
PyTorch 2.0 stack support
We are very excited by the newly announced PyTorch 2.0 stack. You can enable torch.compile
on any of our models, and get support with the Trainer
(and in all our PyTorch examples) by using the torchdynamo
training argument. For instance, just add --torchdynamo inductor
when launching those examples from the command line.
This API is still experimental and may be subject to changes as the PyTorch 2.0 stack matures.
Note that to get the best performance, we recommend:
- using an Ampere GPU (or more recent)
- sticking to fixed shaped for now (so use
--pad_to_max_length
in our examples)
Audio Spectrogram Transformer
The Audio Spectrogram Transformer model was proposed in AST: Audio Spectrogram Transformer by Yuan Gong, Yu-An Chung, James Glass. The Audio Spectrogram Transformer applies a Vision Transformer to audio, by turning audio into an image (spectrogram). The model obtains state-of-the-art results for audio classification.
- Add Audio Spectogram Transformer by @NielsRogge in #19981
Jukebox
The Jukebox model was proposed in Jukebox: A generative model for music by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. It introduces a generative music model which can produce minute long samples that can be conditionned on an artist, genres and lyrics.
- Add Jukebox model (replaces #16875) by @ArthurZucker in #17826
Switch Transformers
The SwitchTransformers model was proposed in Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity by William Fedus, Barret Zoph, Noam Shazeer.
It is the first MoE model supported in transformers
, with the largest checkpoint currently available currently containing 1T parameters.
- Add Switch transformers by @younesbelkada and @ArthurZucker in #19323
RocBert
The RoCBert model was proposed in RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. It’s a pretrained Chinese language model that is robust under various forms of adversarial attacks.
CLIPSeg
The CLIPSeg model was proposed in Image Segmentation Using Text and Image Prompts by Timo Lüddecke and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen CLIP model for zero- and one-shot image segmentation.
- Add CLIPSeg by @NielsRogge in #20066
NAT and DiNAT
NAT
NAT was proposed in Neighborhood Attention Transformer by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
It is a hierarchical vision transformer based on Neighborhood Attention, a sliding-window self attention pattern.
DiNAT
DiNAT was proposed in Dilated Neighborhood Attention Transformer by Ali Hassani and Humphrey Shi.
It extends NAT by adding a Dilated Neighborhood Attention pattern to capture global context, and shows significant performance improvements over it.
- Add Neighborhood Attention Transformer (NAT) and Dilated NAT (DiNAT) models by @alihassanijr in #20219
MobileNetV2
The MobileNet model was proposed in MobileNetV2: Inverted Residuals and Linear Bottlenecks by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen.
MobileNetV1
The MobileNet model was proposed in MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
Image processors
Image processors replace feature extractors as the processing class for computer vision models.
Important changes:
size
parameter is now a dictionary of{"height": h, "width": w}
,{"shortest_edge": s}
,{"shortest_egde": s, "longest_edge": l}
instead of int or tuple.- Addition of
data_format
flag. You can now specify if you want your images to be returned in"channels_first"
- NCHW - or"channels_last"
- NHWC - format. - Processing flags e.g.
do_resize
can be passed directly to thepreprocess
method instead of modifying the class attribute:image_processor([image_1, image_2], do_resize=False, return_tensors="pt", data_format="channels_last")
- Leaving
return_tensors
unset will return a list of numpy arrays.
The classes are backwards compatible and can be created using existing feature extractor configurations - with the size
parameter converted.
- Add Image Processors by @amyeroberts in #19796
- Add Donut image processor by @amyeroberts #20425
- Add segmentation + object detection image processors by @amyeroberts in #20160
- AutoImageProcessor by @amyeroberts in #20111
Backbone for computer vision models
We're adding support for a general AutoBackbone
class, which turns any vision model (like ConvNeXt, Swin Transformer) into a backbone to be used with frameworks like DETR and Mask R-CNN. The design is in early stages and we welcome feedback.
- Add AutoBackbone + ResNetBackbone by @NielsRogge in #20229
- Improve backbone by @NielsRogge in #20380
- [AutoBackbone] Improve API by @NielsRogge in #20407
Support for safetensors
offloading
If the model you are using has a safetensors
checkpoint and you have the library installed, offload to disk will take advantage of this to be more memory efficient and roughly 33% faster.
Contrastive search in the generate
method
- Generate: TF contrastive search with XLA support by @gante in #20050
- Generate: contrastive search with full optional outputs by @gante in #19963
Breaking changes
- 🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in
convert_tokens_to_string
by @beneyal in #15775
Bugfixes and improvements
- add dataset by @stevhliu in #20005
- Add BERT resources by @stevhliu in #19852
- Add LayoutLMv3 resource by @stevhliu in #19932
- fix typo by @stevhliu in #20006
- Update object detection pipeline to use post_process_object_detection methods by @alaradirik in #20004
- clean up vision/text config dict arguments by @ydshieh in #19954
- make sentencepiece import conditional in bertjapanesetokenizer by @ripose-jp in #20012
- Fix gradient checkpoint test in encoder-decoder by @ydshieh in #20017
- Quality by @sgugger in #20002
- Update auto processor to check image processor created by @amyeroberts in #20021
- [Doctest] Add configuration_deberta_v2.py by @Saad135 in #19995
- Improve model tester by @ydshieh in #19984
- Fix doctest by @ydshieh in #20023
- Show installed libraries and their versions in CI jobs by @ydshieh in #20026
- reorganize glossary by @stevhliu in #20010
- Now supporting pathlike in pipelines too. by @Narsil in #20030
- Add **kwargs by @amyeroberts in #20037
- Fix some doctests after PR 15775 by @ydshieh in #20036
- [Doctest] Add configuration_camembert.py by @Saad135 in #20039
- [Whisper Tokenizer] Make more user-friendly by @sanchit-gandhi in #19921
- [FuturWarning] Add futur warning for LEDForSequenceClassification by @ArthurZucker in #19066
- fix jit trace error for model forward sequence is not aligned with jit.trace tuple input sequence, update related doc by @sywangyi in #19891
- Update esmfold conversion script by @Rocketknight1 in #20028
- Fixed torch.finfo issue with torch.fx by @michaelbenayoun in #20040
- Only resize embeddings when necessary by @sgugger in #20043
- Speed up TF token classification postprocessing by converting complete tensors to numpy by @deutschmn in #19976
- Fix ESM LM head test by @Rocketknight1 in #20045
- Update README.md by @bofenghuang in #20063
- fix
tokenizer_type
to avoid error when loading checkpoint back by @pacman100 in #20062 - [Trainer] Fix model name in push_to_hub by @sanchit-gandhi in #20064
- PoolformerImageProcessor defaults to match previous FE by @amyeroberts in #20048
- change constant torch.tensor to torch.full by @MerHS in #20061
- Update READMEs for ESMFold and add notebooks by @Rocketknight1 in #20067
- Update documentation on seq2seq models with absolute positional embeddings, to be in line with Tips section for BERT and GPT2 by @jordiclive in #20068
- Allow passing arguments to model testers for CLIP-like models by @ydshieh in #20044
- Show installed libraries and their versions in GA jobs by @ydshieh in #20069
- Update defaults and logic to match old FE by @amyeroberts in #20065
- Update modeling_tf_utils.py by @cakiki in #20076
- Update hub.py by @cakiki in #20075
- [Doctest] Add configuration_dpr.py by @Saad135 in #20080
- Removing RobertaConfig inheritance from CamembertConfig by @Saad135 in #20059
- Skip 2 tests in
VisionTextDualEncoderProcessorTest
by @ydshieh in #20098 - Replace unsupported facebookresearch/bitsandbytes by @tomaarsen in #20093
- docs: Resolve many typos in the English docs by @tomaarsen in #20088
- use huggingface_hub.model_inifo() to get pipline_tag by @y-tag in #20077
- Fix
generate_dummy_inputs
forImageGPTOnnxConfig
by @ydshieh in #20103 - docs: Fixed variables in f-strings by @tomaarsen in #20087
- Add...
v4.24.0: ESM-2/ESMFold, LiLT, Flan-T5, Table Transformer and Contrastive search decoding
ESM-2/ESMFold
ESM-2 and ESMFold are new state-of-the-art Transformer protein language and folding models from Meta AI's Fundamental AI Research Team (FAIR). ESM-2 is trained with a masked language modeling objective, and it can be easily transferred to sequence and token classification tasks for proteins. Checkpoints exist in various sizes, from 8 million parameters up to a huge 15 billion parameter model.
ESMFold is a state-of-the-art single sequence protein folding model which produces high accuracy predictions significantly faster. Unlike previous protein folding tools like AlphaFold2 and openfold
, ESMFold uses a pretrained protein language model to generate token embeddings that are used as input to the folding model, and so does not require a multiple sequence alignment (MSA) of related proteins as input. As a result, proteins can be folded in a single forward pass of the model without requiring any external databases or search/alignment tools to be present at inference time. This hugely reduces the time and compute requirements for folding.
Transformer protein language models were introduced in the paper Biological structure and function emerge from scaling
unsupervised learning to 250 million protein sequences by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus.
ESMFold was introduced in the paper Language models of protein sequences at the scale of evolution enable accurate structure prediction by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, and Alexander Rives.
- Add ESMFold by @Rocketknight1 in #19977
- TF port of ESM by @Rocketknight1 in #19587
LiLT
LiLT allows to combine any pre-trained RoBERTa text encoder with a lightweight Layout Transformer, to enable LayoutLM-like document understanding for many languages.
It was proposed in LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding by Jiapeng Wang, Lianwen Jin, Kai Ding.
- Add LiLT by @NielsRogge in #19450
Flan-T5
FLAN-T5 is an enhanced version of T5 that has been finetuned on a mixture of tasks.
It was released in the paper Scaling Instruction-Finetuned Language Models by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei.
- Add
flan-t5
documentation page by @younesbelkada in #19892
Table Transformer
Table Transformer is a model that can perform table extraction and table structure recognition from unstructured documents based on the DETR architecture.
It was proposed in PubTables-1M: Towards comprehensive table extraction from unstructured documents by Brandon Smock, Rohith Pesala, Robin Abraham.
- Add table transformer [v2] by @NielsRogge in #19614
Contrastive search decoding
Contrastive search decoding is a new state-of-the-art generation method which aims at reducing the repetitive patterns in which generation models often fall.
It was introduced in A Contrastive Framework for Neural Text Generation by Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, Nigel Collier.
- Adding the state-of-the-art contrastive search decoding methods for the codebase of generation_utils.py by @gmftbyGMFTBY in #19477
Safety and security
We continue to explore the new serialization format not using Pickle via the safetensors library, this time by adding support for TensorFlow models. More checkpoints have been converted to this format. Support is still experimental.
🚨 Breaking changes
The following changes are bugfixes that we have chosen to fix even if it changes the resulting behavior. We mark them as breaking changes, so if you are using this part of the codebase, we recommend you take a look at the PRs to understand what changes were done exactly.
- 🚨🚨🚨 TF: Remove
TFWrappedEmbeddings
(breaking: TF embedding initialization updated for encoder-decoder models) by @gante in #19263 - 🚨🚨🚨 [Breaking change] Deformable DETR intermediate representations by @Narsil in #19678
Bugfixes and improvements
- Enabling custom TF signature draft by @dimitreOliveira in #19249
- Fix whisper for
pipeline
by @ArthurZucker in #19482 - Extend
nested_XXX
functions to mappings/dicts. by @Guillem96 in #19455 - Syntax issues (lines 126, 203) by @kant in #19444
- CLI: add import protection to datasets by @gante in #19470
- Fix
TFGroupViT
CI by @ydshieh in #19461 - Fix doctests for
DeiT
andTFGroupViT
by @ydshieh in #19466 - Update
WhisperModelIntegrationTests.test_large_batched_generation
by @ydshieh in #19472 - [Swin] Replace hard-coded batch size to enable dynamic ONNX export by @lewtun in #19475
- TF: TFBart embedding initialization by @gante in #19460
- Make LayoutLM tokenizers independent from BertTokenizer by @arnaudstiegler in #19351
- Make
XLMRoberta
model and config independent fromRoberta
by @asofiaoliveira in #19359 - Fix
get_embedding
dtype at init. time by @ydshieh in #19473 - Decouples
XLMProphet
model fromProphet
by @srhrshr in #19406 - Implement multiple span support for DocumentQuestionAnswering by @ankrgyl in #19204
- Add warning in
generate
&device_map=auto
& half precision models by @younesbelkada in #19468 - Update TF whisper doc tests by @amyeroberts in #19484
- Make bert_japanese and cpm independent of their inherited modules by @Davidy22 in #19431
- Added tokenize keyword arguments to feature extraction pipeline by @quancore in #19382
- Adding the README_es.md and reference to it in the others files readme by @Oussamaosman02 in #19427
- [CvT] Tensorflow implementation by @mathieujouffroy in #18597
python3
instead ofpython
in push CI setup job by @ydshieh in #19492- Update PT to TF CLI for audio models by @amyeroberts in #19465
- New by @IMvision12 in #19481
- Fix
OPTForQuestionAnswering
doctest by @ydshieh in #19479 - Use a dynamic configuration for circleCI tests by @sgugger in #19325
- Add multi-node conditions in trainer_qa.py and trainer_seq2seq.py by @regisss in #19502
- update doc for perf_train_cpu_many by @sywangyi in #19506
- Avoid Push CI failing to report due to many commits being merged by @ydshieh in #19496
- [Doctest] Add
configuration_bert.py
to doctest by @ydshieh in #19485 - Fix whisper doc by @ArthurZucker in #19518
- Syntax issue (line 497, 526) Documentation by @kant in #19442
- Fix pytorch seq2seq qa by @FilipposVentirozos in #19258
- Add depth estimation pipeline by @nandwalritik in #18618
- Adding links to pipelines parameters documentation by @AndreaSottana in #19227
- fix MarkupLMProcessor option flag by @davanstrien in #19526
- [Doctest] Bart configuration update by @imarekkus in #19524
- Remove roberta dependency from longformer fast tokenizer by @sirmammingtonham in #19501
- made tokenization_roformer independent of bert by @naveennamani in #19426
- Remove bert fast dependency from electra by @Threepointone4 in #19520
- [Examples] Fix typos in run speech recognition seq2seq by @sanchit-gandhi in #19514
- [X-CLIP] Fix doc tests by @NielsRogge in #19523
- Update Marian config default vocabulary size by @gante in #19464
- Make
MobileBert
tokenizers independent fromBert
by @501Good in #19531 - [Whisper] Fix gradient checkpointing by @sanchit-gandhi in #19538
- Syntax issues (paragraphs 122, 130, 147, 155) Documentation: @sgugger by @kant in #19437
- using trunc_normal for weight init & cls_token by @mathieujouffroy in #19486
- Remove
MarkupLMForMaskedLM
fromMODEL_WITH_LM_HEAD_MAPPING_NAMES
by @ydshieh in #19534 - Image transforms library by @amyeroberts in #18520
- Add a decorator for flaky tests by @sgugger in #19498
- [Doctest] Add
configuration_yolos.py
by @daspartho in #19539 - Albert config update by @imarekkus in #19541
- [Doctest]
Add configuration_whisper.py
by @daspartho in #19540 - Throw an error if
getattribute_from_module
can't find anything by @ydshieh in #19535 - [Doctest] Beit Config for doctest by @daspartho in #19542
- Create the arange tensor on device for enabling CUDA-Graph for Clip Encoder by @RezaYazdaniAminabadi in #19503
- [Doctest] GPT2 Config for doctest by @daspartho in #19549
- Build Push CI images also in a daily basis by @ydshieh in #19532
- Fix checkpoint used in
MarkupLMConfig
by @ydshieh in #19547 - add a note to whisper docs clarifying support of long-form decoding by @akashmjn in #19497
- [Whisper] Freeze params of encoder by @sanchit-gandhi in #19527
- [Doctest] Fixing the Doctest for imageGPT config by @RamitPahwa in #19556
- [Doctest] Fixing mobile bert configuration doctest by @RamitPahwa in #19557
- [Doctest] Fixing doctest bert_generation configuration by @Threepointone4 in #19558
- [Doctest] DeiT Config for doctest by @daspartho in #19560
- [Doctest] Reformer Config for doctest by @daspartho in #19562
- [Doctest] RoBERTa Config for doctest by @daspartho in #19563
- [Doctest] Add
configuration_vit.py
by @dasparth...
v4.23.1 Patch release
Fix a revert introduced by mistake making the "automatic-speech-recognition"
for Whisper.
- Fix whisper for pipeline by @ArthurZucker in #19482