Machine Learning (cool) Experiments π¬ π€ with Hugging Face's (HF) transformers
If you are interested in Text Generation, we have just added GPT-J 6B that has a PPL of 3.99 and ACC of 69.7%. We also provide *GPT-Neo 1.3B, 2.7B as well as smaller 350M and 125M parameters. Check here for evaluations.
The following experiments available through HF models are supported:
- GPT-J 6B: GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. π π₯
- HuBERT: Self-supervised representation learning for speech recognition, generation, and compression
- zeroshot - NLI-based Zero Shot Text Classification (ZSL)
- nrot - Numerical reasoning over text (NRoT) pretrained models (NT5)
- vit - Vision Transformer (ViT) model pre-trained on ImageNet
- bigbird - Google sparse-attention based transformer which extends Transformer based models to much longer sequences
- msmarco - Sentence BERT's MSMarco for Semantic Search and Retrieve & Re-Rank π₯
- luke - LUKE is a RoBERTa model that does named entity recognition, extractive and cloze-style question answering, entity typing, and relation classification π₯
- colbert - Model is based on ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
- audioseg - Pyannote audio segmentation and speaker diarization π₯
- asr - automatic speech recognition
- gpt_neo - EleutherAI's replication of the GPT-3 π₯
- bert BERT Transformer: Masked Language Modeling, Next Sentence Prediction, Extractive Question Answering π₯
- summarization - text summarization
- translation - text multiple languages translation
- sentiment - sentiment analysis
- emotions - emotions detection
- pokemon - PokΓ©mon π£ π’ π¦ π π¦π¦ generator based on russian RuDALL-E π π₯
We propose some additional experiments currently not avaiable on HF models' hub
- audioset - YamNet Image classification and VGGish Image embedding on AudioSet Youtube Corpus
- genre - Generative ENtity REtrieval π₯
- mlpvision - MLP Mixex, ResMLP, Perceiver models for Computer Vision
- fewnerd - Few-NERD: Not Only a Few-shot NER Dataset π₯
- skweak - Weak supervision for NLP π₯
- projected_gan - NeurIPS 2021 "Projected GANs Converge Faster"
- fasttext - FastText a library for efficient learning of word representations and sentence classification.
- whisper, general-purpose speech recognition, multilingual speech recognition, speech translation, spoken language identification, and voice activity detection model π π₯
- alphatensor Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610 (2022) π π₯
To build experiments run
./build.sh
To build experiments with GPU run
./build.sh gpu
To run an experiment run
./run.sh [experiment_name] [gpu|cpu] [cache_dir_folder]
To run an experiment on GPU run
./run.sh [experiment_name] gpu [cache_dir_folder]
The experiment_name
field is among the following supported experiment names, while the cache_dir_folder
parameter is the directorty where to cache models files. See later about this.
To debug the code, without running any experiment
./debug.sh
root@d2f0e8a5ec76:/app#
To debug for GPU run
./debug.sh gpu
This will enter the running image hfexperiments
. You can now run python scripts manually, like
root@d2f0e8a5ec76:/app# python src/asr/run.py
NOTE.
For preconfigured experiments, please run the run.py
script from the main folder /app
, as the cache directories are following that path, so like python src/asr/run.py
We are up-to-date with the latest transformers
, Pytorch
, tensorflow
and Keras
models, and we also provide most common ML libraries:
Package Version
----------------------- ------------
transformers 4.5.1
tokenizers 0.10.2
torch 1.8.1
tensorflow 2.4.1
Keras 2.4.3
pytorch-lightning 1.2.10
numpy 1.19.5
tensorboard 2.4.1
sentencepiece 0.1.95
pyannote.core 4.1
librosa 0.8.0
matplotlib 3.4.1
pandas 1.2.4
scikit-learn 0.24.2
scipy 1.6.3
Common Dependencies are defined in the requirements.txt
file and currently are
torch
tensorflow
keras
transformers
sentencepiece
soundfile
Due to high rate of π models pushed to the Huggingface models hub, we provide a requirements-dev.txt
in order to install the latest master
branch of transformers
:
./debug.sh
pip install -r requirements-dev.txt
Experiment level dependencies are specified in app folder requirements.txt
file like src/asr/requirements.txt
for asr
experiment.
Where are models files saved? Models files are typically big. It's preferable to save them to a custom folder like an external HDD of a shared disk. For this reason a docker environment variable cache_dir
can specified at run:
./run.sh emotions models/
the models
folder will be assigned to the cache_dir
variable to be used as default alternative location to download pretrained models. A os.getenv("cache_dir")
will be used to retrieve the environemnt variable in the code.
Some experiments require additional models to be downloaed, not currently available through Huggingface model's hub, therefore a courtesy download script has been provided in the experiment's folder like, genre/models.sh
for the following experiments:
audioset
genre
megatron
We do not automatically download these files, so please run in debug mode with debug.sh
and download the models manually, before running those experiments. The download shall be done once, and the models files will be placed in the models' cache folder specified by environment variable cache_dir
as it happens for the Huggingface's Model Hub.