This repository provides scripts and instructions for:
- Training a WordPiece tokenizer on a Dutch dataset (or any other dataset from the Hugging Face Hub).
- Fine-tuning the ModernBERT-base model on the same Dutch dataset, optionally using the custom-trained tokenizer.
It leverages the Hugging Face Transformers, Tokenizers, and Datasets libraries for efficient training. Note that this code currently only supports single-GPU training. Multi-GPU support may be added in the future.
Actively in development and welcoming contributions from the community! If you're interested in helping out, please feel free to open issues, submit pull requests, or reach out directly.
- Custom Tokenizer Training (Optional):
- Trains a WordPiece tokenizer using the
tokenizers
library. - Supports streaming datasets for efficient handling of large corpora.
- Configurable vocabulary size and training examples.
- Trains a WordPiece tokenizer using the
- Model Fine-tuning:
- Fine-tunes the
answerdotai/ModernBERT-base
model (or another specified checkpoint). - Uses components from
transformers
for streamlined training. - Supports dynamic batching with a custom
DataCollator
. - Implements curriculum learning by gradually decreasing the MLM masking probability.
- Uses gradient accumulation to simulate larger batch sizes.
- Uses the ADOPT optimizer for improved convergence.
- Optionally integrates FlashAttention 2 for faster training (requires a compatible GPU - see details below).
- Includes evaluation steps during training.
- Automatically pushes intermediate and final models to the Hugging Face Hub.
- Fine-tunes the
- Weights & Biases (WandB) Integration (Optional): Tracks and visualizes training runs in real-time.
- Hugging Face Account: You need a Hugging Face account. Sign up here.
- Hugging Face API Token: Generate a User Access Token (with "write" access) from your Hugging Face profile settings.
- WandB Account (Optional): Create a free account at wandb.ai.
- WandB API Key (Optional): Get your API key from your WandB settings.
- Environment: A GPU environment is strongly recommended for model fine-tuning. Tokenizer training can be done on a CPU. Currently, only single-GPU training is supported.
- GPU Compatibility for FlashAttention 2: FlashAttention 2 requires a GPU with compute capability >= 7.0. This means Turing (e.g., T4, RTX 20xx), Ampere (e.g., A100, RTX 30xx), Ada Lovelace (e.g., RTX 40xx), or newer architectures.
-
Clone the Repository:
git clone https://github.com/s-smits/modernbert-finetune.git cd modernbert-finetune
-
Install Dependencies:
pip install -r requirements.txt
Set the following environment variables:
export HUGGINGFACE_TOKEN="your_huggingface_token"
export WANDB_API_KEY="your_wandb_api_key" # Optional
Replace "your_huggingface_token"
with your actual Hugging Face token and "your_wandb_api_key"
with your WandB API key.
The train.py
script defines several configurable parameters for model fine-tuning. Tokenizer training parameters are in tokenize.py
. You can modify these directly in the files or override them using environment variables.
Tokenizer Training Parameters (spm-tokenize.py
):
Parameter | Default Value | Description |
---|---|---|
DATASET_NAME |
"ssmits/fineweb-2-dutch" | The name of the dataset on the Hugging Face Hub to use for training. |
TOKENIZER_SAVE_PATH |
"domain_tokenizer" | The directory to save the trained tokenizer. |
VOCAB_SIZE |
32768 | The desired vocabulary size. |
NUM_EXAMPLES_TO_TRAIN |
10000 | The number of examples from the dataset to use for training the tokenizer. |
Model Fine-tuning Parameters (train.py
):
Parameter | Default Value | Description |
---|---|---|
model_checkpoint |
"answerdotai/ModernBERT-base" | The base pre-trained ModernBERT model to use. |
dataset_name |
"ssmits/fineweb-2-dutch" | The name of the dataset on the Hugging Face Hub to use for fine-tuning. |
num_train_epochs |
1 | The number of training epochs. |
per_device_train_batch_size |
4 | The batch size per GPU. Adjust based on your GPU memory. |
gradient_accumulation_steps |
2 | The number of steps to accumulate gradients over before performing an optimizer step. Modify based on desired effective batch size and GPU memory |
eval_size_ratio |
0.05 | The proportion of the dataset to use for evaluation. |
masking_probabilities |
[0.3, 0.2, 0.18, 0.16, 0.14] | The curriculum learning masking probabilities. |
estimated_dataset_size_in_rows |
86500000 | The estimated number of rows in your dataset. |
username |
"ssmits" | Your Hugging Face username. |
total_save_limit |
2 | The maximum number of saved model checkpoints to keep. |
push_interval |
100000 | How often to push the model to the Hugging Face Hub (in steps). |
eval_size_per_chunk |
5000 | The size of the evaluation set to use for each chunk in curriculum learning. |
learning_rate |
5e-4 | The learning rate for the optimizer. |
weight_decay |
0.01 | The weight decay for the optimizer. |
tokenizer_path |
"domain_tokenizer" | Path to custom tokenizer directory. If it exists and contains tokenizer.json , the custom tokenizer will be used. Otherwise, the default tokenizer from model_checkpoint is loaded. |
If you want to train a new tokenizer:
-
Configure Parameters:
- Adjust tokenizer training parameters (e.g.,
VOCAB_SIZE
,NUM_EXAMPLES_TO_TRAIN
) intokenize.py
as needed.
- Adjust tokenizer training parameters (e.g.,
-
Run the Script:
python tokenize.py
This will train a tokenizer and save it to the domain_tokenizer
directory (or the path you specified).
-
Configure Parameters:
- Adjust model fine-tuning parameters (e.g.,
num_train_epochs
,per_device_train_batch_size
,gradient_accumulation_steps
,repo_name
) intrain.py
as needed. - If you trained a custom tokenizer, make sure
tokenizer_path
points to the correct directory. Otherwise, the script will use the default tokenizer frommodel_checkpoint
.
- Adjust model fine-tuning parameters (e.g.,
-
Login to Hugging Face Hub:
huggingface-cli login --token $HUGGINGFACE_TOKEN
-
Login to WandB (Optional):
wandb login --relogin
-
Run the Script:
python train.py
This will:
- Load the dataset.
- Load the tokenizer (either your custom tokenizer or the default one from the model checkpoint).
- Load the ModernBERT model.
- Resize the model's embedding if you are using a custom tokenizer with a different vocabulary size.
- Fine-tune the model on the dataset using curriculum learning.
- Evaluate the model periodically during training.
- Push intermediate and final models to the Hugging Face Hub.
- WandB Dashboard: If you're using WandB, monitor training progress in real-time on your WandB project dashboard.
- Hugging Face Hub: Your fine-tuned model will be automatically pushed to your Hugging Face Hub profile under the repository name specified in
repo_name
of train.py.
After fine-tuning, use your model for downstream tasks with the Transformers library:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "your_username/modernbert-dutch" # Replace with your model name on the Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Example: Filling in masked tokens
inputs = tokenizer("Het weer is vandaag [MASK].", return_tensors="pt")
outputs = model(**inputs)
# ... process the outputs ...
- GPU Memory: ModernBERT is relatively small. Adjust
per_device_train_batch_size
, andgradient_accumulation_steps
to fully utilize your GPU. - Dataset Size: The script is designed for large, streaming datasets. Adjust
estimated_dataset_size_in_rows
to your dataset size. - Hyperparameter Tuning: Experiment with different hyperparameters (learning rate, masking probabilities, etc.) to find optimal settings.
- Tokenizer Training: If training a new tokenizer, consider the
VOCAB_SIZE
andNUM_EXAMPLES_TO_TRAIN
carefully. - Evaluation: Customize the evaluation frequency using
eval_interval
in the script. - Saving: Adjust the saving frequency of intermediate and final models with
push_interval
.
- CUDA Errors: If you get CUDA errors, reduce
per_device_train_batch_size
, or increasegradient_accumulation_steps
. - Shape Errors: The
fix_batch_inputs
function andDynamicPaddingDataCollator
handle most shape issues. If you encounter any, ensure your dataset is properly formatted and you're using the latesttransformers
version. - Tokenizer Issues: If you have problems loading or using your custom tokenizer, make sure it was saved correctly using
tokenizer.save
intrain_tokenizer.py
and thatTOKENIZER_SAVE_PATH
is accurate. - FlashAttention 2 Issues: Ensure your GPU is compatible (compute capability >= 7.0). If you encounter errors specific to FlashAttention, try disabling it by setting the environment variable
USE_FLASH_ATTENTION
toFalse
.
This project is licensed under the MIT License.