Fine-tuning ModernBERT on a Dutch Dataset with Custom Tokenizer Training

⚠️ Work in Progress - Contributions Welcome! ⚠️

This repository provides scripts and instructions for:

Training a WordPiece tokenizer on a Dutch dataset (or any other dataset from the Hugging Face Hub).
Fine-tuning the ModernBERT-base model on the same Dutch dataset, optionally using the custom-trained tokenizer.

It leverages the Hugging Face Transformers, Tokenizers, and Datasets libraries for efficient training. Note that this code currently only supports single-GPU training. Multi-GPU support may be added in the future.

Actively in development and welcoming contributions from the community! If you're interested in helping out, please feel free to open issues, submit pull requests, or reach out directly.

Features

Custom Tokenizer Training (Optional):
- Trains a WordPiece tokenizer using the tokenizers library.
- Supports streaming datasets for efficient handling of large corpora.
- Configurable vocabulary size and training examples.
Model Fine-tuning:
- Fine-tunes the answerdotai/ModernBERT-base model (or another specified checkpoint).
- Uses components from transformers for streamlined training.
- Supports dynamic batching with a custom DataCollator.
- Implements curriculum learning by gradually decreasing the MLM masking probability.
- Uses gradient accumulation to simulate larger batch sizes.
- Uses the ADOPT optimizer for improved convergence.
- Optionally integrates FlashAttention 2 for faster training (requires a compatible GPU - see details below).
- Includes evaluation steps during training.
- Automatically pushes intermediate and final models to the Hugging Face Hub.
Weights & Biases (WandB) Integration (Optional): Tracks and visualizes training runs in real-time.

Prerequisites

Hugging Face Account: You need a Hugging Face account. Sign up here.
Hugging Face API Token: Generate a User Access Token (with "write" access) from your Hugging Face profile settings.
WandB Account (Optional): Create a free account at wandb.ai.
WandB API Key (Optional): Get your API key from your WandB settings.
Environment: A GPU environment is strongly recommended for model fine-tuning. Tokenizer training can be done on a CPU. Currently, only single-GPU training is supported.
GPU Compatibility for FlashAttention 2: FlashAttention 2 requires a GPU with compute capability >= 7.0. This means Turing (e.g., T4, RTX 20xx), Ampere (e.g., A100, RTX 30xx), Ada Lovelace (e.g., RTX 40xx), or newer architectures.

Installation

Clone the Repository:

git clone https://github.com/s-smits/modernbert-finetune.git
cd modernbert-finetune

Install Dependencies:
```
pip install -r requirements.txt
```

Configuration

Environment Variables

Set the following environment variables:

export HUGGINGFACE_TOKEN="your_huggingface_token"
export WANDB_API_KEY="your_wandb_api_key"  # Optional

Replace "your_huggingface_token" with your actual Hugging Face token and "your_wandb_api_key" with your WandB API key.

Script Parameters

The train.py script defines several configurable parameters for model fine-tuning. Tokenizer training parameters are in tokenize.py. You can modify these directly in the files or override them using environment variables.

Tokenizer Training Parameters (spm-tokenize.py):

Parameter	Default Value	Description
`DATASET_NAME`	"ssmits/fineweb-2-dutch"	The name of the dataset on the Hugging Face Hub to use for training.
`TOKENIZER_SAVE_PATH`	"domain_tokenizer"	The directory to save the trained tokenizer.
`VOCAB_SIZE`	32768	The desired vocabulary size.
`NUM_EXAMPLES_TO_TRAIN`	10000	The number of examples from the dataset to use for training the tokenizer.

Model Fine-tuning Parameters (train.py):

Parameter	Default Value	Description
`model_checkpoint`	"answerdotai/ModernBERT-base"	The base pre-trained ModernBERT model to use.
`dataset_name`	"ssmits/fineweb-2-dutch"	The name of the dataset on the Hugging Face Hub to use for fine-tuning.
`num_train_epochs`	1	The number of training epochs.
`per_device_train_batch_size`	4	The batch size per GPU. Adjust based on your GPU memory.
`gradient_accumulation_steps`	2	The number of steps to accumulate gradients over before performing an optimizer step. Modify based on desired effective batch size and GPU memory
`eval_size_ratio`	0.05	The proportion of the dataset to use for evaluation.
`masking_probabilities`	[0.3, 0.2, 0.18, 0.16, 0.14]	The curriculum learning masking probabilities.
`estimated_dataset_size_in_rows`	86500000	The estimated number of rows in your dataset.
`username`	"ssmits"	Your Hugging Face username.
`total_save_limit`	2	The maximum number of saved model checkpoints to keep.
`push_interval`	100000	How often to push the model to the Hugging Face Hub (in steps).
`eval_size_per_chunk`	5000	The size of the evaluation set to use for each chunk in curriculum learning.
`learning_rate`	5e-4	The learning rate for the optimizer.
`weight_decay`	0.01	The weight decay for the optimizer.
`tokenizer_path`	"domain_tokenizer"	Path to custom tokenizer directory. If it exists and contains `tokenizer.json`, the custom tokenizer will be used. Otherwise, the default tokenizer from `model_checkpoint` is loaded.

Running the Scripts

1. Tokenizer Training (Optional)

If you want to train a new tokenizer:

Configure Parameters:
- Adjust tokenizer training parameters (e.g., VOCAB_SIZE, NUM_EXAMPLES_TO_TRAIN) in tokenize.py as needed.
Run the Script:
```
python tokenize.py
```

This will train a tokenizer and save it to the domain_tokenizer directory (or the path you specified).

2. Model Fine-tuning

Configure Parameters:
- Adjust model fine-tuning parameters (e.g., num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, repo_name) in train.py as needed.
- If you trained a custom tokenizer, make sure tokenizer_path points to the correct directory. Otherwise, the script will use the default tokenizer from model_checkpoint.

Login to Hugging Face Hub:

huggingface-cli login --token $HUGGINGFACE_TOKEN

Login to WandB (Optional):
```
wandb login --relogin
```
Run the Script:
```
python train.py
```

This will:

Load the dataset.
Load the tokenizer (either your custom tokenizer or the default one from the model checkpoint).
Load the ModernBERT model.
Resize the model's embedding if you are using a custom tokenizer with a different vocabulary size.
Fine-tune the model on the dataset using curriculum learning.
Evaluate the model periodically during training.
Push intermediate and final models to the Hugging Face Hub.

Monitoring and Evaluation

WandB Dashboard: If you're using WandB, monitor training progress in real-time on your WandB project dashboard.
Hugging Face Hub: Your fine-tuned model will be automatically pushed to your Hugging Face Hub profile under the repository name specified in repo_name of train.py.

Using Your Fine-tuned Model

After fine-tuning, use your model for downstream tasks with the Transformers library:

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "your_username/modernbert-dutch"  # Replace with your model name on the Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Example: Filling in masked tokens
inputs = tokenizer("Het weer is vandaag [MASK].", return_tensors="pt")
outputs = model(**inputs)
# ... process the outputs ...

Tips and Considerations

GPU Memory: ModernBERT is relatively small. Adjust per_device_train_batch_size, and gradient_accumulation_steps to fully utilize your GPU.
Dataset Size: The script is designed for large, streaming datasets. Adjust estimated_dataset_size_in_rows to your dataset size.
Hyperparameter Tuning: Experiment with different hyperparameters (learning rate, masking probabilities, etc.) to find optimal settings.
Tokenizer Training: If training a new tokenizer, consider the VOCAB_SIZE and NUM_EXAMPLES_TO_TRAIN carefully.
Evaluation: Customize the evaluation frequency using eval_interval in the script.
Saving: Adjust the saving frequency of intermediate and final models with push_interval.

Troubleshooting

CUDA Errors: If you get CUDA errors, reduce per_device_train_batch_size, or increase gradient_accumulation_steps.
Shape Errors: The fix_batch_inputs function and DynamicPaddingDataCollator handle most shape issues. If you encounter any, ensure your dataset is properly formatted and you're using the latest transformers version.
Tokenizer Issues: If you have problems loading or using your custom tokenizer, make sure it was saved correctly using tokenizer.save in train_tokenizer.py and that TOKENIZER_SAVE_PATH is accurate.
FlashAttention 2 Issues: Ensure your GPU is compatible (compute capability >= 7.0). If you encounter errors specific to FlashAttention, try disabling it by setting the environment variable USE_FLASH_ATTENTION to False.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
README.md		README.md
bpe-tokenize.py		bpe-tokenize.py
requirements.txt		requirements.txt
spm-tokenize.py		spm-tokenize.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-tuning ModernBERT on a Dutch Dataset with Custom Tokenizer Training

Features

Prerequisites

Installation

Configuration

Environment Variables

Script Parameters

Running the Scripts

1. Tokenizer Training (Optional)

2. Model Fine-tuning

Monitoring and Evaluation

Using Your Fine-tuned Model

Tips and Considerations

Troubleshooting

License

Acknowledgements

About

Releases

Packages

Languages

s-smits/modernbert-finetune

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning ModernBERT on a Dutch Dataset with Custom Tokenizer Training

Features

Prerequisites

Installation

Configuration

Environment Variables

Script Parameters

Running the Scripts

1. Tokenizer Training (Optional)

2. Model Fine-tuning

Monitoring and Evaluation

Using Your Fine-tuned Model

Tips and Considerations

Troubleshooting

License

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages