Language Model Vocabulary Optimization Pipeline

This repository provides a complete pipeline to optimize a language model by truncating its vocabulary based on corpus statistics. The pipeline analyzes multi-language datasets (English and Azerbaijani), filters out rarely used tokens to create a leaner SentencePiece model, updates the tokenizer accordingly, and maintains model performance while significantly reducing size.

Features

Corpus Analysis: Computes token frequency statistics from English and Azerbaijani text corpora
Vocabulary Truncation: Filters out infrequent tokens based on predefined thresholds, reducing the overall vocabulary size
Tokenizer Update: Modifies the SentencePiece model to reflect the truncated vocabulary and updates the tokenizer
Model Optimization: Creates a smaller version of the model while preserving performance

Installation

Prerequisites

First, update your system and install required packages:

# Update package list
sudo apt-get update

# Install required system utilities
sudo apt-get install zip unzip

Protocol Buffers Installation

Download and install Protocol Buffers compiler:

# Download protoc
wget https://github.com/protocolbuffers/protobuf/releases/download/v3.20.3/protoc-3.20.3-linux-x86_64.zip

# Unzip to a separate directory
unzip protoc-3.20.3-linux-x86_64.zip -d protoc3

# Move binary to system path
sudo mv protoc3/bin/protoc /usr/local/bin/

# Move includes to system include directory
sudo mv protoc3/include/* /usr/local/include/

Python Dependencies

Install required Python packages:

pip install sentence_transformers pandas transformers protobuf torch tqdm
pip install --upgrade sentencepiece

Usage

Prepare Your Data
- Place your corpus files in the project directory:
  - corpus.en - English corpus
  - corpus.az - Azerbaijani corpus
- Each file should contain one sentence per line, tab-separated
Configure Model Settings
- Set the base model (default is 'BAAI/bge-m3')
- Adjust token frequency thresholds in the code:
  - English: tokens appearing ≥10 times
  - Azerbaijani: tokens appearing ≥3 times
- Special tokens and tokens with ID ≤100 are always preserved
Run the Pipeline
- Execute the main script to:
  - Analyze your corpora
  - Create a truncated vocabulary
  - Update the model and tokenizer
  - Save the optimized model locally

Results

The pipeline achieves impressive optimization results while maintaining performance:

Size Reduction

Vocabulary size reduced from 250,000 to 37,367 tokens (~85% reduction)
Embedding layer size reduced by 82.35% (0.1765 ratio)
Overall model size reduced by 38.35% (0.6165 ratio)

Performance

Semantic similarity scores remain identical between the original and optimized models

Example comparison (English-Azerbaijani parallel sentences):

# Both models produce identical similarity score
Original model similarity: 0.75040686
Optimized model similarity: 0.75040686

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
Model truncate code.ipynb		Model truncate code.ipynb
README.md		README.md
corps_download_links		corps_download_links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language Model Vocabulary Optimization Pipeline

Features

Installation

Prerequisites

Protocol Buffers Installation

Python Dependencies

Usage

Results

Size Reduction

Performance

License

Contributing

About

Uh oh!

Releases

Packages

Languages

License

vrashad/language_model_optimization

Folders and files

Latest commit

History

Repository files navigation

Language Model Vocabulary Optimization Pipeline

Features

Installation

Prerequisites

Protocol Buffers Installation

Python Dependencies

Usage

Results

Size Reduction

Performance

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages