GitHub - toxypiks/byte_pair_encoding: Implementation of the byte pair coding algorithm

Byte pair encoding algorithm

The algorithm compresses text by initially replacing the highest frequency pair of characters with a generated token. Through every iteration the token pair with the highest frequency gets replaced with another generated token. This goes on in a loop until the most frequent pair of tokens remain to occure onyl one time.

A look-up table of the replacements is required to rebuild the initial dataset.

Quick start

git clone [email protected]:toxypiks/byte_pair_encoding.git
cd byte_pair_encoding
mkdir build
cd build
cmake ..
make

Generating the dot graph

./main > pair.dot
dot -Tpng pair.dot -o pair.png

How the code works

Freq *freq -> Hashmap for tokens (keys -> Pair {.l, .r}) and their occurence (value)

uint32_t *tokens_in -> text before next iteration of compression

uint32_t *tokens_out -> compressed text

Pair *pairs = NULL; -> look-up array whereas indices 1-255 are the same as the .l value. The .r value equals to zero indicating that the index value corresponds to the ASCII value of the character. All newly generated tokens get appended representing the token with the highest occurcence within each loop iteration

Look-up table gets initialized with values from 1-255 for .l
Text to compress gets appended to tokens_in array
Tokens_in gets iterated pair-wise. Pairs are put into freq hashmap if not already existent. Otherwise the value gets incremented representing the occurence of pair
Freq gets iterated to search for index of pair with max occurence, if that pair occures only one time the loop ends -> the algorithm is finished
The pair with the highest occurence gets appended to the pairs look-up array
Tokens_in gets iterated to find pair with max occurence and replace it with last index of pairs look-up array in tokens_out, if current pair has no max occurce its not getting replaced and just appended to tokens_out
Tokens_in and tokens_out get swapped for next iteration

How to use scripts

To generate a compressed .bpe binary file from a given text file run ./txt2bpe <input.txt> <output.bpe>
To generate .dot file from .bpe file run ./bpe2dot <input.bpe> <output.dot>

Faster Algorithm approach

Instead of creating a new freq hash table through each compression iteration, it gets updated while replacing the most frequent pair in the input text with a token.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
docs		docs
dot_graph		dot_graph
include		include
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Byte pair encoding algorithm

Quick start

Generating the dot graph

How the code works

How to use scripts

Faster Algorithm approach

About

Releases

Packages

Languages

License

toxypiks/byte_pair_encoding

Folders and files

Latest commit

History

Repository files navigation

Byte pair encoding algorithm

Quick start

Generating the dot graph

How the code works

How to use scripts

Faster Algorithm approach

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages