LLMA = LLM (Large Language Model) + Arithmetic coder, which use LLM to do text data compression.
Figure: block diagram of LLMA |
The causal LLMs can output the probabilities of current token according to the previous tokens. This repo is an interesting attempt: adding an arithmetic encoder after an LLM, which encodes the token according to its probability to achieve text data compression.
- Very very slow: ~16 bytes per second on CPU using Qwen2.5-0.5B model.
- Numerical instability: Since LLM inference involves floating point operations, It cannot guarantee that the compressed data generated by a machine can be successfully decompressed on another machine.
The test text is data/data.txt, which is the content of Matt Mahoney's book Data Compression Explained.
Compression Method | Compressed Size | Compress command |
---|---|---|
LLMA (Qwen2.5-0.5B) | 32927 B | python LLMA.py -c data.txt data.llma |
CMIX (ver. 21) | 55960 B | cmix.exe -t dictionary\english.dic data.txt data.cmix |
LPAQ8 (-9) | 68765 B | LPAQ8.exe -9 data.txt data.lpaq8 |
XZ (LZMA) (-9 -e) | 86708 B | xz -zkvf -9 -e data.txt (in Linux) |
GZIP (-9) | 101497 B | gzip -kvf -9 data.txt (in Linux) |
Uncompressed | 294328 B |
To download the pre-trained LLM model, you need to register Huggingface and apply for a "read" token.
Then, set your "read" token in download_pretrained_model.py
.
Then run following command to download the Qwen2.5-0.5B model to local. The model parameter is about 953 MB.
python download_pretrained_model.py
Then run following command to compress a .txt
file to a .llma
file:
python LLMA.py -c <input_name>.txt <output_name>.llma
Run following command to decompress a .llma
file to a .txt
file:
python LLMA.py -d <output_name>.txt <input_name>.llma