Skip to content

LLMA = LLM + Arithmetic coder, which use LLM to do insane text data compression. LLMA=大模型+算术编码,它能使用LLM对文本数据进行暴力的压缩,达到极高的压缩率。

Notifications You must be signed in to change notification settings

WangXuan95/LLMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

语言

LLMA

What is LLMA?

LLMA = LLM (Large Language Model) + Arithmetic coder, which use LLM to do text data compression.

 

diagram
Figure: block diagram of LLMA

 

Insight:

The causal LLMs can output the probabilities of current token according to the previous tokens. This repo is an interesting attempt: adding an arithmetic encoder after an LLM, which encodes the token according to its probability to achieve text data compression.

Pros:

  • Very high compression ratio which can beat the sota CMIX (and of course XZ, GZIP, BZIP, etc.)

Cons:

  • Very very slow: ~16 bytes per second on CPU using Qwen2.5-0.5B model.
  • Numerical instability: Since LLM inference involves floating point operations, It cannot guarantee that the compressed data generated by a machine can be successfully decompressed on another machine.

Test result:

The test text is data/data.txt, which is the content of Matt Mahoney's book Data Compression Explained.

Compression Method Compressed Size Compress command
LLMA (Qwen2.5-0.5B) 32927 B python LLMA.py -c data.txt data.llma
CMIX (ver. 21) 55960 B cmix.exe -t dictionary\english.dic data.txt data.cmix
LPAQ8 (-9) 68765 B LPAQ8.exe -9 data.txt data.lpaq8
XZ (LZMA) (-9 -e) 86708 B xz -zkvf -9 -e data.txt (in Linux)
GZIP (-9) 101497 B gzip -kvf -9 data.txt (in Linux)
Uncompressed 294328 B

 

Usage

To download the pre-trained LLM model, you need to register Huggingface and apply for a "read" token.

Then, set your "read" token in download_pretrained_model.py.

Then run following command to download the Qwen2.5-0.5B model to local. The model parameter is about 953 MB.

python download_pretrained_model.py

Then run following command to compress a .txt file to a .llma file:

python LLMA.py -c <input_name>.txt <output_name>.llma

Run following command to decompress a .llma file to a .txt file:

python LLMA.py -d <output_name>.txt <input_name>.llma

About

LLMA = LLM + Arithmetic coder, which use LLM to do insane text data compression. LLMA=大模型+算术编码,它能使用LLM对文本数据进行暴力的压缩,达到极高的压缩率。

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages