Text classification using lossless compression (gzip)

To keep it short, we want to classifiy texts using gzip and kNN
the gzip method leverages the fact that compression algorithms remove redundant information within text sequences

(1) How does the method work?

Suppose we have two texts $t_1, t_2$ where:

txt_1 = "hello world"
txt_2 = "some text some text some text"

# Compress individual texts
len(gzip.compress(txt_1.encode()))  # Output: 31
len(gzip.compress(txt_2.encode()))  # Output: 33

# Compress concatenated texts
len(gzip.compress(" ".join([txt_1, txt_2]).encode()))  # Output: 43

# Compress text concatenated with itself
len(gzip.compress(" ".join([txt_1, txt_1]).encode()))  # Output: 34
len(gzip.compress(" ".join([txt_2, txt_2]).encode()))  # Output: 33

We notice, that the second compressed text is slightly longer despite being much longer originally.
This demonstrates how the compression algorithm takes advantage of repeated text.

(2) Fixing Tie-Breaking

A common problem in k-nearest neighbors classification is breaking ties when k=2. The method in the original paper always selects the lowest class label in case of a tie, which introduces bias:

top_k_class = [0, 1]
max(set(top_k_class), key=top_k_class.count)  # Output: 0

top_k_class = [1, 0]
max(set(top_k_class), key=top_k_class.count)  # Output: 0

top_k_class = [1, 0, 2]
max(set(top_k_class), key=top_k_class.count)  # Output: 0

A more rational approach uses the Counter object to select the closer neighbor, since we sort the distances in a list

from collections import Counter

top_k_class = [0, 1]
Counter(top_k_class).most_common()[0][0]  # Output: 0

top_k_class = [1, 0]
Counter(top_k_class).most_common()[0][0]  # Output: 1

top_k_class = [1, 2, 0]
Counter(top_k_class).most_common()[0][0]  # Output: 1

Using the second approach can improve accuracy up to a decent amount

The full approach can be demonstrated as follows:

(3) Results

the results strongly depend on the compression ratio of the dataset, since compression ratios and test accuracies have a positive linear correlation meaning that the easier a dataset can be compressed, the better accuracy can be achieved.

(4) Efficiency Improvements

Gzip has a limited sliding window size for searching repeated strings. This limitation becomes apparent when dealing with large datasets (e.g., 20News) as the compressor may not capitalize the quantity of training samples.
Moreover, the computational complexity of kNN is $\Theta(n^2)$, which becomes a significant limitation as the data size increases.
Additionaly, caching the compressed training instances and parallelizing the search across GPU cores reduce runtime significantly, from 12 to 2 hours on the IMDb Movie review dataset.

(5) Final Thoughts

this method is cool, since it relies on CPU resources, eliminating negative environmental impacts related to GPU usage.
since the run-time doesn't scale well on large datasets, we can make use of multiprocessing techniques.
Dataset Overlapping: The datasets used in the original paper had a dataset overlaping issue leveraging wrong results for train/test accuracies.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
r8		r8
src		src
test		test
train		train
validation		validation
README.md		README.md
concepts.ipynb		concepts.ipynb
dataset_dict.json		dataset_dict.json
gzippy.py		gzippy.py
gzippy_cache.py		gzippy_cache.py
gzippy_test.ipynb		gzippy_test.ipynb
local_dataset_utilities.py		local_dataset_utilities.py
overlap.py		overlap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text classification using lossless compression (gzip)

(1) How does the method work?

(2) Fixing Tie-Breaking

(3) Results

(4) Efficiency Improvements

(5) Final Thoughts

About

Releases

Packages

Languages

makarwu/gzippy

Folders and files

Latest commit

History

Repository files navigation

Text classification using lossless compression (gzip)

(1) How does the method work?

(2) Fixing Tie-Breaking

(3) Results

(4) Efficiency Improvements

(5) Final Thoughts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages