Compression

This is a helper repository to the blog post on https://schulichignite.com/blog/making-things-small that explains 2 basic compression algorithms, 1 for text, 1 for images.

Videos

There are several videos created relating to this topic you can find here:

Walkthrough of the theory (COMING SOON)
Walkthrough of text version code (COMING SOON)
Walkthrough of image version code (COMING SOON)

Running examples

Each file was created in python 3.11, but should work on all versions 3.7+. To run them simply run:

python <file>.py

No additional packages need to be installed to run the examples

Background knowledge

You should already know:

How to write and read python well
- On top of the basics you should also know comprehensions & counters
What encodings are

Text

For all of our text algorithms we made a few assumptions:

Text is in just english
The text does not have any numbers
The text is at least a few paragraphs (if it's super short compression is unlikely to be helpful)
Each character takes up 1 byte (i.e. "a" is 1 byte); This is just important when calculating efficiency

All of the text compression algorithms work off the basic principle of creating a mapping between words and indicies in lists. Here is an example:

The output in this example has been compressed by 13 bytes. For our algorithms we have 3 common steps:

Establish a set of words in a list that will be used to map words to indicies
Cleanup and sort the list in some form to make it as efficient as possible
Loop over the list of words and replace each occurance of a word in the input text with it's index in the list

In text.py there are 3 versions of the algorithm in 3 functions:

compress_by_common_words(); Uses the 100 most common english words capitalized and lowercase
compress_by_common_words_improved(); Uses the same method as 1, but with a few improvements to the filtering and sorting
compress_with_counter(); Generates the wordlist by looking at the words in the text and choosing the most frequent ones

Images

The image algorithm does not use real images. It uses a custom encoding to represent images. The encoding relies on RGB tuples, where there are 3 numbers between 0-255, each number represents the "amount" of the colour to use. For example (255,0,0) means use "%100 red and 0% green and 0% blue". Here is a chart to help visualize this:

For our code we will use an encoding of lists of "rows" of pixels, with each pixel being a list. So this list:

image_values = [ 
    # Start with third dimension being a list so the values can be rounded (tuples are immutable)
    [[255,255,255],[255,255,255],[255,255,255],[255,255,255],[255,255,255]],
    [[255,255,255],      [0,0,0],[0,0,0],[0,0,0],      [255,255,255]],
    [[255,255,255],[255,255,255],[0,0,0],[255,255,255],[255,255,255]],
    [[255,255,255],[255,255,255],[0,0,0],[255,255,255],[255,255,255]],
    [[255,255,255],[255,255,255],[0,0,0],[255,255,255],[255,255,255]],
]

Corresponds to this 5 by 5 image:

Something like this:

From this we do the same principles as the text algorithms, except with the list of lists of lists (which we convert to a list of list of tuples, because lists can't be put as keys of dictionaries), and we get a process that starts like this:

and then once the actual compression has been complete we end up with:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
images.py		images.py
text.py		text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Compression

Videos

Running examples

Background knowledge

Text

Images

About

Uh oh!

Releases

Packages

Languages

License

Descent098/compression

Folders and files

Latest commit

History

Repository files navigation

Compression

Videos

Running examples

Background knowledge

Text

Images

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages