forked from dbry/lzw-ab
-
Notifications
You must be signed in to change notification settings - Fork 1
/
output_compatibility.txt
105 lines (77 loc) · 4.32 KB
/
output_compatibility.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
We are going to be compatible with original UNIX compress utility.
Lets show how it works with ncompress (open source version of this utility).
echo -n "abc" | compress -b 9 | ruby -e "pp STDIN.read.unpack('C*')"
> [31, 157, 137, 97, 196, 140, 1]
"31" and "157" are LZW magic bytes, also known as "\037" and "\235" in octal numeral system.
"137" is a result of "128 | 9", "128" is a special value for block mode, "9" is current max code bits.
We need to write codes for "a", "b" and "c".
Codes are 97, 98 and 99, but code width is 9 bit.
How to write it?
97 equals "001100001" (left padded with zeroes).
We will write last 8 bits "01100001" = 97 into first byte.
The remainder bit is "0".
98 equals "001100010".
We will shift this code to the left by remainder bit length and add it.
"(001100010 << 1) | 0" equals "0011000100".
We will write last 8 bits "11000100" = 196 into second byte.
The remainder bits are "00".
99 equals "001100011".
"(001100011 << 2) | 00" equals "00110001100".
We will write last 8 bits "10001100" = 140 into third byte.
The remainder bits are "001".
Input is empty, we will output remainder "001" left padded with zeroes into last byte.
This method is known as LSB.
-----
97 equals "001100001" (left padded with zeroes).
We will write first 8 bits "00110000" in opposite direction ("00001100" = 12) into first byte.
The remainder bit is "1".
98 equals "001100010".
We will shift remainder to add its bits from the top.
"001100010 | (1 << 8)" equals "1001100010".
We will write first 8 bits "10011000" in opposite direction ("00011001" = 25) into second byte.
The remainder bits are "10".
99 equals "001100011".
"001100011 | (10 << 8)" equals "10001100011".
We will write first 8 bits "10001100" in opposite direction ("00110001" = 49) into third byte.
The remainder bits are "011".
Input is empty, we will output remainder "011" right padded with zeroes in opposite direction ("01100000" -> "00000110" = 6) into last byte.
This method is known as MSB.
We can see that this method requires reversing bits in byte.
-----
We have 9 bit codes.
When should we switch to 10 bit codes?
We should store current code bits (9 bits).
We received new symbol.
We are trying to find code for next symbol in dictionary.
If we found it we should keep code bits as is (9 bits).
Codes from dictionary don't affect code bits.
We added new code into dictionary and it requires 10 bits.
We have to write current code with current code bits (9 bits).
Than we can increase code bits (10 bits).
-----
https://github.com/vapier/ncompress/issues/5
There is an ancient bug in original UNIX compress and all its derivatives: ncompress, windows port, ncompress and gzip.
It was easier to implement output of bit group aligned by number of code bits multiplied by 8.
When clear code appears or code bits changed this bit group was flushed into output right padded with zeroes.
Example:
We want to write 10 9 bit codes than 5 10 bit codes.
We have to write 100 bits, than 44 zero bits, than 50 bits and than 6 zero bits.
This bug exists for more than 35 years so it becomes a part of specification.
We have to make our application compatible with this bug by default.
So we can create feature named as "unaligned" which will provide output without "code_bits * 8" alignment.
-----
UNIX compress and his friends wont be able to decompress data compressed with "unaligned".
Their decompressor will just bypass bits that it counts as zero padding.
But "unaligned" output has no any padding bits except last byte.
So they will just miss data.
It is not possible to implement decompressor that will be able to decompress both "unaligned" and "aligned" data automatically.
The problem is that right padding of bit group with zero will conflict with unaligned and aligned data.
For example we want to output 7 9 bit codes and than 1 10 bit code.
We need to write 63 bytes, than 9 zero bits, than 10 bit, than 6 zero bits.
9 zero bits + first bit of 10 bit code can be read as 10 bit code.
If remaining bits (9 last bits of 10 bit code) are zeroes, this data can decompressed in 2 ways.
So automatic detection of "unaligned" or "aligned" data is not possible.
-----
Some UNIX compress implementations writes random bits from uninitialized buffer as alignment bits.
There is no guarantee that alignment bits will be zeroes.
So in terms of compatibility decompressor have to just ignore alignment bit values.