forked from dbry/lzw-ab
-
Notifications
You must be signed in to change notification settings - Fork 1
/
decompressor.txt
155 lines (110 loc) · 3.94 KB
/
decompressor.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
Lets try to decompress [97, 98, 257].
We need to find a convenient way to simulate algorithm.
Read 97.
It means that first prefix is "a".
Remote dictionary applied code 257 for "ax" (x is unknown symbol).
Read 98.
It means that next prefix is "a".
So remote dictionary applied code 257 for "ab".
Remote dictionary applied code 258 for "bx".
Read 257.
It means that next prefix is "ab" (from dictionary).
So remote dictionary applied code 258 for "ba".
Destination is "abab".
-----
Now lets try to decompress [97, 98, 258].
97 and 98 means the same.
Read 258.
There is no 258 in dictionary.
We know that remote dictionary applied code 258 for "bx".
So destination looks like "abbx".
But it means that remote dictionary applied code 258 for "bb".
Destination is "abbb".
This is the most complex part of decompression algorithm.
We need to clarify it in general case.
-----
We have current prefix "a...b".
We know that remote dictionary applied code "c" for "a...bx".
Than we read this code "c" from source.
It means that destination looks like "a...ba...bx".
So "a...bx" is "a...ba".
So we have two identical strings that intersects by first symbol "a".
"...a...ba....."
|
"........a...ba"
We read next code (last used code + 1) from source, it is not available in current dictionary.
In this case next prefix equals to current prefix + first symbol of current prefix.
1. "a(b)(bb)": current prefix code <= 255.
2. "...(a...b)(a...ba)": current prefix code > 257.
So this situation is possible for any prefix code.
-----
What operations are required for decompressor's dictionary?
1. Is code presented in dictionary?
2. Output code.
3. Store next code in dictionary by prefix code and first symbol of current code.
We don't need to find code by symbol sequence.
Compressor requires to maintain mapping between current code and next code.
The problem is that current code has many next codes.
Decompressor requires opposite mapping.
One current code has only one previous code.
It makes decompressor simple and it looks impossible to speed up it.
So we can use single implementation of dictionary.
-----
We need just one array for code mapping.
It can be named as "previous_codes".
Indexes of this array will be >= 257.
Values are previous codes including symbols.
Any code can be previous for other code except max code.
Max code (2 ** 16) - 1 cant be previous code.
So values will be between 0 and (2 ** 16) - 2.
Undefined previous code will be equal to (2 ** 16) - 1.
We don't need to initialize or clear it.
Algorithm will access only initialized codes.
-----
We need a separate array to map between code and last symbol.
It can be named as "last_symbol_by_codes".
Indexes of this array will be >= 257.
We don't need to initialize or clear it.
Algorithm will access only initialized symbols.
-----
We have to write output in an opposite direction.
It is possible to use another array as output buffer.
It can be named as "output_buffer".
Lets imagine we compressed "aa..." data.
All codes will be equal to prev code + "a" symbol.
"aa" - 257, "aaa" - 258, ... "a...a" ((2 ** 16) - 1 - 255) - ((2 ** 16) - 1).
So max length of buffer will be (2 ** 16) - 256 = (2 ** 16) - 257 + 1.
We don't need to initialize or clear it.
Algorithm will access only initialized bytes.
-----
For example [97, 98, 257, 259, 98] -> "abababab":
Imaginary dictionary:
ab - 257
ba - 258
aba - 259
abab - 260
Previous codes:
0 (257) -> 97
1 (258) -> 98
2 (259) -> 257
3 (260) -> 259
Last symbol by codes:
0 (257) -> "b"
1 (258) -> "a"
2 (259) -> "a"
3 (260) -> "b"
Output buffer (reversed):
257 -> ba
259 -> aba
-----
We don't need to clear dictionary.
Algorithm will access only initialized data in all arrays.
-----
How much memory will it consume?
Previous codes: ((2 ** 16) - 257) * sizeof(code_t)
Last symbol by codes: (2 ** 16) - 257
Output buffer: (2 ** 16) - 257 + 1
Total: ~ 216 KB
-----
PS If block mode is disabled we don't need clear code (256).
So we can move from 257 to 256.