forked from dbry/lzw-ab
-
Notifications
You must be signed in to change notification settings - Fork 1
/
compressor_with_sparse_array.txt
74 lines (56 loc) · 2.19 KB
/
compressor_with_sparse_array.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
It is possible to speedup dictionary lookup.
We can store codes trie in single sparse array.
We can name it "next codes".
For example:
"a" (97) -> "c" (262)
-> "b" (257)
-> "a" (259) -> "c" (263)
-> "b" (260)
"b" (98) -> "a" (258)
"c" (99) -> "a" (261)
Next codes:
...
97 ("a") * 256 + 97 ("a") -> 259
97 ("a") * 256 + 98 ("b") -> 257
97 ("a") * 256 + 99 ("c") -> 262
...
98 ("b") * 256 + 97 ("a") -> 258
...
99 ("c") * 256 + 97 ("a") -> 261
...
258 (259) ("aa") * 256 + 98 ("b") -> 260
258 (259) ("aa") * 256 + 99 ("c") -> 263
...
We need to subtract 1 because we don't need to store clear code.
-----
The main idea is very simple.
We have a lookup function with params: current_code and next_symbol.
We can use: return sparse_array[current_code * 256 + next_symbol].
We have a save function with params: current_code, next_symbol, next_code.
We can use: sparse_array[current_code * 256 + next_symbol] = next_code;
Both operations has O(1) compexity.
Indexes of this array are between 0 and ((2 ** 16) - 1) * 256 - 1.
Values >= 257.
We can still use zero constant as undefined next code.
-----
How much memory will it consume?
(((2 ** 16) - 1) * 256) * sizeof(code_t) ~ 33.5 MB.
For example we have "linux.tar" with size about 850 MB.
Practice shows that we need about 2040 dictionary clears.
We need to clear 33.5 MB * 2040 times ~ 68 GB.
Speed of memset is about 10 GB per second for modern pc with DDR-4.
So we are loosing about 7 seconds just to clear sparse array.
We need to find a smarter way to clear it.
We can store used sparse array indexes in separate array.
Length of this array equals to codes length: ((2 ** 16) - 257)
Size of used index is 4 bytes.
So it will consume ((2 ** 16) - 257) * sizeof(used_index_t) ~ 261 KB.
There is another solution.
We can use decompressor dictionary: previous codes and last symbol by codes.
It requires ((2 ** 16) - 257) * sizeof(code_t) + ((2 ** 16) - 257) ~ 196 KB.
But this solution is slower: we need to maintain more complex structure.
We are using 33.5 MB for sparse array.
It would be weird try to save 65 KB.
-----
PS If block mode is disabled we don't need clear code (256).
So we don't need to subtract 1 from current code.