succinct provides several static succinct data types
中文介绍: 100行代码的压缩前缀树: 50% smaller
Set is a succinct, sorted and static string set impl with compacted trie as storage. The space cost is about half lower than the original data.
package succinct
import "fmt"
func ExampleNewSet() {
keys := []string{
"A", "Aani", "Aaron", "Aaronic", "Aaronical", "Aaronite",
"Aaronitic", "Aaru", "Ab", "Ababdeh", "Ababua", "Abadite",
}
s := NewSet(keys)
for _, k := range []string{"Aani", "Foo", "Ababdeh"} {
found := s.Has(k)
fmt.Printf("lookup %10s, found: %v\n", k, found)
}
// Output:
//
// lookup Aani, found: true
// lookup Foo, found: false
// lookup Ababdeh, found: true
}
-
200 kilo real-world words collected from web:
- the space costs is 57% of original data size.
- And a
Has()
costs about350 ns
with a zip-f workload.
Original size: 2204 KB
With comparison with string array bsearch and google btree :
Data Engine Size(KB) Size/original ns/op 200kweb2 bsearch 5890 267% 229 200kweb2 succinct.Set 1258 57% 356 200kweb2 btree 12191 553% 483 A string in go has two fields: a pointer to the text content and a length. Thus the space overhead is quite high with small strings. btree internally has more pointers and indirections(interface).
-
870 kilo real-world ipv4:
- the space costs is 67% of original data size.
- And a
Has()
costs about500 ns
with a zip-f workload.
Original size: 6823 KB
Data Engine Size(KB) Size/original ns/op 870k_ip4_hex bsearch 17057 500% 276 870k_ip4_hex succinct.Set 2316 67% 496 870k_ip4_hex btree 40388 1183% 577
It stores sorted strings in a compacted trie(AKA prefix tree). A trie node has at most 256 outgoing labels. A label is just a single byte. E.g., [ab, abc, abcd, axy, buv] is represented with a trie like the following: (Numbers are node id)
^ -a-> 1 -b-> 3 $
| | `c-> 6 $
| | `d-> 9 $
| `x-> 4 -y-> 7 $
`b-> 2 -u-> 5 -v-> 8 $
Internally it uses a packed []byte and a bitmap with len([]byte)
bits to
describe the outgoing labels of a node,:
^: ab 00
1: bx 00
2: u 0
3: c 0
4: y 0
5: v 0
6: d 0
7: ø
8: ø
9: ø
In storage it packs labels together and bitmaps joined with separator 1
:
labels(ignore space): "ab bx u c y v d"
label bitmap: 0010010101010101111
In this way every node has a 0
pointing to it(except the root node)
and has a corresponding 1
for it:
.-----.
.--. | .---|-.
|.-|--. | | .-|-|.
|| ↓ ↓ | | | ↓ ↓↓
labels(ignore space): ab bx u c y v d øøø
label bitmap: 0010010101010101111
node-id: 0 1 2 3 4 5 6 789
|| | ↑ ↑ ↑ | ↑
|| `-|-|-' `---'
|`---|-'
`----'
To walk from a parent node along a label to a child node, count the number of
0
upto the bit the label position, then find where the the corresponding
1
is:
childNodeId = select1(rank0(i))
In our impl, it is:
nodeId = countZeros(ss.labelBitmap, ss.ranks, bmIdx+1)
bmIdx = selectIthOne(ss.labelBitmap, ss.ranks, ss.selects, nodeId-1) + 1
Finally leaf nodes are indicated by another bitmap leaves
, in which a 1
at
i-th bit indicates the i-th node is a leaf:
leaves: 0001001111
This project is licensed under the MIT License - see the LICENSE file for details.