-
Notifications
You must be signed in to change notification settings - Fork 4
/
README.txt
210 lines (162 loc) · 9.68 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
# INTRODUCTION
This is BeagleHash and friends- a family of hash functions developed
using genetic-algorithm techniques. Each hash function has a different
set of trade offs, primarily relating to the size of the seed, the
size the run time state of the hash, the method of seeding the hashes
state, the block size for reads and the size of the final hash.
Hash WordBits SeedBits StateBits HashBits
BeagleHash 64 Variable 128 64
Zaphod64 64 191 192 64
StadtX 64 128 256 64
Zaphod32 32 95 96 32
Phat4 32 96 128 32
SBOX 64 128 524480 64
BeagleHash is named in honor of the the H.M.S. Beagle, the ship
which carried Charles Darwin to the Galapogos Islands. Zaphod is named
after the character from the Hitchhikers Guide to the Galaxy, and is
also a play on the fact that Microsoft has a hash called "Marvin".
StadtX is a hash function inspired by the metrohash family of hashes,
and was named accordingly, "stadt" being the German word for city.
Phat4 needs a new name. SBOX is named after how it works, which is
substitution box hashing. SBOX is a toy, designed to demonstrate how
"perfect unbreakable hashing" might behave. It hashes by looking up
a random 64 bit value for each input byte. (And then switches to Zaphod64
like hashing after 32 bytes.) The purpose is purely for testing. I do
not recommend you use it.
# WHAT IS IT FOR
The primary intention of the hash functions contained in this
repository is for use in scripting languages such as Perl and other
contexts where there is a single seed used to hash many keys,
including ones from untrusted sources, and where there may be
"leakage" of details about how the hash behaves. In particular one
of the assumptions of the hash functions contained here is that
they will be *rarely* seeded, but that we will hash many times
with the same seed.
## BeagleHash
BeagleHash is one core function exposed as a set of different
functions each intended for a different seed or hash size.
The core routine is the function beagle_hash_64_128_a() which
expects a 128 bit random seed which has been "prepared" for use.
Specifically the seed is expected to have certain properties:
- Neither the first 64 bits nor the last may be either all
zeros, nor all ones.
- There must be a reasonable mix of ones and zeros.
This function is wrapped by various other functions:
beagle_hash_64_127_a() beagle_hash_32_127_a()
beagle_hash_64_112_a() beagle_hash_32_112_a()
beagle_hash_64_96_a() beagle_hash_32_96_a()
beagle_hash_64_64_a() beagle_hash_32_64_a()
The _96, _112 and _127 variants are exposed primarily for testing
and are NOT recommended for general use as they include key
preparation as part of their implementation. Instead one should
use the functions
beagle_hash_seed_prep_127_128_a()
beagle_hash_seed_prep_112_128_a()
beagle_hash_seed_prep_96_128_a()
beagle_hash_seed_prep_64_128_a()
to prepare the seed for use with
beagle_hash_64_128_a()
beagle_hash_32_128_a()
both of which do no seed preparation at all.
Only the _64 variant is recommended for use with variable seeds,
as it uses a fast method for ensuring that seed is of sufficient
quality.
# HASH FUNCTION SECURITY
Security for a hash function is a complex matter. Many contexts do not
rely on security. If you are hashing the words in an "actual" dictionary,
or hashing pointer addresses, then you probably don't need to worry about
the security of that hash function, and can focus on speed and suitability
to the data being hashed. On the other hand, if you are using hashing for
signing messages then you really need to worry about using an academically
peer reviewed and cryptographic grade hash function.
Both of these markets are relatively well served by the existing corpus
of hash functions and it is relatively straight-forward to choose one
appropriate to the task at hand.
However there is a middle ground between these two use cases where things
are less clear. Perl is a good example of this middle ground. Perl, like
many scripting languages, makes heavy use of hasing and hash-tables. The
hash function it uses for strings is seeded once, at process start up, and
is then used for all subsequent hashing requirements, from internal purposes
like namespace management to application level logic like hashing arbitrary
user input such as building a hash of the arguments to a web-request. While
the hash values generated are considered "private" to the process, and transient
to that process, information about the hash-function being used can and does
leak in the form of printing out the keys of a hash in native key-order.
Another key point here is that the hash table size is variable. In many of
the tables only a few bits, usually the low bits, of the hash are actually used.
Since we have to supporting hashing arbitrary user data we have to assume that
sometimes that data may be hostile, and that someone may be setting up a collision
attack on the hash function, and we must ensure that our hash function is
sufficiently robust to make collision attacks impractical, which in turn means
we have to make seed-discovery attacks impractical.
## Collision attacks
### Multi-Collision Attacks
The most serious type of collision attack is the "multi-collision" attack, where
it is possible to construct colliding keys without knowing the seed of the hash.
Hashes which are based around the 64 bit Intel CRC32 intrinsic often suffer
from this problem. It takes a few seconds to find a CRC32 collision, and once you
have that, you can manufacture arbitrary numbers of keys which collide. Even if
the seed changes, and thus the hash value produced changes, all the keys will
still collide. There are other variants to this, for instance some hashes have
the property that there is a key K such that H(P)=H(PK), or have the associative
property such that H(AB)=H(BA). Any such weaknesses are a fatal blow to a hash
function that needs to be robust to untrusted input data.
At this time I do now know of any multi-collision attacks on the hash functions
in this package. I believe that they are by design robust to such attacks, always
mixing the length of the key into the state during hashing, and of sufficient
complexity and construction that such attacks are unlikely.
### Brute-Forced Collisions
An unfortunate fact of life is that if an attacker knows the seed for your hash
function, and you are only using a few bits of the hash then an attacker can
readily brute force collisions. For instance if you are using 32 bits of the hash
then you can brute force a collision of keys longer than 32 bits in about 160k
random tries. (Collisions in a hash are normal, excess collisions are not.) What
this means is that our hash function must be robust to key-discovery attacks,
or in other words the problem moves from avoiding collisions, to keeping our
seed secret.
## Seed Discovery Attacks
### Brute-Force
If the seed is small enough then we can brute force the hash function, and simply
try every seed and see what hash, or key order we get. Note the hash-size doesnt
matter much here, if the seed is 32 bits then the hash function can be brute forced
in short order. This means that any decent hash function needs to have a seed at
least 64 bits, if not longer.
All of the hash functions in this package suport a seed larger than 64 bits.
### Solver Attacks
If an attacker knows the hash function being attacked (and even if they dont to a
certain extent) they can use a solver to compute the seed from a set of input/output.
This means that the complexity of the hash function must be sufficient to make
solving the hash impractical. Modern solver technology is very advanced, and can
make short order of an insufficiently complex hash function.
I do not know the solver status of these hash functions. If I had to guess I would
say that the solver-resistance of the hash function is inversely proportional to the
hash functions speed, but obviously the relative sizes of the seed, state and hash
will play a big role. This needs further study.
## Relatively Speaking
All of these hash functions use less state, or use less complex mix functions than
SipHash, even in its 1-3 implementation. On the other hand they are faster than
SipHash or SipHash13, and probably have a higher level of security than most
"non-cryptographic" hash functions.
# STATUS
This project is a work in progress. It may change at any time until
this notice is removed. Please do not use this code in any production
systems without contacting me first.
# LICENSE
You may use this software either under the LGPL 3.0 or later, or under
express license after contacting the author.
# AUTHOR and COPYRIGHT
Yves Orton wrote and ran the software (currently not included in this
package) that evolved the internals used by BeagleHash, and wrote
the "skeleton" that the genetic-algorithm "filled in".
This software is copyright Yves Orton, 2017.
# ACKNOWLEDGEMENTS
The author would like to thank the following people who have one way or
another influenced the development of this software:
Damian Gryski - A smart cookie who has interest in this area and who
has helped with some of the analysis.
Brett Mulvey - Author of http://papa.bretmulvey.com/post/124027987928/hash-functions
JP Aumasson "veorq" - One of the authors of SipHash which has been very
influential in how this function was structured.
Ryan Bastic - Porting, Doc fixes, actually building stuff, etc.
Larry Wall - Author of Perl, where I got my chops, and why I started
on this in the first place.