-
Notifications
You must be signed in to change notification settings - Fork 0
/
statistical-results-without-false-positives-check-are-most-likely-wrong.html
381 lines (362 loc) · 19.3 KB
/
statistical-results-without-false-positives-check-are-most-likely-wrong.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
<!DOCTYPE html>
<html lang="en">
<head>
<link rel="stylesheet" href="/theme/style/base.min.css?2189187c">
<title>Hyphanet</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />
<link href="https://www.hyphanet.org/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Hyphanet Full Atom Feed" />
<link rel="alternate" hreflang="en" href="https://www.hyphanet.org/statistical-results-without-false-positives-check-are-most-likely-wrong.html" />
<link rel="alternate" hreflang="ru" href="https://www.hyphanet.org/ru/statistical-results-without-false-positives-check-are-most-likely-wrong.html" />
<link rel="alternate" hreflang="fr" href="https://www.hyphanet.org/fr/statistical-results-without-false-positives-check-are-most-likely-wrong.html" />
<link rel="alternate" hreflang="x-default" href="https://www.hyphanet.org /statistical-results-without-false-positives-check-are-most-likely-wrong.html" />
<link rel="canonical" href="https://www.hyphanet.org/statistical-results-without-false-positives-check-are-most-likely-wrong.html" />
<meta property="og:title" content="Hyphanet" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://www.hyphanet.org" />
<meta property="og:image" content="https://www.hyphanet.org/" />
<meta property="og:image:secure_url" content="https://www.hyphanet.org/theme/images/logo-blue.png" />
<meta property="og:description" content="Hyphanet is a peer-to-peer platform for censorship-resistant communication and publishing." />
<meta name="twitter:card" content="summary">
<meta name="twitter:title" content="Hyphanet" />
<meta name="twitter:description" content="Hyphanet is a peer-to-peer platform for censorship-resistant communication and publishing." />
<meta name="twitter:image" content="https://www.hyphanet.org/theme/images/logo-blue.png" />
</head>
<body id="index" class="home">
<div>
<nav id="menu">
<a href="https://www.hyphanet.org/">Hyphanet</a>
<a href="https://www.hyphanet.org/pages/about.html">About</a>
<a href="https://www.hyphanet.org/pages/volunteer.html">Volunteer</a>
<a href="https://www.hyphanet.org/pages/documentation.html">Documentation</a>
<a href="https://www.hyphanet.org/pages/download.html">Download</a>
<a href="https://www.hyphanet.org/pages/help.html">Help</a>
</nav><!-- /#menu -->
<aside class="social">
<a href="https://twitter.com/freenetproject">Twitter</a>
<a rel="me" title="Hyphanet News and Info in the Fediverse" href="https://floss.social/@Freenet">Mastodon</a>
</aside>
<nav id="language">
<span>Language</span>
<a href="https://www.hyphanet.org/ru/statistical-results-without-false-positives-check-are-most-likely-wrong.html">ru</a>
<a href="https://www.hyphanet.org/fr/statistical-results-without-false-positives-check-are-most-likely-wrong.html">fr</a>
</nav>
</div>
<main>
<header id="banner" class="body">
<h1>Statistical results without false positives check are most likely wrong</h1>
</header><!-- /#banner -->
<section id="content" class="body">
<div class="post-info">
<time class="published" datetime="2019-09-09T00:00:00+02:00">
Mon 09 September 2019
</time>
<address class="vcard author">
By <a class="url fn" href="https://www.hyphanet.org/author/freenet-contributors.html">Freenet Contributors</a>
</address>
</div><!-- /.post-info -->
<div class="entry-content">
<p>Like every other privacy network, Freenet is a target of
statistical attacks to trace the activity of its users.</p>
<p>Studies that investigated tracing Freenet users were built on
unrealistic idealized setups or simplistic routing, so that their
results don’t apply to the real network.</p>
<p>Despite these shortcomings in the studies, there have been cases of
seized equipment. To prevent future cases from targeting innocents based on
these misleading statistics, we want to provide an example of a clean
calculation of the probability that some observation is a false
positive.</p>
<p>A short definition: False positives are results which look like a hit,
e.g. finding the originator of a request, but which are wrong,
e.g. pointing to the wrong persion.</p>
<p>Second definition: A Freenet node is Freenet running on a computer.</p>
<p>When observing Freenet, false positives most likely happen because of
misunderstanding how Freenet routing works, how file transfer works,
or how connections in Freenet are structured in real operation.</p>
<p>In the article
<a href="/police-departments-tracking-efforts-based-on-false-statistics.html">tracking efforts based on false statistics</a>,
we already showed how false results occur due to specific
misunderstandings about the concepts used in Freenet routing.
The current article shows how false positives happen due to using a
false idea about the actual structure of the Freenet network.</p>
<p>Firstoff: In an idealized structure, each node has 6 connections, all
nodes provide the same bandwidth, and all connections are usable all
the time. Such an idealized lattice of nodes looks like the following:</p>
<pre><code> 6 6
6 6 6
6 6
6 6 6
6 6
</code></pre>
<p>In the real network at the time of writing, the number of connections
varies between 5 and 65, depending on the bandwidth available at the
nodes. A snapshot of the connection-count distribution can be seen on
the <a href="https://d6.gnutella2.info/freenet/USK@WMa1Z40iYdZZ51yctQ3toFl9zuuFEnNdsm3NejJU5KE,jCBcaNBeKD5~sSQeSkyKz737Bh5ibBGqdzfD8mgfdMY,AQACAAE/statistics/560/">Freenet statistics site</a>. Between 10% and 80% of
the connections are inactive due to overload (backoff). This increased
when groups of users started to patch their nodes to request data at a
higher rate than the rest.</p>
<p>A more realistic structure therefore looks like this:</p>
<pre><code> 80 6 6
70 6
65 6
60 6
55 13
50 24 13
42 13
39 15
36 18
33 21
30 28 24
6
70 13
55 7 20
40 30
</code></pre>
<p>The difference to the idealized structure which is most important to
this article is that almost every node has at least one connection to
another node with 8 connections or less. Also several of these
connections are in backoff, so they are not actually used, which
easily reduces the effective connection count to 4. From now on I will
call such nodes “small nodes”.</p>
<p>If the connection count of a node is just 4, the requests a
neighboring node forwards from a single download look very similar to
requests from the neighboring node if it has 4 downloads running.</p>
<p>Therefore whenever you see requests which could originate from a given
node, you must check how likely it is that they were actually sent
from such a small node. </p>
<p>The first step is to check whether we can exclude a small node as
likely originator. Freenet assigns the number of connections based on
the assigned bandwidth. A node with 4x the bandwidth has 2x the
connections. Therefore, if its user did not actively change its code,
a small node has low bandwidth.</p>
<p>As a simple test, I downloaded a file with roughly 20 MiB on a node
with 8 connections as maximum, 6 active connections on average. It
downloaded 2 MiB per minute. Scaling up, a node with 16 connections
should download about 8 MiB per minute. If you observe a download of
400 MiB that takes 2 hours or more, it is possible that it comes from
a small node with around 11 peers (8-9 working at any time). </p>
<p>If you see 400MiB take only 1 hour or you see 1 GiB downloaded in 2
hours, it is more likely that the originator has 16 connections or
more. With some tricks that can be increased, so as a rule of thumb to
exclude a small node you would have to observe a download of 400 MiB
taking only half an hour. Due to asymmetric connections a small node
is typically one with slow upload, not with slow download.</p>
<p>With these basics in place, we’ll show the rest with a scenario.</p>
<p>Assume that there is a monitorying node that observed requests coming
from a node with 50 peers. The file in question is 400 MiB big and the
download lasted slightly more than two hours. Assume that you observed
requests for 4% of the file from the 50 peer node.</p>
<p>In an idealized uniform network without friend-of-a-friend (FOAF)
routing, you would now assume that you are connected to the
originator. But due diligence requires that you correct for FOAF
routing and the real non-uniform structure, and check for false
positives.</p>
<p>How likely is it that the requests were started by one of the roughly
8 small nodes connected to a typical 50-peer-node?</p>
<p>A typical false positive would be that within those two hours, a node
you were connected to tried to retrieve the file. Let’s only use
information we actually received (without naming names). We’ll clearly
mark where we have to take assumptions, and where this is due to
lacking required information.</p>
<p>Let’s start with the information:</p>
<blockquote>
<p>between 3:50 PM UTC and 6:08 PM UTC the Freenet node requested 383
unique blocks. The Freenet node reported an average of 51.3
peers. To reconstruct the file requires a minimum of 12,723 blocks
of a total possible of 24,874.</p>
</blockquote>
<ul>
<li>minimum required blocks: 12,723</li>
<li>the node had 50 peers</li>
<li>the observer saw 383 block-requests sent via the connection with you</li>
</ul>
<p>The node had 50 peers.</p>
<p>At that time about 25% of nodes had less than 10 peers (peek at 7
peers), 15% of nodes had only 10 to 15 peers, with the rest evenly
spread between 16 and 70. Only about 20% of nodes have 50 peers or
more. See the <sup id="fnref:peercount"><a class="footnote-ref" href="#fn:peercount">1</a></sup> footnote at the end for the origin of this
data.</p>
<p>Assumptions: the node was connected to a node with 7 peers, and that
node requested the file. From the peers of that node, you were the
only with 50 peers or more. Then there was a node with 30 peers. Then
two nodes with 15 peers each, and three nodes with 7 peers each.</p>
<ul>
<li>assumption: actual originator had 7 peers</li>
<li>assumption: the peer-count distribution was typical</li>
</ul>
<p>This is still a typical situation (not a rare one).</p>
<p>Originator-connections:</p>
<ul>
<li>node: 50 peers.</li>
<li>A: 30 peers.</li>
<li>B: 15 peers.</li>
<li>C: 15 peers.</li>
<li>D: 7 peers.</li>
<li>E: 7 peers.</li>
<li>F: 7 peers.</li>
</ul>
<p>We do not know the number of peers of the observer node, so let’s assume
that it has many connections to see a larger share of the traffic. Let’s
assume 70, because that’s what I would do.</p>
<ul>
<li>assumption: observer had 70 peers. (information lacking)</li>
</ul>
<p>First step: The originator requests only the minimum required blocks of
the file, because all requests succeed. In absolute numbers: 12,723
requests.</p>
<p>These are distributed over the peers. In a typical situation, about one
in three peers is backed off. Let’s assume the routable hosts during the
request to be the node, A, C, D and E. B and F are backed off.</p>
<p>Routable:</p>
<ul>
<li>node: 50 peers.</li>
<li>A: 30 peers.</li>
<li>C: 15 peers.</li>
<li>D: 7 peers.</li>
<li>E: 7 peers.</li>
</ul>
<p>Now those requests are distributed via FOAF-Routing not evenly but by
peer-count. There are in total 119 second degree peers, so the node
will receive on average 50/119 * 12723 requests, which would be 5345
requests.</p>
<p>Now we get to the node. Let’s assume a typical distribution
again. Since it has many peers, it will stick closer to the
statistical node-count due to stronger sampling. It will have 10 nodes
with 50 or more peers, one of which is the observer node. As usual, 30%
will be backed off.</p>
<p>The routable connections (not backed off):</p>
<ul>
<li>1 Observer: 70 peers.</li>
<li>6 with 60 peers.</li>
<li>13 with 30 peers.</li>
<li>5 with 15 peers.</li>
<li>10 with 7 peers.</li>
</ul>
<p>The backed off connections:</p>
<ul>
<li>(3 with 60 but backed off).</li>
<li>(7 with 30 but backed off).</li>
<li>(2 with 15 but backed off).</li>
<li>(3 with 7 but backed off).</li>
</ul>
<p>This gives a total number of 965 second-level peers via routable
connections, of which the observer watches 70. So you’d expect that the
observer will see 5345 * 70 / 965 requests: Total requests you received
multiplied by the peers of the observer and divided by the total count
of routable second-level peers.</p>
<p>5345 * 70 / 965 = 387.720207253886.</p>
<p>This number of requests is therefore confirmed as a likely false
positive. It occurs in a typical scenario where the node is not
the requester.</p>
<p>The short of it: The argumentation does NOT show that the node is
likely the requester of the file. Not even in a typical situation. The
most likely situation is that a node this node was connected to
requested the file without the nodes knowledge. If we’d use atypical
but often occurring situations, this would be even clearer.</p>
<p>Sidenote: A calculation like this is not sufficient to show that someone
is guilty. It only shows that the information provided CANNOT show
guilt, because it is very likely to be a false positive.</p>
<p>This is for a file where all blocks succeeded. For a file that’s on the
brink of dropping out, you’d expect two times as many requests. If the
actual requester had more peers, you’d expect fewer requests. If the
requester had only nodes with few peers, you’d expect more
requests. And this is without actually looking for evidence. This is
just disproving the claim using the much too limited information from
the affidavit by showing that this is most likely a false positive.</p>
<p>Besides: Argumentation like the following argumentation is false to a
seriously annoying degree:</p>
<blockquote>
<p>minimum of 12,723 blocks of a total possible of 24,874. These 383 blocks
represent 155% of the even, or expected, share of the minimum block
(12,723) required to download the file and 79% of the even or expected</p>
</blockquote>
<p>Those 3% of the file the observer saw are 155% of what you’d expect if
all the nodes peers had the same number of peers. But that is a false
assumption, as you can already see from the example distribution given
for the originator. The number of blocks requested scale with the
peercount of each peer. So if the observer node had 60 peers while a
typical node had 20 peers, the observer would automatically receive 3x
as many requests as with even share.</p>
<p>As a note: The peer-count can change from release to release when parameters are
optimized. The argumentation here stays the same, but the numbers change
a bit. People will have to look at the peer count distribution
during the time of the measurement.</p>
<p>Final note: The minimal information required for statistical claims
about observations of node upload or download activity in Freenet:</p>
<ul>
<li>
<p>The exact time and HTL of each watched chunk that was seen from the node</p>
<ul>
<li>per chunk: node-location of the observer at the time</li>
<li>per chunk: node-location of the observed at the time</li>
<li>per chunk: node-locations of all peers of the observer at the time</li>
<li>per chunk: node-locations of all peers of the observed at the time</li>
<li>per-chunk: the manifest it belongs to
(only size + index in some list + number of chunks in the
manifest)</li>
<li>per chunk: routing part of the key of the chunk
(no decryption possible from this info => data not accessible)</li>
</ul>
</li>
<li>
<p>The exact formula of the probability that the observed is a valid
target</p>
</li>
<li>The exact formula of the probability that the observed is not a false
positive</li>
<li>
<p>The results of applying those formula to the data along with the data,
so they can be checked independently.</p>
</li>
<li>
<p>all chunks received at HTL <= 16 which would be a match if at HTL > 16</p>
</li>
<li>The peercounts they observed on that day in all nodes they connected to
(a plain list of numbers)</li>
<li>Keys for chunks should be truncated by cutting or blacking at least
4 letters, so they cannot easily be used to download the associated
data, though the full keys must be provided on request to an
independent trusted party (i.e. the defense lawyer) to verify that
they contain what is claimed. Otherwise they could just make up
claims from thin air.</li>
</ul>
<p>definition: watched chunks are those which are recorded if received from
the observed or sent to the observed, as well as those which
would be recorded if received by the observer or sent by the
observer.</p>
<p>If observers cannot provide this minimal information, they cannot get
a robust statistical result. If they do not want to provide this to
a court, they prevent the court from checking their claims.</p>
<p>Yes, it is hard to correctly trace activity in Freenet to a
specific user. Without this property, Freenet could not protect
Freedom of speech and of the press, both of which are under attack
in many countries around the world.</p>
<div class="footnote">
<hr />
<ol>
<li id="fn:peercount">
<p>The peercount is taken from the statistics in june and october, versions 334 and 355 as found via the datehints for that site, counted by eye: SSK@WMa1Z40iYdZZ51yctQ3toFl9zuuFEnNdsm3NejJU5KE,jCBcaNBeKD5~sSQeSkyKz737Bh5ibBGqdzfD8mgfdMY,AQACAAE/statistics-DATEHINT-2018-9?type=text/plain SSK@WMa1Z40iYdZZ51yctQ3toFl9zuuFEnNdsm3NejJU5KE,jCBcaNBeKD5~sSQeSkyKz737Bh5ibBGqdzfD8mgfdMY,AQACAAE/statistics-DATEHINT-2018-10?type=text/plain SSK@WMa1Z40iYdZZ51yctQ3toFl9zuuFEnNdsm3NejJU5KE,jCBcaNBeKD5~sSQeSkyKz737Bh5ibBGqdzfD8mgfdMY,AQACAAE/statistics-334/plot_peer_count.png SSK@WMa1Z40iYdZZ51yctQ3toFl9zuuFEnNdsm3NejJU5KE,jCBcaNBeKD5~sSQeSkyKz737Bh5ibBGqdzfD8mgfdMY,AQACAAE/statistics-355/plot_peer_count.png <a class="footnote-backref" href="#fnref:peercount" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>
</div><!-- /.entry-content -->
<a href="archives.html">News Archives</a>
</section>
</main>
<footer>
<header>
<h2>Hyphanet</h2>
<p>Navigate with Freedom</p>
</header>
<ul class="social">
<a href="https://twitter.com/freenetproject">Twitter</a>
<a rel="me" title="Hyphanet News and Info in the Fediverse" href="https://floss.social/@Freenet">Mastodon</a>
</ul>
<div id="contact">
<span style="display:inline-block; unicode-bidi:bidi-override; direction:rtl;" onmouseover="this.innerText=this.innerText.split('').reverse().join(''); this.style.unicodeBidi='';this.style.direction=''; this.removeAttribute('onmouseover');">gro.tcejorpteneerf@sserp</span></br>
<span style="display:inline-block; unicode-bidi:bidi-override; direction:rtl;" onmouseover="this.innerText=this.innerText.split('').reverse().join(''); this.style.unicodeBidi='';this.style.direction=''; this.removeAttribute('onmouseover');">gro.tcejorpteneerf@troppus</span></br>
<span>IRC: <a href="https://web.libera.chat/?nick=FollowRabbit|?#freenet">#freenet on irc.libera.chat</a></span></br>
</div>
<p id="copyright">Licensed under the <a href="https://www.gnu.org/licenses/fdl-1.3.html">GDFL</a>. <a href="https://github.com/hyphanet/website">Website source repository</a>, <a href="/pages/download.html#privacy-policy">Privacy Policy</a></p>
</footer></body>
</html>