seomoz · dlecocq · Jan 26, 2018
diff --git a/README.md b/README.md
@@ -11,63 +11,159 @@ Simhash Near-Duplicate Detection
 This library enables the efficient identification of near-duplicate documents using
 `simhash` using a C++ extension.
 
-Usage
-=====
-`simhash` differs from most hashes in that its goal is to have two similar documents
-produce similar hashes, where most hashes have the goal of producing very different
-hashes even in the face of small changes to the input.
+Overview
+========
+`simhash` is a bit of an overloaded word. It is often used interchangeably for:
+1) a function to generate a simhash from input, and 2) method used for identifying
+near-duplicates from a set of simhashes. This document will try to preserve that
+distinction.
+
+The `simhash` hashing function accepts a list of input hashes and produces a
+single hash. It doesn't matter how long the input list is - `simhash` will always
+give a hash of the same size, 64 bits. The details of this transformation aren't
+particularly interesting, but the input to this function is very important. **The
+way the input list of hashes is computed has a huge impact on the quality of the
+matches found.**
+
+If there is one thing that we should make very clear, it's that while this library
+provides tooling to help with 1) performing the simhash hashing function on an
+input list of hashes and 2) performing the duplication detection on the resulting
+hashes, **users of the library must be able to convert each of their documents
+to a representative list of hashes in order to get good results.** How that
+"representative" list is created is highly application-specific: the techniques
+that could be used on a photograph are very different from those that could be
+used on a webpage.
+
+Practical Considerations
+========================
+There is a [writeup](https://moz.com/devblog/near-duplicate-detection/) that goes
+into more detail, but we will try to summarize some important points here.
+
+Suppose we have a bunch of text documents we want to compare. Before identifying
+duplicates, we much first generate the simhashes for each. To do so, we must
+turn each document into a list of hashes. Let's start by just taking all the
+words of the document in order, and put it through the `unsigned_hash` function:
 
-The input to `simhash` is a list of hashes representative of a document. The output is an
-unsigned 64-bit integer. The input list of hashes can be produced in several ways, but
-one common mechanism is to:
+```python
+from simhash import unsigned_hash, compute
 
-1. tokenize the document
-1. consider overlapping shingles of these tokens (`simhash.shingle`)
-1. `hash` these overlapping shingles
-1. input these hashes into `simhash.compute`
+document = 'some really long block of text...'
+words = document.split(' ')
 
-This has the effect of considering phrases in a document, rather than just a bag of the
-words in it.
+# We'll see in a minute that this is a BAD technique. Do not do this.
+hashes = map(unsigned_hash, words)
+simhash = compute(hashes)
+```
 
-Once we've produced a `simhash`, we would like to compare it to other documents. For two
-documents to be considered near-duplicates, they must have few bits that differ. We can
-compare two documents:
+The problem with this is that a very different document could have the exact same
+hash:
 
 ```python
-import simhash
-
-a = simhash.compute(...)
-b = simhash.compute(...)
-simhash.num_differing_bits(a, b)
+# These documents would have the exact same simhashes if we
+a = 'one two three four five ...'
+b = 'two four three five one ...'
 ```
 
-One of the key advantages of `simhash` is that it does not require `O(n^2)` time to find
-all near-duplicate pairs from a set of hashes. Given a whole set of `simhashes`, we can
-find all pairs efficiently:
+To improve this situation, a common technique is to use "shingling." Just like
+shingles on a roof overlap, we will consider overlapping ranges of words. For
+example, for the text `one two three four five ...`, we could get shingles of
+size three: `one two three`, `two three four`, `three four five`, `four five ...`.
+Using this technique ensures that the order of words is important, not just which
+words are used.
+
+We provide a `shingle` function to help with this:
 
 ```python
-import simhash
+from simhash import shingle
+
+words = ...
+# Use four words per shingle
+shingles = shingle(words, window=4)
+# Use the shingles when computing the hashes, instead of words
+hashes = map(unsigned_hash, shingles)
+simhash = compute(hashes)
+```
 
-# The `simhash`-es from our documents
-hashes = []
+Two simhashes are considered near-duplicates if their simhashes differ by a few
+bits. Exactly how many "a few" means is highly application-specific (as much so
+as the method for computing the input hashes to the simhash function).
 
-# Number of blocks to use (more in the next section)
-blocks = 4
-# Number of bits that may differ in matching pairs
-distance = 3
-matches = simhash.find_all(hashes, blocks, distance)
-```
+Example Code
+============
+If you were looking for a shortcut for getting good near-duplicates, this is
+the closest thing to it. **However, read and understand this document or risk
+sub-par duplicate-detection.** And with that, let's dive into an example using
+tokenized text documents:
 
-All the matches returned are guaranteed to be _all_ pairs where the hashes differ by
-`distance` bits or fewer. The `blocks` parameter is less intuitive, but is best described
-in [this article](https://moz.com/devblog/near-duplicate-detection/) or in
-[the paper](http://www2007.cpsc.ucalgary.ca/papers/paper215.pdf). The best parameter to
-choose depends on the distribution of the input simhashes, but it must always be at least
-one greater than the provided `distance`.
+```python
+from simhash import shingle, unsigned_hash, compute, find_all, num_differing_bits
+from collections import defaultdict
+import itertools
+
+def simhashDocument(doc):
+    '''Do rudimentary tokenization and produce a simhash.'''
+    shingles = shingle(doc.split(' '), window=4)
+    return compute(map(unsigned_hash, shingles))
+
+# We need to keep a mapping of simhashes to the original document. This is a
+# dict to lists, because theoretically documents could collide.
+simhashMap = defaultdict(list)
+
+# The paths to each of the document
+paths = [...]
+
+# Compute all the simhashes
+for path in paths:
+    with open(path) as fin:
+        doc = fin.read()
+        simhashMap[simhashDocument(doc)].append(path)
+
+# Find all the matching pairs
+#
+# The different_bits parameter is application-specific, and we'll talk about how
+# to pick a good value in the next section.
+#
+# The number_of_blocks parameter affects performance. It must be in the range
+# [different_bits + 1, 64]. Try starting with different_bits + 2 and tweak from
+# there for the best performance.
+pairs = find_all(simhashMap.keys(), number_of_blocks=6, different_bits=3)
+
+# For each pair in the matches, all associated documents are near-duplicates
+for a, b in pairs:
+    distance = num_differing_bits(a, b)
+    aPaths = simhashMap[a]
+    bPaths = simhashMap[b]
+    for aPath, bPath in itertools.product(aPaths, bPaths):
+        print '%s is a near-duplicate of %s (distance = %s)' % (aPath, bPath, distance)
+
+# Technically, all the documents with the same simhash are near-duplicates as
+# well, but that's left as an exercise for the reader.
+```
 
-Internally, `find_all` takes `blocks C distance` passes to complete. The idea is that as
-that value increases (for instance by increasing `blocks`), each pass completes faster.
-In terms of memory, `find_all` takes `O(hashes + matches)` memory.
+Choosing Parameters
+===================
+
+The `number_of_blocks` parameter is not particularly intuitive. They are described
+in more detail in this article](https://moz.com/devblog/near-duplicate-detection/) or in
+[the paper](http://www2007.cpsc.ucalgary.ca/papers/paper215.pdf). Internally,
+`find_all` takes `number_of_blocks C different_bits` passes to complete. With
+more blocks, the number of passes required increases combinatorially, but each
+pass becomes faster. It is important to find the correct balance for performance.
+
+The pairs returned by `find_all` are guaranteed to be _all_ the pairs where the
+simhashes differ by `different_bits` or fewer. This may find all the documents
+you are attempting to match, but that gets back to the two main factors that
+determine the quality of matches: 1) the way the representative document hashes
+are computed, and 2) the `different_bits` parameter.
+
+Choosing the best `different_bits` parameter is difficult. It usually involves
+taking an example set of documents and a gold standard of all the near-duplicate
+document pairs, and then evaluating the
+[precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) for
+different choices of parameters. While perfect results are unlikely, it is
+certainly possible to get both precision and recall to be very high. The big
+upside to the `simhash` approach is that it can be easily run on datasets that
+would otherwise be prohibitively large.
 
 Building
 ========