Cannonical mapreduce wordcount tutorial, implemented in python.
- Python 3
$ git clone [email protected]:thehackerati/wordcount.git
$ cd wordcount
$ echo "foo foo quux labs foo bar quux" | ./mapper.py | sort -k1,1 | ./reducer.py
bar 1
foo 3
labs 1
quux 2
Checkout the 'iterators' branch to see a wordcount implementation using Python iterators and generators.
$ git checkout iterators
You can run it and see that it behaves the same as the first implementation in master.
Now, try downloading James Joyce's Ulysses to use as a larger data set for testing.
$ cat ulysses.txt | ./mapper.py | sort -k1,1 | ./reducer.py
! 1
"Come 1
"Defects," 1
"I 1
"Information 1
...
Notice that we're considering words with adjacent punctuation as unique. To get a better count of words, we need to strip out this punctuation. Checkout the 'strip-nonalpha' branch to see a version that uses a compiled regular expression to remove non-alphanumeric characters.
$ git checkout strip-nonalpha
$ cat ulysses.txt | ./mapper.py | sort -k1,1 | ./reducer.py
...
Dub 3
Dubedat 3
Dubedatandshedidbedad 1
Dublin 122
Dubliner 2
Dubliners 1
...
Notice now that we're much cleaner, but there are still some suspicious results, like 'Dubedatandshedidbedad' above.
- Count different capitalizations of the same word as the same.
- Address any remaining data quality issues exist in this implementation.
- Add unit tests.
- Deploy this code locally on Hadoop.
- Implement continuous integration.
- Deploy this code to AWS (EMR).
This code was inspired by an awesome tutorial by Michael Noll.