Skip to content

hackerati/wordcount

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

wordcount

Cannonical mapreduce wordcount tutorial, implemented in python.

Dependencies

  • Python 3

Running

$ git clone [email protected]:thehackerati/wordcount.git
$ cd wordcount
$ echo "foo foo quux labs foo bar quux" | ./mapper.py | sort -k1,1 | ./reducer.py
bar	1
foo	3
labs	1
quux	2

Checkout the 'iterators' branch to see a wordcount implementation using Python iterators and generators.

$ git checkout iterators

You can run it and see that it behaves the same as the first implementation in master.

Now, try downloading James Joyce's Ulysses to use as a larger data set for testing.

$ cat ulysses.txt | ./mapper.py | sort -k1,1 | ./reducer.py
!               1
"Come           1
"Defects,"      1
"I              1
"Information    1
...

Notice that we're considering words with adjacent punctuation as unique. To get a better count of words, we need to strip out this punctuation. Checkout the 'strip-nonalpha' branch to see a version that uses a compiled regular expression to remove non-alphanumeric characters.

$ git checkout strip-nonalpha
$ cat ulysses.txt | ./mapper.py | sort -k1,1 | ./reducer.py
...
Dub     3
Dubedat 3
Dubedatandshedidbedad   1
Dublin  122
Dubliner        2
Dubliners       1
...

Notice now that we're much cleaner, but there are still some suspicious results, like 'Dubedatandshedidbedad' above.

Exercises

  • Count different capitalizations of the same word as the same.
  • Address any remaining data quality issues exist in this implementation.
  • Add unit tests.
  • Deploy this code locally on Hadoop.
  • Implement continuous integration.
  • Deploy this code to AWS (EMR).

Resources

Credits

This code was inspired by an awesome tutorial by Michael Noll.

About

Canonical mapreduce tutorial, implemented in Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages