-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
cleaned up commit 327f5de for pull request
mapred_tfidf.py: switched argument parsing to argparse if MapReduce will overwrite directories, users are now asked whether they want to continue added '--force' flag to automatically force directory overwrites map_reduce_utils: 'numeric' words such as '14th' are filtered out mappers/reducers: added documentation for each script, removed unneeded conditionals, added helper methods for print formatting word_count_red no longer has to re-read each file to see how long it is cos_sim_map now ensures that two documents are sent to the same reducer regardless of which order they arrive in.
- Loading branch information
Showing
16 changed files
with
251 additions
and
75 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,21 @@ | ||
#!/usr/bin/env python | ||
import sys | ||
|
||
""" | ||
(word) (file1 file2 tfidf1*tfidf2) --> (file1 file2) (tfidf1*tfidf2) | ||
for each word common to two documents, removes the word from the | ||
key/value pair and replaces it with the two filenames so that we can | ||
sum up the values for each pair of documents in the reducer. | ||
""" | ||
|
||
for line in sys.stdin: | ||
key, value = line.strip().split('\t') | ||
doc1, doc2, product = value.strip().split() | ||
product = float(product) | ||
print '%s %s\t%.16f' % (doc1, doc2, product) | ||
|
||
# we want to ensure that (doc1 doc2) and (doc2 doc1) get | ||
# sent to the same reducer, so we order them alphabetically | ||
if doc1 > doc2: | ||
doc1, doc2 = doc2, doc1 | ||
|
||
print '%s %s\t%s' % (doc1, doc2, product) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,32 @@ | ||
#!/usr/bin/env python | ||
|
||
from nltk.stem.porter import PorterStemmer | ||
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as stopwords | ||
import string | ||
import re | ||
|
||
""" | ||
map_reduce_utils contains helper functions that are used in multiple | ||
map-reduce tasks. | ||
""" | ||
|
||
|
||
def clean_text(text): | ||
# TODO remove words w/ numerals, e.g. '14th' | ||
""" | ||
returns a 'cleaned' version of text by filtering out all words | ||
that don't contain strictly alphabetic characters, converting | ||
all words to lowercase, filtering out common stopwords, and | ||
stemming each word using porter stemming. | ||
""" | ||
stemmer = PorterStemmer() | ||
result = text.lower() | ||
result = result.translate(None, string.punctuation) | ||
result = result.replace('\n', ' ') | ||
result = result.split() | ||
|
||
# filter out 'numeric' words such as '14th' | ||
is_alpha = re.compile('^[a-z]+$') | ||
result = filter(lambda word: is_alpha.match(word), result) | ||
|
||
result = [stemmer.stem(word) for word in result] | ||
return filter(lambda word: word not in stopwords, result) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.