"... This book will be a great resource for both readers looking to implement existing algorithms in a scalable fashion and readers who are developing new, custom algorithms using Spark. ..." Dr. Matei Zaharia Original Creator of Apache Spark FOREWORD by Dr. Matei Zaharia |
-
Word Count is a simple and easy to understand algorithm which can be easily implemented as a MapReduce/Spark application. Given a set of text documents, the program counts the number of occurrences of each word.
-
Word count finds out the frequency of each word in a set of documents/files. The goal is to create a dictionary of
(key, value)
pairs, wherekey
is a word (as a String), andvalue
is an Integer denoting the frequency of a given key/word. -
Complete set of solutions are given for Word Count problem using
-
BEFORE reduction filter: You may add
filter()
to remove undesired words (this can be done after tokenizing records) -
AFTER reduction filter: To have a desired final word count as
(word, frequency)
, you may addfilter()
to remove elements wherefrequency < N
, whereN
(as an integer) is your threshold. This can be done after reduction.