Word Count

"... This book will be a great resource for
both readers looking to implement existing
algorithms in a scalable fashion and readers
who are developing new, custom algorithms
using Spark. ..."

Dr. Matei Zaharia
Original Creator of Apache Spark

FOREWORD by Dr. Matei Zaharia

Introduction to Word Count

Word Count is a simple and easy to understand algorithm which can be easily implemented as a MapReduce/Spark application. Given a set of text documents, the program counts the number of occurrences of each word.
Word count finds out the frequency of each word in a set of documents/files. The goal is to create a dictionary of (key, value) pairs, where key is a word (as a String), and value is an Integer denoting the frequency of a given key/word.
Complete set of solutions are given for Word Count problem using
BEFORE reduction filter: You may add filter() to remove undesired words (this can be done after tokenizing records)
AFTER reduction filter: To have a desired final word count as (word, frequency), you may add filter() to remove elements where frequency < N , where N (as an integer) is your threshold. This can be done after reduction.

Word Count in MapReduce

Word Count in PySpark RDDs

Word Count in PySpark DataFrames

References

1. Word count from Wiki

2. Word Count Example, Spark