Skip to content

Wordcount is the "Hello World" for Hadoop, yet most of the Pig and Hive wordcount examples I've seen either require UDFs, external scripts, or they just don't do a very good job of counting words. Here are my Wordcount hacks.

Notifications You must be signed in to change notification settings

slimandslam/pig-hive-wordcount

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

pig-hive-wordcount

Wordcount is the "Hello World" for Hadoop, yet most of the Pig and Hive wordcount examples I've seen either require UDFs, external scripts, or they just don't do a very good job of counting words.

So, my goal here was not efficiency, but merely to create Pig and Hive scripts that:

  1. Use only stock functions that ship with the language (no UDFs or external scripts)
  2. Are short and simple
  3. Do a pretty good job of counting words
  4. Produce diff-able output

To make it diffable, I reformat the Hive output to look like the output of the Pig DUMP operator. In my few tests, output of the two scripts has been identical, or very close, most of the time, though Hive still insists on counting some invisible character occasionally.

About

Wordcount is the "Hello World" for Hadoop, yet most of the Pig and Hive wordcount examples I've seen either require UDFs, external scripts, or they just don't do a very good job of counting words. Here are my Wordcount hacks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published