outline


Following steps are tested only on Ubuntu 12.04 LTS server.

<< Download these labs >>

  sudo apt-get install git
  git clone https://github.com/jazzwang/hadoop_labs


<< Local Mode >>

  lab000/hadoop-local-mode
  jps
  netstat -nap | grep java
  export PATH=${HOME}/hadoop/bin:$PATH
  hadoop fs -ls
  hadoop fs -mkdir tmp
  hadoop fs -ls
  hadoop fs -put ${HOME}/hadoop/conf input

  Exercise:
  (1) How many java process in local mode ?
  (2) What do you see after running "hadoop fs -ls" in local mode ?
  (3) What happend to your working directory affter running "hadoop fs -mkdir tmp" ?
  (4) What happend to your working directory affter running "hadoop fs -put ..." ?

<< Pseudo-Distributed Mode >>

  lab001/hadoop-pseudo-mode
  jps
  netstat -nap | grep java
  export PATH=${HOME}/hadoop/bin:$PATH
  hadoop fs -ls
  hadoop fs -mkdir tmp
  hadoop fs -ls
  hadoop fs -put ${HOME}/hadoop/conf input

  Exercise:
  (1) How many java process in pseudo-distributed mode ?
  (2) What do you see after running "hadoop fs -ls" in pseudo-distributed mode ?
  (3) What happend to your working directory affter running "hadoop fs -mkdir tmp" ?
  (4) What happend to your working directory affter running "hadoop fs -put ..." ?

<< Full Distributed Mode >>

  lab002/hadoop-full-mode
  jps
  netstat -nap | grep java
  export PATH=${HOME}/hadoop/bin:$PATH
  hadoop fs -ls

  Exercise:
  (1) How many java process in full distributed mode ?
  (2) What do you see after running "hadoop fs -ls" in full distributed mode ?
  (3) In netstat results, what's the difference between full distributed mode and pseudo-distributed mode ?

<< DEBUG: Shell Script >>

  file `which hadoop`
  bash -x `which hadoop` fs -ls
  bash -x `which hadoop` jar
  bash -x `which hadoop` fsck

  Exercise:
  (1) Which java class will "hadoop fs" command call?
  (2) Which java class will "hadoop jar" command call?
  (3) Which java class will "hadoop fsck" command call?

<< DEBUG: Log4J >> 

  export HADOOP_ROOT_LOGGER=INFO,console
  hadoop fs -ls
  export HADOOP_ROOT_LOGGER=WARN,console
  hadoop fs -ls
  export HADOOP_ROOT_LOGGER=DEBUG,console
  hadoop fs -ls
  unset HADOOP_ROOT_LOGGER

  Exercise:
  (1) In the result of "hadoop fs -ls", is there any difference between INFO and WARN ?
  (2) In the result of "hadoop fs -ls", is there any difference between INFO and DEBUG ?

<< DEBUG: Changing Modes >>

  export HADOOP_CONF_DIR=~/hadoop/conf.pseudo/
  hadoop fs -ls
  export HADOOP_CONF_DIR=~/hadoop/conf.local/
  hadoop fs -ls
  unset HADOOP_CONF_DIR
  hadoop fs -ls

  Exercise:
  (1) If you're currently running full distributed mode, what is the result of "hadoop fs -ls" after changing HADOOP_CONF_DIR to pseudo-distributed mode configuration directory ? Why are there some errors ?
  (2) If you're currently running full distributed mode, what is the result of "hadoop fs -ls" after changing HADOOP_CONF_DIR to local mode configuration directory ?

<< DEBUG/Monitoring: jconsole >>

  jconsole

  Exercise:
  (1) If you're currently running in full distributed mode, please try to connect to namenode java process.
  (2) If you're currently running in full distributed mode, please try to connect to datanode java process.

<< HDFS: FsShell >>

  lab003/FsShell

  Excercise:
  (1) What is the result of Path.CUR_DIR ?
  (2) In local mode, which class is srcFs object ?
  (3) In full distributed mode, which class is srcFs object ?
  (4) Which classes are updated in hadoop-core-*.jar according to the difference between two jar files?
  (5) Please observe the source code architecture in ${HOME}/hadoop/src/core/org/apache/hadoop/fs.
      Which File Systems are supported by Hadoop 1.0.4?
      (A) HDFS (hdfs://namenode:port)
      (B) Amazon S3 (s3:// , s3n://)
      (C) KFS 
      (D) Local File System (file:///)
      (F) FTP (ftp://user:passwd@ftp-server:port)
      (G) RAMFS (ramfs://)
      (H) HAR (Hadoop Archive Filesystem, har://underlyingfsscheme-host:port/archivepath or har:///archivepath )

  Reference: 
  (1) http://answers.oreilly.com/topic/456-get-to-know-hadoop-filesystems/
  (2) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/package-tree.html

<< Develop: Eclipse >>

  wget https://raw.github.com/mkamithkumar/hadoop-eclipse-plugins/master/hadoop-eclipse-plugin-1.0.4.jar
  eclipse

  Reference:
  (1) https://github.com/mkamithkumar/hadoop-eclipse-plugins

<< HDFS: FileSystem.copyFromLocal >>

  cd ~/hadoop_labs/lab004
  ant
  hadoop fs -ls
  hadoop jar copyFromLocal.jar doc input
  hadoop fs -ls
  touch test
  hadoop jar copyFromLocal.jar test file
  hadoop fs -ls

  export HADOOP_CONF_DIR=~/hadoop/conf.local/
  hadoop fs -ls
  hadoop jar copyFromLocal.jar doc input
  hadoop fs -ls
  hadoop jar copyFromLocal.jar test file
  hadoop fs -ls
  unset HADOOP_CONF_DIR

  ant clean

  Excercise:
  (1) In full distributed mode, what do you see after running "hadoop fs -ls"?
  (2) Change to local mode, what do you see after running "hadoop fs -ls"?

<< HDFS: FileSystem.copyToLocal >>

  cd ~/hadoop_labs/lab005
  ant
  hadoop fs -ls
  ls
  hadoop jar copyToLocal.jar input input
  ls
  hadoop jar copyToLocal.jar file file
  ls
  ant clean

  Excercise:
  (1) In full distributed mode, what do you see after running "ls"?
  (2) Try to switch to local mode and see what's the difference between local mode and full distributed mode.

<< HDFS: FileSystem.exists >>
<< HDFS: FileSystem.isFile >>
<< HDFS: FileSystem.isDirectory >>

  cd ~/hadoop_labs/lab006
  ant
  hadoop jar isFile.jar input
  hadoop jar isFile.jar file
  hadoop jar isFile.jar empty

  Excercise:
  (1) What are the difference in the results of this lab?

  Reference:
  (1) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.html

<< MapReduce: WordCount (after 0.19) >>

  cd ~/hadoop_labs/lab007
  ant
  hadoop fs -rmr input output
  hadoop fs -put ~/hadoop/conf input
  hadoop jar WordCount.jar input output
  (open another console)
  jps
  watch -n 1 jps
  hadoop job -list all
  (open a browser)
  http://localhost:50030

  export HADOOP_CONF_DIR=~/hadoop/conf.local/
  hadoop fs -put ~/hadoop/conf local-input
  hadoop jar WordCount local-input local-output
  unset HADOOP_CONF_DIR

  Excerise:
  (1) In results of "jps", which Java process stands for the main() function defined by WordCount.java ?
  (2) Do you see "Child" in "jps" results? How many java process named by "Child" do you see ?
  (3) Change to local mode. Do you see "mapred.LocalJobRunner" after running "hadoop jar WordCount ...." ?

<< MapReduce: WordCount (before 0.19) >>

  cd ~/hadoop_labs/lab008
  ant
  hadoop fs -rmr input output
  hadoop fs -put ~/hadoop/conf input
  hadoop jar WordCount.jar input output

  Excerise:
  (1) Try to compare two version of WordCount example and draw a UML class diagram for WordCount example.

<< MapReduce: Inner Class v.s. Public Classes >>

  cd ~/hadoop_labs/lab009
  ant
  hadoop fs -rmr input output
  hadoop fs -put ~/hadoop/conf input
  hadoop jar WordCount.jar input output
  hadoop fs -ls output
  hadoop fs -cat output/part-r-00000

  Excerise:
  In this example, we would like to show you the difference between inner class and public classes.
  (1) How many java source files are there in lab009/src folder ?
  (2) How many class files are there after running ant ? Please compare the class name between lab007 and lab007, 
      then name the difference between these two examples.
  (3) How many mapper tasks are there in this job? How many reducer tasks are there in this job ?
      How many task attempts are there in each task ?
  (4) What is the value of "mapred.reduce.tasks" property shown in ${HOME}/hadoop/src/mapred/mapred-default.xml ?

<< MapReduce: Job.setNumReduceTasks() >>

  cd ~/hadoop_labs/lab010
  ant
  hadoop fs -rmr input output
  hadoop fs -put ~/hadoop/conf input
  hadoop jar WordCount.jar input output
  hadoop fs -ls output
  hadoop fs -cat output/part-* 

  Excerise:
  (1) How many reducer task are there in this job ?
  (2) How many output results in this job ?
  (3) Please observe the order of output. 
      Assume the mapper output key are {A,B,C,D}, and we know A < B < C < D.
      Let's set the number of reducers to 2.
      What will be the output?
      (A) {A, B}, {C, D}
      (B) {A, B, C}, {D}
      (C) {A, C}, {B, D}
      (D) {A, D}, {B, C}

  Reference:
  (1) org.apache.hadoop.mapreduce.Job.setNumReduceTasks(int)
      http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Job.html#setNumReduceTasks(int)

<< Debug: setNumReduceTasks(0) >>

  cd ~/hadoop_labs/lab010
  sed -i 's#setNumReduceTasks(2)#setNumReduceTasks(0)#g' ~/hadoop_labs/lab010/src/WordCount.java
  ant
  hadoop fs -rmr output
  hadoop jar WordCount.jar input output
  hadoop fs -ls output
  hadoop fs -cat output/part*

  Excerise:
  
  Try to modify the java source in lab010 and set the number of reducer to 0.
  Run "ant" to compile. Observe the result in output.

  (1) How many mapper task are there in this job?
  (2) How many output files are there in the output?

<< MapReduce: TextInputFormat >>

  cd ~/hadoop_labs/lab011
  ant
  cd ~/hadoop_labs/lab010
  mkdir -p my_input
  echo "A B C D" > my_input/input1
  echo "C D A B" > my_input/input2
  hadoop fs -put my_input my_input
  sed -i 's#setNumReduceTasks(0)#setNumReduceTasks(1)#g' ~/hadoop_labs/lab010/src/WordCount.java
  hadoop jar WordCount.jar my_input my_output

  export HADOOP_CONF_DIR=~/hadoop/conf.local/
  hadoop jar WordCount.jar my_input my_output
  unset HADOOP_CONF_DIR

  Excercise:
  (1) In full distributed mode, how many lines of "TextInputFormat.isSplitable" message do you see while running the job?
  (2) In full distributed mode, how many lines of "TextInputFormat.createRecordReader" message do you see while running the job?
  (3) In local mode, how many lines of "TextInputFormat.isSplitable" message do you see while running the job?
  (4) In local mode, how many lines of "TextInputFormat.createRecordReader" message do you see while running the job?

  Reference:
  (1) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/InputFormat.html
  (2) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html
  (3) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html

<< MapReduce: KeyValueTextInputFormat >>

  cd ~/hadoop_labs/lab012
  ant
  mkdir -p kv_input
  printf "A\t1\n" >  kv_input/input1
  printf "B\t2\n" >> kv_input/input1
  printf "C\t3\n" >> kv_input/input1
  printf "A\t1\n" >  kv_input/input2
  printf "C\t2\n" >> kv_input/input2
  printf "B\t1\n" >> kv_input/input2
  hadoop fs -put kv_input kv_input
  hadoop jar WordCount.jar kv_input kv_output
  hadoop fs -ls kv_output
  hadoop fs -cat kv_output/part-*

  export HADOOP_CONF_DIR=~/hadoop/conf.local/
  hadoop jar WordCount.jar kv_input kv_output
  ls -al kv_output
  cat kv_output/part-*
  unset HADOOP_CONF_DIR

  Excercise:
  (1) 

  Reference:
  (1) org.apache.hadoop.mapreduce.Job.setInputFormatClass(Class<? extends InputFormat>)
      http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Job.html#setInputFormatClass%28java.lang.Class%29
  (2) org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat
      http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/input/KeyValueTextInputFormat.html

<< MapReduce: Configuration >>

  cd ~/hadoop_labs/lab013
  ant
  hadoop fs -put l13_input l13_input
  hadoop jar WordCount.jar l13_input l13_output
  hadoop fs -cat l13_output/part-r-00000
  hadoop jar WordCount.jar -Dwordcount.case.sensitive=false l13_input l13_output2
  hadoop fs -cat l13_output2/part-r-00000

  Reference:
  (1) org.apache.hadoop.conf.Configuration.setBoolean(String name, boolean value)
  (2) org.apache.hadoop.conf.Configuration.getBoolean(String name, boolean defaultValue)
      http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/conf/Configuration.html
  (3) org.apache.hadoop.mapreduce.Mapper.setup(Mapper.Context context)
      http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Mapper.html#setup%28org.apache.hadoop.mapreduce.Mapper.Context%29
  (4) org.apache.hadoop.mapreduce.JobContext.getConfiguration()
      http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/JobContext.html#getConfiguration%28%29
  (5) http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_topic_7_1.html

<< MapReduce: Distribtued Cache >>

  cd ~/hadoop_labs/lab014
  ant


  Reference:
  (1) org.apache.hadoop.filecache.DistributedCache.addCacheFile()
      http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/filecache/DistributedCache.html

<< Query: DBInputFormat >>

<< Pig: Apache Log Analysis >>

  Reference:
  (1) 

<< Pig: DBStorage >>

  Reference:
  (1) 

<< Pig: HBaseStorage >>


  Reference:
  (1) 

<< Other Idea >>

1. lab for concurrent read, concurrent write
   - local mode V.S. full distribtued mode
   - local disk 1 to local disk 2
   - ram disk to local disk
   - multiple ram disk to full distributed mode HDFS

2. generate files to show the limitation due to HDFS namenode HEAPSIZE

3. Use Pig XMLLoader to load XML, and HBaseStorage to store into HBase.
   Or DBStorage to store results into MySQL/SQLite

4. mapper or reducer task running more than 10 minutes. or change