-
Notifications
You must be signed in to change notification settings - Fork 1
/
outline
409 lines (316 loc) · 13.1 KB
/
outline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
Following steps are tested only on Ubuntu 12.04 LTS server.
<< Download these labs >>
sudo apt-get install git
git clone https://github.com/jazzwang/hadoop_labs
<< Local Mode >>
lab000/hadoop-local-mode
jps
netstat -nap | grep java
export PATH=${HOME}/hadoop/bin:$PATH
hadoop fs -ls
hadoop fs -mkdir tmp
hadoop fs -ls
hadoop fs -put ${HOME}/hadoop/conf input
Exercise:
(1) How many java process in local mode ?
(2) What do you see after running "hadoop fs -ls" in local mode ?
(3) What happend to your working directory affter running "hadoop fs -mkdir tmp" ?
(4) What happend to your working directory affter running "hadoop fs -put ..." ?
<< Pseudo-Distributed Mode >>
lab001/hadoop-pseudo-mode
jps
netstat -nap | grep java
export PATH=${HOME}/hadoop/bin:$PATH
hadoop fs -ls
hadoop fs -mkdir tmp
hadoop fs -ls
hadoop fs -put ${HOME}/hadoop/conf input
Exercise:
(1) How many java process in pseudo-distributed mode ?
(2) What do you see after running "hadoop fs -ls" in pseudo-distributed mode ?
(3) What happend to your working directory affter running "hadoop fs -mkdir tmp" ?
(4) What happend to your working directory affter running "hadoop fs -put ..." ?
<< Full Distributed Mode >>
lab002/hadoop-full-mode
jps
netstat -nap | grep java
export PATH=${HOME}/hadoop/bin:$PATH
hadoop fs -ls
Exercise:
(1) How many java process in full distributed mode ?
(2) What do you see after running "hadoop fs -ls" in full distributed mode ?
(3) In netstat results, what's the difference between full distributed mode and pseudo-distributed mode ?
<< DEBUG: Shell Script >>
file `which hadoop`
bash -x `which hadoop` fs -ls
bash -x `which hadoop` jar
bash -x `which hadoop` fsck
Exercise:
(1) Which java class will "hadoop fs" command call?
(2) Which java class will "hadoop jar" command call?
(3) Which java class will "hadoop fsck" command call?
<< DEBUG: Log4J >>
export HADOOP_ROOT_LOGGER=INFO,console
hadoop fs -ls
export HADOOP_ROOT_LOGGER=WARN,console
hadoop fs -ls
export HADOOP_ROOT_LOGGER=DEBUG,console
hadoop fs -ls
unset HADOOP_ROOT_LOGGER
Exercise:
(1) In the result of "hadoop fs -ls", is there any difference between INFO and WARN ?
(2) In the result of "hadoop fs -ls", is there any difference between INFO and DEBUG ?
<< DEBUG: Changing Modes >>
export HADOOP_CONF_DIR=~/hadoop/conf.pseudo/
hadoop fs -ls
export HADOOP_CONF_DIR=~/hadoop/conf.local/
hadoop fs -ls
unset HADOOP_CONF_DIR
hadoop fs -ls
Exercise:
(1) If you're currently running full distributed mode, what is the result of "hadoop fs -ls" after changing HADOOP_CONF_DIR to pseudo-distributed mode configuration directory ? Why are there some errors ?
(2) If you're currently running full distributed mode, what is the result of "hadoop fs -ls" after changing HADOOP_CONF_DIR to local mode configuration directory ?
<< DEBUG/Monitoring: jconsole >>
jconsole
Exercise:
(1) If you're currently running in full distributed mode, please try to connect to namenode java process.
(2) If you're currently running in full distributed mode, please try to connect to datanode java process.
<< HDFS: FsShell >>
lab003/FsShell
Excercise:
(1) What is the result of Path.CUR_DIR ?
(2) In local mode, which class is srcFs object ?
(3) In full distributed mode, which class is srcFs object ?
(4) Which classes are updated in hadoop-core-*.jar according to the difference between two jar files?
(5) Please observe the source code architecture in ${HOME}/hadoop/src/core/org/apache/hadoop/fs.
Which File Systems are supported by Hadoop 1.0.4?
(A) HDFS (hdfs://namenode:port)
(B) Amazon S3 (s3:// , s3n://)
(C) KFS
(D) Local File System (file:///)
(F) FTP (ftp://user:passwd@ftp-server:port)
(G) RAMFS (ramfs://)
(H) HAR (Hadoop Archive Filesystem, har://underlyingfsscheme-host:port/archivepath or har:///archivepath )
Reference:
(1) http://answers.oreilly.com/topic/456-get-to-know-hadoop-filesystems/
(2) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/package-tree.html
<< Develop: Eclipse >>
wget https://raw.github.com/mkamithkumar/hadoop-eclipse-plugins/master/hadoop-eclipse-plugin-1.0.4.jar
eclipse
Reference:
(1) https://github.com/mkamithkumar/hadoop-eclipse-plugins
<< HDFS: FileSystem.copyFromLocal >>
cd ~/hadoop_labs/lab004
ant
hadoop fs -ls
hadoop jar copyFromLocal.jar doc input
hadoop fs -ls
touch test
hadoop jar copyFromLocal.jar test file
hadoop fs -ls
export HADOOP_CONF_DIR=~/hadoop/conf.local/
hadoop fs -ls
hadoop jar copyFromLocal.jar doc input
hadoop fs -ls
hadoop jar copyFromLocal.jar test file
hadoop fs -ls
unset HADOOP_CONF_DIR
ant clean
Excercise:
(1) In full distributed mode, what do you see after running "hadoop fs -ls"?
(2) Change to local mode, what do you see after running "hadoop fs -ls"?
<< HDFS: FileSystem.copyToLocal >>
cd ~/hadoop_labs/lab005
ant
hadoop fs -ls
ls
hadoop jar copyToLocal.jar input input
ls
hadoop jar copyToLocal.jar file file
ls
ant clean
Excercise:
(1) In full distributed mode, what do you see after running "ls"?
(2) Try to switch to local mode and see what's the difference between local mode and full distributed mode.
<< HDFS: FileSystem.exists >>
<< HDFS: FileSystem.isFile >>
<< HDFS: FileSystem.isDirectory >>
cd ~/hadoop_labs/lab006
ant
hadoop jar isFile.jar input
hadoop jar isFile.jar file
hadoop jar isFile.jar empty
Excercise:
(1) What are the difference in the results of this lab?
Reference:
(1) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.html
<< MapReduce: WordCount (after 0.19) >>
cd ~/hadoop_labs/lab007
ant
hadoop fs -rmr input output
hadoop fs -put ~/hadoop/conf input
hadoop jar WordCount.jar input output
(open another console)
jps
watch -n 1 jps
hadoop job -list all
(open a browser)
http://localhost:50030
export HADOOP_CONF_DIR=~/hadoop/conf.local/
hadoop fs -put ~/hadoop/conf local-input
hadoop jar WordCount local-input local-output
unset HADOOP_CONF_DIR
Excerise:
(1) In results of "jps", which Java process stands for the main() function defined by WordCount.java ?
(2) Do you see "Child" in "jps" results? How many java process named by "Child" do you see ?
(3) Change to local mode. Do you see "mapred.LocalJobRunner" after running "hadoop jar WordCount ...." ?
<< MapReduce: WordCount (before 0.19) >>
cd ~/hadoop_labs/lab008
ant
hadoop fs -rmr input output
hadoop fs -put ~/hadoop/conf input
hadoop jar WordCount.jar input output
Excerise:
(1) Try to compare two version of WordCount example and draw a UML class diagram for WordCount example.
<< MapReduce: Inner Class v.s. Public Classes >>
cd ~/hadoop_labs/lab009
ant
hadoop fs -rmr input output
hadoop fs -put ~/hadoop/conf input
hadoop jar WordCount.jar input output
hadoop fs -ls output
hadoop fs -cat output/part-r-00000
Excerise:
In this example, we would like to show you the difference between inner class and public classes.
(1) How many java source files are there in lab009/src folder ?
(2) How many class files are there after running ant ? Please compare the class name between lab007 and lab007,
then name the difference between these two examples.
(3) How many mapper tasks are there in this job? How many reducer tasks are there in this job ?
How many task attempts are there in each task ?
(4) What is the value of "mapred.reduce.tasks" property shown in ${HOME}/hadoop/src/mapred/mapred-default.xml ?
<< MapReduce: Job.setNumReduceTasks() >>
cd ~/hadoop_labs/lab010
ant
hadoop fs -rmr input output
hadoop fs -put ~/hadoop/conf input
hadoop jar WordCount.jar input output
hadoop fs -ls output
hadoop fs -cat output/part-*
Excerise:
(1) How many reducer task are there in this job ?
(2) How many output results in this job ?
(3) Please observe the order of output.
Assume the mapper output key are {A,B,C,D}, and we know A < B < C < D.
Let's set the number of reducers to 2.
What will be the output?
(A) {A, B}, {C, D}
(B) {A, B, C}, {D}
(C) {A, C}, {B, D}
(D) {A, D}, {B, C}
Reference:
(1) org.apache.hadoop.mapreduce.Job.setNumReduceTasks(int)
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Job.html#setNumReduceTasks(int)
<< Debug: setNumReduceTasks(0) >>
cd ~/hadoop_labs/lab010
sed -i 's#setNumReduceTasks(2)#setNumReduceTasks(0)#g' ~/hadoop_labs/lab010/src/WordCount.java
ant
hadoop fs -rmr output
hadoop jar WordCount.jar input output
hadoop fs -ls output
hadoop fs -cat output/part*
Excerise:
Try to modify the java source in lab010 and set the number of reducer to 0.
Run "ant" to compile. Observe the result in output.
(1) How many mapper task are there in this job?
(2) How many output files are there in the output?
<< MapReduce: TextInputFormat >>
cd ~/hadoop_labs/lab011
ant
cd ~/hadoop_labs/lab010
mkdir -p my_input
echo "A B C D" > my_input/input1
echo "C D A B" > my_input/input2
hadoop fs -put my_input my_input
sed -i 's#setNumReduceTasks(0)#setNumReduceTasks(1)#g' ~/hadoop_labs/lab010/src/WordCount.java
hadoop jar WordCount.jar my_input my_output
export HADOOP_CONF_DIR=~/hadoop/conf.local/
hadoop jar WordCount.jar my_input my_output
unset HADOOP_CONF_DIR
Excercise:
(1) In full distributed mode, how many lines of "TextInputFormat.isSplitable" message do you see while running the job?
(2) In full distributed mode, how many lines of "TextInputFormat.createRecordReader" message do you see while running the job?
(3) In local mode, how many lines of "TextInputFormat.isSplitable" message do you see while running the job?
(4) In local mode, how many lines of "TextInputFormat.createRecordReader" message do you see while running the job?
Reference:
(1) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/InputFormat.html
(2) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html
(3) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html
<< MapReduce: KeyValueTextInputFormat >>
cd ~/hadoop_labs/lab012
ant
mkdir -p kv_input
printf "A\t1\n" > kv_input/input1
printf "B\t2\n" >> kv_input/input1
printf "C\t3\n" >> kv_input/input1
printf "A\t1\n" > kv_input/input2
printf "C\t2\n" >> kv_input/input2
printf "B\t1\n" >> kv_input/input2
hadoop fs -put kv_input kv_input
hadoop jar WordCount.jar kv_input kv_output
hadoop fs -ls kv_output
hadoop fs -cat kv_output/part-*
export HADOOP_CONF_DIR=~/hadoop/conf.local/
hadoop jar WordCount.jar kv_input kv_output
ls -al kv_output
cat kv_output/part-*
unset HADOOP_CONF_DIR
Excercise:
(1)
Reference:
(1) org.apache.hadoop.mapreduce.Job.setInputFormatClass(Class<? extends InputFormat>)
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Job.html#setInputFormatClass%28java.lang.Class%29
(2) org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/input/KeyValueTextInputFormat.html
<< MapReduce: Configuration >>
cd ~/hadoop_labs/lab013
ant
hadoop fs -put l13_input l13_input
hadoop jar WordCount.jar l13_input l13_output
hadoop fs -cat l13_output/part-r-00000
hadoop jar WordCount.jar -Dwordcount.case.sensitive=false l13_input l13_output2
hadoop fs -cat l13_output2/part-r-00000
Reference:
(1) org.apache.hadoop.conf.Configuration.setBoolean(String name, boolean value)
(2) org.apache.hadoop.conf.Configuration.getBoolean(String name, boolean defaultValue)
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/conf/Configuration.html
(3) org.apache.hadoop.mapreduce.Mapper.setup(Mapper.Context context)
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Mapper.html#setup%28org.apache.hadoop.mapreduce.Mapper.Context%29
(4) org.apache.hadoop.mapreduce.JobContext.getConfiguration()
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/JobContext.html#getConfiguration%28%29
(5) http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_topic_7_1.html
<< MapReduce: Distribtued Cache >>
cd ~/hadoop_labs/lab014
ant
Reference:
(1) org.apache.hadoop.filecache.DistributedCache.addCacheFile()
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/filecache/DistributedCache.html
<< Query: DBInputFormat >>
<< Pig: Apache Log Analysis >>
Reference:
(1)
<< Pig: DBStorage >>
Reference:
(1)
<< Pig: HBaseStorage >>
Reference:
(1)
<< Other Idea >>
1. lab for concurrent read, concurrent write
- local mode V.S. full distribtued mode
- local disk 1 to local disk 2
- ram disk to local disk
- multiple ram disk to full distributed mode HDFS
2. generate files to show the limitation due to HDFS namenode HEAPSIZE
3. Use Pig XMLLoader to load XML, and HBaseStorage to store into HBase.
Or DBStorage to store results into MySQL/SQLite
4. mapper or reducer task running more than 10 minutes. or change