Skip to content

Latest commit

 

History

History

hadoop

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Hadoop and Hadoop Ecosystems

Tutorials and Examples

Some notes on local Hadoop setup

If you have credits or an appropriate subscription, you can use Hadoop and its ecosystems from various cloud providers, such as HDInsight, Google Dataproc, or Amazon EMR. However, for practices, you can setup Hadoop in your own machines. There are many guides available. Here we just put some notes for quick problems you may face.

Java version and Memory

It is often that we have both Oracle Java and OpenJDK, thus make sure you use the Java setting correct. Pay attention to JAVA_HOME.

The system might need a lot of memory so pay attention about errors due to memory and number of tasks (e.g., heap configuration, maximum memory for MapReduce tasks)

Check XML configuration file

There are a lot of configuration files and parameters. Many are in XML, as such frameworks have been started long time ago. Make sure you check them correct. E.g.,

$ xmllint hive-default.xml
hive-default.xml:3216: parser error : xmlParseCharRef: invalid xmlChar value 8
mmands with OVERWRITE (such as INSERT OVERWRITE) acquire Exclusive locks for
                                                                               ^

Hadoop NameNode and DataNode

Remember that Hadoop File System (HDFS) has NodeName and DataNode which have different configurations, e.g.:

hdfs-site.xml
<property>
      <name>dfs.namenode.name.dir</name>
      <value>/var/hadoop/data/namenode</value>
  </property>
  <property>
      <name>dfs.datanode.name.dir</name>
      <value>/var/hadoop/data/datanode</value>
  </property>

External Zookeeper

You can use a single Zookeeper for Hadoop and HBase, etc. Make sure you do the right configuration. For example, for HBase you can have the following configuration in hbase-site.xml, where "localhost" should be the machine running Zookeeper.

<property>
   <name>hbase.zookeeper.quorum</name>
   <value>localhost</value>
</property>
<property>
   <name>hbase.zookeeper.property.clientPort</name>
   <value>2181</value>
</property>

Hive/Hadoop access denied/impersonation issues

You might get errors when running beeline to call hiveserver2 which in turn calls Hadoop to execute requests. You can check a simple but easy to understand explanation here.

For example, in hive-default.xml, you may need to look at

<property>
   <name>hive.server2.enable.doAs</name>
   <value>false</value>
   <description>
     Setting this property to true will have HiveServer2 execute
     Hive operations as the user making the calls to it.
   </description>
 </property>
 <property>
     <name>hive.conf.restricted.list</name>
     <value>....</value>
     <description>Comma separated list of configuration options which are immutable at runtime</description>
   </property>

In Hadoop configuration, for example, if "truong" is used for a proxy user, we have the following configuration in core-site.xml

<configuration>

<property>
<name>hadoop.proxyuser.truong.hosts</name>
<value>*</value>
</property>

<property>
  <name>hadoop.proxyuser.truong.groups</name>
  <value>*</value>
</property>
</configuration>

Seeing logs of hiveserver2

Instead of running "hiveserver2", you can run

$hive --service hiveserver2 --hiveconf hive.root.logger=INFO,console

Some readings