Monday, 3 August 2015

Setting up Hadoop Cluster in Pseudo-distributed mode on Ubuntu


Here we’ll discuss the pseudo-distributed mode Hadoop cluster setup on linux environment. We are using Hadoop 2.x for this.
Pre-requisites:
     -     Java 7
     -      Adding a dedicated user
     -     Configuring ssh

Step 1: Install Java:

324532@ubuntu:~$ sudo apt-get install openjdk-7-jdk
324532@ubuntu:~$ java –version
java version "1.7.0_79"
OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2)
OpenJDK Client VM (build 24.79-b02, mixed mode, sharing)

Step2 : Add a dedicated hadoop user

Though it is not mandatory,we create it for separating the Hadoop installation from other packages.
324532@ubuntu:~$sudo addgroup hadoop
324532@ubuntu:~$sudo adduser –ingroup hadoop hduser
It will add hduser user in hadoop group.

Step 3: Install ssh:

324532@ubuntu:~$ sudo apt-get install ssh
324532@ubuntu:~$ sudo apt-get install openssh-server
Once it is installed, make sure ssh service is running.

Step4 : Configure ssh

Hadoop uses ssh to manages its nodes. So we need to make ssh running and configured for authentication

First generate an SSH key for hduser.

324532@ubuntu:~$ su - hduser
hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
[...snipp...]
hduser@ubuntu:~$

Once the key is generated, copy the public key to authorized keys.

hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Once key is copied, you can ssh to localhost and continue the Hadoop setup.

hduser@ubuntu:~$ ssh localhost


Step 4: Setup Hadoop cluster

Download a release from Apache download mirrors. And extract it into a folder i.e. ‘/usr/local/hadoop/’. Set JAVA_HOME and other Hadoop related environment variables in .bash_profile file of hduser. 
# set to the root of your Java installation
 export JAVA_HOME=/usr/java/latest
 export HADOOP_INSTALL=/usr/local/hadoop
 export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
 export HADOOP_COMMON_HOME=$HADOOP_INSTALL
 export HADOOP_HDFS_HOME=$HADOOP_INSTALL
 export YARN_HOME=$HADOOP_INSTALL
 export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
Hadoop can run in 3 modes :
     1.    Single node cluster
     2.    Pseudo distributed mode
     3.    Fully distributed mode
  
Single Distributed Mode : All daemons run in  non-distributed manner as a single java process. Local filesystem is used for data storage.

Pseudo Distributed Mode : Hadoop can also be run on single node in pseudo distributed mode where each daemon runs as a separate java process.

Fully Distributed Mode : Hadoop runs on multiple nodes in master slave architecture where each daemon runs as a separate java process.

Configuration :

Following are the minimal configuration you need to add in the configuration files to start a cluster.

etc/hadoop/core-site.xml 
<configuration>
    <property>
        <name>fs.defaultFS</name>
<!-- It is namenode filesystem path -->
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>


etc/Hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

For yarn daemons : etc/hadoop/mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value> <!--other values are local, classic -->
    </property>
</configuration>

etc/hadoop/yarn-site.xml:
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

Execution
Once the above configuration is done, next step is to format the namenode.

To format the filesystem:
  hduser@ubuntu:~$ bin/hdfs namenode -format

To start NameNode and DataNode  daemon:
  hduser@ubuntu:~$ sbin/start-dfs.sh

Browse the NameNode web interface. By default it is : http://localhost:50070

To start yarn daemons (Resource manager and Node manager), run following:
hduser@ubuntu:~ $ sbin/start-yarn.sh

You can browse resource manager at http://localhost:8088

If you want to run all daemons together, you can run following:
hduser@ubuntu:~$ sbin/start-all.sh

Now your cluster is successfully started. You can see all Hadoop daemons running using jps command.
Now start writing Map Reduce job..!!!!!!!!!

No comments:

Post a Comment