Thursday, August 24, 2017

Installing Hadoop2.8.1 on Ubuntu (Single Node Cluster)

 Overview

Hadoop is open source framework for running and writing distributed computing programs. This framework comprise of HDFS (Hadoop Distributed File system) and Map Reduce (Programming framework written in Java).

In Hadoop 1, Only Map Reduce program (either written in Java or Python ) can be run on the data stored in HDFS. Therefore, it only fit for  batch processing computations.

In Hadoop 2, the YARN (Yet Another Resource Negotiator) was introduced which provide API to work on requesting and allocating resource in cluster. These API facilitate application such as Spark, Tez, Storm program to process large scale fault tolerant data of HDFS. Thus, hadoop ecosystem now fits in for all batch or near real time or real time processing computation.  


Today, I will be discussing about the steps to set up Hadoop 2 in pseudo mode on Ubuntu machine.

Prerequisites


  • Hardware requirement
          The machine on which hadoop installed must have 64-128 MB RAM and atleast 1-4 GB     hard disk for better performance. This is the optional requirement.
  • Check java version
         Java version of machine should be greater than 7. If you have version small than 7 or no Java installed than install by steps provided in the article.
   
        You can version the java version with below command.
       
        $ java -version
             java version "1.8.0_131"
             Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
            Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

Steps for Hadoop Set up on Single Machine.

Step 1 : Create a dedicated hadoop user.

  •  Create a group hadoop
            pooja@prod01:~$ sudo groupadd hadoop

  • 1.2 Create a user hduser in group hadoop.

pooja@prod01:~$ sudo useradd -G hadoop -m  hduser

Note:-m will create the home directory

1.3 Make sure home directory with hduser created.

pooja@prod01:~$ sudo ls -ltr /home/

total 8
drwxr-xr-x 28 pooja  pooja  4096 Aug 24 09:23 pooja
drwxr-xr-x  2 hduser hduser 4096 Aug 24 13:34 hduser

1.4 Define password for hduser.
pooja@prod01:~$ sudo passwd hduser

Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully

1.5 Log-in  as hduser 
pooja@prod01:~$ su - hduser
Password: 

hduser@prod01:~$ pwd
/home/hduser

Step 2: Set up Passwordless SSH

2.1 Generate the ssh-keygen without password

hduser@prod01:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa): 
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
6c:c0:f4:c2:d1:d8:40:41:2b:e8:7b:8d:d4:c7:2c:62 hduser@prod01
The key's randomart image is:
+--[ RSA 2048]----+
|     oB*         |
|   . +.+o        |
|  . . * .        |
| .   o *         |
|  . E o S        |
|   + ++         |
|  . o .          |
|   .             |
|                 |
+-----------------+

2.2 Add the public ssh-key generated to authorized keys

hduser@prod01:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

2.3 Provide read and write permission to authorized keys.

hduser@prod01:~$  chmod 0600 ~/.ssh/authorized_keys

2.4 Verify if password less ssh is working.

Note: In continue question, please specify yes as shown below
hduser@prod01:~$ ssh localhost

The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is ad:3c:12:c3:b1:d2:60:a4:8f:76:00:1e:15:b3:f4:41.
Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-27-generic x86_64)
...Snippet
$

Step 3: Download Hadoop  2.8.1

3.1 Download Hadoop 2.8.1 tar file from Apache Download images or using below commands

hduser@prod01:~$ wget http://apache.claz.org/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz

--2017-08-24 14:01:31--  http://apache.claz.org/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
Resolving apache.claz.org (apache.claz.org)... 74.63.227.45
Connecting to apache.claz.org (apache.claz.org)|74.63.227.45|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 424555111 (405M) [application/x-gzip]
Saving to: ‘hadoop-2.8.1.tar.gz’
100%[=====================================================================================================>] 424,555,111 1.51MB/s   in 2m 48s
2017-08-24 14:04:19 (2.41 MB/s) - ‘hadoop-2.8.1.tar.gz’ saved [424555111/424555111]

3.2 Untar the downloaded tar file.

hduser@prod01:~$ tar -xvf hadoop-2.8.1.tar.gz

...Snippet
hadoop-2.8.1/share/doc/hadoop/images/external.png
hadoop-2.8.1/share/doc/hadoop/images/h5.jpg
hadoop-2.8.1/share/doc/hadoop/index.html
hadoop-2.8.1/share/doc/hadoop/project-reports.html
hadoop-2.8.1/include/
hadoop-2.8.1/include/hdfs.h
hadoop-2.8.1/include/Pipes.hh
hadoop-2.8.1/include/TemplateFactory.hh
hadoop-2.8.1/include/StringUtils.hh
hadoop-2.8.1/include/SerialUtils.hh
hadoop-2.8.1/LICENSE.txt
hadoop-2.8.1/NOTICE.txt
hadoop-2.8.1/README.txt

3.3 Create the soft link.

hduser@prod01:~$ ln -s hadoop-2.8.1 hadoop

Step 4: Configure Hadoop Pseudo Distributed mode.

In the hadoop configuration, we only added the minimum required property, you can add more properties to it as well.

4.1 Set up the environment variable.

   4.1.1 Edit bashrc and add hadoop in path as shown below:

            hduser@pooja:~$ vi .bashrc

               #Add below lines to .bashrc
                export HADOOP_HOME=/home/hduser/hadoop
                export HADOOP_INSTALL=$HADOOP_HOME
                export HADOOP_MAPRED_HOME=$HADOOP_HOME
                export HADOOP_COMMON_HOME=$HADOOP_HOME
                export HADOOP_HDFS_HOME=$HADOOP_HOME
                export YARN_HOME=$HADOOP_HOME
                export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
               export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

  4.1.2 Source .bashrc in current login session

          hduser@pooja:~$ source ~/.bashrc
          
4.2  Hadoop configuration file changes

   4.2.1 Changes to hadoop-env.sh (set $JAVA_HOME to installation directory)
         
           4.2.1.1 Find JAVA_HOME on machine.
                      
                        hduser@pooja:~$ which java
                         /usr/bin/java
                        
                         hduser@pooja:~$ readlink -f /usr/bin/java
                         /usr/lib/jvm/java-8-oracle/jre/bin/java

                         Note/usr/lib/jvm/java-8-oracle is JAVA_HOME diretory
          4.2.1.2  Edit hadoop-env.sh and set $JAVA_HOME.
         
                       hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh  
                      
                       Edit file and change

                       JAVA_HOME = ${JAVA_HOME} 
                                         to
                       JAVA_HOME = /usr/lib/jvm/java-8-oracle   
                            Note: JAVA_HOME=path fetched in step 4.2.1.1
                    
4.2.2  Changes to core-site.xml 
hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/core-site.xml

Add the configuration property (NameNode property: fs.dafault.name).

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

4.2.3 Changes to hdfs-site.xml

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the configuration property (NameNode property: dfs.name.dir, DataNode property: dfs.data.dir).

<configuration>
<property>
     <name>dfs.replication</name>
       <value>1</value>
</property>
<property>
       <name>dfs.name.dir</name>
       <value>file:///home/hduser/hadoopdata/hdfs/namenode</value>
</property>
<property>
     <name>dfs.data.dir</name>
     <value>file:///home/hduser/hadoopdata/hdfs/datanode</value>
</property>
</configuration>


4.2.3 Changes to mapred-site.xml

Here, first we will copy the mapred-site.xml.template to mapred-site.xml and then will add property to it.

hduser@prod01:~$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the configuration property (mapreduce.framework.name)
<configuration>
     <property>
         <name>mapreduce.framework.name</name>
          <value>yarn</value>
       </property>


</configuration>

Note: If you didn't specify this then Resource Manager UI (http://localhost:8088) will not show any jobs.

4.2.4 Changes to yarn-site.xml

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the configuration property

<configuration>
     <property>
         <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
       </property>
</configuration>

Step 5: Verify and format HDFS File system

5.1 Format HDFS file system

       hduser@pooja:~$ hdfs namenode -format

       ...Snippet
           17/08/24 16:08:36 INFO util.GSet: capacity      = 2^15 = 32768 entries
           17/08/24 16:08:36 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1601791069-127.0.1.1-1503616
           17/08/24 16:08:37 INFO common.Storage: Storage directory /home/hduser/hadoopdata/hdfs/namenode has been successfully formatted.
            17/08/24 16:08:37 INFO namenode.FSImageFormatProtobuf: Saving image file       /home/hduser/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression
           17/08/24 16:08:37 INFO namenode.FSImageFormatProtobuf: Image file /home/hduser/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds.
           17/08/24 16:08:37 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
           17/08/24 16:08:37 INFO util.ExitUtil: Exiting with status 0
           17/08/24 16:08:37 INFO namenode.NameNode: SHUTDOWN_MSG: 
          /************************************************************
            SHUTDOWN_MSG: Shutting down NameNode at pooja/127.0.1.1
          ************************************************************/

5.2 Verify the format (Make sure hadoopdata/hdfs/* folder created)
        
       hduser@prod01:~$ ls -ltr hadoopdata/hdfs/
       
         total 4
         drwxrwxr-x 3 hduser hduser 4096 Aug 24 16:09 namenode

Note: This is same path as specify in hdfs-site.xml property dfs-name-dir

Step 6: Start single node cluster

We will start the hadoop cluster using the hadoop start-up script.

6.1 Start HDFS
     
hduser@prod01:~$ start-dfs.sh 
17/08/24 16:38:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hduser/hadoop-2.8.1/logs/hadoop-hduser-namenode-prod01.out
localhost: starting datanode, logging to /home/hduser/hadoop-2.8.1/logs/hadoop-hduser-datanode-prod01.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is be:b3:7d:41:89:03:15:04:1c:84:e3:d9:69:1f:c8:5d.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/hduser/hadoop-2.8.1/logs/hadoop-hduser-secondarynamenode-prod01.out
17/08/24 16:39:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

6.2 Start yarn

hduser@prod01:~$ start-yarn.sh 
starting yarn daemons
starting resourcemanager, logging to /home/hduser/hadoop-2.8.1/logs/yarn-hduser-resourcemanager-prod01.out
localhost: starting nodemanager, logging to /home/hduser/hadoop-2.8.1/logs/yarn-hduser-nodemanager-prod01.out

6.3 Verify if all process started

hduser@prod01:~$ jps
6775 DataNode
7209 ResourceManager
7017 SecondaryNameNode
6651 NameNode
7339 NodeManager
7663 Jps

6.4 Run the PI Mapreduce job from the hadoop-examples jar.

hduser@prod1:~$ yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar pi 4 1000 




Step 7: Hadoop Web Interface

Web UI of NameNode(http://localhost:50070)


Resource Manager UI  (http://localhost:8088).
It will show all jobs running and resources on cluster information.This will help monitor the jobs running and progress report of the same.

Step 8: Stopping the hadoop

8.1 Stop Yarn processes

hduser@prod01:~$ stop-yarn.sh

stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
localhost: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
no proxyserver to stop

8.2 Stop HDFS processes

hduser@prod01:~$ stop-dfs.sh
17/08/24 17:11:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
17/08/24 17:12:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Hope you are able to follow my instructions on  Hadoop Pseudo Mode Setup. Please write to me if any of you  are still facing problem.

Happy Coding!!!!

No comments:

Post a Comment