Big Data: Installing Hadoop2.8.1 on Ubuntu (Single Node Cluster)

Overview

Hadoop is open source framework for running and writing distributed computing programs. This framework comprise of HDFS (Hadoop Distributed File system) and Map Reduce (Programming framework written in Java).

In Hadoop 1, Only Map Reduce program (either written in Java or Python ) can be run on the data stored in HDFS. Therefore, it only fit for batch processing computations.

In Hadoop 2, the YARN (Yet Another Resource Negotiator) was introduced which provide API to work on requesting and allocating resource in cluster. These API facilitate application such as Spark, Tez, Storm program to process large scale fault tolerant data of HDFS. Thus, hadoop ecosystem now fits in for all batch or near real time or real time processing computation.

Today, I will be discussing about the steps to set up Hadoop 2 in pseudo mode on Ubuntu machine.

Prerequisites

Hardware requirement

The machine on which hadoop installed must have 64-128 MB RAM and atleast 1-4 GB hard disk for better performance. This is the optional requirement.

Check java version

Java version of machine should be greater than 7. If you have version small than 7 or no Java installed than install by steps provided in the article.

You can version the java version with below command.

$ java -version
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

Steps for Hadoop Set up on Single Machine.

Step 1 : Create a dedicated hadoop user.

Create a group hadoop

pooja@prod01:~$ sudo groupadd hadoop

1.2 Create a user hduser in group hadoop.

pooja@prod01:~$ sudo useradd -G hadoop -m hduser

Note:-m will create the home directory

1.3 Make sure home directory with hduser created.

pooja@prod01:~$ sudo ls -ltr /home/

total 8
drwxr-xr-x 28 pooja pooja 4096 Aug 24 09:23 pooja
drwxr-xr-x 2 hduser hduser 4096 Aug 24 13:34 hduser

1.4 Define password for hduser.
pooja@prod01:~$ sudo passwd hduser

Enter new UNIX password:

Retype new UNIX password:

passwd: password updated successfully

1.5 Log-in as hduser

pooja@prod01:~$ su - hduser

Password:

hduser@prod01:~$ pwd

/home/hduser

Step 2: Set up Passwordless SSH

2.1 Generate the ssh-keygen without password

hduser@prod01:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.

Enter file in which to save the key (/home/hduser/.ssh/id_rsa):

Created directory '/home/hduser/.ssh'.

Your identification has been saved in /home/hduser/.ssh/id_rsa.

Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

The key fingerprint is:

6c:c0:f4:c2:d1:d8:40:41:2b:e8:7b:8d:d4:c7:2c:62 hduser@prod01

The key's randomart image is:

+--[ RSA 2048]----+

| oB* |

| . +.+o |

| . . * . |

| . o * |

| . E o S |

| + ++ |

| . o . |

| . |

| |

+-----------------+

2.2 Add the public ssh-key generated to authorized keys

hduser@prod01:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

2.3 Provide read and write permission to authorized keys.

hduser@prod01:~$ chmod 0600 ~/.ssh/authorized_keys

2.4 Verify if password less ssh is working.

Note: In continue question, please specify yes as shown below

hduser@prod01:~$ ssh localhost

The authenticity of host 'localhost (127.0.0.1)' can't be established.

ECDSA key fingerprint is ad:3c:12:c3:b1:d2:60:a4:8f:76:00:1e:15:b3:f4:41.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.

Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-27-generic x86_64)

...Snippet

Step 3: Download Hadoop 2.8.1

3.1 Download Hadoop 2.8.1 tar file from Apache Download images or using below commands

hduser@prod01:~$ wget http://apache.claz.org/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz

--2017-08-24 14:01:31-- http://apache.claz.org/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
Resolving apache.claz.org (apache.claz.org)... 74.63.227.45
Connecting to apache.claz.org (apache.claz.org)|74.63.227.45|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 424555111 (405M) [application/x-gzip]
Saving to: ‘hadoop-2.8.1.tar.gz’
100%[=====================================================================================================>] 424,555,111 1.51MB/s in 2m 48s
2017-08-24 14:04:19 (2.41 MB/s) - ‘hadoop-2.8.1.tar.gz’ saved [424555111/424555111]

3.2 Untar the downloaded tar file.

hduser@prod01:~$ tar -xvf hadoop-2.8.1.tar.gz

...Snippet
hadoop-2.8.1/share/doc/hadoop/images/external.png
hadoop-2.8.1/share/doc/hadoop/images/h5.jpg
hadoop-2.8.1/share/doc/hadoop/index.html
hadoop-2.8.1/share/doc/hadoop/project-reports.html
hadoop-2.8.1/include/
hadoop-2.8.1/include/hdfs.h
hadoop-2.8.1/include/Pipes.hh
hadoop-2.8.1/include/TemplateFactory.hh
hadoop-2.8.1/include/StringUtils.hh
hadoop-2.8.1/include/SerialUtils.hh
hadoop-2.8.1/LICENSE.txt
hadoop-2.8.1/NOTICE.txt
hadoop-2.8.1/README.txt

3.3 Create the soft link.

hduser@prod01:~$ ln -s hadoop-2.8.1 hadoop

Step 4: Configure Hadoop Pseudo Distributed mode.

In the hadoop configuration, we only added the minimum required property, you can add more properties to it as well.

4.1 Set up the environment variable.

4.1.1 Edit bashrc and add hadoop in path as shown below:

hduser@pooja:~$ vi .bashrc

#Add below lines to .bashrc

export HADOOP_HOME=/home/hduser/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

4.1.2 Source .bashrc in current login session

hduser@pooja:~$ source ~/.bashrc

4.2 Hadoop configuration file changes

4.2.1 Changes to hadoop-env.sh (set $JAVA_HOME to installation directory)

4.2.1.1 Find JAVA_HOME on machine.

hduser@pooja:~$ which java

/usr/bin/java

hduser@pooja:~$ readlink -f /usr/bin/java

/usr/lib/jvm/java-8-oracle/jre/bin/java

Note: /usr/lib/jvm/java-8-oracle is JAVA_HOME diretory

4.2.1.2 Edit hadoop-env.sh and set $JAVA_HOME.

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Edit file and change

JAVA_HOME = ${JAVA_HOME}

JAVA_HOME = /usr/lib/jvm/java-8-oracle

Note: JAVA_HOME=path fetched in step 4.2.1.1

4.2.2 Changes to core-site.xml

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/core-site.xml

Add the configuration property (NameNode property: fs.dafault.name).

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

4.2.3 Changes to hdfs-site.xml

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the configuration property (NameNode property: dfs.name.dir, DataNode property: dfs.data.dir).

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hduser/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hduser/hadoopdata/hdfs/datanode</value>
</property>

</configuration>

4.2.3 Changes to mapred-site.xml

Here, first we will copy the mapred-site.xml.template to mapred-site.xml and then will add property to it.

hduser@prod01:~$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the configuration property (mapreduce.framework.name)

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

</configuration>

Note: If you didn't specify this then Resource Manager UI (http://localhost:8088) will not show any jobs.

4.2.4 Changes to yarn-site.xml

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the configuration property

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

</configuration>

Step 5: Verify and format HDFS File system

5.1 Format HDFS file system

hduser@pooja:~$ hdfs namenode -format

...Snippet

17/08/24 16:08:36 INFO util.GSet: capacity = 2^15 = 32768 entries

17/08/24 16:08:36 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1601791069-127.0.1.1-1503616

17/08/24 16:08:37 INFO common.Storage: Storage directory /home/hduser/hadoopdata/hdfs/namenode has been successfully formatted.

17/08/24 16:08:37 INFO namenode.FSImageFormatProtobuf: Saving image file /home/hduser/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression

17/08/24 16:08:37 INFO namenode.FSImageFormatProtobuf: Image file /home/hduser/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds.

17/08/24 16:08:37 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0

17/08/24 16:08:37 INFO util.ExitUtil: Exiting with status 0

17/08/24 16:08:37 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at pooja/127.0.1.1

************************************************************/

5.2 Verify the format (Make sure hadoopdata/hdfs/* folder created)

hduser@prod01:~$ ls -ltr hadoopdata/hdfs/

total 4

drwxrwxr-x 3 hduser hduser 4096 Aug 24 16:09 namenode

Note: This is same path as specify in hdfs-site.xml property dfs-name-dir

Step 6: Start single node cluster

We will start the hadoop cluster using the hadoop start-up script.

6.1 Start HDFS

hduser@prod01:~$ start-dfs.sh

17/08/24 16:38:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Starting namenodes on [localhost]

localhost: starting namenode, logging to /home/hduser/hadoop-2.8.1/logs/hadoop-hduser-namenode-prod01.out

localhost: starting datanode, logging to /home/hduser/hadoop-2.8.1/logs/hadoop-hduser-datanode-prod01.out

Starting secondary namenodes [0.0.0.0]

The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.

ECDSA key fingerprint is be:b3:7d:41:89:03:15:04:1c:84:e3:d9:69:1f:c8:5d.

Are you sure you want to continue connecting (yes/no)? yes

0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.

0.0.0.0: starting secondarynamenode, logging to /home/hduser/hadoop-2.8.1/logs/hadoop-hduser-secondarynamenode-prod01.out

17/08/24 16:39:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

6.2 Start yarn

hduser@prod01:~$ start-yarn.sh

starting yarn daemons

starting resourcemanager, logging to /home/hduser/hadoop-2.8.1/logs/yarn-hduser-resourcemanager-prod01.out

localhost: starting nodemanager, logging to /home/hduser/hadoop-2.8.1/logs/yarn-hduser-nodemanager-prod01.out

6.3 Verify if all process started

hduser@prod01:~$ jps

6775 DataNode

7209 ResourceManager

7017 SecondaryNameNode

6651 NameNode

7339 NodeManager

7663 Jps

6.4 Run the PI Mapreduce job from the hadoop-examples jar.

hduser@prod1:~$ yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar pi 4 1000

Step 7: Hadoop Web Interface

Web UI of NameNode(http://localhost:50070)

Resource Manager UI (http://localhost:8088).
It will show all jobs running and resources on cluster information.This will help monitor the jobs running and progress report of the same.

Step 8: Stopping the hadoop

8.1 Stop Yarn processes

hduser@prod01:~$ stop-yarn.sh

stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
localhost: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
no proxyserver to stop

8.2 Stop HDFS processes

hduser@prod01:~$ stop-dfs.sh
17/08/24 17:11:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
17/08/24 17:12:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Hope you are able to follow my instructions on Hadoop Pseudo Mode Setup. Please write to me if any of you are still facing problem.

Happy Coding!!!!

Big Data

Thursday, August 24, 2017

Installing Hadoop2.8.1 on Ubuntu (Single Node Cluster)

Overview

Prerequisites

Steps for Hadoop Set up on Single Machine.

Step 1 : Create a dedicated hadoop user.

Step 2: Set up Passwordless SSH

Step 3: Download Hadoop 2.8.1

Step 4: Configure Hadoop Pseudo Distributed mode.

Step 5: Verify and format HDFS File system

Step 6: Start single node cluster

Step 7: Hadoop Web Interface

Step 8: Stopping the hadoop

No comments:

Post a Comment