Wednesday, September 9, 2015

Installing Hadoop2.7 on CentOS 7 (Single Node Cluster)

Hadoop is open source framework written in Java for complex high volume computation. Today's industry data is expanded in 3 Vs (Volume, Velocity and Variety), making it difficult to analyze/interpret such data. Now hadoop's distributed high fault tolerant filesystem (HDFS) is solution for 3Vs data expansion and map-reduce is programming plateform to analyze data in HDFS.

Today, we will be discuss step for simple installing up and running Hadoop on CentOS server machine.

Step 1: Installing Java
Hadoop require Java 1.6 or higher version of installation. Please check if java exists and if not install using the below command.

[pooja@localhost ~]$ sudo yum install java-1.7.0-openjdk
Output
......
Dependency Installed:
giflib.x86_64 0:4.1.6-3.1.el6
jpackage-utils.noarch 0:1.7.5-3.14.el6
pcsc-lite-libs.x86_64 0:1.5.2-15.el6
ttmkfdir.x86_64 0:3.0.9-32.1.el6
tzdata-java.noarch 0:2015f-1.el6
xorg-x11-fonts-Type1.noarch 0:7.2-11.el6

Complete!

[root@localhost ~]$ java -version
Output:
java version "1.7.0_85"
OpenJDK Runtime Environment (rhel-2.6.1.3.el6_7-x86_64 u85-b01)
OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode)


Step 2: Create a dedicated Hadoop user
We recommend to create the dedicated user (non root) for hadoop installation.

[pooja@localhost ~]$ sudo groupadd hadoop
[pooja@localhost ~]$ sudo useradd --groups hadoop hduser
[pooja@localhost ~]$ sudo passwd hduser
[pooja@localhost ~]$ su - hduser

Hadoop required SSH to manage its node and therefore for single node we required set up local machine public key authentication.

[hduser@localhost ~]$ ssh-keygen -t rsa -P ""
Output:
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
87:21:a4:91:1e:f7:01:0b:9a:e3:a3:8a:76:8b:ab:6f hduser@localhost.localdomain
[....snipp...]

[hduser@localhost ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[hduser@localhost ~]$ chmod 0600 ~/.ssh/authorized_keys

If still facing issue refer to "Troubleshooting: SSH Setup" at end of session, if not you can continue.
Now, verify if ssh is set up properly.Below command should not ask for password but first time it should prompt RSA to added to known host list.

[hduser@localhost ~]$ ssh localhost
Output (first time only): The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is e4:37:82:a0:68:e9:ee:1f:0f:22:2e:35:63:94:38:d3. Are you sure you want to continue connecting (yes/no)? yes 
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

Step 3: Download Hadoop 2.7.0
Download Hadoop from Apache Download images or using below commands

[hduser@localhost ~]$ wget http://apache.claz.org/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz
Output: --2016-12-16 21:51:51-- http://apache.claz.org/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

Step 4: Untar hadoop file and create soft link

[hduser@localhost ~]$ tar -xvf hadoop-2.7.0.tar.gz

[hduser@localhost ~]$ln -s hadoop-2.7.0 hadoop

Step 5: Configure Hadoop Pusedo Distributed Mode

5.1 Set Up Enviornment Variables

Edit bashrc and add below line. If you are using any other shell then update appropriate configuration files.

export HADOOP_HOME=/home/hduser/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Now apply the changes in current running environment

[hduser@localhost ~]source ~/.bashrc

5.2 Configuration Changes

Edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable.
Change

# The java implementation to use.
export JAVA_HOME=${JAVA_HOME}

to

export JAVA_HOME=/usr/lib/jvm/jre-openjdk

Hadoop has many configuration file that need to be customized according to our set up. We will be configuring basic hadoop single node setup for this article.
Navigate to path then make edit hadoop configuration file:

[hduser@localhost ~]$ cd $HADOOP_HOME/etc/hadoop

Edit core-site.xml

<configuration>
   <property>
       <name>fs.default.name</name>
       <value>hdfs://localhost:9000</value>
     </property>
</configuration>

Edit hdfs-site.xml

<configuration>
   <property>
       <name>dfs.replication</name>
       <value>1</value>
   </property>
   <property>
       <name>dfs.name.dir</name>
        <value>file:///home/hduser/hadoopdata/hdfs/namenode</value>
    </property>
    <property>
       <name>dfs.data.dir</name>
      <value>file:///home/hduser/hadoopdata/hdfs/datanode</value>
    </property>
</configuration>

Edit yarn-site.xml

<configuration>
  <property>
      <name>yarn.nodemanager.aux-services</name>
       <value>mapreduce_shuffle</value>
    </property>
</configuration>

6. Format HDFS filesystem via NameNode
Now format the HDFS using the command below and make sure hdfs directory is created (directory specied in property "dfs.data.dir" of hdfs-site.xml)

[hduser@localhost ~]$ hdfs namenode -format

Output:
15/09/08 22:44:42 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost.localdomain/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.6.0
[...snipp...]
15/09/08 22:44:44 INFO common.Storage: Storage directory /home/hduser/hadoopdata/hdfs/namenode has been successfully formatted.
15/09/08 22:44:44 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid &amp;amp;amp;amp;amp;amp;gt;= 0
15/09/08 22:44:44 INFO util.ExitUtil: Exiting with status 0
15/09/08 22:44:44 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
************************************************************/


Step 7: Start single node hadoop cluster
Lets start the hadoop cluster using the hadoop provided script.
Start hdfs

[hduser@localhost ~]$ ./$HADOOP_HOME/sbin/start-dfs.sh

Output:
15/09/08 22:54:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hduser/hadoop/logs/hadoop-hduser-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /home/hduser/hadoop/logs/hadoop-hduser-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
RSA key fingerprint is e4:37:82:a0:68:e9:ee:1f:0f:22:2e:35:63:94:38:d3.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/hduser/hadoop/logs/hadoop-hduser-secondarynamenode-localhost.localdomain.out
15/09/08 22:55:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Start yarn

[hduser@localhost ~]$ ./$HADOOP_HOME/sbin/start-yarn.sh

Output:
starting yarn daemons
starting resourcemanager, logging to /home/hduser/hadoop/logs/yarn-hduser-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to /home/hduser/hadoop/logs/yarn-hduser-nodemanager-localhost.localdomain.out


Step 8: Hadoop Web interface
Web UI of NameNode Daemon(http://localhost:50070/)
namdenode
Web UI of Secondary NameNode (http://localhost:50090/)
Secondary Name Node
Web UI of cluster information (http://localhost:8088)
Hadoop Cluster UI

Step 9: Test the Hadoop set up
9.1 Create the sample data file on local machine or data from internet
9.2 Copy the data file from local machine to HDFS using the below commands

[hduser@localhost ~]$ hdfs dfs -mkdir /user
15/09/08 23:39:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[hduser@localhost ~]$ hdfs dfs -put localdata/* /user
15/09/08 23:41:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

9.3 Run the existing map reduce word count job (present in $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar) using the command below:
Note: Hdfs input directory : /user and Hdfs output directory: /user/output

[hduser@localhost ~]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /user /user/output

Output:
15/09/08 23:49:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/08 23:49:55 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/08 23:49:55 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/08 23:49:56 INFO input.FileInputFormat: Total input paths to process : 1
15/09/08 23:49:56 INFO mapreduce.JobSubmitter: number of splits:1
15/09/08 23:49:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local875797856_0001
15/09/08 23:49:56 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/09/08 23:49:56 INFO mapreduce.Job: Running job: job_local875797856_0001
15/09/08 23:49:56 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/09/08 23:49:56 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
[...snipp...]

9.4 Verify the result in hdfs directory /user/output

[hduser@localhost ~]$ hdfs dfs -ls /user/output

Output:
15/09/08 23:54:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 hduser supergroup 0 2015-09-08 23:49 /user/output/_SUCCESS
-rw-r--r-- 1 hduser supergroup 132 2015-09-08 23:49 /user/output/part-r-00000

[hduser@localhost ~]$ hdfs dfs -cat /user/output/part-r-00000

Output:
15/09/08 23:55:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
This 2
Will 1
[...snipp...]


10. Stop the running hadoop cluster using the command below:

[hduser@localhost ~]$ cd $HADOOP_HOME/sbin/
[hduser@localhost ~]$ ./stop-yarn.sh

Output:
stopping yarn daemons
stopping resourcemanager
hduser@localhost's password:
localhost: stopping nodemanager
no proxyserver to stop

[hduser@localhost ~]$ ./stop-dfs.sh

Output:
15/09/08 23:59:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [localhost]

Hope everyone able to set up Hadoop cluster successfully. Please feel free to leave comments if facing any issue.

Happy Coding!!!!

Troubleshooting: SSH Setup

I found 2 errors while SSH setup.

1. Service sshd doesnt exist "there is no such file or directory"

In this case ssh is not installed on your machine install using below command.


[root@localhost ~] $ yum -y install openssh-server openssh-clients


2. ssh: connect to host localhost port 22: Connection refused

When I completed SSH set up, and type in command 'ssh localhost' above error popped up.

To resolve it, I stop and start the service again using below command.


[hduser@localhost ~]$/bin/systemctl stop sshd.service
[hduser@localhost ~]$/bin/systemctl start sshd.service
/bin/systemctl status sshd.service
● sshd.service - OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
Active: active (running) since Fri 2016-12-16 12:41:44 PST; 16s ago
Docs: man:sshd(8)
man:sshd_config(5)
Main PID: 6192 (sshd)
CGroup: /system.slice/sshd.service
└─6192 /usr/sbin/sshd -D

No comments:

Post a Comment