Big Data

Thursday, August 24, 2017

Setup Multi node Apache Hadoop 2 Cluster

Apache Hadoop

Hadoop is open source framework for writing and running distributed application. It consists of scale out fault tolerant distribute file system (HDFS) and data processing system (Map Reduce).

Today, I will walk through the steps for set up Hadoop Cluster which involve 2 or more commodity machine. I will be configuring the set up using 2 machines.

Prerequisites:

Network accessible : Machines should be connect through network by either through Ethernet hubs or switch or routers. Therefore, cluster machines should have same subnetting IP address like 192.168.1.x.

Multi Node Hadoop Cluster Setup

1. Set up Hadoop on each machine

Please follow the steps provide in the tutorial and set up single node setup on each machine. Then stop the processes as shown in Step 8 of the tutorial.

2. Change each nodes hosts files to include all machine in cluster .

In my case, I have just 2 machine connected through network with IP Address (192.168.1.1, 192.168.1.2). Therefore, I have included the below 2 lines to file as shown below:

hduser@pooja:~$ sudo vi /etc/hosts

192.168.1.1 master

192.168.1.2 slave1

3. Set up password less SSH

We will be creating a passwordless ssh between master and all slaves machine in network.

3.1 Master machine ssh set up with itself

We have already set up password less ssh to localhost/itslef when configure Hadoop on each machine. Here, we will just verify if setup is proper.

hduser@pooja:~$ ssh master

The authenticity of host 'master (192.168.1.101)' can't be established.
ECDSA key fingerprint is ad:3c:12:c3:b1:d2:60:a4:8f:76:00:1d:15:b7:f5:41.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'master,192.168.1.101' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-27-generic x86_64)

* Documentation: https://help.ubuntu.com/

385 packages can be updated.
268 updates are security updates.

New release '16.04.3 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

Last login: Thu Aug 24 13:51:11 2017 from localhost
$

3.2 Master machine ssh set up with slave nodes

3.2.1 Copy the master ssh public key to all slave node.

hduser@pooja:~$ ssh-copy-id -i /home/hduser/.ssh/id_rsa.pub hduser@slave1

The authenticity of host 'slaves (192.168.1.2)' can't be established.
The ECDSA key fingerprint is: b3: 7d: 41: 89: 03: 15: 04: 1c: 84: e3: d1: 69: 1f: c8: 5d.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hduser@slave1's password:

Number of key(s) added: 1

Now try logging into the machine, with: "ssh 'hduser@slave1'"

and check to make sure that only the key(s) you wanted were added.

Note: In the bold line, we specify the password of hduser@slave1 machine.

3.2.2 Verify the authorization_keys file of slave1 machine

Make sure you have a key enter from master node as shown below.

hduser@prod01:~$ cat .ssh/authorized_keys

ssh-rsa AAAAB3NzaC1yc....LJ/67N+v7g8S0/U44Mhjf7dviODw5tY9cs5XXsb1FMVQL... hduser@prod01
ssh-rsa fffA3zwdi0eWSkJvDWzD9du...kSRTRLZbzVY9ahLZNLFz+p1QU3HXuY3tLr hduser@pooja

3.2.3 Confirm passwordless ssh from master machine

hduser@pooja:~$ ssh slave1

Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-27-generic x86_64)
* Documentation: https://help.ubuntu.com/
334 packages can be updated.
224 updates are security updates.

New release '16.04.3 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

Last login: Thu Aug 24 13:50:50 2017 from localhost
$

4. Hadoop Configuration changes

4.1 Changes to masters files

This file specify the list of machine that run the name node and secondary name node (name node will always start on the master node but the secondary name node can run on any slave node if cluster started using start-dfs.sh from the particular slave node). Basically, Secondary namenode merge the fsimage and edit log periodically to keep edit log in size.

In our case we will specify the master machine only.

hduser@pooja:~$ vi $HADOOP_HOME/etc/hadoop/masters

4.2 Changes to slave files

This file specify the list of machine that run the datanodes and node masters.

In our case we will specify the master and slave1, if you have more slaves, you can specify them here and can remove master node.

hduser@pooja:~$ vi $HADOOP_HOME/etc/hadoop/slaves

4.3 Changes in core-site.xml for all machine in cluster.

Now, the namenode process will be running on master and not on localhost.

Therefore, we need to change the value of fs-default-name property to hdfs://master:9000 as shown below.

hduser@pooja:~$ vi $HADOOP_HOME/etc/hadoop/core-site.xml

<name>fs.default.name</name>

<value>hdfs://master:9000</value>

</property>

</configuration>

Note: Make sure you make changes to core-site.xml in slave nodes as well

4.4 Changes in hdfs-site.xml of all slave nodes (This is optional step)

Remove property "dfs.namenode.dir" as now namenode won't be running on slave machine.

5. Starting hadoop cluster

From the master machine run the below commands

5.1 Start HDFS

hduser@pooja:~$ start-dfs.sh

5.2 Start Yarn

hduser@pooja:~$ start-yarn.sh

5.3 Verify the running process on master

5.3.1 Process runnining on master machine.

hduser@pooja:~$ jps

6821 SecondaryNameNode

7126 NodeManager

6615 DataNode

7628 Jps

6444 NameNode

6990 ResourceManager

5.3.2 Process running on slave node

hduser@prod01:~$ jps

9749 NodeManager

9613 DataNode

9902 Jps

5.3.3 Run the PI Mapreduce job from the hadoop-examples jar.
hduser@pooja:~$ yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar pi 4 1000

6. Stop the Cluster

In the master node, stop the processes.

6.1 Stop yarn

hduser@pooja:~$ stop-yarn.sh

stopping yarn daemons

stopping resourcemanager

master: stopping nodemanager

slave1: stopping nodemanager

master: nodemanager did not stop gracefully after 5 seconds: killing with kill -9

slave1: nodemanager did not stop gracefully after 5 seconds: killing with kill -9

no proxyserver to stop

6.2 Stop HDFS

hduser@pooja:~$ stop-dfs.sh
17/08/24 18:42:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namodes on [master]
Master: stopping forgive
master: stopping datanode
slave1: stopping datanode
Stopping secondary namodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
17/08/24 18:42:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

I hope you are able to follow my instruction to set up Hadoop Cluster. If still facing issue, I love to address them, please do write your problems !!!

Happy Coding !!!

Installing Hadoop2.8.1 on Ubuntu (Single Node Cluster)

Overview

Hadoop is open source framework for running and writing distributed computing programs. This framework comprise of HDFS (Hadoop Distributed File system) and Map Reduce (Programming framework written in Java).

In Hadoop 1, Only Map Reduce program (either written in Java or Python ) can be run on the data stored in HDFS. Therefore, it only fit for batch processing computations.

In Hadoop 2, the YARN (Yet Another Resource Negotiator) was introduced which provide API to work on requesting and allocating resource in cluster. These API facilitate application such as Spark, Tez, Storm program to process large scale fault tolerant data of HDFS. Thus, hadoop ecosystem now fits in for all batch or near real time or real time processing computation.

Today, I will be discussing about the steps to set up Hadoop 2 in pseudo mode on Ubuntu machine.

Prerequisites

Hardware requirement

The machine on which hadoop installed must have 64-128 MB RAM and atleast 1-4 GB hard disk for better performance. This is the optional requirement.

Check java version

Java version of machine should be greater than 7. If you have version small than 7 or no Java installed than install by steps provided in the article.

You can version the java version with below command.

$ java -version
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

Steps for Hadoop Set up on Single Machine.

Step 1 : Create a dedicated hadoop user.

Create a group hadoop

pooja@prod01:~$ sudo groupadd hadoop

1.2 Create a user hduser in group hadoop.

pooja@prod01:~$ sudo useradd -G hadoop -m hduser

Note:-m will create the home directory

1.3 Make sure home directory with hduser created.

pooja@prod01:~$ sudo ls -ltr /home/

total 8
drwxr-xr-x 28 pooja pooja 4096 Aug 24 09:23 pooja
drwxr-xr-x 2 hduser hduser 4096 Aug 24 13:34 hduser

1.4 Define password for hduser.
pooja@prod01:~$ sudo passwd hduser

Enter new UNIX password:

Retype new UNIX password:

passwd: password updated successfully

1.5 Log-in as hduser

pooja@prod01:~$ su - hduser

Password:

hduser@prod01:~$ pwd

/home/hduser

Step 2: Set up Passwordless SSH

2.1 Generate the ssh-keygen without password

hduser@prod01:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.

Enter file in which to save the key (/home/hduser/.ssh/id_rsa):

Created directory '/home/hduser/.ssh'.

Your identification has been saved in /home/hduser/.ssh/id_rsa.

Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

The key fingerprint is:

6c:c0:f4:c2:d1:d8:40:41:2b:e8:7b:8d:d4:c7:2c:62 hduser@prod01

The key's randomart image is:

+--[ RSA 2048]----+

| oB* |

| . +.+o |

| . . * . |

| . o * |

| . E o S |

| + ++ |

| . o . |

| . |

| |

+-----------------+

2.2 Add the public ssh-key generated to authorized keys

hduser@prod01:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

2.3 Provide read and write permission to authorized keys.

hduser@prod01:~$ chmod 0600 ~/.ssh/authorized_keys

2.4 Verify if password less ssh is working.

Note: In continue question, please specify yes as shown below

hduser@prod01:~$ ssh localhost

The authenticity of host 'localhost (127.0.0.1)' can't be established.

ECDSA key fingerprint is ad:3c:12:c3:b1:d2:60:a4:8f:76:00:1e:15:b3:f4:41.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.

Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-27-generic x86_64)

...Snippet

Step 3: Download Hadoop 2.8.1

3.1 Download Hadoop 2.8.1 tar file from Apache Download images or using below commands

hduser@prod01:~$ wget http://apache.claz.org/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz

--2017-08-24 14:01:31-- http://apache.claz.org/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
Resolving apache.claz.org (apache.claz.org)... 74.63.227.45
Connecting to apache.claz.org (apache.claz.org)|74.63.227.45|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 424555111 (405M) [application/x-gzip]
Saving to: ‘hadoop-2.8.1.tar.gz’
100%[=====================================================================================================>] 424,555,111 1.51MB/s in 2m 48s
2017-08-24 14:04:19 (2.41 MB/s) - ‘hadoop-2.8.1.tar.gz’ saved [424555111/424555111]

3.2 Untar the downloaded tar file.

hduser@prod01:~$ tar -xvf hadoop-2.8.1.tar.gz

...Snippet
hadoop-2.8.1/share/doc/hadoop/images/external.png
hadoop-2.8.1/share/doc/hadoop/images/h5.jpg
hadoop-2.8.1/share/doc/hadoop/index.html
hadoop-2.8.1/share/doc/hadoop/project-reports.html
hadoop-2.8.1/include/
hadoop-2.8.1/include/hdfs.h
hadoop-2.8.1/include/Pipes.hh
hadoop-2.8.1/include/TemplateFactory.hh
hadoop-2.8.1/include/StringUtils.hh
hadoop-2.8.1/include/SerialUtils.hh
hadoop-2.8.1/LICENSE.txt
hadoop-2.8.1/NOTICE.txt
hadoop-2.8.1/README.txt

3.3 Create the soft link.

hduser@prod01:~$ ln -s hadoop-2.8.1 hadoop

Step 4: Configure Hadoop Pseudo Distributed mode.

In the hadoop configuration, we only added the minimum required property, you can add more properties to it as well.

4.1 Set up the environment variable.

4.1.1 Edit bashrc and add hadoop in path as shown below:

hduser@pooja:~$ vi .bashrc

#Add below lines to .bashrc

export HADOOP_HOME=/home/hduser/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

4.1.2 Source .bashrc in current login session

hduser@pooja:~$ source ~/.bashrc

4.2 Hadoop configuration file changes

4.2.1 Changes to hadoop-env.sh (set $JAVA_HOME to installation directory)

4.2.1.1 Find JAVA_HOME on machine.

hduser@pooja:~$ which java

/usr/bin/java

hduser@pooja:~$ readlink -f /usr/bin/java

/usr/lib/jvm/java-8-oracle/jre/bin/java

Note: /usr/lib/jvm/java-8-oracle is JAVA_HOME diretory

4.2.1.2 Edit hadoop-env.sh and set $JAVA_HOME.

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Edit file and change

JAVA_HOME = ${JAVA_HOME}

JAVA_HOME = /usr/lib/jvm/java-8-oracle

Note: JAVA_HOME=path fetched in step 4.2.1.1

4.2.2 Changes to core-site.xml

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/core-site.xml

Add the configuration property (NameNode property: fs.dafault.name).

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

4.2.3 Changes to hdfs-site.xml

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the configuration property (NameNode property: dfs.name.dir, DataNode property: dfs.data.dir).

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hduser/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hduser/hadoopdata/hdfs/datanode</value>
</property>

</configuration>

4.2.3 Changes to mapred-site.xml

Here, first we will copy the mapred-site.xml.template to mapred-site.xml and then will add property to it.

hduser@prod01:~$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the configuration property (mapreduce.framework.name)

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

</configuration>

Note: If you didn't specify this then Resource Manager UI (http://localhost:8088) will not show any jobs.

4.2.4 Changes to yarn-site.xml

hduser@prod01:~$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the configuration property

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

</configuration>

Step 5: Verify and format HDFS File system

5.1 Format HDFS file system

hduser@pooja:~$ hdfs namenode -format

...Snippet

17/08/24 16:08:36 INFO util.GSet: capacity = 2^15 = 32768 entries

17/08/24 16:08:36 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1601791069-127.0.1.1-1503616

17/08/24 16:08:37 INFO common.Storage: Storage directory /home/hduser/hadoopdata/hdfs/namenode has been successfully formatted.

17/08/24 16:08:37 INFO namenode.FSImageFormatProtobuf: Saving image file /home/hduser/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression

17/08/24 16:08:37 INFO namenode.FSImageFormatProtobuf: Image file /home/hduser/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds.

17/08/24 16:08:37 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0

17/08/24 16:08:37 INFO util.ExitUtil: Exiting with status 0

17/08/24 16:08:37 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at pooja/127.0.1.1

************************************************************/

5.2 Verify the format (Make sure hadoopdata/hdfs/* folder created)

hduser@prod01:~$ ls -ltr hadoopdata/hdfs/

total 4

drwxrwxr-x 3 hduser hduser 4096 Aug 24 16:09 namenode

Note: This is same path as specify in hdfs-site.xml property dfs-name-dir

Step 6: Start single node cluster

We will start the hadoop cluster using the hadoop start-up script.

6.1 Start HDFS

hduser@prod01:~$ start-dfs.sh

17/08/24 16:38:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Starting namenodes on [localhost]

localhost: starting namenode, logging to /home/hduser/hadoop-2.8.1/logs/hadoop-hduser-namenode-prod01.out

localhost: starting datanode, logging to /home/hduser/hadoop-2.8.1/logs/hadoop-hduser-datanode-prod01.out

Starting secondary namenodes [0.0.0.0]

The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.

ECDSA key fingerprint is be:b3:7d:41:89:03:15:04:1c:84:e3:d9:69:1f:c8:5d.

Are you sure you want to continue connecting (yes/no)? yes

0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.

0.0.0.0: starting secondarynamenode, logging to /home/hduser/hadoop-2.8.1/logs/hadoop-hduser-secondarynamenode-prod01.out

17/08/24 16:39:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

6.2 Start yarn

hduser@prod01:~$ start-yarn.sh

starting yarn daemons

starting resourcemanager, logging to /home/hduser/hadoop-2.8.1/logs/yarn-hduser-resourcemanager-prod01.out

localhost: starting nodemanager, logging to /home/hduser/hadoop-2.8.1/logs/yarn-hduser-nodemanager-prod01.out

6.3 Verify if all process started

hduser@prod01:~$ jps

6775 DataNode

7209 ResourceManager

7017 SecondaryNameNode

6651 NameNode

7339 NodeManager

7663 Jps

6.4 Run the PI Mapreduce job from the hadoop-examples jar.

hduser@prod1:~$ yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar pi 4 1000

Step 7: Hadoop Web Interface

Web UI of NameNode(http://localhost:50070)

Resource Manager UI (http://localhost:8088).
It will show all jobs running and resources on cluster information.This will help monitor the jobs running and progress report of the same.

Step 8: Stopping the hadoop

8.1 Stop Yarn processes

hduser@prod01:~$ stop-yarn.sh

stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
localhost: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
no proxyserver to stop

8.2 Stop HDFS processes

hduser@prod01:~$ stop-dfs.sh
17/08/24 17:11:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
17/08/24 17:12:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Hope you are able to follow my instructions on Hadoop Pseudo Mode Setup. Please write to me if any of you are still facing problem.

Happy Coding!!!!

Wednesday, January 25, 2017

Configure IntelliJ for Android Development on CentOS

Mobile Application

In today world, the application development for mobile has increased magnificently. The application from online payment to e-shopping to digital assistance to interactive messaging to many more operations are now just click away using mobile.
Mobile application user interface can be developed using a foray of technologies such as HTML 5, CSS,Javascript, Java, Android or iOS.

In this post, I will be discussing about setting up Android environment on existing IntelliJ.

IntelliJ set up for Android Development

Perform below steps for setup.

Step 1. Install Java 8 or Java 7 JDK

$ java -version
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)

Step 2. Install Android SDK

[user@localhost ~]$ cd /opt

[user@localhost opt]$ sudo wget http://dl.google.com/android/android-sdk_r24.4.1-linux.tgz

[sudo] password for pooja:

--2017-01-24 22:25:23-- http://dl.google.com/android/android-sdk_r24.4.1-linux.tgz

Resolving dl.google.com (dl.google.com)... 172.217.6.46, 2607:f8b0:4005:805::200e

Connecting to dl.google.com (dl.google.com)|172.217.6.46|:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: 326412652 (311M) [application/x-tar]

Saving to: ‘android-sdk_r24.4.1-linux.tgz’

100%[============================================================================================================>] 326,412,652 148KB/s in 29m 58s

2017-01-24 22:55:21 (177 KB/s) - ‘android-sdk_r24.4.1-linux.tgz’ saved [326412652/326412652]

[user@localhost opt]$ sudo tar zxvf android-sdk_r24.4.1-linux.tgz

[user@localhost opt]$ sudo chown -R root:root android-sdk_r24.4.1-linux

[user@localhost opt]$ sudo ln -s android-sdk_r24.4.1-linux android-sdk-linux

#If not change ownership, you will get error "selected directory is not a valid home for android SDK" while setting Andriod SDK path in IntelliJ

[user@localhost opt]$ sudo chown -R user:group /opt/android-sdk-linux/

#sudo vim /etc/profile.d/android-sdk-env.sh

export ANDROID_HOME=/opt/android-sdk-linux

export PATH=$ANDROID_HOME/tools:$ANDROID_HOME/platform-tools:$PATH

# source /etc/profile.d/android-sdk-env.sh

Step 3: Open SDK Manager under SDK Android Tool

[user@localhost opt]sudo android-sdk-linux/tools/android

Now, Select All Tools option and press "Install 23 packages". Then the license screen is open as shown below.

Finally, select 'Install' button that will start download of packages.

Step 4: Install IntelliJ (if not exists)

Download IntelliJ Community Edition is free, download it and untar the file.

Step 5: Open IntelliJ or close project will open up below screen.

Now, select 'Create New Project' and then select Project type as "Android" as shown below

Now, Select option "Application Module" and select 'Next'.

Now, Select option 'New' button.

Then the browser window will open up, Now select /opt/android-sdk-linux and press 'OK'

Lastly, the android version popup window will be shown as below

This way, we have configured existing IntelliJ for Android Development project. Now press 'Finish' button to create project.

I hope you are also able to configure your existing IntelliJ for Android development. If any problems, please write back and I love to hear from you.

Tuesday, January 17, 2017

Debugging Apache Hadoop (NameNode,DataNode,SNN,ResourceManager,NodeManager) using IntelliJ

In the previous blogs, I discuss the set up the environment and then download Apache Hadoop code and then build it and also set it up in IDE (IntelliJ).

In this blog, I will focus on debugging Apache Hadoop code for understanding.

I used remote debugging to connect and debug any of the Hadoop processes (NameNode,DataNode, SecondaryNameNode,ResourceManager,NodeManager).

Prerequisites

1. Apache Hadoop code on local machine.

2. Code is build (look for hadoop/hadoop-dist created)

3. Set up of the code in IntelliJ.

Let dive into the step to understand the debug process.

Step 1: Look for hadoop-dist directory in hadoop main directory.

Once hadoop code is build, the directory hadoop-dist is created in Hadoop main directory as shown below.

Step 2: Move in the target directory.

[pooja@localhost hadoop]$ cd hadoop-dist/target/hadoop-3.0.0-alpha2-SNAPSHOT

The directory structure looks as below (It same as Apache Download tar)

Step 3: Now, setup Hadoop configuration.

a. Change hadoop-env file to add JAVA_HOME path

[pooja@localhost hadoop-3.0.0-alpha2-SNAPSHOT]$ vi etc/hadoop/hadoop-env.sh

Add the below line.

JAVA_HOME=$JAVA_HOME

b. Add configuration paramters (Note: I am doing minimum set up for running hadoop processes)

[pooja@localhost hadoop-3.0.0-alpha2-SNAPSHOT]$ vi etc/hadoop/core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<configuration>

[pooja@localhost hadoop-3.0.0-alpha2-SNAPSHOT]$ vi etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/pooja/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/pooja/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

[pooja@localhost hadoop-3.0.0-alpha2-SNAPSHOT]$ vi etc/hadoop/yarn-site.xml

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

Place the enviornment property in ~/.bashrc

export HADOOP_HOME=<hadoop source code directory>/hadoop/hadoop-dist/target/hadoop-3.0.0-alpha2-SNAPSHOT

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Step 4: Run all hadoop process

[pooja@localhost hadoop-3.0.0-alpha2-SNAPSHOT]$ sbin/start-dfs.sh

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [localhost.localdomain]
2017-01-17 20:27:44,335 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[pooja@localhost hadoop-3.0.0-alpha2-SNAPSHOT]$ sbin/start-yarn.sh

Starting resourcemanager
Starting nodemanagers

[pooja@localhost hadoop-3.0.0-alpha2-SNAPSHOT]$ jps

25232 SecondaryNameNode
26337 Jps
24839 DataNode
24489 NameNode
25914 NodeManager
25597 ResourceManager

Step 5: Now stop all the processes.

[pooja@localhost hadoop-3.0.0-alpha2-SNAPSHOT]$ sbin/stop-yarn.sh

[pooja@localhost hadoop-3.0.0-alpha2-SNAPSHOT]$ sbin/stop-dfs.sh

Step 6: Debug a Hadoop process (eg. NameNode) by performing below change in hadoop-env.sh or hdfs-env.sh.

[pooja@localhost hadoop-3.0.0-alpha2-SNAPSHOT]$ vi etc/hadoop/hadoop-env.sh

Add below line.

export HDFS_NAMENODE_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,address=5000,server=y,suspend=n"

Simlarly, we can debug below processes:

YARN_RESOURCEMANAGER_OPTS
YARN_NODEMANAGER_OPTS
HDFS_NAMENODE_OPTS
HDFS_DATANODE_OPTS

HDFS_SECONDARYNAMENODE_OPTS

Step 7: Enable remote debugging in IDE (IntelliJ) as shown below.

Note: Identify the main class for NameNode process by looking in startup script.

Open NameNode.java class ->Run/Debug Configuration (+)->Remote-> Change 'port' to 5000 (textbox) ->Apply button

Step 8: Now start the namenode process and put the break point in NameNode.java class as shown below.

Start the process:

[pooja@localhost hadoop-3.0.0-alpha2-SNAPSHOT]$ sbin/start-dfs.sh

Start the debugger(Shift+9):

And now can debug the code as shown below.

I hope everyone is able to set up the code, if any problem. Please do write, I will be happy to help you.

In the next blog, will be writing about the steps for making patch for Apache Hadoop Contribution.

Happy Coding and Keep Learning !!!!

Importing Apache Hadoop (HDFS,Yarn) module to IntelliJ

In previous blog, I wrote about the steps to set the environment and download Apache Hadoop code on our machine for understanding and contributing. In this blog, I will walk through the code set up on IDE (IntelliJ here).

By now, I presume to Apache Hadoop code is on our machine and also code is compiled. If not follow the blog.

Please follow below steps for importing HDFS module on IntelliJ

Step 1: Open IntelliJ (either using short-link or idea.sh) and then close project if already open as shown below

Step 2: In below screen, choose Import project as shown below.

Step 3: Now, you have to browse to the folder you want to import. Select Hadoop/hadoop-hdfs-project/hadoop-hdfs folder directory and press 'OK'.

Step 4: The below screen will be shown. Please select the option "Import project from external model" and Click 'next'.

Step 5: Now, Keep pressing next->next and then finish. The project will be imported in IntelliJ as shown below.

Now, Apache Hadoop HDFS module is imported in IntelliJ. You can import other module (YARN,Common) similarly.

I hope all viewers are able to import the Apache Hadoop project successfully in IntelliJ. If facing any issues, please discuss as I will be happy assisting you all.

In the next tutorial, I will discuss the steps of debugging Hadoop.

Thursday, January 12, 2017

Contribute to Apache Hadoop

From long time, I had desire to contribute to open source Apache Hadoop. Today, I was free so worked on setup of Hadoop code on my local machine for development. I am documenting the steps as it may be useful for any newcomers.

Below are the steps to set up the Hadoop code for development

Step 1: Install Java JDK 8 and above

$ java -version
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)

Step 2: Install Apache Maven version 3 or later

mvn -version
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T08:41:47-08:00)
Maven home: /usr/local/apache-maven
Java version: 1.8.0_72, vendor: Oracle Corporation
Java home: /usr/java/jdk1.8.0_72/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-514.2.2.el7.x86_64", arch: "amd64", family: "unix"

Step 3: Install Google protocol buffer (version 2.5.0)

Make sure protocol buffer version is 2.5.0

I have installed the Google protocol buffer higher version 3.1.0, but when compiling code got below error code.

[ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.0.0-alpha2-SNAPSHOT:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: protoc version is 'libprotoc 3.1.0', expected version is '2.5.0' -> [Help 1]

[ERROR]

Step 4: Download the hadoop source code

We can either clone the directory or create a fork of directory and then clone it.

a) Directly cloning the directory.

git clone git://git.apache.org/hadoop.git

b) Create fork as shown below:

And then download the code as shown below:

$ git clone https://github.com/poojagpta/hadoop

Syn the fork project with current project changes

Add the remote link:

$git remote add upstream https://github.com/apache/hadoop

$ git remote -v

origin https://github.com/poojagpta/hadoop (fetch)

origin https://github.com/poojagpta/hadoop (push)

upstream https://github.com/apache/hadoop (fetch)

upstream https://github.com/apache/hadoop (push)

Now, if want to fetch latest code:

$ git fetch upstream

$ git checkout trunk

Step 5: Compile the downloaded code

$ cd hadoop

$ mvn clean install -Pdist -Dtar -Ptest-patch -DskipTests -Denforcer.skip=true

Snippet Output:

[INFO] --- maven-install-plugin:2.5.1:install (default-install) @ hadoop-client-modules ---

[INFO] Installing /home/pooja/dev/hadoop/hadoop-client-modules/pom.xml to /home/pooja/.m2/repository/org/apache/hadoop/hadoop-client-modules/3.0.0-alpha2-SNAPSHOT/hadoop-client-modules-3.0.0-alpha2-SNAPSHOT.pom

[INFO] ------------------------------------------------------------------------

[INFO] Reactor Summary:

[INFO]

[INFO] Apache Hadoop Main ................................. SUCCESS [ 1.780 s]

[INFO] Apache Hadoop Build Tools .......................... SUCCESS [ 2.560 s]

[INFO] Apache Hadoop Project POM .......................... SUCCESS [ 2.236 s]

[INFO] Apache Hadoop Annotations .......................... SUCCESS [ 4.824 s]

[INFO] Apache Hadoop Assemblies ........................... SUCCESS [ 0.314 s]

[INFO] Apache Hadoop Project Dist POM ..................... SUCCESS [ 1.834 s]

[INFO] Apache Hadoop Maven Plugins ........................ SUCCESS [ 9.167 s]

[INFO] Apache Hadoop MiniKDC .............................. SUCCESS [ 5.918 s]

[INFO] Apache Hadoop Auth ................................. SUCCESS [ 20.083 s]

[INFO] Apache Hadoop Auth Examples ........................ SUCCESS [ 7.650 s]

[INFO] Apache Hadoop Common ............................... SUCCESS [02:03 min]

[INFO] Apache Hadoop NFS .................................. SUCCESS [ 12.138 s]

[INFO] Apache Hadoop KMS .................................. SUCCESS [ 13.088 s]

[INFO] Apache Hadoop Common Project ....................... SUCCESS [ 0.138 s]

[INFO] Apache Hadoop HDFS Client .......................... SUCCESS [ 54.973 s]

[INFO] Apache Hadoop HDFS ................................. SUCCESS [01:51 min]

[INFO] Apache Hadoop HDFS Native Client ................... SUCCESS [ 1.323 s]

[INFO] Apache Hadoop HttpFS ............................... SUCCESS [ 41.081 s]

[INFO] Apache Hadoop HDFS-NFS ............................. SUCCESS [ 12.680 s]

[INFO] Apache Hadoop HDFS Project ......................... SUCCESS [ 0.070 s]

[INFO] Apache Hadoop YARN ................................. SUCCESS [ 0.073 s]

[INFO] Apache Hadoop YARN API ............................. SUCCESS [ 35.955 s]

[INFO] Apache Hadoop YARN Common .......................... SUCCESS [01:38 min]

[INFO] Apache Hadoop YARN Server .......................... SUCCESS [ 0.089 s]

[INFO] Apache Hadoop YARN Server Common ................... SUCCESS [ 22.489 s]

[INFO] Apache Hadoop YARN NodeManager ..................... SUCCESS [ 32.492 s]

[INFO] Apache Hadoop YARN Web Proxy ....................... SUCCESS [ 8.606 s]

[INFO] Apache Hadoop YARN ApplicationHistoryService ....... SUCCESS [ 20.153 s]

[INFO] Apache Hadoop YARN Timeline Service ................ SUCCESS [02:26 min]

[INFO] Apache Hadoop YARN ResourceManager ................. SUCCESS [ 55.442 s]

[INFO] Apache Hadoop YARN Server Tests .................... SUCCESS [ 5.479 s]

[INFO] Apache Hadoop YARN Client .......................... SUCCESS [ 17.122 s]

[INFO] Apache Hadoop YARN SharedCacheManager .............. SUCCESS [ 8.654 s]

[INFO] Apache Hadoop YARN Timeline Plugin Storage ......... SUCCESS [ 8.234 s]

[INFO] Apache Hadoop YARN Timeline Service HBase tests .... SUCCESS [02:51 min]

[INFO] Apache Hadoop YARN Applications .................... SUCCESS [ 0.044 s]

[INFO] Apache Hadoop YARN DistributedShell ................ SUCCESS [ 8.076 s]

[INFO] Apache Hadoop YARN Unmanaged Am Launcher ........... SUCCESS [ 5.937 s]

[INFO] Apache Hadoop YARN Site ............................ SUCCESS [ 0.077 s]

[INFO] Apache Hadoop YARN Registry ........................ SUCCESS [ 11.366 s]

[INFO] Apache Hadoop YARN UI .............................. SUCCESS [ 1.832 s]

[INFO] Apache Hadoop YARN Project ......................... SUCCESS [ 8.590 s]

[INFO] Apache Hadoop MapReduce Client ..................... SUCCESS [ 0.225 s]

[INFO] Apache Hadoop MapReduce Core ....................... SUCCESS [ 43.115 s]

[INFO] Apache Hadoop MapReduce Common ..................... SUCCESS [ 27.865 s]

[INFO] Apache Hadoop MapReduce Shuffle .................... SUCCESS [ 9.009 s]

[INFO] Apache Hadoop MapReduce App ........................ SUCCESS [ 24.415 s]

[INFO] Apache Hadoop MapReduce HistoryServer .............. SUCCESS [ 14.692 s]

[INFO] Apache Hadoop MapReduce JobClient .................. SUCCESS [ 29.361 s]

[INFO] Apache Hadoop MapReduce HistoryServer Plugins ...... SUCCESS [ 4.828 s]

[INFO] Apache Hadoop MapReduce NativeTask ................. SUCCESS [ 10.299 s]

[INFO] Apache Hadoop MapReduce Examples ................... SUCCESS [ 12.238 s]

[INFO] Apache Hadoop MapReduce ............................ SUCCESS [ 4.336 s]

[INFO] Apache Hadoop MapReduce Streaming .................. SUCCESS [ 17.591 s]

[INFO] Apache Hadoop Distributed Copy ..................... SUCCESS [ 13.083 s]

[INFO] Apache Hadoop Archives ............................. SUCCESS [ 6.314 s]

[INFO] Apache Hadoop Archive Logs ......................... SUCCESS [ 6.982 s]

[INFO] Apache Hadoop Rumen ................................ SUCCESS [ 12.048 s]

[INFO] Apache Hadoop Gridmix .............................. SUCCESS [ 12.327 s]

[INFO] Apache Hadoop Data Join ............................ SUCCESS [ 5.819 s]

[INFO] Apache Hadoop Extras ............................... SUCCESS [ 5.794 s]

[INFO] Apache Hadoop Pipes ................................ SUCCESS [ 0.036 s]

[INFO] Apache Hadoop OpenStack support .................... SUCCESS [ 8.138 s]

[INFO] Apache Hadoop Amazon Web Services support .......... SUCCESS [ 53.458 s]

[INFO] Apache Hadoop Azure support ........................ SUCCESS [ 20.452 s]

[INFO] Apache Hadoop Aliyun OSS support ................... SUCCESS [ 11.273 s]

[INFO] Apache Hadoop Client Aggregator .................... SUCCESS [ 3.698 s]

[INFO] Apache Hadoop Mini-Cluster ......................... SUCCESS [ 1.618 s]

[INFO] Apache Hadoop Scheduler Load Simulator ............. SUCCESS [ 12.085 s]

[INFO] Apache Hadoop Azure Data Lake support .............. SUCCESS [ 27.289 s]

[INFO] Apache Hadoop Tools Dist ........................... SUCCESS [ 5.002 s]

[INFO] Apache Hadoop Kafka Library support ................ SUCCESS [ 7.041 s]

[INFO] Apache Hadoop Tools ................................ SUCCESS [ 0.052 s]

[INFO] Apache Hadoop Client API ........................... SUCCESS [02:09 min]

[INFO] Apache Hadoop Client Runtime ....................... SUCCESS [01:21 min]

[INFO] Apache Hadoop Client Packaging Invariants .......... SUCCESS [ 3.431 s]

[INFO] Apache Hadoop Client Test Minicluster .............. SUCCESS [03:13 min]

[INFO] Apache Hadoop Client Packaging Invariants for Test . SUCCESS [ 0.329 s]

[INFO] Apache Hadoop Client Packaging Integration Tests ... SUCCESS [ 1.542 s]

[INFO] Apache Hadoop Distribution ......................... SUCCESS [ 42.013 s]

[INFO] Apache Hadoop Client Modules ....................... SUCCESS [ 0.105 s]

[INFO] ------------------------------------------------------------------------

[INFO] BUILD SUCCESS

[INFO] ------------------------------------------------------------------------

[INFO] Total time: 32:42 min

[INFO] Finished at: 2017-01-12T11:30:58-08:00

[INFO] Final Memory: 131M/819M

[INFO] ------------------------------------------------------------------------

I hope you are also to set up Hadoop project and ready to contribute like me. Please let me know if you are still facing issues, I love to help you.
In the next tutorial, I will set up the code in IntelliJ and steps to debug the code.

Thanks and happy coding !!!

Problem Encounter will compiling code:

1. Some of the junit are failing.

-------------------------------------------------------

T E S T S

-------------------------------------------------------

Running org.apache.hadoop.minikdc.TestMiniKdc

Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 3.451 sec <<< FAILURE! - in org.apache.hadoop.minikdc.TestMiniKdc

testKeytabGen(org.apache.hadoop.minikdc.TestMiniKdc) Time elapsed: 1.314 sec <<< ERROR!

java.lang.RuntimeException: Unable to parse:includedir /etc/krb5.conf.d/

at org.apache.kerby.kerberos.kerb.common.Krb5Parser.load(Krb5Parser.java:72)

at org.apache.kerby.kerberos.kerb.common.Krb5Conf.addKrb5Config(Krb5Conf.java:47)

at org.apache.kerby.kerberos.kerb.client.ClientUtil.getDefaultConfig(ClientUtil.java:94)

at org.apache.kerby.kerberos.kerb.client.KrbClientBase.<init>(KrbClientBase.java:51)

at org.apache.kerby.kerberos.kerb.client.KrbClient.<init>(KrbClient.java:38)

at org.apache.kerby.kerberos.kerb.server.SimpleKdcServer.<init>(SimpleKdcServer.java:54)

at org.apache.hadoop.minikdc.MiniKdc.start(MiniKdc.java:280)

at org.apache.hadoop.minikdc.KerberosSecurityTestcase.startMiniKdc(KerberosSecurityTestcase.java:49)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)

at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)

at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)

at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)

at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)

at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)

at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)

at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)

at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)

at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)

at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)

at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)

at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)

at org.junit.runners.ParentRunner.run(ParentRunner.java:309)

at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)

at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)

at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)

at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)

at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)

at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)

testMiniKdcStart(org.apache.hadoop.minikdc.TestMiniKdc) Time elapsed: 1.002 sec <<< ERROR!