Big Data

Sunday, September 20, 2015

Install JDK 7 on CentOS

This tutorial will guide you to install JDK (Java Development Kit) on CentOS 6.5 (also 5,7). Java is open source, platform independent programming language.

Now, we have many different platform and versions of java available, for now we will be installing Java SE(Standard Editions) and below version:

1. OpenJDK7
2. Oracle Java 7

Prerequisites
You are admin or non-admin user with sudo rights.

Uninstall existing JDK versions
You can have multiple version of JDK installed on machine but you can use one at a time.
If you want, you can remove existing installed java on the system using the below steps:

Note: This step is optional.

[code language="text"]
#List all existing jdk
$ rpm -aq | grep -i jdk
java-1.6.0-openjdk-1.6.0.36-1.13.8.1.el6_7.x86_64
java-1.6.0-openjdk-devel-1.6.0.36-1.13.8.1.el6_7.x86_64
java-1.7.0-openjdk-1.7.0.85-2.6.1.3.el6_7.x86_64

# Now, uninstalled all java version as shown below:
sudo yum remove java-1.6.0-openjdk-1.6.0.36-1.13.8.1.el6_7.x86_64
sudo yum remove java-1.7.0-openjdk-1.7.0.85-2.6.1.3.el6_7.x86_64
[/code]

Install OpenJDK7 JDK

To install OpenJDK7 use the below commands:
Note:Rpm package "java-1.7.0-openjdk" will install jre not jdk.

[code language="text"]
$ sudo yum install java-1.7.0-openjdk-devel

Loaded plugins: fastestmirror, refresh-packagekit
Setting up Install Process
Loading mirror speeds from cached hostfile
* base: mirror.supremebytes.com
* epel: mirrors.syringanetworks.net
* extras: mirror.supremebytes.com
* nux-dextop: li.nux.ro
* updates: mirror.supremebytes.com
Resolving Dependencies
--> Running transaction check
---> Package java-1.7.0-openjdk-devel.x86_64 1:1.7.0.85-2.6.1.3.el6_7 will be installed
--> Processing Dependency: java-1.7.0-openjdk = 1:1.7.0.85-2.6.1.3.el6_7 for package: 1:java-1.7.0-openjdk-devel-1.7.0.85-2.6.1.3.el6_7.x86_64
--> Running transaction check
---> Package java-1.7.0-openjdk.x86_64 1:1.7.0.85-2.6.1.3.el6_7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================
Package Arch Version Repository
Size
================================================================================
Installing:
java-1.7.0-openjdk-devel x86_64 1:1.7.0.85-2.6.1.3.el6_7 updates 9.4 M
Installing for dependencies:
java-1.7.0-openjdk x86_64 1:1.7.0.85-2.6.1.3.el6_7 updates 26 M

Transaction Summary
================================================================================
Install 2 Package(s)

Total download size: 35 M
Installed size: 126 M
Is this ok [y/N]: y
Downloading Packages:
(1/2): java-1.7.0-openjdk-1.7.0.85-2.6.1.3.el6_7.x86_64. | 26 MB 00:07
(2/2): java-1.7.0-openjdk-devel-1.7.0.85-2.6.1.3.el6_7.x | 9.4 MB 00:02
--------------------------------------------------------------------------------
Total 3.8 MB/s | 35 MB 00:09
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : 1:java-1.7.0-openjdk-1.7.0.85-2.6.1.3.el6_7.x86_64 1/2
Installing : 1:java-1.7.0-openjdk-devel-1.7.0.85-2.6.1.3.el6_7.x86_64 2/2
Verifying : 1:java-1.7.0-openjdk-devel-1.7.0.85-2.6.1.3.el6_7.x86_64 1/2
Verifying : 1:java-1.7.0-openjdk-1.7.0.85-2.6.1.3.el6_7.x86_64 2/2

Installed:
java-1.7.0-openjdk-devel.x86_64 1:1.7.0.85-2.6.1.3.el6_7

Dependency Installed:
java-1.7.0-openjdk.x86_64 1:1.7.0.85-2.6.1.3.el6_7

Complete!
[/code]

Verify the installation using command:

[code language="text"]
$ java -version
java version "1.7.0_85"
OpenJDK Runtime Environment (rhel-2.6.1.3.el6_7-x86_64 u85-b01)
OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode)
[/code]

Install Oracle JDK 7
To install OracleJDK7 use the below commands

[code language="text"]
#Download the tarball of oracle jdk
$ sudo wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/7u71-b14/jdk-7u71-linux-x64.tar.gz"

# untar the downloaded tarball in /opt/ directory
$ sudo tar xvf jdk-7u71-linux-x64.tar.gz -C /opt/

# change the ownership of untar jdk directory
$ sudo chown -R root: /opt/jdk1.7.0_71

#install jdk
$ sudo alternatives --install /usr/bin/java java /opt/jdk1.7.0_71/bin/java 1
$ sudo alternatives --install /usr/bin/javac javac /opt/jdk1.7.0_71/bin/javac 1
$ sudo alternatives --install /usr/bin/jar jar /opt/jdk1.7.0_71/bin/jar 1
[/code]

Verify the installation using command:

[code language="text"]
$ java -version

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
[/code]

Configure Environment Variable
Below are steps to configure java variables

[code language="text"]
$ sudo vi /etc/profile.d/java.sh

#Add the below line shell script "java.sh"

#If Oracle JDK 7 install
export JAVA_HOME=/opt/jdk1.7.0_71
# use below JAVA_HOME if OpenJDK7devel install
#export JAVA_HOME=/usr/lib/jvm/java-openjdk
export JAVA_PATH=$JAVA_HOME
export PATH=$PATH:$JAVA_HOME/bin
[/code]

Again log in the machine or use below to see the changes
[code language="text"]
$ source .bashrc
$ echo $JAVA_HOME
[/code]

References:
https://en.wikipedia.org/wiki/Java_(software_platform)
http://docs.oracle.com/javase/7/docs/webnotes/install/
http://openjdk.java.net/projects/jdk7/

Friday, September 11, 2015

Useful commands for Github

Github is open source social code hosting plateform where we have public and private repositories (project). It manage repositories among multiple developer and can be sync from anywhere. Now, you can perform version control tasks (like pull,push, commit, revert,rebase changes) to repository either using github.com (UI interface) or using command line tool Git, which is popular these days.

You can use GitHub web interface for change tracking the repository but in case of simultaneously two user change same file, then changes will be overridden by one who check it last.Whereas, Git provide the local working directory with staging area and project history. Therefore, the person who push the code last will ask to pull the file before pushing code.

We will discuss step for creating public centralised reposiories in Github:

Step 1: Sign-up Github

You need to sign up by providing unique_name, email-id, password.

Step 2: Creating a New Repository using Command-line.
You can create a new repository using web interface, in this article we will discuss to create using command line.

Step 2.1: Installing and Configuring Git

I have installing Git using yum on CentOS machine as shown below. You can download and install Git according to system specification.
[code language="text"]
$ sudo yum install git
[/code]
In case you have error like "Unable to find remote helper for 'https'".Please follow below steps for installation:
[code language="text"]
#Install the package requirement for git
$ sudo yum install curl-devel expat-devel gettext-devel openssl-devel zlib-devel
$ sudo yum install gcc perl-ExtUtils-MakeMaker
#remove the existing git
$sudo yum remove git
#Download the git source code
$ sudo wget https://www.kernel.org/pub/software/scm/git/git-2.0.4.tar.gz
$ sudo tar xzf git-2.0.4.tar.gz -C /opt/
$ sudo make prefix=/usr/local/git all
$ sudo make prefix=/usr/local/git install
#create git.sh file in /etc/profile
$ sudo vi /etc/profile.d/git.sh
# Add the file to git.sh file and then save and exit
export PATH=$PATH:/usr/local/git/bin
#source .bashrc
[/code]

Now, configuration set up Git using the below command line (specify user.name/user.email you specify in Github signup)
[code language="text"]
$ git config --global user.name "Your Name Here"
$ git config --global user.email "your_email@example.com"
[/code]
The configuration changes is stored in .gitconfig file in home directory or using list command as shown below.
[code language="text"]
$ cat ~/.gitconfig
$ git config --list
[/code]

Step 2.2 Generate SSH key
Github support protocol like git (not secure and push not possible), https (secure but require passwords/tokens) ,ssh (secure but required set up SSH key) for accessing the repository.
You can setup ssh (public key authentication) using command below. You can refer document for details.
[code language="text"]
#List all existing ssh keys
$ ls -al ~/.ssh
#Generate new ssh key
$ ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

Output:
Generating public/private rsa key pair.
Enter file in which to save the key (/home/userName/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/userName/.ssh/id_rsa.
Your public key has been saved in /home/userName/.ssh/id_rsa.pub.
The key fingerprint is:
08:88:55:66:95:bd:bf:88:0e:dd:dd:46:77:f6:b9:b6 your_email@example.com
The key's randomart image is:
+--[ RSA 4096]----+
| |
| o |
| . . o . |
| + + o . |
| . + S . |
| . . . . |
| . . . o .|
| . o oooo|
| .o.++E=o|
+-----------------+
$ eval "$(ssh-agent -s)"

Output:
Agent pid 4648

$ ssh-add ~/.ssh/id_rsa
Identity added: /home/userName/.ssh/id_rsa (/home/userName/.ssh/id_rsa)
[/code]

Now, add the ssh key to GitHub account from https://github.com/settings/ssh as below

You can test the ssh configuration
[code language="text"]
$ ssh -T git@github.com
Output:
Hi userName! You've successfully authenticated, but GitHub does not provide shell access.
[/code]

Step 2.3 Adding New Repository

Create a New Repository on GitHub. Refer https://help.github.com/articles/create-a-repo/

Using below commands, we have created an repository/project on GitHub.
[code language="text"]
$ cd GitDemo
$ echo "# GitDemo" >> README.txt
$ git init
$ git add README.txt
$ git commit -m "first commit"
$ git remote add origin git@github.com:username/GitDemo.git
$ git push -u origin master
[/code]

OR push an existing repository from the command line

[code language="text"]
$ git remote add origin git@github.com:username/GitDemo.git
$ git push -u origin master
[/code]

OR You can add a empty repostory on github using below command

[code language="text"]
$ curl -u 'xxxx' https://api.github.com/user/repos -d "{\"name\":\"GitDemo\"}"
Output:
Enter host password for user 'user.name':
{
"id": 41167919,
"name": "GitDemo",
"full_name": "user.name/GitDemo",
"owner": {
"login": "user.name",
"id": 33333,
"avatar_url": "https://avatars.githubusercontent.com/u/3756379?v=3",
"gravatar_id": "",
"url": "https://api.github.com/users/xxxx",
[...snipp----]
"default_branch": "master",
"permissions": {
"admin": true,
"push": true,
"pull": true
},
[..snipp...]
[/code]

Now, we will dicuss the commands to change tracking tasks on repository.

Step 1.1 Fetching the local working directory from GitHub
Using clone command you can create a working copy of the repository (snapshot of master branch code at particular time) which you can use for development.
[code language="text"]
# clone a existing repository and point to master branch
$ git clone git@github.com:userName/GitDemo.git
# Navigate to your working directory
$ cd GitDemo
# Now, you can checkout to branch you need to work on
$ git checkout branch1
[/code]

Step 1.2 See changes between commits and working tree
Using diff command, you can see changes made in working tree from GitHub files or between 2 commits.
[code language="text"]
# Show difference between working directory and GitHub files
$ git diff

Output:
diff --git a/README.txt b/README.txt
index 0e8a4bf..1a117eb 100644
--- a/README.txt
+++ b/README.txt
@@ -1 +1,2 @@
Customer portfolio will show now its age amd name.
+I am now in testing phase
diff --git a/src/java/Customer.java b/src/java/Customer.java
index 93f7581..5d50a75 100644
--- a/src/java/Customer.java
+++ b/src/java/Customer.java
@@ -1,2 +1,10 @@
+public class Customer{
+String custName="";
+String age;
+public void setCustName(String custName){
+ this.custName=custName;
+}
+}
[/code]

Step 1.3 Find out changes made in working directory (Staging Area)

After you make changes in working directory, it is possible to split and group related file together using Staging Area. Using status command you can view the state of working and staging area. Therefore, it will show modified/untracked files in output as shown below:

[code language="text"]
$ git status
Output:
# On branch master
# Changed but not updated:
# (use "git add <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: README.txt
#
# Untracked files:
# (use "git add <file>..." to include in what will be committed)
#
# src/
no changes added to commit (use "git add" and/or "git commit -a")
[/code]
Note: Files that you want to ignore like .exe,.class etc should be recorder in .gitignore file in root of working directory.Then it will not appear in status command as well.

Step 1.4 Commit code to local Git project history.

Before you commit your changes,you want to group or split related file together.

1.4(a) Using add command, you tell git to add all changes to this file in next commit.Note:No changes are recorded using this command.
[code language="text"]
$ git add <file_name>
[/code]

After you are satisfied with staged snapshot (using git status command) you commit it.

1.4(b) Using commit command, you tell git to record changes to local project history
[code language="text"]
$ git commit -m "comments"
[/code]

Step 1.5 Push changes to remote repository GitHub
Once the changes are commited in local working repository, you can push the changes to GitHub (master branch/master)
[code language="text"]
# Push the code changes to Master branch (not adviceable)
$ git push origin master
#Push code changes to branch name "branch1"
$ git push origin branch1
[/code]

Step 1.6 Viewing old commits
You can view all previous commit made in branch or master.

[code language="text"]
$ git log --oneline

Output:
b0a6b8f Revert "Change to revert"
17ebb40 Change to revert
2739150 Customer.java changes
3c46b93 Merge pull request #1 from poojagpta/branch2
7bc8863 Add new file
090c318 first commit
[/code]

Step 1.7 Revert a commit pushed to remote repository(GitHub)
There are many ways to correct the mistake, simply you can change files,commit it and pushed it again to remote.But if there are bunch of bad committed files, you can use following commands.

a) Revert command: Use this command to revert to a commitId (use Step 1.5 to find commitId) and then push to remote as shown below.
Note: Revert is logged into the history also. So can be view using log command as shown in Step 1.5.

[code language="text"]
$ git revert 17ebb40

Output: Git will ask to select the files to be reveretd from the list of files committed as shown:
Revert "Change to revert"

This reverts commit 17ebb40a5c569610c37a10c412e6e1bcdbef0e71.

# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
# On branch branch1
# Changes to be committed:
# (use "git reset HEAD <file>..." to unstage)
#
modified: NewFile
modified: README.txt
deleted: branchFiletoRevert
# modified: src/java/Customer.java

After you select the file to be revered save it. It will show up message:
Finished one revert.
[branch1 b0a6b8f] Revert "Change to revert"
4 files changed, 1 insertions(+), 11 deletions(-)
delete mode 100644 branchFiletoRevert
[/code]

Now, we need to push the revert changes to remote repository Github
[code language="text"]
$ git push origin branch1

Counting objects: 11, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 968 bytes, done.
Total 7 (delta 0), reused 0 (delta 0)
To git@github.com:poojagpta/GitDemo.git
17ebb40..b0a6b8f branch1 -> branch1
[/code]

b) Use reset command: Using reset command, you can rollback to previous remote commit repository as shown below, but it will delete the project history. Note: Don't use it in master/remote repository.Prefer to use it in local repository only.
[code language="text"]
# Move the local project history to one commit before
$ git reset HEAD^
$ git push origin branch1 -f

Output:
Total 0 (delta 0), reused 0 (delta 0)
To git@github.com:poojagpta/GitDemo.git
+ bb19386...7c0e42c branch1 -> branch1 (forced update)

# If you want to remove the grouping of file you did before commit, you can use it
$ git reset
[/code]

Branching and Merging

Creating a new branch in GitHub
In GitHub, new branches can be created using below commands:
[code langauge="text"]
# create a local branch
$ git checkout -b dev
Output:
Switched to a new branch 'dev'

#Local commit to project history (skip staging area by using -a option)
$ git commit -a -m "Fist Commit in dev"
#Push changes to remote repository
$git push origin dev

Output:
Counting objects: 12, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (9/9), 1.02 KiB, done.
Total 9 (delta 0), reused 0 (delta 0)
To git@github.com:poojagpta/GitDemo.git
* [new branch] dev -> dev
[/code]

Listing all branches in git:
[code language="text"]
$ git branch -a
[/code]

Deleting branch in GitHub

[code language="text"]
$ git branch -rd origin/dev -- deletes remote branch
$ git branch -D dev --deletes local branch
[/code]

Create tags in branch
Tagging branches for release
[code language="text"]
$ git tag 2.0-xx3 -m "Release candidate 2.0-xx3"
$ git push origin --tags ---push tag to repo
[/code]

Rebase branch in Github
We need to rebase our dev branch to master branch.

[code language="text"]
$ git remote -v
origin git@github.com:poojagpta/GitDemo.git (fetch)
origin git@github.com:poojagpta/GitDemo.git (push)

#fetching the latest code from the master branch
$ git fetch origin

#Rebase dev branch with master changes
$ git rebase master
First, rewinding head to replay your work on top of it...
Applying: Fist Commit in dev
Using index info to reconstruct a base tree...
Falling back to patching base and 3-way merge...
Auto-merging README.txt
CONFLICT (content): Merge conflict in README.txt
Failed to merge in the changes.
Patch failed at 0001 Fist Commit in dev

When you have resolved this problem run "git rebase --continue".
If you would prefer to skip this patch, instead run "git rebase --skip".
To restore the original branch and stop rebasing run "git rebase --abort".

#Merge the code using the tool i have use mergetool meld to view the difference
$ git mergetool -t meld

#Then continue merge
$ git merge --continue

#finally push changes to github
$ git push origin dev -f
[/code]

Fork a Repository
Fork is creating a copy of the repository, in which you can experient without affecting parent repo. In many open source project, you don't have push rights so you can contribute by forking the project to your repositories and then making changes in it and then sending them pull request to review and merge.

Forking
GitHub provide a good tutorial on how to fork a repo. But for summary, there is button 'fork' on top right side of repository you want to fork as shown below.

After pressing fork button, the repository can be viewed in your own github account.

Wednesday, September 9, 2015

Installing Hadoop2.7 on CentOS 7 (Single Node Cluster)

Hadoop is open source framework written in Java for complex high volume computation. Today's industry data is expanded in 3 Vs (Volume, Velocity and Variety), making it difficult to analyze/interpret such data. Now hadoop's distributed high fault tolerant filesystem (HDFS) is solution for 3Vs data expansion and map-reduce is programming plateform to analyze data in HDFS.

Today, we will be discuss step for simple installing up and running Hadoop on CentOS server machine.

Step 1: Installing Java
Hadoop require Java 1.6 or higher version of installation. Please check if java exists and if not install using the below command.

[pooja@localhost ~]$ sudo yum install java-1.7.0-openjdk
Output
......
Dependency Installed:
giflib.x86_64 0:4.1.6-3.1.el6
jpackage-utils.noarch 0:1.7.5-3.14.el6
pcsc-lite-libs.x86_64 0:1.5.2-15.el6
ttmkfdir.x86_64 0:3.0.9-32.1.el6
tzdata-java.noarch 0:2015f-1.el6
xorg-x11-fonts-Type1.noarch 0:7.2-11.el6

Complete!

[root@localhost ~]$ java -version
Output:
java version "1.7.0_85"
OpenJDK Runtime Environment (rhel-2.6.1.3.el6_7-x86_64 u85-b01)
OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode)

Step 2: Create a dedicated Hadoop user
We recommend to create the dedicated user (non root) for hadoop installation.

[pooja@localhost ~]$ sudo groupadd hadoop
[pooja@localhost ~]$ sudo useradd --groups hadoop hduser
[pooja@localhost ~]$ sudo passwd hduser
[pooja@localhost ~]$ su - hduser

Hadoop required SSH to manage its node and therefore for single node we required set up local machine public key authentication.

[hduser@localhost ~]$ ssh-keygen -t rsa -P ""
Output:
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
87:21:a4:91:1e:f7:01:0b:9a:e3:a3:8a:76:8b:ab:6f hduser@localhost.localdomain
[....snipp...]

[hduser@localhost ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[hduser@localhost ~]$ chmod 0600 ~/.ssh/authorized_keys

If still facing issue refer to "Troubleshooting: SSH Setup" at end of session, if not you can continue.
Now, verify if ssh is set up properly.Below command should not ask for password but first time it should prompt RSA to added to known host list.

[hduser@localhost ~]$ ssh localhost
Output (first time only): The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is e4:37:82:a0:68:e9:ee:1f:0f:22:2e:35:63:94:38:d3. Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

Step 3: Download Hadoop 2.7.0
Download Hadoop from Apache Download images or using below commands

[hduser@localhost ~]$ wget http://apache.claz.org/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz
Output: --2016-12-16 21:51:51-- http://apache.claz.org/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

Step 4: Untar hadoop file and create soft link

[hduser@localhost ~]$ tar -xvf hadoop-2.7.0.tar.gz

[hduser@localhost ~]$ln -s hadoop-2.7.0 hadoop

Step 5: Configure Hadoop Pusedo Distributed Mode

5.1 Set Up Enviornment Variables

Edit bashrc and add below line. If you are using any other shell then update appropriate configuration files.

export HADOOP_HOME=/home/hduser/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Now apply the changes in current running environment

[hduser@localhost ~]source ~/.bashrc

5.2 Configuration Changes

Edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable.
Change

# The java implementation to use.
export JAVA_HOME=${JAVA_HOME}

to

export JAVA_HOME=/usr/lib/jvm/jre-openjdk

Hadoop has many configuration file that need to be customized according to our set up. We will be configuring basic hadoop single node setup for this article.
Navigate to path then make edit hadoop configuration file:

[hduser@localhost ~]$ cd $HADOOP_HOME/etc/hadoop

Edit core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Edit hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hduser/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hduser/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

Edit yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

6. Format HDFS filesystem via NameNode
Now format the HDFS using the command below and make sure hdfs directory is created (directory specied in property "dfs.data.dir" of hdfs-site.xml)

[hduser@localhost ~]$ hdfs namenode -format

Output:
15/09/08 22:44:42 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost.localdomain/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.6.0
[...snipp...]
15/09/08 22:44:44 INFO common.Storage: Storage directory /home/hduser/hadoopdata/hdfs/namenode has been successfully formatted.
15/09/08 22:44:44 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid &amp;amp;amp;amp;amp;gt;= 0
15/09/08 22:44:44 INFO util.ExitUtil: Exiting with status 0
15/09/08 22:44:44 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
************************************************************/

Step 7: Start single node hadoop cluster
Lets start the hadoop cluster using the hadoop provided script.
Start hdfs

[hduser@localhost ~]$ ./$HADOOP_HOME/sbin/start-dfs.sh

Output:
15/09/08 22:54:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hduser/hadoop/logs/hadoop-hduser-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /home/hduser/hadoop/logs/hadoop-hduser-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
RSA key fingerprint is e4:37:82:a0:68:e9:ee:1f:0f:22:2e:35:63:94:38:d3.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/hduser/hadoop/logs/hadoop-hduser-secondarynamenode-localhost.localdomain.out
15/09/08 22:55:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Start yarn

[hduser@localhost ~]$ ./$HADOOP_HOME/sbin/start-yarn.sh

Output:
starting yarn daemons
starting resourcemanager, logging to /home/hduser/hadoop/logs/yarn-hduser-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to /home/hduser/hadoop/logs/yarn-hduser-nodemanager-localhost.localdomain.out

Step 8: Hadoop Web interface
Web UI of NameNode Daemon(http://localhost:50070/)

Web UI of Secondary NameNode (http://localhost:50090/)

Web UI of cluster information (http://localhost:8088)

Step 9: Test the Hadoop set up
9.1 Create the sample data file on local machine or data from internet
9.2 Copy the data file from local machine to HDFS using the below commands

[hduser@localhost ~]$ hdfs dfs -mkdir /user
15/09/08 23:39:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[hduser@localhost ~]$ hdfs dfs -put localdata/* /user
15/09/08 23:41:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

9.3 Run the existing map reduce word count job (present in $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar) using the command below:
Note: Hdfs input directory : /user and Hdfs output directory: /user/output

[hduser@localhost ~]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /user /user/output

Output:
15/09/08 23:49:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/08 23:49:55 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/08 23:49:55 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/08 23:49:56 INFO input.FileInputFormat: Total input paths to process : 1
15/09/08 23:49:56 INFO mapreduce.JobSubmitter: number of splits:1
15/09/08 23:49:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local875797856_0001
15/09/08 23:49:56 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/09/08 23:49:56 INFO mapreduce.Job: Running job: job_local875797856_0001
15/09/08 23:49:56 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/09/08 23:49:56 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
[...snipp...]

9.4 Verify the result in hdfs directory /user/output

[hduser@localhost ~]$ hdfs dfs -ls /user/output

Output:
15/09/08 23:54:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 hduser supergroup 0 2015-09-08 23:49 /user/output/_SUCCESS
-rw-r--r-- 1 hduser supergroup 132 2015-09-08 23:49 /user/output/part-r-00000

[hduser@localhost ~]$ hdfs dfs -cat /user/output/part-r-00000

Output:
15/09/08 23:55:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
This 2
Will 1
[...snipp...]

10. Stop the running hadoop cluster using the command below:

[hduser@localhost ~]$ cd $HADOOP_HOME/sbin/
[hduser@localhost ~]$ ./stop-yarn.sh

Output:
stopping yarn daemons
stopping resourcemanager
hduser@localhost's password:
localhost: stopping nodemanager
no proxyserver to stop

[hduser@localhost ~]$ ./stop-dfs.sh

Output:
15/09/08 23:59:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [localhost]

Hope everyone able to set up Hadoop cluster successfully. Please feel free to leave comments if facing any issue.

Happy Coding!!!!

Troubleshooting: SSH Setup

I found 2 errors while SSH setup.

1. Service sshd doesnt exist "there is no such file or directory"

In this case ssh is not installed on your machine install using below command.

[root@localhost ~] $ yum -y install openssh-server openssh-clients

2. ssh: connect to host localhost port 22: Connection refused

When I completed SSH set up, and type in command 'ssh localhost' above error popped up.

To resolve it, I stop and start the service again using below command.

[hduser@localhost ~]$/bin/systemctl stop sshd.service
[hduser@localhost ~]$/bin/systemctl start sshd.service
/bin/systemctl status sshd.service
● sshd.service - OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
Active: active (running) since Fri 2016-12-16 12:41:44 PST; 16s ago
Docs: man:sshd(8)
man:sshd_config(5)
Main PID: 6192 (sshd)
CGroup: /system.slice/sshd.service
└─6192 /usr/sbin/sshd -D

Monday, September 7, 2015

Docker in simple terms

Docker is the open source software build on concept of virtualization. Using it, minimal portable operating system can be installed in its containers.Like on servers, we can install docker software and then can run Windows OS in one or two container and CentOS on other containers.Docker light weight isolated containers can be easily managed and can be stop,start,kill or restart at any point of time.The containers internally will shared the same linux instance and uses resources/namespace secure isolation therefore better option than hardware virtualization where in we specify the resource allocation (like Hard disk size/RAM usage) for each VMs at time of creation.

Docker container is created using read only template called Docker images. For eg. an image can contain Ubuntu operating system with Apache tomcat installed. You can either download the existing image or build your own images.All the docker images public and private are stored in docker registry called Docker Hub.

Lets briefly go through simple steps to follow to setup docker on a server machine:

a. Download Docker

On window 7,I install boot2docker (https://github.com/boot2docker/windows-installer/releases->docker-install.exe) and run the exe, it will install Docker client and virtual box. You run the Docker client tool.

You can even go through the steps mentioned at http://docs.docker.com/installation/windows/.

On CentOS machine,follow the below steps:

1. Make sure yum package is up-to date
$ sudo yum update

2. Run the installation script
$ curl -sSL https://get.docker.com/ | sh

Output:
[sudo] password for xxx:
+ sudo -E sh -c 'sleep 3; yum -y -q install docker-engine'
warning: rpmts_HdrFromFdno: Header V4 RSA/SHA1 Signature, key ID 2c52609d: NOKEY
Importing GPG key 0x2C52609D:
Userid: "Docker Release Tool (releasedocker) <docker@docker.com>"
From : https://yum.dockerproject.org/gpg

Remember that you will have to log out and back in for this to take effect!

3. Start the docker service
$ sudo service docker start

4. Test if successfully installed docker:
On docker client tool/bash terminal, run the below command:
$sudo docker run hello-world

Output:
Unable to find image 'hello-world:latest' locally
latest: Pulling from hello-world

535020c3e8ad: Pull complete
af340544ed62: Pull complete
Digest: sha256:a68868bfe696c00866942e8f5ca39e3e31b79c1e50feaee4ce5e28df2f051d5c
Status: Downloaded newer image for hello-world:latest

Hello from Docker.
This message shows that your installation appears to be working correctly.

This command should download the image "hello-world" from registry(Docker Hub) and then create container and run it and then produce the output "Hello from Docker".

b. Managing the docker images
Docker image can be pulled from registry (Docker Hub) or can be build using Docker file(shown in step e below).

Pulling ubuntu image from registry:
$ sudo docker pull ubuntu

Output:
latest: Pulling from ubuntu
d3a1f33e8a5a: Pull complete
c22013c84729: Pull complete
d74508fb6632: Pull complete
91e54dfb1179: Already exists
ubuntu:latest: The image you are pulling has been verified. Important: image verification is a tech preview feature and should not be relied on to provide security.
Digest: sha256:fde8a8814702c18bb1f39b3bd91a2f82a8e428b1b4e39d1963c5d14418da8fba
Status: Downloaded newer image for ubuntu:latest

List all the downloaded docker images on server:
$ sudo docker images

Output:
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
ubuntu latest 91e54dfb1179 2 weeks ago 188.3 MB
hello-world latest af340544ed62 4 weeks ago 960 MB

Delete docker images from machine:
Delete the ubuntu docker image whose Image Id (91e54dfb1179)

$ sudo docker rmi 91e54dfb1179

Output:
Untagged: ubuntu:latest
Deleted: 91e54dfb11794fad694460162bf0cb0a4fa710cfa3f60979c177d920813e267c
Deleted: d74508fb6632491cea586a1fd7d748dfc5274cd6fdfedee309ecdcbc2bf5cb82
Deleted: c22013c8472965aa5b62559f2b540cd440716ef149756e7b958a1b2aba421e87
Deleted: d3a1f33e8a5a513092f01bb7eb1c2abf4d711e5105390a3fe1ae2248cfde1391

Delete all docker images from machine:
If we need to remove all downloaded docker images from machine

$ sudo docker rmi $(sudo docker images -q)

Output:
Untagged: ubuntu:latest
Deleted: 91e54dfb11794fad694460162bf0cb0a4fa710cfa3f60979c177d920813e267c
Deleted: d74508fb6632491cea586a1fd7d748dfc5274cd6fdfedee309ecdcbc2bf5cb82
Deleted: c22013c8472965aa5b62559f2b540cd440716ef149756e7b958a1b2aba421e87
Deleted: d3a1f33e8a5a513092f01bb7eb1c2abf4d711e5105390a3fe1ae2248cfde1391
Untagged: hello-world:latest
Deleted: af340544ed62de0680f441c71fa1a80cb084678fed42bae393e543faea3a572c
Deleted: 535020c3e8add9d6bb06e5ac15a261e73d9b213d62fb2c14d752b8e189b2b912

d. Creating the Docker Container
Docker container can be run from image or start existing docker container

Started a Docker Container using Ubuntu image ( or can even use Image ID) in interactive bash shell (-i) and pseudo-TTY connected (-t) to container's stdin. Note: In below we code,we specify Docker to connect Docker Container bash shell.

$ sudo docker run -it ubuntu /bin/bash
Output:
root@0ba57ee1a68a:/# uname -a
Linux 0ba57ee1a68a 2.6.32-573.3.1.el6.x86_64 #1 SMP Thu Aug 13 22:55:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Once we stop the container, they can be restarted but Docker Container connect command cannot be modified.Note: 0ba57ee1a68a is Container ID.

$ sudo docker ps -a
Output:
[sudo] password for xxx:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0ba57ee1a68a 91e54dfb1179 "/bin/bash" 10 minutes ago Exited (0) 15 seconds ago sad_cori

$ sudo docker restart 0ba57ee1a68a

0ba57ee1a68a

$ sudo docker attach 0ba57ee1a68a
Output:
root@0ba57ee1a68a:/# uname -a
Linux 0ba57ee1a68a 2.6.32-573.3.1.el6.x86_64 #1 SMP Thu Aug 13 22:55:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

e. Managing the Docker containers
Docker containers can be listed, stop,remove,killed.

List all Docker containers:
$ sudo docker ps -a
Output:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0ba57ee1a68a 91e54dfb1179 "/bin/bash" 24 minutes ago Exited (1) 15 seconds ago

Stop running Docker Container:

$ sudo docker stop 0ba57ee1a68a
Output:
0ba57ee1a68a

Remove the Docker Container:
$ sudo docker rm 0ba57ee1a68a
Output:
0ba57ee1a68a

Remove all Docker Containers:
$ sudo docker rm $(sudo docker ps -a -q)
Output:
3d8ae1bb867e

b. Building the Docker images (from docker file)
Docker can build image by reading the instructions from the Dockerfile (it is a text file that contain all commands need to build a image).
Created a simple dockerfile where in instruction to install Java 8 on Ubuntu machine and then boot with bash shell.
Dockerfile attached below:

FROM ubuntu
MAINTAINER Pooja Gupta <pooja.gupta@jbksoft.com>

# setup Java
RUN RUNLEVEL=1 DEBIAN_FRONTEND=noninteractive apt-get install -y wget
RUN mkdir /opt/java
RUN wget -O /opt/java/jdk-8u25-linux-x64.tar.gz --no-cookies --no-check-certificate --header \
"Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" \
"http://download.oracle.com/otn-pub/java/jdk/8u25-b17/jdk-8u25-linux-x64.tar.gz"

# change dir to Java installation dir

WORKDIR /opt/java/

RUN tar -zxf jdk-8u25-linux-x64.tar.gz

# setup environment variables

RUN update-alternatives --install /usr/bin/javac javac /opt/java/jdk1.8.0_25/bin/javac 100

RUN update-alternatives --install /usr/bin/java java /opt/java/jdk1.8.0_25/bin/java 100

RUN update-alternatives --display java

RUN java -version

# Expose the ports we're interested in
EXPOSE 8080 9990

# Set the default command to run on boot
CMD ["/bin/bash"]

Now, we can build the image and then run the container using image.Note:ubuntu-with-java is the repository name of the image created and Image Id is b37cc178d4c7 for me.

$ sudo docker build -t ubuntu-with-java8 .

Output:
.....
Removing intermediate container 182805e359f2
Step 11 : EXPOSE 8080 9990
---> Running in e4c2eebbb475
---> 346fd2a5f340
Removing intermediate container e4c2eebbb475
Step 12 : CMD /bin/bash
---> Running in 8372b7ae20e7
---> b37cc178d4c7
Removing intermediate container 8372b7ae20e7
Successfully built b37cc178d4c7

$ sudo docker run -it ubuntu-with-java8
Output:
root@66a414638a22:/opt/java#
root@66a414638a22:/opt/java# java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

Monday, December 15, 2014

Oozie Coordinator based on Data Availability

Apache Oozie framework(java web application) is used for scheduling Hadoop MR Jobs, Pig, Hive, Hbase. The task or jobs are referred as actions. The DAGs of these are created as the Workflow in XML format.

The Oozie jobs can be divided into two types:

Workflow Jobs - These jobs specify the sequence of actions to be executed by using DAGs. These jobs consists of workflow.xml, workflow.properties and the code(having the code for actions to be executed). The bundle of workflow.xml and code as jar is created.

Coordinator Jobs - These jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. These jobs have additional coordinator.xml as part of the bundle to be pushed to Oozie.

The oozie bundle needs to be copied to HDFS. The below is the structure of the bundle:

	OozieSample
	\|- workflow.xml
	\|- com
	\|-jbksoft
	\|- oozie
	\|- DemoJavaMainJob.java

view raw bundle structure hosted with ❤ by GitHub

The content of the Workflow.xml is as below:

	<?xml version="1.0" encoding="UTF-8"?>
	<workflow-app xmlns="uri:oozie:workflow:0.2" name="java-main-wf">
	<start to="java-node"/>
	<action name="java-node">
	<java>
	<job-tracker>${jobTracker}</job-tracker>
	<name-node>${nameNode}</name-node>
	<configuration>
	<property>
	<name>mapred.job.queue.name</name>
	<value>${queueName}</value>
	</property>
	</configuration>
	<main-class>com.jbksoft.oozie.DemoJavaMainJob</main-class>
	<arg>${inputDir}</arg>
	<arg>${outputDir}</arg>
	</java>
	<ok to="end"/>
	<error to="fail"/>
	</action>
	<kill name="fail">
	<message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
	</kill>
	<end name="end"/>
	</workflow-app>

view raw workflow.xml hosted with ❤ by GitHub

The code for oozie java action:

	package com.jbksoft.oozie;

	/**
	* Oozie job with Java Main.
	*/
	public class DemoJavaMainJob {
	public static void main(String[] args)
	throws Exception {
	System.out.println("Hello World!");
	long time = System.currentTimeMillis();
	System.out.println("Hello : " + time);
	Thread.sleep(100000);
	System.out.println("Hello : " + (System.currentTimeMillis() - time));
	System.out.println("Hello : " + Thread.currentThread().getName());
	if (args.length <= 2) {
	System.out.println("Hello : " + args[0] + " : " + args[1]);
	}
	}
	}

view raw DemoJavaMainJob.java hosted with ❤ by GitHub

The workflow.properties file:

	nameNode=hdfs://aagarwal-mbpro.local:8020
	jobTracker=aagarwal-mbpro.local:8021
	queueName=default

	oozie.wf.application.path=${nameNode}/apps/OozieSample

view raw workflow.properties hosted with ❤ by GitHub

The code jar and workflow.xml is copied to HDFS:

hadoop fs -rm -r /apps/${JOB_NAME}
hadoop fs -mkdir /apps/${JOB_NAME}
hadoop fs -copyFromLocal ${TARGET_DIR}/${JOB_NAME} /apps/

# aagarwal-mbpro:OozieSample ashok.agarwal$ hadoop fs -ls /apps/OozieSample/lib/
# Found 1 items
# -rw-r--r-- 1 ashok.agarwal supergroup 8038 2014-09-11 14:22 /apps/OozieSample/lib/OozieSample.jar
# aagarwal-mbpro:OozieSample ashok.agarwal$ hadoop fs -ls /apps/OozieSample/
# Found 2 items
# drwxr-xr-x - ashok.agarwal supergroup 0 2014-09-11 14:22 /apps/OozieSample/lib
# -rw-r--r-- 1 ashok.agarwal supergroup 794 2014-08-07 12:54 /apps/OozieSample/workflow.xml
# aagarwal-mbpro:OozieSample ashok.agarwal$

oozie job -oozie http://aagarwal-mbpro.local:11000/oozie -config /apps/OozieSample/workflow.properties -run

So we have deployed workflow jobs.

We can make this job as recurrent by adding coordinator.xml.

	<coordinator-app name="JavaMainCoordJobTimeDep"
	frequency="${frequency}"
	start="${start}"
	end="${end}"
	timezone="${timeZone}"
	xmlns="uri:oozie:coordinator:0.1">
	<action>
	<workflow>
	<app-path>${workflowAppPath}</app-path>
	</workflow>
	</action>
	</coordinator-app>

view raw coordinator.xml hosted with ❤ by GitHub

Copy this coordinator.xml to hdfs

hadoop fs -copyFromLocal coordinator.xml /apps/OozieSample/

The workflow.properties will not work in this case. So for coordinator we are creating coordinator.properties file.

	nameNode=hdfs://aagarwal-mbpro.local:8020
	jobTracker=aagarwal-mbpro.local:8021
	queueName=default

	appPath=${nameNode}/apps/OozieSample

	oozie.coord.application.path=${appPath}

	workflowAppPath=${appPath}

	inputDir=${appPath}/input
	outputDir=${appPath}/output
	frequency=5
	start=2014-05-19T00:00Z
	end=2014-05-20T00:00Z
	timeZone=UTC

view raw coordinator.properties hosted with ❤ by GitHub

Now again push the job using below command:

oozie job -oozie http://aagarwal-mbpro.local:11000/oozie -config ${TARGET_DIR}/coordinator.properties -run

Inorder to create coordinator to trigger on data availability the coordinator.xml is updated as below:

	<coordinator-app name="JavaMainCoordJobTimeDep"
	frequency="${frequency}"
	start="${start}"
	end="${end}"
	timezone="${timeZone}"
	xmlns="uri:oozie:coordinator:0.1">
	<datasets>
	<dataset name="input1" frequency="15" initial-instance="2014-03-12T13:00Z" timezone="${timeZone}">
	<!-- Below path can be created on HDFS like ${appPath}/feed/2014/03/11/20 -->
	<uri-template>${appPath}/feed/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-template>
	</dataset>
	</datasets>
	<input-events>
	<data-in name="coordInput1" dataset="input1">
	<start-instance>${coord:current(1)}</start-instance>
	<end-instance>${coord:current(10)}</end-instance>
	</data-in>
	</input-events>
	<action>
	<workflow>
	<app-path>${workflowAppPath}</app-path>
	<configuration>
	<property>
	<name>input_files</name>
	<value>${coord:dataIn('coordInput1')}</value>
	</property>
	</configuration>
	</workflow>
	</action>
	</coordinator-app>

view raw coordinator.properties hosted with ❤ by GitHub

Copy the updated coordinator.xml to HDFS and push the job to oozie.

This job will wait till it find the data ie _SUCCESS signal(empty) file at ${appPath}/feed/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}.

So create it and paste it to the path.

touch _SUCCESS

hadoop fs -copyFromLocal ${appPath}/feed/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}

Check the oozie workflow from UI it will start execution as soon as the file is created at the path for which coordinator was looking for it.

Saturday, November 22, 2014

Data Mining using Java

Overview

Data Mining is the process of sorting useful insights through large volume of data by identifying patterns and establish relationship among them. This is very generalize term, which is used to solve lot of company challenges such as determining the best selling price of the commodity or suppliers of a commodity or finding pattern of purchase of customers or understand the browsing trends or determining recommendations to customers.

Mostly, the data mining process is performed on:

Data warehouses/Database (Transactional/Sales/Purchase) .

ClickStream/Internet (Scrapping,Extracting)/Logs.

In this tutorial, I will focus on a use case wherein extracting useful information from web.

Web Mining

This is useful process wherein data mining algorithms are applied to large volume of data captured from world wide web.

Use Case

Lastly, worked on the java based project where we were scrapping the data and then mining useful information from it. Now, in java there are numerous tools exists to scrap the web page and then parse the page and extract information from it. Below are some of approaches we started to solved our business scenario.

DOM Parsing

We initially create the XML based configuration file for each web page.Now, when scrapped data is parsed to DOM tree and then can matched against the configuration file and extract useful information. This approach is very CPU and Memory intensive. Also, for each new web page we need to configure the XML, which seems to be a pain.

Pattern Parsing

We slowly and gradually moved to better solution, in this solution we are initially extracting the pattern from each web page (and then stored in Elastic Search), we know the product availability status at that point of time using the code below:

[code language="java"]
public static String getPatterns(String productSupply, String htmlSource, int RANGE) {

if (htmlSource == null) {
return "";
}
String match_patterns = "";

int index_price = -1;
Pattern pattern = Pattern.compile(productSupply);
Matcher matcher = pattern.matcher(htmlSource);

while (matcher.find()) {

index_price = matcher.start();
int beginIndex = index_price - RANGE > 0 ? index_price - RANGE : 0;
int endIndex = index_price + RANGE < htmlSource.length() ? index_price + RANGE : htmlSource.length();
match_patterns = match_patterns + htmlSource.substring(beginIndex, endIndex) + "^^^";

}

return match_patterns;
}
[/code]

And then we were matching the each extracted pattern with the scrapped page using Java Regex code and finding the product availability status using the code below:

[code language="java"]
public static String[] getProductSupplyMatches(String productPatterns, String htmlSource,int RANGE){

if (htmlSource == null) {
return null;
}

int count=0;

String[] patternsMatch=productPatterns.split("^^^");
String[] productSupply=new String[patternsMatch.length];

for(String patternMatch:patternsMatch){
Pattern pattern = Pattern.compile(patternMatch);
Matcher matcher = pattern.matcher(htmlSource);
int index_price = -1;
while (matcher.find()) {

index_price = matcher.start();

int beginIndex = index_price + RANGE > htmlSource.length() ? index_price + RANGE : 0;
int endIndex = index_price - RANGE < 0 ? index_price - RANGE : htmlSource.length();
productSupply [count]= htmlSource.substring(beginIndex, endIndex);
}
}

return productSupply;
}
[/code]

These are few of the approaches we adopted while development. Will keep you posted on more changes.

Conclusion

In this tutorial, I summarized my experience working on a web mining algorithms.

Monday, September 29, 2014

Learn Apache Spark using Cloudera Quickstart VM

Apache spark is open source big data computing engine. It enables applications to run upto 100X faster in memory and 10X faster even running on disk. It provides support for Java, Scala and Python, so that applications can be rapidly developed for batch, interactive and streaming systems.

Apache spark is composed of master server and one or more worker nodes. I am using Cloudera quick start vm for this tutorial. The virtual box VM can be downloaded from here. This VM has spark preinstalled, the master and worker nodes will be started as soon as the VM is up.

Master Node

The master can be started either:

using master only

[code language="bash"]
./sbin/start-master.sh
[/code]

using master and one or more worker (the master can access slave nodes using password less ssh)

[code language="bash"]
./sbin/start-all.sh
[/code]

The conf/slaves file has the hostnames of all the worker machines(one hostname per line).

The master will print out a spark://HOST:PORT URL. This url will be used for starting worker nodes.

The master's web UI can be accessed using http://localhost:8080.

Worker/slave node

Similarly, one or more worker can be started

one by one on running below command on each worker node.

[code language="bash"]
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
[/code]

The IP and PORT can be found out from the master’s web UI, which is http://localhost:8080 by default.

or using below script from master node

[code language="bash"]
./sbin/start-slaves.sh
[/code]

This will start all the worker nodes in conf/slaves file.

The worker's web ui can be accessed using http://localhost:8081.

spark scala shell

The spark scala shell can be invoked using:

[code language="bash"]
./bin/spark-shell
[/code]

OR

[code language="bash"]
./bin/spark-shell --master spark://IP:PORT
[/code]

The below figure shows the spark shell.

spark shell screen shot

[code language="bash"]

scala&gt; var file = sc.textFile(&quot;hdfs://quickstart.cloudera:8020/user/hdfs/demo1/input/data.txt&quot;)
14/09/29 22:57:11 INFO storage.MemoryStore: ensureFreeSpace(158080) called with curMem=0, maxMem=311387750
14/09/29 22:57:11 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 154.4 KB, free 296.8 MB)
file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at &lt;console&gt;:12

scala&gt; val counts = file.flatMap(line =&gt; line.split(&quot; &quot;)).map(word =&gt; (word, 1)).reduceByKey(_ + _)
14/09/29 22:57:20 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
14/09/29 22:57:20 INFO mapred.FileInputFormat: Total input paths to process : 1
counts: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at reduceByKey at &lt;console&gt;:14

scala&gt; println(counts)
MapPartitionsRDD[6] at reduceByKey at &lt;console&gt;:14

scala&gt; counts
res3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at reduceByKey at &lt;console&gt;:14

scala&gt; counts.saveAsTextFile(&quot;hdfs://quickstart.cloudera:8020/user/hdfs/demo1/output&quot;)
14/09/29 22:59:31 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
14/09/29 22:59:31 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
14/09/29 22:59:31 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
14/09/29 22:59:31 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
14/09/29 22:59:31 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
14/09/29 22:59:31 INFO spark.SparkContext: Starting job: saveAsTextFile at &lt;console&gt;:17
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Registering RDD 4 (reduceByKey at &lt;console&gt;:14)
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Got job 0 (saveAsTextFile at &lt;console&gt;:17) with 1 output partitions (allowLocal=false)
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Final stage: Stage 0(saveAsTextFile at &lt;console&gt;:17)
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 1)
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Missing parents: List(Stage 1)
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[4] at reduceByKey at &lt;console&gt;:14), which has no missing parents
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at &lt;console&gt;:14)
14/09/29 22:59:31 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
14/09/29 22:59:31 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
14/09/29 22:59:31 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 2121 bytes in 3 ms
14/09/29 22:59:31 INFO executor.Executor: Running task ID 0
14/09/29 22:59:31 INFO storage.BlockManager: Found block broadcast_0 locally
14/09/29 22:59:31 INFO rdd.HadoopRDD: Input split: hdfs://quickstart.cloudera:8020/user/hdfs/demo1/input/data.txt:0+28
14/09/29 22:59:32 INFO executor.Executor: Serialized size of result for 0 is 779
14/09/29 22:59:32 INFO executor.Executor: Sending result for 0 directly to driver
14/09/29 22:59:32 INFO executor.Executor: Finished task ID 0
14/09/29 22:59:32 INFO scheduler.TaskSetManager: Finished TID 0 in 621 ms on localhost (progress: 1/1)
14/09/29 22:59:32 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(1, 0)
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Stage 1 (reduceByKey at &lt;console&gt;:14) finished in 0.646 s
14/09/29 22:59:32 INFO scheduler.DAGScheduler: looking for newly runnable stages
14/09/29 22:59:32 INFO scheduler.DAGScheduler: running: Set()
14/09/29 22:59:32 INFO scheduler.DAGScheduler: waiting: Set(Stage 0)
14/09/29 22:59:32 INFO scheduler.DAGScheduler: failed: Set()
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Missing parents for Stage 0: List()
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[7] at saveAsTextFile at &lt;console&gt;:17), which is now runnable
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[7] at saveAsTextFile at &lt;console&gt;:17)
14/09/29 22:59:32 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/09/29 22:59:32 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
14/09/29 22:59:32 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 13029 bytes in 0 ms
14/09/29 22:59:32 INFO executor.Executor: Running task ID 1
14/09/29 22:59:32 INFO storage.BlockManager: Found block broadcast_0 locally
14/09/29 22:59:32 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/09/29 22:59:32 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
14/09/29 22:59:32 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 4 ms
14/09/29 22:59:32 INFO output.FileOutputCommitter: Saved output of task 'attempt_201409292259_0000_m_000000_1' to hdfs://quickstart.cloudera:8020/user/hdfs/demo1/output/_temporary/0/task_201409292259_0000_m_000000
14/09/29 22:59:32 INFO spark.SparkHadoopWriter: attempt_201409292259_0000_m_000000_1: Committed
14/09/29 22:59:32 INFO executor.Executor: Serialized size of result for 1 is 825
14/09/29 22:59:32 INFO executor.Executor: Sending result for 1 directly to driver
14/09/29 22:59:32 INFO executor.Executor: Finished task ID 1
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Completed ResultTask(0, 0)
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Stage 0 (saveAsTextFile at &lt;console&gt;:17) finished in 0.383 s
14/09/29 22:59:32 INFO spark.SparkContext: Job finished: saveAsTextFile at &lt;console&gt;:17, took 1.334581571 s
14/09/29 22:59:32 INFO scheduler.TaskSetManager: Finished TID 1 in 387 ms on localhost (progress: 1/1)
14/09/29 22:59:32 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

scala&gt;
[/code]

The below screen shot provides details about the input to wordcount and output of above scala word count.

References:

Spark Documentation

Spark Quickstart

Apache Spark

Spark Github

JBKSoft Technologies