Thursday, January 12, 2017

Contribute to Apache Hadoop

From long time, I had desire to contribute to open source Apache Hadoop. Today, I was free so worked on setup of Hadoop code on my local machine for development. I am documenting the steps as it may be useful for any newcomers.

Below are the steps to set up the Hadoop code for development

Step 1:  Install Java JDK 8 and above

$ java -version
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)

Step 2: Install Apache Maven version 3 or later

mvn -version
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T08:41:47-08:00)
Maven home: /usr/local/apache-maven
Java version: 1.8.0_72, vendor: Oracle Corporation
Java home: /usr/java/jdk1.8.0_72/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-514.2.2.el7.x86_64", arch: "amd64", family: "unix"

Step 3:  Install Google protocol buffer (version 2.5.0)
Make sure protocol buffer version is 2.5.0
I have installed the Google protocol buffer higher version 3.1.0, but when compiling code got below error code.
[ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.0.0-alpha2-SNAPSHOT:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: protoc version is 'libprotoc 3.1.0', expected version is '2.5.0' -> [Help 1]
[ERROR] 

Step 4:  Download the hadoop source code

We can either clone the directory or create a fork of directory and then clone it.

a) Directly cloning the directory.
 git clone git://git.apache.org/hadoop.git
b) Create fork as shown below:


And then download the code as shown below:

 $ git clone https://github.com/poojagpta/hadoop

Syn the fork project with current project changes
Add the remote link:
 $git remote add upstream https://github.com/apache/hadoop

 $ git remote -v
origin https://github.com/poojagpta/hadoop (fetch)
origin https://github.com/poojagpta/hadoop (push)
upstream https://github.com/apache/hadoop (fetch)
upstream https://github.com/apache/hadoop (push)

Now, if want to fetch latest code:
$ git fetch upstream
$ git checkout trunk

Step 5: Compile the downloaded code
$ cd hadoop
$ mvn clean install -Pdist -Dtar -Ptest-patch -DskipTests -Denforcer.skip=true

Snippet Output:
[INFO] --- maven-install-plugin:2.5.1:install (default-install) @ hadoop-client-modules ---
[INFO] Installing /home/pooja/dev/hadoop/hadoop-client-modules/pom.xml to /home/pooja/.m2/repository/org/apache/hadoop/hadoop-client-modules/3.0.0-alpha2-SNAPSHOT/hadoop-client-modules-3.0.0-alpha2-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Hadoop Main ................................. SUCCESS [  1.780 s]
[INFO] Apache Hadoop Build Tools .......................... SUCCESS [  2.560 s]
[INFO] Apache Hadoop Project POM .......................... SUCCESS [  2.236 s]
[INFO] Apache Hadoop Annotations .......................... SUCCESS [  4.824 s]
[INFO] Apache Hadoop Assemblies ........................... SUCCESS [  0.314 s]
[INFO] Apache Hadoop Project Dist POM ..................... SUCCESS [  1.834 s]
[INFO] Apache Hadoop Maven Plugins ........................ SUCCESS [  9.167 s]
[INFO] Apache Hadoop MiniKDC .............................. SUCCESS [  5.918 s]
[INFO] Apache Hadoop Auth ................................. SUCCESS [ 20.083 s]
[INFO] Apache Hadoop Auth Examples ........................ SUCCESS [  7.650 s]
[INFO] Apache Hadoop Common ............................... SUCCESS [02:03 min]
[INFO] Apache Hadoop NFS .................................. SUCCESS [ 12.138 s]
[INFO] Apache Hadoop KMS .................................. SUCCESS [ 13.088 s]
[INFO] Apache Hadoop Common Project ....................... SUCCESS [  0.138 s]
[INFO] Apache Hadoop HDFS Client .......................... SUCCESS [ 54.973 s]
[INFO] Apache Hadoop HDFS ................................. SUCCESS [01:51 min]
[INFO] Apache Hadoop HDFS Native Client ................... SUCCESS [  1.323 s]
[INFO] Apache Hadoop HttpFS ............................... SUCCESS [ 41.081 s]
[INFO] Apache Hadoop HDFS-NFS ............................. SUCCESS [ 12.680 s]
[INFO] Apache Hadoop HDFS Project ......................... SUCCESS [  0.070 s]
[INFO] Apache Hadoop YARN ................................. SUCCESS [  0.073 s]
[INFO] Apache Hadoop YARN API ............................. SUCCESS [ 35.955 s]
[INFO] Apache Hadoop YARN Common .......................... SUCCESS [01:38 min]
[INFO] Apache Hadoop YARN Server .......................... SUCCESS [  0.089 s]
[INFO] Apache Hadoop YARN Server Common ................... SUCCESS [ 22.489 s]
[INFO] Apache Hadoop YARN NodeManager ..................... SUCCESS [ 32.492 s]
[INFO] Apache Hadoop YARN Web Proxy ....................... SUCCESS [  8.606 s]
[INFO] Apache Hadoop YARN ApplicationHistoryService ....... SUCCESS [ 20.153 s]
[INFO] Apache Hadoop YARN Timeline Service ................ SUCCESS [02:26 min]
[INFO] Apache Hadoop YARN ResourceManager ................. SUCCESS [ 55.442 s]
[INFO] Apache Hadoop YARN Server Tests .................... SUCCESS [  5.479 s]
[INFO] Apache Hadoop YARN Client .......................... SUCCESS [ 17.122 s]
[INFO] Apache Hadoop YARN SharedCacheManager .............. SUCCESS [  8.654 s]
[INFO] Apache Hadoop YARN Timeline Plugin Storage ......... SUCCESS [  8.234 s]
[INFO] Apache Hadoop YARN Timeline Service HBase tests .... SUCCESS [02:51 min]
[INFO] Apache Hadoop YARN Applications .................... SUCCESS [  0.044 s]
[INFO] Apache Hadoop YARN DistributedShell ................ SUCCESS [  8.076 s]
[INFO] Apache Hadoop YARN Unmanaged Am Launcher ........... SUCCESS [  5.937 s]
[INFO] Apache Hadoop YARN Site ............................ SUCCESS [  0.077 s]
[INFO] Apache Hadoop YARN Registry ........................ SUCCESS [ 11.366 s]
[INFO] Apache Hadoop YARN UI .............................. SUCCESS [  1.832 s]
[INFO] Apache Hadoop YARN Project ......................... SUCCESS [  8.590 s]
[INFO] Apache Hadoop MapReduce Client ..................... SUCCESS [  0.225 s]
[INFO] Apache Hadoop MapReduce Core ....................... SUCCESS [ 43.115 s]
[INFO] Apache Hadoop MapReduce Common ..................... SUCCESS [ 27.865 s]
[INFO] Apache Hadoop MapReduce Shuffle .................... SUCCESS [  9.009 s]
[INFO] Apache Hadoop MapReduce App ........................ SUCCESS [ 24.415 s]
[INFO] Apache Hadoop MapReduce HistoryServer .............. SUCCESS [ 14.692 s]
[INFO] Apache Hadoop MapReduce JobClient .................. SUCCESS [ 29.361 s]
[INFO] Apache Hadoop MapReduce HistoryServer Plugins ...... SUCCESS [  4.828 s]
[INFO] Apache Hadoop MapReduce NativeTask ................. SUCCESS [ 10.299 s]
[INFO] Apache Hadoop MapReduce Examples ................... SUCCESS [ 12.238 s]
[INFO] Apache Hadoop MapReduce ............................ SUCCESS [  4.336 s]
[INFO] Apache Hadoop MapReduce Streaming .................. SUCCESS [ 17.591 s]
[INFO] Apache Hadoop Distributed Copy ..................... SUCCESS [ 13.083 s]
[INFO] Apache Hadoop Archives ............................. SUCCESS [  6.314 s]
[INFO] Apache Hadoop Archive Logs ......................... SUCCESS [  6.982 s]
[INFO] Apache Hadoop Rumen ................................ SUCCESS [ 12.048 s]
[INFO] Apache Hadoop Gridmix .............................. SUCCESS [ 12.327 s]
[INFO] Apache Hadoop Data Join ............................ SUCCESS [  5.819 s]
[INFO] Apache Hadoop Extras ............................... SUCCESS [  5.794 s]
[INFO] Apache Hadoop Pipes ................................ SUCCESS [  0.036 s]
[INFO] Apache Hadoop OpenStack support .................... SUCCESS [  8.138 s]
[INFO] Apache Hadoop Amazon Web Services support .......... SUCCESS [ 53.458 s]
[INFO] Apache Hadoop Azure support ........................ SUCCESS [ 20.452 s]
[INFO] Apache Hadoop Aliyun OSS support ................... SUCCESS [ 11.273 s]
[INFO] Apache Hadoop Client Aggregator .................... SUCCESS [  3.698 s]
[INFO] Apache Hadoop Mini-Cluster ......................... SUCCESS [  1.618 s]
[INFO] Apache Hadoop Scheduler Load Simulator ............. SUCCESS [ 12.085 s]
[INFO] Apache Hadoop Azure Data Lake support .............. SUCCESS [ 27.289 s]
[INFO] Apache Hadoop Tools Dist ........................... SUCCESS [  5.002 s]
[INFO] Apache Hadoop Kafka Library support ................ SUCCESS [  7.041 s]
[INFO] Apache Hadoop Tools ................................ SUCCESS [  0.052 s]
[INFO] Apache Hadoop Client API ........................... SUCCESS [02:09 min]
[INFO] Apache Hadoop Client Runtime ....................... SUCCESS [01:21 min]
[INFO] Apache Hadoop Client Packaging Invariants .......... SUCCESS [  3.431 s]
[INFO] Apache Hadoop Client Test Minicluster .............. SUCCESS [03:13 min]
[INFO] Apache Hadoop Client Packaging Invariants for Test . SUCCESS [  0.329 s]
[INFO] Apache Hadoop Client Packaging Integration Tests ... SUCCESS [  1.542 s]
[INFO] Apache Hadoop Distribution ......................... SUCCESS [ 42.013 s]
[INFO] Apache Hadoop Client Modules ....................... SUCCESS [  0.105 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 32:42 min
[INFO] Finished at: 2017-01-12T11:30:58-08:00
[INFO] Final Memory: 131M/819M
[INFO] ------------------------------------------------------------------------


I hope you are also to set up Hadoop project and ready to contribute like me. Please let me know if you are still facing issues, I love to help you.
In the next tutorial, I will set up the code in IntelliJ and steps to debug the code.

Thanks and happy coding !!!

Problem Encounter will compiling code:

1. Some of the junit are failing.

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.hadoop.minikdc.TestMiniKdc
Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 3.451 sec <<< FAILURE! - in org.apache.hadoop.minikdc.TestMiniKdc
testKeytabGen(org.apache.hadoop.minikdc.TestMiniKdc)  Time elapsed: 1.314 sec  <<< ERROR!
java.lang.RuntimeException: Unable to parse:includedir /etc/krb5.conf.d/
at org.apache.kerby.kerberos.kerb.common.Krb5Parser.load(Krb5Parser.java:72)
at org.apache.kerby.kerberos.kerb.common.Krb5Conf.addKrb5Config(Krb5Conf.java:47)
at org.apache.kerby.kerberos.kerb.client.ClientUtil.getDefaultConfig(ClientUtil.java:94)
at org.apache.kerby.kerberos.kerb.client.KrbClientBase.<init>(KrbClientBase.java:51)
at org.apache.kerby.kerberos.kerb.client.KrbClient.<init>(KrbClient.java:38)
at org.apache.kerby.kerberos.kerb.server.SimpleKdcServer.<init>(SimpleKdcServer.java:54)
at org.apache.hadoop.minikdc.MiniKdc.start(MiniKdc.java:280)
at org.apache.hadoop.minikdc.KerberosSecurityTestcase.startMiniKdc(KerberosSecurityTestcase.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)

testMiniKdcStart(org.apache.hadoop.minikdc.TestMiniKdc)  Time elapsed: 1.002 sec  <<< ERROR!
java.lang.RuntimeException: Unable to parse:includedir /etc/krb5.conf.d/
at org.apache.kerby.kerberos.kerb.common.Krb5Parser.load(Krb5Parser.java:72)
at org.apache.kerby.kerberos.kerb.common.Krb5Conf.addKrb5Config(Krb5Conf.java:47)
at org.apache.kerby.kerberos.kerb.client.ClientUtil.getDefaultConfig(ClientUtil.java:94)
at org.apache.kerby.kerberos.kerb.client.KrbClientBase.<init>(KrbClientBase.java:51)
at org.apache.kerby.kerberos.kerb.client.KrbClient.<init>(KrbClient.java:38)
at org.apache.kerby.kerberos.kerb.server.SimpleKdcServer.<init>(SimpleKdcServer.java:54)
at org.apache.hadoop.minikdc.MiniKdc.start(MiniKdc.java:280)
at org.apache.hadoop.minikdc.KerberosSecurityTestcase.startMiniKdc(KerberosSecurityTestcase.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)

testKerberosLogin(org.apache.hadoop.minikdc.TestMiniKdc)  Time elapsed: 1.008 sec  <<< ERROR!
java.lang.RuntimeException: Unable to parse:includedir /etc/krb5.conf.d/
at org.apache.kerby.kerberos.kerb.common.Krb5Parser.load(Krb5Parser.java:72)
at org.apache.kerby.kerberos.kerb.common.Krb5Conf.addKrb5Config(Krb5Conf.java:47)
at org.apache.kerby.kerberos.kerb.client.ClientUtil.getDefaultConfig(ClientUtil.java:94)
at org.apache.kerby.kerberos.kerb.client.KrbClientBase.<init>(KrbClientBase.java:51)
at org.apache.kerby.kerberos.kerb.client.KrbClient.<init>(KrbClient.java:38)
at org.apache.kerby.kerberos.kerb.server.SimpleKdcServer.<init>(SimpleKdcServer.java:54)
at org.apache.hadoop.minikdc.MiniKdc.start(MiniKdc.java:280)
at org.apache.hadoop.minikdc.KerberosSecurityTestcase.startMiniKdc(KerberosSecurityTestcase.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)

Running org.apache.hadoop.minikdc.TestChangeOrgNameAndDomain
Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 3.289 sec <<< FAILURE! - in org.apache.hadoop.minikdc.TestChangeOrgNameAndDomain
testKeytabGen(org.apache.hadoop.minikdc.TestChangeOrgNameAndDomain)  Time elapsed: 1.18 sec  <<< ERROR!
java.lang.RuntimeException: Unable to parse:includedir /etc/krb5.conf.d/
at org.apache.kerby.kerberos.kerb.common.Krb5Parser.load(Krb5Parser.java:72)
at org.apache.kerby.kerberos.kerb.common.Krb5Conf.addKrb5Config(Krb5Conf.java:47)
at org.apache.kerby.kerberos.kerb.client.ClientUtil.getDefaultConfig(ClientUtil.java:94)
at org.apache.kerby.kerberos.kerb.client.KrbClientBase.<init>(KrbClientBase.java:51)
at org.apache.kerby.kerberos.kerb.client.KrbClient.<init>(KrbClient.java:38)
at org.apache.kerby.kerberos.kerb.server.SimpleKdcServer.<init>(SimpleKdcServer.java:54)
at org.apache.hadoop.minikdc.MiniKdc.start(MiniKdc.java:280)
at org.apache.hadoop.minikdc.KerberosSecurityTestcase.startMiniKdc(KerberosSecurityTestcase.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)

testMiniKdcStart(org.apache.hadoop.minikdc.TestChangeOrgNameAndDomain)  Time elapsed: 1.014 sec  <<< ERROR!
java.lang.RuntimeException: Unable to parse:includedir /etc/krb5.conf.d/
at org.apache.kerby.kerberos.kerb.common.Krb5Parser.load(Krb5Parser.java:72)
at org.apache.kerby.kerberos.kerb.common.Krb5Conf.addKrb5Config(Krb5Conf.java:47)
at org.apache.kerby.kerberos.kerb.client.ClientUtil.getDefaultConfig(ClientUtil.java:94)
at org.apache.kerby.kerberos.kerb.client.KrbClientBase.<init>(KrbClientBase.java:51)
at org.apache.kerby.kerberos.kerb.client.KrbClient.<init>(KrbClient.java:38)
at org.apache.kerby.kerberos.kerb.server.SimpleKdcServer.<init>(SimpleKdcServer.java:54)
at org.apache.hadoop.minikdc.MiniKdc.start(MiniKdc.java:280)
at org.apache.hadoop.minikdc.KerberosSecurityTestcase.startMiniKdc(KerberosSecurityTestcase.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)

testKerberosLogin(org.apache.hadoop.minikdc.TestChangeOrgNameAndDomain)  Time elapsed: 1.009 sec  <<< ERROR!
java.lang.RuntimeException: Unable to parse:includedir /etc/krb5.conf.d/
at org.apache.kerby.kerberos.kerb.common.Krb5Parser.load(Krb5Parser.java:72)
at org.apache.kerby.kerberos.kerb.common.Krb5Conf.addKrb5Config(Krb5Conf.java:47)
at org.apache.kerby.kerberos.kerb.client.ClientUtil.getDefaultConfig(ClientUtil.java:94)
at org.apache.kerby.kerberos.kerb.client.KrbClientBase.<init>(KrbClientBase.java:51)
at org.apache.kerby.kerberos.kerb.client.KrbClient.<init>(KrbClient.java:38)
at org.apache.kerby.kerberos.kerb.server.SimpleKdcServer.<init>(SimpleKdcServer.java:54)
at org.apache.hadoop.minikdc.MiniKdc.start(MiniKdc.java:280)
at org.apache.hadoop.minikdc.KerberosSecurityTestcase.startMiniKdc(KerberosSecurityTestcase.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)


Results :

Tests in error: 
  TestMiniKdc>KerberosSecurityTestcase.startMiniKdc:49 » Runtime Unable to parse...
  TestMiniKdc>KerberosSecurityTestcase.startMiniKdc:49 » Runtime Unable to parse...
  TestMiniKdc>KerberosSecurityTestcase.startMiniKdc:49 » Runtime Unable to parse...
  TestChangeOrgNameAndDomain>KerberosSecurityTestcase.startMiniKdc:49 » Runtime ...
  TestChangeOrgNameAndDomain>KerberosSecurityTestcase.startMiniKdc:49 » Runtime ...
  TestChangeOrgNameAndDomain>KerberosSecurityTestcase.startMiniKdc:49 » Runtime ...

Tests run: 6, Failures: 0, Errors: 6, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Hadoop Main ................................. SUCCESS [  2.060 s]
[INFO] Apache Hadoop Build Tools .......................... SUCCESS [  1.584 s]
[INFO] Apache Hadoop Project POM .......................... SUCCESS [  2.018 s]
[INFO] Apache Hadoop Annotations .......................... SUCCESS [  4.161 s]
[INFO] Apache Hadoop Assemblies ........................... SUCCESS [  0.253 s]
[INFO] Apache Hadoop Project Dist POM ..................... SUCCESS [  1.803 s]
[INFO] Apache Hadoop Maven Plugins ........................ SUCCESS [  9.047 s]
[INFO] Apache Hadoop MiniKDC .............................. FAILURE [ 11.461 s]
[INFO] Apache Hadoop Auth ................................. SKIPPED
[INFO] Apache Hadoop Auth Examples ........................ SKIPPED
......
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.17:test (default-test) on project hadoop-minikdc: There are test failures.
[ERROR] 
[ERROR] Please refer to /home/pooja/dev/hadoop/hadoop-common-project/hadoop-minikdc/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :hadoop-minikdc

Solution:
I fixed the problem by skipping the Junit (-DskipTests) for entire build and run Junit only for module you want to start fixing code.

2. Got below error for module  'Hadoop HDFS'. 

[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Hadoop Main ................................. SUCCESS [  1.612 s]
[INFO] Apache Hadoop Build Tools .......................... SUCCESS [  1.507 s]
[INFO] Apache Hadoop Project POM .......................... SUCCESS [  2.033 s]
[INFO] Apache Hadoop Annotations .......................... SUCCESS [  4.937 s]
[INFO] Apache Hadoop Assemblies ........................... SUCCESS [  0.312 s]
[INFO] Apache Hadoop Project Dist POM ..................... SUCCESS [  1.670 s]
[INFO] Apache Hadoop Maven Plugins ........................ SUCCESS [  8.432 s]
[INFO] Apache Hadoop MiniKDC .............................. SUCCESS [  5.359 s]
[INFO] Apache Hadoop Auth ................................. SUCCESS [ 14.786 s]
[INFO] Apache Hadoop Auth Examples ........................ SUCCESS [  5.682 s]
[INFO] Apache Hadoop Common ............................... SUCCESS [02:04 min]
[INFO] Apache Hadoop NFS .................................. SUCCESS [ 12.310 s]
[INFO] Apache Hadoop KMS .................................. SUCCESS [ 14.775 s]
[INFO] Apache Hadoop Common Project ....................... SUCCESS [  0.074 s]
[INFO] Apache Hadoop HDFS Client .......................... SUCCESS [01:19 min]
[INFO] Apache Hadoop HDFS ................................. FAILURE [  5.697 s]
[INFO] Apache Hadoop HDFS Native Client ................... SKIPPED
.........................................
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:48 min
[INFO] Finished at: 2017-01-11T15:43:27-08:00
[INFO] Final Memory: 90M/533M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project hadoop-hdfs: Could not resolve dependencies for project org.apache.hadoop:hadoop-hdfs:jar:3.0.0-alpha2-SNAPSHOT: The following artifacts could not be resolved: org.apache.hadoop:hadoop-kms:jar:classes:3.0.0-alpha2-SNAPSHOT, org.apache.hadoop:hadoop-kms:jar:tests:3.0.0-alpha2-SNAPSHOT: Could not find artifact org.apache.hadoop:hadoop-kms:jar:classes:3.0.0-alpha2-SNAPSHOT in apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :hadoop-hdfs

Solution:

The problem suggest that problem in intallation of Openssl.
I want to install OpenSSL to be able to use the HTTPS protocol in HDFS or curl or different application.
openssl (which is the binary) is installed, but OpenSSL (which is required for the HTTPS protocol is not installed).

You can install openssl using below command
$sudo yum install openssl openssl-devel

$ which openssl
/usr/bin/openssl

$ openssl version
OpenSSL 1.0.1e-fips 11 Feb 2013
We can solve the problem using 2 approaches

   a. Create a link to openssl path as shown below

      ln -s /usr/bin/openssl /usr/local/openssl

or
    b. Download OpenSSL and compile it as shown below

$wget https://www.openssl.org/source/openssl-1.0.1e.tar.gz
$tar -xvf openssl-1.0.1e.tar.gz
$cd openssl-1.0.1e
$./config --prefix=/usr/local/openssl --openssldir=/usr/local/openssl
$ make
$ sudo make install

3. Error with enforcer
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce (depcheck) on project hadoop-hdfs-client: Some Enforcer rules have failed. Look above for specific messages explaining why the rule failed. -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Solution:
I skipped enforcer (-Denforcer.skip=true) as this constrains allow only unix, mac machine. 

Tuesday, January 10, 2017

Spark Kafka Integration use case

Apache Spark

Apache Spark is distributed computing platform that provides near real time processing of data from various data sources. The  data sources can vary from HDFS file system or Kafka or Flume or Relational Database.

There are many spark components which facilitate the integration with various data sources such as Spark SQL, Spark Streaming, Mlib, GraphX.

Apache Kafka

Apache Kafka is distributed fault tolerant streaming platform that used to build the real-time data pipeline. It works on publisher and subscriber model.

Use Case

Recently, I worked on Kafka Spark integration for a simple fraud detection real time data pipeline. In this, we were tracking the Customer Activity and purchase events of Customer on e-Commerce site.Then, based on purchase events we were categorizing suspiciously fraudulent Customers. Now, we were filtering the Customer Activity for interested customer and then performing operations which is consumed by another stream for further processing.

We have consider many implementation plan and one of them is explained below.

Data Model (Just an example)

Suspicious Fraudulent Customer (demo1):



Customer_Id
Receive Flag
Name
Sex
Age
City
1
A
AAA
F
23
Union City
2
A
BBB
M
77
San Mateo
3
F
NNN
F
33
San Francisco

Customer Activity (demo2)


Customer_Id
Page Visit
Product
1
store/5426/whats-new
1
ip/product-page/product-Desc
16503225
2
ip/product-page/product-Desc
9988334
3
search/?query=battery
3
cp/Gift-Cards
3
account/trackorder



We need to process above data and filter only active data. So sample output data will be as follows.


Cus_Id
Flag
Name
Sex
Age
City
Page Visit
Product
1
A
AAA
F
23
Union City
store/5426/whats-new
1
A
AAA
F
23
Union City
ip/product-page/product-Desc
16503225
2
A
BBB
M
77
San Mateo
ip/product-page/product-Desc
9988334



Implementation Strategy

Kafka streaming:

In this data pipeline, we were receiving 2 Kafka stream and output stream  as described below.

  1. Suspicious Fraudulent Customer (demo1)

  2. Customer Activity (demo2)

  3. Output (test-output)

Spark Streaming component:

The Spark Streaming API will integrate with Kafka topics (demo1, demo2). Now, the demo1 data will be cached in memory and update for any change in active customer or add new customer. The data from demo1, demo2 is joined together and filter for active customer which is output to 'test-output'.

  1. Subscribe Suspicious Fraudulent Customer  (demo1).

  2. Subscribe to Customer Activity (demo2).

  3. Update Suspicious Fraudulent Customer in memory (so as to reflect the update in demo1).

  4. Join data from demo1 and demo2, then filter based on flag.

  5. Perform operation on the data.

  6.  Output the result to test-output for further processing.

I have implement the demo code in scala.

The working model

Let start the Spark server and submit the Spark Job to Spark  cluster as shown below

screenshot-from-2017-01-10-20-46-20

Note: Application Id: app-20170110204548-0000 is started and running.

Now, start Kafka server and start 3 topics- demo1 (Producer), demo2(Producer), test-output(consumer).

For this tutorial, to show our use case we will be showing manual data entry.

Kafka Topic (demo1):

Screenshot from 2017-01-10 20-58-51.pngKafka Topic(demo2):

Screenshot from 2017-01-10 20-59-08.png

Kafka Topic (test-output): Receive output as shown below:

Screenshot from 2017-01-10 21-02-32.pngNote: customer 3 is inactive so it will not be shown.

Now, there are changes in demo1 and will add new active customer 4 and update customer 2  to inactive and also change customer 3 to active as shown below:

[kafka@localhost kafka_2.11-0.10.1.0]$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic demo1
4,A NNN F 33 San Francisco
2,F TTT F 22 XXX
3,A HHH M 56 MMM

Then, input some Customer Activity (demo2)

[kafka@localhost kafka_2.11-0.10.1.0]$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic demo2
1,store/5426/whats-new
2,ip/product-page/product-Desc 16503225
3,ip/product-page/product-Desc 9988334
4,search/?query=battery
4,cp/Gift-Cards
3,account/trackorder

Finally, output will show transaction of all active customer  in memory Customer 1,3,4.

[kafka@localhost kafka_2.11-0.10.1.0]$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-output
(4,(A NNN F 33 San Francisco,search/?query=battery ))
(4,(A NNN F 33 San Francisco,cp/Gift-Cards ))
(3,(A HHH M 56 MMM,ip/product-page/product-Desc 9988334))
(3,(A HHH M 56 MMM,account/trackorder))
(1,(A AAA F 23 Union City,store/5426/whats-new ))

I hope you follow the use case. In case of any questions, please mail me, I would be glad to help you.

Thursday, December 29, 2016

Processing Geospatial ShapeFile in Spark Part - 1

Geospatial Shapefile is file format for storing geospatial vector data. The file consists of 3 three mandatory - .shp.shx, and .dbf file extensions. The geographical features like water wells, river, lake, school, city, land parcel, roads have geographic location like lat/long and associated information like name, area, temperature etc can be represented as point, polygons and lines.

Other Geo Data Format 

WKT - Well Known Text


The wkt format for San Francisco Bay Area is

POLYGON((-122.84912109375 38.26487165882067,-121.7889404296875 38.26487165882067,-121.7889404296875 37.274872400526334,-122.84912109375 37.274872400526334,-122.84912109375 38.26487165882067))

After applying the polygon on the google map via Wicket


GeoJSON



The geo data can be expressed in json format known as GeoJSON. GeoJSON for geographical location like Coit Tower can be expressed in GeoJSON as below.

{
    "type": "Point",
    "coordinates": [
        -122.405802,
         37.802350
    ]
}

GeoJSON Viewer like geojsonlint

ShapeFile



The shapefile for SF Bay area can be downloaded from sfgov.org

Unzip the file

[pooja@localhost Downloads]$ unzip bayarea_cities.zip
Archive:  bayarea_cities.zip
  inflating: bayarea_cities/bay_area_cities.dbf
  inflating: bayarea_cities/bay_area_cities.prj
  inflating: bayarea_cities/bay_area_cities.sbn
  inflating: bayarea_cities/bay_area_cities.sbx
  inflating: bayarea_cities/bay_area_cities.shp
  inflating: bayarea_cities/bay_area_cities.shp.xml
  inflating: bayarea_cities/bay_area_cities.shx
[pooja@localhost Downloads]$

The extracte files can be viewed by shapefile viewer. You can download open source qgis viewer.

ShapeFile Transformation


The shapefile data can be converted easily by tools like shp2pgsql. into a PostgreSQL SQL file.

shp2pgsql <shapefile> <tablename> <db_name> > filename.sql 

for example

shp2pgsql bay_area_cities.shp cities gisdatabase > cities.sql

This shapefile can be very huge for some use case like land parcel, census etc. But this huge data will be divided or sliced by some criteria like city, state, county etc.

Migrating from wordpress to blogger

Migrating the blogs - having the content, image, comments, etc from wordpress to blogger seams bit complex at first. The below steps will make migration really very easy.

Export the content from Wordpress

  1. Login the the wordpress dashboard.by opening https://<blog>.wordpress.com/wp-admin/ in the browser.
  2. On left Nav, select Tools -> Export
  3. Select 'All content'
  4. Click 'Download Export File'
  5. The XML file will be downloaded.

Convert the Wordpress to Blogger Format

The downloaded file needs to be converted to blogger file format so that it can be imported later to blogger.
  1. Checkout the code from the github:  https://github.com/pra85/google-blog-converters-appengine.
  2. go to the directory 'google-blog-converters-appengine' 
              [pooja@localhost dev]$ cd google-blog-converters-appengine/
 
      3. Run the 'bin/wordpress2blogger.sh' script with input - above downloaded file and output file.

              [pooja@localhost google-blog-converters-appengine]$ bin/wordpress2blogger.sh ~/Downloads/leveragebigdata.wordpress.2016-12-29.xml >> blog.xml

Import the file to blogger

The blogger format file generated by converter tool needs to be uploaded to blogger.

  1. Open https://www.blogger.com/ in browser.
  2. Navigate the left: Settings -> Other
  3. Click import content
  4. Click 'Import from computer' and browse to the converter generated file.
  5. The blogs will be imported and listed as below
  6. Publish the posts.

Handling Errors

  1. Error during execution of converter tool script:
Traceback (most recent call last):
  File "bin/../src/wordpress2blogger/wp2b.py", line 26, in <module>
    import gdata
ImportError: No module named gdata             

Solution: sudo pip install gdata

      2. Images loses the alignment. From Left Nav -> Template -> Customize

Click 'Advanced' -> Add CSS 


Paste the CSS:

.post-body img {
width:100%;
height:100%;
display: block;
}

Click 'Apply to Blog' and check the blogs.


Happy blogging !!!