Big Data

Thursday, January 12, 2017

Contribute to Apache Hadoop

From long time, I had desire to contribute to open source Apache Hadoop. Today, I was free so worked on setup of Hadoop code on my local machine for development. I am documenting the steps as it may be useful for any newcomers.

Below are the steps to set up the Hadoop code for development

Step 1: Install Java JDK 8 and above

$ java -version
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)

Step 2: Install Apache Maven version 3 or later

mvn -version
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T08:41:47-08:00)
Maven home: /usr/local/apache-maven
Java version: 1.8.0_72, vendor: Oracle Corporation
Java home: /usr/java/jdk1.8.0_72/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-514.2.2.el7.x86_64", arch: "amd64", family: "unix"

Step 3: Install Google protocol buffer (version 2.5.0)

Make sure protocol buffer version is 2.5.0

I have installed the Google protocol buffer higher version 3.1.0, but when compiling code got below error code.

[ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.0.0-alpha2-SNAPSHOT:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: protoc version is 'libprotoc 3.1.0', expected version is '2.5.0' -> [Help 1]

[ERROR]

Step 4: Download the hadoop source code

We can either clone the directory or create a fork of directory and then clone it.

a) Directly cloning the directory.

git clone git://git.apache.org/hadoop.git

b) Create fork as shown below:

And then download the code as shown below:

$ git clone https://github.com/poojagpta/hadoop

Syn the fork project with current project changes

Add the remote link:

$git remote add upstream https://github.com/apache/hadoop

$ git remote -v

origin https://github.com/poojagpta/hadoop (fetch)

origin https://github.com/poojagpta/hadoop (push)

upstream https://github.com/apache/hadoop (fetch)

upstream https://github.com/apache/hadoop (push)

Now, if want to fetch latest code:

$ git fetch upstream

$ git checkout trunk

Step 5: Compile the downloaded code

$ cd hadoop

$ mvn clean install -Pdist -Dtar -Ptest-patch -DskipTests -Denforcer.skip=true

Snippet Output:

[INFO] --- maven-install-plugin:2.5.1:install (default-install) @ hadoop-client-modules ---

[INFO] Installing /home/pooja/dev/hadoop/hadoop-client-modules/pom.xml to /home/pooja/.m2/repository/org/apache/hadoop/hadoop-client-modules/3.0.0-alpha2-SNAPSHOT/hadoop-client-modules-3.0.0-alpha2-SNAPSHOT.pom

[INFO] ------------------------------------------------------------------------

[INFO] Reactor Summary:

[INFO]

[INFO] Apache Hadoop Main ................................. SUCCESS [ 1.780 s]

[INFO] Apache Hadoop Build Tools .......................... SUCCESS [ 2.560 s]

[INFO] Apache Hadoop Project POM .......................... SUCCESS [ 2.236 s]

[INFO] Apache Hadoop Annotations .......................... SUCCESS [ 4.824 s]

[INFO] Apache Hadoop Assemblies ........................... SUCCESS [ 0.314 s]

[INFO] Apache Hadoop Project Dist POM ..................... SUCCESS [ 1.834 s]

[INFO] Apache Hadoop Maven Plugins ........................ SUCCESS [ 9.167 s]

[INFO] Apache Hadoop MiniKDC .............................. SUCCESS [ 5.918 s]

[INFO] Apache Hadoop Auth ................................. SUCCESS [ 20.083 s]

[INFO] Apache Hadoop Auth Examples ........................ SUCCESS [ 7.650 s]

[INFO] Apache Hadoop Common ............................... SUCCESS [02:03 min]

[INFO] Apache Hadoop NFS .................................. SUCCESS [ 12.138 s]

[INFO] Apache Hadoop KMS .................................. SUCCESS [ 13.088 s]

[INFO] Apache Hadoop Common Project ....................... SUCCESS [ 0.138 s]

[INFO] Apache Hadoop HDFS Client .......................... SUCCESS [ 54.973 s]

[INFO] Apache Hadoop HDFS ................................. SUCCESS [01:51 min]

[INFO] Apache Hadoop HDFS Native Client ................... SUCCESS [ 1.323 s]

[INFO] Apache Hadoop HttpFS ............................... SUCCESS [ 41.081 s]

[INFO] Apache Hadoop HDFS-NFS ............................. SUCCESS [ 12.680 s]

[INFO] Apache Hadoop HDFS Project ......................... SUCCESS [ 0.070 s]

[INFO] Apache Hadoop YARN ................................. SUCCESS [ 0.073 s]

[INFO] Apache Hadoop YARN API ............................. SUCCESS [ 35.955 s]

[INFO] Apache Hadoop YARN Common .......................... SUCCESS [01:38 min]

[INFO] Apache Hadoop YARN Server .......................... SUCCESS [ 0.089 s]

[INFO] Apache Hadoop YARN Server Common ................... SUCCESS [ 22.489 s]

[INFO] Apache Hadoop YARN NodeManager ..................... SUCCESS [ 32.492 s]

[INFO] Apache Hadoop YARN Web Proxy ....................... SUCCESS [ 8.606 s]

[INFO] Apache Hadoop YARN ApplicationHistoryService ....... SUCCESS [ 20.153 s]

[INFO] Apache Hadoop YARN Timeline Service ................ SUCCESS [02:26 min]

[INFO] Apache Hadoop YARN ResourceManager ................. SUCCESS [ 55.442 s]

[INFO] Apache Hadoop YARN Server Tests .................... SUCCESS [ 5.479 s]

[INFO] Apache Hadoop YARN Client .......................... SUCCESS [ 17.122 s]

[INFO] Apache Hadoop YARN SharedCacheManager .............. SUCCESS [ 8.654 s]

[INFO] Apache Hadoop YARN Timeline Plugin Storage ......... SUCCESS [ 8.234 s]

[INFO] Apache Hadoop YARN Timeline Service HBase tests .... SUCCESS [02:51 min]

[INFO] Apache Hadoop YARN Applications .................... SUCCESS [ 0.044 s]

[INFO] Apache Hadoop YARN DistributedShell ................ SUCCESS [ 8.076 s]

[INFO] Apache Hadoop YARN Unmanaged Am Launcher ........... SUCCESS [ 5.937 s]

[INFO] Apache Hadoop YARN Site ............................ SUCCESS [ 0.077 s]

[INFO] Apache Hadoop YARN Registry ........................ SUCCESS [ 11.366 s]

[INFO] Apache Hadoop YARN UI .............................. SUCCESS [ 1.832 s]

[INFO] Apache Hadoop YARN Project ......................... SUCCESS [ 8.590 s]

[INFO] Apache Hadoop MapReduce Client ..................... SUCCESS [ 0.225 s]

[INFO] Apache Hadoop MapReduce Core ....................... SUCCESS [ 43.115 s]

[INFO] Apache Hadoop MapReduce Common ..................... SUCCESS [ 27.865 s]

[INFO] Apache Hadoop MapReduce Shuffle .................... SUCCESS [ 9.009 s]

[INFO] Apache Hadoop MapReduce App ........................ SUCCESS [ 24.415 s]

[INFO] Apache Hadoop MapReduce HistoryServer .............. SUCCESS [ 14.692 s]

[INFO] Apache Hadoop MapReduce JobClient .................. SUCCESS [ 29.361 s]

[INFO] Apache Hadoop MapReduce HistoryServer Plugins ...... SUCCESS [ 4.828 s]

[INFO] Apache Hadoop MapReduce NativeTask ................. SUCCESS [ 10.299 s]

[INFO] Apache Hadoop MapReduce Examples ................... SUCCESS [ 12.238 s]

[INFO] Apache Hadoop MapReduce ............................ SUCCESS [ 4.336 s]

[INFO] Apache Hadoop MapReduce Streaming .................. SUCCESS [ 17.591 s]

[INFO] Apache Hadoop Distributed Copy ..................... SUCCESS [ 13.083 s]

[INFO] Apache Hadoop Archives ............................. SUCCESS [ 6.314 s]

[INFO] Apache Hadoop Archive Logs ......................... SUCCESS [ 6.982 s]

[INFO] Apache Hadoop Rumen ................................ SUCCESS [ 12.048 s]

[INFO] Apache Hadoop Gridmix .............................. SUCCESS [ 12.327 s]

[INFO] Apache Hadoop Data Join ............................ SUCCESS [ 5.819 s]

[INFO] Apache Hadoop Extras ............................... SUCCESS [ 5.794 s]

[INFO] Apache Hadoop Pipes ................................ SUCCESS [ 0.036 s]

[INFO] Apache Hadoop OpenStack support .................... SUCCESS [ 8.138 s]

[INFO] Apache Hadoop Amazon Web Services support .......... SUCCESS [ 53.458 s]

[INFO] Apache Hadoop Azure support ........................ SUCCESS [ 20.452 s]

[INFO] Apache Hadoop Aliyun OSS support ................... SUCCESS [ 11.273 s]

[INFO] Apache Hadoop Client Aggregator .................... SUCCESS [ 3.698 s]

[INFO] Apache Hadoop Mini-Cluster ......................... SUCCESS [ 1.618 s]

[INFO] Apache Hadoop Scheduler Load Simulator ............. SUCCESS [ 12.085 s]

[INFO] Apache Hadoop Azure Data Lake support .............. SUCCESS [ 27.289 s]

[INFO] Apache Hadoop Tools Dist ........................... SUCCESS [ 5.002 s]

[INFO] Apache Hadoop Kafka Library support ................ SUCCESS [ 7.041 s]

[INFO] Apache Hadoop Tools ................................ SUCCESS [ 0.052 s]

[INFO] Apache Hadoop Client API ........................... SUCCESS [02:09 min]

[INFO] Apache Hadoop Client Runtime ....................... SUCCESS [01:21 min]

[INFO] Apache Hadoop Client Packaging Invariants .......... SUCCESS [ 3.431 s]

[INFO] Apache Hadoop Client Test Minicluster .............. SUCCESS [03:13 min]

[INFO] Apache Hadoop Client Packaging Invariants for Test . SUCCESS [ 0.329 s]

[INFO] Apache Hadoop Client Packaging Integration Tests ... SUCCESS [ 1.542 s]

[INFO] Apache Hadoop Distribution ......................... SUCCESS [ 42.013 s]

[INFO] Apache Hadoop Client Modules ....................... SUCCESS [ 0.105 s]

[INFO] ------------------------------------------------------------------------

[INFO] BUILD SUCCESS

[INFO] ------------------------------------------------------------------------

[INFO] Total time: 32:42 min

[INFO] Finished at: 2017-01-12T11:30:58-08:00

[INFO] Final Memory: 131M/819M

[INFO] ------------------------------------------------------------------------

I hope you are also to set up Hadoop project and ready to contribute like me. Please let me know if you are still facing issues, I love to help you.
In the next tutorial, I will set up the code in IntelliJ and steps to debug the code.

Thanks and happy coding !!!

Problem Encounter will compiling code:

1. Some of the junit are failing.

-------------------------------------------------------

T E S T S

-------------------------------------------------------

Running org.apache.hadoop.minikdc.TestMiniKdc

Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 3.451 sec <<< FAILURE! - in org.apache.hadoop.minikdc.TestMiniKdc

testKeytabGen(org.apache.hadoop.minikdc.TestMiniKdc) Time elapsed: 1.314 sec <<< ERROR!

java.lang.RuntimeException: Unable to parse:includedir /etc/krb5.conf.d/

at org.apache.kerby.kerberos.kerb.common.Krb5Parser.load(Krb5Parser.java:72)

at org.apache.kerby.kerberos.kerb.common.Krb5Conf.addKrb5Config(Krb5Conf.java:47)

at org.apache.kerby.kerberos.kerb.client.ClientUtil.getDefaultConfig(ClientUtil.java:94)

at org.apache.kerby.kerberos.kerb.client.KrbClientBase.<init>(KrbClientBase.java:51)

at org.apache.kerby.kerberos.kerb.client.KrbClient.<init>(KrbClient.java:38)

at org.apache.kerby.kerberos.kerb.server.SimpleKdcServer.<init>(SimpleKdcServer.java:54)

at org.apache.hadoop.minikdc.MiniKdc.start(MiniKdc.java:280)

at org.apache.hadoop.minikdc.KerberosSecurityTestcase.startMiniKdc(KerberosSecurityTestcase.java:49)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)

at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)

at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)

at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)

at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)

at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)

at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)

at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)

at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)

at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)

at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)

at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)

at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)

at org.junit.runners.ParentRunner.run(ParentRunner.java:309)

at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)

at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)

at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)

at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)

at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)

at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)

testMiniKdcStart(org.apache.hadoop.minikdc.TestMiniKdc) Time elapsed: 1.002 sec <<< ERROR!