Apache Spark
Apache Spark is one of the powerful analytical engine to process huge volume of data using distributed in-memory data storage.
Apache Hadoop Yarn
Hadoop is well-known as distributed computing system that consists of Distributed file system (HDFS), YARN (Resource management framework), Analytical computing job (such as Map Reduce, Hive,Pig, Spark etc).
Apache Spark analytical job can be run on Standalone Spark Cluster or YARN cluster or Mesos cluster.
In this tutorial, I will go through details steps and problem facing while setting up Spark job to run on remote YARN cluster. Since, I have just one computer, I have create 2 users (sparkuser & hduser). Now, Hadoop is installed as 'hduser' and Spark installed as 'sparksuser'.
Step 1: Install Hadoop 2.7.0 cluster with hduser
Please refer to tutorial for set up of Hadoop Standalone setup with hduser.
Step 2: Install Spark with sparkuser
[code language="java"]
#Login to sparkuser
[root@localhost ~]$ su - sparkuser
#Download the spark tar ball using below command or using URL <a href="http://spark.apache.org/downloads.html">http://spark.apache.org/downloads.html</a>
[sparkuser@localhost ~]$ wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz
#Untar above downloaded tar ball
[sparkuser@localhost ~]$ tar -xvf spark-2.0.2-bin-hadoop2.7
[/code]
Step 3. Copy hadoop configuration files
Move two hadoop configuration file core-site.xml, yarn-site.xml to spark set up machine as shown below.
[code language="java"]
# As both user 'hduser' and 'sparkuser' on same machine, we can copy using /tmp/ folder, if the machine is remote then we can even ftp the properties files.
[hduser@localhost hadoop]$ cp etc/hadoop/core-site.xml /tmp/
[hduser@localhost hadoop]$ cp etc/hadoop/yarn-site.xml /tmp/
# Copy the hadoop configuration to Spark machine
[sparkuser@localhost ~]$ mkdir hadoopConf
[sparkuser@localhost ~]$ cd hadoopConf
[sparkuser@localhost hadoopConf]$ cp /tmp/core-site.xml .
[sparkuser@localhost hadoopConf]$ cp /tmp/yarn-site.xml .
[/code]
Step 4: Set up HADOOP_CONF_DIR
In spark-env.sh, set the local path where hadoop configuration files are stored as shown below.
[code language="java"]
# In Spark set up machine, change the <Spark_home>/conf/spark-env.sh
[sparkuser@localhost spark-2.0.2-bin-hadoop2.7]$ nano conf/spark-env.sh
#Earlier stored the hadoop configuration file in hadoopConf
export HADOOP_CONF_DIR=/home/sparkuser/hadoopConf/
[/code]
Problem Faced: Earlier, I tried to avoid copying file to 'sparkuser' and provide the HADOOP_CONF_DIR as '/home/hduser/hadoop/etc/hadoop'.
But, when i submit the spark job I was facing below error. Its when I realized that 'sparkuser' is not able to access file in 'hduser'.
[code language="java"]
[sparkuser@localhost spark-2.0.2-bin-hadoop2.7]$ bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster examples/jars/spark-examples_2.11-2.0.2.jar 10
Failure Output:
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
.....
16/12/16 16:19:38 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/12/16 16:19:41 INFO Client: <strong>Source and destination file systems are the same. Not copying file:/tmp/spark-6700e780-d8fa-443c-aead-7763ed18ca7d/__spark_libs__7158677467450857723.zip</strong>
....16/12/16 16:19:41 INFO SecurityManager: Changing view acls to: sparkuser
16/12/16 16:19:41 INFO Client: Submitting application application_1481925228457_0007 to ResourceManager
....16/12/16 16:19:45 INFO Client: Application report for application_1481925228457_0007 (state: FAILED)
16/12/16 16:19:45 INFO Client:
client token: N/A
diagnostics: Application application_1481925228457_0007 failed 2 times due to AM Container for appattempt_1481925228457_0007_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://localhost:8088/cluster/app/application_1481925228457_0007Then, click on links to logs of each attempt.
Diagnostics: File file:/tmp/spark-6700e780-d8fa-443c-aead-7763ed18ca7d/__spark_libs__7158677467450857723.zip does not exist
<strong>java.io.FileNotFoundException</strong>: File file:/tmp/spark-6700e780-d8fa-443c-aead-7763ed18ca7d/__spark_libs__7158677467450857723.zip does not exist
[/code]
Step 5: Change the Hadoop DFS access permission.
Now, when spark job is executed on Yarn cluster, it will place create directory on HDFS file system. Therefore, 'sparkuser' should have access right on it.
[code language="java"]
#Create /user/sparkuser directory on HDFS and also change permissions
[hduser@localhost ~]$ hadoop fs -mkdir /user/sparkuser
[hduser@localhost ~]$ hadoop fs -chmod 777 /user/sparkuser
# or you can disable permissions on HDFS, change hdfs.site.xml and add below
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
[/code]
Problem faced: When submit the spark job, I was getting permission issues as shown below
[code language="java"]
[sparkuser@localhost spark-2.0.2-bin-hadoop2.7]$ bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 1g --executor-memory 1g --num-executors 1 examples/jars/spark-examples_2.11-2.0.2.jar 10
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
....
....
16/12/16 17:11:30 INFO Client: Setting up container launch context for our AM
16/12/16 17:11:30 INFO Client: Setting up the launch environment for our AM container
16/12/16 17:11:30 INFO Client: <strong>Preparing resources for our AM container</strong>
<strong>Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=sparkuser, access=WRITE, inode="/user/sparkuser/.sparkStaging/application_1481925228457_0008":hduser:supergroup:drwxr-xr-x</strong>
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
......[/code]
Step 6: Run the spark jobs
Now, run the spark job as shown below.
[code language="java"]
#Sumit the job
[sparkuser@localhost spark-2.0.2-bin-hadoop2.7]$ bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster examples/jars/spark-examples_2.11-2.0.2.jar 10
Output:
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
...16/12/16 23:31:51 INFO Client: Submitting application application_1481959535348_0001 to ResourceManager
16/12/16 23:31:52 INFO YarnClientImpl: Submitted application application_1481959535348_0001
16/12/16 23:31:53 INFO Client: Application report for application_1481959535348_0001 (state: ACCEPTED)
...16/12/16 23:33:09 INFO Client: Application report for application_1481959535348_0001 (state: ACCEPTED)
16/12/16 23:33:10 INFO Client: Application report for application_1481959535348_0001 (state: RUNNING)
16/12/16 23:33:10 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 192.168.1.142
ApplicationMaster RPC port: 0
queue: default
start time: 1481959911811
final status: UNDEFINED
tracking URL: http://localhost:8088/proxy/application_1481959535348_0001/
user: pooja
16/12/16 23:33:21 INFO Client: Application report for application_1481959535348_0001 (state: RUNNING)
16/12/16 23:33:22 INFO Client: Application report for application_1481959535348_0001 (state: FINISHED)
16/12/16 23:33:22 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 192.168.1.142
ApplicationMaster RPC port: 0
queue: default
start time: 1481959911811
final status: SUCCEEDED
tracking URL: http://localhost:8088/proxy/<strong>application_1481959535348_0001</strong>/
user: pooja
16/12/16 23:33:23 INFO Client: Deleting staging directory hdfs://localhost:9000/user/pooja/.sparkStaging/application_1481959535348_0001
16/12/16 23:33:24 INFO ShutdownHookManager: Shutdown hook called
16/12/16 23:33:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-d61b8ae1-3dec-4380-8a85-0c615c1e4be1
[/code]
Step 7: Verify job running on YARN
Make sure application shown on previous output is shown on console as well .
Hope you have successfully submit Spark jobs on YARN. Please put your comments if you are facing any issues.
Happy Coding !!!
Apache Spark is one of the powerful analytical engine to process huge volume of data using distributed in-memory data storage.
Apache Hadoop Yarn
Hadoop is well-known as distributed computing system that consists of Distributed file system (HDFS), YARN (Resource management framework), Analytical computing job (such as Map Reduce, Hive,Pig, Spark etc).
Apache Spark analytical job can be run on Standalone Spark Cluster or YARN cluster or Mesos cluster.
In this tutorial, I will go through details steps and problem facing while setting up Spark job to run on remote YARN cluster. Since, I have just one computer, I have create 2 users (sparkuser & hduser). Now, Hadoop is installed as 'hduser' and Spark installed as 'sparksuser'.
Step 1: Install Hadoop 2.7.0 cluster with hduser
Please refer to tutorial for set up of Hadoop Standalone setup with hduser.
Step 2: Install Spark with sparkuser
[code language="java"]
#Login to sparkuser
[root@localhost ~]$ su - sparkuser
#Download the spark tar ball using below command or using URL <a href="http://spark.apache.org/downloads.html">http://spark.apache.org/downloads.html</a>
[sparkuser@localhost ~]$ wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz
#Untar above downloaded tar ball
[sparkuser@localhost ~]$ tar -xvf spark-2.0.2-bin-hadoop2.7
[/code]
Step 3. Copy hadoop configuration files
Move two hadoop configuration file core-site.xml, yarn-site.xml to spark set up machine as shown below.
[code language="java"]
# As both user 'hduser' and 'sparkuser' on same machine, we can copy using /tmp/ folder, if the machine is remote then we can even ftp the properties files.
[hduser@localhost hadoop]$ cp etc/hadoop/core-site.xml /tmp/
[hduser@localhost hadoop]$ cp etc/hadoop/yarn-site.xml /tmp/
# Copy the hadoop configuration to Spark machine
[sparkuser@localhost ~]$ mkdir hadoopConf
[sparkuser@localhost ~]$ cd hadoopConf
[sparkuser@localhost hadoopConf]$ cp /tmp/core-site.xml .
[sparkuser@localhost hadoopConf]$ cp /tmp/yarn-site.xml .
[/code]
Step 4: Set up HADOOP_CONF_DIR
In spark-env.sh, set the local path where hadoop configuration files are stored as shown below.
[code language="java"]
# In Spark set up machine, change the <Spark_home>/conf/spark-env.sh
[sparkuser@localhost spark-2.0.2-bin-hadoop2.7]$ nano conf/spark-env.sh
#Earlier stored the hadoop configuration file in hadoopConf
export HADOOP_CONF_DIR=/home/sparkuser/hadoopConf/
[/code]
Problem Faced: Earlier, I tried to avoid copying file to 'sparkuser' and provide the HADOOP_CONF_DIR as '/home/hduser/hadoop/etc/hadoop'.
But, when i submit the spark job I was facing below error. Its when I realized that 'sparkuser' is not able to access file in 'hduser'.
[code language="java"]
[sparkuser@localhost spark-2.0.2-bin-hadoop2.7]$ bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster examples/jars/spark-examples_2.11-2.0.2.jar 10
Failure Output:
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
.....
16/12/16 16:19:38 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/12/16 16:19:41 INFO Client: <strong>Source and destination file systems are the same. Not copying file:/tmp/spark-6700e780-d8fa-443c-aead-7763ed18ca7d/__spark_libs__7158677467450857723.zip</strong>
....16/12/16 16:19:41 INFO SecurityManager: Changing view acls to: sparkuser
16/12/16 16:19:41 INFO Client: Submitting application application_1481925228457_0007 to ResourceManager
....16/12/16 16:19:45 INFO Client: Application report for application_1481925228457_0007 (state: FAILED)
16/12/16 16:19:45 INFO Client:
client token: N/A
diagnostics: Application application_1481925228457_0007 failed 2 times due to AM Container for appattempt_1481925228457_0007_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://localhost:8088/cluster/app/application_1481925228457_0007Then, click on links to logs of each attempt.
Diagnostics: File file:/tmp/spark-6700e780-d8fa-443c-aead-7763ed18ca7d/__spark_libs__7158677467450857723.zip does not exist
<strong>java.io.FileNotFoundException</strong>: File file:/tmp/spark-6700e780-d8fa-443c-aead-7763ed18ca7d/__spark_libs__7158677467450857723.zip does not exist
[/code]
Step 5: Change the Hadoop DFS access permission.
Now, when spark job is executed on Yarn cluster, it will place create directory on HDFS file system. Therefore, 'sparkuser' should have access right on it.
[code language="java"]
#Create /user/sparkuser directory on HDFS and also change permissions
[hduser@localhost ~]$ hadoop fs -mkdir /user/sparkuser
[hduser@localhost ~]$ hadoop fs -chmod 777 /user/sparkuser
# or you can disable permissions on HDFS, change hdfs.site.xml and add below
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
[/code]
Problem faced: When submit the spark job, I was getting permission issues as shown below
[code language="java"]
[sparkuser@localhost spark-2.0.2-bin-hadoop2.7]$ bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 1g --executor-memory 1g --num-executors 1 examples/jars/spark-examples_2.11-2.0.2.jar 10
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
....
....
16/12/16 17:11:30 INFO Client: Setting up container launch context for our AM
16/12/16 17:11:30 INFO Client: Setting up the launch environment for our AM container
16/12/16 17:11:30 INFO Client: <strong>Preparing resources for our AM container</strong>
<strong>Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=sparkuser, access=WRITE, inode="/user/sparkuser/.sparkStaging/application_1481925228457_0008":hduser:supergroup:drwxr-xr-x</strong>
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
......[/code]
Step 6: Run the spark jobs
Now, run the spark job as shown below.
[code language="java"]
#Sumit the job
[sparkuser@localhost spark-2.0.2-bin-hadoop2.7]$ bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster examples/jars/spark-examples_2.11-2.0.2.jar 10
Output:
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
...16/12/16 23:31:51 INFO Client: Submitting application application_1481959535348_0001 to ResourceManager
16/12/16 23:31:52 INFO YarnClientImpl: Submitted application application_1481959535348_0001
16/12/16 23:31:53 INFO Client: Application report for application_1481959535348_0001 (state: ACCEPTED)
...16/12/16 23:33:09 INFO Client: Application report for application_1481959535348_0001 (state: ACCEPTED)
16/12/16 23:33:10 INFO Client: Application report for application_1481959535348_0001 (state: RUNNING)
16/12/16 23:33:10 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 192.168.1.142
ApplicationMaster RPC port: 0
queue: default
start time: 1481959911811
final status: UNDEFINED
tracking URL: http://localhost:8088/proxy/application_1481959535348_0001/
user: pooja
16/12/16 23:33:21 INFO Client: Application report for application_1481959535348_0001 (state: RUNNING)
16/12/16 23:33:22 INFO Client: Application report for application_1481959535348_0001 (state: FINISHED)
16/12/16 23:33:22 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 192.168.1.142
ApplicationMaster RPC port: 0
queue: default
start time: 1481959911811
final status: SUCCEEDED
tracking URL: http://localhost:8088/proxy/<strong>application_1481959535348_0001</strong>/
user: pooja
16/12/16 23:33:23 INFO Client: Deleting staging directory hdfs://localhost:9000/user/pooja/.sparkStaging/application_1481959535348_0001
16/12/16 23:33:24 INFO ShutdownHookManager: Shutdown hook called
16/12/16 23:33:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-d61b8ae1-3dec-4380-8a85-0c615c1e4be1
[/code]
Step 7: Verify job running on YARN
Make sure application shown on previous output is shown on console as well .
Hope you have successfully submit Spark jobs on YARN. Please put your comments if you are facing any issues.
Happy Coding !!!