Apache Spark
Apache Spark is cluster computing framework written in Scala language. It is gaining popularity as it provides real-time solutions to big data ecosystem.
Installation
Apache spark can be installed on stand alone mode by simply placing the compile version of spark on each node or build it yourself using the source code.
In this tutorial, I will provide details of installation using compile version of spark.
a. Install Java 7+ on machine (if not already installed)
b. Download the Spark tar ball
Download the Spark tar ball using http://spark.apache.org/downloads.html as shown below.
We need to select the below parameter for download.
c. Unzip the downloaded tar file
$tar -xvf spark-2.0.2-bin-hadoop2.7.tgz
Below is the folder structure after you extract the tar file as shown below.
The description of the important folders:
d. Start the spark stand alone cluster using below command
cd <Spark Root directory>
sbin/start-master.sh
e. Check if master node is working properly.
In the console, type in the URL http://localhost:8080, it should show up the screen as shown below.
f. Start worker node
Now, we will run script sbin/start-slave.sh as shown below.
cd <spark-root-directory>
sbin/start-slave.sh spark://localhost:7077
g. Verify if the worker node is running.
Make sure the http://localhost:8080, UI console, you can see a new Worker Id (worker-20161215153905-192.168.1.142-57869) as shown below.
h.Running a Spark example
We can run the Spark example job
$./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://localhost:7077 examples/jars/spark-examples_2.11-2.0.2.jar 1000
Verify if the console shows below output:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/15 21:36:41 INFO SparkContext: Running Spark version 2.0.2
16/12/15 21:36:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/15 21:36:42 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.1.142 instead (on interface wlp18s0b1)
...........
...........
16/12/15 21:36:51 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 4.098445 s
Pi is roughly 3.143019143019143
16/12/15 21:36:51 INFO SparkUI: Stopped Spark web UI at http://192.168.1.142:4040
16/12/15 21:36:51 INFO StandaloneSchedulerBackend: Shutting down all executors
16/12/15 21:36:51 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
16/12/15 21:36:51 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/12/15 21:36:51 INFO MemoryStore: MemoryStore cleared
16/12/15 21:36:51 INFO BlockManager: BlockManager stopped
16/12/15 21:36:51 INFO BlockManagerMaster: BlockManagerMaster stopped
16/12/15 21:36:51 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
Running multiple instance of Spark Worker on Standalone Mode
In the conf/spark-env.sh, set SPARK_WORKER_INSTANCES to number of worker you want to start and start with start-slave.sh script as shown below
#Add the below line to <Spark_home/conf/spark-env.sh>
export SPARK_WORKER_INSTANCES=2
#Then start the worker threads
sbin/start-slave.sh spark://localhost:7077 --cores 2 --memory 2g
By now, I hope you are able to configure the Spark Stand Alone cluster successfully. If facing any issues, please reply on comments.
Keep Reading and Learning.
Happy Coding!!!
Apache Spark is cluster computing framework written in Scala language. It is gaining popularity as it provides real-time solutions to big data ecosystem.
Installation
Apache spark can be installed on stand alone mode by simply placing the compile version of spark on each node or build it yourself using the source code.
In this tutorial, I will provide details of installation using compile version of spark.
a. Install Java 7+ on machine (if not already installed)
b. Download the Spark tar ball
Download the Spark tar ball using http://spark.apache.org/downloads.html as shown below.
We need to select the below parameter for download.
- Choose a Spark release. You can choose the latest version
- Choose the package type. You can select with Hadoop version or with user provided hadoop.Note: Spark uses core Hadoop Library to communicate to HDFS and other Hadoop-supported storage system.Because the protocol changed for different version o HDFS therefore select that build against the same version as version hadoop cluster runs. I have selected the "Pre-build with Hadoop 2.7 and later".
- Choose the download type. Select "Direct download".
- Download Spark. Click on the link for download tar ball on local machine.
c. Unzip the downloaded tar file
$tar -xvf spark-2.0.2-bin-hadoop2.7.tgz
Below is the folder structure after you extract the tar file as shown below.
The description of the important folders:
Folder | Usage |
sbin | Contain start, stop master and slave scripts |
bin | Contain Scala and Python Spark shell |
conf | Contain configuration files |
data | Contain graph, machine leraning and streaming job data |
jars | Contains jar included in Spark Classpath |
examples | Contain example for Spark job |
logs | Contain all log file |
d. Start the spark stand alone cluster using below command
cd <Spark Root directory>
sbin/start-master.sh
e. Check if master node is working properly.
In the console, type in the URL http://localhost:8080, it should show up the screen as shown below.
f. Start worker node
Now, we will run script sbin/start-slave.sh as shown below.
cd <spark-root-directory>
sbin/start-slave.sh spark://localhost:7077
g. Verify if the worker node is running.
Make sure the http://localhost:8080, UI console, you can see a new Worker Id (worker-20161215153905-192.168.1.142-57869) as shown below.
h.Running a Spark example
We can run the Spark example job
$./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://localhost:7077 examples/jars/spark-examples_2.11-2.0.2.jar 1000
Verify if the console shows below output:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/15 21:36:41 INFO SparkContext: Running Spark version 2.0.2
16/12/15 21:36:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/15 21:36:42 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.1.142 instead (on interface wlp18s0b1)
...........
...........
16/12/15 21:36:51 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 4.098445 s
Pi is roughly 3.143019143019143
16/12/15 21:36:51 INFO SparkUI: Stopped Spark web UI at http://192.168.1.142:4040
16/12/15 21:36:51 INFO StandaloneSchedulerBackend: Shutting down all executors
16/12/15 21:36:51 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
16/12/15 21:36:51 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/12/15 21:36:51 INFO MemoryStore: MemoryStore cleared
16/12/15 21:36:51 INFO BlockManager: BlockManager stopped
16/12/15 21:36:51 INFO BlockManagerMaster: BlockManagerMaster stopped
16/12/15 21:36:51 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
Running multiple instance of Spark Worker on Standalone Mode
In the conf/spark-env.sh, set SPARK_WORKER_INSTANCES to number of worker you want to start and start with start-slave.sh script as shown below
#Add the below line to <Spark_home/conf/spark-env.sh>
export SPARK_WORKER_INSTANCES=2
#Then start the worker threads
sbin/start-slave.sh spark://localhost:7077 --cores 2 --memory 2g
By now, I hope you are able to configure the Spark Stand Alone cluster successfully. If facing any issues, please reply on comments.
Keep Reading and Learning.
Happy Coding!!!