Wednesday, December 14, 2016

Install Spark on Standalone Mode

Apache Spark

Apache Spark is cluster computing framework written in Scala language. It is gaining popularity as it provides real-time solutions to big data ecosystem.

Installation

Apache spark can be installed on stand alone mode by simply placing the compile version of spark on each node or build it yourself using the source code.

In this tutorial, I will provide details of installation using compile version of spark.

a. Install Java 7+ on machine (if not already installed)

b. Download the Spark tar ball

Download the Spark tar ball using http://spark.apache.org/downloads.html as shown below.

Screenshot from 2016-12-15 15-26-03.png

We need to select the below parameter for download.

  1. Choose a Spark release. You can choose the latest version

  2. Choose the package type. You can select with Hadoop version or with user provided hadoop.Note: Spark uses core Hadoop Library to communicate to HDFS and other Hadoop-supported storage system.Because the protocol changed for different version o HDFS therefore select that build against the same version as version hadoop cluster runs. I have selected the "Pre-build with Hadoop 2.7 and later".

  3. Choose the download type. Select "Direct download".

  4. Download Spark. Click on the link for download tar ball on local machine.

c. Unzip the downloaded tar file

$tar -xvf spark-2.0.2-bin-hadoop2.7.tgz

Below is the folder structure after you extract the tar file as shown below.

screenshot-from-2016-12-15-15-28-20

The description of the important folders:

FolderUsage
sbinContain start, stop master and slave scripts
binContain Scala and Python Spark shell
confContain configuration files
dataContain graph, machine leraning and streaming job data
jarsContains jar included in Spark Classpath
examplesContain example for Spark job
logsContain all log file

d. Start the spark stand alone cluster using below command
cd <Spark Root directory>
sbin/start-master.sh

e. Check if master node is working properly.

In the console, type in the URL http://localhost:8080, it should show up the screen as shown below.

screenshot-from-2016-12-15-15-31-13

f. Start worker node

Now, we will run script sbin/start-slave.sh as shown below.

cd <spark-root-directory>

sbin/start-slave.sh spark://localhost:7077

g. Verify if the worker node is running.

Make sure the http://localhost:8080, UI console, you can see a new Worker Id (worker-20161215153905-192.168.1.142-57869) as shown below.

Screenshot from 2016-12-15 15-44-41.png

h.Running a Spark example

We can run the Spark example job

$./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://localhost:7077 examples/jars/spark-examples_2.11-2.0.2.jar 1000


Verify if the console shows below output:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/15 21:36:41 INFO SparkContext: Running Spark version 2.0.2
16/12/15 21:36:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/15 21:36:42 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.1.142 instead (on interface wlp18s0b1)

...........

...........

16/12/15 21:36:51 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 4.098445 s
Pi is roughly 3.143019143019143
16/12/15 21:36:51 INFO SparkUI: Stopped Spark web UI at http://192.168.1.142:4040
16/12/15 21:36:51 INFO StandaloneSchedulerBackend: Shutting down all executors
16/12/15 21:36:51 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
16/12/15 21:36:51 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/12/15 21:36:51 INFO MemoryStore: MemoryStore cleared
16/12/15 21:36:51 INFO BlockManager: BlockManager stopped
16/12/15 21:36:51 INFO BlockManagerMaster: BlockManagerMaster stopped
16/12/15 21:36:51 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

Running multiple instance of Spark Worker on Standalone Mode

In the conf/spark-env.sh, set SPARK_WORKER_INSTANCES to number of worker you want to start and start with start-slave.sh script as shown below

#Add the below line to <Spark_home/conf/spark-env.sh>
export SPARK_WORKER_INSTANCES=2
#Then start the worker threads
sbin/start-slave.sh spark://localhost:7077 --cores 2 --memory 2g

By now, I hope you are able to configure the Spark Stand Alone cluster successfully. If facing any issues, please reply on comments.

Keep Reading and Learning.

Happy Coding!!!

No comments:

Post a Comment