Big Data: Learn Apache Spark using Cloudera Quickstart VM

Apache spark is open source big data computing engine. It enables applications to run upto 100X faster in memory and 10X faster even running on disk. It provides support for Java, Scala and Python, so that applications can be rapidly developed for batch, interactive and streaming systems.

Apache spark is composed of master server and one or more worker nodes. I am using Cloudera quick start vm for this tutorial. The virtual box VM can be downloaded from here. This VM has spark preinstalled, the master and worker nodes will be started as soon as the VM is up.

Master Node

The master can be started either:

using master only

[code language="bash"]
./sbin/start-master.sh
[/code]

using master and one or more worker (the master can access slave nodes using password less ssh)

[code language="bash"]
./sbin/start-all.sh
[/code]

The conf/slaves file has the hostnames of all the worker machines(one hostname per line).

The master will print out a spark://HOST:PORT URL. This url will be used for starting worker nodes.

The master's web UI can be accessed using http://localhost:8080.

Worker/slave node

Similarly, one or more worker can be started

one by one on running below command on each worker node.

[code language="bash"]
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
[/code]

The IP and PORT can be found out from the master’s web UI, which is http://localhost:8080 by default.

or using below script from master node

[code language="bash"]
./sbin/start-slaves.sh
[/code]

This will start all the worker nodes in conf/slaves file.

The worker's web ui can be accessed using http://localhost:8081.

spark scala shell

The spark scala shell can be invoked using:

[code language="bash"]
./bin/spark-shell
[/code]

OR

[code language="bash"]
./bin/spark-shell --master spark://IP:PORT
[/code]

The below figure shows the spark shell.

spark shell screen shot

[code language="bash"]

scala&gt; var file = sc.textFile(&quot;hdfs://quickstart.cloudera:8020/user/hdfs/demo1/input/data.txt&quot;)
14/09/29 22:57:11 INFO storage.MemoryStore: ensureFreeSpace(158080) called with curMem=0, maxMem=311387750
14/09/29 22:57:11 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 154.4 KB, free 296.8 MB)
file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at &lt;console&gt;:12

scala&gt; val counts = file.flatMap(line =&gt; line.split(&quot; &quot;)).map(word =&gt; (word, 1)).reduceByKey(_ + _)
14/09/29 22:57:20 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
14/09/29 22:57:20 INFO mapred.FileInputFormat: Total input paths to process : 1
counts: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at reduceByKey at &lt;console&gt;:14

scala&gt; println(counts)
MapPartitionsRDD[6] at reduceByKey at &lt;console&gt;:14

scala&gt; counts
res3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at reduceByKey at &lt;console&gt;:14

scala&gt; counts.saveAsTextFile(&quot;hdfs://quickstart.cloudera:8020/user/hdfs/demo1/output&quot;)
14/09/29 22:59:31 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
14/09/29 22:59:31 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
14/09/29 22:59:31 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
14/09/29 22:59:31 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
14/09/29 22:59:31 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
14/09/29 22:59:31 INFO spark.SparkContext: Starting job: saveAsTextFile at &lt;console&gt;:17
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Registering RDD 4 (reduceByKey at &lt;console&gt;:14)
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Got job 0 (saveAsTextFile at &lt;console&gt;:17) with 1 output partitions (allowLocal=false)
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Final stage: Stage 0(saveAsTextFile at &lt;console&gt;:17)
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 1)
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Missing parents: List(Stage 1)
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[4] at reduceByKey at &lt;console&gt;:14), which has no missing parents
14/09/29 22:59:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at &lt;console&gt;:14)
14/09/29 22:59:31 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
14/09/29 22:59:31 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
14/09/29 22:59:31 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 2121 bytes in 3 ms
14/09/29 22:59:31 INFO executor.Executor: Running task ID 0
14/09/29 22:59:31 INFO storage.BlockManager: Found block broadcast_0 locally
14/09/29 22:59:31 INFO rdd.HadoopRDD: Input split: hdfs://quickstart.cloudera:8020/user/hdfs/demo1/input/data.txt:0+28
14/09/29 22:59:32 INFO executor.Executor: Serialized size of result for 0 is 779
14/09/29 22:59:32 INFO executor.Executor: Sending result for 0 directly to driver
14/09/29 22:59:32 INFO executor.Executor: Finished task ID 0
14/09/29 22:59:32 INFO scheduler.TaskSetManager: Finished TID 0 in 621 ms on localhost (progress: 1/1)
14/09/29 22:59:32 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(1, 0)
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Stage 1 (reduceByKey at &lt;console&gt;:14) finished in 0.646 s
14/09/29 22:59:32 INFO scheduler.DAGScheduler: looking for newly runnable stages
14/09/29 22:59:32 INFO scheduler.DAGScheduler: running: Set()
14/09/29 22:59:32 INFO scheduler.DAGScheduler: waiting: Set(Stage 0)
14/09/29 22:59:32 INFO scheduler.DAGScheduler: failed: Set()
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Missing parents for Stage 0: List()
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[7] at saveAsTextFile at &lt;console&gt;:17), which is now runnable
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[7] at saveAsTextFile at &lt;console&gt;:17)
14/09/29 22:59:32 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/09/29 22:59:32 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
14/09/29 22:59:32 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 13029 bytes in 0 ms
14/09/29 22:59:32 INFO executor.Executor: Running task ID 1
14/09/29 22:59:32 INFO storage.BlockManager: Found block broadcast_0 locally
14/09/29 22:59:32 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/09/29 22:59:32 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
14/09/29 22:59:32 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 4 ms
14/09/29 22:59:32 INFO output.FileOutputCommitter: Saved output of task 'attempt_201409292259_0000_m_000000_1' to hdfs://quickstart.cloudera:8020/user/hdfs/demo1/output/_temporary/0/task_201409292259_0000_m_000000
14/09/29 22:59:32 INFO spark.SparkHadoopWriter: attempt_201409292259_0000_m_000000_1: Committed
14/09/29 22:59:32 INFO executor.Executor: Serialized size of result for 1 is 825
14/09/29 22:59:32 INFO executor.Executor: Sending result for 1 directly to driver
14/09/29 22:59:32 INFO executor.Executor: Finished task ID 1
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Completed ResultTask(0, 0)
14/09/29 22:59:32 INFO scheduler.DAGScheduler: Stage 0 (saveAsTextFile at &lt;console&gt;:17) finished in 0.383 s
14/09/29 22:59:32 INFO spark.SparkContext: Job finished: saveAsTextFile at &lt;console&gt;:17, took 1.334581571 s
14/09/29 22:59:32 INFO scheduler.TaskSetManager: Finished TID 1 in 387 ms on localhost (progress: 1/1)
14/09/29 22:59:32 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

scala&gt;
[/code]

The below screen shot provides details about the input to wordcount and output of above scala word count.