Big Data: Spark Development using SBT in IntelliJ

Apache Spark

Apache Spark is open source big data computational system. It is developed using Scala programming language which run on JVM ( Java Virtual Machine) platform. Today, popularity of Spark is increasing due to it's in-memory data storage and real time processing capabilities. This computational system provides high level API in Java, Scala and Python. Therefore, we can run data analytical queries using these high level API on Spark system and get desire insights. Spark can deployed to standalone cluster, Hadoop 2 (YARN) or Mesos.

SBT Overview

SBT is Simple Build Tool. A build tool help in automation of tasks like build,compile, test, package, run, deploy. Other build tools are like Maven, Ant, Gradle, Ivy. SBT is also one othe build tool that focus mainly on Scala projects.

Today, I am going to explore to write a basic query using Spark high level API in Scala 2.10. Also, I will be using IntelliJ as IDE for development.

Now, all set. Let get our hands dirty with some actual coding.

Prerequisite (Make sure your machine has below components already installed):

Install Java JDK 7+.

Install SBT.

Unzip IntelliJ.

Working on Code

a. Creating project structure

There are different ways project structure can be created. We can even use the existing project templates to create it automatically.Today, we are going to create the project structure manually. In below code, we have create the root directory/project name ( scalaProjectDemo) and folder src/main/scala inside it as shown below:

[code language="java"]

mkdir scalaProjectDemo

cd scalaProjectDemo
mkdir project
mkdir -p src/main/scala
mkdir -p src/main/resources
touch project/build.properties
touch project/plugins.sbt
touch project/assembly.sbt
[/code]

b. Creating a build file

We will be creating the build file "build.sbt" in the root directory as shown below:

[code language="java"]

import AssemblyKeys._

assemblySettings

name := "scalaProjectDemo"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0" % "provided"

libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.1.0"

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.0.0-mr1-cdh4.2.0" % "provided"

resolvers ++= Seq(
"Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/",
"Akka Repository" at "http://repo.akka.io/releases/",
"Spray Repository" at "http://repo.spray.cc/")

[/code]

The project can be imported here into IntelliJ. Please refer the item " Importing the code into IntelliJ "

c. Creating a Scala file for testing

Next, we create a sample Scala file which just a single print statement as shown below:

[code language="text"]

package com.xxx

object HelloWorld {
def main(args: Array[String]){
println("Hello World")
}
}

[/code]

d. Run the code

Next, we will run the code and make sure the code compile successfully as shown below:

[code language="java"]
$ cd scalaProjectDemo
$ sbt run
Getting org.scala-sbt sbt 0.13.6 ...

downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/0.13.6/jars/sbt.jar ...
[SUCCESSFUL ] org.scala-sbt#sbt;0.13.6!sbt.jar (1481ms)
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/main/0.13.6/jars/main.jar ...
[SUCCESSFUL ] org.scala-sbt#main;0.13.6!main.jar (3868ms)
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/compiler-interface/0.13.6/jars/compiler-interface-bin.jar ...
[SUCCESSFUL ] org.scala-sbt#compiler-interface;0.13.6!compiler-interface-bin.jar (1653ms)
......
[SUCCESSFUL ] org.scala-sbt#test-agent;0.13.6!test-agent.jar (1595ms)
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/apply-macro/0.13.6/jars/apply-macro.jar ...
[SUCCESSFUL ] org.scala-sbt#apply-macro;0.13.6!apply-macro.jar (1619ms)
:: retrieving :: org.scala-sbt#boot-app
confs: [default]
44 artifacts copied, 0 already retrieved (13750kB/320ms)
[info] Loading project definition from /home/xxx/dev/scalaProjectDemo/project
[info] Updating {file:/home/xxx/dev/scalaProjectDemo/project/}scalaprojectdemo-build...
[info] Resolving org.scala-sbt.ivy#ivy;2.3.0-sbt-14d4d23e25f354cd296c73bfff40554[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.eed3si9n/sbt-assembly/scala_2.10/sbt_0.13/0.11.2/jars/sbt-assembly.jar ...
[info] [SUCCESSFUL ] com.eed3si9n#sbt-assembly;0.11.2!sbt-assembly.jar (2136ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.1.0/spark-core_2.10-1.1.0.jar ...
[info] [SUCCESSFUL ] org.slf4j#slf4j-log4j12;1.7.5!slf4j-log4j12.jar (78ms)
1.14.v20131031!jetty-webapp.jar (103ms)
............
[info] downloading https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/2.4.0a/protobuf-java-2.4.0a.jar ...
[info] [SUCCESSFUL ] com.google.protobuf#protobuf-java;2.4.0a!protobuf-java.jar (387ms)
[info] downloading https://repo1.maven.org/maven2/asm/asm/3.2/asm-3.2.jar ...
[info] [SUCCESSFUL ] asm#asm;3.2!asm.jar (195ms)
[info] Done updating.
[info] Compiling 1 Scala source to /home/pooja/dev/scalaProjectDemo/target/scala-2.10/classes...
[info] Running com.jbksoft.HelloWorld
Hello World
[success] Total time: 97 s, completed Dec 8, 2016 2:58:29 PM
[/code]

e. Importing the code into IntelliJ

Edit the file plugins.sbt

[code]

addSbtPlugin("com.github.mpeltonen" % "sbt-idea" % "1.5.2")

[/code]

Edit the assembly.sbt

[code]

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")

[/code]

Edit the build.properties

[code]sbt.version=0.13.6[/code]

Open the IntelliJ

screenshot-from-2016-12-09-13-43-36