Apache Spark
Apache Spark is open source big data computational system. It is developed using Scala programming language which run on JVM ( Java Virtual Machine) platform. Today, popularity of Spark is increasing due to it's in-memory data storage and real time processing capabilities. This computational system provides high level API in Java, Scala and Python. Therefore, we can run data analytical queries using these high level API on Spark system and get desire insights. Spark can deployed to standalone cluster, Hadoop 2 (YARN) or Mesos.
SBT Overview
SBT is Simple Build Tool. A build tool help in automation of tasks like build,compile, test, package, run, deploy. Other build tools are like Maven, Ant, Gradle, Ivy. SBT is also one othe build tool that focus mainly on Scala projects.
Today, I am going to explore to write a basic query using Spark high level API in Scala 2.10. Also, I will be using IntelliJ as IDE for development.
Now, all set. Let get our hands dirty with some actual coding.
Prerequisite (Make sure your machine has below components already installed):
Working on Code
a. Creating project structure
There are different ways project structure can be created. We can even use the existing project templates to create it automatically.Today, we are going to create the project structure manually. In below code, we have create the root directory/project name ( scalaProjectDemo) and folder src/main/scala inside it as shown below:
[code language="java"]
mkdir scalaProjectDemo
cd scalaProjectDemo
mkdir project
mkdir -p src/main/scala
mkdir -p src/main/resources
touch project/build.properties
touch project/plugins.sbt
touch project/assembly.sbt
[/code]
b. Creating a build file
We will be creating the build file "build.sbt" in the root directory as shown below:
[code language="java"]
import AssemblyKeys._
assemblySettings
name := "scalaProjectDemo"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.1.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.0.0-mr1-cdh4.2.0" % "provided"
resolvers ++= Seq(
"Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/",
"Akka Repository" at "http://repo.akka.io/releases/",
"Spray Repository" at "http://repo.spray.cc/")
[/code]
The project can be imported here into IntelliJ. Please refer the item " Importing the code into IntelliJ "
c. Creating a Scala file for testing
Next, we create a sample Scala file which just a single print statement as shown below:
[code language="text"]
package com.xxx
object HelloWorld {
def main(args: Array[String]){
println("Hello World")
}
}
[/code]
d. Run the code
Next, we will run the code and make sure the code compile successfully as shown below:
[code language="java"]
$ cd scalaProjectDemo
$ sbt run
Getting org.scala-sbt sbt 0.13.6 ...
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/0.13.6/jars/sbt.jar ...
[SUCCESSFUL ] org.scala-sbt#sbt;0.13.6!sbt.jar (1481ms)
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/main/0.13.6/jars/main.jar ...
[SUCCESSFUL ] org.scala-sbt#main;0.13.6!main.jar (3868ms)
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/compiler-interface/0.13.6/jars/compiler-interface-bin.jar ...
[SUCCESSFUL ] org.scala-sbt#compiler-interface;0.13.6!compiler-interface-bin.jar (1653ms)
......
[SUCCESSFUL ] org.scala-sbt#test-agent;0.13.6!test-agent.jar (1595ms)
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/apply-macro/0.13.6/jars/apply-macro.jar ...
[SUCCESSFUL ] org.scala-sbt#apply-macro;0.13.6!apply-macro.jar (1619ms)
:: retrieving :: org.scala-sbt#boot-app
confs: [default]
44 artifacts copied, 0 already retrieved (13750kB/320ms)
[info] Loading project definition from /home/xxx/dev/scalaProjectDemo/project
[info] Updating {file:/home/xxx/dev/scalaProjectDemo/project/}scalaprojectdemo-build...
[info] Resolving org.scala-sbt.ivy#ivy;2.3.0-sbt-14d4d23e25f354cd296c73bfff40554[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.eed3si9n/sbt-assembly/scala_2.10/sbt_0.13/0.11.2/jars/sbt-assembly.jar ...
[info] [SUCCESSFUL ] com.eed3si9n#sbt-assembly;0.11.2!sbt-assembly.jar (2136ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.1.0/spark-core_2.10-1.1.0.jar ...
[info] [SUCCESSFUL ] org.slf4j#slf4j-log4j12;1.7.5!slf4j-log4j12.jar (78ms)
1.14.v20131031!jetty-webapp.jar (103ms)
............
[info] downloading https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/2.4.0a/protobuf-java-2.4.0a.jar ...
[info] [SUCCESSFUL ] com.google.protobuf#protobuf-java;2.4.0a!protobuf-java.jar (387ms)
[info] downloading https://repo1.maven.org/maven2/asm/asm/3.2/asm-3.2.jar ...
[info] [SUCCESSFUL ] asm#asm;3.2!asm.jar (195ms)
[info] Done updating.
[info] Compiling 1 Scala source to /home/pooja/dev/scalaProjectDemo/target/scala-2.10/classes...
[info] Running com.jbksoft.HelloWorld
Hello World
[success] Total time: 97 s, completed Dec 8, 2016 2:58:29 PM
[/code]
e. Importing the code into IntelliJ
Edit the file plugins.sbt
[code]
addSbtPlugin("com.github.mpeltonen" % "sbt-idea" % "1.5.2")
[/code]
Edit the assembly.sbt
[code]
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")
[/code]
Edit the build.properties
[code]sbt.version=0.13.6[/code]
Open the IntelliJ
Then from Menu choose File> Open
The open dialog opens
choose the project path and click OK
Leave default and click OK
The project will take time as it downloads the dependencies. Click OK
Select the new window as it open the project in new window.
The project will be imported in IntelliJ
Expand the src > main > scala
You can now add more files, run and debug the code in IntelliJ
The code will get compiled by scala compiler and with then be executed.
The output is run the window.
Let me know i you face any issues.
Happy Coding
Apache Spark is open source big data computational system. It is developed using Scala programming language which run on JVM ( Java Virtual Machine) platform. Today, popularity of Spark is increasing due to it's in-memory data storage and real time processing capabilities. This computational system provides high level API in Java, Scala and Python. Therefore, we can run data analytical queries using these high level API on Spark system and get desire insights. Spark can deployed to standalone cluster, Hadoop 2 (YARN) or Mesos.
SBT Overview
SBT is Simple Build Tool. A build tool help in automation of tasks like build,compile, test, package, run, deploy. Other build tools are like Maven, Ant, Gradle, Ivy. SBT is also one othe build tool that focus mainly on Scala projects.
Today, I am going to explore to write a basic query using Spark high level API in Scala 2.10. Also, I will be using IntelliJ as IDE for development.
Now, all set. Let get our hands dirty with some actual coding.
Prerequisite (Make sure your machine has below components already installed):
Working on Code
a. Creating project structure
There are different ways project structure can be created. We can even use the existing project templates to create it automatically.Today, we are going to create the project structure manually. In below code, we have create the root directory/project name ( scalaProjectDemo) and folder src/main/scala inside it as shown below:
[code language="java"]
mkdir scalaProjectDemo
cd scalaProjectDemo
mkdir project
mkdir -p src/main/scala
mkdir -p src/main/resources
touch project/build.properties
touch project/plugins.sbt
touch project/assembly.sbt
[/code]
b. Creating a build file
We will be creating the build file "build.sbt" in the root directory as shown below:
[code language="java"]
import AssemblyKeys._
assemblySettings
name := "scalaProjectDemo"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.1.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.0.0-mr1-cdh4.2.0" % "provided"
resolvers ++= Seq(
"Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/",
"Akka Repository" at "http://repo.akka.io/releases/",
"Spray Repository" at "http://repo.spray.cc/")
[/code]
The project can be imported here into IntelliJ. Please refer the item " Importing the code into IntelliJ "
c. Creating a Scala file for testing
Next, we create a sample Scala file which just a single print statement as shown below:
[code language="text"]
package com.xxx
object HelloWorld {
def main(args: Array[String]){
println("Hello World")
}
}
[/code]
d. Run the code
Next, we will run the code and make sure the code compile successfully as shown below:
[code language="java"]
$ cd scalaProjectDemo
$ sbt run
Getting org.scala-sbt sbt 0.13.6 ...
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/0.13.6/jars/sbt.jar ...
[SUCCESSFUL ] org.scala-sbt#sbt;0.13.6!sbt.jar (1481ms)
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/main/0.13.6/jars/main.jar ...
[SUCCESSFUL ] org.scala-sbt#main;0.13.6!main.jar (3868ms)
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/compiler-interface/0.13.6/jars/compiler-interface-bin.jar ...
[SUCCESSFUL ] org.scala-sbt#compiler-interface;0.13.6!compiler-interface-bin.jar (1653ms)
......
[SUCCESSFUL ] org.scala-sbt#test-agent;0.13.6!test-agent.jar (1595ms)
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/apply-macro/0.13.6/jars/apply-macro.jar ...
[SUCCESSFUL ] org.scala-sbt#apply-macro;0.13.6!apply-macro.jar (1619ms)
:: retrieving :: org.scala-sbt#boot-app
confs: [default]
44 artifacts copied, 0 already retrieved (13750kB/320ms)
[info] Loading project definition from /home/xxx/dev/scalaProjectDemo/project
[info] Updating {file:/home/xxx/dev/scalaProjectDemo/project/}scalaprojectdemo-build...
[info] Resolving org.scala-sbt.ivy#ivy;2.3.0-sbt-14d4d23e25f354cd296c73bfff40554[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.eed3si9n/sbt-assembly/scala_2.10/sbt_0.13/0.11.2/jars/sbt-assembly.jar ...
[info] [SUCCESSFUL ] com.eed3si9n#sbt-assembly;0.11.2!sbt-assembly.jar (2136ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.1.0/spark-core_2.10-1.1.0.jar ...
[info] [SUCCESSFUL ] org.slf4j#slf4j-log4j12;1.7.5!slf4j-log4j12.jar (78ms)
1.14.v20131031!jetty-webapp.jar (103ms)
............
[info] downloading https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/2.4.0a/protobuf-java-2.4.0a.jar ...
[info] [SUCCESSFUL ] com.google.protobuf#protobuf-java;2.4.0a!protobuf-java.jar (387ms)
[info] downloading https://repo1.maven.org/maven2/asm/asm/3.2/asm-3.2.jar ...
[info] [SUCCESSFUL ] asm#asm;3.2!asm.jar (195ms)
[info] Done updating.
[info] Compiling 1 Scala source to /home/pooja/dev/scalaProjectDemo/target/scala-2.10/classes...
[info] Running com.jbksoft.HelloWorld
Hello World
[success] Total time: 97 s, completed Dec 8, 2016 2:58:29 PM
[/code]
e. Importing the code into IntelliJ
Edit the file plugins.sbt
[code]
addSbtPlugin("com.github.mpeltonen" % "sbt-idea" % "1.5.2")
[/code]
Edit the assembly.sbt
[code]
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")
[/code]
Edit the build.properties
[code]sbt.version=0.13.6[/code]
Open the IntelliJ
Then from Menu choose File> Open
The open dialog opens
choose the project path and click OK
Leave default and click OK
The project will take time as it downloads the dependencies. Click OK
Select the new window as it open the project in new window.
The project will be imported in IntelliJ
Expand the src > main > scala
You can now add more files, run and debug the code in IntelliJ
The code will get compiled by scala compiler and with then be executed.
The output is run the window.
Let me know i you face any issues.
Happy Coding
No comments:
Post a Comment