Apache Spark
Apache Spark is distributed computing platform that provides near real time processing of data from various data sources. The data sources can vary from HDFS file system or Kafka or Flume or Relational Database.
There are many spark components which facilitate the integration with various data sources such as Spark SQL, Spark Streaming, Mlib, GraphX.
Apache Kafka
Apache Kafka is distributed fault tolerant streaming platform that used to build the real-time data pipeline. It works on publisher and subscriber model.
Use Case
Recently, I worked on Kafka Spark integration for a simple fraud detection real time data pipeline. In this, we were tracking the Customer Activity and purchase events of Customer on e-Commerce site.Then, based on purchase events we were categorizing suspiciously fraudulent Customers. Now, we were filtering the Customer Activity for interested customer and then performing operations which is consumed by another stream for further processing.
We have consider many implementation plan and one of them is explained below.
Data Model (Just an example)
Suspicious Fraudulent Customer (demo1):
Customer Activity (demo2)
We need to process above data and filter only active data. So sample output data will be as follows.
Implementation Strategy
Kafka streaming:
In this data pipeline, we were receiving 2 Kafka stream and output stream as described below.
Spark Streaming component:
The Spark Streaming API will integrate with Kafka topics (demo1, demo2). Now, the demo1 data will be cached in memory and update for any change in active customer or add new customer. The data from demo1, demo2 is joined together and filter for active customer which is output to 'test-output'.
I have implement the demo code in scala.
The working model
Let start the Spark server and submit the Spark Job to Spark cluster as shown below
Note: Application Id: app-20170110204548-0000 is started and running.
Now, start Kafka server and start 3 topics- demo1 (Producer), demo2(Producer), test-output(consumer).
For this tutorial, to show our use case we will be showing manual data entry.
Kafka Topic (demo1):
Kafka Topic(demo2):
Kafka Topic (test-output): Receive output as shown below:
Note: customer 3 is inactive so it will not be shown.
Now, there are changes in demo1 and will add new active customer 4 and update customer 2 to inactive and also change customer 3 to active as shown below:
[kafka@localhost kafka_2.11-0.10.1.0]$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic demo1
4,A NNN F 33 San Francisco
2,F TTT F 22 XXX
3,A HHH M 56 MMM
Then, input some Customer Activity (demo2)
[kafka@localhost kafka_2.11-0.10.1.0]$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic demo2
1,store/5426/whats-new
2,ip/product-page/product-Desc 16503225
3,ip/product-page/product-Desc 9988334
4,search/?query=battery
4,cp/Gift-Cards
3,account/trackorder
Finally, output will show transaction of all active customer in memory Customer 1,3,4.
[kafka@localhost kafka_2.11-0.10.1.0]$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-output
(4,(A NNN F 33 San Francisco,search/?query=battery ))
(4,(A NNN F 33 San Francisco,cp/Gift-Cards ))
(3,(A HHH M 56 MMM,ip/product-page/product-Desc 9988334))
(3,(A HHH M 56 MMM,account/trackorder))
(1,(A AAA F 23 Union City,store/5426/whats-new ))
I hope you follow the use case. In case of any questions, please mail me, I would be glad to help you.
Apache Spark is distributed computing platform that provides near real time processing of data from various data sources. The data sources can vary from HDFS file system or Kafka or Flume or Relational Database.
There are many spark components which facilitate the integration with various data sources such as Spark SQL, Spark Streaming, Mlib, GraphX.
Apache Kafka
Apache Kafka is distributed fault tolerant streaming platform that used to build the real-time data pipeline. It works on publisher and subscriber model.
Use Case
Recently, I worked on Kafka Spark integration for a simple fraud detection real time data pipeline. In this, we were tracking the Customer Activity and purchase events of Customer on e-Commerce site.Then, based on purchase events we were categorizing suspiciously fraudulent Customers. Now, we were filtering the Customer Activity for interested customer and then performing operations which is consumed by another stream for further processing.
We have consider many implementation plan and one of them is explained below.
Data Model (Just an example)
Suspicious Fraudulent Customer (demo1):
Customer_Id
|
Receive
Flag
|
Name
|
Sex
|
Age
|
City
|
1
|
A
|
AAA
|
F
|
23
|
Union
City
|
2
|
A
|
BBB
|
M
|
77
|
San
Mateo
|
3
|
F
|
NNN
|
F
|
33
|
San
Francisco
|
Customer Activity (demo2)
Customer_Id
|
Page
Visit
|
Product
|
1
|
store/5426/whats-new
|
|
1
|
ip/product-page/product-Desc
|
16503225
|
2
|
ip/product-page/product-Desc
|
9988334
|
3
|
search/?query=battery
|
|
3
|
cp/Gift-Cards
|
|
3
|
account/trackorder
|
We need to process above data and filter only active data. So sample output data will be as follows.
Cus_Id
|
Flag
|
Name
|
Sex
|
Age
|
City
|
Page
Visit
|
Product
|
1
|
A
|
AAA
|
F
|
23
|
Union
City
|
store/5426/whats-new
|
|
1
|
A
|
AAA
|
F
|
23
|
Union
City
|
ip/product-page/product-Desc
|
16503225
|
2
|
A
|
BBB
|
M
|
77
|
San
Mateo
|
ip/product-page/product-Desc
|
9988334
|
Kafka streaming:
In this data pipeline, we were receiving 2 Kafka stream and output stream as described below.
- Suspicious Fraudulent Customer (demo1)
- Customer Activity (demo2)
- Output (test-output)
Spark Streaming component:
The Spark Streaming API will integrate with Kafka topics (demo1, demo2). Now, the demo1 data will be cached in memory and update for any change in active customer or add new customer. The data from demo1, demo2 is joined together and filter for active customer which is output to 'test-output'.
- Subscribe Suspicious Fraudulent Customer (demo1).
- Subscribe to Customer Activity (demo2).
- Update Suspicious Fraudulent Customer in memory (so as to reflect the update in demo1).
- Join data from demo1 and demo2, then filter based on flag.
- Perform operation on the data.
- Output the result to test-output for further processing.
I have implement the demo code in scala.
The working model
Let start the Spark server and submit the Spark Job to Spark cluster as shown below
Note: Application Id: app-20170110204548-0000 is started and running.
Now, start Kafka server and start 3 topics- demo1 (Producer), demo2(Producer), test-output(consumer).
For this tutorial, to show our use case we will be showing manual data entry.
Kafka Topic (demo1):
Kafka Topic(demo2):
Kafka Topic (test-output): Receive output as shown below:
Note: customer 3 is inactive so it will not be shown.
Now, there are changes in demo1 and will add new active customer 4 and update customer 2 to inactive and also change customer 3 to active as shown below:
[kafka@localhost kafka_2.11-0.10.1.0]$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic demo1
4,A NNN F 33 San Francisco
2,F TTT F 22 XXX
3,A HHH M 56 MMM
Then, input some Customer Activity (demo2)
[kafka@localhost kafka_2.11-0.10.1.0]$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic demo2
1,store/5426/whats-new
2,ip/product-page/product-Desc 16503225
3,ip/product-page/product-Desc 9988334
4,search/?query=battery
4,cp/Gift-Cards
3,account/trackorder
Finally, output will show transaction of all active customer in memory Customer 1,3,4.
[kafka@localhost kafka_2.11-0.10.1.0]$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-output
(4,(A NNN F 33 San Francisco,search/?query=battery ))
(4,(A NNN F 33 San Francisco,cp/Gift-Cards ))
(3,(A HHH M 56 MMM,ip/product-page/product-Desc 9988334))
(3,(A HHH M 56 MMM,account/trackorder))
(1,(A AAA F 23 Union City,store/5426/whats-new ))
I hope you follow the use case. In case of any questions, please mail me, I would be glad to help you.