SlideShare a Scribd company logo
Streaming Analytics Tutorial
Ashish Gupta (LinkedIn)
Neera Agarwal
Streaming Analytics
Before doing this tutorial please read the main presentation.
http://www.slideshare.net/NeeraAgarwal2/streaming-analytics
Technical Requirements
● OS: MAC OS X
● Programming Language: Scala 2.10.x
● Open source software used in tutorial: Kafka and Spark 1.6.2
Tutorials
Wiki topic
Ads topic
Clicks topic
Wiki
Edit
Events
Ad
Events
Click
Events
Producers
Kafka
Spark Streaming
Consumers
WikiPedia
Article
Edit Metrics
Impression &
Click Metrics
Tutorial -1
Tutorial -2
Step 1: Installation
In the conference we provided USB stick with the environment ready for the
tutorials. Here are the instructions to create your own environment:
1. Check Java: java -version
java version "1.8.0_92" (Note: Java 1.7+ should work.)
1. Check Maven: mvn -v
If not installed, check instructions in the Additional slides at the end.
Step 1: Installation
3. Install Scala
curl -O http://downloads.lightbend.com/scala/2.10.6/scala-2.10.6.tgz
tar -xzf scala-2.10.6.tgz
Set scala home to the path pointing to scala-2.10.6 folder. For example on Mac:
export SCALA_HOME=/Users/<username>/scala-2.10.6
export PATH=$PATH:$SCALA_HOME/bin
4. Install Spark
curl -O http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz
Step 1: Installation
5. Install Kafka
curl -O http://apache.claz.org/kafka/0.10.0.0/kafka_2.10-0.10.0.0.tgz
tar -xzf kafka_2.10-0.10.0.0.tgz
6. Download tutorial
https://github.com/NeeraAgarwal/kdd2016-streaming-tutorial
Step 2: Start Kafka
Start a terminal window. Start Zookeeper
> cd kafka_2.10-0.10.0.0
> bin/zookeeper-server-start.sh config/zookeeper.properties
Wait for zookeeper to start. Start another terminal window and start Kafka
> cd kafka_2.10-0.10.0.0
> bin/kafka-server-start.sh config/server.properties
Tutorial 1:
Bot and Human edit counts on Wikipedia Edit stream
Step 1: Listening to WikiPedia Edit Stream
Start a new terminal window. Run WikiPedia Connector
> cd kdd2016-streaming-tutorial
> java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.WikiPediaConnector
After some messages, stop using CTRL-C. We will run it again after writing
streaming code.
Does not run, build package and try again...
> mvn package
Step 1: WikiPedia Stream message structure
[[-acylglycerol O-acyltransferase]] MB
https://en.wikipedia.org/w/index.php?diff=733783045&oldid=721976415 * BU RoBOT * (-1) /* References
*/Sort into more specific stub template based on presence in [[Category:EC 2.3]] or subcategories (Task
25)
[[City Building]] https://en.wikipedia.org/w/index.php?diff=733783047&oldid=732314994 * Hmains * (+9)
refine category structures
[[Wikipedia:Articles for deletion/Log/2016 August 10]] B
https://en.wikipedia.org/w/index.php?diff=733783051&oldid=733783026 * Cyberbot
Fields: title, flags, diffUrl, user, byteDiff, summary
Flags (2nd field): ‘M’=Minor, ‘N’ = New, ‘!’ = Unpatrolled, ‘B’ = Bot Edit
WikiPedia Stream pattern = "[[(.*)]]s(.*)s(.*)s*s(.*)s*s(+?(.d*))s(.*)".r
Spark: RDD
An RDD is an immutable distributed collection of objects. Each RDD is split into
multiple partitions, which may be computed on different nodes of the cluster.
RDDs can contain any type of objects, including Python, Java, Scala or user-
defined classes.
RDDs offer two types of operations:
● Transformations construct a new RDD from a previous one.
● Actions compute a result based on an RDD, and either return it to the driver program or
save it to an external storage system.
Spark: DStream
DStream is a sequence of data arriving over time.
Internally, each DStream is represented as a sequence of RDDs arriving at each time step.
DStreams offer two types of operations:
● Transformations yield a new DStream.
● Output operations write data to an external system.
Ref: https://spark.apache.org/docs/latest/streaming-programming-guide.html
Step 2: Write Code
In kdd2016-streaming-tutorial
Change code in src/main/scala/example/WikiPediaStreaming.scala file. Use your
favorite editor.
val lines = messages.foreachRDD { rdd =>
// ADD CODE HERE
}
Note: In the conference participants were asked to write the code while in github repository full code is provided.
Step 2: Continued
rdd =>
val linesDF = rdd.map(row => row._2 match {
case pattern(title, flags, diffUrl, user, byteDiff, summary) => WikiEdit(title, flags, diffUrl, user,
byteDiff.toInt, summary)
case _ => WikiEdit("title", "flags", "diffUrl", "user", 0, "summary")
}).filter(row => row.title != "title").toDF()
Step 2 Continued
// Number of records in 10 second window.
val totalCnt = linesDF.count()
// Number of bot edited records in 10 second window.
val botEditCnt = linesDF.filter("flags like '%B%'").count()
// Number of human edited records in 10 second window.
val humanEditCnt = linesDF.filter("flags not like '%B%'").count()
val botEditPct = if (totalCnt > 0) 100 * botEditCnt / totalCnt else 0
val humanEditPct = if (totalCnt > 0) 100 * humanEditCnt / totalCnt else 0
Step 3: Build Program
Start a new terminal window.
> cd kdd2016-streaming-tutorial
> mvn package
Step 4: Run Programs
Run WikiPediaConnector in terminal window. It starts receiving data from
WikiPedia IRC channel and writes to Kafka.
>java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar
example.WikiPediaConnector
Run WikiPediaStream in a new terminal window.
> cd kdd2016-streaming-tutorial
> ../spark-1.6.2-bin-hadoop2.6/bin/spark-submit --class
example.WikiPediaStreaming target/streamingtutorial-1.0.0-jar-with-
dependencies.jar
Output
Tutorial 2:
Impression Click metrics on Ad and Click streams
Tutorials
Wiki topic
Ads topic
Clicks topic
Wiki
Edit
Events
Ad
Events
Click
Events
Producers
Kafka
Spark Streaming
Consumers
WikiPedia
Edit Metrics
Impression &
Click Metrics
Tutorial -1
Tutorial -2
Step 1: Listening to Ads and Clicks stream
Run program to replay Ad and Click Events from file
> cd kdd2016-streaming-tutorial
> java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.AdClickEventReplay
After some messages, stop using CTRL-C. We will run it again after writing
streaming code.
Does not run, build package and try again...
> mvn package
Step 1: Ad and Click Event message structure
Ad Event:
QueryID, AdId, TimeStamp: 6815, 48195, 1470632477761
Click Event:
QueryID, ClickId, TimeStamp: 6815, 93630, 1470632827088
Join on QueryId, show metrics by Ad Id.
Step 2: Write Code
In kdd2016-streaming-tutorial
Change code in src/main/scala/example/AdEventJoiner.scala file.
val adEventDStream = adStream.transform( rdd => {
rdd.map(line => line._2.split(",")).
map(row => (row(0).trim.toInt, AdEvent(row(0).trim.toInt, row(1).trim.toInt, row(2).trim.toLong)))
})
// ADD CODE HERE..
Note: In the conference participants were asked to write the code while in github repository full code is provided.
Step 2 Continued
//Connects Spark Streaming to Kafka Topic and gets DStream of RDDs (click event message)
val clickStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
kafkaParams, clickStreamTopicSet)
//Create a new DStream by extracting kafka message and converting it to DStream[queryId, ClickEvent]
val clickEventDStream = clickStream.transform{ rdd =>
rdd.map(line => line._2.split(",")).
map(row => (row(0).trim.toInt, ClickEvent(row(0).trim.toInt, row(1).trim.toInt, row(2).trim.toLong)))
}
Step 2 Continued
// Join adEvent and clickEvent DStreams and output DStream[queryId, (adEvent, clickEvent)]
val joinByQueryId = adEventDStream.join(clickEventDStream)
joinByQueryId.print()
// Transform DStream to DStream[adId, count(adId)] for each RDD
val countByAdId = joinByQueryId.map(rdd => (rdd._2._1.adId,1)).reduceByKey(_+_)
Step 2 Continued
// Update the state [adId, countCummulative(adId)] by values from the next RDDs
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.sum
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
val countByAdIdCumm = countByAdId.updateStateByKey(updateFunc)
// Transform (key, value) pair to (adId, count(adId), countCummulative(adId))
val ad = countByAdId.join(countByAdIdCumm).map {case (adId, (count, cumCount)) => (adId, count, cumCount)}
Step 2 Continued
//Print report
ad.foreachRDD( ad => {
println("%5s %10s %12s".format("AdId", "AdCount", "AdCountCumm"))
ad.foreach( row => println("%5s %10s %12s".format(row._1, row._2, row._3)))
})
Step 3: Build Program
> cd kdd2016-streaming-tutorial
> mvn package
Step 4: Run Programs
Run AdClickEventReplay in terminal window. It reads data from Ad and Click event
files writes to Kafka.
> cd kdd2016-streaming-tutorial
> java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar
example.AdClickEventReplay
Run AdEventJoiner in a new terminal window.
> cd kdd2016-streaming-tutorial
> ../spark-1.6.2-bin-hadoop2.6/bin/spark-submit --class example.AdEventJoiner
target/streamingtutorial-1.0.0-jar-with-dependencies.jar
Output
Contact Us
Ashish Gupta - ahgupta@linkedin.com
https://www.linkedin.com/in/guptash
Neera Agarwal - neera8work@gmail.com
https://www.linkedin.com/in/neera-agarwal-21b9473
Additional Notes - Install Java (Mac 10.11)
● java -version
java version "1.8.0_92"
If java does not exist, try
In .bash_profie add export JAVA_HOME=$(/usr/libexec/java_home)
(Install Java - https://java.com/en/download/help/mac_install.xml)
Additional Notes - Install Maven
● Check Maven on a terminal window
○ mvn -v
○ Apache Maven 3.2.5+
● Install Maven
○ brew install maven
OR if you do not have brew then do:
1. curl -O http://mirror.nexcess.net/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-
bin.tar.gz
2. tar -xzvf apache-maven-3.3.9-bin.tar.gz

More Related Content

PDF
Clojure ♥ cassandra
PDF
Apache Cassandra in Bangalore - Cassandra Internals and Performance
PDF
NoSQL and JavaScript: a love story
PPTX
Jafka guide
KEY
State of the art - server side JavaScript - web-5 2012
PDF
Ice mini guide
PPTX
Java with a Clojure mindset
PDF
03 - Qt UI Development
Clojure ♥ cassandra
Apache Cassandra in Bangalore - Cassandra Internals and Performance
NoSQL and JavaScript: a love story
Jafka guide
State of the art - server side JavaScript - web-5 2012
Ice mini guide
Java with a Clojure mindset
03 - Qt UI Development

What's hot (19)

PPTX
Проведение криминалистической экспертизы и анализа руткит-программ на примере...
PPTX
Introduction to apache zoo keeper
PPTX
Object Oriented Code RE with HexraysCodeXplorer
PPTX
Apache zookeeper 101
PDF
Apache ZooKeeper
PPTX
Return of c++
PDF
Node 관계형 데이터베이스_바인딩
PPTX
COLLADA & WebGL
PDF
SICP_2.5 일반화된 연산시스템
PPTX
Android development with Scala and SBT
PDF
Cassandra Summit 2013 Keynote
PPTX
MongoDB Live Hacking
PDF
HKG15-211: Advanced Toolchain Usage Part 4
PDF
Tokyo cassandra conference 2014
PPTX
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
PPTX
Windows Remote Management - EN
PDF
[C++ gui programming with qt4] chap9
PDF
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
PDF
Cutting Edge Data Processing with PHP & XQuery
Проведение криминалистической экспертизы и анализа руткит-программ на примере...
Introduction to apache zoo keeper
Object Oriented Code RE with HexraysCodeXplorer
Apache zookeeper 101
Apache ZooKeeper
Return of c++
Node 관계형 데이터베이스_바인딩
COLLADA & WebGL
SICP_2.5 일반화된 연산시스템
Android development with Scala and SBT
Cassandra Summit 2013 Keynote
MongoDB Live Hacking
HKG15-211: Advanced Toolchain Usage Part 4
Tokyo cassandra conference 2014
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
Windows Remote Management - EN
[C++ gui programming with qt4] chap9
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
Cutting Edge Data Processing with PHP & XQuery
Ad

Viewers also liked (20)

PDF
Streaming Analytics
PDF
Introduction to Real-time data processing
PDF
The end of polling : why and how to transform a REST API into a Data Streamin...
PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
PDF
Building Big Data Streaming Architectures
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
PDF
RBea: Scalable Real-Time Analytics at King
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Real-time analytics as a service at King
PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
PPTX
Data Streaming (in a Nutshell) ... and Spark's window operations
PPTX
Stream Analytics in the Enterprise
PDF
Stream Processing Everywhere - What to use?
PDF
Reliable Data Intestion in BigData / IoT
PDF
Stateful Distributed Stream Processing
PDF
Oracle Stream Analytics - Simplifying Stream Processing
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Big Data Architectures @ JAX / BigDataCon 2016
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
PDF
Distributed Real-Time Stream Processing: Why and How 2.0
Streaming Analytics
Introduction to Real-time data processing
The end of polling : why and how to transform a REST API into a Data Streamin...
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Building Big Data Streaming Architectures
Real-time Stream Processing with Apache Flink @ Hadoop Summit
RBea: Scalable Real-Time Analytics at King
Large-Scale Stream Processing in the Hadoop Ecosystem
Real-time analytics as a service at King
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Data Streaming (in a Nutshell) ... and Spark's window operations
Stream Analytics in the Enterprise
Stream Processing Everywhere - What to use?
Reliable Data Intestion in BigData / IoT
Stateful Distributed Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
Apache Kafka - Scalable Message-Processing and more !
Big Data Architectures @ JAX / BigDataCon 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Distributed Real-Time Stream Processing: Why and How 2.0
Ad

Similar to KDD 2016 Streaming Analytics Tutorial (20)

PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
PDF
Building end to end streaming application on Spark
PPT
strata_spark_streaming.ppt
PPT
Spark streaming
PDF
Streaming Visualization
PDF
Spark (Structured) Streaming vs. Kafka Streams
PPT
strata_spark_streaming.ppt
PPT
strata_spark_streaming.ppt
PPT
strata spark streaming strata spark streamingsrata spark streaming
PPTX
Streaming options in the wild
PDF
Spark streaming state of the union
PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
Streaming Visualisation
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Interactive Data Analysis in Spark Streaming
PDF
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
PDF
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
PPTX
Sparkstreaming with kafka and h base at scale (1)
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Building end to end streaming application on Spark
strata_spark_streaming.ppt
Spark streaming
Streaming Visualization
Spark (Structured) Streaming vs. Kafka Streams
strata_spark_streaming.ppt
strata_spark_streaming.ppt
strata spark streaming strata spark streamingsrata spark streaming
Streaming options in the wild
Spark streaming state of the union
Strata NYC 2015: What's new in Spark Streaming
Streaming Visualisation
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Interactive Data Analysis in Spark Streaming
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Sparkstreaming with kafka and h base at scale (1)

Recently uploaded (20)

PDF
How to Confidently Manage Project Budgets
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Become an Agentblazer Champion Challenge Kickoff
PDF
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
PPTX
CRUISE TICKETING SYSTEM | CRUISE RESERVATION SOFTWARE
PPTX
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
PDF
Digital Strategies for Manufacturing Companies
PDF
Perfecting Gamer’s Experiences with Performance Testing for Gaming Applicatio...
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
PDF
AI in Product Development-omnex systems
PDF
Forouzan Book Information Security Chaper - 1
PDF
Best Practices for Rolling Out Competency Management Software.pdf
PDF
Comprehensive Salesforce Implementation Services.pdf
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
How to Confidently Manage Project Budgets
PTS Company Brochure 2025 (1).pdf.......
Become an Agentblazer Champion Challenge Kickoff
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
CRUISE TICKETING SYSTEM | CRUISE RESERVATION SOFTWARE
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
Digital Strategies for Manufacturing Companies
Perfecting Gamer’s Experiences with Performance Testing for Gaming Applicatio...
The Five Best AI Cover Tools in 2025.docx
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
Which alternative to Crystal Reports is best for small or large businesses.pdf
ai tools demonstartion for schools and inter college
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
AI in Product Development-omnex systems
Forouzan Book Information Security Chaper - 1
Best Practices for Rolling Out Competency Management Software.pdf
Comprehensive Salesforce Implementation Services.pdf
Online Work Permit System for Fast Permit Processing
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
How to Migrate SBCGlobal Email to Yahoo Easily

KDD 2016 Streaming Analytics Tutorial

  • 1. Streaming Analytics Tutorial Ashish Gupta (LinkedIn) Neera Agarwal
  • 2. Streaming Analytics Before doing this tutorial please read the main presentation. http://www.slideshare.net/NeeraAgarwal2/streaming-analytics
  • 3. Technical Requirements ● OS: MAC OS X ● Programming Language: Scala 2.10.x ● Open source software used in tutorial: Kafka and Spark 1.6.2
  • 4. Tutorials Wiki topic Ads topic Clicks topic Wiki Edit Events Ad Events Click Events Producers Kafka Spark Streaming Consumers WikiPedia Article Edit Metrics Impression & Click Metrics Tutorial -1 Tutorial -2
  • 5. Step 1: Installation In the conference we provided USB stick with the environment ready for the tutorials. Here are the instructions to create your own environment: 1. Check Java: java -version java version "1.8.0_92" (Note: Java 1.7+ should work.) 1. Check Maven: mvn -v If not installed, check instructions in the Additional slides at the end.
  • 6. Step 1: Installation 3. Install Scala curl -O http://downloads.lightbend.com/scala/2.10.6/scala-2.10.6.tgz tar -xzf scala-2.10.6.tgz Set scala home to the path pointing to scala-2.10.6 folder. For example on Mac: export SCALA_HOME=/Users/<username>/scala-2.10.6 export PATH=$PATH:$SCALA_HOME/bin 4. Install Spark curl -O http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz
  • 7. Step 1: Installation 5. Install Kafka curl -O http://apache.claz.org/kafka/0.10.0.0/kafka_2.10-0.10.0.0.tgz tar -xzf kafka_2.10-0.10.0.0.tgz 6. Download tutorial https://github.com/NeeraAgarwal/kdd2016-streaming-tutorial
  • 8. Step 2: Start Kafka Start a terminal window. Start Zookeeper > cd kafka_2.10-0.10.0.0 > bin/zookeeper-server-start.sh config/zookeeper.properties Wait for zookeeper to start. Start another terminal window and start Kafka > cd kafka_2.10-0.10.0.0 > bin/kafka-server-start.sh config/server.properties
  • 9. Tutorial 1: Bot and Human edit counts on Wikipedia Edit stream
  • 10. Step 1: Listening to WikiPedia Edit Stream Start a new terminal window. Run WikiPedia Connector > cd kdd2016-streaming-tutorial > java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.WikiPediaConnector After some messages, stop using CTRL-C. We will run it again after writing streaming code. Does not run, build package and try again... > mvn package
  • 11. Step 1: WikiPedia Stream message structure [[-acylglycerol O-acyltransferase]] MB https://en.wikipedia.org/w/index.php?diff=733783045&oldid=721976415 * BU RoBOT * (-1) /* References */Sort into more specific stub template based on presence in [[Category:EC 2.3]] or subcategories (Task 25) [[City Building]] https://en.wikipedia.org/w/index.php?diff=733783047&oldid=732314994 * Hmains * (+9) refine category structures [[Wikipedia:Articles for deletion/Log/2016 August 10]] B https://en.wikipedia.org/w/index.php?diff=733783051&oldid=733783026 * Cyberbot Fields: title, flags, diffUrl, user, byteDiff, summary Flags (2nd field): ‘M’=Minor, ‘N’ = New, ‘!’ = Unpatrolled, ‘B’ = Bot Edit WikiPedia Stream pattern = "[[(.*)]]s(.*)s(.*)s*s(.*)s*s(+?(.d*))s(.*)".r
  • 12. Spark: RDD An RDD is an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of objects, including Python, Java, Scala or user- defined classes. RDDs offer two types of operations: ● Transformations construct a new RDD from a previous one. ● Actions compute a result based on an RDD, and either return it to the driver program or save it to an external storage system.
  • 13. Spark: DStream DStream is a sequence of data arriving over time. Internally, each DStream is represented as a sequence of RDDs arriving at each time step. DStreams offer two types of operations: ● Transformations yield a new DStream. ● Output operations write data to an external system. Ref: https://spark.apache.org/docs/latest/streaming-programming-guide.html
  • 14. Step 2: Write Code In kdd2016-streaming-tutorial Change code in src/main/scala/example/WikiPediaStreaming.scala file. Use your favorite editor. val lines = messages.foreachRDD { rdd => // ADD CODE HERE } Note: In the conference participants were asked to write the code while in github repository full code is provided.
  • 15. Step 2: Continued rdd => val linesDF = rdd.map(row => row._2 match { case pattern(title, flags, diffUrl, user, byteDiff, summary) => WikiEdit(title, flags, diffUrl, user, byteDiff.toInt, summary) case _ => WikiEdit("title", "flags", "diffUrl", "user", 0, "summary") }).filter(row => row.title != "title").toDF()
  • 16. Step 2 Continued // Number of records in 10 second window. val totalCnt = linesDF.count() // Number of bot edited records in 10 second window. val botEditCnt = linesDF.filter("flags like '%B%'").count() // Number of human edited records in 10 second window. val humanEditCnt = linesDF.filter("flags not like '%B%'").count() val botEditPct = if (totalCnt > 0) 100 * botEditCnt / totalCnt else 0 val humanEditPct = if (totalCnt > 0) 100 * humanEditCnt / totalCnt else 0
  • 17. Step 3: Build Program Start a new terminal window. > cd kdd2016-streaming-tutorial > mvn package
  • 18. Step 4: Run Programs Run WikiPediaConnector in terminal window. It starts receiving data from WikiPedia IRC channel and writes to Kafka. >java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.WikiPediaConnector Run WikiPediaStream in a new terminal window. > cd kdd2016-streaming-tutorial > ../spark-1.6.2-bin-hadoop2.6/bin/spark-submit --class example.WikiPediaStreaming target/streamingtutorial-1.0.0-jar-with- dependencies.jar
  • 20. Tutorial 2: Impression Click metrics on Ad and Click streams
  • 21. Tutorials Wiki topic Ads topic Clicks topic Wiki Edit Events Ad Events Click Events Producers Kafka Spark Streaming Consumers WikiPedia Edit Metrics Impression & Click Metrics Tutorial -1 Tutorial -2
  • 22. Step 1: Listening to Ads and Clicks stream Run program to replay Ad and Click Events from file > cd kdd2016-streaming-tutorial > java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.AdClickEventReplay After some messages, stop using CTRL-C. We will run it again after writing streaming code. Does not run, build package and try again... > mvn package
  • 23. Step 1: Ad and Click Event message structure Ad Event: QueryID, AdId, TimeStamp: 6815, 48195, 1470632477761 Click Event: QueryID, ClickId, TimeStamp: 6815, 93630, 1470632827088 Join on QueryId, show metrics by Ad Id.
  • 24. Step 2: Write Code In kdd2016-streaming-tutorial Change code in src/main/scala/example/AdEventJoiner.scala file. val adEventDStream = adStream.transform( rdd => { rdd.map(line => line._2.split(",")). map(row => (row(0).trim.toInt, AdEvent(row(0).trim.toInt, row(1).trim.toInt, row(2).trim.toLong))) }) // ADD CODE HERE.. Note: In the conference participants were asked to write the code while in github repository full code is provided.
  • 25. Step 2 Continued //Connects Spark Streaming to Kafka Topic and gets DStream of RDDs (click event message) val clickStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, clickStreamTopicSet) //Create a new DStream by extracting kafka message and converting it to DStream[queryId, ClickEvent] val clickEventDStream = clickStream.transform{ rdd => rdd.map(line => line._2.split(",")). map(row => (row(0).trim.toInt, ClickEvent(row(0).trim.toInt, row(1).trim.toInt, row(2).trim.toLong))) }
  • 26. Step 2 Continued // Join adEvent and clickEvent DStreams and output DStream[queryId, (adEvent, clickEvent)] val joinByQueryId = adEventDStream.join(clickEventDStream) joinByQueryId.print() // Transform DStream to DStream[adId, count(adId)] for each RDD val countByAdId = joinByQueryId.map(rdd => (rdd._2._1.adId,1)).reduceByKey(_+_)
  • 27. Step 2 Continued // Update the state [adId, countCummulative(adId)] by values from the next RDDs val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } val countByAdIdCumm = countByAdId.updateStateByKey(updateFunc) // Transform (key, value) pair to (adId, count(adId), countCummulative(adId)) val ad = countByAdId.join(countByAdIdCumm).map {case (adId, (count, cumCount)) => (adId, count, cumCount)}
  • 28. Step 2 Continued //Print report ad.foreachRDD( ad => { println("%5s %10s %12s".format("AdId", "AdCount", "AdCountCumm")) ad.foreach( row => println("%5s %10s %12s".format(row._1, row._2, row._3))) })
  • 29. Step 3: Build Program > cd kdd2016-streaming-tutorial > mvn package
  • 30. Step 4: Run Programs Run AdClickEventReplay in terminal window. It reads data from Ad and Click event files writes to Kafka. > cd kdd2016-streaming-tutorial > java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.AdClickEventReplay Run AdEventJoiner in a new terminal window. > cd kdd2016-streaming-tutorial > ../spark-1.6.2-bin-hadoop2.6/bin/spark-submit --class example.AdEventJoiner target/streamingtutorial-1.0.0-jar-with-dependencies.jar
  • 32. Contact Us Ashish Gupta - [email protected] https://www.linkedin.com/in/guptash Neera Agarwal - [email protected] https://www.linkedin.com/in/neera-agarwal-21b9473
  • 33. Additional Notes - Install Java (Mac 10.11) ● java -version java version "1.8.0_92" If java does not exist, try In .bash_profie add export JAVA_HOME=$(/usr/libexec/java_home) (Install Java - https://java.com/en/download/help/mac_install.xml)
  • 34. Additional Notes - Install Maven ● Check Maven on a terminal window ○ mvn -v ○ Apache Maven 3.2.5+ ● Install Maven ○ brew install maven OR if you do not have brew then do: 1. curl -O http://mirror.nexcess.net/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9- bin.tar.gz 2. tar -xzvf apache-maven-3.3.9-bin.tar.gz

Editor's Notes

  • #12: Walkthrough the code.
  • #13: Transformations: Map, filter, join, groupByKey, reduceByKey,
  • #24: Now explain wikipedia and processing code.
  • #34: xcode-select --install