SlideShare a Scribd company logo
Introduction to Spark
Streaming
Real time processing on Apache Spark
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Real time analytics in Big data
● Unification
● Spark streaming
● DStream
● DStream and RDD
● Stream processing
● DStream transformation
● Hands on
3 V’s of Big data
● Volume
○ TB’s and PB’s of files
○ Driving need for batch processing systems
● Velocity
○ TB’s of stream data
○ Driving need for stream processing systems
● Variety
○ Structured, semi structured and unstructured
○ Driving need for sql, graph processing systems
Velocity
● Speed at which
○ Collect the data
○ Process to get insights
● More and more big data analytics becoming real time
● Primary drivers
○ Social media
○ IoT
○ Mobile applications
Use cases
● Twitter needs to crunch few billion tweets/s to publish
trending topics
● Credit card companies needs to crunch millions of
transactions/s for identifying fraud
● Mobile applications like whatsapp needs to constantly
crunch logs for service availability and performance
Real Time analytics
● Ability to collect and process TB’s of streaming data to
get insights
● Data will be consumed from one or more streams
● Need for combining historical data with real time data
● Ability to stream data for downstream application
Stream processing using M/R
● Map/Reduce is inherently batch processing system
which is not suitable for streaming
● Need for data source as disk put latencies in the
processing
● Stream needs multiple transformation which cannot be
expressed effectively on M/R
● Overhead in launch of a new M/R job is too high
Apache Storm
● Apache storm is a stream processing system build on
top of HDFS
● Apache storm has it’s on API’s and do not use
Map/Reduce
● It’s a one message at time in core and micro batch is
built on top of it(trident)
● Built by twitter
Limitations of Streaming on Hadoop
● M/R is not suitable for streaming
● Apache storm needs learning new API’s and new
paradigm
● No way to combine batch result from M/R with Apache
storm streams
● Maintaining two runtimes are always hard
Unified Platform for Big Data Apps
Apache Spark
Batch Interactive Streaming
Hadoop Mesos NoSQL
Spark streaming
Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
Micro batch
● Spark streaming is a fast batch processing system
● Spark streaming collects stream data into small batch
and runs batch processing on it
● Batch can be as small as 1s to as big as multiple hours
● Spark job creation and execution overhead is so low it
can do all that under a sec
● These batches are called as DStreams
Discretized streams (DStream)
Input stream is divided into multiple discrete batches. Batch is configurable.
Spark Streaming
batch @ t1 batch @t2 batch @ t3
Input
Stream
DStream
● Discretized streams
● Each batch of data is converted to small discrete
batches
● Batch size can be from 1s - multiple mins
● DStream can be constructed from
○ Sockets
○ Kafka
○ HDFS
○ Custom receivers
DStream to RDD
Spark Streaming
batch @ t1 batch @t2 batch @ t3
Input
Stream
RDD @t2RDD @ t1 RDD @ t3
Dstream to RDD
● Each batch of Dstream is represented as RDD
underneath
● These RDD are replicated in cluster for fault tolerance
● Every DStream operation result in RDD transformation
● There are API’s to access these RDD is directly
● Can combine stream and batch processing
DStream transformation
val ssc = new
StreamingContext(args(0),
"wordcount", Seconds(5))
val lines = ssc.
socketTextStream
("localhost",50050)
val words = lines.flatMap(_.
split(" "))
Spark Streaming
batch @ t1 batch @t2 batch @ t3
Socket
Stream
RDD @t2RDD @ t1 RDD @ t3
FlatMapR
DD @ t2
FlatMapRD
D @ t1
FlatMapRD
D @ t3
flatMap flatMap flatMap
flatMap flatMap flatMap
Socket stream
● Ability to listen to any socket on remote machines
● Need to configure host and port
● Both Raw and Text representation of socket available
● Built in retry mechanism
Wordcount example
File Stream
● File streams allows for track new files in a given
directory on HDFS
● Whenever there is new file appears, spark streaming
will pick it up
● Only works for new files, modification for existing files
will not be considered
● Tracked using file creation time
FileStream example
Receiver architecture
Spark Cluster
Streaming Application(Driver)
Reciever
Block
Manager
Job Generator
Dstream Transformations
Store
Block
RDD
Mini
Batch
Recieve
Stateful operations
● Ability to maintain random state across multiple batches
● Fault tolerant
● Exactly once semantics
● WAL (Write Ahead Log) for receiver crashes
StatefulWordcount example
How stateful operations work?
● Generally state is a mutable operation
● But in functional programming, state is represented with
state machine going from one state to another
fn(oldState,newInfo) => newState
● In Spark, state is represented using RDD.
● Change in the state is represented using transformation
of RDD’s
● Fault tolerance of RDD helps in fault tolerance of state
Transform API
● In stream processing, ability to combine stream data
with batch data is extremely important
● Both batch API and stream API share RDD as
abstraction
● transform api of DStream allows us to access
underneath RDD’s directly
Ex : Combine customer sales data with customer
information
CartCustomerJoin example
Window based operations
Window wordcount
References
● http://www.slideshare.net/pacoid/qcon-so-paulo-
realtime-analytics-with-spark-streaming
● http://www.slideshare.net/ptgoetz/apache-storm-vs-
spark-streaming
● https://spark.apache.org/docs/latest/streaming-
programming-guide.html

More Related Content

PPTX
Presentacion Marco Teórico
Suany Rosario
 
PDF
Conception et réalisation d'une application de gestion intégrée au sein de la...
Addi Ait-Mlouk
 
PPTX
Présentation PFE : Mise en place d’une solution de gestion intégrée (OpenERP...
Mohamed Cherkaoui
 
PDF
Android-Tp2: liste et adaptateurs
Lilia Sfaxi
 
PPTX
Artificial Intelligence Vs Human Intelligence
Manikant Rai
 
PDF
The 7 steps of Machine Learning
Waziri Shebogholo
 
PPTX
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 
Presentacion Marco Teórico
Suany Rosario
 
Conception et réalisation d'une application de gestion intégrée au sein de la...
Addi Ait-Mlouk
 
Présentation PFE : Mise en place d’une solution de gestion intégrée (OpenERP...
Mohamed Cherkaoui
 
Android-Tp2: liste et adaptateurs
Lilia Sfaxi
 
Artificial Intelligence Vs Human Intelligence
Manikant Rai
 
The 7 steps of Machine Learning
Waziri Shebogholo
 
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 

What's hot (20)

PDF
What Is RDD In Spark? | Edureka
Edureka!
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Apache spark
shima jafari
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PPTX
Apache Spark overview
DataArt
 
PPTX
Spark streaming
Whiteklay
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PDF
Spark shuffle introduction
colorant
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PPTX
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
What Is RDD In Spark? | Edureka
Edureka!
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Apache spark
shima jafari
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Apache Spark overview
DataArt
 
Spark streaming
Whiteklay
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Understanding Query Plans and Spark UIs
Databricks
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Apache Spark Architecture
Alexey Grishchenko
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Spark shuffle introduction
colorant
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
Ad

Viewers also liked (20)

PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
Spark streaming: Best Practices
Prakash Chockalingam
 
POTX
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
ODP
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
PDF
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Guido Schmutz
 
PDF
Apache Spark & Streaming
Fernando Rodriguez
 
PDF
Introduction to Apache Spark
datamantra
 
PDF
Functional programming in Scala
datamantra
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Flink vs. Spark
Slim Baltagi
 
ODP
Linked Data
cyriacsmail
 
PDF
Toying with spark
Raymond Tay
 
PPTX
An Introduction to Spark
jlacefie
 
ODP
Hbase trabalho final
Lanylldo Araujo
 
PDF
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Cloudera, Inc.
 
PPTX
Spark Streaming and Expert Systems
Jim Haughwout
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Spark streaming: Best Practices
Prakash Chockalingam
 
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Guido Schmutz
 
Apache Spark & Streaming
Fernando Rodriguez
 
Introduction to Apache Spark
datamantra
 
Functional programming in Scala
datamantra
 
Introduction to Apache Spark
Rahul Jain
 
Flink vs. Spark
Slim Baltagi
 
Linked Data
cyriacsmail
 
Toying with spark
Raymond Tay
 
An Introduction to Spark
jlacefie
 
Hbase trabalho final
Lanylldo Araujo
 
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Cloudera, Inc.
 
Spark Streaming and Expert Systems
Jim Haughwout
 
Ad

Similar to Introduction to Spark Streaming (20)

PDF
Analyzing Data at Scale with Apache Spark
Nicola Ferraro
 
PDF
Introduction to Apache Flink
datamantra
 
PDF
Introduction to Flink Streaming
datamantra
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Streamsets and spark at SF Hadoop User Group
Hari Shreedharan
 
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
PPT
CS8091_BDA_Unit_IV_Stream_Computing
Palani Kumar
 
PDF
Spark Driven Big Data Analytics
inoshg
 
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
PDF
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Adrianos Dadis
 
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
WSO2
 
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
PDF
Apache Storm Concepts
André Dias
 
PDF
Introduction to Structured streaming
datamantra
 
PDF
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Introduction to spark 2.0
datamantra
 
PDF
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
PDF
Streamsets and spark
Hari Shreedharan
 
Analyzing Data at Scale with Apache Spark
Nicola Ferraro
 
Introduction to Apache Flink
datamantra
 
Introduction to Flink Streaming
datamantra
 
Apache Spark Components
Girish Khanzode
 
Streamsets and spark at SF Hadoop User Group
Hari Shreedharan
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
CS8091_BDA_Unit_IV_Stream_Computing
Palani Kumar
 
Spark Driven Big Data Analytics
inoshg
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Adrianos Dadis
 
[WSO2Con EU 2018] The Rise of Streaming SQL
WSO2
 
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Apache Storm Concepts
André Dias
 
Introduction to Structured streaming
datamantra
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Introduction to spark 2.0
datamantra
 
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Streamsets and spark
Hari Shreedharan
 

More from datamantra (20)

PPTX
Multi Source Data Analysis using Spark and Tellius
datamantra
 
PPTX
State management in Structured Streaming
datamantra
 
PDF
Spark on Kubernetes
datamantra
 
PDF
Understanding transactional writes in datasource v2
datamantra
 
PDF
Introduction to Datasource V2 API
datamantra
 
PDF
Exploratory Data Analysis in Spark
datamantra
 
PDF
Core Services behind Spark Job Execution
datamantra
 
PDF
Optimizing S3 Write-heavy Spark workloads
datamantra
 
PDF
Structured Streaming with Kafka
datamantra
 
PDF
Understanding time in structured streaming
datamantra
 
PDF
Spark stack for Model life-cycle management
datamantra
 
PDF
Productionalizing Spark ML
datamantra
 
PPTX
Building real time Data Pipeline using Spark Streaming
datamantra
 
PDF
Testing Spark and Scala
datamantra
 
PDF
Understanding Implicits in Scala
datamantra
 
PDF
Migrating to Spark 2.0 - Part 2
datamantra
 
PDF
Migrating to spark 2.0
datamantra
 
PDF
Scalable Spark deployment using Kubernetes
datamantra
 
PDF
Introduction to concurrent programming with akka actors
datamantra
 
PDF
Interactive Data Analysis in Spark Streaming
datamantra
 
Multi Source Data Analysis using Spark and Tellius
datamantra
 
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
datamantra
 
Understanding transactional writes in datasource v2
datamantra
 
Introduction to Datasource V2 API
datamantra
 
Exploratory Data Analysis in Spark
datamantra
 
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Structured Streaming with Kafka
datamantra
 
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
datamantra
 
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
datamantra
 
Migrating to Spark 2.0 - Part 2
datamantra
 
Migrating to spark 2.0
datamantra
 
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
datamantra
 
Interactive Data Analysis in Spark Streaming
datamantra
 

Recently uploaded (20)

PPTX
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
PPTX
Logistic Regression ml machine learning.pptx
abdullahcocindia
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
Sione Palu
 
PPTX
GR3-PPTFINAL (1).pptx 0.91 MbHIHUHUGG,HJGH
DarylArellaga1
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
SanthoshKumar229321
 
PPTX
Understanding Prototyping in Design and Development
SadiaJanjua2
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
batch data Retailer Data management Project.pptx
sumitmundhe77
 
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PaulYoung221210
 
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
JanakiRaman206018
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PPTX
CL11_CH20_-LOCOMOTION-AND-MOVEMENT-Autosaved.pptx
GOTOO80
 
PDF
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
PPTX
artificial intelligence deeplearning-200712115616.pptx
revathi148366
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
Logistic Regression ml machine learning.pptx
abdullahcocindia
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
Sione Palu
 
GR3-PPTFINAL (1).pptx 0.91 MbHIHUHUGG,HJGH
DarylArellaga1
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Purple and Violet Modern Marketing Presentation (1).pptx
SanthoshKumar229321
 
Understanding Prototyping in Design and Development
SadiaJanjua2
 
Chad Readey - An Independent Thinker
Chad Readey
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
batch data Retailer Data management Project.pptx
sumitmundhe77
 
Moving the Public Sector (Government) to a Digital Adoption
PaulYoung221210
 
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
JanakiRaman206018
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
CL11_CH20_-LOCOMOTION-AND-MOVEMENT-Autosaved.pptx
GOTOO80
 
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
artificial intelligence deeplearning-200712115616.pptx
revathi148366
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 

Introduction to Spark Streaming

  • 1. Introduction to Spark Streaming Real time processing on Apache Spark
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Real time analytics in Big data ● Unification ● Spark streaming ● DStream ● DStream and RDD ● Stream processing ● DStream transformation ● Hands on
  • 4. 3 V’s of Big data ● Volume ○ TB’s and PB’s of files ○ Driving need for batch processing systems ● Velocity ○ TB’s of stream data ○ Driving need for stream processing systems ● Variety ○ Structured, semi structured and unstructured ○ Driving need for sql, graph processing systems
  • 5. Velocity ● Speed at which ○ Collect the data ○ Process to get insights ● More and more big data analytics becoming real time ● Primary drivers ○ Social media ○ IoT ○ Mobile applications
  • 6. Use cases ● Twitter needs to crunch few billion tweets/s to publish trending topics ● Credit card companies needs to crunch millions of transactions/s for identifying fraud ● Mobile applications like whatsapp needs to constantly crunch logs for service availability and performance
  • 7. Real Time analytics ● Ability to collect and process TB’s of streaming data to get insights ● Data will be consumed from one or more streams ● Need for combining historical data with real time data ● Ability to stream data for downstream application
  • 8. Stream processing using M/R ● Map/Reduce is inherently batch processing system which is not suitable for streaming ● Need for data source as disk put latencies in the processing ● Stream needs multiple transformation which cannot be expressed effectively on M/R ● Overhead in launch of a new M/R job is too high
  • 9. Apache Storm ● Apache storm is a stream processing system build on top of HDFS ● Apache storm has it’s on API’s and do not use Map/Reduce ● It’s a one message at time in core and micro batch is built on top of it(trident) ● Built by twitter
  • 10. Limitations of Streaming on Hadoop ● M/R is not suitable for streaming ● Apache storm needs learning new API’s and new paradigm ● No way to combine batch result from M/R with Apache storm streams ● Maintaining two runtimes are always hard
  • 11. Unified Platform for Big Data Apps Apache Spark Batch Interactive Streaming Hadoop Mesos NoSQL
  • 12. Spark streaming Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
  • 13. Micro batch ● Spark streaming is a fast batch processing system ● Spark streaming collects stream data into small batch and runs batch processing on it ● Batch can be as small as 1s to as big as multiple hours ● Spark job creation and execution overhead is so low it can do all that under a sec ● These batches are called as DStreams
  • 14. Discretized streams (DStream) Input stream is divided into multiple discrete batches. Batch is configurable. Spark Streaming batch @ t1 batch @t2 batch @ t3 Input Stream
  • 15. DStream ● Discretized streams ● Each batch of data is converted to small discrete batches ● Batch size can be from 1s - multiple mins ● DStream can be constructed from ○ Sockets ○ Kafka ○ HDFS ○ Custom receivers
  • 16. DStream to RDD Spark Streaming batch @ t1 batch @t2 batch @ t3 Input Stream RDD @t2RDD @ t1 RDD @ t3
  • 17. Dstream to RDD ● Each batch of Dstream is represented as RDD underneath ● These RDD are replicated in cluster for fault tolerance ● Every DStream operation result in RDD transformation ● There are API’s to access these RDD is directly ● Can combine stream and batch processing
  • 18. DStream transformation val ssc = new StreamingContext(args(0), "wordcount", Seconds(5)) val lines = ssc. socketTextStream ("localhost",50050) val words = lines.flatMap(_. split(" ")) Spark Streaming batch @ t1 batch @t2 batch @ t3 Socket Stream RDD @t2RDD @ t1 RDD @ t3 FlatMapR DD @ t2 FlatMapRD D @ t1 FlatMapRD D @ t3 flatMap flatMap flatMap flatMap flatMap flatMap
  • 19. Socket stream ● Ability to listen to any socket on remote machines ● Need to configure host and port ● Both Raw and Text representation of socket available ● Built in retry mechanism
  • 21. File Stream ● File streams allows for track new files in a given directory on HDFS ● Whenever there is new file appears, spark streaming will pick it up ● Only works for new files, modification for existing files will not be considered ● Tracked using file creation time
  • 23. Receiver architecture Spark Cluster Streaming Application(Driver) Reciever Block Manager Job Generator Dstream Transformations Store Block RDD Mini Batch Recieve
  • 24. Stateful operations ● Ability to maintain random state across multiple batches ● Fault tolerant ● Exactly once semantics ● WAL (Write Ahead Log) for receiver crashes
  • 26. How stateful operations work? ● Generally state is a mutable operation ● But in functional programming, state is represented with state machine going from one state to another fn(oldState,newInfo) => newState ● In Spark, state is represented using RDD. ● Change in the state is represented using transformation of RDD’s ● Fault tolerance of RDD helps in fault tolerance of state
  • 27. Transform API ● In stream processing, ability to combine stream data with batch data is extremely important ● Both batch API and stream API share RDD as abstraction ● transform api of DStream allows us to access underneath RDD’s directly Ex : Combine customer sales data with customer information