SlideShare a Scribd company logo
Streaming Analytics
Ashish Gupta (LinkedIn)
Neera Agarwal
What is a Data Stream
● Unbounded Data
● Data arriving continuously at high rate
● Too large to first store and then process
● Need to be processed in one pass
● May display Temporal Locality - patterns may evolve over time
Contrast with Batch Processing
1. Process Bounded Files. Files by ingestion time. Last 15 minutes, last 1 hour,
last 1 day
2. Program can go back and forth in data. Do multipass processing.
3. Sessions and joins can span files.
4. Many machine learning algorithms need full batch of data to train.
5. Very high Latency, but very high throughput as well.
a. Wait for files to arrive. Ie wait for file window to close.
b. Processing the whole file(s) take time.
Streaming Applications
● Joining Clicks and Impressions
● Mobile applications - User activity
● Session based analysis
● Fraud detection
● Industrial IOT
● LinkedIn’s Streaming Standardization Platform
Lambda Architecture
New
Data
Streaming
System
Batch System
(Hadoop)
Application
Speed DB
Batch DB
Speed Layer
Batch Layer
Streaming
System
Streaming Systems Architecture
New
Data
Streaming
System
Hadoop
Store and Serve
SQL and
NoSql DB
Sources
Kafka
Real-time
Consumers
Process and Analyze
Consumers
Kafka
New
Data
New
Data
Save Log
Spark
Flink
Storm
Samza
Kafka Streams
Consumers
What is Streaming Analytics
“ Continuous processing on unbounded data”
“Software that can filter, aggregate, enrich, and analyze a high throughput of data
from multiple disparate live data sources and in any data format to identify simple
and complex patterns to visualize business in real-time, detect urgent situations,
and automate immediate actions.” - Forrester
Streaming Concepts
Time Window Order Correctness
Delayed data
Out of order data
Event Time
Processing Time
Fixed Window
Sliding Window
Sessions
Consistency
At least Once
Exactly Once
Checkpointing
Streaming Concepts - Time
New
Data
Streaming
System
Kafka
New
Data
New
Data
Event Time Ingestion Time
Processing
Time
Streaming Concepts - Windows
Fixed Window/
Tumbling Window
Sliding Window Session Window
Kafka
Streams
Processing Model Mini Batch Event level Event level Event level Event level
Guarantee Exactly Once Exactly Once At least once At least once At least once
State Management Yes Yes No Yes Yes
Latency Medium Low Low Low Low
Built in primitives Batch and
streaming
Batch and
streaming
Low Level API Low level
API
Streaming
only
Back Pressure Yes Yes No via Kafka via Kafka
Open Source Streaming Systems
Case Study: LinkedIn Standardization Platform
User edits
profile
Company
Geo
Title
Education
Seniority
….
Search
Feed
…..
Job
Recommendation
Standardizer
LinkedIn Profile
Pattern : External Lookup/Stream to Table Join
Decide based on size of the data, latency needs and QPS of external systems
Streaming
System Service
External
Cache
External
DB
Internal
Cache
Pattern: Stream to Stream Joining
● Joins are expensive
● If partitions of two streams not collocated, then expensive shuffle
● Broadcast join if one file is small
Pattern: Reprocessing
Streaming
System (Samza)
Hadoop
SQL and
NoSql DB
Real-time
Consumers
Consumers
Title, Geo, Company
Education …..
DB Log
Events
DB
Snapshot
Kafka
KafkaDatabus
Databus
Standardizer
with ML Models Kafka
Pattern: Reprocessing
Streaming
System (Samza)
Hadoop
SQL and
NoSql DB
Real-time
Consumers
Consumers
Title, Geo, Company
Education …..
DB Log
Events
DB
Snapshot
Kafka
KafkaDatabus
Databus
rewind 2 hours
Standardizer
with ML Models Kafka
Apache Kafka
● Highly scalable messaging system
● Distributed commit log
● Developed in LinkedIn back in 2010
● At LinkedIn - more than 1.4 trillion messages
per day across over 1400 brokers
● Distributed, partitioned, replicated
● Message retention - based on time and size
Some Kafka use cases
● Queuing/Messaging
● Metrics
● Auditing
● Logging
Producer
Producer
Consumer
Consumer
Consumer
Broker
Broker
Broker
Broker
Kafka Cluster
Send messages Fetch messages
Producer
References
● MillWheel: http://research.google.com/pubs/pub41378.html
● DataFlow:http://research.google.com/pubs/pub43864.html
● Samza: http://samza.apache.org/
● Spark Streaming Paper: Discretized streams
● Big Data Stream Mining (KDD’15)
● Models and Issues in Data Stream Systems
Contact Us
Ashish Gupta - ahgupta@linkedin.com
https://www.linkedin.com/in/guptash
Neera Agarwal - neera8work@gmail.com
https://www.linkedin.com/in/neera-agarwal-21b9473

More Related Content

PDF
Building Big Data Streaming Architectures
PDF
Introduction to Streaming Analytics
PDF
Data Stream Processing - Concepts and Frameworks
PDF
Big Data Architectures @ JAX / BigDataCon 2016
PDF
Stream Processing Overview
PDF
Spark Streaming and IoT by Mike Freedman
PPTX
Data streaming fundamentals
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
Building Big Data Streaming Architectures
Introduction to Streaming Analytics
Data Stream Processing - Concepts and Frameworks
Big Data Architectures @ JAX / BigDataCon 2016
Stream Processing Overview
Spark Streaming and IoT by Mike Freedman
Data streaming fundamentals
Apache Flink: Real-World Use Cases for Streaming Analytics

What's hot (20)

PPTX
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
PPTX
Realtime streaming architecture in INFINARIO
PPTX
Stateful Stream Processing at In-Memory Speed
PDF
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
PDF
Extracting Insights from Data at Twitter
PPTX
Spark Streaming the Industrial IoT
PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
PDF
The Netflix Way to deal with Big Data Problems
PDF
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
PDF
Lambda architecture @ Indix
PPTX
Assaf Araki – Real Time Analytics at Scale
PDF
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
PDF
ASPgems - kappa architecture
PDF
Time Series Analysis Using an Event Streaming Platform
PPTX
Lambda architecture: from zero to One
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
PDF
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
PDF
Apache Storm vs. Spark Streaming - two stream processing platforms compared
PPTX
Monitoring and Troubleshooting a Real Time Pipeline
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
Realtime streaming architecture in INFINARIO
Stateful Stream Processing at In-Memory Speed
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Extracting Insights from Data at Twitter
Spark Streaming the Industrial IoT
Unified, Efficient, and Portable Data Processing with Apache Beam
The Netflix Way to deal with Big Data Problems
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Lambda architecture @ Indix
Assaf Araki – Real Time Analytics at Scale
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
ASPgems - kappa architecture
Time Series Analysis Using an Event Streaming Platform
Lambda architecture: from zero to One
Implementing the Lambda Architecture efficiently with Apache Spark
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Monitoring and Troubleshooting a Real Time Pipeline
Why apache Flink is the 4G of Big Data Analytics Frameworks
Ad

Viewers also liked (20)

PDF
Introduction to Real-time data processing
PDF
The end of polling : why and how to transform a REST API into a Data Streamin...
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PPTX
KDD 2016 Streaming Analytics Tutorial
PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
PDF
RBea: Scalable Real-Time Analytics at King
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
PDF
Real-time analytics as a service at King
PPTX
Data Streaming (in a Nutshell) ... and Spark's window operations
PPTX
Stream Analytics in the Enterprise
PDF
Reliable Data Intestion in BigData / IoT
PDF
Stream Processing Everywhere - What to use?
PDF
Stateful Distributed Stream Processing
PDF
Oracle Stream Analytics - Simplifying Stream Processing
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Distributed Real-Time Stream Processing: Why and How 2.0
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
PDF
Introduction to Streaming Analytics
Introduction to Real-time data processing
The end of polling : why and how to transform a REST API into a Data Streamin...
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
KDD 2016 Streaming Analytics Tutorial
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Real-time Stream Processing with Apache Flink @ Hadoop Summit
RBea: Scalable Real-Time Analytics at King
Large-Scale Stream Processing in the Hadoop Ecosystem
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real-time analytics as a service at King
Data Streaming (in a Nutshell) ... and Spark's window operations
Stream Analytics in the Enterprise
Reliable Data Intestion in BigData / IoT
Stream Processing Everywhere - What to use?
Stateful Distributed Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
Apache Kafka - Scalable Message-Processing and more !
Distributed Real-Time Stream Processing: Why and How 2.0
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Introduction to Streaming Analytics
Ad

Similar to Streaming Analytics (20)

PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
PDF
[ODSC EUROPE 2022] Eagleeye - Data Pipeline for Anomaly Detection in Cyber Se...
PDF
Stream Processing – Concepts and Frameworks
PDF
Apache Druid 101
PPTX
Handling Data in Mega Scale Systems
PPTX
Unified Batch & Stream Processing with Apache Samza
PPT
2011 06-30-hadoop-summit v5
PPT
EQR Reporting: Rails + Amazon EC2
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
PPT
Hadoop and Voldemort @ LinkedIn
PPTX
I Heart Log: Real-time Data and Apache Kafka
PDF
Cloud Lambda Architecture Patterns
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
PDF
Machine learning and big data @ uber a tale of two systems
PDF
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
PPTX
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
PDF
Distributed Systems: scalability and high availability
PDF
Data Infrastructure for a World of Music
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
PPTX
Debunking Common Myths in Stream Processing
[WSO2Con EU 2018] The Rise of Streaming SQL
[ODSC EUROPE 2022] Eagleeye - Data Pipeline for Anomaly Detection in Cyber Se...
Stream Processing – Concepts and Frameworks
Apache Druid 101
Handling Data in Mega Scale Systems
Unified Batch & Stream Processing with Apache Samza
2011 06-30-hadoop-summit v5
EQR Reporting: Rails + Amazon EC2
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Hadoop and Voldemort @ LinkedIn
I Heart Log: Real-time Data and Apache Kafka
Cloud Lambda Architecture Patterns
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Machine learning and big data @ uber a tale of two systems
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Distributed Systems: scalability and high availability
Data Infrastructure for a World of Music
GOTO Night Amsterdam - Stream processing with Apache Flink
Debunking Common Myths in Stream Processing

Recently uploaded (20)

PPT
Performance Implementation Review powerpoint
PPTX
batch data Retailer Data management Project.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Chad Readey - An Independent Thinker
PDF
CB-Insights_Artificial-Intelligence-Report-Q2-2025.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
lec_5(probability).pptxzzjsjsjsjsjsjjsjjssj
PPTX
CL11_CH20_-LOCOMOTION-AND-MOVEMENT-Autosaved.pptx
PPTX
Logistic Regression ml machine learning.pptx
PDF
Digital Infrastructure – Powering the Connected Age
PDF
Report The-State-of-AIOps 20232032 3.pdf
PPTX
Understanding Prototyping in Design and Development
PPTX
Challenges and opportunities in feeding a growing population
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Economic Sector Performance Recovery.pptx
PDF
Mastering Query Optimization Techniques for Modern Data Engineers
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Performance Implementation Review powerpoint
batch data Retailer Data management Project.pptx
Moving the Public Sector (Government) to a Digital Adoption
Chad Readey - An Independent Thinker
CB-Insights_Artificial-Intelligence-Report-Q2-2025.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
lec_5(probability).pptxzzjsjsjsjsjsjjsjjssj
CL11_CH20_-LOCOMOTION-AND-MOVEMENT-Autosaved.pptx
Logistic Regression ml machine learning.pptx
Digital Infrastructure – Powering the Connected Age
Report The-State-of-AIOps 20232032 3.pdf
Understanding Prototyping in Design and Development
Challenges and opportunities in feeding a growing population
Business Acumen Training GuidePresentation.pptx
Economic Sector Performance Recovery.pptx
Mastering Query Optimization Techniques for Modern Data Engineers
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj

Streaming Analytics

  • 1. Streaming Analytics Ashish Gupta (LinkedIn) Neera Agarwal
  • 2. What is a Data Stream ● Unbounded Data ● Data arriving continuously at high rate ● Too large to first store and then process ● Need to be processed in one pass ● May display Temporal Locality - patterns may evolve over time
  • 3. Contrast with Batch Processing 1. Process Bounded Files. Files by ingestion time. Last 15 minutes, last 1 hour, last 1 day 2. Program can go back and forth in data. Do multipass processing. 3. Sessions and joins can span files. 4. Many machine learning algorithms need full batch of data to train. 5. Very high Latency, but very high throughput as well. a. Wait for files to arrive. Ie wait for file window to close. b. Processing the whole file(s) take time.
  • 4. Streaming Applications ● Joining Clicks and Impressions ● Mobile applications - User activity ● Session based analysis ● Fraud detection ● Industrial IOT ● LinkedIn’s Streaming Standardization Platform
  • 6. Streaming System Streaming Systems Architecture New Data Streaming System Hadoop Store and Serve SQL and NoSql DB Sources Kafka Real-time Consumers Process and Analyze Consumers Kafka New Data New Data Save Log Spark Flink Storm Samza Kafka Streams Consumers
  • 7. What is Streaming Analytics “ Continuous processing on unbounded data” “Software that can filter, aggregate, enrich, and analyze a high throughput of data from multiple disparate live data sources and in any data format to identify simple and complex patterns to visualize business in real-time, detect urgent situations, and automate immediate actions.” - Forrester
  • 8. Streaming Concepts Time Window Order Correctness Delayed data Out of order data Event Time Processing Time Fixed Window Sliding Window Sessions Consistency At least Once Exactly Once Checkpointing
  • 9. Streaming Concepts - Time New Data Streaming System Kafka New Data New Data Event Time Ingestion Time Processing Time
  • 10. Streaming Concepts - Windows Fixed Window/ Tumbling Window Sliding Window Session Window
  • 11. Kafka Streams Processing Model Mini Batch Event level Event level Event level Event level Guarantee Exactly Once Exactly Once At least once At least once At least once State Management Yes Yes No Yes Yes Latency Medium Low Low Low Low Built in primitives Batch and streaming Batch and streaming Low Level API Low level API Streaming only Back Pressure Yes Yes No via Kafka via Kafka Open Source Streaming Systems
  • 12. Case Study: LinkedIn Standardization Platform User edits profile Company Geo Title Education Seniority …. Search Feed ….. Job Recommendation Standardizer LinkedIn Profile
  • 13. Pattern : External Lookup/Stream to Table Join Decide based on size of the data, latency needs and QPS of external systems Streaming System Service External Cache External DB Internal Cache
  • 14. Pattern: Stream to Stream Joining ● Joins are expensive ● If partitions of two streams not collocated, then expensive shuffle ● Broadcast join if one file is small
  • 15. Pattern: Reprocessing Streaming System (Samza) Hadoop SQL and NoSql DB Real-time Consumers Consumers Title, Geo, Company Education ….. DB Log Events DB Snapshot Kafka KafkaDatabus Databus Standardizer with ML Models Kafka
  • 16. Pattern: Reprocessing Streaming System (Samza) Hadoop SQL and NoSql DB Real-time Consumers Consumers Title, Geo, Company Education ….. DB Log Events DB Snapshot Kafka KafkaDatabus Databus rewind 2 hours Standardizer with ML Models Kafka
  • 17. Apache Kafka ● Highly scalable messaging system ● Distributed commit log ● Developed in LinkedIn back in 2010 ● At LinkedIn - more than 1.4 trillion messages per day across over 1400 brokers ● Distributed, partitioned, replicated ● Message retention - based on time and size
  • 18. Some Kafka use cases ● Queuing/Messaging ● Metrics ● Auditing ● Logging Producer Producer Consumer Consumer Consumer Broker Broker Broker Broker Kafka Cluster Send messages Fetch messages Producer
  • 19. References ● MillWheel: http://research.google.com/pubs/pub41378.html ● DataFlow:http://research.google.com/pubs/pub43864.html ● Samza: http://samza.apache.org/ ● Spark Streaming Paper: Discretized streams ● Big Data Stream Mining (KDD’15) ● Models and Issues in Data Stream Systems
  • 20. Contact Us Ashish Gupta - [email protected] https://www.linkedin.com/in/guptash Neera Agarwal - [email protected] https://www.linkedin.com/in/neera-agarwal-21b9473