SlideShare a Scribd company logo
Large-Scale Stream Processing
in the Hadoop Ecosystem
Gyula Fóra
gyfora@apache.org
Márton Balassi
mbalassi@apache.org
This talk
§ Stream processing by example
§ Open source stream processors
§ Runtime architecture and programming
model
§ Counting words…
§ Fault tolerance and stateful processing
§ Closing
2Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
Stream processing
by example
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 3
Streaming applications
ETL style operations
• Filter incoming data,
Log analysis
• High throughput, connectors,
at-least-once processing
Window aggregations
• Trending tweets,
User sessions, Stream joins
• Window abstractions
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 4
Inpu
t
Inpu
t
Inpu
tInput
Process/Enrich
Streaming applications
Machine learning
• Fitting trends to the evolving
stream, Stream clustering
• Model state, cyclic flows
Pattern recognition
• Fraud detection, Triggering
signals based on activity
• Exactly-once processing
5Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
Open source stream
processors
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 6
Apache Streaming landscape
72015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Apache Storm
§ Started in 2010, development driven by
BackType, then Twitter
§ Pioneer in large scale stream processing
§ Distributed dataflow abstraction (spouts &
bolts)
82015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Apache Flink
§ Started in 2008 as a research project
(Stratosphere) at European universities
§ Unique combination of low latency streaming
and high throughput batch analysis
§ Flexible operator states and windowing
9
Batch  data
Kafka,	
  RabbitMQ,	
  
...
HDFS,	
  JDBC,	
  
...
Stream	
  Data
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Apache Spark
§ Started in 2009 at UC Berkley, Apache since 2013
§ Very strong community, wide adoption
§ Unified batch and stream processing over a
batch runtime
§ Good integration with batch programs
102015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Apache Samza
§ Developed at LinkedIn, open sourced in 2013
§ Builds heavily on Kafka’s log based philosophy
§ Pluggable messaging system and execution
backend
112015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
System comparison
12
Streaming
model
Native Micro-batching Native Native
API Compositional Declarative Compositional Declarative
Fault tolerance Record ACKs RDD-based Log-based Checkpoints
Guarantee At-least-once Exactly-once At-least-once Exactly-once
State Only in Trident
State as
DStream
Stateful
operators
Stateful
operators
Windowing Not built-in Time based Not built-in Policy based
Latency Very-Low Medium Low Low
Throughput Medium High High High
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Runtime and
programming model
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 13
Native Streaming
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 14
Distributed dataflow runtime
§ Storm, Samza and Flink
§ General properties
• Long standing operators
• Pipelined execution
• Usually possible to create
cyclic flows
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 15
Pros
• Full expressivity
• Low-latency execution
• Stateful operators
Cons
• Fault-tolerance is hard
• Throughput may suffer
• Load balancing is an
issue
Distributed dataflow runtime
§ Storm
• Dynamic typing + Kryo
• Dynamic topology rebalancing
§ Samza
• Almost every component pluggable
• Full task isolation, no backpressure (buffering
handled by the messaging layer)
§ Flink
• Strongly typed streams + custom serializers
• Flow control mechanism
• Memory management
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 16
Micro-batching
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 17
Micro-batch runtime
§ Implemented by Apache Spark
§ General properties
• Computation broken down
to time intervals
• Load aware scheduling
• Easy interaction with batch
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 18
Pros
• Easy to reason about
• High-throughput
• FT comes for “free”
• Dynamic load balancing
Cons
• Latency depends on
batch size
• Limited expressivity
• Stateless by nature
Programming model
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 19
Declarative
§ Expose a high-level API
§ Operators are higher order
functions on abstract data
stream types
§ Advanced behavior such as
windowing is supported
§ Query optimization
Compositional
§ Offer basic building blocks
for composing custom
operators and topologies
§ Advanced behavior such as
windowing is often missing
§ Topology needs to be hand-
optimized
Programming model
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 20
DStream, DataStream
§ Transformations abstract
operator details
§ Suitable for engineers and data
analysts
Spout, Consumer,
Bolt, Task, Topology
§ Direct access to the execution
graph / topology
• Suitable for engineers
Counting words…
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 21
WordCount
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 22
storm  budapest  flink
apache  storm  spark
streaming  samza storm
flink  apache  flink
bigdata  storm
flink  streaming
(storm,  4)
(budapest,  1)
(flink,  4)
(apache,  2)
(spark,  1)
(streaming,  2)
(samza,  1)
(bigdata,  1)
Storm
Assembling the topology
232015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new SentenceSpout(), 5);
builder.setBolt("split", new Splitter(), 8).shuffleGrouping("spout");
builder.setBolt("count", new Counter(), 12)
.fieldsGrouping("split", new Fields("word"));
public class Counter extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
}
Rolling word count bolt
Samza
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 24
public class WordCountTask implements StreamTask {
private KeyValueStore<String, Integer> store;
public void process( IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
String word = envelope.getMessage();
Integer count = store.get(word);
if(count == null){count = 0;}
store.put(word, count + 1);
collector.send(new OutgoingMessageEnvelope(new
SystemStream("kafka", ”wc"), Tuple2.of(word, count)));
}
}
Rolling word count task
Flink
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
Rolling word count
Window word count
252015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Spark
Window word count
Rolling word count (kind of)
262015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Fault tolerance and
stateful processing
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 27
Fault tolerance intro
§ Fault-tolerance in streaming systems is
inherently harder than in batch
• Can’t just restart computation
• State is a problem
• Fast recovery is crucial
• Streaming topologies run 24/7 for a long period
§ Fault-tolerance is a complex issue
• No single point of failure is allowed
• Guaranteeing input processing
• Consistent operator state
• Fast recovery
• At-least-once vs Exactly-once semantics
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 28
Storm record acknowledgements
§ Track the lineage of tuples as they are
processed (anchors and acks)
§ Special “acker” bolts track each lineage
DAG (efficient xor based algorithm)
§ Replay the root of failed (or timed out)
tuples
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 29
Samza offset tracking
§ Exploits the properties of a durable, offset
based messaging layer
§ Each task maintains its current offset, which
moves forward as it processes elements
§ The offset is checkpointed and restored on
failure (some messages might be repeated)
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 30
Flink checkpointing
§ Based on consistent global snapshots
§ Algorithm designed for stateful dataflows
(minimal runtime overhead)
§ Exactly-once semantics
31Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
Spark RDD recomputation
§ Immutable data model with
repeatable computation
§ Failed RDDs are recomputed
using their lineage
§ Checkpoint RDDs to reduce
lineage length
§ Parallel recovery of failed
RDDs
§ Exactly-once semantics
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 32
State in streaming programs
§ Almost all non-trivial streaming programs are
stateful
§ Stateful operators (in essence):
𝒇:	
   𝒊𝒏, 𝒔𝒕𝒂𝒕𝒆 ⟶ 𝒐𝒖𝒕, 𝒔𝒕𝒂𝒕𝒆.
§ State hangs around and can be read and
modified as the stream evolves
§ Goal: Get as close as possible while
maintaining scalability and fault-tolerance
33Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
§ States available only in Trident API
§ Dedicated operators for state updates and
queries
§ State access methods
• stateQuery(…)
• partitionPersist(…)
• persistentAggregate(…)
§ It’s very difficult to
implement transactional
states
Exactly-­‐‑once  guarantee
34Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
§ Stateless runtime by design
• No continuous operators
• UDFs are assumed to be stateless
§ State can be generated as a separate
stream of RDDs: updateStateByKey(…)
𝒇:	
   𝑺𝒆𝒒[𝒊𝒏 𝒌], 𝒔𝒕𝒂𝒕𝒆 𝒌 ⟶ 𝒔𝒕𝒂𝒕𝒆.
𝒌
§ 𝒇 is scoped to a specific key
§ Exactly-once semantics
35Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
§ Stateful dataflow operators
(Any task can hold state)
§ State changes are stored
as a log by Kafka
§ Custom storage engines can
be plugged in to the log
§ 𝒇 is scoped to a specific task
§ At-least-once processing
semantics
36Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
§ Stateful dataflow operators (conceptually
similar to Samza)
§ Two state access patterns
• Local (Task) state
• Partitioned (Key) state
§ Proper API integration
• Java: OperatorState interface
• Scala: mapWithState, flatMapWithState…
§ Exactly-once semantics by checkpointing
37Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
Performance
§ Throughput/Latency
• A cost of a network hop is 25+ msecs
• 1 million records/sec/core is nice
§ Size of Network Buffers/Batching
§ Buffer Timeout
§ Cost of Fault Tolerance
§ Operator chaining/Stages
§ Serialization/Types
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 38
Closing
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 39
Comparison revisited
40
Streaming
model
Native Micro-batching Native Native
API Compositional Declarative Compositional Declarative
Fault tolerance Record ACKs RDD-based Log-based Checkpoints
Guarantee At-least-once Exactly-once At-least-once Exactly-once
State Only in Trident
State as
DStream
Stateful
operators
Stateful
operators
Windowing Not built-in Time based Not built-in Policy based
Latency Very-Low Medium Low Low
Throughput Medium High High High
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Summary
§ Streaming applications and stream
processors are very diverse
§ 2 main runtime designs
• Dataflow based (Storm, Samza, Flink)
• Micro-batch based (Spark)
§ The best framework varies based on
application specific needs
§ But high-level APIs are nice J
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 41
Thank you!
List of Figures (in order of usage)
§ https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/CPT-FSM-
abcd.svg/326px-CPT-FSM-abcd.svg.png
§ https://storm.apache.org/images/topology.png
§ https://databricks.com/wp-content/uploads/2015/07/image11-1024x655.png
§ https://databricks.com/wp-content/uploads/2015/07/image21-1024x734.png
§ https://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf,
page 2.
§ http://www.slideshare.net/ptgoetz/storm-hadoop-summit2014, page 69-71.
§ http://samza.apache.org/img/0.9/learn/documentation/container/checkpointi
ng.svg
§ https://databricks.com/wp-content/uploads/2015/07/image41-1024x602.png
§ https://storm.apache.org/documentation/images/spout-vs-state.png
§ http://samza.apache.org/img/0.9/learn/documentation/container/stateful_job.
png
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 43

More Related Content

PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
PDF
The Future of Apache Storm
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Timeline Service v.2 (Hadoop Summit 2016)
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
January 2016 Flink Community Update & Roadmap 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
The Future of Apache Storm
Large-Scale Stream Processing in the Hadoop Ecosystem
Timeline Service v.2 (Hadoop Summit 2016)
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
January 2016 Flink Community Update & Roadmap 2016

What's hot (20)

PPTX
Next Gen Big Data Analytics with Apache Apex
PDF
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
PPTX
Fabian Hueske – Cascading on Flink
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
PDF
Stream Processing use cases and applications with Apache Apex by Thomas Weise
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
PDF
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
PDF
Building real time data-driven products
PDF
From Device to Data Center to Insights
PDF
Cost-based Query Optimization
PDF
Big Migrations: Moving elephant herds by Carlos Izquierdo
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Unified Batch & Stream Processing with Apache Samza
PPTX
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
PPTX
Java High Level Stream API
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Next Gen Big Data Analytics with Apache Apex
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Fabian Hueske – Cascading on Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Stream Processing use cases and applications with Apache Apex by Thomas Weise
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Building real time data-driven products
From Device to Data Center to Insights
Cost-based Query Optimization
Big Migrations: Moving elephant herds by Carlos Izquierdo
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Unified Batch & Stream Processing with Apache Samza
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Java High Level Stream API
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Ad

Viewers also liked (20)

PDF
Apache Big Data EU 2015 - HBase
PDF
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
PDF
Building Big Data Streaming Architectures
PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
PPTX
KDD 2016 Streaming Analytics Tutorial
PDF
RBea: Scalable Real-Time Analytics at King
PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
PDF
Real-time analytics as a service at King
PDF
Streaming Analytics
PPTX
Data Streaming (in a Nutshell) ... and Spark's window operations
PPTX
Stream Analytics in the Enterprise
PDF
Reliable Data Intestion in BigData / IoT
PDF
Stream Processing Everywhere - What to use?
PDF
CamelOne 2012 - Spoilt for Choice: Which Integration Framework to use?
PDF
The end of polling : why and how to transform a REST API into a Data Streamin...
PDF
Stateful Distributed Stream Processing
PDF
Oracle Stream Analytics - Simplifying Stream Processing
PDF
Event Driven Architecture with Apache Camel
PDF
Apache Kafka - Scalable Message-Processing and more !
Apache Big Data EU 2015 - HBase
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
Building Big Data Streaming Architectures
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Real-time Stream Processing with Apache Flink @ Hadoop Summit
KDD 2016 Streaming Analytics Tutorial
RBea: Scalable Real-Time Analytics at King
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real-time analytics as a service at King
Streaming Analytics
Data Streaming (in a Nutshell) ... and Spark's window operations
Stream Analytics in the Enterprise
Reliable Data Intestion in BigData / IoT
Stream Processing Everywhere - What to use?
CamelOne 2012 - Spoilt for Choice: Which Integration Framework to use?
The end of polling : why and how to transform a REST API into a Data Streamin...
Stateful Distributed Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
Event Driven Architecture with Apache Camel
Apache Kafka - Scalable Message-Processing and more !
Ad

Similar to Large-Scale Stream Processing in the Hadoop Ecosystem (20)

PDF
Distributed real time stream processing- why and how
ODP
Web-scale data processing: practical approaches for low-latency and batch
PPT
Moving Towards a Streaming Architecture
PDF
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
PDF
SnappyData Toronto Meetup Nov 2017
PDF
Streaming analytics state of the art
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
PDF
SQL Engines for Hadoop - The case for Impala
PDF
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
PDF
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
PPTX
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
PDF
Distributed Real-Time Stream Processing: Why and How 2.0
PPTX
Big Stream Processing Systems, Big Graphs
PDF
Marton Balassi – Stateful Stream Processing
PDF
Leveraging Mainframe Data for Modern Analytics
PDF
An introduction To Apache Spark
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
Jags Ramnarayan's presentation
PDF
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Distributed real time stream processing- why and how
Web-scale data processing: practical approaches for low-latency and batch
Moving Towards a Streaming Architecture
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
SnappyData Toronto Meetup Nov 2017
Streaming analytics state of the art
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
SQL Engines for Hadoop - The case for Impala
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
Distributed Real-Time Stream Processing: Why and How 2.0
Big Stream Processing Systems, Big Graphs
Marton Balassi – Stateful Stream Processing
Leveraging Mainframe Data for Modern Analytics
An introduction To Apache Spark
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Flexible and Real-Time Stream Processing with Apache Flink
Jags Ramnarayan's presentation
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko

Recently uploaded (20)

PDF
Data Analyst Certificate Programs for Beginners | IABAC
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
Extract Transformation Load (3) (1).pptx
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
PPTX
Understanding Prototyping in Design and Development
PPTX
LESSON-1-NATURE-OF-MATHEMATICS.pptx patterns
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Azure Data management Engineer project.pptx
PPTX
batch data Retailer Data management Project.pptx
PPTX
咨询新西兰毕业证(UCOL毕业证书)联合理工学院毕业证国外毕业证
PPTX
Logistic Regression ml machine learning.pptx
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
PPTX
artificial intelligence deeplearning-200712115616.pptx
PPTX
办理新西兰毕业证(Lincoln毕业证书)林肯大学毕业证毕业 证
PDF
Chad Readey - An Independent Thinker
Data Analyst Certificate Programs for Beginners | IABAC
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Taxes Foundatisdcsdcsdon Certificate.pdf
Extract Transformation Load (3) (1).pptx
Purple and Violet Modern Marketing Presentation (1).pptx
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
Presentation1.pptxvhhh. H ycycyyccycycvvv
Understanding Prototyping in Design and Development
LESSON-1-NATURE-OF-MATHEMATICS.pptx patterns
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Azure Data management Engineer project.pptx
batch data Retailer Data management Project.pptx
咨询新西兰毕业证(UCOL毕业证书)联合理工学院毕业证国外毕业证
Logistic Regression ml machine learning.pptx
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
artificial intelligence deeplearning-200712115616.pptx
办理新西兰毕业证(Lincoln毕业证书)林肯大学毕业证毕业 证
Chad Readey - An Independent Thinker

Large-Scale Stream Processing in the Hadoop Ecosystem

  • 1. Large-Scale Stream Processing in the Hadoop Ecosystem Gyula Fóra [email protected] Márton Balassi [email protected]
  • 2. This talk § Stream processing by example § Open source stream processors § Runtime architecture and programming model § Counting words… § Fault tolerance and stateful processing § Closing 2Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 4. Streaming applications ETL style operations • Filter incoming data, Log analysis • High throughput, connectors, at-least-once processing Window aggregations • Trending tweets, User sessions, Stream joins • Window abstractions 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 4 Inpu t Inpu t Inpu tInput Process/Enrich
  • 5. Streaming applications Machine learning • Fitting trends to the evolving stream, Stream clustering • Model state, cyclic flows Pattern recognition • Fraud detection, Triggering signals based on activity • Exactly-once processing 5Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 8. Apache Storm § Started in 2010, development driven by BackType, then Twitter § Pioneer in large scale stream processing § Distributed dataflow abstraction (spouts & bolts) 82015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 9. Apache Flink § Started in 2008 as a research project (Stratosphere) at European universities § Unique combination of low latency streaming and high throughput batch analysis § Flexible operator states and windowing 9 Batch  data Kafka,  RabbitMQ,   ... HDFS,  JDBC,   ... Stream  Data 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 10. Apache Spark § Started in 2009 at UC Berkley, Apache since 2013 § Very strong community, wide adoption § Unified batch and stream processing over a batch runtime § Good integration with batch programs 102015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 11. Apache Samza § Developed at LinkedIn, open sourced in 2013 § Builds heavily on Kafka’s log based philosophy § Pluggable messaging system and execution backend 112015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 12. System comparison 12 Streaming model Native Micro-batching Native Native API Compositional Declarative Compositional Declarative Fault tolerance Record ACKs RDD-based Log-based Checkpoints Guarantee At-least-once Exactly-once At-least-once Exactly-once State Only in Trident State as DStream Stateful operators Stateful operators Windowing Not built-in Time based Not built-in Policy based Latency Very-Low Medium Low Low Throughput Medium High High High 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 15. Distributed dataflow runtime § Storm, Samza and Flink § General properties • Long standing operators • Pipelined execution • Usually possible to create cyclic flows 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 15 Pros • Full expressivity • Low-latency execution • Stateful operators Cons • Fault-tolerance is hard • Throughput may suffer • Load balancing is an issue
  • 16. Distributed dataflow runtime § Storm • Dynamic typing + Kryo • Dynamic topology rebalancing § Samza • Almost every component pluggable • Full task isolation, no backpressure (buffering handled by the messaging layer) § Flink • Strongly typed streams + custom serializers • Flow control mechanism • Memory management 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 16
  • 18. Micro-batch runtime § Implemented by Apache Spark § General properties • Computation broken down to time intervals • Load aware scheduling • Easy interaction with batch 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 18 Pros • Easy to reason about • High-throughput • FT comes for “free” • Dynamic load balancing Cons • Latency depends on batch size • Limited expressivity • Stateless by nature
  • 19. Programming model 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 19 Declarative § Expose a high-level API § Operators are higher order functions on abstract data stream types § Advanced behavior such as windowing is supported § Query optimization Compositional § Offer basic building blocks for composing custom operators and topologies § Advanced behavior such as windowing is often missing § Topology needs to be hand- optimized
  • 20. Programming model 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 20 DStream, DataStream § Transformations abstract operator details § Suitable for engineers and data analysts Spout, Consumer, Bolt, Task, Topology § Direct access to the execution graph / topology • Suitable for engineers
  • 22. WordCount 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 22 storm  budapest  flink apache  storm  spark streaming  samza storm flink  apache  flink bigdata  storm flink  streaming (storm,  4) (budapest,  1) (flink,  4) (apache,  2) (spark,  1) (streaming,  2) (samza,  1) (bigdata,  1)
  • 23. Storm Assembling the topology 232015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new SentenceSpout(), 5); builder.setBolt("split", new Splitter(), 8).shuffleGrouping("spout"); builder.setBolt("count", new Counter(), 12) .fieldsGrouping("split", new Fields("word")); public class Counter extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); } } Rolling word count bolt
  • 24. Samza 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 24 public class WordCountTask implements StreamTask { private KeyValueStore<String, Integer> store; public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { String word = envelope.getMessage(); Integer count = store.get(word); if(count == null){count = 0;} store.put(word, count + 1); collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", ”wc"), Tuple2.of(word, count))); } } Rolling word count task
  • 25. Flink val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() Rolling word count Window word count 252015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 26. Spark Window word count Rolling word count (kind of) 262015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 27. Fault tolerance and stateful processing 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 27
  • 28. Fault tolerance intro § Fault-tolerance in streaming systems is inherently harder than in batch • Can’t just restart computation • State is a problem • Fast recovery is crucial • Streaming topologies run 24/7 for a long period § Fault-tolerance is a complex issue • No single point of failure is allowed • Guaranteeing input processing • Consistent operator state • Fast recovery • At-least-once vs Exactly-once semantics 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 28
  • 29. Storm record acknowledgements § Track the lineage of tuples as they are processed (anchors and acks) § Special “acker” bolts track each lineage DAG (efficient xor based algorithm) § Replay the root of failed (or timed out) tuples 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 29
  • 30. Samza offset tracking § Exploits the properties of a durable, offset based messaging layer § Each task maintains its current offset, which moves forward as it processes elements § The offset is checkpointed and restored on failure (some messages might be repeated) 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 30
  • 31. Flink checkpointing § Based on consistent global snapshots § Algorithm designed for stateful dataflows (minimal runtime overhead) § Exactly-once semantics 31Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 32. Spark RDD recomputation § Immutable data model with repeatable computation § Failed RDDs are recomputed using their lineage § Checkpoint RDDs to reduce lineage length § Parallel recovery of failed RDDs § Exactly-once semantics 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 32
  • 33. State in streaming programs § Almost all non-trivial streaming programs are stateful § Stateful operators (in essence): 𝒇:   𝒊𝒏, 𝒔𝒕𝒂𝒕𝒆 ⟶ 𝒐𝒖𝒕, 𝒔𝒕𝒂𝒕𝒆. § State hangs around and can be read and modified as the stream evolves § Goal: Get as close as possible while maintaining scalability and fault-tolerance 33Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 34. § States available only in Trident API § Dedicated operators for state updates and queries § State access methods • stateQuery(…) • partitionPersist(…) • persistentAggregate(…) § It’s very difficult to implement transactional states Exactly-­‐‑once  guarantee 34Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 35. § Stateless runtime by design • No continuous operators • UDFs are assumed to be stateless § State can be generated as a separate stream of RDDs: updateStateByKey(…) 𝒇:   𝑺𝒆𝒒[𝒊𝒏 𝒌], 𝒔𝒕𝒂𝒕𝒆 𝒌 ⟶ 𝒔𝒕𝒂𝒕𝒆. 𝒌 § 𝒇 is scoped to a specific key § Exactly-once semantics 35Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 36. § Stateful dataflow operators (Any task can hold state) § State changes are stored as a log by Kafka § Custom storage engines can be plugged in to the log § 𝒇 is scoped to a specific task § At-least-once processing semantics 36Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 37. § Stateful dataflow operators (conceptually similar to Samza) § Two state access patterns • Local (Task) state • Partitioned (Key) state § Proper API integration • Java: OperatorState interface • Scala: mapWithState, flatMapWithState… § Exactly-once semantics by checkpointing 37Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 38. Performance § Throughput/Latency • A cost of a network hop is 25+ msecs • 1 million records/sec/core is nice § Size of Network Buffers/Batching § Buffer Timeout § Cost of Fault Tolerance § Operator chaining/Stages § Serialization/Types 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 38
  • 40. Comparison revisited 40 Streaming model Native Micro-batching Native Native API Compositional Declarative Compositional Declarative Fault tolerance Record ACKs RDD-based Log-based Checkpoints Guarantee At-least-once Exactly-once At-least-once Exactly-once State Only in Trident State as DStream Stateful operators Stateful operators Windowing Not built-in Time based Not built-in Policy based Latency Very-Low Medium Low Low Throughput Medium High High High 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 41. Summary § Streaming applications and stream processors are very diverse § 2 main runtime designs • Dataflow based (Storm, Samza, Flink) • Micro-batch based (Spark) § The best framework varies based on application specific needs § But high-level APIs are nice J 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 41
  • 43. List of Figures (in order of usage) § https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/CPT-FSM- abcd.svg/326px-CPT-FSM-abcd.svg.png § https://storm.apache.org/images/topology.png § https://databricks.com/wp-content/uploads/2015/07/image11-1024x655.png § https://databricks.com/wp-content/uploads/2015/07/image21-1024x734.png § https://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf, page 2. § http://www.slideshare.net/ptgoetz/storm-hadoop-summit2014, page 69-71. § http://samza.apache.org/img/0.9/learn/documentation/container/checkpointi ng.svg § https://databricks.com/wp-content/uploads/2015/07/image41-1024x602.png § https://storm.apache.org/documentation/images/spout-vs-state.png § http://samza.apache.org/img/0.9/learn/documentation/container/stateful_job. png 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 43