SlideShare a Scribd company logo
Introducing
Apache Flink™
@StephanEwen
Flink’s Recent History
April 2014 April 2015Dec 2014
Top Level
Project
Graduation
0.70.60.5 0.90.9-m1
What is Apache Flink?
3
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream (Java/Scala)
HadoopM/R
Local Remote YARN Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading
(WiP)
Streaming dataflow runtime
Zeppelin
A Top-Level project of the Apache Software Foundation
Program compilation
4
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Master
Data Source
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow Graph
Independent of
batch or
streaming job
deploy
operators
track
intermediate
results
Native workload support
5
Flink
Streaming
topologies
Long batch pipelines Machine Learning at scale
How can an engine natively support all these workloads?
And what does "native" mean?
Graph Analysis
 Low latency
 resource utilization  iterative algorithms
 Mutable state
E.g.: Non-native iterations
6
Step Step Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
E.g.: Non-native streaming
7
stream
discretizer
Job Job Job Job
while (true) {
// get next few records
// issue batch job
}
Data Stream
Native workload support
8
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at
scale
How can an engine natively support all these workloads?
And what does "native" mean?
Graph Analysis
 Low latency
 resource utilization  iterative algorithms
 Mutable state
Ingredients for “native” support
1. Execute everything as streams
Pipelined execution, backpressure or buffered, push/pull model
2. Special code paths for batch
Automatic job optimization, fault tolerance
3. Allow some iterative (cyclic) dataflows
4. Allow some mutable state
5. Operate on managed memory
Make data processing on the JVM robust
9
Stream processing in Flink
10
Stream platform architecture
11
- Gather and backup streams
- Offer streams for consumption
- Provide stream recovery
- Analyze and correlate streams
- Create derived streams and state
- Provide these to downstream systems
Server
logs
Trxn
logs
Sensor
logs
Downstream
systems
What is a stream processor?
1. Pipelining
2. Stream replay
3. Operator state
4. Backup and restore
5. High-level APIs
6. Integration with batch
7. High availability
8. Scale-in and scale-out
12
Basics
State
App development
Large deployments
See http://data-artisans.com/stream-processing-with-flink.html
Pipelining
13
Basic building block to “keep the data moving”
Note: pipelined systems do not
usually transfer individual tuples,
but buffers that batch several tuples!
Operator state
 User-defined state
• Flink transformations (map/reduce/etc) are long-running
operators, feel free to keep around objects
• Hooks to include in system's checkpoint
 Windowed streams
• Time, count, data-driven windows
• Managed by the system (currently WiP)
14
Streaming fault tolerance
 Ensure that operators see all events
• “At least once”
• Solved by replaying a stream from a checkpoint, e.g., from a
past Kafka offset
 Ensure that operators do not perform duplicate updates
to their state
• “Exactly once”
• Several solutions
15
Exactly once approaches
 Discretized streams (Spark Streaming)
• Treat streaming as a series of small atomic computations
• “Fast track” to fault tolerance, but does not separate
application logic (semantics) from recovery
 MillWheel (Google Cloud Dataflow)
• State update and derived events committed as atomic
transaction to a high-throughput transactional store
• Needs a very high-throughput transactional store 
 Chandy-Lamport distributed snapshots (Flink)
16
Distributed snapshots in Flink
Super-impose checkpointing mechanism on
execution instead of using execution as the
checkpointing mechanism
17
18
JobManager
Register checkpoint
barrier on master
Replay will start from here
19
JobManagerBarriers “push” prior events
(assumes in-order delivery in
individual channels)
Operator checkpointing
starting
Operator checkpointing
finished
Operator checkpointing in
progress
20
JobManager Operator checkpointing takes
snapshot of state after data
prior to barrier have updated
the state. Checkpoints
currently synchronous, WiP
for incremental and
asynchronous
State backup
Pluggable mechanism. Currently
either JobManager (for small state) or
file system (HDFS/Tachyon). WiP for
in-memory grids
21
JobManager
Operators with many inputs
need to wait for all barriers to
pass before they checkpoint
their state
22
JobManager
State snapshots at sinks
signal successful end of this
checkpoint
At failure,
recover last
checkpointed
state and
restart
sources from
last barrier
guarantees at
least once
State backup
Benefits of Flink’s approach
 Data processing does not block
• Can checkpoint at any interval you like to balance overhead/recovery
time
 Separates business logic from recovery
• Checkpointing interval is a config parameter, not a variable in the
program (as in discretization)
 Can support richer windows
• Session windows, event time, etc
 Best of all worlds: true streaming latency, exactly-once semantics,
and low overhead for recovery
23
DataStream API
24
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Roadmap
 Short-term (3-6 months)
• Graduate DataStream API from beta
• Fully managed window and user-defined state with pluggable
backends
• Table API for streams (towards StreamSQL)
 Long-term (6+ months)
• Highly available master
• Dynamic scale in/out
• FlinkML and Gelly for streams
• Full batch + stream unification
25
Batch processing
Batch on Streaming
26
Batch Pipelines
27
Batch on Streaming
 Batch programs are a special kind of streaming program
28
Infinite Streams Finite Streams
Stream Windows Global View
Pipelined
Data Exchange
Pipelined or
Blocking Exchange
Streaming Programs Batch Programs
Batch Pipelines
29
Data exchange (shuffle / broadcast)
is mostly streamed
Some operators block (e.g. sorts / hash tables)
Operators Execution Overlaps
30
Memory Management
31
Memory Management
32
Smooth out-of-core performance
33
More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
Blue bars are in-memory, orange bars (partially) out-of-core
Other features of Flink
There is more…
34
More Engine Features
35
Automatic Optimization /
Static Code Analysis
Closed Loop Iterations
Stateful
Iterations
DataSourc
e
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
broadc
ast
forward
Combine
GroupRed
sort
DataSourc
e
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
forward
Closing
36
Apache Flink: community
37
I Flink, do you? 
38
If you find this exciting,
get involved and start a discussion on Flink‘s mailing list,
or stay tuned by
subscribing to news@flink.apache.org,
following flink.apache.org/blog, and
@ApacheFlink on Twitter
39
flink-forward.org
Bay Area Flink meetup
Tomorrow

More Related Content

PPTX
Real-time Stream Processing with Apache Flink
PPTX
Data Stream Processing with Apache Flink
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
PPTX
January 2016 Flink Community Update & Roadmap 2016
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
PPTX
Data Analysis With Apache Flink
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Real-time Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
January 2016 Flink Community Update & Roadmap 2016
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Apache Flink(tm) - A Next-Generation Stream Processor
Data Analysis With Apache Flink
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...

What's hot (20)

PPTX
Flink history, roadmap and vision
PPTX
Streaming in the Wild with Apache Flink
PDF
Baymeetup-FlinkResearch
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
PPTX
QCon London - Stream Processing with Apache Flink
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
PDF
Stateful Distributed Stream Processing
PDF
Apache Spark vs Apache Flink
PPTX
The Evolution of (Open Source) Data Processing
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
PPTX
Flink Community Update December 2015: Year in Review
PDF
Flink Apachecon Presentation
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PDF
Apache Flink: Streaming Done Right @ FOSDEM 2016
PDF
Christian Kreuzfeld – Static vs Dynamic Stream Processing
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
PPTX
From Apache Flink® 1.3 to 1.4
PPTX
Apache Flink@ Strata & Hadoop World London
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
PDF
Stream Processing with Apache Flink
Flink history, roadmap and vision
Streaming in the Wild with Apache Flink
Baymeetup-FlinkResearch
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
QCon London - Stream Processing with Apache Flink
Don't Cross The Streams - Data Streaming And Apache Flink
Stateful Distributed Stream Processing
Apache Spark vs Apache Flink
The Evolution of (Open Source) Data Processing
Taking a look under the hood of Apache Flink's relational APIs.
Flink Community Update December 2015: Year in Review
Flink Apachecon Presentation
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Apache Flink: Streaming Done Right @ FOSDEM 2016
Christian Kreuzfeld – Static vs Dynamic Stream Processing
GOTO Night Amsterdam - Stream processing with Apache Flink
From Apache Flink® 1.3 to 1.4
Apache Flink@ Strata & Hadoop World London
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Stream Processing with Apache Flink
Ad

Viewers also liked (14)

PPTX
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
PPTX
Flink vs. Spark
PPTX
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
PDF
Apache Samoa: Mining Big Data Streams with Apache Flink
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PDF
Apache Hadoop YARN - Enabling Next Generation Data Applications
PDF
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
PDF
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
PDF
The Future of Food Communications: Winning Share of Mouth in the Conversation...
 
PDF
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
PDF
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Flink vs. Spark
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Apache Samoa: Mining Big Data Streams with Apache Flink
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Apache Hadoop YARN - Enabling Next Generation Data Applications
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
The Future of Food Communications: Winning Share of Mouth in the Conversation...
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Why apache Flink is the 4G of Big Data Analytics Frameworks
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
Ad

Similar to Apache Flink Overview at SF Spark and Friends (20)

PPTX
Flink Streaming Hadoop Summit San Jose
PPTX
Apache Flink Deep Dive
PPTX
First Flink Bay Area meetup
PPTX
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
PPTX
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
Chicago Flink Meetup: Flink's streaming architecture
PDF
Towards sql for streams
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PDF
Prezo tooracleteam (2)
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
PDF
Apache Samza 1.0 - What's New, What's Next
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
PDF
Strata NYC 2015: What's new in Spark Streaming
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Streaming Hadoop Summit San Jose
Apache Flink Deep Dive
First Flink Bay Area meetup
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
K. Tzoumas & S. Ewen – Flink Forward Keynote
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Flexible and Real-Time Stream Processing with Apache Flink
Chicago Flink Meetup: Flink's streaming architecture
Towards sql for streams
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Flink 0.10 @ Bay Area Meetup (October 2015)
Developing streaming applications with apache apex (strata + hadoop world)
Prezo tooracleteam (2)
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Metadata and Provenance for ML Pipelines with Hopsworks
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Apache Samza 1.0 - What's New, What's Next
Apache Flink @ Tel Aviv / Herzliya Meetup
Strata NYC 2015: What's new in Spark Streaming
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...

Recently uploaded (20)

PDF
creating-agentic-ai-solutions-leveraging-aws.pdf
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
Test Bank, Solutions for Java How to Program, An Objects-Natural Approach, 12...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Dell Pro 14 Plus: Be better prepared for what’s coming
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
PDF
This slide provides an overview Technology
PDF
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
PDF
Transforming Manufacturing operations through Intelligent Integrations
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PDF
agentic-ai-and-the-future-of-autonomous-systems.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
creating-agentic-ai-solutions-leveraging-aws.pdf
madgavkar20181017ppt McKinsey Presentation.pdf
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Test Bank, Solutions for Java How to Program, An Objects-Natural Approach, 12...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Dell Pro 14 Plus: Be better prepared for what’s coming
ChatGPT's Deck on The Enduring Legacy of Fax Machines
This slide provides an overview Technology
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Revolutionize Operations with Intelligent IoT Monitoring and Control
Transforming Manufacturing operations through Intelligent Integrations
Reimagining Insurance: Connected Data for Confident Decisions.pdf
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
Smarter Business Operations Powered by IoT Remote Monitoring
agentic-ai-and-the-future-of-autonomous-systems.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf

Apache Flink Overview at SF Spark and Friends

  • 2. Flink’s Recent History April 2014 April 2015Dec 2014 Top Level Project Graduation 0.70.60.5 0.90.9-m1
  • 3. What is Apache Flink? 3 Gelly Table ML SAMOA DataSet (Java/Scala) DataStream (Java/Scala) HadoopM/R Local Remote YARN Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading (WiP) Streaming dataflow runtime Zeppelin A Top-Level project of the Apache Software Foundation
  • 4. Program compilation 4 case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) Master Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph Independent of batch or streaming job deploy operators track intermediate results
  • 5. Native workload support 5 Flink Streaming topologies Long batch pipelines Machine Learning at scale How can an engine natively support all these workloads? And what does "native" mean? Graph Analysis  Low latency  resource utilization  iterative algorithms  Mutable state
  • 6. E.g.: Non-native iterations 6 Step Step Step Step Step Client for (int i = 0; i < maxIterations; i++) { // Execute MapReduce job }
  • 7. E.g.: Non-native streaming 7 stream discretizer Job Job Job Job while (true) { // get next few records // issue batch job } Data Stream
  • 8. Native workload support 8 Flink Streaming topologies Long batch pipelines Machine Learning at scale How can an engine natively support all these workloads? And what does "native" mean? Graph Analysis  Low latency  resource utilization  iterative algorithms  Mutable state
  • 9. Ingredients for “native” support 1. Execute everything as streams Pipelined execution, backpressure or buffered, push/pull model 2. Special code paths for batch Automatic job optimization, fault tolerance 3. Allow some iterative (cyclic) dataflows 4. Allow some mutable state 5. Operate on managed memory Make data processing on the JVM robust 9
  • 11. Stream platform architecture 11 - Gather and backup streams - Offer streams for consumption - Provide stream recovery - Analyze and correlate streams - Create derived streams and state - Provide these to downstream systems Server logs Trxn logs Sensor logs Downstream systems
  • 12. What is a stream processor? 1. Pipelining 2. Stream replay 3. Operator state 4. Backup and restore 5. High-level APIs 6. Integration with batch 7. High availability 8. Scale-in and scale-out 12 Basics State App development Large deployments See http://data-artisans.com/stream-processing-with-flink.html
  • 13. Pipelining 13 Basic building block to “keep the data moving” Note: pipelined systems do not usually transfer individual tuples, but buffers that batch several tuples!
  • 14. Operator state  User-defined state • Flink transformations (map/reduce/etc) are long-running operators, feel free to keep around objects • Hooks to include in system's checkpoint  Windowed streams • Time, count, data-driven windows • Managed by the system (currently WiP) 14
  • 15. Streaming fault tolerance  Ensure that operators see all events • “At least once” • Solved by replaying a stream from a checkpoint, e.g., from a past Kafka offset  Ensure that operators do not perform duplicate updates to their state • “Exactly once” • Several solutions 15
  • 16. Exactly once approaches  Discretized streams (Spark Streaming) • Treat streaming as a series of small atomic computations • “Fast track” to fault tolerance, but does not separate application logic (semantics) from recovery  MillWheel (Google Cloud Dataflow) • State update and derived events committed as atomic transaction to a high-throughput transactional store • Needs a very high-throughput transactional store   Chandy-Lamport distributed snapshots (Flink) 16
  • 17. Distributed snapshots in Flink Super-impose checkpointing mechanism on execution instead of using execution as the checkpointing mechanism 17
  • 18. 18 JobManager Register checkpoint barrier on master Replay will start from here
  • 19. 19 JobManagerBarriers “push” prior events (assumes in-order delivery in individual channels) Operator checkpointing starting Operator checkpointing finished Operator checkpointing in progress
  • 20. 20 JobManager Operator checkpointing takes snapshot of state after data prior to barrier have updated the state. Checkpoints currently synchronous, WiP for incremental and asynchronous State backup Pluggable mechanism. Currently either JobManager (for small state) or file system (HDFS/Tachyon). WiP for in-memory grids
  • 21. 21 JobManager Operators with many inputs need to wait for all barriers to pass before they checkpoint their state
  • 22. 22 JobManager State snapshots at sinks signal successful end of this checkpoint At failure, recover last checkpointed state and restart sources from last barrier guarantees at least once State backup
  • 23. Benefits of Flink’s approach  Data processing does not block • Can checkpoint at any interval you like to balance overhead/recovery time  Separates business logic from recovery • Checkpointing interval is a config parameter, not a variable in the program (as in discretization)  Can support richer windows • Session windows, event time, etc  Best of all worlds: true streaming latency, exactly-once semantics, and low overhead for recovery 23
  • 24. DataStream API 24 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 25. Roadmap  Short-term (3-6 months) • Graduate DataStream API from beta • Fully managed window and user-defined state with pluggable backends • Table API for streams (towards StreamSQL)  Long-term (6+ months) • Highly available master • Dynamic scale in/out • FlinkML and Gelly for streams • Full batch + stream unification 25
  • 28. Batch on Streaming  Batch programs are a special kind of streaming program 28 Infinite Streams Finite Streams Stream Windows Global View Pipelined Data Exchange Pipelined or Blocking Exchange Streaming Programs Batch Programs
  • 29. Batch Pipelines 29 Data exchange (shuffle / broadcast) is mostly streamed Some operators block (e.g. sorts / hash tables)
  • 33. Smooth out-of-core performance 33 More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html Blue bars are in-memory, orange bars (partially) out-of-core
  • 34. Other features of Flink There is more… 34
  • 35. More Engine Features 35 Automatic Optimization / Static Code Analysis Closed Loop Iterations Stateful Iterations DataSourc e orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT prob e broadc ast forward Combine GroupRed sort DataSourc e orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] hash-part [0,1] GroupRed sort forward
  • 38. I Flink, do you?  38 If you find this exciting, get involved and start a discussion on Flink‘s mailing list, or stay tuned by subscribing to [email protected], following flink.apache.org/blog, and @ApacheFlink on Twitter

Editor's Notes

  • #4: Flink is an entire software stack the heart: streaming dataflow engine: think of programs as operators and data flows Kappa architecture: run batch programs on a streaming system Table API: logical representation, sql-style Samoa “on-line learners”
  • #5: toy program: native transitive closure type extraction: types that go in and out of each operator
  • #6: Flink is an analytical system streaming topology: real-time; low latency “native”: build-in support in the system, no working around, no black-box next slide: define native by some “non-native” examples
  • #7: Used for Machine Learning run the same job over the data multiple times to come up with parameters for a ml model this is how you do it when treating the engine as a black box
  • #8: If you only have a batch processor: do a lot of small batch jobs LIMITATION: state across the small jobs (batches)
  • #9: Flink is an analytical system streaming topology: real-time; low latency “native”: build-in support in the system, no working around, no black-box next slide: define native by some “non-native” examples
  • #10: Corner points / requirements for flink keep data in motion, avoid materialization even though it’s a streaming runtime, have special paths for batch: OPTIMIZER, CHECKPOINTING make the system aware of cyclic data flows, in a controlled way allow operators to have some state, in a controlled way (DELTA-ITERATIONS). relax “traditional” batch assumption flink runs in the jvm, but we want control over memory, not rely on GC
  • #12: What are the technologies that enable streaming? The open source leaders in this space is Apache Kafka (that solves the integration problem), and Apache Flink (that solves the analytics problem, removing the final barrier). Kafka and Flink combined can remove the batch barriers from the infrastructure, creating a truly real-time analytics platform.
  • #27: structure, different title
  • #38: Other data points Google (cloud dataflow) Hortonworks Cloudera Adatao Concurrent Confluent We have been part of this open source movement with Apache Flink. Flink is a streaming dataflow engine that can run in Hadoop clusters. Flink has grown a lot over the past year both in terms of code and community. We have added domain-specific libraries, a streaming API with streaming backend support, etc, etc. Tremendous growth. Flink has also grown in community. The project is by now a very established Apache project, it has more than 140 contributors (placing it at the top 5 of Apache big data projects), and several companies are starting to experiment with it. At data Artisans we are supporting two production installations (ResearchGate and Bouygues Telecom), and are helping a number of companies that are testing Flink (e.g., Spotify, King.com, Amadeus, and a group at Yahoo). Huawei and Intel have started contributing to Flink, and interest in vendors is picking up (e.g., Adatao, Huawei, Hadoop vendors). All of this is the result of purely organic growth with very little marketing investment from data Artisans.