SlideShare a Scribd company logo
Community Update &
Roadmap 2016
Robert Metzger
@rmetzger_
rmetzger@apache.org
Berlin Apache Flink Meetup,
January 26, 2016
January Community Update
What happened in the last month
2
What happened?
3
 Google proposed Dataflow API to Apache
Incubator
 Proposal discussions at the mailing list:
• SQL / Stream SQL support
• CEP (Complex Event Processing) library
 Flink Kinesis Connector
 Chengxiang Li added as committer
 Discussions for releasing 1.0.0
Now merged to master (1.0-SNAPSOT)
4
 Savepoints: Manual checkpoints for
restarting jobs with state
 Kafka 0.9.0.0 integration
 Job submission through JobManager web
interface
 Checkpoint statistics in JobManager web
interface
 Streaming examples are now in the binary
dist
Reading List
 Benchmarking Streaming Computation
Engines at Yahoo!
 Receiving metrics from Apache Flink
applications
 Running Apache Flink on Amazon Elastic
Mapreduce
5
1. http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
2. http://mnxfst.tumblr.com/post/136539620407/receiving-metrics-from-apache-flink-
applications
3. http://themodernlife.github.io/scala/hadoop/hdfs/sclading/flink/streaming/realtime/e
mr/aws/2016/01/06/running-apache-flink-on-amazon-elastic-mapreduce/
Upcoming talks
 FOSDEM Brussels (4 talks) (Jan 30-31)
 Big Data Technology Summit Warsaw
(Feb. 25-26)
 Qcon London (March 7-9)
 Hadoop Summit Dublin (2 talks) (April 13-
14)
 Strata San Jose
 Strata London
6
Global Meetup Community
 Brazil-Sao Paulo Apache Flink Meetup
 Apache Flink Taiwan User Group
 Also new groups in Delhi, Phoenix and
Dallas
7
Github stats
8
 900 Stars
Roadmap 2016
Whats next?
9
Overview
10
 SQL / StreamSQL
 CEP Library
 Managed Operator State
 Dynamic Scaling
 Miscellaneous
SQL and StreamSQL
11
SQL / StreamSQL
12
 Structured queries over data sets and
streams
 Add support for SQL
• Standard SQL queries over (batch) data sets
• Continuous StreamSQL queries over data
streams
 Keep and extend Table API as structured
query API on data sets and streams
Proposed Architecture
13
Table API
(Batch) SQL
Query
StreamSQL
Query
ApacheCalcite
Standard
SQL parser
Customized
StreamSQL
parser
Optimizer
Logical Plan
DataSet
Program
DataStream
Program
APIs
Internals
SQL integration into APIs
14
val stream : DataStream[(String, Double, Int)]
= env.addSource(new FlinkKafkaConsumer(...))
val tabEnv = new TableEnvironment(env)
tabEnv.registerStream(stream, “myStream”,
(“ID”, “MEASURE”, “COUNT”))
val sqlQuery = tabEnv.sql(
“SELECT ID, MEASURE FROM myStream WHERE
COUNT > 17”)
 Define Kafka input stream
 Define table environment
 SQL Query
Complex Event Processing
15
CEP Library
 Complex Event Processing: the analysis of
complex patterns such as correlations and
sequence detection from multiple sources
 Most current systems are not distributed
(beyond multi-threading)
 Goal: provide an easy to use API for CEP,
running on a distributed high-throughput, low
latency engine.
16
CEP Example
17
Realtime stock prices
15.1 15.3 15.2 15.5
State
Machine
Alerts
Start
Price drop by at least $.5
Ignore
Alert
Programming API for CEP
CEPStream<Event> cepStream = CEP.from(inputDataStream)
// grouping
GroupedCEPStream<Event> grouped = cepStream.groupBy(“id”)
// windows
WindowedCEPStream windowed = grouped.timeWindow(Time.minutes(10),
Time.minutes(1))
WindowedCEPStream windowed = grouped.countWindow(10L, 1L)
// pattern matching
CEPStream<Result> resultStream =
CEP.from(input).groupBy(0).pattern(
Pattern.<Event>next("e1").where( (evt) -> evt.id == 42 )
.followedBy("e2").where( (evt) -> evt.id == 1337 )
.within(Time.minutes(10))
).select( (Map<String, Event> patternElements) ->
new Result(patternElements.get("e2").timestamp -
patternElements.get("e1").timestamp) ) 18
 convert stream into CEPStream of Events
 Window events
 Define a pattern to match
DSL for CEP
select e1.id, e1.price
from every e1 = Event(price > 10) →
e2 = Event(date == 42) → e3 =
Event(price == 10) within 10 seconds
where e1.id == e2.id
19
 No programming required
 Potentially integrated with SQL
Managed Operator State
20
State in Flink
21
Operator
“count tweet
impressions”
User Function
state
impression counts
Retrieve/set
count for
tweet it
State in Flink
22
Operator
“count tweet
impressions”
User Function
state
impression counts
Retrieve/set
count for
tweet it
What happens if the job
crashes?
Loss of data
Solution: Checkpoints
23
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
Periodic checkpoints
of state to HDFS
Restore from HDFS
in case of failure
state
Solution: Checkpoints
24
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
Periodic checkpoints
of state to HDFS
Restore from HDFS
in case of failure
state
This is the current state in
Flink!
State on Steroids
25
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
state
State on Steroids
26
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
state
Spill to disk
async/incremental snapshots
Restore from HDFS
in case of failure
What if state
grows too big?
State on Steroids
27
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
state
Spill to disk
State on Steroids
28
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
state
Spill to disk
async/incremental snapshots
Restore from HDFS
in case of failure
What if state
grows too big?
Checkpointing stalls
processing!
State on Steroids
29
Operator
“count tweet
impressions”
User Function
impression counts
Retrieve/set
count for
tweet it
state
Spill to disk
async/incremental snapshots
Restore from HDFS
in case of failure
Dealing with Dynamic
Resources
30
Streams with varying data rate
31
time
events/second
With static resources: Provision for max. rate
Idle capacity
(1) Adjust Parallelism
32
Initial
configuration
Scale Out
(for load)
Scale In
(save resources)
(1) Adjust Parallelism
 Adjusting parallelism without (significantly)
interrupting the program
 Initial version:
• Checkpoint -> stop -> restart-with-different-parallelism
 Stateless operators: Trivial
 Stateful operators: Repartition state
• Transparent for key/value state and windows
• Consistent hashing simplifies state reorganization
33
(2) Dynamic Worker Pool
34
JobManager
Resource
Manager
Pool of Cluster
ResourcesYARN/Mesos/…
TaskManager
TaskManager
Miscellaneous
 Support for Apache Mesos
 Security
• Over-the-wire encryption of RPC (akka) and data
transfers (netty)
 More connectors
• Apache Cassandra
• Amazon Kinesis
 Enhance metrics
• Throughput / Latencies
• Backpressure monitoring
• Spilling / Out of Core
35

More Related Content

PPTX
Flink Community Update December 2015: Year in Review
Robert Metzger
 
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
PPTX
QCon London - Stream Processing with Apache Flink
Robert Metzger
 
PPTX
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PPTX
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Robert Metzger
 
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Flink Community Update December 2015: Year in Review
Robert Metzger
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
QCon London - Stream Processing with Apache Flink
Robert Metzger
 
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Robert Metzger
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 

What's hot (20)

PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PPTX
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
PPTX
The Evolution of (Open Source) Data Processing
Aljoscha Krettek
 
PPTX
Flink vs. Spark
Slim Baltagi
 
PDF
Apache Spark vs Apache Flink
AKASH SIHAG
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
PPTX
Flink history, roadmap and vision
Stephan Ewen
 
PPTX
Data Analysis With Apache Flink
DataWorks Summit
 
PDF
Baymeetup-FlinkResearch
Foo Sounds
 
PDF
Introduction to Apache Flink
datamantra
 
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
PDF
Towards sql for streams
Radu Tudoran
 
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
PPTX
Fabian Hueske – Cascading on Flink
Flink Forward
 
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
The Evolution of (Open Source) Data Processing
Aljoscha Krettek
 
Flink vs. Spark
Slim Baltagi
 
Apache Spark vs Apache Flink
AKASH SIHAG
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
Flink history, roadmap and vision
Stephan Ewen
 
Data Analysis With Apache Flink
DataWorks Summit
 
Baymeetup-FlinkResearch
Foo Sounds
 
Introduction to Apache Flink
datamantra
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Towards sql for streams
Radu Tudoran
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
Fabian Hueske – Cascading on Flink
Flink Forward
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
Ad

Viewers also liked (14)

PDF
The Business Event Bus
Joris Meijer
 
PDF
Kafka Utrecht Meetup
Jeroen van Disseldorp MSc MBA
 
PDF
Themenstruktur Kafka
schoolmeester
 
PDF
Verwandlung Familienstruktur Schaubild
schoolmeester
 
PPTX
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Ververica
 
PPTX
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 
PDF
Event Driven Architecture
Lourens Naudé
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PPTX
Building Event-Driven Systems with Apache Kafka
Brian Ritchie
 
PDF
Monal Daxini - Beaming Flink to the Cloud @ Netflix
Flink Forward
 
PPTX
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
PPT
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
The Business Event Bus
Joris Meijer
 
Kafka Utrecht Meetup
Jeroen van Disseldorp MSc MBA
 
Themenstruktur Kafka
schoolmeester
 
Verwandlung Familienstruktur Schaubild
schoolmeester
 
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Ververica
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 
Event Driven Architecture
Lourens Naudé
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Building Event-Driven Systems with Apache Kafka
Brian Ritchie
 
Monal Daxini - Beaming Flink to the Cloud @ Netflix
Flink Forward
 
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
Ad

Similar to January 2016 Flink Community Update & Roadmap 2016 (20)

PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PDF
Apache flink
pranay kumar
 
PPTX
Flink Meetup Septmeber 2017 2018
Christos Hadjinikolis
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
From Apache Flink® 1.3 to 1.4
Till Rohrmann
 
PDF
Stream Processing with Apache Flink
C4Media
 
PPTX
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
PPTX
Apache Flink: Past, Present and Future
Gyula Fóra
 
PPTX
Stream processing - Apache flink
Renato Guimaraes
 
PPTX
Counting Elements in Streams
Jamie Grier
 
PPTX
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Flink Forward
 
PDF
Evolution of Real-time User Engagement Event Consumption at Pinterest
HostedbyConfluent
 
PPTX
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
PPTX
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
Flink Streaming @BudapestData
Gyula Fóra
 
Apache flink
pranay kumar
 
Flink Meetup Septmeber 2017 2018
Christos Hadjinikolis
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
From Apache Flink® 1.3 to 1.4
Till Rohrmann
 
Stream Processing with Apache Flink
C4Media
 
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
Apache Flink: Past, Present and Future
Gyula Fóra
 
Stream processing - Apache flink
Renato Guimaraes
 
Counting Elements in Streams
Jamie Grier
 
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Flink Forward
 
Evolution of Real-time User Engagement Event Consumption at Pinterest
HostedbyConfluent
 
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 

More from Robert Metzger (19)

PDF
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
Robert Metzger
 
PDF
dA Platform Overview
Robert Metzger
 
PPTX
Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger
 
PPTX
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
PPTX
Flink September 2015 Community Update
Robert Metzger
 
PPTX
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Robert Metzger
 
PPTX
August Flink Community Update
Robert Metzger
 
PPTX
Flink Cummunity Update July (Berlin Meetup)
Robert Metzger
 
PPTX
Apache Flink First Half of 2015 Community Update
Robert Metzger
 
PPTX
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Robert Metzger
 
PPTX
Apache Flink Hands On
Robert Metzger
 
PPTX
Berlin Apache Flink Meetup May 2015, Community Update
Robert Metzger
 
PPTX
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Robert Metzger
 
PPTX
Flink Community Update April 2015
Robert Metzger
 
PPTX
Apache Flink Community Update March 2015
Robert Metzger
 
PPTX
Flink Community Update February 2015
Robert Metzger
 
PDF
Compute "Closeness" in Graphs using Apache Giraph.
Robert Metzger
 
PDF
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
ODP
Stratosphere Intro (Java and Scala Interface)
Robert Metzger
 
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
Robert Metzger
 
dA Platform Overview
Robert Metzger
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger
 
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
Flink September 2015 Community Update
Robert Metzger
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Robert Metzger
 
August Flink Community Update
Robert Metzger
 
Flink Cummunity Update July (Berlin Meetup)
Robert Metzger
 
Apache Flink First Half of 2015 Community Update
Robert Metzger
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Robert Metzger
 
Apache Flink Hands On
Robert Metzger
 
Berlin Apache Flink Meetup May 2015, Community Update
Robert Metzger
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Robert Metzger
 
Flink Community Update April 2015
Robert Metzger
 
Apache Flink Community Update March 2015
Robert Metzger
 
Flink Community Update February 2015
Robert Metzger
 
Compute "Closeness" in Graphs using Apache Giraph.
Robert Metzger
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
Stratosphere Intro (Java and Scala Interface)
Robert Metzger
 

Recently uploaded (20)

PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
agentic-ai-and-the-future-of-autonomous-systems.pdf
siddharthnetsavvies
 
PDF
Shreyas_Phanse_Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
SHREYAS PHANSE
 
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
CIFDAQ
 
PPTX
The Power of IoT Sensor Integration in Smart Infrastructure and Automation.pptx
Rejig Digital
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
DOCX
Top AI API Alternatives to OpenAI: A Side-by-Side Breakdown
vilush
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PDF
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
CIFDAQ
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Make GenAI investments go further with the Dell AI Factory - Infographic
Principled Technologies
 
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
Precisely
 
PDF
DevOps & Developer Experience Summer BBQ
AUGNYC
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
AVTRON Technologies LLC
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
agentic-ai-and-the-future-of-autonomous-systems.pdf
siddharthnetsavvies
 
Shreyas_Phanse_Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
SHREYAS PHANSE
 
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
CIFDAQ
 
The Power of IoT Sensor Integration in Smart Infrastructure and Automation.pptx
Rejig Digital
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Top AI API Alternatives to OpenAI: A Side-by-Side Breakdown
vilush
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
CIFDAQ
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Make GenAI investments go further with the Dell AI Factory - Infographic
Principled Technologies
 
Enable Enterprise-Ready Security on IBM i Systems.pdf
Precisely
 
DevOps & Developer Experience Summer BBQ
AUGNYC
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
AVTRON Technologies LLC
 

January 2016 Flink Community Update & Roadmap 2016

  • 1. Community Update & Roadmap 2016 Robert Metzger @rmetzger_ [email protected] Berlin Apache Flink Meetup, January 26, 2016
  • 2. January Community Update What happened in the last month 2
  • 3. What happened? 3  Google proposed Dataflow API to Apache Incubator  Proposal discussions at the mailing list: • SQL / Stream SQL support • CEP (Complex Event Processing) library  Flink Kinesis Connector  Chengxiang Li added as committer  Discussions for releasing 1.0.0
  • 4. Now merged to master (1.0-SNAPSOT) 4  Savepoints: Manual checkpoints for restarting jobs with state  Kafka 0.9.0.0 integration  Job submission through JobManager web interface  Checkpoint statistics in JobManager web interface  Streaming examples are now in the binary dist
  • 5. Reading List  Benchmarking Streaming Computation Engines at Yahoo!  Receiving metrics from Apache Flink applications  Running Apache Flink on Amazon Elastic Mapreduce 5 1. http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming- computation-engines-at 2. http://mnxfst.tumblr.com/post/136539620407/receiving-metrics-from-apache-flink- applications 3. http://themodernlife.github.io/scala/hadoop/hdfs/sclading/flink/streaming/realtime/e mr/aws/2016/01/06/running-apache-flink-on-amazon-elastic-mapreduce/
  • 6. Upcoming talks  FOSDEM Brussels (4 talks) (Jan 30-31)  Big Data Technology Summit Warsaw (Feb. 25-26)  Qcon London (March 7-9)  Hadoop Summit Dublin (2 talks) (April 13- 14)  Strata San Jose  Strata London 6
  • 7. Global Meetup Community  Brazil-Sao Paulo Apache Flink Meetup  Apache Flink Taiwan User Group  Also new groups in Delhi, Phoenix and Dallas 7
  • 10. Overview 10  SQL / StreamSQL  CEP Library  Managed Operator State  Dynamic Scaling  Miscellaneous
  • 12. SQL / StreamSQL 12  Structured queries over data sets and streams  Add support for SQL • Standard SQL queries over (batch) data sets • Continuous StreamSQL queries over data streams  Keep and extend Table API as structured query API on data sets and streams
  • 13. Proposed Architecture 13 Table API (Batch) SQL Query StreamSQL Query ApacheCalcite Standard SQL parser Customized StreamSQL parser Optimizer Logical Plan DataSet Program DataStream Program APIs Internals
  • 14. SQL integration into APIs 14 val stream : DataStream[(String, Double, Int)] = env.addSource(new FlinkKafkaConsumer(...)) val tabEnv = new TableEnvironment(env) tabEnv.registerStream(stream, “myStream”, (“ID”, “MEASURE”, “COUNT”)) val sqlQuery = tabEnv.sql( “SELECT ID, MEASURE FROM myStream WHERE COUNT > 17”)  Define Kafka input stream  Define table environment  SQL Query
  • 16. CEP Library  Complex Event Processing: the analysis of complex patterns such as correlations and sequence detection from multiple sources  Most current systems are not distributed (beyond multi-threading)  Goal: provide an easy to use API for CEP, running on a distributed high-throughput, low latency engine. 16
  • 17. CEP Example 17 Realtime stock prices 15.1 15.3 15.2 15.5 State Machine Alerts Start Price drop by at least $.5 Ignore Alert
  • 18. Programming API for CEP CEPStream<Event> cepStream = CEP.from(inputDataStream) // grouping GroupedCEPStream<Event> grouped = cepStream.groupBy(“id”) // windows WindowedCEPStream windowed = grouped.timeWindow(Time.minutes(10), Time.minutes(1)) WindowedCEPStream windowed = grouped.countWindow(10L, 1L) // pattern matching CEPStream<Result> resultStream = CEP.from(input).groupBy(0).pattern( Pattern.<Event>next("e1").where( (evt) -> evt.id == 42 ) .followedBy("e2").where( (evt) -> evt.id == 1337 ) .within(Time.minutes(10)) ).select( (Map<String, Event> patternElements) -> new Result(patternElements.get("e2").timestamp - patternElements.get("e1").timestamp) ) 18  convert stream into CEPStream of Events  Window events  Define a pattern to match
  • 19. DSL for CEP select e1.id, e1.price from every e1 = Event(price > 10) → e2 = Event(date == 42) → e3 = Event(price == 10) within 10 seconds where e1.id == e2.id 19  No programming required  Potentially integrated with SQL
  • 21. State in Flink 21 Operator “count tweet impressions” User Function state impression counts Retrieve/set count for tweet it
  • 22. State in Flink 22 Operator “count tweet impressions” User Function state impression counts Retrieve/set count for tweet it What happens if the job crashes? Loss of data
  • 23. Solution: Checkpoints 23 Operator “count tweet impressions” User Function impression counts Retrieve/set count for tweet it Periodic checkpoints of state to HDFS Restore from HDFS in case of failure state
  • 24. Solution: Checkpoints 24 Operator “count tweet impressions” User Function impression counts Retrieve/set count for tweet it Periodic checkpoints of state to HDFS Restore from HDFS in case of failure state This is the current state in Flink!
  • 25. State on Steroids 25 Operator “count tweet impressions” User Function impression counts Retrieve/set count for tweet it state
  • 26. State on Steroids 26 Operator “count tweet impressions” User Function impression counts Retrieve/set count for tweet it state Spill to disk async/incremental snapshots Restore from HDFS in case of failure What if state grows too big?
  • 27. State on Steroids 27 Operator “count tweet impressions” User Function impression counts Retrieve/set count for tweet it state Spill to disk
  • 28. State on Steroids 28 Operator “count tweet impressions” User Function impression counts Retrieve/set count for tweet it state Spill to disk async/incremental snapshots Restore from HDFS in case of failure What if state grows too big? Checkpointing stalls processing!
  • 29. State on Steroids 29 Operator “count tweet impressions” User Function impression counts Retrieve/set count for tweet it state Spill to disk async/incremental snapshots Restore from HDFS in case of failure
  • 31. Streams with varying data rate 31 time events/second With static resources: Provision for max. rate Idle capacity
  • 32. (1) Adjust Parallelism 32 Initial configuration Scale Out (for load) Scale In (save resources)
  • 33. (1) Adjust Parallelism  Adjusting parallelism without (significantly) interrupting the program  Initial version: • Checkpoint -> stop -> restart-with-different-parallelism  Stateless operators: Trivial  Stateful operators: Repartition state • Transparent for key/value state and windows • Consistent hashing simplifies state reorganization 33
  • 34. (2) Dynamic Worker Pool 34 JobManager Resource Manager Pool of Cluster ResourcesYARN/Mesos/… TaskManager TaskManager
  • 35. Miscellaneous  Support for Apache Mesos  Security • Over-the-wire encryption of RPC (akka) and data transfers (netty)  More connectors • Apache Cassandra • Amazon Kinesis  Enhance metrics • Throughput / Latencies • Backpressure monitoring • Spilling / Out of Core 35

Editor's Notes

  • #22: Data loss happens if the job crashes
  • #23: Data loss happens if the job crashes