SlideShare a Scribd company logo
Apache Flink
Past, present and future
Gyula Fóra
gyfora@apache.org
What is Apache Flink
2
Distributed Data Flow Processing System
▪ Focused on large-scale data analytics
▪ Unified real-time stream and batch processing
▪ Easy and powerful APIs in Java / Scala (+ Python)
▪ Robust and fast execution backend
Reduce
Join
Filter
Reduce
Map
Iterate
Source
Sink
Source
What is Flink good at
3
It‘s a general-purpose data analytics system
▪ Real-time stream processing with flexible windowing
▪ Complex and heavy ETL jobs
▪ Analyzing huge graphs
▪ Machine learning on large data sets and streams
▪ …
The Flink Stack
4
Python
Gelly
Table
ML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Streaming optimizer
Hadoop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Word count in Flink
5
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(1,MINUTES)).every(Time.of(30,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Table API
6
val orders = env.readCsvFile(…)
.as('oId, 'oDate, 'shipPrio)
.filter('shipPrio === 5)
val items = orders
.join(lineitems).where('oId === 'id)
.select('oId, 'oDate, 'shipPrio,
'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue)
val result = items
.groupBy('oId, 'oDate, 'shipPrio)
.select('oId, 'revenue.sum, 'oDate, 'shipPrio)
▪ Execute SQL-like expressions on table data
• Tight integration with Java and Scala APIs
• Available for batch and streaming programs
A trip down memory lane
7
April 16, 2014
8
9
Stratosphere Optimizer
DataSet API (Java)
Stratosphere Runtime
DataSet API (Scala)
Stratosphere 0.5
Local Remote Yarn
Key new features
• New Java API
• Distributed cache
• Collection data sources and
sinks
• JDBC data sources and sinks
• Hadoop I/O format
• Avro support
10
Flink Optimizer
DataSet (Java/Scala)
Flink Runtime
Flink 0.7
DataStream (Java)
Stream Builder
Hadoop
M/R
Local Remote Yarn Embedded
Key new features
• Unification of Java and Scala
APIs
• Logical keys/POJO support
• MR compatibility
• Collections backend
• Extended filesystem support
11
Flink Runtime
Flink 0.8
Flink Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Stream Builder
Hadoop
M/R
Local Remote Yarn Embedded
Key new features
• Improved filesystem support
• DataStream Scala
• Streaming windows
• Lots of performance and
stability
• Kryo default serializer
12
Python
Gelly
Table
ML
SAMOA
Current master (0.9-Snapshot)
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Stream Optimizer
Hadoop
M/R
New Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Key new features
• New runtime
• Tez mode
• Python API
• Gelly
• Flinq
• FlinkML
• Streaming FT
Flink community
13
#unique contributors by git commits
(without manual de-dup)
Summary
▪ The project has a lot of momentum with major
improvements every release
▪ Healthy community
▪ Project diversification
• Real-time data streaming
• Several frontends (targeting different user profiles
and use cases)
• Several backends (targeting different production
settings)
▪ Integration with open source ecosystem
14
Vision for Flink
15
What are we building?
16
A "use-case complete" framework to unify
batch & stream processing
Flink
Data Streams
• Kafka
• RabbitMQ
• ...
“Historic” data
• HDFS
• JDBC
• ...
Analytical Workloads
• ETL
• Relational processing
• Graph analysis
• Machine learning
• Streaming data analysis
Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,
Machine Learning
Relational, …
Low latency
windowing,
aggregations, ...
Event logs
An engine that puts equal emphasis to
stream and batch processing
Real-time data
streams
What are we building?
(master)
Integrating batch with
streaming
18
Why?
▪ Applications need to combine streaming and
static data sources
▪ Making the switch from batch to streaming easy
will be key to boost adoption
▪ Companies are making the transition from batch
to streaming now
19
What is stream processing?
20
▪ Data stream: Infinite sequence of data arriving
in a continuous fashion
▪Stream processing: Analyzing and acting on
real-time streaming data, using continuous
queries
Lambda architecture
▪ "Speed layer" can be a stream processing system
▪ "Picks up" after the batch layer
21
Kappa architecture
▪ Need for batch & speed layer not
fundamental, practical with current tech
▪ Idea: use a stream processing system for all
data processing
▪ They are all dataflows anyway
22http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
Data streaming with Flink
▪ Flink is building a proper stream
processing system
• that can execute both batch and stream jobs
natively
• batch-only jobs pass via different optimization
code path
▪ Flink is building libraries and DSLs on top
of both batch and streaming
• e.g., see recent Table API
23
Data streaming with Flink
▪ Low-latency stream processor
▪ Expressive APIs in Scala/Java
▪ Stateful operators and flexible windowing
▪ Efficient fault tolerance for exactly-once
guarantees
24
Summary
▪ Flink is a general-purpose data analytics
system
▪ Unifies batch and stream processing
▪ Expressive high-level APIs
▪ Robust and fast execution engine
25
Apache Flink: Past, Present and Future
flink.apache.org
@ApacheFlink

More Related Content

PPTX
Flink vs. Spark
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
PDF
Bay Area Apache Flink Meetup Community Update August 2015
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PDF
Apache Spark vs Apache Flink
PPTX
January 2016 Flink Community Update & Roadmap 2016
PPTX
Apache Flink and what it is used for
PDF
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Flink vs. Spark
Apache Flink(tm) - A Next-Generation Stream Processor
Bay Area Apache Flink Meetup Community Update August 2015
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Apache Spark vs Apache Flink
January 2016 Flink Community Update & Roadmap 2016
Apache Flink and what it is used for
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)

What's hot (20)

PPTX
Flink Streaming
PDF
Shared time-series-analysis-using-an-event-streaming-platform -_v2
PPTX
Slim Baltagi – Flink vs. Spark
PPTX
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
PPTX
The Past, Present, and Future of Apache Flink®
PDF
What every software engineer should know about streams and tables in kafka ...
PDF
Time series-analysis-using-an-event-streaming-platform -_v3_final
PPTX
QCon London - Stream Processing with Apache Flink
PPTX
MongoDB Days Germany: Data Processing with MongoDB
PDF
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
PDF
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
PDF
Building Streaming Data Applications Using Apache Kafka
PDF
Leveraging Mainframe Data for Modern Analytics
PDF
Time Series Analysis Using an Event Streaming Platform
PPTX
Apache Flink community Update for March 2016 - Slim Baltagi
PPTX
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
PDF
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
PDF
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
PDF
Maximilian Michels - Flink and Beam
Flink Streaming
Shared time-series-analysis-using-an-event-streaming-platform -_v2
Slim Baltagi – Flink vs. Spark
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
The Past, Present, and Future of Apache Flink®
What every software engineer should know about streams and tables in kafka ...
Time series-analysis-using-an-event-streaming-platform -_v3_final
QCon London - Stream Processing with Apache Flink
MongoDB Days Germany: Data Processing with MongoDB
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Apache Flink: Real-World Use Cases for Streaming Analytics
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Building Streaming Data Applications Using Apache Kafka
Leveraging Mainframe Data for Modern Analytics
Time Series Analysis Using an Event Streaming Platform
Apache Flink community Update for March 2016 - Slim Baltagi
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
Maximilian Michels - Flink and Beam
Ad

Similar to Apache Flink: Past, Present and Future (20)

PPTX
Flink history, roadmap and vision
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PPTX
Flink Streaming @BudapestData
PPTX
Introduction to Apache Flink at Vienna Meet Up
PPTX
Apache Flink Training: System Overview
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
PPTX
Flink September 2015 Community Update
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
PPTX
Berlin Apache Flink Meetup May 2015, Community Update
PDF
Big Data Analytics Platforms by KTH and RISE SICS
PDF
DBCC 2021 - FLiP Stack for Cloud Data Lakes
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
PPTX
Apache Flink Online Training
PDF
Cloud lunch and learn real-time streaming in azure
PDF
Flink in Zalando's world of Microservices
PDF
Flink in Zalando's World of Microservices
PDF
Apache flink
Flink history, roadmap and vision
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Flink Streaming @BudapestData
Introduction to Apache Flink at Vienna Meet Up
Apache Flink Training: System Overview
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Flink September 2015 Community Update
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Berlin Apache Flink Meetup May 2015, Community Update
Big Data Analytics Platforms by KTH and RISE SICS
DBCC 2021 - FLiP Stack for Cloud Data Lakes
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Apache Flink Online Training
Cloud lunch and learn real-time streaming in azure
Flink in Zalando's world of Microservices
Flink in Zalando's World of Microservices
Apache flink
Ad

More from Gyula Fóra (6)

PDF
Real-time analytics as a service at King
PDF
RBea: Scalable Real-Time Analytics at King
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Stateful Distributed Stream Processing
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
PDF
Flink Apachecon Presentation
Real-time analytics as a service at King
RBea: Scalable Real-Time Analytics at King
Large-Scale Stream Processing in the Hadoop Ecosystem
Stateful Distributed Stream Processing
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Flink Apachecon Presentation

Recently uploaded (20)

PDF
ETO & MEO Certificate of Competency Questions and Answers
PPTX
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
PPTX
anatomy of limbus and anterior chamber .pptx
PPT
SCOPE_~1- technology of green house and poyhouse
PPTX
The-Looming-Shadow-How-AI-Poses-Dangers-to-Humanity.pptx
PDF
International Journal of Information Technology Convergence and Services (IJI...
PDF
Principles of Food Science and Nutritions
PDF
Geotechnical Engineering, Soil mechanics- Soil Testing.pdf
PDF
A Framework for Securing Personal Data Shared by Users on the Digital Platforms
PDF
B.Tech (Electrical Engineering ) 2024 syllabus.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Top 10 read articles In Managing Information Technology.pdf
PDF
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)
PPTX
Ship’s Structural Components.pptx 7.7 Mb
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
PPTX
Simulation of electric circuit laws using tinkercad.pptx
PDF
Chad Ayach - A Versatile Aerospace Professional
PDF
BRKDCN-2613.pdf Cisco AI DC NVIDIA presentation
PPTX
AgentX UiPath Community Webinar series - Delhi
PDF
flutter Launcher Icons, Splash Screens & Fonts
ETO & MEO Certificate of Competency Questions and Answers
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
anatomy of limbus and anterior chamber .pptx
SCOPE_~1- technology of green house and poyhouse
The-Looming-Shadow-How-AI-Poses-Dangers-to-Humanity.pptx
International Journal of Information Technology Convergence and Services (IJI...
Principles of Food Science and Nutritions
Geotechnical Engineering, Soil mechanics- Soil Testing.pdf
A Framework for Securing Personal Data Shared by Users on the Digital Platforms
B.Tech (Electrical Engineering ) 2024 syllabus.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Top 10 read articles In Managing Information Technology.pdf
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)
Ship’s Structural Components.pptx 7.7 Mb
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
Simulation of electric circuit laws using tinkercad.pptx
Chad Ayach - A Versatile Aerospace Professional
BRKDCN-2613.pdf Cisco AI DC NVIDIA presentation
AgentX UiPath Community Webinar series - Delhi
flutter Launcher Icons, Splash Screens & Fonts

Apache Flink: Past, Present and Future

  • 2. What is Apache Flink 2 Distributed Data Flow Processing System ▪ Focused on large-scale data analytics ▪ Unified real-time stream and batch processing ▪ Easy and powerful APIs in Java / Scala (+ Python) ▪ Robust and fast execution backend Reduce Join Filter Reduce Map Iterate Source Sink Source
  • 3. What is Flink good at 3 It‘s a general-purpose data analytics system ▪ Real-time stream processing with flexible windowing ▪ Complex and heavy ETL jobs ▪ Analyzing huge graphs ▪ Machine learning on large data sets and streams ▪ …
  • 4. The Flink Stack 4 Python Gelly Table ML SAMOA Batch Optimizer DataSet (Java/Scala) DataStream (Java/Scala) Streaming optimizer Hadoop M/R Flink Runtime Local Remote Yarn Tez Embedded Dataflow Dataflow
  • 5. Word count in Flink 5 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(1,MINUTES)).every(Time.of(30,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 6. Table API 6 val orders = env.readCsvFile(…) .as('oId, 'oDate, 'shipPrio) .filter('shipPrio === 5) val items = orders .join(lineitems).where('oId === 'id) .select('oId, 'oDate, 'shipPrio, 'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue) val result = items .groupBy('oId, 'oDate, 'shipPrio) .select('oId, 'revenue.sum, 'oDate, 'shipPrio) ▪ Execute SQL-like expressions on table data • Tight integration with Java and Scala APIs • Available for batch and streaming programs
  • 7. A trip down memory lane 7
  • 9. 9 Stratosphere Optimizer DataSet API (Java) Stratosphere Runtime DataSet API (Scala) Stratosphere 0.5 Local Remote Yarn Key new features • New Java API • Distributed cache • Collection data sources and sinks • JDBC data sources and sinks • Hadoop I/O format • Avro support
  • 10. 10 Flink Optimizer DataSet (Java/Scala) Flink Runtime Flink 0.7 DataStream (Java) Stream Builder Hadoop M/R Local Remote Yarn Embedded Key new features • Unification of Java and Scala APIs • Logical keys/POJO support • MR compatibility • Collections backend • Extended filesystem support
  • 11. 11 Flink Runtime Flink 0.8 Flink Optimizer DataSet (Java/Scala) DataStream (Java/Scala) Stream Builder Hadoop M/R Local Remote Yarn Embedded Key new features • Improved filesystem support • DataStream Scala • Streaming windows • Lots of performance and stability • Kryo default serializer
  • 12. 12 Python Gelly Table ML SAMOA Current master (0.9-Snapshot) Batch Optimizer DataSet (Java/Scala) DataStream (Java/Scala) Stream Optimizer Hadoop M/R New Flink Runtime Local Remote Yarn Tez Embedded Dataflow Dataflow Key new features • New runtime • Tez mode • Python API • Gelly • Flinq • FlinkML • Streaming FT
  • 13. Flink community 13 #unique contributors by git commits (without manual de-dup)
  • 14. Summary ▪ The project has a lot of momentum with major improvements every release ▪ Healthy community ▪ Project diversification • Real-time data streaming • Several frontends (targeting different user profiles and use cases) • Several backends (targeting different production settings) ▪ Integration with open source ecosystem 14
  • 16. What are we building? 16 A "use-case complete" framework to unify batch & stream processing Flink Data Streams • Kafka • RabbitMQ • ... “Historic” data • HDFS • JDBC • ... Analytical Workloads • ETL • Relational processing • Graph analysis • Machine learning • Streaming data analysis
  • 17. Flink Historic data Kafka, RabbitMQ, ... HDFS, JDBC, ... ETL, Graphs, Machine Learning Relational, … Low latency windowing, aggregations, ... Event logs An engine that puts equal emphasis to stream and batch processing Real-time data streams What are we building? (master)
  • 19. Why? ▪ Applications need to combine streaming and static data sources ▪ Making the switch from batch to streaming easy will be key to boost adoption ▪ Companies are making the transition from batch to streaming now 19
  • 20. What is stream processing? 20 ▪ Data stream: Infinite sequence of data arriving in a continuous fashion ▪Stream processing: Analyzing and acting on real-time streaming data, using continuous queries
  • 21. Lambda architecture ▪ "Speed layer" can be a stream processing system ▪ "Picks up" after the batch layer 21
  • 22. Kappa architecture ▪ Need for batch & speed layer not fundamental, practical with current tech ▪ Idea: use a stream processing system for all data processing ▪ They are all dataflows anyway 22http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
  • 23. Data streaming with Flink ▪ Flink is building a proper stream processing system • that can execute both batch and stream jobs natively • batch-only jobs pass via different optimization code path ▪ Flink is building libraries and DSLs on top of both batch and streaming • e.g., see recent Table API 23
  • 24. Data streaming with Flink ▪ Low-latency stream processor ▪ Expressive APIs in Scala/Java ▪ Stateful operators and flexible windowing ▪ Efficient fault tolerance for exactly-once guarantees 24
  • 25. Summary ▪ Flink is a general-purpose data analytics system ▪ Unifies batch and stream processing ▪ Expressive high-level APIs ▪ Robust and fast execution engine 25