SlideShare a Scribd company logo
SQL on Streams powered by Apache Flink
and Apache Calcite?
Radu Tudoran
Titled Explained (motivation)
Why SQL?Why Streaming?
● API to your database and other date lakes
● Ask for what you want, system decides how
to get it
● Query planner (optimizer) converts logical
queries to physical plans
● Standard & Mathematically sound language
● Opportunity for novel data organizations &
algorithms
● Existing knowhow with rich pool of experts
● Most data is produced as a stream
● Streams are everywhere: devices, web,
services, logs, traces, (social) media
● No delay – receive, process and deliver
instantly and continuously
● Opportunity for novel services and
businesses value extraction
● Better interconnected cloud services
Why query streams?
Duality:
● “Your database is just a cache of my stream”
● “Your stream is just change-capture of my
database”
● “Data is the new oil”
● Treating events/messages as data allows you to
extract and refine them
● Declarative approach to streaming applications
Outline
Flink and Table API
Calcite
SQL batch
SQL stream
Open thoughts…
What is Apache Flink?
Python
Gelly
Table
ML
SAMOA
Flink Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Stream Builder
Hadoop
M/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Flink Dataflow Runtime
HDFS
HBase
Kafka
RabbitMQ
Flume
HCatalog
JDBC
Credits to DataArtisans & Flink community 4
Table
Technology inside Flink
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Cost-based
optimizer
Type extraction
stack
Task
scheduling
Recovery
metadata
Pre-flight (Client)
Master
Workers
DataSourc
e
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
Memory
manager
Out-of-core
algos
Batch &
Streaming
State &
Checkpoints
deploy
operators
track
intermediate
results
Credits to DataArtisans & Flink community 5
Table API
6
• API for “SQL-like” queries/ expression language on the analytics pipeline
• Build as abstraction in Java/Scala on top of DataSet (batch) and extended for
DataStream (stream)
• Enables to apply relational operators: selection, aggregation, joins
• Enables to register as tables native data structures (DataSet, DataStream)
and external sources
• Tables can be converted back to native data structures
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
// existing stream
DataStream ord = …
// register the DataStream ord as table "Orders" with fields user, product, and amount
tableEnv.registerDataStream("Orders", ord, "user, product, amount");
TableSource custTS = new CsvTableSource("/path/to/file", ...)
// register a `TableSource` as external table "Customers"
tableEnv.registerTableSource("Customers", custTS)
// convert a DataSet into a Table
Table custT = tableEnv .toTable(orders, “user, amount") .where(“amount >'100'") .select("name")
Table Environment
Table Creation
Table Usage
From Program to Dataflow
7
Flink Program
Dataflow Plan
Optimized Plan
Outline
Flink and Table API
Calcite
SQL batch
SQL stream
Open thoughts…
Context
http://www.slideshare.net/julianhyde/calcite-stratany2014?qid=16ae156b-e978-486a-a3e5-b3072b6f7394&v=&b=&from_search=4
Conventional DB architecture Calcite Model
•Apache Project (incubator project May 2014; top-level project October 2015)
•Provide standard SQL parser, validator and JDBC driver
•Query planning framework
•Base all query optimization decisions on cost
•Query optimizations are modeled as pluggable rules
Calcite Architecture
http://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite?qid=16ae156b-e978-486a-a3e5-
b3072b6f7394&v=&b=&from_search=1
Calcite Planning Process
http://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite?qid=16ae156b-e978-486a-a3e5-
b3072b6f7394&v=&b=&from_search=1
Outline
Flink and Table API
Calcite
SQL Analytics
Open thoughts…
Analytics
Traditional batch analytics
• Repeated queries on finite and changing data sets
• Queries join and aggregate large data sets
• Data is fully available
 Stream analytics
• “Standing” query produces continuous results from infinite input stream
• Query computes aggregates on high-volume streams
• A StreamSQL query runs forever and produces results continuously
• Query’s focus needs to evolve with the stream
How to compute aggregates on infinite streams?
Stream-table duality
select *
from Orders
where units > 1000
select stream *
from Orders
where units > 1000
A a stream can be used as a table and
Retrieve orders from now to +∞
…and a table can be used as a stream
 Retrieve elements from -∞to now
 Duality property allows to convert one to the other
 Orders (think of an eCommerce service) is both
 Calcite syntax: use the stream keyword
Challenge: Where to actually find the data? That’s up
to the system
Stream SQL
 A StreamSQL query runs forever and produces results continuously
 Adopt SQL operators to work on continuous (infinite) streams
 Use windows to apply SQL operators to a subset of records
 New windows types are introduced:
 Group By and multi GroupBy (group-by SQL operator)
 Windows (Thumblin, Hopping, Sliding, Row, Cascading)
 Joins
Stream SQL Architecture in Flink
 SQL support via Apache
Calcite
 Translate the SQL query
to stream topologies
 Leverage the query
optimization plan and
rule engine
 Logical operators
(RelNodes) have a
mapping to one or more
Flink operators
Example
SELECT STREAM
TUMBLE_END(time, INTERVAL '1' DAY) AS day,
location AS room,
AVG((tempF - 32) * 0.556) AS avgTempC
FROM sensorData
WHERE location LIKE 'room%'
GROUP BY TUMBLE(time, INTERVAL '1' DAY), location
val avgRoomTemp: Table = tableEnv.ingest("sensorData")
.where('location.like("room%"))
.partitionBy('location)
.window(Tumbling every Days(1) on 'time as 'w)
.select('w.end, 'location, , (('tempF - 32) * 0.556).avg as 'avgTempCs)
Calcite style
Flink style
Outline
Flink and Table API
Calcite
SQL Analytics
Open thoughts…
Conclusions Observations
 Big Data trend is to move to uniform APIs (SQL, Apache Beam)
Towards complete decoupling of all functionalities across software stack
 Stream and DB have a duality property
 SQL is compatible with streams – within windows
 New data service systems
Credits for the slide materials
 Apache Flink Community and DataArtisans
Apache Calcite Community
19

More Related Content

PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PPTX
January 2016 Flink Community Update & Roadmap 2016
PPTX
Real-time Stream Processing with Apache Flink
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
PDF
Stateful Distributed Stream Processing
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
PPTX
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
January 2016 Flink Community Update & Roadmap 2016
Real-time Stream Processing with Apache Flink
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Stateful Distributed Stream Processing
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing

What's hot (20)

PPTX
Flink Streaming @BudapestData
PDF
Flink Apachecon Presentation
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
PPTX
The Stream Processor as a Database Apache Flink
PDF
Flink Gelly - Karlsruhe - June 2015
PPTX
Data Stream Processing with Apache Flink
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
PPTX
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
PPTX
Google cloud Dataflow & Apache Flink
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
PDF
Timeline Service v.2 (Hadoop Summit 2016)
PPTX
Apache Flink Overview at SF Spark and Friends
PDF
Airstream: Spark Streaming At Airbnb
PDF
Baymeetup-FlinkResearch
PDF
Change Data Capture with Data Collector @OVH
Flink Streaming @BudapestData
Flink Apachecon Presentation
Taking a look under the hood of Apache Flink's relational APIs.
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
The Stream Processor as a Database Apache Flink
Flink Gelly - Karlsruhe - June 2015
Data Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink @ Hadoop Summit
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Google cloud Dataflow & Apache Flink
Large-Scale Stream Processing in the Hadoop Ecosystem
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Timeline Service v.2 (Hadoop Summit 2016)
Apache Flink Overview at SF Spark and Friends
Airstream: Spark Streaming At Airbnb
Baymeetup-FlinkResearch
Change Data Capture with Data Collector @OVH
Ad

Viewers also liked (9)

PPTX
Apache Flink: Past, Present and Future
PDF
Apache Flink Deep Dive
PDF
Streaming SQL
PPTX
Apache Flink Community Updates November 2016 @ Berlin Meetup
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
PDF
A look at Flink 1.2
PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
PPTX
Spark Tips & Tricks
Apache Flink: Past, Present and Future
Apache Flink Deep Dive
Streaming SQL
Apache Flink Community Updates November 2016 @ Berlin Meetup
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
A look at Flink 1.2
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Spark Tips & Tricks
Ad

Similar to Towards sql for streams (20)

PPTX
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
PPTX
Stream Analytics with SQL on Apache Flink
PPTX
Why and how to leverage the power and simplicity of SQL on Apache Flink
PPTX
Streaming SQL to unify batch and stream processing: Theory and practice with ...
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
PDF
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
PDF
Flink's SQL Engine: Let's Open the Engine Room!
PDF
Streaming SQL
PDF
Streaming SQL w/ Apache Calcite
PDF
Streaming SQL with Apache Calcite
PPTX
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
PDF
Streaming SQL
PPTX
Webinar: Flink SQL in Action - Fabian Hueske
PDF
Streaming SQL
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Stream Analytics with SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache Flink
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Flink's SQL Engine: Let's Open the Engine Room!
Streaming SQL
Streaming SQL w/ Apache Calcite
Streaming SQL with Apache Calcite
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Streaming SQL
Webinar: Flink SQL in Action - Fabian Hueske
Streaming SQL
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
ABU RAUP TUGAS TIK kelas 8 hjhgjhgg.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
PDF
Top Generative AI Tools for Patent Drafting in 2025.pdf
PPTX
CroxyProxy Instagram Access id login.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dell Pro 14 Plus: Be better prepared for what’s coming
PPTX
Belt and Road Supply Chain Finance Blockchain Solution
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PDF
Chapter 2 Digital Image Fundamentals.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
PPTX
Web Security: Login Bypass, SQLi, CSRF & XSS.pptx
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Modernizing your data center with Dell and AMD
CIFDAQ's Market Insight: SEC Turns Pro Crypto
ABU RAUP TUGAS TIK kelas 8 hjhgjhgg.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
A Day in the Life of Location Data - Turning Where into How.pdf
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
Top Generative AI Tools for Patent Drafting in 2025.pdf
CroxyProxy Instagram Access id login.pptx
Understanding_Digital_Forensics_Presentation.pptx
Dell Pro 14 Plus: Be better prepared for what’s coming
Belt and Road Supply Chain Finance Blockchain Solution
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
Chapter 2 Digital Image Fundamentals.pdf
NewMind AI Weekly Chronicles - August'25 Week I
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
Web Security: Login Bypass, SQLi, CSRF & XSS.pptx
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Towards sql for streams

  • 1. SQL on Streams powered by Apache Flink and Apache Calcite? Radu Tudoran
  • 2. Titled Explained (motivation) Why SQL?Why Streaming? ● API to your database and other date lakes ● Ask for what you want, system decides how to get it ● Query planner (optimizer) converts logical queries to physical plans ● Standard & Mathematically sound language ● Opportunity for novel data organizations & algorithms ● Existing knowhow with rich pool of experts ● Most data is produced as a stream ● Streams are everywhere: devices, web, services, logs, traces, (social) media ● No delay – receive, process and deliver instantly and continuously ● Opportunity for novel services and businesses value extraction ● Better interconnected cloud services Why query streams? Duality: ● “Your database is just a cache of my stream” ● “Your stream is just change-capture of my database” ● “Data is the new oil” ● Treating events/messages as data allows you to extract and refine them ● Declarative approach to streaming applications
  • 3. Outline Flink and Table API Calcite SQL batch SQL stream Open thoughts…
  • 4. What is Apache Flink? Python Gelly Table ML SAMOA Flink Optimizer DataSet (Java/Scala) DataStream (Java/Scala) Stream Builder Hadoop M/R Local Remote Yarn Tez Embedded Dataflow Dataflow Flink Dataflow Runtime HDFS HBase Kafka RabbitMQ Flume HCatalog JDBC Credits to DataArtisans & Flink community 4 Table
  • 5. Technology inside Flink case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Cost-based optimizer Type extraction stack Task scheduling Recovery metadata Pre-flight (Client) Master Workers DataSourc e orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph Memory manager Out-of-core algos Batch & Streaming State & Checkpoints deploy operators track intermediate results Credits to DataArtisans & Flink community 5
  • 6. Table API 6 • API for “SQL-like” queries/ expression language on the analytics pipeline • Build as abstraction in Java/Scala on top of DataSet (batch) and extended for DataStream (stream) • Enables to apply relational operators: selection, aggregation, joins • Enables to register as tables native data structures (DataSet, DataStream) and external sources • Tables can be converted back to native data structures StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env); // existing stream DataStream ord = … // register the DataStream ord as table "Orders" with fields user, product, and amount tableEnv.registerDataStream("Orders", ord, "user, product, amount"); TableSource custTS = new CsvTableSource("/path/to/file", ...) // register a `TableSource` as external table "Customers" tableEnv.registerTableSource("Customers", custTS) // convert a DataSet into a Table Table custT = tableEnv .toTable(orders, “user, amount") .where(“amount >'100'") .select("name") Table Environment Table Creation Table Usage
  • 7. From Program to Dataflow 7 Flink Program Dataflow Plan Optimized Plan
  • 8. Outline Flink and Table API Calcite SQL batch SQL stream Open thoughts…
  • 9. Context http://www.slideshare.net/julianhyde/calcite-stratany2014?qid=16ae156b-e978-486a-a3e5-b3072b6f7394&v=&b=&from_search=4 Conventional DB architecture Calcite Model •Apache Project (incubator project May 2014; top-level project October 2015) •Provide standard SQL parser, validator and JDBC driver •Query planning framework •Base all query optimization decisions on cost •Query optimizations are modeled as pluggable rules
  • 12. Outline Flink and Table API Calcite SQL Analytics Open thoughts…
  • 13. Analytics Traditional batch analytics • Repeated queries on finite and changing data sets • Queries join and aggregate large data sets • Data is fully available  Stream analytics • “Standing” query produces continuous results from infinite input stream • Query computes aggregates on high-volume streams • A StreamSQL query runs forever and produces results continuously • Query’s focus needs to evolve with the stream How to compute aggregates on infinite streams?
  • 14. Stream-table duality select * from Orders where units > 1000 select stream * from Orders where units > 1000 A a stream can be used as a table and Retrieve orders from now to +∞ …and a table can be used as a stream  Retrieve elements from -∞to now  Duality property allows to convert one to the other  Orders (think of an eCommerce service) is both  Calcite syntax: use the stream keyword Challenge: Where to actually find the data? That’s up to the system
  • 15. Stream SQL  A StreamSQL query runs forever and produces results continuously  Adopt SQL operators to work on continuous (infinite) streams  Use windows to apply SQL operators to a subset of records  New windows types are introduced:  Group By and multi GroupBy (group-by SQL operator)  Windows (Thumblin, Hopping, Sliding, Row, Cascading)  Joins
  • 16. Stream SQL Architecture in Flink  SQL support via Apache Calcite  Translate the SQL query to stream topologies  Leverage the query optimization plan and rule engine  Logical operators (RelNodes) have a mapping to one or more Flink operators
  • 17. Example SELECT STREAM TUMBLE_END(time, INTERVAL '1' DAY) AS day, location AS room, AVG((tempF - 32) * 0.556) AS avgTempC FROM sensorData WHERE location LIKE 'room%' GROUP BY TUMBLE(time, INTERVAL '1' DAY), location val avgRoomTemp: Table = tableEnv.ingest("sensorData") .where('location.like("room%")) .partitionBy('location) .window(Tumbling every Days(1) on 'time as 'w) .select('w.end, 'location, , (('tempF - 32) * 0.556).avg as 'avgTempCs) Calcite style Flink style
  • 18. Outline Flink and Table API Calcite SQL Analytics Open thoughts…
  • 19. Conclusions Observations  Big Data trend is to move to uniform APIs (SQL, Apache Beam) Towards complete decoupling of all functionalities across software stack  Stream and DB have a duality property  SQL is compatible with streams – within windows  New data service systems Credits for the slide materials  Apache Flink Community and DataArtisans Apache Calcite Community 19