SlideShare a Scribd company logo
SQL on Streams powered by Apache Flink
and Apache Calcite?
Radu Tudoran
Titled Explained (motivation)
Why SQL?Why Streaming?
● API to your database and other date lakes
● Ask for what you want, system decides how
to get it
● Query planner (optimizer) converts logical
queries to physical plans
● Standard & Mathematically sound language
● Opportunity for novel data organizations &
algorithms
● Existing knowhow with rich pool of experts
● Most data is produced as a stream
● Streams are everywhere: devices, web,
services, logs, traces, (social) media
● No delay – receive, process and deliver
instantly and continuously
● Opportunity for novel services and
businesses value extraction
● Better interconnected cloud services
Why query streams?
Duality:
● “Your database is just a cache of my stream”
● “Your stream is just change-capture of my
database”
● “Data is the new oil”
● Treating events/messages as data allows you to
extract and refine them
● Declarative approach to streaming applications
Outline
Flink and Table API
Calcite
SQL batch
SQL stream
Open thoughts…
What is Apache Flink?
Python
Gelly
Table
ML
SAMOA
Flink Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Stream Builder
Hadoop
M/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Flink Dataflow Runtime
HDFS
HBase
Kafka
RabbitMQ
Flume
HCatalog
JDBC
Credits to DataArtisans & Flink community 4
Table
Technology inside Flink
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Cost-based
optimizer
Type extraction
stack
Task
scheduling
Recovery
metadata
Pre-flight (Client)
Master
Workers
DataSourc
e
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
Memory
manager
Out-of-core
algos
Batch &
Streaming
State &
Checkpoints
deploy
operators
track
intermediate
results
Credits to DataArtisans & Flink community 5
Table API
6
• API for “SQL-like” queries/ expression language on the analytics pipeline
• Build as abstraction in Java/Scala on top of DataSet (batch) and extended for
DataStream (stream)
• Enables to apply relational operators: selection, aggregation, joins
• Enables to register as tables native data structures (DataSet, DataStream)
and external sources
• Tables can be converted back to native data structures
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
// existing stream
DataStream ord = …
// register the DataStream ord as table "Orders" with fields user, product, and amount
tableEnv.registerDataStream("Orders", ord, "user, product, amount");
TableSource custTS = new CsvTableSource("/path/to/file", ...)
// register a `TableSource` as external table "Customers"
tableEnv.registerTableSource("Customers", custTS)
// convert a DataSet into a Table
Table custT = tableEnv .toTable(orders, “user, amount") .where(“amount >'100'") .select("name")
Table Environment
Table Creation
Table Usage
From Program to Dataflow
7
Flink Program
Dataflow Plan
Optimized Plan
Outline
Flink and Table API
Calcite
SQL batch
SQL stream
Open thoughts…
Context
http://www.slideshare.net/julianhyde/calcite-stratany2014?qid=16ae156b-e978-486a-a3e5-b3072b6f7394&v=&b=&from_search=4
Conventional DB architecture Calcite Model
•Apache Project (incubator project May 2014; top-level project October 2015)
•Provide standard SQL parser, validator and JDBC driver
•Query planning framework
•Base all query optimization decisions on cost
•Query optimizations are modeled as pluggable rules
Calcite Architecture
http://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite?qid=16ae156b-e978-486a-a3e5-
b3072b6f7394&v=&b=&from_search=1
Calcite Planning Process
http://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite?qid=16ae156b-e978-486a-a3e5-
b3072b6f7394&v=&b=&from_search=1
Outline
Flink and Table API
Calcite
SQL Analytics
Open thoughts…
Analytics
Traditional batch analytics
• Repeated queries on finite and changing data sets
• Queries join and aggregate large data sets
• Data is fully available
 Stream analytics
• “Standing” query produces continuous results from infinite input stream
• Query computes aggregates on high-volume streams
• A StreamSQL query runs forever and produces results continuously
• Query’s focus needs to evolve with the stream
How to compute aggregates on infinite streams?
Stream-table duality
select *
from Orders
where units > 1000
select stream *
from Orders
where units > 1000
A a stream can be used as a table and
Retrieve orders from now to +∞
…and a table can be used as a stream
 Retrieve elements from -∞to now
 Duality property allows to convert one to the other
 Orders (think of an eCommerce service) is both
 Calcite syntax: use the stream keyword
Challenge: Where to actually find the data? That’s up
to the system
Stream SQL
 A StreamSQL query runs forever and produces results continuously
 Adopt SQL operators to work on continuous (infinite) streams
 Use windows to apply SQL operators to a subset of records
 New windows types are introduced:
 Group By and multi GroupBy (group-by SQL operator)
 Windows (Thumblin, Hopping, Sliding, Row, Cascading)
 Joins
Stream SQL Architecture in Flink
 SQL support via Apache
Calcite
 Translate the SQL query
to stream topologies
 Leverage the query
optimization plan and
rule engine
 Logical operators
(RelNodes) have a
mapping to one or more
Flink operators
Example
SELECT STREAM
TUMBLE_END(time, INTERVAL '1' DAY) AS day,
location AS room,
AVG((tempF - 32) * 0.556) AS avgTempC
FROM sensorData
WHERE location LIKE 'room%'
GROUP BY TUMBLE(time, INTERVAL '1' DAY), location
val avgRoomTemp: Table = tableEnv.ingest("sensorData")
.where('location.like("room%"))
.partitionBy('location)
.window(Tumbling every Days(1) on 'time as 'w)
.select('w.end, 'location, , (('tempF - 32) * 0.556).avg as 'avgTempCs)
Calcite style
Flink style
Outline
Flink and Table API
Calcite
SQL Analytics
Open thoughts…
Conclusions Observations
 Big Data trend is to move to uniform APIs (SQL, Apache Beam)
Towards complete decoupling of all functionalities across software stack
 Stream and DB have a duality property
 SQL is compatible with streams – within windows
 New data service systems
Credits for the slide materials
 Apache Flink Community and DataArtisans
Apache Calcite Community
19

More Related Content

PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PPTX
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
PPTX
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
PDF
Stateful Distributed Stream Processing
Gyula Fóra
 
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
PPTX
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward
 
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
Stateful Distributed Stream Processing
Gyula Fóra
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 

What's hot (20)

PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PDF
Flink Apachecon Presentation
Gyula Fóra
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
PPTX
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
PDF
Flink Gelly - Karlsruhe - June 2015
Andra Lungu
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Gyula Fóra
 
PPTX
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
PPTX
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PDF
Timeline Service v.2 (Hadoop Summit 2016)
Sangjin Lee
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PDF
Airstream: Spark Streaming At Airbnb
Jen Aman
 
PDF
Baymeetup-FlinkResearch
Foo Sounds
 
PDF
Change Data Capture with Data Collector @OVH
Paris Data Engineers !
 
Flink Streaming @BudapestData
Gyula Fóra
 
Flink Apachecon Presentation
Gyula Fóra
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
Flink Gelly - Karlsruhe - June 2015
Andra Lungu
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Gyula Fóra
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Timeline Service v.2 (Hadoop Summit 2016)
Sangjin Lee
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Airstream: Spark Streaming At Airbnb
Jen Aman
 
Baymeetup-FlinkResearch
Foo Sounds
 
Change Data Capture with Data Collector @OVH
Paris Data Engineers !
 
Ad

Viewers also liked (9)

PPTX
Apache Flink: Past, Present and Future
Gyula Fóra
 
PDF
Apache Flink Deep Dive
Vasia Kalavri
 
PDF
Streaming SQL
Julian Hyde
 
PPTX
Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger
 
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
PDF
A look at Flink 1.2
Stefan Richter
 
PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
PPTX
Spark Tips & Tricks
Jason Hubbard
 
Apache Flink: Past, Present and Future
Gyula Fóra
 
Apache Flink Deep Dive
Vasia Kalavri
 
Streaming SQL
Julian Hyde
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
A look at Flink 1.2
Stefan Richter
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
Spark Tips & Tricks
Jason Hubbard
 
Ad

Similar to Towards sql for streams (20)

PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
PDF
Streaming SQL
Julian Hyde
 
PPT
Windows Azure and a little SQL Data Services
ukdpe
 
PDF
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2
 
PPTX
Practical OData
Vagif Abilov
 
PPT
SQL Server 2008 for Developers
ukdpe
 
PDF
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
PPTX
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
PPTX
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
SingleStore
 
PPTX
Stream processing - Apache flink
Renato Guimaraes
 
PPTX
Leveraging Hadoop in Polyglot Architectures
Thanigai Vellore
 
PDF
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
 
PDF
Streaming SQL with Apache Calcite
Julian Hyde
 
PDF
Streaming SQL w/ Apache Calcite
Hortonworks
 
PPTX
Roles y Responsabilidades en SQL Azure
Eduardo Castro
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Apache Samza 1.0 - What's New, What's Next
Prateek Maheshwari
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
Streaming SQL
Julian Hyde
 
Windows Azure and a little SQL Data Services
ukdpe
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2
 
Practical OData
Vagif Abilov
 
SQL Server 2008 for Developers
ukdpe
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
SingleStore
 
Stream processing - Apache flink
Renato Guimaraes
 
Leveraging Hadoop in Polyglot Architectures
Thanigai Vellore
 
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
 
Streaming SQL with Apache Calcite
Julian Hyde
 
Streaming SQL w/ Apache Calcite
Hortonworks
 
Roles y Responsabilidades en SQL Azure
Eduardo Castro
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Apache Samza 1.0 - What's New, What's Next
Prateek Maheshwari
 
20170126 big data processing
Vienna Data Science Group
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 

Recently uploaded (20)

PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
PDF
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
AVTRON Technologies LLC
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
PPTX
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
PDF
Test Bank, Solutions for Java How to Program, An Objects-Natural Approach, 12...
famaw19526
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
PPTX
The Power of IoT Sensor Integration in Smart Infrastructure and Automation.pptx
Rejig Digital
 
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
AbdullahSani29
 
PDF
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
CIFDAQ
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PDF
NewMind AI Monthly Chronicles - July 2025
NewMind AI
 
PDF
Software Development Methodologies in 2025
KodekX
 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
AVTRON Technologies LLC
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
Test Bank, Solutions for Java How to Program, An Objects-Natural Approach, 12...
famaw19526
 
This slide provides an overview Technology
mineshkharadi333
 
The Power of IoT Sensor Integration in Smart Infrastructure and Automation.pptx
Rejig Digital
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
AbdullahSani29
 
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
CIFDAQ
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
NewMind AI Monthly Chronicles - July 2025
NewMind AI
 
Software Development Methodologies in 2025
KodekX
 

Towards sql for streams

  • 1. SQL on Streams powered by Apache Flink and Apache Calcite? Radu Tudoran
  • 2. Titled Explained (motivation) Why SQL?Why Streaming? ● API to your database and other date lakes ● Ask for what you want, system decides how to get it ● Query planner (optimizer) converts logical queries to physical plans ● Standard & Mathematically sound language ● Opportunity for novel data organizations & algorithms ● Existing knowhow with rich pool of experts ● Most data is produced as a stream ● Streams are everywhere: devices, web, services, logs, traces, (social) media ● No delay – receive, process and deliver instantly and continuously ● Opportunity for novel services and businesses value extraction ● Better interconnected cloud services Why query streams? Duality: ● “Your database is just a cache of my stream” ● “Your stream is just change-capture of my database” ● “Data is the new oil” ● Treating events/messages as data allows you to extract and refine them ● Declarative approach to streaming applications
  • 3. Outline Flink and Table API Calcite SQL batch SQL stream Open thoughts…
  • 4. What is Apache Flink? Python Gelly Table ML SAMOA Flink Optimizer DataSet (Java/Scala) DataStream (Java/Scala) Stream Builder Hadoop M/R Local Remote Yarn Tez Embedded Dataflow Dataflow Flink Dataflow Runtime HDFS HBase Kafka RabbitMQ Flume HCatalog JDBC Credits to DataArtisans & Flink community 4 Table
  • 5. Technology inside Flink case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Cost-based optimizer Type extraction stack Task scheduling Recovery metadata Pre-flight (Client) Master Workers DataSourc e orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph Memory manager Out-of-core algos Batch & Streaming State & Checkpoints deploy operators track intermediate results Credits to DataArtisans & Flink community 5
  • 6. Table API 6 • API for “SQL-like” queries/ expression language on the analytics pipeline • Build as abstraction in Java/Scala on top of DataSet (batch) and extended for DataStream (stream) • Enables to apply relational operators: selection, aggregation, joins • Enables to register as tables native data structures (DataSet, DataStream) and external sources • Tables can be converted back to native data structures StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env); // existing stream DataStream ord = … // register the DataStream ord as table "Orders" with fields user, product, and amount tableEnv.registerDataStream("Orders", ord, "user, product, amount"); TableSource custTS = new CsvTableSource("/path/to/file", ...) // register a `TableSource` as external table "Customers" tableEnv.registerTableSource("Customers", custTS) // convert a DataSet into a Table Table custT = tableEnv .toTable(orders, “user, amount") .where(“amount >'100'") .select("name") Table Environment Table Creation Table Usage
  • 7. From Program to Dataflow 7 Flink Program Dataflow Plan Optimized Plan
  • 8. Outline Flink and Table API Calcite SQL batch SQL stream Open thoughts…
  • 9. Context http://www.slideshare.net/julianhyde/calcite-stratany2014?qid=16ae156b-e978-486a-a3e5-b3072b6f7394&v=&b=&from_search=4 Conventional DB architecture Calcite Model •Apache Project (incubator project May 2014; top-level project October 2015) •Provide standard SQL parser, validator and JDBC driver •Query planning framework •Base all query optimization decisions on cost •Query optimizations are modeled as pluggable rules
  • 12. Outline Flink and Table API Calcite SQL Analytics Open thoughts…
  • 13. Analytics Traditional batch analytics • Repeated queries on finite and changing data sets • Queries join and aggregate large data sets • Data is fully available  Stream analytics • “Standing” query produces continuous results from infinite input stream • Query computes aggregates on high-volume streams • A StreamSQL query runs forever and produces results continuously • Query’s focus needs to evolve with the stream How to compute aggregates on infinite streams?
  • 14. Stream-table duality select * from Orders where units > 1000 select stream * from Orders where units > 1000 A a stream can be used as a table and Retrieve orders from now to +∞ …and a table can be used as a stream  Retrieve elements from -∞to now  Duality property allows to convert one to the other  Orders (think of an eCommerce service) is both  Calcite syntax: use the stream keyword Challenge: Where to actually find the data? That’s up to the system
  • 15. Stream SQL  A StreamSQL query runs forever and produces results continuously  Adopt SQL operators to work on continuous (infinite) streams  Use windows to apply SQL operators to a subset of records  New windows types are introduced:  Group By and multi GroupBy (group-by SQL operator)  Windows (Thumblin, Hopping, Sliding, Row, Cascading)  Joins
  • 16. Stream SQL Architecture in Flink  SQL support via Apache Calcite  Translate the SQL query to stream topologies  Leverage the query optimization plan and rule engine  Logical operators (RelNodes) have a mapping to one or more Flink operators
  • 17. Example SELECT STREAM TUMBLE_END(time, INTERVAL '1' DAY) AS day, location AS room, AVG((tempF - 32) * 0.556) AS avgTempC FROM sensorData WHERE location LIKE 'room%' GROUP BY TUMBLE(time, INTERVAL '1' DAY), location val avgRoomTemp: Table = tableEnv.ingest("sensorData") .where('location.like("room%")) .partitionBy('location) .window(Tumbling every Days(1) on 'time as 'w) .select('w.end, 'location, , (('tempF - 32) * 0.556).avg as 'avgTempCs) Calcite style Flink style
  • 18. Outline Flink and Table API Calcite SQL Analytics Open thoughts…
  • 19. Conclusions Observations  Big Data trend is to move to uniform APIs (SQL, Apache Beam) Towards complete decoupling of all functionalities across software stack  Stream and DB have a duality property  SQL is compatible with streams – within windows  New data service systems Credits for the slide materials  Apache Flink Community and DataArtisans Apache Calcite Community 19