SlideShare a Scribd company logo
DON’T OPTIMIZE MY QUERIES,
ORGANIZE MY DATA!
Julian Hyde (Calcite PMC)
Don’t optimize my queries, optimize my data!
Abstract
Your queries won't run fast if your data is not organized right. Apache Calcite
optimizes queries, but can we make it optimize data? We had to solve several
challenges. Users are too busy to tell us the structure of their database, and the
query load changes daily, so Calcite has to learn and adapt. We talk about new
algorithms we developed for gathering statistics on massive database, and how
we infer and evolve the data model based on the queries.
@julianhyde
SQL
Query planning
Query federation
OLAP
Streaming
Hadoop
ASF member
Original author of Apache Calcite
PMC Apache Arrow, Calcite, Drill, Eagle, Kylin
Architect at Looker
Don't optimize my queries, organize my data!
A “simple” query
Data
● 2010 U.S. census
● 100 million records
● 1KB per record
● 100 GB total
System
● 4x SATA 3 disks
● Total read throughput 1 GB/s
Query
Goal
● Compute the answer to the query in
under 5 seconds
SELECT SUM(householdSize)
FROM CensusHouseholds;
Solutions
Sequential scan Query takes 100 s (100 GB at 1 GB/s)
Parallelize Spread the data over 40 disks in 10 machines
Query takes 10 s
Cache Keep the data in memory
2nd query: 10 ms
3rd query: 10 s
Materialize Summarize the data on disk
All queries: 100 ms
Materialize +
cache + adapt
As above, building summaries on demand
Lazy > Smart + Fast
(Lazy + adaptive is even better)
Overview
How do you tune a data system? How can (or should) a data system tune itself?
What problems have we solved to bring these things to Apache Calcite?
Part 1: Strategies for organizing data
Part 2: How to make systems self-organizing?
SELECT d.name, COUNT(*) AS c
FROM Emps AS e
JOIN Depts AS d USING (deptno)
WHERE e.age < 40
GROUP BY d.deptno
HAVING COUNT(*) > 5
ORDER BY c DESC
Relational algebra
Based on set theory, plus operators:
Project, Filter, Aggregate, Union, Join,
Sort
Requires: declarative language (SQL),
query planner
Original goal: data independence
Enables: query optimization, new
algorithms and data structures
Scan [Emps] Scan [Depts]
Join [e.deptno = d.deptno]
Filter [e.age < 30]
Aggregate [deptno, COUNT(*) AS c]
Filter [c > 5]
Project [name, c]
Sort [c DESC]
Apache Calcite
Apache top-level project
Query planning framework used in many
projects and products
Also works standalone: embedded federated
query engine with SQL / JDBC front end
Apache community development model
https://calcite.apache.org
https://github.com/apache/calcite
1. Organizing data
Ways of organizing data
Format (CSV, JSON, binary)
Layout: row- vs. column-oriented (e.g. Parquet, ORC), cache friendly (e.g. Arrow)
Storage medium (disk, flash, RAM, NVRAM, ...)
Non-lossy copy: sorted / partitioned
Lossy copies of data: project, filter, aggregate, join
Combinations of the above
Logical optimizations >> physical optimizations
Index
A sorted, projected materialized
view
Accelerates queries that use
ranges, correlated lookups, sorting,
aggregate, distinct
CREATE TABLE Emp (empno INT,
name VARCHAR(20), deptno INT);
CREATE INDEX I_Emp_Deptno
ON Emp (deptno, name);
SELECT DISTINCT deptno FROM Emp
WHERE deptno BETWEEN 20 AND 40
ORDER BY deptno;
empno name deptno
100 Fred 20
110 Barney 10
120 Wilma 30
130 Dino 10
deptno name rowid
10 Barney af5634.0001
10 Dino af5634.0003
20 Fred af5634.0000
30 Wilma af5634.0002
Add the remaining columns
No longer need “rowid”
Lossless
During planning, treat indexes
as tables, and index lookups
as joins
Covering index
empno name deptno
100 Fred 20
110 Barney 10
120 Wilma 30
130 Dino 10
deptno name empno
10 Barney 100
10 Dino 130
20 Fred 20
30 Wilma 30
CREATE INDEX I_Emp_Deptno2 (
deptno INTEGER,
name VARCHAR(20))
COVER (empno);
Materialized view
CREATE MATERIALIZED
VIEW EmpsByDeptno AS
SELECT deptno, name, deptno
FROM Emp
ORDER BY deptno, name;
Scan [Emps]
Scan [EmpsByDeptno]
Sort [deptno, name]
empno name deptno
100 Fred 20
110 Barney 10
120 Wilma 30
130 Dino 10
deptno name empno
10 Barney 100
10 Dino 130
20 Fred 20
30 Wilma 30
As a materialized view, an
index is now just another
table
Several tables contain the
information necessary to
answer the query - just pick
the best
Spatial query
Find all restaurants within 1.5 distance units of
where I am:
restaurant x y
Zachary’s pizza 3 1
King Yen 7 7
Filippo’s 7 4
Station burger 5 6
SELECT *
FROM Restaurants AS r
WHERE ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
•
•
•
•
Zachary’s
pizza
Filippo’s
King
Yen
Station
burger
Hilbert space-filling curve
● A space-filling curve invented by mathematician David Hilbert
● Every (x, y) point has a unique position on the curve
● Points near to each other typically have Hilbert indexes close together
•
•
•
•
Add restriction based on h, a restaurant’s distance
along the Hilbert curve
Must keep original restriction due to false positives
Using Hilbert index
restaurant x y h
Zachary’s pizza 3 1 5
King Yen 7 7 41
Filippo’s 7 4 52
Station burger 5 6 36
Zachary’s
pizza
Filippo’s
SELECT *
FROM Restaurants AS r
WHERE (r.h BETWEEN 35 AND 42
OR r.h BETWEEN 46 AND 46)
AND ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
King
Yen
Station
burger
Telling the optimizer
1. Declare h as a generated column
2. Sort table by h
Planner can now convert spatial range
queries into a range scan
Does not require specialized spatial
index such as r-tree
Very efficient on a sorted table such as
HBase
CREATE TABLE Restaurants (
restaurant VARCHAR(20),
x DOUBLE,
y DOUBLE,
h DOUBLE GENERATED ALWAYS AS
ST_Hilbert(x, y) STORED)
SORT KEY (h);
restaurant x y h
Zachary’s pizza 3 1 5
Station burger 5 6 36
King Yen 7 7 41
Filippo’s 7 4 52
Much valuable data is “data in flight”
Use SQL to query streams (or streams + tables)
Streaming
Data center
SELECT AVG(unitPrice)
FROM Orders
WHERE units > 1000
AND orderDate
BETWEEN ‘2014-06-01’
AND ‘2015-12-31’
SELECT STREAM *
FROM Orders
WHERE units > 1000
Streaming query
Historic query
Hybrid query combines a stream with its
own history
● Orders is used as both as stream
and as “stream history” virtual table
● “Average order size over last year”
should be maintained by the system,
i.e. a materialized view
SELECT STREAM *
FROM Orders AS o
WHERE units > (
SELECT AVG(units)
FROM Orders AS h
WHERE h.productId = o.productId
AND h.rowtime
> o.rowtime - INTERVAL ‘1’ YEAR)
“Orders” used
as a stream
“Orders” used as
a “stream history”
virtual table
Summary - data optimization via
materialized views
Many forms of data optimization can be modeled as materialized views:
● Blocks in cache
● B-tree indexes
● Summary tables
● Spatial indexes
● History of streams
Allows the optimizer to “understand” the optimization and use it (if beneficial)
But who designs the optimizations?
2. Learning
How do data systems learn?
queries
DML
statistics
adaptations
recommender
Goals ● Improve response time, throughput, storage cost
● Predictable, adaptive (short and long term), allow human
intervention
How? ● Humans
● Adaptive systems
● Smart algorithms
Example
adaptations
● Cache disk blocks in memory
● Cached query results
● Data organization, e.g. partition on a different key
● Secondary structures, e.g. b-tree and r-tree indexes
Tiled, in-memory materialized views
A vision for an adaptive data system (we’re not there yet)
tables on
disk
in-memory
materializations
SELECT x, SUM(n) FROM t GROUP BY x
Building materialized views
Challenges:
● Design Which materializations to create?
● Populate Load them with data
● Maintain Incrementally populate when data changes
● Rewrite Transparently rewrite queries to use materializations
● Adapt Design and populate new materializations, drop unused ones
● Express Need a rich algebra, to model how data is derived
Initial focus: summary tables (materialized views over star schemas)
CREATE LATTICE Sales AS
SELECT t.*, c.*, COUNT(*), SUM(s.units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
JOIN Products AS p USING (productId);
Designing summary tables via lattices
CREATE MATERIALIZED VIEW SalesYearZipcode AS
SELECT t.year, c.state, c.zipcode,
COUNT(*), SUM(units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
GROUP BY 1, 2, 3;
product
product
class
sales
customers
time
Many possible
summary
tables
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
() 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
raw 1m
(y, m)
60
(g, y) 10
(z, s)
43.4k
(g, y, m)
120
Fewer than you would
expect, because 5m
combinations cannot
occur in 1m row table
Fewer than you
would expect,
because state
depends on zipcode
Algorithm: Design summary tables
Given a database with 30 columns, 10M rows. Find X summary tables with under
Y rows that improve query response time the most.
AdaptiveMonteCarlo algorithm [1]:
● Based on research [2]
● Greedy algorithm that takes a combination of summary tables and tries to
find the table that yields the greatest cost/benefit improvement
● Models “benefit” of the table as query time saved over simulated query load
● The “cost” of a table is its size
[1] org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm
[2] Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”
Lattice (optimized) () 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
(z, g, y,
m) 909k
(z, s, y,
m) 831k
raw 1m
(z, s, g,
m) 644k
(z, s, g,
y) 392k
(y, m)
60
(z, s)
43.4k
(z, s, g)
83.6k
(g, y) 10
(g, y, m)
120
(g, m)
24
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
Data profiling
Algorithm needs count(distinct a, b, ...) for each combination of attributes:
● Previous example had 25
= 32 possible tables
● Schema with 30 attributes has 230
(about 109
) possible tables
● Algorithm considers a significant fraction of these
● Approximations are OK
Attempts to solve the profiling problem:
1. Compute each combination: scan, sort, unique, count; repeat 230
times!
2. Sketches (HyperLogLog)
3. Sketches + parallelism + information theory [CALCITE-1616]
Sketches
HyperLogLog is an algorithm that computes
approximate distinct count. It can estimate
cardinalities of 109
with a typical error rate of
2%, using 1.5 kB of memory. [3][4]
With 16 MB memory per machine we can
compute 10,000 combinations of attributes
each pass.
So, we’re down from 109
to 105
passes.
[3] Flajolet, Fusy, Gandouet, Meunier (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm"
[4] https://github.com/mrjgreen/HyperLogLog
Given Expected cardinality Actual cardinality Surprise
(gender): 2 (state): 50 (gender, state): 100.0 100 0.000
(month): 12 (zipcode): 43,000 (month, zipcode): 441,699.3 442,700 0.001
(state): 50 (zipcode): 43,000 (state, zipcode): 799,666.7 43,400 0.897
(state, zipcode): 43,400
(gender, state): 100
(gender, zipcode): 85,995
(gender, state, zipcode): 86,799
= min(86,799, 892,234, 892,228)
83,567 0.019
● Surprise = abs(actual - expected) / (actual + expected)
● E(card (x, y)) = n . (1 - ((n - 1) / n) ^ p) n = card (x) * card (y), p = row count
Combining probability & information theory
Algorithm
Three ways “surprise” can help:
● If a cardinality is not
surprising, we don’t need to
store it -- we can derive it
● If a combination’s cardinality
is not surprising, it is unlikely
to have surprising children
● If we’re not seeing surprising
results, it’s time to stop
surprise_threshold := 1
queue := {singleton combinations} // (a), (b), ...
while queue is not empty {
batch := remove first 10,000 entries in queue
compute cardinality of each combination in batch
for each actual (computed) cardinality a {
e := expected cardinality of combination
s := surprise(a, e)
if s > surprise_threshold {
store combination and its cardinality
add child combinations to queue // (x, a), (x, b), ...
}
increase surprise_threshold
}
}
Algorithm progress and “surprise” threshold
Progress of algorithm
Rejected as not
sufficiently
surprising
Surprise
threshold rises
as algorithm
progresses
Singleton
combinations
are have surprise
= 1
Surprise
threshold rises
after we have
completed the
first batch
Data profiling - summary
The algorithm defeats a combinatorial search space using sketches +
information theory + parallelism
Recommending data structures is an optimization problem; profiling provides
the cost & benefit function
As a by-product, the algorithm discovers unique keys, “almost” keys, and foreign
keys
But which tables are actually joined together in practice?
CREATE LATTICE Sales AS
SELECT t.*, c.*, COUNT(*), SUM(s.units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
JOIN Products AS p USING (productId);
CREATE MATERIALIZED VIEW SalesYearZipcode AS
SELECT t.year, c.state, c.zipcode,
COUNT(*), SUM(units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
GROUP BY 1, 2, 3;
product
product
class
sales
customers
time
The lattice generates the
summary tables. But who
writes the lattice?
Designing summary tables via lattices (2)
CREATE LATTICE Sales AS
SELECT t.*, c.*, COUNT(*), SUM(s.units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
JOIN Products AS p USING (productId);
CREATE MATERIALIZED VIEW SalesYearZipcode AS
SELECT t.year, c.state, c.zipcode,
COUNT(*), SUM(units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
GROUP BY 1, 2, 3;
ALTER SCHEMA Sales
INFER LATTICES;
product
product
class
sales
customers
time
Designing summary tables via lattices (3)
Lattice after Query 1 + 2
Query 2
Query 1
Growing and evolving
lattices based on queries
sales
customers
product
product
class
sales
product
product
class
sales
customers
See: [CALCITE-1870] “Lattice suggester”
Summary
Learning systems = manual tuning + adaptive + smart algorithms
Query history + data profiling→ lattices → summary tables
We have discussed summary tables (materialized views based on
join/aggregate in a star schema) but the approach can be applied to other kinds
of materialized views
Relational algebra, incorporating materialized views, is a powerful language that
allows us to combine many forms of data optimization
Thank you! Questions?
@ApacheCalcite | @julianhyde | https://calcite.apache.org
Resources
➔ [CALCITE-1616] Data profiler
➔ [CALCITE-1870] Lattice suggester
➔ [CALCITE-1861] Spatial indexes
➔ [CALCITE-1968] OpenGIS
➔ [CALCITE-1991] Generated columns
➔ Talk: “Data profiling with Apache Calcite” (Hadoop
Summit, 2017)
➔ Talk: “SQL on everything, in memory” (Strata, 2014)
➔ Zhang, Qi, Stradling, Huang (2014). “Towards a
Painless Index for Spatial Objects”
➔ Harinarayan, Rajaraman, Ullman (1996).
“Implementing data cubes efficiently”
Don't optimize my queries, organize my data!
Extra slides
Architecture
Conventional database Calcite
Planning queries
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Table: splunk
Optimized query
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: splunk
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Calcite framework
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
SQL parser
SqlNode
SqlParser
SqlValidator
Transformation rules
RelOptRule
• FilterMergeRule
• AggregateUnionTransposeRule
• 100+ more
Global transformations
Unification (materialized view)
Column trimming
De-correlation
Relational algebra
RelNode (operator)
• TableScan
• Filter
• Project
• Union
• Aggregate
• …
RelDataType (type)
RexNode (expression)
RelTrait (physical property)
• RelConvention (calling-convention)
• RelCollation (sortedness)
• RelDistribution (partitioning)
RelBuilder
JDBC driver
Metadata
Schema
Table
Function
• TableFunction
• TableMacro
Lattice
Materialized views, lattices, tiles
Materialized view - A table whose contents are
guaranteed to be the same as executing a given query.
Lattice - Recommends, builds, and recognizes summary
materialized views (tiles) based on a star schema.
A query defines the tables and many:1 relationships in
the star schema.
Tile - A summary materialized view that belongs to a
lattice. A tile may or may not be materialized. Might be:
● Declared in lattice, or
● Generated via recommender algorithm, or
● Created in response to query.
CREATE MATERIALIZED VIEW t AS
SELECT * FROM emps
WHERE deptno = 10;
CREATE LATTICE star AS
SELECT *
FROM sales_fact_1997 AS s
JOIN product AS p ON …
JOIN product_class AS pc ON …
JOIN customer AS c ON …
JOIN time_by_day AS t ON …;
CREATE MATERIALIZED VIEW zg IN star
SELECT gender, zipcode, COUNT(*),
SUM(unit_sales) FROM star
GROUP BY gender, zipcode;
Combining past and future
select stream *
from Orders as o
where units > (
select avg(units)
from Orders as h
where h.productId = o.productId
and h.rowtime > o.rowtime - interval ‘1’ year)
➢ Orders is used as both stream and table
➢ System determines where to find the records
➢ Query is invalid if records are not available
Controlling when data is emitted
Early emission is the defining
characteristic of a streaming query.
The emit clause is a SQL extension
inspired by Apache Beam’s “trigger”
notion. (Still experimental… and
evolving.)
A relational (non-streaming) query is
just a query with the most conservative
possible emission strategy.
select stream productId,
count(*) as c
from Orders
group by productId,
floor(rowtime to hour)
emit at watermark,
early interval ‘2’ minute,
late limit 1;
select *
from Orders
emit when complete;
Other applications of data profiling
Query optimization:
● Planners are poor at estimating selectivity of conditions after N-way join
(especially on real data)
● New join-order benchmark: “Movies made by French directors tend to have
French actors”
● Predict number of reducers in MapReduce & Spark
“Grokking” a data set
Identifying problems in normalization, partitioning, quality
Applications in machine learning?
Further improvements to data profiling
● Build sketches in parallel
● Run algorithm in a distributed framework (Spark or MapReduce)
● Compute histograms
○ For example, Median age for male/female customers
● Seek out functional dependencies
○ Once you know FDs, a lot of cardinalities are no longer “surprising”
○ FDs occur in denormalized tables, e.g. star schemas
● Smarter criteria for stopping algorithm
● Skew/heavy hitters. Are some values much more frequent than others?
● Conditional cardinalities and functional dependencies
○ Does one partition of the data behave differently from others? (e.g. year=2005, state=LA)

More Related Content

PDF
Tactical data engineering
Julian Hyde
 
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
PPTX
Lazy beats Smart and Fast
Julian Hyde
 
PDF
Spatial query on vanilla databases
Julian Hyde
 
PDF
Efficient spatial queries on vanilla databases
Julian Hyde
 
PDF
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
Tactical data engineering
Julian Hyde
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
Lazy beats Smart and Fast
Julian Hyde
 
Spatial query on vanilla databases
Julian Hyde
 
Efficient spatial queries on vanilla databases
Julian Hyde
 
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 

What's hot (20)

PDF
Don’t optimize my queries, optimize my data!
Julian Hyde
 
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
PPT
Drill / SQL / Optiq
Julian Hyde
 
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
PDF
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Julian Hyde
 
PDF
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Julian Hyde
 
PDF
Streaming SQL
Julian Hyde
 
PDF
Why you care about
 relational algebra (even though you didn’t know it)
Julian Hyde
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
PPTX
Cost-based query optimization in Apache Hive
Julian Hyde
 
PDF
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
PDF
Fast federated SQL with Apache Calcite
Chris Baynes
 
PDF
Apache Calcite: One Frontend to Rule Them All
Michael Mior
 
PDF
Streaming SQL
Julian Hyde
 
PDF
Introduction to Apache Calcite
Jordan Halterman
 
PDF
Streaming SQL
Julian Hyde
 
PDF
Data Profiling in Apache Calcite
Julian Hyde
 
PDF
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 
PPTX
Calcite meetup-2016-04-20
Josh Elser
 
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
Drill / SQL / Optiq
Julian Hyde
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Julian Hyde
 
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Julian Hyde
 
Streaming SQL
Julian Hyde
 
Why you care about
 relational algebra (even though you didn’t know it)
Julian Hyde
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Cost-based query optimization in Apache Hive
Julian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Fast federated SQL with Apache Calcite
Chris Baynes
 
Apache Calcite: One Frontend to Rule Them All
Michael Mior
 
Streaming SQL
Julian Hyde
 
Introduction to Apache Calcite
Jordan Halterman
 
Streaming SQL
Julian Hyde
 
Data Profiling in Apache Calcite
Julian Hyde
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 
Calcite meetup-2016-04-20
Josh Elser
 
Ad

Similar to Don't optimize my queries, organize my data! (20)

PDF
Data profiling in Apache Calcite
DataWorks Summit
 
PDF
Data profiling with Apache Calcite
Julian Hyde
 
PDF
SQL on everything, in memory
Julian Hyde
 
PDF
Bill howe 2_databases
Mahammad Valiyev
 
PDF
phoenix-on-calcite-hadoop-summit-2016
Maryann Xue
 
PDF
Cost-Based query optimization
DataWorks Summit/Hadoop Summit
 
PDF
Optimized cluster index generation
Rutvik Pensionwar
 
PPTX
The Microsoft BigData Story
Lynn Langit
 
PPT
The thinking persons guide to data warehouse design
Calpont
 
PDF
Infobright Column-Oriented Analytical Database Engine
Alex Esterkin
 
PPTX
Performance By Design
Guy Harrison
 
PDF
Relational Database Design Bootcamp
Mark Niebergall
 
PDF
Cs437 lecture 1-6
Aneeb_Khawar
 
PPTX
In memory databases presentation
Michael Keane
 
PDF
QBIC
Misha Kozik
 
PDF
Delivering fast, powerful and scalable analytics #OPEN18
Kangaroot
 
PPT
What to do when one size does not fit all?!
Arjen de Vries
 
PPT
Ch10
蕭美蓮
 
PDF
query optimization
Dimara Hakim
 
PPT
Analysis technologies - day3 slides Lecture notesppt
PerumalPitchandi
 
Data profiling in Apache Calcite
DataWorks Summit
 
Data profiling with Apache Calcite
Julian Hyde
 
SQL on everything, in memory
Julian Hyde
 
Bill howe 2_databases
Mahammad Valiyev
 
phoenix-on-calcite-hadoop-summit-2016
Maryann Xue
 
Cost-Based query optimization
DataWorks Summit/Hadoop Summit
 
Optimized cluster index generation
Rutvik Pensionwar
 
The Microsoft BigData Story
Lynn Langit
 
The thinking persons guide to data warehouse design
Calpont
 
Infobright Column-Oriented Analytical Database Engine
Alex Esterkin
 
Performance By Design
Guy Harrison
 
Relational Database Design Bootcamp
Mark Niebergall
 
Cs437 lecture 1-6
Aneeb_Khawar
 
In memory databases presentation
Michael Keane
 
Delivering fast, powerful and scalable analytics #OPEN18
Kangaroot
 
What to do when one size does not fit all?!
Arjen de Vries
 
Ch10
蕭美蓮
 
query optimization
Dimara Hakim
 
Analysis technologies - day3 slides Lecture notesppt
PerumalPitchandi
 
Ad

More from Julian Hyde (14)

PPTX
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
PDF
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Julian Hyde
 
PDF
Building a semantic/metrics layer using Calcite
Julian Hyde
 
PDF
Cubing and Metrics in SQL, oh my!
Julian Hyde
 
PDF
Adding measures to Calcite SQL
Julian Hyde
 
PDF
Morel, a data-parallel programming language
Julian Hyde
 
PDF
Is there a perfect data-parallel programming language? (Experiments with More...
Julian Hyde
 
PDF
Morel, a Functional Query Language
Julian Hyde
 
PDF
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
PDF
The evolution of Apache Calcite and its Community
Julian Hyde
 
PDF
What to expect when you're Incubating
Julian Hyde
 
PDF
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Julian Hyde
 
PDF
Streaming SQL with Apache Calcite
Julian Hyde
 
PDF
Streaming SQL
Julian Hyde
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Julian Hyde
 
Building a semantic/metrics layer using Calcite
Julian Hyde
 
Cubing and Metrics in SQL, oh my!
Julian Hyde
 
Adding measures to Calcite SQL
Julian Hyde
 
Morel, a data-parallel programming language
Julian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Julian Hyde
 
Morel, a Functional Query Language
Julian Hyde
 
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
The evolution of Apache Calcite and its Community
Julian Hyde
 
What to expect when you're Incubating
Julian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Julian Hyde
 
Streaming SQL with Apache Calcite
Julian Hyde
 
Streaming SQL
Julian Hyde
 

Recently uploaded (20)

PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
dushyantsharma1221
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
PDF
Digital Infrastructure – Powering the Connected Age
Heera Yadav
 
PPTX
Global journeys: estimating international migration
Office for National Statistics
 
PDF
TCP_IP for Programmers ------ slides.pdf
Souhailsouhail5
 
PPTX
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
Sione Palu
 
PPTX
batch data Retailer Data management Project.pptx
sumitmundhe77
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PDF
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
PPTX
Logistic Regression ml machine learning.pptx
abdullahcocindia
 
PPTX
GR3-PPTFINAL (1).pptx 0.91 MbHIHUHUGG,HJGH
DarylArellaga1
 
PPTX
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Accentfuture
 
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PaulYoung221210
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PDF
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
mswindow00
 
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
JanakiRaman206018
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Major-Components-ofNKJNNKNKNKNKronment.pptx
dushyantsharma1221
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
Digital Infrastructure – Powering the Connected Age
Heera Yadav
 
Global journeys: estimating international migration
Office for National Statistics
 
TCP_IP for Programmers ------ slides.pdf
Souhailsouhail5
 
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
Sione Palu
 
batch data Retailer Data management Project.pptx
sumitmundhe77
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Chad Readey - An Independent Thinker
Chad Readey
 
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
Logistic Regression ml machine learning.pptx
abdullahcocindia
 
GR3-PPTFINAL (1).pptx 0.91 MbHIHUHUGG,HJGH
DarylArellaga1
 
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Accentfuture
 
Moving the Public Sector (Government) to a Digital Adoption
PaulYoung221210
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
mswindow00
 
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
JanakiRaman206018
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 

Don't optimize my queries, organize my data!

  • 1. DON’T OPTIMIZE MY QUERIES, ORGANIZE MY DATA! Julian Hyde (Calcite PMC)
  • 2. Don’t optimize my queries, optimize my data! Abstract Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
  • 3. @julianhyde SQL Query planning Query federation OLAP Streaming Hadoop ASF member Original author of Apache Calcite PMC Apache Arrow, Calcite, Drill, Eagle, Kylin Architect at Looker
  • 5. A “simple” query Data ● 2010 U.S. census ● 100 million records ● 1KB per record ● 100 GB total System ● 4x SATA 3 disks ● Total read throughput 1 GB/s Query Goal ● Compute the answer to the query in under 5 seconds SELECT SUM(householdSize) FROM CensusHouseholds;
  • 6. Solutions Sequential scan Query takes 100 s (100 GB at 1 GB/s) Parallelize Spread the data over 40 disks in 10 machines Query takes 10 s Cache Keep the data in memory 2nd query: 10 ms 3rd query: 10 s Materialize Summarize the data on disk All queries: 100 ms Materialize + cache + adapt As above, building summaries on demand
  • 7. Lazy > Smart + Fast (Lazy + adaptive is even better)
  • 8. Overview How do you tune a data system? How can (or should) a data system tune itself? What problems have we solved to bring these things to Apache Calcite? Part 1: Strategies for organizing data Part 2: How to make systems self-organizing?
  • 9. SELECT d.name, COUNT(*) AS c FROM Emps AS e JOIN Depts AS d USING (deptno) WHERE e.age < 40 GROUP BY d.deptno HAVING COUNT(*) > 5 ORDER BY c DESC Relational algebra Based on set theory, plus operators: Project, Filter, Aggregate, Union, Join, Sort Requires: declarative language (SQL), query planner Original goal: data independence Enables: query optimization, new algorithms and data structures Scan [Emps] Scan [Depts] Join [e.deptno = d.deptno] Filter [e.age < 30] Aggregate [deptno, COUNT(*) AS c] Filter [c > 5] Project [name, c] Sort [c DESC]
  • 10. Apache Calcite Apache top-level project Query planning framework used in many projects and products Also works standalone: embedded federated query engine with SQL / JDBC front end Apache community development model https://calcite.apache.org https://github.com/apache/calcite
  • 12. Ways of organizing data Format (CSV, JSON, binary) Layout: row- vs. column-oriented (e.g. Parquet, ORC), cache friendly (e.g. Arrow) Storage medium (disk, flash, RAM, NVRAM, ...) Non-lossy copy: sorted / partitioned Lossy copies of data: project, filter, aggregate, join Combinations of the above Logical optimizations >> physical optimizations
  • 13. Index A sorted, projected materialized view Accelerates queries that use ranges, correlated lookups, sorting, aggregate, distinct CREATE TABLE Emp (empno INT, name VARCHAR(20), deptno INT); CREATE INDEX I_Emp_Deptno ON Emp (deptno, name); SELECT DISTINCT deptno FROM Emp WHERE deptno BETWEEN 20 AND 40 ORDER BY deptno; empno name deptno 100 Fred 20 110 Barney 10 120 Wilma 30 130 Dino 10 deptno name rowid 10 Barney af5634.0001 10 Dino af5634.0003 20 Fred af5634.0000 30 Wilma af5634.0002
  • 14. Add the remaining columns No longer need “rowid” Lossless During planning, treat indexes as tables, and index lookups as joins Covering index empno name deptno 100 Fred 20 110 Barney 10 120 Wilma 30 130 Dino 10 deptno name empno 10 Barney 100 10 Dino 130 20 Fred 20 30 Wilma 30 CREATE INDEX I_Emp_Deptno2 ( deptno INTEGER, name VARCHAR(20)) COVER (empno);
  • 15. Materialized view CREATE MATERIALIZED VIEW EmpsByDeptno AS SELECT deptno, name, deptno FROM Emp ORDER BY deptno, name; Scan [Emps] Scan [EmpsByDeptno] Sort [deptno, name] empno name deptno 100 Fred 20 110 Barney 10 120 Wilma 30 130 Dino 10 deptno name empno 10 Barney 100 10 Dino 130 20 Fred 20 30 Wilma 30 As a materialized view, an index is now just another table Several tables contain the information necessary to answer the query - just pick the best
  • 16. Spatial query Find all restaurants within 1.5 distance units of where I am: restaurant x y Zachary’s pizza 3 1 King Yen 7 7 Filippo’s 7 4 Station burger 5 6 SELECT * FROM Restaurants AS r WHERE ST_Distance( ST_MakePoint(r.x, r.y), ST_MakePoint(6, 7)) < 1.5 • • • • Zachary’s pizza Filippo’s King Yen Station burger
  • 17. Hilbert space-filling curve ● A space-filling curve invented by mathematician David Hilbert ● Every (x, y) point has a unique position on the curve ● Points near to each other typically have Hilbert indexes close together
  • 18. • • • • Add restriction based on h, a restaurant’s distance along the Hilbert curve Must keep original restriction due to false positives Using Hilbert index restaurant x y h Zachary’s pizza 3 1 5 King Yen 7 7 41 Filippo’s 7 4 52 Station burger 5 6 36 Zachary’s pizza Filippo’s SELECT * FROM Restaurants AS r WHERE (r.h BETWEEN 35 AND 42 OR r.h BETWEEN 46 AND 46) AND ST_Distance( ST_MakePoint(r.x, r.y), ST_MakePoint(6, 7)) < 1.5 King Yen Station burger
  • 19. Telling the optimizer 1. Declare h as a generated column 2. Sort table by h Planner can now convert spatial range queries into a range scan Does not require specialized spatial index such as r-tree Very efficient on a sorted table such as HBase CREATE TABLE Restaurants ( restaurant VARCHAR(20), x DOUBLE, y DOUBLE, h DOUBLE GENERATED ALWAYS AS ST_Hilbert(x, y) STORED) SORT KEY (h); restaurant x y h Zachary’s pizza 3 1 5 Station burger 5 6 36 King Yen 7 7 41 Filippo’s 7 4 52
  • 20. Much valuable data is “data in flight” Use SQL to query streams (or streams + tables) Streaming Data center SELECT AVG(unitPrice) FROM Orders WHERE units > 1000 AND orderDate BETWEEN ‘2014-06-01’ AND ‘2015-12-31’ SELECT STREAM * FROM Orders WHERE units > 1000 Streaming query Historic query
  • 21. Hybrid query combines a stream with its own history ● Orders is used as both as stream and as “stream history” virtual table ● “Average order size over last year” should be maintained by the system, i.e. a materialized view SELECT STREAM * FROM Orders AS o WHERE units > ( SELECT AVG(units) FROM Orders AS h WHERE h.productId = o.productId AND h.rowtime > o.rowtime - INTERVAL ‘1’ YEAR) “Orders” used as a stream “Orders” used as a “stream history” virtual table
  • 22. Summary - data optimization via materialized views Many forms of data optimization can be modeled as materialized views: ● Blocks in cache ● B-tree indexes ● Summary tables ● Spatial indexes ● History of streams Allows the optimizer to “understand” the optimization and use it (if beneficial) But who designs the optimizations?
  • 24. How do data systems learn? queries DML statistics adaptations recommender Goals ● Improve response time, throughput, storage cost ● Predictable, adaptive (short and long term), allow human intervention How? ● Humans ● Adaptive systems ● Smart algorithms Example adaptations ● Cache disk blocks in memory ● Cached query results ● Data organization, e.g. partition on a different key ● Secondary structures, e.g. b-tree and r-tree indexes
  • 25. Tiled, in-memory materialized views A vision for an adaptive data system (we’re not there yet) tables on disk in-memory materializations SELECT x, SUM(n) FROM t GROUP BY x
  • 26. Building materialized views Challenges: ● Design Which materializations to create? ● Populate Load them with data ● Maintain Incrementally populate when data changes ● Rewrite Transparently rewrite queries to use materializations ● Adapt Design and populate new materializations, drop unused ones ● Express Need a rich algebra, to model how data is derived Initial focus: summary tables (materialized views over star schemas)
  • 27. CREATE LATTICE Sales AS SELECT t.*, c.*, COUNT(*), SUM(s.units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) JOIN Products AS p USING (productId); Designing summary tables via lattices CREATE MATERIALIZED VIEW SalesYearZipcode AS SELECT t.year, c.state, c.zipcode, COUNT(*), SUM(units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) GROUP BY 1, 2, 3; product product class sales customers time
  • 28. Many possible summary tables Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 raw 1m (y, m) 60 (g, y) 10 (z, s) 43.4k (g, y, m) 120 Fewer than you would expect, because 5m combinations cannot occur in 1m row table Fewer than you would expect, because state depends on zipcode
  • 29. Algorithm: Design summary tables Given a database with 30 columns, 10M rows. Find X summary tables with under Y rows that improve query response time the most. AdaptiveMonteCarlo algorithm [1]: ● Based on research [2] ● Greedy algorithm that takes a combination of summary tables and tries to find the table that yields the greatest cost/benefit improvement ● Models “benefit” of the table as query time saved over simulated query load ● The “cost” of a table is its size [1] org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm [2] Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”
  • 30. Lattice (optimized) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (y, m) 60 (z, s) 43.4k (z, s, g) 83.6k (g, y) 10 (g, y, m) 120 (g, m) 24 Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12)
  • 31. Data profiling Algorithm needs count(distinct a, b, ...) for each combination of attributes: ● Previous example had 25 = 32 possible tables ● Schema with 30 attributes has 230 (about 109 ) possible tables ● Algorithm considers a significant fraction of these ● Approximations are OK Attempts to solve the profiling problem: 1. Compute each combination: scan, sort, unique, count; repeat 230 times! 2. Sketches (HyperLogLog) 3. Sketches + parallelism + information theory [CALCITE-1616]
  • 32. Sketches HyperLogLog is an algorithm that computes approximate distinct count. It can estimate cardinalities of 109 with a typical error rate of 2%, using 1.5 kB of memory. [3][4] With 16 MB memory per machine we can compute 10,000 combinations of attributes each pass. So, we’re down from 109 to 105 passes. [3] Flajolet, Fusy, Gandouet, Meunier (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" [4] https://github.com/mrjgreen/HyperLogLog
  • 33. Given Expected cardinality Actual cardinality Surprise (gender): 2 (state): 50 (gender, state): 100.0 100 0.000 (month): 12 (zipcode): 43,000 (month, zipcode): 441,699.3 442,700 0.001 (state): 50 (zipcode): 43,000 (state, zipcode): 799,666.7 43,400 0.897 (state, zipcode): 43,400 (gender, state): 100 (gender, zipcode): 85,995 (gender, state, zipcode): 86,799 = min(86,799, 892,234, 892,228) 83,567 0.019 ● Surprise = abs(actual - expected) / (actual + expected) ● E(card (x, y)) = n . (1 - ((n - 1) / n) ^ p) n = card (x) * card (y), p = row count Combining probability & information theory
  • 34. Algorithm Three ways “surprise” can help: ● If a cardinality is not surprising, we don’t need to store it -- we can derive it ● If a combination’s cardinality is not surprising, it is unlikely to have surprising children ● If we’re not seeing surprising results, it’s time to stop surprise_threshold := 1 queue := {singleton combinations} // (a), (b), ... while queue is not empty { batch := remove first 10,000 entries in queue compute cardinality of each combination in batch for each actual (computed) cardinality a { e := expected cardinality of combination s := surprise(a, e) if s > surprise_threshold { store combination and its cardinality add child combinations to queue // (x, a), (x, b), ... } increase surprise_threshold } }
  • 35. Algorithm progress and “surprise” threshold Progress of algorithm Rejected as not sufficiently surprising Surprise threshold rises as algorithm progresses Singleton combinations are have surprise = 1 Surprise threshold rises after we have completed the first batch
  • 36. Data profiling - summary The algorithm defeats a combinatorial search space using sketches + information theory + parallelism Recommending data structures is an optimization problem; profiling provides the cost & benefit function As a by-product, the algorithm discovers unique keys, “almost” keys, and foreign keys But which tables are actually joined together in practice?
  • 37. CREATE LATTICE Sales AS SELECT t.*, c.*, COUNT(*), SUM(s.units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) JOIN Products AS p USING (productId); CREATE MATERIALIZED VIEW SalesYearZipcode AS SELECT t.year, c.state, c.zipcode, COUNT(*), SUM(units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) GROUP BY 1, 2, 3; product product class sales customers time The lattice generates the summary tables. But who writes the lattice? Designing summary tables via lattices (2)
  • 38. CREATE LATTICE Sales AS SELECT t.*, c.*, COUNT(*), SUM(s.units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) JOIN Products AS p USING (productId); CREATE MATERIALIZED VIEW SalesYearZipcode AS SELECT t.year, c.state, c.zipcode, COUNT(*), SUM(units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) GROUP BY 1, 2, 3; ALTER SCHEMA Sales INFER LATTICES; product product class sales customers time Designing summary tables via lattices (3)
  • 39. Lattice after Query 1 + 2 Query 2 Query 1 Growing and evolving lattices based on queries sales customers product product class sales product product class sales customers See: [CALCITE-1870] “Lattice suggester”
  • 40. Summary Learning systems = manual tuning + adaptive + smart algorithms Query history + data profiling→ lattices → summary tables We have discussed summary tables (materialized views based on join/aggregate in a star schema) but the approach can be applied to other kinds of materialized views Relational algebra, incorporating materialized views, is a powerful language that allows us to combine many forms of data optimization
  • 41. Thank you! Questions? @ApacheCalcite | @julianhyde | https://calcite.apache.org Resources ➔ [CALCITE-1616] Data profiler ➔ [CALCITE-1870] Lattice suggester ➔ [CALCITE-1861] Spatial indexes ➔ [CALCITE-1968] OpenGIS ➔ [CALCITE-1991] Generated columns ➔ Talk: “Data profiling with Apache Calcite” (Hadoop Summit, 2017) ➔ Talk: “SQL on everything, in memory” (Strata, 2014) ➔ Zhang, Qi, Stradling, Huang (2014). “Towards a Painless Index for Spatial Objects” ➔ Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”
  • 45. Planning queries MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc Table: splunk
  • 46. Optimized query MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: splunk Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  • 47. Calcite framework Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • FilterMergeRule • AggregateUnionTransposeRule • 100+ more Global transformations Unification (materialized view) Column trimming De-correlation Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • RelDistribution (partitioning) RelBuilder JDBC driver Metadata Schema Table Function • TableFunction • TableMacro Lattice
  • 48. Materialized views, lattices, tiles Materialized view - A table whose contents are guaranteed to be the same as executing a given query. Lattice - Recommends, builds, and recognizes summary materialized views (tiles) based on a star schema. A query defines the tables and many:1 relationships in the star schema. Tile - A summary materialized view that belongs to a lattice. A tile may or may not be materialized. Might be: ● Declared in lattice, or ● Generated via recommender algorithm, or ● Created in response to query. CREATE MATERIALIZED VIEW t AS SELECT * FROM emps WHERE deptno = 10; CREATE LATTICE star AS SELECT * FROM sales_fact_1997 AS s JOIN product AS p ON … JOIN product_class AS pc ON … JOIN customer AS c ON … JOIN time_by_day AS t ON …; CREATE MATERIALIZED VIEW zg IN star SELECT gender, zipcode, COUNT(*), SUM(unit_sales) FROM star GROUP BY gender, zipcode;
  • 49. Combining past and future select stream * from Orders as o where units > ( select avg(units) from Orders as h where h.productId = o.productId and h.rowtime > o.rowtime - interval ‘1’ year) ➢ Orders is used as both stream and table ➢ System determines where to find the records ➢ Query is invalid if records are not available
  • 50. Controlling when data is emitted Early emission is the defining characteristic of a streaming query. The emit clause is a SQL extension inspired by Apache Beam’s “trigger” notion. (Still experimental… and evolving.) A relational (non-streaming) query is just a query with the most conservative possible emission strategy. select stream productId, count(*) as c from Orders group by productId, floor(rowtime to hour) emit at watermark, early interval ‘2’ minute, late limit 1; select * from Orders emit when complete;
  • 51. Other applications of data profiling Query optimization: ● Planners are poor at estimating selectivity of conditions after N-way join (especially on real data) ● New join-order benchmark: “Movies made by French directors tend to have French actors” ● Predict number of reducers in MapReduce & Spark “Grokking” a data set Identifying problems in normalization, partitioning, quality Applications in machine learning?
  • 52. Further improvements to data profiling ● Build sketches in parallel ● Run algorithm in a distributed framework (Spark or MapReduce) ● Compute histograms ○ For example, Median age for male/female customers ● Seek out functional dependencies ○ Once you know FDs, a lot of cardinalities are no longer “surprising” ○ FDs occur in denormalized tables, e.g. star schemas ● Smarter criteria for stopping algorithm ● Skew/heavy hitters. Are some values much more frequent than others? ● Conditional cardinalities and functional dependencies ○ Does one partition of the data behave differently from others? (e.g. year=2005, state=LA)