SlideShare a Scribd company logo
Adding measures to
Calcite SQL
Julian Hyde (Google)
Apache Calcite virtual meetup, 2023-03-15
SQL vs BI
BI tools implement their own languages on top of SQL. Why not SQL?
Possible reasons:
● Semantic Model
● Control presentation / visualization
● Governance
● Pre-join tables
● Define reusable calculations
● Ask complex questions in a concise way
Processing BI in SQL
Why we should do it
● Move processing, not data
● Cloud SQL scale
● Remove data lag
● SQL is open
Why it’s hard
● Different paradigm
● More complex data model
● Can’t break SQL
Pasta machine vs Pizza delivery
Relational algebra (bottom-up) Multidimensional (top-down)
Products
Suppliers
⨝
⨝
Σ
⨝
σ
Sales
Products
Suppliers
⨝
⨝
Σ
σ
Sales
π
(Supplier:
‘ACE’,
Date: ‘1994-01’,
Product: all)
(Supplier:
‘ACE’,
Date: ‘1995-01’,
Product: all)
Supplier
Product
Date
Bottom-up vs Top-down query
Some multidimensional queries
● Give the total sales for each product in each quarter of 1995. (Note that quarter is a function of date).
● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to
the sales in January 1994.
● For each product give its market share in its category today minus its market share in its category in
October 1994.
● Select top 5 suppliers for each product category for last year, based on total sales.
● For each product category, select total sales this month of the product that had highest sales in that
category last month.
● Select suppliers that currently sell the highest selling product of last month.
● Select suppliers for which the total sale of every product increased in each of last 5 years.
● Select suppliers for which the total sale of every product category increased in each of last 5 years.
From [Agrawal1997]. Assumes a database with dimensions {supplier, date, product} and measure {sales}.)
Some multidimensional queries
● Give the total sales for each product in each quarter of 1995. (Note that quarter is a function of date).
● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to
the sales in January 1994.
● For each product give its market share in its category today minus its market share in its category in
October 1994.
● Select top 5 suppliers for each product category for last year, based on total sales.
● For each product category, select total sales this month of the product that had highest sales in that
category last month.
● Select suppliers that currently sell the highest selling product of last month.
● Select suppliers for which the total sale of every product increased in each of last 5 years
● Select suppliers for which the total sale of every product category increased in each of last 5 years.
From [Agrawal1997]. Assumes a database with dimensions {supplier, date, product} and measure {sales}.)
Query:
● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to
the sales in January 1994.
SQL MDX
SELECT p.prodId,
s95.sales,
(s95.sales - s94.sales) / s95.sales
FROM (
SELECT p.prodId, SUM(s.sales) AS sales
FROM Sales AS s
JOIN Suppliers AS u USING (suppId)
JOIN Products AS p USING (prodId)
WHERE u.name = ‘ACE’
AND FLOOR(s.date TO MONTH) = ‘1995-01-01’
GROUP BY p.prodId) AS s95
LEFT JOIN (
SELECT p.prodId, SUM(s.sales) AS sales
FROM Sales AS s
JOIN Suppliers AS u USING (suppId)
JOIN Products AS p USING (prodId)
WHERE u.name = ‘ACE’
AND FLOOR(s.date TO MONTH) = ‘1994-01-01’
GROUP BY p.prodId) AS s94
USING (prodId)
WITH MEMBER [Measures].[Sales Last Year] =
([Measures].[Sales],
ParallelPeriod([Date], 1, [Date].[Year]))
MEMBER [Measures].[Sales Growth] =
([Measures].[Sales]
- [Measures].[Sales Last Year])
/ [Measures].[Sales Last Year]
SELECT [Measures].[Sales Growth] ON COLUMNS,
[Product].Members ON ROWS
FROM [Sales]
WHERE [Supplier].[ACE]
Query:
● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to
the sales in January 1994.
SQL SQL with measures
SELECT p.prodId,
s95.sales,
(s95.sales - s94.sales) / s95.sales
FROM (
SELECT p.prodId, SUM(s.sales) AS sales
FROM Sales AS s
JOIN Suppliers AS u USING (suppId)
JOIN Products AS p USING (prodId)
WHERE u.name = ‘ACE’
AND FLOOR(s.date TO MONTH) = ‘1995-01-01’
GROUP BY p.prodId) AS s95
LEFT JOIN (
SELECT p.prodId, SUM(s.sales) AS sales
FROM Sales AS s
JOIN Suppliers AS u USING (suppId)
JOIN Products AS p USING (prodId)
WHERE u.name = ‘ACE’
AND FLOOR(s.date TO MONTH) = ‘1994-01-01’
GROUP BY p.prodId) AS s94
USING (prodId)
SELECT p.prodId,
SUM(s.sales) AS MEASURE sumSales,
sumSales AT (SET FLOOR(s.date TO MONTH)
= ‘1994-01-01’)
AS MEASURE sumSalesLastYear
FROM Sales AS s
JOIN Suppliers AS u USING (suppId)
JOIN Products AS p USING (prodId))
WHERE u.name = ‘ACE’
AND FLOOR(s.date TO MONTH) = ‘1995-01-01’
GROUP BY p.prodId
Self-joins, correlated subqueries, window aggregates, measures
Window aggregate functions were introduced to save on
self-joins.
Some DBs rewrite scalar subqueries and self-joins to
window aggregates [Zuzarte2003].
Window aggregates are more concise, easier to optimize,
and often more efficient.
However, window aggregates can only see data that is from
the same table, and is allowed by the WHERE clause.
Measures overcome that limitation.
SELECT *
FROM Employees AS e
WHERE sal > (
SELECT AVG(sal)
FROM Employees
WHERE deptno = e.deptno)
SELECT *
FROM Employees AS e
WHERE sal > AVG(sal)
OVER (PARTITION BY deptno)
A measure is… ?
… a column with an aggregate function. SUM(sales)
A measure is… ?
… a column with an aggregate function. SUM(sales)
… a column that, when used as an
expression, knows how to aggregate itself.
(SUM(sales) - SUM(cost))
/ SUM(sales)
A measure is… ?
… a column with an aggregate function. SUM(sales)
… a column that, when used as an
expression, knows how to aggregate itself.
(SUM(sales) - SUM(cost))
/ SUM(sales)
… a column that, when used as expression,
can evaluate itself in any context.
(SELECT SUM(forecastSales)
FROM SalesForecast AS s
WHERE predicate(s))
ExchService$ClosingRate(
‘USD’, ‘EUR’, sales.date)
A measure is…
… a column with an aggregate function. SUM(sales)
… a column that, when used as an
expression, knows how to aggregate itself.
(SUM(sales) - SUM(cost))
/ SUM(sales)
… a column that, when used as expression,
can evaluate itself in any context.
Its value depends on, and only on, the
predicate placed on its dimensions.
(SELECT SUM(forecastSales)
FROM SalesForecast AS s
WHERE predicate(s))
ExchService$ClosingRate(
‘USD’, ‘EUR’, sales.date)
SELECT MOD(deptno, 2) = 0 AS evenDeptno, avgSal2
FROM
WHERE deptno < 30
SELECT deptno, AVG(avgSal) AS avgSal2
FROM
GROUP BY deptno
Table model
Tables are SQL’s fundamental
model.
The model is closed – queries
consume and produce tables.
Tables are opaque – you can’t
deduce the type, structure or
private data of a table.
SELECT deptno, job,
AVG(sal) AS avgSal
FROM Employees
GROUP BY deptno, job
Employees2
Employees3
SELECT MOD(deptno, 2) = 0 AS evenDeptno, avgSal2
FROM
WHERE deptno < 30
SELECT deptno, AVG(avgSal) AS avgSal2
FROM
GROUP BY deptno
Table model
Tables are SQL’s fundamental
model.
The model is closed – queries
consume and produce tables.
Tables are opaque – you can’t
deduce the type, structure or
private data of a table.
SELECT deptno, job,
AVG(sal) AS avgSal
FROM Employees
GROUP BY deptno, job
SELECT e.deptno, e.job, d.dname, e.avgSal / e.deptAvgSal
FROM
AS e
JOIN Departments AS d USING (deptno)
WHERE d.dname <> ‘MARKETING’
GROUP BY deptno, job
We propose to allow any table and
query to have measure columns.
The model is closed – queries
consume and produce
tables-with-measures.
Tables-with-measures are
semi-opaque – you can’t deduce the
type, structure or private data, but
you can evaluate the measure in any
context that can be expressed as a
predicate on the measure’s
dimensions.
SELECT *,
avgSal AS MEASURE avgSal,
avgSal AT (CLEAR deptno) AS MEASURE deptAvgSal
FROM
Table model with measures
SELECT *,
AVG(sal) AS MEASURE avgSal
FROM Employees
AnalyticEmployees
AnalyticEmployees2
SELECT e.deptno, e.job, d.dname, e.avgSal / e.deptAvgSal
FROM
AS e
JOIN Departments AS d USING (deptno)
WHERE d.dname <> ‘MARKETING’
GROUP BY deptno, job
We propose to allow any table and
query to have measure columns.
The model is closed – queries
consume and produce
tables-with-measures.
Tables-with-measures are
semi-opaque – you can’t deduce the
type, structure or private data, but
you can evaluate the measure in any
context that can be expressed as a
predicate on the measure’s
dimensions.
SELECT *,
avgSal AS MEASURE avgSal,
avgSal AT (CLEAR deptno) AS MEASURE deptAvgSal
FROM
Table model with measures
SELECT *,
AVG(sal) AS MEASURE avgSal
FROM Employees
Model + Query + Engine = Data system
Query
language
Data
model
Engine
Syntax
expression AS MEASURE – defines a measure in the SELECT clause
AGGREGATE(measure) – evaluates a measure in a GROUP BY query
expression AT (contextModifier…) – evaluates expression in a modified context
contextModifier ::=
CLEAR dimension
| SET dimension = [CURRENT] expression
| VISIBLE
| ALL
aggFunction(aggFunction(expression) PER dimension) – multi-level aggregation
Plan of attack
1. Add measures to the table model, and allow queries to use them
◆ Measures are defined only via the Table API
2. Define measures using SQL expressions (AS MEASURE)
◆ You can still define them using the Table API
3. Context-sensitive expressions (AT)
Semantics
0. We have a measure M, value type V,
in a table T.
CREATE VIEW AnalyticEmployees AS
SELECT *, AVG(sal) AS MEASURE avgSal
FROM Employees
1. System defines a row type R with the
non-measure columns.
CREATE TYPE R AS
ROW (deptno: INTEGER, job: VARCHAR)
2. System defines an auxiliary function
for M. (Function is typically a scalar
subquery that references the measure’s
underlying table.)
CREATE FUNCTION computeAvgSal(
rowPredicate: FUNCTION<R, BOOLEAN>) =
(SELECT AVG(e.sal)
FROM Employees AS e
WHERE APPLY(rowPredicate, e))
Semantics (continued)
3. We have a query that uses M. SELECT deptno,
avgSal
/ avgSal AT (CLEAR deptno)
FROM AnalyticEmployees AS e
GROUP BY deptno
4. Substitute measure references with
calls to the auxiliary function with the
appropriate predicate
SELECT deptno,
computeAvgSal(r 🠚 (r.deptno = e.deptno))
/ computeAvgSal(r 🠚 TRUE))
FROM AnalyticEmployees AS e
GROUP BY deptno
5. Planner inlines computeAvgSal and
scalar subqueries
SELECT deptno, AVG(sal) / MIN(avgSal)
FROM (
SELECT deptno, sal,
AVG(sal) OVER () AS avgSal
FROM Employees)
GROUP BY deptno
Calculating at the right grain
Example Formula Grain
Computing the revenue from
units and unit price
units * pricePerUnit AS revenue Row
Sum of revenue (additive) SUM(revenue)
AS MEASURE sumRevenue
Top
Profit margin (non-additive) (SUM(revenue) - SUM(cost))
/ SUM(revenue)
AS MEASURE profitMargin
Top
Inventory (semi-additive) SUM(LAST_VALUE(unitsInStock)
PER inventoryDate)
AS MEASURE sumInventory
Intermediate
Daily average (weighted
average)
AVG(sumRevenue PER orderDate)
AS MEASURE dailyAvgRevenue
Intermediate
Subtotals & visible
SELECT deptno, job,
SUM(sal), sumSal
FROM (
SELECT *,
SUM(sal) AS MEASURE sumSal
FROM Employees)
WHERE job <> ‘ANALYST’
GROUP BY ROLLUP(deptno, job)
ORDER BY 1,2
deptno job SUM(sal) sumSal
10 CLERK 1,300 1,300
10 MANAGER 2,450 2,450
10 PRESIDENT 5,000 5,000
10 8,750 8,750
20 CLERK 1,900 1,900
20 MANAGER 2,975 2,975
20 4,875 10,875
30 CLERK 950 950
30 MANAGER 2,850 2,850
30 SALES 5,600 5,600
30 9,400 9,400
20,750 29,025
Measures by default sum ALL rows;
Aggregate functions sum only VISIBLE rows
Visible
Expression Example Which rows?
Aggregate function SUM(sal) Visible only
Measure sumSal All
AGGREGATE applied to measure AGGREGATE(sumSal) Visible only
Measure with VISIBLE sumSal AT (VISIBLE) Visible only
Measure with ALL sumSal AT (ALL) All
Semantic models versus databases
In my opinion, a semantic model…
● … is the place to share data and calculations
● … needs a really good query language
○ (So you don’t have to change the model every time
someone has a new question)
● … doesn’t become a database just because it
speaks SQL
● … should do other things too
○ (Access control, governance, presentation defaults,
guide data exploration, transform data, tune data, …)
Shouldn’t the semantic model
be outside the database?
(I don’t want to be tied to one
DBMS vendor.)
I have a great semantic model
already. Why do I need a query
language? My users don’t want
to write SQL.
What even is a
semantic model?
Summary
Concise queries without self-joins
Top-down evaluation
Reusable calculations
Doesn’t break SQL
References
Papers
● [Agrawal1997] “Modeling multidimensional databases” (Agrawal, Gupta, and Sarawagi, 1997)
● [Zuzarte2003] “WinMagic: Subquery Elimination Using Window Aggregation” (Zuzarte, Pirahash, Ma,
Cheng, Liu, and Wong, 2003)
Issues
● [CALCITE-4488] WITHIN DISTINCT clause for aggregate functions (experimental)
● [CALCITE-4496] Measure columns ("SELECT ... AS MEASURE")
● [CALCITE-5105] Add MEASURE type and AGGREGATE aggregate function
● [CALCITE-5155] Custom time frames
● [CALCITE-xxxx] PER
● [CALCITE-xxxx] AT
Thank you!
Any questions?
@julianhyde
@ApacheCalcite
https://calcite.apache.org
Slides and recording will be posted at @ApacheCalcite.

More Related Content

PDF
Building a semantic/metrics layer using Calcite
PDF
Cubing and Metrics in SQL, oh my!
PDF
Introduction to Apache Calcite
PDF
Morel, a Functional Query Language
PDF
Fast federated SQL with Apache Calcite
PDF
Apache Calcite Tutorial - BOSS 21
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
Practical Partitioning in Production with Postgres
 
Building a semantic/metrics layer using Calcite
Cubing and Metrics in SQL, oh my!
Introduction to Apache Calcite
Morel, a Functional Query Language
Fast federated SQL with Apache Calcite
Apache Calcite Tutorial - BOSS 21
Apache Calcite (a tutorial given at BOSS '21)
Practical Partitioning in Production with Postgres
 

What's hot (20)

PDF
Streaming SQL with Apache Calcite
PPTX
High Performance, High Reliability Data Loading on ClickHouse
PDF
Apache Calcite: One planner fits all
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
PDF
Dynamic Partition Pruning in Apache Spark
PPTX
The Current State of Table API in 2022
PDF
Analyze Virtual Machine Overhead Compared to Bare Metal with Tracing
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PDF
Best practices for MySQL High Availability
PPTX
Logical Replication in PostgreSQL
 
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PPTX
Apache Calcite overview
PPTX
The Volcano/Cascades Optimizer
PPTX
04 spark-pair rdd-rdd-persistence
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
PDF
How to Analyze and Tune MySQL Queries for Better Performance
PDF
Mastering PostgreSQL Administration
PDF
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
PDF
Don’t optimize my queries, optimize my data!
Streaming SQL with Apache Calcite
High Performance, High Reliability Data Loading on ClickHouse
Apache Calcite: One planner fits all
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Dynamic Partition Pruning in Apache Spark
The Current State of Table API in 2022
Analyze Virtual Machine Overhead Compared to Bare Metal with Tracing
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Best practices for MySQL High Availability
Logical Replication in PostgreSQL
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite overview
The Volcano/Cascades Optimizer
04 spark-pair rdd-rdd-persistence
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
How to Analyze and Tune MySQL Queries for Better Performance
Mastering PostgreSQL Administration
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Don’t optimize my queries, optimize my data!
Ad

Similar to Adding measures to Calcite SQL (20)

PDF
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
PPTX
Measures in SQL (SIGMOD 2024, Santiago, Chile)
PPT
Introduction to OLAP and OLTP Concepts - DBMS
PDF
advance-sqaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal.pdf
PDF
FOUNDATION OF DATA SCIENCE SQL QUESTIONS
PDF
Sql wksht-3
PDF
SQL-for-Data-Analytics-Top-10-Queries-Every-Analyst-Should-Know
PDF
Olapsql
PPT
ch19.ppt
PPT
ch19.ppt
PDF
CS121Lec05.pdf
PDF
SQL Queries .pdf
PPTX
Unit 5 Introduction to Oracle and Sql.pptx
PDF
Run your queries 14X faster without any investment!
PPTX
DBMS: Week 07 - Advanced SQL Queries in MySQL
PPTX
Database Management System - SQL Advanced Training
ODP
Oracle SQL Advanced
PDF
Learning Open Source Business Intelligence
PPTX
Project report aditi paul1
PDF
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Introduction to OLAP and OLTP Concepts - DBMS
advance-sqaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal.pdf
FOUNDATION OF DATA SCIENCE SQL QUESTIONS
Sql wksht-3
SQL-for-Data-Analytics-Top-10-Queries-Every-Analyst-Should-Know
Olapsql
ch19.ppt
ch19.ppt
CS121Lec05.pdf
SQL Queries .pdf
Unit 5 Introduction to Oracle and Sql.pptx
Run your queries 14X faster without any investment!
DBMS: Week 07 - Advanced SQL Queries in MySQL
Database Management System - SQL Advanced Training
Oracle SQL Advanced
Learning Open Source Business Intelligence
Project report aditi paul1
Ad

More from Julian Hyde (20)

PDF
Morel, a data-parallel programming language
PDF
Is there a perfect data-parallel programming language? (Experiments with More...
PDF
The evolution of Apache Calcite and its Community
PDF
What to expect when you're Incubating
PDF
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
PDF
Efficient spatial queries on vanilla databases
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
PDF
Tactical data engineering
PDF
Don't optimize my queries, organize my data!
PDF
Spatial query on vanilla databases
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PPTX
Lazy beats Smart and Fast
PDF
Data profiling with Apache Calcite
PDF
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
PDF
Data Profiling in Apache Calcite
PDF
Streaming SQL
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
PDF
Streaming SQL
PDF
Streaming SQL
PDF
Streaming SQL
Morel, a data-parallel programming language
Is there a perfect data-parallel programming language? (Experiments with More...
The evolution of Apache Calcite and its Community
What to expect when you're Incubating
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Efficient spatial queries on vanilla databases
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Tactical data engineering
Don't optimize my queries, organize my data!
Spatial query on vanilla databases
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Lazy beats Smart and Fast
Data profiling with Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Data Profiling in Apache Calcite
Streaming SQL
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL
Streaming SQL
Streaming SQL

Recently uploaded (20)

PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
medical staffing services at VALiNTRY
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
17 Powerful Integrations Your Next-Gen MLM Software Needs
PDF
Nekopoi APK 2025 free lastest update
PPTX
history of c programming in notes for students .pptx
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
medical staffing services at VALiNTRY
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Wondershare Filmora 15 Crack With Activation Key [2025
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
Patient Appointment Booking in Odoo with online payment
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
Operating system designcfffgfgggggggvggggggggg
Salesforce Agentforce AI Implementation.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
Why Generative AI is the Future of Content, Code & Creativity?
17 Powerful Integrations Your Next-Gen MLM Software Needs
Nekopoi APK 2025 free lastest update
history of c programming in notes for students .pptx
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Oracle Fusion HCM Cloud Demo for Beginners
iTop VPN Free 5.6.0.5262 Crack latest version 2025
Design an Analysis of Algorithms II-SECS-1021-03
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025

Adding measures to Calcite SQL

  • 1. Adding measures to Calcite SQL Julian Hyde (Google) Apache Calcite virtual meetup, 2023-03-15
  • 2. SQL vs BI BI tools implement their own languages on top of SQL. Why not SQL? Possible reasons: ● Semantic Model ● Control presentation / visualization ● Governance ● Pre-join tables ● Define reusable calculations ● Ask complex questions in a concise way
  • 3. Processing BI in SQL Why we should do it ● Move processing, not data ● Cloud SQL scale ● Remove data lag ● SQL is open Why it’s hard ● Different paradigm ● More complex data model ● Can’t break SQL
  • 4. Pasta machine vs Pizza delivery
  • 5. Relational algebra (bottom-up) Multidimensional (top-down) Products Suppliers ⨝ ⨝ Σ ⨝ σ Sales Products Suppliers ⨝ ⨝ Σ σ Sales π (Supplier: ‘ACE’, Date: ‘1994-01’, Product: all) (Supplier: ‘ACE’, Date: ‘1995-01’, Product: all) Supplier Product Date Bottom-up vs Top-down query
  • 6. Some multidimensional queries ● Give the total sales for each product in each quarter of 1995. (Note that quarter is a function of date). ● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to the sales in January 1994. ● For each product give its market share in its category today minus its market share in its category in October 1994. ● Select top 5 suppliers for each product category for last year, based on total sales. ● For each product category, select total sales this month of the product that had highest sales in that category last month. ● Select suppliers that currently sell the highest selling product of last month. ● Select suppliers for which the total sale of every product increased in each of last 5 years. ● Select suppliers for which the total sale of every product category increased in each of last 5 years. From [Agrawal1997]. Assumes a database with dimensions {supplier, date, product} and measure {sales}.)
  • 7. Some multidimensional queries ● Give the total sales for each product in each quarter of 1995. (Note that quarter is a function of date). ● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to the sales in January 1994. ● For each product give its market share in its category today minus its market share in its category in October 1994. ● Select top 5 suppliers for each product category for last year, based on total sales. ● For each product category, select total sales this month of the product that had highest sales in that category last month. ● Select suppliers that currently sell the highest selling product of last month. ● Select suppliers for which the total sale of every product increased in each of last 5 years ● Select suppliers for which the total sale of every product category increased in each of last 5 years. From [Agrawal1997]. Assumes a database with dimensions {supplier, date, product} and measure {sales}.)
  • 8. Query: ● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to the sales in January 1994. SQL MDX SELECT p.prodId, s95.sales, (s95.sales - s94.sales) / s95.sales FROM ( SELECT p.prodId, SUM(s.sales) AS sales FROM Sales AS s JOIN Suppliers AS u USING (suppId) JOIN Products AS p USING (prodId) WHERE u.name = ‘ACE’ AND FLOOR(s.date TO MONTH) = ‘1995-01-01’ GROUP BY p.prodId) AS s95 LEFT JOIN ( SELECT p.prodId, SUM(s.sales) AS sales FROM Sales AS s JOIN Suppliers AS u USING (suppId) JOIN Products AS p USING (prodId) WHERE u.name = ‘ACE’ AND FLOOR(s.date TO MONTH) = ‘1994-01-01’ GROUP BY p.prodId) AS s94 USING (prodId) WITH MEMBER [Measures].[Sales Last Year] = ([Measures].[Sales], ParallelPeriod([Date], 1, [Date].[Year])) MEMBER [Measures].[Sales Growth] = ([Measures].[Sales] - [Measures].[Sales Last Year]) / [Measures].[Sales Last Year] SELECT [Measures].[Sales Growth] ON COLUMNS, [Product].Members ON ROWS FROM [Sales] WHERE [Supplier].[ACE]
  • 9. Query: ● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to the sales in January 1994. SQL SQL with measures SELECT p.prodId, s95.sales, (s95.sales - s94.sales) / s95.sales FROM ( SELECT p.prodId, SUM(s.sales) AS sales FROM Sales AS s JOIN Suppliers AS u USING (suppId) JOIN Products AS p USING (prodId) WHERE u.name = ‘ACE’ AND FLOOR(s.date TO MONTH) = ‘1995-01-01’ GROUP BY p.prodId) AS s95 LEFT JOIN ( SELECT p.prodId, SUM(s.sales) AS sales FROM Sales AS s JOIN Suppliers AS u USING (suppId) JOIN Products AS p USING (prodId) WHERE u.name = ‘ACE’ AND FLOOR(s.date TO MONTH) = ‘1994-01-01’ GROUP BY p.prodId) AS s94 USING (prodId) SELECT p.prodId, SUM(s.sales) AS MEASURE sumSales, sumSales AT (SET FLOOR(s.date TO MONTH) = ‘1994-01-01’) AS MEASURE sumSalesLastYear FROM Sales AS s JOIN Suppliers AS u USING (suppId) JOIN Products AS p USING (prodId)) WHERE u.name = ‘ACE’ AND FLOOR(s.date TO MONTH) = ‘1995-01-01’ GROUP BY p.prodId
  • 10. Self-joins, correlated subqueries, window aggregates, measures Window aggregate functions were introduced to save on self-joins. Some DBs rewrite scalar subqueries and self-joins to window aggregates [Zuzarte2003]. Window aggregates are more concise, easier to optimize, and often more efficient. However, window aggregates can only see data that is from the same table, and is allowed by the WHERE clause. Measures overcome that limitation. SELECT * FROM Employees AS e WHERE sal > ( SELECT AVG(sal) FROM Employees WHERE deptno = e.deptno) SELECT * FROM Employees AS e WHERE sal > AVG(sal) OVER (PARTITION BY deptno)
  • 11. A measure is… ? … a column with an aggregate function. SUM(sales)
  • 12. A measure is… ? … a column with an aggregate function. SUM(sales) … a column that, when used as an expression, knows how to aggregate itself. (SUM(sales) - SUM(cost)) / SUM(sales)
  • 13. A measure is… ? … a column with an aggregate function. SUM(sales) … a column that, when used as an expression, knows how to aggregate itself. (SUM(sales) - SUM(cost)) / SUM(sales) … a column that, when used as expression, can evaluate itself in any context. (SELECT SUM(forecastSales) FROM SalesForecast AS s WHERE predicate(s)) ExchService$ClosingRate( ‘USD’, ‘EUR’, sales.date)
  • 14. A measure is… … a column with an aggregate function. SUM(sales) … a column that, when used as an expression, knows how to aggregate itself. (SUM(sales) - SUM(cost)) / SUM(sales) … a column that, when used as expression, can evaluate itself in any context. Its value depends on, and only on, the predicate placed on its dimensions. (SELECT SUM(forecastSales) FROM SalesForecast AS s WHERE predicate(s)) ExchService$ClosingRate( ‘USD’, ‘EUR’, sales.date)
  • 15. SELECT MOD(deptno, 2) = 0 AS evenDeptno, avgSal2 FROM WHERE deptno < 30 SELECT deptno, AVG(avgSal) AS avgSal2 FROM GROUP BY deptno Table model Tables are SQL’s fundamental model. The model is closed – queries consume and produce tables. Tables are opaque – you can’t deduce the type, structure or private data of a table. SELECT deptno, job, AVG(sal) AS avgSal FROM Employees GROUP BY deptno, job Employees2 Employees3
  • 16. SELECT MOD(deptno, 2) = 0 AS evenDeptno, avgSal2 FROM WHERE deptno < 30 SELECT deptno, AVG(avgSal) AS avgSal2 FROM GROUP BY deptno Table model Tables are SQL’s fundamental model. The model is closed – queries consume and produce tables. Tables are opaque – you can’t deduce the type, structure or private data of a table. SELECT deptno, job, AVG(sal) AS avgSal FROM Employees GROUP BY deptno, job
  • 17. SELECT e.deptno, e.job, d.dname, e.avgSal / e.deptAvgSal FROM AS e JOIN Departments AS d USING (deptno) WHERE d.dname <> ‘MARKETING’ GROUP BY deptno, job We propose to allow any table and query to have measure columns. The model is closed – queries consume and produce tables-with-measures. Tables-with-measures are semi-opaque – you can’t deduce the type, structure or private data, but you can evaluate the measure in any context that can be expressed as a predicate on the measure’s dimensions. SELECT *, avgSal AS MEASURE avgSal, avgSal AT (CLEAR deptno) AS MEASURE deptAvgSal FROM Table model with measures SELECT *, AVG(sal) AS MEASURE avgSal FROM Employees AnalyticEmployees AnalyticEmployees2
  • 18. SELECT e.deptno, e.job, d.dname, e.avgSal / e.deptAvgSal FROM AS e JOIN Departments AS d USING (deptno) WHERE d.dname <> ‘MARKETING’ GROUP BY deptno, job We propose to allow any table and query to have measure columns. The model is closed – queries consume and produce tables-with-measures. Tables-with-measures are semi-opaque – you can’t deduce the type, structure or private data, but you can evaluate the measure in any context that can be expressed as a predicate on the measure’s dimensions. SELECT *, avgSal AS MEASURE avgSal, avgSal AT (CLEAR deptno) AS MEASURE deptAvgSal FROM Table model with measures SELECT *, AVG(sal) AS MEASURE avgSal FROM Employees
  • 19. Model + Query + Engine = Data system Query language Data model Engine
  • 20. Syntax expression AS MEASURE – defines a measure in the SELECT clause AGGREGATE(measure) – evaluates a measure in a GROUP BY query expression AT (contextModifier…) – evaluates expression in a modified context contextModifier ::= CLEAR dimension | SET dimension = [CURRENT] expression | VISIBLE | ALL aggFunction(aggFunction(expression) PER dimension) – multi-level aggregation
  • 21. Plan of attack 1. Add measures to the table model, and allow queries to use them ◆ Measures are defined only via the Table API 2. Define measures using SQL expressions (AS MEASURE) ◆ You can still define them using the Table API 3. Context-sensitive expressions (AT)
  • 22. Semantics 0. We have a measure M, value type V, in a table T. CREATE VIEW AnalyticEmployees AS SELECT *, AVG(sal) AS MEASURE avgSal FROM Employees 1. System defines a row type R with the non-measure columns. CREATE TYPE R AS ROW (deptno: INTEGER, job: VARCHAR) 2. System defines an auxiliary function for M. (Function is typically a scalar subquery that references the measure’s underlying table.) CREATE FUNCTION computeAvgSal( rowPredicate: FUNCTION<R, BOOLEAN>) = (SELECT AVG(e.sal) FROM Employees AS e WHERE APPLY(rowPredicate, e))
  • 23. Semantics (continued) 3. We have a query that uses M. SELECT deptno, avgSal / avgSal AT (CLEAR deptno) FROM AnalyticEmployees AS e GROUP BY deptno 4. Substitute measure references with calls to the auxiliary function with the appropriate predicate SELECT deptno, computeAvgSal(r 🠚 (r.deptno = e.deptno)) / computeAvgSal(r 🠚 TRUE)) FROM AnalyticEmployees AS e GROUP BY deptno 5. Planner inlines computeAvgSal and scalar subqueries SELECT deptno, AVG(sal) / MIN(avgSal) FROM ( SELECT deptno, sal, AVG(sal) OVER () AS avgSal FROM Employees) GROUP BY deptno
  • 24. Calculating at the right grain Example Formula Grain Computing the revenue from units and unit price units * pricePerUnit AS revenue Row Sum of revenue (additive) SUM(revenue) AS MEASURE sumRevenue Top Profit margin (non-additive) (SUM(revenue) - SUM(cost)) / SUM(revenue) AS MEASURE profitMargin Top Inventory (semi-additive) SUM(LAST_VALUE(unitsInStock) PER inventoryDate) AS MEASURE sumInventory Intermediate Daily average (weighted average) AVG(sumRevenue PER orderDate) AS MEASURE dailyAvgRevenue Intermediate
  • 25. Subtotals & visible SELECT deptno, job, SUM(sal), sumSal FROM ( SELECT *, SUM(sal) AS MEASURE sumSal FROM Employees) WHERE job <> ‘ANALYST’ GROUP BY ROLLUP(deptno, job) ORDER BY 1,2 deptno job SUM(sal) sumSal 10 CLERK 1,300 1,300 10 MANAGER 2,450 2,450 10 PRESIDENT 5,000 5,000 10 8,750 8,750 20 CLERK 1,900 1,900 20 MANAGER 2,975 2,975 20 4,875 10,875 30 CLERK 950 950 30 MANAGER 2,850 2,850 30 SALES 5,600 5,600 30 9,400 9,400 20,750 29,025 Measures by default sum ALL rows; Aggregate functions sum only VISIBLE rows
  • 26. Visible Expression Example Which rows? Aggregate function SUM(sal) Visible only Measure sumSal All AGGREGATE applied to measure AGGREGATE(sumSal) Visible only Measure with VISIBLE sumSal AT (VISIBLE) Visible only Measure with ALL sumSal AT (ALL) All
  • 27. Semantic models versus databases In my opinion, a semantic model… ● … is the place to share data and calculations ● … needs a really good query language ○ (So you don’t have to change the model every time someone has a new question) ● … doesn’t become a database just because it speaks SQL ● … should do other things too ○ (Access control, governance, presentation defaults, guide data exploration, transform data, tune data, …) Shouldn’t the semantic model be outside the database? (I don’t want to be tied to one DBMS vendor.) I have a great semantic model already. Why do I need a query language? My users don’t want to write SQL. What even is a semantic model?
  • 28. Summary Concise queries without self-joins Top-down evaluation Reusable calculations Doesn’t break SQL
  • 29. References Papers ● [Agrawal1997] “Modeling multidimensional databases” (Agrawal, Gupta, and Sarawagi, 1997) ● [Zuzarte2003] “WinMagic: Subquery Elimination Using Window Aggregation” (Zuzarte, Pirahash, Ma, Cheng, Liu, and Wong, 2003) Issues ● [CALCITE-4488] WITHIN DISTINCT clause for aggregate functions (experimental) ● [CALCITE-4496] Measure columns ("SELECT ... AS MEASURE") ● [CALCITE-5105] Add MEASURE type and AGGREGATE aggregate function ● [CALCITE-5155] Custom time frames ● [CALCITE-xxxx] PER ● [CALCITE-xxxx] AT