SlideShare a Scribd company logo
Page1 © Hortonworks Inc. 2014
Discardable, In-Memory
Materialized Query for Hadoop
Julian Hyde Julian Hyde
June 3rd, 2014
Page2 © Hortonworks Inc. 2014
About me
Julian Hyde
Architect at Hortonworks
Open source:
• Founder & lead, Apache Optiq (query optimization framework)
• Founder & lead, Pentaho Mondrian (analysis engine)
• Committer, Apache Drill
• Contributor, Apache Hive
• Contributor, Cascading Lingual (SQL interface to Cascading)
Past:
• SQLstream (streaming SQL)
• Broadbase (data warehouse)
• Oracle (SQL kernel development)
Page3 © Hortonworks Inc. 2014
Before we get started…
The bad news
• This software is not available to download
• I am a database bigot
• I am a BI bigot
The good news
• I believe in Hadoop
• Now is a great time to discuss where Hadoop is going
Page4 © Hortonworks Inc. 2014
Hadoop today
Brute force
Hadoop brings a lot of CPU, disk, IO
Yarn, Tez, Vectorization are making Hadoop faster
How to use that brute force is left to the application
Business Intelligence
Best practice is to pull data out of Hadoop
• Populate enterprise data warehouse
• In-memory analytics
• Custom analytics, e.g. Lambda architecture
Ineffective use of memory
Opportunity to make Hadoop smarter
Page5 © Hortonworks Inc. 2014
Brute force + diplomacy
Page6 © Hortonworks Inc. 2014
Hardware trends
Typical Hadoop server configuration
Lots of memory - but not enough for all data
More heterogeneous mix (disk + SSD + memory)
What to do about memory?
Year 2009 2014
Cores 4 – 8 24
Memory 8 GB 128 GB
SSD None 1 TB
Disk 4 x 1 TB 12 x 4 TB
Disk : memory 512 : 1 384 : 1
Page7 © Hortonworks Inc. 2014
Dumb use of memory - Buffer cache
Example:
50 TB data
1 TB memory
Full-table scan
Analogous to virtual
memory
• Great while it works, but…
• Table scan nukes the cache!
Table (disk)
Buffer cache
(memory)
Table scan
Query
operators
(Tez, or
whatever)
Pre-fetch
Page8 © Hortonworks Inc. 2014
“Document-oriented” analysis
Operate on working sets small enough to fit in memory
Analogous to working on a document (e.g. a spreadsheet)
• Works well for problems that fit into memory (e.g. some machine-learning algorithms)
• If your problem grows, you’re out of luck
Working set
(memory)
Table (disk)
Interactive user
Page9 © Hortonworks Inc. 2014
Smarter use of memory - Materialized queries
In-memory
materialized
queries
Tables
on disk
Page10 © Hortonworks Inc. 2014
Census data
Census table – 300M records
Which state has the most
males?
Brute force – read 150M
records
Smarter – read 50 records
CREATE TABLE Census (id, gender,
zipcode, state, age);
SELECT state, COUNT(*) AS c
FROM Census
WHERE gender = ‘M’
GROUP BY state
ORDER BY c DESC LIMIT 1;
CREATE TABLE CensusSummary AS
SELECT state, gender, COUNT(*) AS c
FROM Census
GROUP BY state, gender;
SELECT state, c
FROM CensusSummary
WHERE gender = ‘M’
ORDER BY c DESC LIMIT 1;
Page11 © Hortonworks Inc. 2014
Materialized view – Automatic smartness
A materialized view is a
table that is declared to be
identical to a given query
Optimizer rewrites query
on “Census” to use
“CensusSummary” instead
Even smarter – read 50
records
CREATE MATERIALIZED VIEW CensusSummary
STORAGE (MEMORY, DISCARDABLE) AS
SELECT state, gender, COUNT(*) AS c
FROM Census
GROUP BY state, gender;
SELECT state, COUNT(*) AS c
FROM Census
WHERE gender = ‘M’
GROUP BY state
ORDER BY c DESC LIMIT 1;
Page12 © Hortonworks Inc. 2014
Indistinguishable from magic
Run query #1
System creates
materialized view in
background
Related query #2 runs
faster
SELECT state, COUNT(*) AS c
FROM Census
WHERE gender = ‘M’
GROUP BY state
ORDER BY c DESC LIMIT 1;
CREATE MATERIALIZED VIEW CensusSummary
STORAGE (MEMORY, DISCARDABLE) AS
SELECT state, gender, COUNT(*) AS c
FROM Census
GROUP BY state, gender;
SELECT state,
COUNT(NULLIF(gender, ‘F’)) AS males,
COUNT(NULLIF(gender, ‘M’)) AS females
FROM Census
GROUP BY state
HAVING females > males;
Page13 © Hortonworks Inc. 2014
Materialized views - Classic
Classic materialized view
(Oracle, DB2, Teradata, MSSql)
1. A table defined using a SQL query
2. Designed by DBA
3. Storage same as a regular table
1. On disk
2. Can define indexes
4. DB populates the table
5. Queries are rewritten to use the
table**
6. DB updates the table to reflect
changes to source data (usually
deferred)*
*Magic required
SELECT t.year, AVG(s.units)
FROM SalesFact AS s
JOIN TimeDim AS t USING (timeId)
GROUP BY t.year;
CREATE MATERIALIZED VIEW SalesMonthZip AS
SELECT t.year, t.month,
c.state, c.zipcode,
COUNT(*), SUM(s.units), SUM(s.price)
FROM SalesFact AS s
JOIN TimeDim AS t USING (timeId)
JOIN CustomerDim AS c USING (customerId)
GROUP BY t.year, t.month,
c.state, c.zipcode;
Page14 © Hortonworks Inc. 2014
Materialized views - DIMMQ
DIMMQs - Discardable, In-memory Materialized Queries
Differences with classic materialized views
1. May be in-memory
2. HDFS may discard – based on DDM (Distributed Discardable Memory)
3. Lifecycle support:
1. Assume table is populated
2. Don’t populate & maintain
3. User can flag as valid, invalid, or change definition (e.g. date range)
4. HDFS may discard
4. More design options:
1. DBA specifies
2. Retain query results (or partial results)
3. An agent builds MVs based on query traffic
Page15 © Hortonworks Inc. 2014
DIMMQ compared to Spark RDDs
It’s not “either / or”
• Spark-on-YARN, SQL-on-Spark already exist; Cascading-on-Spark common soon
• Hive-queries-on-RDDs, DIMMQs populated by Spark, Spark-on-Hive-tables are possible
Spark RDD DIMMQ
Access By reference By algebraic expression
Sharing Within session Across sessions
Execution model Built-in External
Recovery Yes No
Discard Failure or GC Failure or cost-based
Native organization Memory Disk
Language Scala (or other JVM
language)
Language-independent
Page16 © Hortonworks Inc. 2014
Data independence
This is not just about SQL standards compliance!
Materialized views are supposed to be transparent in creation, maintenance and use.
If not one DBA ever types “CREATE MATERIALIZED VIEW”, we have still succeeded
Data independence
Ability to move data around and not tell your application
Replicas
Redundant copies
Moving between disk and memory
Sort order, projections (à la Vertica), aggregates (à la Microstrategy)
Indexes, and other weird data structures
Page17 © Hortonworks Inc. 2014
Implementing DIMMQs
Relational algebra
Apache Optiq (just entered incubator – yay!)
Algebra, rewrite rules, cost model
Metadata
Hive: “CREATE MATERIALIZED VIEW”
Definitions of materialized views in HCatalog
HDFS - Discardable Distributed Memory (DDM)
Off-heap data in memory-mapped files
Discard policy
Build in-memory, replicate to disk; or vice versa
Central namespace
Evolution of existing components
Page18 © Hortonworks Inc. 2014
Tiled queries in distributed memory
Query: SELECT x, SUM(y) FROM t GROUP BY x
In-memory
materialized
queries
Tables
on disk
Page19 © Hortonworks Inc. 2014
An adaptive system
Ongoing activities:
• Agent suggests new MVs
• MVs are built in background
• Ongoing query activity uses MVs
• User marks MVs as invalid due to source data
changes
• HDFS throws out MVs that are not pulling their
weight
Dynamic equilibrium
DIMMQs continually created & destroyed
System moves data around to adapt to
changing usage patterns
Page20 © Hortonworks Inc. 2014
Lambda architecture
From “Runaway complexity in Big Data and a plan to stop it” (Nathan Marz)
Page21 © Hortonworks Inc. 2014
Lambda architecture in Hadoop via DIMMQs
Use DIMMQs for materialized historic & streaming data
MV on disk
MV in key-
value store
Hadoop
Query via
SQL
Page22 © Hortonworks Inc. 2014
Variations on a theme
Materialized queries don’t have to be in memory
Materialized queries don’t need to be discardable
Materialized queries don’t need to be accessed via SQL
Materialized queries allow novel data structures to be
described
Maintaining materialized queries - build on Hive ACID
Fine-grained invalidation
Streaming into DIMMQs
In-memory tables don’t have to be materialized queries
Data aging – Older data to cheaper storage
Discardable
In-memory
Algebraic
Page23 © Hortonworks Inc. 2014
Lattice
Space of possible materialized views
A star schema, with mandatory many-to-one
relationships
Each view is a projected, filtered
aggregation
• Sales by zipcode and quarter in 2013
• Sales by state in Q1, 2012
Lattice gathers stats
• “I used MV m to answer query q and avoided
fetching r rows”
• Cost of MV = construction effort + memory * time
• Utility of MV = query processing effort saved
Recommends & builds optimal MVs
CREATE LATTICE SalesStar AS
SELECT *
FROM SalesFact AS s
JOIN TimeDim AS t USING (timeId)
JOIN CustomerDim AS c USING (customerId);
Page24 © Hortonworks Inc. 2014
Conclusion
Broadening the tent – batch, interactive BI, streaming, iterative
Declarative, algebraic, transparent
Seamless movement between memory and disk
Adding brains to complement Hadoop’s brawn
Page25 © Hortonworks Inc. 2014
Thank you!
My next talk:
“Cost-based query optimization in Hive”
4:35pm Wednesday
@julianhyde
http://hortonworks.com/blog/dmmq/
http://hortonworks.com/blog/ddm/
http://incubator.apache.org/projects/optiq.html

More Related Content

PDF
SQL Now! How Optiq brings the best of SQL to NoSQL data.
PDF
Streaming SQL with Apache Calcite
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PDF
Cost-based query optimization in Apache Hive 0.14
PDF
What's new in Mondrian 4?
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
PDF
Streaming SQL
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Streaming SQL with Apache Calcite
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Cost-based query optimization in Apache Hive 0.14
What's new in Mondrian 4?
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL

What's hot (20)

PDF
Apache Calcite: One planner fits all
PPT
Why is data independence (still) so important? Optiq and Apache Drill.
PDF
Optiq: A dynamic data management framework
PPT
SQL on Big Data using Optiq
PDF
ONE FOR ALL! Using Apache Calcite to make SQL smart
PDF
Apache Calcite Tutorial - BOSS 21
PPTX
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
PDF
Streaming SQL
PDF
SQL on everything, in memory
PPTX
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
PDF
Apache Calcite: One Frontend to Rule Them All
PDF
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PDF
Why you care about
 relational algebra (even though you didn’t know it)
PPTX
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
PPTX
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
PPT
Mondrian update (Pentaho community meetup 2012, Amsterdam)
PPTX
Options for Data Prep - A Survey of the Current Market
Apache Calcite: One planner fits all
Why is data independence (still) so important? Optiq and Apache Drill.
Optiq: A dynamic data management framework
SQL on Big Data using Optiq
ONE FOR ALL! Using Apache Calcite to make SQL smart
Apache Calcite Tutorial - BOSS 21
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Streaming SQL
SQL on everything, in memory
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Apache Calcite: One Frontend to Rule Them All
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Apache Calcite (a tutorial given at BOSS '21)
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Why you care about
 relational algebra (even though you didn’t know it)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Mondrian update (Pentaho community meetup 2012, Amsterdam)
Options for Data Prep - A Survey of the Current Market
Ad

Viewers also liked (11)

PDF
The twins that everyone loved too much
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PPTX
Cost-based Query Optimization in Hive
PDF
Oracle enterprise architects day
PPTX
Cost-based query optimization in Apache Hive 0.14
PDF
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
PDF
Streaming SQL
PDF
Streaming SQL
PPTX
Apache Calcite overview
PPTX
Cost-based query optimization in Apache Hive
PDF
Streaming SQL
The twins that everyone loved too much
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Hive
Oracle enterprise architects day
Cost-based query optimization in Apache Hive 0.14
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Streaming SQL
Streaming SQL
Apache Calcite overview
Cost-based query optimization in Apache Hive
Streaming SQL
Ad

Similar to Discardable In-Memory Materialized Queries With Hadoop (20)

PPTX
Architecting Your First Big Data Implementation
PDF
Meta scale kognitio hadoop webinar
PDF
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
PPTX
Big Data Technologies and Why They Matter To R Users
PPTX
NoSQL
PPTX
Data warehousing with Hadoop
PDF
Predictive Analytics and Machine Learning…with SAS and Apache Hadoop
PDF
So You Want to Build a Data Lake?
PPTX
The modern analytics architecture
PPTX
Modern Data Warehousing with the Microsoft Analytics Platform System
PPTX
Hadoop Data Modeling
PDF
Bi on Big Data - Strata 2016 in London
PPTX
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
PPTX
Stinger Initiative - Deep Dive
PDF
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
PPTX
Introduction to Apache Kudu
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PPTX
Build a modern data platform.pptx
PPTX
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
Architecting Your First Big Data Implementation
Meta scale kognitio hadoop webinar
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Big Data Technologies and Why They Matter To R Users
NoSQL
Data warehousing with Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache Hadoop
So You Want to Build a Data Lake?
The modern analytics architecture
Modern Data Warehousing with the Microsoft Analytics Platform System
Hadoop Data Modeling
Bi on Big Data - Strata 2016 in London
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
Stinger Initiative - Deep Dive
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Introduction to Apache Kudu
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Build a modern data platform.pptx
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup

More from Julian Hyde (20)

PPTX
Measures in SQL (SIGMOD 2024, Santiago, Chile)
PDF
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
PDF
Building a semantic/metrics layer using Calcite
PDF
Cubing and Metrics in SQL, oh my!
PDF
Adding measures to Calcite SQL
PDF
Morel, a data-parallel programming language
PDF
Is there a perfect data-parallel programming language? (Experiments with More...
PDF
Morel, a Functional Query Language
PDF
The evolution of Apache Calcite and its Community
PDF
What to expect when you're Incubating
PDF
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
PDF
Efficient spatial queries on vanilla databases
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
PDF
Tactical data engineering
PDF
Don't optimize my queries, organize my data!
PDF
Spatial query on vanilla databases
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PPTX
Lazy beats Smart and Fast
PDF
Don’t optimize my queries, optimize my data!
PDF
Data profiling with Apache Calcite
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Building a semantic/metrics layer using Calcite
Cubing and Metrics in SQL, oh my!
Adding measures to Calcite SQL
Morel, a data-parallel programming language
Is there a perfect data-parallel programming language? (Experiments with More...
Morel, a Functional Query Language
The evolution of Apache Calcite and its Community
What to expect when you're Incubating
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Efficient spatial queries on vanilla databases
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Tactical data engineering
Don't optimize my queries, organize my data!
Spatial query on vanilla databases
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Lazy beats Smart and Fast
Don’t optimize my queries, optimize my data!
Data profiling with Apache Calcite

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Chapter 2 Digital Image Fundamentals.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Sensors and Actuators in IoT Systems using pdf
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
Diabetes mellitus diagnosis method based random forest with bat algorithm
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
20250228 LYD VKU AI Blended-Learning.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Chapter 3 Spatial Domain Image Processing.pdf
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Chapter 2 Digital Image Fundamentals.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Sensors and Actuators in IoT Systems using pdf

Discardable In-Memory Materialized Queries With Hadoop

  • 1. Page1 © Hortonworks Inc. 2014 Discardable, In-Memory Materialized Query for Hadoop Julian Hyde Julian Hyde June 3rd, 2014
  • 2. Page2 © Hortonworks Inc. 2014 About me Julian Hyde Architect at Hortonworks Open source: • Founder & lead, Apache Optiq (query optimization framework) • Founder & lead, Pentaho Mondrian (analysis engine) • Committer, Apache Drill • Contributor, Apache Hive • Contributor, Cascading Lingual (SQL interface to Cascading) Past: • SQLstream (streaming SQL) • Broadbase (data warehouse) • Oracle (SQL kernel development)
  • 3. Page3 © Hortonworks Inc. 2014 Before we get started… The bad news • This software is not available to download • I am a database bigot • I am a BI bigot The good news • I believe in Hadoop • Now is a great time to discuss where Hadoop is going
  • 4. Page4 © Hortonworks Inc. 2014 Hadoop today Brute force Hadoop brings a lot of CPU, disk, IO Yarn, Tez, Vectorization are making Hadoop faster How to use that brute force is left to the application Business Intelligence Best practice is to pull data out of Hadoop • Populate enterprise data warehouse • In-memory analytics • Custom analytics, e.g. Lambda architecture Ineffective use of memory Opportunity to make Hadoop smarter
  • 5. Page5 © Hortonworks Inc. 2014 Brute force + diplomacy
  • 6. Page6 © Hortonworks Inc. 2014 Hardware trends Typical Hadoop server configuration Lots of memory - but not enough for all data More heterogeneous mix (disk + SSD + memory) What to do about memory? Year 2009 2014 Cores 4 – 8 24 Memory 8 GB 128 GB SSD None 1 TB Disk 4 x 1 TB 12 x 4 TB Disk : memory 512 : 1 384 : 1
  • 7. Page7 © Hortonworks Inc. 2014 Dumb use of memory - Buffer cache Example: 50 TB data 1 TB memory Full-table scan Analogous to virtual memory • Great while it works, but… • Table scan nukes the cache! Table (disk) Buffer cache (memory) Table scan Query operators (Tez, or whatever) Pre-fetch
  • 8. Page8 © Hortonworks Inc. 2014 “Document-oriented” analysis Operate on working sets small enough to fit in memory Analogous to working on a document (e.g. a spreadsheet) • Works well for problems that fit into memory (e.g. some machine-learning algorithms) • If your problem grows, you’re out of luck Working set (memory) Table (disk) Interactive user
  • 9. Page9 © Hortonworks Inc. 2014 Smarter use of memory - Materialized queries In-memory materialized queries Tables on disk
  • 10. Page10 © Hortonworks Inc. 2014 Census data Census table – 300M records Which state has the most males? Brute force – read 150M records Smarter – read 50 records CREATE TABLE Census (id, gender, zipcode, state, age); SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1; CREATE TABLE CensusSummary AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, c FROM CensusSummary WHERE gender = ‘M’ ORDER BY c DESC LIMIT 1;
  • 11. Page11 © Hortonworks Inc. 2014 Materialized view – Automatic smartness A materialized view is a table that is declared to be identical to a given query Optimizer rewrites query on “Census” to use “CensusSummary” instead Even smarter – read 50 records CREATE MATERIALIZED VIEW CensusSummary STORAGE (MEMORY, DISCARDABLE) AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1;
  • 12. Page12 © Hortonworks Inc. 2014 Indistinguishable from magic Run query #1 System creates materialized view in background Related query #2 runs faster SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1; CREATE MATERIALIZED VIEW CensusSummary STORAGE (MEMORY, DISCARDABLE) AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, COUNT(NULLIF(gender, ‘F’)) AS males, COUNT(NULLIF(gender, ‘M’)) AS females FROM Census GROUP BY state HAVING females > males;
  • 13. Page13 © Hortonworks Inc. 2014 Materialized views - Classic Classic materialized view (Oracle, DB2, Teradata, MSSql) 1. A table defined using a SQL query 2. Designed by DBA 3. Storage same as a regular table 1. On disk 2. Can define indexes 4. DB populates the table 5. Queries are rewritten to use the table** 6. DB updates the table to reflect changes to source data (usually deferred)* *Magic required SELECT t.year, AVG(s.units) FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) GROUP BY t.year; CREATE MATERIALIZED VIEW SalesMonthZip AS SELECT t.year, t.month, c.state, c.zipcode, COUNT(*), SUM(s.units), SUM(s.price) FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) JOIN CustomerDim AS c USING (customerId) GROUP BY t.year, t.month, c.state, c.zipcode;
  • 14. Page14 © Hortonworks Inc. 2014 Materialized views - DIMMQ DIMMQs - Discardable, In-memory Materialized Queries Differences with classic materialized views 1. May be in-memory 2. HDFS may discard – based on DDM (Distributed Discardable Memory) 3. Lifecycle support: 1. Assume table is populated 2. Don’t populate & maintain 3. User can flag as valid, invalid, or change definition (e.g. date range) 4. HDFS may discard 4. More design options: 1. DBA specifies 2. Retain query results (or partial results) 3. An agent builds MVs based on query traffic
  • 15. Page15 © Hortonworks Inc. 2014 DIMMQ compared to Spark RDDs It’s not “either / or” • Spark-on-YARN, SQL-on-Spark already exist; Cascading-on-Spark common soon • Hive-queries-on-RDDs, DIMMQs populated by Spark, Spark-on-Hive-tables are possible Spark RDD DIMMQ Access By reference By algebraic expression Sharing Within session Across sessions Execution model Built-in External Recovery Yes No Discard Failure or GC Failure or cost-based Native organization Memory Disk Language Scala (or other JVM language) Language-independent
  • 16. Page16 © Hortonworks Inc. 2014 Data independence This is not just about SQL standards compliance! Materialized views are supposed to be transparent in creation, maintenance and use. If not one DBA ever types “CREATE MATERIALIZED VIEW”, we have still succeeded Data independence Ability to move data around and not tell your application Replicas Redundant copies Moving between disk and memory Sort order, projections (à la Vertica), aggregates (à la Microstrategy) Indexes, and other weird data structures
  • 17. Page17 © Hortonworks Inc. 2014 Implementing DIMMQs Relational algebra Apache Optiq (just entered incubator – yay!) Algebra, rewrite rules, cost model Metadata Hive: “CREATE MATERIALIZED VIEW” Definitions of materialized views in HCatalog HDFS - Discardable Distributed Memory (DDM) Off-heap data in memory-mapped files Discard policy Build in-memory, replicate to disk; or vice versa Central namespace Evolution of existing components
  • 18. Page18 © Hortonworks Inc. 2014 Tiled queries in distributed memory Query: SELECT x, SUM(y) FROM t GROUP BY x In-memory materialized queries Tables on disk
  • 19. Page19 © Hortonworks Inc. 2014 An adaptive system Ongoing activities: • Agent suggests new MVs • MVs are built in background • Ongoing query activity uses MVs • User marks MVs as invalid due to source data changes • HDFS throws out MVs that are not pulling their weight Dynamic equilibrium DIMMQs continually created & destroyed System moves data around to adapt to changing usage patterns
  • 20. Page20 © Hortonworks Inc. 2014 Lambda architecture From “Runaway complexity in Big Data and a plan to stop it” (Nathan Marz)
  • 21. Page21 © Hortonworks Inc. 2014 Lambda architecture in Hadoop via DIMMQs Use DIMMQs for materialized historic & streaming data MV on disk MV in key- value store Hadoop Query via SQL
  • 22. Page22 © Hortonworks Inc. 2014 Variations on a theme Materialized queries don’t have to be in memory Materialized queries don’t need to be discardable Materialized queries don’t need to be accessed via SQL Materialized queries allow novel data structures to be described Maintaining materialized queries - build on Hive ACID Fine-grained invalidation Streaming into DIMMQs In-memory tables don’t have to be materialized queries Data aging – Older data to cheaper storage Discardable In-memory Algebraic
  • 23. Page23 © Hortonworks Inc. 2014 Lattice Space of possible materialized views A star schema, with mandatory many-to-one relationships Each view is a projected, filtered aggregation • Sales by zipcode and quarter in 2013 • Sales by state in Q1, 2012 Lattice gathers stats • “I used MV m to answer query q and avoided fetching r rows” • Cost of MV = construction effort + memory * time • Utility of MV = query processing effort saved Recommends & builds optimal MVs CREATE LATTICE SalesStar AS SELECT * FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) JOIN CustomerDim AS c USING (customerId);
  • 24. Page24 © Hortonworks Inc. 2014 Conclusion Broadening the tent – batch, interactive BI, streaming, iterative Declarative, algebraic, transparent Seamless movement between memory and disk Adding brains to complement Hadoop’s brawn
  • 25. Page25 © Hortonworks Inc. 2014 Thank you! My next talk: “Cost-based query optimization in Hive” 4:35pm Wednesday @julianhyde http://hortonworks.com/blog/dmmq/ http://hortonworks.com/blog/ddm/ http://incubator.apache.org/projects/optiq.html

Editor's Notes

  • #4: This software does not yet exist. I hope and believe that it will. I don’t believe in promoting vaporware! I want to have a discussion about where Hadoop is going so as many of us can shape it. This seems like a great time and place to have it. My background is in database. I think like a database guy. I’ve been thinking really hard about how to bring the “good stuff” from databases over to Hadoop I have a deep knowledge of BI, and what BI tools need from their underlying data system, having written Mondrian
  • #6: Hadoop today brings a lot of brute force. It can be more effective if it were a bit smarter. We cannot make effective use of memory unless we are a bit smarter.
  • #7: Lots of memory now. But still not enough
  • #25: Bringing all kinds of data into the Hadoop tent - Batch, interactive, streaming, iterative Declarative, algebraic, transparent - Applications use different copies of data without knowing it Making Hadoop smarter – Adding brains to compliment Hadoop’s formidable brawn