SlideShare a Scribd company logo
SQL on Big Data
using Optiq
@julianhyde



Real-time Big Data Meetup
at RichRelevance

April 2013
What is “SQL on Big Data?”
□ “Open-source Teradata”
□ SQL generator for Map-Reduce
□ ETL (Extract-Transform Load)
□ Scalable transaction processing
□ Querying nested data sets
□ Querying documents & populating databases
□ Continuous query/streaming


(Check one or more.)
Revolution & counter-revolution
“Big Data” was a revolution in data management.

Lots of broken things got fixed (unlimited scale,
  data anywhere & any format, late schema,
  flexible queries).

Some useful things got broken (standard
 interface, data independence, central control).

“In 5 years everyone will be using Hadoop and
  they won't even know it.” – me, a few years
  ago
Conventional DBMS architecture

              JDBC client


              JDBC server
              SQL parser /
                validator           Metadata
                 Query
               optimizer
               Data-flow
               operators

       Data                  Data
Optiq architecture

                   JDBC client


                   JDBC server
       Optional    SQL parser /            Metadata
                     validator               SPI
         Core          Query               Pluggable
                     optimizer               rules
                    3rd      3rd
       Pluggable   party party
                    ops     ops
       3rd party                   3rd party
         data                        data
SELECT p.product_name, COUNT(*) AS c
Expression                            FROM splunk.splunk AS s
                                        JOIN mysql.products AS p
tree                                    ON s.product_id = p.product_id
                                      WHERE s.action = 'purchase'
                                      GROUP BY p.product_name
                                      ORDER BY c DESC
Splunk
Table: splunk
                                                   Key: product_name
                    Key: product_id                Agg: count
                                      Condition:                        Key: c DESC
                                        action =
                                      'purchase'
scan
                         join
MySQL                                 filter           group           sort
   scan
                Table: products
SELECT p.product_name, COUNT(*) AS c
Expression                           FROM splunk.splunk AS s
                                       JOIN mysql.products AS p
tree (optimized)                       ON s.product_id = p.product_id
                                     WHERE s.action = 'purchase'
                                     GROUP BY p.product_name
                                     ORDER BY c DESC

             Splunk
                        Condition:
 Table: splunk
                          action =
                        'purchase'                     Key: product_name
                                                       Agg: count
                                                                            Key: c DESC

                                     Key: product_id
 scan                   filter

MySQL
                                     join                  group           sort
   scan
                 Table: products
Apache Drill
“Apache Drill (incubating) is a distributed system
  for interactive analysis of large-scale datasets,
  based on Google's Dremel. Its goal is to
  efficiently process nested data. It is a design
  goal to scale to 10,000 servers or more and to
  be able to process petabyes of data and
  trillions of records in seconds.”
Data model: JSON, late-binding
Optiq:
  SQL → logical plan (current)
  Logical → physical plan (proposed)
Cascading Lingual
“Cascading is the de facto Java API for creating
  complex data processing workloads and the
  engine underneath Scalding, Cascalog, and
  others.”


Lingual uses Optiq to translate SQL onto
  Cascading flows
SQL is “yet another DSL” for Cascading
Just released!
Mondrian (Pentaho Analysis)
Mondrian next-gen architecture
      mondrian               mondrian              mondrian     Optiq provides SQL
                                                                view onto hybrid
                                                                SQL + NoSQL +
    optiq               optiq                    optiq          in-memory store


        cache         data    cache       grid       cache     In-memory tables
                                                               (query results,
                                                               planned & on-the-fly
                                                               materializations)
            control




                                control




                                                     control
             cache




                                 cache




                                                      cache
                                                               Raw data +
                                                               summarized /
    HDFS                MongoDB                  DBMS          projected / sorted /
                                                               re-organized data.
                                                               Partitions.
Summary: Data independence
Logical & physical data models
Requires & allows query optimization
Allows you (or the system) to re-organize data
Query federation, data movement, caching
SQL interface for humans & machines
Optiq lets you add rules to optimize better
Thank you!
@julianhyde


optiq https://github.com/julianhyde/optiq
drill http://incubator.apache.org/drill/
lingual http://www.cascading.org/lingual/
mondrian http://mondrian.pentaho.com
slides https://github.com/julianhyde/share/tree/master/slides

More Related Content

PPT
Why is data independence (still) so important? Optiq and Apache Drill.
PDF
SQL on everything, in memory
PPT
Drill / SQL / Optiq
PPT
How to integrate Splunk with any data solution
PDF
Apache Calcite Tutorial - BOSS 21
PDF
SQL Now! How Optiq brings the best of SQL to NoSQL data.
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
PDF
SQL for NoSQL and how Apache Calcite can help
Why is data independence (still) so important? Optiq and Apache Drill.
SQL on everything, in memory
Drill / SQL / Optiq
How to integrate Splunk with any data solution
Apache Calcite Tutorial - BOSS 21
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
SQL for NoSQL and how Apache Calcite can help

What's hot (20)

PDF
Apache Calcite: One planner fits all
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PDF
Cost-based query optimization in Apache Hive 0.14
PDF
Apache Calcite: One Frontend to Rule Them All
PDF
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
PDF
ONE FOR ALL! Using Apache Calcite to make SQL smart
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
PDF
Streaming SQL with Apache Calcite
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
Why you care about
 relational algebra (even though you didn’t know it)
PPT
Optiq: a SQL front-end for everything
PDF
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
PDF
What's new in Mondrian 4?
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PDF
Introduction to Apache Calcite
PDF
Tactical data engineering
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
PPTX
Discardable In-Memory Materialized Queries With Hadoop
PDF
DataEngConf SF16 - Spark SQL Workshop
PPTX
Apache Calcite overview
Apache Calcite: One planner fits all
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Cost-based query optimization in Apache Hive 0.14
Apache Calcite: One Frontend to Rule Them All
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
ONE FOR ALL! Using Apache Calcite to make SQL smart
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL with Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Why you care about
 relational algebra (even though you didn’t know it)
Optiq: a SQL front-end for everything
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
What's new in Mondrian 4?
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Introduction to Apache Calcite
Tactical data engineering
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Discardable In-Memory Materialized Queries With Hadoop
DataEngConf SF16 - Spark SQL Workshop
Apache Calcite overview
Ad

Similar to SQL on Big Data using Optiq (20)

PDF
Common MongoDB Use Cases
PPTX
Common MongoDB Use Cases Webinar
PPT
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
PPTX
Nosql Now 2012: MongoDB Use Cases
KEY
Optimize drupal using mongo db
PDF
Common MongoDB Use Cases
PDF
Treasure Data: Big Data Analytics on Heroku
PDF
Don't Re-write Code to Get Better Analytics
PDF
Finding the Right Data Solution for your Application in the Data Storage Hays...
PPTX
How Klout is changing the landscape of social media with Hadoop and BI
PPT
Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh
PDF
Transition from relational to NoSQL Philly DAMA Day
PDF
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
PDF
Introducing the App Engine datastore
PPTX
Polyglot persistence with no sql
PPTX
Drill njhug -19 feb2013
PPTX
Solving Compliance for Big Data
PDF
Informix physical database design for data warehousing
PDF
Intro to NoSQL and MongoDB
PDF
Finding the Right Data Solution for Your Application in the Data Storage Hays...
Common MongoDB Use Cases
Common MongoDB Use Cases Webinar
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Nosql Now 2012: MongoDB Use Cases
Optimize drupal using mongo db
Common MongoDB Use Cases
Treasure Data: Big Data Analytics on Heroku
Don't Re-write Code to Get Better Analytics
Finding the Right Data Solution for your Application in the Data Storage Hays...
How Klout is changing the landscape of social media with Hadoop and BI
Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh
Transition from relational to NoSQL Philly DAMA Day
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
Introducing the App Engine datastore
Polyglot persistence with no sql
Drill njhug -19 feb2013
Solving Compliance for Big Data
Informix physical database design for data warehousing
Intro to NoSQL and MongoDB
Finding the Right Data Solution for Your Application in the Data Storage Hays...
Ad

More from Julian Hyde (20)

PPTX
Measures in SQL (SIGMOD 2024, Santiago, Chile)
PDF
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
PDF
Building a semantic/metrics layer using Calcite
PDF
Cubing and Metrics in SQL, oh my!
PDF
Adding measures to Calcite SQL
PDF
Morel, a data-parallel programming language
PDF
Is there a perfect data-parallel programming language? (Experiments with More...
PDF
Morel, a Functional Query Language
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
The evolution of Apache Calcite and its Community
PDF
What to expect when you're Incubating
PDF
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
PDF
Efficient spatial queries on vanilla databases
PDF
Don't optimize my queries, organize my data!
PDF
Spatial query on vanilla databases
PPTX
Lazy beats Smart and Fast
PDF
Don’t optimize my queries, optimize my data!
PDF
Data profiling with Apache Calcite
PDF
Data Profiling in Apache Calcite
PDF
Streaming SQL
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Building a semantic/metrics layer using Calcite
Cubing and Metrics in SQL, oh my!
Adding measures to Calcite SQL
Morel, a data-parallel programming language
Is there a perfect data-parallel programming language? (Experiments with More...
Morel, a Functional Query Language
Apache Calcite (a tutorial given at BOSS '21)
The evolution of Apache Calcite and its Community
What to expect when you're Incubating
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Efficient spatial queries on vanilla databases
Don't optimize my queries, organize my data!
Spatial query on vanilla databases
Lazy beats Smart and Fast
Don’t optimize my queries, optimize my data!
Data profiling with Apache Calcite
Data Profiling in Apache Calcite
Streaming SQL

Recently uploaded (20)

PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
PDF
Test Bank, Solutions for Java How to Program, An Objects-Natural Approach, 12...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
PDF
Google’s NotebookLM Unveils Video Overviews
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
PDF
Software Development Methodologies in 2025
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
PDF
Chapter 2 Digital Image Fundamentals.pdf
PDF
Transforming Manufacturing operations through Intelligent Integrations
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
PDF
DevOps & Developer Experience Summer BBQ
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Smarter Business Operations Powered by IoT Remote Monitoring
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
Test Bank, Solutions for Java How to Program, An Objects-Natural Approach, 12...
NewMind AI Monthly Chronicles - July 2025
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
Google’s NotebookLM Unveils Video Overviews
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
Software Development Methodologies in 2025
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Chapter 2 Digital Image Fundamentals.pdf
Transforming Manufacturing operations through Intelligent Integrations
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
Revolutionize Operations with Intelligent IoT Monitoring and Control
DevOps & Developer Experience Summer BBQ
GamePlan Trading System Review: Professional Trader's Honest Take
madgavkar20181017ppt McKinsey Presentation.pdf
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Understanding_Digital_Forensics_Presentation.pptx

SQL on Big Data using Optiq

  • 1. SQL on Big Data using Optiq @julianhyde Real-time Big Data Meetup at RichRelevance April 2013
  • 2. What is “SQL on Big Data?” □ “Open-source Teradata” □ SQL generator for Map-Reduce □ ETL (Extract-Transform Load) □ Scalable transaction processing □ Querying nested data sets □ Querying documents & populating databases □ Continuous query/streaming (Check one or more.)
  • 3. Revolution & counter-revolution “Big Data” was a revolution in data management. Lots of broken things got fixed (unlimited scale, data anywhere & any format, late schema, flexible queries). Some useful things got broken (standard interface, data independence, central control). “In 5 years everyone will be using Hadoop and they won't even know it.” – me, a few years ago
  • 4. Conventional DBMS architecture JDBC client JDBC server SQL parser / validator Metadata Query optimizer Data-flow operators Data Data
  • 5. Optiq architecture JDBC client JDBC server Optional SQL parser / Metadata validator SPI Core Query Pluggable optimizer rules 3rd 3rd Pluggable party party ops ops 3rd party 3rd party data data
  • 6. SELECT p.product_name, COUNT(*) AS c Expression FROM splunk.splunk AS s JOIN mysql.products AS p tree ON s.product_id = p.product_id WHERE s.action = 'purchase' GROUP BY p.product_name ORDER BY c DESC Splunk Table: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = 'purchase' scan join MySQL filter group sort scan Table: products
  • 7. SELECT p.product_name, COUNT(*) AS c Expression FROM splunk.splunk AS s JOIN mysql.products AS p tree (optimized) ON s.product_id = p.product_id WHERE s.action = 'purchase' GROUP BY p.product_name ORDER BY c DESC Splunk Condition: Table: splunk action = 'purchase' Key: product_name Agg: count Key: c DESC Key: product_id scan filter MySQL join group sort scan Table: products
  • 8. Apache Drill “Apache Drill (incubating) is a distributed system for interactive analysis of large-scale datasets, based on Google's Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.” Data model: JSON, late-binding Optiq: SQL → logical plan (current) Logical → physical plan (proposed)
  • 9. Cascading Lingual “Cascading is the de facto Java API for creating complex data processing workloads and the engine underneath Scalding, Cascalog, and others.” Lingual uses Optiq to translate SQL onto Cascading flows SQL is “yet another DSL” for Cascading Just released!
  • 11. Mondrian next-gen architecture mondrian mondrian mondrian Optiq provides SQL view onto hybrid SQL + NoSQL + optiq optiq optiq in-memory store cache data cache grid cache In-memory tables (query results, planned & on-the-fly materializations) control control control cache cache cache Raw data + summarized / HDFS MongoDB DBMS projected / sorted / re-organized data. Partitions.
  • 12. Summary: Data independence Logical & physical data models Requires & allows query optimization Allows you (or the system) to re-organize data Query federation, data movement, caching SQL interface for humans & machines Optiq lets you add rules to optimize better
  • 13. Thank you! @julianhyde optiq https://github.com/julianhyde/optiq drill http://incubator.apache.org/drill/ lingual http://www.cascading.org/lingual/ mondrian http://mondrian.pentaho.com slides https://github.com/julianhyde/share/tree/master/slides