SlideShare a Scribd company logo
Apache Druid 101:
1
Fast, Real-time, Open Source Analytics
Matt Sarrel
Developer Evangelist
● Former IT leadership/CIO roles
● Focus on data management,
data analysis, network
infrastructure and network
security
● 15 years of startup experience
(mostly open source
infrastructure and datastores)
● Former PCMag Tech Director,
GigaOm Pro analyst, eWeek and
InfoWorld contributor
● BA (History) and MPH (ID Epi)
● CISSP certified
● Cook Competitive BBQ (KCBS)
matt.sarrel@imply.io
@msarrel on Twitter
@matt on ASF
#druid Slack
Agenda
3
What is
Druid?
Why was
Druid
created?
Best Uses
for Druid
How Druid
Works
Druid is a high performance real-time
analytics database
Apache Druid Powers Interactive Applications
5
1st gen: on-prem data warehouses
6
The 1st gen architecture was unscalable, complex, and expensive.
Data Sources Processing Store and Compute
BI tools
Reporting
Analytics
Data
Data
Data
ETL Data
Warehouse
2nd gen: cloud data warehouses
7
The 2nd gen, while cheaper and more flexible, still has many latency restrictions.
Data Sources
Storage
Compute
Processing
BI tools
Reporting
Analytics
Data
Data
Data
Data Lake
(S3, Blob
store, etc)
Data
Warehouse
ELT
(Spark,
Hadoop,
etc)
3rd gen: Apache Druid/Imply
8
The 3rd gen architecture is designed for an increasingly low latency world.
Data Sources
Storage
Processing
Next gen
data apps
Interactive
GUIs
Real-time
Analytics
Data
Data
Data
Message Bus
(Kafka, Kinesis,
Pub/Sub)
ELT (Spark
Streaming,
Kafka Streams,
Apache Flink)
Druid
Data
Warehouse
Archiving
Reporting
MetaMarkets: The first use case
Druid was created at a startup called Metamarkets (now part of Snapchat)
Druid was created to power an interactive app for digital advertisers
Advertisers loaded impressions and clicks data
Advertisers used the app to optimize user/ad engagement
Druid has since expanded to many new verticals and use cases
Challenges
• Scale: millions events/sec (batch and real-time)
• Complexity: high dimensionality & high cardinality
• Structure: semi-structured (nested, evolving schemas, etc.)Data:
• Drill downs: static reports aren’t enough (BI tools not
enough)
• Multi-tenancy: thousands of concurrent users
• Self-service: many users are non-technical
App:
Core Design
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Efficient storage
● Fast analytic queries
● Optimized storage for
time-based datasets
● Time-based functions
SEARCH PLATFORM TIME SERIES DB OLAP
Key features
Column oriented
High concurrency
Scalable to 1000s of servers, millions of messages/sec
Continuous, real-time ingest
Query through SQL via API
Target query latency sub-second to a few seconds
12
Druid in Production
User activity
Data sets: clickstreams, view streams, activity streams
Group users along any attributes, without pre-computation or pre-definition
Compare groups of users against each other
Define interesting groupings quickly through top lists
Count number of users matching any criteria
Network flows
Data set: netflow logs
View relationships between source & dest addresses
Measure flows based on protocol, interface, IP address, or any other attribute
Burstable billing: 95th percentile flow rates in 5 min buckets
Troubleshoot bottlenecks
Digital advertising
Data sets: bids, clicks, impressions, etc.
Analyze campaign performance on ad-hoc groupings of participants
Compute quantiles and histograms for bid prices
Calculate conversion rates (impressions → clicks)
Server metrics
Data sets: server logs, application metrics, etc.
Track CPU load on servers, numbers of cache requests/hits/misses, data center
performance, etc.
Aggregate time series on the fly
Compute latency %iles over ad hoc groups of events (all ‘foo’ servers; all ‘/v1/bar’ API
calls; all servers in rack 10; etc)
Druid…it’s out there
The original Druid cluster:
• >500 TB of segments
(>50 trillion raw events,
>50 PB raw data)
• mean 500ms query time
• 90%ile < 1s
• 95%ile < 5s
• 99%ile < 10s
Netflix Druid Cluster:
• 100 billion+ rows/day
• 1+ trillion rows, retained for
at least a year
• 100s of servers
• Sub-second to a few
seconds query response
• Relies on combination of
streaming and batch
ingestion
Druid is designed for performance
Data sourced from: Correia, José & Costa, Carlos & Santos, Maribel. (2019). Challenging SQL-on-Hadoop Performance with Apache Druid.
There is no such thing as too fast
Is Druid Right For My Project?
Data
Characteristics
Timestamp dimension
Streaming
Denormalized
Many attributes (30+
dimensions)
High cardinality
Use Case
Characteristics Large dataset
Fast query response (<1s)
Low latency data ingestion
Interactive, ad-hoc queries
Arbitrary slicing and dicing (OLAP)
Query real-time & historical data
Infrequent updates
Druid in Data Pipeline
Data lakes
Message buses
Raw data Staging (and Processing) Analytics Database End User Application
clicks, ad impressions
network telemetry
application events
Druid and Data Warehouses
Druid is not a DW
Druid augments DW to provide
• consistent, sub-second SLA
• pre-aggregation/metrics generation upon ingest
• simple schema
• high concurrency reads
Druid is for hot queries (sub-second queries on fresh data)
• Slice and dice OLAP
• Dashboards that fire dozens of queries at once
DW is for cold queries (second+ queries on historical data)
Druid Architecture
Architecture (Ingestion)
Indexers
Indexers
Indexers
Files
Historicals
Historicals
Historicals
Streams
Segments
The Ingestion Spec
Druid segments
Enables global index and write once consistency.
Engine and data format are tightly integrated
28
Secondary indexes
Operate on
compressed data Late materializationCompression
INDEX
[0,1,2](11100000)
[3,4] (00011000)
[5,6,7](0000111)
DATA0
0
0
1
1
2
2
2
DICT
Melbourne = 0
Perth = 1
Sydney = 2
Querying
Query libraries:
• JSON over
HTTP
• SQL
• R
• Python
• Ruby
Query Processing in Parallel
30
Historical Indexer Historical Indexer
Data server
Broker
Query server
Segments
Apache Druid Unified Console
Resources
druid.apache.org
druid.apache.org/community
Imply Meetup Groups https://www.meetup.com/pro/apache-druid
ASF #druid Slack channel
@druidio on Twitter
Apache Distribution https://github.com/apache/druid
Imply Distribution https://imply.io/get-started

More Related Content

PDF
Dataflow with Apache NiFi
PPT
Class and Objects in PHP
PPTX
DevOps concepts, tools, and technologies v1.0
PPTX
Types of cyber attacks
PDF
Reinventing Deep Learning
 with Hugging Face Transformers
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
PPTX
The intelligent investor
PPTX
Effective business communication
Dataflow with Apache NiFi
Class and Objects in PHP
DevOps concepts, tools, and technologies v1.0
Types of cyber attacks
Reinventing Deep Learning
 with Hugging Face Transformers
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
The intelligent investor
Effective business communication

What's hot (20)

PPTX
Druid deep dive
PDF
Premier Inside-Out: Apache Druid
PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
The Current State of Table API in 2022
PDF
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
An Introduction to Druid
PDF
When NOT to use Apache Kafka?
PDF
Apache Pulsar Development 101 with Python
PPTX
Real Time analytics with Druid, Apache Spark and Kafka
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PPTX
Elastic Stack Introduction
PDF
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
PPTX
Hive + Tez: A Performance Deep Dive
PPTX
Druid and Hive Together : Use Cases and Best Practices
PDF
Fundamentals of Apache Kafka
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Druid deep dive
Premier Inside-Out: Apache Druid
Apache Kafka Architecture & Fundamentals Explained
The Current State of Table API in 2022
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Apache Iceberg - A Table Format for Hige Analytic Datasets
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
A Thorough Comparison of Delta Lake, Iceberg and Hudi
An Introduction to Druid
When NOT to use Apache Kafka?
Apache Pulsar Development 101 with Python
Real Time analytics with Druid, Apache Spark and Kafka
Architect’s Open-Source Guide for a Data Mesh Architecture
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Elastic Stack Introduction
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Hive + Tez: A Performance Deep Dive
Druid and Hive Together : Use Cases and Best Practices
Fundamentals of Apache Kafka
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Ad

Similar to Apache Druid 101 (20)

PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PDF
High-performance database technology for rock-solid IoT solutions
PDF
J1 - Keynote Data Platform - Rohan Kumar
PDF
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
PDF
USQL Trivadis Azure Data Lake Event
PPTX
Bitkom Cray presentation - on HPC affecting big data analytics in FS
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PPTX
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
PPTX
Time Series Analytics Azure ADX
PPTX
Shikha fdp 62_14july2017
PPTX
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
PPTX
Unlocking Operational Intelligence from the Data Lake
PPTX
Skilwise Big data
PPTX
Transform your DBMS to drive engagement innovation with Big Data
PPTX
Streaming Data and Stream Processing with Apache Kafka
PPTX
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
PDF
Stream Processing – Concepts and Frameworks
PDF
Creating a Modern Data Architecture for Digital Transformation
PPTX
Architecting an Open Source AI Platform 2018 edition
PPTX
Skillwise Big Data part 2
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
High-performance database technology for rock-solid IoT solutions
J1 - Keynote Data Platform - Rohan Kumar
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
USQL Trivadis Azure Data Lake Event
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Time Series Analytics Azure ADX
Shikha fdp 62_14july2017
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Unlocking Operational Intelligence from the Data Lake
Skilwise Big data
Transform your DBMS to drive engagement innovation with Big Data
Streaming Data and Stream Processing with Apache Kafka
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Stream Processing – Concepts and Frameworks
Creating a Modern Data Architecture for Digital Transformation
Architecting an Open Source AI Platform 2018 edition
Skillwise Big Data part 2
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Foundation of Data Science unit number two notes
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Introduction to Business Data Analytics.
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
Introduction-to-Cloud-ComputingFinal.pptx
Foundation of Data Science unit number two notes
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Moving the Public Sector (Government) to a Digital Adoption
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Clinical guidelines as a resource for EBP(1).pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Business Data Analytics.
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
.pdf is not working space design for the following data for the following dat...
Data_Analytics_and_PowerBI_Presentation.pptx
Supervised vs unsupervised machine learning algorithms
Major-Components-ofNKJNNKNKNKNKronment.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Quality review (1)_presentation of this 21
climate analysis of Dhaka ,Banglades.pptx
Reliability_Chapter_ presentation 1221.5784

Apache Druid 101

  • 1. Apache Druid 101: 1 Fast, Real-time, Open Source Analytics
  • 2. Matt Sarrel Developer Evangelist ● Former IT leadership/CIO roles ● Focus on data management, data analysis, network infrastructure and network security ● 15 years of startup experience (mostly open source infrastructure and datastores) ● Former PCMag Tech Director, GigaOm Pro analyst, eWeek and InfoWorld contributor ● BA (History) and MPH (ID Epi) ● CISSP certified ● Cook Competitive BBQ (KCBS) [email protected] @msarrel on Twitter @matt on ASF #druid Slack
  • 4. Druid is a high performance real-time analytics database
  • 5. Apache Druid Powers Interactive Applications 5
  • 6. 1st gen: on-prem data warehouses 6 The 1st gen architecture was unscalable, complex, and expensive. Data Sources Processing Store and Compute BI tools Reporting Analytics Data Data Data ETL Data Warehouse
  • 7. 2nd gen: cloud data warehouses 7 The 2nd gen, while cheaper and more flexible, still has many latency restrictions. Data Sources Storage Compute Processing BI tools Reporting Analytics Data Data Data Data Lake (S3, Blob store, etc) Data Warehouse ELT (Spark, Hadoop, etc)
  • 8. 3rd gen: Apache Druid/Imply 8 The 3rd gen architecture is designed for an increasingly low latency world. Data Sources Storage Processing Next gen data apps Interactive GUIs Real-time Analytics Data Data Data Message Bus (Kafka, Kinesis, Pub/Sub) ELT (Spark Streaming, Kafka Streams, Apache Flink) Druid Data Warehouse Archiving Reporting
  • 9. MetaMarkets: The first use case Druid was created at a startup called Metamarkets (now part of Snapchat) Druid was created to power an interactive app for digital advertisers Advertisers loaded impressions and clicks data Advertisers used the app to optimize user/ad engagement Druid has since expanded to many new verticals and use cases
  • 10. Challenges • Scale: millions events/sec (batch and real-time) • Complexity: high dimensionality & high cardinality • Structure: semi-structured (nested, evolving schemas, etc.)Data: • Drill downs: static reports aren’t enough (BI tools not enough) • Multi-tenancy: thousands of concurrent users • Self-service: many users are non-technical App:
  • 11. Core Design ● Real-time ingestion ● Flexible schema ● Full text search ● Batch ingestion ● Efficient storage ● Fast analytic queries ● Optimized storage for time-based datasets ● Time-based functions SEARCH PLATFORM TIME SERIES DB OLAP
  • 12. Key features Column oriented High concurrency Scalable to 1000s of servers, millions of messages/sec Continuous, real-time ingest Query through SQL via API Target query latency sub-second to a few seconds 12
  • 14. User activity Data sets: clickstreams, view streams, activity streams Group users along any attributes, without pre-computation or pre-definition Compare groups of users against each other Define interesting groupings quickly through top lists Count number of users matching any criteria
  • 15. Network flows Data set: netflow logs View relationships between source & dest addresses Measure flows based on protocol, interface, IP address, or any other attribute Burstable billing: 95th percentile flow rates in 5 min buckets Troubleshoot bottlenecks
  • 16. Digital advertising Data sets: bids, clicks, impressions, etc. Analyze campaign performance on ad-hoc groupings of participants Compute quantiles and histograms for bid prices Calculate conversion rates (impressions → clicks)
  • 17. Server metrics Data sets: server logs, application metrics, etc. Track CPU load on servers, numbers of cache requests/hits/misses, data center performance, etc. Aggregate time series on the fly Compute latency %iles over ad hoc groups of events (all ‘foo’ servers; all ‘/v1/bar’ API calls; all servers in rack 10; etc)
  • 18. Druid…it’s out there The original Druid cluster: • >500 TB of segments (>50 trillion raw events, >50 PB raw data) • mean 500ms query time • 90%ile < 1s • 95%ile < 5s • 99%ile < 10s Netflix Druid Cluster: • 100 billion+ rows/day • 1+ trillion rows, retained for at least a year • 100s of servers • Sub-second to a few seconds query response • Relies on combination of streaming and batch ingestion
  • 19. Druid is designed for performance Data sourced from: Correia, José & Costa, Carlos & Santos, Maribel. (2019). Challenging SQL-on-Hadoop Performance with Apache Druid.
  • 20. There is no such thing as too fast
  • 21. Is Druid Right For My Project? Data Characteristics Timestamp dimension Streaming Denormalized Many attributes (30+ dimensions) High cardinality Use Case Characteristics Large dataset Fast query response (<1s) Low latency data ingestion Interactive, ad-hoc queries Arbitrary slicing and dicing (OLAP) Query real-time & historical data Infrequent updates
  • 22. Druid in Data Pipeline Data lakes Message buses Raw data Staging (and Processing) Analytics Database End User Application clicks, ad impressions network telemetry application events
  • 23. Druid and Data Warehouses Druid is not a DW Druid augments DW to provide • consistent, sub-second SLA • pre-aggregation/metrics generation upon ingest • simple schema • high concurrency reads Druid is for hot queries (sub-second queries on fresh data) • Slice and dice OLAP • Dashboards that fire dozens of queries at once DW is for cold queries (second+ queries on historical data)
  • 27. Druid segments Enables global index and write once consistency.
  • 28. Engine and data format are tightly integrated 28 Secondary indexes Operate on compressed data Late materializationCompression INDEX [0,1,2](11100000) [3,4] (00011000) [5,6,7](0000111) DATA0 0 0 1 1 2 2 2 DICT Melbourne = 0 Perth = 1 Sydney = 2
  • 29. Querying Query libraries: • JSON over HTTP • SQL • R • Python • Ruby
  • 30. Query Processing in Parallel 30 Historical Indexer Historical Indexer Data server Broker Query server Segments
  • 32. Resources druid.apache.org druid.apache.org/community Imply Meetup Groups https://www.meetup.com/pro/apache-druid ASF #druid Slack channel @druidio on Twitter Apache Distribution https://github.com/apache/druid Imply Distribution https://imply.io/get-started