Apache Druid 101

Apache Druid 101:
1
Fast, Real-time, Open Source Analytics

Matt Sarrel
Developer Evangelist
● Former IT leadership/CIO roles
● Focus on data management,
data analysis, network
infrastructure and network
security
● 15 years of startup experience
(mostly open source
infrastructure and datastores)
● Former PCMag Tech Director,
GigaOm Pro analyst, eWeek and
InfoWorld contributor
● BA (History) and MPH (ID Epi)
● CISSP certified
● Cook Competitive BBQ (KCBS)
matt.sarrel@imply.io
@msarrel on Twitter
@matt on ASF
#druid Slack

Agenda
3
What is
Druid?
Why was
Druid
created?
Best Uses
for Druid
How Druid
Works

Druid is a high performance real-time
analytics database

Apache Druid Powers Interactive Applications
5

1st gen: on-prem data warehouses
6
The 1st gen architecture was unscalable, complex, and expensive.
Data Sources Processing Store and Compute
BI tools
Reporting
Analytics
Data
Data
Data
ETL Data
Warehouse

2nd gen: cloud data warehouses
7
The 2nd gen, while cheaper and more flexible, still has many latency restrictions.
Data Sources
Storage
Compute
Processing
BI tools
Reporting
Analytics
Data
Data
Data
Data Lake
(S3, Blob
store, etc)
Data
Warehouse
ELT
(Spark,
Hadoop,
etc)

3rd gen: Apache Druid/Imply
8
The 3rd gen architecture is designed for an increasingly low latency world.
Data Sources
Storage
Processing
Next gen
data apps
Interactive
GUIs
Real-time
Analytics
Data
Data
Data
Message Bus
(Kafka, Kinesis,
Pub/Sub)
ELT (Spark
Streaming,
Kafka Streams,
Apache Flink)
Druid
Data
Warehouse
Archiving
Reporting

MetaMarkets: The first use case
Druid was created at a startup called Metamarkets (now part of Snapchat)
Druid was created to power an interactive app for digital advertisers
Advertisers loaded impressions and clicks data
Advertisers used the app to optimize user/ad engagement
Druid has since expanded to many new verticals and use cases

Challenges
• Scale: millions events/sec (batch and real-time)
• Complexity: high dimensionality & high cardinality
• Structure: semi-structured (nested, evolving schemas, etc.)Data:
• Drill downs: static reports aren’t enough (BI tools not
enough)
• Multi-tenancy: thousands of concurrent users
• Self-service: many users are non-technical
App:

Core Design
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Efficient storage
● Fast analytic queries
● Optimized storage for
time-based datasets
● Time-based functions
SEARCH PLATFORM TIME SERIES DB OLAP

Key features
Column oriented
High concurrency
Scalable to 1000s of servers, millions of messages/sec
Continuous, real-time ingest
Query through SQL via API
Target query latency sub-second to a few seconds
12

User activity
Data sets: clickstreams, view streams, activity streams
Group users along any attributes, without pre-computation or pre-definition
Compare groups of users against each other
Define interesting groupings quickly through top lists
Count number of users matching any criteria

Network flows
Data set: netflow logs
View relationships between source & dest addresses
Measure flows based on protocol, interface, IP address, or any other attribute
Burstable billing: 95th percentile flow rates in 5 min buckets
Troubleshoot bottlenecks

Digital advertising
Data sets: bids, clicks, impressions, etc.
Analyze campaign performance on ad-hoc groupings of participants
Compute quantiles and histograms for bid prices
Calculate conversion rates (impressions → clicks)

Server metrics
Data sets: server logs, application metrics, etc.
Track CPU load on servers, numbers of cache requests/hits/misses, data center
performance, etc.
Aggregate time series on the fly
Compute latency %iles over ad hoc groups of events (all ‘foo’ servers; all ‘/v1/bar’ API
calls; all servers in rack 10; etc)

Druid…it’s out there
The original Druid cluster:
• >500 TB of segments
(>50 trillion raw events,
>50 PB raw data)
• mean 500ms query time
• 90%ile < 1s
• 95%ile < 5s
• 99%ile < 10s
Netflix Druid Cluster:
• 100 billion+ rows/day
• 1+ trillion rows, retained for
at least a year
• 100s of servers
• Sub-second to a few
seconds query response
• Relies on combination of
streaming and batch
ingestion

Druid is designed for performance
Data sourced from: Correia, José & Costa, Carlos & Santos, Maribel. (2019). Challenging SQL-on-Hadoop Performance with Apache Druid.

There is no such thing as too fast

Is Druid Right For My Project?
Data
Characteristics
Timestamp dimension
Streaming
Denormalized
Many attributes (30+
dimensions)
High cardinality
Use Case
Characteristics Large dataset
Fast query response (<1s)
Low latency data ingestion
Interactive, ad-hoc queries
Arbitrary slicing and dicing (OLAP)
Query real-time & historical data
Infrequent updates

Druid in Data Pipeline
Data lakes
Message buses
Raw data Staging (and Processing) Analytics Database End User Application
clicks, ad impressions
network telemetry
application events

Druid and Data Warehouses
Druid is not a DW
Druid augments DW to provide
• consistent, sub-second SLA
• pre-aggregation/metrics generation upon ingest
• simple schema
• high concurrency reads
Druid is for hot queries (sub-second queries on fresh data)
• Slice and dice OLAP
• Dashboards that fire dozens of queries at once
DW is for cold queries (second+ queries on historical data)

Architecture (Ingestion)
Indexers
Indexers
Indexers
Files
Historicals
Historicals
Historicals
Streams
Segments

Druid segments
Enables global index and write once consistency.

Engine and data format are tightly integrated
28
Secondary indexes
Operate on
compressed data Late materializationCompression
INDEX
[0,1,2](11100000)
[3,4] (00011000)
[5,6,7](0000111)
DATA0
0
0
1
1
2
2
2
DICT
Melbourne = 0
Perth = 1
Sydney = 2

Querying
Query libraries:
• JSON over
HTTP
• SQL
• R
• Python
• Ruby

Query Processing in Parallel
30
Historical Indexer Historical Indexer
Data server
Broker
Query server
Segments

Resources
druid.apache.org
druid.apache.org/community
Imply Meetup Groups https://www.meetup.com/pro/apache-druid
ASF #druid Slack channel
@druidio on Twitter
Apache Distribution https://github.com/apache/druid
Imply Distribution https://imply.io/get-started

Apache Druid 101

More Related Content

What's hot (20)

Similar to Apache Druid 101 (20)

More from Data Con LA (20)

Recently uploaded (20)

Apache Druid 101