SlideShare a Scribd company logo
AirStream
LIYIN TANG & JINGWEI LU
Data Infrastructure at Airbnb
Event
Logs
MySQL
Dumps
Gold Cluster
HDFS
Hive
Kafka
Sqoop
Silver Cluster Spark Cluster
Spark
ReAir
Airflow Scheduling
S3
Presto Cluster
AirPal
Caravel
Tableau
Batch Infrastructure
Yarn HDFS
Hive
Yarn
Liyin Tang and Jingwei Lu
3
Streaming at Airbnb
Event
Logging
MySQL
BINLOG
Cluster
HDFS
Hive
Spinal tap
Presto Cluster
Yarn
Kafka
HBase
Spark Streaming
Datadog
Druid
Kafka
Liyin Tang and Jingwei Lu
4
Growing Pain
Stateless
Liyin Tang and Jingwei Lu
Computation SinkSource
DStream DF DF
Stateful
Liyin Tang and Jingwei Lu
ComputationSource
DStream DF DF
Sink1
Sink2
Sink N
State Storage
RDD
Multiple Streams
Liyin Tang and Jingwei Lu
DataFrame
Sink1
Process A
Sink2
Sink3
SinkN
…
DataFrame
Sink1
Process N
Sink2
Sink3
SinkN
…
Source
DStream
Align by Time
DataFrame
DataFrame
State
Source
DStream
…
Streaming + Batch
Liyin Tang and Jingwei Lu
DataFrame
Sink1
Process A
Sink2
Sink3
SinkN
…
DataFrame
State
DStream
…
Align by Time
…
DataFrame
Sink1
Process A
Sink2
Sink3
SinkN
…
Simplify and Unify
AirStream Architecture
Liyin Tang and Jingwei Lu
Sources
Stream #1 Stream #N
Hive Tables HBase Tables
Virtual Table Views for Computation
Sinks
…
Customized ComputationSpark SQL
Simple Config
HBase Services Streaming SourcesDruid
AirStream Architecture
Liyin Tang and Jingwei Lu
Sources
Stream #1 Stream #N
Hive Tables HBase Tables
Virtual Table Views for Computation
Sinks
…
Customized ComputationSpark SQL
HBase Services Streaming SourcesDruid
Same Computation for
Batch processing
Stateful
Liyin Tang and Jingwei Lu
State Store
• Merge changes
• Provide fast lookup
• Fast persistent storage across streaming
and batch jobs
14
Why HBase
Liyin Tang and Jingwei Lu
Rich Functionalities
Rich Integration with Hadoop EcoSystem
Easy Management
Strong Community
Reliable and Scalable
HBase State Store
Operators in Airstream
Liyin Tang and Jingwei Lu
16
Full Table Scan
Simple Aggregation
Bulk Upload
Key/Prefix Lookup
Update
Liyin Tang and Jingwei Lu
Computation DAG
17
Input Data
Left Outer Join Result
Key Lookup
Liyin Tang and Jingwei Lu
Key Space Design
• Hash partition key space for
load balance
• Composite key for K-> V
• Support full key lookup
• Prefix lookup supported for
all keys used in hash
function
Hash key1 key2 key3
Hash based on key prefix
Hash key1 key2
Lookup based on key prefix
key1 = ‘value1’ and key2 = ‘value2’
18
• Partition based on key before write
• Use bulk upload for large volume update
Write Performance
Liyin Tang and Jingwei Lu
19
Case Study
Liyin Tang and Jingwei Lu
Experiment realtime feedback
20
Update
Experiment
Assignment Event
Lookup
HBase
with TTL
Booking Event
Druid Datadog
one airstream
configjob 2 job 1
Realtime Data Ingestion
Realtime Ingestion on HBase
Data Infrastructure
MySQL
Analytical
Events
Kafka
Spark
Streamin HBase
HDFS
Presto/Hive/
Spark
Source
Ingest
RealtimeQuery
Snapshot
BatchQuery
Liyin Tang and Jingwei Lu
22
Access Data in HBase
Liyin Tang and Jingwei Lu
HBase
Hive Presto
Spark
SQL
Spark
Streaming
Batch Jobs Interactive Query Streaming
HDFS
Snapshot
Table Mapping/Unifed View on realtime data
23
Snapshot&Reseed
Liyin Tang and Jingwei Lu
HBase HDFS
Snapshot HFile Links)
Bulk Upload
24
Case Study 1: Events Ingestion
Liyin Tang and Jingwei Lu
Kafka
topic
…
topic
topic
Spark
Executor1
…
Executor
Executor
HBase
DeDup
HDFS
Daily
Realtime
Hive
Presto
Events
Partition
25
Case Study 2: Streaming DB Export
KafkaRDS
Table1
…
Spinalta
p.
…
Table2
TableN
Spinaltap.
Table2
Spinaltap.
TableN
Spark
Executor1
…
Executor2
Executor K
HBase
Region1
…
Region2
Region M
HDFS
Daily Snapshot
Realtime Query
Liyin Tang and Jingwei Lu
26
Case Study: Streaming DB Export
Rows CF: Colums Version Value
<ShardKey><DB_TABLE_#1><PK_a=A> id Fri May 19 00:33:19 2016 101
<ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 19 00:33:19 2016 San Francisco
<ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 10 00:34:19 2016 New York
<ShardKey><DB_TABLE_#2><PK_a=A’> id Fri May 19 00:33:19 2016 1
Liyin Tang and Jingwei Lu
27
Case Study: Streaming DB Export
TXN 1
Commit_TS:
101
…
TXN 2
Commit_TS:
102
TXN 3
Commit_TS:
103
TXN N
Commit_TS: N’
Binlog Order
Liyin Tang and Jingwei Lu
28
Case Study: Streaming DB Export
TXN 1
Commit_TS:
101
…
TXN 2
Commit_TS:
103
TXN 3
Commit_TS:
102
TXN N
Commit_TS: N’
NTP
Binlog Order
Liyin Tang and Jingwei Lu
29
Case Study: Streaming DB Export
TXN 1
Commit_TS:
101
…
Binlog Order
TXN 2
Commit_TS:
103
TXN 3
Commit_TS:
102
TXN N
Commit_TS: N’
Point-in-Time Restore on TS 102
Liyin Tang and Jingwei Lu
30
Case Study: Streaming DB Export
Rows CF: Colums Version Value
<ShardKey><DB_TABLE_#1><PK_a=A> id bin100 101
<ShardKey><DB_TABLE_#1><PK_a=A> city bin101 San Francisco
<ShardKey><DB_TABLE_#1><PK_a=A> city bin102 New York
<ShardKey><DB_TABLE_#2><PK_a=A’> id bin100 1
Liyin Tang and Jingwei Lu
31
Case Study: Streaming DB Export
Rows Version (Logical Offset) Value
<ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100
<ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101
<ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103
<ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102
Liyin Tang and Jingwei Lu
32
Case Study: Streaming DB Export
Rows Version (Logical Offset) Value
<ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100
<ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101
<ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103
<ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102
Liyin Tang and Jingwei Lu
33
Operation
Job Management: Scaling up
Config Driver
Streaming
Job
Yarn
Spark Jobs
…
Liyin Tang & Jingwei Lu
Config Driver
Streaming
Job
… … … …
Spark Jobs
Config Driver
Streaming
Job
Spark Jobs
Spark Job 1
Spark Job2
Spark Job N
Concurrent
…
…
Liyin Tang & Jingwei Lu
Config Driver
Streaming
Job
Yarn
Job Management: Scaling up
Job Management: Fault Tolerant
Driver
Spark Job 1
Spark Job2
Spark Job N
Streaming
Job
Concurrent
Yarn
…
…
Liyin Tang & Jingwei Lu
OffsetManagement
Mesos
Driver
Driver
Config
Config
Config
……
Checkpoint Rewind
Job Management: Monitoring&Alerting
Driver
Spark Job 1
Spark Job2
Spark Job N
Streaming
Job
Concurrent
Yarn
…
…AirStreamListener
Liyin Tang & Jingwei Lu
Summary
Liyin Tang and Jingwei Lu
Simplify and Unify Stream Batch Pipeline
Rich Stateful Computation
Rich Integration with Hadoop EcoSystem
Easy Operation
40

More Related Content

PPTX
Apache HBase at Airbnb
PDF
HBaseCon2017 Data Product at AirBnB
PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
PDF
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
PDF
Lambda Architecture Using SQL
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
PPTX
Use r tutorial part1, introduction to sparkr
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Apache HBase at Airbnb
HBaseCon2017 Data Product at AirBnB
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Lambda Architecture Using SQL
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Use r tutorial part1, introduction to sparkr
Bellevue Big Data meetup: Dive Deep into Spark Streaming

What's hot (20)

PDF
Dive into Spark Streaming
PDF
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
PDF
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
PPTX
Introduction to Streaming Distributed Processing with Storm
PDF
Cost-Based Optimizer in Apache Spark 2.2
PDF
Stateful Distributed Stream Processing
PDF
Building Data Pipelines in Python
PDF
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PDF
Towards sql for streams
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
PDF
Continuous Application with FAIR Scheduler with Robert Xue
PDF
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Dive into Spark Streaming
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Spark Summit EU 2015: Lessons from 300+ production users
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Introduction to Streaming Distributed Processing with Storm
Cost-Based Optimizer in Apache Spark 2.2
Stateful Distributed Stream Processing
Building Data Pipelines in Python
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Designing Structured Streaming Pipelines—How to Architect Things Right
Towards sql for streams
Spark Under the Hood - Meetup @ Data Science London
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Continuous Application with FAIR Scheduler with Robert Xue
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
From Pipelines to Refineries: Scaling Big Data Applications
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Ad

Viewers also liked (20)

PDF
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
PDF
Spark Uber Development Kit
PDF
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
Huawei Advanced Data Science With Spark Streaming
PDF
Low Latency Execution For Apache Spark
PDF
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Morticia: Visualizing And Debugging Complex Spark Workflows
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
PDF
Scalable And Incremental Data Profiling With Spark
PDF
Huohua: A Distributed Time Series Analysis Framework For Spark
PDF
Big Data in Production: Lessons from Running in the Cloud
PPTX
ETL with SPARK - First Spark London meetup
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Understanding Memory Management In Spark For Fun And Profit
PDF
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Spark Uber Development Kit
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Huawei Advanced Data Science With Spark Streaming
Low Latency Execution For Apache Spark
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Spark And Cassandra: 2 Fast, 2 Furious
Morticia: Visualizing And Debugging Complex Spark Workflows
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Scalable And Incremental Data Profiling With Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
Big Data in Production: Lessons from Running in the Cloud
ETL with SPARK - First Spark London meetup
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Understanding Memory Management In Spark For Fun And Profit
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Ad

Similar to Airstream: Spark Streaming At Airbnb (20)

PDF
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
PPTX
מיכאל
PDF
JConWorld_ Continuous SQL with Kafka and Flink
PDF
Continuous SQL with Apache Streaming (FLaNK and FLiP)
PDF
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
KEY
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
PDF
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
big data fest building modern data streaming apps
PDF
BigDataFest_ Building Modern Data Streaming Apps
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PPTX
Expand data analysis tool at scale with Zeppelin
PDF
Apache cassandra & apache spark for time series data
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PPTX
Realtime olap architecture in apache kylin 3.0
PDF
Nyc hadoop meetup introduction to h base
PPTX
Streaming map reduce
PPTX
Confluent-Ably-AWS-ID-2023 - GSlide.pptx
KEY
HBase and Hadoop at Urban Airship
PPTX
Webinar: Flink SQL in Action - Fabian Hueske
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
מיכאל
JConWorld_ Continuous SQL with Kafka and Flink
Continuous SQL with Apache Streaming (FLaNK and FLiP)
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
big data fest building modern data streaming apps
BigDataFest_ Building Modern Data Streaming Apps
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Expand data analysis tool at scale with Zeppelin
Apache cassandra & apache spark for time series data
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Realtime olap architecture in apache kylin 3.0
Nyc hadoop meetup introduction to h base
Streaming map reduce
Confluent-Ably-AWS-ID-2023 - GSlide.pptx
HBase and Hadoop at Urban Airship
Webinar: Flink SQL in Action - Fabian Hueske

More from Jen Aman (20)

PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PDF
Spatial Analysis On Histological Images Using Spark
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
Deploying Accelerators At Datacenter Scale Using Spark
PDF
Re-Architecting Spark For Performance Understandability
PDF
Re-Architecting Spark For Performance Understandability
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Livy: A REST Web Service For Apache Spark
PDF
GPU Computing With Apache Spark And Python
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
PDF
Spark on Mesos
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PDF
Spark at Bloomberg: Dynamically Composable Analytics
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Snorkel: Dark Data and Machine Learning with Christopher Ré
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
RISELab:Enabling Intelligent Real-Time Decisions
Spatial Analysis On Histological Images Using Spark
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
A Graph-Based Method For Cross-Entity Threat Detection
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Time-Evolving Graph Processing On Commodity Clusters
Deploying Accelerators At Datacenter Scale Using Spark
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Efficient State Management With Spark 2.0 And Scale-Out Databases
Livy: A REST Web Service For Apache Spark
GPU Computing With Apache Spark And Python
Building Custom Machine Learning Algorithms With Apache SystemML
Spark on Mesos
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Spark at Bloomberg: Dynamically Composable Analytics

Recently uploaded (20)

PDF
Company Profile 2023 PT. ZEKON INDONESIA.pdf
PPTX
Economic Sector Performance Recovery.pptx
PDF
Data Analyst Certificate Programs for Beginners | IABAC
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
PDF
Nashik East side PPT 01-08-25. vvvhvjvvvhvh
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Extract Transformation Load (3) (1).pptx
PPTX
Web dev -ppt that helps us understand web technology
PPT
Performance Implementation Review powerpoint
PDF
A Systems Thinking Approach to Algorithmic Fairness.pdf
PPTX
Trading Procedures (1).pptxcffcdddxxddsss
PDF
Mastering Query Optimization Techniques for Modern Data Engineers
PPTX
PPT_Dream_45_NEET_Organic_Chemistry_Pankaj_Sijariya_Sir_Sanjeet.pptx
PPTX
Azure Data management Engineer project.pptx
PPTX
LESSON-1-NATURE-OF-MATHEMATICS.pptx patterns
PPTX
CL11_CH20_-LOCOMOTION-AND-MOVEMENT-Autosaved.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
lec_5(probability).pptxzzjsjsjsjsjsjjsjjssj
PPTX
Global journeys: estimating international migration
Company Profile 2023 PT. ZEKON INDONESIA.pdf
Economic Sector Performance Recovery.pptx
Data Analyst Certificate Programs for Beginners | IABAC
Purple and Violet Modern Marketing Presentation (1).pptx
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Nashik East side PPT 01-08-25. vvvhvjvvvhvh
Business Acumen Training GuidePresentation.pptx
Extract Transformation Load (3) (1).pptx
Web dev -ppt that helps us understand web technology
Performance Implementation Review powerpoint
A Systems Thinking Approach to Algorithmic Fairness.pdf
Trading Procedures (1).pptxcffcdddxxddsss
Mastering Query Optimization Techniques for Modern Data Engineers
PPT_Dream_45_NEET_Organic_Chemistry_Pankaj_Sijariya_Sir_Sanjeet.pptx
Azure Data management Engineer project.pptx
LESSON-1-NATURE-OF-MATHEMATICS.pptx patterns
CL11_CH20_-LOCOMOTION-AND-MOVEMENT-Autosaved.pptx
Moving the Public Sector (Government) to a Digital Adoption
lec_5(probability).pptxzzjsjsjsjsjsjjsjjssj
Global journeys: estimating international migration

Airstream: Spark Streaming At Airbnb

  • 3. Event Logs MySQL Dumps Gold Cluster HDFS Hive Kafka Sqoop Silver Cluster Spark Cluster Spark ReAir Airflow Scheduling S3 Presto Cluster AirPal Caravel Tableau Batch Infrastructure Yarn HDFS Hive Yarn Liyin Tang and Jingwei Lu 3
  • 4. Streaming at Airbnb Event Logging MySQL BINLOG Cluster HDFS Hive Spinal tap Presto Cluster Yarn Kafka HBase Spark Streaming Datadog Druid Kafka Liyin Tang and Jingwei Lu 4
  • 6. Stateless Liyin Tang and Jingwei Lu Computation SinkSource DStream DF DF
  • 7. Stateful Liyin Tang and Jingwei Lu ComputationSource DStream DF DF Sink1 Sink2 Sink N State Storage RDD
  • 8. Multiple Streams Liyin Tang and Jingwei Lu DataFrame Sink1 Process A Sink2 Sink3 SinkN … DataFrame Sink1 Process N Sink2 Sink3 SinkN … Source DStream Align by Time DataFrame DataFrame State Source DStream …
  • 9. Streaming + Batch Liyin Tang and Jingwei Lu DataFrame Sink1 Process A Sink2 Sink3 SinkN … DataFrame State DStream … Align by Time … DataFrame Sink1 Process A Sink2 Sink3 SinkN …
  • 11. AirStream Architecture Liyin Tang and Jingwei Lu Sources Stream #1 Stream #N Hive Tables HBase Tables Virtual Table Views for Computation Sinks … Customized ComputationSpark SQL Simple Config HBase Services Streaming SourcesDruid
  • 12. AirStream Architecture Liyin Tang and Jingwei Lu Sources Stream #1 Stream #N Hive Tables HBase Tables Virtual Table Views for Computation Sinks … Customized ComputationSpark SQL HBase Services Streaming SourcesDruid Same Computation for Batch processing
  • 14. Liyin Tang and Jingwei Lu State Store • Merge changes • Provide fast lookup • Fast persistent storage across streaming and batch jobs 14
  • 15. Why HBase Liyin Tang and Jingwei Lu Rich Functionalities Rich Integration with Hadoop EcoSystem Easy Management Strong Community Reliable and Scalable
  • 16. HBase State Store Operators in Airstream Liyin Tang and Jingwei Lu 16 Full Table Scan Simple Aggregation Bulk Upload Key/Prefix Lookup Update
  • 17. Liyin Tang and Jingwei Lu Computation DAG 17 Input Data Left Outer Join Result Key Lookup
  • 18. Liyin Tang and Jingwei Lu Key Space Design • Hash partition key space for load balance • Composite key for K-> V • Support full key lookup • Prefix lookup supported for all keys used in hash function Hash key1 key2 key3 Hash based on key prefix Hash key1 key2 Lookup based on key prefix key1 = ‘value1’ and key2 = ‘value2’ 18
  • 19. • Partition based on key before write • Use bulk upload for large volume update Write Performance Liyin Tang and Jingwei Lu 19
  • 20. Case Study Liyin Tang and Jingwei Lu Experiment realtime feedback 20 Update Experiment Assignment Event Lookup HBase with TTL Booking Event Druid Datadog one airstream configjob 2 job 1
  • 22. Realtime Ingestion on HBase Data Infrastructure MySQL Analytical Events Kafka Spark Streamin HBase HDFS Presto/Hive/ Spark Source Ingest RealtimeQuery Snapshot BatchQuery Liyin Tang and Jingwei Lu 22
  • 23. Access Data in HBase Liyin Tang and Jingwei Lu HBase Hive Presto Spark SQL Spark Streaming Batch Jobs Interactive Query Streaming HDFS Snapshot Table Mapping/Unifed View on realtime data 23
  • 24. Snapshot&Reseed Liyin Tang and Jingwei Lu HBase HDFS Snapshot HFile Links) Bulk Upload 24
  • 25. Case Study 1: Events Ingestion Liyin Tang and Jingwei Lu Kafka topic … topic topic Spark Executor1 … Executor Executor HBase DeDup HDFS Daily Realtime Hive Presto Events Partition 25
  • 26. Case Study 2: Streaming DB Export KafkaRDS Table1 … Spinalta p. … Table2 TableN Spinaltap. Table2 Spinaltap. TableN Spark Executor1 … Executor2 Executor K HBase Region1 … Region2 Region M HDFS Daily Snapshot Realtime Query Liyin Tang and Jingwei Lu 26
  • 27. Case Study: Streaming DB Export Rows CF: Colums Version Value <ShardKey><DB_TABLE_#1><PK_a=A> id Fri May 19 00:33:19 2016 101 <ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 19 00:33:19 2016 San Francisco <ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 10 00:34:19 2016 New York <ShardKey><DB_TABLE_#2><PK_a=A’> id Fri May 19 00:33:19 2016 1 Liyin Tang and Jingwei Lu 27
  • 28. Case Study: Streaming DB Export TXN 1 Commit_TS: 101 … TXN 2 Commit_TS: 102 TXN 3 Commit_TS: 103 TXN N Commit_TS: N’ Binlog Order Liyin Tang and Jingwei Lu 28
  • 29. Case Study: Streaming DB Export TXN 1 Commit_TS: 101 … TXN 2 Commit_TS: 103 TXN 3 Commit_TS: 102 TXN N Commit_TS: N’ NTP Binlog Order Liyin Tang and Jingwei Lu 29
  • 30. Case Study: Streaming DB Export TXN 1 Commit_TS: 101 … Binlog Order TXN 2 Commit_TS: 103 TXN 3 Commit_TS: 102 TXN N Commit_TS: N’ Point-in-Time Restore on TS 102 Liyin Tang and Jingwei Lu 30
  • 31. Case Study: Streaming DB Export Rows CF: Colums Version Value <ShardKey><DB_TABLE_#1><PK_a=A> id bin100 101 <ShardKey><DB_TABLE_#1><PK_a=A> city bin101 San Francisco <ShardKey><DB_TABLE_#1><PK_a=A> city bin102 New York <ShardKey><DB_TABLE_#2><PK_a=A’> id bin100 1 Liyin Tang and Jingwei Lu 31
  • 32. Case Study: Streaming DB Export Rows Version (Logical Offset) Value <ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100 <ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101 <ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103 <ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102 Liyin Tang and Jingwei Lu 32
  • 33. Case Study: Streaming DB Export Rows Version (Logical Offset) Value <ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100 <ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101 <ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103 <ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102 Liyin Tang and Jingwei Lu 33
  • 35. Job Management: Scaling up Config Driver Streaming Job Yarn Spark Jobs … Liyin Tang & Jingwei Lu Config Driver Streaming Job … … … … Spark Jobs Config Driver Streaming Job Spark Jobs
  • 36. Spark Job 1 Spark Job2 Spark Job N Concurrent … … Liyin Tang & Jingwei Lu Config Driver Streaming Job Yarn Job Management: Scaling up
  • 37. Job Management: Fault Tolerant Driver Spark Job 1 Spark Job2 Spark Job N Streaming Job Concurrent Yarn … … Liyin Tang & Jingwei Lu OffsetManagement Mesos Driver Driver Config Config Config …… Checkpoint Rewind
  • 38. Job Management: Monitoring&Alerting Driver Spark Job 1 Spark Job2 Spark Job N Streaming Job Concurrent Yarn … …AirStreamListener Liyin Tang & Jingwei Lu
  • 39. Summary Liyin Tang and Jingwei Lu Simplify and Unify Stream Batch Pipeline Rich Stateful Computation Rich Integration with Hadoop EcoSystem Easy Operation
  • 40. 40