Airstream: Spark Streaming At Airbnb

AirStream
LIYIN TANG & JINGWEI LU

Event
Logs
MySQL
Dumps
Gold Cluster
HDFS
Hive
Kafka
Sqoop
Silver Cluster Spark Cluster
Spark
ReAir
Airflow Scheduling
S3
Presto Cluster
AirPal
Caravel
Tableau
Batch Infrastructure
Yarn HDFS
Hive
Yarn
Liyin Tang and Jingwei Lu
3

Streaming at Airbnb
Event
Logging
MySQL
BINLOG
Cluster
HDFS
Hive
Spinal tap
Presto Cluster
Yarn
Kafka
HBase
Spark Streaming
Datadog
Druid
Kafka
4

Stateless
Computation SinkSource
DStream DF DF

Stateful
ComputationSource
DStream DF DF
Sink1
Sink2
Sink N
State Storage
RDD

Multiple Streams
DataFrame
Sink1
Process A
Sink2
Sink3
SinkN
…
DataFrame
Sink1
Process N
Sink2
Sink3
SinkN
…
Source
DStream
Align by Time
DataFrame
DataFrame
State
Source
DStream
…

Streaming + Batch
DataFrame
Sink1
Process A
Sink2
Sink3
SinkN
…
DataFrame
State
DStream
…
Align by Time
…
DataFrame
Sink1
Process A
Sink2
Sink3
SinkN
…

AirStream Architecture
Sources
Stream #1 Stream #N
Hive Tables HBase Tables
Virtual Table Views for Computation
Sinks
…
Customized ComputationSpark SQL
Simple Config
HBase Services Streaming SourcesDruid

AirStream Architecture
Sources
Stream #1 Stream #N
Hive Tables HBase Tables
Virtual Table Views for Computation
Sinks
…
Customized ComputationSpark SQL
HBase Services Streaming SourcesDruid
Same Computation for
Batch processing

State Store
• Merge changes
• Provide fast lookup
• Fast persistent storage across streaming
and batch jobs
14

Why HBase
Rich Functionalities
Rich Integration with Hadoop EcoSystem
Easy Management
Strong Community
Reliable and Scalable

HBase State Store
Operators in Airstream
16
Full Table Scan
Simple Aggregation
Bulk Upload
Key/Prefix Lookup
Update

Computation DAG
17
Input Data
Left Outer Join Result
Key Lookup

Key Space Design
• Hash partition key space for
load balance
• Composite key for K-> V
• Support full key lookup
• Prefix lookup supported for
all keys used in hash
function
Hash key1 key2 key3
Hash based on key prefix
Hash key1 key2
Lookup based on key prefix
key1 = ‘value1’ and key2 = ‘value2’
18

• Partition based on key before write
• Use bulk upload for large volume update
Write Performance
19

Case Study
Experiment realtime feedback
20
Update
Experiment
Assignment Event
Lookup
HBase
with TTL
Booking Event
Druid Datadog
one airstream
configjob 2 job 1

Realtime Ingestion on HBase
Data Infrastructure
MySQL
Analytical
Events
Kafka
Spark
Streamin HBase
HDFS
Presto/Hive/
Spark
Source
Ingest
RealtimeQuery
Snapshot
BatchQuery
22

Access Data in HBase
HBase
Hive Presto
Spark
SQL
Spark
Streaming
Batch Jobs Interactive Query Streaming
HDFS
Snapshot
Table Mapping/Unifed View on realtime data
23

Snapshot&Reseed
HBase HDFS
Snapshot HFile Links)
Bulk Upload
24

Case Study 1: Events Ingestion
Kafka
topic
…
topic
topic
Spark
Executor1
…
Executor
Executor
HBase
DeDup
HDFS
Daily
Realtime
Hive
Presto
Events
Partition
25

Case Study 2: Streaming DB Export
KafkaRDS
Table1
…
Spinalta
p.
…
Table2
TableN
Spinaltap.
Table2
Spinaltap.
TableN
Spark
Executor1
…
Executor2
Executor K
HBase
Region1
…
Region2
Region M
HDFS
Daily Snapshot
Realtime Query
26

Case Study: Streaming DB Export
Rows CF: Colums Version Value
<ShardKey><DB_TABLE_#1><PK_a=A> id Fri May 19 00:33:19 2016 101
<ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 19 00:33:19 2016 San Francisco
<ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 10 00:34:19 2016 New York
<ShardKey><DB_TABLE_#2><PK_a=A’> id Fri May 19 00:33:19 2016 1
27

TXN 1
Commit_TS:
101
…
TXN 2
Commit_TS:
102
TXN 3
Commit_TS:
103
TXN N
Commit_TS: N’
Binlog Order
28

TXN 1
Commit_TS:
101
…
TXN 2
Commit_TS:
103
TXN 3
Commit_TS:
102
TXN N
Commit_TS: N’
NTP
Binlog Order
29

TXN 1
Commit_TS:
101
…
Binlog Order
TXN 2
Commit_TS:
103
TXN 3
Commit_TS:
102
TXN N
Commit_TS: N’
Point-in-Time Restore on TS 102
30

Rows CF: Colums Version Value
<ShardKey><DB_TABLE_#1><PK_a=A> id bin100 101
<ShardKey><DB_TABLE_#1><PK_a=A> city bin101 San Francisco
<ShardKey><DB_TABLE_#1><PK_a=A> city bin102 New York
<ShardKey><DB_TABLE_#2><PK_a=A’> id bin100 1
31

Rows Version (Logical Offset) Value
<ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100
32

Rows Version (Logical Offset) Value
33

Job Management: Scaling up
Config Driver
Streaming
Job
Yarn
Spark Jobs
…
Liyin Tang & Jingwei Lu
Config Driver
Streaming
Job
… … … …
Spark Jobs
Config Driver
Streaming
Job
Spark Jobs

Spark Job 1
Spark Job2
Spark Job N
Concurrent
…
…
Config Driver
Streaming
Job
Yarn
Job Management: Scaling up

Job Management: Fault Tolerant
Driver
Spark Job 1
Spark Job2
Spark Job N
Streaming
Job
Concurrent
Yarn
…
…
OffsetManagement
Mesos
Driver
Driver
Config
Config
Config
……
Checkpoint Rewind

Job Management: Monitoring&Alerting
Driver
Spark Job 1
Spark Job2
Spark Job N
Streaming
Job
Concurrent
Yarn
…
…AirStreamListener

Summary
Simplify and Unify Stream Batch Pipeline
Rich Stateful Computation
Rich Integration with Hadoop EcoSystem
Easy Operation

Airstream: Spark Streaming At Airbnb

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Airstream: Spark Streaming At Airbnb (20)

More from Jen Aman (20)

Recently uploaded (20)

Airstream: Spark Streaming At Airbnb