SlideShare a Scribd company logo
Apache Samza 1.0 - What's New, What's Next
Apache Samza
• Top-level Apache project since 2014
• In use at LinkedIn, Slack, Metamarkets, Intuit,
TripAdvisor, VMWare, Optimizely, Redfin, etc.
• Powers thousands of active jobs in LinkedIn’s
production
Stream Processing Architecture at LinkedIn
Kafka
Near Real Time Processing
(Apache Samza)
Processing
Espresso
Oracle
MySql
Ambry
Services Tier
Ingestion
Venice
Results
Pinot
Couchb
ase
Changes
Brooklin
HDFS
Samza Scale At LinkedIn
3K+Jobs
900B+
Msgs Processed/Day
3K+Machines
99.99Availability
What's New
● Faster Onboarding
○ Make it fast and simple to learn Samza and create new applications.
● Powerful APIs
○ Provide the right level of expressibility for every use case.
● Ease of Development
○ Offer the right abstractions and tools to get things done quickly.
● Better Operability
○ Make it effortless and cost effective to run applications at any scale.
Faster Onboarding
Revamped Website and Documentation
Samza Course on YouTube
https://bit.ly/2TCS9x7
YouTube LIEngineering
Channel. Stream
Processing Tutorials
Playlist.
Simpler Job Creation
● More samples in hello-samza
○ Samza SQL
○ EventHubs Consumer
○ Integration Tests
○ Running with YARN and Standalone
https://github.com/apache/
samza-hello-samza
Powerful APIs
Example Application
Count number of ‘Page Views’ for each member in a 5 minute window
11
Page View
Page View Per
Member
Repartition
by member id
Window Map SendTo
Intermediate Stream
Low Level API
Job 1: Repartitioner Job
public class PageViewRepartitioner implements StreamTask {
private final SystemStream outputStream = new SystemStream("kafka", "pvMemberId");
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {
PageViewEvent pageViewEvent = (PageViewEvent) envelope.getMessage();
String key = pageViewEvent.getMemberId();
OutgoingMessageEnvelope outMessage =
new OutgoingMessageEnvelope(outputStream, pageViewEvent, key, pageViewEvent);
collector.send(outMessage);
}
}
Low Level API
Job 2: Page view counter job
public class PageViewCounter implements StreamTask {
private final SystemStream outputStream = new SystemStream("kafka", "pageviewCount");
private final HashMap<String, Integer> counter = new HashMap<>();
private Instant lastTriggerTime = Instant.now();
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {
PageViewEvent pageViewEvent = (PageViewEvent) envelope.getMessage();
String memberId = pageViewEvent.getMemberId();
counter.put(memberId, counter.getOrDefault(memberId, 0) + 1);
if (Duration.between(lastTriggerTime, Instant.now()).toMinutes() > 5) {
counter.forEach((key, value) -> collector.send(new OutgoingMessageEnvelope(outputStream, key, value)));
counter.clear();
lastTriggerTime = Instant.now();
}
}
}
High Level API
● Complex Processing Pipelines
● Easy Repartitioning
● Stream-Stream and Stream-Table Joins
● Processing Time Windows and Joins
High Level API
public class PageViewCountApplication implements StreamApplication {
@Override
public void describe(StreamApplicationDescriptor appDescriptor) {
...
appDescriptor.getInputStream(pageViews)
.partitionBy(m -> m.memberId, serde)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5),
initialValue, (m, c) -> c + 1))
.map(PageViewCount::new)
.sendTo(appDescriptor.getOutputStream(pageViewCounts));
}
}
Apache Beam
● Event Time Processing Support
● Multi-language APIs (Python)*
● Sliding Windows & Multi-Way Joins
* coming soon
Apache Beam
public class PageViewCount {
public static void main(String[] args) {
...
pipeline
.apply(KafkaIO.<PageViewEvent>read()
.withTopic("PageView")
.withTimestampFn(kv -> new Instant(kv.getValue().header.time))
.withWatermarkFn(kv -> new Instant(kv.getValue().header.time - 60000))
.apply(Values.create())
.apply(MapElements
.via((PageViewEvent pv) -> KV.of(String.valueOf(pv.header.memberId), 1)))
.apply(Window.into(TumblingWindows.of(Duration.standardMinutes(5))))
.apply(Count.perKey())
.apply(MapElements.via(newCounter()))
.apply(KafkaIO.<Counter>write().withTopic("PageViewCount")
pipeline.run();
}
}
Samza SQL
● Declarative Streaming SQL API
● Create, Validate and Deploy in minutes using SQL Shell
● Managed Service at LinkedIn
● Capabilities: Filters, Projections, , Flatten, UDFs, Stream-Table Joins
Samza SQL
INSERT INTO kafka.tracking.PageViewCount
SELECT memberId, count(*) FROM kafka.tracking.PageView
GROUP BY memberId, TUMBLE(current_timestamp, INTERVAL '5' MINUTES)
Samza APIs
● Complex Processing Pipelines
● Easy Repartitioning
● Complex Windows and Joins
● Event and Arrival Time Processing
● Multi-Language APIs (Java, Python, SQL)
Low Level (StreamTask)
High Level (StreamApplication)
Samza SQL
Apache Beam
(event time based windowed processing)
Java
Python
Samza APIs
Easier Development
Table API
● Evolution of the KVStore API
● Local and Remote K-V data sources
● Composition through hybrid tables
● Simplifies Stream-Table joins
● Remote Tables: Async I/O, Caching, Rate-limiting, and Retry
Stream Table Joins
Page Views Enriched Page Views
SendToJoin
Enrich ‘Page Views’ with Profile Info
Member
Database
RemoteTable ● Remote Table Features
○ Rate Limits to avoid DDoS
○ Async I/O
○ Caching / Retries
Table API
@Override
public void describe(StreamApplicationDescriptor appDesc) {
...
TableDescriptor<Integer, Profile> tableDesc =
new RocksDbTableDescriptor("profiles", serde);
Table<KV<Integer, Profile>> profilesTable = appDesc.getTable(tableDesc);
appDesc.getInputStream(profiles).sendTo(profilesTable);
appDesc.getInputStream(pageViews)
.map(m -> m.memberId)
.join(profilesTable, new MyJoinFunc())
.sendTo(decoratedProfiles);
}
Configuration Descriptors
Specify system, stream and table properties in code instead of configuration.
● Test your application against in-memory data.
● No need to set up Kafka / Yarn / Zookeeper locally.
● Works for both Low Level and High Level API applications.
Test Framework
Test Framework
@Test
public void testApplication() throws Exception {
// Generate Mock Data
List<PageView> pageViews = generateMockInput(...);
List<DecoratedPageView> expectedOutput = generateMockOutput(...);
// Get In Memory System and Stream Descriptors
InMemorySystemDescriptor inMemorySystem = new InMemorySystemDescriptor("test");
InMemoryInputDescriptor<PageView> pvDescriptor = inMemorySystem.getInputDescriptor(“page-views”);
InMemoryOutputDescriptor<DecoratedPageView> dpvDescriptor = inMemorySystem.getOutputDescriptor(“decorated-page-views”)
// Configure the TestRunner
TestRunner.of(new MyApplication())
.addInputStream(pvDescriptor, pageViews) // Associate data with the descriptor
.addOutputStream(dpvDescriptor, 10)
.run(Duration.ofMillis(1000));
// Add assertions on the output
StreamAssert.containsInOrder(expectedOutput, decoratedPageViewDesc, Duration.ofMillis(1000));
}
Offline Experimentation and Grandfathering
Application logic: Count number of ‘Page Views’ for each member in a 5 minute
window and send the counts to ‘Page View Per Member’
29
Page View
in stream
Page View per Member
out stream
Repartition
by member id
Window Map SendTo
HDFS
PageView: hdfs://mydbsnapshot/PageViewFiles/
PageViewPerMember: hdfs://myoutputdb/PageViewPerMemberFiles Zero code changes
Better Operability
Samza as a Service (YARN)
• Low Cost: Applications are run
over-subscribed and can use 2 to 4x
more CPU than what is requested
• Supports Host Affinity for stateful jobs
and also clean up of state stores
• Job Management – Samza Dashboard,
Metrics/Alerting dashboards, ELK for
log management
• Multitenant and Fully-Managed:
Applications request
containers/resources and the service
manages allocation and resource
isolation
• Failure Handling: YARN has built in
retries
Samza as a Library (Standalone)
• Handle Process Failures via External
Monitoring Service
• Coordination via Zookeeper
• Enables canary support
• Host Affinity for stateful jobs
• Build event processing logic as part of
a larger application
• Full control on how app is hosted and
the entire life cycle management.
• Applications typically are hosted in
VMs/Containers.
Dedicated Clusters
● Dedicated machines for guaranteed capacity
● Isolation from noisy neighbors (hot machines)
● For large jobs with their own SRE teams
Heterogenous Clusters
● Clusters with spinning disks instead of SSDs
● Lower C2S for stateless jobs
Samza Diagnostics
● Error analysis for applications
○ Top N Errors
○ Latest N Errors
○ Exception Navigation
○ Application / Container Incarnations
Coming Soon
Faster Onboarding
● Bounded And Predictable Memory Usage
○ Avoid manual memory tuning during initial deploys
● More documentation, examples, and how-tos in hello-samza
Powerful APIs
● High Level API Async I/O support
● Python API via Apache Beam
● Samza SQL
○ Windowing (Aggregations)
○ Stream-Stream Joins
○ Nested data support
Sample Python Code
A Sample Pipeline
KafkaRead
KafkaWrite
p = Pipeline(options=pipeline_options)
(p
| 'read' >> ReadFromKafka(cluster="tracking",
topic="PageViewEvent", config=config)
| 'extract' >> beam.Map(lambda record: (record.value['memberId'], 1))
| "windowing" >> beam.WindowInto(window.FixedWindows(60))
| "compute" >> beam.CombinePerKey(beam.combiners.CountCombineFn())
| 'write' >> WriteToKafka(cluster = "queuing",
topic = "PageViewCount", config = config)
p.run().waitUntilFinish()
Map
Window
Count
Easier Development
● Table API
○ Couchbase Table
○ Batching for Remote Tables
Better Operability
● Self-Serve Checkpoints
○ Set System / Stream / Partition Level Checkpoints
○ Set Time Based Checkpoints (e.g. "5 minutes ago") for all of the above
● State Restore Performance Improvements
○ Up to 60% faster restore times!
● Standby Containers With State Replication
● Host Affinity for Standalone
○ Support for stateful apps in ZK Standalone
● Queryable Local State
○ Read RocksDB store contents for debugging
Thank You!
samza.apache.org
dev@samza.apache.org
Apache Samza
0.7 July 2014
0.8 Dec 2014
0.9 Apr 2015
0.10 Dec 2015
0.11 Oct 2016
0.12 Feb 2017
0.13 June 2017
0.14 Jan 2018
1.0 Dec 2018
Context APIs
● Clear distinction b/w framework and application created objects.
● Clear distinction between Container and Task scoped objects.
● Ability to provide application context factories through the
ApplicationDescriptor.
Side Inputs
● Bounded (compacted) streams with periodic updates
● Bootstrap semantics (first consume "fully", then in continuous mode)
● Ideal for periodic data pushes from Hadoop
○ E.g., ML features generated offline.

More Related Content

PDF
Akka Streams - From Zero to Kafka
PDF
UDF/UDAF: the extensibility framework for KSQL (Hojjat Jafapour, Confluent) K...
PDF
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
PPTX
KSQL and Kafka Streams – When to Use Which, and When to Use Both
PDF
Developing Secure Scala Applications With Fortify For Scala
PDF
Richmond kafka streams intro
PDF
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
PDF
Building scalable rest service using Akka HTTP
Akka Streams - From Zero to Kafka
UDF/UDAF: the extensibility framework for KSQL (Hojjat Jafapour, Confluent) K...
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
KSQL and Kafka Streams – When to Use Which, and When to Use Both
Developing Secure Scala Applications With Fortify For Scala
Richmond kafka streams intro
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Building scalable rest service using Akka HTTP

What's hot (20)

PDF
Real World Serverless
PDF
KSQL: Streaming SQL for Kafka
PDF
Scaling with Scala: refactoring a back-end service into the mobile age
PPTX
Kick your database_to_the_curb_reston_08_27_19
PPTX
Service Stampede: Surviving a Thousand Services
PDF
Kafka Streams: the easiest way to start with stream processing
PDF
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
PDF
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
PDF
Real-Time Stream Processing with KSQL and Apache Kafka
PDF
Interactive Kafka Streams
PPTX
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
PDF
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
PPTX
Mario Fusco - Reactive programming in Java - Codemotion Milan 2017
PDF
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka
PDF
Chti jug - 2018-06-26
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
PDF
KSQL in Practice (Almog Gavra, Confluent) Kafka Summit London 2019
PDF
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
ODP
Introduction to Structured Streaming
PDF
Journey into Reactive Streams and Akka Streams
Real World Serverless
KSQL: Streaming SQL for Kafka
Scaling with Scala: refactoring a back-end service into the mobile age
Kick your database_to_the_curb_reston_08_27_19
Service Stampede: Surviving a Thousand Services
Kafka Streams: the easiest way to start with stream processing
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
Real-Time Stream Processing with KSQL and Apache Kafka
Interactive Kafka Streams
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
Mario Fusco - Reactive programming in Java - Codemotion Milan 2017
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka
Chti jug - 2018-06-26
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
KSQL in Practice (Almog Gavra, Confluent) Kafka Summit London 2019
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
Introduction to Structured Streaming
Journey into Reactive Streams and Akka Streams
Ad

Similar to Apache Samza 1.0 - What's New, What's Next (20)

PDF
SamzaSQL QCon'16 presentation
PDF
Stream and Batch Processing in the Cloud with Data Microservices
PDF
XStream: stream processing platform at facebook
PDF
Scalable Stream Processing with Apache Samza
PDF
Scaling up Near Real-time Analytics @Uber &LinkedIn
PDF
Reimagine Frontend in the Serverless Era
PPTX
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
PPTX
Serverless in-action
PPTX
What's New in .Net 4.5
PDF
Agile integration workshop Seattle
PDF
Divide and Conquer – Microservices with Node.js
PPTX
Scaling asp.net websites to millions of users
PPTX
Java on Windows Azure
PPTX
CloudConnect 2011 - Building Highly Scalable Java Applications on Windows Azure
PDF
Building a serverless company on AWS lambda and Serverless framework
PDF
Eranea's solution and technology for mainframe migration / transformation : d...
PDF
Dataservices: Processing (Big) Data the Microservice Way
PPTX
Mastering Azure Durable Functions - Building Resilient and Scalable Workflows
PDF
Working with data using Azure Functions.pdf
PDF
Developing streaming applications with apache apex (strata + hadoop world)
SamzaSQL QCon'16 presentation
Stream and Batch Processing in the Cloud with Data Microservices
XStream: stream processing platform at facebook
Scalable Stream Processing with Apache Samza
Scaling up Near Real-time Analytics @Uber &LinkedIn
Reimagine Frontend in the Serverless Era
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Serverless in-action
What's New in .Net 4.5
Agile integration workshop Seattle
Divide and Conquer – Microservices with Node.js
Scaling asp.net websites to millions of users
Java on Windows Azure
CloudConnect 2011 - Building Highly Scalable Java Applications on Windows Azure
Building a serverless company on AWS lambda and Serverless framework
Eranea's solution and technology for mainframe migration / transformation : d...
Dataservices: Processing (Big) Data the Microservice Way
Mastering Azure Durable Functions - Building Resilient and Scalable Workflows
Working with data using Azure Functions.pdf
Developing streaming applications with apache apex (strata + hadoop world)
Ad

Recently uploaded (20)

PPTX
Belt and Road Supply Chain Finance Blockchain Solution
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
REPORT: Heating appliances market in Poland 2024
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
PDF
DevOps & Developer Experience Summer BBQ
PDF
Modernizing your data center with Dell and AMD
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
creating-agentic-ai-solutions-leveraging-aws.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Web Security: Login Bypass, SQLi, CSRF & XSS.pptx
PDF
Event Presentation Google Cloud Next Extended 2025
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
Sensors and Actuators in IoT Systems using pdf
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Belt and Road Supply Chain Finance Blockchain Solution
Chapter 3 Spatial Domain Image Processing.pdf
REPORT: Heating appliances market in Poland 2024
A Day in the Life of Location Data - Turning Where into How.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
DevOps & Developer Experience Summer BBQ
Modernizing your data center with Dell and AMD
GamePlan Trading System Review: Professional Trader's Honest Take
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Enable Enterprise-Ready Security on IBM i Systems.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
creating-agentic-ai-solutions-leveraging-aws.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Web Security: Login Bypass, SQLi, CSRF & XSS.pptx
Event Presentation Google Cloud Next Extended 2025
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
Sensors and Actuators in IoT Systems using pdf
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)

Apache Samza 1.0 - What's New, What's Next

  • 2. Apache Samza • Top-level Apache project since 2014 • In use at LinkedIn, Slack, Metamarkets, Intuit, TripAdvisor, VMWare, Optimizely, Redfin, etc. • Powers thousands of active jobs in LinkedIn’s production
  • 3. Stream Processing Architecture at LinkedIn Kafka Near Real Time Processing (Apache Samza) Processing Espresso Oracle MySql Ambry Services Tier Ingestion Venice Results Pinot Couchb ase Changes Brooklin HDFS
  • 4. Samza Scale At LinkedIn 3K+Jobs 900B+ Msgs Processed/Day 3K+Machines 99.99Availability
  • 5. What's New ● Faster Onboarding ○ Make it fast and simple to learn Samza and create new applications. ● Powerful APIs ○ Provide the right level of expressibility for every use case. ● Ease of Development ○ Offer the right abstractions and tools to get things done quickly. ● Better Operability ○ Make it effortless and cost effective to run applications at any scale.
  • 7. Revamped Website and Documentation
  • 8. Samza Course on YouTube https://bit.ly/2TCS9x7 YouTube LIEngineering Channel. Stream Processing Tutorials Playlist.
  • 9. Simpler Job Creation ● More samples in hello-samza ○ Samza SQL ○ EventHubs Consumer ○ Integration Tests ○ Running with YARN and Standalone https://github.com/apache/ samza-hello-samza
  • 11. Example Application Count number of ‘Page Views’ for each member in a 5 minute window 11 Page View Page View Per Member Repartition by member id Window Map SendTo Intermediate Stream
  • 12. Low Level API Job 1: Repartitioner Job public class PageViewRepartitioner implements StreamTask { private final SystemStream outputStream = new SystemStream("kafka", "pvMemberId"); @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { PageViewEvent pageViewEvent = (PageViewEvent) envelope.getMessage(); String key = pageViewEvent.getMemberId(); OutgoingMessageEnvelope outMessage = new OutgoingMessageEnvelope(outputStream, pageViewEvent, key, pageViewEvent); collector.send(outMessage); } }
  • 13. Low Level API Job 2: Page view counter job public class PageViewCounter implements StreamTask { private final SystemStream outputStream = new SystemStream("kafka", "pageviewCount"); private final HashMap<String, Integer> counter = new HashMap<>(); private Instant lastTriggerTime = Instant.now(); @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { PageViewEvent pageViewEvent = (PageViewEvent) envelope.getMessage(); String memberId = pageViewEvent.getMemberId(); counter.put(memberId, counter.getOrDefault(memberId, 0) + 1); if (Duration.between(lastTriggerTime, Instant.now()).toMinutes() > 5) { counter.forEach((key, value) -> collector.send(new OutgoingMessageEnvelope(outputStream, key, value))); counter.clear(); lastTriggerTime = Instant.now(); } } }
  • 14. High Level API ● Complex Processing Pipelines ● Easy Repartitioning ● Stream-Stream and Stream-Table Joins ● Processing Time Windows and Joins
  • 15. High Level API public class PageViewCountApplication implements StreamApplication { @Override public void describe(StreamApplicationDescriptor appDescriptor) { ... appDescriptor.getInputStream(pageViews) .partitionBy(m -> m.memberId, serde) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1)) .map(PageViewCount::new) .sendTo(appDescriptor.getOutputStream(pageViewCounts)); } }
  • 16. Apache Beam ● Event Time Processing Support ● Multi-language APIs (Python)* ● Sliding Windows & Multi-Way Joins * coming soon
  • 17. Apache Beam public class PageViewCount { public static void main(String[] args) { ... pipeline .apply(KafkaIO.<PageViewEvent>read() .withTopic("PageView") .withTimestampFn(kv -> new Instant(kv.getValue().header.time)) .withWatermarkFn(kv -> new Instant(kv.getValue().header.time - 60000)) .apply(Values.create()) .apply(MapElements .via((PageViewEvent pv) -> KV.of(String.valueOf(pv.header.memberId), 1))) .apply(Window.into(TumblingWindows.of(Duration.standardMinutes(5)))) .apply(Count.perKey()) .apply(MapElements.via(newCounter())) .apply(KafkaIO.<Counter>write().withTopic("PageViewCount") pipeline.run(); } }
  • 18. Samza SQL ● Declarative Streaming SQL API ● Create, Validate and Deploy in minutes using SQL Shell ● Managed Service at LinkedIn ● Capabilities: Filters, Projections, , Flatten, UDFs, Stream-Table Joins
  • 19. Samza SQL INSERT INTO kafka.tracking.PageViewCount SELECT memberId, count(*) FROM kafka.tracking.PageView GROUP BY memberId, TUMBLE(current_timestamp, INTERVAL '5' MINUTES)
  • 20. Samza APIs ● Complex Processing Pipelines ● Easy Repartitioning ● Complex Windows and Joins ● Event and Arrival Time Processing ● Multi-Language APIs (Java, Python, SQL)
  • 21. Low Level (StreamTask) High Level (StreamApplication) Samza SQL Apache Beam (event time based windowed processing) Java Python Samza APIs
  • 23. Table API ● Evolution of the KVStore API ● Local and Remote K-V data sources ● Composition through hybrid tables ● Simplifies Stream-Table joins ● Remote Tables: Async I/O, Caching, Rate-limiting, and Retry
  • 24. Stream Table Joins Page Views Enriched Page Views SendToJoin Enrich ‘Page Views’ with Profile Info Member Database RemoteTable ● Remote Table Features ○ Rate Limits to avoid DDoS ○ Async I/O ○ Caching / Retries
  • 25. Table API @Override public void describe(StreamApplicationDescriptor appDesc) { ... TableDescriptor<Integer, Profile> tableDesc = new RocksDbTableDescriptor("profiles", serde); Table<KV<Integer, Profile>> profilesTable = appDesc.getTable(tableDesc); appDesc.getInputStream(profiles).sendTo(profilesTable); appDesc.getInputStream(pageViews) .map(m -> m.memberId) .join(profilesTable, new MyJoinFunc()) .sendTo(decoratedProfiles); }
  • 26. Configuration Descriptors Specify system, stream and table properties in code instead of configuration.
  • 27. ● Test your application against in-memory data. ● No need to set up Kafka / Yarn / Zookeeper locally. ● Works for both Low Level and High Level API applications. Test Framework
  • 28. Test Framework @Test public void testApplication() throws Exception { // Generate Mock Data List<PageView> pageViews = generateMockInput(...); List<DecoratedPageView> expectedOutput = generateMockOutput(...); // Get In Memory System and Stream Descriptors InMemorySystemDescriptor inMemorySystem = new InMemorySystemDescriptor("test"); InMemoryInputDescriptor<PageView> pvDescriptor = inMemorySystem.getInputDescriptor(“page-views”); InMemoryOutputDescriptor<DecoratedPageView> dpvDescriptor = inMemorySystem.getOutputDescriptor(“decorated-page-views”) // Configure the TestRunner TestRunner.of(new MyApplication()) .addInputStream(pvDescriptor, pageViews) // Associate data with the descriptor .addOutputStream(dpvDescriptor, 10) .run(Duration.ofMillis(1000)); // Add assertions on the output StreamAssert.containsInOrder(expectedOutput, decoratedPageViewDesc, Duration.ofMillis(1000)); }
  • 29. Offline Experimentation and Grandfathering Application logic: Count number of ‘Page Views’ for each member in a 5 minute window and send the counts to ‘Page View Per Member’ 29 Page View in stream Page View per Member out stream Repartition by member id Window Map SendTo HDFS PageView: hdfs://mydbsnapshot/PageViewFiles/ PageViewPerMember: hdfs://myoutputdb/PageViewPerMemberFiles Zero code changes
  • 31. Samza as a Service (YARN) • Low Cost: Applications are run over-subscribed and can use 2 to 4x more CPU than what is requested • Supports Host Affinity for stateful jobs and also clean up of state stores • Job Management – Samza Dashboard, Metrics/Alerting dashboards, ELK for log management • Multitenant and Fully-Managed: Applications request containers/resources and the service manages allocation and resource isolation • Failure Handling: YARN has built in retries
  • 32. Samza as a Library (Standalone) • Handle Process Failures via External Monitoring Service • Coordination via Zookeeper • Enables canary support • Host Affinity for stateful jobs • Build event processing logic as part of a larger application • Full control on how app is hosted and the entire life cycle management. • Applications typically are hosted in VMs/Containers.
  • 33. Dedicated Clusters ● Dedicated machines for guaranteed capacity ● Isolation from noisy neighbors (hot machines) ● For large jobs with their own SRE teams
  • 34. Heterogenous Clusters ● Clusters with spinning disks instead of SSDs ● Lower C2S for stateless jobs
  • 35. Samza Diagnostics ● Error analysis for applications ○ Top N Errors ○ Latest N Errors ○ Exception Navigation ○ Application / Container Incarnations
  • 37. Faster Onboarding ● Bounded And Predictable Memory Usage ○ Avoid manual memory tuning during initial deploys ● More documentation, examples, and how-tos in hello-samza
  • 38. Powerful APIs ● High Level API Async I/O support ● Python API via Apache Beam ● Samza SQL ○ Windowing (Aggregations) ○ Stream-Stream Joins ○ Nested data support
  • 39. Sample Python Code A Sample Pipeline KafkaRead KafkaWrite p = Pipeline(options=pipeline_options) (p | 'read' >> ReadFromKafka(cluster="tracking", topic="PageViewEvent", config=config) | 'extract' >> beam.Map(lambda record: (record.value['memberId'], 1)) | "windowing" >> beam.WindowInto(window.FixedWindows(60)) | "compute" >> beam.CombinePerKey(beam.combiners.CountCombineFn()) | 'write' >> WriteToKafka(cluster = "queuing", topic = "PageViewCount", config = config) p.run().waitUntilFinish() Map Window Count
  • 40. Easier Development ● Table API ○ Couchbase Table ○ Batching for Remote Tables
  • 41. Better Operability ● Self-Serve Checkpoints ○ Set System / Stream / Partition Level Checkpoints ○ Set Time Based Checkpoints (e.g. "5 minutes ago") for all of the above ● State Restore Performance Improvements ○ Up to 60% faster restore times! ● Standby Containers With State Replication ● Host Affinity for Standalone ○ Support for stateful apps in ZK Standalone ● Queryable Local State ○ Read RocksDB store contents for debugging
  • 43. Apache Samza 0.7 July 2014 0.8 Dec 2014 0.9 Apr 2015 0.10 Dec 2015 0.11 Oct 2016 0.12 Feb 2017 0.13 June 2017 0.14 Jan 2018 1.0 Dec 2018
  • 44. Context APIs ● Clear distinction b/w framework and application created objects. ● Clear distinction between Container and Task scoped objects. ● Ability to provide application context factories through the ApplicationDescriptor.
  • 45. Side Inputs ● Bounded (compacted) streams with periodic updates ● Bootstrap semantics (first consume "fully", then in continuous mode) ● Ideal for periodic data pushes from Hadoop ○ E.g., ML features generated offline.