SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Amanda Moran, Databricks
Simplify and Scale
Data Engineering Pipelines
with Delta Lake
#UnifiedDataAnalytics #SparkAISummit
● Solutions Architect @ Databricks
● MS Computer Science, BS Biology
● Previously: HP, Teradata, DataStax, Esgyn
● PMC and Apache Committer on Apache
Trafodion
● 5 Different Distributed Systems
● Course with Udacity on Data Engineering
Today’s Speaker
Agenda
● Data Engineers Nightmares and Dreams
● Data Lifecycle vs the Delta Lifecycle
● Transitioning Data Pipeline to Delta
● How Dreams Become True
● DEMO!
● How to use Delta
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch
Stream
Stream
The Data Engineer’s Journey…
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey…
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified View
Validation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey… into a Nightmare
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey… into a Nightmare
Can this be simplified?
A Data Engineer’s Dream...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a cost
efficient way without having to choose between batch or streaming
What’s missing?
1. Ability to read consistent data while data is being written
2. Ability to read incrementally from a large table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new data that arrived
5. Ability to handle late arriving data without having to delay downstream
processing
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
So… What is the answer?
STRUCTURED
STREAMING
+ =
The
Delta
Architecture
1. Unify batch & streaming with a continuous data flow model
2. Infinite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing costs
Let’s try it instead with
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
What does this remind you of?
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle of the Past
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle of the Past
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
Apache
Spark
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
Apache
Spark
DW/OLAP
Transitioning from the Data Lifecycle
to the Delta Lake Lifecycle
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
•Dumping ground for raw data
•Often with long retention (years)
•Avoid error-prone parsing
��
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Intermediate data with some cleanup applied.
Queryable for easy debugging!
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Clean data, ready for consumption.
Read with Spark or Presto*
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Streams move data through the Delta Lake
•Low-latency or manually triggered
•Eliminates management of schedules and jobs
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Delta Lake also supports batch jobs
and standard DML
UPDATE
DELETE
MERGE
OVERWRITE
• GDPR, CCPA
• Upserts
INSERT
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Easy to recompute when business logic changes:
• Clear tables
• Restart streams
DELETE DELETE
How the Dream Becomes True
Demo Time
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
Connecting the dots...
Snapshot isolation between writers and
readers
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new
data that arrived
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new
data that arrived
5. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new
data that arrived
5. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
How do I use ?
dataframe
.write
.format("delta")
.save("/data")
Get Started with Delta using Spark APIs
dataframe
.write
.format("parquet")
.save("/data")
Instead of parquet... … simply say delta
Add Spark Package
pyspark --packages io.delta:delta-core_2.12:0.1.0
bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0
Maven
Build your own
Delta Lake
at
https://delta.io
Join the Community
Notebook from Today
Try the notebook from
Databricks Community
Edition!
Download the notebook at
https://dbricks.co/dlw-01
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PDF
Modernizing to a Cloud Data Architecture
PPTX
Introduction to Data Engineering
PDF
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
PDF
DI&A Slides: Data Lake vs. Data Warehouse
PPTX
Introduction to Data Engineering
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PPT
Dw & etl concepts
PPTX
Data Lake Overview
Modernizing to a Cloud Data Architecture
Introduction to Data Engineering
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
DI&A Slides: Data Lake vs. Data Warehouse
Introduction to Data Engineering
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Dw & etl concepts
Data Lake Overview

What's hot (20)

PPTX
Elastic Data Warehousing
PPTX
Azure Synapse Analytics Overview (r2)
PPTX
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
PDF
3D: DBT using Databricks and Delta
PPTX
Demystifying data engineering
PDF
Data Engineering
PDF
Observability for Data Pipelines With OpenLineage
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
PPTX
How to Implement Snowflake Security Best Practices with Panther
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
PDF
Modern Data architecture Design
PDF
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
PPTX
Building an Effective Data Warehouse Architecture
PDF
Apache Iceberg: An Architectural Look Under the Covers
PPTX
Data Sharing with Snowflake
PDF
dbt Python models - GoDataFest by Guillermo Sanchez
PDF
Productizing Structured Streaming Jobs
PPTX
DW Migration Webinar-March 2022.pptx
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Elastic Data Warehousing
Azure Synapse Analytics Overview (r2)
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
3D: DBT using Databricks and Delta
Demystifying data engineering
Data Engineering
Observability for Data Pipelines With OpenLineage
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Introducing the Snowflake Computing Cloud Data Warehouse
How to Implement Snowflake Security Best Practices with Panther
Making Data Timelier and More Reliable with Lakehouse Technology
Modern Data architecture Design
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Building an Effective Data Warehouse Architecture
Apache Iceberg: An Architectural Look Under the Covers
Data Sharing with Snowflake
dbt Python models - GoDataFest by Guillermo Sanchez
Productizing Structured Streaming Jobs
DW Migration Webinar-March 2022.pptx
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Ad

Similar to Simplify and Scale Data Engineering Pipelines with Delta Lake (20)

PDF
Delta Architecture
PDF
Delta from a Data Engineer's Perspective
PDF
The delta architecture
PDF
Making Apache Spark Better with Delta Lake
PDF
Delta Lake: Open Source Reliability w/ Apache Spark
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
Intro to Delta Lake
PDF
Databricks Delta Lake and Its Benefits
PDF
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
PDF
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
PDF
Building Reliable Data Lakes at Scale with Delta Lake
PDF
Getting Started with Delta Lake on Databricks
PDF
Power Your Delta Lake with Streaming Transactional Changes
PDF
Designing and Building Next Generation Data Pipelines at Scale with Structure...
PDF
What Is Delta Lake ???
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Containerized Stream Engine to Build Modern Delta Lake
PDF
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
PPTX
Reshape Data Lake (as of 2020.07)
Delta Architecture
Delta from a Data Engineer's Perspective
The delta architecture
Making Apache Spark Better with Delta Lake
Delta Lake: Open Source Reliability w/ Apache Spark
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Intro to Delta Lake
Databricks Delta Lake and Its Benefits
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
Getting Started with Delta Lake on Databricks
Power Your Delta Lake with Streaming Transactional Changes
Designing and Building Next Generation Data Pipelines at Scale with Structure...
What Is Delta Lake ???
Introduction SQL Analytics on Lakehouse Architecture
Containerized Stream Engine to Build Modern Delta Lake
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Reshape Data Lake (as of 2020.07)
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
Global journeys: estimating international migration
PPTX
1intro to AI.pptx AI components & composition
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
PPTX
Understanding Prototyping in Design and Development
PDF
Chad Readey - An Independent Thinker
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
PDF
345_IT infrastructure for business management.pdf
PDF
Mastering Query Optimization Techniques for Modern Data Engineers
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
PDF
Data Analyst Certificate Programs for Beginners | IABAC
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Trading Procedures (1).pptxcffcdddxxddsss
PDF
Company Presentation pada Perusahaan ADB.pdf
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Challenges and opportunities in feeding a growing population
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PDF
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
Taxes Foundatisdcsdcsdon Certificate.pdf
Global journeys: estimating international migration
1intro to AI.pptx AI components & composition
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Understanding Prototyping in Design and Development
Chad Readey - An Independent Thinker
Purple and Violet Modern Marketing Presentation (1).pptx
345_IT infrastructure for business management.pdf
Mastering Query Optimization Techniques for Modern Data Engineers
Moving the Public Sector (Government) to a Digital Adoption
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
Data Analyst Certificate Programs for Beginners | IABAC
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Trading Procedures (1).pptxcffcdddxxddsss
Company Presentation pada Perusahaan ADB.pdf
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Major-Components-ofNKJNNKNKNKNKronment.pptx
Challenges and opportunities in feeding a growing population
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf

Simplify and Scale Data Engineering Pipelines with Delta Lake

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Amanda Moran, Databricks Simplify and Scale Data Engineering Pipelines with Delta Lake #UnifiedDataAnalytics #SparkAISummit
  • 3. ● Solutions Architect @ Databricks ● MS Computer Science, BS Biology ● Previously: HP, Teradata, DataStax, Esgyn ● PMC and Apache Committer on Apache Trafodion ● 5 Different Distributed Systems ● Course with Udacity on Data Engineering Today’s Speaker
  • 4. Agenda ● Data Engineers Nightmares and Dreams ● Data Lifecycle vs the Delta Lifecycle ● Transitioning Data Pipeline to Delta ● How Dreams Become True ● DEMO! ● How to use Delta
  • 5. Table (Data gets written continuously) AI & Reporting Events Batch Stream Stream The Data Engineer’s Journey…
  • 6. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Reprocessing Update & Merge Stream Table (Data gets compacted every hour) The Data Engineer’s Journey…
  • 7. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified View Validation Updates & Merge get complex with data lake Reprocessing Update & Merge Stream Table (Data gets compacted every hour) The Data Engineer’s Journey… into a Nightmare
  • 8. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Updates & Merge get complex with data lake Reprocessing Update & Merge Stream Table (Data gets compacted every hour) The Data Engineer’s Journey… into a Nightmare Can this be simplified?
  • 9. A Data Engineer’s Dream... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming
  • 10. What’s missing? 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput 3. Ability to rollback in case of bad writes 4. Ability to replay historical data along new data that arrived 5. Ability to handle late arriving data without having to delay downstream processing Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 11. So… What is the answer? STRUCTURED STREAMING + = The Delta Architecture 1. Unify batch & streaming with a continuous data flow model 2. Infinite retention to replay/reprocess historical events as needed 3. Independent, elastic compute and storage to scale while balancing costs
  • 12. Let’s try it instead with
  • 13. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption. *Data Quality Levels *
  • 14. What does this remind you of?
  • 15. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Data Lifecycle of the Past Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality
  • 16. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Data Lifecycle of the Past Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Data Lake
  • 17. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Data Lifecycle Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Data Lake Apache Spark
  • 18. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Data Lifecycle Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Data Lake Apache Spark DW/OLAP
  • 19. Transitioning from the Data Lifecycle to the Delta Lake Lifecycle
  • 20. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption. *Data Quality Levels *
  • 21. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis •Dumping ground for raw data •Often with long retention (years) •Avoid error-prone parsing ��
  • 22. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Intermediate data with some cleanup applied. Queryable for easy debugging!
  • 23. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Clean data, ready for consumption. Read with Spark or Presto*
  • 24. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Streams move data through the Delta Lake •Low-latency or manually triggered •Eliminates management of schedules and jobs
  • 25. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Delta Lake also supports batch jobs and standard DML UPDATE DELETE MERGE OVERWRITE • GDPR, CCPA • Upserts INSERT
  • 26. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Easy to recompute when business logic changes: • Clear tables • Restart streams DELETE DELETE
  • 27. How the Dream Becomes True
  • 29. Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 30. Connecting the dots... Snapshot isolation between writers and readers Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written
  • 31. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput
  • 32. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput 3. Ability to rollback in case of bad writes
  • 33. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput 3. Ability to rollback in case of bad writes 4. Ability to replay historical data along new data that arrived
  • 34. 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput 3. Ability to rollback in case of bad writes 4. Ability to replay historical data along new data that arrived 5. Ability to handle late arriving data without having to delay downstream processing Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Stream any late arriving data added to the table as they get added Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 35. 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput 3. Ability to rollback in case of bad writes 4. Ability to replay historical data along new data that arrived 5. Ability to handle late arriving data without having to delay downstream processing Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Stream any late arriving data added to the table as they get added Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting
  • 36. How do I use ?
  • 37. dataframe .write .format("delta") .save("/data") Get Started with Delta using Spark APIs dataframe .write .format("parquet") .save("/data") Instead of parquet... … simply say delta Add Spark Package pyspark --packages io.delta:delta-core_2.12:0.1.0 bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0 Maven
  • 38. Build your own Delta Lake at https://delta.io
  • 40. Notebook from Today Try the notebook from Databricks Community Edition! Download the notebook at https://dbricks.co/dlw-01
  • 41. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT