SlideShare a Scribd company logo
Data Engineering
Patterns and Principles
Valdas Maksimavičius
Software Development Data Projects
Data engineering design patterns
Software Development Data Projects
Would you be
confident in a
self-driving car ...
… knowing that
there is your
software running
it?
Standardize and increase the descriptive power
of engineering processes
by applying patterns
Or in other words
stand on the shoulders of giants
and stop reinventing the wheel
Source: https://www.health.harvard.edu/blog/right-brainleft-brain-right-2017082512222
● Left side of your brain is responsible for
analytical thinking, science, math, etc.
● It uses known building blocks to model the
surrounding world
● If you like table representation of data, you
will try to model everything as a table
● As an engineer, expand your tool belt by
learning new patterns and new building
blocks to solve business problems better.
Why does my brain need patterns?
About me
● IT Architect at Cognizant
● Data Engineering, Data Science,
Cloud Computing, Agile teams
● Financial, Manufacturing,
Logistics, Retail industries
● Organizer of Vilnius Microsoft Data
Platform Meetup & Hack4Vilnius Hackathon
● Blogging on www.valdas.blog
Biological and Physiological needs
Basic life needs - air, food, drink, shelter, warmth, sex, sleep, etc.
Safety needs
security, employment, protection against hunger and violence
Love and belonging needs
Receive and give love, appreciation, friendship
Esteem need
Unique individual, self-respect, etc.
Experience purpose and meaning
Realising all inner potentials
Self-actualization
Personal growth and fulfillment
Maslow’s hierarchy of needs
X
Culture
Core values, way of working
Enterprise architecture
Buy vs build, cloud readiness
Data strategy & architecture
Defensive vs offensive strategy, use cases
Existing team skillset
Databases, programming, etc
Design patterns, tools &
principles
Business drivers
Business goals and objectives
Maslow’s hierarchy of needs for data projects
Culture
Core values, way of working
Data architecture
Ingestion, storage consumption, how data is collected,
stored, transformed, distributed, and consumed
Tools & principles
Best practices, naming, patterns
Maslow’s hierarchy of needs for data projects -
simplified view for today’s presentation
Culture, way of working, values
DevOps culture
1. Foster a Collaborative Environment
2. Impose End-to-End Responsibility - you build it you ship it
3. Encourage Continuous Improvement
4. Automate (Almost) Everything
5. Focus on the Customer’s Needs
6. Embrace Failure, and Learn From it
7. Unite Teams — and Expertise
Source: https://www.cmswire.com/information-management/7-key-principles-for-a-successful-devops-culture/
Data engineering design patterns
Data architecture
If you are building a data platform in the
cloud, remember that ...
low barrier-to-entry overshadows
complexity
Big Data cloud architecture references
Source: https://azure.microsoft.com/en-in/solutions/architecture/modern-data-warehouse/
CRM
Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data store Transform,
Clean & Train
Results
External systems
Digital portals
Architecture example
Reporting
Core systems
Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data store Transform,
Clean & Train
Results
Data ingestion
CRM
External systems
Digital portals
Reporting
Core systems
Application integration approaches
File Transfer
Have each application produce files of shared data for others to consume, and consume files that others have produced.
Shared Database
Have the applications store the data they wish to share in a common database.
Remote Procedure Invocation
Have each application expose some of its procedures so that they can be invoked remotely, and have applications invoke
those to run behavior and exchange data.
Messaging
Have each application connect to a common messaging system, and exchange data and invoke behavior using messages.
Ingestion challenges
● Multiple data source load and prioritization -> push vs pull strategy
● Ingested data indexing and tagging -> metadata collection is mandatory
● Data validation and cleansing -> separate business from processing logic
● Data transformation and compression -> different compression and file types
Choose privacy protection patterns
Privacy protection at the ingress
Source: https://www.valdas.blog/2019/08/06/privacy-gdpr-implementation-in-azure/
Privacy protection at the
egress
Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data store Transform,
Clean & Train
Results
Data storage
CRM
External systems
Digital portals
Reporting
Core systems
Use cloud storage offerings instead of Hadoop
Data Warehouse vs Data Lake
Data Warehouse Data Lake
Requirements Relational requirements Diverse data, scalability, low cost
Data Value Data of recognised high value Candidate data of potential value
Data Processing Mostly refined calculated data Mostly detailed source data
Business Entities Known entities, tracked over time Raw material for discovering entities and facts
Data Standards Data conforms to enterprise
standards
Fidelity to original format and condition
Data Integration Data integration upfront Data prep on demand
Transformation Data transformed, in principle Data repurposed later, as needs arise
Schema Definition Schema-on-write Schema-on-read
Metadata Management Metadata improvement Metadata developed on read
Data Warehouse vs Data Lake
Source: Microsoft
Data Warehouse vs Data Lake
Source: Microsoft
Data Warehouse vs Data Lake
Source: Microsoft
Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data store Transform,
Clean & Train
Results
Data preparation & training
CRM
External systems
Digital portals
Reporting
Core systems
Offer self-service tools
Self service exploration
Automated pipeline
Collect raw
data
Curate data
Train &
Score
Take Insights
Into Actions
Make
hypothesis
Identify
variables
Split
data
Build
model
Validate
model
SQL
Use on-demand resources
Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data store Transform,
Clean & Train
Results
Serve results to end consumers
CRM
External systems
Digital portals
Reporting
Core systems
Apply domain and product thinking
● Model to describe a domain
● Unified language
● Raw or transformed datasets
● Domain team is responsible for its lifecycle, SLA
● Discoverable, addressable, trustworthy,
self-describing, interoperable, secure
● Each producer is responsible of sharing data
products to organization
Data engineering design patterns
Data engineering design patterns
Principles, best practices, tools
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps - Examples
Delay commitments and keep important
decisions open
● The principle of Last Responsible
Moment originates from Lean
Software Development
● It emphasises holding on taking
important actions and crucial
decisions for as long as possible.
Why Last Responsible
Moment is important in
cloud analytics?
Expect new improvements and
upgrades all the time
valdas@maksimavicius.eu
https://www.linkedin.com/in/valdasm/
Twitter: @VMaksimavicius
Data engineering design patterns

More Related Content

PDF
Data platform architecture
PDF
Data Product Architectures
PPTX
Data product thinking-Will the Data Mesh save us from analytics history
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PPTX
Design Principles for a Modern Data Warehouse
PDF
DataOps - The Foundation for Your Agile Data Architecture
PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
PDF
Scaling and Modernizing Data Platform with Databricks
Data platform architecture
Data Product Architectures
Data product thinking-Will the Data Mesh save us from analytics history
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Design Principles for a Modern Data Warehouse
DataOps - The Foundation for Your Agile Data Architecture
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Scaling and Modernizing Data Platform with Databricks

What's hot (20)

PPTX
Introduction to Data Engineering
PDF
Making Apache Spark Better with Delta Lake
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PPTX
Data Lake Overview
PDF
Learn to Use Databricks for Data Science
PDF
Data Lake: A simple introduction
PDF
The delta architecture
PDF
Using Databricks as an Analysis Platform
PDF
Achieving Lakehouse Models with Spark 3.0
PPTX
Databricks Fundamentals
PPTX
Introduction to Data Engineering
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PDF
Modern Data architecture Design
PPTX
Free Training: How to Build a Lakehouse
PDF
PDF
3D: DBT using Databricks and Delta
PDF
Intro to Delta Lake
PDF
Apache Hudi: The Path Forward
PDF
Moving to Databricks & Delta
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Introduction to Data Engineering
Making Apache Spark Better with Delta Lake
Building Lakehouses on Delta Lake with SQL Analytics Primer
Data Lake Overview
Learn to Use Databricks for Data Science
Data Lake: A simple introduction
The delta architecture
Using Databricks as an Analysis Platform
Achieving Lakehouse Models with Spark 3.0
Databricks Fundamentals
Introduction to Data Engineering
Architect’s Open-Source Guide for a Data Mesh Architecture
Modern Data architecture Design
Free Training: How to Build a Lakehouse
3D: DBT using Databricks and Delta
Intro to Delta Lake
Apache Hudi: The Path Forward
Moving to Databricks & Delta
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Ad

Similar to Data engineering design patterns (20)

PPTX
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
PDF
DataEd Slides: Approaching Data Management Technologies
PDF
Data Science and Culture
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PDF
Foundations for Successful Data Projects – Strata London 2019
PPTX
Managing Large Amounts of Data with Salesforce
PPTX
Data Mesh in Azure using Cloud Scale Analytics (WAF)
PDF
Data Structuring PowerPoint Presentation Slides
PPTX
Modernising the data warehouse - January 2019
PDF
Lunch and Learn: You have the data, now what?
PDF
Top 10 guidelines for deploying modern data architecture for the data driven ...
PPTX
DA_01_Intro.pptx
PDF
10 ways to stumble with big data
PDF
Inawsidom - Data Journey
PDF
How to succeed at data without even trying!
PPTX
Deliveinrg explainable AI
PDF
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PDF
Foundational Strategies for Trust in Big Data Part 1: Getting Data to the Pla...
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
DataEd Slides: Approaching Data Management Technologies
Data Science and Culture
Data Engineer's Lunch #85: Designing a Modern Data Stack
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Foundations for Successful Data Projects – Strata London 2019
Managing Large Amounts of Data with Salesforce
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Structuring PowerPoint Presentation Slides
Modernising the data warehouse - January 2019
Lunch and Learn: You have the data, now what?
Top 10 guidelines for deploying modern data architecture for the data driven ...
DA_01_Intro.pptx
10 ways to stumble with big data
Inawsidom - Data Journey
How to succeed at data without even trying!
Deliveinrg explainable AI
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
Foundational Strategies for Trust in Big Data Part 1: Getting Data to the Pla...
Ad

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
Reach Out and Touch Someone: Haptics and Empathic Computing
MIND Revenue Release Quarter 2 2025 Press Release
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Review of recent advances in non-invasive hemoglobin estimation
Building Integrated photovoltaic BIPV_UPV.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Big Data Technologies - Introduction.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectroscopy.pptx food analysis technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Data engineering design patterns

  • 1. Data Engineering Patterns and Principles Valdas Maksimavičius
  • 5. Would you be confident in a self-driving car ... … knowing that there is your software running it?
  • 6. Standardize and increase the descriptive power of engineering processes by applying patterns Or in other words stand on the shoulders of giants and stop reinventing the wheel
  • 7. Source: https://www.health.harvard.edu/blog/right-brainleft-brain-right-2017082512222 ● Left side of your brain is responsible for analytical thinking, science, math, etc. ● It uses known building blocks to model the surrounding world ● If you like table representation of data, you will try to model everything as a table ● As an engineer, expand your tool belt by learning new patterns and new building blocks to solve business problems better. Why does my brain need patterns?
  • 8. About me ● IT Architect at Cognizant ● Data Engineering, Data Science, Cloud Computing, Agile teams ● Financial, Manufacturing, Logistics, Retail industries ● Organizer of Vilnius Microsoft Data Platform Meetup & Hack4Vilnius Hackathon ● Blogging on www.valdas.blog
  • 9. Biological and Physiological needs Basic life needs - air, food, drink, shelter, warmth, sex, sleep, etc. Safety needs security, employment, protection against hunger and violence Love and belonging needs Receive and give love, appreciation, friendship Esteem need Unique individual, self-respect, etc. Experience purpose and meaning Realising all inner potentials Self-actualization Personal growth and fulfillment Maslow’s hierarchy of needs
  • 10. X
  • 11. Culture Core values, way of working Enterprise architecture Buy vs build, cloud readiness Data strategy & architecture Defensive vs offensive strategy, use cases Existing team skillset Databases, programming, etc Design patterns, tools & principles Business drivers Business goals and objectives Maslow’s hierarchy of needs for data projects
  • 12. Culture Core values, way of working Data architecture Ingestion, storage consumption, how data is collected, stored, transformed, distributed, and consumed Tools & principles Best practices, naming, patterns Maslow’s hierarchy of needs for data projects - simplified view for today’s presentation
  • 13. Culture, way of working, values
  • 14. DevOps culture 1. Foster a Collaborative Environment 2. Impose End-to-End Responsibility - you build it you ship it 3. Encourage Continuous Improvement 4. Automate (Almost) Everything 5. Focus on the Customer’s Needs 6. Embrace Failure, and Learn From it 7. Unite Teams — and Expertise Source: https://www.cmswire.com/information-management/7-key-principles-for-a-successful-devops-culture/
  • 17. If you are building a data platform in the cloud, remember that ... low barrier-to-entry overshadows complexity
  • 18. Big Data cloud architecture references Source: https://azure.microsoft.com/en-in/solutions/architecture/modern-data-warehouse/
  • 19. CRM Social LOB Graph IoT Image CRM Cloud INGEST STORE PREP & TRAIN DEPLOY & SERVE Data orchestration and monitoring Big data store Transform, Clean & Train Results External systems Digital portals Architecture example Reporting Core systems
  • 20. Social LOB Graph IoT Image CRM Cloud INGEST STORE PREP & TRAIN DEPLOY & SERVE Data orchestration and monitoring Big data store Transform, Clean & Train Results Data ingestion CRM External systems Digital portals Reporting Core systems
  • 21. Application integration approaches File Transfer Have each application produce files of shared data for others to consume, and consume files that others have produced. Shared Database Have the applications store the data they wish to share in a common database. Remote Procedure Invocation Have each application expose some of its procedures so that they can be invoked remotely, and have applications invoke those to run behavior and exchange data. Messaging Have each application connect to a common messaging system, and exchange data and invoke behavior using messages.
  • 22. Ingestion challenges ● Multiple data source load and prioritization -> push vs pull strategy ● Ingested data indexing and tagging -> metadata collection is mandatory ● Data validation and cleansing -> separate business from processing logic ● Data transformation and compression -> different compression and file types
  • 23. Choose privacy protection patterns Privacy protection at the ingress Source: https://www.valdas.blog/2019/08/06/privacy-gdpr-implementation-in-azure/ Privacy protection at the egress
  • 24. Social LOB Graph IoT Image CRM Cloud INGEST STORE PREP & TRAIN DEPLOY & SERVE Data orchestration and monitoring Big data store Transform, Clean & Train Results Data storage CRM External systems Digital portals Reporting Core systems
  • 25. Use cloud storage offerings instead of Hadoop
  • 26. Data Warehouse vs Data Lake Data Warehouse Data Lake Requirements Relational requirements Diverse data, scalability, low cost Data Value Data of recognised high value Candidate data of potential value Data Processing Mostly refined calculated data Mostly detailed source data Business Entities Known entities, tracked over time Raw material for discovering entities and facts Data Standards Data conforms to enterprise standards Fidelity to original format and condition Data Integration Data integration upfront Data prep on demand Transformation Data transformed, in principle Data repurposed later, as needs arise Schema Definition Schema-on-write Schema-on-read Metadata Management Metadata improvement Metadata developed on read
  • 27. Data Warehouse vs Data Lake Source: Microsoft
  • 28. Data Warehouse vs Data Lake Source: Microsoft
  • 29. Data Warehouse vs Data Lake Source: Microsoft
  • 30. Social LOB Graph IoT Image CRM Cloud INGEST STORE PREP & TRAIN DEPLOY & SERVE Data orchestration and monitoring Big data store Transform, Clean & Train Results Data preparation & training CRM External systems Digital portals Reporting Core systems
  • 31. Offer self-service tools Self service exploration Automated pipeline Collect raw data Curate data Train & Score Take Insights Into Actions Make hypothesis Identify variables Split data Build model Validate model SQL
  • 33. Social LOB Graph IoT Image CRM Cloud INGEST STORE PREP & TRAIN DEPLOY & SERVE Data orchestration and monitoring Big data store Transform, Clean & Train Results Serve results to end consumers CRM External systems Digital portals Reporting Core systems
  • 34. Apply domain and product thinking ● Model to describe a domain ● Unified language ● Raw or transformed datasets ● Domain team is responsible for its lifecycle, SLA ● Discoverable, addressable, trustworthy, self-describing, interoperable, secure ● Each producer is responsible of sharing data products to organization
  • 38. Get familiar with DataOps
  • 39. Get familiar with DataOps
  • 40. Get familiar with DataOps
  • 41. Get familiar with DataOps
  • 42. Get familiar with DataOps
  • 43. Get familiar with DataOps
  • 44. Get familiar with DataOps
  • 45. Get familiar with DataOps
  • 46. Get familiar with DataOps
  • 47. Get familiar with DataOps
  • 48. Get familiar with DataOps
  • 49. Get familiar with DataOps - Examples
  • 50. Delay commitments and keep important decisions open ● The principle of Last Responsible Moment originates from Lean Software Development ● It emphasises holding on taking important actions and crucial decisions for as long as possible.
  • 51. Why Last Responsible Moment is important in cloud analytics? Expect new improvements and upgrades all the time