SlideShare a Scribd company logo
State	
  of	
  the	
  Database	
  
@HBase	
  
h)p://hbase.apache.org	
  
2015-­‐09-­‐28	
  
Nick	
  Dimiduk	
  (@xefyr)	
  
h)p://n10k.com	
  
#apachebigdata	
  
Agenda	
  
o State	
  of	
  the	
  Project	
  
o State	
  of	
  the	
  SoMware	
  
o State	
  of	
  the	
  Ecosystem	
  
o Latest	
  Releases	
  
o Bonus	
  Content!	
  
o Q	
  &	
  A	
  
STATE	
  OF	
  THE	
  PROJECT	
  
Who	
  we	
  are,	
  what	
  we	
  do,	
  why	
  we	
  do	
  it	
  
Project:	
  Vision	
  
Simple,	
  steady,	
  and	
  powerful:	
  “A	
  first	
  class	
  high	
  
performance	
  horizontally	
  scalable	
  data	
  storage	
  
engine	
  for	
  Big	
  Data,	
  suitable	
  as	
  the	
  store	
  of	
  
record	
  for	
  mission	
  cri]cal	
  data.”	
  
Project:	
  Usage	
  
o  Data	
  access	
  for	
  medium-­‐	
  and	
  high-­‐scale	
  services	
  
o Hundreds	
  of	
  enterprises	
  and	
  startups	
  
o Some	
  of	
  the	
  largest	
  Internet	
  companies	
  in	
  the	
  world	
  
o  Running	
  major	
  produc]on	
  workloads	
  since	
  2011	
  
o  Use-­‐cases	
  
o messaging,	
  security,	
  measurement/“IoT”,	
  
collabora]on,	
  digital	
  media,	
  digital	
  adver]sing,	
  
telecommunica]ons,	
  computa]onal	
  biology,	
  clinical	
  
informa]cs/healthcare,	
  insurance	
  
Apache Big Data EU 2015 - HBase
Project:	
  Goals	
  
o Availability:	
  Always	
  more,	
  always	
  faster	
  
o Stability	
  and	
  operability	
  
o Scaling	
  up,	
  scaling	
  down	
  
o Up-­‐to-­‐date	
  with	
  “commodity”	
  hardware	
  
o Mul]-­‐tenancy	
  
o Diversity	
  of	
  ecosystem	
  
Apache Big Data EU 2015 - HBase
STATE	
  OF	
  THE	
  SOFTWARE	
  
Regarding	
  the	
  codebase	
  
State	
  of	
  the	
  SoMware	
  
o Mature	
  codebase	
  
o 100+	
  contributors	
  (40+	
  commi)ers)	
  
o 1.1M	
  lines	
  of	
  code	
  (each	
  ac]ve	
  branch)	
  
o est.	
  1200+	
  human-­‐years’	
  effort	
  
o  Clusters	
  sizes	
  from	
  10	
  to	
  1000+	
  machines	
  
o  that	
  we	
  know	
  of!	
  
o Runs	
  on	
  HDFS,	
  MapR,	
  Gluster,	
  GPFS,	
  Lustre	
  
o HBase	
  as	
  a	
  Service	
  
o AWS/EMR,	
  HDInsight,	
  Qubole,	
  Google	
  (sort-­‐of)	
  
SoMware:	
  Releases	
  
SoMware:	
  Seman]c	
  Versioning	
  
MAJOR-­‐MINOR-­‐PATCH[-­‐iden]fier]	
  
	
  	
  
o Client/Server	
  wire	
  compa]bility	
  
o Server/Server	
  feature	
  compa]bility	
  
o API	
  compliance	
  guarantees	
  
o ABI	
  compliance	
  guarantees	
  
	
  
	
  
	
  	
  	
  
h)p://hbase.apache.org/book.html#hbase.versioning	
  
SoMware:	
  Ac]ve	
  Development	
  
o Smaller	
  regions,	
  more	
  regions	
  
o Less	
  write	
  amplifica]on	
  
o 1M+	
  region	
  clusters	
  
o Stability	
  
o ProcedureV2	
  
o Assignment	
  improvements/stability	
  
o Backup,	
  restore	
  tools	
  
o Built	
  on	
  snapshots,	
  easier	
  opera]ons	
  
SoMware:	
  Ac]ve	
  Development	
  
o Adap]on:	
  Workloads	
  
o HBase	
  as	
  Medium	
  Object	
  Store	
  (MOB)	
  
o Tunable	
  Availability	
  
o Region	
  replicas	
  
o TIMELINE	
  consistency	
  
o Coprocessor	
  API	
  stability	
  
o Less	
  GC,	
  more	
  RAM	
  (off-­‐heap)	
  
SoMware:	
  Ac]ve	
  Development	
  
o Mul]-­‐tenancy	
  
o Table	
  groups	
  
o Quotas	
  
o Priori]es	
  
o Improved	
  machine	
  u]liza]on	
  
o More	
  RAM	
  (100’s	
  of	
  GB)	
  
o IOPS	
  
o Be)er	
  concurrency	
  
STATE	
  OF	
  THE	
  ECOSYSTEM	
  
The	
  whole	
  enchilada	
  
State	
  of	
  the	
  Ecosystem	
  
o OpenTSDB	
  
o Transac]on	
  Managers	
  
o Themis,	
  Tephera,	
  Omid2,	
  LeanXcale	
  
o Graph	
  engines	
  
o Titan,	
  Giraph,	
  Zen,	
  S2Graph	
  
o Myriad	
  SQL’s	
  
o Other	
  Hadoop	
  components	
  
o Google	
  Cloud	
  Bigtable	
  
Ecosystem:	
  SQL	
  
Ecosystem:	
  Hadoop	
  Components	
  
o YARN-­‐2928	
  Applica]on	
  Timeline	
  Service	
  
o HIVE-­‐9452	
  HBase	
  to	
  store	
  Hive	
  metadata	
  
o AMBARI-­‐5707	
  Ambari	
  Metrics	
  System	
  
LATEST	
  RELEASES	
  
Come	
  and	
  get	
  it!	
  
Release:	
  0.94	
  
o Last	
  (final?)	
  release:	
  0.94.27,	
  2015-­‐03-­‐26	
  
o “ancient	
  history”	
  
o No	
  new	
  deployments	
  
o Exis]ng	
  users	
  highly	
  encouraged	
  to	
  upgrade	
  
o Requires	
  down]me	
  to	
  upgrade	
  
😫	
   😡	
  (╯°□°)╯︵	
  ┻━┻	
  
Release:	
  0.98	
  
o Last	
  release:	
  0.98.14,	
  2015-­‐08-­‐31	
  
o “legacy”	
  
o Most	
  produc]on	
  deploys	
  (probably)	
  
o Largest	
  produc]on	
  clusters	
  (probably)	
  
o New	
  features	
  back-­‐ported	
  when	
  possible	
  
Release	
  1.x	
  
o Last	
  release:	
  1.1.2,	
  2015-­‐09-­‐01	
  
o “stable”	
  
o Produc]on	
  deploys	
  moving	
  here	
  
o Ac]ve	
  development	
  
o Rolling	
  upgrade	
  from	
  0.98.x	
  
😄	
   😍	
  ヽ(´ー`)ノ	
  
Release	
  1.0	
  
o Released	
  1.0.0,	
  2015-­‐02-­‐24	
  
o Adop]ng	
  seman]c	
  versioning	
  
o Patch	
  releases	
  don’t	
  quite	
  follow	
  spec	
  yet	
  
o Client	
  /	
  Server	
  API	
  cleanup	
  
o Interfaces,	
  builder	
  pa)ern,	
  @InterfaceAudience	
  
o Region	
  Replicas	
  
o Trade	
  Consistency,	
  resources	
  for	
  Availability	
  
	
  
	
  
github.com/ndimiduk/hbase-­‐1.0-­‐api-­‐examples	
  
Region	
  Replicas	
  
o Mul]ple	
  Region	
  Servers	
  host	
  each	
  region	
  
o Primary	
  +	
  N	
  read	
  replicas	
  (usually	
  N=2)	
  
o Primary	
  is	
  authority	
  on	
  reads	
  and	
  writes	
  
o Replicas	
  tail	
  replicate	
  edits,	
  offer	
  TIMELINE	
  view	
  
o Client’s	
  choice	
  
o Read	
  primary	
  only	
  for	
  “classic”	
  strong	
  consistency	
  
o Fan-­‐out	
  reads	
  for	
  faster,	
  poten]ally	
  TIMELINE	
  
results	
  
Release	
  1.1	
  
o  Release	
  1.1.0,	
  2015-­‐05-­‐15	
  
o  Async	
  RPC	
  client	
  
o  Scanner	
  improvements	
  
o RPC	
  chunking,	
  heartbeat	
  messages,	
  API	
  
o  RPC	
  thro)ling	
  
o quotas	
  for	
  per	
  user,	
  table,	
  namespace	
  
o  Compac]on	
  thro)ling,	
  monitoring	
  
o  ProcedureV2	
  
o Improved	
  opera]onal	
  reliability	
  
ProcedureV2	
  
o Distributed,	
  fault-­‐tolerant	
  opera]ons	
  
o Mul]ple	
  steps	
  on	
  mul]ple	
  machines	
  
o Roll-­‐back	
  in	
  case	
  of	
  failure	
  
o Coordina]on	
  of	
  long-­‐running	
  procedures	
  
o Compac]ons,	
  splits,	
  &c.	
  
o Progress	
  tracking	
  
o No]fica]ons	
  across	
  mul]ple	
  machines	
  
o Current	
  status	
  inquiries	
  
Branch-­‐1.2	
  
o Next	
  up	
  in	
  1.x	
  line	
  
o Java	
  8	
  support	
  
o Na]ve	
  checksums	
  
o SyncTable	
  
	
  
	
  
	
  
o Flush-­‐per-­‐store	
  
o ProcV2	
  all	
  the	
  things!	
  
o (More)	
  Compac]on	
  
improvements	
  
o Region	
  normalizer	
  
Region	
  Normalizer	
  
o  An]-­‐entropy	
  for	
  region	
  size	
  
o  Converge	
  towards	
  uniform	
  size	
  
o  Compliments	
  balancer	
  working	
  toward	
  uniform	
  distribu]on	
  
o  Managed	
  by	
  Master,	
  runs	
  in	
  the	
  background	
  (like	
  balancer)	
  
o  Pluggable	
  normaliza]on	
  strategies	
  (“simple”	
  default)	
  
o  Use-­‐cases	
  
o  Merge	
  away	
  regions	
  from	
  expired	
  ]meseries	
  data	
  
o  Smooth	
  uneven	
  bulk	
  loads	
  
o  Correct	
  operator	
  ini]al	
  split	
  guesses	
  
o  Ease	
  upgrades	
  from	
  ancient	
  versions	
  (0.92/1g	
  vs.	
  today/20g)	
  
Thanks!	
  
@HBase	
  
h)p://hbase.apache.org	
  
2015-­‐09-­‐28	
  
Nick	
  Dimiduk	
  (@xefyr)	
  
h)p://n10k.com	
  
#apachebigdata	
  
BONUS	
  CONTENT!	
  
Ask	
  and	
  you	
  shall	
  receive	
  
Agenda	
  
o Replica]on	
  
o Filters	
  
o Coprocessors	
  
Replica]on	
  
o  Keep	
  data	
  synchronized	
  between	
  clusters	
  
o  Supports	
  mulOple	
  desOnaOons	
  
o  Cyclical	
  graphs	
  supported	
  
o  Configurable	
  at	
  Column	
  Family	
  granularity	
  	
  
o  Uses	
  WAL	
  shipping	
  to	
  propagate	
  data	
  
o  Replica]on	
  state,	
  status	
  stored	
  in	
  ZooKeeper	
  
o  General	
  purpose	
  interface	
  for	
  asynchronously	
  shipping	
  
edits	
  from	
  a	
  cluster	
  
o  Other	
  HBase	
  clusters,	
  Region	
  Replicas,	
  SOLR/Elas]cSearch	
  
	
  
hbase.apache.org/book.html#_cluster_replica]on	
  
Apache Big Data EU 2015 - HBase
Filters	
  
o  Addi]onal	
  applied	
  to	
  reads	
  
o Use	
  in	
  conjunc]on	
  with	
  specifying	
  start,	
  end	
  rows,	
  &c.	
  
o  Run	
  on	
  the	
  Region	
  Servers	
  
o Included	
  in	
  GET,	
  SCAN	
  request	
  
o  Explicitly	
  exclude	
  data	
  based	
  on	
  criteria	
  
o I.E.,	
  value	
  >=	
  10	
  
o  Implicitly	
  exclude	
  data	
  by	
  hin]ng	
  seeks	
  
o INCLUDE_AND_NEXT_COL,	
  NEXT_ROW,	
  
SEEK_NEXT_USING_HINT	
  
o  Operate	
  on	
  data	
  read	
  from	
  BlockCache	
  
Filters	
  
o 30+	
  Filters	
  included	
  in	
  distribu]on	
  
o Mini-­‐language	
  for	
  use	
  in	
  ThriM,	
  REST	
  
o  "(PrefixFilter ('row2') AND (QualifierFilter (>=,
'binary:xyz'))) AND (TimestampsFilter ( 123, 456))"
o  hbase.apache.org/book.html#thriM.filter_language	
  
o Simple	
  interface,	
  Implement	
  your	
  own!	
  
public class PageFilter extends FilterBase {
public PageFilter(long pageSize) {…}
public boolean filterRowKey(Cell c) {
return false;
}
public ReturnCode filterKeyValue(Cell c) {
return ReturnCode.INCLUDE;
}
public boolean filterAllRemaining() {
return this.rowsAccepted >= this.pageSize;
}
public filterRow() {
this.rowsAccepted++;
return this.rowsAccepted > this.pageSize;
}
}
Coprocessors	
  
o  Extension	
  points	
  for	
  HBase	
  
o Think	
  Linux	
  Kernel	
  Module,	
  not	
  Stored	
  Procedure	
  
o I.E.,	
  customize	
  compac]ons,	
  Table	
  constraints	
  
o  Observers	
  
o pre-­‐	
  and	
  post-­‐execu]on	
  logic	
  
o I.E.,	
  MasterObserver#preTruncateTable,	
  
RegionObserver#postScannerNext	
  
o  Endpoints	
  
o Cluster	
  RPC	
  extensions	
  
o I.E.,	
  RowCountEndpoint,	
  BulkDeleteEndpoint	
  
public class RowCountEndpoint implements
ExampleProtos.RowCountService {
public void getRowCount(…) {
Scan = new Scan();
InternalScanner scanner =
env.getRegion().getScanner(scan);
…
do {
count++;
} while (scanner.next());
// return count
}
}
Thanks!	
  
@HBase	
  
h)p://hbase.apache.org	
  
2015-­‐09-­‐28	
  
Nick	
  Dimiduk	
  (@xefyr)	
  
h)p://n10k.com	
  
#apachebigdata	
  

More Related Content

PDF
HBase for Architects
PDF
Apache Big Data EU 2015 - Phoenix
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
PPTX
Apache phoenix
PPTX
Apache phoenix: Past, Present and Future of SQL over HBAse
PDF
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
PPTX
Apache phoenix
PPTX
Mapreduce over snapshots
HBase for Architects
Apache Big Data EU 2015 - Phoenix
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
Apache phoenix
Apache phoenix: Past, Present and Future of SQL over HBAse
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
Apache phoenix
Mapreduce over snapshots

What's hot (20)

PPTX
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
PPTX
The Evolution of a Relational Database Layer over HBase
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
PPTX
HBaseConEast2016: HBase and Spark, State of the Art
PPTX
Hadoop hbase mapreduce
PPTX
HBase Read High Availability Using Timeline Consistent Region Replicas
PPTX
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
PPTX
Dancing with the elephant h base1_final
PPTX
Taming the Elephant: Efficient and Effective Apache Hadoop Management
PDF
HBase Read High Availability Using Timeline-Consistent Region Replicas
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
HBase and Impala Notes - Munich HUG - 20131017
PPTX
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
PPTX
Apache Spark on Apache HBase: Current and Future
PPTX
Apache Phoenix: Use Cases and New Features
PPTX
Apache HBase - Introduction & Use Cases
PPT
HBaseCon 2013: Apache HBase Replication
PPTX
Apache HBase: State of the Union
PPTX
April 2014 HUG : Apache Phoenix
PDF
Integration of HIve and HBase
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
The Evolution of a Relational Database Layer over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
HBaseConEast2016: HBase and Spark, State of the Art
Hadoop hbase mapreduce
HBase Read High Availability Using Timeline Consistent Region Replicas
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Dancing with the elephant h base1_final
Taming the Elephant: Efficient and Effective Apache Hadoop Management
HBase Read High Availability Using Timeline-Consistent Region Replicas
Large-Scale Stream Processing in the Hadoop Ecosystem
HBase and Impala Notes - Munich HUG - 20131017
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Apache Spark on Apache HBase: Current and Future
Apache Phoenix: Use Cases and New Features
Apache HBase - Introduction & Use Cases
HBaseCon 2013: Apache HBase Replication
Apache HBase: State of the Union
April 2014 HUG : Apache Phoenix
Integration of HIve and HBase
Ad

Viewers also liked (13)

PDF
Data Engineering Quick Guide
PPTX
Data analytics
PDF
The inherent complexity of stream processing
PPTX
Introduction to Data Engineering
KEY
The Secrets of Building Realtime Big Data Systems
PDF
HBase Data Types
PPTX
Big data road map
PDF
11 Hard to Ignore Data Analytics Quotes
PDF
Demystifying Data Engineering
PPTX
Big Data: The 6 Key Skills Every Business Needs
PPTX
Big Data: The 4 Layers Everyone Must Know
PPTX
What is Big Data?
PPTX
Big Data - 25 Amazing Facts Everyone Should Know
Data Engineering Quick Guide
Data analytics
The inherent complexity of stream processing
Introduction to Data Engineering
The Secrets of Building Realtime Big Data Systems
HBase Data Types
Big data road map
11 Hard to Ignore Data Analytics Quotes
Demystifying Data Engineering
Big Data: The 6 Key Skills Every Business Needs
Big Data: The 4 Layers Everyone Must Know
What is Big Data?
Big Data - 25 Amazing Facts Everyone Should Know
Ad

Similar to Apache Big Data EU 2015 - HBase (20)

PPTX
HPC Controls Future
PPTX
HPC Resource Management: Futures
PPTX
Need for Time series Database
PDF
Big Data Streams Architectures. Why? What? How?
PDF
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
PDF
Processing 70Tb Of Genomics Data With ADAM And Toil
PDF
Scaling and High Performance Storage System: LeoFS
PDF
MongoDB at MapMyFitness from a DevOps Perspective
PPT
Bhupeshbansal bigdata
PDF
Netflix Open Source Meetup Season 4 Episode 2
PPTX
Distributed caching-computing v3.8
PDF
Sanger OpenStack presentation March 2017
PDF
(Julien le dem) parquet
PDF
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
PDF
Impala presentation ahad rana
PDF
Streaming solutions for real time problems
PPTX
Windows Azure Storage: Overview, Internals, and Best Practices
PPTX
Sql saturday azure storage by Anton Vidishchev
PDF
Music city data Hail Hydrate! from stream to lake
PDF
Michael stack -the state of apache h base
HPC Controls Future
HPC Resource Management: Futures
Need for Time series Database
Big Data Streams Architectures. Why? What? How?
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Processing 70Tb Of Genomics Data With ADAM And Toil
Scaling and High Performance Storage System: LeoFS
MongoDB at MapMyFitness from a DevOps Perspective
Bhupeshbansal bigdata
Netflix Open Source Meetup Season 4 Episode 2
Distributed caching-computing v3.8
Sanger OpenStack presentation March 2017
(Julien le dem) parquet
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
Impala presentation ahad rana
Streaming solutions for real time problems
Windows Azure Storage: Overview, Internals, and Best Practices
Sql saturday azure storage by Anton Vidishchev
Music city data Hail Hydrate! from stream to lake
Michael stack -the state of apache h base

More from Nick Dimiduk (10)

PDF
Apache HBase 1.0 Release
PPTX
HBase Low Latency, StrataNYC 2014
PDF
HBase Blockcache 101
PDF
Apache HBase Low Latency
PDF
Apache HBase for Architects
PDF
HBase Data Types (WIP)
PDF
Bring Cartography to the Cloud
PDF
HBase Client APIs (for webapps?)
PPTX
Pig, Making Hadoop Easy
KEY
Introduction to Hadoop, HBase, and NoSQL
Apache HBase 1.0 Release
HBase Low Latency, StrataNYC 2014
HBase Blockcache 101
Apache HBase Low Latency
Apache HBase for Architects
HBase Data Types (WIP)
Bring Cartography to the Cloud
HBase Client APIs (for webapps?)
Pig, Making Hadoop Easy
Introduction to Hadoop, HBase, and NoSQL

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
CroxyProxy Instagram Access id login.pptx
PDF
Event Presentation Google Cloud Next Extended 2025
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
Omni-Path Integration Expertise Offered by Nor-Tech
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Cloud computing and distributed systems.
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
Modernizing your data center with Dell and AMD
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
CroxyProxy Instagram Access id login.pptx
Event Presentation Google Cloud Next Extended 2025
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
MYSQL Presentation for SQL database connectivity
NewMind AI Weekly Chronicles - August'25 Week I
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
Omni-Path Integration Expertise Offered by Nor-Tech
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Smarter Business Operations Powered by IoT Remote Monitoring
Understanding_Digital_Forensics_Presentation.pptx
madgavkar20181017ppt McKinsey Presentation.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Cloud computing and distributed systems.
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
GamePlan Trading System Review: Professional Trader's Honest Take

Apache Big Data EU 2015 - HBase

  • 1. State  of  the  Database   @HBase   h)p://hbase.apache.org   2015-­‐09-­‐28   Nick  Dimiduk  (@xefyr)   h)p://n10k.com   #apachebigdata  
  • 2. Agenda   o State  of  the  Project   o State  of  the  SoMware   o State  of  the  Ecosystem   o Latest  Releases   o Bonus  Content!   o Q  &  A  
  • 3. STATE  OF  THE  PROJECT   Who  we  are,  what  we  do,  why  we  do  it  
  • 4. Project:  Vision   Simple,  steady,  and  powerful:  “A  first  class  high   performance  horizontally  scalable  data  storage   engine  for  Big  Data,  suitable  as  the  store  of   record  for  mission  cri]cal  data.”  
  • 5. Project:  Usage   o  Data  access  for  medium-­‐  and  high-­‐scale  services   o Hundreds  of  enterprises  and  startups   o Some  of  the  largest  Internet  companies  in  the  world   o  Running  major  produc]on  workloads  since  2011   o  Use-­‐cases   o messaging,  security,  measurement/“IoT”,   collabora]on,  digital  media,  digital  adver]sing,   telecommunica]ons,  computa]onal  biology,  clinical   informa]cs/healthcare,  insurance  
  • 7. Project:  Goals   o Availability:  Always  more,  always  faster   o Stability  and  operability   o Scaling  up,  scaling  down   o Up-­‐to-­‐date  with  “commodity”  hardware   o Mul]-­‐tenancy   o Diversity  of  ecosystem  
  • 9. STATE  OF  THE  SOFTWARE   Regarding  the  codebase  
  • 10. State  of  the  SoMware   o Mature  codebase   o 100+  contributors  (40+  commi)ers)   o 1.1M  lines  of  code  (each  ac]ve  branch)   o est.  1200+  human-­‐years’  effort   o  Clusters  sizes  from  10  to  1000+  machines   o  that  we  know  of!   o Runs  on  HDFS,  MapR,  Gluster,  GPFS,  Lustre   o HBase  as  a  Service   o AWS/EMR,  HDInsight,  Qubole,  Google  (sort-­‐of)  
  • 12. SoMware:  Seman]c  Versioning   MAJOR-­‐MINOR-­‐PATCH[-­‐iden]fier]       o Client/Server  wire  compa]bility   o Server/Server  feature  compa]bility   o API  compliance  guarantees   o ABI  compliance  guarantees             h)p://hbase.apache.org/book.html#hbase.versioning  
  • 13. SoMware:  Ac]ve  Development   o Smaller  regions,  more  regions   o Less  write  amplifica]on   o 1M+  region  clusters   o Stability   o ProcedureV2   o Assignment  improvements/stability   o Backup,  restore  tools   o Built  on  snapshots,  easier  opera]ons  
  • 14. SoMware:  Ac]ve  Development   o Adap]on:  Workloads   o HBase  as  Medium  Object  Store  (MOB)   o Tunable  Availability   o Region  replicas   o TIMELINE  consistency   o Coprocessor  API  stability   o Less  GC,  more  RAM  (off-­‐heap)  
  • 15. SoMware:  Ac]ve  Development   o Mul]-­‐tenancy   o Table  groups   o Quotas   o Priori]es   o Improved  machine  u]liza]on   o More  RAM  (100’s  of  GB)   o IOPS   o Be)er  concurrency  
  • 16. STATE  OF  THE  ECOSYSTEM   The  whole  enchilada  
  • 17. State  of  the  Ecosystem   o OpenTSDB   o Transac]on  Managers   o Themis,  Tephera,  Omid2,  LeanXcale   o Graph  engines   o Titan,  Giraph,  Zen,  S2Graph   o Myriad  SQL’s   o Other  Hadoop  components   o Google  Cloud  Bigtable  
  • 19. Ecosystem:  Hadoop  Components   o YARN-­‐2928  Applica]on  Timeline  Service   o HIVE-­‐9452  HBase  to  store  Hive  metadata   o AMBARI-­‐5707  Ambari  Metrics  System  
  • 20. LATEST  RELEASES   Come  and  get  it!  
  • 21. Release:  0.94   o Last  (final?)  release:  0.94.27,  2015-­‐03-­‐26   o “ancient  history”   o No  new  deployments   o Exis]ng  users  highly  encouraged  to  upgrade   o Requires  down]me  to  upgrade   😫   😡  (╯°□°)╯︵  ┻━┻  
  • 22. Release:  0.98   o Last  release:  0.98.14,  2015-­‐08-­‐31   o “legacy”   o Most  produc]on  deploys  (probably)   o Largest  produc]on  clusters  (probably)   o New  features  back-­‐ported  when  possible  
  • 23. Release  1.x   o Last  release:  1.1.2,  2015-­‐09-­‐01   o “stable”   o Produc]on  deploys  moving  here   o Ac]ve  development   o Rolling  upgrade  from  0.98.x   😄   😍  ヽ(´ー`)ノ  
  • 24. Release  1.0   o Released  1.0.0,  2015-­‐02-­‐24   o Adop]ng  seman]c  versioning   o Patch  releases  don’t  quite  follow  spec  yet   o Client  /  Server  API  cleanup   o Interfaces,  builder  pa)ern,  @InterfaceAudience   o Region  Replicas   o Trade  Consistency,  resources  for  Availability       github.com/ndimiduk/hbase-­‐1.0-­‐api-­‐examples  
  • 25. Region  Replicas   o Mul]ple  Region  Servers  host  each  region   o Primary  +  N  read  replicas  (usually  N=2)   o Primary  is  authority  on  reads  and  writes   o Replicas  tail  replicate  edits,  offer  TIMELINE  view   o Client’s  choice   o Read  primary  only  for  “classic”  strong  consistency   o Fan-­‐out  reads  for  faster,  poten]ally  TIMELINE   results  
  • 26. Release  1.1   o  Release  1.1.0,  2015-­‐05-­‐15   o  Async  RPC  client   o  Scanner  improvements   o RPC  chunking,  heartbeat  messages,  API   o  RPC  thro)ling   o quotas  for  per  user,  table,  namespace   o  Compac]on  thro)ling,  monitoring   o  ProcedureV2   o Improved  opera]onal  reliability  
  • 27. ProcedureV2   o Distributed,  fault-­‐tolerant  opera]ons   o Mul]ple  steps  on  mul]ple  machines   o Roll-­‐back  in  case  of  failure   o Coordina]on  of  long-­‐running  procedures   o Compac]ons,  splits,  &c.   o Progress  tracking   o No]fica]ons  across  mul]ple  machines   o Current  status  inquiries  
  • 28. Branch-­‐1.2   o Next  up  in  1.x  line   o Java  8  support   o Na]ve  checksums   o SyncTable         o Flush-­‐per-­‐store   o ProcV2  all  the  things!   o (More)  Compac]on   improvements   o Region  normalizer  
  • 29. Region  Normalizer   o  An]-­‐entropy  for  region  size   o  Converge  towards  uniform  size   o  Compliments  balancer  working  toward  uniform  distribu]on   o  Managed  by  Master,  runs  in  the  background  (like  balancer)   o  Pluggable  normaliza]on  strategies  (“simple”  default)   o  Use-­‐cases   o  Merge  away  regions  from  expired  ]meseries  data   o  Smooth  uneven  bulk  loads   o  Correct  operator  ini]al  split  guesses   o  Ease  upgrades  from  ancient  versions  (0.92/1g  vs.  today/20g)  
  • 30. Thanks!   @HBase   h)p://hbase.apache.org   2015-­‐09-­‐28   Nick  Dimiduk  (@xefyr)   h)p://n10k.com   #apachebigdata  
  • 31. BONUS  CONTENT!   Ask  and  you  shall  receive  
  • 32. Agenda   o Replica]on   o Filters   o Coprocessors  
  • 33. Replica]on   o  Keep  data  synchronized  between  clusters   o  Supports  mulOple  desOnaOons   o  Cyclical  graphs  supported   o  Configurable  at  Column  Family  granularity     o  Uses  WAL  shipping  to  propagate  data   o  Replica]on  state,  status  stored  in  ZooKeeper   o  General  purpose  interface  for  asynchronously  shipping   edits  from  a  cluster   o  Other  HBase  clusters,  Region  Replicas,  SOLR/Elas]cSearch     hbase.apache.org/book.html#_cluster_replica]on  
  • 35. Filters   o  Addi]onal  applied  to  reads   o Use  in  conjunc]on  with  specifying  start,  end  rows,  &c.   o  Run  on  the  Region  Servers   o Included  in  GET,  SCAN  request   o  Explicitly  exclude  data  based  on  criteria   o I.E.,  value  >=  10   o  Implicitly  exclude  data  by  hin]ng  seeks   o INCLUDE_AND_NEXT_COL,  NEXT_ROW,   SEEK_NEXT_USING_HINT   o  Operate  on  data  read  from  BlockCache  
  • 36. Filters   o 30+  Filters  included  in  distribu]on   o Mini-­‐language  for  use  in  ThriM,  REST   o  "(PrefixFilter ('row2') AND (QualifierFilter (>=, 'binary:xyz'))) AND (TimestampsFilter ( 123, 456))" o  hbase.apache.org/book.html#thriM.filter_language   o Simple  interface,  Implement  your  own!  
  • 37. public class PageFilter extends FilterBase { public PageFilter(long pageSize) {…} public boolean filterRowKey(Cell c) { return false; } public ReturnCode filterKeyValue(Cell c) { return ReturnCode.INCLUDE; } public boolean filterAllRemaining() { return this.rowsAccepted >= this.pageSize; } public filterRow() { this.rowsAccepted++; return this.rowsAccepted > this.pageSize; } }
  • 38. Coprocessors   o  Extension  points  for  HBase   o Think  Linux  Kernel  Module,  not  Stored  Procedure   o I.E.,  customize  compac]ons,  Table  constraints   o  Observers   o pre-­‐  and  post-­‐execu]on  logic   o I.E.,  MasterObserver#preTruncateTable,   RegionObserver#postScannerNext   o  Endpoints   o Cluster  RPC  extensions   o I.E.,  RowCountEndpoint,  BulkDeleteEndpoint  
  • 39. public class RowCountEndpoint implements ExampleProtos.RowCountService { public void getRowCount(…) { Scan = new Scan(); InternalScanner scanner = env.getRegion().getScanner(scan); … do { count++; } while (scanner.next()); // return count } }
  • 40. Thanks!   @HBase   h)p://hbase.apache.org   2015-­‐09-­‐28   Nick  Dimiduk  (@xefyr)   h)p://n10k.com   #apachebigdata