phillydb hbase and mapr m7 - march 2013

65
1 ©MapR Technologies Hbase and M7 Technical Overview Keys Botzum Senior Principal Technologist MapR Technologies March 2013

Upload: phillydb

Post on 21-Jan-2015

348 views

Category:

Technology


1 download

DESCRIPTION

http://www.meetup.com/PhillyDB/events/104465902 This presentation will provide an overview of Hbase as well as MapR's M7 NoSQL database. We will begin with a discussion of the basic Hbase architecture and the problems it solves. We will then discuss how MapR's M7, like M5, adds innovative features that provide tangible advantages to Hbase users while maintaining API compatibility."

TRANSCRIPT

Page 1: PhillyDB Hbase and MapR M7 - March 2013

1  ©MapR  Technologies    

Hbase  and  M7  Technical  Overview  Keys  Botzum  Senior  Principal  Technologist  MapR  Technologies  

March  2013  

Page 2: PhillyDB Hbase and MapR M7 - March 2013

2  ©MapR  Technologies    

HBase  MapR  M7  Containers    

Agenda  

Page 3: PhillyDB Hbase and MapR M7 - March 2013

3  ©MapR  Technologies    

 

HBase  A  sparse,  distributed,  persistent,  indexed,  and  

sorted  map  OR  

A  NoSQL  database  OR  

A  Columnar  data  store    

Page 4: PhillyDB Hbase and MapR M7 - March 2013

4  ©MapR  Technologies    

Key-­‐Value  Store  

§  Row  key  –  Binary  sortable  value  

§  Row  content  key  (analogous  to  a  column)  –  Column  family  (string)  –  Column  qualifier  (binary)  –  Version/Omestamp  (number)  

§  A  row  key,  column  family,  column  qualifier,  and  version  uniquely  idenOfies  a  parOcular  cell  –  A  cell  contains  a  single  binary  value  

Page 5: PhillyDB Hbase and MapR M7 - March 2013

5  ©MapR  Technologies    

A  Row    

Value1  Row  Key   Value2   Value3   Value4   ValueN  …  

 C0      C1      C2      C3      C4        CN  

Column  Family  Row  Key   Column  

Qualifier   Version   Value2  

Column  Family  Row  Key   Column  

Qualifier   Version   Value1  

Column  Family  Row  Key   Column  

Qualifier   Version   ValueN  

Page 6: PhillyDB Hbase and MapR M7 - March 2013

6  ©MapR  Technologies    

§  Weakly  typed  and  schema-­‐less  (unstructured  or  perhaps  semi-­‐structured)  –  Almost  everything  is  binary  

§  No  constraints  –  You  can  put  any  binary  value  in  any  cell  –  You  can  even  put  incompaOble  types  in  two  different  instances  of  the  same  column  family:column  qualifier  

§  Column  (qualifiers)  are  created  implicitly  

§  Different  rows  can  have  different  columns  §  No  transacOons/no  ACID  

–  Only  unit  of  atomic  operaOon  is  a  single  row  

Not  A  TradiAonal  RDBMS  

Page 7: PhillyDB Hbase and MapR M7 - March 2013

7  ©MapR  Technologies    

§  APIs  for  querying  (get),  scanning,  and  updaOng  (put)  –  Operate  on  row  key,  column  family,  qualifier,  version,  and  values  –  Can  parOally  specify  and  will  retrieve  union  of  results  

•  if  just  specify  row  key,  will  get  all  values  for  it  (with  column  family,  qualifier)  –  By  default  only  largest  version  (most  recent  if  Omestamp)    is  returned  

•  Specify  row  key  and  column  family  to  get  will  retrieve  all  values  for  that  row  and  column  family  

–  Scanning  is  just  get  over  a  range  of  row  keys  

§  Version  – While  defaults  to  a  Omestamp,  any  integer  is  acceptable  

API  

Page 8: PhillyDB Hbase and MapR M7 - March 2013

8  ©MapR  Technologies    

§  Rather  than  storing  table  rows  linearly  on  disk  and  each  row  on  disk  as  a  single  byte  range  with  fixed  size  fields,  store  columns  of  row  separately  –  Very  efficient  storage  for  sparse  data  sets  (NULL  is  free)  –  Compression  works  befer  on  similar  data    –  Fetches  of  only  subsets  of  row  very  efficient  (less  disk  IO)  –  No  fixed  size  on  column  values  –  No  requirement  to  even  define  columns  

§  Columns  are  grouped  together  into  column  families  –  Basically  a  file  on  disk  –  A  unit  of  opOmizaOon  –  In  Hbase,  adding  column  is  implicit,  adding  column  family  is  explicit  

Columnar  

Page 9: PhillyDB Hbase and MapR M7 - March 2013

9  ©MapR  Technologies    

HBase  Table  Architecture  §  Tables  are  divided  into  key  ranges  (regions)  §  Regions  are  served  by  nodes  (RegionServers)  §  Columns  are  divided  into  access  groups  (columns  families)  

CF1   CF2   CF3   CF4   CF5  

R1  

R2  

R3  

R4  

Page 10: PhillyDB Hbase and MapR M7 - March 2013

10  ©MapR  Technologies    

§  Data  is  stored  in  sorted  order  –  A  table  contains  rows  –  A  sequence  of  rows  are  grouped  together  into  a  region  

•  A  region  consists  of  various  files  related  to  those  rows  and  is  loaded  into  a  region  server  

•  Regions  are  stored  in  HDFS  for  high  availability  –  A  single  region  server  manages  mulOple  regions  

•  Region  assignment  can  change  –  load  balancing,  failures,  etc.  

§  Clients  connect  to  tables  –  HBase  runOme  transparently  determines  the  region  (based  on  key  ranges)  and  contacts  the  appropriate  region  server  

§  At  any  given  Ome  exactly  one  region  server  provides  access  to  a  region  – Master  region  servers  (with  Zookeeper)  manage  that  

Storage  Model  Highlights  

Page 11: PhillyDB Hbase and MapR M7 - March 2013

11  ©MapR  Technologies    

§  Very  scalable  §  Easy  to  add  region  servers  §  Easy  to  move  regions  around  §  Scans  are  efficient  

–  Unlike  hashing  based  models  

§  Access  via  row  key  is  very  efficient  –  Note:  there  are  no  secondary  indexes  

§  No  schema,  can  store  whatever  you  want  when  you  want  §  Strong  consistency  

§  Integrated  with  Hadoop  – Map-­‐reduce  on  HBase  is  straighlorward  –  HDFS/MapR-­‐FS  provides  data  replicaOon  

What’s  Great  About  This?  

Page 12: PhillyDB Hbase and MapR M7 - March 2013

12  ©MapR  Technologies    

§  Data  from  a  region  column  family  is  stored  in  an  HFile  – An  HFile  contains  row  key:column  qualifier:version:value  entries  

– Index  at  the  end  into  the  data  –  64KB  “blocks”  by  default  §  Update  

– New  value  is  wrifen  persistently  to  Write  Ahead  Log  (WAL)  – Cached  in  memory  – When  memory  fills,  write  out  new  HFile  

§  Read  – Checks  in  memory,  then  all  of  the  Hfiles  – Read  data  cached  in  memory  

§  Delete  – Create  a  tombstone  record  (purged  at  major  compacOon)  

 

Data  Storage  Architecture  

Page 13: PhillyDB Hbase and MapR M7 - March 2013

13  ©MapR  Technologies    

Apache  HBase  HFile  Structure  

64Kbyte  blocks  are  compressed    

An  index  into  the  compressed  blocks  is  created  as  a  btree  

Key-­‐value  pairs  are  laid  out  in  increasing  order  

Each  cell  is  an  individual  key  +  value      -­‐  a  row  repeats  the  key  for  each  column  

Page 14: PhillyDB Hbase and MapR M7 - March 2013

14  ©MapR  Technologies    

HBase  Region  OperaAon  

§  Typical  region  size  is  a  few  GB,  someOmes  even  10G  or  20G  §  RS    holds  data  in  memory  unOl  full,  then  writes  a  new  HFile  

–  Logical  view  of  database  constructed  by  layering  these  files,  with  the  latest  on  top  

 

Key  range  represented  by  this  region  

newest  

oldest  

Page 15: PhillyDB Hbase and MapR M7 - March 2013

15  ©MapR  Technologies    

HBase  Read  AmplificaAon  §  When  a  get/scan  comes  in,  all  the  files  have  to  be  examined  

–  schema-­‐less,  so  where  is  the  column?  –  Done  in-­‐memory  and  does  not  change  what's  on  disk  

•  Bloom-­‐filters  do  not  help  in  scans  

newest  

oldest  

With  7  files,  a  1K-­‐record  get()  potenOally  takes  about  30  seeks,    7  block  fetches  and  decompressions,  from  HDFS.  Even  with  the  index  in  memory  7  seeks  and  7  block  fetches  are  required.  

Page 16: PhillyDB Hbase and MapR M7 - March 2013

16  ©MapR  Technologies    

HBase  Write  AmplificaAon  

§  To  reduce  the  read-­‐amplificaOon,  HBase  merges  the  HFiles  periodically  –  process  called  compacOon  –  runs  automaOcally  when  too  many  files  –  usually  turned  off  due  to  I/O  storms  which  interfere  with  client  access  

–  and  kicked-­‐off  manually  on  weekends  

Major  compacOon  reads  all  files  and  merges    into  a  single  HFile  

Page 17: PhillyDB Hbase and MapR M7 - March 2013

17  ©MapR  Technologies    

Client  Hbase  Master  

Hbase  Region  Server  

Zookeeper  

HDFS  Server  

Linux  Filesystem  

HFiles  

WAL  

 HBase  Server  Architecture  

Coordinates  Lookup  

Data  

Page 18: PhillyDB Hbase and MapR M7 - March 2013

18  ©MapR  Technologies    

§  A  persistent  record  of  every  update/insert  in  sequence  order  –  Shared  by  all  regions  on  one  region  server  – WAL  files  periodically  rolled  to  limit  size  but  older  WALs  sOll  needed  – WAL  file  no  longer  needed  once  every  region  with  updates  in  WAL  file  has  flushed  those  from  memory  to  an  HFile  •  Remember  that  more  HFiles  slow  read  path!  

§  Must  be  replayed  as  part  of  recovery  process  since  in  memory  updates  are  “lost”  –  This  is  very  expensive  and  delays  bringing  a  region  back  online  

WAL  File  

Page 19: PhillyDB Hbase and MapR M7 - March 2013

19  ©MapR  Technologies    

What’s  Not  So  Good  

Reliability  • Complex  coordinaOon  between  ZK,  HDFS,  HBase  Master,  and  Region  Server  during  region  movement  

• CompacOons  disrupt  operaOons  • Very  slow  crash  recovery  because  of  • CoordinaOon  complexity  • WAL  log  reading  (one  log/server)  

Business  conAnuity  • Many  administraOve  acOons  require  downOme  • Not  well  integrated  into  MapR-­‐FS  mirroring  and  snapshot  funcOonality  

Page 20: PhillyDB Hbase and MapR M7 - March 2013

20  ©MapR  Technologies    

What’s  Not  So  Good  

Performance  • Very  long  read/write  path  •  Significant  read  and  write  amplificaOon  • MulOple  JVMs  in  read/write  path  –  GC  delays!  

Manageability  • CompacOons,  splits  and  merges  must  be  done  manually  (in  reality)  

•  Lots  of  “well  known”  problems  maintaining  reliable  cluster  –  spliwng,  compacOons,  region  assignment,  etc.  

• PracOcal  limits  on  number  of  regions/region  server  and  size  of  regions  –  can  make  it  hard  to  fully  uOlize  hardware  

Page 21: PhillyDB Hbase and MapR M7 - March 2013

21  ©MapR  Technologies    

Region  Assignment  in  Apache  HBase  

Page 22: PhillyDB Hbase and MapR M7 - March 2013

22  ©MapR  Technologies    

Apache  HBase  on  MapR  

Limited  data  management,  data  protecOon  and  disaster  recovery  for  tables.    

Page 23: PhillyDB Hbase and MapR M7 - March 2013

23  ©MapR  Technologies    

HBase  MapR  M7  Containers    

Agenda  

Page 24: PhillyDB Hbase and MapR M7 - March 2013

24  ©MapR  Technologies    

MapR  A  provider  of  enterprise  grade  Hadoop  with  

uniquely  differenOated  features    

Page 25: PhillyDB Hbase and MapR M7 - March 2013

25  ©MapR  Technologies    

MapR:  The  Enterprise  Grade  DistribuAon  

Page 26: PhillyDB Hbase and MapR M7 - March 2013

26  ©MapR  Technologies    

One  PlaVorm  for  Big  Data  

Batch  

99.999%  HA  

Data  ProtecOon  

Disaster  Recovery  

Scalability    &  

Performance  Enterprise  IntegraOon  

MulO-­‐tenancy  

Map  Reduce  

File-­‐Based  ApplicaOons   SQL   Database   Search   Stream  

Processing  

InteracOve   Real-­‐Ome  

…  Broad    range  of  

applicaOons  

RecommendaOon  Engines   Fraud  DetecOon   Billing   LogisOcs  Risk  Modeling   Market  SegmentaOon   Inventory  ForecasOng  

Page 27: PhillyDB Hbase and MapR M7 - March 2013

27  ©MapR  Technologies    

Dependable:  Lights  Out  Data  Center  Ready  

§  Automated  stateful  failover  

§  Automated  re-­‐replicaOon  

§  Self-­‐healing  from  HW    and  SW  failures  

§  Load  balancing  

§  No  lost  jobs  or  data  

§  99999’s  of  upOme  

Reliable  Compute   Dependable  Storage  

§  Business  conOnuity  with    snapshots    and  mirrors  

§  Recover  to  a  point  in  Ome  §  End-­‐to-­‐end  check  summing    §  Strong  consistency  §  Data  safe  § Mirror  across  sites  to  meet  Recovery  Time  ObjecOves  

Page 28: PhillyDB Hbase and MapR M7 - March 2013

28  ©MapR  Technologies    

Benchmark   MapR  2.1.1   CDH  4.1.1   MapR  Speed  Increase  

Terasort  (1x  replicaOon,  compression  disabled)  

Total   13m  35s   26m  6s   2X  

Map   7m  58s   21m  8s   3X  

Reduce   13m  32s   23m  37s   1.8X  

DFSIO  throughput/node  

Read   1003  MB/s   656  MB/s   1.5X  

Write   924  MB/s   654  MB/s   1.4X  

YCSB  (50%  read,  50%  update)  

Throughput   36,584.4  op/s   12,500.5  op/s   2.9X  

RunOme   3.80  hr   11.11  hr   2.9X  

YCSB  (95%  read,  5%  update)  

Throughput   24,704.3  op/s   10,776.4  op/s   2.3X  

RunOme   0.56  hr   1.29  hr   2.3X  

Benchmark  hardware  configuraOon:    10  servers,  12  x  2  cores  (2.4  GHz),  12  x  2TB,  48  GB,  1  x  10GbE  

MinuteSort  Record  1.5  TB  in  60  seconds  

2103  nodes  

Fast:  World  Record  Performance  

Page 29: PhillyDB Hbase and MapR M7 - March 2013

29  ©MapR  Technologies    

The  Cloud  Leaders  Pick  MapR  

Google  chose  MapR  to  provide  Hadoop  on  Google  

Compute  Engine  

Amazon  EMR  is  the  largest  Hadoop  provider  in  revenue  

and  #  of  clusters  

Page 30: PhillyDB Hbase and MapR M7 - March 2013

30  ©MapR  Technologies    

MapR  Supports  Broad  Set  of  Customers  

§  Log  analysis  §  HBase  

§  Customer  targeOng  §  Social  media  analysis  

§  Customer  Revenue  AnalyOcs  

§  ETL  Offload  

§  AdverOsing  exchange  analysis  and  opOmizaOon  

 

§  Clickstream  Analysis  §  Quality  profiling/field  

failure  analysis  

§  Enterprise  Grade  Plalorm  

§  COOP  features  

§  Monitoring  and  measuring  online  behavior  

§  Fraud  DetecOon    §  Channel  analyOcs  

§  RecommendaOon  Engine  §  Fraud  detecOon  and  PrevenOon  

§  Customer  Behavior  Analysis  §  Brand  Monitoring  

§  Customer  targeOng  §  Viewer  Behavioral  analyOcs  

§  RecommendaOon  Engine  §  Family  tree  connecOons  

§  Intrusion  detecOon  &  prevenOon  §  Forensic  analysis  

§  Global  threat    analyOcs  

§  Virus  analysis    

§  PaOent  care  monitoring  

Leading  Retailer  Global  Credit  Card  Issuer  

Page 31: PhillyDB Hbase and MapR M7 - March 2013

31  ©MapR  Technologies    

MapR  EdiAons  

§  Control  System  §  NFS  Access  §  Performance  §  High  Availability  §  Snapshots  &  Mirroring  §  24  X  7  Support  §  Annual  SubscripOon  

§  Control  System  §  NFS  Access  §  Performance  §  Unlimited  Nodes  §  Free    

Compute  Engine  

Also  Available  through:    

§  All  the  Features  of  M5  §  Simplified  

AdministraOon  for  HBase  

§  Increased  Performance  §  Consistent  Low  Latency  §  Unified  Snapshots,  

Mirroring  

Page 32: PhillyDB Hbase and MapR M7 - March 2013

32  ©MapR  Technologies    

Hbase  MapR  M7  Containers    

Agenda  

Page 33: PhillyDB Hbase and MapR M7 - March 2013

33  ©MapR  Technologies    

M7  An  integrated  system  for  

unstructured  and  structured  data    

Page 34: PhillyDB Hbase and MapR M7 - March 2013

34  ©MapR  Technologies    

Introducing  MapR  M7  

§  An  integrated  system  – Unified  namespace  for  files  and  tables  – Built-­‐in  data  management  &  protecOon  – No  extra  administraOon  

§  Architected  for  reliability  and  performance  – Fewer  layers  – Single  hop  to  data  – No  compacOons,  low  i/o  amplificaOon  – Seamless  splits,  automaOc  merges  –  Instant  recovery  

Page 35: PhillyDB Hbase and MapR M7 - March 2013

35  ©MapR  Technologies    

Binary  CompaAble  with  HBase  APIs  

§  HBase  applicaOons  work  "as  is"  with  M7  –  No  need  to  recompile  (binary  compaOble)  

§  Can  run  M7  and  HBase  side-­‐by-­‐side  on  the  same  cluster  –  e.g.,  during  a  migraOon  –  can  access  both  M7  table  and  HBase  table  in  same  program    

§  Use  standard  Apache  HBase  CopyTable  tool  to  copy  a  table  from  HBase  to  M7  or  vice-­‐versa    %  hbase  org.apache.hadoop.hbase.mapreduce.CopyTable                            -­‐-­‐new.name=/user/srivas/mytable  oldtable  

Page 36: PhillyDB Hbase and MapR M7 - March 2013

36  ©MapR  Technologies    

M7:    Remove  Layers,  Simplify  

MapR      M7  

Take  note!  No  JVM!  

Page 37: PhillyDB Hbase and MapR M7 - March 2013

37  ©MapR  Technologies    

M7:    No  Master  and  No  RegionServers  

No  extra  daemons  to  manage  

One  hop  to  data   Unified  cache  

No  JVM  problems  

Page 38: PhillyDB Hbase and MapR M7 - March 2013

38  ©MapR  Technologies    

Region  Assignment  in  Apache  HBase  None  of  this  complexity  is  present  in  MapR  M7  

Page 39: PhillyDB Hbase and MapR M7 - March 2013

39  ©MapR  Technologies    

Unified  Namespace  for  Files  and  Tables  

$  pwd  /mapr/default/user/dave    $  ls  file1    file2    table1    table2    $  hbase  shell  hbase(main):003:0>  create  '/user/dave/table3',  'cf1',  'cf2',  'cf3'  0  row(s)  in  0.1570  seconds    $  ls  file1    file2    table1    table2    table3    $  hadoop  fs  -­‐ls  /user/dave  Found  5  items  -­‐rw-­‐r-­‐-­‐r-­‐-­‐      3  mapr  mapr                  16  2012-­‐09-­‐28  08:34  /user/dave/file1  -­‐rw-­‐r-­‐-­‐r-­‐-­‐      3  mapr  mapr                  22  2012-­‐09-­‐28  08:34  /user/dave/file2  trwxr-­‐xr-­‐x      3  mapr  mapr                    2  2012-­‐09-­‐28  08:32  /user/dave/table1  trwxr-­‐xr-­‐x      3  mapr  mapr                    2  2012-­‐09-­‐28  08:33  /user/dave/table2  trwxr-­‐xr-­‐x      3  mapr  mapr                    2  2012-­‐09-­‐28  08:38  /user/dave/table3  

Page 40: PhillyDB Hbase and MapR M7 - March 2013

40  ©MapR  Technologies    

Tables  for  End  Users  

§  Users  can  create  and  manage  their  own  tables  –  Unlimited  #  of  tables  

 §  Tables  can  be  created  in  any  directory  

–  Tables  count  towards  volume  and  user  quotas  

§  No  admin  intervenOon  needed  –  I  can  create  a  file  or  a  directory  without  opening  a  Ocket  with  admin  team,  why  not  a  table?  

–  Do  stuff  on  the  fly,    no  stop/restart  servers  

§  AutomaOc  data  protecOon  and  disaster  recovery  –  Users  can  recover  from  snapshots/mirrors  on  their  own  

Page 41: PhillyDB Hbase and MapR M7 - March 2013

41  ©MapR  Technologies    

M7  –  An  Integrated  System  

Page 42: PhillyDB Hbase and MapR M7 - March 2013

42  ©MapR  Technologies    

M7  ComparaOve  Analysis  with  

 Apache  HBase,  Level-­‐DB  and  a  BTree  

Page 43: PhillyDB Hbase and MapR M7 - March 2013

43  ©MapR  Technologies    

HBase  Write  AmplificaAon  Analysis  

§  Assume  10G  per  region,  write  10%  per  day,  grow  10%  per  week  –  1G  of  writes  –  a�er  7  days,  7  files  of  1G  and  1file  of  10G  (only  1G  is  growth)  

§  IO  Cost  – Wrote  7G  to  WAL  +  7G  to  HFiles  –  CompacOon  adds  sOll  more  

•  read:  17G    (=  7  x  1G    +  1  x  10G)  •  write:    11G  write  to  new  Hfile  

– WAF  –  wrote  7G  “for  real”  but  actual  disk  IO  a�er  compacOon  is  read  17G  +  write  25G  and  that’s  assuming  no  applicaOon  reads!  

§  IO  Cost  of  1000  regions  similar  to  above  –  read  17T,    write  25T    è  major  impact  on  node  

§  Best  pracOce,  limit  #  of  regions/node  à  can’t  fully  uOlize  storage  

Page 44: PhillyDB Hbase and MapR M7 - March 2013

44  ©MapR  Technologies    

AlternaAve:  Level-­‐DB  

§  Tiered,  logarithmic  increase  –  L1:  2  x  1M    files  –  L2:    10  x  1M  –  L3:    100  x  1M  –  L4:      1,000  x  1M,  etc  

§  CompacOon  overhead  –  avoids  IO  storms    (i/o  done  in  smaller  increments  of    ~10M)  –  but  significantly  extra  bandwidth  compared  to  HBase  

§  Read  overhead  is  sOll  high  –  10-­‐15  seeks,  perhaps  more  if  the  lowest  level  is  very  large  –  40K  -­‐  60K    read  from  disk  to  retrieve  a  1K  record  

Page 45: PhillyDB Hbase and MapR M7 - March 2013

45  ©MapR  Technologies    

BTree  analysis  §  Read  finds  data  directly,  proven  to  be  fastest  

–  interior  nodes  only  hold  keys  –  very  large  branching  factor  –  values  only  at  leaves  –  thus  index  caches  work  –  R  =  logN  seeks,  if  no  caching  –  1K  record  read  will  transfer  about  logN  blocks  from  disk  

§  Writes  are  slow  on  inserts  –  inserted  into  correct  place  right  away  –  otherwise  read  will  not  find  it  –  requires  btree  to  be  conOnuously  rebalanced  –  causes  extreme  random  i/o  in  insert  path  – W  =  2.5x  +  logN  seeks  if  no  caching  

Page 46: PhillyDB Hbase and MapR M7 - March 2013

46  ©MapR  Technologies    

Log-­‐Structured  Merge  Trees  §  LSM  Trees  reduce  insert  cost  by  deferring  and  batching  index  changes  

–  If  don't  compact  o�en,  read  perf  is  impacted  –  If  compact  too  o�en,  write  perf  is  impacted    

§  B-­‐Trees  are  great  for  reads  –  but  expensive  to  update  in  real-­‐Ome    

Index Log

Index

Memory Disk

Write

Read

Can  we  combine  both  ideas?    Writes  cannot  be  done  befer  than  W  =  2.5x  

write  to  log    +    write  data  to  somewhere    +    update  meta-­‐data    

Page 47: PhillyDB Hbase and MapR M7 - March 2013

47  ©MapR  Technologies    

M7  from  MapR  §  TwisOng  BTree's  

–  leaves  are  variable  size  (8K  -­‐  8M  or  larger)  –  can  stay  unbalanced  for  long  periods  of  Ome  

•  more  inserts  will  balance  it  eventually  •  automaOcally  throfles  updates  to  interior  btree  nodes  

– M7  inserts  "close  to"  where  the  data  is  supposed  to  go  

§  Reads  –  Uses  BTree  structure  to  get  "close"  very  fast  

•  very  high  branching  with  key-­‐prefix-­‐compression  –  UOlizes  a  separate  lower-­‐level  index  to  find  it  exactly  

•  updated  "in-­‐place”  bloom-­‐filters  for  gets,  range-­‐maps  for  scans    

§  Overhead  –  1K  record  read  will  transfer  about  32K  from  disk  in  logN  seeks  

Page 48: PhillyDB Hbase and MapR M7 - March 2013

48  ©MapR  Technologies    

M7    provides  Instant  Recovery  §  Instead  of  having  one  WAL/region  server  or  even  one/region,  we  have  many  micro-­‐WALs/region  

§  0-­‐40  microWALs  per  region  –  idle  WALs  “compacted”,  so  most  are  empty  –  region  is  up  before  all  microWALs  are  recovered  –  recovers  region  in  background  in  parallel  – when  a  key  is  accessed,  that  microWAL  is  recovered  inline  –  1000-­‐10000x  faster  recovery  

§  Never  perform  equivalent  of  HBase  major  or  minor  compacOon  

§  Why  doesn't  HBase  do  this?  M7  uses  MapR-­‐FS,  not  HDFS  –  No  limit  to  #  of  files  on  disk  –  No  limit  to  #  open  files  –  I/O  path  translates  random  writes  to  sequenOal  writes  on  disk  

Page 49: PhillyDB Hbase and MapR M7 - March 2013

49  ©MapR  Technologies    

1K  record  -­‐read  amplificaAon  

CompacAon   Recovery  

HBase  with  7  hfiles   30  seeks  130K  xfer  

IO  Storms  good  bandwidth    

Huge  WAL  to  recover  

HBase  with  3  hfiles   15  seeks,  70K  xfer  

IO  Storms  high  bandwidth    

Huge  WAL  to  recover  

LevelDB  with  5  levels   13  seeks  48K  xfer  

No  i/o  storms  Very  high  b/w    

WAL  is  Ony  

BTree   logN  seeks  logN  xfer  

No  i/o  storms  but  100%  random  

WAL  is  proporOonal  to  concurrency  +  cache  

MapR    M7   logN  seeks  32K  xfer  

No  i/o  storms  low  bandwidth  

microWALs    allow    recovery  <  100ms  

Summary  

Page 50: PhillyDB Hbase and MapR M7 - March 2013

50  ©MapR  Technologies    

M7:    Fileservers  Serve  Regions  

§  Region  lives  enOrely  inside  a  container  – Does  not  coordinate  through  ZooKeeper    

§  Containers  support  distributed  transacOons  – with  replicaOon  built-­‐in  

§  Only  coordinaOon  in  the  system  is  for  splits  –  Between  region-­‐map  and  data-­‐container  –  already  solved  this  problem  for  files  and  its  chunks  

 

Page 51: PhillyDB Hbase and MapR M7 - March 2013

51  ©MapR  Technologies    

Hbase  MapR  M7  Containers    

Agenda  

Page 52: PhillyDB Hbase and MapR M7 - March 2013

52  ©MapR  Technologies    

     

What's  a  MapR  container?  

Page 53: PhillyDB Hbase and MapR M7 - March 2013

53  ©MapR  Technologies    

l  Each  container  contains  l  Directories  &  files  l  Data  blocks  l  BTrees  

l  100%  random  writes  

MapR's  Containers  Files/directories  are  sharded  into  blocks,  and    placed  in  containers  on  disks  

Containers  are  ~32  GB  segments  of  disk,  placed  on  nodes  

Patent  Pending  

Page 54: PhillyDB Hbase and MapR M7 - March 2013

54  ©MapR  Technologies    

M7  Containers  

§  Container  holds  many  files  – regular,  dir,  symlink,  btree,  chunk-­‐map,  region-­‐map,  …  – all  random-­‐write  capable  

§  Container  is  replicated  to  servers  – unit  of  resynchronizaOon  

§  Region  lives  enOrely  inside  1  container  – all  files  +  WALs  +  btree's  +  bloom-­‐filters  +  range-­‐maps  

Page 55: PhillyDB Hbase and MapR M7 - March 2013

55  ©MapR  Technologies    

Read-­‐write  ReplicaAon  

§  Write  are  synchronous  –  All  copies  have  same  data    

§  Data  is  replicated  in  a  "chain"  fashion  –  befer  bandwidth,  uOlizes  full-­‐duplex  network  links  well  

§  Meta-­‐data  is  replicated  in  a  "star"  manner  –  response  Ome  befer,  bandwidth  not  of  concern  

–  data  can  also  be  done  this  way    

55  

client1  client2  

clientN  

Page 56: PhillyDB Hbase and MapR M7 - March 2013

56  ©MapR  Technologies    

Random  WriAng  in  MapR  S1

S2

S3 S5 S4

S1, S2, S4 S1, S3 S1, S4, S5 S2, S4, S5 S3

Client  wriAng  data  

CLDB  Ask  for  64M  block  

Create  cont.  

Picks  master  and  2  replica  slaves  

Write  next  chunk  to  S2  

S2, S3, S5

afach  

Page 57: PhillyDB Hbase and MapR M7 - March 2013

57  ©MapR  Technologies    

l  As  data  size  increases,  writes  spread  more,  like  dropping  a  pebble  in  a  pond    

l  Larger  pebbles  spread  the  ripples  farther    

l  Space  balanced  by  moving  idle  containers  

   

Container  Balancing  •  Servers  keep  a  bunch  of  containers  "ready  to  go".  •  Writes  get  distributed  around  the  cluster.  

Page 58: PhillyDB Hbase and MapR M7 - March 2013

58  ©MapR  Technologies    

l  HB  loss    +    upstream  enOty  reports  failure          =>  server  dead    

l  Incr  epoch  at  CLDB  l  Rearrange  repl  path  l  Exact  same  code  for  files  

and  M7  tables  l  No  ZK  

Failure  Handling  

Containers  managed  at  CLDB  (HB,  container-­‐reports).  

Container  LocaOon  DataBase    (CLDB)  

Page 59: PhillyDB Hbase and MapR M7 - March 2013

59  ©MapR  Technologies    

Architectural  Params  

§  Unit  of  I/O  –  4K/8K    (8K  in  MapR)  

§  Unit  of  Chunking    (a  map-­‐reduce  split)  –  10-­‐100's  of  megabytes  

§  Unit  of  Resync      (a  replica)  –  10-­‐100's  of  gigabytes  –  container  in  MapR    

i/o   10^3  map-­‐red  

10^6  resync  

10^9  admin  

HDFS  'block'  

§  Unit  of  AdministraOon    (snap,  repl,  mirror,  quota,  backup)  –  1  gigabyte  -­‐  1000's  of  terabytes  –  volume  in  MapR  –  what  data  is  affected  by  my  

missing  blocks?    

Page 60: PhillyDB Hbase and MapR M7 - March 2013

60  ©MapR  Technologies    

Other  M7  Features  

§  Smaller  disk  footprint  – M7  never  repeats  the  key  or  column  name  

 §  Columnar  layout  

– M7  supports  64  column  families  –  in-­‐memory  column-­‐families  

§  Online  admin  – M7  schema  changes  on  the  fly  – delete/rename/redistribute  tables      

Page 61: PhillyDB Hbase and MapR M7 - March 2013

61  ©MapR  Technologies    

Thank  you!    

QuesAons?  

Page 62: PhillyDB Hbase and MapR M7 - March 2013

62  ©MapR  Technologies    

Examples:  Reliability  Issues  

§  CompacAons  disrupt  HBase  operaAons:    I/O  bursts  overwhelm  nodes  (hfp://hbase.apache.org/book.html#compacOon)  

§  Very  slow  crash  recovery:  RegionServer  crash  can  cause  data  to  be  unavailable  for  up  to  30  minutes  while  WALs  are  replayed  for  impacted  regions.  (HBASE-­‐1111)  

§  Unreliable  splibng:  Region  spliwng  may  cause  data  to  be  inconsistent  and  unavailable.  (hfp://chilinglam.blogspot.com/2011/12/my-­‐experience-­‐with-­‐hbase-­‐dynamic.html)  

§  No  client  throcling:  HBase  client  can  easily  overwhelm  RegionServers  and  cause  downOme.  (HBASE-­‐5161,  HBASE-­‐5162)  

Page 63: PhillyDB Hbase and MapR M7 - March 2013

63  ©MapR  Technologies    

Examples:  Business  ConAnuity  Issues  

§  No  Snapshots:  MapR  provides  all-­‐or-­‐nothing  snapshots  for  HBase.  The  WALs  are  shared  among  tables  so  single-­‐table  and  selecOve  mulO-­‐table  snapshots  are  not  possible.  (HDFS-­‐2802,  HDFS-­‐3370,  HBASE-­‐50,  HBASE-­‐6055)  

§  Complex  Backup  Process:    complex,  unreliable  and  inefficient.  (hfp://bruteforcedata.blogspot.com/2012/08/hbase-­‐disaster-­‐recovery-­‐and-­‐whisky.html)  

§  AdministraAon  Requires  DownAme:  The  enOre  cluster  must  be  taken  down  in  order  to  merge  regions.  Tables  must  be  disabled  to  change  schema,  replicaOon  and  other  properOes.  (HBASE-­‐420,  HBASE-­‐1621,  HBASE-­‐5504,  HBASE-­‐5335,  HBASE-­‐3909)  

Page 64: PhillyDB Hbase and MapR M7 - March 2013

64  ©MapR  Technologies    

Examples:  Performance  Issues  

§  Limited  support  for  mulAple  column  families:  HBase  has  issues  handling  mulOple  column  family  due  to  compacOons.  The  standard  HBase  documentaOon  recommends  no  more  than  2-­‐3  column  families.  (HBASE-­‐3149)  

§  Limited  data  locality:  HBase  does  not  take  into  account  block  locaOons  when  assigning  regions.  A�er  a  reboot,  RegionServers  are  o�en  reading  data  over  the  network  rather  than  the  local  drives.  (HBASE-­‐4755,  HBASE-­‐4491)  

§  Cannot  uAlize  disk  space:  HBase  RegionServers  struggle  with  more  than  50-­‐150  regions  per  RegionServer  so  a  commodity  server  can  only  handle  about  1TB  of  HBase  data,  wasOng  disk  space.  (hfp://hbase.apache.org/book/important_configuraOons.html,  hfp://www.cloudera.com/blog/2011/04/hbase-­‐dos-­‐and-­‐donts/)  

§  Limited  #  of  tables:  A  single  cluster  can  only  handle  several  tens  of  tables  effecOvely.  (hfp://hbase.apache.org/book/important_configuraOons.html)  

Page 65: PhillyDB Hbase and MapR M7 - March 2013

65  ©MapR  Technologies    

Examples:  Manageability  Issues  

§  Manual  major  compacAons:  HBase  major  compacOons  are  disrupOve  so  producOon  clusters  keep  them  disabled  and  rely  on  the  administrator  to  manually  trigger  compacOons.  (hfp://hbase.apache.org/book.html#compacOon)    

§  Manual  splibng:  HBase  auto-­‐spliwng  does  not  work  properly  in  a  busy  cluster  so  users  must  pre-­‐split  a  table  based  on  their  esOmate  of  data  size/growth.  (hfp://chilinglam.blogspot.com/2011/12/my-­‐experience-­‐with-­‐hbase-­‐dynamic.html)  

§  Manual  merging:  HBase  does  not  automaOcally  merge  regions  that  are  too  small.  The  administrator  must  take  down  the  cluster  and  trigger  the  merges  manually.    

§  Basic  administraAon  is  complex:  Renaming  a  table  requires  copying  all  the  data.  Backing  up  a  cluster  is  a  complex  process.  (HBASE-­‐643)