phillydb hbase and mapr m7 - march 2013

1 ©MapR Technologies

Hbase and M7 Technical Overview Keys Botzum Senior Principal Technologist MapR Technologies

March 2013


HBase MapR M7 Containers

Agenda


HBase A sparse, distributed, persistent, indexed, and

sorted map OR

A NoSQL database OR

A Columnar data store


Key-‐Value Store

§  Row key –  Binary sortable value

§  Row content key (analogous to a column) –  Column family (string) –  Column qualifier (binary) –  Version/Omestamp (number)

§  A row key, column family, column qualifier, and version uniquely idenOfies a parOcular cell –  A cell contains a single binary value


A Row

Value1 Row Key Value2 Value3 Value4 ValueN …

C0 C1 C2 C3 C4 CN

Column Family Row Key Column

Qualifier Version Value2


Qualifier Version Value1


Qualifier Version ValueN

…


§  Weakly typed and schema-‐less (unstructured or perhaps semi-‐structured) –  Almost everything is binary

§  No constraints –  You can put any binary value in any cell –  You can even put incompaOble types in two different instances of the same column family:column qualifier

§  Column (qualifiers) are created implicitly

§  Different rows can have different columns §  No transacOons/no ACID

–  Only unit of atomic operaOon is a single row

Not A TradiAonal RDBMS


§  APIs for querying (get), scanning, and updaOng (put) –  Operate on row key, column family, qualifier, version, and values –  Can parOally specify and will retrieve union of results

•  if just specify row key, will get all values for it (with column family, qualifier) –  By default only largest version (most recent if Omestamp) is returned

•  Specify row key and column family to get will retrieve all values for that row and column family

–  Scanning is just get over a range of row keys

§  Version – While defaults to a Omestamp, any integer is acceptable

API


§  Rather than storing table rows linearly on disk and each row on disk as a single byte range with fixed size fields, store columns of row separately –  Very efficient storage for sparse data sets (NULL is free) –  Compression works befer on similar data –  Fetches of only subsets of row very efficient (less disk IO) –  No fixed size on column values –  No requirement to even define columns

§  Columns are grouped together into column families –  Basically a file on disk –  A unit of opOmizaOon –  In Hbase, adding column is implicit, adding column family is explicit

Columnar


HBase Table Architecture §  Tables are divided into key ranges (regions) §  Regions are served by nodes (RegionServers) §  Columns are divided into access groups (columns families)

CF1 CF2 CF3 CF4 CF5

R1

R2

R3

R4


§  Data is stored in sorted order –  A table contains rows –  A sequence of rows are grouped together into a region

•  A region consists of various files related to those rows and is loaded into a region server

•  Regions are stored in HDFS for high availability –  A single region server manages mulOple regions

•  Region assignment can change – load balancing, failures, etc.

§  Clients connect to tables –  HBase runOme transparently determines the region (based on key ranges) and contacts the appropriate region server

§  At any given Ome exactly one region server provides access to a region – Master region servers (with Zookeeper) manage that

Storage Model Highlights


§  Very scalable §  Easy to add region servers §  Easy to move regions around §  Scans are efficient

–  Unlike hashing based models

§  Access via row key is very efficient –  Note: there are no secondary indexes

§  No schema, can store whatever you want when you want §  Strong consistency

§  Integrated with Hadoop – Map-‐reduce on HBase is straighlorward –  HDFS/MapR-‐FS provides data replicaOon

What’s Great About This?


§  Data from a region column family is stored in an HFile – An HFile contains row key:column qualifier:version:value entries

– Index at the end into the data – 64KB “blocks” by default §  Update

– New value is wrifen persistently to Write Ahead Log (WAL) – Cached in memory – When memory fills, write out new HFile

§  Read – Checks in memory, then all of the Hfiles – Read data cached in memory

§  Delete – Create a tombstone record (purged at major compacOon)

Data Storage Architecture


Apache HBase HFile Structure

64Kbyte blocks are compressed

An index into the compressed blocks is created as a btree

Key-‐value pairs are laid out in increasing order

Each cell is an individual key + value -‐ a row repeats the key for each column


HBase Region OperaAon

§  Typical region size is a few GB, someOmes even 10G or 20G §  RS holds data in memory unOl full, then writes a new HFile

–  Logical view of database constructed by layering these files, with the latest on top

Key range represented by this region

newest

oldest


HBase Read AmplificaAon §  When a get/scan comes in, all the files have to be examined

–  schema-‐less, so where is the column? –  Done in-‐memory and does not change what's on disk

•  Bloom-‐filters do not help in scans

newest

oldest

With 7 files, a 1K-‐record get() potenOally takes about 30 seeks, 7 block fetches and decompressions, from HDFS. Even with the index in memory 7 seeks and 7 block fetches are required.


HBase Write AmplificaAon

§  To reduce the read-‐amplificaOon, HBase merges the HFiles periodically –  process called compacOon –  runs automaOcally when too many files –  usually turned off due to I/O storms which interfere with client access

–  and kicked-‐off manually on weekends

Major compacOon reads all files and merges into a single HFile


Client Hbase Master

Hbase Region Server

Zookeeper

HDFS Server

Linux Filesystem

HFiles

WAL

HBase Server Architecture

Coordinates Lookup

Data


§  A persistent record of every update/insert in sequence order –  Shared by all regions on one region server – WAL files periodically rolled to limit size but older WALs sOll needed – WAL file no longer needed once every region with updates in WAL file has flushed those from memory to an HFile •  Remember that more HFiles slow read path!

§  Must be replayed as part of recovery process since in memory updates are “lost” –  This is very expensive and delays bringing a region back online

WAL File


What’s Not So Good

Reliability • Complex coordinaOon between ZK, HDFS, HBase Master, and Region Server during region movement

• CompacOons disrupt operaOons • Very slow crash recovery because of • CoordinaOon complexity • WAL log reading (one log/server)

Business conAnuity • Many administraOve acOons require downOme • Not well integrated into MapR-‐FS mirroring and snapshot funcOonality


What’s Not So Good

Performance • Very long read/write path •  Significant read and write amplificaOon • MulOple JVMs in read/write path – GC delays!

Manageability • CompacOons, splits and merges must be done manually (in reality)

•  Lots of “well known” problems maintaining reliable cluster – spliwng, compacOons, region assignment, etc.

• PracOcal limits on number of regions/region server and size of regions – can make it hard to fully uOlize hardware


Region Assignment in Apache HBase


Apache HBase on MapR

Limited data management, data protecOon and disaster recovery for tables.


HBase MapR M7 Containers

Agenda


MapR A provider of enterprise grade Hadoop with

uniquely differenOated features


MapR: The Enterprise Grade DistribuAon


One PlaVorm for Big Data

…

Batch

99.999% HA

Data ProtecOon

Disaster Recovery

Scalability &

Performance Enterprise IntegraOon

MulO-‐tenancy

Map Reduce

File-‐Based ApplicaOons SQL Database Search Stream

Processing

InteracOve Real-‐Ome

… Broad range of

applicaOons

RecommendaOon Engines Fraud DetecOon Billing LogisOcs Risk Modeling Market SegmentaOon Inventory ForecasOng


Dependable: Lights Out Data Center Ready

§  Automated stateful failover

§  Automated re-‐replicaOon

§  Self-‐healing from HW and SW failures

§  Load balancing

§  No lost jobs or data

§  99999’s of upOme

Reliable Compute Dependable Storage

§  Business conOnuity with snapshots and mirrors

§  Recover to a point in Ome §  End-‐to-‐end check summing §  Strong consistency §  Data safe § Mirror across sites to meet Recovery Time ObjecOves


Benchmark MapR 2.1.1 CDH 4.1.1 MapR Speed Increase

Terasort (1x replicaOon, compression disabled)

Total 13m 35s 26m 6s 2X

Map 7m 58s 21m 8s 3X

Reduce 13m 32s 23m 37s 1.8X

DFSIO throughput/node

Read 1003 MB/s 656 MB/s 1.5X

Write 924 MB/s 654 MB/s 1.4X

YCSB (50% read, 50% update)

Throughput 36,584.4 op/s 12,500.5 op/s 2.9X

RunOme 3.80 hr 11.11 hr 2.9X

YCSB (95% read, 5% update)

Throughput 24,704.3 op/s 10,776.4 op/s 2.3X

RunOme 0.56 hr 1.29 hr 2.3X

Benchmark hardware configuraOon: 10 servers, 12 x 2 cores (2.4 GHz), 12 x 2TB, 48 GB, 1 x 10GbE

MinuteSort Record 1.5 TB in 60 seconds

2103 nodes

Fast: World Record Performance


The Cloud Leaders Pick MapR

Google chose MapR to provide Hadoop on Google

Compute Engine

Amazon EMR is the largest Hadoop provider in revenue

and # of clusters


MapR Supports Broad Set of Customers

§  Log analysis §  HBase

§  Customer targeOng §  Social media analysis

§  Customer Revenue AnalyOcs

§  ETL Offload

§  AdverOsing exchange analysis and opOmizaOon

§  Clickstream Analysis §  Quality profiling/field

failure analysis

§  Enterprise Grade Plalorm

§  COOP features

§  Monitoring and measuring online behavior

§  Fraud DetecOon §  Channel analyOcs

§  RecommendaOon Engine §  Fraud detecOon and PrevenOon

§  Customer Behavior Analysis §  Brand Monitoring

§  Customer targeOng §  Viewer Behavioral analyOcs

§  RecommendaOon Engine §  Family tree connecOons

§  Intrusion detecOon & prevenOon §  Forensic analysis

§  Global threat analyOcs

§  Virus analysis

§  PaOent care monitoring

Leading Retailer Global Credit Card Issuer


MapR EdiAons

§  Control System §  NFS Access §  Performance §  High Availability §  Snapshots & Mirroring §  24 X 7 Support §  Annual SubscripOon

§  Control System §  NFS Access §  Performance §  Unlimited Nodes §  Free

Compute Engine

Also Available through:

§  All the Features of M5 §  Simplified

AdministraOon for HBase

§  Increased Performance §  Consistent Low Latency §  Unified Snapshots,

Mirroring


Hbase MapR M7 Containers

Agenda


M7 An integrated system for

unstructured and structured data


Introducing MapR M7

§  An integrated system – Unified namespace for files and tables – Built-‐in data management & protecOon – No extra administraOon

§  Architected for reliability and performance – Fewer layers – Single hop to data – No compacOons, low i/o amplificaOon – Seamless splits, automaOc merges –  Instant recovery


Binary CompaAble with HBase APIs

§  HBase applicaOons work "as is" with M7 –  No need to recompile (binary compaOble)

§  Can run M7 and HBase side-‐by-‐side on the same cluster –  e.g., during a migraOon –  can access both M7 table and HBase table in same program

§  Use standard Apache HBase CopyTable tool to copy a table from HBase to M7 or vice-‐versa % hbase org.apache.hadoop.hbase.mapreduce.CopyTable -‐-‐new.name=/user/srivas/mytable oldtable


M7: Remove Layers, Simplify

MapR M7

Take note! No JVM!


M7: No Master and No RegionServers

No extra daemons to manage

One hop to data Unified cache

No JVM problems


Region Assignment in Apache HBase None of this complexity is present in MapR M7


Unified Namespace for Files and Tables

$ pwd /mapr/default/user/dave $ ls file1 file2 table1 table2 $ hbase shell hbase(main):003:0> create '/user/dave/table3', 'cf1', 'cf2', 'cf3' 0 row(s) in 0.1570 seconds $ ls file1 file2 table1 table2 table3 $ hadoop fs -‐ls /user/dave Found 5 items -‐rw-‐r-‐-‐r-‐-‐ 3 mapr mapr 16 2012-‐09-‐28 08:34 /user/dave/file1 -‐rw-‐r-‐-‐r-‐-‐ 3 mapr mapr 22 2012-‐09-‐28 08:34 /user/dave/file2 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:32 /user/dave/table1 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:33 /user/dave/table2 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:38 /user/dave/table3


Tables for End Users

§  Users can create and manage their own tables –  Unlimited # of tables

§  Tables can be created in any directory

–  Tables count towards volume and user quotas

§  No admin intervenOon needed –  I can create a file or a directory without opening a Ocket with admin team, why not a table?

–  Do stuff on the fly, no stop/restart servers

§  AutomaOc data protecOon and disaster recovery –  Users can recover from snapshots/mirrors on their own


M7 – An Integrated System


M7 ComparaOve Analysis with

Apache HBase, Level-‐DB and a BTree


HBase Write AmplificaAon Analysis

§  Assume 10G per region, write 10% per day, grow 10% per week –  1G of writes –  a�er 7 days, 7 files of 1G and 1file of 10G (only 1G is growth)

§  IO Cost – Wrote 7G to WAL + 7G to HFiles –  CompacOon adds sOll more

•  read: 17G (= 7 x 1G + 1 x 10G) •  write: 11G write to new Hfile

– WAF – wrote 7G “for real” but actual disk IO a�er compacOon is read 17G + write 25G and that’s assuming no applicaOon reads!

§  IO Cost of 1000 regions similar to above –  read 17T, write 25T è major impact on node

§  Best pracOce, limit # of regions/node à can’t fully uOlize storage


AlternaAve: Level-‐DB

§  Tiered, logarithmic increase –  L1: 2 x 1M files –  L2: 10 x 1M –  L3: 100 x 1M –  L4: 1,000 x 1M, etc

§  CompacOon overhead –  avoids IO storms (i/o done in smaller increments of ~10M) –  but significantly extra bandwidth compared to HBase

§  Read overhead is sOll high –  10-‐15 seeks, perhaps more if the lowest level is very large –  40K -‐ 60K read from disk to retrieve a 1K record


BTree analysis §  Read finds data directly, proven to be fastest

–  interior nodes only hold keys –  very large branching factor –  values only at leaves –  thus index caches work –  R = logN seeks, if no caching –  1K record read will transfer about logN blocks from disk

§  Writes are slow on inserts –  inserted into correct place right away –  otherwise read will not find it –  requires btree to be conOnuously rebalanced –  causes extreme random i/o in insert path – W = 2.5x + logN seeks if no caching


Log-‐Structured Merge Trees §  LSM Trees reduce insert cost by deferring and batching index changes

–  If don't compact o�en, read perf is impacted –  If compact too o�en, write perf is impacted

§  B-‐Trees are great for reads –  but expensive to update in real-‐Ome

Index Log

Index

Memory Disk

Write

Read

Can we combine both ideas? Writes cannot be done befer than W = 2.5x

write to log + write data to somewhere + update meta-‐data


M7 from MapR §  TwisOng BTree's

–  leaves are variable size (8K -‐ 8M or larger) –  can stay unbalanced for long periods of Ome

•  more inserts will balance it eventually •  automaOcally throfles updates to interior btree nodes

– M7 inserts "close to" where the data is supposed to go

§  Reads –  Uses BTree structure to get "close" very fast

•  very high branching with key-‐prefix-‐compression –  UOlizes a separate lower-‐level index to find it exactly

•  updated "in-‐place” bloom-‐filters for gets, range-‐maps for scans

§  Overhead –  1K record read will transfer about 32K from disk in logN seeks


M7 provides Instant Recovery §  Instead of having one WAL/region server or even one/region, we have many micro-‐WALs/region

§  0-‐40 microWALs per region –  idle WALs “compacted”, so most are empty –  region is up before all microWALs are recovered –  recovers region in background in parallel – when a key is accessed, that microWAL is recovered inline –  1000-‐10000x faster recovery

§  Never perform equivalent of HBase major or minor compacOon

§  Why doesn't HBase do this? M7 uses MapR-‐FS, not HDFS –  No limit to # of files on disk –  No limit to # open files –  I/O path translates random writes to sequenOal writes on disk


1K record -‐read amplificaAon

CompacAon Recovery

HBase with 7 hfiles 30 seeks 130K xfer

IO Storms good bandwidth

Huge WAL to recover

HBase with 3 hfiles 15 seeks, 70K xfer

IO Storms high bandwidth

Huge WAL to recover

LevelDB with 5 levels 13 seeks 48K xfer

No i/o storms Very high b/w

WAL is Ony

BTree logN seeks logN xfer

No i/o storms but 100% random

WAL is proporOonal to concurrency + cache

MapR M7 logN seeks 32K xfer

No i/o storms low bandwidth

microWALs allow recovery < 100ms

Summary


M7: Fileservers Serve Regions

§  Region lives enOrely inside a container – Does not coordinate through ZooKeeper

§  Containers support distributed transacOons – with replicaOon built-‐in

§  Only coordinaOon in the system is for splits –  Between region-‐map and data-‐container –  already solved this problem for files and its chunks


Hbase MapR M7 Containers

Agenda


What's a MapR container?


l  Each container contains l  Directories & files l  Data blocks l  BTrees

l  100% random writes

MapR's Containers Files/directories are sharded into blocks, and placed in containers on disks

Containers are ~32 GB segments of disk, placed on nodes

Patent Pending


M7 Containers

§  Container holds many files – regular, dir, symlink, btree, chunk-‐map, region-‐map, … – all random-‐write capable

§  Container is replicated to servers – unit of resynchronizaOon

§  Region lives enOrely inside 1 container – all files + WALs + btree's + bloom-‐filters + range-‐maps


Read-‐write ReplicaAon

§  Write are synchronous –  All copies have same data

§  Data is replicated in a "chain" fashion –  befer bandwidth, uOlizes full-‐duplex network links well

§  Meta-‐data is replicated in a "star" manner –  response Ome befer, bandwidth not of concern

–  data can also be done this way

55

client1 client2

clientN


Random WriAng in MapR S1

S2

S3 S5 S4

S1, S2, S4 S1, S3 S1, S4, S5 S2, S4, S5 S3

Client wriAng data

CLDB Ask for 64M block

Create cont.

Picks master and 2 replica slaves

Write next chunk to S2

S2, S3, S5

afach


l  As data size increases, writes spread more, like dropping a pebble in a pond

l  Larger pebbles spread the ripples farther

l  Space balanced by moving idle containers

Container Balancing •  Servers keep a bunch of containers "ready to go". •  Writes get distributed around the cluster.


l  HB loss + upstream enOty reports failure => server dead

l  Incr epoch at CLDB l  Rearrange repl path l  Exact same code for files

and M7 tables l  No ZK

Failure Handling

Containers managed at CLDB (HB, container-‐reports).

Container LocaOon DataBase (CLDB)


Architectural Params

§  Unit of I/O –  4K/8K (8K in MapR)

§  Unit of Chunking (a map-‐reduce split) –  10-‐100's of megabytes

§  Unit of Resync (a replica) –  10-‐100's of gigabytes –  container in MapR

i/o 10^3 map-‐red

10^6 resync

10^9 admin

HDFS 'block'

§  Unit of AdministraOon (snap, repl, mirror, quota, backup) –  1 gigabyte -‐ 1000's of terabytes –  volume in MapR –  what data is affected by my

missing blocks?


Other M7 Features

§  Smaller disk footprint – M7 never repeats the key or column name

§  Columnar layout

– M7 supports 64 column families –  in-‐memory column-‐families

§  Online admin – M7 schema changes on the fly – delete/rename/redistribute tables


Thank you!

QuesAons?


Examples: Reliability Issues

§  CompacAons disrupt HBase operaAons: I/O bursts overwhelm nodes (hfp://hbase.apache.org/book.html#compacOon)

§  Very slow crash recovery: RegionServer crash can cause data to be unavailable for up to 30 minutes while WALs are replayed for impacted regions. (HBASE-‐1111)

§  Unreliable splibng: Region spliwng may cause data to be inconsistent and unavailable. (hfp://chilinglam.blogspot.com/2011/12/my-‐experience-‐with-‐hbase-‐dynamic.html)

§  No client throcling: HBase client can easily overwhelm RegionServers and cause downOme. (HBASE-‐5161, HBASE-‐5162)


Examples: Business ConAnuity Issues

§  No Snapshots: MapR provides all-‐or-‐nothing snapshots for HBase. The WALs are shared among tables so single-‐table and selecOve mulO-‐table snapshots are not possible. (HDFS-‐2802, HDFS-‐3370, HBASE-‐50, HBASE-‐6055)

§  Complex Backup Process: complex, unreliable and inefficient. (hfp://bruteforcedata.blogspot.com/2012/08/hbase-‐disaster-‐recovery-‐and-‐whisky.html)

§  AdministraAon Requires DownAme: The enOre cluster must be taken down in order to merge regions. Tables must be disabled to change schema, replicaOon and other properOes. (HBASE-‐420, HBASE-‐1621, HBASE-‐5504, HBASE-‐5335, HBASE-‐3909)


Examples: Performance Issues

§  Limited support for mulAple column families: HBase has issues handling mulOple column family due to compacOons. The standard HBase documentaOon recommends no more than 2-‐3 column families. (HBASE-‐3149)

§  Limited data locality: HBase does not take into account block locaOons when assigning regions. A�er a reboot, RegionServers are o�en reading data over the network rather than the local drives. (HBASE-‐4755, HBASE-‐4491)

§  Cannot uAlize disk space: HBase RegionServers struggle with more than 50-‐150 regions per RegionServer so a commodity server can only handle about 1TB of HBase data, wasOng disk space. (hfp://hbase.apache.org/book/important_configuraOons.html, hfp://www.cloudera.com/blog/2011/04/hbase-‐dos-‐and-‐donts/)

§  Limited # of tables: A single cluster can only handle several tens of tables effecOvely. (hfp://hbase.apache.org/book/important_configuraOons.html)


Examples: Manageability Issues

§  Manual major compacAons: HBase major compacOons are disrupOve so producOon clusters keep them disabled and rely on the administrator to manually trigger compacOons. (hfp://hbase.apache.org/book.html#compacOon)

§  Manual splibng: HBase auto-‐spliwng does not work properly in a busy cluster so users must pre-‐split a table based on their esOmate of data size/growth. (hfp://chilinglam.blogspot.com/2011/12/my-‐experience-‐with-‐hbase-‐dynamic.html)

§  Manual merging: HBase does not automaOcally merge regions that are too small. The administrator must take down the cluster and trigger the merges manually.

§  Basic administraAon is complex: Renaming a table requires copying all the data. Backing up a cluster is a complex process. (HBASE-‐643)

phillydb hbase and mapr m7 - march 2013

Technology

region column family

column qualier column

region mapr technologies

column families

single row mapr technologies

column family scanning

explicit mapr technologies

acceptable mapr technologies