apache(hbase(–where(we’ve(been( donotusepublicly#...

Headline Goes Here Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Apache HBase – Where we’ve been and what’s upcoming Jonathan Hsieh | @jmhsieh SoMware Engineer at Cloudera | HBase PMC Member BigData.be April 4, 2014

4/4/14 BigData.be (Brussels Belgium)

Who Am I?

•  Cloudera: •  Tech Lead HBase Team •  So<ware Engineer •  Apache HBase commiVer / PMC •  Apache Flume founder / PMC

• U of Washington: •  Research in Distributed Systems


What is Apache HBase?

Apache HBase is a reliable, column-‐

oriented data store that provides consistent, low-‐latency, random read/

write access.

ZK HDFS

App MR


An HBase History

Where We’ve Been


Jan ‘12: 0.92.0

Apache HBase Timeline


2014 2006 2007 2008 2009 2010 2011 2013 2012

Nov ’06: Google BigTable OSDI ‘06

Apr ‘07: First Apache HBase commit as Hadoop contrib project

Apr ‘10: Apache HBase becomes top level project Oct ‘13: 0.96.0

Apr’11: CDH3 GA with HBase 0.90.1

May ‘12: HBaseCon 2012

Jun ‘13: HBaseCon 2013

Jan‘08: Promoted to Hadoop subproject

Feb ‘13: 0.98.0 May ‘12: 0.94.0

Summer‘11: Messages on HBase Summer ‘09

StumbleUpon goes producdon on HBase ~0.20

Nov ‘11: Cassini on HBase

Jan ‘13 Phoenix on HBase

Summer‘11: Web Crawl Cache

Developer Community

•  Acdve community! •  Diverse commiVers from many organizadons


Apache HBase “Nascar” Slide


Apache HBase Core Development


•  Vendors •  Self Service

Apache HBase Sample Users


•  Inbox •  Storage • Web •  Search • Analydcs • Monitoring

Apache HBase Ecosystem Projects


Disaster recovery, Condnuity, and MTTR

Today: Apache 0.96.0


HBase provides Low-‐latency Random Access

•  Writes: •  1-‐3ms, 1k-‐20k writes/sec per node

•  Reads: •  0-‐3ms cached, 10-‐30ms disk •  10k-‐40k reads / second / node from cache

•  Cell size: •  0B-‐3MB

•  Read, write, and insert data anywhere in the table


0000000000

1111111111

2222222222

3333333333

4444444444

5555555555

6666666666

7777777777

1

2 3

4

5

Core Properdes

•  ACID guarantees on a row •  Writes are durable •  Strong consistency first, then availability •  AMer failure, recover and return current value instead of returning stale value •  CAS and atomic increments can be efficient.

•  Sorted By Primary Key •  Short scans are efficient •  Parddoned by Primary Key

•  Log Structured Merged Tree •  Writes are extremely efficient •  Reads are efficient

•  Periodic layout opdmizadons for read opdmizadon (“compacdons”) required.


Cridcal Features

Disaster Recovery

•  Cluster Replicadon •  Table Snapshots •  Copy Table •  Import / Export Tables • Metadata Corrupdon repair tool (hbck)

AdministraMve and ConMnuity

•  Kerberos based Authendcadon •  ACL based Authorizadon •  Config change via rolling restart. • Within version rolling upgrade. •  Protobuf based wire protocol for RPC future proofing


Hardened for 0.96.0

Table AdministraMon

• Online Schema change • Online Region Merging •  Condnuous fault injecdon tesdng with “Chaos Monkey”

Performance Tuning

•  Alternate key encodings for efficient memory usage

•  Exploring Compactor policy minimizes compacdon storms

•  Smart and Adapdve Stochasdc region load balancer

•  Fast split policy for new tables


Mean Time to Recovery (MTTR)

• Machine failures happen in distributed systems • Average unavailability when automadcally recovering from a failure.

•  Recovery dme for a unclean data center power cycle


recovered nodfy repair detect

Region unavailable

Region available client aware

Region available client unaware

Fast nodficadon and detecdon (0.96)

•  Proacdve nodficadon of HMaster failure (0.96) •  Proacdve nodficadon of RS failure (0.96) • Nodfy client on recovery (0.96)

•  Fast server failover (Hardware)


recovered replay assign split

Region unavailable

Region available for RW

hdfs hdfs

detect

hdfs

•  Previously had two IO intensive passes: •  Log splisng to intermediate files •  Assign and log replay

•  Now just one IO heavy pass: Assign first, then split+replay. •  Improves read and write recovery dmes. •  Off by default currently*.

Distributed log replay (0.96)


recovered split + replay assign

Region unavailable


Region available for replay writes

hdfs

detect

*Caveat: If you override dme stamps you could have READ REPEATED isoladon violadons (use tags to fix this)

A Future HBase

What’s Upcoming


Outline

•  Improved Mean dme to recovery (MTTR)

•  Improved Predictability

•  Improved Usability

•  Improved Muldtenancy


Faster read recovery

Improving MTTR Further


•  Previously had two IO intensive passes: •  Log splisng to intermediate files •  Assign and log replay

•  Now just one IO heavy pass: Assign first, then split+replay. •  Improves read and write recovery dmes. •  Off by default currently*.

Distributed log replay (0.96*)


recovered split + replay assign

Region unavailable


Region available for replay writes

hdfs

detect

*Caveat: If you override dme stamps you could have READ REPEATED isoladon violadons (use tags to fix this)

recovered split + replay

Distributed log replay with fast write recovery


assign

Region unavailable


Region available for all writes

hdfs

detect

• Writes in HBase do not incur reads. • With distributed log replay, we’ve already have regions open for write.

• Allow fresh writes while replaying old logs*. *Caveat: If you override dme stamps you could have READ REPEATED isoladon violadons (use tags to fix this)

Fast Read Recovery (proposed)

•  Idea: Prisdne Region fast read recovery •  If region not edited it is consistent and can recover RW immediately

•  Idea: Shadow Regions for fast read recovery •  Shadow region tails the WAL of the primary region •  Shadow memstore is one HDFS block behind, catch up recover RW

•  Currently some progress for trunk


recovered assign

Region unavailable

Can guarantee no new edits? Region available for all RW

detect

Can guarantee we have all edits? Region available for all RW

Improving the 99%dle

Improving Predictability


Common causes of performance variability

•  Locality Loss •  Favored Nodes, HDFS block affinity

•  Compacdon •  Exploring compactor

• GC* •  Off-‐heap Cache

• Hardware hiccups • MulM WAL, HDFS speculaMve read


Performance degraded aMer recovery

•  AMer recovery, reads suffer a performance hit. •  Regions have lost locality •  To maintain performance aMer failover, we need to regain locality. •  Compact Region to regain locality

• We can do beVer by using HDFS features


performance recovered recovered

Service recovered; degraded performance L

recovery

Performance recovered because compacdon restores locality J

•  Control and track where block replicas are •  All files for a region created such that blocks go to the same set of favored nodes •  When failing over, assign the region to one of those favored nodes.

•  Currently a preview feature in 0.96 •  Disabled by default because it doesn’t work well with the latest balancer or splits. •  Will likely use upcoming HDFS block affinity for beVer operability

•  Originally on Facebook’s 0.89, ported to 0.96

performance recovered

Read Throughput: Favored Nodes (0.96*)


Service recovered; performance sustained because region assigned to favored node. J

recovery

Read latency: HDFS hedged read (CDH5.0)

• HBase’s Region servers use HDFS client to reads 1 of 3 HDFS block replicas

•  If you chose the slow node, your reads are slow.

•  If a read is taking too long, speculadvely go to another that may be faster.

4/4/14 BigData.be (Brussels Belgium) RS

1 2 3

Slow read!

Hdfs re

plicas

RS

1 2 3

Hdfs re

plicas

Too slow, read other replica

Read latency: Read Replicas (in progress)

•  HBase client reads from primary region servers.

•  If you chose the slow node, your reads are slow.

•  Idea: Read replica assigned to other region servers. Replicas periodically catch up (via snapshots or shadow region memstores)

•  Client specifies if stale read OK. If a read is taking too long, speculadvely go to another that may be faster.


Hbase

Client

1

Slow read!

1 2 3 Re

gion

replicas

Too slow, read stale replica

Hbase

Client

Write latency: Muldple WALs (in progress)

• HBase’s HDFS client writes 3 replicas

• Min write latency is bounded by the slowest of the 3 replicas

•  Idea: If a write is taking too long let’s duplicate it on another set that may be faster.

4/4/14 BigData.be (Brussels Belgium) RS

1 2 3

Slow Write

Hdfs

replicas

RS

1 2 3 Hd

fs

replicas

1 2 3 Hd

fs

replicas

Too slow, write to other replica

MR over Table Snapshots (0.98, CDH5.0)

•  Previously MapReduce jobs over HBase required online full table scan

•  Idea: Take a snapshot and run MR job over snapshot files

•  Doesn’t use HBase client •  Avoid affecdng HBase caches •  3-‐5x perf boost.


map map map map map map map map

reduce reduce reduce

map map map map map map map map

reduce reduce reduce

snapshot

Improving Usability Autotuning, Tracing, and SQL


Making HBase easier to use and tune.

• Difficult to see what is happening in HBase •  Easy to make poor design decisions early without realizing • New Developments

• Memory auto tuning •  HTrace + Zipkin •  Frameworks for Schema design


Memory Use Auto-‐tuning

• Memory is divided between •  the memstore (used for serving recent writes) •  the block cache (used for read hot spots)

• Need to choose balance for work load


memstore

Block cache memstore

Block cache

memstore Block cache

Read Heavy Balanced Write heavy

HTrace

•  Problem: •  Where is dme being spent inside HBase?

•  Soludon: HTrace Framework •  Inspired by Google Dapper •  Threaded through HBase and HDFS

•  Tracks dme spent in calls in a distributed system by tracking spans* on different machines.

*Some assembly sdll required.


HTrace: Distributed Tracing in HBase and HDFS

•  Framework Inspired by Google Dapper

•  Tracks dme spent in calls in RPCs across different machines.

•  Threaded through HBase (0.96) and future HDFS.


HBase

RS

1 HDFS

DN

ZK

HBase

Client

HBase

meta

HDFS

NN

A span

RPC calls

Zipkin – Visualizing Spans

•  UI + Visualizadon System •  WriVen by TwiVer

•  Zipkin HBase Storage •  Zipkin HTrace integradon

•  View where dme from a specific call is spent in HBase, HDFS, and ZK.


HBase Schemas

•  HBase Applicadon developers must iterate to find a suitable HBase schema

•  Schema criMcal for Performance at Scale •  How can we make this easier? •  How can we reduce the experdse required to do this?

•  Today: •  Lots of tuning knobs •  Developers need to understand Column Families, Rowkey design, Data encoding, …

•  Some are expensive to change aMer the fact


How should I arrange my data?

•  Isomorphic data representadons!


rowkey d:

bob-‐col1 aaaa

bob-‐col2 bbbb

bob-‐col3 cccc

bob-‐col4 dddd

jon-‐col1 eeee

jon-‐col2 ffff

jon-‐col3 gggg

jon-‐col4 hhhh

Rowkey d:col1 d:col2 d:col3 d:col4

bob aaaa bbbb cccc dddd

jon eeee ffff gggg hhhhh

Rowkey col1: col2: col3: col4:



Short Fat Table using column qualifiers

Short Fat Table using column families

Tall skinny with compound rowkey

How should I arrange my data?

•  Isomorphic data representadons!


rowkey d:

bob-‐col1 aaaa

bob-‐col2 bbbb

bob-‐col3 cccc

bob-‐col4 dddd

jon-‐col1 eeee

jon-‐col2 ffff

jon-‐col3 gggg

jon-‐col4 hhhh

Rowkey d:col1 d:col2 d:col3 d:col4



Rowkey col1: col2: col3: col4:



Short Fat Table using column qualifiers

Short Fat Table using column families

Tall skinny with compound rowkey

With great pow

er comes grea

t

responsibility

!

How can we

make this easie

r for users?

Impala

•  Scalable Low-‐latency SQL querying for HDFS (and HBase!) •  ODBC/JDBC driver interface •  Highlights

•  Use’s Hive metastore and its hbase-‐hbase connector configuradon convendons.

•  Nadve code implementadon, uses JIT for query execudon opdmizadon.

•  Authorizadon via Kerberos support

•  Open sourced by Cloudera •  hVps://github.com/cloudera/impala


Phoenix

•  A SQL skin over HBase targedng low-‐latency queries. •  JDBC SQL interface •  Highlights

•  Adds Types •  Handles Compound Row key encoding •  Secondary indices in development •  Provides some pushdown aggregadons (coprocessor).

•  Open sourced by Salesforce.com •  Work from James Taylor, Jesse Yates, et al •  hVps://github.com/forcedotcom/phoenix


Kite (nee Cloudera Development Kit/CDK)

•  APIs that provides a Dataset abstracMon •  Provides get/put/delete API in avro objects •  HBase Support in progress

•  Highlights •  Supports muldple components of the hadoop distros (flume, morphlines, hive, crunch, hcat)

•  Provides types using Avro and parquet formats for encoding enddes

•  Manages schema evoludon

•  Open source by Cloudera •  hVps://github.com/kite-‐sdk/kite


Many apps and users in a single cluster

Muld-‐tenancy


Growing HBase

•  Pre 0.96.0: scaling up HBase for single HBase applicadons

•  Essendally a single user for single app.

•  Ex: Facebook messages, one applicadon, many hbase clusters

•  Shard users to different pods •  Focused on condnuity and disaster recovery features

•  Cross-‐cluster Replicadon •  Table Snapshots •  Rolling Upgrades


# of isolated applicadons

# of clusters

Scalability

One giant applicadon, Muldple clusters

Growing HBase

•  In 0.96 we introduce primidves for suppordng MulMtenancy

• Many users, many applicadons, one HBase cluster

•  Need to have some control of the interacdons different users cause.

•  Ex: Manage for MR analydcs and low-‐latency serving in one cluster.


# of isolated applicadons

# of clusters

muldtenancy

Scalability

One giant applicadon, Muldple clusters

Many applicadons In one shared cluster

Namespaces (0.96)

• Namespaces provide an abstracdon for muldple tenants to create and manage their own tables within a large HBase instance.


Namespace blue Namespace green Namespace orange

Muldtenancy goals

•  Security (0.96) •  A separate admin ACLs for different sets of tables

•  Quotas (in progress) •  Max tables, max regions.

•  Performance Isoladon (in progress) •  Limit performance impact load on one table has on others.

•  Priority (future) •  Handle some tables before others


Isoladon with Region Server Groups


Region assignment distribudon (no region server groups)


Isoladon with Region Server Groups


RSG blue RSG green orange


Region assignment distribudon with Region Server Groups (RSG)

Cell Tags

• Mechanism for aVaching arbitrary metadata to Cells.

• Modvadon: Finer-‐grained isoladon •  Use for Accumulo-‐style cell-‐level visibility

• Main feature for 0.98 • Other uses:

•  Add sequence numbers to enable correct fast read/write recovery

•  Potendal for schema tags


Conclusions


Summary by Version 0.90 (CDH3) 0.92 /0.94 (CDH4) 0.96 (CDH5) Next (0.98 / 1.0.0)

New Features

Stability Reliability

Condnuity Muldtenancy

MTTR Recovery in Hours

Recovery in Minutes Recovery of writes in seconds, reads in 10’s of seconds

Recovery in Seconds (reads+writes)

Perf Baseline BeVer Throughput Opdmizing Performance

Predictable Performance

Usability HBase Developer Experdse

HBase Operadonal Experience

Distributed Systems Admin Experience

Applicadon Developers Experience


Quesdons? @jmhsieh


apache(hbase(–where(we’ve(been( donotusepublicly#...

Documents