apache(hbase(–where(we’ve(been( donotusepublicly#...
TRANSCRIPT
Headline Goes Here Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12
Apache HBase – Where we’ve been and what’s upcoming Jonathan Hsieh | @jmhsieh SoMware Engineer at Cloudera | HBase PMC Member BigData.be April 4, 2014
4/4/14 BigData.be (Brussels Belgium)
Who Am I?
• Cloudera: • Tech Lead HBase Team • So<ware Engineer • Apache HBase commiVer / PMC • Apache Flume founder / PMC
• U of Washington: • Research in Distributed Systems
4/4/14 BigData.be (Brussels Belgium)
What is Apache HBase?
Apache HBase is a reliable, column-‐
oriented data store that provides consistent, low-‐latency, random read/
write access.
ZK HDFS
App MR
4/4/14 BigData.be (Brussels Belgium)
An HBase History
Where We’ve Been
4/4/14 BigData.be (Brussels Belgium)
Jan ‘12: 0.92.0
Apache HBase Timeline
4/4/14 BigData.be (Brussels Belgium)
2014 2006 2007 2008 2009 2010 2011 2013 2012
Nov ’06: Google BigTable OSDI ‘06
Apr ‘07: First Apache HBase commit as Hadoop contrib project
Apr ‘10: Apache HBase becomes top level project Oct ‘13: 0.96.0
Apr’11: CDH3 GA with HBase 0.90.1
May ‘12: HBaseCon 2012
Jun ‘13: HBaseCon 2013
Jan‘08: Promoted to Hadoop subproject
Feb ‘13: 0.98.0 May ‘12: 0.94.0
Summer‘11: Messages on HBase Summer ‘09
StumbleUpon goes producdon on HBase ~0.20
Nov ‘11: Cassini on HBase
Jan ‘13 Phoenix on HBase
Summer‘11: Web Crawl Cache
Developer Community
• Acdve community! • Diverse commiVers from many organizadons
4/4/14 BigData.be (Brussels Belgium)
Apache HBase “Nascar” Slide
4/4/14 BigData.be (Brussels Belgium)
Apache HBase Core Development
4/4/14 BigData.be (Brussels Belgium)
• Vendors • Self Service
Apache HBase Sample Users
4/4/14 BigData.be (Brussels Belgium)
• Inbox • Storage • Web • Search • Analydcs • Monitoring
Apache HBase Ecosystem Projects
4/4/14 BigData.be (Brussels Belgium)
Disaster recovery, Condnuity, and MTTR
Today: Apache 0.96.0
4/4/14 BigData.be (Brussels Belgium)
HBase provides Low-‐latency Random Access
• Writes: • 1-‐3ms, 1k-‐20k writes/sec per node
• Reads: • 0-‐3ms cached, 10-‐30ms disk • 10k-‐40k reads / second / node from cache
• Cell size: • 0B-‐3MB
• Read, write, and insert data anywhere in the table
4/4/14 BigData.be (Brussels Belgium)
0000000000
1111111111
2222222222
3333333333
4444444444
5555555555
6666666666
7777777777
1
2 3
4
5
Core Properdes
• ACID guarantees on a row • Writes are durable • Strong consistency first, then availability • AMer failure, recover and return current value instead of returning stale value • CAS and atomic increments can be efficient.
• Sorted By Primary Key • Short scans are efficient • Parddoned by Primary Key
• Log Structured Merged Tree • Writes are extremely efficient • Reads are efficient
• Periodic layout opdmizadons for read opdmizadon (“compacdons”) required.
4/4/14 BigData.be (Brussels Belgium)
Cridcal Features
Disaster Recovery
• Cluster Replicadon • Table Snapshots • Copy Table • Import / Export Tables • Metadata Corrupdon repair tool (hbck)
AdministraMve and ConMnuity
• Kerberos based Authendcadon • ACL based Authorizadon • Config change via rolling restart. • Within version rolling upgrade. • Protobuf based wire protocol for RPC future proofing
4/4/14 BigData.be (Brussels Belgium)
Hardened for 0.96.0
Table AdministraMon
• Online Schema change • Online Region Merging • Condnuous fault injecdon tesdng with “Chaos Monkey”
Performance Tuning
• Alternate key encodings for efficient memory usage
• Exploring Compactor policy minimizes compacdon storms
• Smart and Adapdve Stochasdc region load balancer
• Fast split policy for new tables
4/4/14 BigData.be (Brussels Belgium)
Mean Time to Recovery (MTTR)
• Machine failures happen in distributed systems • Average unavailability when automadcally recovering from a failure.
• Recovery dme for a unclean data center power cycle
4/4/14 BigData.be (Brussels Belgium)
recovered nodfy repair detect
Region unavailable
Region available client aware
Region available client unaware
Fast nodficadon and detecdon (0.96)
• Proacdve nodficadon of HMaster failure (0.96) • Proacdve nodficadon of RS failure (0.96) • Nodfy client on recovery (0.96)
• Fast server failover (Hardware)
4/4/14 BigData.be (Brussels Belgium)
recovered replay assign split
Region unavailable
Region available for RW
hdfs hdfs
detect
hdfs
• Previously had two IO intensive passes: • Log splisng to intermediate files • Assign and log replay
• Now just one IO heavy pass: Assign first, then split+replay. • Improves read and write recovery dmes. • Off by default currently*.
Distributed log replay (0.96)
4/4/14 BigData.be (Brussels Belgium)
recovered split + replay assign
Region unavailable
Region available for RW
Region available for replay writes
hdfs
detect
*Caveat: If you override dme stamps you could have READ REPEATED isoladon violadons (use tags to fix this)
A Future HBase
What’s Upcoming
4/4/14 BigData.be (Brussels Belgium)
Outline
• Improved Mean dme to recovery (MTTR)
• Improved Predictability
• Improved Usability
• Improved Muldtenancy
4/4/14 BigData.be (Brussels Belgium)
Faster read recovery
Improving MTTR Further
4/4/14 BigData.be (Brussels Belgium)
• Previously had two IO intensive passes: • Log splisng to intermediate files • Assign and log replay
• Now just one IO heavy pass: Assign first, then split+replay. • Improves read and write recovery dmes. • Off by default currently*.
Distributed log replay (0.96*)
4/4/14 BigData.be (Brussels Belgium)
recovered split + replay assign
Region unavailable
Region available for RW
Region available for replay writes
hdfs
detect
*Caveat: If you override dme stamps you could have READ REPEATED isoladon violadons (use tags to fix this)
recovered split + replay
Distributed log replay with fast write recovery
4/4/14 BigData.be (Brussels Belgium)
assign
Region unavailable
Region available for RW
Region available for all writes
hdfs
detect
• Writes in HBase do not incur reads. • With distributed log replay, we’ve already have regions open for write.
• Allow fresh writes while replaying old logs*. *Caveat: If you override dme stamps you could have READ REPEATED isoladon violadons (use tags to fix this)
Fast Read Recovery (proposed)
• Idea: Prisdne Region fast read recovery • If region not edited it is consistent and can recover RW immediately
• Idea: Shadow Regions for fast read recovery • Shadow region tails the WAL of the primary region • Shadow memstore is one HDFS block behind, catch up recover RW
• Currently some progress for trunk
4/4/14 BigData.be (Brussels Belgium)
recovered assign
Region unavailable
Can guarantee no new edits? Region available for all RW
detect
Can guarantee we have all edits? Region available for all RW
Improving the 99%dle
Improving Predictability
4/4/14 BigData.be (Brussels Belgium)
Common causes of performance variability
• Locality Loss • Favored Nodes, HDFS block affinity
• Compacdon • Exploring compactor
• GC* • Off-‐heap Cache
• Hardware hiccups • MulM WAL, HDFS speculaMve read
4/4/14 BigData.be (Brussels Belgium)
Performance degraded aMer recovery
• AMer recovery, reads suffer a performance hit. • Regions have lost locality • To maintain performance aMer failover, we need to regain locality. • Compact Region to regain locality
• We can do beVer by using HDFS features
4/4/14 BigData.be (Brussels Belgium)
performance recovered recovered
Service recovered; degraded performance L
recovery
Performance recovered because compacdon restores locality J
• Control and track where block replicas are • All files for a region created such that blocks go to the same set of favored nodes • When failing over, assign the region to one of those favored nodes.
• Currently a preview feature in 0.96 • Disabled by default because it doesn’t work well with the latest balancer or splits. • Will likely use upcoming HDFS block affinity for beVer operability
• Originally on Facebook’s 0.89, ported to 0.96
performance recovered
Read Throughput: Favored Nodes (0.96*)
4/4/14 BigData.be (Brussels Belgium)
Service recovered; performance sustained because region assigned to favored node. J
recovery
Read latency: HDFS hedged read (CDH5.0)
• HBase’s Region servers use HDFS client to reads 1 of 3 HDFS block replicas
• If you chose the slow node, your reads are slow.
• If a read is taking too long, speculadvely go to another that may be faster.
4/4/14 BigData.be (Brussels Belgium) RS
1 2 3
Slow read!
Hdfs re
plicas
RS
1 2 3
Hdfs re
plicas
Too slow, read other replica
Read latency: Read Replicas (in progress)
• HBase client reads from primary region servers.
• If you chose the slow node, your reads are slow.
• Idea: Read replica assigned to other region servers. Replicas periodically catch up (via snapshots or shadow region memstores)
• Client specifies if stale read OK. If a read is taking too long, speculadvely go to another that may be faster.
4/4/14 BigData.be (Brussels Belgium)
Hbase
Client
1
Slow read!
1 2 3 Re
gion
replicas
Too slow, read stale replica
Hbase
Client
Write latency: Muldple WALs (in progress)
• HBase’s HDFS client writes 3 replicas
• Min write latency is bounded by the slowest of the 3 replicas
• Idea: If a write is taking too long let’s duplicate it on another set that may be faster.
4/4/14 BigData.be (Brussels Belgium) RS
1 2 3
Slow Write
Hdfs
replicas
RS
1 2 3 Hd
fs
replicas
1 2 3 Hd
fs
replicas
Too slow, write to other replica
MR over Table Snapshots (0.98, CDH5.0)
• Previously MapReduce jobs over HBase required online full table scan
• Idea: Take a snapshot and run MR job over snapshot files
• Doesn’t use HBase client • Avoid affecdng HBase caches • 3-‐5x perf boost.
4/4/14 BigData.be (Brussels Belgium)
map map map map map map map map
reduce reduce reduce
map map map map map map map map
reduce reduce reduce
snapshot
Improving Usability Autotuning, Tracing, and SQL
4/4/14 BigData.be (Brussels Belgium)
Making HBase easier to use and tune.
• Difficult to see what is happening in HBase • Easy to make poor design decisions early without realizing • New Developments
• Memory auto tuning • HTrace + Zipkin • Frameworks for Schema design
4/4/14 BigData.be (Brussels Belgium)
Memory Use Auto-‐tuning
• Memory is divided between • the memstore (used for serving recent writes) • the block cache (used for read hot spots)
• Need to choose balance for work load
4/4/14 BigData.be (Brussels Belgium)
memstore
Block cache memstore
Block cache
memstore Block cache
Read Heavy Balanced Write heavy
HTrace
• Problem: • Where is dme being spent inside HBase?
• Soludon: HTrace Framework • Inspired by Google Dapper • Threaded through HBase and HDFS
• Tracks dme spent in calls in a distributed system by tracking spans* on different machines.
*Some assembly sdll required.
4/4/14 BigData.be (Brussels Belgium)
HTrace: Distributed Tracing in HBase and HDFS
• Framework Inspired by Google Dapper
• Tracks dme spent in calls in RPCs across different machines.
• Threaded through HBase (0.96) and future HDFS.
4/4/14 BigData.be (Brussels Belgium)
HBase
RS
1 HDFS
DN
ZK
HBase
Client
HBase
meta
HDFS
NN
A span
RPC calls
Zipkin – Visualizing Spans
• UI + Visualizadon System • WriVen by TwiVer
• Zipkin HBase Storage • Zipkin HTrace integradon
• View where dme from a specific call is spent in HBase, HDFS, and ZK.
4/4/14 BigData.be (Brussels Belgium)
HBase Schemas
• HBase Applicadon developers must iterate to find a suitable HBase schema
• Schema criMcal for Performance at Scale • How can we make this easier? • How can we reduce the experdse required to do this?
• Today: • Lots of tuning knobs • Developers need to understand Column Families, Rowkey design, Data encoding, …
• Some are expensive to change aMer the fact
4/4/14 BigData.be (Brussels Belgium)
How should I arrange my data?
• Isomorphic data representadons!
4/4/14 BigData.be (Brussels Belgium)
rowkey d:
bob-‐col1 aaaa
bob-‐col2 bbbb
bob-‐col3 cccc
bob-‐col4 dddd
jon-‐col1 eeee
jon-‐col2 ffff
jon-‐col3 gggg
jon-‐col4 hhhh
Rowkey d:col1 d:col2 d:col3 d:col4
bob aaaa bbbb cccc dddd
jon eeee ffff gggg hhhhh
Rowkey col1: col2: col3: col4:
bob aaaa bbbb cccc dddd
jon eeee ffff gggg hhhhh
Short Fat Table using column qualifiers
Short Fat Table using column families
Tall skinny with compound rowkey
How should I arrange my data?
• Isomorphic data representadons!
4/4/14 BigData.be (Brussels Belgium)
rowkey d:
bob-‐col1 aaaa
bob-‐col2 bbbb
bob-‐col3 cccc
bob-‐col4 dddd
jon-‐col1 eeee
jon-‐col2 ffff
jon-‐col3 gggg
jon-‐col4 hhhh
Rowkey d:col1 d:col2 d:col3 d:col4
bob aaaa bbbb cccc dddd
jon eeee ffff gggg hhhhh
Rowkey col1: col2: col3: col4:
bob aaaa bbbb cccc dddd
jon eeee ffff gggg hhhhh
Short Fat Table using column qualifiers
Short Fat Table using column families
Tall skinny with compound rowkey
With great pow
er comes grea
t
responsibility
!
How can we
make this easie
r for users?
Impala
• Scalable Low-‐latency SQL querying for HDFS (and HBase!) • ODBC/JDBC driver interface • Highlights
• Use’s Hive metastore and its hbase-‐hbase connector configuradon convendons.
• Nadve code implementadon, uses JIT for query execudon opdmizadon.
• Authorizadon via Kerberos support
• Open sourced by Cloudera • hVps://github.com/cloudera/impala
4/4/14 BigData.be (Brussels Belgium)
Phoenix
• A SQL skin over HBase targedng low-‐latency queries. • JDBC SQL interface • Highlights
• Adds Types • Handles Compound Row key encoding • Secondary indices in development • Provides some pushdown aggregadons (coprocessor).
• Open sourced by Salesforce.com • Work from James Taylor, Jesse Yates, et al • hVps://github.com/forcedotcom/phoenix
4/4/14 BigData.be (Brussels Belgium)
Kite (nee Cloudera Development Kit/CDK)
• APIs that provides a Dataset abstracMon • Provides get/put/delete API in avro objects • HBase Support in progress
• Highlights • Supports muldple components of the hadoop distros (flume, morphlines, hive, crunch, hcat)
• Provides types using Avro and parquet formats for encoding enddes
• Manages schema evoludon
• Open source by Cloudera • hVps://github.com/kite-‐sdk/kite
4/4/14 BigData.be (Brussels Belgium)
Many apps and users in a single cluster
Muld-‐tenancy
4/4/14 BigData.be (Brussels Belgium)
Growing HBase
• Pre 0.96.0: scaling up HBase for single HBase applicadons
• Essendally a single user for single app.
• Ex: Facebook messages, one applicadon, many hbase clusters
• Shard users to different pods • Focused on condnuity and disaster recovery features
• Cross-‐cluster Replicadon • Table Snapshots • Rolling Upgrades
4/4/14 BigData.be (Brussels Belgium)
# of isolated applicadons
# of clusters
Scalability
One giant applicadon, Muldple clusters
Growing HBase
• In 0.96 we introduce primidves for suppordng MulMtenancy
• Many users, many applicadons, one HBase cluster
• Need to have some control of the interacdons different users cause.
• Ex: Manage for MR analydcs and low-‐latency serving in one cluster.
4/4/14 BigData.be (Brussels Belgium)
# of isolated applicadons
# of clusters
muldtenancy
Scalability
One giant applicadon, Muldple clusters
Many applicadons In one shared cluster
Namespaces (0.96)
• Namespaces provide an abstracdon for muldple tenants to create and manage their own tables within a large HBase instance.
4/4/14 BigData.be (Brussels Belgium)
Namespace blue Namespace green Namespace orange
Muldtenancy goals
• Security (0.96) • A separate admin ACLs for different sets of tables
• Quotas (in progress) • Max tables, max regions.
• Performance Isoladon (in progress) • Limit performance impact load on one table has on others.
• Priority (future) • Handle some tables before others
4/4/14 BigData.be (Brussels Belgium)
Isoladon with Region Server Groups
4/4/14 BigData.be (Brussels Belgium)
Region assignment distribudon (no region server groups)
Namespace blue Namespace green Namespace orange
Isoladon with Region Server Groups
4/4/14 BigData.be (Brussels Belgium)
RSG blue RSG green orange
Namespace blue Namespace green Namespace orange
Region assignment distribudon with Region Server Groups (RSG)
Cell Tags
• Mechanism for aVaching arbitrary metadata to Cells.
• Modvadon: Finer-‐grained isoladon • Use for Accumulo-‐style cell-‐level visibility
• Main feature for 0.98 • Other uses:
• Add sequence numbers to enable correct fast read/write recovery
• Potendal for schema tags
4/4/14 BigData.be (Brussels Belgium)
Conclusions
4/4/14 BigData.be (Brussels Belgium)
Summary by Version 0.90 (CDH3) 0.92 /0.94 (CDH4) 0.96 (CDH5) Next (0.98 / 1.0.0)
New Features
Stability Reliability
Condnuity Muldtenancy
MTTR Recovery in Hours
Recovery in Minutes Recovery of writes in seconds, reads in 10’s of seconds
Recovery in Seconds (reads+writes)
Perf Baseline BeVer Throughput Opdmizing Performance
Predictable Performance
Usability HBase Developer Experdse
HBase Operadonal Experience
Distributed Systems Admin Experience
Applicadon Developers Experience
4/4/14 BigData.be (Brussels Belgium)
Quesdons? @jmhsieh
4/4/14 BigData.be (Brussels Belgium)