trihug january 2012 talk by chris shain

HBase: An Introduction

Christopher ShainSoftware Development LeadTresata

About Tresata

Hadoop for Financial Services First completely Hadoop-powered analytics

application Widely recognized as “Big Data Startup to

Watch” Winner of the 2011 NCTA Award for

Emerging Tech Company of the Year

Based in Charlotte NC We are hiring! curious@tresata.com

About Me

Software Development Lead at Tresata

Background in Financial Services IT End-User Applications Data Warehousing ETL

Email: chris@tresata.com Twitter: @chrisshain

OK, enough about me.

What is HBase? From: http://hbase.apache.org:▪ “HBase is the Hadoop database.”

I think this is a confusing statement

What isn’t HBase?

‘Database’, to many, means: Transactions Joins Indexes

HBase has none of these More on this later

What is HBase, really?

HBase is a data storage platform designed to work hand-in-hand with Hadoop Distributed Failure-tolerant Semi-structured Low latency Strictly consistent HDFS-Aware “NoSQL”

Where did HBase come from?

Need for a low-latency, distributed datastore with unlimited horizontal scale

Hadoop (MapReduce) doesn’t provide low-latency

Traditional RDBMS don’t scale out horizontally

History of HBase

November 2006: Google BigTable whitepaper published: http://research.google.com/archive/bigtable.html

February 2007: Initial HBase Prototype October 2007: First ‘usable’ HBase January 2008: HBase becomes Apache

subproject of Hadoop March 2009: HBase 0.20.0 May 10th, 2010: HBase becomes

Apache Top Level Project

Common Use Cases

Web Indexing

Social Graph

Messaging (Email etc.)

HBase Architecture

HBase is written almost entirely in Java JVM clients are first-class citizensHBase Master

RegionServer

Proxy(Thrift or

JVM Clients

Non-JVM Clients

Data Organization in HBase All data is stored in Tables Table rows have exactly one Key, and all

rows in a table are physically ordered by key Tables have a fixed number of Column

Families (more on this later!) Each row can have many Columns in each

column family Each column has a set of values, each with

a timestamp Each row:family:column:timestamp

combination represents coordinates for a Cell

What is a Column Family?

Defined by the Table A Column Family is a group of related

columns with it’s own name All columns must be in a column

family Each row can have a completely

different set of columns for a column family

Row: Column Family:

Columns:

Friends

Friends:Bob

Bob Friends:Chris Friends:James

James Friends:Bob

Rows and Cells

Not exactly the same as rows in a traditional RDBMS Key: a byte array (usually a UTF-8 String) Data: Cells, qualified by column family, column,

and timestamp (not shown here)Row Key:

Column Families :(Defined by the Table)

Columns:(Defined by the Row)(May vary between rows)

Cells:(Created with Columns)

Attributes Attributes:Age 30

Attributes:Height 68

Friends Friends:Bob 1 (Bob’s a cool guy)

Friends:Jane 0 (Jane and I don’t get along)

Cell Timestamps

All cells are created with a timestamp

Column family defines how many versions of a cell to keep

Updates always create a new cell Deletes create a tombstone (more

on that later) Queries can include an “as-of”

timestamp to return point-in-time values

Deletes

HBase deletes are a form of write called a “tombstone”

Indicates that “beyond this point any previously written value is dead”

Old values can still be read using point-in-time queries

Timestamp Write Type Resulting Value

Point-In-TimeValue “as of” T+1

T + 0 PUT (“Foo”) “Foo” “Foo”

T + 1 PUT (“Bar”) “Bar” “Bar”

T + 2 DELETE <none> “Bar”

T + 3 PUT (“Foo Too”)

“Foo Too” “Bar”

Use Case – Time Series

Requirement: Store real-time stock tick data

Requirement: Accommodate many simultaneous readers & writers

Requirement: Allow for reading of current price for any ticker at any point in time

Ticker Timestamp Sequence

Bid Ask

IBM 09:15:03:001 1 179.16 179.18

MSFT 09:15:04:112 2 28.25 28.27

GOOG 09:15:04:114 3 624.94 624.99

IBM 09:15:04:155 4 179.18 179.19

Time Series Use Case – RDBMS Solution:

Keys Column DataType

Ticker Varchar

Timestamp DateTime

Sequence_Number Integer

Bid_Price Decimal

Ask_Price Decimal

Keys Column DataType

PK Ticker Varchar

Bid_Price Decimal

Ask_Price Decimal

Latest Prices:

Historical Prices:

Time Series Use Case – HBase Solution:

Row Key Family:Column

[Ticker].[Rev_Timestamp].[Rev_Sequence_Number]

Prices:Bid

Prices:Ask

HBase throughput will scale linearly with # of nodes

No need to keep separate “latest price” table A scan starting at “ticker” will always

return the latest price row

HBase Data Partitioning

HBase scales horizontally

Needs to split data over many RegionServers

Regions are the unit of scale

Regions & RegionServers

All HBase tables are broken into 1 or more regions

Regions have a start row key and an end row key

Each Region lives on exactly one RegionServer

RegionServers may host many Regions

When RegionServers die, Master detects this and assigns Regions to other RegionServers

Region Distribution

Table Region Region

Server

“Aaron” – “George” Node01

“George” – “Matthew”

Node02

“Matthew” – “Zachary”

Node01

Row Keys in Region“Aaron” – “George”

“Aaron”

“Bob”

“Chris”

Row Keys in Region“George” – “Matthew”

“George”Row Keys in Region“Matthew” – “Zachary”

“Matthew”

“Nancy”

“Zachary”

-META- Table

“Users” Table

HBase Architecture

Deceptively simple

HBase Architecture

HBase Master

RegionServer

Proxy(Thrift or

JVM Clients

Non-JVM Clients

Backup HBase Master

ZooKeeper Cluster

Component Roles & Responsibilities

ZooKeeper Keeps track of which server is the

current HBase Master

HBase Master Keeps track of Region/RegionServer

mapping Manages the -ROOT- and .META. tables Responsible for updating ZooKeeper

when these change

More Component Roles & Responsibilities

RegionServer Stores table regions

Clients Need to be smarter than RDBMS clients First connect to ZooKeeper to get

RegionServer for a given Table/Region Then connect directly to RegionServer to

interact with the data All connections over Hadoop RPC – non-

JVM clients use proxy (Thrift or REST (Stargate))

HBase Catalog Tables

-ROOT- Table

.META.[region]

info:regioninfo

info:server

info:serverstartcode

.META. Table

[table],[region start key],[region id]

info:regioninfo

info:server

info:serverstartcode

Points to DataNode hosting .META. region.…

Regular User Table

… whatever … …

Points to DataNode hosting table region.

HBase Master Reliability

HBase Master is not necessarily a single point of failure (SPOF) Multiple masters can be running Current ‘active’ Master controlled via ZooKeeper Make sure you have enough ZooKeeper nodes!

Master is not needed for client connectivity Clients connect directly to ZooKeeper to find

Regions Everything Master does can be put off until one

is elected

ZooKeeper Quorum

HBase Master & ZooKeeper

ZooKeeper Node

HBase Master

(Current)

HBase Master

(Standby)

HBase Master

(Standby)

HBase RegionServer Reliability HBase tolerates RegionServer failure

when running on HDFS Data is replicated by HDFS (dfs.replication

setting) Lots of issues around fsync, failure before

data is flushed - some probably still not fixed Thus, data can still be lost if node fails after a

HDFS NameNode is still SPOF, even for HBase

RegionServer Write Ahead Log Similar to log in many RDBMS

All operations by default written to log before considered ‘committed’ (can be overridden for ‘disposable fast writes’)

Log can be replayed when region is moved to another RegionServer

One WAL per RegionServer

Writes

MemStore

Flushed periodically (10s by default)

Flushed when MemStore gets too big

HBase RegionServer & HDFSRegionServer

Region

HDFS ClientBlock

HDFS DataNode

StoreFileHFile M

HDFS DataNode

HBase data locality on HDFS

A RegionServer is not guaranteed to be on the same physical node as it’s data

Compaction causes RegionServer to write preferentially to local node But this is a function of HDFS Client, not

Flushes and Compactions All data is in memory initially (memstore) HBase is a write-only system

Modifications and deletes are just writes with later timestamps

Function of HDFS being append-only Eventually old writes need to be

discarded 2 Types of Compactions:

Minor Major

Flushes

All HBase edits are initially stored in memory (memstore)

Flushes occur when memstore reaches a certain size By default 67,108,864 bytes Controlled by

hbase.hregion.memstore.flush.size configuration property

Each flush creates a new HFile

Minor Compaction

Triggered when a certain number of HFiles are created for a given Region Store (+ some other conditions) By default 3 HFiles Controlled by hbase.hstore.compactionThreshold configuration

property

Compacts most recent HFiles into one By default, uses RegionServer-local HDFS node

Does not eliminate deletes Only touches most recent HFiles

NOTE: All column families are compacted at once (this might change in the future)

Major Compaction

Triggered every 24 hours (with random offset) or manually Large HBase installations usually leave

this for manual operators

Re-writes all HFiles into one

Processes deletes Eliminates tombstones Erases earlier entries

Guarantees

HBase does not have transactions However:

Row-level modifications are atomic: All modifications to a row will succeed or fail as a unit

Gets are consistent for a given point in time▪ But Scans may return 2 rows from different

points in time All data read has been ‘durably stored’▪ Does NOT mean flushed to disk- can still be

Appendix

Do's and Don'ts: Schema Design DO: Design your schema for linear range scans

on your most common queries. Scans are the most efficient way to query a lot

of rows quickly

DON’T: Use more than 2 or 3 column families. Some operations (flushing and compacting)

operate on the whole row

DO: Be aware of the relative cardinality of column families Wildly differing cardinality leads to sparsity and

bad scanning results.

Do's and Don'ts: Schema Design DO: Be mindful of the size of your row and

column keys They are used in indexes and queries, can

be quite large!

DON’T: Use monotonically increasing row keys Can lead to hotspots on writes

DO: Store timestamp keys in reverse Rows in a table need to be read in order,

usually you want most recent

Do's and Don'ts: Queries

DO: Query single rows using exact-match on key (Gets) or Scans for multiple rows Scans allow efficient I/O vs. multiple gets

DON’T: Use regex-based or non-prefix column filters Very inefficient

DO: Tune the scan cache and batch size parameters Drastically improves performance when returning

lots of rows

HBase Architecture

Deceptively simpleHBase Master

RegionServer

Proxy(Thrift or

JVM Clients

Non-JVM Clients

ZooKeeper Quorum

HBase Master & ZooKeeper

ZooKeeper Node

HBase Master

(Current)

HBase Master

(Standby)

HBase Master

(Standby)

HBase RegionServer & HDFSRegionServer

Region

HDFS ClientBlock

HDFS DataNode

StoreFileHFile M

HDFS DataNode

Use Case – User PreferencesRequirement: Store an arbitrary set of

preferences for all usersRequirement: Each user may choose to

store a different set of preferencesRequirement: Preferences may be of

different data types (Strings, Integers, etc)Requirement: Developers will add new

preference options all the time, so we shouldn’t need to modify the database structure when adding them

User Preferences Use Case – Relational Solution:

One possible RDBMS solution: Key/Value table All values as strings Flexible, but wastes space

Keys: Column: Data Type:

PKUserID Int

PreferenceName Varchar

PreferenceValue Varchar

User Preferences Use Case – HBase Solution:

Store all preferences in the Preferences column family

Preference name as column name, preference value as (serialized) byte array: HBase client library provides methods

for serializing many common data typesRow Key: Family: Column: Value:

Chris PreferencesAge 30

Hometown “Mineola, NY”

Joe Preferences Birthdate 11/13/1987

trihug january 2012 talk by chris shain

Technology

eric langenbacher, yossi shain-power and the past_...

big data analysis patterns - trihug 6/27/2013

photonics.intec.ugent.be › education › ivpv ›...

from 0 to 60: the fastest model for customer advocacy...

chris wark: hey, everybody. it’s chris from chris beat...

the standard, the law and cupe legal opportunities for...

document resume ed 134 215 author shain, charles h. ·...

trihug: lucene solr hadoop

welcome to the ark joshua beckett jennifer jankowski sarah...

monitoring and conservation of globally threatened species...

“ young people’s positive activities” competitive...

[yossi shain] kinship and diasporas in...

p 9 shain wiederholt

steroids 1 reference list and supplemental...

motivational differences chris rose...

john shain maria rizzo 396 spring garden ave toronto on

visa strategy report -...

brandeis...

5 gold keys -...

trihug november hcatalog talk by alan gates