scaling hbase for big data

Ranjeeth Kathiresan

Senior Software Engineer

[email protected]

Scaling HBase for Big DataSalesforce

Gurpreet Multani

Principal Software Engineer

[email protected]

Introduction

Ranjeeth Kathiresan is a Senior Software Engineer at Salesforce, where he

focuses primarily on improving the performance, scalability, and availability of

applications by assessing and tuning the server-side components in terms of

code, design, configuration, and so on, particularly with Apache HBase.

Ranjeeth is an admirer of performance engineering and is especially fond of

tuning an application to perform better.

Gurpreet Multani is a Principal Software Engineer at Salesforce. At Salesforce,

Gurpreet has lead initiatives to scale various Big Data technologies such as Apache

HBase, Apache Solr, Apache Kafka. He is particularly interested in finding ways to

optimize code to reduce bottlenecks, consume lesser resources and achieve more out

of available capacity in the process.

Agenda

• HBase @ Salesforce

• CAP Theorem

• HBase Refresher

• Typical HBase Use Cases

• HBase Internals

• Data Loading Use Case

• Write Bottlenecks

• Tuning Writes

• Best Practices

• Q&A

HBase @ Salesforce

Typical Cluster

Data Volume

120 TB

Nodes Across All Clusters

2200+

Variety

Simple Row Store

Denormalization

Messaging

Event Log

Analytics

Metrics

Graphs

Cache

CAP Theorem

It is impossible for a distributed data store to simultaneously provide more than two out of the following

three guarantees:

Availability

Consistency Partition tolerance

Each client can always

read and write

All clients have the same

view of the dataThe system works well despite

physical network partitions

CassandraRDBMS

HBase

HBase Refresher

• Distributed database

• Non-relational

• Column-oriented

• Supports compression

• In-memory operations

• Bloom filters on a per-column basis

• Written in Java

• Runs on top of HDFS

“A sparse, distributed, persistent, multidimensional, sorted map”

Typical HBase Use Cases

Large Data Volume running into at least hundreds of GBs or more (aka Big Data)

Data access patterns are well known at design time and are not expected to change i.e.

no secondary indexes / joins need to be added at a later stage

RDBMS-like multi-row transactions are not required

Large “working set” of data. Working set = data being accessed or being updated

Multiple versions of data

Region Server

Region

HBase InternalsWrite Operation

Client

Zookeeper HDFS

Region Server

.META.

Region

WAL

HFile HFile

HFile HFile HFile

Memstore HFile

Store

1. Get .META.

location

2. Get Region

location

3. Put

4. Write

5. Write

Flush

Region Region …..HFile HFile HFile

…..…..

Memstore Memstore…..

HBase InternalsCompaction

HFile HFile HFileHFile HFile HFile

HFile HFile …HFile Main purpose of compaction is to optimize read

performance by reducing the number of disk seeks

Minor Compaction Major Compaction

Trigger: Automatic based on configurations

Mechanism

• Reads a configurable number of smaller HFiles

and writes into a single large HFile

Trigger: Scheduled or Manual

Mechanism

• Reads all HFiles of a region and writes to a

single large HFile

• Physical deletion of records

• Tries to achieve high data locality

Region Server

Region

Client

Zookeeper

Memstore

HDFS

Region Server

.META.

Region

WAL

HFile HFile

HFile HFile HFile

Memstore HFile

1. Get .META.

location

2. Get Region

location

3. Get 5. Read

Region Region RegionHFile HFile HFile

Block Cache4. Read

6. Read

…..

.….

Memstore…..

HBase InternalsRead Operation

One of the use cases is to

store and process data in

text format

Lookups from HBase using

row key is more efficient

A subset of data is stored in

Solr for effective lookups

from HBase

Data Loading Overview

Salesforce Application

Transform

Extract

Load

Data InsightsKey Details about the data used for processing

Velocity VarietyVolume

500MBData Influx/Min

200GBData Size/Cycle

TextData Format

175KRecords/Min

Throughput SLA

600KRecords/Min

3300GBHBase Data Size/Cycle

CSV, JSONData Format

250MMRecords/Day

Write Operation Bottlenecks

Influx Rate:

600K Records/Min

Write Rate

60K Records/Min

Write Operation in

progress for >3 days

Write Rate dropped to <5K

Records/Min after few hours

Write Operation Tunings

Improved throughput by ~8 times & achieved ~3 times more than expected throughput

Initial Throughput:

60K Records/Min

Achieved Throughput:

480K Records/Min

Salting Pre-Splitting Optimal Configuration

CompressionRow Size Optimization Optimal Read vs. Write

Consistency Check

Region Hot SpottingOutline: Region Hot Spotting refers to over utilizing a single region server, despite of having

multiple nodes in the cluster, during write operation because of using sequential rowkeys.

Scenario

Not our

turn, Yet!!Not our

turn, Yet!!

Hey Buddy! I’m

overloaded

Impact

Node1 Node2 Node3

Uti

liza

tio

n

Time


Initial Throughput:

60K Records/Min


480K Records/Min



Consistency Check

Salting

Outline: Salting helps to distribute writes over multiple

regions by using random row keys

How do I implement Salting?

Salting is implemented by defining the rowkeys wisely by

adding a salt prefix (random character) to the original key

Two Common ways of salting

• Adding a random number as prefix based on modulo

• Hashing the rowkey

Salting

Random number can be identified by performing modulo operation between insertion index and

total buckets

Salted Key = (++index % total buckets) +”_” + Original Key

Prefixing random number

0_1000

0_1003

1_1001

1_1004

2_1002

2_1005

Bucket 1

Bucket 2

Bucket 3

1000

1001

1002

1003

1004

1005

Example with 3 Salt Buckets

Key

Po

ints

Randomness is provided to some

extent as it depends on insertion

order

Salted keys stored in HBase won’t

be visible to client during lookups

Data

Salting

Hashing the entire rowkey or adding a few characters of the hash of rowkey as prefix can be used

to implement salting

Salted Key = hash(Original Key) OR firstNChars(hash(Original Key))+”_”+Original Key

Hashing Rowkey

AtNB/q..

B50SP..

e8aRjL..

ggEw9..

w56syI..

xwer51..

Bucket 1

Bucket 2

Bucket 3

1000

1001

1002

1003

1004

1005

Example with 3 Salt Buckets

Key

Po

ints

Randomness in the row key is

ensured by hash values

HBase lookups will be effective as

the same hashing function can be

used during lookup

Data

Salting

Salting does not resolve Region Hot spotting for the entire write cycle.

Reason: HBase creates only one region by default and uses default auto split policy to create more

regions

AtNB/q..

B50SP..

e8aRjL..

ggEw9..

w56syI..

xwer51..

Bucket 1

1000

1001

1002

1003

1004

1005

Data

Example

Does it help?

Impact

Node1 Node2 Node3

Uti

liza

tio

n

Time


Initial Throughput:

60K Records/Min


480K Records/Min



Consistency Check

Pre-Splitting

Outline: Pre-Splitting helps to create multiple regions during table creation which will help to

reap the benefits of salting

How do I pre-split a HBase table?

Pre-splitting can be done by providing split points during table creation

Example: create ‘table_name’, ‘cf_name’, SPLITS => [‘a’ , ‘m’]

AtNB/q..

B50SP..

e8aRjL..

ggEw9..

w56syI..

xwer51..

Bucket 1 [‘’ -> ‘a’]

Bucket 2 [‘a’ -> ‘m’]

Bucket 3 [‘m’ -> ‘’]

1000

1001

1002

1003

1004

1005

Data

Pre-Splitting

Scenario

Hey Buddy! I’m

overloaded

Improvement

Node1 Node2 Node3

Uti

liza

tio

n

Time

GO Regions!!!

Optimization Benefit

Salting

Pre-Splitting

Throughput Improvement

Current Throughput:

60K Records/Min

Improved Throughput:

150K Records/Min


Initial Throughput:

60K Records/Min


480K Records/Min



Consistency Check

Configuration Tuning

Outline: Default configurations may not work for all use cases. We need to tune configurations based on

our use case

It is 9!! No. It is 6!!

Configuration Tuning

Configuration PurposeChange

Nature

hbase.regionserver.handler.countNumber of threads in region server used to

process read and write requestsIncreased

hbase.hregion.memstore.flush.sizeMemstore will be flushed to disk after reaching

the value provided in this configurationIncreased

hbase.hstore.blockingStoreFilesFlushes will be blocked until compaction reduces

the number of HFiles to this valueIncreased

hbase.hstore.blockingWaitTimeMaximum time for which the clients will be

blocked from writing to HBaseDecreased

Following are the key configurations which we have tuned based on our write use case

Configuration TuningRegion Server Handler Count

Region Server

Client

Client

ClientRegion

Region

Region

Region…..

Region Server Handlers (Default Count=10)

Tun

ing

Be

nef

itC

auti

on

Increasing it could help in improving

throughput by increasing

concurrency

Thumb Rule -> Low for high payload

and high for low payload

Can increase heap utilization

eventually leading to OOM

High GC pauses impacting the

throughput

Configuration TuningRegion Memstore Size

Region Server

Region

Region…..

Thread which checks Memstore size

(Default – 128 MB)

Tun

ing

Be

nef

itC

auti

on

Increasing Memstore size will

generate larger HFiles which will

minimize compaction impact and

improves throughput

Can increase heap utilization

eventually leading to OOM

High GC pauses impacting the

throughput

HDFS

HFile

Memstore

Memstore

Memstore

HFile

HFile

…..

…..

Memstore

Configuration TuningHStore Blocking Store Files

Region Server

Region

Region…..

Default Blocking Store Files - 10

Tun

ing

Be

nef

itC

auti

on

Increasing blocking store files will

allow client to write more with less

pauses and improves throughput

Compaction could take more time as

more files could be written without

blocking client

HDFS

HFile

Memstore…..

HFile

HFileMemstore

HFile

Store

HFile HFile

ClientHFile HFile

…..

…..

Configuration TuningHStore Blocking Wait Time

Region Server

Region

Region…..

Tun

ing

Be

nef

itC

auti

on

Decreasing blocking wait time will

allow client to write more with less

pauses and improves throughput

Compaction could take more time as

more files could be written without

blocking client

HDFS

HFile

Memstore…..

HFile

HFileMemstore

HFile

Store

HFile HFile

ClientHFile HFile

…..

…..

Time for which writes on Region is

blocked (Default – 90 Secs)


Throughput Improvement

Current Throughput:

150K Records/Min

Improved Throughput:

260K Records/Min

Optimal Configuration

Reduced Resource Utilization


Initial Throughput:

60K Records/Min


480K Records/Min



Consistency Check

Optimal Read vs. Write Consistency CheckMulti Version Concurrency Control

Multi Version Concurrency Control (MVCC) is used to achieve row level ACID property in HBase.

Source: https://blogs.apache.org/hbase/entry/apache_hbase_internals_locking_and

Write Steps with MVCC

Optimal Read vs. Write Consistency Check

Issue: MVCC stuck after few hours of write operation impacting the write throughput drastically

as there are 140+ columns per row

Scenario Impact

Throughput

Throughput

Rec

ord

s/M

in

Time

Write point: I

have a lot to

write

Read point: I

have a lot to

catch up

Read point has to catch up write point to avoid high delay between read and write versions

Optimal Read vs. Write Consistency CheckSolution: Reduce the pressure on MVCC by storing all the 140+ columns in a single cell

Scenario

Throughput

Rec

ord

s/M

in

Time

Improvement

abc

def

ghi

{

“col1”:”abc”,

“col2”:”def”,

“col3”,”ghi”

}

Column Representation

col1

col2

col3

column

Optimal Read vs. Write

Consistency Check


Stability Improvement

Steady Resource Utilization

Storage Optimization

Storage is one of the important factors impacting scalability of HBase cluster

Write operation throughput is mainly dependent on the average row size as it is an I/O bound

process. Optimizing the storage will help us to achieve more throughput.

Example:

Having a Column Name as “BA_S1” instead of “Billing_Address_Street_Line_Number_One” will help in

reducing the storage and improve write throughput

Column Name #Characters Additional Bytes to each Row

Billing_Address_Street_Line_Number_One 39 78

BA_S1 5 10


Initial Throughput:

60K Records/Min


480K Records/Min



Consistency Check

Compression

Compression is one of the storage optimization technique

Commonly used compression algorithms in HBase

• Snappy

• Gzip

• LZO

• LZ4

Compression Ratio

Gzip compression ratio is better than Snappy and LZO

Resource Consumption

Snappy consumes lesser resources for compression and decompression than Gzip


Productivity Improvement

Reduced

Storage CostsImproved

Throughput

Compression

Before Optimization After Optimization %Improvement

Storage 3300 GB 963 GB ~71%

Throughput 260K Records/Min 380K Records/Min ~42%


Initial Throughput:

60K Records/Min


480K Records/Min



Consistency Check

Row Size Optimization

Few columns out of 140+ columns were empty for most of the rows. Storing empty columns in

JSON format will increase the average row size

Solution: Avoid storing empty values when using JSON

Example:

Salesforce Scenario

{“col1”:”abc”,

”col2”:”def”,

”col3”:””,

”col4”:””,

”col5”:”ghi”}

Data

Remove

empty

{“col1”:”abc”,

”col2”:”def”,

”col5”:”ghi”}

Data In HBase


Productivity Improvement

Reduced

Storage CostsImproved

Throughput

Before Optimization After Optimization %Improvement

Storage 963 GB 784 GB ~19%

Throughput 380K Records/Min 480K Records/Min ~26%

Row Size Optimization

Recap

Write Throughput

Initial:

60K Records/Min

Achieved:

480K Records/Min

SLA 175K Records/Min

Optimization

Salting

Pre-Splitting

Config.

Tuning

Optimal MVCC

Compress-ion

Optimal Row Size

Data Format Text Influx Rate 500 MB/Min

ReducedStorage

Improved Stability

Reduced Resource Utilization

Best Practices

Row key design

• Know your data better before pre-splitting

• Shorter row key but long enough for data access

Minimize IO

• Less number of Column Families

• Shorter Column Family and Qualifier name

Locality

• Review the locality of regions periodically

• Co-locate Region server and Data node

Maximize Throughput

• Minimize major compactions

• Use high throughput disk

When HBase?

HBase is for you HBase is NOT for you

Random read/write access to high

volumes of data in real time

No dependency on RDBMS features

Variable schema with flexibility to add

columns

Single/Range of key based lookups for

de-normalized data

Multiple versions of Big Data

Replacement for RDBMS

Low data volume

Scanning and aggregation on large

volumes of data

Replacement for batch processing

engines like MapReduce/Spark

Thank Y u

scaling hbase for big data

Technology