hbase accelerated: in-memory flush and compaction

HBase Accelerated:In-Memory Flush and Compact ionE s h c a r H i l l e l , A n a s t a s i a B r a g i n s k y , E d w a r d B o r t n i k o v H a d o o p S u m m i t E u r o p e , ⎪A p r i l 1 3 , 2 0 1 6

2

Motivation: Dynamic Content Processing on Top of HBase Real-time content processing pipelines

› Store intermediate results in persistent map› Notification mechanism is prevalent

Storage and notifications on the same platform Sieve – Yahoo’s real-time content management platform

Crawl Docproc

Link Analysis Queue

Crawl schedule

Content

QueueLinks

Ser

ving

Apache Storm

Apache HBase

Notification Mechanism is Like a Sliding Window

3

Small working set but not necessarily FIFO queue Short life-cycle delete message after processing it High-churn workload message state can be updated Frequent scans to consume message

Producer

Producer

Producer

Consumer

Consumer

Consumer

4

HBase Accelerated: Mission Definition

Goal Real-time performance and less I/O in persistent KV-stores

How?Exploit redundancies in the workload to eliminate duplicates in memoryReduce amount of new filesReduce write amplification effect (overall I/O)Reduce retrieval latencies

How much can we gain?Gain is proportional to the duplications

5

Persistent map› Atomic get, put, and snapshot scans

Table sorted set of rows› Each row uniquely identified by a key

Region unit of horizontal partitioning Column family unit of vertical partitioning Store implements a column family within a

region

In production a region server (RS) can manage 100s-1000s of regions› Share resources: memory, cpu, disk

key Col1 Col2 Col3

HBase Data modelColumn family

Store

Reg

ion

6

Random I/O Sequential I/O

Random writes absorbed in RAM (Cm)› Low latency› Built in versioning› Write-ahead-log (WAL) for durability

Periodic batch merge to disk› High-throughput I/O

Disk components (Cd) immutable › Periodic merge&compact eliminate duplicate versions

and tombstones

Random reads from Cm or Cd cache

Log-Structured Merge (LSM) Key-Value Store Approach

Cm memory

diskC1

C2

merge

Cd

7

Random writes absorbed in active segment (Cm) When active segment is full

› Becomes immutable (snapshot)› A new mutable (active) segment serves writes› Flushed to disk, truncate WAL

Compaction reads a few files, merge-sorts them, writes back new files

HBase Writes

C’mflush

memory HDFS

Cm

prepare-for-flushCd

compactiondisk

WAL

write

8

Random reads from Cm or C’m or Cd cache When data piles-up on disk

› Hit ratio drops › Retrieval latency up

Compaction re-writes small files into fewer bigger files› Causes replication-related network and disk IO

HBase Reads

C’mmemory HDFS

CmCd

ReadBlockcache

compactiondisk

9

New compaction pipeline› Active segment flushed to pipeline › Pipeline segments compacted in memory› Flush to disk only when needed

New Design: In-Memory Flush and Compaction

C’mflush-to-disk

memory HDFS

Cm

in-memory-flush

Cd

prepare-for-flush

Compaction pipeline

WAL

Blockcache

compactionmemory

10

Trade read cache (BlockCache) for write cache (compaction pipeline)


write readcache cache

C’mflush-to-disk

memory HDFS

Cm

in-memory-flush

Cd

prepare-for-flush

Compaction pipeline

WAL

Blockcache

compactionmemory

11


CPU

IO

C’mflush-to-disk

memory HDFS

Cm

in-memory-flush

Cd

prepare-for-flush

Compaction pipeline

WAL

Blockcache

compactionmemory

Trade read cache (BlockCache) for write cache (compaction pipeline)More CPU cycles for less I/O

12

Evaluation Settings: In-Memory Working Set

YCSB: compares compacting vs. default memstore Small cluster: 3 HDFS nodes on a single rack, 1 RS

› 1GB heap space, MSLAB enabled (2MB chunks)› Default: 128MB flush size, 100MB block-cache› Compacting: 192MB flush size, 36MB block-cache

High-churn workload, small working set› 128,000 records, 1KB value field› 10 threads running 5 millions operations, various key distributions› 50% reads 50% updates, target 1000ops› 1% (long) scans 99% updates, target 500ops

Measure average latency over time› Latencies accumulated over intervals of 10 seconds

13

Evaluation Results: Read Latency (Zipfian Distribution)

Flush to disk

Compaction

Data fits into

cache

14

Evaluation Results: Read Latency (Uniform Distribution)

Region split

15

Evaluation Results: Scan Latency (Uniform Distribution)

16

Evaluation Settings: Handling Tombstones

YCSB: compares compacting vs. default memstore Small cluster: 3 HDFS nodes on a single rack, 1 RS

› 1GB heap space, MSLAB enabled (2MB chunks), 128MB flush size, 64MB block-cache› Default: Minimum 4 files for compaction› Compaction: Minimum 2 files for compaction

High-churn workload, small working set with deletes› 128,000 records, 1KB value field› 10 threads running 5 millions operations, various key distributions› 40% reads 40% updates 20% deletes (with 50,000 updates head start), target 1000ops› 1% (long) scans 66% updates 33% deletes (with head start), target 500ops

Measure average latency over time› Latencies accumulated over intervals of 10 seconds

17

Evaluation Results: Read Latency (Zipfian Distribution)

(total 2 flushes and 1 disk compactions)(total 15 flushes and 4 disk compactions)

18

Evaluation Results: Read Latency (Uniform Distribution)

(total 3 flushes and 2 disk compactions)(total 15 flushes and 4 disk compactions)

19

Evaluation Results: Scan Latency (Zipfian Distribution)

20

Work in Progress: Memory Optimizations

Current design› Data stored in flat buffers, index is a skip-list › All memory allocated on-heap

Optimizations› Flat layout for immutable segments index› Manage (allocate, store, release) data buffers off-heap

Pros› Locality in access to index› Reduce memory fragmentation› Significantly reduce GC work › Better utilization of memory and CPU

21

Status Umbrella Jira HBASE-14918

HBASE-14919 HBASE-15016 HBASE-15359 Infrastructure refactoring› Status: committed

HBASE-14920 new compacting memstore› Status: pre-commit

HBASE-14921 memory optimizations (memory layout, off-heaping) › Status: under code review

https://issues.apache.org/jira/browse/HBASE-14918











22

Summary

Feature intended for HBase 2.0.0 New design pros over default implementation

› Predictable retrieval latency by serving (mainly) from memory› Less compaction on disk reduces write amplification effect› Less disk I/O and network traffic reduces load on HDFS

We would like to thank the reviewers› Michael Stack, Anoop Sam John, Ramkrishna s. Vasudevan, Ted Yu

Thank You

24

Evaluation Results: Write Latency

hbase accelerated: in-memory flush and compaction

Technology