hbase accelerated: in-memory flush and compaction
TRANSCRIPT
HBase Accelerated:In-Memory Flush and Compact ionE s h c a r H i l l e l , A n a s t a s i a B r a g i n s k y , E d w a r d B o r t n i k o v H a d o o p S u m m i t E u r o p e , ⎪A p r i l 1 3 , 2 0 1 6
2
Motivation: Dynamic Content Processing on Top of HBase Real-time content processing pipelines
› Store intermediate results in persistent map› Notification mechanism is prevalent
Storage and notifications on the same platform Sieve – Yahoo’s real-time content management platform
Crawl Docproc
Link Analysis Queue
Crawl schedule
Content
QueueLinks
Ser
ving
Apache Storm
Apache HBase
Notification Mechanism is Like a Sliding Window
3
Small working set but not necessarily FIFO queue Short life-cycle delete message after processing it High-churn workload message state can be updated Frequent scans to consume message
Producer
Producer
Producer
Consumer
Consumer
Consumer
4
HBase Accelerated: Mission Definition
Goal Real-time performance and less I/O in persistent KV-stores
How?Exploit redundancies in the workload to eliminate duplicates in memoryReduce amount of new filesReduce write amplification effect (overall I/O)Reduce retrieval latencies
How much can we gain?Gain is proportional to the duplications
5
Persistent map› Atomic get, put, and snapshot scans
Table sorted set of rows› Each row uniquely identified by a key
Region unit of horizontal partitioning Column family unit of vertical partitioning Store implements a column family within a
region
In production a region server (RS) can manage 100s-1000s of regions› Share resources: memory, cpu, disk
key Col1 Col2 Col3
HBase Data modelColumn family
Store
Reg
ion
6
Random I/O Sequential I/O
Random writes absorbed in RAM (Cm)› Low latency› Built in versioning› Write-ahead-log (WAL) for durability
Periodic batch merge to disk› High-throughput I/O
Disk components (Cd) immutable › Periodic merge&compact eliminate duplicate versions
and tombstones
Random reads from Cm or Cd cache
Log-Structured Merge (LSM) Key-Value Store Approach
Cm memory
diskC1
C2
merge
Cd
7
Random writes absorbed in active segment (Cm) When active segment is full
› Becomes immutable (snapshot)› A new mutable (active) segment serves writes› Flushed to disk, truncate WAL
Compaction reads a few files, merge-sorts them, writes back new files
HBase Writes
C’mflush
memory HDFS
Cm
prepare-for-flushCd
compactiondisk
WAL
write
8
Random reads from Cm or C’m or Cd cache When data piles-up on disk
› Hit ratio drops › Retrieval latency up
Compaction re-writes small files into fewer bigger files› Causes replication-related network and disk IO
HBase Reads
C’mmemory HDFS
CmCd
ReadBlockcache
compactiondisk
9
New compaction pipeline› Active segment flushed to pipeline › Pipeline segments compacted in memory› Flush to disk only when needed
New Design: In-Memory Flush and Compaction
C’mflush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction pipeline
WAL
Blockcache
compactionmemory
10
Trade read cache (BlockCache) for write cache (compaction pipeline)
New Design: In-Memory Flush and Compaction
write readcache cache
C’mflush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction pipeline
WAL
Blockcache
compactionmemory
11
New Design: In-Memory Flush and Compaction
CPU
IO
C’mflush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction pipeline
WAL
Blockcache
compactionmemory
Trade read cache (BlockCache) for write cache (compaction pipeline)More CPU cycles for less I/O
12
Evaluation Settings: In-Memory Working Set
YCSB: compares compacting vs. default memstore Small cluster: 3 HDFS nodes on a single rack, 1 RS
› 1GB heap space, MSLAB enabled (2MB chunks)› Default: 128MB flush size, 100MB block-cache› Compacting: 192MB flush size, 36MB block-cache
High-churn workload, small working set› 128,000 records, 1KB value field› 10 threads running 5 millions operations, various key distributions› 50% reads 50% updates, target 1000ops› 1% (long) scans 99% updates, target 500ops
Measure average latency over time› Latencies accumulated over intervals of 10 seconds
13
Evaluation Results: Read Latency (Zipfian Distribution)
Flush to disk
Compaction
Data fits into
cache
14
Evaluation Results: Read Latency (Uniform Distribution)
Region split
15
Evaluation Results: Scan Latency (Uniform Distribution)
16
Evaluation Settings: Handling Tombstones
YCSB: compares compacting vs. default memstore Small cluster: 3 HDFS nodes on a single rack, 1 RS
› 1GB heap space, MSLAB enabled (2MB chunks), 128MB flush size, 64MB block-cache› Default: Minimum 4 files for compaction› Compaction: Minimum 2 files for compaction
High-churn workload, small working set with deletes› 128,000 records, 1KB value field› 10 threads running 5 millions operations, various key distributions› 40% reads 40% updates 20% deletes (with 50,000 updates head start), target 1000ops› 1% (long) scans 66% updates 33% deletes (with head start), target 500ops
Measure average latency over time› Latencies accumulated over intervals of 10 seconds
17
Evaluation Results: Read Latency (Zipfian Distribution)
(total 2 flushes and 1 disk compactions)(total 15 flushes and 4 disk compactions)
18
Evaluation Results: Read Latency (Uniform Distribution)
(total 3 flushes and 2 disk compactions)(total 15 flushes and 4 disk compactions)
19
Evaluation Results: Scan Latency (Zipfian Distribution)
20
Work in Progress: Memory Optimizations
Current design› Data stored in flat buffers, index is a skip-list › All memory allocated on-heap
Optimizations› Flat layout for immutable segments index› Manage (allocate, store, release) data buffers off-heap
Pros› Locality in access to index› Reduce memory fragmentation› Significantly reduce GC work › Better utilization of memory and CPU
21
Status Umbrella Jira HBASE-14918
HBASE-14919 HBASE-15016 HBASE-15359 Infrastructure refactoring› Status: committed
HBASE-14920 new compacting memstore› Status: pre-commit
HBASE-14921 memory optimizations (memory layout, off-heaping) › Status: under code review
22
Summary
Feature intended for HBase 2.0.0 New design pros over default implementation
› Predictable retrieval latency by serving (mainly) from memory› Less compaction on disk reduces write amplification effect› Less disk I/O and network traffic reduces load on HDFS
We would like to thank the reviewers› Michael Stack, Anoop Sam John, Ramkrishna s. Vasudevan, Ted Yu
Thank You
24
Evaluation Results: Write Latency