a survey of hbase application archetypes

60
Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Apache HBase Application Archetypes Lars George | @larsgeorge | Cloudera EMEA Chief Architect | HBase PMC Jonathan Hsieh | @jmhsieh | Cloudera HBase Tech lead | HBase PMC HBaseCon 2014 May 5 th , 2014 5/5/14 HBaseCon 2014; Lars George, Jon Hsieh 1

Upload: hbasecon

Post on 10-Sep-2014

587 views

Category:

Software


1 download

DESCRIPTION

Speakers: Lars George and Jon Hsieh (Cloudera) Today, there are hundreds of production HBase clusters running a multitude of applications and use cases. Many well-known implementations exercise opposite ends of the HBase's capabilities emphasizing either entity-centric schemas or event-based schemas. This talk presents these archetypes and others based on a use-case survey of clusters conducted by Cloudera's development, product, and services teams. By analyzing the data from the nearly 20,000 HBase cluster nodes Cloudera has under management, we'll categorize HBase users and their use cases into a few simple archetypes, describe workload patterns, and quantify the usage of advanced features. We'll also explain what an HBase user can do to alleviate pressure points from these fundamentally different workloads, and use these results will provide insight into what lies in HBase's future.

TRANSCRIPT

Page 1: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

1

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Apache HBase Application ArchetypesLars George | @larsgeorge | Cloudera EMEA Chief Architect | HBase PMCJonathan Hsieh | @jmhsieh | Cloudera HBase Tech lead | HBase PMCHBaseCon 2014May 5th , 2014

Page 2: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

2

About Lars and Jon

Lars George• EMEA Chief Architect

@Cloudera• Apache HBase PMC• O’Reilly Author of HBase – The

Definitive Guide• Contact

[email protected]• @larsgeorge

Jon Hsieh• Tech Lead HBase Team

@Cloudera• Apache HBase PMC• Apache Flume founder

• Contact:• [email protected]• @jmhsieh

Page 3: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

3

About Supporting HBase at Cloudera

• Supporting Customers using HBase since 2011• HBase Training • Professional Services

• Team has experience supporting and running HBase since 2009• 8 committers on staff • 2 HBase book authors

• As of Jan 2014, ~20,000 HBase nodes (in aggregate) under management• Information in this presentation is either aggregated customer data or

from public sources.

Page 4: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

4

An Apache HBase Timeline

20142008 2009 2010 2011 20132012

Apr’11: CDH3 GA with HBase 0.90.1

May ‘12: HBaseCon 2012

Jun ‘13: HBaseCon 2013Summer‘11:

Messages on HBase Summer ‘09

StumbleUpon goes production on HBase ~0.20

Nov ‘11: Cassinion HBase

Jan ‘13Phoenixon HBase

Summer‘11: Web Crawl Cache

Sept’11: HBase TDG published

Nov’12: HBase in Actionpublished

2015

May ‘14: HBaseCon 2014

Aug ‘13Flurry 1k-1k node cluster replication

Summer ‘14HBase v1.0.0 released

Jan’14: Cloudera has ~20k Hbase nodes under management

Page 5: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

5

Apache HBase “Nascar” Slide

Page 6: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

6

Outline

• Definitions• Archetypes

• The Good• The Bad• The Maybe

• Conclusion

Page 7: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

7

A vocabulary for HBase Archetypes

Definitions

Page 8: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

8

Defining HBase Archetypes

• There are a lot of HBase applications• Some successful, some less so• They have common architecture patterns• They have common tradeoffs

• Archetypes are common architecture patterns• Common across multiple use-cases• Extracted to be repeatable

• Our Goal: Define patterns à la “Gang of Four” (Gamma, Helm, Johnson, Vlissides)

Page 9: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

9

So you want to use HBase?

• What data is being stored?• Entity data • Event data

• Why is the data being stored?• Operational use cases• Analytical use cases

• How does the data get in and out?• Real time vs. Batch • Random vs. Sequential

Page 10: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

10

What is being stored?

There are primarly two kinds of big data workloads. They have different storage requirements.

Entities Events

Page 11: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

11

Entity Centric Data

• Entity data is information about current state• Generally real time reads and writes

• Examples: • Accounts• Users• Geolocation points• Click Counts and Metrics• Current Sensors Reading

• Scales up with # of Humans and # of Machines/Sensors• Billions of distinct entities

Page 12: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

12

Event Centric Data

• Event centric data are time-series data points recording successive points spaced over time intervals.

• Generally real time write, some combination of real time read or batch read• Examples:

• Sensor data over time• Historical Stock Ticker data• Historical Metrics• Clicks time-series

• Scales up due to finer grained intervals, retention policies, and the passage of time

Page 13: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

13

Events about Entities

• Majority Big Data use cases are dealing with event-based data• |Entities| * |Events| = Big data

• When you ask questions, do you hone in on entity first?• When you ask questions, do you hone in on time ranges first?

• Your answer will help you determine where and how to store your data.

Page 14: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

14

Why are you storing the data?

• So what kind of questions are you asking the data?

• Entity-centric questions• Give me everything about entity e• Give me the most recent event v about entity e• Give me the n most recent events V about entity e• Give me all events V about e between time [t1,t2]

• Event and Time-centric questions• Give me an aggregates on each entity between time [t1,t2]• Give me an aggregate on each time interval for entity e• Find events V that match some other given criteria

Page 15: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

15

How does data get in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

GetsShort scan

Full Scan, MapReduce

HBase Scanner

Bulk Import

HBase Client

HBase Replication

HBase Replication

Page 16: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

16

How does data get in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase Replication

HBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase Scanner

Page 17: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

17

What system is most efficient?

• It is all physics • You have a limited I/O budget

• Use all your I/O by parallelizing access and read/write sequentially.

• Choose the system and features that reduces I/O in general

• Pick the systems best for your workload

IOPs/s/disk

Page 18: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

18

The physics of Hadoop Storage SystemsWorkload HBase HDFS

Low latency ms, cached mins, MR+ seconds, Impala

Random Read primary index - index?, small files problemShort Scan sorted + partitionFull Scan 0 live table

+ (MR on snapshots) MR, Hive, Impala

Random Write log structured - Not supportedSequential Write hbase overhead

bulk load minimal overhead

Updates log structured - Not supported

Page 19: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

19

The physics of Hadoop Storage SystemsWorkload HBase HDFS

Low latency ms, cached mins, MR+ seconds, Impala

Random Read primary index - index?, small files problemShort Scan sorted + partitionFull Scan 0 live table

+ (MR on snapshots) MR, Hive, Impala

Random Write log structured - Not supportedSequential Write hbase overhead

bulk load minimal overhead

Updates log structured - Not supported

Page 20: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

20

The physics of Hadoop Storage SystemsWorkload HBase HDFS

Low latency ms, cached mins, MR+ seconds, Impala

Random Read primary index - index?, small files problemShort Scan sorted + partitionFull Scan 0 live table

+ (MR on snapshots) MR, Hive, Impala

Random Write log structured - not supportedSequential Write HBase overhead

bulk load minimal overhead

Updates log structured - not supported

Page 21: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

21

The ArchetypesHBase Applications

Page 22: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

22

HBase application use cases

• The Good• Simple Entities• Messaging Store• Graph Store• Metrics Store

• The Bad• Large Blobs • Naïve RDBMS port• Analytic Archive

• The Maybe• Time series DB• Combined workloads

Page 23: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

23

Archetypes: The GoodHBase, you are my soul mate.

Page 24: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

24

Archetype: Simple Entities• Purely entity data, no relation between entities

• Batch or real-time, random writes• Real-time, random reads• Could be a well-done denormalized RDBMS port.• Often from many different sources, with poly-structured data

• Schema: • Row per entity• Row key => entity ID, or hash of entity ID• Col qualifier => Property / field, possibly time stamp

• Geolocation data• Search index building• Use solr to make text data searchable.

Page 25: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

25

Simple Entities access pattern

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase Scanner

HBase Replication Solr

Page 26: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

26

Archetype: Messaging Store• Messaging Data:

• Realtime Random writes: Emails, SMS, MMS, IM • Realtime random updates: Msg read, starred, moved, deleted• Reading of top-N entries, sorted by time• Records are of varying size• Some time series, but mostly random read/write

• Schema:• Row = users/feed/inbox• Row key = UID or UID + time• Column Qualifier = time or conversation id + time.• Use CF’s for indexes.

• Examples:• Facebook Messages, Xiaomi Messages• Telco SMS/MMS services• Feeds like tumblr, pinterest

Page 27: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

27

Facebook Messages - Statistics

Source: HBaseCon 2012 - Anshuman Singh

Page 28: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

28

Messages Access Pattern

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase Replication

HBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase Scanner

Page 29: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

29

Archetype: Graph Data

• Graph Data: All entities and relations• Batch or realtime, random writes• Batch or realtime, random reads• Its an entity with relation edges

• Schema: • Row = Node. • Row key => Node ID. • Col qualifier => Edge ID, or properties:values

• Examples:• Web Caches – Yahoo!, Trend Micro• Titan Graph DB with HBase storage backend• Sessionization (financial transactions, clicks streams, network traffic)• Government (connect the bad guy)

Page 30: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

30

Graph Data Access Pattern

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase Replication

HBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase Scanner

Page 31: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

31

Archetype: Metrics

• Frequently updated Metrics • Increments • Roll ups generated by MR and bulk loaded to HBase• Poor man’s datacubes

• Examples• Campaign Impression/Click counts (Ad tech)• Sensor data (Energy, Manufacturing, Auto)

Page 32: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

32

Metrics Access Pattern

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase Replication

HBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase Scanner

Page 33: A Survey of HBase Application Archetypes

CONFIDENTIAL - RESTRICTED

Archetypes: The BadThese are not the droids you are looking for

33

Page 34: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

34

Current HBase weak spots

• HBase’s architecture can handle a lot• We make engineering trade offs to optimize for them.• HBase can still do things it is not optimal for.• However, other systems are fundamentally more efficient for some workloads.

• We’ve often seen some folks forcing apps into HBase.• If one of these is your only workloads on this data, use another system• If you are in a mixed workload case, some of these become “maybes”.

• Just because it is not good today, doesn’t mean it cant be better tomorrow.

Page 35: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

35

Bad Archetype: Large Blob Store

• Saving large objects >3MB per cell• Schema:

• Normal entity pattern, but with some columns with large cells.• Examples

• Raw photo or video storage in HBase• Large frequently updated structs as a single cell

• Problems:• Will get crushed due to write amplification when reoptimizing data for read. (compactions on

large unchanging data)• Will crush write pipeline if there are large structs with frequently updated subfields. Cells are

atomic, and hbase must rewrite an entire cell. • Some work adding LOB support

• This requires new architecture elements

Page 36: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

36

Bad Archetype: Naïve RDBMS port

• A naïve port the RDBMS onto HBase, directly copying the schema.• Schema

• Many tables, just like an RDBMS schema.• Row key: primary key or auto-incrementing key, like RDBMS schema• Column qualifiers: field names• Manually do joins, or secondary indexes (not consistent)

• Solution:• HBase is not a SQL Database. • No multi-region/multi-table in HBase transactions (yet).• Must to denormalize your schema to use Hbase.

Page 37: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

37

Large blob store, Naïve RDBMS port access patterns

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase Replication

HBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase Scanner

Page 38: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

38

Bad Archetype: Analytic archive

• Store purely chronological data, partitioned by time• Real time writes, chronological time as primary index• Column-centric aggregations over all rows.• Bulk reads out, generally for generating periodic reports

• Schema• Row key: date+xxx or salt+date+xxx• Column qualifiers: properties with data or counters

• Example• Machine logs organized by date.• Full fidelity clickstream

Page 39: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

39

Bad Archetype: Analytic archive Problems

• HBase non-optimal as primary use case.• Will get crushed by frequent full table scans.• Will get crushed by large compactions.• Will get crushed by write-side region hot spotting.

• Instead • Store in HDFS; Use Parquet columnar data storage + Impala/Hive• Build rollups in HDFS+MR; store and serve rollups in HBase

Page 40: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

40

Analytic Archive access patterns

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase Replication

HBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase Scanner

Page 41: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

41

And this is crazy | But here’s my data, | serve it, maybe!

Archetypes: The Maybe

Page 42: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

42

The Maybe’s

• For some applications, doing it right gets complicated.

• These more sophisticated or nuanced cases require considing these questions:

• When do you choose HBase vs HDFS storage for time series data?• Are there times where bad archetypes are ok?

Page 43: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

43

Time Series: in HBase or HDFS?

• IO Patterns:• Reads: Collocate related data

• Make reads cheap and fast.• Writes: Spread writes out as much as possible

• Maximize write throughput• HBase: Tension between these goals

• Spreading writes spreads data making reads inefficient• Colocating on write causes hotspots, underutilizes resources by limiting write throughput

• HDFS: The sweet spot.• Sequential writes and and sequential read.• Just write more files in date-dirs; physically spreads writes but logically groups data.• Reads for time centric quieries just read files in date-dir

Page 44: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

44

Time Series data flows

• Ingest• Flume or similar direct tool via app

• HDFS• Batch queries and generate rollups in Hive/MR• Faster queries in Impala • No user time serving• HBase for recent, HDFS for historical

• HBase• Serve individual events • Serve pre-computed aggregates

Page 45: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

45

Archetype: Entity Time Series• A time series access pattern suitable for HBase

• Random write to event data, random read specific event or aggregate data• Generate aggregates via counters, don’t directly compute aggregate on query• HBase is system of record

• Schema:• Rowkey: entity-timestamp or hash(entity)-timestamp, possibly with salt added

after entity.• Col qualifiers: property• Use custom aggretation to consolidate old data• Use TTL’s to bound and age off old data

• Examples:• OpenTSDB does this well for numeric values; Lazily aggregates cells for better

performance.• Facebook Insights, ODS

Page 46: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

46

Entity Time Series access pattern

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase Replication

HBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase Scanner

Flume

Custom App

Page 47: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

47

Archetypes: Hybrid Entity Time Series

• Essentially a combo of the Metric Archetype and Entity Time Series Archetype, with bulk loads of rollups via HDFS.

• Land data in HDFS and HBase• Keep all data in HDFS for future use• Aggregate in HDFS and write to HBase• HBase can do some aggregates too (counters)• Keep serve-able data in HBase. • Use TTL to discard old values from Hbase.

Page 48: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

48

Hybrid time series access pattern

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Hive or MR:Bulk Import

HBase Client

HBase Replication

HBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase Scanner

HDFS

Flume

Page 49: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

49

Meta Archetype: Combined workloads

• In these cases, the use of HBase depends on workload

• Cases where we have multiple workloads styles.• Many cases we want to do multiple things with the same

data• primary use case (real time, random access)• secondary use case (analytical)• Pick for your primary, here’s some patterns on how to do

your secondary.

Page 50: A Survey of HBase Application Archetypes

50

Real time workloads and Analytical access

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

HBase Client

Get, Scanpoor latency!

full scans interfere with

latency!

high throughput

MapReduce

HBase Scanner

HBase Client

Put, Incr, Append

Bulk Import

HBase Client

HBase Replication

Page 51: A Survey of HBase Application Archetypes

51

Real time workloads and Analytical access

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

HBase Client

Get, Scan

HBase Replication

low latencyIsolated from full scans

high throughput

MapReduce

HBase Scanner

HBase Client

Put, Incr, Append

Bulk Import

HBase Client

HBase Replication

high throughput

Page 52: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

52

MR over Table Snapshots (0.98, CDH5.0)

• Previously MapReduce jobs over HBase required online full table scan

• Take a snapshot and run MR job over snapshot files

• Doesn’t use HBase client• Avoid affecting HBase caches • 3-5x perf boost.• Still requires more IOPs than hdfs

raw files

mapmapmapmapmapmapmapmap

reducereducereduce

mapmapmapmapmapmapmapmap

reducereducereduce

snapshot

Page 53: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

53

Analytic Archive access pattern

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase Replication

HBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase Scanner

Page 54: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

54

Analytic Archive Snapshot access pattern

HDFS

HBase Client

Put, Incr, Append

HBase Client

Snapshot Scan, MR

HBase Scanner

Bulk Import

HBase Client

HBase Replication

HBase Replication

low latency

Higher throughput

Table snapshot

GetsShort scan

Page 55: A Survey of HBase Application Archetypes

55

Multitenancy (in progress)

• We want to MR for analytics while serving low-latency requests in one cluster.

• Performance Isolation• Limit performance impact load on one

table has on others. (HBASE-6721)• Request prioritization and scheduling

• Toda default is FIFO• Need to schedule some requests

before others (HBASE-10994)

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

1 1 2 1 1 3 1

1 1 21 1 31

Delayed by long scan requests

Rescheduled so new request get

priority

Mixed workload

Isolated workload

Page 56: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

56

Conclusions

Page 57: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

57

Big Data Workloads

Lowlatency

Batch

Random Access Full ScanShort Scan

HDFS + MR(Hive/pig)

HBase

HBase + Snapshots -> HDFS + MR

HDFS + Impala

HBase + MR

Page 58: A Survey of HBase Application Archetypes

58

Big Data Workloads

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

Lowlatency

Batch

Random Access Full ScanShort Scan

HDFS + MR(Hive/pig)

HBase

HBase + Snapshots -> HDFS + MR

HDFS + Impala

HBase + MRCurrent Metrics

Graph data

Simple Entities

Hybrid Entity Time series + Rollup serving

Messages

Analytic archive

Hybrid Entity Time series+ Rollup generation

Index building

Entity Time series

Page 59: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

59

HBase is evolving to be an Operational Database

• Excels at consistent single row centric operations• Dev efforts aimed at using all machine resources efficiently, reducing

MTTR, and improving latency predictability.• Projects built on HBase that enable secondary indexing and multi-

row transactions• Apache Phoenix (incubating) or Impala provide a SQL skin for

simplified application development• Analytic workloads?

• Can be done but will be beaten by direct HDFS + MR/Spark/Impala

Page 60: A Survey of HBase Application Archetypes

5/5/14 HBaseCon 2014; Lars George, Jon Hsieh

60

Questions?@larsgeorge@jmhsieh