hbase design patterns @ yahoo!

HBase Design Patterns @ Y!

PRESENTED BY Francis Liu | [email protected] May 5, 2014⎪

Y! Grid

▪ Off-Stage Processing▪ Hosted Service▪ Multi-tenant

Batch Processing (with HDFS)

▪ Append-only ▪ Efficient full table scans▪ Process entire data set (or partitions)

HBase

▪ Mutable▪ Point Access ▪ Range scans▪ Record-level processing▪ 7 clusters, 1500 nodes, 6PB

Entity Store: Motivation

▪ Integrate data from multiple data sources▪ Store historical data▪ Share data

› Analytics› Machine Learning› Consume a data source

Entity Store

▪ Records as Entities› Web pages› Celebrities› etc.

▪ Denormalized as a single table

Entity Store: Content Store

Entity Store: Considerations

▪ Row vs multiple rows as an entity?› Row in most cases

▪ Blob vs Primitives as cell values?› Blobs are more compact› Primitives work better for granular updates› Out of the box filters work better with primitives› Use a compact binary format

▪ Prepare for Schema Changes› Provide a DAO library

▪ Incremental Scan› Batch id (via version)› Size cache for batch

Event Processing: Motivation

▪ Process a stream of events› Ad Targeting› Personalization› etc.

▪ Low average age of a record/model/etc

Event Processing

▪ Entity Store▪ Incremental computation

› Persist incremental state▪ Stream processing framework

› ie Storm▪ Fit working set in Block Cache

Event Processing: Ad Targeting

Ad Targeting

Event Processing - Considerations

▪ Limit large compactions▪ Deferred log flush▪ Avoid compaction storms▪ Async Access

› HBase work queue› AsyncHBase

▪ Blobs when possible▪ Cache optimizations

Phased Event Processing: Motivation

▪ Large/Complex event pipeline▪ Modularization▪ Dependency between pipelines

Phased Event Processing

▪ Notifications › Separate Table› Separate Column Family

Phased Event Processing: Personalization

Phased Event Processing: Considerations

▪ Notifications› Ordered› At least once

▪ Write to multiple regions▪ Transactions

Time Series DB: Motivation

▪ Track/Monitor changes over time› Application Metrics› User Analytics› System Metrics› etc.

▪ Alerts/Alarms› Thresholds› Changes over time

Time Series DB: Personalization Data Quality

Time-Series: Considerations

▪ Hot metrics› Namespace› Indexed tags

▪ Pre-compute aggregates if it is accessed often▪ Consider using a block encoding scheme (PREFIX, FAST_DIFF, etc)▪ Consider pre-computed aggregates in a separate table▪ Consider OpenTSDB

HBaseCon 2014

Thank You!(We’re hiring)

hbase design patterns @ yahoo!

Software

primitives