hbase design patterns @ yahoo!
DESCRIPTION
Speaker: Francis Liu (Yahoo!) HBase's introduction into the Yahoo! Grid has provided our users with new ways to process and store data. A year after its availability, there has been varied usages: Event processing for personalization, incremental processing for ingestion, time-based aggregations for analytics, etc. All these were possible thanks to features HBase brings beyond working with HDFS files. This talk will review some recurring HBase design patterns at Yahoo! as well as share our learnings and experiences.TRANSCRIPT
HBase Design Patterns @ Y!
PRESENTED BY Francis Liu | [email protected] May 5, 2014⎪
Y! Grid
▪ Off-Stage Processing▪ Hosted Service▪ Multi-tenant
Batch Processing (with HDFS)
▪ Append-only ▪ Efficient full table scans▪ Process entire data set (or partitions)
HBase
▪ Mutable▪ Point Access ▪ Range scans▪ Record-level processing▪ 7 clusters, 1500 nodes, 6PB
Entity Store: Motivation
▪ Integrate data from multiple data sources▪ Store historical data▪ Share data
› Analytics› Machine Learning› Consume a data source
Entity Store
▪ Records as Entities› Web pages› Celebrities› etc.
▪ Denormalized as a single table
Entity Store: Content Store
Entity Store: Considerations
▪ Row vs multiple rows as an entity?› Row in most cases
▪ Blob vs Primitives as cell values?› Blobs are more compact› Primitives work better for granular updates› Out of the box filters work better with primitives› Use a compact binary format
▪ Prepare for Schema Changes› Provide a DAO library
▪ Incremental Scan› Batch id (via version)› Size cache for batch
Event Processing: Motivation
▪ Process a stream of events› Ad Targeting› Personalization› etc.
▪ Low average age of a record/model/etc
Event Processing
▪ Entity Store▪ Incremental computation
› Persist incremental state▪ Stream processing framework
› ie Storm▪ Fit working set in Block Cache
Event Processing: Ad Targeting
Ad Targeting
Event Processing - Considerations
▪ Limit large compactions▪ Deferred log flush▪ Avoid compaction storms▪ Async Access
› HBase work queue› AsyncHBase
▪ Blobs when possible▪ Cache optimizations
Phased Event Processing: Motivation
▪ Large/Complex event pipeline▪ Modularization▪ Dependency between pipelines
Phased Event Processing
▪ Notifications › Separate Table› Separate Column Family
Phased Event Processing: Personalization
Phased Event Processing: Considerations
▪ Notifications› Ordered› At least once
▪ Write to multiple regions▪ Transactions
Time Series DB: Motivation
▪ Track/Monitor changes over time› Application Metrics› User Analytics› System Metrics› etc.
▪ Alerts/Alarms› Thresholds› Changes over time
Time Series DB: Personalization Data Quality
Time-Series: Considerations
▪ Hot metrics› Namespace› Indexed tags
▪ Pre-compute aggregates if it is accessed often▪ Consider using a block encoding scheme (PREFIX, FAST_DIFF, etc)▪ Consider pre-computed aggregates in a separate table▪ Consider OpenTSDB
HBaseCon 2014
Thank You!(We’re hiring)