hbasecon 2012 | gap inc direct: serving apparel catalog from hbase for live website
DESCRIPTION
Gap Inc Direct, the online division for Gap Inc., uses HBase to serve, in real-time, apparel catalog for all its brands’ and markets’ web sites. This case study will review the business case as well as key decisions regarding schema selection and cluster configurations. We will also discuss implementation challenges and insights that were learned.TRANSCRIPT
![Page 1: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/1.jpg)
1
Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website
HBaseCon 2012
Applications Track – Case Study
![Page 2: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/2.jpg)
2
Who Are We?
Suraj Varma Director of Technology Implementation Gap Inc Direct (GID), San Francisco, CA IRC: svarma
Gupta Gogula Director-IT & Domain Architect of
Catalog Management & Distribution Gap Inc Direct (GID), San Francisco, CA
![Page 3: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/3.jpg)
3
Agenda - Case Study
Problem Domain
HBase Schema Specifics
HBase Cluster Specifics
Learning & Challenges
![Page 4: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/4.jpg)
4
2005NEW SITE LAUNCH
2007PIPERLIME2008
UNIVERSALITY2009ATHLETA
US
US
US
US
US
EU
EU
CA
CA
EUCA
INCOMING TRAFFIC
2010CA & EU
MARKETS
APPLICATION SERVERS DATABASES
![Page 5: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/5.jpg)
5
Problem Domain
Evolution of the GID Apparel Catalog 2005 - Three independent brands in US 2010 – 5 integrated brands in US, CA, EU
Rapid Expansion of Apparel Catalog
However, each brand / market combination necessitated separate logical catalog databases
![Page 6: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/6.jpg)
6
What We Wanted …
Single Catalog store for all brands/markets Horizontally scalable over time Cross brand business features
Access data store directly To avail of inventory awareness of items
Minimal Caching – only for optimization Keeping caches in sync is a problem.
Highly Available
![Page 7: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/7.jpg)
7
Initial Explorations
Sharded RDMBS, MemCached, etc Significant effort was required Still had scalability limits
Non-relational alternatives considered
HBase POC (early-2010) Promising results -decided to move
ahead
![Page 8: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/8.jpg)
8
Why HBase?
Strong Consistency Model Server Side Filters Automatic Sharding, Distribution,
Failover Hadoop Integration out of the box
General Purpose Other use cases outside of Catalog
Strong Community!
![Page 9: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/9.jpg)
9
Architecture Diagram
HBASE CLUSTER
MUTATIONS
MUTATIONSMUTATIONS
REQUESTSBACKEND SERVICES
NEAR REAL TIME INVENTORY UPDATES
PRICING UPDATES ITEM UPDATES
INCOMING REQUESTS
FOR CATALOG DATA
![Page 10: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/10.jpg)
10
Cluster Traffic Patterns
Read Mostly
Write / Delete Bursts
Continuous Writes
Website Traffic Sync MR Jobs
Catalog Publish Phase out to near real-
time updates from originating systems
MR jobs on Live Cluster
Inventory Updates
![Page 11: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/11.jpg)
11
Rows:100KB avg size1000-5000 colsSparse rows
Data Model & Access Patterns Hierarchical Data (Primarily)
SKU -> Style Lookups (child -> parent) Cross Brand Sell (sibling <-> sibling)
Data Access Patterns Full Product Graph in one read Single path of graph from root to leaf node Search - Secondary Indices Large Feed files
![Page 12: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/12.jpg)
12
Primary Access Patterns
READ FULL GRAPH
READ SINGLE PATH / EDGE
![Page 13: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/13.jpg)
13
HBase Schema Management Built custom “bean to schema
mapper” POJO graph < -> HBase qualifiers Flexibility to shorten column qualifiers Flexibility to change schema qualifiers
(per environment / developer)<…><association>one-to-many</association>
<prefix>SC</prefix> <uniqueId>colorCd</uniqueId>
<beanName>styleColorBean</beanName> <…>
![Page 14: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/14.jpg)
14
Schema Example - Hierarchy <PP>_<id1>_QQ_<id2>_RR_<id3>_name
Where PP is parent, QQ is child, RR is grandchild
cf1:VAR_1_SC_0012_colorCdcf2:VAR_1_SC_0012_SCIMG_10_path
Pattern: ANCESTOR IDS EMBEDDED IN QUALIFIER NAME
![Page 15: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/15.jpg)
15
Schema – Lookups
Secondary Index <id3> => RR ; QQ ; PP FilterList with (RR, QQ, PP) ids to get
thin slice path
14444 333 22KEY_5555
Pattern: SECONDARY INDEX TO HIERARCHICAL ANCESTORS
![Page 16: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/16.jpg)
16
Schema – Future Dates, Large Files “Publish at Midnight”
Future Dated PUTs Get/Scan with time range
Large Feed Files Sharded into smaller chunks < 2MB per
cell
S_4S_1 S_2 S_3KEY_nnnn
Pattern: SHARDED CHUNKS
![Page 17: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/17.jpg)
17
HBase Cluster
16 Slave (RS + TT + DN) Nodes 8 & 16 GB RAM
3 Master (HM,ZK,JT, NN) Nodes 8 GB RAM
NN Failover via NFS
![Page 18: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/18.jpg)
18
Configurations – Block Cache, GC
Block Cache Maximize Block Cache hfile.block.cache.size: 0.6
Garbage Collection MSLAB enabled CMSInitiatingOccupancyFactor
![Page 19: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/19.jpg)
19
Configurations – Timeouts Quick Recovery on node failure
Default timeouts too large zookeeper.session.timeout
Region Server hbase.rpc.timeout
Data Node dfs.heartbeat.recheck.interval heartbeat.recheck.interval
![Page 20: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/20.jpg)
20
Learnings – Regions
Block Cache Size Tuning Block Cache Churn
Hot Row scenarios Perf Tests & Doing Phased Rollouts
Hot Region issues Perf Tests & Pre-split Regions.
Filters CPU Intensive – profiling needed.
![Page 21: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/21.jpg)
21
Learnings – Monitoring, Hardware
Monitoring is crucial Layer by layer -> what’s the bottleneck Metrics to target optimization & tuning Troubleshooting
Non Uniform Hardware Sub-optimal region distribution Hefty boxes lightly loaded.
![Page 22: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/22.jpg)
22
Learnings – Miscellaneous
M/R Jobs running on live cluster Has an impact – so cannot run full
throttle Go easy …
Feature Enablement – Phase in Don’t turn on several features together Easier identification of potential hot
regions / rows, overloaded RS, etc
![Page 23: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/23.jpg)
23
Phasing In Features
23
HBASE CLUSTER
LOT MORE
REQUESTS
BACKEND SERVICES
INVENTORY UPDATES
PRICING UPDATES ITEM UPDATES
Enable Features individually to measure impact and tune cluster accordingly
FEATURE “A” ENABLED: ADDITIONAL “N” REQ / SEC
FEATURE “B” ENABLED: ADDITIONAL “K” REQ / SEC
INCOMING REQUESTS
![Page 24: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/24.jpg)
24
Challenges – Search, Transactions
Search No out-of-the-box secondary indexes. Custom solution with Solr
Transactions Only row level atomicity But … can’t pack all in a single row Atomic Cross-Row Put/Delete and HBASE-
5229 seem potential partial solves (0.94+)
![Page 25: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/25.jpg)
25
Challenges – Optimal Schema Orthogonal access patterns
Optimize for most frequently used pattern.
Filters May suffice, with early out configurations Impacts CPU usage
Duplicate data for every access pattern Too drastic Effort to keep all copies in sync
![Page 26: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/26.jpg)
26
Challenges - Backups
Rebuild from source data Takes time … but no data loss
Export / import based backups Faster … but stale Another MR on live cluster
Better options in future releases …
![Page 27: HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website](https://reader035.vdocument.in/reader035/viewer/2022062418/556264c3d8b42aab1a8b4baf/html5/thumbnails/27.jpg)
27
Gap Inc Direct
We’re hiring!http://www.gapinc.com