design patterns for building 360-degree views with hbase and kiji
DESCRIPTION
Speaker: Jonathan Natkins (WibiData) Many companies aspire to have 360-degree views of their data. Whether they're concerned about customers, users, accounts, or more abstract things like sensors, organizations are focused on developing capabilities for analyzing all the data they have about these entities. This talk will introduce the concept of entity-centric storage, discuss what it means, what it enables for businesses, and how to develop an entity-centric system using the open-source Kiji framework and HBase. It will also compare and contrast traditional methods of building a 360-degree view on a relational database versus building against a distributed key-value store, and why HBase is a good choice for implementing an entity-centric system.TRANSCRIPT
Design Patterns for 360º Views using HBase and Kiji
Jonathan Natkins
Who am I?
Jon “Natty” NatkinsField Engineer at WibiDataFormerly at Cloudera/Vertica
What is a 360º View?
What is a 360º View For?Past
What interactions has a customer had in the past?
PresentWhat is the customer doing right now?
FutureWhat is the customer likely do to next?
Past and present inform the future
What If I Don’t Care About Customers?
Generalizing the 360º View:Entity-Centric Systems
Goal of an Entity-Centric System
“Show me everything I know about Natty”
What Data Do I Need to Store?
Static data
Event-oriented data
Derived data
Building Entity-Centric Systems
Often, this is an EDW with a star schema
Fact
Dim
Dim
Dim
Dim
Challenges With Star Schemas
How do we answer the original question?
Full table scan + joinsOLTP systems will likely fall over from the volumeOLAP systems are usually not optimized for single-row lookups
Need Something Else…
Why
HBase rows can store both static and event-oriented data
Cell versions are key
Single-row lookups are extremely fast
is for Building Entity-Centric Systems
Often used for:Building recommendation systemsPersonalized searchReal-time HBase applications
Underlying technologies:
Designing an Entity-Centric Datastore
Ask yourself this: what is the entity?
Determine your entity by determining how you want to analyze the data
It’s ok to have data organized in multiple ways
Schema Management with Kiji
Sometimes you actually want a schema layerDefining a schema allows for data discoverability
Column Families in KijiKiji has two types of column familiesGroup families are similar to relational tables
Predefined set of columnsEach column has its own data type
Map families specify columns at runtime
Every column has the same data type
sessions:2345
sessions:2345
sessions:2345
sessions:1234
sessions:1234
info:purchases
Knowing When To Use Different Family Types
Do you know all of your columns up front?
Then use a group family
Map families are for when you don’t know your columns ahead of time
info:name info:emailsessions:1
234sessions:2
345info:purchas
esinfo:purchas
es
Choosing a Row KeyRow keys in Kiji are componentized
[ ‘component1’, ‘component2’, 1234 ]
More efficient than byte arraysConsider ‘1234567890’ versus [ 1234567890 ]
Good for scanning areas of the keyspace
A Common Use for Components
Known users IDs versus unknown IDsOn a website, how do you differentiate between a logged-in or cookie’d user versus a brand new visitor[ ‘K’, ‘user1234’ ] or [ ‘U’, ‘unknown2345’ ]
Physically and logically separate rowsRun jobs over all known or unknown users
Identifying Known UsersProblem: Users have many cookies over time.
Challenge: Ideally, we would have a single row for each user. How do we ensure that new data goes to the right row?
Finding Known Users WithLookup Tables
HBase get operations are fastIt’s easy enough to create a table that contains a mapping of cookies to known user IDsWhen data is loaded, check the lookup table to determine if you should write data to an existing row or a new one
Avoiding Hotspots
Unhashed Row KeysNode 1 Node 2 Node 3
RegionA-B
RegionB-C
RegionD-E
RegionF-G
RegionH-I
RegionJ-K
Hash-Prefixed Row KeysNode 1 Node 2 Node 3
Region00A-0fK
Region10A-1fK
Region20A-2fK
Region30A-3fK
Region40A-4fK
Region50A-5fK
Storing Event Series360º views need easy access to all the transactions and events for a userHBase cells may contain more than one versionKiji leverages this to store event series data like clicks or purchases
sessions:2345
sessions:2345
sessions:2345
sessions:1234
sessions:1234
info:purchasesinfo:name info:email
sessions:1234
sessions:2345
info:purchases
info:purchases
How Many Events is Too Many?
The HBase book warns that too many versions of a cell can cause StoreFile bloat
HBase will never split a row
Common tactic is to add a timestamp range to the row key
Kiji makes this easy with componentized row keys
Beware of Timestamp Misuse
A major reason the HBase book warns against mucking with timestamps is that they can be dangerous
What happens if you use a sequence number as a timestamp? Think about TTLs
Iterate and Evolve
Why is Evolution Necessary?No entity-centric system will be the end-all, be-all the first time aroundData sources in large enterprises are usually heavily silo’dStart smallIncorporate new data sources over time
Putting it TogetherKiji includes a shell to use DDL to create tablesMany of the features that have been discussed are declarative via the DDL
Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))PROPERTIES (NUMREGIONS = 32)WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'),LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ));
Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))PROPERTIES (NUMREGIONS = 32)WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'),LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ));
Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'),LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ));
Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))PROPERTIES (NUMREGIONS = 32)WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'),LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true,
FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.’));
In Summary…Designing applications in an entity-centric fashion can make them easier to build and more efficientKiji can speed up the development process of 360º views