h base vs hive srp vs analytics 2-14-2012
TRANSCRIPT
HBase vs. Hive
Philip WicklineChief Technology Officer
Hadapt
Goals
Brief introduction to the differences between transactional/operational and analytical systems
Understand when to use Hive and when to use HBase and why
2
Databases
3
Datastores
4
Differences of Purpose : “Transaction Processing”
Operational systems
• Optimized for small short random access – reads and writes
• E.g. record that an employee invested $100 in a S&P500 index fund in his 401(k) *or* record that a user posted something on another users “wall”
Traditional DB examples
• Oracle
• MySQL
NoSQL Examples
• HBase
• MongoDB
• Cassandra
5
Differences of Purpose: Analytics
Analytics
• Optimized for read-only computations about large amounts of data
• E.g. compute the average amount invested in bond funds and stock funds for all employees at all employers over the last 5 years
DB Examples
• Netezza
• Vertica
NoSQL Examples
• Hive
• Pig
6
0
2
4
6
8
10
12
14
16
Oct Nov Dec Jan Feb Mar
Plan
Actual
Option 1
Acme
GM
Newco
Oldco
Bigcorp
Option 1
0
5
10
5-10
0-5
HBase Data Model : Conceptual
From the BigTable paper:
“a sparse, distributed, persistent multi-dimensional sorted map”
(row : bytestring, column family : bytestring, column : bytestring, time : int64) -> byte string
7
HBase Map
{ ”key_1" : {
”columnfamily_a" : {
”column_i" : {
15 : "y",
4 : "m"
},
”column_ii" : {
15 : "d”,
}},
“columnfamily_b" : {
”column_other" : {
6 : "w"
3 : "o"
1 : "w”
}}}}
8
Hive Data Model : Conceptual
Traditional Relational Tables
9
CUSTKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL COMMENT
451234 NEWC
ORP
196
Broadway
…
1 111-555-
1212
$1,231,285 NULL
887765 ACME 1 Main st.
…
2 222-555-
1212
$46,945 “Top
customer”
HBase Data Model : Physical
Every cell stored with row, family, column and timestamp
Allows fast lookup with low copy overhead
BUT
Space inefficient (optional compression available) and inefficient to scan
10
“key_1” “cf_a” “c_i” 15 “foo”
“key_1” “cf_a” “c_ii” 15 “bar”
“key_2” “cf_a” “c_ii” 4 “baz”
Hive Data Model : Physical
Depends on the underlying storage files
Can use flat text files, RCFiles, even use HBase for storage
Standard Row Storage
11
C_1 C_2 C_3 C_4
11 12 13 14
21 22 23 24
31 32 33 34
41 42 43 44
51 52 53 54
Hive Data Model : RCFile
Break into row groups, and then store as columns
12
Row Group 1
C_1 11 21 31
C_2 12 22 32
C_3 13 23 33
C_4 14 24 34
Row Group 2
C_1 41 51
C_2 42 52
C_3 43 53
C_4 44 54
Informal Performance Comparison
13
Hive HBase
Insert Speed batch Fast!
Update Speed NA Fast!
Lookup speed MR lower bound
(10s of seconds)
Fast!
Data warehouse
queries
15x faster on one
test
Uh oh
THANK YOU