hadoop summit 2009 hive
DESCRIPTION
Hive talk at Hadoop Summit 2009TRANSCRIPT
Hive - Data Warehousing & Analytics on Hadoop
Wednesday, June 10, 2009 Santa Clara Marriott
Namit Jain, Zheng ShaoFacebook
Agenda
» Introduction
» Facebook Usage
» Hive Progress and Roadmap
» Open Source Community
» Introduction
Why Another Data Warehousing System?
Data, data and more data~1TB per day in March 2008
~10TB per day today
Lets try Hadoop…
» Pros› Superior in availability/scalability/manageability
› Efficiency not that great, but throw more hardware
› Partial Availability/resilience/scale more important than ACID
» Cons: Programmability and Metadata› Map-reduce hard to program (users know sql/bash/python)
› Need to publish data in well known schemas
» Solution: HIVE
Lets try Hadoop… (continued)
RDBMS> select key, count(1) from kv1 where key > 100 group by key;
vs.
$ cat > /tmp/reducer.sh
uniq -c | awk '{print $2"\t"$1}‘
$ cat > /tmp/map.sh
awk -F '\001' '{if($1 > 100) print $1}‘
$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
$ bin/hadoop dfs –cat /tmp/largekey/part*
What is HIVE?
» A system for managing and querying structured data built on top of Hadoop› Map-Reduce for execution
› HDFS for storage
› Metadata on raw files
» Key Building Principles:› SQL as a familiar data warehousing tool
› Extensibility – Types, Functions, Formats, Scripts
› Scalability and Performance
Simplifying Hadoop
RDBMS> select key, count(1) from kv1 where key > 100 group by key;
vs.
hive> select key, count(1) from kv1 where key > 100 group by key;
» Facebook Usage
Data Warehousing at Facebook Today
Web Servers Scribe Servers
Filers
Hive on Hadoop ClusterOracle RAC Federated MySQL
Hive/Hadoop Usage @ Facebook
» Types of Applications:› Reporting
• Eg: Daily/Weekly aggregations of impression/click counts• SELECT pageid, count(1) as imps FROM imp_tableGROUP BY pageid WHERE date = ‘2009-05-01’;
• Complex measures of user engagement
› Ad hoc Analysis
• Eg: how many group admins broken down by state/country
› Data Mining (Assembling training data)
• Eg: User Engagement as a function of user attributes
› Spam Detection
• Anomalous patterns for Site Integrity
• Application API usage patterns
› Ad Optimization
Hadoop Usage @ Facebook
» Cluster Capacity:› 600 nodes
› ~2.4PB (80% used)
» Data statistics:› Source logs/day: 6TB
› Dimension data/day: 4TB
› Compression Factor ~5x (gzip)
» Usage statistics:› 3200 jobs/day with 800K tasks(map-reduce tasks)/day
› 55TB of compressed data scanned daily
› 15TB of compressed output data written to hdfs
› 150 active users within Facebook
» Hive Progress and Roadmap
» CREATE TABLE clicks(key STRING, value STRING)LOCATION '/hive/clicks'PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.TestSerDe' WITH SERDEPROPERTIES ('testserde.default.serialization.format'='\003');
Data Model
Logical Partitioning
Hash Partitioning
clicks
HDFS MetaStore
/hive/clicks
/hive/clicks/ds=2008-03-25
/hive/clicks/ds=2008-03-25/0
…
Tables
Data LocationBucketing InfoPartitioning Cols
Metastore DB
HIVE: Components
HDFS
Hive CLIDDL QueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDeThrift CSV JSON..
ExecutionParser
Planner
DB
Web
UI
Optimizer
Hive Query Language
» SQL› Subqueries in from clause
› Equi-joins
› Multi-table Insert
› Multi-group-by
» Sampling› SELECT s.key, count(1) FROM clicksTABLESAMPLE (BUCKET 1 OUT OF 32) s WHERE s.ds = ‘2009-04-22’ GROUP BY s.key
FROM pv_users
INSERT INTO TABLE pv_gender_sum
SELECT gender, count(DISTINCT userid)
GROUP BY gender
INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’
SELECT age, count(DISTINCT userid)
GROUP BY age
INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’
SELECT age, count(DISTINCT userid)
GROUP BY age;
Hive Query Language (continued)
» Extensibility› Pluggable Map-reduce scripts
› Pluggable User Defined Functions
› Pluggable User Defined Types• Complex object types: List of Maps
› Pluggable Data Formats• Apache Log Format
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script‘
AS dt, uid
CLUSTER BY dt) map
INSERT INTO TABLE pv_users_reduced
REDUCE map.dt, map.uid
USING 'reduce_script'
AS date, count;
Pluggable Map-Reduce Scripts
Map Reduce Example
Machine 2
Machine 1
<k1, v1><k2, v2><k3, v3>
<k4, v4><k5, v5><k6, v6>
<nk1, nv1><nk2, nv2><nk3, nv3>
<nk2, nv4><nk2, nv5><nk1, nv6>
LocalMap
<nk2, nv4><nk2, nv5><nk2, nv2>
<nk1, nv1><nk3, nv3><nk1, nv6>
GlobalShuffle
<nk1, nv1><nk1, nv6><nk3, nv3>
<nk2, nv4><nk2, nv5><nk2, nv2>
LocalSort
<nk2, 3>
<nk1, 2><nk3, 1>
LocalReduce
Hive QL – Join
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv
JOIN user u
ON (pv.userid = u.userid);
Hive QL – Join in Map Reduce
key value
111 <1,1>
111 <1,2>
222 <1,1>
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
page_view
user
key value
111 <2,25>
222 <2,32>
Map
key value
111 <1,1>
111 <1,2>
111 <2,25>
key value
222 <1,1>
222 <2,32>
ShuffleSort
Pageid age
1 25
2 25
pageid age
1 32
Reduce
Join Optimizations
» Map Joins› User specified small tables stored in hash tables on the
mapper backed by jdbm
› No reducer needed
INSERT INTO TABLE pv_users
SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age
FROM page_view pv JOIN user u
ON (pv.userid = u.userid);
» FutureExploit table/column statistics for deciding strategy
Hive QL – Map Join
key value
111 <1,2>
222 <2>
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
page_view
user
Pageid age
1 25
2 25
1 32
Hash table
pv_users
Hive QL – Group By
SELECT pageid, age, count(1)
FROM pv_users
GROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pageid age
1 25
1 25
pv_users
pageid age count
1 25 3
pageid age
2 32
1 25
Map
key value
<1,25> 2
key value
<1,25> 1
<2,32> 1
key value
<1,25> 2
<1,25> 1
key value
<2,32> 1
ShuffleSort
pageid age count
2 32 1
Reduce
Group by Optimizations
» Map side partial aggregations› Hash-based aggregates
› Serialized key/values in hash tables
› 90% speed improvement on Query• SELECT count(1) FROM t;
» Load balancing for data skew
» Optimizations being Worked On:› Exploit pre-sorted data for distinct counts
› Exploit table/column statistics for deciding strategy
Columnar Storage
» CREATE table columnTable
(key STRING, value STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.ColumnarSerDe'
STORED AS RCFILE;
» Saved 25% of space compared with SequenceFile› Based on one of the largest tables (30 columns) inside Facebook
› Both are compressed with GzipCodec
» Speed improvements in progress› Need to propagate column-selection information to FileFormat
» *Contribution from Yongqiang He (outside Facebook)
Speed Improvements over Time
Date SVN Revision Major Changes Query A Query B Query C
2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec
2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec
3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec
4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec
6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec
» QueryA: SELECT count(1) FROM t;» QueryB: SELECT concat(concast(concat(a,b),c),d) FROM t;» QueryC: SELECT * FROM t;
» Time measured is map-side time only (to avoid unstable shuffling time at reducer side). It includes time for decompression and compression (both using GzipCodec).
» * No performance benchmarks for Map-side Join yet.
Overcoming Java Overhead
» Reuse objects› Use Writable instead of Java Primitives
› Reuse objects across all rows
› *40% speed improvement on Query C
» Lazy deserialization› Only deserialize the column when asked
› Very helpful for complex types (map/list/struct)
› *108% speed improvement on Query A
Generic UDF and UDAF
» Let UDF and UDAF accept complex-type parameters
» Integrate UDF and UDAF with Writables
public IntWritable evaluate(IntWritable a, IntWritable b) {
intWritable.set((int)(a.get() + b.get()));
return intWritable;
}
HQL Optimizations
» Predicate Pushdown
» Merging n-way join
» Column Pruning
» Open Source Community
Open Source Community
» 21 contributors and growing › 6 contributors within Facebook
» Contributors from:› Academia
› Other web companies
› Etc..
» 7 committers› 1 external to Facebook and looking to add more here
» 50 jiras fixed in last month
» 218 jiras still open
» 125 mails in last month on hive-user@
» 600 mails in last month on hive-dev@
» Various companies/universities› Adknowledge, Admob
› Berkeley, Chinese Academy of Science
» Demonstration in VLDB’2009
Deployment Options
» EC2› http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely
» Cloudera Virtual Machine› http://www.cloudera.com/hadoop-training-hive-tutorial
» Your own cluster› http://wiki.apache.org/hadoop/Hive/GettingStarted
» Hive can directly consume data on hadoop› CREATE EXTERNAL TABLE mytable (key STRING, value STRING)LOCATION '/user/abc/mytable';
Future Work
» Benchmark & Performance
» Integration with BI tools (through JDBC/ODBC)
» Indexing
» More on Hive Roadmap› http://wiki.apache.org/hadoop/Hive/Roadmap
» Machine Learning Integration
» Real-time Streaming
Information
» Available as a sub project in Hadoop- http://wiki.apache.org/hadoop/Hive(wiki)
- http://hadoop.apache.org/hive (home page)
- http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)
- ##hive (IRC)
- Works with hadoop-0.17, 0.18, 0.19
» Release 0.3 is out and more are coming
» Mailing Lists: › hive-{user,dev,commits}@hadoop.apache.org
Contributors
» Aaron Newton
» Ashish Thusoo
» David Phillips
» Dhruba Borthakur
» Edward Capriolo
» Eric Hwang
» Hao Liu
» He Yongqiang
» Jeff Hammerbacher
» Johan Oskarsson
» Josh Ferguson
» Joydeep Sen Sarma
» Kim P.
» Michi Mutsuzaki
» Min Zhou
» Namit Jain
» Neil Conway
» Pete Wyckoff
» Prasad Chakka
» Raghotham Murthy
» Richard Lee
» Shyam Sundar Sarkar
» Suresh Antony
» Venky Iyer
» Zheng Shao
» Questions