performance evaluation of cloudera impala (with comparison to hive)

13
Cloudera impala Performance Evaluation with Comparison to HiveDec. 8, 2012 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon

Upload: yukinori-suda

Post on 14-Jun-2015

4.235 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Performance evaluation of cloudera impala (with Comparison to Hive)

Cloudera  impala  Performance  Evaluation  

(with  Comparison  to  Hive) Dec. 8, 2012

CELLANT Corp. R&D Strategy Division Yukinori SUDA

@sudabon

Page 2: Performance evaluation of cloudera impala (with Comparison to Hive)

About  Cloudera  impala •  Latest version is 0.3 beta •  Open-sourced implementation inspired by Google Dremel

and F1 •  Developed by famous Hadoop distributor Cloudera •  Bring real-time, ad-hoc query capability on Apache Hadoop •  Query data stored in HDFS or Apache Hbase •  Use the same metadata, SQL syntax (HiveQL) as Apache Hive •  Support for TextFile and SequenceFile as Hive storage format •  Also support SequenceFile compressed as Snappy, Gzip and

Bzip •  Directly access the data through a specialized distributed

query engine

Page 3: Performance evaluation of cloudera impala (with Comparison to Hive)

Architecture •  State Store works as an impala-state-store(statestored) daemon •  Query Planner, Query Coordinator and Query Exec Engine work as an

impalad daemon

Page 4: Performance evaluation of cloudera impala (with Comparison to Hive)

System  Environment •  Install via Cloudera Manager Free Edition

13  Servers1  Sever

・HDFS NameNode SecondaryNameNode

・MapReduceV1 JobTracker

・impala impalad impala-­‐‑state-­‐‑store (statestored)

・HDFS DataNode

・MapReduceV1 TaskTracker

・impala impalad

Master Slave

All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch

Page 5: Performance evaluation of cloudera impala (with Comparison to Hive)

Server  Specification

•  CPU o  Intel Core 2 Duo 2.13 GHz with Hyper Threading

•  Memory o  4GB

•  Disk o  7,200 rpm SATA mechanical Hard Disk Drive

•  OS o  CentOS 6.2

Page 6: Performance evaluation of cloudera impala (with Comparison to Hive)

Benchmark

•  Use CDH4.1 + impala version 0.2 and 0.3 •  Use hivebench in open-sourced benchmark tool

“HiBench” o  https://github.com/hibench

•  Modified datasets to 1/10 scale o  Default configuration generates table with 1 billion rows

•  Modified query sentence o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance o  Deleted “datediff” function (I mistook not to be supported)

•  Combines a few Hive storage format with a few compression method o  TextFile, SequenceFile, RCFile o  No compression, Gzip, Snappy

•  Comparison with job query latency o  Average job latency over 5 measurements

Page 7: Performance evaluation of cloudera impala (with Comparison to Hive)

Modified  Datasets •  Rankings table

o  12 million rows o  Schema

•  pageURL string •  pageRank int •  avgDuration int

•  Uservisits table o  100 million rows o  Schema

•  sourceIP string •  destURL string •  visitDate string •  adRevenue double

•  userAgent string •  countryCode string •  languageCode string •  searchWord string •  duration int

Page 8: Performance evaluation of cloudera impala (with Comparison to Hive)

Modified  Query SELECT

sourceIP, sum(adRevenue) as totalRevenue, avg(pageRank)

FROM rankings R

JOIN ( SELECT

sourceIP, destURL, adRevenue

FROM uservisits UV

WHERE UV.visitData >= ‘1999-01-01’ AND UV.visitData <= ‘2001-01-01’

) NUV

ON (R.pageURL = NUV.destURL)

GROUP BY sourceIP ORDER BY totalRevenue DESC LIMIT 1

Page 9: Performance evaluation of cloudera impala (with Comparison to Hive)

Benchmark  Result  (Hive)

Page 10: Performance evaluation of cloudera impala (with Comparison to Hive)

Benchmark  Result  (impala  0.2)

Page 11: Performance evaluation of cloudera impala (with Comparison to Hive)

Benchmark  Result  (impala  0.3)

Page 12: Performance evaluation of cloudera impala (with Comparison to Hive)

Conclusion •  Impala is over 10 times faster than MR + Hive

o  Impala 0.3 •  SequenceFile compressed as Snappy: 14.337 seconds

o  Impala 0.2 •  SequenceFile compressed as Gzip: 19.733 seconds

o  Hive •  RCFile compressed as Snappy: 164.161 seconds

•  Hope that impala version 1.0 included in CDH5 makes faster o  Support RCFile and Trevni columner format

Page 13: Performance evaluation of cloudera impala (with Comparison to Hive)

Thank  you