tpc-h analytics' scenarios and performances on hadoop data clouds
DESCRIPTION
TRANSCRIPT
TPC-H ANALTYCS’ SCENARIOSTPC H ANALTYCS SCENARIOS
AND PERFORMANCES ON
HADOOP DATA CLOUDS
RIM [email protected] –UNIV OF TUNISLATICE –UNIV. OF TUNISTUNISIA
4th International. Conference on Networked Digital TechnologiesNDT’2012, Dubai, UAE.24th, April 2012
OUTLINELaTICE
Cloud A l tiAnalytics
1. Business Intelligence2. Motivation
data managment issues, NoSQL, cloudsg , ,
OLAP in the cloud
3. Implementation of OLAP in the cloud3. Implementation of OLAP in the cloudTPC-H Benchmark
Analytics Scenariosy
Performance Measurements
4. Related Work4. Related Work5. Conclusion6 W k
NDT’2012. Dubai. UAE 224th, Apr. 2012
6. Future Work
BUSINESS INTELLIGENCELATICE
Cloud A l ti
BIMotivationTPC‐H & COLAPPerformanceConclusionAnalytics Future work
Business intelligence aims to support better business g ppdecision-making. Common functions of business intelligence technologies areCommon functions of business intelligence technologies are
On-Line Analytical Processing,
data mining process mining data mining, process mining, Business performance managementT t i i d di ti l ti Text mining and predictive analytics, …
Market shareGartner Research Reports BI Market Revenue Hit $12.2 Billion in 2011
NDT’2012. Dubai. UAE 324th, Apr. 2012
MOTIVATIONLATICE
Cloud A l ti
BIMotivation|—Issues|— NoSQL|—CloudAnalytics ||—COLAP
Decision Support Systems• I t D t & l kl d• Incessant Data & complex workload• Complex DB schema
Scalability Issues• Ideally, Linear Speed up & Linear Scale up• DBMS do not scale linearlyy• OLAP Technologies do not scale
Hardware Hardware • I/O Bottleneck I/O-bound data storage systems• Gilder law: Thrice bandwidth every 3 years
M L T d 18 h • Moore Law: Twice computing and storage capacities every 18 months. Obsolete by 2017
• Vertical scaling cost >> Horizontal scaling cost
NDT’2012. Dubai. UAE 424th, Apr. 2012
NOSQLLATICE
Cloud A l ti
BIMotivation|—Issues|— NoSQL|—CloudAnalytics
Bi Ch ll l t d t l it
||—COLAP
Big Challenges related to velocity How fast huge volumes of data can be processed?
N SQL l tiNoSQL solutionsAdopted by Google, Facebook, Amazon, …
Dynamic horizontal scale upDynamic horizontal scale-up
Nodes are added without bringing the cluster down
Shared-nothing architectureShared nothing architecture
Independent computing& storage nodes interconnected via a high speed network
Distributed programming framework: MapReduce (Google)
NDT’2012. Dubai. UAE 524th, Apr. 2012
CLOUD COMPUTINGLATICE
Cloud A l ti
BIMotivation|—Issues|— NoSQL|—CloudAnalytics
Cloud computing is a style of computing where scalable and elastic IT-bl d biliti id d " i " t t l t
||—COLAP
enabled capabilities are provided as a service to external customers using Internet technologies.
Broad network accessBroad network access
Resource pooling (virtualization)
Self-provisioningScalableAnalyticsp g
Rapid elasticity
Measured service
Analytics
Market shareForrester Research expects the global cloud computing market to reach $241 billion in 2020. In particular, SaaS market growing to $92.8 billion by 2016.
Gartner group expects the cloud computing market will reach $US150.1 billion, with a compound annual rate of 26 5% in 2013
NDT’2012. Dubai. UAE 624th, Apr. 2012
with a compound annual rate of 26.5%, in 2013.
OLAP IN THE CLOUDLATICE
Cloud A l ti
BIMotivation|—Issues|— NoSQL|—CloudAnalytics
OLAP constraints
||—COLAP
Big data analytics’ obstaclesCurrent systems & technologies do not scale
Cloud computing
Key benefits of Cloud ComputingPerformance
NoSQLBig data analytics
Much faster data analysis,Dynamic and up-to-date hardware infrastructure,y p ,
More Economical Organizations no longer need to expend capital upfront for Organizations no longer need to expend capital upfront for hardware and software purchasesServices are provided on a pay-per-use basis,
NDT’2012. Dubai. UAE 724th, Apr. 2012
p p y p ,
TPC-HDECISION SUPPORT SYSTEM BENCHMARK
LATICE
Cloud A l ti
BIMotivationTPC‐H COLAPPerformanceDECISION-SUPPORT SYSTEM BENCHMARKAnalytics
DATA
Conclusion
• Complex DB schema• Scale factor 1, 10, …, 100,000 correspond respectively
to 1GB 10 GB 100 TBto 1GB, 10 GB, …, 100 TB• 8 data files {lineitem, customer…, region}.tbl• broad industry-wide relevancey
22 l ld b
WORKLOAD
• 22 real world business questions• High degree of complexity
• Star queries (complex joins)Star queries (complex joins)• Grouping• Nested queries
NDT’2012. Dubai. UAE 824th, Apr. 2012
TPC-H BENCHMARKLATICE
Cloud A l ti
BIMotivationTPC‐H |— E/R schemaCOLAPAnalytics Performance
NDT’2012. Dubai. UAE 924th, Apr. 2012
HADOOP/PIG LATINLATICE
Cloud A l ti
TPC‐H COLAP|— hadoop/pig|— translationPerformance
APACHE HADOOP
Analytics Conclusion
• Framework for running applications on large clusters of dit h d
APACHE HADOOP
commodity hardware. • Implements computational framework MapReduce• HDFS: (hadoop distributed file system) stores data on the ( p y )
compute nodes • Replication & job resoumissions for failures’ handling
APACHE PIG LATIN
• high-level language for expressing data analysis programs (filter, projection, join, group, sort, union, …)
NDT’2012. Dubai. UAE 1024th, Apr. 2012
PIG LATIN BENCHMARK5 TRANSLATION HINTS
LATICE
Cloud A l ti
TPC‐H COLAP|— hadoop/pig|— translation|—nominal 5 TRANSLATION HINTSAnalytics
1. Load Data for Immediate
|scenario
Processing• Better memory managementy g
2. Minimum Relation Scan• Conjunction/disjunction of Conjunction/disjunction of
predicates applied once
3 Unary operations prior to 3. Unary operations prior to binary operations• U ti ( j ti • Unary operations (projection,
restriction, ) reduce data volume
NDT’2012. Dubai. UAE 1124th, Apr. 2012
PIG LATIN BENCHMARK5 TRANSLATION HINTS CTND
LATICE
Cloud A l ti
TPC‐H COLAP|— hadoop/pig|— translation|—nominal 5 TRANSLATION HINTS -CTNDAnalytics
4. Intra-operation parallelism
|scenario
p p• partitioned join
5 Join Algorithm5. Join Algorithm• Algorithm
• h h j i • hash join, • merge join,
• S i j i d i• Star-queries: joins ordering
NDT’2012. Dubai. UAE 1224th, Apr. 2012
NOMINAL ANALYTICAL SCENARIOLATICE
Cloud A l ti
TPC‐H COLAP|— hadoop/pig|— translation|—nominal Analytics |
scenario
Pig Latin Script (Business Question)
Response
NDT’2012. Dubai. UAE 1324th, Apr. 2012
NOMINAL ANALYTICAL SCENARIOLATICE
Cloud A l ti
TPC‐H COLAP|— hadoop/pig|— translation|—nominal Analytics
High Cost
|scenario
• Measured Service, pay as you consume cloud ressources (bandwitdh, CPU, RAM)
Performance Issues• The same query (with same or different parameters) is
executed several times with no optimization
Di ti it f S iDiscontinuity of Service• Network failure/congestion
NDT’2012. Dubai. UAE 1424th, Apr. 2012
COMPLEX!!LATICE
Cloud A l ti
TPC‐H COLAP|— hadoop/pig|— translation|—nominal Analytics
How to reduce service cost?
|scenario
How to reduce service cost? How to improve performances?How to prevent discontinuity of service?
Cost• Materialized
views?• Aggregated data • Exploit
i i
CostManagement
Aggregated data replication
• OLAP or not?
W kl d S d
organizationhardware resources?
• Cloud Right size?Workload Studyfor Tuning
Cloud Right size?
NDT’2012. Dubai. UAE 1524th, Apr. 2012
TPC-H WORKLOAD NUMERICAL STUDYTYPE A
LATICE
Cloud A l ti
COLAP|— …|—nominal
scenario|—towardsTYPE AAnalytics
*Order Priority Checking*
|better scenario
135dimdim
alwaysmeasureorderscountpriorityorderdateorder ××
OLAP!+export to olap server
NDT’2012. Dubai. UAE 1624th, Apr. 2012
+export to olap server+MV
TPC-H WORKLOAD NUMERICAL STUDYTYPE B
LATICE
Cloud A l ti
COLAP|— …|—nominal
scenario|—towardsTYPE BAnalytics |
better scenario
*Large Volume Orders*
dim × measureqtylinesumorder
130083000,500,1
dim
=>
×
×
∑ SFforqtylinehaveordersofppmalmostSF
measureqtylinesumorder
1,3008.3 =>∑ SFforqtylinehaveordersofppmalmost
Not OLAP!
NDT’2012. Dubai. UAE 1724th, Apr. 2012
+MV
TPC-H WORKLOAD NUMERICAL STUDYTYPE C
LATICE
Cloud A l ti
COLAP|— …|—nominal
scenario|—towardsTYPE CAnalytics
*Minimum Cost Supplier*
|better scenario
000,000,000,2
costsupplymindimpartdimsupplier2 ×
××
SF
measureOLAP!MV storage cost
NDT’2012. Dubai. UAE 1824th, Apr. 2012
,,,MV storage costbest supplier in each region for each part!
TPC-H WORKLOAD NUMERICALSTUDY
LATICE
Cloud A l ti
COLAP|— …|—nominal
scenario|—towardsSTUDYAnalytics
Type Features TPC-H Business Questions
|better scenario
(OLAP Cube)
A • Medium dimensionality Q1, Q3, Q4, Q5, Q6, Q7, • Result is TPC-H Scale Factor
independentQ8, Q12, Q13, Q14, Q16, Q19, Q2213 business questions13 business questions
B • High dimensionality• few results, lots of empty
Q15, Q182 business questionsfew results, lots of empty
cellsq
C • High dimensionality Q2, Q9, Q10, Q11, Q17, g y• Result % of Scale Factor
, , , , ,Q20, Q217 business questions
NDT’2012. Dubai. UAE 1924th, Apr. 2012
CLOUDCOST MANAGEMENT
LATICE
Cloud A l ti
COLAP|— …|—nominal
scenario|—towardsCOST MANAGEMENTAnalytics
Measured Service
|better scenario
Measured Service pay as you goCPU + Memory + BandwidthCPU + Memory + Bandwidth
“When users understand the relationship between cost and consumption everybody wins” –Ron Millerconsumption, everybody wins –Ron MillerEmerging need to understand, manage and
ti l t l t th l dproactively control costs across the cloudResource Utilization MonitoringRight size w.r.t. both performances & cost (client and provider)
Green cloud through energy saving
NDT’2012. Dubai. UAE 2024th, Apr. 2012
BETTER SCENARIOLATICE
Cloud A l ti
COLAP|— …|—better scenarioPerformanceRelated workAnalytics Conclusion
Pig Latin Script (Generalized Business
Question)Question)
Pre-aggregated DataInteraction
OLAP Client Import Data into an on-siteinto an on-site
OLAP server
NDT’2012. Dubai. UAE 2124th, Apr. 2012
PERFORMANCE MEASUREMENTSLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion Analytics Future work
K
• French GRID platform: a large H • TPC H S • Apache
G5K
p gscale nation wide infrastructure for Grid research.
• Bordeaux Site TPC
-H • TPC-H Benchmark
• SF=1• 1 1GB source /H
DFS • Apache
Hadoop 0.20• N=3, 5 or 8 • one Hadoop• Bordeaux Site
• Borderel: 24GB RAM, 4 AMD CPUs, 2.27 GHz,
/
T 1.1GB source files
• 4.5GB single big file Pi
g/
one HadoopMaster
• (2, 4 or 7) Workers
and 4cores/CPU.• Borderline: :
32GB RAM, 4 Intel Xeon CPUs,
g• SF=10
• 11GB source files
• Apache Pig 0.8.1
e eo C Us, 2.6 GHz, and 2 cores/CPU.
• Ethernet10Gbps
• 45GB single big file
NDT’2012. Dubai. UAE 2224th, Apr. 2012
PERFORMANCE MEASUREMENTSLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion Analytics
Original TPC H 1GB
Original TPC H 11GB
Big FileBig File 4 5GB
Big File 45GB
Original 1.1 11GB
Future work
TPC-H 1GB TPC-H 11GBg
4.5GB 45GBvs 11GB
Except business questions which do not perform join operations: No improvementwhen cluster size doubles
NDT’2012. Dubai. UAE 2324th, Apr. 2012
PERFORMANCE MEASUREMENTSLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion Analytics
Original TPC H 1GB
Original TPC H 11GB
Big FileBig File 4 5GB
Big File 45GB
Original 1.1 11GB
Future work
TPC-H 1GB TPC-H 11GBg
4.5GB 45GBvs 11GB
Improvement when cluster size doublesMore data so it’s nice to have more storage & computing nodes
NDT’2012. Dubai. UAE 2424th, Apr. 2012
PERFORMANCE MEASUREMENTSLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion Analytics
Original TPC H 1GB
Original TPC H 11GB
Big FileBig File 4 5GB
Big File 45GB
Original 1.1 11GB
Future work
TPC-H 1GB TPC-H 11GBg
4.5GB 45GBvs 11GB
Elapsed times for SF=10 (11GB) are
At maximum 5 times elapsed times obtained for SF=1 (1.1GB)
In average twice elapsed times obtained for SF=1 (1.1GB)
NDT’2012. Dubai. UAE 2524th, Apr. 2012
PERFORMANCE MEASUREMENTSLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion Analytics
Original TPC H 1GB
Original TPC H 11GB
Big FileBig File 4 5GB
Big File 45GB
Original 1.1 11GB
Future work
Joining partitionned files is complex!
TPC-H 1GB TPC-H 11GBg
4.5GB 45GBvs 11GB
Combine all files into one fileSF=1 4.5GBSF 1 4.5GBSF=10 45GB
Evaluation of Pig/MR without joinsDenormalization
j i tsaves join costincreases required storage space (≈ ×4 for TPC-H)
NDT’2012. Dubai. UAE 2624th, Apr. 2012
PERFORMANCE MEASUREMENTSLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion Analytics
Original TPC H 1GB
Original TPC H 11GB
Big FileBig File 4 5GB
Big File 45GB
Original 1.1 11GB
Future work
TPC-H 1GB TPC-H 11GBg
4.5GB 45GBvs 11GB
Compared to (SF=1, 1.1GB), improvements range from 10% to 80%,p ( , ), p g ,performance degradation with more than 4 workers (N=5): this is due to MR framework (before reduce phase, data is grouped and sorted which has a costwhen involving more storage and computing nodes)
NDT’2012. Dubai. UAE 2724th, Apr. 2012
when involving more storage and computing nodes)
PERFORMANCE MEASUREMENTSLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion Analytics
Original TPC H 1GB
Original TPC H 11GB
Big FileBig File 4 5GB
Big File 45GB
Original 1.1 11GB
Future work
TPC-H 1GB TPC-H 11GBg
4.5GB 45GBvs 11GB
Compared to (SF=10, 11GB), after affording more nodes we obtain similar resultsp ( , ), gCompared to (SF=1, 4.5GB), elapsed times are less than 10× for same cluster sizePerformance degradation for queries which do not perform joins
Q1 executes over 7GB lineitem file (SF=10,11GB), now it executes over 45GB file
NDT’2012. Dubai. UAE 2824th, Apr. 2012
Q1 executes over 7GB lineitem file (SF 10,11GB), now it executes over 45GB file
PERFORMANCE MEASUREMENTSLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion Analytics
Big FileBig File 4 5GB
Big File 45GB
OLAP OLAP 4.5GB OLAP 45GB
Future work
g4.5GB 45GB
Aggregated data TPC-H business questions type A (SF independent & small resultset)TPC-H business questions type A (SF independent & small resultset)TPC-H business questions type B (very very small resultset)15 business questions from 22
Tradeoff between space & computation TPC-H business questions type C Add derived fieldsAdd derived fields
Q2: check (true) minimum supplycost by supplier for a part in PartSuppQ17: average_line_quantity field for each partQ20: sum_lines_quantities_per_year for each supplierQ21: number of waiting orders for each supplier
7 business questions from 22
NDT’2012. Dubai. UAE 2924th, Apr. 2012
7 business questions from 22
PERFORMANCE MEASUREMENTSLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion Analytics
Big FileBig File 4 5GB
Big File 45GB
OLAP OLAP 4.5GB OLAP 45GB
Future work
g4.5GB 45GB
Average degradation is 5%done once + time to retreive data
NDT’2012. Dubai. UAE 3024th, Apr. 2012
PERFORMANCE MEASUREMENTSLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion Analytics
Big FileBig File 4 5GB
Big File 45GB
OLAP OLAP 4.5GB OLAP 45GB
Future work
g4.5GB 45GB
Average degradation is 60%done once + time to retreive data
NDT’2012. Dubai. UAE 3124th, Apr. 2012
RELATED WORKLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion Analytics
Implementation of relational operations using MR framework
Future work
Implementation of relational operations using MR frameworkKim et al. –MRBench, 2008
Nominal analytics scenarioNominal analytics scenario
TPC-H benchmarking for SF=1,3
Translation from SQL to Pig LatinTranslation from SQL to Pig LatinIu et al. –Hadoop to SQL, 2010
Lee et al. –Ysmart, 2011 ,
Other pig latin use cases
Shatzle et al, RDF data, 2011
Loebman etal. Astrophysical data, 2009
NDT’2012. Dubai. UAE 3224th, Apr. 2012
CONCLUSIONLATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusionAnalytics
TPC H in depth numerical study
Future work
TPC-H in-depth numerical studyOLAP in the cloud
ScenariosImplementation p
Pig / Hadoop Distributed File SystemPerformancesPerformances
Various cluster sizes Various data volumesVarious schemas (with and without joins)
NDT’2012. Dubai. UAE 3324th, Apr. 2012
( j )
FUTURE WORKAQP
LATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion AQPAnalytics
Approximate Query Processing in clouds
Future work
pp y gMost Distributed File Systems implement replication for high availabilityfor high availabilityMDS erasure codes outperform replication from two perspectives (i) storage cost and (ii) minimal perspectives (i) storage cost and (ii) minimal operation cost of redundant data management
N H d l New Hadoop release Facebook
Generalized framework for approximate data analytics in the cloud coping with nodes’ failure
NDT’2012. Dubai. UAE 3424th, Apr. 2012
FUTURE WORKPIG LATIN++
LATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion PIG LATIN++Analytics
Most of TPC-H business questions scripts are composed of
Future work
jobs, which execute sequentially,some branches of the DAG are unnecessarily y
blocked! namely Q1 Q3 Q4 Q9 Q10 Q11 Q12 Q13 namely Q1, Q3, Q4, Q9, Q10, Q11, Q12, Q13, Q14, Q16, Q17 and Q18
Pig Latin EnhancementsPig Latin EnhancementsInvestigate intra-operation Parallelism for better
f f Sperformances of Pig Scripts Investigate better job definitions strategies, in order
NDT’2012. Dubai. UAE 3524th, Apr. 2012
to increase inter-job parallelism
FUTURE WORKPIG LATIN++⟩⟩ Q16 EXPLE
LATICE
Cloud A l ti
TPC‐HCOLAPPerformanceRelated workConclusion PIG LATIN++⟩⟩ Q16 EXPLEAnalytics
Q16’s DAG Q16’s pig
Future work
J_0005
script
J_0004
J_0003
J_0002
J 0001
NDT’2012. Dubai. UAE 3624th, Apr. 2012
J_0001
TPC-H ANALTYCS SCENARIOS AND PERFORMANCES ON HADOOP DATACLOUDS
THANK YOU FOR YOUR ATTENTION
Q & AQ & A
24th, Apr. 2012, p
4th International. Conference on Networked Digital Technologies
NDT’12.Dubai. UAE