Transcript
Page 1: Benchmarking “No One Size Fits All” Big Data Analytics

Benchmarking “No One Size Fits All”

Big Data AnalyticsBigFrame Team

The Hong Kong Polytechnic UniversityDuke University

HP Labs

Page 2: Benchmarking “No One Size Fits All” Big Data Analytics

Analytics System Landscape

• MPP DBo Greenplum, SQL server PDW, Teradata, etc.

• Columnaro Vertica, Redshift, Vectorwise, etc.

• MapReduceo Hadoop, Hive, HadoopDB, Tenzing, etc

• Streamingo Storm, Streambase, etc

• Grapho Pregel, GraphLab, etc

• Multi-tenancyo Mesos, Yarn, etc

Page 3: Benchmarking “No One Size Fits All” Big Data Analytics

Analytics System Landscape

• MPP DBo Greenplum, SQL server PDW, Teradata, etc.

• Columnaro Vertica, Redshift, Vectorwise, etc.

• MapReduceo Hadoop, Hive, HadoopDB, Tenzing, etc

• Streamingo Storm, Streambase, etc

• Grapho Pregel, GraphLab, etc

• Multi-tenancyo Mesos, Yarn, etc

What does this mean for Big Data Practitioners?

Page 4: Benchmarking “No One Size Fits All” Big Data Analytics

Gives them a lot of power!

Page 5: Benchmarking “No One Size Fits All” Big Data Analytics

Even the mighty may need a little help

Page 6: Benchmarking “No One Size Fits All” Big Data Analytics

Challenges for PractitionersWhich system touse for the app that I am developing?

• Features (e.g. graph data)

• Performance (e.g., claims like System A is 50x faster than B)

• Resource efficiency• Growth and scalability• Multi-tenancyApp Developers,

Data Scientists

Page 7: Benchmarking “No One Size Fits All” Big Data Analytics

Challenges for PractitionersWhich system touse for the app that I am developing?

Different parts of my app have different requirements

Compose "best of breed" systems Or Use "one size fits all" System?

App Developers, Data Scientists

Page 8: Benchmarking “No One Size Fits All” Big Data Analytics

Challenges for PractitionersWhich system touse for the app that I am developing?

Different parts of my app have different requirements

Managing manysystems is hard!

App Developers, Data Scientists

System Admins CIO

Total Cost of Ownership (TCO)?

Page 9: Benchmarking “No One Size Fits All” Big Data Analytics

NeedBenchmarks

Page 10: Benchmarking “No One Size Fits All” Big Data Analytics

One Approach

Categorize systems

Develop a benchmark per system category

Page 11: Benchmarking “No One Size Fits All” Big Data Analytics

Useful, But ...

• MPP DB, Columnaro TPC-H/TPC-DS, Berkeley Big Data Benchmark etc.

• MapReduceo Terasort, DFSIO, GridMix, HiBench etc.

• Streamingo Linear Road, etc.

• Grapho Graph 500, PageRank, etc.

• ...

Page 12: Benchmarking “No One Size Fits All” Big Data Analytics

Problem: May miss the Big Picture

Page 13: Benchmarking “No One Size Fits All” Big Data Analytics

Problem: May miss the Big Picture

• Cannot capture the complexities and end-to-end behavior of big data applications and deployments:o Bottleneckso Data conversion, transfer, & loading overheadso Storage costs & other parts of the data life-cycleo Resource management challengeso Total Cost of Ownership (TCO)

Page 14: Benchmarking “No One Size Fits All” Big Data Analytics

A Better Approach:

BigBench or Deep Analytics Pipeline:• Applications driven• Involved multiple types of data:

o Structuredo Semi-structuredo Unstructured

• Involved multiple types of operator:o Relation Operators: join, group byo Text Analytics: Sentiment analysiso Machine Learning

Page 15: Benchmarking “No One Size Fits All” Big Data Analytics

Problem:

Give a man fish and you will feed him for a day.

Give him fishing gear and you will feed him for life.

--Anonymous

Benchmark

X

XBenchmark Generator

Page 16: Benchmarking “No One Size Fits All” Big Data Analytics

BigFrameA Benchmark Generator for

Big Data Analytics

Page 17: Benchmarking “No One Size Fits All” Big Data Analytics

How a user uses BigFrame

HiveMapReduce

HBase

BigFrame Interface

BenchmarkGenerator

Benchmark Driver for System Under

Test

bigif(benchmark input format)

bigspec(benchmark

specification)

result

run the benchmark

System Under Test

Page 18: Benchmarking “No One Size Fits All” Big Data Analytics

bigspec: Benchmark Specification

HiveMapReduce

HBase

Page 19: Benchmarking “No One Size Fits All” Big Data Analytics

What should be captured by the benchmark input format

• The 3Vs

VolumeVelocity

Variety

Page 20: Benchmarking “No One Size Fits All” Big Data Analytics

bigif: BigFrame's InputFormat

Page 21: Benchmarking “No One Size Fits All” Big Data Analytics

Benchmark Generation

bigif(benchmark input format)

bigspec(benchmark

specification)BenchmarkGenerator

bigif describes points in a discrete space of

{Data, Query} X {Variety, Volume, Velocity}

1. Initial data to load2. Data refresh pattern3. Query streams4. Evaluation metrics

Benchmark generation can be addressed as a search problem within a rich application domain

Page 22: Benchmarking “No One Size Fits All” Big Data Analytics

Application Domain Modeled Currently

E-commerce sales,promotions,

recommendations

Social media sentiment &

influence

Benchmark generation can be addressed as a search problem within a rich application domain

Page 23: Benchmarking “No One Size Fits All” Big Data Analytics

Application Domain Modeled Currently

Page 24: Benchmarking “No One Size Fits All” Big Data Analytics

Application Domain Modeled Currently

Item

Web_sales

Promotion

Page 25: Benchmarking “No One Size Fits All” Big Data Analytics

Application Domain Modeled Currently

Page 26: Benchmarking “No One Size Fits All” Big Data Analytics

Use Case 1: Exploratory BI

• Large volumes of relational data

• Mostly aggregation and few join

• Can Spark's performance match that of a MPP DB

BigFrame will generate a benchmark specification containing

relational data and (SQL-ish) queries

Data Variety = {Relational}

Query Variety = {Micro}

Page 27: Benchmarking “No One Size Fits All” Big Data Analytics

Use Case 2: Complex BI

• Large volumes of relational data

• Even larger volumes of text data

• Combined analytics

Data Variety = {Relational, text}

Query Variety = {Macro} (application-focused instead of micro-benchmark)

BigFrame will generate a benchmark specification that includes

sentiment analysis tasks over tweets

Page 28: Benchmarking “No One Size Fits All” Big Data Analytics

Use Case 3: Dashboards

• Large volume and velocity of relational and text data

• Continuously-updated Dashboards

Data Velocity= Fast

Query Variety = continuous(as opposed to Exploratory)

BigFrame will generate a benchmark specification that includes data refresh as well as continuous queries whose results change upon data refresh

Page 29: Benchmarking “No One Size Fits All” Big Data Analytics

Working with the community

• First release of BigFrame planned for August 2013o open source with extensibility APIs

• Benchmark Driver for more systems• Utilities (accessed through the benchmark

Driver to drill down into system behavior during benchmarking)

• Instantiate the BigFrame pipeline for more app domains

Page 30: Benchmarking “No One Size Fits All” Big Data Analytics

Take Away

• Benchmarks shape a field (for better or worse); they are how we determine the value of change.

--(David Patterson, University of California Berkeley, 1994).

• Benchmarks meet different needs for different people• End customers, application developers, system

designers, system administrators, researchers, CIOs

• BigFrame helps users generate benchmarks that best meet their needs


Top Related