five ways to do data analytics "the wrong way"

Five Ways to Do Data Analytics

“The Wrong Way”

Title of the talk, on August 6 2014, @ Pinterest

Powered by the Wisconsin Idea: The Wisconsin Idea is the principle that the university should

improve people’s lives beyond the classroom. It spans UW–Madison’s teaching, research,

outreach and public service.

Jignesh M. Patel

[email protected]

1

Definition: A computing or networking architecture

suggested by the marketing department for sales purposes

rather than for technical reasons. Cisco calls them

"reference designs".

http://www.urbandictionary.com

Follow the markitecture

2

http://gridgaintech.wordpress.com

Technology = In-‐memory file system

https://spark.apache.org

Technology = In-‐memory caching + language bindings

http://hortonworks.com/blog/100x-‐faster-‐hive/

The Stinger Initiative: 100X Hive

Technology = caching, vectorized query execution

http://blog.cloudera.com

Technology = pin files in memory

3

http://hortonworks.com/blog/stinger-‐phase-‐2-‐the-‐journey-‐to-‐100x-‐faster-‐hive/

Problem: Claims are too broad!

https://spark.apache.org

Problem: Claims are too broad

Venkatraman et al. EuroSys’13

Presto (not the FB) v/s Spark: Big Wins an in the R framework

4

Never fix a duct-‐taped solution

Embrace complexity

5

Image from: http://http://thewaysleueslove.blogspot.com

One has to apply duct tape to fix problems, but consider

removing it later.

Stonebraker and Cetintemel, ICDE 2005

Natural instinct is to build/deploy a specialized system for each application,

but that approach blows up the operational complexity

6

Chasseur and Patel, WebDB’13

JSON

JSON

Web App

Mapping Layer

Rather than a specialized engine for JSON document store, a

simple language translator to SQL has higher performance and

better data integrity.

Chasseur and Patel, WebDB’13

Similar story for graphs and linear ML models – can easily be

supported on top of systems powered by relational algebra

The network effect! But in a bad way!

Complexity Growth = O(N2)

1 2

3

1 2

3 4

7

R v/s Python debate

Complexity Growth = O(N2) Also applies to tools and

programming languages in house

R Python

5K CRAN statistically robust packages

Linear algebra, clustering, …

ETL

8

Never realize that technology is NOT the “end,” but simply the “means to a (business) end”

Think of technology as the end

9

Netflix Challenge

Example: Building a recommendation system

10

Figure from: Ricardo: Integrating R and Hadoop by Das et al. SIGMOD’10

Key approach: Latent-‐factor Modeling

All Together Now: A Perspective on the Netflix Prize, by Bell, Koren and Volinsky

Winning insights

•  Missing ratings are not missing by random!

•  Parameters (popularity, users standards for rating, user tastes, …) vary over time

•  Combining sets of predictors

•  Efficient computation critical

11

Pandora’s Music Recommender by Michael Howe

Pandora: Music Genome

•  Content-‐filtering •  Classification to pick the

recommendation •  Key is to “build up a

neighborhood for a particular user’s preference”

Pandora.com

Pandora: Music Genome

12

Build before you analyze the technology trend

Never use back-‐of-‐the envelope calculations

13

Motivation for the UW Quickstep project http://quickstep.cs.wisc.edu

Hardware changes are far more non-‐linear than in the past

La

te

nc

y ((

cyc

le

s) ( CPU$

$

DRAM$

caches$

Magnetic)Hard)Disk)Drives)

~1#10s

!

~100

!

~107

!– !108

!

CPU$$caches$

NVRAM)(e.g.)SSDs))

~105

) –)10

6!

Ca

pa

ci

ty (

Co

st(

Energy Efficiency for Large-‐Scale MapReduce Workloads with Significant Interactive Analysis, Chen et al. EuroSys’12

Most interactive jobs work on “small” data sets

14

15

Patterson, CACM 2004

Latency lags bandwidth J. Dean, Latency numbers every programmer should know, 2012

0

10

1,0

00

100

,000

10,

000,

000

1,0

00,0

00,0

00

L1 cache reference

Branch mispredict

L2 cache reference

Mutex lock/unlock

Main memory reference

Compress 1K bytes with Zippy

Send 1K bytes over 1 Gbps network

Read 4K randomly from SSD*

Read 1 MB sequentially from memory

Round trip within same datacenter

Read 1 MB sequentially from SSD*

Disk seek

Read 1 MB sequentially from disk

Send packet CA-‐>Netherlands-‐>CA

Time in ns (log scale)

Amazing way to reason about bottlenecks

Little’s Law

L = λW

16

Amdahl, AFIPS 1967

Amdahl's law

DeWitt and Gray, CACM 1992

Parallel computing is hard

Speedu

p = Old/New

Stubbornly refuse to throw away code and platform architecture.

Fall in love with your architecture

17

Data from 2013 publicly reported numbers and Alexa

19#

29#18#7#

9#

1"

2"

4"

8"

16"

32"

64"

0" 1" 2" 3"

$/Active)Use

r)(log)scale))

Revenue/Employee)($M))

Google

YouTube

Problem: It’s hard to throw away something that you built, even if it

doesn’t fit anymore

18

Bubble volume based on daily time on the site

19

Watch for claims that are too broad

Markitecture

Simple is beautiful – keep the building blocks of your architectural DNA simple

Complexity

Periodically re-‐evaluate your technology architecture. Also, people and processes.

Architecture

Technology must serve an end business goal

Technology and Business

Amazingly powerful – think hard before you build!

Back-‐of-‐the envelope calculations

doing it right …

SSuummmmaarryy

five ways to do data analytics "the wrong way"

Engineering