five ways to do data analytics "the wrong way"
DESCRIPTION
ABSTRACT: The ongoing big data revolution has revolutionized the way in which technology is used to empower new business segments like social networking and transform old business segments like traditional retail. However, the DNA that is used to build data processing platform is evolving quite rapidly. There is a plethora of competing tools, technologies, and “religion” for how to build state-of-the-art data analysis frameworks. In this talk, I will go over five ways to build scalable high-performance long-lasting data analysis frameworks in the wrong way. Surprisingly, the industry is full of examples of organization building frameworks in this “wrong” way. Since the “right” way to build a technology framework is dependent on the key business drivers, it is my hope that this talk will spur a discussion on what is the “right” way for Pinterest. The talk will focus on technologies including “data plumbing” (e.g. tools in the Hadoop ecosystem), and statistical modeling methods (e.g. R and Python). In this talk, I’ll try to connect to platform builders, data scientists, and business decision makers. BIO: Jignesh Patel is a Professor in Computer Sciences at the University of Wisconsin-Madison, where he also earned his Ph.D. He has worked in the area of databases (now fashionably called “big data”) for over two decades. He has won several best paper awards, and industry research awards. He is the recipient of the Wisconsin COW teaching award, and the U. Michigan College of Engineering Education Excellence Award. He has a strong interest in seeing research ideas transition to actual products. His Ph.D. thesis work was acquired by NCR/Teradata in 1997, and he also co-founded Locomatix -- a startup that built a platform to power real-time data-driven mobile services. Locomatix became part of Twitter in 2013. He is an ACM Distinguished Scientist and an IEEE Senior Member. He also serves on the board of Lands’ End, and advises a number of startups.TRANSCRIPT
Five Ways to Do Data Analytics
“The Wrong Way”
Title of the talk, on August 6 2014, @ Pinterest
Powered by the Wisconsin Idea: The Wisconsin Idea is the principle that the university should
improve people’s lives beyond the classroom. It spans UW–Madison’s teaching, research,
outreach and public service.
Jignesh M. Patel
1
Definition: A computing or networking architecture
suggested by the marketing department for sales purposes
rather than for technical reasons. Cisco calls them
"reference designs".
http://www.urbandictionary.com
Follow the markitecture
2
http://gridgaintech.wordpress.com
Technology = In-‐memory file system
https://spark.apache.org
Technology = In-‐memory caching + language bindings
http://hortonworks.com/blog/100x-‐faster-‐hive/
The Stinger Initiative: 100X Hive
Technology = caching, vectorized query execution
http://blog.cloudera.com
Technology = pin files in memory
3
http://hortonworks.com/blog/stinger-‐phase-‐2-‐the-‐journey-‐to-‐100x-‐faster-‐hive/
Problem: Claims are too broad!
https://spark.apache.org
Problem: Claims are too broad
Venkatraman et al. EuroSys’13
Presto (not the FB) v/s Spark: Big Wins an in the R framework
4
Never fix a duct-‐taped solution
Embrace complexity
5
Image from: http://http://thewaysleueslove.blogspot.com
One has to apply duct tape to fix problems, but consider
removing it later.
Stonebraker and Cetintemel, ICDE 2005
Natural instinct is to build/deploy a specialized system for each application,
but that approach blows up the operational complexity
6
Chasseur and Patel, WebDB’13
JSON
JSON
Web App
Mapping Layer
Rather than a specialized engine for JSON document store, a
simple language translator to SQL has higher performance and
better data integrity.
Chasseur and Patel, WebDB’13
Similar story for graphs and linear ML models – can easily be
supported on top of systems powered by relational algebra
The network effect! But in a bad way!
Complexity Growth = O(N2)
1 2
3
1 2
3 4
7
R v/s Python debate
Complexity Growth = O(N2) Also applies to tools and
programming languages in house
R Python
5K CRAN statistically robust packages
Linear algebra, clustering, …
ETL
8
Never realize that technology is NOT the “end,” but simply the “means to a (business) end”
Think of technology as the end
9
Netflix Challenge
Example: Building a recommendation system
10
Figure from: Ricardo: Integrating R and Hadoop by Das et al. SIGMOD’10
Key approach: Latent-‐factor Modeling
All Together Now: A Perspective on the Netflix Prize, by Bell, Koren and Volinsky
Winning insights
• Missing ratings are not missing by random!
• Parameters (popularity, users standards for rating, user tastes, …) vary over time
• Combining sets of predictors
• Efficient computation critical
11
Pandora’s Music Recommender by Michael Howe
Pandora: Music Genome
• Content-‐filtering • Classification to pick the
recommendation • Key is to “build up a
neighborhood for a particular user’s preference”
Pandora.com
Pandora: Music Genome
12
Build before you analyze the technology trend
Never use back-‐of-‐the envelope calculations
13
Motivation for the UW Quickstep project http://quickstep.cs.wisc.edu
Hardware changes are far more non-‐linear than in the past
La
te
nc
y ((
cyc
le
s) ( CPU$
$
DRAM$
caches$
Magnetic)Hard)Disk)Drives)
~1#10s
!
~100
!
~107
!– !108
!
CPU$$caches$
NVRAM)(e.g.)SSDs))
~105
) –)10
6!
Ca
pa
ci
ty (
Co
st(
Energy Efficiency for Large-‐Scale MapReduce Workloads with Significant Interactive Analysis, Chen et al. EuroSys’12
Most interactive jobs work on “small” data sets
14
15
Patterson, CACM 2004
Latency lags bandwidth J. Dean, Latency numbers every programmer should know, 2012
0
10
1,0
00
100
,000
10,
000,
000
1,0
00,0
00,0
00
L1 cache reference
Branch mispredict
L2 cache reference
Mutex lock/unlock
Main memory reference
Compress 1K bytes with Zippy
Send 1K bytes over 1 Gbps network
Read 4K randomly from SSD*
Read 1 MB sequentially from memory
Round trip within same datacenter
Read 1 MB sequentially from SSD*
Disk seek
Read 1 MB sequentially from disk
Send packet CA-‐>Netherlands-‐>CA
Time in ns (log scale)
Amazing way to reason about bottlenecks
Little’s Law
L = λW
16
Amdahl, AFIPS 1967
Amdahl's law
DeWitt and Gray, CACM 1992
Parallel computing is hard
Speedu
p = Old/New
Stubbornly refuse to throw away code and platform architecture.
Fall in love with your architecture
17
Data from 2013 publicly reported numbers and Alexa
19#
29#18#7#
9#
1"
2"
4"
8"
16"
32"
64"
0" 1" 2" 3"
$/Active)Use
r)(log)scale))
Revenue/Employee)($M))
YouTube
Problem: It’s hard to throw away something that you built, even if it
doesn’t fit anymore
18
Bubble volume based on daily time on the site
19
Watch for claims that are too broad
Markitecture
Simple is beautiful – keep the building blocks of your architectural DNA simple
Complexity
Periodically re-‐evaluate your technology architecture. Also, people and processes.
Architecture
Technology must serve an end business goal
Technology and Business
Amazingly powerful – think hard before you build!
Back-‐of-‐the envelope calculations
doing it right …
SSuummmmaarryy