the berkeley amplab - collaborative big data...
TRANSCRIPT
The Berkeley AMPLab - Collaborative Big Data Research
UC BERKELEY
Anthony D. Joseph
LASER Summer School September 2013
About Me Education: MIT SB, MS, PhD
Joined Univ. of California, Berkeley in 1998
Current research areas: » Cloud computing (Mesos): http://mesos.apache.org/ » Secure Machine Learning (SecML):
http://radlab.cs.berkeley.edu/wiki/SecML » DETER security testbed: http://deter-project.org/ » Intel Science and Technology Center for User Security:
http://scrub.cs.berkeley.edu/
Other: Peer-to-Peer networking (Tapestry), ���Mobile computing, Wireless/Cellular networking
2
Sources Driving Big Data It’s All Happening On-‐line
Every: Click Ad impression Billing event Fast Forward, pause,… Friend Request Transaction Network message Fault …
User Generated (Web & Mobile)
…..
Internet of Things / M2M Scientific Computing
Challenge 1: Data is Big Projected Growth
Increa
se ove
r 201
0
0
10
20
30
40
50
60
2010 2011 2012 2013 2014 2015
Moore's Law
Overall Data
Particle Accel.
DNA Sequencers
Data Grows faster than Moore’s Law [IDC report, Kathy Yelick, LBNL]
Challenge 2: Data is Dirty
• Variety of diverse sources
• Uncurated
• No schema
• Inconsistent syntax and semantics
Dirty Data worse than Big Data
Challenge 3: Complex Questions
Hard questions » What is the impact on traffic and home
prices of building a new ramp?
Real-time questions » Is there a cyber attack going on?
Open-ended questions » How many supernovae happened last
year?
Big Data Must Enable Decisions
Requires Multifaceted Approach
Three dimensions to improve data analysis » Improving scale, efficiency, and quality of algorithms
running in datacenters (Algorithms) » Scaling up datacenters (Machines) » Leverage human activity and intelligence (People)
Need to adaptively and flexibly combine all three dimensions
7
Algorithms, Machines, People (AMP) • Today’s apps: fixed point in solution space
8
Algorithms
Machines
People
Need techniques to dynamically pick best operating point
search
Watson/IBM
The AMP Lab
9
search
Watson/IBM
Machines
People
Algorithms
Make sense of data at scale by tightly integrating algorithms, machines, and people
AMP Lab
Faculty » Alex Bayen (mobile sensing platforms) » Armando Fox (systems) » Michael Franklin (databases): Director » Michael Jordan (machine learning): Co-director » Anthony Joseph (secure machine learning & privacy) » Randy Katz (systems) » David Patterson (systems) » Ion Stoica (systems): Co-director » Scott Shenker (networking)
Algorithms State-of-art Machine Learning (ML) algorithms do not scale » Prohibitive to process all data points
11
How do you know when to stop?
true answer
Estim
ate"
# of data points
Algorithms Given any problem, data and a budget » Immediate results with continuous improvement » Calibrate answer: provide error bars
12
Error bars on every answer!
Estim
ate"
# of data points
true answer
Algorithms
13
Stop when error smaller than a given threshold
Estim
ate"
# of data points time
true answer
Given any problem, data and a budget » Immediate results with continuous improvement » Calibrate answer: provide error bars
Algorithms Given any problem, data and a time budget » Automatically pick a solution on ML algorithm spectrum
14
Estim
ate"
time
pick sophisticated pick simple
error too high
true answer sophisticated
simple
Machines
“The datacenter as a computer” still in its infancy » Special purpose clusters, e.g., Hadoop cluster » Highly variable performance » Hard to program » Hard to debug
15
=!?
Machines: Problem Rapid innovation in cloud computing
No single framework optimal for all applications
Want to run multiple frameworks in a single cluster » … to maximize utilization » … to share data between frameworks
16
Dryad
Pregel
Cassandra Hypertable
Machines: A Solution Apache Mesos: a resource sharing layer supporting diverse frameworks » Fine-grained sharing: Improves utilization, latency, and data locality » Resource offers: Simple, scalable application-controlled scheduling mechanism
Mesos
Node Node Node Node
Hadoop Pregel …
Node Node
Hadoop
Node Node
Pregel …
B. Hindman, et al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011, March 2011. http://mesos.apache.org/ 17
People Make people an integrated part of the system! » Leverage human activity » Leverage human intelligence ���
(crowdsourcing): • Curate and clean dirty data
• Answer imprecise questions • Test and improve algorithms
Challenge » Inconsistent answer quality in all ���
dimensions (e.g., type of question, time, cost) 19
Machines + Algorithms
data
, ac
tivity
Que
stio
ns A
nswers
Our Vision: A Necessary Synergy
Challenge 1: Data is Big ✔ ✔
Challenge 3: Questions are complex
✔ ✔ ✔
Challenge 2: Data is Dirty ✔ ✔ ✔
lgorithms achines eople
Berkeley Data Analytics Stack
Apache Spark
Shark BlinkDB
SQL
HDFS / Hadoop Storage / Tachyon
Apache Mesos / YARN Resource Manager
Spark Streaming
GraphX MLBase
Big Data in 2020 Almost Certainly:
Create a new generation of big data scientist
A real datacenter OS
ML becoming an engineering discipline
People deeply integrated in big data analysis pipeline
If We’re Lucky:
System will know what to throw away
Come up with answers in minutes no one knows
Summary Goal: Tame Big Data Problem » Get results with right quality at the right time
Approach: Holistically integrate Algorithms, Machines, and People
Huge research issues across many domains