amplab:(algorithms,(machines(and( people(franklin/talks/amp...our(retreatgoals(•...

33
AMPLab: Algorithms, Machines and People Michael Franklin UC Berkeley Kick-off Meeting December 8, 2010 Asilomar, CA

Upload: others

Post on 25-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

AMPLab:  Algorithms,  Machines  and  People  

Michael Franklin!UC Berkeley!

!!

Kick-off Meeting!December 8, 2010!

Asilomar, CA!

Page 2: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Agenda  

•  About  the  retreat  •  AMPLab  Background  

– Context  – People  

•  Project  Mo>va>on  and  Goals  •  Research  Thrusts  •  Data  Management  Foci  

2

Page 3: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Retreat  Agenda  -­‐  Highlights  •  Today:  AMP  Overviews  and  Applica>ons*  •  Discussion  topic  groups  during  Dinner*  •  Open  Mic  session  –  Techie  and  other*  •  Thurs:    Machine  Learning,  Systems,  and  DB*  •  Lunch  with  Discussion*    •  <sun  comes  out  here>    Ac>vity  Break*  •  Industry  talks  –  Founding  Sponsors*  •  Report  outs;  Dinner/Poster  Session*  •  Fri:  Industry  talks  –  Cloud  and  Crowd*  •  Group  Photo*  •  Industry  Feedback***  3 * = Your Participation Needed

Page 4: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Who’s  Here?  

4

•  Amazon  •  Cisco  •  Cloudera  •  Crowdflower  •  eBay  •  Ericsson  •  Facebook  

•  Google  •  HP  •  Huawei  •  IBM  •  Intel  •  MicrosoX  •  NEC  

•  NetApp  •  O'Reilly  •  Oracle  •  SAP  •  Twi[er  •  VMWare  •  Yahoo!  

Page 5: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Our  Retreat  Goals  

•  Outline  our  direc>ons  and  project  goals  •  Introduce  our  ideas  as  they  currently  stand  •  Introduce  you  to  the  team  and  vice  versa  •  Get  your  ideas,  guidance,  feedback,  direc>ons  •  Kick-­‐off  a  great  5-­‐year  collabora>on  •  Posi>on  AMPLab  to  be  the  leading  academic  center  for  “Big  Data”  research  

5

Page 6: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

6

Compu>ng  as  a  Commodity  

Page 7: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Con>nuous  Improvement  of    Client  Devices  

Page 8: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Ubiquitous  Connec>vity  

8

Page 9: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Algorithms,  Machines  &  People  

Adap>ve/Ac>ve  Machine  Learning  and  Analy>cs  

Cloud  Compu>ng  CrowdSourcing  

9

Massive  and  

Diverse  Data  

Page 10: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

AMPLab:  What  is  it?  

A  Five-­‐Year  research  collabora>on  to  develop  a  new  genera>on  of  data  analysis  methods,  tools  and  infrastructure  for  

making  sense  at  scale.  

10

Page 11: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

“Big  Data”:  Working  Defini>on  

When  the  normal  applica>on  of  current    technology  doesn’t  enable  users  to  obtain  

 answers  of    to  their  data-­‐driven  ques>ons.  

11

Page 12: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

The  Scalability  Dilemma  

12

•  State-­‐of-­‐the  Art  Machine  Learning  techniques  do  not  scale  to  large  data  sets.  

•  Data  Analy>cs  frameworks  can’t  handle  lots  of  incomplete,  heterogeneous,  dirty  data.  

•  Processing  architectures  struggle  with  increasing  diversity  of  programming  models  and  job  types.  

•  Adding  people  to  a  late  project  makes  it  later.  

Exactly Opposite of what we Expect and Need  

Page 13: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

The  AMP  Team  •  Principal  Inves>gators    (*co-­‐directors)  

–  Alex  Bayen  (sensing  plalorms)  –  Armando  Fox  (systems)  –  Michael  Franklin*  (databases)  –  Michael  Jordan*  (machine  learning)  –  Anthony  Joseph  (security  &  privacy)  –  Randy  Katz  (systems)  –  David  Pa[erson  (systems)  –  Ion  Stoica*  (systems)  –  Sco[  Shenker  (networking)  

•  PostDocs  and  Visi>ng  Researchers  –  Ali  Ghodsi  (KTH),    Tim  Kraska  (ETH  Zurich),  Jus>n  Ma  (UCSD),  Purna  Sarkar  (CMU),  Elaine  Shi  (CMU/PARC)  

•  Lots  o’  Great  Students    –  Present  and  Future  

13

Page 14: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Background:  RAD  Lab    Enable  1  person  to  develop,  deploy,  operate  a  next-­‐

genera3on  Internet  applica3on  at  scale    

Five-­‐year  collabora>ve  effort  –  Thru  Feb  2011  •  See  upcoming  “end  of  project”  event  and  demo  

Ini>al  Technical  Bet:    •  Machine  Learning  for  large-­‐scale  self-­‐managing  systems  

Wave  Caught:  Cloud  Compu>ng  Mul>-­‐area  faculty,  postdocs,  &  students  

•  Systems,  Networks,  DB,  Security,  Sta>s>cal  Machine  Learning  all  in  a  single,  open,  collabora>ve  space  

Industrial  Sponsorship  and  intensive  interac>on  •  Including  bi-­‐annual  retreats  

14!

Page 15: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

RAD,  AMP,  WTF?  •  Similar  cast  of  characters  and  exper>se  •  Leveraging  RADLab  structure,  organiza>onal  approach  and  collabora>ve  space  

•  Will  start  with  much  of  the  RADLab-­‐developed  soXware  stack  

•  Focus  on  big  data  analy>cs  and  analy>cs-­‐intensive  applica>ons  

•  “Systems  for  ML”  rather  than  “ML  for  Systems”  •  Close  collabora>on  with  applica>on  partners  •  Focus  on  Human  Element  throughout  data  lifecycle  

–  Elvis  is  everywhere  15

Page 16: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

AMP  Themes  •  Big  Data  

•  Scale  -­‐>  Diversity  •  Scale  -­‐>  You  never  see  “all  the  data”  •  Scale  -­‐>  Randomness  looks  like  non-­‐randomness  

•  Reliable  answers  from  unreliable  data  •  Cloud,  Warehouse-­‐Scale  &  Ubiquitous  Compu>ng  •  People  involved  throughout  the  whole  data  lifecycle  •  Both  a  part  of  the  problem  and  a  part  of  the  solu>on  

•  Con>nuous  answer  improvement/Pay-­‐as-­‐you-­‐go  •  Smart,  flexible,  fair  resource  alloca>on  and  use  

16

Page 17: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Ini>al  Use  Cases/Applica>ons  

17

Crowdsourced:    Sensing,  Analysis,  Policy,  Journalism  

Urban  Micro-­‐Simula>on  (next  session)  

Page 18: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

AMPLab:  Making  Sense  at  Scale  

•  A  holis>c  view  of  the  stack.  •  Strong  Industrial  

involvement  •  A  five-­‐year  plan  •  Research  underway  now  •  Public  launch  of  lab  in  Feb  

2011  

•  Funding  &  commitments  received  to  date:      

18!

Data Viz Collaboration, HCI

Text analytics Machine Learning and Stats

Database, OLAP, MapReduce Security and Privacy

MPP,Data Centers, Networks

Multi-Core Parallelism

Page 19: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

AMP:  Technical  Thrusts  •  Machine  Learning  and  Analy>cs  (tomorrow  AM)  

–  Error  Bars  on  all  Answers  – Ac>ve  learning  and  con>nuous/adap>ve  answer  improvement  

•  Infrastructure  (next  talk)  – Mesos  cloud  OS  and  analy>cs  frameworks  

•  Data  Management  –  Pay-­‐as-­‐you-­‐go  integra>on  and  structure  – ML/Analy>cs  workload/workflow  support  

•  Hybrid  Crowd/Cloud  Systems  –  “Human  Tolerant  Compu>ng”  –  Incen>ve  structures,  HCI  aspects  

19

Page 20: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Architectural  View  (strawman)  

20

Crowd UI

Crowd Resource Mgt

OpenFlow/ NOX

Mesos

Info

rm.

Inte

gr.

Scalable Storage Engine

Spark Confidence Insightful Query Language (CIQL)

Priv

acy

Con

trol

Deb

uggi

ng

Machine  Learning  

Result Control Center Collaborative Visualization

Page 21: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Scaling  Machine  Learning  

•  Want  a  systema>c  methodology  that  automa>cally  selects  an  opera>ng  point  along  the  spectrum  

•  Given  a  fixed  computa3onal  budget,  performance  should  improve  monotonically  as  data  accrue  –  immediate  results  with  con>nual  improvement  

•  Smart  sampling/dropping  of  data  •  Error  bars  on  all  answers  

Simple  Algorithms  Massive  Data  

n3 n log n n log n

Algorithmic Complexity

Complex  Alg  Less  Data  

21

Page 22: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Reliable  Answers  

•  Probabilis>c  Databases  •  Schema  Matching  •  Judicious  use  of  User  Input    •  Approximate  Query  Answering  •  Uncertainty  Management  •  Data  Model  Learning  •  Provenance  and  Annota>on  •  Structured  +  Unstructured  Search  

22

Page 23: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Data  Management  for  Interac>ve  Web  Applica>ons  

App  DB  

App  DB  

App  DB  

App  DB  

App  DB  

SCADS Storage

Scads  Storage  Handler  Mesos  

VM  monitor  

Director  

•  Stateless •  Easy to scale •  DB Library

•  PIQL – Performance Insightful Query Language

•  Consistency Rationing – Consistency à la carte

•  AVRO Schema - Physical/Logical data independence

•  Independent of Key/Value store

•  Stateful but still easy to scale (no sharding required)

•  Simple Query Interface •  Reduced consistency •  Predictable performance •  Easy to price •  High availability (even across data

centers)

App  DB  

Page 24: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

AMP  Lab  Big  Data  Analy>cs  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

App  

DB  

Moving Code to Data

Integrating the Crowd

Machine Learning Access Patterns

Functionality?

Optimize the storage for machine learning • Intermediate results as first class citizens • Adaptive replication levels and layout (e.g. Fractured Mirrors) • … Spark as a first step

val custs= cluster.get(“test”) val result = custs.map(

a => a.title == “PostDoc” && a.incrSalary(10) > 60 )

Page 25: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Par>cipatory  Culture  -­‐  Direct  

25

Page 26: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Par>cipatory  Culture  –  “Indirect”  

26

John Murrell: GM SV 9/17/09 …every time we use a Google app or service, we are working on behalf of the search sovereign, creating more content for it to index and monetize or teaching it something potentially useful about our desires, intentions and behavior.

Page 27: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Hybrid  Compu>ng  –  A  First  Step    

Disk 2

Disk 1

Parser

Optimizer St

atist

ics

Query Results

Executor

Files Access Methods

Form Collection

Form Editor

UI Creation

HIT Management

People DB

Met

aDat

a

27

CrowdDB

Page 28: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

CrowdSQL  •  DDL  Extensions:  Crowdsourced  columns,  tables  and  

referen>al  integrity.           CREATE TABLE company ( name STRING, hq_address CROWD STRING); CREATE CROWD TABLE department ( name STRING PRIMARY KEY phone_no STRING);

•  DML  Extensions:  CROWDEQUAL  and  CROWDORDER  

operators  (currently  UDFs).  

28

Page 29: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

SELECT * FROM professor p, department dWHERE p.dep = d.name AND p.name = "Carey"

Professor Department

⋈σ name="Carey"

p.dep=d.name

Professor

Department

σname= "Carey"

p.dep=d.name

Please fill out the missing professor data

Submit

Carey

E-Mail

Name

Please fill out the missing department data

Submit

CS

Phone

DepartmentGoodEnough

MTJoin(Dep)

p.dep = d.name

MTProbe(Professor)

name=Carey

(a) PeopleSQL query (b) Logical plan before optimization

(c) Logical plan after optimization

(d) Physical plan

CSDepartment

Leveraging  DB  Technology  

•  Use  Schema  to  drive  Turker  UI  genera>on  •  Create  a  cost  model  of  crowd  operators  and  plug  them  into  the  op>mizer.  

29

Page 30: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Picture  query  

30

Select the best picture of the Golden Gate Bridge

Page 31: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Integra>ng  Clouds  and  Crowds  

Interac)ve  Cloud   Analy)c  Cloud   People  Cloud  

Data              Acquisi)on  

Transac>onal      systems                              Data  entry  

…  +  Sensors  (physical  &  soXware)  

…  +  Web  2.0  

Computa)on   Get  and  Put   Map  Reduce  Parallel  DBMS  

Stream  Processing  

…  +  Collabora>ve  Structures  (e.g.,  Mechanical  Turk,  

Intelligence  Markets)  

Data  Model   Records   …  +  Numbers,  Media   …  +  Text,  Media,    Natural  Language  

Response                    Time  

Seconds   …+  Min/Hours/Days  +  Con>nuous  

all  

31

Page 32: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Summary  

•  AMPLab  will  inves>gate  the  confluence  of  Algorithms,  Machines  and  People  to  solve  “Big  Data”  analysis  problems.  

•  Huge  research  issues  across  many  domains.  •  The  goal  of  this  mee>ng  is  to  get  the  process  started.  

•  Our  approach  depends  on  close  interac>on  with  our  industrial  partners.  

•  We  look  forward  to  your  input,  advice  and  collabora>on.  

32

Page 33: AMPLab:(Algorithms,(Machines(and( People(franklin/Talks/AMP...Our(RetreatGoals(• Outline(our(direc>ons(and(projectgoals(• Introduce(our(ideas(as(they(currently(stand(• Introduce(you(to(the(team(and(vice(versa

Proposed  Discussion  Topics    1)  Using  the  Cloud  for  Big  Data  analy>cs  and  Machine  Learning    

2)  Pay-­‐as-­‐you-­‐go  Processing  -­‐  incremental  answer  improvement  and  data  integra>on      

3)  Data  Center  Opera>ng  System  

4)  Debugging  Big  Data  systems  

5)  Crowdsourcing  Opportuni>es  and  Challenges  

6)  User  Interface,  Interac>on  and  Data  Visualiza>on  

7)  Mobility  and  Devices  -­‐  how  do  they  impact  our  agenda?    

8)  Privacy  Issues  -­‐  What  are  they  and  how  to  address  them?    

9)  Metrics  -­‐  How  does  AMPLab  measure  success?  

10)  Approximate  answers  and  answer  confidence  

11)  Killer  Apps  -­‐  What  other  apps  should  AMP  focus  on?  

12)  What  ques>ons  can  we  answer?  not  answer?  can  we  classify  ques>ons  in  terms  of  difficulty?    13)  How  to  define  “Big  Data”  and  should  we  use  some  other  term?  

33