jubatus: realtime deep analytics for bigdata@rakuten technology conference 2012

33
Realtime deep analytics for BigData Daisuke Okanohara Preferred Infrastructure, Inc. cofounder, vice president [email protected] Oct. 20 th 2012@Rakuten Technology Conference 2012

Upload: preferred-infrastructure-preferred-networks

Post on 05-Dec-2014

3.920 views

Category:

Documents


1 download

DESCRIPTION

Currently, we face new challenges in realtime analytics of BigData, such as social monitoring, M2M sensor, online advertising optimization, smart energy management and security monitoring. To analyze these data, scalable machine learning technologies are essential. Jubatus is the open source platform for online distributed machine learning on the data streams of BigData. we explain the inside technologies of Jubatus and show how jubatus can achieve realtime analytics in various problems.

TRANSCRIPT

Page 1: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Realtime  deep  analytics    for  BigData

Daisuke  Okanohara      

Preferred  Infrastructure,  Inc.   co-­‐founder,  vice  president  

[email protected]  

Oct.  20th  2012@Rakuten  Technology  Conference  2012

Page 2: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Agenda

l  Introduction  of  PFI  

l  Current  condition  of  BigData  Analysis    l  Jubatus:  concept  and  characteristics  

l  Inside  Jubatus:  Update,  Analyze,  and  Mix  

2

Page 3: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Preferred  Infrastructure  (PFI)

l  Founded:  March  2006  l  Location:  Hongo,  Tokyo  l  Employees:  26  l  Our  mission:    

Bring  cutting-­‐edge  research  advances  to  the  real  world  

l  Our  products  :  l  Sedue          “Modern  search  engine”  l  Bazil                “Machine  learning  for  everyone”  l  Jubatus    “Realtime  deep  analytics  for  BigData”  

   

3

Page 4: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Preferred  Infrastructure  (contd.)

l  We  are  passionate  towards  developing  various  computer  science  technologies  l  machine  learning  l  natural  language  processing  l  distributed  systems  l  programming  languages  l  data  structures  l  algorithms,  etc…  

l  Out  team  includes  winners  of  various  programming  contests  and  red  coders  

l  Very  rapid  prototyping  and  developing  good  software  

4

Page 5: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Agenda

l  Introduction  of  PFI  

l  Current  condition  of  BigData  Analysis    l  Jubatus:  concept  and  characteristics  

l  Inside  Jubatus:  Update,  Analyze,  and  Mix  

5

Page 6: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

BigData  !

l  We  see  BigData  everywhere  l  3V    “Volume”,  “Velocity”,  “Variety”  

l  Need  tools  for  analyzing  BigData

6

Text Log Image Voice Vision Signal Finance Bio

People PC Mobile Sensors Cars Factories Web Hospitals

<Data  Sources>

<Data  Types>

Page 7: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Case  1.  SNS(Twitter・Facebook,  etc.)

7

•  Jubatus  classifies  each  tweet  from  stream  (6000  tps)    into  categories  according  to  tweet  contents  using    machine  learning  technologies  

 

Page 8: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Case  2.  Automobiles

8

l  Services  l  Remote  maintenance  /    security  l  Insurance:  Pay  As  You  Drive  ,  Pay  How  You  Drive      

l  Auto-­‐driving  cars  l  equipped  sensors:  radar,  lidar  (laser  radar)  ,  GPS,  cameras  l  E.  g.  Google  driverless  cars  

l  In  Aug.  2012,  they  completed  480,000  km  test  drive  

Page 9: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Case  2.  automobile  (contd.)  navigation  system  based  on  real-­‐time  traffic  updates  waze.com  

9

Page 10: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Case  3.    Infrastructures,  factories

l  Preventive  maintenance  for  NY  City  power  grid  l  Learning  prioritization  (supervised  ranking  or  MTBF)  of  

candidates  using  approx.  300  summary  features    l  The  results  are  enough  accurate  to  support  decision  making  

10

“Machine  Learning  for  the  New  York  City  Power  Grid”,    J.  IEEE  Trans.  PAMI,  2-­‐12,    

OA rate =outage rate

Page 11: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Case  3.  Infrastructures,  factories  (contd.)

11

Benefit vs Cost for various replacement strategies analyzed by machine learning

“Machine  Learning  for  the  New  York  City  Power  Grid”,    J.  IEEE  Trans.  PAMI,  2-­‐12,    

Page 12: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

12

Case.  4    Genome  Analysis

l  Next  generation  sequencer  makes  big  changes  l  Human  genome  sequencing,  $3  billion/10  year  in  2001  

 becomes  $7,700/1  day  in  2012  l  GWAS  (Genome-­‐wide  association  study)  becomes  popular  

l  Big  impacts  in  many  fields:  Healthcare,  Agriculture,  Medicine  

l  23andme  analyzes  users’  DNA  and  obtain  information  about    their  

ancestries,  health  and  genetic  traits  

Page 13: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Agenda

l  Introduction  of  PFI  

l  Current  condition  of  BigData  Analysis    l  Jubatus:  concept  and  characteristics  

l  Inside  Jubatus:  Update,  Analyze,  and  Mix  

13

Page 14: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Increasing  demand  in  BigData  applications:  Higher  necessity  of  deeper  real-­‐time  analysis l  Current:  simple  aggregation  and  pre-­‐defined  rule  processing  

on  bigger  data  l  CEP,  Hadoop,  DSMS  

l  Future:  deeper  analysis  for  rapid  decisions  and  actions  

14 Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf      http://www.computerworlduk.com/news/networking/3302464/  

Hadoop

Deep  analysis

Decision  Speed

CEPJubatus

Page 15: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Jubatus: OSS platform for Big Data analytics

l  Joint  development  of  PFI  and  NTT  laboratory  l  Project  started  in  April  2011  

l  Released  as  an  open  source  software  l  You  can  download  it  from:  http://github.com/jubatus/  

15

Page 16: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Key  technology:  Machine  learning  

l  We  need  rapid  decisions  under  uncertainties  l  Anomaly  detection  from  M2M  sensor  data  l  Energy  demand  forecast  /  Smart  grid  optimization  l  Security  monitoring  on  raw  Internet  traffic  

l  What  is  missing  for  fast  &  deep  analytics  on  BigData?  l  Online/real-­‐time  machine  learning  platform      +  Scale-­‐out  distributed  machine  learning  platform  

1. Bigger data

3. Deeper analysis

2. Real-time

Page 17: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Online  machine  learning

l  Batch  machine  learning  l  Scan  all  data  before  building  a  model  l  Analysis  can  be  available  after  all  data  is  prepared  

 l  Online  machine  learning  

l  Model  is  updated  instantaneously  by  each  data  sample  l  Online  models  converge  with  the  batch  models  l  the  convergence  is  very  fast,  appx.  100  times  faster  than  

batch    (1day  -­‐>  5  min.)  

17

Model

Model

Page 18: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Jubatus  employs  latest  online  machine  learning

l  Advantages:  fast  and  memory-­‐efficient  l  Low  latency  &  high  throughput  l  No  need  for  large  dataset  storage  

l  Eg.  Online  learning  for  Linear  classification  l  Perceptron  (1958)  l  Passive  Aggressive    (2003)  l  Confidence  Weighted  Learning    (2008)  l  AROW  (2009)  l  Normal  HERD    (2010)  l  Soft  Confidence  Weighted  Learning    (2012)  

18

Very  recent  progress

Page 19: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Data  analysis  goes  Real-­‐time/Online  and  Large  scale

l  Jubatus  combines  them  into  a  unified  computation  framework

19

WEKA        1993-­‐SPSS              1988-­‐  

Mahout            2006-­‐  

Online  ML  alg.  Structured  Perceptron  2001  PA  2003,  CW  2008  

Real-­‐time/  Online �

Batch  

Small  scale    Stand-­‐alone  

Large  scale  &  

Distributed/  Parallel  

computing  

Jubatus    2011-­‐  

Page 20: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

What  Jubatus  currently  supports

1.  Classification  (multi-­‐class)  l  Perceptron  /  PA  /  CW  /  AROW  

2.  Regression  l  PA-­‐based  regression  

3.  Nearest  neighbor  l  LSH  /  MinHash  /  Euclid  LSH  

4.  Recommendation  l  Based  on  nearest  neighbor  

5.  Anomaly  detection  l  LOF  based  on  nearest  neighbor  

6.  Graph  analysis  l  Shortest  path  /  Centrality  (PageRank)  

7.  Simple  statistics  20

We  support  most  machine    learning/data  mining    technologies

Page 21: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Hadoop  and  Mahout  are  not  good  for  online  learning

l  Hadoop  l  Advantages  

l  Many  extensions  for  a  variety  of  applications  

l  Good  for  distributed  data  storing  and  aggregation  

l  Disadvantages  l  No  direct  support  for  machine  learning  and  online  processing  

l  Mahout  l  Advantages  

l  Popular  machine  learning  algorithms  are  implemented  

l  Disadvantages  l  Some  implementations  are  less  mature  

l  Still  not  capable  of  online  machine  learning  

21

Page 22: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Jubatus  vs.  Hadoop,  RDB,  and  Storm:  Advantage  in  online  AND  distributed  ML l  Only  Jubatus  satisfies  both  of  them  at  the  same  time  

22

Jubatus� Hadoop � RDB � Storm �Storing BigData

-✓✓

HDFS✓ -

Batch learning

✓ ✓ Mahout

✓✓ SPSS, etc

-

Stream processing

✓ - - ✓✓

Distributed learning

✓✓ ✓ Mahout

- -

Online learning

✓✓ - - -High

importance

Page 23: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Agenda

l  Introduction  of  PFI  

l  Current  condition  of  BigData  Analysis    l  Jubatus:  concept  and  characteristics  

l  Inside  Jubatus:  Update,  Analyze,  and  Mix  

23

Page 24: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Distributed  online  learning  algorithm  is  not  trivial

l  Online  learning  requires  frequent  model  updates  l  Naïve  distributed  architecture  leads  to  too  many  

synchronization  operations  

24

Batch  learning Online  learning

Learn  the  update

Model  update

Time

Learn    

Model  update

Learn  the  update

Model  update

Learn

Model  update

Learn

Model  update

Learn

Model  update

Easy  to  parallelize

Hard  to  parallelize  due  to  

frequent  updates

Page 25: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Solution:  Loose  model  sharing

l  Jubatus  only  shares  the  local  models  in  a  loose  manner  l  Fact:  Model  size  <<  Data  size  l  does  not  share  data  sets  l  Unique  approach  compared  to  existing  framework  

l  Local  models  can  be  different  on  the  servers  l  Different  models  will  be  gradually  merged  

ModelModelModel

Mixed  model

Mixed  model

Mixed  model

Page 26: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Three  fundamental  operations  on  Jubatus:  UPDATE,  ANALYZE,  and  MIX 1.  UPDATE  

l  Receive  a  sample,  learn  and  update  the  local  model  2.  ANALYZE  

l  Receive  a  sample,  apply  the  local  model,  return  the  result  3.  MIX  (automatically  executed  in  backend)  

l  Exchange  and  merge  the  local  models  between  servers  

l  C.f.  Map-­‐Shuffle-­‐Reduce  operations  on  Hadoop  l  Algorithms  can  be  implemented  independently  from  

l  Distribution  logic  l  Data  sharing  l  Failover

26

Page 27: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

UPDATE

l  Each  data  sample  are  sent  to  one  (or  two)  server(s)  l  Local  models  are  updated  based  on  the  sample  l  Data  samples  are  NEVER  shared

27

Local model 1

Local model 2

Initial model

Initial model

Distributed randomly or consistently

Page 28: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

MIX

l  Each  server  sends  its  model  diff  (difference)  l  Model  diffs  are  merged  and  distributed    l  Only  model  diffs  are  transmitted

28

Local model 1

Local model 2

Mixed model

Mixed model

Initial model

Initial model

=

=

Model diff 1

Model diff 2

Initial model

Initial model

-

-

Model diff 1

Model diff 2

Merged diff

Merged diff

Merged diff

+

+

=

= = +

Page 29: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

UPDATE  (iteration)

l  Each  server  starts  updating  from  the  mixed  model  l  The  mixed  model  improves  gradually  thanks  to  all  of  the  

servers  

29

Local model 1

Local model 2

Mixed model

Mixed model

Distributed randomly or consistently

Page 30: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

ANALYZE

l  For  analysis,  each  sample  randomly  goes  to  a  server  l  Server  applies  the  current  mixed  model  to  the  sample  

l  use  the  model  in  local  server  only,  doesn’t  communicate  l  The  results  are  returned  to  the  client

30

Mixed model

Mixed model

Distributed randomly

Return prediction

Return prediction

Page 31: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Why  Jubatus  can  work  in  real-­‐time?

1.  Focus  on  online  machine  learning  l  Make  online  machine  learning  algorithms  distributed  

2.  Update  locally  l  Online  training  without  communication  with  others  

3.  Mix  only  models    l  Small  communication  cost,  low  latency,  good  performance  l  Advantage  compared  to  costly  Shuffle  in  MapReduce    

4.  Analyze  locally  l  Each  server  has  mixed  model  and  need  not  to  communicate  l  Low  latency  for  making  predictions  

5.  Everything  in-­‐memory  l  Process  data  on-­‐the-­‐fly  

31

Page 32: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Summary

l  Jubatus  is  the  first  OSS  platform  for  online  distributed  machine  learning  on  BigData  streams.  

l  Download  it  from  http://github.com/jubatus/  l  We  welcome  your  contribution  and  collaboration  

1. Bigger data

3. Deep analysis

2. More in real-time

32

Page 33: Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Copyright  ©  2006-‐‑‒2012  

Preferred  Infrastructure  All  Right  Reserved.