jubatus: realtime deep analytics for bigdata@rakuten technology conference 2012
DESCRIPTION
Currently, we face new challenges in realtime analytics of BigData, such as social monitoring, M2M sensor, online advertising optimization, smart energy management and security monitoring. To analyze these data, scalable machine learning technologies are essential. Jubatus is the open source platform for online distributed machine learning on the data streams of BigData. we explain the inside technologies of Jubatus and show how jubatus can achieve realtime analytics in various problems.TRANSCRIPT
Realtime deep analytics for BigData
Daisuke Okanohara
Preferred Infrastructure, Inc. co-‐founder, vice president
Oct. 20th 2012@Rakuten Technology Conference 2012
Agenda
l Introduction of PFI
l Current condition of BigData Analysis l Jubatus: concept and characteristics
l Inside Jubatus: Update, Analyze, and Mix
2
Preferred Infrastructure (PFI)
l Founded: March 2006 l Location: Hongo, Tokyo l Employees: 26 l Our mission:
Bring cutting-‐edge research advances to the real world
l Our products : l Sedue “Modern search engine” l Bazil “Machine learning for everyone” l Jubatus “Realtime deep analytics for BigData”
3
Preferred Infrastructure (contd.)
l We are passionate towards developing various computer science technologies l machine learning l natural language processing l distributed systems l programming languages l data structures l algorithms, etc…
l Out team includes winners of various programming contests and red coders
l Very rapid prototyping and developing good software
4
Agenda
l Introduction of PFI
l Current condition of BigData Analysis l Jubatus: concept and characteristics
l Inside Jubatus: Update, Analyze, and Mix
5
BigData !
l We see BigData everywhere l 3V “Volume”, “Velocity”, “Variety”
l Need tools for analyzing BigData
6
Text Log Image Voice Vision Signal Finance Bio
People PC Mobile Sensors Cars Factories Web Hospitals
<Data Sources>
<Data Types>
Case 1. SNS(Twitter・Facebook, etc.)
7
• Jubatus classifies each tweet from stream (6000 tps) into categories according to tweet contents using machine learning technologies
Case 2. Automobiles
8
l Services l Remote maintenance / security l Insurance: Pay As You Drive , Pay How You Drive
l Auto-‐driving cars l equipped sensors: radar, lidar (laser radar) , GPS, cameras l E. g. Google driverless cars
l In Aug. 2012, they completed 480,000 km test drive
Case 2. automobile (contd.) navigation system based on real-‐time traffic updates waze.com
9
Case 3. Infrastructures, factories
l Preventive maintenance for NY City power grid l Learning prioritization (supervised ranking or MTBF) of
candidates using approx. 300 summary features l The results are enough accurate to support decision making
10
“Machine Learning for the New York City Power Grid”, J. IEEE Trans. PAMI, 2-‐12,
OA rate =outage rate
Case 3. Infrastructures, factories (contd.)
11
Benefit vs Cost for various replacement strategies analyzed by machine learning
“Machine Learning for the New York City Power Grid”, J. IEEE Trans. PAMI, 2-‐12,
12
Case. 4 Genome Analysis
l Next generation sequencer makes big changes l Human genome sequencing, $3 billion/10 year in 2001
becomes $7,700/1 day in 2012 l GWAS (Genome-‐wide association study) becomes popular
l Big impacts in many fields: Healthcare, Agriculture, Medicine
l 23andme analyzes users’ DNA and obtain information about their
ancestries, health and genetic traits
Agenda
l Introduction of PFI
l Current condition of BigData Analysis l Jubatus: concept and characteristics
l Inside Jubatus: Update, Analyze, and Mix
13
Increasing demand in BigData applications: Higher necessity of deeper real-‐time analysis l Current: simple aggregation and pre-‐defined rule processing
on bigger data l CEP, Hadoop, DSMS
l Future: deeper analysis for rapid decisions and actions
14 Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf http://www.computerworlduk.com/news/networking/3302464/
Hadoop
Deep analysis
Decision Speed
CEPJubatus
Jubatus: OSS platform for Big Data analytics
l Joint development of PFI and NTT laboratory l Project started in April 2011
l Released as an open source software l You can download it from: http://github.com/jubatus/
15
Key technology: Machine learning
l We need rapid decisions under uncertainties l Anomaly detection from M2M sensor data l Energy demand forecast / Smart grid optimization l Security monitoring on raw Internet traffic
l What is missing for fast & deep analytics on BigData? l Online/real-‐time machine learning platform + Scale-‐out distributed machine learning platform
1. Bigger data
3. Deeper analysis
2. Real-time
Online machine learning
l Batch machine learning l Scan all data before building a model l Analysis can be available after all data is prepared
l Online machine learning
l Model is updated instantaneously by each data sample l Online models converge with the batch models l the convergence is very fast, appx. 100 times faster than
batch (1day -‐> 5 min.)
17
Model
Model
Jubatus employs latest online machine learning
l Advantages: fast and memory-‐efficient l Low latency & high throughput l No need for large dataset storage
l Eg. Online learning for Linear classification l Perceptron (1958) l Passive Aggressive (2003) l Confidence Weighted Learning (2008) l AROW (2009) l Normal HERD (2010) l Soft Confidence Weighted Learning (2012)
18
Very recent progress
Data analysis goes Real-‐time/Online and Large scale
l Jubatus combines them into a unified computation framework
19
WEKA 1993-‐SPSS 1988-‐
Mahout 2006-‐
Online ML alg. Structured Perceptron 2001 PA 2003, CW 2008
Real-‐time/ Online �
Batch
Small scale Stand-‐alone
Large scale &
Distributed/ Parallel
computing
Jubatus 2011-‐
What Jubatus currently supports
1. Classification (multi-‐class) l Perceptron / PA / CW / AROW
2. Regression l PA-‐based regression
3. Nearest neighbor l LSH / MinHash / Euclid LSH
4. Recommendation l Based on nearest neighbor
5. Anomaly detection l LOF based on nearest neighbor
6. Graph analysis l Shortest path / Centrality (PageRank)
7. Simple statistics 20
We support most machine learning/data mining technologies
Hadoop and Mahout are not good for online learning
l Hadoop l Advantages
l Many extensions for a variety of applications
l Good for distributed data storing and aggregation
l Disadvantages l No direct support for machine learning and online processing
l Mahout l Advantages
l Popular machine learning algorithms are implemented
l Disadvantages l Some implementations are less mature
l Still not capable of online machine learning
21
Jubatus vs. Hadoop, RDB, and Storm: Advantage in online AND distributed ML l Only Jubatus satisfies both of them at the same time
22
Jubatus� Hadoop � RDB � Storm �Storing BigData
-✓✓
HDFS✓ -
Batch learning
✓ ✓ Mahout
✓✓ SPSS, etc
-
Stream processing
✓ - - ✓✓
Distributed learning
✓✓ ✓ Mahout
- -
Online learning
✓✓ - - -High
importance
Agenda
l Introduction of PFI
l Current condition of BigData Analysis l Jubatus: concept and characteristics
l Inside Jubatus: Update, Analyze, and Mix
23
Distributed online learning algorithm is not trivial
l Online learning requires frequent model updates l Naïve distributed architecture leads to too many
synchronization operations
24
Batch learning Online learning
Learn the update
Model update
Time
Learn
Model update
Learn the update
Model update
Learn
Model update
Learn
Model update
Learn
Model update
Easy to parallelize
Hard to parallelize due to
frequent updates
Solution: Loose model sharing
l Jubatus only shares the local models in a loose manner l Fact: Model size << Data size l does not share data sets l Unique approach compared to existing framework
l Local models can be different on the servers l Different models will be gradually merged
ModelModelModel
Mixed model
Mixed model
Mixed model
Three fundamental operations on Jubatus: UPDATE, ANALYZE, and MIX 1. UPDATE
l Receive a sample, learn and update the local model 2. ANALYZE
l Receive a sample, apply the local model, return the result 3. MIX (automatically executed in backend)
l Exchange and merge the local models between servers
l C.f. Map-‐Shuffle-‐Reduce operations on Hadoop l Algorithms can be implemented independently from
l Distribution logic l Data sharing l Failover
26
UPDATE
l Each data sample are sent to one (or two) server(s) l Local models are updated based on the sample l Data samples are NEVER shared
27
Local model 1
Local model 2
Initial model
Initial model
Distributed randomly or consistently
MIX
l Each server sends its model diff (difference) l Model diffs are merged and distributed l Only model diffs are transmitted
28
Local model 1
Local model 2
Mixed model
Mixed model
Initial model
Initial model
=
=
Model diff 1
Model diff 2
Initial model
Initial model
-
-
Model diff 1
Model diff 2
Merged diff
Merged diff
Merged diff
+
+
=
= = +
UPDATE (iteration)
l Each server starts updating from the mixed model l The mixed model improves gradually thanks to all of the
servers
29
Local model 1
Local model 2
Mixed model
Mixed model
Distributed randomly or consistently
ANALYZE
l For analysis, each sample randomly goes to a server l Server applies the current mixed model to the sample
l use the model in local server only, doesn’t communicate l The results are returned to the client
30
Mixed model
Mixed model
Distributed randomly
Return prediction
Return prediction
Why Jubatus can work in real-‐time?
1. Focus on online machine learning l Make online machine learning algorithms distributed
2. Update locally l Online training without communication with others
3. Mix only models l Small communication cost, low latency, good performance l Advantage compared to costly Shuffle in MapReduce
4. Analyze locally l Each server has mixed model and need not to communicate l Low latency for making predictions
5. Everything in-‐memory l Process data on-‐the-‐fly
31
Summary
l Jubatus is the first OSS platform for online distributed machine learning on BigData streams.
l Download it from http://github.com/jubatus/ l We welcome your contribution and collaboration
1. Bigger data
3. Deep analysis
2. More in real-time
32
Copyright © 2006-‐‑‒2012
Preferred Infrastructure All Right Reserved.