what is jubatus? how it works for you?
TRANSCRIPT
What is Jubatus?How it works for you?
NTT SIC Hiroki Kumazaki
Jubatus is…• A Distributed Online Machine-Learning framework
• Distributed– Fault-Tolerance– Scale out
• Online– Fixed time computation
• Machine-Learning– More than “word count”!
Architecture• ML model is combined with feature-extractor
MachineLearningModel
FeatureExtractor
Jubatus Server
Jubatus RPC
Architecture
• Distributed Computation– Shared-Everything Architecture• It’s fast and fault-tolerant!
Mix
Architecture
• It looks as if one server running.
Client
Jubatus RPC
Proxy
Architecture
• It looks as if one server running– You can use single local Jubatus server for develop– Multiple Jubatus server cluster for production
Client
Jubatus RPC
The same RPC!
Architecture• With heavy load…
Client
Jubatus RPC
Proxy
Architecture• Dynamically scale-out!
Client
Jubatus RPC
Proxy
Architecture• Whenever servers break down– Proxy conceals failures, so the service will continue.
Client
Jubatus RPC
Proxy
Architecture
• Multilanguage client library– gem, pip, cpan, maven Ready!– It essentially uses a messagepack-rpc.
• So you can use OCaml, Haskell, JavaScript, Go with your own risk.
Client
Jubatus RPC
Architecture• Many ML algorithms– Classifier– Recommender– Anomaly Detection– Clustering– Regression– Graph Mining
Useful!
Classifier• Task: Classification of Datum
import sys
def fib(a): if a == 1 or a == 0: return 1 else: return fib(a-1) + fib(a-2)
if __name__ == “__main__”: print(fib(int(sys.argv[1])))
def fib(a) if a == 1 or a == 0 1 else return fib(a-1) + fib(a-2) endendif __FILE__ == $0 puts fib(ARGV[0].to_i)end
Sample Task: Classify what programming language used
It’s It’s
Classifier• Set configuration in the Jubatus server
ClassifierFreatureExtractor
"converter": { "string_types": { "bigram": { "method": "ngram", "char_num": "2" } }, "string_rules": [ { "key": "*", "type": "bigram", "sample_weight": "tf", "global_weight": "idf“ } ]}
Feature Extractor
Classifier• Configuration JSON– It does “feature vector design”– very important step for machine learning
"converter": { "string_types": { "bigram": { "method": "ngram", "char_num": "2" } }, "string_rules": [ { "key": "*", "type": "bigram", "sample_weight": "tf", "global_weight": "idf“ } ]}
setteings for extract feature from string
define function named “bigram”
original embedded function “ngram”
pass “2” to “ngram” to create “bigram”
for all dataapply “bigram”
feature weights based on tf/idfsee wikipedia/tf-idf
Classifier• Feature Extractor becomes “bigram extractor”
Classifierbigramextractor
Feature Extractor• What bigram extractor does?
bigramextractor
import sys
def fib(a): if a == 1 or a == 0: return 1 else: return fib(a-1) + fib(a-2)
if __name__ == “__main__”: print(fib(int(sys.argv[1])))
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
Feature Vector
Classifier• Training model with feature vectors
key valueim 1mp 1po 1... ...): 1... ...de 1ef 1... ...
Classifier
key valuepu 1ut 1... ...{| ...|m 1m| 1{| 1en 1nd 1
key value@a 1$_ 1... ...my ...su 1ub 1us 1se 1... ...
Classifier• Set configuration in the Jubatus server
Classifier
"method" : "AROW","parameter" : { "regularization_weight" : 1.0}
Feature Extractor
bigramextractor Classifier Algorithms
• Perceptron• Passive Aggressive• Confidence Weight• Adaptive Regularization of Weights• Normal Her d
Classifier• Use model to classification task– Jubatus will find clue for classification
AROW
key valuesi 1il 1... ...{| 1... ...
It’s
Classifier• Use model to classification task– Jubatus will find clue for classification
AROW
key valuere 1): 1
... ...s[ 1... ...
It’s
Via RPC• call feature extraction and classification from
client via RPC
AROWbigramextractor
lang = client.classify([sourcecode])
import sys
def fib(a): if a == 1 or a == 0: return 1 else: return fib(a-1) + fib(a-2)
if __name__ == “__main__”: print(fib(int(sys.argv[1])))
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
It may be
What classifier can do?• You can – estimate the topic of tweets– trash spam mail automatically– monitor server failure from syslog– estimate sentiment of user from blog post– detect malicious attack– find what feature is the best clue to classification
What classifier cannot do• You cannot– train model from data without supervised answer– create a class without knowledge of the class– get fine model without correct feature designing
How to use?• see examples in
http://github.com/jubatus/jubatus-example – gender– shogun– malware classification– language detection
Recommender• Task: what datum is similar to the datum?
Name Star Wars
Harry Potter Star Trek Titanic Frozen
John 4 3 2 2
Bob 5 3
Erika 1 3 4 5
Jack 2 5
Ann 4 5
Emily 1 4 2 5 4
Which movie should we recommend Ann?
Recommender• Do recommendation based on Nearest Neighbor
Movie Rating(high-dimensional)
Science Fiction
Star Trek loverJohn
Jack
Love RomanceFantasy
Erika
Ann
StarWars loverBob
Emily
Near
Far
Recommender• Ann and Emily is near– we should recommend Flozen for Ann
Name Star Wars
Harry Potter Star Trek Titanic Frozen
Ann 4 5 ★
Emily 1 4 2 5 4
I bet Ann would like it!
Recommender with Feature Extractor• Recommender server consist of Feature Extractor
and Recommender engine.– Jubatus calculates distance between feature vectors
RecommenderFeatureExtractor
Recommender Engine can use• Minhash• Locality Sensitive Hashing• Euclid Locality Sensitive Hashingfor defining distance.
Recommender with Feature Extractor• Jubatus maps data in feature space– There are distances between data• How are they near or far?
key value
pu 1
ut 1
... ...
{| ...
|m 1
m| 1
{| 1
FeatureExtractor
key value
im 1
mp 1
... ...
... ...
“{ 1
fo 1
... ...
key value
Ma 1
ap 1
... ...
in 1
nt 1
te 1
er 1
Recommender
Ruby
Python
Java
What Recommender can do?• You can– create recommendation engine in e-commerce– calculate similarity of tweets– find similar directional NBA player– visualize distance between “Star Wars” and “Star Trek”
What Recommender cannot do?• You cannot– Label data(use classifier!)– get decision tree– get a-priori based recommendation
Anomaly Detection• Task: Which datum is far from the others?
Anomaly Detection• Task: Which datum is far from the others?
This One!
Anomaly Detection• Distance based detection is not good– We cannot decide appropriate threshold of distance
Distance is equal!
Anomaly Detection with Feature Extractor
• Anomaly detection server consist of Feature Extractor and anomaly detection engine.– Jubatus finds outlier from feature vectors
AnomalyDetection
FeatureExtractor
Anomaly Detection Engine can use• Minhash• Locality Sensitive Hashing• Euclid Locality Sensitive Hashingfor defining distance.
Anomaly Detection• jubaanomaly can do it!– It base on local outlier factor algorithm
key value
pu 1
ut 1
... ...
{| ...
|m 1
m| 1
{| 1
FeatureExtractor
key value
im 1
mp 1
... ...
... ...
“{ 1
fo 1
... ...
key value
Ma 1
ap 1
... ...
in 1
nt 1
te 1
er 1
AnomalyDetection
Outlier!
What Anomaly Detection can do?• You (might) can – find outlier– grasp the trend and overview of current data stream– detect or predict server's failure– protect Web services from zero-day attacks
What Anomaly Detection cannot do?• You cannot– know the cluster distribution of data– find any kinds of outliers with 100% accuracy– easily understand how each outlier occurs– know why a datum is assigned high outlier score
Conclusion• Jubatus have embedded feature extractor with
algorithms.• User should configure both feature extractor and
algorithm properly• Client use configured machine learning via
Jubatus-RPC• Classifier and Recommender and Anomaly may
be useful for your task.
DEMO
• I try to run the jubatus-example.