machine learning in data streams and batch - sztaki · unify batch and stream processing dfki (de)...
TRANSCRIPT
Machine learning in data streams and batch
Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI)
András Benczúr Informatics Laboratory „Big Data – Momemtum” research group Győr, 1/6/2016
2
Batch and then data streaming recommendation
● Surprisingly, reading the data only once and forgetting helps!
● Our first main observation
3
Context-aware recommendation/modeling tasks
Recommendation by weather, walking, jogging, biking, driving
Last.fm music and social network: friends influence
Twitter hashtag spread by geolocation
Link prediction, friend recommendation
Restaurants by location
→ Quickly react to changes in context, retrain models
Palovics, Daroczy, Benczur, Pap, Ermann, Phan, Chepelianskii, Shepelyansky, Statistical analysis of NOMAO customer votes for spots of France
4
Last.fm friend influence
R.Palovics, A.A.Benczur, L.Kocsis, T.Kiss, E.Frigo, "Exploiting temporal influence in online recommendation", ACM RecSys (2014)
5
Social media prediction
Daróczy, Pálovics, Wieszner, Farkas, Benczúr: Predicting User-specific Temporal Retweet
Count Based on Network and Content Information. INRA workshop, RecSys 2015
6
Information spread prediction in time
Palovics, Szalai, Kocsis, Pap, Frigo, Benczur. Location-Aware Online Learning for Top-k Hashtag Recommendation. LocalRec 2015
7
Terminology – all depend on data scale („Big Data”)
Batch
• Repeatedly read all training data multiple times
• Stochastic gradient: use multiple times in random order
• Elaborate optimization procedures, e.g. SVM
+ More accurate (?)
+ Easy to implement (?)
Online learning
• Update immediately, e.g. with large learning rate
Data streaming
• Read training/testing data only once, no chance to store
Real time / Interactive
+ More timely, adapts fast
- Challenging to implement
8
„Big Data” today mostly about software infrastructure
9
„Big Data” today mostly about software infrastructure
Batch
MapReduce
Hadoop
Spark
Flink
Streaming
Storm
Spark
Flink
MapReduce has the first open source distributed software, Hadoop
Limitations
Join and more complex second order functions
Graphs, machine learning
Alternatives
Graph processing, most notably GraphLab but also Giraph, HAMA, …
Hadoop in memory based „mini-batches”: Spark
Second order function optimization over a streaming dataflow engine: Flink
10
A most challenging profession …
11
STREAMLINE H2020
New initiative on top of Apache Flink
A "use-case complete" framework to
unify batch and stream processing
DFKI (DE)
SICS (SE)
Portugal Telecom (PT)
Internet Memory (FR)
Rovio (FI)
NMusic (PT )
SZTAKI (HU)
B. – Volker Markl (TU Berlin)
12
STREAMLINE Magic Triangle
Challenge Present Status Goal Action Leader
Delayed
information
processing
No up-to-date timely
predictions Reactivity
Same unified system
for data at rest and
data in motion DFKI
Actionable
intelligence: Lack
of appropriate
analytics
Poor or non-timely
prediction results in user
churn, business losses
Prediction
quality
Library for batch and
stream combined
machine learning SZTAKI
Skills shortage:
Human latency
Multiple expertise needed
for data scientists, expensive to operate
Ease of
implementation High level declarative
language SICS
13
Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,
Machine Learning
Relational, …
Low latency
windowing,
aggregations, ...
Event logs
Real-time data
streams
Batch and stream: Same execution engine
An engine that puts equal emphasis to
streaming and batch
Introduction to Matrix Factorization in Recommenders
15
Netflix Prize
Predict rating for a given user-item pair based on other user-item ratings from the past
rating prediction task: for a given (u,i) pair, predict r(u,i)
explicit data: ratings are available
the data is static, there is no timely evolution
5
3
?
3 4
Us
ers
Items
16
Netflix Prize vs. Present
Implicit recommendation
Instead of ratings, we only have info on
interactions between users and items, e.g. clicks
Top-K recommendation
Instead of rating prediction, we consider top-k
item recommendation for a given user
Online environment can be strongly temporal
E.g. news, hashtags on Twitter rapidly appear and
disappear
i2
i3
i4
u
i1
17
Latent factor models
Items and users described by unobserved factors
Each item is summarized by a d-dimensional vector Vi
Similarly, each user summarized by Uu
Predicted rating for Item i by User u
Inner product of Vi and Uu
∑ Uuk Vik
18
Geared towards
females
Geared towards
males
serious
escapist
The Princess Diaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
Amadeus
The Color Purple
Dumb and Dumber
Ocean’s 11 Sense and Sensibility
Gus
Dave
Netflix illustration by Yehuda Koren
19
General overview of MF approaches
Approximate user preferences by e.g.
𝑟 𝑢,𝑖 = 𝑝𝑢𝑇𝑞𝑖
Objective function (error function) e.g.
Optimize for RMSE with regularization
L = 𝑟 𝑢,𝑖 − 𝑟𝑢,𝑖2
(𝑢,𝑖)∈𝑇𝑟𝑎𝑖𝑛 + 𝜆𝑈 𝑃𝑢2𝑆𝑈
𝑢=1 +𝜆𝐼 𝑄𝑖2𝑆𝐼
𝑖=1
Learning method, e.g.
Stochastic gradient descent (SGD)
Alternating Least Squares (ALS)
Why not SVD?
Most of the matrix values are unknown
Typically <1% filled
1
?
1
1
P
Q
qi
pu
(S)GD
21
Matrix factorization algorithm
Random initialization of U and V
While U x V does not approximate values of M well enough
Choose a known value of M
Adjust the values of the corresponding row and column of U and V respectively, to improve
Upon seeing a rating Mui, optimize for squared error (∑ Uuk Vik - Mui)2
Error = 3*1+2*3-4 = 5
22
What is a good adjustment step?
1. Adjustment proportional to error let it be ε times the error
2. Take into account how much a value contributes to the error
For the selected row with ε = 0.1: 3 is multiplied by 1 3 is adjusted by ε*5*1 = 0.5 2 is multiplied by 3 2 is adjusted by ε*5*3 = 1.5
For the selected column respectively: ε*5*3=1.5 and ε*5*2=1.0
2.5 0.5
-0.5
2
23
Gradient Descent Summary
We want to minimize RMSE
Same as minimizing MSE
Minimum place where its derivatives are zeroes
Because the error surface is quadratic
SGD optimization
testtest Riu
K
k
kiukui
testRiu
uiui
test
qprR
rrR
MSE),(
2
1),(
2 1ˆ
1
24
BRISMF model
Biased Regularized Incremental Simultaneous Matrix Factorization
Applies regularization to prevent overfitting
To further decrease RMSE using bias values
Model:
iu
K
k
kiukiuiuui cbqpcbqpr 1
ˆ
25
BRISMF Learning
Loss function
SGD update rules
i
i
u
u
Riu ki
ki
ku
ukiu
K
k
kiukui cbqpcbqprtrain
22
),( ),(
2
),(
2
2
1
ukkiuiuk pqep kiukuiki qpeq
uuiu beb iuii cec
(i)ALS
27
Short detour: linear regression
𝐴𝑥 = 𝑏 linear equation
𝐴𝜖ℝ𝑁×𝑀, b𝜖ℝ𝑁 known
x𝜖ℝ𝑀 unknown
Meaning
Rows of 𝐴 are the samples
Elements are the output for each sample
𝑥 weighting vector
Assumed that output is obtained with linear combination of inputs
Objective function: MSE
𝐿 = 𝑏 − 𝐴𝑥 2 =1
𝑁 𝑏𝑖 − 𝐴𝑇
𝑖𝑇𝑥
2𝑁𝑖=1
28
Solution of the linear regression
Error function is convex, its minimum is attained where its derivative is zero
Gradient: 𝜕𝐿
𝜕𝑥= 2𝐴𝑇 𝑏 − 𝐴𝑥
2𝐴𝑇 𝑏 − 𝐴𝑥 = 0
𝐴𝑇𝑏 = 𝐴𝑇𝐴𝑥
𝑥 = 𝐴𝑇𝐴 −1𝐴𝑇𝑏
Because 𝐴𝑇𝐴 symmetric and positive definite, its inverse exists
29
Alternating Least Squares (ALS)
𝑅 ≈ 𝑅 = 𝑃𝑇𝑄
Fix one of the matrices, let’s pick 𝑃
Given a fixed 𝑃 the 𝑖-th column of 𝑅 depends only on the 𝑖-th column of 𝑄
Problem to solve: 𝑅𝑖 = 𝑃𝑇𝑄𝑖
Problem of linear regression
Error function
𝐿 = 𝑅 − 𝑅 𝑓𝑟𝑜𝑏
2+ 𝜆𝑈 𝑃 𝑓𝑟𝑜𝑏
2+ 𝜆𝐼 𝑄 𝑓𝑟𝑜𝑏
2
Then the derivatives of 𝐿 by 𝑄 is the linear function of the columns of 𝑄, therefore each column of 𝑄 can be calculated separately
30
Implicit ALS – objective function
𝐿 = 𝑤𝑢,𝑖 𝑟 𝑢,𝑖 − 𝑟𝑢,𝑖2𝑆𝑈,𝑆𝐼
𝑢=1,𝑖=1 + 𝜆𝑈 𝑃𝑢2𝑆𝑈
𝑢=1 + 𝜆𝐼 𝑄𝑖2𝑆𝐼
𝑖=1
Weighted MSE
𝑤𝑢,𝑖 = 𝑤𝑢,𝑖 if (𝑢, 𝑖) ∈ 𝑇
𝑤0 otherwise 𝑤0 ≪ 𝑤𝑢,𝑖
Typical weights: 𝑤0 = 1, 𝑤𝑢,𝑖 = 100 ∗ 𝑠𝑢𝑝𝑝 𝑢, 𝑖
What does it mean?
Create two matrices from the events
(1) Preference matrix Binary
1 represents the presence of an event
(2) Confidence matrix Interprets our certainty on the corresponding values in the first matrix
Negative feedback is much less certain
Experiments
32
Dataset - 30M Data from Last.fm “30M” Music listening dataset crawled by the CrowdRec
team
Implicit, timestamped music listening dataset
Each record contains: [ timestamp, user , artist, album, track, … ]
We investigate artist and track recommendation
We always recommend and learn when the user interacts with an item at the first time
We discarded from the dataset infrequent artists
~50,000 users, ~100,000 artists, ~500,000 tracks
First we investigate artist recommendation
33
Online vs. Batch Matrix Factorization
● In what follows, top K = 100
● Batch has a periodic behavior
● Online strongly outperforms the batch methods
34
Shuffled Online vs. Batch Matrix Factorization
● Batch outperforms online after shuffling
● Gains of online MF come from non-stationary data
35
Batch and then Online Matrix Factorization
● Surprisingly, their combination works
● Concept: Users are non-stationary, but artists are kind of static
36
Sampling Positive Interactions From the Past
Something between batch and online MF
We randomly sample interactions from the past for the given
item while learning online
We emphasize the recent past while learning with geometric
distribution
We only update the item vectors
u,i,t u2,i,t2 u1,i,t1
37
Track Recommendation
● For track recommendation, sampling is the best
Matrix factorization in a distributed system
39
MapReduce
40
Breadth-First Search (BFS)
1
2 2
2
3
3
3
3
4
4
41
MapReduce BFS
MAP:
For all nodes n: distance D from start; list of out-edges
p out-edge(n): emit (p, D+1)
Reduce
Grouped by p
New distance is minimum received if less than current (Bellman-Ford algorithm - Dijkstra is more efficient but cannot parallelize)
Dump all data to disk, restart
Stop if no change in an iteration
Needs to join out-edge(p) with the new distance of p Solution: we also have to emit (n, out-edge(n))
42
SGD, ALS implementations in Mahout
ALS single iteration is easy: 𝑞𝑖 = 𝑃𝑇𝑃 −1𝑃𝑇 𝑅𝑖 = 𝑃𝑇𝑃 −1𝑃𝑗
𝑇 𝑅𝑖𝑗𝑁𝑗=1
Partition by i
Broadcast 𝑃𝑇𝑃, just a kxk matrix
SGD? Updates affect both the user AND the item models
Partitioning neither for users nor for items is sufficient
Efficient shared memory implementations but no real nice distributed
More iterations? Hadoop will write all information to disk, we may re-partition before writing to have
it ready for the next iteration
Should we consider this efficient??
43
Distributed ALS needs new broadcast data primitive
𝑞𝑖 = 𝑃𝑇𝑃 −1𝑃𝑇 𝑅𝑖 =
𝑃𝑇𝑃 −1𝑃𝑗𝑇 𝑅𝑖𝑗
𝑁𝑗=1
For each nonzero 𝑅𝑖𝑗 we have an „edge”
We need to emit 𝑃𝑇𝑃 −1 of dimension k2
Join by using i as key, to compute Q
If we have a predefined partition, we should not emit the same data for ALL edges from partition x to partition y
By naive implementation, network communication may become the bottleneck
44
Distributed ALS needs new broadcast data primitive
Generated test data
15 million ratings
(courtesy: Gravity)
Plain Flink
Flink with broadcast
Balassi, Palovics, Benczur, Distributed Frameworks for Alternating Least Squares, Large-Scale Recommender Systems 2014
45
Conclusions and Future Work
We need combinations of batch and online learning New users / items appear
Old items can become irrelevant
User’s taste can change over time
Future work Develop a learning algo that reflects on how fast the user and the
item part evolves of the MF model.
E.g. sample more positive items from the past at the beginning
Demonstrate online-batch learning combinations in Apache Flink
Find more applications of highly temporal, fast changing machine learning problems
46
András Benczúr
http://datamining.sztaki.hu
Credits: Róbert Pálovics, Levente Kocsis, Márton Balassi,
Erzsébet Frigó, Anna Oláh, … (SZTAKI)
Domonkos Tikk (Gravity)
Commercials
48
Data Analysis Challenges
RecSys 2016 Challenge starts: beginning of Feb
Conference 15th-19th September in Boston
Task: recommend job ads to Xing users using location, area, expertise, and past view/like/delete
Organizers
Fabian Abel
Daniel Kohlsdorf
Martha Larson
Róbert Pálovics
49
Data Analysis Challenges
ECML/PKDD Discovery Challenge 2016
Conf: Riva del Garda, Italy, September 19–23
Task: predict visit to bank branch by card use
Organizers:
CNR Italy
OTP Bank Hungary
SZTAKI
50
Recommender and Personalization Meetup, free GraphLab training
GraphLab Create training
Alon Palombo and Guy Rapoport from Dato, a Seattle based startup
When: Thursday, June 9, 2016 2:00 PM to 5:00 PM
Where: MTA SZTAKI Kende u. 13-17., Budapest