machine learning in data streams and batch - sztaki · unify batch and stream processing dfki (de)...

Machine learning in data streams and batch

Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI)

András Benczúr Informatics Laboratory „Big Data – Momemtum” research group Győr, 1/6/2016

2

Batch and then data streaming recommendation

● Surprisingly, reading the data only once and forgetting helps!

● Our first main observation

3

Context-aware recommendation/modeling tasks

Recommendation by weather, walking, jogging, biking, driving

Last.fm music and social network: friends influence

Twitter hashtag spread by geolocation

Link prediction, friend recommendation

Restaurants by location

→ Quickly react to changes in context, retrain models

Palovics, Daroczy, Benczur, Pap, Ermann, Phan, Chepelianskii, Shepelyansky, Statistical analysis of NOMAO customer votes for spots of France

4

Last.fm friend influence

R.Palovics, A.A.Benczur, L.Kocsis, T.Kiss, E.Frigo, "Exploiting temporal influence in online recommendation", ACM RecSys (2014)

5

Social media prediction

Daróczy, Pálovics, Wieszner, Farkas, Benczúr: Predicting User-specific Temporal Retweet

Count Based on Network and Content Information. INRA workshop, RecSys 2015

6

Information spread prediction in time

Palovics, Szalai, Kocsis, Pap, Frigo, Benczur. Location-Aware Online Learning for Top-k Hashtag Recommendation. LocalRec 2015

7

Terminology – all depend on data scale („Big Data”)

Batch

• Repeatedly read all training data multiple times

• Stochastic gradient: use multiple times in random order

• Elaborate optimization procedures, e.g. SVM

+ More accurate (?)

+ Easy to implement (?)

Online learning

• Update immediately, e.g. with large learning rate

Data streaming

• Read training/testing data only once, no chance to store

Real time / Interactive

+ More timely, adapts fast

- Challenging to implement

8

„Big Data” today mostly about software infrastructure

9

„Big Data” today mostly about software infrastructure

Batch

MapReduce

Hadoop

Spark

Flink

Streaming

Storm

Spark

Flink

MapReduce has the first open source distributed software, Hadoop

Limitations

Join and more complex second order functions

Graphs, machine learning

Alternatives

Graph processing, most notably GraphLab but also Giraph, HAMA, …

Hadoop in memory based „mini-batches”: Spark

Second order function optimization over a streaming dataflow engine: Flink

10

A most challenging profession …

11

STREAMLINE H2020

New initiative on top of Apache Flink

A "use-case complete" framework to

unify batch and stream processing

DFKI (DE)

SICS (SE)

Portugal Telecom (PT)

Internet Memory (FR)

Rovio (FI)

NMusic (PT )

SZTAKI (HU)

B. – Volker Markl (TU Berlin)

12

STREAMLINE Magic Triangle

Challenge Present Status Goal Action Leader

Delayed

information

processing

No up-to-date timely

predictions Reactivity

Same unified system

for data at rest and

data in motion DFKI

Actionable

intelligence: Lack

of appropriate

analytics

Poor or non-timely

prediction results in user

churn, business losses

Prediction

quality

Library for batch and

stream combined

machine learning SZTAKI

Skills shortage:

Human latency

Multiple expertise needed

for data scientists, expensive to operate

Ease of

implementation High level declarative

language SICS

13

Flink

Historic data

Kafka, RabbitMQ, ...

HDFS, JDBC, ...

ETL, Graphs,

Machine Learning

Relational, …

Low latency

windowing,

aggregations, ...

Event logs

Real-time data

streams

Batch and stream: Same execution engine

An engine that puts equal emphasis to

streaming and batch

Introduction to Matrix Factorization in Recommenders

15

Netflix Prize

Predict rating for a given user-item pair based on other user-item ratings from the past

rating prediction task: for a given (u,i) pair, predict r(u,i)

explicit data: ratings are available

the data is static, there is no timely evolution

5

3

?

3 4

Us

ers

Items

16

Netflix Prize vs. Present

Implicit recommendation

Instead of ratings, we only have info on

interactions between users and items, e.g. clicks

Top-K recommendation

Instead of rating prediction, we consider top-k

item recommendation for a given user

Online environment can be strongly temporal

E.g. news, hashtags on Twitter rapidly appear and

disappear

i2

i3

i4

u

i1

17

Latent factor models

Items and users described by unobserved factors

Each item is summarized by a d-dimensional vector Vi

Similarly, each user summarized by Uu

Predicted rating for Item i by User u

Inner product of Vi and Uu

∑ Uuk Vik

18

Geared towards

females

Geared towards

males

serious

escapist

The Princess Diaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

Amadeus

The Color Purple

Dumb and Dumber

Ocean’s 11 Sense and Sensibility

Gus

Dave

Netflix illustration by Yehuda Koren

19

General overview of MF approaches

Approximate user preferences by e.g.

𝑟 𝑢,𝑖 = 𝑝𝑢𝑇𝑞𝑖

Objective function (error function) e.g.

Optimize for RMSE with regularization

L = 𝑟 𝑢,𝑖 − 𝑟𝑢,𝑖2

(𝑢,𝑖)∈𝑇𝑟𝑎𝑖𝑛 + 𝜆𝑈 𝑃𝑢2𝑆𝑈

𝑢=1 +𝜆𝐼 𝑄𝑖2𝑆𝐼

𝑖=1

Learning method, e.g.

Stochastic gradient descent (SGD)

Alternating Least Squares (ALS)

Why not SVD?

Most of the matrix values are unknown

Typically <1% filled

1

?

1

1

P

Q

qi

pu

21

Matrix factorization algorithm

Random initialization of U and V

While U x V does not approximate values of M well enough

Choose a known value of M

Adjust the values of the corresponding row and column of U and V respectively, to improve

Upon seeing a rating Mui, optimize for squared error (∑ Uuk Vik - Mui)2

Error = 3*1+2*3-4 = 5

22

What is a good adjustment step?

1. Adjustment proportional to error let it be ε times the error

2. Take into account how much a value contributes to the error

For the selected row with ε = 0.1: 3 is multiplied by 1 3 is adjusted by ε*5*1 = 0.5 2 is multiplied by 3 2 is adjusted by ε*5*3 = 1.5

For the selected column respectively: ε*5*3=1.5 and ε*5*2=1.0

2.5 0.5

-0.5

2

23

Gradient Descent Summary

We want to minimize RMSE

Same as minimizing MSE

Minimum place where its derivatives are zeroes

Because the error surface is quadratic

SGD optimization

testtest Riu

K

k

kiukui

testRiu

uiui

test

qprR

rrR

MSE),(

2

1),(

2 1ˆ

1

24

BRISMF model

Biased Regularized Incremental Simultaneous Matrix Factorization

Applies regularization to prevent overfitting

To further decrease RMSE using bias values

Model:

iu

K

k

kiukiuiuui cbqpcbqpr 1

ˆ

25

BRISMF Learning

Loss function

SGD update rules

i

i

u

u

Riu ki

ki

ku

ukiu

K

k

kiukui cbqpcbqprtrain

22

),( ),(

2

),(

2

2

1

ukkiuiuk pqep kiukuiki qpeq

uuiu beb iuii cec

(i)ALS

27

Short detour: linear regression

𝐴𝑥 = 𝑏 linear equation

𝐴𝜖ℝ𝑁×𝑀, b𝜖ℝ𝑁 known

x𝜖ℝ𝑀 unknown

Meaning

Rows of 𝐴 are the samples

Elements are the output for each sample

𝑥 weighting vector

Assumed that output is obtained with linear combination of inputs

Objective function: MSE

𝐿 = 𝑏 − 𝐴𝑥 2 =1

𝑁 𝑏𝑖 − 𝐴𝑇

𝑖𝑇𝑥

2𝑁𝑖=1

28

Solution of the linear regression

Error function is convex, its minimum is attained where its derivative is zero

Gradient: 𝜕𝐿

𝜕𝑥= 2𝐴𝑇 𝑏 − 𝐴𝑥

2𝐴𝑇 𝑏 − 𝐴𝑥 = 0

𝐴𝑇𝑏 = 𝐴𝑇𝐴𝑥

𝑥 = 𝐴𝑇𝐴 −1𝐴𝑇𝑏

Because 𝐴𝑇𝐴 symmetric and positive definite, its inverse exists

29

Alternating Least Squares (ALS)

𝑅 ≈ 𝑅 = 𝑃𝑇𝑄

Fix one of the matrices, let’s pick 𝑃

Given a fixed 𝑃 the 𝑖-th column of 𝑅 depends only on the 𝑖-th column of 𝑄

Problem to solve: 𝑅𝑖 = 𝑃𝑇𝑄𝑖

Problem of linear regression

Error function

𝐿 = 𝑅 − 𝑅 𝑓𝑟𝑜𝑏

2+ 𝜆𝑈 𝑃 𝑓𝑟𝑜𝑏

2+ 𝜆𝐼 𝑄 𝑓𝑟𝑜𝑏

2

Then the derivatives of 𝐿 by 𝑄 is the linear function of the columns of 𝑄, therefore each column of 𝑄 can be calculated separately

30

Implicit ALS – objective function

𝐿 = 𝑤𝑢,𝑖 𝑟 𝑢,𝑖 − 𝑟𝑢,𝑖2𝑆𝑈,𝑆𝐼

𝑢=1,𝑖=1 + 𝜆𝑈 𝑃𝑢2𝑆𝑈

𝑢=1 + 𝜆𝐼 𝑄𝑖2𝑆𝐼

𝑖=1

Weighted MSE

𝑤𝑢,𝑖 = 𝑤𝑢,𝑖 if (𝑢, 𝑖) ∈ 𝑇

𝑤0 otherwise 𝑤0 ≪ 𝑤𝑢,𝑖

Typical weights: 𝑤0 = 1, 𝑤𝑢,𝑖 = 100 ∗ 𝑠𝑢𝑝𝑝 𝑢, 𝑖

What does it mean?

Create two matrices from the events

(1) Preference matrix Binary

1 represents the presence of an event

(2) Confidence matrix Interprets our certainty on the corresponding values in the first matrix

Negative feedback is much less certain

Experiments

32

Dataset - 30M Data from Last.fm “30M” Music listening dataset crawled by the CrowdRec

team

Implicit, timestamped music listening dataset

Each record contains: [ timestamp, user , artist, album, track, … ]

We investigate artist and track recommendation

We always recommend and learn when the user interacts with an item at the first time

We discarded from the dataset infrequent artists

~50,000 users, ~100,000 artists, ~500,000 tracks

First we investigate artist recommendation

33

Online vs. Batch Matrix Factorization

● In what follows, top K = 100

● Batch has a periodic behavior

● Online strongly outperforms the batch methods

34

Shuffled Online vs. Batch Matrix Factorization

● Batch outperforms online after shuffling

● Gains of online MF come from non-stationary data

35

Batch and then Online Matrix Factorization

● Surprisingly, their combination works

● Concept: Users are non-stationary, but artists are kind of static

36

Sampling Positive Interactions From the Past

Something between batch and online MF

We randomly sample interactions from the past for the given

item while learning online

We emphasize the recent past while learning with geometric

distribution

We only update the item vectors

u,i,t u2,i,t2 u1,i,t1

37

Track Recommendation

● For track recommendation, sampling is the best

Matrix factorization in a distributed system

39

MapReduce

40

Breadth-First Search (BFS)

1

2 2

2

3

3

3

3

4

4

41

MapReduce BFS

MAP:

For all nodes n: distance D from start; list of out-edges

p out-edge(n): emit (p, D+1)

Reduce

Grouped by p

New distance is minimum received if less than current (Bellman-Ford algorithm - Dijkstra is more efficient but cannot parallelize)

Dump all data to disk, restart

Stop if no change in an iteration

Needs to join out-edge(p) with the new distance of p Solution: we also have to emit (n, out-edge(n))

42

SGD, ALS implementations in Mahout

ALS single iteration is easy: 𝑞𝑖 = 𝑃𝑇𝑃 −1𝑃𝑇 𝑅𝑖 = 𝑃𝑇𝑃 −1𝑃𝑗

𝑇 𝑅𝑖𝑗𝑁𝑗=1

Partition by i

Broadcast 𝑃𝑇𝑃, just a kxk matrix

SGD? Updates affect both the user AND the item models

Partitioning neither for users nor for items is sufficient

Efficient shared memory implementations but no real nice distributed

More iterations? Hadoop will write all information to disk, we may re-partition before writing to have

it ready for the next iteration

Should we consider this efficient??

43

Distributed ALS needs new broadcast data primitive

𝑞𝑖 = 𝑃𝑇𝑃 −1𝑃𝑇 𝑅𝑖 =

𝑃𝑇𝑃 −1𝑃𝑗𝑇 𝑅𝑖𝑗

𝑁𝑗=1

For each nonzero 𝑅𝑖𝑗 we have an „edge”

We need to emit 𝑃𝑇𝑃 −1 of dimension k2

Join by using i as key, to compute Q

If we have a predefined partition, we should not emit the same data for ALL edges from partition x to partition y

By naive implementation, network communication may become the bottleneck

44

Distributed ALS needs new broadcast data primitive

Generated test data

15 million ratings

(courtesy: Gravity)

Plain Flink

Flink with broadcast

Balassi, Palovics, Benczur, Distributed Frameworks for Alternating Least Squares, Large-Scale Recommender Systems 2014

45

Conclusions and Future Work

We need combinations of batch and online learning New users / items appear

Old items can become irrelevant

User’s taste can change over time

Future work Develop a learning algo that reflects on how fast the user and the

item part evolves of the MF model.

E.g. sample more positive items from the past at the beginning

Demonstrate online-batch learning combinations in Apache Flink

Find more applications of highly temporal, fast changing machine learning problems

46

András Benczúr

http://datamining.sztaki.hu

[email protected]

Credits: Róbert Pálovics, Levente Kocsis, Márton Balassi,

Erzsébet Frigó, Anna Oláh, … (SZTAKI)

Domonkos Tikk (Gravity)

mailto:[email protected]



Commercials

48

Data Analysis Challenges

RecSys 2016 Challenge starts: beginning of Feb

Conference 15th-19th September in Boston

Task: recommend job ads to Xing users using location, area, expertise, and past view/like/delete

Organizers

Fabian Abel

Daniel Kohlsdorf

Martha Larson

Róbert Pálovics

http://fabianabel.de/

http://fabianabel.de/

https://www.xing.com/profile/Daniel_Kohlsdorf

https://www.xing.com/profile/Daniel_Kohlsdorf

http://homepage.tudelft.nl/q22t4/

http://homepage.tudelft.nl/q22t4/

https://dms.sztaki.hu/en/people/robert-palovics



49

Data Analysis Challenges

ECML/PKDD Discovery Challenge 2016

Conf: Riva del Garda, Italy, September 19–23

Task: predict visit to bank branch by card use

Organizers:

CNR Italy

OTP Bank Hungary

SZTAKI

50

Recommender and Personalization Meetup, free GraphLab training

GraphLab Create training

Alon Palombo and Guy Rapoport from Dato, a Seattle based startup

When: Thursday, June 9, 2016 2:00 PM to 5:00 PM

Where: MTA SZTAKI Kende u. 13-17., Budapest

machine learning in data streams and batch - sztaki · unify batch and stream processing dfki (de)...

Documents