recommender systems and their effects - ibmmatrix completion in recommender systems let us consider...

IBM Research - Ireland

Recommender Systems and their Effects

Jakub MarecekIBM Research - Ireland

TCD, March 20th, 2019


The Agenda

1. “Up the wall game”: Matrix completion with interval uncertainty setsfor some entries. Sparse noise on top. Online variants.

2. “Now for something completely different”:What if everyone used the recommenders?

3. “What would the user do without the recommender?”:User modelling, when they interact with recommenders.

Based on joint work with a number of fabulous colleagues, incl. AlbertAkhriev (IBM), Jonathan Epperlein (IBM), Andre Fioravanti (UNICAMP),Mark Kozdoba (Technion), Shie Mannor (Technion), Peter Richtarik(Edinburgh/KAUST), Robert Shorten (IBM/UCD), Andrea Simonetto(IBM), Matheus Souza (UNICAMP), Martin Takac (Lehigh), TigranTchrakian (IBM), Fabian Wirth (Passau), Jing Xu (Penn/SingTel), and JiaYuan Yu (Concordia).


Matrix Completion in Recommender Systems

• Let us consider the most simple abstraction of a recommender system:

• There is a matrix, where each row corresponds to one user and eachcolumn corresponds to a product or service

• Every user rates only a modest number of products or services, i.e.,only some elements of the matrix are known

• Without imposing any further requirements on the matrix, there areinfinitely many completions

• The search for the most succinct explanation corresponds to rankminimisation

• Netflix prize: A 2006 challenge with a grand prize of US $1,000,000


Our Extensions

• The matrix changes over time, Mk at time k .

• Instead of some elements i , j of Mk , there are intervals [Mk,ij ,Mk,ij ],e.g., [x −∆, x + ∆].

• Sparse noise in the observations, block-wise (e.g., RGB in video,sensor readings from one site).

• Upper and lower bounds on all elements, e.g., for a scale [0,B] onehas [max{0, x −∆},min{x + ∆,B}].


Our Extensions

This is based on joint work with Albert Akhriev (IBM), DimitriosGunopulos (Athens), Vana Kalogeraki (Athens), Stathis Maroulis (Athens),Peter Richtarik (Edinburgh/KAUST), Andrea Simonetto (IBM), andMartin Takac (Lehigh):

• Matrix Completion under Interval UncertaintyEuropean Journal of Operational Research 256(1), 35–43, 2017https://arxiv.org/abs/1408.2467

• Low-rank Methods in Event Detectionhttps://arxiv.org/abs/1802.03649

• Pursuit of Low-Rank Models of Time-Varying Matrices Robust toSparse and Measurement Noisehttps://arxiv.org/abs/1809.03550

https://arxiv.org/abs/1408.2467




The Off-line Setting

There exists Rk ∈ Rr×nN , such that our observations xd ∈ RnN for row dare

(xd)i = (1n − Ii ,k) ◦[(RT

k cd)i + (ed)i

]+ Ii ,k ◦ si , for block i , (1)

where

• vector cd ∈ Rr weighs the rows of matrix Rk ,

• ed ∈ RnN is the noise vector, where each entry be uniformlydistributed between known, fixed −∆ and ∆,

• si ∈ Rn is an arbitrary noise vector,

• Boolean vector Ii ∈ {0, 1}n has entries that are all ones or zerosdepending on whether we receive a measurement belonging to ourmodel or not,

• ◦ represents element-wise multiplication.


The Off-line Setting

Assumption

For each (i , j) of Mk there is a finite element-wise upper bound Mk,ij anda finite element-wise lower bound Mk,ij .

This assumption is satisfied even for any missing values at ij when themeasurements lie naturally in a bounded set, e.g., [0, 255] in manycomputer-vision applications.


Algorithm 1: MACO


Algorithm 2: The Sparse Noise on top


Convergence Rate

Theorem

There exists τ > 0, such that Algorithm 1 with the initialization to all-zerovector after at most T = O(log 1

ε ) steps has f (CT ,RT ) 6 f ∗ + ε withprobability 1.

Notice that we do not assume α-MSC, β-MSS, and C -robust bistability,but rather prove them.


The On-Line Setting

Assumption

The variation of the observation matrix Mk at two subsequent instant kand k − 1 is so to guarantee that

|f (Ck ,Rk ; Mk)− f (Ck ,Rk ; Mk−1)| 6 e,

for all instants k > 0.


A Bound on the Tracking Error

Theorem

Let the two assumptions above hold. Then, Algorithm 2 starting from anall-zero matrices generates a sequence of matrices {(Ck ,Rk)} for which

f (Ck ,Rk ; Mk)− f (C∗k ,R∗k ; Mk) 6

η0(f (Ck−1,Rk−1; Mk−1)− f (C∗k−1,R∗k−1; Mk−1)) + η0e, (2)

for some η0 < 1. And in the limit,

lim supk→∞

f (Ck ,Rk ; Mk)− f (C∗k ,R∗k ; Mk) 6

η0e

1− η0=: E . (3)


Our Evaluation

• A benchmark in image in-painting

• Small Netflix: 95526× 3561 matrix of rank 2 or 3

• Well-known dataset: 100, 198, 805 ratings from 480, 189 usersconsidering 17, 770 products

• Yelp’s Academic Dataset1, from which we have extracted a252, 898× 41, 958 matrix with 1,125,458 non-zeros, again on the 1–5scale.

• changedetection.net

1https://www.yelp.co.uk/academic_dataset

https://www.yelp.co.uk/academic_dataset


Our Results: A Benchmark in In-painting

Inst./Alg. SVT SVP SoftImpute LMaFit ADMiRA JS OR1MP EOR1MP MACO

Barbara 26.9635 25.2598 25.6073 25.9589 23.3528 23.5322 26.5314 26.4413 23.8015Camer. 25.6273 25.9444 26.7183 24.8956 26.7645 24.6238 27.8565 27.8283 28.9670Clown 28.5644 19.0919 26.9788 27.2748 25.7019 25.2690 28.1963 28.2052 29.0057Couple 23.1765 23.7974 26.1033 25.8252 25.6260 24.4100 27.0707 27.0310 27.1824Crowd 26.9644 22.2959 25.4135 26.0662 24.0555 18.6562 26.0535 26.0510 26.1705Girl 29.4688 27.5461 27.7180 27.4164 27.3640 26.1557 30.0878 30.0565 30.4110Goldhill 28.3097 16.1256 27.1516 22.4485 26.5647 25.9706 28.5646 28.5101 28.6265Lenna 28.1832 25.4586 26.7022 23.2003 26.2371 24.5056 28.0115 27.9643 28.3581Man 27.0223 25.3246 25.7912 25.7417 24.5223 23.3060 26.5829 26.5049 26.5990Peppers 25.7202 26.0223 26.8475 27.3663 25.8934 24.0979 28.0781 28.0723 28.8469


Our Results: Small Netflix

0 2 4 6 8 101.5

2

2.5

3

RM

SE

Epoch

RMSE for rank =2, ∆ =0





Our Results: A Well-Known Instance

0 500 1000 1500 2000 2500

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

Elapsed Time [s]

RM

SE

Test 1

Test 2

Test 4

Test 8

Test 16

Train 1

Train 2

Train 4

Train 8

Train 16


Our Results: A Well-Known Instance

0 100 200 300 400

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

Epochs

RM

SE

Test 1

Test 2

Test 4

Test 8

Test 16

Train 1

Train 2

Train 4

Train 8

Train 16


Our Results: Kaggle

• We performed 10-fold cross-validation on the training set, usingvarying rank.

• For rank 1 to 2, 4, 8, 16, 32, and 50, the average error decreased from1.7958 to 1.8284, 1.6464, 1.4590, 1.3395, 1.2702, and 1.2454,respectively.

• This seems to be comparable to the best results from the 2013Recommender Systems Challenge2,

2https://www.kaggle.com/c/yelp-recsys-2013

https://www.kaggle.com/c/yelp-recsys-2013


Our Results: changedetection.net


A Summary of Part 1

• Robust optimisation improves the statistical performance and doesnot cost much, computationally.

• Present-best statistical performance in matrix completion on anin-painting benchmark.

• In the off-line case, wall-clock run-time on a single computercomparable to those achieved on a large cluster previously.

• In the on-line case, present-best guarantees and a method practical forvideo data.

• What would happen, if everyone received the same recommendations?


The Agenda




Based on joint work with a number of fabulous colleagues, incl. AlbertAkhriev (IBM), Jonathan Epperlein (IBM), Andre Fioravanti (UNICAMP),Mark Kozdoba (Technion), Shie Mannor (Technion/Ford), Peter Richtarik(Edinburgh/KAUST), Robert Shorten (IBM/UCD), Andrea Simonetto(IBM), Matheus Souza (UNICAMP), Martin Takac (Lehigh), TigranTchrakian (IBM), Fabian Wirth (Passau), Jing Xu (Penn/SingTel), and JiaYuan Yu (Concordia).


Closed-Loop Performance of Recommender Systems

Central Authority

Agent 1

ω1

Agent 2

ω2

Agent N

ωN{µt} {µt} {µt}

histogram

a1t

aNt

a2t

cM (nMt )nMt

c1(n1t )

n1t

mux

z−1

c(nt)

st


Closed-Loop Performance of Recommender Systems

• Consider route recommendations to drivers in the form of publicsignalling

• Even if transport authorities just provide information about traveltimes to tell the drivers, the drivers pick their route based on theinformation.

• Assume that travel time depends only on the number of concurrentusers of the road(degree-4 polynomial link performance functions in HCM 2010,“Bureau of Public Roads functions”)

• Stability is sought while some measure of system performance is to beregulated

• Solutions have to be acceptable to road users, e.g., each user istreated fairly, independent of the initial state


A Simplified Model

• Transport can be modelled as a discrete-time dynamical system:• N agents want to travel from origin O to a destination D• A,B, . . . , denote the alternative routes from O to D• travel time on route A is a function of the number of users on the

roads cA : N→ R+

• nAt agents pick A at time t• social cost C (nAt ) weights the costs of actions with the proportions of

agents taking the actions, i.e., C (nAt ) , nAtN · cA(nAt ) + nBt

N · cB(nBt ) + . . .

0.0 0.2 0.4 0.6 0.8 1.01.01.21.41.61.82.02.22.4

c X(n

)

n/N

0.0 0.2 0.4 0.6 0.8 1.01.2

1.4

1.6

1.8

2.0

2.2

2.4

C(n

)

n/N

Figure: An Example. Left: cA , 1.2 + x/N (dashed line) andcB , 1 + (1.08− x/N)−1/22 (solid line). Right: Social cost.


A Simplified Model

• A central agent has information about the state of the system andpicks signals to send to users: s it := (yA,it , yB,it ), where

yA,it , cA(nAt−1), yB,it , cB(nBt−1),

(we will call this 0-scalar signaling later).

• Each driver i picks action ait ,

ait =

{arg minX∈{A,B,...} yX ,it−1, if t > 1,

A, otherwise,(4)

i.e. optimises a myopic objective.


A Simplified Model

• Let us consider two roads from O to D, e.g.:I A local road, where the travel time increases sharply with congestionI A highway, where the free-flow travel time is higher than for the local

road, but it does not increase as much with congestion.

• Let us consider a number of agents who start traveling from O to Dat the same time.

• If all the agents decide that the travel time on the local road was lessthan on the highway, all will use the local road.

• You obtain a limit cycle, where the social cost (travel time, summedacross all users), will be arbitrarily higher than in the social optimum.


The Actual Model

• There is a finite set {1, 2, . . . ,M} of alternative roads,from which each driver chooses exactly 1 at every time t.

• The travel-time for m is cm(nmt ), where nmt denotes the number ofdrivers choosing m at time t.

• Vector nt :=[n1t · · · nMt

]is the congestion profile.

• Central authority knows the history of congestion profiles {nτ}t−1τ=1

• The goal is to minimise the aggregate travel time∑Mm=1(nmt /N)cm(nmt )

• Two scalars umt and vmt are used to describe resource m.


r -Extreme Signalling

• Send an interval per road segment m at each time t:

umt := arg minj=t−r ,...,t−1

{cm(nmj )} (5)

vmt := arg maxj=t−r ,...,t−1

{cm(nmj )} (6)

• Each agent has its own myopic policy parametrized by ω ∈ [0, 1]:

πω(st) := arg minm=1,...,M

ωumt + (1− ω)vmt . (7)


Weak Convergence for r -Extreme Signalling

Assumption (i.i.d. µt , “Population Renewal”)

The distribution µt is an i.i.d. sequence of random variables with P(µt =ηk) = dk , 0 < dk < 1 for all t and k.

Theorem (Marecek et al., IJC 2016)

There exists constant k ′, such that under the assumption above, if thefunctions {cm : m = 1, . . . ,M} are `-Lipschitz continuous for ` < 1/(Nk ′),there exists a unique limit random variable Z such that the congestionprofile nt/N converges to Z in distribution as t →∞.


Exponential Smoothing

• Send an interval signal employing exponential smoothing on the pastcosts of resource m to obtain umt , and a measure vmt of their volatility:

umt := (1− q1)(cm(nmt−1) + q1cm(nmt−2) + · · ·+ qt−1

1 cm(nm1 ))

vmt := (1− q2)(∣∣cm(nmt−1)− umt−1

∣∣+ q2

∣∣cm(nmt−2)− umt−2

∣∣+ · · ·

+ qt−12

∣∣cm(nm1 )− um1∣∣),

(8)


Population Dynamics

• Each driver has its own myopic policy parametrized by ω ∈ [0, 1]:

πω(st) := arg minm=1,...,M

ωumt + (1− ω)vmt . (9)

• There is a family of possible distributions of the levels ω of riskaversion, with a finite index set J = {1, . . . ,K}

• A Markov chain with K states and transition probability matrixP ∈ [0, 1]K×K

• The probability of appearance of population ηj , j ∈ J , at iterationt + 1 is now given by P(µt+1 = ηj | µt = i) = pij , i.e., the probabilityof a specific ηj depends on what the last observed population was.


Population Dynamics

A simple example:

P =

0 1 0 00 0 1 00 0 0 11 0 0 0

1

“noon”

“evening”

“night”

“morning”

2

“noon”

“evening”

“night”

“morning” 3

“noon”

“evening”

“night”

“morning”

4

“noon”

“evening”

“night”

“morning”


Weak Convergence for Exponential Smoothing

Theorem (Epperlein, JM, Allerton 2017)

There exists a set of constants κ′m, m = 1, . . . ,M such that if the costscm(·) are 1/κ′m-Lipschitz (with respect to the 1-norm), and q2 > q1, thesignal st and the congestion profile nt converge in distribution as t →∞.


Population Dynamics II

A more complicated example:

• The probability of the next population being ηj given that the currentone is ηi should depend on “how different” they are.

• Let ∆(·, ·) be a metric on the space of populations, e.g., Wassersteinmetric, substitution metric

• Probability pij should be a decreasing function of ∆(ηi , ηj)

• A parameter ψ ∈ (0, 1) that reflects the probability of an agentchanging its policy, pij := ψ∆(ηi ,ηj ).


Weak Convergence for Exponential Smoothing

Theorem (“Asymptotic Stability with Population Dynamics II”)

There exists a set of constants κ′m, m = 1, . . . ,M such that if the costscm(·) are 1/κ′m-Lipschitz (with respect to the 1-norm), and q2 > q1, thesignal st and the congestion profile nt converge in distribution as t →∞.


Further Pointers

This part is based on joint work with Jonathan Epperlein, Robert Shorten,and Jia Yuan Yu:

• Signalling and Obfuscation for Congestion Controlhttps://arxiv.org/abs/1406.7639

http://dx.doi.org/10.1080/00207179.2015.1033758

• r -Extreme Signalling for Congestion Controlhttps://arxiv.org/abs/1404.2458

http://dx.doi.org/10.1080/00207179.2016.1146968

• Distributional Robustness in Congestion Controlhttps://arxiv.org/abs/1705.09152 (ITS)

• Resource Allocation with Population Dynamicshttps://arxiv.org/abs/1703.07308

http://dx.doi.org/10.1109/ALLERTON.2017.8262886


http://dx.doi.org/10.1080/00207179.2015.1033758


http://dx.doi.org/10.1080/00207179.2016.1146968



http://dx.doi.org/10.1109/ALLERTON.2017.8262886


The Regulation Problem

• A controller C, which represents the central authority with privatestate xc(k) ∈ Rnc , produces a signal π(k) ∈ Π ⊆ Rnπ at time k .

• xi (k) is the use of agent i at time k (a random variable)

• y(k)def=∑N

i=1 xi (k) is the aggregate resource utilisation at time k (arandom variable)

• y(k) is the observable output of a filter F on y(k)

• r of the desired utilisation of the resource.

• e(k) is the error signal y(k)− r utilised by the controller.



With probability 1, we would like to see:

G1 feasibility: for all k ∈ N

N∑i=1

xi (k) = y(k) 6 r . (10)

G2 predictability: for each agent i there exists a constant r i such that

limk→∞

1

k + 1

k∑j=0

xi (j) = r i , (11)

where this latter limit is independent of initial conditions.


A Generalisation of Agents’ ResponseIn particular:• Wi ∈ N state transition maps wij : Rni → Rni , j = 1, . . . ,Wi andHi ∈ N output maps hil : Rni → Ri , l = 1, . . . ,Hi .

•xi (k + 1) ∈ {wij(xi (k)) | j = 1, . . . ,Wi} (12)

yi (k) ∈ {hij(xi (k)) | j = 1, . . . ,Hi} (13)

where the choice of agent i ’s response at time k is governed by aprobability functions pij : Π→ [0, 1], j = 1, . . . ,Wi , respectivelyp′il : Π→ [0, 1], l = 1, . . . ,Hi . Specifically, we have for all k ∈ N andall π(k) ∈ Π that

P(xi (k + 1) = wij(xi (k))) = pij(π(k)), (14a)

P(yi (k) = hil(xi (k))) = p′il(π(k)). (14b)

Wi∑j=1

pij(π) =

Hi∑l=1

p′il(π) = 1. (14c)


Negative Results

• The convergence of error e = r − y , limk→∞ e(k) = 0, is oftenassumed to be assured by controllers with integral action, such as theProportional-Integral (PI):

π(k) = π(k − 1) + κ[e(k)− αe(k − 1)

], (15)

which means its transfer function from e to π is given by

C (z)def=

π(z)

e(z)= κ

1− αz−1

1− z−1. (16)

• The integral action may be heavily dependent on the controller’sinitial state.

• Our results apply to any controller with any sort of integral action,i.e., pole at z = 1. We also have negative results for switchedcontrollers, etc.


Negative Results: An Example

• Sets Ai , r and the coefficients of F are rational, pij continuous.

• N = 10 agents, whose states xi are in the set {0, 1}, with x1 to x5

having (i = 1, . . . , 5)

pi1(xi (k + 1) = 1) = 0.02 +0.95

1 + exp(−100(π(k)− 5)),

whereas the remaining agents’ (for i = 6, . . . , 10)

pi1(xi (k + 1) = 1) = 0.98− 0.95

1 + exp(−100(π(k)− 1)).

• r = 5. If the control signal π(k)� 5, then the first five agents aremore likely to be active. On the other hand, if π(k)� 1, thenremaining ones are more likely to take the resource.

• PI controller with κ = 0.1 and α = −4.

• Lag controller with κ = 0.1, α = −4.01 and β = 0.99.


Negative Results: A Sample Trajectory

Figure: Filter output for a single simulation.


Negative Results: The Ergodic Aggregate Behaviour

Figure: Average number of active systems. Regulation is observed for the PI and for thelag controllers, within a given precision.


Negative Results: Non-Ergodic for the Individual

Figure: Average trajectory of the first agent for both controllers and for both initialcontroller states. Predictability is lost for the PI controller.


Negative Results: Non-Ergodic for the Individual

Figure: Average value for x1(1000) for different initial conditions of both controllers.


Negative Results: The Non-Ergodic Signal

Figure: Average value of the broadcast signal π(k) for both controllers and for bothinitial controller states.


Negative Results Formalised

Theorem

Consider N agents with states xi , i = 1, . . . ,N. Assume that there is anupper bound L on the different values the agents can attain, i.e., for eachi we have xi ∈ Ai = {a1, . . . , aLi} ⊂ R for a given set Ai and 1 6 Li 6 L.Consider a system, where F : y 7→ y is a finite-memory moving-average(FIR) filter. Assume the controller CL is a linear marginally stable single-input single-output (SISO) system with a pole s1 = eqiπ on the unit circlewhere q is a rational number. In addition, let the probability functionspij : R → [0, 1] be continuous for all i = 1, . . . ,N, j = 1, . . . ,Mi , i.e., ifπ(k) is the output of CL at time k, then P(xi (k + 1) = aj) = pij(π(k)).Then:

(i) The set OF of possible output values of the filter F is finite.

(ii) If the real additive group E generated by {r − y | y ∈ OF} isdiscrete, then the closed loop cannot be ergodic.


Positive Results I: Assumptions I

• An alternative linear controller:

C :

{xc(k + 1) = Acxc(k) + Bce(k),

π(k) = Ccxc(k) + Dce(k),(17)

where xc : N→ Rnc is its internal state.

• A linear model for the filter F , e.g. for y(k) :=∑N

i=1 xi (k): which is

F :

{xf (k + 1) = Af xf (k) + F1y(k) + F2y(k),

y(k) = Cf xf (k),(18)

where y stores the previous M values of y .


Positive Results II: Assumptions II

• That is, y evolves by y(k + 1) = Jy(k) + Ly(k) with

J =

100...0

, L =

0 0 · · · 0 01 0 · · · 0 00 1 · · · 0 0...

.... . .

......

0 0 · · · 1 0

. (19)


Positive Results III

Theorem

Consider the system with C and F above. Assume that each agent i ∈{1, · · · ,N} has state xi with dynamics governed by the following affinestochastic difference equation:

xi (k + 1) = wij(xi ), (20)

where the affine mapping wij is chosen at each step of time according to a

Dini-continuous probability function pij(xi , π(k)), out of wij(xi )def= Aixi +

bij , where Ai is a Schur matrix and for all i , π(k),∑

j pij(xi , π(k)) = 1. Inaddition, suppose that there exist scalars δi > 0 such that pij(xi , π) > δi >0; that is, the probabilities are bounded away from zero. Then, for everystable linear controller C and every stable linear filter F , the feedback loopconverges in distribution to a unique invariant measure.


Positive Results IV

Consider a non-linear system with:{xi (k + 1) ∈ {wi (xi (k)) | j = 1, . . . ,Wi}yi (k) ∈ {hi (xi (k)) | j = 1, . . . ,Hi},

(21)

y(k) =N∑i=1

yi (k), (22)

F :

{xf (k + 1) = wf (xf (k), y(k))

y(k) = hf (xf (k), y(k)),(23)

C :

{xc(k + 1) = wc(xc(k), y(k), r)π(k) = hc(xc(k), y(k), r),

(24)


Positive Results V

• Let Xi , i = 1, . . . ,N,XC and XF be the state spaces of the agents,the controller and the filter,

• The system evolves on X :=∏N

i=1 Xi × XC × XF as

x(k + 1) :=

(xi )Ni=1

xfxc

(k + 1) ∈ {Fm(x(k)) |m ∈M}. (25)

• Each map Fm is of the form Fm(x(k)) =

(wij (xi (k)))Ni=1wf (xf (k),

∑Ni=1 hil (xi (k)))

wc (xc (k), hf (xf (k)))

• Maps Fm are indexed by m from (26), chosen w/ (qm(π(k))):

N∏i=1

{(i , 1), . . . , (i ,Wi )} ×N∏i=1

{(i , 1), . . . , (i ,Hi )}. (26)

P (x(k + 1) = Fm(x(k))) =

(N∏i=1

piji (π(k))

)(N∏i=1

p′ili (π(k))

).

(qm(π(k)))


Positive Results VI

Theorem

Assume that each agent i has a state xi (k + 1) = wij(xi (k)), yi (k) =hij(xi (k)) where wij and hij are globally Lipschitz-continuous functions wlij , resp. l ′ij and Dini-continuous probability functions pij , p

′il and scalars

δ, δ′ > 0 such that pij(π) > δ > 0, p′ij(π) > δ′ > 0 for all (i , j), and one ofthe following holds:

(a) “contractivity”: for all 1 ≤ i ≤ N, 1 ≤ j ≤ J, lij < 1 and l ′ij < 1.

(b) “average contractivity”: for all 1 ≤ i ≤ N,∑J

j=1 pij(xi )lij < 1; for all

1 ≤ i ≤ N,∑J

j=1 p′ij(xi )l

′ij < 1.

(c) “marginal contractivity”: for all 1 ≤ i ≤ N, 1 ≤ j ≤ J, lij ≤ 1, withprobability 1, there exist i , j , such that li ,j < 1. Noticepij(xi ) ≥ δi > 0 by definition. Likewise for l ′i .

Then, for every stable linear controller C and every stable linear filter F ,the loop has a unique attractive invariant measure.


Positive Results VII

Next:

• Agents’ actions are limited to a finite set.

• Lipschitz conditions in the previous cannot be satisfied except intrivial cases.

• XS :=∏N

i=1Ai is finite and we consider the graph G = (XS ,E ) wherethere is an edge from a tuple (xi ) ∈ XS to (yi ) ∈ XS if there is achoice of maps wij such that (wij(xi )) = (yi ).


Positive Results VIII

Theorem

Consider the feedback system depicted in Figure 1. Consider the feedbacksystem depicted in Figure 1. Assume that Ai is finite for each i . Assumethat each agent i ∈ {1, · · · ,N} has a state governed by the non-linearstochastic difference equations above. Assume we have Dini continuousprobability functions pij , p

′il so that the probabilistic laws above are satisfied.

Assume furthermore that there are scalars δ, δ′ > 0 such that pij(π) > δ >0, p′ij(π) > δ′ > 0 for all (i , j). Then, for every stable linear controller C andevery stable linear filter F the following holds: If the graph G = (XS ,E ) isstrongly connected, then there exists an invariant measure for the feedbackloop. If in addition, the adjacency matrix of the graph is primitive, thenthe invariant measure is attractive and the system is ergodic.


Further Pointers

This part is based on joint work with Andre Fioravanti, Robert Shorten,Matheus Souza, and Fabian Wirth:

• On Classical Control and Smart Citieshttps://arxiv.org/abs/1703.07308

https://dx.doi.org/10.1109/CDC.2017.8263852

• On the Ergodic Control of Ensembleshttps://arxiv.org/pdf/1807.03256


https://dx.doi.org/10.1109/CDC.2017.8263852

https://arxiv.org/pdf/1807.03256


The Agenda




Based on joint work with a number of fabulous colleagues, incl. AlbertAkhriev (IBM), Jonathan Epperlein (IBM), Andre Fioravanti (UNICAMP),Mark Kozdoba (Technion), Shie Mannor (Technion), Peter Richtarik(Edinburgh/KAUST), Robert Shorten (IBM/UCD), Andrea Simonetto(IBM), Matheus Souza (UNICAMP), Martin Takac (Lehigh), TigranTchrakian (IBM), Fabian Wirth (Passau), Jing Xu (Penn/SingTel), and JiaYuan Yu (Concordia).


“What would the user do without the recommender?”

• If we want to say our recommender results in a better performancethan someone else’s recommender, we need a model of what wouldhave the user done without the recommender.

• Such a “plan vanilla” user model is often not available:users already receive some recommendations.

• For example, consider advanced traveller information systems (ATIS)or satellite navigation systems.


Markov-Modulated Markov Chains

• Consider a Markov chain R with state space [R] and state rt ∈ [R], inwhich the transition probabilities

P(rt = j | rt−1 = i) = aRij (st)

depend on a latent random variable st .

• “Markov chain is modulated by the random variable st”

• “Markov-modulated Markov chain” (MMMC): st is the state ofanother Markov chain S with transition matrix AS and state space [S ].


Closed-Loop Markov-Modulated Markov Chains

• The current state of the visible Markov chain R also modulates thetransition probabilities in S .

• The closed-loop MMMC (clMMMC) is a tuple µ = (AR(·),AS(·)),where AR and AS both have pages.

• There is a partition Γ = {Γ1, . . . , Γp} of [R] such that there is a pagein AS for each Γi ; A

S : [p]→ [0, 1]S×S , withP(st = j | st−1 = i) = aSij (γ(rt)).


Parameter Estimation

• A parameter estimation problem: Given a sequence of observations,i.e. a realization (r1r2 · · · rT ) of the process defined by S and R, whatare the underlying transition probabilities AS(`), ` = 1, . . . , p, AR(`),` = 1, . . . ,S?

• An EM-type algorithm provides local optima of the maximumlikelihood.

• More sophisticated algorithms allow for regret bounds.

• Notice, however, that until a while ago, no regret bounds (againstKalman filters) were available even for linear dynamical systems.


A Linear Dynamical System

Consider a simple linear dynamical system (G ,F , v ,W ) is:

φt = Gφt−1 + ωt (27)

Yt = F ′φt + νt , (28)

where

• Yt are scalar observations,

• φt ∈ Rn×1 is the hidden state,

• G ∈ Rn×n is the state transition matrix,

• F ∈ Rn×1 is the observation direction.

• ωt is the process noise, N (0,W )

• νt is the observation noise, N (0, v)

• φ0 is initial state, N (m0,C0).


Kalman Filter

• Kalman filter is a key tool for time-series forecasting and analysis.

• An estimate of the current hidden state, given the observations fort > 1:

mt = E (φt |Y0, . . . ,Yt) , (29)

and let Ct be the covariance matrix of φt given Y0, . . . ,Yt .

• Forecast of the next observation, given the current data:

ft+1 = E (Yt+1|Yt , . . . ,Y0) = F ′Gmt . (30)


Kalman Filter Unrolled

ft+1 = F ′GAtYt + F ′s−1∑j=0

[(j∏

i=0

Zt−i

)GAt−j−1Yt−j−1

]︸︷︷︸

AR(s+1)

+ F ′

(s∏

i=0

Zt−i

)at−s .︸︷︷︸

Remainder term

(31)


The Results

Theorem

If the covariance matrix of the process noise is non-zero, then there isγ = γ(W , v ,F ,G ) < 1 such that for every x ∈ Rn,

[(I − A⊗ F )Ux , (I − A⊗ F )Ux ] 6 γ [x , x ] , (32)

where [x , y ] = 〈Rx , y〉 is the inner product induced by the limit R of Rt onRn.


The Results

Theorem (LDS Approximation)

Let L = L(F ,G , v ,W ) be an observable LDS with W > 0.

1. For any ε > 0, and any B0 > 0, there is T0 > 0, s > 0 and θ ∈ Rs ,such that for every sequence Yt with |Yt | 6 B0, and for every t > T0,∣∣∣∣∣ft+1 −

s−1∑i=0

θiYt−i

∣∣∣∣∣ 6 ε. (33)

2. For any ε, δ > 0, and any B1 > 0, there is T0 > 0, s > 0 and θ ∈ Rs ,such that for every sequence Yt with |Yt+1 − Yt | 6 B1, and for everyt > T0, ∣∣∣∣∣ft+1 −

s−1∑i=0

θiYt−i

∣∣∣∣∣ 6 2 max (ε, δ |Yt |) . (34)


The Algorithm

1: Input: Regression length s, domain bound D.Observations {Yt}∞0 , given sequentially.

2: Set the learning rate ηt = t−12 .

3: Initialize θs arbitrarily in D.4: for t = s to ∞ do5: Predict yt =

∑s−1i=0 θt,iYt−i−1

6: Observe Yt and compute the loss `t(θt)7: Update θt+1 ← πD (θ − ηt∇`t(θt))8: end for


The Regret Bound

Theorem

Let S be a finite family of LDSs, such that every L = L(F ,G , v ,W ) ∈ S ,is observable and has W > 0. Let B0 be given. For any ε > 0, there ares,D, and CS , such that the following holds:

For every sequence Yt with |Yt | 6 B0, if θt is a sequence produced by thealgorithm with parameters s and D, then for every T > 0,

T∑t=0

`t(θt)−minL∈S

T∑t=0

`(Yt , ft(L)) 6 CS + 2(D2 + B20 )√T + εT . (35)


Further Pointers

This part was based on joint work with Jonathan Epperlein, MarkKozdoba, Shie Mannor, Robert Shorten, Tigran Tchrakian, and Jing Xu:

• On-Line Learning of Linear Dynamical Systems: ExponentialForgetting in Kalman FiltersAAAI 2019, https://arxiv.org/abs/1809.05870

• Parameter Estimation in Gaussian Mixture Models with MaliciousNoise, without Balanced Mixing CoefficientsAllerton 2019, https://arxiv.org/abs/1711.08082

• Recovering Markov Models from Closed-Loop Datahttps://doi.org/10.1016/j.automatica.2019.01.022

https://arxiv.org/pdf/1706.06359.pdf

• Parameter Estimation for Closed-Loop Markov Modulated MarkovChains with Applications in Recommender Systemssubmitted



https://doi.org/10.1016/j.automatica.2019.01.022

https://arxiv.org/pdf/1706.06359.pdf


The Conclusions

• “Data knots”.

• Closed-loop analyses of the effects of recommender systems.

• Even if the models of decision making are linear (e.g., Markovian), theclosed-loop interactions with the recommenders render the parameterestimation non-convex.

• Questions and comments most welcome!

• We are seeking PhD students for internships!

• This work has been supported in part by the EU H2020 project VaVeL[grant agreement 688380].


Dini Continuity from Wikipedia

• Let X be a compact subset of a metric space (such as Rn), and letf : X → X be a function from X into itself.

• The modulus of continuity of f is ωf (t) = supd(x ,y)≤t d(f (x), f (y)).

• The function f is called Dini-continuous if∫ 1

0ωf (t)t dt <∞.

• An equivalent condition is that, for any θ ∈ (0, 1),∑∞

i=1 ωf (θia) <∞where a is the diameter of X .


Weak Convergence

• Convergence in distribution of x(1), x(2), x(3), . . . to a randomvariable X : limn→∞ Fn(x) = F (x), for all x where F is continuous,where Fn and F are the cumulative distribution functions of x(k) andX

• Almost sure convergence to a limit: the measure of the set of samplepaths which are converging is 1 (with respect to a probability measureon the set of sample paths)

• Convergence in probability: for every ε > 0 and δ > 0 there exists ak0 such that for all k > k0 the probability of being further away fromthe limit than δ is smaller than ε.

• Almost sure convergence implies convergence distribution andconvergence in probability

• Patrick Billingsley: Convergence of Probability Measures (Wiley Seriesin Probability and Statistics)


Coupling Arguments I

• The proof relies on a theorem of Hairer, who won a Fields medal for it

• Coupling arguments provide criteria for the non-existence of a uniqueinvariant measure, essentially linking the existence of a coupling withthe forgetfulness of initial conditions.

• Σ∞ (the “path space”) is the space of trajectories of a Σ-valuedMarkov chain {X (k)}k∈N, i.e., the space of all sequences(x(0), x(1), x(2), . . .) with x(k) ∈ Σ, k ∈ N.

• A coupling of two measures Pµ1 ,Pµ2 ∈ M(Σ∞) is a measure onΣ∞ × Σ∞ whose marginals coincide with Pµ1 ,Pµ2 .

• The set C (Pµ1 ,Pµ1) of couplings of Pµ1 ,Pµ2 ∈ M(Σ∞) is thendefined by

{Γ ∈ M(Σ∞ × Σ∞) : Π(1)Γ = Pµ1 ,Π(2)Γ = Pµ2}.


Coupling Arguments II

• We say that a coupling Γ is an asymptotic coupling if Γ has fullmeasure on the pairs of convergent sequences. To make this preciseconsider the following set denoted D:{

(x1, x2) ∈ Σ∞ × Σ∞ : limk→∞

‖x1(k)− x2(k)‖ = 0

}Γ is an asymptotic coupling if Γ(D) = 1.

Theorem (Hairer et al.)

Let P be a Markov operator admitting two ergodic invariant measures µ1

and µ2. The following are equivalent:

(i) µ1 = µ2.

(ii) There exists an asymptotic coupling of Pµ1 and Pµ2 .

If no asymptotic coupling of Pµ1 and Pµ2 exists, then µ1 and µ2 aredistinct.

recommender systems and their effects - ibmmatrix completion in recommender systems let us consider...

Documents