matrix completion with deterministic pattern

Matrix completion with deterministicpattern

A. ShapiroJoint work with Yao Xie and Rui Zhang

School of Industrial and Systems Engineering,Georgia Institute of Technology,

Atlanta, Georgia 30332-0205, USA

Bridging Mathematical Optimization, Information Theory,and Data Science

Princeton, May 2018

Minimum Rank Matrix Completion (MRMC) problem

minY ∈Rn1×n2

rank(Y ) subject to Yij = Mij, (i, j) ∈ Ω,

where Ω ⊂ 1, ..., n1 × 1, ..., n2 is an index set of cardinality m

and Mij, (i, j) ∈ Ω, are given values. Let M be the n1×n2 matrix

with entries Mij at (i, j) ∈ Ω, and all other entries equal zero.

Minimum Rank Factor Analysis (MRFA) problem

minX∈Dp

rank(Σ−X) s.t. Σ−X 0,

where Σ ∈ Sp is a positive definite matrix. By Sp we denote the

space of p × p symmetric matrices and by Dp the space of p × pdiagonal matrices.

1

Generic bounds for the minimal rank

Suppose that specified values of the minimum rank problems are

observed with noise. In the MRFA the covariance matrix Σ is es-

timated by the sample covariance matrix S based on a sample of

N observations. In the MRMC the entries Mij are observed with

random noise. In such settings there are the following generic

lower bounds for the minimal rank.

For the MRFA we have the following lower bound (Shapiro

(1982)), which holds for a.e. Σ ∈ Sp (i.e., for all Σ ∈ Sp ex-

cept in a set of Lebesgue measure zero):

rank(Σ +X) ≥2p+ 1−

√8p+ 1

2, ∀X ∈ Dp.

2

This is based on that the set

Wr := A ∈ Sp : rank(A) = rof matrices of rank r forms C∞ smooth manifold, in the linearspace Sp, of dimension

dim(Wr) = p(p+ 1)/2− (p− r)(p− r + 1)/2

and the transversality condition. Mapping A(X) := Σ−X, fromDp into Sp, intersects Wr transverally if for every X ∈ Dp eitherA(X) 6∈ Wr or A(X) ∈ Wr and

Dp + TWr(A(X)) = Sp.

The above generic lower bound means that

p+ dim(Wr) ≥ dim(Sp),

i.e., that p ≥ (p− r)(p− r + 1)/2.

3

For the MRMC problem we can proceed in a similar way. Theset

Mr := A ∈ Rn1×n2 : rank(A) = r

of matrices of rank r ≤ minn1, n2 forms a smooth manifold, inthe linear space Rn1×n2, of dimension

dim(Mr) = r(n1 + n2 − r).

We use the following notation. Ωc := 1, ..., n1 × 1, ..., n2 \Ω,the complement of the index set Ω,

VΩ :=Y ∈ Rn1×n2 : Yij = 0, (i, j) ∈ Ωc

.

This linear space represents the set of matrices that are filledwith zeros at the locations of the unobserved entries,

VΩc :=Y ∈ Rn1×n2 : Yij = 0, (i, j) ∈ Ω

.

4

Consider mapping AM(X) := M + X, X ∈ VΩc. The image

AM(VΩc) defines the space of feasible points of the MRMC

problem. The transversality condition: either AM(X) 6∈ Mr or

AM(X) ∈Mr and

VΩc + TMr(AM(X)) = Rn1×n2.

It follows that generically (i.e., for a.e. values Mij) the rank r of

matrix Y ∈ Rn1×n2 such that Yij = Mij, (i, j) ∈ Ω, should satisfy

r(n1 + n2 − r) ≥ m.

Recall that m = |Ω|.

5

Equivalently it holds generically (almost surely) that the minimal

rank

r ≥ R(n1, n2,m),

where

R(n1, n2,m) := (n1 + n2)/2−√

(n1 + n2)2/4− m.

In particular, for n1 = n2 = n,

R(n, n,m) = n−√n2 − m.

It also follows that unless R(n1, n2,m) is an integer, almost surely

the set of optimal solutions of the MRMC problem is not a

singleton.

6

For an appropriate index set Ω, of cardinality m, the minimal

rank r∗ of the MRMC problem attains any value between

R(n1, n2,m) ≤ r∗ ≤⌈√

m⌉.

That is, for r < minn1, n2 consider data matrix M of the fol-

lowing form

M =(M1 0M2 M3

)where the three sub-matrices M1, M2, M3, of the respective

order r × r, (n1 − r)× r and (n1 − r)× (n2 − r).

7

The MRMC problem can be formulated in the following equiva-lent form

minX∈Vτ

rank(Ξ +X) subject to Ξ +X 0,

where Ξ ∈ Sp, p = n1 + n2, is symmetric matrix of the form

Ξ =

[0 M

M> 0

], and

Vτ := X ∈ Sp : Xij = 0, (i, j) ∈ τ.

Indeed consider Y = VW>, where V and W are matrices of therespective order n1 × r and n2 × r and common rank r and takeX := UU> −Ξ, where

U :=

[VW

], i.e., X =

[V V > Y −M

(Y −M)> WW>

].

8

As a heuristic it was suggested in Fazel (2002) to approximate

the MRMC problem by the following problem

minX∈Vτ

tr(X) subject to Ξ +X 0.

Minimum Trace Factor Analysis (MTFA) problem (Bentler,

1972)

minX∈Dp

tr(Σ−X) s.t. Σ−X 0.

For any Σ ∈ Sp the MTFA problem has unique optimal solution.

When the minimal rank is less then the generic lower bound, the

minimum trace approach “often” recovers the exact minimum

rank solution.9

Low-Rank Matrix Approximation

Consider the following model of the observed values

Mij = Y ∗ij +N−1/2∆ij + εij, (i, j) ∈ Ω,

where Y ∗ is n1 × n2 matrix of rank r (i.e., Y ∗ ∈ Mr), ∆ij aresome (deterministic) numbers and εij are mutually independentrandom variables such that N1/2εij converge in distribution tonormal with mean zero and variance σ2

ij, (i, j) ∈ Ω. The addi-

tional terms N−1/2∆ij represent a possible deviation of popu-lation values from the “true” model and are often referred toas the population drift or a sequence of local alternatives. It isassumed that r ≤ R(n1, n2,m).

It is said that the model is (globally) identified at Y ∗ if Y ∗ is theunique solution of the respective matrix completion problem.

10

Recall that

VΩ :=Y ∈ Rn1×n2 : Yij = 0, (i, j) ∈ Ωc

,

VΩc :=Y ∈ Rn1×n2 : Yij = 0, (i, j) ∈ Ω

.

By PΩ we denote the projection onto the space VΩ, i.e.,[PΩ(Y )]ij = Yij for (i, j) ∈ Ω and [PΩ(Y )]ij = 0 for (i, j) ∈ Ωc.

Note that for Y ∈ Rn1×n2 and Y ∈ VΩc it follows that PΩ(Y +Y ) =PΩ(Y ).

Definition 1 (Well-posedness condition) We say that a ma-trix Y ∈Mr is well-posed, for the MRMC problem, if PΩ(Y ) = M

and the following condition holds

VΩc ∩ TMr(Y ) = 0.

11

!ℳ# $%

$%ℳ&

'()

12

This condition has simple geometrical interpretation. Suppose

that this does does not hold, i.e., there exists nonzero matrix

H ∈ VΩc ∩ TMr(Y ). Recall that PΩ(Y ) = M and note that

PΩ(Y + tH) = M for any t ∈ R. Then there is a curve Z(t) ∈Mr starting at Y and tangential to H, i.e., Z(0) = Y and

‖Y + tH−Z(t)‖ = o(t). Of course if moreover PΩ(Z(t)) = M for

all t near 0 ∈ R, then solution Y is not locally unique. Although

this is not guaranteed, violation of this condition implies that so-

lution Y is unstable in the sense that for some matrices Y ∈Mr

close to Y the distance ‖PΩ(Y )−M‖ is of order o(‖Y − Y ‖).

13

Algebraic condition for well-posedness. The tangent space atY ∈Mr:

TMr(Y ) =H ∈ Rn1×n2 : FHG = 0

,

where F is an (n1 − r) × n1 matrix of rank n1 − r such thatFY = 0 (referred to as a left side complement of Y ) and G is ann2 × (n2 − r) matrix of rank n2 − r such that Y G = 0 (referredto as a right side complement of Y .

Matrix Y ∈ Mr is well-posed if and only if for any left sidecomplement F and right side complement G of Y , the columnvectors g>j ⊗ fi, (i, j) ∈ Ωc, are linearly independent.

Theorem 1 (Sufficient conditions for local uniqueness) IfY ∈ Mr is well-posed, then matrix Y ∈ Mr is a locally uniquesolution of the MRMC problem, i.e., there is a neighborhood Vof Y such that rank(Y ) > rank(Y ) for any Y ∈ V, Y 6= Y .

14

Consider the weighted least squares problem

minY ∈Mr

∑(i,j)∈Ω

wij(Mij − Yij

)2,

where wij := 1/σ2ij with σ2

ij being consistent estimates of σ2ij (i.e.,

σ2ij converge in probability to σ2

ij as N →∞).

The model

Mij = Y ∗ij +N−1/2∆ij + εij, (i, j) ∈ Ω,

is tested by the following statistic

TN(r) := N minY ∈Mr

∑(i,j)∈Ω

wij(Mij − Yij

)2.

15

A sufficient condition for a feasible Y to be a strict local solutionof the MRMC problem is

tr[PΩ(H)>PΩ(H)

]> 0, ∀H ∈ TMr(Y ) \ 0.

This condition is equivalent to well posedness of Y .

Theorem 2 (Asymptotic properties of test statistic) Supposethat the model is globally identified at Y ∗ ∈ Mr and Y ∗ iswell-posed. Then the test statistic TN(r) converges in distri-bution to noncentral chi square with degrees of freedom dfr =m− r(n1 + n2 − r) and the noncentrality parameter

δr = minH∈TMr(Y

∗)

∑(i,j)∈Ω

σ−2ij

(∆ij −Hij

)2.

In Factor Analysis this type of results is going back to Steiger,Shapiro, and Browne (1985).

16

The noncentrality parameter can be approximated as

δr ≈ N minY ∈Mr

∑(i,j)∈Ω

wij(Y ∗ij +N−1/2∆ij − Yij

)2.

That is, the noncentrality parameter is approximately equal to

N times the fit to the “true” model of the alternative popula-

tion values Y ∗ij + N−1/2∆ij under small perturbations of order

O(N−1/2).

17

The asymptotics of the test statistic TN(r) depends on r andalso on the cardinality m of the index set Ω. Suppose now thatmore observations become available at additional entries of thematrix. That is we are testing now the model with a larger indexset Ω′, of cardinality m′, such that Ω ⊂ Ω′. In order to emphasizethat the test statistic also depends on the corresponding indexset we add the index set in the respective notations.

Theorem 3 Consider index sets Ω ⊂ Ω′ of cardinality m = |Ω|and m′ = |Ω′|. Suppose that the model is globally identified atY ∗ ∈ Mr and the well posedness condition holds at Y ∗ for thesmaller model (and hence for both models). Then the statisticTN(r,Ω′) − TN(r,Ω) converges in distribution to noncentral chi-square with dfr,Ω′ − dfr,Ω = m′ − m degrees of freedom and thenoncentrality parameter δr,Ω′ − δr,Ω, and TN(r,Ω′) − TN(r,Ω) isasymptotically independent of TN(r,Ω).

18

The (weighted) least squares problem is not convex and it may

happen that an optimization algorithm converges to a station-

ary point which is not a globally optimal solution. Suppose that

we run a numerical procedure which identifies a matrix Y ∈ Mr

satisfying the (necessary) first order optimality conditions, i.e.,

the algorithm converged to a stationary point. Then if PΩ(Y ) is

sufficiently close to M (i.e., the fit∑

(i,j)∈Ω

(Yij −Mij

)2is suffi-

ciently small) and the well posedness condition holds at Y , then

Y solves the least squares problem at least locally. Unfortunately

it is not clear how to quantify the “sufficiently close” condition.

19

Uniqueness of the minimal rank solutions

Consider the Minimum Rank Factor Analysis problem. Suppose

that it has solution

Σ = ΛΛ>+X

where Λ is p× r matrix of rank r and X ∈ Dp is diagonal matrix.

This solution is unique if for any i ∈ 1, ..., p the matrix Λ has

two r×r disjoint nonsingular submatrices not containing i-th row

(this result is going back at least to Anderson and Rubin, 1956).

That if r < p/2 and every r × r submatrix of Λ is nonsingular,

then the solution is unique.

20

Example 1 (Wilson and Worcester. 1939) Consider

M =

0 0.56 0.16 0.48 0.24 0.640.56 0 0.20 0.66 0.51 0.860.16 0.20 0 0.18 0.07 0.230.48 0.66 0.18 0 0.3 0.720.24 0.51 0.07 0.30 0 0.410.64 0.86 0.23 0.72 0.41 0

The two rank 3 solutions are given by diagonal matrices:

D1 = Diag(0.64,0.85,0.06,0.56,0.50,0.93),

D2 = Diag(0.425616,0.902308,0.063469,0.546923,0.386667,0.998).

For the MRFA the corresponding generic lower bound for p = 6is r ≥ 3, and for the MRMC problem the corresponding genericlower bound for n1 = n2 = 6 is r ≥ 4.

21

Uniqueness of the minimum rank solution is invariant with re-

spect to permutations of rows and columns of matrix M . This

motivates to introduce the following definition.

Definition 2 We say that the index set Ω is reducible if by per-

mutations of rows and columns, matrix M can be represented in

a block diagonal form, i.e., M =

[M1 00 M2

]. Otherwise we say

that Ω is irreducible.

22

Theorem 4 The following holds.

(i) If the index set Ω is reducible, then any minimum rank solution

Y , such that Yij 6= 0 for all (i, j) ∈ Ωc, is not locally (and hence

globally) unique.

(ii) Suppose that Ω is irreducible, Mij 6= 0 for all (i, j) ∈ Ω, and

every row and every column of the matrix M have at least one

element Mij. Then any rank one solution is globally unique.

23

Least square approximation is solved by a soft-threshholded SVDsolver with setting no nuclear norm regularization. The algorithmstops when either relative change in the Frobenius norm betweentwo successive estimates, ‖Y (t+1) − Y t‖F/‖Y (t)‖F , is less thansome tolerance, denoted as tol or the iteration is larger than somenumber, denoted as it. Nuclear norm minimization is solved viaTFOCS. The solutions of least square approximation and nuclearnorm minimization are denoted as Y ls and Y nm respectively. Inthe following, we will compare this two methods by looking intothe absolute value of Y lsij − Y

∗ij and Y nmij − Y ∗ij.

In this experiment, n1 = 40, n2 = 50, r∗ = 10, m = 1000,σ = 5, N = 50 and Ω is sampled until well-posedness conditionis satisfied. Convergence parameter tol = 10−20 and it = 50000.Least square approximation recover the true matrix Y ∗ with lesserror than the nuclear norm minimization.

24

The picture on the left is the structure of Ω with Ω denoted byyellow and Ωc denoted by blue. The picture in the middle is theabsolute error between true matrix and the completed matrix byleast square approximation, |Y lsij − Y

∗ij|. The picture on the right

is the absolute error between true matrix and the completedmatrix by nuclear norm minimization, |Y nmij − Y ∗ij|. True matrix

Y ∗ ∈ R40×50, rank(Y ∗) = 10, |Ω| = 1000, ∆ij = 0, εij ∼ N(0, 52

50)and observation matrix Mij = Y ∗ij + εij, (i, j) ∈ Ω. For the error,blue denotes small error and yellow denotes large error.

25

The Q-Q plot of T (k)N (r)200

k=1 against the corresponding chi-

square distribution. In this experiment, n1 = 40, n2 = 50, r∗ =

11, m = 1000, σ = 5, N = 400 and Ω is sampled until well-

posedness condition is satisfied. Convergence parameter tol =

10−20 and it = 50000. It can be seen that TN(r) follows the

central chi-square distribution with degree of freedom dfr = m−r(n1 + n2 − r) = 131

26

References

Shapiro, A., Xie, Y. and Zhang, R., “Matrix completion with

deterministic pattern - a geometric perspective”,

https://arxiv.org/pdf/1802.00047.pdf

Shapiro, A., “Statistical Inference of Semidefinite Programming”,

Mathematical Programming, online

https://link.springer.com/article/10.1007/s10107-018-1250-z

27

matrix completion with deterministic pattern

Documents