matrix completion with deterministic pattern
TRANSCRIPT
Matrix completion with deterministicpattern
A. ShapiroJoint work with Yao Xie and Rui Zhang
School of Industrial and Systems Engineering,Georgia Institute of Technology,
Atlanta, Georgia 30332-0205, USA
Bridging Mathematical Optimization, Information Theory,and Data Science
Princeton, May 2018
Minimum Rank Matrix Completion (MRMC) problem
minY ∈Rn1×n2
rank(Y ) subject to Yij = Mij, (i, j) ∈ Ω,
where Ω ⊂ 1, ..., n1 × 1, ..., n2 is an index set of cardinality m
and Mij, (i, j) ∈ Ω, are given values. Let M be the n1×n2 matrix
with entries Mij at (i, j) ∈ Ω, and all other entries equal zero.
Minimum Rank Factor Analysis (MRFA) problem
minX∈Dp
rank(Σ−X) s.t. Σ−X 0,
where Σ ∈ Sp is a positive definite matrix. By Sp we denote the
space of p × p symmetric matrices and by Dp the space of p × pdiagonal matrices.
1
Generic bounds for the minimal rank
Suppose that specified values of the minimum rank problems are
observed with noise. In the MRFA the covariance matrix Σ is es-
timated by the sample covariance matrix S based on a sample of
N observations. In the MRMC the entries Mij are observed with
random noise. In such settings there are the following generic
lower bounds for the minimal rank.
For the MRFA we have the following lower bound (Shapiro
(1982)), which holds for a.e. Σ ∈ Sp (i.e., for all Σ ∈ Sp ex-
cept in a set of Lebesgue measure zero):
rank(Σ +X) ≥2p+ 1−
√8p+ 1
2, ∀X ∈ Dp.
2
This is based on that the set
Wr := A ∈ Sp : rank(A) = rof matrices of rank r forms C∞ smooth manifold, in the linearspace Sp, of dimension
dim(Wr) = p(p+ 1)/2− (p− r)(p− r + 1)/2
and the transversality condition. Mapping A(X) := Σ−X, fromDp into Sp, intersects Wr transverally if for every X ∈ Dp eitherA(X) 6∈ Wr or A(X) ∈ Wr and
Dp + TWr(A(X)) = Sp.
The above generic lower bound means that
p+ dim(Wr) ≥ dim(Sp),
i.e., that p ≥ (p− r)(p− r + 1)/2.
3
For the MRMC problem we can proceed in a similar way. Theset
Mr := A ∈ Rn1×n2 : rank(A) = r
of matrices of rank r ≤ minn1, n2 forms a smooth manifold, inthe linear space Rn1×n2, of dimension
dim(Mr) = r(n1 + n2 − r).
We use the following notation. Ωc := 1, ..., n1 × 1, ..., n2 \Ω,the complement of the index set Ω,
VΩ :=Y ∈ Rn1×n2 : Yij = 0, (i, j) ∈ Ωc
.
This linear space represents the set of matrices that are filledwith zeros at the locations of the unobserved entries,
VΩc :=Y ∈ Rn1×n2 : Yij = 0, (i, j) ∈ Ω
.
4
Consider mapping AM(X) := M + X, X ∈ VΩc. The image
AM(VΩc) defines the space of feasible points of the MRMC
problem. The transversality condition: either AM(X) 6∈ Mr or
AM(X) ∈Mr and
VΩc + TMr(AM(X)) = Rn1×n2.
It follows that generically (i.e., for a.e. values Mij) the rank r of
matrix Y ∈ Rn1×n2 such that Yij = Mij, (i, j) ∈ Ω, should satisfy
r(n1 + n2 − r) ≥ m.
Recall that m = |Ω|.
5
Equivalently it holds generically (almost surely) that the minimal
rank
r ≥ R(n1, n2,m),
where
R(n1, n2,m) := (n1 + n2)/2−√
(n1 + n2)2/4− m.
In particular, for n1 = n2 = n,
R(n, n,m) = n−√n2 − m.
It also follows that unless R(n1, n2,m) is an integer, almost surely
the set of optimal solutions of the MRMC problem is not a
singleton.
6
For an appropriate index set Ω, of cardinality m, the minimal
rank r∗ of the MRMC problem attains any value between
R(n1, n2,m) ≤ r∗ ≤⌈√
m⌉.
That is, for r < minn1, n2 consider data matrix M of the fol-
lowing form
M =(M1 0M2 M3
)where the three sub-matrices M1, M2, M3, of the respective
order r × r, (n1 − r)× r and (n1 − r)× (n2 − r).
7
The MRMC problem can be formulated in the following equiva-lent form
minX∈Vτ
rank(Ξ +X) subject to Ξ +X 0,
where Ξ ∈ Sp, p = n1 + n2, is symmetric matrix of the form
Ξ =
[0 M
M> 0
], and
Vτ := X ∈ Sp : Xij = 0, (i, j) ∈ τ.
Indeed consider Y = VW>, where V and W are matrices of therespective order n1 × r and n2 × r and common rank r and takeX := UU> −Ξ, where
U :=
[VW
], i.e., X =
[V V > Y −M
(Y −M)> WW>
].
8
As a heuristic it was suggested in Fazel (2002) to approximate
the MRMC problem by the following problem
minX∈Vτ
tr(X) subject to Ξ +X 0.
Minimum Trace Factor Analysis (MTFA) problem (Bentler,
1972)
minX∈Dp
tr(Σ−X) s.t. Σ−X 0.
For any Σ ∈ Sp the MTFA problem has unique optimal solution.
When the minimal rank is less then the generic lower bound, the
minimum trace approach “often” recovers the exact minimum
rank solution.9
Low-Rank Matrix Approximation
Consider the following model of the observed values
Mij = Y ∗ij +N−1/2∆ij + εij, (i, j) ∈ Ω,
where Y ∗ is n1 × n2 matrix of rank r (i.e., Y ∗ ∈ Mr), ∆ij aresome (deterministic) numbers and εij are mutually independentrandom variables such that N1/2εij converge in distribution tonormal with mean zero and variance σ2
ij, (i, j) ∈ Ω. The addi-
tional terms N−1/2∆ij represent a possible deviation of popu-lation values from the “true” model and are often referred toas the population drift or a sequence of local alternatives. It isassumed that r ≤ R(n1, n2,m).
It is said that the model is (globally) identified at Y ∗ if Y ∗ is theunique solution of the respective matrix completion problem.
10
Recall that
VΩ :=Y ∈ Rn1×n2 : Yij = 0, (i, j) ∈ Ωc
,
VΩc :=Y ∈ Rn1×n2 : Yij = 0, (i, j) ∈ Ω
.
By PΩ we denote the projection onto the space VΩ, i.e.,[PΩ(Y )]ij = Yij for (i, j) ∈ Ω and [PΩ(Y )]ij = 0 for (i, j) ∈ Ωc.
Note that for Y ∈ Rn1×n2 and Y ∈ VΩc it follows that PΩ(Y +Y ) =PΩ(Y ).
Definition 1 (Well-posedness condition) We say that a ma-trix Y ∈Mr is well-posed, for the MRMC problem, if PΩ(Y ) = M
and the following condition holds
VΩc ∩ TMr(Y ) = 0.
11
!ℳ# $%
$%ℳ&
'()
12
This condition has simple geometrical interpretation. Suppose
that this does does not hold, i.e., there exists nonzero matrix
H ∈ VΩc ∩ TMr(Y ). Recall that PΩ(Y ) = M and note that
PΩ(Y + tH) = M for any t ∈ R. Then there is a curve Z(t) ∈Mr starting at Y and tangential to H, i.e., Z(0) = Y and
‖Y + tH−Z(t)‖ = o(t). Of course if moreover PΩ(Z(t)) = M for
all t near 0 ∈ R, then solution Y is not locally unique. Although
this is not guaranteed, violation of this condition implies that so-
lution Y is unstable in the sense that for some matrices Y ∈Mr
close to Y the distance ‖PΩ(Y )−M‖ is of order o(‖Y − Y ‖).
13
Algebraic condition for well-posedness. The tangent space atY ∈Mr:
TMr(Y ) =H ∈ Rn1×n2 : FHG = 0
,
where F is an (n1 − r) × n1 matrix of rank n1 − r such thatFY = 0 (referred to as a left side complement of Y ) and G is ann2 × (n2 − r) matrix of rank n2 − r such that Y G = 0 (referredto as a right side complement of Y .
Matrix Y ∈ Mr is well-posed if and only if for any left sidecomplement F and right side complement G of Y , the columnvectors g>j ⊗ fi, (i, j) ∈ Ωc, are linearly independent.
Theorem 1 (Sufficient conditions for local uniqueness) IfY ∈ Mr is well-posed, then matrix Y ∈ Mr is a locally uniquesolution of the MRMC problem, i.e., there is a neighborhood Vof Y such that rank(Y ) > rank(Y ) for any Y ∈ V, Y 6= Y .
14
Consider the weighted least squares problem
minY ∈Mr
∑(i,j)∈Ω
wij(Mij − Yij
)2,
where wij := 1/σ2ij with σ2
ij being consistent estimates of σ2ij (i.e.,
σ2ij converge in probability to σ2
ij as N →∞).
The model
Mij = Y ∗ij +N−1/2∆ij + εij, (i, j) ∈ Ω,
is tested by the following statistic
TN(r) := N minY ∈Mr
∑(i,j)∈Ω
wij(Mij − Yij
)2.
15
A sufficient condition for a feasible Y to be a strict local solutionof the MRMC problem is
tr[PΩ(H)>PΩ(H)
]> 0, ∀H ∈ TMr(Y ) \ 0.
This condition is equivalent to well posedness of Y .
Theorem 2 (Asymptotic properties of test statistic) Supposethat the model is globally identified at Y ∗ ∈ Mr and Y ∗ iswell-posed. Then the test statistic TN(r) converges in distri-bution to noncentral chi square with degrees of freedom dfr =m− r(n1 + n2 − r) and the noncentrality parameter
δr = minH∈TMr(Y
∗)
∑(i,j)∈Ω
σ−2ij
(∆ij −Hij
)2.
In Factor Analysis this type of results is going back to Steiger,Shapiro, and Browne (1985).
16
The noncentrality parameter can be approximated as
δr ≈ N minY ∈Mr
∑(i,j)∈Ω
wij(Y ∗ij +N−1/2∆ij − Yij
)2.
That is, the noncentrality parameter is approximately equal to
N times the fit to the “true” model of the alternative popula-
tion values Y ∗ij + N−1/2∆ij under small perturbations of order
O(N−1/2).
17
The asymptotics of the test statistic TN(r) depends on r andalso on the cardinality m of the index set Ω. Suppose now thatmore observations become available at additional entries of thematrix. That is we are testing now the model with a larger indexset Ω′, of cardinality m′, such that Ω ⊂ Ω′. In order to emphasizethat the test statistic also depends on the corresponding indexset we add the index set in the respective notations.
Theorem 3 Consider index sets Ω ⊂ Ω′ of cardinality m = |Ω|and m′ = |Ω′|. Suppose that the model is globally identified atY ∗ ∈ Mr and the well posedness condition holds at Y ∗ for thesmaller model (and hence for both models). Then the statisticTN(r,Ω′) − TN(r,Ω) converges in distribution to noncentral chi-square with dfr,Ω′ − dfr,Ω = m′ − m degrees of freedom and thenoncentrality parameter δr,Ω′ − δr,Ω, and TN(r,Ω′) − TN(r,Ω) isasymptotically independent of TN(r,Ω).
18
The (weighted) least squares problem is not convex and it may
happen that an optimization algorithm converges to a station-
ary point which is not a globally optimal solution. Suppose that
we run a numerical procedure which identifies a matrix Y ∈ Mr
satisfying the (necessary) first order optimality conditions, i.e.,
the algorithm converged to a stationary point. Then if PΩ(Y ) is
sufficiently close to M (i.e., the fit∑
(i,j)∈Ω
(Yij −Mij
)2is suffi-
ciently small) and the well posedness condition holds at Y , then
Y solves the least squares problem at least locally. Unfortunately
it is not clear how to quantify the “sufficiently close” condition.
19
Uniqueness of the minimal rank solutions
Consider the Minimum Rank Factor Analysis problem. Suppose
that it has solution
Σ = ΛΛ>+X
where Λ is p× r matrix of rank r and X ∈ Dp is diagonal matrix.
This solution is unique if for any i ∈ 1, ..., p the matrix Λ has
two r×r disjoint nonsingular submatrices not containing i-th row
(this result is going back at least to Anderson and Rubin, 1956).
That if r < p/2 and every r × r submatrix of Λ is nonsingular,
then the solution is unique.
20
Example 1 (Wilson and Worcester. 1939) Consider
M =
0 0.56 0.16 0.48 0.24 0.640.56 0 0.20 0.66 0.51 0.860.16 0.20 0 0.18 0.07 0.230.48 0.66 0.18 0 0.3 0.720.24 0.51 0.07 0.30 0 0.410.64 0.86 0.23 0.72 0.41 0
The two rank 3 solutions are given by diagonal matrices:
D1 = Diag(0.64,0.85,0.06,0.56,0.50,0.93),
D2 = Diag(0.425616,0.902308,0.063469,0.546923,0.386667,0.998).
For the MRFA the corresponding generic lower bound for p = 6is r ≥ 3, and for the MRMC problem the corresponding genericlower bound for n1 = n2 = 6 is r ≥ 4.
21
Uniqueness of the minimum rank solution is invariant with re-
spect to permutations of rows and columns of matrix M . This
motivates to introduce the following definition.
Definition 2 We say that the index set Ω is reducible if by per-
mutations of rows and columns, matrix M can be represented in
a block diagonal form, i.e., M =
[M1 00 M2
]. Otherwise we say
that Ω is irreducible.
22
Theorem 4 The following holds.
(i) If the index set Ω is reducible, then any minimum rank solution
Y , such that Yij 6= 0 for all (i, j) ∈ Ωc, is not locally (and hence
globally) unique.
(ii) Suppose that Ω is irreducible, Mij 6= 0 for all (i, j) ∈ Ω, and
every row and every column of the matrix M have at least one
element Mij. Then any rank one solution is globally unique.
23
Least square approximation is solved by a soft-threshholded SVDsolver with setting no nuclear norm regularization. The algorithmstops when either relative change in the Frobenius norm betweentwo successive estimates, ‖Y (t+1) − Y t‖F/‖Y (t)‖F , is less thansome tolerance, denoted as tol or the iteration is larger than somenumber, denoted as it. Nuclear norm minimization is solved viaTFOCS. The solutions of least square approximation and nuclearnorm minimization are denoted as Y ls and Y nm respectively. Inthe following, we will compare this two methods by looking intothe absolute value of Y lsij − Y
∗ij and Y nmij − Y ∗ij.
In this experiment, n1 = 40, n2 = 50, r∗ = 10, m = 1000,σ = 5, N = 50 and Ω is sampled until well-posedness conditionis satisfied. Convergence parameter tol = 10−20 and it = 50000.Least square approximation recover the true matrix Y ∗ with lesserror than the nuclear norm minimization.
24
The picture on the left is the structure of Ω with Ω denoted byyellow and Ωc denoted by blue. The picture in the middle is theabsolute error between true matrix and the completed matrix byleast square approximation, |Y lsij − Y
∗ij|. The picture on the right
is the absolute error between true matrix and the completedmatrix by nuclear norm minimization, |Y nmij − Y ∗ij|. True matrix
Y ∗ ∈ R40×50, rank(Y ∗) = 10, |Ω| = 1000, ∆ij = 0, εij ∼ N(0, 52
50)and observation matrix Mij = Y ∗ij + εij, (i, j) ∈ Ω. For the error,blue denotes small error and yellow denotes large error.
25
The Q-Q plot of T (k)N (r)200
k=1 against the corresponding chi-
square distribution. In this experiment, n1 = 40, n2 = 50, r∗ =
11, m = 1000, σ = 5, N = 400 and Ω is sampled until well-
posedness condition is satisfied. Convergence parameter tol =
10−20 and it = 50000. It can be seen that TN(r) follows the
central chi-square distribution with degree of freedom dfr = m−r(n1 + n2 − r) = 131
26
References
Shapiro, A., Xie, Y. and Zhang, R., “Matrix completion with
deterministic pattern - a geometric perspective”,
https://arxiv.org/pdf/1802.00047.pdf
Shapiro, A., “Statistical Inference of Semidefinite Programming”,
Mathematical Programming, online
https://link.springer.com/article/10.1007/s10107-018-1250-z
27