Emmanuel AbbePrinceton University
ISIT TutorialInformation theory and machine learning
Part II
1
Martin WainwrightUC Berkeley
Examples:- community detection in social networks
2
Inverse problems on graphs
A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents
by observing local noisy interactions of these agents
Examples:- community detection in social networks- image segmentation
3
Inverse problems on graphs
A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents
by observing local noisy interactions of these agents
Examples:- community detection in social networks
- data classification and information retrieval- image segmentation
4
Inverse problems on graphs
A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents
by observing local noisy interactions of these agents
Examples:- community detection in social networks- image segmentation
- object matching, synchronization- page sorting- protein-to-protein interactions- haplotype assembly- ...
- data classification and information retrieval
5
Inverse problems on graphs
A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents
by observing local noisy interactions of these agents
In each case: observe information on the edges of a network that has been generated from hidden attributes on the nodes,
and try to infer back these attributes
Dual to the graphical model learning problem (previous part)
6
Inverse problems on graphs
A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents
by observing local noisy interactions of these agents
X1
Xn
Y1
C
YN
W
W
Different: the code is a design parameter and takes typically specific non-local interactions of the bits (e.g., random, LDPC, polar codes).
What about graph-based codes?
7
(1) What are the relevant types of “codes” and “channels” behind machine learning problems?(2) What are the fundamental limits for these?
1. Community detection and clustering2. Stochastic block models : fundamental limits and capacity-achieving algorithms3. Open problems 4. Graphical channels and low-rank matrix recovery
Outline of the talk
8
Community detection and clustering
9
social networks:“friendship”
biological networks:“protein interactions”
genome HiC networks:“DNA contacts”
call graphs:“calls”
Networks provide local interactions among agents
10
social networks:“friendship”
biological networks:“protein interactions”
genome HiC networks:“DNA contacts”
call graphs:“calls”
Networks provide local interactions among agentsone often wants to infer global similarity classes
11
Tutorial motto:Can on establish a clear line-of-sight
as in communications with the Shannon capacity? Pea
k D
ata
Effi
cien
cy
SNR
WAN wireless tech.
Shannon-capacity
Infeasible Region
EV-DO
802.16
HSDPA
A long-studied and notoriously hard problem
what is a good clustering?assort. and disassort. relations
how to get a good clustering?computationally hard
The challenges of community detection
work with models
12
many heuristics...
The Stochastic Block Model
13
p = (p1, . . . , pk) <- probability vector = relative size of the communities
k = 4The DMC of clustering..?
P = diag(p)
The stochastic block model
np1 np2
np3np4
14
SBM(n, p,W )
<- symmetric matrix with entries in [0,1] = prob .of connecting among communitiesW =
0
B@W11 . . . W1k...
. . ....
Wk1 · · · Wkk
1
CA
W11W12
W13
W14
W22
W23
W24
W44 W34
W33
The (exact) recovery problem
We will see weaker recovery requirements later
Starting point: progress in science often comes from understanding special cases...
15
Definition. An algorithm
ˆXn(·) solves (exact) recovery in the SBM if
for a random graph G under the model, limn!1 P(Xn=
ˆXn(G)) = 1.
Let Xn= [X1, . . . , Xn] represent the community variables
of the nodes (drawn under p)
SBM with 2 symmetric communities: 2-SBM
16
2-SBM
p pq
p1 = p2 = 1/2
n
2
n
2
17
W11 = W22 = p W12 = q
Some historyfor 2-SBM
Recovery problem
HollandLaskeyLeinhardt
1983
BoppanaSnijdersNowicki
CondonKarpDyer
Frieze
BickelChen
2010
McSherry
2014
JerrumSorkin
CarsonImpagliazzoBui, Chaudhuri,
Leighton, Sipser
RoheChatterjeeYu
Bui, Chaudhuri,
Leighton, Sipser ’84 maxflow-mincut p = ⌦(1/n), q = o(n
�1�4/((p+q)n))
Boppana ’87 spectral meth. (p� q)/
pp+ q = ⌦(
plog(n)/n)
Dyer, Frieze ’89 min-cut via degrees p� q = ⌦(1)
Snijders, Nowicki ’97 EM algo. p� q = ⌦(1)
Jerrum, Sorkin ’98 Metropolis aglo. p� q = ⌦(n
�1/6+✏)
Condon, Karp ’99 augmentation algo. p� q = ⌦(n
�1/2+✏)
Carson, Impagliazzo ’01 hill-climbing algo. p� q = ⌦(n
�1/2log
4(n))
Mcsherry ’01 spectral meth. (p� q)/
pp � ⌦(
plog(n)/n)
Bickel, Chen ’09 N-G modularity (p� q)/
pp+ q = ⌦(log(n)/
pn)
Rohe, Chatterjee, Yu ’11 spectral meth. p� q = ⌦(1)
Instead of ‘how’,when can we recover
the clusters (IT)?
P(X̂n = Xn) ! 1
18
algorithms driven...
Information-theoretic view of clustering (über alles)
reliable comm. iff R < 1-H(ɛ)
X1
Xn
Y1
C
YN
W
W
W =
✓1� ✏ ✏✏ 1� ✏
◆
R =n
NX1
Xn
Y1
YN
...
...
W
W
R =2
n
W =
✓1� p p1� q q
◆
reliable comm.
exact recovery
,
unorthodoxcode!
2-SBM
N =
✓n
2
◆
iff ???
19
Recovery problem
HollandLaskeyLeinhardt
1983
BoppanaSnijdersNowicki
CondonKarpDyer
Frieze
BickelChen
2010
McSherry
2014
JerrumSorkin
CarsonImpagliazzoBui, Chaudhuri,
Leighton, Sipser
RoheChatterjeeYu
Infeasible
a
befficiently achievable
Bui, Chaudhuri,
Leighton, Sipser ’84 maxflow-mincut p = ⌦(1/n), q = o(n
�1�4/((p+q)n))
Boppana ’87 spectral meth. (p� q)/
pp+ q = ⌦(
plog(n)/n)
Dyer, Frieze ’89 min-cut via degrees p� q = ⌦(1)
Snijders, Nowicki ’97 EM algo. p� q = ⌦(1)
Jerrum, Sorkin ’98 Metropolis aglo. p� q = ⌦(n
�1/6+✏)
Condon, Karp ’99 augmentation algo. p� q = ⌦(n
�1/2+✏)
Carson, Impagliazzo ’01 hill-climbing algo. p� q = ⌦(n
�1/2log
4(n))
Mcsherry ’01 spectral meth. (p� q)/
pp � ⌦(
plog(n)/n)
Bickel, Chen ’09 N-G modularity (p� q)/
pp+ q = ⌦(log(n)/
pn)
Rohe, Chatterjee, Yu ’11 spectral meth. p� q = ⌦(1)
Some historyfor 2-SBM
p =
a log(n)
n, q =
b log(n)
nRecovery i↵
a+b2 � 1 +
pab
Abbe-Bandeira-Hall
20
P(X̂n = Xn) ! 1
Recovery problem
HollandLaskeyLeinhardt
1983
BoppanaSnijdersNowicki
CondonKarpDyer
Frieze
BickelChen
2010
McSherry
2014
Detection problem
Coja-Oghlan
DecelleKrzakalaMoore Zdeborova
Massoulié
MosselNeemanSlyJerrum
Sorkin
CarsonImpagliazzo
p =a
n, q =
b
n(a� b)2 > 2(a+ b)
Detection i↵
Mossel-Neeman-Sly
Bui, Chaudhuri,Leighton, Sipser
RoheChatterjeeYu
efficiently achievable
Bui, Chaudhuri,
Leighton, Sipser ’84 maxflow-mincut p = ⌦(1/n), q = o(n
�1�4/((p+q)n))
Boppana ’87 spectral meth. (p� q)/
pp+ q = ⌦(
plog(n)/n)
Dyer, Frieze ’89 min-cut via degrees p� q = ⌦(1)
Snijders, Nowicki ’97 EM algo. p� q = ⌦(1)
Jerrum, Sorkin ’98 Metropolis aglo. p� q = ⌦(n
�1/6+✏)
Condon, Karp ’99 augmentation algo. p� q = ⌦(n
�1/2+✏)
Carson, Impagliazzo ’01 hill-climbing algo. p� q = ⌦(n
�1/2log
4(n))
Mcsherry ’01 spectral meth. (p� q)/
pp � ⌦(
plog(n)/n)
Bickel, Chen ’09 N-G modularity (p� q)/
pp+ q = ⌦(log(n)/
pn)
Rohe, Chatterjee, Yu ’11 spectral meth. p� q = ⌦(1)
9✏ > 0 : P(d(X̂n, Xn)
n<
1
2� ✏) ! 1
Some historyfor 2-SBM
p =
a log(n)
n, q =
b log(n)
nRecovery i↵
a+b2 � 1 +
pab
Abbe-Bandeira-Hall
What about multiple/asymm. communities?Conjecture: detection changes with 5 or more
21
P(X̂n = Xn) ! 1
p =
a log n
n, q =
b log n
n
n/2n/2
pBj
pBi
qRi Rji j
Recovery in the 2-SBM: IT limit
If a+b2 �
pab < 1 then ML fails w.h.p.Converse:
Abbe-Bandeira-Hall ’14 22
ML fails if two nodes can be swapped to reduce the cut
what is ML? ! min-bisection
P (B1 R1) = n�((a+b)/2�pab)+o(1)
P (9i : Bi Ri) = ? ⇣ nP (B1 R1) (weak correlations)
n/2n/2
pp q
P(9X̂n : d(X̂n, Xn)/n /2 [1/2� ✏, 1/2 + ✏]) ! 1�1 +1
Recovery in the 2-SBM: efficient algorithms
max x
TAx
s.t. xi = ±1
1
tx = 0
Abbe-Bandeira-Hall ’14
X = xx
tlifting:
max tr(AX)
s.t. Xii = 1
1
tX = 0
X ⌫ 0
rank(X) = 1
ML-decoding:spectral:max x
TAx
s.t. kxk = 1
1
tx = 0
NP-hard
SDP:
23
Note that SDP can be expensive...
-> Analyze the spectral norm of a random matrix[Abbe-Bandeira-Hall ’14] Bernstein: slightly loose
[Xu-Hanjek-Wu ’15] Seginer bound[Bandeira, Bandeira-Van Handel ’15] tight bound
Recovery in the 2-SBM: efficient algorithms
Abbe-Bandeira-Hall ’14 24
Theorem. The SDP solves recovery if 2LSBM + 11
t+ In ⌫ 0
where LSBM = DG+ �DG� �A.
The general SBM
25
Quiz: If a node is in community i, how many neighbors does ithave in expectation in community j ?
1. npj
4. 7i
“degree profile matrix”
26
np1 np2
np3np4
W11W12
W13
W14
W22
W23
W24
W44 W34
W33
SBM(n, p,W )
2. npjWij
3. npiWij
0
@ nPW
1
A
Back to the Information-theoretic view of clustering
X1
Xn
Y1
C
YN
W
W
R =n
NX1
Xn
Y1
YN
SBM
...
...
W
W
R =2
n
reliable comm. iff R < 1-H(ɛ)
reliable comm. iff 1 < J(p,W) ???reliable comm. iff R < max I(p,W)p | {z }
KL-divergence
reliable comm. iff 1 < (a+b)/2-√ab__
✓ ◆W
✓ ◆W
27
Main results
(über alles)
where
| {z }Dt(µ, ⌫)
We call D+ the CH-divergence.
D
1
2(pa�
pb)2 � 1
Abbe-Bandeira-Hall ’14
J(p,Q) := mini<j
D+((PQ)i, (PQ)j) � 1
• D1/2(µ, ⌫) =12k
pµ�
p⌫k22 is the Hellinger divergence (distance)
• � logmaxtP
i µti⌫
ti is the Cherno↵ divergence
• Dt is an f -divergence:P
i ⌫if(µi/⌫i)
[Abbe-Sandon ’15]
f(x) = 1� t+ tx� x
t
D+(µ, ⌫) := max
t2[0,1]
X
`2[k]
�tµ` + (1� t)⌫` � µt
`⌫1�t`
�
28
Theorem 1. Recovery is solvable in SBM(n, p,Q log(n)/n) if and only if
Main results
(über alles)(über alles)
Is recovery in the general SBM solvable efficiently down the information theoretic threshold?
where
Theorem 2. The degree-profiling algorithm achieves the threshold
and runs in quasi-linear time.
J(p,Q) := mini<j
D+((PQ)i, (PQ)j) � 1
YES!
[Abbe-Sandon ’15]
D+(µ, ⌫) := max
t2[0,1]
X
`2[k]
�tµ` + (1� t)⌫` � µt
`⌫1�t`
�
29
Theorem 1. Recovery is solvable in SBM(n, p,Q log(n)/n) if and only if
(über alles)
[Abbe-Sandon ’15]
When can we extract a specific community?
30
(PQ)i
(PQ)j
1
Theorem. If community i has a profile (PQ)i at D+-distance at least 1
from all other profiles (PQ)j , j 6= i, then it can be extracted w.h.p.
What if we do not know the parameters?We can learn them on the fly: [Abbe-Sandon ’15] (second paper)
Proof techniques and algorithms
31
A key step
(über alles)(über alles)
Hypothesis 1
Hypothesis 2
dvdv
.⇠ P(log(n)(PQ)1)
dv.⇠ P(log(n)(PQ)2)
1 2
3
1 2
3
Theorem. For any ✓1, ✓2 2 (R+ \ {0})k with ✓1 6= ✓2 and p1, p2 2 R+ \ {0},X
x2Zk+
min(Pln(n)✓1(x)p1,Pln(n)✓2(x)p2) = ⇥⇣n
�D+(✓1,✓2)�o(1)⌘,
where D+ is the CH-divergence.
[Abbe-Sandon ’15] 32
Plan:put effort in recovering most of the nodes and
then finish greedily with local improvements
How to use this?
[Abbe-Sandon ’15] 33
(1) Split into two graphsG”
GG’ loglog-degree
log-degree
G’(2) Run on Sphere-comparison
-> gets a fraction 1-o(1) (see next)
(3) Take now with the clustering of G” G’
Pe
= n�D+((PQ)1,(PQ)2)+o(1)
The degree-profiling algorithm
Hypothesis 1
Hypothesis 2
dv...⇠ P(log(n)(PQ)1)
dv...⇠ P(log(n)(PQ)2)
dv
(capacity-achieving)
[Abbe-Sandon ’15] 34
How do we get most nodes correctly?
35
Other recovery requirements
For all the above: what are the “efficient” VS. “information-theoretic” fundamental limits?
36
Exact recovery: c = 1
Almost exact recovery: c = 1� o(1)
Weak recovery or detection : for some (for the symmetric k-SBM).
c = 1/k + ✏ ✏ > 0
Partial recovery: An algorithm solves partial recovery in the SBM with accuracy if it produces a clustering which is correct on a fraction of the nodes with high probability.
cc
Proposed notion of SNR:
|�min
|2�max
e.v. of PQ
Theorem (informal). In the sparse SBM(n, p,Q/n),the Sphere-comparison algorithm recovers a fraction
of nodes which approaches 1 when the SNR diverges.
What is a good notion of SNR?
Note that the SNR scales if Q scales!
=
(a�b)2
2(a+b) for 2-symm. comm.
=
(a�b)2
k(a+(k�1)b) for k-symm. comm.
[Abbe-Sandon ’15]
Bin(n/2, a/n)
Bin(n/2, b/n)
Partial recovery in SBM(n, p,Q/n)
37
A node neighborhood inSBM(n, p,Q/n)
np1 npk
�v = i
p1Qi1 pkQik
. . .(PQ)i
|{z
}
depth r
...
Nr(v)((PQ)r)i
Sphere-comparison
[Abbe-Sandon ’15] 38
�v = i
...
...
�v0 = j
Nr(v)
Nr0(v0)
Compare v and v’ from:
|Nr(v) \Nr0(v0)|
Sphere-comparisonTake now two nodes:
[Abbe-Sandon ’15] 39
hard to analyze...
�v = i
...
...
�v0 = j
E
Nr,r0[E](v · v0)= number of crossing edges
Compare v and v’ from:
⇡ Nr[G\E](v) ·cQ
nNr0[G\E](v
0)
Subsample G with prob. c to get E
Nr[G\E](v)
Nr0[G\E](v0) ⇡ ((1� c)PQ)re�v · cQ
n((1� c)PQ)r
0e�v0
= c(1� c)r+r0e�v ·Q(PQ)r+r0e�v0 /n
Sphere-comparison
Decorrelate:
Additional steps:1. look at several depths -> Vandermonde syst.2. use “anchor nodes”
[Abbe-Sandon ’15] 40
A real data example
41
our algorithm: ~60/1222
The political blogs network
42
[Adamic and Glance ’05]
1222 blogs(left- and right-leaning)
edge = hyperlink between blogs
We can recover 95% of the nodes correctlyThe CH-divergence is close to 1
Some open problems in community detection
43
Some open problems in community detection
I. The SBM:a. Recovery
[Abbe-Sandon ’15] should extend to k = o(log(n))
[Chen-Xu ’14] k,p,q scale with n polynomially
Growing nb. of communities? Sub-linear communities?
44
Some open problems in community detection
I. The SBM:
b. Detection and broadcasting on treesa. Recovery
[Mossel-Neeman-Sly ’13] Converse for detection in 2-SBM:
p=a/n, q=b/n
Galton-Watson tree Poisson((a+b)/2)
1 1 0
Xr
BSC(b/(a+ b))
If (a+b)/2<=1 the tree dies w.p. 1If (a+b)/2>1 the tree survives w.p. >0
Unorthodox broadcasting problem: when can we detect the root-bit?
c >1
(1� 2")2If and only if
[Evans-Kenyon-Peres-Schulman ’00]
c =a+ b
2
() (a� b)2
2(a+ b)> 1
45
Some open problems in community detection
I. The SBM:
b. Detection and broadcasting on treesa. Recovery
For k clusters? SNR =(a� b)2
k(a� (k � 1)b)
46
[Decelle-Krzakala-Zdeborova-Moore ’11]
impossible possible efficient
0 SNR1ck
Conjecture. For the symmetric k-SBM(n, a, b), there exists ck s.t.
(1) If SNR < ck, then detection cannot be solved,
(2) If ck < SNR < 1, then detection can be solved
information-theoretically but not e�ciently,
(3) If SNR > 1, then detection can be solved e�ciently.
Moreover ck = 1 for k 2 {2, 3, 4} and ck < 1 for k � 5.
Some open problems in community detection
I. The SBM:
b. Detection and broadcasting on treesc. Partial recovery and the SNR-distortion curve
a. Recovery
1/2
1SNR
� = 1� ↵ k = 2
Conjecture. For the symmetric k-SBM(n, a, b) and ↵ 2 (1/k, 1),there exists �k, �k s.t. partial-recovery of accuracy ↵ is solvable
if and only if SNR > �k, and e�ciently solvable i↵ SNR > �k.
k � 5
47
- Censored block models- Labelled block models
n/2n/2
0 1
pp p
[Abbe-Bandeira-Bracher-Singer ’14]
“correlation clustering” [Bansal, Blum, Chawla ’04] “LDGM codes” [Kumar, Pakzad, Salavati, Shokrollahi ’12]
“labelled block model” [Heimlicher, Lelarge, Massoulié ’12]“soft CSPs” [Abbe, Montanari ’13]
“pairwise measurements” [Chen, Suh, Goldsmith ’14 and ’15] “bounded-size correlation clustering” [Puleo, Milenkovic ’14]
Some open problems in community detection
2-CBM [Abbe-Bandeira-Bracher-Singer, Xu-Lelarge-Massoulie,Saad-Krzakala-Lelarge-Zdeborova,Chin-Rao-Vu, Hajek-Wu-Xu]
2-CBM(n, p, ✏)
II. Other block models
48
- Mixed-membership block models [Airoldi-Blei-Fienber-Xing ’08]- Overlapping block models [Abbe-Sandon ’15]
OSBM(n, p, f)u v
XuXu Xv⇠ p p on {0, 1}s⇠ p
f : {0, 1}s ⇥ {0, 1}s ! [0, 1]connect (u, v) with prob. f(Xu, Xv)
Some open problems in community detection
- Censored block models- Labelled block models
II. Other block models
49
- Degree-corrected block models [Karrer-Newman ’11]
- Planted community [Deshpande-Montanari ’14 , Montanari ’15]
, p p
Some open problems in community detection
- Mixed-membership block models - Overlapping block models
- Censored block models- Labelled block models- Degree-corrected block models
II. Other block models
50
- Hypergraph models
Some open problems in community detection
- Planted community
- Mixed-membership block models - Overlapping block models
- Censored block models- Labelled block models- Degree-corrected block models
II. Other block models
51
- Hypergraph models
Some open problems in community detection
- Planted community
- Mixed-membership block models - Overlapping block models
- Censored block models- Labelled block models- Degree-corrected block models
II. Other block models
For all the above: Is there a CH-divergence behind? A generalized notion of SNR? Detection gaps?
Efficient algorithms?
52
Some open problems in community detection
III. Beyond block models:
b. Low-rank matrix recovery
b. Graphical channels
A general information-theoretic model?
a. Exchangeable random arrays and graphons w : [0, 1]2 ! [0, 1]
P (Eij = 1|Xi = xi, Xj = xj) = w(xi, xj)
ui
uj
w
×
(ui, uj)
w(ui, uj)
G1
G2T
Figure 1: [Left] Given a graphon w : [0, 1]2 → [0, 1], we draw i.i.d. samples ui, uj fromUniform[0,1] and assign Gt[i, j] = 1 with probability w(ui, uj), for t = 1, . . . , 2T . [Middle]Heat map of a graphon w. [Right] A random graph generated by the graphon shown in the middle.Rows and columns of the graph are ordered by increasing ui, instead of i for better visualization.
Wolfe [9] studied the consistency properties, but did not provide algorithms to estimate the graphon.To the best of our knowledge, the only method that estimates graphons consistently, besides ours, isUSVT [8]. However, our algorithm has better complexity and outperformsUSVT in our simulations.More recently, other groups have begun exploring approaches related to ours [28, 24].
The proposed approximation procedure requires w to be piecewise Lipschitz. The basic idea is toapproximate w by a two-dimensional step function !w with diminishing intervals as n increases.Theproposed method is called the stochastic blockmodel approximation (SBA) algorithm, as the idea ofusing a two-dimensional step function for approximation is equivalent to using the stochastic blockmodels [10, 22, 13, 7, 25]. The SBA algorithm is defined up to permutations of the nodes, so theestimated graphon is not canonical. However, this does not affect the consistency properties of theSBA algorithm, as the consistency is measured w.r.t. the graphon that generates the graphs.
2 Stochastic blockmodel approximation: Procedure
In this section we present the proposed SBA algorithm and discuss its basic properties.
2.1 Assumptions on graphons
We assume that w is piecewise Lipschitz, i.e., there exists a sequence of non-overlaping intervalsIk = [αk−1, αk] defined by 0 = α0 < . . . < αK = 1, and a constant L > 0 such that, for any(x1, y1) and (x2, y2) ∈ Iij = Ii × Ij ,
|w(x1, y1)− w(x2, y2)| ≤ L (|x1 − x2|+ |y1 − y2|) .
For generality we assume w to be asymmetric i.e., w(u, v) ̸= w(v, u), so that symmetric graphonscan be considered as a special case. Consequently, a random graph G(n,w) generated by w isdirected, i.e., G[i, j] ̸= G[j, i].
2.2 Similarity of graphon slices
The intuition of the proposed SBA algorithm is that if the graphon is smooth, neighboring cross-sections of the graphon should be similar. In other words, if two labels ui and uj are close i.e.,|ui− uj| ≈ 0, then the difference between the row slices |w(ui, ·)−w(uj , ·)| and the column slices|w(·, ui) − w(·, uj)| should also be small. To measure the similarity between two labels using thegraphon slices, we define the following distance
dij =1
2
"# 1
0[w(x, ui)− w(x, uj)]
2 dx+
# 1
0[w(ui, y)− w(uj , y)]
2 dy
$. (2)
2
[Lovasz] SBMs can approximate graphons [Choi-Wolfe, Airoldi-Costa-Chan]
53
[Abbe-Montanari ’13]Graphical channels
• G = (V,E) a k-hypergraph• Q : X k ! Y a channel (kernel)
X1
Xn
Y1
YN
...
...
G
Q
Q
A family of channels motivated by inference on graph problems
For x 2 X V, y 2 YE
,
P(y|x) =Q
e2E(G) Q(ye|x[e])
How much information do graphical channels carry?
54
limn!1
1
nI(Xn;G)Theorem. exists for ER graphs and some kernels
-> why not always? what is the limit?
Connection: sparse PCA and clustering
55
SBM and low-rank Gaussian model
Spiked Wigner model:
If (finite SNR), with (large degrees)npn, nqn ! 1�n =n(pn � qn)2
2(pn + qn)! �
Y� =
r�
nXXt + Z
limn!1
1
nI(X;G(n, pn, qn)) lim
n!1
1
nI(X;Y�)
=�
4+
�2⇤
4�� �⇤
2+ I(�⇤)
where �⇤ solves � = �(1�MMSE(�)) Y1(�) =p�X1 + Z1
(single-letter)
Theorem. =
[Deshpande-Abbe-Montanari ’15]
Gaussian symmetric
• X = (X1, . . . , Xn) i.i.d. Bernoulli(✏) ! sparse-PCA
[Amini-Wainwright ’09, Desphande-Montanari ’14]
• X = (X1, . . . , Xn) i.i.d. Radamacher(1/2) ! SBM???
I-MMSE [Guo-Shamai-Verdú]
Yij = cXiXj + Zij
56
?=
Conclusion
Community detection couples naturally with the channel view of information theory and more specifically with:
- graph-based codes- f-divergences- broadcasting problems - I-MMSE - ...
More generally, the problem of inferring global similarity classes in data sets from noisy local interactions is at the center of many problems in ML, and aninformation-theoretic view of these problems seems needed and powerful.
unorthodox versions...
|{z
}
57
www.princeton.edu/~eabbeDocuments related to the tutorial:
Questions?
58