compressed sensing for learning from big data over networksjunga1/largestructuresem_2017_v0.pdf ·...

aalto-logo-en-3

Compressed Sensingfor

Learning from Big Data over Networks

Alexander JungDept. of Computer Science, Aalto University

joint work with Ayelet Heimowitz, Yonina C. Eldar (Technion)

14.02.2017

1 / 47

aalto-logo-en-3

Outline

1 Big Data over Networks

2 Sparse Label Propagation (SLP)

3 Network Nullspace Property

4 Application to Image Processing

5 Wrap Up

2 / 47

aalto-logo-en-3

The Data Deluge

“We’re drowning in information and starving for knowledge.”- Rutherford D. Rogers.

3 / 47

aalto-logo-en-3

Big Data Fuels Artificial Intelligence

Availability of vast amounts of training data allows

to train extremely complex models such as

sparse models, deep neural networks, etc....

4 / 47

aalto-logo-en-3

Andrew Ng’s Rocket Picture

Big Data Complex Model Modern AI/

Deep Learning

5 / 47

aalto-logo-en-3

AI Everywhere

Shazam identifies the ear-worm tune you are listening to

spam filters keep your inbox tidy

Google.com became personal Jeannie

6 / 47

aalto-logo-en-3

Shazam - Live Demo

watched Kill Bill recently

fighting scence with a cool background song

Shazam App digged out the title in seconds!

song unrelated to my preferences in Spotify/FB etc...

7 / 47

https://youtu.be/a3aFv8IQb4s?t=9m5s

https://www.youtube.com/watch?v=5Us1ASENPws

aalto-logo-en-3

The Power (Danger) of AI

AI can be used for psychographic profiling

FB knows you better than your own mother!

use social network analysis for targeted marketing

used by Trump and Brexit campaign

8 / 47

aalto-logo-en-3

A Key Principle

modern AI systems organizebig data as networks

9 / 47

aalto-logo-en-3

Big Data over Networks

datasets often have intrinsic network structure

chip design internet bioinformatics

social networks universe material science

cf. L. Lovasz, “Large Networks and Graph Limits”

10 / 47

aalto-logo-en-3

Deep Learning

modern machine learning uses deep neural networks

11 / 47

aalto-logo-en-3

Graphication of Heterogeneous Data

observe dataset D={z1, . . . , zp} with data points zi

data point zi might be chunk of audio, video or text data

structure data points by some notion of “similarity”

e.g., zi , zj similar if they belong to same user account

represent zi by node i ∈V of empirical graph G = (V, E)

edge {i , j} connects similar data points zi and zj

12 / 47

aalto-logo-en-3

Semi-Supervised Learning (SSL)

consider data point z , e.g., a FB user

data point contains several features, e.g., age, nationaltiy, ...

data point is assigned a label, e.g., “GOOD GUY” vs. “BADGUY”

GOAL: learn mapping from features to label

Features

age: 33easily triggered: yes

BAD GUY

difficult childhoodfrom Gotham City

Featuresage: 70easily triggered: ??

GOOD GUY

great childhood

from Austria

z1 z2

13 / 47

aalto-logo-en-3

Graph Signals for SSL

represent single data point zi by node i ∈ V

nodes i ,j for similar zi and zj connected by edge {i , j} ∈ E

similarity between zi and zj quantified by weight Wi ,j

data point zi has label x [i ] (e.g., 1 for “GOOD GUY” or 0 for“BAD GUY”)

entire labelling is a graph signal x [·] : V → R

graph signal x [·] maps node i ∈ V to label x [i ]∈R

14 / 47

aalto-logo-en-3

Graph Signal Processing generalizes DSP

view discrete-time signals as graph signals over chain graph

· · ·· · ·x[−1] x[0] x[1]

label x [i ] might be presence of “clipping” at time i

(greyscale) images are signals over grid graph

· · ····

x[i]

label x [i ] might encode fore-/background

15 / 47

aalto-logo-en-3

Fast Algorithms on Graphs

GSP theory yields fast algorithms for large-scale graphs

generalizes FFT from chain graph to general graphs

based on product graph structure

6

For the Cartesian graph product, denoted as G = G1 × G2,the adjacency matrix is

A× = A1 ⊗ IN2 + IN1 ⊗A2. (25)

Finally, for the strong product, denoted as G = G1 ! G2, theadjacency matrix is

A! = A1 ⊗ A2 + A1 ⊗ IN2 + IN1 ⊗A2. (26)

The strong product can be seen as a combination of the Kro-necker and Cartesian products. Since the products (23), (25),and (26) are associative, Kronecker, Cartesian, and strong graphproducts can be defined for an arbitrary number of graphs.Product graphs arise in different applications, including

signal and image processing [32], computational sciencesand data mining [33], and computational biology [34]. Theirprobabilistic counterparts are used in network modeling andgeneration [35], [36], [37]. Multiple approaches have beenproposed for the decomposition and approximation of graphswith product graphs [38], [30], [31], [39].Product graphs offer a versatile graph model for the represen-

tation of complex datasets in multi-level and multi-parameterways. In traditional DSP, multi-dimensional signals, such asdigital images and video, reside on rectangular lattices thatare Cartesian products of line graphs. Fig. 2(a) shows a two-dimensional lattice formed by the Cartesian product of twoone-dimensional lattices.Another example of graph signals residing on product graphs

is data collected by a sensor network over a period of time.In this case, the graph signal formed by measurements of allsensors at all time steps resides on the product of the sensornetwork graph with the time series graph. As the example inFig. 2(b) illustrates, the kth measurement of the nth sensor isindexed by the nth node of the kth copy of the sensor graph(or, equivalently, the kth node of the nth copy of the time seriesgraph). Depending on the choice of product, a measurement ofa sensor is related to the measurements collected by this sensorand its neighbors at the same time and previous and followingtime steps. For instance, the strong product in Fig. 2(b) relatesthe measurement of the nth sensor at time step k to itsmeasurements at time steps k − 1 and k + 1, as well as tomeasurements of its neighbors at times k − 1, k, and k + 1.A social network with multiple communities also may be

representable by a graph product. Fig. 2(c) shows an exampleof a social network that has three communities with similarstructures, where individuals from different communities alsointeract with each other. This social graph may be seen asan approximation of the Cartesian product of the graph thatcaptures the community structure and the graph that capturesthe interaction between communities.Other examples where product graphs are potentially useful

for data representation include multi-way data arrays thatcontain elements described by multiple features, parameters,or characteristics, such as publications in citation databasesdescribed by their topics, authors, and venues; or internetconnections described by their time, location, IP address, portaccesses, and other parameters. In this case, the graph factorsin (22) represent similarities or dependencies between subsetsof characteristics.

Digital image Row Column

(a)

Sensor network Time series

Mea

sure

men

tsat

one

tim

e st

epMeasurements of one sensor

Sensor network measurements

(b)

Social network with communities

Communitystructure

Intercommunitycommunication

structure

(c)

Fig. 2. Examples of product graphs indexing various data: a) Digital imagesreside on rectangular lattices that are Cartesian products of line graphs for rowsand columns; b) Measurements of a sensor network are indexed by the strongproduct of the sensor network graph with the time series graph (edges of theCartesian product are shown in blue and green, and edges of the Kroneckerproduct are shown in grey; the strong product contains all edges); c) A socialnetwork with three similar communities is approximated by a Cartesian productof the community structure graph with the intercommunity communicationgraph.

Graph products are also used for modeling entire graphfamilies. Kronecker products of scale-free graphs with thesame degree distribution are also scale-free and have the samedistribution [40], [35].K- and ϵ-nearest neighbor graphs, whichare used in signal processing, communications and machinelearning to represent spatial and temporal location of data,such as sensor networks and image pixels, or data similaritystructure, can be approximated with graph products, as theexamples in Figs. 2(a) and 2(b) suggest. Other graph families,such as trees, are constructed using rooted graph products [41],which are not discussed in this article.

V. SIGNAL PROCESSING ON PRODUCT GRAPHSIn this Section, we discuss how product graphs help “mod-

ularize” the computation of filtering and Fourier transform ongraphs and improve algorithms, data storage and memory ac-cess for large datasets. They lead to graph filtering and Fouriertransform implementations suitable for multi-core and clusteredplatforms with distributed storage by taking advantage ofsuch performance optimization techniques as parallelizationand vectorization. The presented results illustrate how productgraphs offer a suitable and practical model for constructing

16 / 47

aalto-logo-en-3

Graph Models: Perfect Match for 3 Vs of Big Data

graph models lead to message passing algorithms

message passing algorithms are perfectly scalable

copes with volume (distributed computing) and velocity(parallel computing) of big data

“ship computation to data” and not vice-versa!

graph models also allow to process heterogeneous data

17 / 47

aalto-logo-en-3

Smoothness Hypothesis of SSL

consider graph signal x [i ] representing labeled dataset D

observe labels only at sampling set M⊆ V

acquiring labels is costly

how to recover remaining unobserved labels x [i ] for i ∈ V \M

central smoothness hypothesis of supervised learning

close-by data points in high-density regions have similar labels

18 / 47

aalto-logo-en-3

SSL over Graphs

??? BAD GUY ???

GOOD GUY GOOD GUY

unlabelled data influences connectivity of labeled data points!

19 / 47

aalto-logo-en-3

SSL without Graphs

??? BAD GUY ???

GOOD GUY GOOD GUY

unlabelled data does not help anymore!

20 / 47

aalto-logo-en-3

Aquiring Labels (Sampling) in Marine Biology

21 / 47

aalto-logo-en-3

Aquiring Labels (Sampling) in Particle Physics

22 / 47

aalto-logo-en-3

Aquiring Labels (Sampling) in Pharmacology

23 / 47

aalto-logo-en-3

Key Problems

given a graph signal representation of the learning problem:

how many labels (samples) do we need?

which nodes should we sample ?

what are efficient learning algorithms?

24 / 47

aalto-logo-en-3

Outline





5 Wrap Up

25 / 47

aalto-logo-en-3

Give Me Some Orientation!

empirical graph G is weighted but undirected

we obtain a directed version−→G by orienting edges

e={i, j}

e+ e−

for each undirected edge e = {i , j} we choose one node ashead e+ and the other as tail e−

26 / 47

aalto-logo-en-3

The Recovery Problem

observe few initial labels y [i ] for i ∈M

get all labels x [i ] by mininimizing total variation

‖x‖TV :=∑

{i ,j}∈E

Wi ,j |x [i ]− x [j ]|=‖Dx‖1

incidence matrix D∈RE×V of empirical graph

De,i =

We if i = e+

−We if i = e−

0 else.

require consistency with initial labels, i.e., xM = y

we end up with the recovery problem

x ∈ argminx‖Dx‖1 s.t. xM = y.

27 / 47

aalto-logo-en-3

Non-Smooth Convex Optimization

recovery problem x ∈ argminx f (x) :=‖Dx‖1+I(xM=y)

objective f (x) sum of two non-smooth convex components

x characterized by 0∈∂f (x)

perfect prey for proximal methods

basic idea: reduce condition 0∈∂f (x) to fixed-point equation

x = P xwith some operator P

do a fixed-point iteration x(k+1) = Px(k)

28 / 47

aalto-logo-en-3

Primal-Dual Methods

optimality condition of our recovery problem

0 ∈ ∂f (x)∣∣x=x

= ∂[g(Dx) + h(x)

]∣∣x=x

, (1)

where g(x) := ‖x‖1 and h(x) := I(xM = y)

by convex duality [Rockafellar, Thm. 31.3], (1) amounts to

Dx ∈ ∂g∗(y) , − (DT y) ∈ ∂h(x)

for some dual solution y

convex conjugate g∗(y) defined by

g∗(y) = supx

yTx− g(x)

29 / 47

aalto-logo-en-3

The Conjugate of a Convex Function

conjugate g∗(y) of convex function g(x) defined by

g∗(y) = supx

yTx− g(x)

x

g(x)

yT x

yT x−g∗(y)

(0, −g∗(y))

x0

∇g(x0)=y

yT x0−g(x0)

30 / 47

aalto-logo-en-3

Primal-Dual Method by Pock-Chambolle

optimality condition of our recovery problem

Dx ∈ ∂g∗(y) , − (DT y) ∈ ∂h(x)

equivalent to, for suitable σ, τ > 0,(−2σD (I + σ∂g∗)

(I + τ∂h) τDT

)(xy

)=

(−σD II 0

)(xy

)

we arrive at a fixed-point problem(−σD II 0

)−1( −2σD (I + σ∂g∗)(I + τ∂h) τDT

)(xy

)=

(xy

)

solve via fixed-point iteration....

31 / 47

aalto-logo-en-3

Sparse Label Propagation (SLP)

Input: incidence matrix D∈RE×V , initial labels {x [i ]}i∈M.

Initialize: k :=0, z(0) :=0, x(0)M :=y, x(0) :=0, y(0) :=0

Repeat until convergence:

y(k+1) := T (y(k) + (1/(2dmax))Dz(k))

r := x(k)−(1/(2dmax))DTy(k+1)

x(k+1) :=

{x [i ] for i ∈Mr [i ] else.

z(k+1) := 2x(k+1)−x(k)

x(k+1) := x(k)+x(k+1)

k := k + 1

e

T (e)

1

1

Output: x(k) := (1/k)x(k)

32 / 47

aalto-logo-en-3

Convergence

consider recovery problem

x ∈ argminx‖x‖TV s.t. xM = y

let x(k) denote output of SLP after k iterations

we have ‖x(k)‖TV − ‖x‖TV ≤ c1/k

the constant c1 might depend on the underlying graph

33 / 47

aalto-logo-en-3

Outline





5 Wrap Up

34 / 47

aalto-logo-en-3

Sampling Clustered Signals

consider recovery problem

x ∈ argminx‖x‖TV s.t. xM = xM

assume true graph signal of the form

x =∑

Cl∈FaltCl , with tC =

∑

i∈Cei

using partition F = {C1, . . . , C|F|} with disjoint clusters Cl

C1 C2

a1 a2

∂Fsampled node

when is the solution x close to x ?35 / 47

aalto-logo-en-3

Circulations with Demands

consider oriented empirical graph−→G

circulation f [e] with demands d [i ] is mapping f [·] :−→E → R:

the conservation law∑

in

f [(i , j)]−∑

out

f [(j , i)] = d [i ], for any i ∈ V

and the capacity constraints

f [e] ≤We for any oriented edge e ∈ −→E .

36 / 47

aalto-logo-en-3

Circulations with Demands

d[1]<0

d[2]=0

d[3]=0

d[4]=0

d[5]=0 d[6]>0

37 / 47

aalto-logo-en-3

The Network Nullspace Property (NNSP)consider clustered graph signals

xc =∑

Cl∈FaltCl

which are sampled at the nodes in the sampling set M. Thesampling set M satisfies the network nullspace property w.r.t. F ,denoted NNSP-(M,F), if there exist circulations f [e] withdemands

d [i ]=+/− 2Wi ,j for {i , j} ∈ ∂F ,d [i ]=0 for every node i /∈∂F ∪M.

C1 C2

a1a2

∂F

sampled node

55

5 5 5 5

5

38 / 47

aalto-logo-en-3

NNSP implies Success of SLP

Theorem. Consider a clustered graph signal xc =∑Cl∈F altCl

which is observed only at the sampling set M⊆ V yielding initiallabels y [i ] = xc [i ] for i ∈M. If NNSP-(M,F) holds, then thesolution of

argminx‖x‖TV s.t. xM = y

is unique and coincides with xc .

39 / 47

aalto-logo-en-3

Outline





5 Wrap Up

40 / 47

aalto-logo-en-3

Fore/Background Segmentation

represent RGB bitmap image by grid graph

rgb[i]=(255, 0, 0)T

rgb[j]=(0, 0, 255)T rgb[k]=(0, 255, 0)T

i

j k

Wi,j

weight Wi ,j :=exp(− ‖rgb[i ]− rgb[j ]‖2

)

graph signal x [i ] is likelihood for pixel i being foreground

for some pixels i we have hand-crafted initial labels

use SLP to determine entire foreground

41 / 47

aalto-logo-en-3

Fore/Background Segmentation - Results I

R1

R2

R3

extracted foregroundoriginal image

sampling set M = R1 ∪ R3 initial labels x[i] =

{1 for i ∈ R1

−1 for i ∈ R3

42 / 47

aalto-logo-en-3

Fore/Background Segmentation - Results II

43 / 47

aalto-logo-en-3

Outline





5 Wrap Up

44 / 47

aalto-logo-en-3

Conclusions

formulated semi-supervised learning as convex optimizationproblem

implemented Pock-Chambolle method yielding sparse labelpropagation

formulated nullspace condition in terms of networkconnectivity

applied SLP to fore-/background separation

45 / 47

aalto-logo-en-3

Next Steps (Wanna Join?)

study NNSP for various network models (e.g., stochasticblock model)

apply linear program solvers to recovery problem

consider noisy observations

non-uniform conditions via convex geometry of TV

46 / 47

aalto-logo-en-3

Material

“An Industrial-Strength Audio Search Algorithm”, AveryLi-Chun Wang SHAZAM PAPER

Lecture Notes on Proximal Methods by Lieven VandenbergheCLICK HERE

A. Jung, A. Heimowitz and Y.C. Eldar “The NetworkNullspace Property for Compressed Sensing over Networks”,submitted to SAMPTA 2017, preprint available on request

M. Newman, “Networks: An Introduction”

R.T. Rockafellar, “Convex Analysis” THE BIBLE OFCONVEX ANALYSIS!

47 / 47

https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf

http://www.seas.ucla.edu/~vandenbe/236C/lectures/cp.pdf

compressed sensing for learning from big data over networksjunga1/largestructuresem_2017_v0.pdf ·...

Documents