e6885 network science lecture 6cylin/course/netsci/netsci... · more targeted advertising data...

© 2013 Columbia University

E6885 Network Science Lecture 6: Network Topology Inference

E 6885 Topics in Signal Processing -- Network Science

Ching-Yung Lin, Dept. of Electrical Engineering, Columbia University

October 14th, 2013

© 2013 Columbia University2 E6885 Network Science – Lecture 6: Topology Inference

Course Structure

Class Date Lecture Topics Covered

09/09/13 1 Overview of Network Science

09/16/13 2 Network Representation and Feature Extraction

09/23/13 3 Network Paritioning, Clustering and Visualization

09/30/13 4 Network Analysis Use Case

10/07/13 5 Network Sampling, Estimation, and Modeling

10/14/13 6 Network Topology Inference

10/21/13 7 Network Information Flow

10/28/13 8 Graph Database

11/11/13 9 Final Project Proposal Presentation

11/18/13 10 Dynamic and Probabilistic Networks

11/25/13 11 Information Diffusion in Networks

12/02/13 12 Impact of Network Analysis

12/09/13 13 Large-Scale Network Processing System

12/16/13 14 Final Project Presentation


Correlation Networks

Pearson product-moment correlation between two nodes.

Empirical correlations

If the pair of variables (Xi,Xj) has a bivariate Gaussian distribution, the density under H0: has a concise closed-form expression, but it is somewhat complicated and requires numerical integration or tables to produce p-values.

( , ) ijij i j

ii jj

corr X Xs

rs s

= =

ˆˆ

ˆ ˆij

ij

ii jj

sr

s s=

ˆijr 0ijr =


Partial Correlation Networks

Important -- ‘Correlation does not imply causation’

For instance:

–Two vertices may have highly correlated attributes because the vertices somehow strongly ‘influence’ each other in a direct fashion.

–Alternatively, their correlation may be high primarily because they each are strongly influenced by a third vertex.

Need more considerations!!


Partial Correlation Networks (cont’d)

If it is felt desirable to construct a graph G where the inferred edges are more reflective of direct influence among vertices, rather than indirect influence, the notion of partial correlation becomes relevant.

The partial correlation of attributes Xi and Xj of vertices i, j, defined with respect to the attributes Xk1, … Xkm of vertices k1, … km in , is the correlation between Xi and Xj left over after adjusting for those effects of Xk1, … Xkm common to both.

Let Sm={k1, … km }, we define the partial correlation of Xi and Xj , adjusting for , as

1,..., \{ , }mk k V i jÎ

1( ,..., )

m

TS k kmX X=X

||

| |

m

m

m m

ij Sij S

ii S jj S

sr

s s=


Partial Correlation Networks (cont’d)

Here

for and .

and then , , and are the diagonal and off-diagonal elements of this 2x2 partial covariance matrix.

For example, for a given choice of m, we may dictate that an edge be present only when there is correlation between Xi and Xj regardless of which m other vertices are conditioned upon.

1 11 12

2 21 22

CovS Sæ ö é ù

=ç ÷ ê úS Sè ø ë û

W

W

1 ( , )Ti jX X=W

2 mS=W X

| |m mij S ji Ss s=| mii Ss | mjj Ss

111|2 11 12 22 21

-S = S -S S S

{ }(2) ( )| \{ , }{ , } : 0,

m

mij S m i jE i j V S Vr= Î ¹ " Î


Gaussian Graphical Model Networks

A special and popular case of the use of partial correlation coefficients is when m=Nv-2 and the attributes are assumed to have a multivariate Gaussian joint distribution.

Here the partial correlation between attributes of two vertices is defined conditional upon the attribute information at all other vertices.

The graph with edge set

is called a conditional independence graph. The overall model, combing the multivariate Gaussian distribution with the graph G, is called a Gaussian graphical model.

The partial correlation coefficients may be expressed in the form:

where is the (I,j)-th entry of , the inverse of the covariance matrix of vertex attributes.

{ }(2)| \{ . }{ , } : 0ij V i jE i j V r= Î ¹

| \{ , }ij

ij V i j

ii jj

wr

w w

-=

ijw 1-W = S


Inferring Interests from Social Graph


On the quality of inferring interests from social neighbors (Wen and Lin, KDD 2010)

Modeling user interests enables personalized servicesMore relevant search/recommendation results

More targeted advertising

Data about users are sparseMany user profiles are static, incomplete and/or outdated

<10% employees actively participate social software [Brzozowski2009]

Inferring user interests from neighbors can be a solutionAlso bring up a concern of exposing user’s private information

How true are “You are who you know”, “Birds of a Feather Flocks

Together”?


Challenges in Observing Users

Diverse types of media

Public social media (friending, blogs, etc.)

• Data are public but limited (esp. in enterprises)

Private communication media (email, instant messaging, face-to-face meetings, etc)

• Much more data

• Privacy is a major issue


Example of Diverse Types of Media

Number of people participated in top 3 media in an Enterprise with 400K employees

Number of entries:• Social bookmarking: 400K• Electronic communication: 20M• File sharing: 140K


Our Goals

How well a user’s interests can be inferred from his/her social neighbors?

Can the diverse types of media be combined to improve inferring user interests from social neighbors?

Can the quality of the inference be predicted based on features of social neighbors?

Only sufficiently accurate inference may help personalized services


Our Approach

Infer user interests from social neighbors

Model user interests based on multiple types of information they accessed

Construct employee social network from communication data

Infer using social influence model

Study the relationship between inference quality and network characteristics

Identify effective factors to ensure high quality results for applications


Dataset

25315 users’ contributed content

– 20M email/chats

– 400K social bookmarks

– 20K shared public files

– Profile information

• Job role, division, news categories of interests, etc

Infer social network based on email/chats

X’: number of emails


User Interests Model – Implicit Interests

Model users’ interests implicitly indicated by their contributed content

Extract latent topics from the multiple types of content using LDA

Select top-N distinct topics as the implicit interests model of a user

The degree the user is interested

The similarity of topics


User Interests Model – Explicit Interests

29% users manually specify interests in their profile

A list of selected terms

• From a static 1120-term taxonomy related to work

Compare implicit and explicit interests

Explicit interests models are more limited

• Implicit interests cover 60.4% explicit interests

• Explicit interests cover 2.2% implicit interests


Infer Interests Based on Social Influence

Social influence model

Network autocorrelation model [Leenders02]

• Social influence represented as a weighted combination of neighbors’ attributes

The weight is an exponential function of the social distance


Inference Quality

Condition Max Mean St. Deviation

Using social bookmark data only 59.4% 19.2% 10.7%

Using file sharing data only 44.9% 12.7% 7.2%

Using email/IM data only 62.1% 29.6% 14.1%

Using all three data 100% 45.1% 21.7%

Implicit interests: how close the inferred top-20 topics to the ground truth

– Significant advantage in combining multiple sources

– Large variance can affect practical application, thus need predict when to infer interests

– Much better recall than precision

Explicit interests: precision and recall of inferred terms

Measure Mean St. Deviation

Precision 30.1% 26.9%

Recall 61.5% 27.6%


Can Inference Quality be Predicted?

Hypothesis: inference quality can be predicted from social network properties

–User activeness: the amount of contribution

–In-degree

–Out-degree

–Betweenness

–User management role

Use Support Vector Regression to perform prediction

Evaluate prediction

–Precision/recall of the prediction (10-fold cross validation)

–Use prediction to improve inference• Only infer when we predict it’s high quality


Quality Prediction ResultsPrecision/recall of prediction

Implicit Interests Explicit Interests


Quality Prediction Results

Improve inference

Measure Improved to Improvement (%)

Precision 60.5% 101%

Recall 85.7% 39.3%

Implicit Interests

Explicit Interests


Feature Comparison “Leave-one-feature-out" comparisons of prediction results

Most social influences are from 1&2-degree neighbors

You neighbors decide how well you can be inferred


Feature Comparison (cont’d) “Leave-one-feature-out" comparisons of prediction results

You neighbors’ network positions may be even more important than how active they are

– Formal organizational properties

• Manager neighbors are more important in inferencei.e., more social influence (about 5% more)


Conclusion of the Wen-Lin 2010 KDD paper

There’s large variance in the quality of inferring user interests from social neighbors

The “recall” of the inference is much better than “precision”

The inference quality may be predicted from social network properties


Tomographical Inference


Tomographic Network Topology Inference

Inference of ‘interior’ components of a network – both vertices and edges – from data obtained at some subset of ‘exterior’ vertices.

E.g., in computer networks, desktop and laptop computers are typical instances of ‘exterior’ vertices, while Internet routers to which we do not have access are effecitvely ‘interior’ vertices.

Network tomography: describe prblems in the context of computer network monitoring, in which aspects of the ‘internal’ workings of the network, suchas intensity of traffic flowing between vertices, are inferred from ‘external’ measurements.

This is an’ill-posed inverse proble’ in mathematics. Mapping being many-to-one.

For the tomographic inference of network topologies, a key structural simplification has been the restriction to inference of networks in the form of trees.


Tomographic Inference of Tree Topologies A tree structure.


Tomographic network inference problem

Suppose a set of Nl vertices, we have n iid observations of some random variables {X1, … XNl}. We aim to find that three of all binary trees with Nl labeled leaves that best explains the data.

Example: Multicast probes


Two most popular classes of methods

Hierarchical clustering and related ides

Likelihood-based methods


Hierarchical Clustering- based methods

We treat the Nl leaves as the ‘objects’ to be clustered and the tree corresponding to the resulting clustering as our inferred tree.

The tree corresponding to the entire set of parititions is our focus.

Example: Rantnasamy and McCanne (1999): the observed rate of shared losses of packets should be fairly indicative of how close two leaf vertices (i.e., destination addresses) are on a multicast tree T.

– Two different types of shared loss between a pair of leaf vertices – ‘true’ and ‘false’ shared loss.

– The true shared losses are due to loss of packets on the path common to the vertices i and j.

– The false shared losses would refer to cases where packets were lost separately on the two paths from i1 to the vertices 1 and 3.


Likelihood-based Methods

If we are willing to specify probability models, then we have the potential for likelihood-based methods of inference.

In general, maximum likelihood inference of tree topologies may also be pursued through the use of MCMC. MCMC is critical to the use of Bayesian methods to tomographic inference of trees.


Bayesian Inference for Content Classification


river

TOPIC 2

river

riverstreambank

bank

stream

loan

TOPIC 1

money

loan

bank money

ban

k bank

loanDOCUMENT 2: loan1 river2 stream2 loan1 bank2 river2 bank2

bank1 stream2 river2 loan1 bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2 bank2 money1 river2

DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1

stream2 bank1 money1 loan1 river2 stream2 bank1 money10.3

0.8

0.2

Application – Content Clustering

0.7

Goal – categorize the documents into topics

Each document is a probability distribution over topics Each topic is a probability distribution over words

Mixture components

Mixture weights


Content Analysis Prior Art I -- Latent Semantic Analysis

Latent Semantic Analysis (LSA) [Landauer, Dumais 1997]

– Descriptions:

• Capture the semantic concepts of documents

by mapping words into the latent semantic

space which captures the possible synonym

and polysemy of words

• Training based on different level of documents.

Experiments show the synergy of the # of

training documents and the psychological

studies of students at 4th, 10th, and college

level. Used as an alternative to TOEFL test.

– Based on truncated SVD of document-term matrix:

optimal least-square projection to reduce

dimensionality

– Capture the concepts instead of words

• Synonym

• Polysemy

X T0

S0

D0

N x M N x K K x K K x M

· ·=te

rms

documents

00

~

LSA


Traditional Content Clustering

fw2

fwj : the frequency of the word wj

in a document

fw1

fw3 Clustering:Partition the feature space into segments based on training documents. Each segment represents a topic / category. ( Topic Detection)

Hard clustering: e.g., K-mean clustering

1 2{ , ,..., }

Nw w wd f f f z= ®

( | )P Z wW = f

Soft clustering: e.g., Fuzzy C-mean

clustering

w1

: observationsz1TopicsTopics

WordsWords

z2 z3 z4 z5

w5w2 w3 w4 w6

Another representation of clustering

d1DocumentsDocuments d5d2 d3 d4 d6


Traditional Content Clustering

fw2


in a document

fw1

fw3 Clustering:Partition the feature space into segments based on training documents. Each segment represents a topic / category. ( Topic Detection)

Hard clustering: e.g., K-mean clustering

1 2{ , ,..., }

Nw w wd f f f z= ®

w1

: observationsz1

TopicsTopics

WordsWords

z2 z3 z4 z5

w5w2 w3 w4 w6Another representation of clustering (w/o showing the deterministic part)

( | )P Z wW = f

Soft clustering: e.g., Fuzzy C-mean

clustering


Content Clustering based on Bayesian Network

( | )P W Z

Bayesian Network:• Causality Network – models the causal relationship of attributes / nodes

• Allows hidden / latent nodes

Hard clustering:

w1 : observations

z1TopicsTopics

WordsWords

z2 z3 z4 z5

w5w2 w3 w4 w6

soft clustering

d1DocumentsDocuments d2 d3

( | ) ( )( | )

( )

P Z W P WP W Z

P Z=

( | )P Z D

( ) arg max ( | )z

h D d P Z= = wW = f

( )h D

hard clustering

s. c.

<= MLE

<= Bayes Theorem


Content Clustering based on Bayesian Network – Hard Clustering

w1

: observations

z1TopicsTopics

WordsWords

z2 z3 z4 z5

w5w2 w3 w4 w6

( | ) ( ) ( | ) ( )( | )

( ) ( | )

P Z W P W P Z W P WP W Z

P Z P Z W dW= =

ò

z

w

N: the number of words(The number of topics (M) are

pre-determined)

Major Solution 1 -- Dirichlet Process:• Models P( W | Z) as mixtures of Dirichlet probabilities

• Before training, the prior of P(W|Z) can be a easy

Dirichlet (uniform distribution). After training, P(W|Z)

will still be Dirichlet. ( The reason of using Dirichlet)

Major Solution 2 -- Gibbs Sampling:• A Markov chain Monte Carlo (MCMC) method for

integration of large samples calculate P(Z)

z

w

N

bM

f

Topic-WordTopic-Worddistributionsdistributions

Latent Dirichlet Allocation (LDA) (Blei 2003)

shown as


river

TOPIC 2

river

riverstreambank

bank

stream

loan

TOPIC 1

money

loan

bank money

ban

k bank

loanDOCUMENT 2: loan1 river2 stream2 loan1 bank2 river2 bank2

bank1 stream2 river2 loan1 bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2 bank2 money1 river2

DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1

stream2 bank1 money1 loan1 river2 stream2 bank1 money10.3

0.8

0.2

Latent Dirichlet Allocation (LDA) [Blei et al. 2003]

0.7

Goal – categorize the documents into topics

Each document is a probability distribution over topics Each topic is a probability distribution over words

Mixture components

Mixture weights ( ) ( ) ( )

1

|T

i i i ij

P w P w z j P z j=

= = =åThe probability of ith word in a given document

Mixture components

Mixture weights


INPUT: document-word counts

• D documents, W words

OUTPUT: Mixture Components Mixture Weights

LDA (cont.)

Parameters can be estimated by Gibbs Sampling

Mixture Components:

Mixture weights:

θθ

wwWD

ββ

αα

zz

T

T: number of topics

f: Observations

Bayesian approach: use priors Mixture weights ~ Dirichlet() Mixture components ~ Dirichlet()

( ) ( ) ( )1

|T

i i i ij

P w P w z j P z j=

= = =å

( )( )djq( )( )j

wf

( )jwf( )d

jq


Comparison of Dirichlet Distribution with Gaussian Mixture Models

Dirichlet Distribution:1 21 1 1

1 2 1 1 2 1 2

1

( )( , ,..., ; , ,..., ) ...

( )

ra a ar r rr

kk

Nf f f a a a f f f

ar - - -

-

=

G=

GÕ

0 1kf£ £1

1r

kk

f=

=å

Multivariate Gaussian: 2 2 2( ) ( ) ( )1 1 2 2 1 12 2 21 2 1

1 2 1 1 1 2 2 1 1 11

1

1( , ,..., ; , , , ,..., , ) ...

(2 )

f f fr r

rr r r r

rk

k

f f f e e em m m

s s sr m s m s m sp s

- - -- -

-- - -

- - - --

=

=

Õ


Beyond Gaussian: Examples of Dirichlet Distribution


Importance of Dirichlet Distribution

In 1982, Sandy Zabell proved that, if we make certain assumptions about an individual’s beliefs, then that individual must use the Dirichlet density function to quantify any prior beliefs about a relative frequency.


Use Dirichlet Distribution to model prior and posterior beliefs (I)

Prior beliefs:

E.g.: *fair* coin?

– Flipping a coin, what’s the probability of getting ‘head’.

( ) ( ; , )f beta f a br =

beta(1,1): Never tossed that coin.No prior knowledge


Use Dirichlet Distribution to model prior and posterior beliefs (II)

( ) ( ; , )f beta f a br =

beta(1,1): No prior knowledge beta(3,3): posterior knowledge after 2 head trials (s) and 2 tail trials (t)

-- this coin may be fair

d: 2, 2

Prior beliefs:

Posterior beliefs:

( | ) ( ; , )f d beta f a s b tr = + +


Use Dirichlet Distribution to model prior and posterior beliefs (III)

Prior beliefs:

Posterior beliefs:( ) ( ; , )f beta f a br =

( | ) ( ; , )f d beta f a s b tr = + +

d: 8, 2

beta(3,3): prior knowledge -- this coin may be fair

beta(11,5): posterior belief-- the coin may not be fair after

tossing 8 heads and 2 tails


Use Dirichlet Distribution to model prior and posterior beliefs (IV)

Another Example:

– Our belief that an event, which in general has 5% occurrence possibility, may be true in our case.

d: 3,0

beta(1/360,19/360): prior knowledge that an 5% chance-event may be true

beta(1/360+3,19/360): posterior belief that an 5% chance-event may be true


Gibbs Sampling (I)

Given a PDF, ,usually it’s difficult to get an integration, .

We can use Gibbs Sampling to simulate the *samples* that follows the PDF. Thus, we can do an integration by summing from those samples.

( ) ( )f X X dXr×ò( )Xr

( )sf X

,1 ,2 ,{ , ,..., }s s s s TX X X X=


Gibbs Sampling (II)

Gibbs Sampling Find a Markov Chain

, s.t., the outcome of samples follow the PDF

1( | )s i iX Xr -

( )Xr

x1 x2

1 2( , )X x x=

1, 1, 2, 1( | )s i ix xr -

2, 2, 1,( | )s i ix xr

Example: 2-dimensional X vector:

1( | )s i iX Xr - is determined by

x1

x2

X1

X2

X3

X4

Derived from the PDF

1( | )s i iX Xr -


Some Insight on BN-based Content Clustering

Content Clustering:• Because documents and words are

dependent,

only close documents in the feature

space can be clustered together as

one topic.

fw2


in a document

fw1

fw3

Incorporating human factors can possibly *link* multiple clusters together.

Bayesian Network:• Models the *practical* causal

relationships..


Why We Need Simultaneous Multimodality Clustering?

Multiple-Step Clustering:• e.g., Naïve way to combine content filtering and collaborative filtering

Independently cluster first. Combine later.

fw2

fw1

fw3

OK

fw2

fw1

fw3

Not-OK

Simultaneous Multimodality Clustering is important.