network models and data analysis stephen e. fienberg department of statistics machine learning...

47
Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15- 781, Fall 2008

Upload: piers-berry

Post on 16-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

Network Models and Data Analysis

Stephen E. FienbergDepartment of Statistics

Machine Learning Department

Machine Learning 10-701/15-781, Fall 2008

Page 2: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

2

Making Pretty Pictures — Visualizing Networks — Is Easy

October 22, 2008

Page 3: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

3October 22, 2008

Page 4: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

4

Example 1: 9/11 Terrorists

October 22, 2008

Page 5: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

5

Lots of Probabilistic/Statistical Models

• Types of models:– Descriptive vs. Generative.– Static vs. Dynamic.

• Origin of social network models in 1930s, integrated with graph representation in 1950s.

• Erdos-Renyi random graph models.– Generalized random graph models.– Stochastic process reinterpretations.

• Sociometric models such as p1 and ERGMs.• Machine learning / latent-variable models:

– Stochasitic block models for mixed membership.October 22, 2008

Page 6: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

6

Applications Galore

• Small world studies• Social networks:

– Sampson’s monks – Classroom friendship

• Organization theory– Branch banks

• Homeland security• Politics

– Voting behavior– Bill co-sponsorship

• Public health– Needle sharing– Spread of AIDS– Obesity

• Computer science:– Email networks (Enron)– Internet – WWW routing systems

• Biology:– Protein-protein interactions– Zebras

October 22, 2008

Page 7: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

7

But Doing Careful Statistical Analysis is Difficult

• Claims for network behavior are often based on casual empiricism:– Power laws are everywhere, yet nowhere

once we look closely at the data.

• Inferential issues usually buried:– Algorithms, simulations, and “experiments”

are not substitutes for formal statistical representation and theory.

October 22, 2008

Page 8: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

8October 22, 2008

Page 9: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

9

Power Laws & Internet Graph

October 22, 2008

Page 10: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

10

Framework for Networks Evolving over Time

• Our representation for a network will be a graph: Gt={Nt;Et}.– Nodes and edges can be created and can die.– Edges can be directed or undirected.– Data are available to be observed beginning at time

t0.

• There exists stochastic process, evolving over time which, combined with initial conditions, describes the network structure and evolution.– May involve more than dyadic relationships.

October 22, 2008

Page 11: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

11

Forms of Network Data

1. Observe formation (or removal) of each edge with a time stamp indicating when this occurs. • Can see how entire network or sub-network

changes with each transaction.

2. Observe status of network or sub-network at T epochs. • Represent snapshots of network.• Correspond to information on incidence of links

and information on relationships.

3. Observe cumulative network links over time.• “Prevalence” approach.

October 22, 2008

Page 12: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

12

Example 3: Enron E-mail Database

• Attributes nodes (including organization chart!) and full text on all e-mail messages.

• Multiple addressees and cc’s. Thus observations produce structure different from dyadic edges.

• Messages contain time stamps, so we are in situation 3.

• Question: Who was party to fraudulent transactions and when?

October 22, 2008

Page 13: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

13

Enron−Threshold 5 (151 employees)

October 22, 2008

Page 14: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

14

Enron−Threshold 30 (151 employees)

October 22, 2008

Page 15: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

15

Example 4: The Framingham “Obesity” Study

• Original Framingham “sample” cohort with offspring cohort of N0=5,124 individuals measured beginning in 1971 for T=7 epochs centered at 1971, 1981, 1985, 1989, 1992, 1997, 1999.

• Link information on family members and one “close friend.” Total number of individuals on whom we have obesity measures is N=12,067.

• NEJM, July 2007.October 22, 2008

Page 16: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

16October 22, 2008

Page 17: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

17

Animation

October 22, 2008

Page 18: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

18

Erdos-Renyi Random Graph Model

• Two versions:– In G(n, M) model, graph is chosen uniformly

at random from collection of all graphs which have n nodes and M edges.

– In G(n, p) model, each edge is included in graph with probability p, with presence or absence of distinct edges being independent.• As p increases from 0 to 1, the model becomes

more and more likely to include graphs with more edges.

– October 22, 2008

Page 19: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

19

Erdos-Renyi Random Graph Model

• G(n, p) has on average nC2 p edges, and distribution of degree of any node is binomial (n,p).– If np < 1, G(n,p) will almost surely have no

connected components of size larger than O(logn).– If np = 1, G(n,p) will almost surely have largest

component whose size 0(n2 / 3). – If np tends to constant c > 1, G(n, p) will almost

surely have unique "giant" component containing positive fraction of the nodes. No other component will contain more than O(logn) nodess.October 22, 2008

Page 20: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

20October 22, 2008

Page 21: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

21

Preferential Attachment Model

• Encourages formation of hubs in in graph.

• Degree distribution follows power law.– Fraction of nodes having k edges to other

nodes for large values of k as P(k) ~ k−γ .– Linear on log-log scale.

October 22, 2008

Page 22: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

22

Small World Model

• Designed to produce local clustering and triadic closures, by interpolating between an ER graph and a regular ring lattice.

October 22, 2008

Page 23: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 23

Example 5: Monks in a Monastery

• 18 novices observed over two years.– Network data gather at 4 time points; and on

multiple relationships, e.g., friendship.– Airoldi, et al., (2007, 2008)

Page 24: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 24

Page 25: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

25

Holland-Leinhardt’s p1 Model

• n nodes; occurrence of “directed” links is random.

• Consider dyads Dij= (Xij,Xji) to be independent with– Pr(Dij=(1,1)) = mij, i < j

– Pr(Dij=(1,0)) = aij, i ≠ j

– Pr(Dij=(0,0)) = nij, i < j

where mij + aij + aji + nij = 1, for all i < j.October 22, 2008

Page 26: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

26

p1 Model

• If we let – ρij = log{mijnij/(aijaji)}, i < j

– θij = log{aij/nij}, i ≠ j

Then p1 assumes probability of observing x is:

p1(x) = Pr(X=x) = K exp[Σi<j ρijXijXji + ΣijθijXij]– K = Πi<k 1/kij

– kij ({θij}, { ρij }) is a normalizing constant for Dij.

October 22, 2008

Page 27: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

27

Three Common Forms of p1

• If we add restrictions:• θij = θ + αi + ßj i ≠ j

• (i) ρij = 0, (ii) ρij = ρ, and (iii) ρij = ρ+ρi+ρj

• Then for case (ii):• p1(x) = K exp[ρM + θL + ΣiαiXi+ + ΣjßjX+j]

reciprocity density expansiveness popularity

October 22, 2008

Page 28: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

28

Estimation for p1

• Exponential family form.– Set MSSs equal to their expectations.– Iterate.

• Holland and Leinhardt explored goodness of fit of p1:– Comparing ρij = 0 vs. ρij = ρ.

– Usual chi-square results don’t apply. – How to test ρij = ρ against a more complex

model? October 22, 2008

Page 29: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

29

p1 As a Log-Linear Model

• p1 is expressible as log-linear model on “incomplete” 4-way contingency table:– Yijkl= 1 if Xij = k and Xji = l,

0 otherwise.

• p1 with ρij = ρ corresponds to log-linear model on Y with all two-way interactions: [12][13][14][23][24][34].

• p1 with ρij = ρ +ρi + ρj corresponds to [12][134][234].October 22, 2008

Page 30: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 30

Example 5: Monks in a Monastery

• 18 novices observed over two years.– Network data gather at 4 time points; and on

multiple relationships, e.g., friendship.– Airoldi, et al., (2007, 2008)

Page 31: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

31

p1 Analysis of Monk Data

October 22, 2008

Page 32: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

32

p1 Analysis of Monk Data

October 22, 2008

Page 33: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

33

Sampson’s Monks−3 Blocks?

October 22, 2008

Page 34: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 34

K=3 SBMM for Friendship

• Friendship relationship among novices measured at 3 successive times.

• K=3 stochastic blocks + mixed membership:

Page 35: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 35

Example 6: MIPS-Curated PPI in Yeast

• 871 proteins participate in 15 high-level functions• 2119 functional annotations (binary)

Page 36: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 36

M = 871 nodesM2 = 750K entries

The Data: Interaction Graphs

• M proteins in a graph (nodes)• M2 observations on pairs of proteins

– Edges are random quantities, Y[n,m]

– Interactions are not independent.

– Interacting proteins form a protein complex.

• T graphs on the same set of proteins• Partial annotations for each protein, X[n]

Page 37: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 37

Modeling Ideas

• Hierarchical Bayes:– Latent variables encode semantic elements– Assume structure on observable-latent

elements

1. Models of mixed membership

2. Network models (block models)

Stochastic block models of mixed membership

=

Page 38: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 38

Graphical Model Representation

StochasticBlocks

MixedMembership

Page 39: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 39

Interactions(observed*)

j

i

yij = 1

i

j

1 2 3

Mixed membershipVectors (latent*)

h

g

1 2 3123

23 = 0.9

Group-to-grouppatterns (latent*)

Pr ( yij=1 | i,j, ) = i j

T

Hierarchical Likelihood

Page 40: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 40

Interactions in Yeast (MIPS)

Do PPI contain information about functions?

YLD014W

1

01 2 3 . . . 15

Page 41: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 41

Results: Functional Annotations

Page 42: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 42

Results: Stochastic Block Model

Page 43: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

October 22, 2008 43

Some Results

• K=50 blocks works well using 5-fold cross-validation, and are consistent with 15 functional categories.

• Our predictions of functional annotations are superior to others in the literature on same data base.

• Lots of technical details.

Page 44: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

44

Example 7: Social Network

of Zebras

October 22, 2008

Page 45: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

45

Dynamical Representation

• What is the stochastic model for group formation and change?

• Groups of females and shifting males who are mating?

October 22, 2008

Page 46: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

46

Summary

• Lots of networks and their graphs representation.

• Eros-Renyi random graph models, preferential attachment models, small world models.

• p1 and log-linear models.– Generalization to Exponential Random Graph

Models.

• Stochastic block models with mixed membership.October 22, 2008

Page 47: Network Models and Data Analysis Stephen E. Fienberg Department of Statistics Machine Learning Department Machine Learning 10-701/15-781, Fall 2008

47

Some References• Holland, P.W. and Leinhardt, S. (1981). An exponential family of

probability distributions for directed graphs (with discussion). Journal of the American Statistical Association, 76:33–65.

• Fienberg, S.E. and Wasserman, S.S. (1981). Categorical data analysis of single sociometric relations. Sociological Methodology, 156–192.

• Fienberg, S.E. Meyer, M.M. and Wasserman, S.S. (1985). Statistical analysis of multiple sociometric relations. Journal of the American Statistical Association, 80:51–67.

• Airoldi, E.M., Blei, D.M. Fienberg, S.E., Goldenberg, A., Xing, E., and Zheng, A. eds. (2007). Statistical Network Analysis: Models, Issues and New Directions, LNCS 4503 Springer-Verlag, Berlin.

• Newman, M., Barabási, A.L., and Watts, D.J. eds. (2006). The Structure and Dynamics of Networks. Princeton Univ. Press.

• Airoldi, E.M., Blei, D.M. Fienberg, S.E., and Xing, E. (2008). Mixed Membership Stochastic Blockmodels. Journal of Machine Learning Research, 9:1981—2014.

October 22, 2008