beyond ergms - scalable methods for the statistical...

Beyond ERGMsScalable methods for the statistical modeling of networks

David Hunter

Department of StatisticsPenn State University

Supported by ONR MURI Award Number N00014-08-1-1015

University of Texas at Austin, May 10, 2013

Outline

Estimation and the ERGM Framework

Statistical Estimation for Large, Time-Varying Networks

Model-Based Clustering of Large Networks

A network model is a probability distribution (or familyof distributions) on the set of all possible networks

I Thus, we assign each possible network a probability, e.g.,

−→ 11024

, −→ 11024

, and so on.

I But we’d like to avoid explicit enumeration.(Think of Occam’s Razor.)

I ERGMs are one way to allow the assignment to depend(explicitly) on a relatively small number of parameters.

ERGM = Exponential-family Random Graph Model

ERGM: Exponential-Family Random Graph Model

I An ERGM (or p-star model) says

Pθ(Y = y) =exp{θ>g(y)}

κ(θ,Y), y ∈ Y

I Y is a random networkI Y is the set of all possible networks

, , · · ·

I θ is a vector of parametersI g(y) is a known vector of network statistics on yI κ(θ,Y) makes all the probabilities sum to 1

The Gilbert-Erdos-Rényi model: The simplest ERGM

I The function κ(θ,Y) can be troublesome, but not always.I Consider the following case (Ann. Math. Stat., 1959):

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve, and extend access toThe Annals of Mathematical Statistics.

www.jstor.org®

The Gilbert-Erdos-Rényi model: The simplest ERGM

I Let p be some fixed constant between 0 and 1.

P(Y = y) = pE(y)(1− p)E(y),

where E(y) is the number of edges in y and E(y) is thenumber of non-edges in y .

I Rewrite using θ = log p − log(1− p):

P(Y = y) =

(p

1− p

)# of edges× (1− p)const

= exp{θ × # of edges} × 1κ(θ)

Dyadic independence ERGMs are generally tractable

Gilbert-Erdos-Rényi is a special case of dyadic independence:

Pθ(Y = y) =∏i<j

Pθ(Dij = dij)

Dyad Dij , directed case:

i j

Dyad Dij , undirected case:

i j

Dyadic independence models have drawbacks but theyI facilitate estimation;I facilitate simulation;I avoid degeneracy issue (cf. Schweinberger, 2011).

Statistical inference is “Probability in reverse”

The ERGM hypothesizes:


κ(θ,Y), y ∈ Y

PROBABILITY −→

θERGM←→ , , · · ·

←− STATISTICS

Statistical Goal:Use observed data to select from the given ERGM class — i.e.,to learn about θ.We might search for a “best” θ or a density p(θ |data).

The loglikelihood function is L(θ) = Pθ(Y = yobs)

The ERGM hypothesizes:


κ(θ,Y), y ∈ Y

I To choose a θ, we might try to search for a “best” theta bymaximizing L(θ) or

`(θ) = log L(θ) = θ>g(yobs)− logκ(θ,Y)

I Alternatively, a Bayesian approach tries to describe anentire distribution over θ values, the posterior:

p(θ |Y = yobs) ∝ L(θ)× π(θ).

Computing the likelihood is sometimes very difficult

The likelihood is L(θ) = Pθ(Y = yobs), viewed as a function of θ.

7

8

7

11

8

8

7

8 8

9

7

11

9

8

11

7

9

9

10

7

7

9

89

9

7

9

7

9

7

7

8

9

7

For this undirected, 34-nodenetwork, computing `(θ) directlymay require summation of

7,547,924,849,643,082,704,483,109,161,976,537,781,833,842,440,832,880,856,752,412,600,491,248,324,784,297,704,172,253,450,355,317,535,082,936,750,061,527,689,799,541,169,259,849,585,265,122,868,502,865,392,087,298,790,653,952

terms.

The log-likelihood may be written as an expectation

Recall: `(θ) = log L(θ) = θ>g(yobs)− logκ(θ,Y)

I Suppose we fix θ0. A bit of algebra shows that

`(θ)− `(θ0) = (θ − θ0)>g(yobs)− log E θ0

[exp

{(θ − θ0)tg(Y )

}]= (θ − θ0)>g(yobs)− log E θ0 [blah blah Y blah] .

I Thus, randomly sampling networks from Pθ0 allowsapproximation of `(θ)− `(θ0).

Example Network: High School Friendship Data

School 10: 205 Students

1

1

0

98

9

97

9

8

8

1

91

7

9

8

9

9

8

77

9

7

7

7 8

7

7

9

7

00

0

2 9

−

8

9

7

0

7

1

1

1

9

0 99

0

7

70

7

7

7

1

9

9

0

9

2

8

7

7

9

7

1

7

7

9

89

−7

7

28

9

98

7

0 8

7

98

81

8

8

1

77

0

9

9

8

7

2

77

9

88

2

7

9

7

0

1

7

79

7

90

7

7

1

8

77

9

7

8

9

1

1

0

0

7

0

1

2

0

88

−

7

79

8

7

1

2

0

87

9

1

7

1

7

7

8

2

9

7

8

7

7

0

7

7

78−

2

0

8

8

7

9

8

8

1

91 7

0

7

8

8

1

2

7

7

0 8

8

9

2

2

1

88

0 7

1

0

9 89

9I An edge indicates a mutual

friendship.I Colored labels give grade

level, 7 through 12.I Circles = female,

squares = male,triangles = unknown.

I N.B.: Missing data ignoredhere, though this could bealtered.

Fitting an ERGM to the high school dataset

I ERGM parameter estimates from Hunter et al (2008):Coefficient Coefficientedges −3.49(1.92) AD (Gr.) = 1 3.41(1.42)∗

GWESP 0.83(0.13)∗∗∗ AD (Gr.) = 2 2.42(1.48)GWD −2.01(0.35)∗∗∗ AD (Gr.) = 3 1.43(1.62)GWDSP 0.50(0.09)∗∗∗

DH (Gr. 7) 6.00(1.56)∗∗∗

NF (Gr. 8) −0.34(0.78) DH (Gr. 8) 6.48(1.64)∗∗∗

NF (Gr. 9) 0.64(0.59) DH (Gr. 9) 4.52(1.58)∗∗

NF (Gr. 10) 0.55(0.59) DH (Gr. 10) 4.96(1.59)∗∗

NF (Gr. 11) 0.97(0.60) DH (Gr. 11) 4.32(1.54)∗∗

NF (Gr. 12) 1.23(0.60)∗ DH (Gr. 12) 4.11(1.58)∗∗

NF (Gr. NA) 3.86(1.30)∗∗

DH (White) 1.55(0.68)∗

NF (Black) 0.51(0.42) DH (Black) 0.92(1.55)NF (Hisp) −0.23(0.33) DH (Hisp) 0.87(0.43)∗

NF (Nat Am) −0.21(0.32) DH (Nat Am) 1.31(0.43)∗∗

NF (Other) −0.61(0.69)NF (Race NA) 1.53(0.89)

NF (Female) 0.09(0.10) UH (Sex) 0.67(0.16)∗∗∗

NF (Sex NA) −0.18(0.47)NF stands for Node Factor. AD stands for Absolute Difference.

DH stands for Differential Homophily.UH stands for Uniform Homophily.

∗ Significant at 0.05 level ∗∗ Significant at 0.01 level ∗∗∗ Significant at 0.001 level


1

1

0

98

9

97

9

8

8

1

91

7

9

8

9

9

8

77

9

7

7

7 8

7

7

9

7

00

0

2 9

−

8

9

7

0

7

1

1

1

9

0 99

0

7

70

7

7

7

1

9

9

0

9

2

8

7

7

9

7

1

7

7

9

89

−7

7

28

9

98

7

0 8

7

98

81

8

8

1

77

0

9

9

8

7

2

77

9

88

2

7

9

7

0

1

7

79

7

90

7

7

1

8

77

9

7

8

9

1

1

0

0

7

0

1

2

0

88

−

7

79

8

7

1

2

0

87

9

1

7

1

7

7

8

2

9

7

8

7

7

0

7

7

78−

2

0

8

8

7

9

8

8

1

91 7

0

7

8

8

1

2

7

7

0 8

8

9

2

2

1

88

0 7

1

0

9 89

9

But what about “Large Networks”?

I 205 nodes does notreally qualify as large inthis context.

I The estimationtechniques usedpreviously do not scalewell.


1

1

0

98

9

97

9

8

8

1

91

7

9

8

9

9

8

77

9

7

7

7 8

7

7

9

7

00

0

2 9

−

8

9

7

0

7

1

1

1

9

0 99

0

7

70

7

7

7

1

9

9

0

9

2

8

7

7

9

7

1

7

7

9

89

−7

7

28

9

98

7

0 8

7

98

81

8

8

1

77

0

9

9

8

7

2

77

9

88

2

7

9

7

0

1

7

79

7

90

7

7

1

8

77

9

7

8

9

1

1

0

0

7

0

1

2

0

88

−

7

79

8

7

1

2

0

87

9

1

7

1

7

7

8

2

9

7

8

7

7

0

7

7

78−

2

0

8

8

7

9

8

8

1

91 7

0

7

8

8

1

2

7

7

0 8

8

9

2

2

1

88

0 7

1

0

9 89

9

Outline




Idea: Use counting process theory to model networks

12

3

4

t=16

t=3

t=11

t=20

I Goal: Model adynamically evolvingnetwork using countingprocesses.

I Methods should beapplicable to largenetwork datasets (tens orhundreds of thousandsof nodes)

I Two modeling frameworks (terminology of Butts, 2008):I Egocentric: The counting process Ni (t) = cumulative

number of “events” involving the i th node by time t .I Relational: The counting process Nij (t) = cumulative

number of “events” involving the (i , j)th node pair by time t .

NB: “Events” need not be edge additions

Counting processes may be considered multivariateI Combine the Ni(t) to give a

multivariate counting process

N(t) = (N1(t), . . . ,Nn(t)).

I Genuinely multivariate; noassumption about theindependence of Ni(t).

12

3

4

t=16

t=3

t=11

t=20

0 5 10 15 20

N(t)

t

01

2

N1(t)

N2(t) N4(t) N3(t)

Citation Networks may be modeled as egocentricI 29,557 articles; 352,807 citations; 25,004 unique times

(theoretical physics articles on arXiv)I At arrival, a paper cites others that are already in the

network.I Main dynamic development: Number of citations received.

Time

I Ni(t): Number of citations to paper i by time t .I “At-risk” indicator Ri(t): Equal to I{tarr

i < t}.

Twitter behavior provides another egocentric example

I 101,853 people; 477,768vaccination-related tweets from Aug. 2009to Jan. 2010.

I More than 4 million “follower” edgesI Of particular interest: H1N1 vaccination sentiment.I Some express − or + sentiments regarding H1N1

vaccination.I N−i (t) and N+

i (t) are numbers of such tweets by time t .I Question of interest: Which predictors (e.g., past behavior

or self / followers / followees, position in directed followingnetwork) predict the propensity to tweet − or +?

Is tweeting behavior about H1N1 vaccinations contagious?

A multivariate counting process is a submartingale

Each Ni(t) is nondecreasing in time, so N(t) may beconsidered a submartingale; i.e., it satisfies

E [N(t) |past up to time s] ≥ N(s) for all t > s.

0 5 10 15 20

N(t)

t

01

2

N1(t)

N2(t) N4(t) N3(t)

The so-called Doob-Meyer Decomposition uniquelydecomposes any submartingale

N(t) =

∫ t

0λ(s) ds + M(t) :

I λ(t) is the “signal” at time t , called the intensity functionI M(t) is the “noise,” a continuous-time Martingale.

I We will model each λi(t) or λij(t).

We use standard models for the intensity processesIn the egocentric case, consider the Cox or Aalen model for thenode i process:

I Cox Proportional Hazard Model, fixed coefficients:

λi(t |Ht−) = Ri(t)α0(t) exp(β>si(t)

),

I Aalen additive model, time-varying coefficients:

λi(t |Ht−) = Ri(t)(β0(t) + β(t)>si(t)

),

whereI Ri(t) = I(t > tarr

i ) is the “at-risk indicator”I Ht− is the past of the network up to but not including time tI α0(t) or β0(t) is the baseline hazard functionI β is the p-vector of coefficients to estimateI si(t) =

(si1(t), . . . , sip(t)

)is a vector of statistics for node i

Relational case is similar (cf. Perry and Wolfe, 2010)

I Cox Proportional Hazard Model, fixed coefficients:λij(t |Ht−) = Rij(t)α0(t) exp

(β>s(i , j , t)

),

I Aalen Additive Model, time-varying coefficients:λij(t |Ht−) = Rij(t)

(β0(t) + β(t)>s(i , j , t)

)where

I Rij(t) = I(max{tarri , tarr

j } < t < teij ) is the “at-risk indicator”I Ht− is the past of the network up to but not including time tI α0(t) or β0(t) is the baseline hazard functionI β or β(t) is the vector of coefficients to estimateI s(i , j , t) is a p-vector of statistics for pair (i , j)

For large networks, maximizing the partial likelihood inthe Cox model requires some computing tricks

Recall: The intensity process for node i is

λi(t |Ht−) = Ri(t)α0(t) exp(β>si(t)

)

Treat α0 as a nuisance parameter and take a partial likelihoodapproach: Maximize

L(β) =m∏

e=1

exp(β>sie (te)

)∑n

i=1 Ri(te) exp(β>si(te)

) =m∏

e=1

exp(β>sie (te)

)κ(te)

.

Computational Trick: Write κ(te) = κ(te−1) + ∆κ(te), thenoptimize ∆κ(te) calculation.

Fitting the Aalen model uses weighted least squares

Recall: The intensity process for node i is

λi(t |Ht−) = Ri(t)(β0(t) + β(t)>si(t)

).

I We do inference not for the βk but rather for theirtime-integrals

Bk (t) =

∫ t

0βk (s)ds. (1)

I Then (basically weighted least squares)

B(t) =∑te≤t

J(te)[W(te)>W(te)

]−1W(te)>∆N(te), (2)where

I W(t) is N(N − 1)× p with (i , j)th row Rij (t)s(i , j , t)>;I J(t) is the indicator that W(t) has full column rank.

Example statistics: Preferential Attachment

For each cited paper j already in the network. . .I First-order PA: sj1(t) =

∑Ni=1 yij(t−). “Rich get richer” effect

I Second-order PA: sj2(t) =∑

i 6=k yki(t−)yij(t−).Effect due to being cited by well-cited papers

j

Statistics in red are time-dependent. Others are fixed once jjoins the network.

NB: y(t−) is the network just prior to time t.

Example statistics: Recency PA Statistic

For each cited paper j already in the network. . .I Recency-based first-order PA (we take Tw = 180 days):

sj3(t) =∑N

i=1 yij(t−)I(t − tarri < Tw ).

Temporary elevation of citation intensity after recentcitations

j



Example statistics: Triangle Statistics

For each cited paper j already in the network. . .I “Seller” statistic: sj4(t) =

∑i 6=k yki(t−)yij(t)ykj(t−).

I “Broker” statistic: sj5(t) =∑

i 6=k ykj(t)yji(t−)yki(t−).I “Buyer” statistic: sj6(t) =

∑i 6=k yjk (t)yki(t)yji(t−).

ASeller

B

C

Broker

Buyer



Example statistics: Out-Path Statistics

For each cited paper j already in the network. . .I First-order out-degree (OD): sj7(t) =

∑Ni=1 yji(t−).

I Second-order OD: sj8(t) =∑

i 6=k yjk (t−)yki(t−).

j



Example statistics: Topic Modeling Statistics

Additional statistics, using abstract text if available, as follows:I A latent Dirichlet allocation (LDA) model (Blei et al, 2003)

is learned on the training set.

Figure from Wikipedia entry for LatentDirichlet Analysis, Feb. 2012

N=words per documentM=documentsθi =topic distribution for paper i

I We construct a vector (50 dimensions) of similaritystatistics:

sLDAj (tarr

i ) = θi ◦ θj ,

where ◦ denotes the element-wise product of two vectors.

Coefficient Estimates for LDA + P2PTR180 ModelStatistics Coefficients (β)s1 (PA) 0.01362s2 (2nd PA) 0.00012s3 (PA-180) 0.02052s4 (Seller) -0.00126s5 (Broker) -0.00066s6 (Buyer) -0.00387s7 (1st OD) 0.00090s8 (2nd OD) 0.02052

All coefficientestimates aresignificant at the0.0001 level.

ASeller

B

C

Broker

Buyer

DB

CDiverse seller effect:D more likely cited than A.

ASeller

B

C

Broker

Buyer

AB

EDiverse buyer effect:E more likely cited than C.

Twitter and H1N1 Vaccination Sentiments

Starting from late August 2009, we observed a steady increase in thenumber of relevant tweets in the United States until early November2009, after which the number of tweets dropped back to previouslevels. Figure 1A shows the absolute numbers of positive (n+),negative (n-) and neutral (n0) tweets per day in the United States. Theoverall influenza A(H1N1) vaccine sentiment score, measured as therelative difference of positive and negative tweets ((n+-n-)/(n++ n- +n0)), started at a negative value in late Summer 2009 and showedrelatively large short term fluctuations. The 14 day - movingaverage turned positive in mid October (as the vaccine becameavailable) and remained positive for the rest of the year (Figure 1B).

For vaccination sentiments measured online to be meaningful,they need to be compared to empirical data for validation. A

positive correlation between the influenza A(H1N1) vaccinationsentiment score and estimated vaccination coverage would berelevant to public health efforts because it would allow for theidentification of target areas for communication interventions. Totest for such a correlation, we used estimated influenza A(H1N1)vaccination rates up to January 2010 as provided by the CDC[12]. These estimates are based on results from the BehavioralRisk Factor Surveillance System (BRFSS) and the National 2009H1N1 Flu Survey (NHFS). We found a very strong correlation onthe level of HHS regions (weighted r = 0.78, p = 0.017; regions asdefined by the US Department of Health & Human Services)using the estimated vaccination coverages for all persons olderthan 6 months (Figure 1C), and a strong correlation at the level ofstate (weighted r = 0.52, p = 0.0046). All reported correlationvalues are Pearson product-moment correlation coefficientsbecause the variables considered for analysis are normallydistributed (Shapiro-Wilk test and Anderson-Darling test), weight-ed by the total number of tweets (n+ + n- + n0) per region.

Using data on who followed whom among users in the datasetallows us to generate a directed network of information flow whosestructure (with respect to the distribution of opinions onvaccination) can provide insight into how sentiments aredistributed (see Methods). In order to investigate if userspreferentially seek information from other users who share theiropinion, we measured assortative mixing of users with aqualitatively similar opinion on vaccination (homophily) bycalculating the assortativity coefficient r which is defined [13] as

r~

Pi eii{

Pi aibi

1{P

i aibi

where

Figure 1. (A) Total number of negative (red), positive (green), and neutral (blue) tweets relating to influenza A(H1N1) vaccinationduring the Fall wave of the 2009 pandemic. (B) Daily (gray) and 14 day moving average (blue) sentiment score during the same time. (C)Correlation between estimated vaccination rates for individuals older than 6 months, and sentiment score per HHS region (black dots) and states(gray dots). Numbers represent the ten regions as defined by the US Department of Human Health & Services. Lines shows best fit of linear regression(blue for regions, red for states).doi:10.1371/journal.pcbi.1002199.g001

Author Summary

Sentiments about vaccination can strongly affect individ-ual vaccination decisions. Measuring such sentiments - andhow they are distributed in a population - is typically adifficult and resource-intensive endeavor. We use publiclyavailable data from Twitter, a popular online social mediaservice, to measure the evolution and distribution ofsentiments towards the novel influenza A(H1N1) vaccineduring the second half of 2009, i.e. the fall wave of theH1N1 (swine flu) pandemic. We find that projectedvaccination rates based on sentiments expressed onTwitter are in very good agreement with vaccination ratesestimated by the CDC with traditional phone surveys.Looking at the online social network, we find that bothnegative and positive opinions are clustered, and that anequivalent level of clustering of vaccinations in apopulation would strongly increase disease outbreak risks.

Assessing Vaccination Sentiments with Online Media

PLoS Computational Biology | www.ploscompbiol.org 2 October 2011 | Volume 7 | Issue 10 | e1002199

Negative (red), positive (green), andneutral (blue) tweets during fall wave of

2009 H1N1 pandemic

I Salathé and Khandelwal (2011)collected over 4 million tweetsfrom over 100 000 Twitter users.

I Here, counting process ofinterest is not formation of ties; itis expression of H1N1vaccination sentiments.

N+i (t) = # of positive tweets by i before time t

N−i (t) = # of negative tweets by i before time t

Cox Model Coefficients for Twitter Dataset

Intensity of Intensity ofPositive Tweeting Negative Tweeting

f +1 : # friends who tweet + 0.0029 (p = 0.129) −0.0527 (p < 10−3)

f +2 : (+ tweets) × (+ friends) 0.0143 (p = 0.234) 0.039 (p < 10−3)

f−1 : # friends who tweet − −0.0205 (p < 10−3) −0.0003 (p = 0.799)f−2 : (− tweets) × (− friends) −0.0611 (p < 10−3) 0.006 (p < 10−3)

where the statistics here are given by

f +1 (i , t) =

∑j∈F (i,t)

# of + tweets by j up to time ttotal # of tweets by j up to time t

f +2 (i , t) =

1f +1 (i , t)

∑j∈F (i,t)

(# of + tweets by j up to time t)2

total # of tweets by j up to time t

Sensitivity Analysis for Twitter DatasetI The automatic evaluation algorithm for tweets can err.I Randomly re-classify all tweets using small test dataset

comparing human classification to automatic classification.I Repeat 200 times

Response: λ−i (negative tweeting)

−0.06 −0.04 −0.02 0.00 0.02

050

100

150

200

Coefficient

Pos. Friends’ Pos. Tweets Pos. CIs = 44.5 % − Neg. CIs = 13 %

Response: λ+i (positive tweeting)

−0.10 −0.08 −0.06 −0.04 −0.02 0.00

050

100

150

200

Coefficient

Neg. Friends’ Neg. Tweets Pos. CIs = 3 % − Neg. CIs = 29 %

Outline




Epinions.com: Example of large network dataset

Unbiased Reviews by Real People

I Members of Epinions.com can decide whether to ”trust”each other.

I “Web of Trust” combined with review ratings to determinewhich reviews are shown to the user.

I Dataset of Massa and Avesani (2007):I n = 131,828 nodesI n(n − 1) = 17.4 billion observationsI 841,372 of these are nonzero (±1)

The Goal: Cluster 131,828 users

I Basis for clustering: Patterns oftrusts and distrusts in thenetwork

I If possible: understand thefeatures of the clusters byexamining parameter estimates.

Unbiased Reviews by Real People

Notation: Throughout, we let yij be rating of j by i and y = (yij).

I We’d like to restrict attention to dyadic independenceERGMs in order to model observed (yij) data.

To model dependence, add K -component mixturestructure

Suppose nodes have latent (unobserved) “colors” Z1, . . . ,Zn.

Reality: What we observe:

Simplifying assumption: P(Y = y | Z ) =∏i<j

P(Dij = dij | Zi ,Zj)

where Dij is the state in Y of the (i , j)th pair.

Consider two examples of conditional dyadicindependence for the Epinions dataset

1. The “full model” of Nowicki and Snijders (2001):

Pθ(Dij = d | Zi = k ,Zj = l) = θd ;kl

2. A more parsimonious model:

Pθ(Dij = dij | Zi = k ,Zj = l) ∝ exp{θ−(y−ij + y−ji )

+θ∆k yji + θ∆

l yij

+θ−−y−ij y−ji + θ++y+ij y+

ji }

where y−ij = I{Yij = −1} and y+ij = I{Yij = +1}.

I When K = 5 components, these models have 109 and 12parameters, respectively.

There is a problem with the “simplifying” assumption

Our conditional independence model says

I Ziiid∼ Multinomial(1; γ1, . . . , γK );

I Pθ(Y = y | Z ) =∏i<j

Pθ(Dij = dij | Zi ,Zj)

I Not so simple when we do not observe Z .

I The full (unconditional) loglikelihood is rather complicated:

`(γ, θ) = log∑

z

Pγ(Z = z)Pθ(Y = y | Z = z)

Approximate maximum likelihood estimation uses avariational EM algorithm

I For MLE, goal is to maximize theloglikelihood `(γ, θ).

I Basic idea: Establish lower bound

J(γ, θ, α) ≤ `(γ, θ) (3)

after augmenting parameters by adding α.I Create an EM-like algorithm guaranteed to

increase J(γ, θ, α) at each iteration.I If we maximize the lower bound, then we’re

hoping that the inequality (3) will be tightenough to put us close to a maximum of`(γ, θ).

We adapt the variational EM idea of Daudin, Picard, & Robin (2008).

We may derive a lower bound by simple algebra

I Clever variational idea: Augment the parameter set, letting

αik = P(Zi = k) for all 1 ≤ i ≤ n and 1 ≤ k ≤ K .

I Let Aα(Z ) =∏

i Mult(zi ;αi) denote the joint dist. of Z .I Direct calculation gives

J(γ, θ, α)def= `(γ, θ)− KL {Aα(Z ),Pγ,θ(Z |Y )}= . . .

= Eα [log Pγ.θ(Y ,Z )]− H [Aα(Z )] .

I Thus, an EM-like algorithm consists of alternately:I maximizing J(γ, θ, α) with respect to α (“E-step”)I maximizing Eα [log Pγ,θ(Y ,Z )] with respect to γ, θ

(“M-step”)

The variational E-step may be modified using a(non-variational) MM algorithm

I Idea: Use a “generalized variational E-step” in whichJ(γ, θ, α) is increased but not necessarily maximized.

I To this end, we create a surrogate function

Q(α, γ(t), θ(t), α(t))

of α, where t is the counter of the iteration number.

I The surrogate function is aminorizer of J(γ, θ, α):It has the property that maximizingor increasing its value will guaranteean increase in the value of J(γ, θ, α).

In the figure, the red curve minorizes f (x) at x0.

x0

f((x0))

Construction of the minorizer of J(γ, θ, α) usesstandard MM algorithm methods

J(γ, θ, α) =∑i<j

K∑k=1

K∑`=1

αikαjl logπdij ;kl(θ)

+n∑

i=1

C∑k=1

αik (log γk − logαik ) .

We may define a minorizing function as follows:

Q(α, γ, θ, α(t)) =∑i<j

K∑k=1

K∑`=1

α2ik

α(t)j`

2α(t)ik

+ α2j`α

(t)ik

2α(t)j`

logπdij ;kl(θ)

+n∑

i=1

K∑k=1

αik

(log γk − logα(t)

ik −αik

α(t)ik

+ 1

).

I Can be maximized (in α) using quadratic programming.

The parsimonious model for the Epinions dataset

Pθ(Dij = dij | Zi = k ,Zj = l) ∝ exp{θ−(y−ij + y−ji )

+θ∆k yji + θ∆

l yij

+θ−−y−ij y−ji + θ++y+ij y+

ji }

where y−ij = I{Yij = −1} and y+ij = I{Yij = +1}.

I NB: The term θ+(y+ij + y+

ji ) is missing to avoid perfectcollinearity.

Dyad Dij , directedcase:

i j

I θ−: Overall tendency toward distrustI θ∆

k : Category-specific trustednessI θ−−: lex talionis tendency (eye for an eye)I θ++: quid pro quo tendency (one good turn. . . )

Parameter estimates themselves are of interest

Parameter ConfidenceParameter Estimate IntervalNegative edges (θ−) −24.020 (−24.029,−24.012)Positive edges (θ+) 0 —Negative reciprocity (θ−−) 8.660 (8.614,8.699)Positive reciprocity (θ++) 9.899 (9.891,9.907)Cluster 1 Trustworthiness (θ∆

1 ) −6.256 (−6.260,−6.251)Cluster 2 Trustworthiness (θ∆




5 ) −15.212 (−15.225,−15.200)

95% Confidence intervals based on parametric bootstrap using500 simulated networks.NB: There are some strange aspects of the bootstrap wecannot explain yet.

Multiple starting points converge to the same solution

Trace plots from 100 different randomly selected startingparameter values:

0 1000 2000 3000 4000Iterations

−8400000

−7800000

−7200000

0 1000 2000 3000 4000Iterations

−16

−14

−12

−10

−8−6

Loglikelihood values Trustedness parameters

Full (109-parameter) model results look nothing like this.

We may use average ratings of reviews by other usersas a way to “ground-truth” the clustering solutions

659,290 articles categorized by author’s highest-probabilitycomponent. (Vertical axis is average article rating.)

Parsimonious (12-parameter) Model Full (109-parameter) Model

−500

050

010

00

Size of Cluster464 2424 7992 119218 1729

−500

050

010

00

Size of Cluster845 808 4050 13565 112559

ERGMs may not be a great way to model largenetworks with dependencies, but. . .

I The ERGM framework is useful because it forcesresearchers to think about which network statistics areimportant.

I Alternative models can exploit similar ways of thinkingabout networks, or even exploit ERGMs themselves.

Cited References: ERGMs

Erdos, P and Rényi, AOn Random Graphs IPublicationes Mathematicae [Debrecen], 1959.

Gilbert, E. N.Random GraphsAnnals of Mathematical Statistics, 1959.

Hunter, D. R., Goodreau, S. M., and Handcock M. S.Goodness of fit for social network modelsJ. Am. Stat. Assoc., 2008.

Cited References: Counting Processes for Networks

Brandes, U., Lerner, J., and Snijders, T. A. B.Networks evolving step by step: Statistical analysis of dyadic event data.In Advances in Social Network Analysis and Mining, pp. 200–205. IEEE, 2009.

Butts, C.T.A relational event framework for social action.Sociological Methodology, 38(1):155–200, 2008.

Cox, D. R.Regression models and life-tables.Journal of the Royal Statistical Society, Series B, 34:187–220, 1972.

Perry, P. O. and Wolfe, P. J.Point process modeling for directed interaction networksJournal of the Royal Statistical Society, Series B, to appear.

Salathé, M., Vu, D. Q., Khandelwal, S., and Hunter, D. R.The Dynamics of Health Behavior Sentiments on a Large Social NetworkEPJ Data Science, 2:4, 2013.

Vu, D. Q., Asuncion, A. U., Hunter, D. R., and Smyth, P.Dynamic Egocentric Models for Citation Networks,Proceedings of the 28th International Conference on Machine Learning (ICML 2011), 857–864, 2011.

Vu, D. Q., Asuncion, A. U., Hunter, D. R., and Smyth, P.Continuous-Time Regression Models for Longitudinal NetworksAdvances in Neural Information Processing Systems 24 (NIPS 2011), to appear.

Cited References: Variational EM for Large Networks

Daudin, J. J., Picard, F., and Robin, S.A Mixture Model for Random Graphs.Statistics & Computing, 2008.

Nowicki, K and Snijders, T. A. B.Estimation and Prediction for Stochastic Blockstructures.Journal of the American Statistical Association, 2001.

Vu, D. Q., Hunter, D. R., and Schweinberger, M.Model-Based Clustering of Large NetworksAnnals of Applied Statistics, to appear.

A FEW EXTRA SLIDES

Maximum Pseudolikelihood: Intuition

I What if we assume that there is no dependence (or veryweak dependence) among the Yij?

I In other words, what if we approximate the marginalP(Yij = 1) by the conditional P(Yij = 1|Y c

ij = ycij )?

I Then the Yij are independent with

logP(Yij = 1)

P(Yij = 0)= θ>δ(yobs)ij ,

so we obtain an estimate of θ using straightforward logisticregression.

I Result: The maximum pseudolikelihood estimate.I For independence models, MPLE = MLE!

MLE vs. MPLE

Far better an approximate answer to the ‘right’question, which is often vague, than an ‘exact’ answerto the wrong question, which can always be madeprecise.

— John W. Tukey

I MLE (maximum likelihood estimation): Well-establishedmethod but very hard because the normalizing constantκ(α) is difficult to evaluate, so we approximate it instead.

I MPLE (maximum pseudo-likelihood estimation): Easy todo using logistic regression, but based on anindependence assumption that is often not justified.

Several authors, notably van Duijn et al. (2009), argue forcefullyagainst the use of MPLE (except when MLE=MPLE!).

Model construction and Testing

Dataset: arXiv-TH. High-energy physics theory articles,Jan. 93–Apr. 03. Timestamps are continuous time; abstract textis included.(29,557 articles; 352,807 citations; 25,004 unique times)

1. Statistics-building phase (19,004 unique event times):Construct network history and build up network statistics.

2. Training phase (1000 unique event times): Constructpartial likelihood and estimate model coefficients.

3. Test phase (5000 unique event times): Evaluatepredictive capability of the learned model.

Statistics-building is ongoing even through the training and testphases. The phases are split along citation event times.

Recall PerformanceRecall: Proportion of true citations among largest Klikelihoods.

0 5000 10000 150000

0.2

0.4

0.6

0.8

1

Cut−point K

Rec

all

PAP2PTP2PTR180LDALDA+P2PTR180

I PA: pref. attach only (s1(t)); P2PT: s1, . . . , s8 except s3;I P2PTR180: s1, . . . , s8; LDA: LDA stats only

Social networks may be modeled as “relational”

I Irvine: Online social network of studentsanteater

I 1899 users; 20 296 directed contact edgesI Links are non-recurrent; i.e., Nij(t) is either 0 or 1.I “At-risk” indicator Rij(t) = I{max(tarr

i , tarrj ) < t < teij}.

contacteecontacter date1 14155 2004-06-15 12:00:00.0001 2238 2004-06-15 12:00:00.0001 14275 2004-06-15 12:00:00.000...13099 7683 2004-06-17 16:31:51.04015231 14752 2004-06-17 16:31:51.040...45087 7610 2007-10-31 12:23:15.68316719 61 2007-10-31 13:28:38.67048758 1 2007-10-31 13:47:16.843

!

Relational Example: Modeling a network of contacts

I Irvine: Online social network of studentsanteater

I 1899 users; 20 296 directed contact edges

Some of the statistics in the model:I Sender out-degree: s1(i , j , t) =

∑h∈V ,h 6=i Nih(t−)

I Reciprocity: s5(i , j , t) = Nji(t−)

I Transitivity: s6(i , j , t) =∑

h∈V ,h 6=i,j Nih(t−)Nhj(t−)

I Shared contacters: s9(i , j , t) =∑

h∈V ,h 6=i,j Nhi(t−)Nhj(t−)

Aalen model estimates for Irvine Data Set

I Aalen coefficients suggest two distinct phases of networkevolution, consistent with an independent analysis(Panzarasa et al, 2009).

I On prediction experiments, Aalen/Cox outperforms logisticregression.

(a) Sender Out-Degree (b) Reciprocity

6/1/04 7/21/04 9/9/04Time

03e−5

Coefficient

6/1/04 7/21/04 9/9/04Time

0.02

Coefficient

(c) Transitivity (d) Shared Contacters

6/1/04 7/21/04 9/9/04Time

0.0003

Coefficient

6/1/04 7/21/04 9/9/04Time

−5e−5

0Coefficient

Adaptive LR (1:10)Adaptive LR (1:50)Adaptive CoxAalen (Uniform−1)Aalen (Uniform−10)

1 5 10 15 20Cut−Point K

0.2

0.3

0.4

Rec

all

beyond ergms - scalable methods for the statistical...

Documents