topic models for social network analysis and bibliometrics andrew mccallum computer science...

111
Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Xuerui Wang, Natasha Mohanty, Andres Corrada, Chris Pal, Wei Li, David Mimno and Gideon Mann.

Upload: adrian-hart

Post on 23-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topic Models forSocial Network Analysis

and Bibliometrics

Andrew McCallum

Computer Science Department

University of Massachusetts Amherst

Joint work with Xuerui Wang, Natasha Mohanty,Andres Corrada, Chris Pal, Wei Li, David Mimno and Gideon Mann.

Page 2: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Goal:

Mine actionable knowledgefrom unstructured text.

Page 3: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

From Text to Actionable Knowledge

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Page 4: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Joint Inference

Page 5: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

ProbabilisticModel

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Conditional Random Fields [Lafferty, McCallum, Pereira]

Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]

Discriminatively-trained undirected graphical models

Complex Inference and LearningJust what we researchers like to sink our teeth into!

Unified Model

Page 6: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Context

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Leveraging Text in Social Network Analysis

Joint inferenceamong detailed steps

Page 7: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Outline

• Role Discovery (Author-Recipient-Topic Model, ART)

• Group Discovery (Group-Topic Model, GT)

• Enhanced Topic Models

– Correlations among Topics (Pachinko Allocation, PAM)

– Time Localized Topics (Topics-over-Time Model, TOT)

– Markov Dependencies in Topics (Topical N-Grams Model, TNG)

• Bibliometric Impact Measures enabled by Topics

Social Network Analysis with Topic Models

Multi-Conditional Mixtures

Page 8: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Social Network in an Email Dataset

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 9: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Clustering words into topics withLatent Dirichlet Allocation

[Blei, Ng, Jordan 2003]

Sample a distributionover topics,

For each document:

Sample a topic, z

For each word in doc

Sample a wordfrom the topic, w

Example:

70% Iraq war30% US election

Iraq war

“bombing”

GenerativeProcess:

Page 10: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

STORYSTORIESTELL

CHARACTERCHARACTERS

AUTHORREADTOLD

SETTINGTALESPLOT

TELLINGSHORTFICTIONACTIONTRUE

EVENTSTELLSTALENOVEL

MINDWORLDDREAMDREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEASWIM

SWIMMINGPOOLLIKESHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASESGERMSFEVERCAUSECAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

Example topicsinduced from a large collection of text

FIELDMAGNETICMAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGYFIELD

PHYSICSLABORATORY

STUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTSBAT

TERRY

JOBWORKJOBS

CAREEREXPERIENCEEMPLOYMENTOPPORTUNITIES

WORKINGTRAININGSKILLS

CAREERSPOSITIONS

FINDPOSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

[Tennenbaum et al]

Page 11: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

STORYSTORIESTELL

CHARACTERCHARACTERS

AUTHORREADTOLD

SETTINGTALESPLOT

TELLINGSHORTFICTIONACTIONTRUE

EVENTSTELLSTALENOVEL

MINDWORLDDREAMDREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEASWIM

SWIMMINGPOOLLIKESHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASESGERMSFEVERCAUSECAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

FIELDMAGNETICMAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGYFIELD

PHYSICSLABORATORY

STUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELDPLAYER

BASKETBALLCOACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTSBAT

TERRY

JOBWORKJOBS

CAREEREXPERIENCEEMPLOYMENTOPPORTUNITIES

WORKINGTRAININGSKILLS

CAREERSPOSITIONS

FINDPOSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Example topicsinduced from a large collection of text

[Tennenbaum et al]

Page 12: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

From LDA to Author-Recipient-Topic(ART) [McCallum et al 2005]

Page 13: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Inference and Estimation

Gibbs Sampling:- Easy to implement- Reasonably fast

r

Page 14: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Enron Email Corpus

• 250k email messages• 23k people

Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001

Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.

DP

Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]

Page 15: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topics, and prominent senders / receiversdiscovered by ARTTopic names,

by hand

Page 16: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topics, and prominent senders / receiversdiscovered by ART

Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”

Page 17: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Comparing Role Discovery

connection strength (A,B) =

distribution overauthored topics

Traditional SNA

distribution overrecipients

distribution overauthored topics

Author-TopicART

Page 18: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Comparing Role Discovery Tracy Geaconne Dan McCarty

Traditional SNA Author-TopicART

Similar roles Different rolesDifferent roles

Geaconne = “Secretary”McCarty = “Vice President”

Page 19: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Traditional SNA Author-TopicART

Different roles Very differentVery similar

Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”

Comparing Role Discovery Lynn Blair Kimberly Watson

Page 20: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

McCallum Email Corpus 2004

• January - October 2004• 23k email messages• 825 people

From: [email protected]: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: [email protected]

There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:

NIPS registration receipt.CALO registration receipt.

Thanks,Kate

Page 21: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Four most prominent topicsin discussions with ____?

Page 22: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with
Page 23: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Two most prominent topicsin discussions with ____?

Words Problove 0.030514house 0.015402

0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009

0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539

Words Probtoday 0.051152tomorrow 0.045393time 0.041289ll 0.039145meeting 0.033877week 0.025484talk 0.024626meet 0.023279morning 0.022789monday 0.020767back 0.019358call 0.016418free 0.015621home 0.013967won 0.013783day 0.01311hope 0.012987leave 0.012987office 0.012742tuesday 0.012558

Page 24: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with
Page 25: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Role-Author-Recipient-Topic Models

Page 26: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Results with RART:People in “Role #3” in Academic Email

• olc lead Linux sysadmin• gauthier sysadmin for CIIR group• irsystem mailing list CIIR sysadmins• system mailing list for dept. sysadmins• allan Prof., chair of “computing

committee”• valerie second Linux sysadmin• tech mailing list for dept. hardware• steve head of dept. I.T. support

Page 27: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Roles for allan (James Allan)

• Role #3 I.T. support• Role #2 Natural Language

researcher

Roles for pereira (Fernando Pereira) • Role #2 Natural Language researcher• Role #4 SRI CALO project participant• Role #6 Grant proposal writer• Role #10 Grant proposal coordinator• Role #8 Guests at McCallum’s house

Page 28: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Traditional SNA Author-TopicART

Block structured NotNot

ART: Roles but not Groups

Enron TransWestern Division

Page 29: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Outline

• Role Discovery (Author-Recipient-Topic Model, ART)

• Group Discovery (Group-Topic Model, GT)

• Enhanced Topic Models

– Correlations among Topics (Pachinko Allocation, PAM)

– Time Localized Topics (Topics-over-Time Model, TOT)

– Markov Dependencies in Topics (Topical N-Grams Model, TNG)

• Bibliometric Impact Measures enabled by Topics

Social Network Analysis with Topic Models

Multi-Conditional Mixtures

Page 30: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Groups and Topics

• Input:– Observed relations between people– Attributes on those relations (text, or categorical)

• Output:– Attributes clustered into “topics”– Groups of people---varying depending on topic

Page 31: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Discovering Groups from Observed Set of Relations

Admiration relations among six high school students.

Student Roster

AdamsBennettCarterDavisEdwardsFrederking

Academic Admiration

Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)

Page 32: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Adjacency Matrix Representing Relations

A B C D E FABCDEF

A B C D E FG1G2G1G2G3G3

G1G2G1G2G3G3

ABCDEF

A C B D E FG1G1G2G2G3G3

G1G1G2G2G3G3

ACBDEF

Student Roster

AdamsBennettCarterDavisEdwardsFrederking

Academic Admiration

Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)

Page 33: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Group Model: Partitioning Entities into Groups

2Sv

β

2Gγ α

Stochastic Blockstructures for Relations[Nowicki, Snijders 2001]

S: number of entities

G: number of groups

Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]

BetaDirichlet

Binomial

SgMultinomial

Page 34: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Two Relations with Different Attributes

A C B D E FG1G1G2G2G3G3

G1G1G2G2G3G3

A C E B D FG1G1G1G2G2G2

G1G1G1G2G2G2

ACEBDF

Student Roster

AdamsBennettCarterDavisEdwardsFrederking

Academic Admiration

Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)

Social Admiration

Soci(A, B) Soci(A, D) Soci(A, F)Soci(B, A) Soci(B, C) Soci(B, E)Soci(C, B) Soci(C, D) Soci(C, F)Soci(D, A) Soci(D, C) Soci(D, E)Soci(E, B) Soci(E, D) Soci(E, F)Soci(F, A) Soci(F, C) Soci(F, E)

ACBDEF

Page 35: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

The Group-Topic Model: Discovering Groups and Topics Simultaneously

bNw

t

B

T

φ

η

DirichletMultinomial

Uniform

2Sv

β

2Gγ α

Beta

Dirichlet

Binomial

SgMultinomial

T

[Wang, Mohanty, McCallum 2006]

Page 36: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Inference and EstimationGibbs Sampling:- Many r.v.s can be integrated out- Easy to implement- Reasonably fast

We assume the relationship is symmetric.

Page 37: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Dataset #1:U.S. Senate

• 16 years of voting records in the US Senate (1989 – 2005)

• a Senator may respond Yea or Nay to a resolution

• 3423 resolutions with text attributes (index terms)

• 191 Senators in total across 16 years

S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms

Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……

Page 38: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topics Discovered (U.S. Senate)Education Energy

MilitaryMisc.

Economic

education energy government federalschool power military laboraid water foreign insurance

children nuclear tax aiddrug gas congress tax

students petrol aid businesselementary research law employeeprevention pollution policy care

Mixture of Unigrams

Group-Topic Model

Education

+ DomesticForeign Economic

Social Security

+ Medicareeducation foreign labor socialschool trade insurance securityfederal chemicals tax insuranceaid tariff congress medical

government congress income caretax drugs minimum medicare

energy communicable wage disabilityresearch diseases business assistance

Page 39: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Groups Discovered (US Senate)

Groups from topic Education + Domestic

Page 40: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Senators Who Change Coalition the most Dependent on Topic

e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicaid

Page 41: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Dataset #2:The UN General Assembly

• Voting records of the UN General Assembly (1990 - 2003)

• A country may choose to vote Yes, No or Abstain

• 931 resolutions with text attributes (titles)

• 192 countries in total

• Also experiments later with resolutions from 1960-2003

Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting

The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions:

In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.

Page 42: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topics Discovered (UN)

Everything Nuclear

Human RightsSecurity

in Middle East

nuclear rights occupiedweapons human israel

use palestine syriaimplementation situation security

countries israel calls

Mixture ofUnigrams

Group-TopicModel

NuclearNon-proliferation

Nuclear Arms Race

Human Rights

nuclear nuclear rightsstates arms humanunited prevention palestine

weapons race occupiednations space israel

Page 43: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

GroupsDiscovered(UN)The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.

Page 44: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Outline

• Role Discovery (Author-Recipient-Topic Model, ART)

• Group Discovery (Group-Topic Model, GT)

• Enhanced Topic Models

– Correlations among Topics (Pachinko Allocation, PAM)

– Time Localized Topics (Topics-over-Time Model, TOT)

– Markov Dependencies in Topics (Topical N-Grams Model, TNG)

• Bibliometric Impact Measures enabled by Topics

Social Network Analysis with Topic Models

Multi-Conditional Mixtures

Page 45: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Latent Dirichlet Allocation

[Blei, Ng, Jordan, 2003]

N

n

w

z

θ

α

β

LDA 100motiondetectionfieldopticalflowsensitivemovingfunctionaldetectcontrastlightdimensionalintensitycomputermtmeasuresocclusiontemporaledgereal

“motion,some junk”

LDA 20visual modelmotionfieldobjectimageimagesobjectsfieldsreceptiveeyepositionspatialdirectiontargetvisionmultiplefigureorientationlocation

“images,motion, eyes”

Page 46: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Correlated Topic Model[Blei, Lafferty, 2005]

N

n

w

z

η

β

Square matrix ofpairwise correlations.

logisticnormal

Page 47: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Pachinko Machine

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Page 48: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Pachinko Allocation Model[Li, McCallum, 2005]

α22

α31 α33

α41 α42 α43 α44 α45

Model stru

cture

,

not the g

raphical m

odel

α32

word1 word2 word3 word4 word5 word6 word7 word8

Given: directed acyclic graph (DAG); at each interior node: a Dirichlet over its children and words at leaves

For each document: Sample a multinomial from each Dirichlet

For each word in this document: Starting from the root, sample a child from successive nodes, down to a leaf.Generate the word at the leaf

α21

α11

Like a Polya tree, but DAG shaped, with arbitrary number of children.

Thanks to Michael Jordan

for suggesting the name

Page 49: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Pachinko Allocation Model[Li, McCallum, 2005]

Model stru

cture

,

not the g

raphical m

odel

DAG may have arbitrary structure• arbitrary depth• any number of children per node• sparse connectivity• edges may skip layers

α22

α31 α33

α41 α42 α43 α44 α45

α32

word1 word2 word3 word4 word5 word6 word7 word8

α21

α11

Page 50: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Pachinko Allocation Model[Li, McCallum, 2005]

Model stru

cture

,

not the g

raphical m

odel

Distributions over words (like “LDA topics”)

Distributions over topics;mixtures, representing topic correlations

Distributions over distributions over topics...

Some interior nodes could contain one multinomial, used for all documents.(i.e. a very peaked Dirichlet)

α22

α31 α33

α41 α42 α43 α44 α45

α32

word1 word2 word3 word4 word5 word6 word7 word8

α21

α11

Page 51: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Pachinko Allocation Model[Li, McCallum, 2005]

Model stru

cture

,

not the g

raphical m

odel

Estimate all these Dirichlets from data.

Estimate model structure from data. (number of nodes, and connectivity)

α22

α31 α33

α41 α42 α43 α44 α45

α32

word1 word2 word3 word4 word5 word6 word7 word8

α21

α11

Page 52: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Pachinko Allocation Special CasesLatent Dirichlet Allocation

α41 α42 α43 α44 α45

α32

word1 word2 word3 word4 word5 word6 word7 word8

Page 53: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Pachinko Allocation Special CasesHierarchical Latent Dirichlet Allocation (HLDA)

α41

α32

α51

α33

α42

word1 word2 word3 word4 word5 word6 word7 word8

α22 α23 α24α21

α11

α31 α34

Very low variance Dirichlet at root

TheHLDAhier.

Each leaf of theHLDA topic hier. hasa distr. over nodeson path to the root.

Page 54: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Pachinko Allocation on a Topic Hierarchy

α41

α32

α51

α33

α42

word1 word2 word3 word4 word5 word6 word7 word8

α22 α23 α24α21

α11

α31 α34

Combining best of HLDA and Pachinko Allocation

TheHLDAhier.

...representingcorrelations amongtopic leaves.

α12

α00

ThePAMDAG.

Page 55: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Pachinko Allocation Model... with two layers, no skipping layers,fully-connected from one layer to the next.

α11

α21 α23

α31 α32 α33 α34 α35

α22

word1 word2 word3 word4 word5 word6 word7 word8

“sub-topics”

“super-topics”

fixed multinomials

Another special case would select only one super-topic per document.

Page 56: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Graphical Models

N

n

w

z1

z2 zm

β…

N

n

w

z

θ

α

β

LDA PAM(with fixed multinomials for topics)

Page 57: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Pachinko Allocation Model

• Likelihood

• Estimate z’s by Gibbs sampling

• Estimate α’s by moment matching.

P(w,z,θ,ϕ |α ,β ) = P(ϕ i |β )i=1

T

∏ ×

∏ ∏∏∏= = =

−=

N

i

n

jijmj

m

kkijijk

q

jjij zwPzzPP

1 1 2)1(

1

))),|(),|(()|(( ϕθαθ

∏ ∑∑= −

−− +

+

+∝

−m

k k kijm

zijijm

k kzkiji

zzijkkiji

ijij zC

wzC

zC

zzCzwzP ijm

kij

ijkkij

2 ' '' ',)1(

,)1(

)(

),(

)(

),(),,,|(

)1(

)1(

β

β

α

αβα

Page 58: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Preliminary Experimental Results

• Topic Coherence

• Likelihood on held-out data

• Document classification

Page 59: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

NIPS Dataset

•1740 papers •13649 Words•2,301,375 tokens

NIPS Conference PapersVolumes 0-12

Spanning 1987 – 1999. Prepared by Sam Roweis.

Page 60: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topic Coherence Comparison

LDA 100estimationlikelihoodmaximumnoisyestimatesmixturescenesurfacenormalizationgeneratedmeasurementssurfacesestimatingestimatediterativecombinedfiguredivisivesequenceideal

LDA 20models modelparametersdistributionbayesianprobabilityestimationdatagaussianmethodslikelihoodemmixtureshowapproachpaperdensityframeworkapproximationmarkov

Example super-topic33 input hidden units function number27 estimation bayesian parameters data methods24 distribution gaussian markov likelihood mixture11 exact kalman full conditional deterministic1 smoothing predictive regularizers intermediate slope

“models,estimation, stopwords”

“estimation,some junk”

PAM 100estimationbayesianparametersdatamethodsestimatemaximumprobabilisticdistributionsnoisevariablevariablesnoisyinferencevarianceentropymodelsframeworkstatisticalestimating

“estimation”

Page 61: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topic Coherence Comparison

LDA 100networklayermultitrainedhighperceptronlayersgivetypenonlinearityperceptronsmodulemodifiedmatchedperformedprovideddesignedsamplesstudymode

PAM 100inputhiddenunitsfunctionnumberfunctionsnetworksoutputlinearlayersingleresultsweightinputsbasisparametersstandardnetworkpatternsstudy

LDA 20architecturenetworkinputoutputstructurepaperleveltaskworksequencessequencemultipleproblemshowsconnectionistnetworkscontextperformscalelearn

“neural networks,some junk”

“neural networks,some junk”

“neural networks,much less junk”

Page 62: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Blind Topic Evaluation

• Randomly select 25 similar pairs of topics generated from PAM and LDA

• 5 people• Each asked “select the topic

in each pair that you find most semantically coherent.”

LDA PAM

5 votes 0 5

>= 4 votes 3 8

>= 3 votes 9 16

Topic countsPrefer PAM

Page 63: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topic Correlations in PAM

5000 research paper abstracts, from across all CS

Numbers on edges are supertopics’ Dirichlet parameters

Page 64: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Likelihood on Held Out Data

• Likelihood comparison– NIPS abstracts– Train the model with 75% data– Calculate likelihood on 25% data

• Calculate likelihood by– Sampling many, many documents from the model– Estimating a simple mixture of multinomials from these– Calculate the likelihood of data under this simple

mixture.

Page 65: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Likelihood Comparison

Varying number of topics

Page 66: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Document Classification

Test Accuracy (%)

“Comp5” from 20 Newsgroups corpus.

Train on 25%, test on 75%Like Naive Bayes, but use LDA/PAM per-class instead of multinomial.

~2.5% increase

Page 67: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Outline

• Role Discovery (Author-Recipient-Topic Model, ART)

• Group Discovery (Group-Topic Model, GT)

• Enhanced Topic Models

– Correlations among Topics (Pachinko Allocation, PAM)

– Time Localized Topics (Topics-over-Time Model, TOT)

– Markov Dependencies in Topics (Topical N-Grams Model, TNG)

• Bibliometric Impact Measures enabled by Topics

Social Network Analysis with Topic Models

Multi-Conditional Mixtures

Page 68: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Want to Model Trends over Time

• Is prevalence of topic growing or waning?

• Pattern appears only briefly– Capture its statistics in focused way– Don’t confuse it with patterns elsewhere in time

• How do roles, groups, influence shift over time?

Page 69: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topics over Time (TOT)

w t

α

Nd

z

D

T

T

Betaover time

Multinomialover words

β γ

Dirichlet

multinomialover topics

topicindex

wordtime

stamp

Dirichletprior

Uniformprior

w

t

Nd

z

D

T

Multinomialover words

β

time stamp

multinomialover topics

topicindex

word

Dirichletprior

distributionon timestamps

T

Betaover time

γ

Uniformprior

Page 70: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

State of the Union Address

208 Addresses delivered between January 8, 1790 and January 29, 2002.

To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied.

•17156 ‘documents’

•21534 words

•669,425 tokens

Our scheme of taxation, by means of which this needless surplus is takenfrom the people and put into the public Treasury, consists of a tariff orduty levied upon importations from abroad and internal-revenue taxes leviedupon the consumption of tobacco and spirituous and malt liquors. It must beconceded that none of the things subjected to internal-revenue taxationare, strictly speaking, necessaries. There appears to be no just complaintof this taxation by the consumers of these articles, and there seems to benothing so well able to bear the burden without hardship to any portion ofthe people.

1910

Page 71: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Comparing

TOT

against

LDA

Page 72: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

TOT on 17 years of NIPS proceedings

Page 73: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

TOT on 17 years of NIPS proceedings

TOT LDA

Page 74: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

TOT

versus

LDA

on my email

Page 75: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

TOT improves ability to Predict Time

Predicting the year of a State-of-the-Union address.

L1 = distance between predicted year and actual year.

Page 76: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Outline

• Role Discovery (Author-Recipient-Topic Model, ART)

• Group Discovery (Group-Topic Model, GT)

• Enhanced Topic Models

– Correlations among Topics (Pachinko Allocation, PAM)

– Time Localized Topics (Topics-over-Time Model, TOT)

– Markov Dependencies in Topics (Topical N-Grams Model, TNG)

• Bibliometric Impact Measures enabled by Topics

Social Network Analysis with Topic Models

Multi-Conditional Mixtures

Page 77: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topics Modeling Phrases

• Topics based only on unigrams often difficult to interpret

• Topic discovery itself is confused because important meaning / distinctions carried by phrases.

• Significant opportunity to provide improved language models to ASR, MT, IR, etc.

Page 78: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topical N-gram Model

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1

T

D

. . .

. . .

. . .

α

WTW

γ1 γ2β 2

Page 79: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

LDA Topic

LDA

algorithmsalgorithmgenetic

problemsefficient

Topical N-grams

genetic algorithmsgenetic algorithm

evolutionary computationevolutionary algorithms

fitness function

Page 80: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topic Comparison

learningoptimalreinforcementstateproblemspolicydynamicactionprogrammingactionsfunctionmarkovmethodsdecisionrlcontinuousspacessteppoliciesplanning

LDA

reinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning_rlfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods

policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies

Topical N-grams (2) Topical N-grams (1)

Page 81: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topic Comparison

motionvisualfieldpositionfiguredirectionfieldseyelocationretinareceptivevelocityvisionmovingsystemflowedgecenterlightlocal

LDA

receptive fieldspatial frequencytemporal frequencyvisual motionmotion energytuning curveshorizontal cellsmotion detectionpreferred directionvisual processingarea mtvisual cortexlight intensitydirectional selectivityhigh contrastmotion detectorsspatial phasemoving stimulidecision strategyvisual stimuli

motionresponsedirectioncellsstimulusfigurecontrastvelocitymodelresponsesstimulimovingcellintensitypopulationimagecentertuningcomplexdirections

Topical N-grams (2) Topical N-grams (1)

Page 82: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topic Comparison

wordsystemrecognitionhmmspeechtrainingperformancephonemewordscontextsystemsframetrainedspeakersequencespeakersmlpframessegmentationmodels

LDA

speech recognitiontraining dataneural networkerror ratesneural nethidden markov modelfeature vectorscontinuous speechtraining procedurecontinuous speech recognitiongamma filterhidden controlspeech productionneural netsinput representationoutput layerstraining algorithmtest setspeech framesspeaker dependent

speechwordtrainingsystemrecognitionhmmspeakerperformancephonemeacousticwordscontextsystemsframetrainedsequencephoneticspeakersmlphybrid

Topical N-grams (2) Topical N-grams (1)

Page 83: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Outline

• Role Discovery (Author-Recipient-Topic Model, ART)

• Group Discovery (Group-Topic Model, GT)

• Enhanced Topic Models

– Correlations among Topics (Pachinko Allocation, PAM)

– Time Localized Topics (Topics-over-Time Model, TOT)

– Markov Dependencies in Topics (Topical N-Grams Model, TNG)

• Bibliometric Impact Measures enabled by Topics

Social Network Analysis with Topic Models

Multi-Conditional Mixtures

Page 84: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Social Networks in Research Literature

• Better understand structure of our own research area.

• Structure helps us learn a new field.• Aid collaboration• Map how ideas travel through social networks

of researchers.

• Aids for hiring and finding reviewers!

Page 85: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Traditional Bibliometrics

• Analyses a small amount of data(e.g. 19 articles from a single issue of a journal)

• Uses “journal” as a proxy for “research topic”(but there is no journal for information extraction)

• Uses impact measures almost exclusively based on simple citation counts.

How can we use topic models to create new, interesting impact measures?

Page 86: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Our Data

• Over 1 million research papers, gathered as part of Rexa.info portal.

• Cross linked references / citations.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 87: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Finding Topics with TNG

Traditional unigram LDArun on 1 milliontitles / abstracts

(200 topics)

...select ~300k papers onML, NLP, robotics, vision...

Find 200 TNG topics among those papers.

Page 88: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topical Bibliometric Impact Measures

• Topical Citation Counts

• Topical Impact Factors

• Topical Longevity

• Topical Diversity

• Topical Precedence

• Topical Transfer

Page 89: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topical DiversityEntropy of the topic distribution among

papers that cite this paper (this topic).

LowDiversity

HighDiversity

Page 90: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topical Diversity

Can also be measured on particular papers...

Page 91: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?

“Early-ness”

Information Retrieval:

On Relevance, Probabilistic Indexing and Information Retrieval,Kuhns and Maron (1960)

Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems,

Cooper (1968)

Relevance feedback in information retrieval, Rocchio (1971)

Relevance feedback and the optimization of retrieval effectiveness, Salton (1971)

New experiments in relevance feedback, Ide (1971)

Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)

Page 92: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?

“Early-ness”

Speech Recognition:

Some experiments on the recognition of speech, with one and two ears,E. Colin Cherry (1953)

Spectrographic study of vowel reduction, B. Lindblom (1963)

Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965)

Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974)

Automatic Recognition of Speakers from Their Voices, B. Atal (1976)

Page 93: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topical Transfer

Transfer from Digital Libraries to other topics

Other topic Cit’s Paper Title

Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.

Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic...

Video 12 Lessons learned from the creation and deployment of a terabyte digital video

Graphs 12 Trawling the Web for Emerging Cyber-Communities

Web Pages 11 WebBase: a repository of Web pages

Page 94: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topical TransferCitation counts from one topic to another.

Map “producers and consumers”

Page 95: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Outline

• Role Discovery (Author-Recipient-Topic Model, ART)

• Group Discovery (Group-Topic Model, GT)

• Enhanced Topic Models

– Correlations among Topics (Pachinko Allocation, PAM)

– Time Localized Topics (Topics-over-Time Model, TOT)

– Markov Dependencies in Topics (Topical N-Grams Model, TNG)

• Bibliometric Impact Measures enabled by Topics

Social Network Analysis with Topic Models

Multi-Conditional Mixtures

Page 96: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Want a “topic model” with the advantages of CRFs

• Use arbitrary, overlapping features of the input.

• Undirected graphical model, so we don’t have to think about avoiding cycles.

• Integrate naturally with our other CRF components.

• Train “discriminatively”

• Natural semi-supervised training

What does this mean?

Topic models are unsupervised!

Page 97: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

“Multi-Conditional Mixtures”Latent Variable Models fit by Multi-way Conditional Probability

• For clustering structured data,ala Latent Dirichlet Allocation & its successors

• But an undirected model,like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005]

• But trained by a “multi-conditional” objective: O = P(A|B,C) P(B|A,C) P(C|A,B)e.g. A,B,C are different modalities

[McCallum, Wang, Pal, 2005], [McCallum, Pal, Wang, 2006]

Page 98: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Objective Functions for Parameter EstimationTraditional, joint training (e.g. naive Bayes, most topic models)

Traditional, conditional training (e.g. MaxEnt classifiers, CRFs)

Conditional mixtures (e.g. Jebara’s CEM)

Multi-conditional(mostly conditional, generative regularization)

Multi-conditional(for semi-sup)

Multi-conditional(for transfer learning, 2 tasks, shared hiddens)

Tra

dit

ion

alN

ew,

mu

lti-

con

dit

ion

al

Traditional mixture model (e.g. LDA)

Page 99: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

“Multi-Conditional Learning” (Regularization)[McCallum, Pal, Wang, 2006]

Page 100: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Predictive Random Fieldsmixture of Gaussians on synthetic data

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Data, classify by color Generatively trained

Conditionally-trained [Jebara 1998]

Multi-Conditional

[McCallum, Wang, Pal, 2005]

Page 101: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Multi-Conditional Mixturesvs. Harmoniun

on document retrieval task

Harmonium, joint with words, no labels

Harmonium, joint,with class labels and words

Conditionally-trained,to predict class labels

Multi-Conditional,multi-way conditionally trained

[McCallum, Wang, Pal, 2005]

Page 102: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Outline

• Role Discovery (Author-Recipient-Topic Model, ART)

• Group Discovery (Group-Topic Model, GT)

• Enhanced Topic Models

– Correlations among Topics (Pachinko Allocation, PAM)

– Time Localized Topics (Topics-over-Time Model, TOT)

– Markov Dependencies in Topics (Topical N-Grams Model, TNG)

• Bibliometric Impact Measures enabled by Topics

Social Network Analysis with Topic Models

Multi-Conditional Mixtures

Page 103: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Summary

Page 104: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Assigning topics to documents

1. Build a 200 topic n-gram topic model on 300k documents

2. Remove stopword or methodological topics (e.g. “efficient, fast, speed”)

3. For each document d, if more than 10% of d’s tokens are assigned to topic t, and that comprises more than two tokens, assign d to t

Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.

Page 105: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Impact Factor

Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in 2002-3, divided by the number of articles published in Cell in 2002-3.

2004 Impact factors from JCR:

Nature 32.182

Cell 28.389

JMLR 5.952

Machine Learning 3.258

Page 106: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topic Impact Factor

Page 107: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Broad Impact: Diffusion

Journal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100

Problem: relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.

Page 108: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Broad Impact: Diversity

Topic Diversity: Entropy of the distribution of citing topics

Better at capturing broad end of impact spectrum: the high diffusion topics are identical to the least frequently cited topics

Page 109: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Broad Impact: Diversity

Topic Diversity: Entropy of the distribution of citing topics

Topic diversity can also be measured for papers:

Page 110: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Longevity: Cited Half Life

Two views:• Given a paper, what is the median age of

citations to that paper?• What is the median age of citations from

current literature?

Page 111: Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

History: Topical Precedence

Within a topic, what are the earliest papers that received more than n citations?

Information Retrieval (138):On Relevance, Probabilistic Indexing and Information Retrieval,

Kuhns and Maron (1960)Expected Search Length: A Single Measure of Retrieval

Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968)

Relevance feedback in information retrieval, Rocchio (1971)Relevance feedback and the optimization of retrieval

effectiveness, Salton (1971)New experiments in relevance feedback, Ide (1971)Automatic Indexing of a Sound Database Using Self-organizing

Neural Nets, Feiten and Gunzel (1982)