jerry scripps

Jerry ScrippsJerry Scripps

L IN K

M I NI

NG

OverviewOverview What is link mining?What is link mining? MotivationMotivation Preliminaries Preliminaries

definitionsdefinitions metricsmetrics network typesnetwork types

Link mining techniquesLink mining techniques

What is Link Mining?What is Link Mining?

Statistics Graph Theory

Social Network Analysis

Machine Learning

Link MiningData Mining

Database

What is Link Mining?What is Link Mining?Examples:Examples: Discovering communities within Discovering communities within

collaboration networkscollaboration networks Finding authoritative web pages on a Finding authoritative web pages on a

given topicgiven topic Selecting the most influential people Selecting the most influential people

in a social networkin a social network

Link Mining – MotivationLink Mining – MotivationEmerging Data SetsEmerging Data Sets

World wide webWorld wide web Social networkingSocial networking Collaboration Collaboration

databasesdatabases etc.etc.

Link Mining – MotivationLink Mining – MotivationDirect ApplicationsDirect Applications

What is the What is the community around community around msu.edu?msu.edu?

What are the What are the authoritative pages?authoritative pages?

Who has the most Who has the most influence?influence?

Who is the likely Who is the likely member of terrorist member of terrorist cell?cell?

Is this a news story Is this a news story about crime, politics about crime, politics or business?or business?

Link Mining – MotivationLink Mining – MotivationIndirect ApplicationsIndirect Applications

Convert ordinary Convert ordinary data sets into data sets into networksnetworks

Integrate link mining Integrate link mining techniques into techniques into other techniquesother techniques

PreliminariesPreliminaries DefinitionsDefinitions MetricsMetrics Network TypesNetwork Types

DefinitionsDefinitions MetricsMetrics Network TypesNetwork Types

DefinitionsDefinitionsNode (vertex, point, object)

Link (edge, arc)

Community

MetricsMetricsNodeNode DegreeDegree ClosenessCloseness BetweenneBetweenne

ssss Clustering Clustering

coefficientcoefficient

Node PairNode Pair Graph distanceGraph distance Min-cutMin-cut Common Common

neighborsneighbors Jaccard’s coefJaccard’s coef Adamic/adarAdamic/adar Pref. attachmentPref. attachment KatzKatz Hitting timeHitting time Rooted pageRankRooted pageRank simRanksimRank Bibliographic Bibliographic

metricsmetrics

NetworkNetwork CharacteristiCharacteristi

c path c path lengthlength

Clustering Clustering coefficientcoefficient

Min-cutMin-cut

Network TypesNetwork Types

Regular Small World Random

Watts Watts & & StrogatStrogatzz

Networks – Scale-freeNetworks – Scale-free

GVSU FaceBook

0

200

400

600

800

1000

0100200300400500600

Degree

Cou

nts

GVSU FaceBook (log scale)

1

10

100

1000

1101001000

Degree

Cou

nts

Barabasi & Bonabeau Degree follows a power law ~ 1/kn Can be found in a wide variety of real-

world networks

Network recapNetwork recapNetwork Network TypeType

Clustering Clustering coefficientcoefficient

CharacteristCharacteristic path ic path lengthlength

Power LawPower Law

RandomRandom LowLow LowLow NoNo

RegularRegular HighHigh HighHigh NoNo

Small worldSmall world HighHigh LowLow ??

Scale-freeScale-free ?? ?? YesYes

TechniquesTechniques Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding Link CompletionLink Completion

Link-Based ClassificationLink-Based Classification

?Include features from linked objects: building a single model on all features Fusion of link and attribute models

Link-Based ClassificationLink-Based ClassificationChakrabarti, et al.Chakrabarti, et al.

Copying data from neighboring web Copying data from neighboring web pages actually reduced accuracypages actually reduced accuracy

Using the label from neighboring page Using the label from neighboring page improved accuracyimproved accuracy

010010

011110

111011

A

A

?

101011

B

111011

010010

101011

011110

A

A

B

Link-Based ClassificationLink-Based ClassificationLu & GetoorLu & Getoor

Define vectors for attributes and linksDefine vectors for attributes and links Attribute data OA(X)Attribute data OA(X) Link data LD(X) constructed usingLink data LD(X) constructed using

mode (single feature – class of plurality)mode (single feature – class of plurality) count (feature for each class – count for neighbors)count (feature for each class – count for neighbors) binary (feature for each class – 0/1 if exists)binary (feature for each class – 0/1 if exists)

010010

011110

111011

A

?

101011

BA 11101

1…

OA (attr)

LD (link)A

…2 1 0…1 1 0…

ModelModel 1

Model 2

Link-Based ClassificationLink-Based ClassificationLu & GetoorLu & Getoor

Define probabilities for both Define probabilities for both AttributeAttribute LinkLink

Class estimation:Class estimation: ))(,|())(,|()(ˆ10 XLDwcPXOAwcPXC

1))(exp(1))(,|(

cXOAwXOAwcP T

oo

1))(exp(1))(,|(

cXLDwXLDwcP T

ll

Link-Based ClassificationLink-Based ClassificationSummarySummary

Using class of neighbors improves accuracyUsing class of neighbors improves accuracy Using separate models for attribute and link data Using separate models for attribute and link data

further improves accuracyfurther improves accuracy Other considerations:Other considerations:

improvements are possible by using community improvements are possible by using community informationinformation

knowledge of network type could also benefit classifierknowledge of network type could also benefit classifier

Link PredictionLink Prediction

Link PredictionLink PredictionLiben-Nowell and KleinbergLiben-Nowell and Kleinberg Tested node-pair metrics:Tested node-pair metrics: Graph distanceGraph distance Common neighborsCommon neighbors Jaccards coefficientJaccards coefficient Adamic/adarAdamic/adar Preferential Preferential

attachmentattachment KatzKatz Hitting timeHitting time Rooted PageRankRooted PageRank SimRankSimRank

Neighborhood

Ensemble of paths

Link Prediction - resultsLink Prediction - results

Link Prediction – summaryLink Prediction – summary There is room for growth – best predictor There is room for growth – best predictor

has accuracy of only around 9%has accuracy of only around 9% Predicting collaborations is difficultPredicting collaborations is difficult Finding communities could help if most Finding communities could help if most

collaborations are intra-communitycollaborations are intra-community New problem could be to predict the New problem could be to predict the

direction of the linkdirection of the link

RankingRanking

Ranking – Markov Chain Ranking – Markov Chain BasedBased

Random-surfer analogyRandom-surfer analogy Problem with cyclesProblem with cycles PageRank uses random vectorPageRank uses random vector

Ranking – summaryRanking – summary Other methods such as HITS and Other methods such as HITS and

SALSA also based on Markov chainSALSA also based on Markov chain Ranking has been applied in other Ranking has been applied in other

areas:areas: text summarizationtext summarization anomaly detectionanomaly detection

InfluenceInfluence

Maximizing influence Maximizing influence model-basedmodel-based

Problem – finding the k best nodes to activate to Problem – finding the k best nodes to activate to maximize the number of nodes activatedmaximize the number of nodes activated

Models:Models: independent cascade – when activated a node has a independent cascade – when activated a node has a

one-time change to activate neighbors with prob. pone-time change to activate neighbors with prob. pijij linear threshold – node becomes activated when the linear threshold – node becomes activated when the

percent of its neighbors crosses a thresholdpercent of its neighbors crosses a threshold

Maximizing influence Maximizing influence model-basedmodel-based

Models: independent cascade & linear thresholdModels: independent cascade & linear threshold A function f:SA function f:S→S→S**, can be created using either , can be created using either

modelmodel Functions use monte-carlo, hill-climbing solutionFunctions use monte-carlo, hill-climbing solution Submodular functions, Submodular functions,

where Swhere ST are proven in another work to be NP-T are proven in another work to be NP-C but by using a hill-climbing solution can get to C but by using a hill-climbing solution can get to within 1-1/e of optimum.within 1-1/e of optimum.

)(}){()(}){( TfvTfSfvSf

Maximizing influence – Maximizing influence – cost/benefitcost/benefit

Assumptions:Assumptions: product x sells for $100product x sells for $100 a discount of 10% can be offered to various prospective a discount of 10% can be offered to various prospective

customerscustomers If customer purchases profit is:If customer purchases profit is:

90 if discount is offered90 if discount is offered 100 if discount is not offered 100 if discount is not offered

Expected lift in profit (ELP) from offering discount is:Expected lift in profit (ELP) from offering discount is: 90*P(buy|discount) - 100*P(buy|no discount)90*P(buy|discount) - 100*P(buy|no discount)

Maximizing influence – Maximizing influence – cost/benefitcost/benefit

Goal is to find M Goal is to find M that maximizes that maximizes global ELPglobal ELP

Three Three approximations approximations used:used: single passsingle pass greedygreedy hill-climbinghill-climbing

n

ii

kii

ki

k cMfYXXPrMfYXXPrMYXELP1

00

11 ))(,,|1())(,,|1(),,(

XXii is the decision of is the decision of customer i to buycustomer i to buy

Y is vector of product Y is vector of product attributesattributes

M is vector of marketing M is vector of marketing decisiondecision

f is a function to set the ith f is a function to set the ith element of Melement of M

rr00 and r and r11 are revenue are revenue gained gained

c is the cost of marketingc is the cost of marketing

Comparison of approachesComparison of approachesCost/benefitCost/benefit Model-basedModel-based

Size of Size of starting setstarting set

variable - variable - based on based on max. liftmax. lift

fixedfixed

uses uses attributesattributes

yesyes nono

probabilitiesprobabilities extracted extracted from data setfrom data set

assigned to assigned to linkslinks An extension would be to spread influence An extension would be to spread influence

to the most number of communitiesto the most number of communities Improvements can be made in speedImprovements can be made in speed

CommunitiesCommunities

Gibson, Kleinberg and Gibson, Kleinberg and Raghavan Raghavan

Query

Search Engine

Root Set

Use HITS to find top 10 hubs and authorities

Base Set: add forward and back links

Reddy and KitsuregawaReddy and Kitsuregawa Bipartite graphBipartite graph Given an initial set of nodes T Given an initial set of nodes T

build I from the nodes pointed build I from the nodes pointed to from Tto from T

Repeat:Repeat: use relax_cocite to expand T and Iuse relax_cocite to expand T and I prune T and I using dense prune T and I using dense

bipartite graph function bipartite graph function DBPG(T,I,DBPG(T,I,αα,,ββ) )

T I

u

v

w

Flake, Lawrence and GilesFlake, Lawrence and Giles Uses Min-cutUses Min-cut Start with seed setStart with seed set Add linked nodesAdd linked nodes Find nodes from Find nodes from

outgoing linksoutgoing links Create virtual source nodeCreate virtual source node Add virtual sink linking it to all nodesAdd virtual sink linking it to all nodes Find the min-cut of the virtual source Find the min-cut of the virtual source

and sinkand sink

Neville, Adler and Jensen Neville, Adler and Jensen A0 1 1 0

C1 1 0 1

B1 1 0 0

Distance based on links and attributesDistance based on links and attributes If link exists score is number of If link exists score is number of

common attributes zero otherwisecommon attributes zero otherwise score(A,B)=2, score(A,C)=1,score(A,B)=2, score(A,C)=1,

score(B,C)=0score(B,C)=0 Used with 3 partitioning algorithms: Used with 3 partitioning algorithms:

Karger’s Min-Cut Karger’s Min-Cut MajorClustMajorClust Spectral partitioning by Shi & MalikSpectral partitioning by Shi & Malik

Communities - summaryCommunities - summary There are many options for building There are many options for building

communities around a small group of communities around a small group of nodesnodes

Possible future directionsPossible future directions finding communities in networks having finding communities in networks having

different link typesdifferent link types impact of network type on community impact of network type on community

finding techniquesfinding techniques

Link CompletionLink Completion

Goldenberg, Kubica and Goldenberg, Kubica and KomarekKomarek

Problem: given a network and n-1 Problem: given a network and n-1 members of a community find the nmembers of a community find the nthth

randomrandom countingcounting popularpopular NBNB NNNN cGraphcGraph BayesNetBayesNet EBS and LREBS and LR

ConclusionsConclusions Link mining is a young, dynamic field of Link mining is a young, dynamic field of

study with problem areas that continue to study with problem areas that continue to emerge and morph as techniques continue emerge and morph as techniques continue to evolveto evolve

Opportunities for improvements exist inOpportunities for improvements exist in using community knowledgeusing community knowledge using network knowledgeusing network knowledge

We are the living links in a life force that moves and plays around and through us, binding the deepest soils with the farthest stars.

Alan Chadwick

http://www.quotationspage.com/quote/4192.html

http://www.quotationspage.com/myquotations.php?add=4192

http://www.quotationspage.com/quote/4192.html

RankingRanking Based on Markov Based on Markov

ChainChain Rank is sum of node Rank is sum of node

weights from incoming weights from incoming linkslinks

Breaks down when Breaks down when cycles existcycles exist

9

5

9

64

2

3

15

14

Ranking - continuedRanking - continued General approachGeneral approach

aapp = authority score for p = authority score for p BBpp = backlinks of p = backlinks of p

PageRank PageRank HITS approachHITS approach

aapp = authority score for p = authority score for p hhpp = hub score for p = hub score for p BBpp = backlinks of p = backlinks of p

Normalize between iterationsNormalize between iterations

)()1(/ pENaa pBq

qpp

pBqqp ha

pBqqp ah

jerry scripps

Documents

authoritative pages

objectlink edge

given topicselecting

influential people

news story

power law