jerry scripps
DESCRIPTION
K. I. L. N. I. G. I. N. M. N. Jerry Scripps. Overview. What is link mining? Motivation Preliminaries definitions metrics network types Link mining techniques. What is Link Mining?. Graph Theory. Statistics. Link Mining. Data Mining. Machine Learning. - PowerPoint PPT PresentationTRANSCRIPT
Jerry ScrippsJerry Scripps
L IN K
M I NI
NG
OverviewOverview What is link mining?What is link mining? MotivationMotivation Preliminaries Preliminaries
definitionsdefinitions metricsmetrics network typesnetwork types
Link mining techniquesLink mining techniques
What is Link Mining?What is Link Mining?
Statistics Graph Theory
Social Network Analysis
Machine Learning
Link MiningData Mining
Database
What is Link Mining?What is Link Mining?Examples:Examples: Discovering communities within Discovering communities within
collaboration networkscollaboration networks Finding authoritative web pages on a Finding authoritative web pages on a
given topicgiven topic Selecting the most influential people Selecting the most influential people
in a social networkin a social network
Link Mining – MotivationLink Mining – MotivationEmerging Data SetsEmerging Data Sets
World wide webWorld wide web Social networkingSocial networking Collaboration Collaboration
databasesdatabases etc.etc.
Link Mining – MotivationLink Mining – MotivationDirect ApplicationsDirect Applications
What is the What is the community around community around msu.edu?msu.edu?
What are the What are the authoritative pages?authoritative pages?
Who has the most Who has the most influence?influence?
Who is the likely Who is the likely member of terrorist member of terrorist cell?cell?
Is this a news story Is this a news story about crime, politics about crime, politics or business?or business?
Link Mining – MotivationLink Mining – MotivationIndirect ApplicationsIndirect Applications
Convert ordinary Convert ordinary data sets into data sets into networksnetworks
Integrate link mining Integrate link mining techniques into techniques into other techniquesother techniques
PreliminariesPreliminaries DefinitionsDefinitions MetricsMetrics Network TypesNetwork Types
DefinitionsDefinitions MetricsMetrics Network TypesNetwork Types
DefinitionsDefinitionsNode (vertex, point, object)
Link (edge, arc)
Community
MetricsMetricsNodeNode DegreeDegree ClosenessCloseness BetweenneBetweenne
ssss Clustering Clustering
coefficientcoefficient
Node PairNode Pair Graph distanceGraph distance Min-cutMin-cut Common Common
neighborsneighbors Jaccard’s coefJaccard’s coef Adamic/adarAdamic/adar Pref. attachmentPref. attachment KatzKatz Hitting timeHitting time Rooted pageRankRooted pageRank simRanksimRank Bibliographic Bibliographic
metricsmetrics
NetworkNetwork CharacteristiCharacteristi
c path c path lengthlength
Clustering Clustering coefficientcoefficient
Min-cutMin-cut
Network TypesNetwork Types
Regular Small World Random
Watts Watts & & StrogatStrogatzz
Networks – Scale-freeNetworks – Scale-free
GVSU FaceBook
0
200
400
600
800
1000
0100200300400500600
Degree
Cou
nts
GVSU FaceBook (log scale)
1
10
100
1000
1101001000
Degree
Cou
nts
Barabasi & Bonabeau Degree follows a power law ~ 1/kn Can be found in a wide variety of real-
world networks
Network recapNetwork recapNetwork Network TypeType
Clustering Clustering coefficientcoefficient
CharacteristCharacteristic path ic path lengthlength
Power LawPower Law
RandomRandom LowLow LowLow NoNo
RegularRegular HighHigh HighHigh NoNo
Small worldSmall world HighHigh LowLow ??
Scale-freeScale-free ?? ?? YesYes
TechniquesTechniques Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding Link CompletionLink Completion
Link-Based ClassificationLink-Based Classification
?Include features from linked objects: building a single model on all features Fusion of link and attribute models
Link-Based ClassificationLink-Based ClassificationChakrabarti, et al.Chakrabarti, et al.
Copying data from neighboring web Copying data from neighboring web pages actually reduced accuracypages actually reduced accuracy
Using the label from neighboring page Using the label from neighboring page improved accuracyimproved accuracy
010010
011110
111011
A
A
?
101011
B
111011
010010
101011
011110
A
A
B
Link-Based ClassificationLink-Based ClassificationLu & GetoorLu & Getoor
Define vectors for attributes and linksDefine vectors for attributes and links Attribute data OA(X)Attribute data OA(X) Link data LD(X) constructed usingLink data LD(X) constructed using
mode (single feature – class of plurality)mode (single feature – class of plurality) count (feature for each class – count for neighbors)count (feature for each class – count for neighbors) binary (feature for each class – 0/1 if exists)binary (feature for each class – 0/1 if exists)
010010
011110
111011
A
?
101011
BA 11101
1…
OA (attr)
LD (link)A
…2 1 0…1 1 0…
ModelModel 1
Model 2
Link-Based ClassificationLink-Based ClassificationLu & GetoorLu & Getoor
Define probabilities for both Define probabilities for both AttributeAttribute LinkLink
Class estimation:Class estimation: ))(,|())(,|()(ˆ10 XLDwcPXOAwcPXC
1))(exp(1))(,|(
cXOAwXOAwcP T
oo
1))(exp(1))(,|(
cXLDwXLDwcP T
ll
Link-Based ClassificationLink-Based ClassificationSummarySummary
Using class of neighbors improves accuracyUsing class of neighbors improves accuracy Using separate models for attribute and link data Using separate models for attribute and link data
further improves accuracyfurther improves accuracy Other considerations:Other considerations:
improvements are possible by using community improvements are possible by using community informationinformation
knowledge of network type could also benefit classifierknowledge of network type could also benefit classifier
TechniquesTechniques Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding Link CompletionLink Completion
Link PredictionLink Prediction
Link PredictionLink PredictionLiben-Nowell and KleinbergLiben-Nowell and Kleinberg Tested node-pair metrics:Tested node-pair metrics: Graph distanceGraph distance Common neighborsCommon neighbors Jaccards coefficientJaccards coefficient Adamic/adarAdamic/adar Preferential Preferential
attachmentattachment KatzKatz Hitting timeHitting time Rooted PageRankRooted PageRank SimRankSimRank
Neighborhood
Ensemble of paths
Link Prediction - resultsLink Prediction - results
Link Prediction – summaryLink Prediction – summary There is room for growth – best predictor There is room for growth – best predictor
has accuracy of only around 9%has accuracy of only around 9% Predicting collaborations is difficultPredicting collaborations is difficult Finding communities could help if most Finding communities could help if most
collaborations are intra-communitycollaborations are intra-community New problem could be to predict the New problem could be to predict the
direction of the linkdirection of the link
TechniquesTechniques Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding Link CompletionLink Completion
RankingRanking
Ranking – Markov Chain Ranking – Markov Chain BasedBased
Random-surfer analogyRandom-surfer analogy Problem with cyclesProblem with cycles PageRank uses random vectorPageRank uses random vector
Ranking – summaryRanking – summary Other methods such as HITS and Other methods such as HITS and
SALSA also based on Markov chainSALSA also based on Markov chain Ranking has been applied in other Ranking has been applied in other
areas:areas: text summarizationtext summarization anomaly detectionanomaly detection
TechniquesTechniques Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding Link CompletionLink Completion
InfluenceInfluence
Maximizing influence Maximizing influence model-basedmodel-based
Problem – finding the k best nodes to activate to Problem – finding the k best nodes to activate to maximize the number of nodes activatedmaximize the number of nodes activated
Models:Models: independent cascade – when activated a node has a independent cascade – when activated a node has a
one-time change to activate neighbors with prob. pone-time change to activate neighbors with prob. pijij linear threshold – node becomes activated when the linear threshold – node becomes activated when the
percent of its neighbors crosses a thresholdpercent of its neighbors crosses a threshold
Maximizing influence Maximizing influence model-basedmodel-based
Models: independent cascade & linear thresholdModels: independent cascade & linear threshold A function f:SA function f:S→S→S**, can be created using either , can be created using either
modelmodel Functions use monte-carlo, hill-climbing solutionFunctions use monte-carlo, hill-climbing solution Submodular functions, Submodular functions,
where Swhere ST are proven in another work to be NP-T are proven in another work to be NP-C but by using a hill-climbing solution can get to C but by using a hill-climbing solution can get to within 1-1/e of optimum.within 1-1/e of optimum.
)(}){()(}){( TfvTfSfvSf
Maximizing influence – Maximizing influence – cost/benefitcost/benefit
Assumptions:Assumptions: product x sells for $100product x sells for $100 a discount of 10% can be offered to various prospective a discount of 10% can be offered to various prospective
customerscustomers If customer purchases profit is:If customer purchases profit is:
90 if discount is offered90 if discount is offered 100 if discount is not offered 100 if discount is not offered
Expected lift in profit (ELP) from offering discount is:Expected lift in profit (ELP) from offering discount is: 90*P(buy|discount) - 100*P(buy|no discount)90*P(buy|discount) - 100*P(buy|no discount)
Maximizing influence – Maximizing influence – cost/benefitcost/benefit
Goal is to find M Goal is to find M that maximizes that maximizes global ELPglobal ELP
Three Three approximations approximations used:used: single passsingle pass greedygreedy hill-climbinghill-climbing
n
ii
kii
ki
k cMfYXXPrMfYXXPrMYXELP1
00
11 ))(,,|1())(,,|1(),,(
XXii is the decision of is the decision of customer i to buycustomer i to buy
Y is vector of product Y is vector of product attributesattributes
M is vector of marketing M is vector of marketing decisiondecision
f is a function to set the ith f is a function to set the ith element of Melement of M
rr00 and r and r11 are revenue are revenue gained gained
c is the cost of marketingc is the cost of marketing
Comparison of approachesComparison of approachesCost/benefitCost/benefit Model-basedModel-based
Size of Size of starting setstarting set
variable - variable - based on based on max. liftmax. lift
fixedfixed
uses uses attributesattributes
yesyes nono
probabilitiesprobabilities extracted extracted from data setfrom data set
assigned to assigned to linkslinks An extension would be to spread influence An extension would be to spread influence
to the most number of communitiesto the most number of communities Improvements can be made in speedImprovements can be made in speed
TechniquesTechniques Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding Link CompletionLink Completion
CommunitiesCommunities
Gibson, Kleinberg and Gibson, Kleinberg and Raghavan Raghavan
Query
Search Engine
Root Set
Use HITS to find top 10 hubs and authorities
Base Set: add forward and back links
Reddy and KitsuregawaReddy and Kitsuregawa Bipartite graphBipartite graph Given an initial set of nodes T Given an initial set of nodes T
build I from the nodes pointed build I from the nodes pointed to from Tto from T
Repeat:Repeat: use relax_cocite to expand T and Iuse relax_cocite to expand T and I prune T and I using dense prune T and I using dense
bipartite graph function bipartite graph function DBPG(T,I,DBPG(T,I,αα,,ββ) )
T I
u
v
w
Flake, Lawrence and GilesFlake, Lawrence and Giles Uses Min-cutUses Min-cut Start with seed setStart with seed set Add linked nodesAdd linked nodes Find nodes from Find nodes from
outgoing linksoutgoing links Create virtual source nodeCreate virtual source node Add virtual sink linking it to all nodesAdd virtual sink linking it to all nodes Find the min-cut of the virtual source Find the min-cut of the virtual source
and sinkand sink
Neville, Adler and Jensen Neville, Adler and Jensen A0 1 1 0
C1 1 0 1
B1 1 0 0
Distance based on links and attributesDistance based on links and attributes If link exists score is number of If link exists score is number of
common attributes zero otherwisecommon attributes zero otherwise score(A,B)=2, score(A,C)=1,score(A,B)=2, score(A,C)=1,
score(B,C)=0score(B,C)=0 Used with 3 partitioning algorithms: Used with 3 partitioning algorithms:
Karger’s Min-Cut Karger’s Min-Cut MajorClustMajorClust Spectral partitioning by Shi & MalikSpectral partitioning by Shi & Malik
Communities - summaryCommunities - summary There are many options for building There are many options for building
communities around a small group of communities around a small group of nodesnodes
Possible future directionsPossible future directions finding communities in networks having finding communities in networks having
different link typesdifferent link types impact of network type on community impact of network type on community
finding techniquesfinding techniques
TechniquesTechniques Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding Link CompletionLink Completion
Link CompletionLink Completion
Goldenberg, Kubica and Goldenberg, Kubica and KomarekKomarek
Problem: given a network and n-1 Problem: given a network and n-1 members of a community find the nmembers of a community find the nthth
randomrandom countingcounting popularpopular NBNB NNNN cGraphcGraph BayesNetBayesNet EBS and LREBS and LR
ConclusionsConclusions Link mining is a young, dynamic field of Link mining is a young, dynamic field of
study with problem areas that continue to study with problem areas that continue to emerge and morph as techniques continue emerge and morph as techniques continue to evolveto evolve
Opportunities for improvements exist inOpportunities for improvements exist in using community knowledgeusing community knowledge using network knowledgeusing network knowledge
We are the living links in a life force that moves and plays around and through us, binding the deepest soils with the farthest stars.
Alan Chadwick
RankingRanking Based on Markov Based on Markov
ChainChain Rank is sum of node Rank is sum of node
weights from incoming weights from incoming linkslinks
Breaks down when Breaks down when cycles existcycles exist
9
5
9
64
2
3
15
14
Ranking - continuedRanking - continued General approachGeneral approach
aapp = authority score for p = authority score for p BBpp = backlinks of p = backlinks of p
PageRank PageRank HITS approachHITS approach
aapp = authority score for p = authority score for p hhpp = hub score for p = hub score for p BBpp = backlinks of p = backlinks of p
Normalize between iterationsNormalize between iterations
)()1(/ pENaa pBq
qpp
pBqqp ha
pBqqp ah