jerry scripps n t o k m i n i n g e w r. overview what is network mining? what is network mining?...
TRANSCRIPT
Jerry ScrippsJerry Scripps
NT O K
M IN I N
G
E W R
OverviewOverview What is network mining?What is network mining? MotivationMotivation Preliminaries Preliminaries
definitionsdefinitions metricsmetrics network typesnetwork types
Network mining techniquesNetwork mining techniques
What is Network Mining?What is Network Mining?Statistics
Graph TheorySocial Network Analysis
Machine Learning
Network Mining
Data Mining
Computer ScienceMathematics
Pattern Recognition
What is Network Mining?What is Network Mining?Border DisciplinesBorder Disciplines
StatisticsStatistics Computer Computer
ScienceScience PhysicsPhysics MathMath PsychologyPsychology Law EnforcementLaw Enforcement
SociologySociology MilitaryMilitary BiologyBiology MedicineMedicine ChemistryChemistry BusinessBusiness
What is Network Mining?What is Network Mining?
Examples:Examples: Discovering communities within Discovering communities within
collaboration networkscollaboration networks Finding authoritative web pages on a Finding authoritative web pages on a
given topicgiven topic Selecting the most influential people Selecting the most influential people
in a social networkin a social network
Network Mining – MotivationNetwork Mining – MotivationEmerging Data SetsEmerging Data Sets
World wide webWorld wide web Social networkingSocial networking Collaboration databasesCollaboration databases Customer or Employee Customer or Employee
setssets Genomic dataGenomic data Terrorist setsTerrorist sets Supply ChainsSupply Chains Many more…Many more…
Network Mining – MotivationNetwork Mining – MotivationDirect ApplicationsDirect Applications
What is the What is the community around community around msu.edu?msu.edu?
What are the What are the authoritative authoritative pages?pages?
Who has the most Who has the most influence?influence?
Who is the likely Who is the likely member of terrorist member of terrorist cell?cell?
Is this a news story Is this a news story about crime, about crime, politics or business?politics or business?
Network Mining – MotivationNetwork Mining – MotivationIndirect ApplicationsIndirect Applications
Convert ordinary Convert ordinary data sets into data sets into networksnetworks
Integrate network Integrate network mining techniques mining techniques into other into other techniquestechniques
PreliminariesPreliminaries
DefinitionsDefinitions MetricsMetrics Network TypesNetwork Types
DefinitionsDefinitions MetricsMetrics Network TypesNetwork Types
DefinitionsDefinitions
Node (vertex, point, object)
Link (edge, arc)
Community
MetricsMetrics
NodeNode DegreeDegree ClosenessCloseness BetweenneBetweenne
ssss Clustering Clustering
coefficientcoefficient
Node PairNode Pair Graph distanceGraph distance Min-cutMin-cut Common Common
neighborsneighbors Jaccard’s coefJaccard’s coef Adamic/adarAdamic/adar Pref. attachmentPref. attachment KatzKatz Hitting timeHitting time Rooted Rooted
pageRankpageRank simRanksimRank Bibliographic Bibliographic
metricsmetrics
NetworkNetwork CharacteristiCharacteristi
c path c path lengthlength
Clustering Clustering coefficientcoefficient
Min-cutMin-cut diameterdiameter
Network Types – RandomNetwork Types – Random
Network Types – Small Network Types – Small WorldWorld
Regular Small World
Random
Watts Watts & & StrogatStrogatzz
Networks – Scale-freeNetworks – Scale-free
GVSU FaceBook
0
200
400
600
800
1000
0100200300400500600
Degree
Co
un
ts
GVSU FaceBook (log scale)
1
10
100
1000
1101001000
Degree
Co
un
ts
Barabasi & Bonabeau Degree follows a power law ~ 1/kn Can be found in a wide variety of real-
world networks
Network recapNetwork recap
Network Network TypeType
Clustering Clustering coefficientcoefficient
CharacteristCharacteristic path ic path lengthlength
Power LawPower Law
RandomRandom LowLow LowLow NoNo
RegularRegular HighHigh HighHigh NoNo
Small worldSmall world HighHigh LowLow ??
Scale-freeScale-free ?? ?? YesYes
TechniquesTechniques
Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding
Link-Based ClassificationLink-Based Classification
?Include features from linked objects: building a single model on all features Fusion of link and attribute models
Link-Based ClassificationLink-Based ClassificationChakrabarti, et al.Chakrabarti, et al.
Copying data from neighboring web Copying data from neighboring web pages actually reduced accuracypages actually reduced accuracy
Using the label from neighboring page Using the label from neighboring page improved accuracyimproved accuracy
010010
011110
111011
A
A
?
101011
B
111011
010010
101011
011110
A
A
B
Link-Based ClassificationLink-Based ClassificationLu & GetoorLu & Getoor
Define vectors for attributes and linksDefine vectors for attributes and links Attribute data OA(X)Attribute data OA(X) Link data LD(X) constructed usingLink data LD(X) constructed using
mode (single feature – class of plurality)mode (single feature – class of plurality) count (feature for each class – count for neighbors)count (feature for each class – count for neighbors) binary (feature for each class – 0/1 if exists)binary (feature for each class – 0/1 if exists)
010010
011110
111011
A
?
101011
BA
111011
…
OA (attr)
LD (link)A
…
2 1 0
…
1 1 0
…
ModelModel 1
Model 2
Link-Based ClassificationLink-Based ClassificationLu & GetoorLu & Getoor
Define probabilities for both Define probabilities for both AttributeAttribute
LinkLink
Class estimation:Class estimation: ))(,|())(,|()(ˆ10 XLDwcPXOAwcPXC
1))(exp(
1))(,|(
cXOAwXOAwcP
To
o
1))(exp(
1))(,|(
cXLDwXLDwcP
Tl
l
Collective ClassificationCollective Classification
Uses both attributes and linksUses both attributes and links Iteratively update the unlabeled Iteratively update the unlabeled
instancesinstances message passing, loopy belief nets, message passing, loopy belief nets,
etc.etc.
Link-Based ClassificationLink-Based ClassificationSummarySummary
Using class of neighbors improves accuracyUsing class of neighbors improves accuracy Using separate models for attribute and link data Using separate models for attribute and link data
further improves accuracyfurther improves accuracy Other considerations:Other considerations:
improvements are possible by using community improvements are possible by using community informationinformation
knowledge of network type could also benefit classifierknowledge of network type could also benefit classifier
TechniquesTechniques
Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding
Link PredictionLink Prediction
Link PredictionLink PredictionLiben-Nowell and KleinbergLiben-Nowell and Kleinberg
Tested node-pair metrics:Tested node-pair metrics: Graph distanceGraph distance Common neighborsCommon neighbors Jaccards coefficientJaccards coefficient Adamic/adarAdamic/adar Preferential Preferential
attachmentattachment KatzKatz Hitting timeHitting time Rooted PageRankRooted PageRank SimRankSimRank
Neighborhood
Ensemble of paths
Link Prediction - resultsLink Prediction - results
Link Prediction – newer Link Prediction – newer methodsmethods
maximum likelihoodmaximum likelihood stochastic block modelstochastic block model probabilisticprobabilistic
Link Prediction – summaryLink Prediction – summary
There is room for growth – best There is room for growth – best predictor has accuracy of only predictor has accuracy of only around 9%around 9%
Predicting collaborations is difficultPredicting collaborations is difficult New problem could be to predict the New problem could be to predict the
direction of the linkdirection of the link
TechniquesTechniques
Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding Link CompletionLink Completion
RankingRanking
Ranking – Markov Chain Ranking – Markov Chain BasedBased
Random-surfer analogyRandom-surfer analogy Problem with cyclesProblem with cycles PageRank uses random vectorPageRank uses random vector
Ranking – summaryRanking – summary
Other methods such as HITS and Other methods such as HITS and SALSA also based on Markov chainSALSA also based on Markov chain
Ranking has been applied in other Ranking has been applied in other areas:areas: text summarizationtext summarization anomaly detectionanomaly detection
TechniquesTechniques
Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding
InfluenceInfluence
Influence MaximizationInfluence Maximization
Problem: find the best nodes to Problem: find the best nodes to activateactivate
Approaches:Approaches: degree – fast but not effectivedegree – fast but not effective greedy – effective but slowgreedy – effective but slow improvements to greedy: degree improvements to greedy: degree
heuristics and Shapely valueheuristics and Shapely value use communitiesuse communities cost-benefit – probabilistic approachcost-benefit – probabilistic approach
Maximizing influence Maximizing influence model-basedmodel-based
Problem – finding the k best nodes to activate to Problem – finding the k best nodes to activate to maximize the number of nodes activatedmaximize the number of nodes activated
Models:Models: independent cascade – when activated a node has a independent cascade – when activated a node has a
one-time change to activate neighbors with prob. pone-time change to activate neighbors with prob. p ijij linear threshold – node becomes activated when the linear threshold – node becomes activated when the
percent of its neighbors crosses a thresholdpercent of its neighbors crosses a threshold
Maximizing influence Maximizing influence model-basedmodel-based
Models: independent cascade & linear Models: independent cascade & linear thresholdthreshold
A function f:SA function f:S→S→S**, can be created using either , can be created using either modelmodel
Functions use monte-carlo, hill-climbing Functions use monte-carlo, hill-climbing solutionsolution
Submodular functions, Submodular functions, where Swhere ST are proven in another work to be T are proven in another work to be NP-C but by using a hill-climbing solution can NP-C but by using a hill-climbing solution can get to within 1-1/e of optimum.get to within 1-1/e of optimum.
)(}){()(}){( TfvTfSfvSf
Maximizing influence – Maximizing influence – cost/benefitcost/benefit
Assumptions:Assumptions: product x sells for $100product x sells for $100 a discount of 10% can be offered to various prospective a discount of 10% can be offered to various prospective
customerscustomers If customer purchases profit is:If customer purchases profit is:
90 if discount is offered90 if discount is offered 100 if discount is not offered 100 if discount is not offered
Expected lift in profit (ELP) from offering discount is:Expected lift in profit (ELP) from offering discount is: 90*P(buy|discount) - 100*P(buy|no discount)90*P(buy|discount) - 100*P(buy|no discount)
Maximizing influence – Maximizing influence – cost/benefitcost/benefit
Goal is to find M Goal is to find M that maximizes that maximizes global ELPglobal ELP
Three Three approximations approximations used:used: single passsingle pass greedygreedy hill-climbinghill-climbing
n
ii
kii
ki
k cMfYXXPrMfYXXPrMYXELP1
00
11 ))(,,|1())(,,|1(),,(
XXii is the decision of is the decision of customer i to buycustomer i to buy
Y is vector of product Y is vector of product attributesattributes
M is vector of marketing M is vector of marketing decisiondecision
f is a function to set the ith f is a function to set the ith element of Melement of M
rr00 and r and r11 are revenue are revenue gained gained
c is the cost of marketingc is the cost of marketing
Comparison of approachesComparison of approaches
Cost/benefitCost/benefit Model-basedModel-based
Size of Size of starting setstarting set
variable - variable - based on based on max. liftmax. lift
fixedfixed
uses uses attributesattributes
yesyes nono
probabilitiesprobabilities extracted extracted from data setfrom data set
assigned to assigned to linkslinks An extension would be to spread influence An extension would be to spread influence
to the most number of communitiesto the most number of communities Improvements can be made in speedImprovements can be made in speed
TechniquesTechniques
Link-Based ClassificationLink-Based Classification Link PredictionLink Prediction RankingRanking Influential NodesInfluential Nodes Community FindingCommunity Finding
CommunitiesCommunities
Gibson, Kleinberg and Gibson, Kleinberg and Raghavan Raghavan
Query
Search Engine
Root Set
Use HITS to find top 10 hubs and authorities
Base Set: add forward and back links
Flake, Lawrence and GilesFlake, Lawrence and Giles
Uses Min-cutUses Min-cut Start with seed setStart with seed set Add linked nodesAdd linked nodes Find nodes from Find nodes from
outgoing linksoutgoing links Create virtual source nodeCreate virtual source node Add virtual sink linking it to all nodesAdd virtual sink linking it to all nodes Find the min-cut of the virtual source Find the min-cut of the virtual source
and sinkand sink
Community FindingCommunity Finding
Girvan and Newman – minimize betweennessGirvan and Newman – minimize betweenness Clauset, et al. – agglomerative, uses modularityClauset, et al. – agglomerative, uses modularity Shi & Malik – spectral clusteringShi & Malik – spectral clustering
Communities - summaryCommunities - summary
There are many options for building There are many options for building communities around a small group of communities around a small group of nodesnodes
Possible future directionsPossible future directions finding communities in networks having finding communities in networks having
different link typesdifferent link types impact of network type on community impact of network type on community
finding techniquesfinding techniques