graph and link mining -...
TRANSCRIPT
Graphs - Basics
A graph is a powerful abstraction for modeling entities and their pairwise relationships.
G = (V,E) Set of nodes 𝑉 = 𝑣1, … , 𝑣5
Set of edges 𝐸 = { 𝑣1, 𝑣2 , … 𝑣4, 𝑣5 }
Examples: Social network
Twitter Followers
Web
Collaboration graphs
2
𝑣1
𝑣2
𝑣3 𝑣4
𝑣5
Undirected Graphs Undirected Graph
The edges are undirected pairs – they can be traversed in any direction.
Degree of node: Number of edges incident on the node
Path: A sequence of edges from one node to another
Connected Component: A set of nodes such that there is a path between any two nodes in the set
3
𝑣1
𝑣2
𝑣3 𝑣4
𝑣5
10011
00111
01011
11001
11110
A
Directed Graphs Directed Graph:
Edges are ordered pairs – they can be traversed in the direction from first to second.
In-degree and Out-degree of a node.
Path: A sequence of directed edges from one node to another
Strongly Connected Component: A set of nodes such that there is a directed path between any two nodes in the set
4
𝑣1
𝑣2
𝑣3 𝑣4
𝑣5
10001
00111
00010
10000
00110
A
Examples of Graphs we Might Mine
Airline Route Maps are useful
Info can tell you about both history and politics
Call Detail Records
Tell us about relationships between people
Who got in trouble about a decade ago for using this info?
Web is based on (hyper)links between docs
Social Networks form Graphs
Link Analysis is the data mining technique that addresses relationships and connections
5
6 Degrees of Separation
Claim: there are at most 6 degrees of separation between any two people This is important in social networks
LinkedIn tell you how you connect to others and it expands with each link.
Stanley Milgram wasn’t first to note small world effect But popularized it with famous experiment: How close are two
random people? Picked people in Omaha Nebraska or Wichita Kansas, and someone
in Boston Asked source person to send it to other person and if did not know
the person send it to someone more likely to know them Average path length was 5.5 or 6
But only 64 of 296 arrived (this is often not highlighted)
6
Examples of Applications
Identifying authoritative sources of information on the WWW by analyzing page links Google and PageRank– we will come back to this
Understanding physician referral patterns Analyzing telephone call patterns
MCI Friends and Family You call Mary Smith, also on MCI, so ask her to join MCI
But your wife does not know Mary Smith! Oops! Far-fetched? Facebook does it all of the time!!!!
Identify fraud: in past one would purchaser several stolen calling cards and use them to call same person. That is a clue.
7
Mining the graph structure
A graph is a combinatorial object, with a certain structure.
Mining the structure of the graph reveals information about the entities in the graph E.g., if in the Facebook graph I find that there are 100
people that are all linked to each other, then these people are likely to be a community The community discovery problem
By measuring the number of friends in Facebook graph I can find the most important nodes The node importance problem
8
Importance problem
What are the most important nodes in the graph?
What are the most authoritative pages on the web?
Who are the important users in Facebook?
What are the most influential Twitter accounts?
9
Link Analysis
First generation search engines view documents as flat text files could not cope with size, spamming, user needs
Second generation search engines Ranking becomes critical shift from relevance to authoritativeness
authoritativeness: the static importance of the page
a success story for the network analysis + a huge commercial success
it all started with two graduate students at Stanford. Everyone knows the company, right?
10
Link Analysis: Intuition
A link from page p to page q denotes endorsement
page p considers page q an authority on a subject
use the graph of recommendations
assign an authority value to every page
The same idea applies to other graphs as well
Twitter graph, where user p follows user q
11
Constructing the graph
Goal: output an authority weight for each node Also known as centrality or importance
12
w w
w
w
w
Rank by Popularity
Rank pages according to the number of incoming edges (in-degree, degree centrality)
13
1. Red Page
2. Yellow Page
3. Blue Page
4. Purple Page
5. Green Page
w=1 w=1
w=2
w=3 w=2
Popularity
It is not important only how many link to you,
but how important they are Good authorities are pointed by good authorities
Recursive definition of importance
14
PageRank
Good authorities are pointed to by good authorities The value of a page is the value of the
people that link to you
How do we implement that? Each node distributes its authority
value equally to its neighbors
The authority value of each node is the sum of the authority fractions it collects from its neighbors.
Solving the system of equations we get authority values for the nodes w = ½ , w = ¼ , w = ¼
15
w w
w
w + w + w = 1
w = w + w
w = ½ w
w = ½ w
A More Complex Example
16
v1 v2
v3
v4 v5
w1 = 1/3 w4 + 1/2 w5
w2 = 1/2 w1 + w3 + 1/3 w4
w3 = 1/2 w1 + 1/3 w4
w4 = 1/2 w5
w5 = w2
Random Walks on Graphs
What we described is equivalent to a random walk on the graph
Random walk: Start from a node uniformly at random Pick one of the outgoing edges uniformly at random Repeat Some nodes will be visited more often than others.
Those are more important. Based not only on number of incoming links, but how
often the predecessor nodes are visited
A value like Google’s Pagerank indicates how often a node would be visited
17
Random walks on graphs
Question: what is the probability of being at a specific node? 𝑝𝑖: probability of being at node i at this step 𝑝𝑖′: probability of being at node i in the next step
After many steps the probabilities converge to the stationary distribution of the random walk.
18
v1
v3
v4 v5
p’1 = 1/3 p4 + 1/2 p5
p’2 = 1/2 p1 + p3 + 1/3 p4
p’3 = 1/2 p1 + 1/3 p4
p’4 = 1/2 p5
p’5 = p2
v2
How Does Pagerank Work?
Arbitrarily initialize all pages to Pagerank of 1
Repeatedly perform calculations for each page
Eventually the values will converge
Pagerank is what caused Google to succeed
Prior to that only content mattered, not link structure
19
Benefits of PageRank
It is not trivial to fool Pagerank
You can create dummy pages to point to your page, but since no one is pointing to those pages, it will have low PageRank and not help much
You can create dummy pages to also point to one another, but without being pointed to by an outside authority, the impact will be limited
But it is clear that Google must have many tweaks to catch cases like this– link spam or link farms
20
Social Network Analysis
Social Network Analysis Overview https://www.youtube.com/watch?v=fgr_g1q2ikA
5 Minutes
What is Social Network Analysis https://www.youtube.com/watch?v=xT3EpF2EsbQ
4 minutes
21