on application-aware data extraction for big data in ... · graph reduction graph summarization...
TRANSCRIPT
On Application-Aware Data Extraction for Big Data in Social Networks
Ming-Syan Chen Research Center for Information Tech.
Innovation, Academia Sinica EE Department, National Taiwan Univ.
Fast Increasing of Social Network Activities
• Example social networks: – Twitter
– Flickr
– MSN
– Wikipedia
– Amazon.com
• Such a network – Very huge in size!
– Cannot easily be analyzed
M.-S. Chen 2 2
The Amount of Information is Huge!
• Twitter – 150+ million members – 50 million tweets per day
• Facebook – 800+ million users
• Amazon Co-purchasing Network – half million product nodes – several million recomm. links
• Web Pages – Yahoo! Over one billion Web Pages
M.-S. Chen 3 Amazon From SNSP
From twitter.om
M.-S. Chen 4
Example of Big Data and Social Network
Volume: thousands of people!
Velocity: fast accumulated!!
Variety: eating different food!!!
M.-S. Chen 5
Example of Big Data and Social Network
For some gossip in this occasion, Veracity is an issue and the information Value could be low.
Mr. Lin won the lottery!
Mrs. Chang just did a face lift!
Information Extraction for Big Data in Social Networks
• Extracting important information from large social network graphs
– To allow data analysts to mine the information in large social networks, to enable scalable storage and querying, and to facilitate the development of real-world applications
M.-S. Chen 6
Outline
• Graph reduction
– Summarization, sampling, and extraction
• Information Extraction on Social Network Graphs
– Capturing key parameters (parameter extraction)
– Guide query (information extraction)
– Decomposing SN graphs (structure extraction)
M.-S. Chen 7
Graph Reduction
Graph summarization (going thru all data)
e.g., NTU has 32K students, 20% are sushi lovers, 25% prefer steak, also 15% are artists, 20% are engineers, etc.
Graph sampling (going thru a subset)
Getting a small representative set of NTU students (which preferably fit statistics)
Graph extraction
Application/goal-oriented data extraction, e.g., only picking good eaters for feast contest.
M.-S. Chen 8
Graph Extraction
Application/goal-oriented data extraction Three levels of information extraction from SN graphs
• Parameter extraction (e.g., company stat.)
– Fast calculation of closeness centrality
• Information extraction (e.g., company biz.)
– Guide query
• Structure extraction (e.g., company org.)
– Decomposing SN graphs M.-S. Chen
To handle complicated things with simple skills.
執簡御繁
M.-S. Chen 10
Parameter extraction
Structure extraction
Information extraction (regarding capability)
weapon
Outline
• Graph reduction
• Information Extraction on Social Network Graphs
– Capturing key parameters (parameter extraction)
– Guide query (information extraction)
– Decomposing SN graphs (structure extraction)
M.-S. Chen 11
Closeness centrality
• There are several interesting quantities,
including closeness centrality, network
diameters, degree distribution, in SN graphs.
• Closeness centrality of node v, Cc(v): the
inverse of the average shortest path distance
from v to any other node in a network.
– If Cc(v) is large, v is around the center as it
requires only few hops to reach others.
M.-S. Chen 12
Response to Dynamic Changes
• It is frequent to have edge insertion or
deletion in a social network
– It is desirable to fast update the closeness centrality
of every node in response to edge
insertion/deletion.
• Example use: pick a number of people (the
nodes with high CCs) who can maximize
advertisement effectiveness.
M.-S. Chen 13
Example of Closeness Centrality
Cc(v): the inverse of the average shortest path distance
from v to other nodes
M.-S. Chen 14
Vu
cuvp
VvC
|),(|
1||)(
44
13
17262514132241
114)(
vCc
An unweighted and undirected graph
G with 14 nodes and 18 edges
31
13
24434231
114)(
wCc
Thus, node w is closer to all
other node than the node v.
Calculating Closeness Centrality
• One can calculate closeness centralities of all vertices by solving All Pairs Shortest Paths (APSP) problem. – O(n(m+n)) based on the breadth-first search (BFS)
method for undirected graph, where n and m are the number of nodes and edges in the graph.
– In a dynamic graph, re-solving APSP problem after each edge insertion or deletion is not efficient.
• Note that only some pairs of shortest paths will be affected due to certain edge changes. – Identify them (unstable node pairs) for fast calculation
of CC
M.-S. Chen 15
Example
For example, with the addition of (a,b)
Un-changed shortest paths ◦ p(b,v), p(c,t) and p(r,h), etc.
Changed shortest paths ◦ Before edge insertion
p(a,b)={a,d,w,b}, p(a,c)={a,d,w,r,c} and p(u,v)={u,l,o,d,w,r,s,v}, etc.
◦ After edge insertion (we then call these nodes unstable) p(a,b)={a,b}, p(a,c)={a,b,c} and p(u,v)={u,x,a,b,c,v}, etc.
M.-S. Chen 16
(a): the original unweighted
and undirected graph G.
(b): G’=G∪e(a,b).
Illustration of Unstable Node Pairs
• To find V’u : u-unstable node set, whose shortest paths to u changed after the edge addition
• If we perform BFS at node u in G and G’ to obtain Gu and G’u, we can find only the shortest paths p(u,b), p(u,c), p(u,h), p(u,v) and p(u,t) changed. – unstable node pairs: (u,b), (u,c), (u,h),
(u,v) and (u,t).
– V’u={b,c,h,v,t}
M.-S. Chen 17
Gu G’u
(Main Theorem) After the addition of edge (a,b), every
unstable node pair (whose shortest path changed)
{v,u} will have v ∈ V’a
and u ∈ V’b
Only these shortest paths will change
after edge addition (and need to be re-calculated)
V’a
V’b
.
.
.
.
.
.
.. ..
Concurrent Calculation of CC in SN
M.-S. Chen 19
Time
Calculate G’a
and V’a
Calculate G’b
and V’b
Calculate Ga
Calculate Gb
Perform BFS
starting at a ∈ V’b
Perform BFS
starting at x ∈ V’b
Perform BFS
starting at l ∈ V’b
Perform BFS
starting at u ∈ V’b
Perform in parallel BFS at nodes
a and b in G to obtain
V’a={a,x,l,u},V’b={b,c,h,v,t},
simultaneously.
Inform nodes in these unstable
pairs to re-calculate their
shortest paths to others and
CC
Experiments
• To evaluate CENDY, we conducted
experiments on six real unit-weighted graph
datasets of different types.
• The case of edge deletion can be done
similarly (in light of a companion theorem
proposed)
M.-S. Chen 20
Experiments
Evaluation on Edge Insertion From this table, we can see that the closeness centralities of
all vertices and APL can be updated only by a few of BFS
processes.
e.g., DBLP contains 460,413 nodes. The naïve way requires
to perform 460K BFS processes to update closeness
centrality and APL. However, CENDY only requires 4K
BFS processes to finish the task.
M.-S. Chen 21
Remark
• In response to the fast changes in SN, CENDY
is devised to efficiently update the closeness
centrality of each node in the social network.
• The design of new algorithms is called for to
efficiently calculate other key parameters in
the fast changing social network
M.-S. Chen 22
Outline
• Graph reduction
• Information Extraction on Social Network Graphs
– Capturing key parameters (parameter extraction)
– Guide query (information extraction)
– Decomposing SN graphs (structure extraction)
M.-S. Chen 23
Motivation of Guide Query Several works on information finding in social
networks • Expert finding [Deng’08][Lappas’09]
– To find the experts based on some given requirement
• Gateway finding [Koren’06][Wang’10] – To find the gateways between the source group and the
target group • Active Friending [Wu’13]
– To explore social networks to improve friend finding • Guide query [Lin’13]
– To explore social networks to improve friend finding
M.-S. Chen 24
[Deng’08] ICDM 2008. [Lappas’09] KDD 2009. [Koren’06] KDD 2006. Wang’10] KDD 2010. [Wu’13] KDD2013. [Lin’13] WAIM 2013
Motivation of Guide Query (Cont’d)
• By expert finding, the answer is a list of experts ranked by their expertise.
• Using the guide query, the answer is a list of informative friends of the querier ranked by the ability of gathering information from experts – Exploring social relationship
– Taking the probabilities of getting help into consideration
M.-S. Chen 25
Guide Query: Graph Extraction based on Your Friends
M.-S. Chen 26
A B
D
E
E
C
C
A
A
A
B
I want to know information about Company A or B.
These two friends are who I should ask for
information.
This friend is also who I should ask since she can collect
information from her friends.
Quide Query
• Guided query [Lin’13]
– For a user initiating the query, the answer is the user’s neighbors that are informative about user-assigned attributes.
– An informative neighbor should either have the attributes itself or know some other friends that have the attributes.
M.-S. Chen 27 [Lin’13] Y.-C. Lin, P. S. Yu, M.-S. Chen, “Guide Query in Social Networks,” WAIM 2013.
Problem Definition
M.-S. Chen 28
N3 N0
N1
N2
N11
N12
N4
N13 N21
N41
N32
N31
N34
N33
{A}
{B}
{A, B} {A} {A}
{A}
{B}
{B}
{B} {B}
{C}
{C}
{D}
Given a query node q and a set of keywords W = {w1, w2, …, w|W|}, the guide query is to find the top-k informative neighbors of q considering W.
q = N0 W = {A, B}
Ni candidate
Ni target
N4
N2
N3
N1
Problem (Cont’d)
M.-S. Chen 29
N3 N0
N1
N2
N11
N12
N4
N13 N21
N41
N32
N31
N34
N33
{A}
{B}
{A, B} {A} {A}
{A}
{B}
{B}
{B} {B}
{C}
{C}
{D}
In the model, an edge is labeled with the probability that a node successfully spreads the request to the linked node.
P=0.6
P=0.5
P=0.3 P=0.2 P=0.5
P=0.8
P=0.7 P=0.5
P=0.5
We rank the candidates based on how informative they are, which is evaluated by the proposed InfScore and DivScore
InfScore
M.-S. Chen 30
N3 N0
N1
N2
N11
N12
N4
N13 N21
N41
N32
N31
N34
N33
{A}
{B}
{A, B} {A} {A}
{A}
{B}
{B}
{B} {B}
{C}
{C}
{D}
Modeled by the expected number of targets a candidate is able to spread the request to.
P=0.5
P=0.5
P=0.5 P=0.5 P=0.5
P=0.5
P=0.5 P=0.5
P=0.5
InfScore: The informative level for a candidate node (i.e., the ability to spread the request to targets).
InfScore
M.-S. Chen 31
N3 N0
N1
N2
N11
N12
N4
N13 N21
N41
N32
N34
{A}
{B}
{A, B} {A}
{B} {B}
{C}
{C}
{D}
P=0.5
P=0.5
P=0.5 P=0.5
P=0.5
InfRatio is defined as the probability that a specific candidate successfully spreads the request to a certain target.
P=0.25 N31
N33
{A}
{A}
{B}
{B}
P=0.5
P=0.5
P=0.5
P=0.5
P=0.25
P=0.25
e.g., the InfRatio from N1 to N13 is 0.25
InfScore (intensity)
M.-S. Chen 32
N3 N0
N1
N2
N11
N12
N4
N13 N21
N41
N32
N34
{A}
{B}
{A, B} {A}
{B} {B}
{C}
{C}
{D}
P=0.5
P=0.5
P=0.5
P=0.5
The InfScore is the weighted sum of InfRatio.
P=0.25 N31
N33
{A}
{A}
{B}
{B}
P=0.5
P=0.5
P=0.25
P=0.25
N InfScore
N1 1.5
N2 0.5
N3 1.5
N4 1.5
InfScore(N1) = 0.5 + 0.5 + 0.25*2 = 1.5 (N11) (N12) (N13)
InfScore(N4) = 1.0 + 0.5 = 1.5 (N4) (N41)
DivScore (Diversity)
The DivScore is an entropy-like measure to evaluate the diversity of possibly accessible target nodes.
For each node, the target vector XT is defined as follows. Each item in the vector is a normalized InfScore value, describing the probability distribution on different targets.
With the target vector, the DivScore is defined as follows.
DivScore
N3 N0
N1
N2
N11
N12
N4
N13 N21
N41
N32
N31
N34
N33
{A}
{B}
{A, B} {A} {A}
{A}
{B}
{B}
{B} {B}
{C}
{C}
{D}
We design the DivScore as the probability distribution to each possibly accessible target.
P=0.5
P=0.5
P=0.25
P=0.5 P=0.25
P=0.25
P=0.5
P=0.5
P=0.5
N DivScore
N1 1.585
N2 0.000
N3 1.918
N4 0.918
Example: DivScore(N3) = [-(1/3)*log2 (1/3)]*2 + [-(1/6)*log2(1/6)]*2
Distribution of N3: [0.5/1.5, 0.5/1.5, 0.25/1.5, 0.25/1.5] =[1/3, 1/3, 1/6, 1/6]
Experimental Setup
• DBLP dataset [DBLP]
– Co-authorship network
– Edge probability • Based on the WC (weighted cascade) model
• p(Ni -> Nj) = 1 / d(Nj)
• d(Nj) is the in-degree of Nj
– Node attribute • Conference names of an author’s publications
M.-S. Chen 35
[DBLP] http://www.informatik.uni-trier.de/~ley/db/
[Chen’10] W. Chen, et al., “Scalable Influence Maximization for Prevalent Viral Marketing in Large-Scale Social Networks,” KDD 2010.
Experimental Results
M.-S. Chen 36
Query input: • q = ‘Ming-Syan Chen’ • k = 10 • W = [KDD, SDM, CIKM,
ICDM, PKDD]
Suppose Ming-Syan Chen wants to discuss with people who have published papers on KDD, SDM, CIKM, ICDM, PKDD, which co-authors should he first connect to? (i.e., Either coauthors who have these conf. papers or coauthors who coauthored with people who have these conf. papers.)
Remark
• The key notion is to guide the query to right candidates in the social network. – For each candidate, a combination of the expertise
and the social relationship with the person initiating the query is considered
• Just like the group formation (KDD-12) and this expert finding problem (WAIM-13), more applications/tools can be enhanced with SR considered
M.-S. Chen 37
Outline
• Graph reduction
• Information Extraction on Social Network Graphs
– Capturing key parameters (parameter extraction)
– Guide query (information extraction)
– Decomposing SN graphs (structure extraction)
M.-S. Chen 38
Diffusion Analysis in Social Networks
• Diffusion of Information can be used to model the interaction among nodes in a network, e.g.,
– Viruses spread over the internet.
– Disease spread in the community.
– Rumors/news spread among humans.
M.-S. Chen 39
Example Diffusion
• Information diffusion can happen in social networks, such as facebook and twitter.
M.-S. Chen 40
0
3
2
1 𝑛1
𝑛2
𝑛3
𝑛4
𝑛5
𝑛6
𝑛7
𝑛8
𝑛9
Path of Infection
Underlying network
The Network is Hidden • In some situations, the underlying network is not
known (due to cost or privacy issue).
• Network inference problem (NIP) is studied to discover the underlying network
M.-S. Chen 41
0
3
2
1
𝑛1
𝑛2
𝑛3
𝑛4
𝑛5
𝑛6
𝑛7
𝑛8
𝑛9
To infer the network from what happened.
Network Inference Problem
• Assume there is an underlying information network.
• NIP is to infer the information network given a set of cascades.
• A cascade 𝐭s = [t1s , … , tN
s ] is the time records of information s spreading over the network. (N is #nodes), i.e., node 𝑛𝑖 gets s (infected) in time ti
s
• If a node i is never infected with s, set 𝑡𝑖𝑠 = ∞ .
• Ex : 𝐭𝐬 = [∞,∞, 2,∞, 0,1]
M.-S. Chen 42 1
0
2 𝑛1
𝑛2
𝑛3
𝑛4
𝑛5
𝑛6
Clustering Cascades
• Traditionally, NIP assumes there is one underlying network, which may not always be true in reality – e.g., Sports news, political news, and
entertainment news are likely to spread in different ways
• Hence, we would like to cluster cascades so that the cascades in each cluster spread in the same pattern –An SN graph is hence decomposed into
application-specific ones
M.-S. Chen 43
Example Cascades
M.-S. Chen 44
0
1
Cascade b (49ers news)
0
1
2
Cascade e (Jets news)
0 2
1
Cascade c (Redskins news)
0
1
Cascade f (Celtics news)
3
1
2
0
Cascade d (Heats news)
1
0
Cascade a (Lakers news)
𝑛1
𝑛2
𝑛3
𝑛4
𝑛5
𝑛6
𝑛1
𝑛2
𝑛3
𝑛4
𝑛5
𝑛6
𝑛1
𝑛2
𝑛3
𝑛4
𝑛5
𝑛6
𝑛1
𝑛2
𝑛3
𝑛4
𝑛5
𝑛6
𝑛1
𝑛2
𝑛3
𝑛4
𝑛5
𝑛6
𝑛1
𝑛2
𝑛3
𝑛4
𝑛5
𝑛6
To Model Inference Network
• Modeling method:
– If two nodes are always infected in short time, the weight would be large.
– 𝑤𝑖𝑗 =1
|𝑠:𝑡𝑖𝑠<𝑡𝑗
𝑠|
1
𝑡𝑗𝑠−𝑡𝑖
𝑠𝑠:𝑡𝑖𝑠<𝑡𝑗
𝑠
– Consider 𝑤12 as an example. {𝑠: 𝑡1
𝑠 < 𝑡2𝑠} = {𝑏, 𝑐, 𝑒}
𝑤12 =1
3(
1
∞ − 0+
1
1 − 0+
1
2 − 0) =
1
2
45
Example Inference Network
M.-S. Chen 46
0.17
𝑛1
𝑛6
𝑛5
𝑛4
𝑛3
𝑛2
0.67
0.5
0.25
0.5
0.25
0.67
0.5
0.17
0.5
0.25
To Cluster Cascades by K-Means
• Transform cascade 𝒕 to N-dim indicator based on whether nodes are infected or not.
• Ex:
– 𝒕𝒂 = ∞,∞,∞,∞, 0,1 → [0,0,0,0,1,1]
– 𝒕𝒃 = 0,∞,∞, 1,∞,∞ → [1,0,0,1,0,0]
– 𝒕𝒄 = 0,1,2,∞,∞,∞ → [1,1,1,0,0,0]
• Run K-means to get the clustering result.
– (𝑎 , 𝑑 , f) and (b, c , e)
47
Graph Decomposition
• By considering cascades {a, d, f} and cascades {b, c, e} independently (based on which nodes are infected), the original SN graph is decomposed in accordance with the information carried.
M.-S. Chen 48
Cascades {a, d, f} (NBA) Cascades {b, c, e} (NFL)
𝑛1
𝑛6
𝑛5
𝑛4
𝑛3
𝑛2
0.67
0.5 0.5
0.5
0.17
0.17
𝑛1
𝑛6
𝑛5
𝑛4
𝑛3
𝑛2 0.25
0.67
0.5
0.5
0.33
Remark
• Traditionally NIP results in a dense and complex network, which is difficult to capture knowledge.
• By properly clustering cascades, we can have a few resulting concise networks which carry clearer information
– These resulting networks better match the corresponding cascades than a single dense network.
M.-S. Chen 49
Conclusion
• Information extraction is an application/goal-oriented process to capture the key ingredients (parameters, information, structure, etc) in the huge SN
• The procedure of information extraction can be integrated into related process for better efficiency in practice
M.-S. Chen 50
M.-S. Chen 51
Thank you!
Graph Summarization
Condense the original graph to a
more compact form
Lossless and lossy methods
Required to examine the entire network
M.-S. Chen 52
1
3 2
4
5
6
7 8
G Gs b a
c
d
Sa={2,3} Sb={1,9}
Sc={7,8,10} Sd={4,5,6}
9
10
A revised example form S. Navlakha et al.
Graph Summarization with Bounded Error.
SIGMOD’08
─ {5, 10}
─ {6, 10}
Graph Sampling
• Graph Sampling
– Selecting a subset of the original data
– Characteristics of the original graph are preserved
– Only a proportion of nodes in the network are visited
M.-S. Chen 53
Sampling
Plotted by NodeXL, an EXCEL template created by the NodeXL team at Microsoft Research
A Running Example of CENDY
Originally, we have the closeness centralities of all nodes and the
average path length of the graph.
M.-S. Chen 54
47
13
27262524132231
114)(
xCc
An unweighted and
undirected graph G with
14 nodes and 18 edges
A=
182
586
)114(14
47373540
GL
a b c d h l o r s t u v w x
40
13
35
13
37
13
33
13
46
13
47
13
40
13
33
13
40
13
56
13
57
13
44
13
31
13
47
13
Example (Cont’d)
For the insertion of the
edge e(a,b).
• We perform BFS at
node a in G and G’ to
obtain Ga and G’a, and
then have
V’a={b,c,h,v,t}.
M.-S. Chen 55
Ga G’a
Example (Cont’d)
• Also, we perform BFS
at node b in G and G’
to obtain Gb and G’b,
and then have
V’b={a,x,l,u}.
M.-S. Chen 56
Gb G’b
Example (Cont’d)
• Then, in light of the main theorem, we
re-calculate the paths between V’a and
V’b
• For example, for node x ∈ V’b, we calculate
– (1): ||p(x,t)| - |p’(x,t)|| = 7 – (1+1+3) = 2
– (2): ||p(x,h)| - |p’(x,h)|| = 6 – 4 = 2
– (3): ||p(x,v)| - |p’(x,v)|| = 6 – 4 = 2
– (4): ||p(x,c)| - |p’(x,c)|| = 5 – 3 = 2
– (5): ||p(x,b)| - |p’(x,b)|| = 4 – 2 = 2
• and then update its new closeness centrality:
M.-S. Chen 57
Gx G’x
37
13
2222247
13
)5()4()3()2()1(47
13)(
xCc
Example (Cont’d)
• Finally, we update the closeness centralities of the referenced
nodes and recalculate the APL.
M.-S. Chen 58
A=
a b c d h l o r s t u v w x
40
13
35
13
37
13
33
13
46
13
47
13
40
13
33
13
40
13
56
13
57
13
44
13
31
13
47
13
182
516
)114(14
37302830
GL
a b c d h l o r s t u v w x
30
13
28
13
30
13
33
13
39
13
42
13
40
13
33
13
40
13
49
13
47
13
37
13
31
13
37
13
Example Scenario
M.-S. Chen 59
N3 N0
N1
N2
N11
N12
N4
N13 N21
N41
N32
N31
N34
N33
{A}
{B}
{A, B} {A} {A}
{A}
{B}
{B}
{B} {B}
{C}
{C}
{D}
N0 is initiating a query to find a job in company A or company B. Which friend should N0 ask for information?
New Contributions
• Given M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring Networks of Diffusion and Influence. In KDD ’10,
Our work is unique in that:
1. We assume there could be many underlying networks (rather than only one).
2. We model and learn a weighted graph (rather than an unweighted one).
M.-S. Chen 60