on application-aware data extraction for big data in ... · graph reduction graph summarization...

On Application-Aware Data Extraction for Big Data in Social Networks

Ming-Syan Chen Research Center for Information Tech.

Innovation, Academia Sinica EE Department, National Taiwan Univ.

Fast Increasing of Social Network Activities

• Example social networks: – Twitter

– Facebook

– Flickr

– MSN

– Wikipedia

– Amazon.com

• Such a network – Very huge in size!

– Cannot easily be analyzed

M.-S. Chen 2 2

The Amount of Information is Huge!

• Twitter – 150+ million members – 50 million tweets per day

• Facebook – 800+ million users

• Amazon Co-purchasing Network – half million product nodes – several million recomm. links

• Web Pages – Yahoo! Over one billion Web Pages

M.-S. Chen 3 Amazon From SNSP

From twitter.om

M.-S. Chen 4

Example of Big Data and Social Network

Volume: thousands of people!

Velocity: fast accumulated!!

Variety: eating different food!!!

M.-S. Chen 5

Example of Big Data and Social Network

For some gossip in this occasion, Veracity is an issue and the information Value could be low.

Mr. Lin won the lottery!

Mrs. Chang just did a face lift!

Information Extraction for Big Data in Social Networks

• Extracting important information from large social network graphs

– To allow data analysts to mine the information in large social networks, to enable scalable storage and querying, and to facilitate the development of real-world applications

M.-S. Chen 6

Outline

• Graph reduction

– Summarization, sampling, and extraction

• Information Extraction on Social Network Graphs

– Capturing key parameters (parameter extraction)

– Guide query (information extraction)

– Decomposing SN graphs (structure extraction)

M.-S. Chen 7

Graph Reduction

Graph summarization (going thru all data)

e.g., NTU has 32K students, 20% are sushi lovers, 25% prefer steak, also 15% are artists, 20% are engineers, etc.

Graph sampling (going thru a subset)

Getting a small representative set of NTU students (which preferably fit statistics)

Graph extraction

Application/goal-oriented data extraction, e.g., only picking good eaters for feast contest.

M.-S. Chen 8

Graph Extraction

Application/goal-oriented data extraction Three levels of information extraction from SN graphs

• Parameter extraction (e.g., company stat.)

– Fast calculation of closeness centrality

• Information extraction (e.g., company biz.)

– Guide query

• Structure extraction (e.g., company org.)

– Decomposing SN graphs M.-S. Chen

To handle complicated things with simple skills.

執簡御繁

M.-S. Chen 10

Parameter extraction

Structure extraction

Information extraction (regarding capability)

weapon

Outline

• Graph reduction





M.-S. Chen 11

Closeness centrality

• There are several interesting quantities,

including closeness centrality, network

diameters, degree distribution, in SN graphs.

• Closeness centrality of node v, Cc(v): the

inverse of the average shortest path distance

from v to any other node in a network.

– If Cc(v) is large, v is around the center as it

requires only few hops to reach others.

M.-S. Chen 12

Response to Dynamic Changes

• It is frequent to have edge insertion or

deletion in a social network

– It is desirable to fast update the closeness centrality

of every node in response to edge

insertion/deletion.

• Example use: pick a number of people (the

nodes with high CCs) who can maximize

advertisement effectiveness.

M.-S. Chen 13

Example of Closeness Centrality

Cc(v): the inverse of the average shortest path distance

from v to other nodes

M.-S. Chen 14

Vu

cuvp

VvC

|),(|

1||)(

44

13

17262514132241

114)(

vCc

An unweighted and undirected graph

G with 14 nodes and 18 edges

31

13

24434231

114)(

wCc

Thus, node w is closer to all

other node than the node v.

Calculating Closeness Centrality

• One can calculate closeness centralities of all vertices by solving All Pairs Shortest Paths (APSP) problem. – O(n(m+n)) based on the breadth-first search (BFS)

method for undirected graph, where n and m are the number of nodes and edges in the graph.

– In a dynamic graph, re-solving APSP problem after each edge insertion or deletion is not efficient.

• Note that only some pairs of shortest paths will be affected due to certain edge changes. – Identify them (unstable node pairs) for fast calculation

of CC

M.-S. Chen 15

Example

For example, with the addition of (a,b)

Un-changed shortest paths ◦ p(b,v), p(c,t) and p(r,h), etc.

Changed shortest paths ◦ Before edge insertion

p(a,b)={a,d,w,b}, p(a,c)={a,d,w,r,c} and p(u,v)={u,l,o,d,w,r,s,v}, etc.

◦ After edge insertion (we then call these nodes unstable) p(a,b)={a,b}, p(a,c)={a,b,c} and p(u,v)={u,x,a,b,c,v}, etc.

M.-S. Chen 16

(a): the original unweighted

and undirected graph G.

(b): G’=G∪e(a,b).

Illustration of Unstable Node Pairs

• To find V’u : u-unstable node set, whose shortest paths to u changed after the edge addition

• If we perform BFS at node u in G and G’ to obtain Gu and G’u, we can find only the shortest paths p(u,b), p(u,c), p(u,h), p(u,v) and p(u,t) changed. – unstable node pairs: (u,b), (u,c), (u,h),

(u,v) and (u,t).

– V’u={b,c,h,v,t}

M.-S. Chen 17

Gu G’u

(Main Theorem) After the addition of edge (a,b), every

unstable node pair (whose shortest path changed)

{v,u} will have v ∈ V’a

and u ∈ V’b

Only these shortest paths will change

after edge addition (and need to be re-calculated)

V’a

V’b

.

.

.

.

.

.

.. ..

Concurrent Calculation of CC in SN

M.-S. Chen 19

Time

Calculate G’a

and V’a

Calculate G’b

and V’b

Calculate Ga

Calculate Gb

Perform BFS

starting at a ∈ V’b

Perform BFS

starting at x ∈ V’b

Perform BFS

starting at l ∈ V’b

Perform BFS

starting at u ∈ V’b

Perform in parallel BFS at nodes

a and b in G to obtain

V’a={a,x,l,u},V’b={b,c,h,v,t},

simultaneously.

Inform nodes in these unstable

pairs to re-calculate their

shortest paths to others and

CC

Experiments

• To evaluate CENDY, we conducted

experiments on six real unit-weighted graph

datasets of different types.

• The case of edge deletion can be done

similarly (in light of a companion theorem

proposed)

M.-S. Chen 20

Experiments

Evaluation on Edge Insertion From this table, we can see that the closeness centralities of

all vertices and APL can be updated only by a few of BFS

processes.

e.g., DBLP contains 460,413 nodes. The naïve way requires

to perform 460K BFS processes to update closeness

centrality and APL. However, CENDY only requires 4K

BFS processes to finish the task.

M.-S. Chen 21

Remark

• In response to the fast changes in SN, CENDY

is devised to efficiently update the closeness

centrality of each node in the social network.

• The design of new algorithms is called for to

efficiently calculate other key parameters in

the fast changing social network

M.-S. Chen 22

Outline

• Graph reduction





M.-S. Chen 23

Motivation of Guide Query Several works on information finding in social

networks • Expert finding [Deng’08][Lappas’09]

– To find the experts based on some given requirement

• Gateway finding [Koren’06][Wang’10] – To find the gateways between the source group and the

target group • Active Friending [Wu’13]

– To explore social networks to improve friend finding • Guide query [Lin’13]

– To explore social networks to improve friend finding

M.-S. Chen 24

[Deng’08] ICDM 2008. [Lappas’09] KDD 2009. [Koren’06] KDD 2006. Wang’10] KDD 2010. [Wu’13] KDD2013. [Lin’13] WAIM 2013

Motivation of Guide Query (Cont’d)

• By expert finding, the answer is a list of experts ranked by their expertise.

• Using the guide query, the answer is a list of informative friends of the querier ranked by the ability of gathering information from experts – Exploring social relationship

– Taking the probabilities of getting help into consideration

M.-S. Chen 25

Guide Query: Graph Extraction based on Your Friends

M.-S. Chen 26

A B

D

E

E

C

C

A

A

A

B

I want to know information about Company A or B.

These two friends are who I should ask for

information.

This friend is also who I should ask since she can collect

information from her friends.

Quide Query

• Guided query [Lin’13]

– For a user initiating the query, the answer is the user’s neighbors that are informative about user-assigned attributes.

– An informative neighbor should either have the attributes itself or know some other friends that have the attributes.

M.-S. Chen 27 [Lin’13] Y.-C. Lin, P. S. Yu, M.-S. Chen, “Guide Query in Social Networks,” WAIM 2013.

Problem Definition

M.-S. Chen 28

N3 N0

N1

N2

N11

N12

N4

N13 N21

N41

N32

N31

N34

N33

{A}

{B}

{A, B} {A} {A}

{A}

{B}

{B}

{B} {B}

{C}

{C}

{D}

Given a query node q and a set of keywords W = {w1, w2, …, w|W|}, the guide query is to find the top-k informative neighbors of q considering W.

q = N0 W = {A, B}

Ni candidate

Ni target

N4

N2

N3

N1

Problem (Cont’d)

M.-S. Chen 29

N3 N0

N1

N2

N11

N12

N4

N13 N21

N41

N32

N31

N34

N33

{A}

{B}

{A, B} {A} {A}

{A}

{B}

{B}

{B} {B}

{C}

{C}

{D}

In the model, an edge is labeled with the probability that a node successfully spreads the request to the linked node.

P=0.6

P=0.5

P=0.3 P=0.2 P=0.5

P=0.8

P=0.7 P=0.5

P=0.5

We rank the candidates based on how informative they are, which is evaluated by the proposed InfScore and DivScore

InfScore

M.-S. Chen 30

N3 N0

N1

N2

N11

N12

N4

N13 N21

N41

N32

N31

N34

N33

{A}

{B}

{A, B} {A} {A}

{A}

{B}

{B}

{B} {B}

{C}

{C}

{D}

Modeled by the expected number of targets a candidate is able to spread the request to.

P=0.5

P=0.5

P=0.5 P=0.5 P=0.5

P=0.5

P=0.5 P=0.5

P=0.5

InfScore: The informative level for a candidate node (i.e., the ability to spread the request to targets).

InfScore

M.-S. Chen 31

N3 N0

N1

N2

N11

N12

N4

N13 N21

N41

N32

N34

{A}

{B}

{A, B} {A}

{B} {B}

{C}

{C}

{D}

P=0.5

P=0.5

P=0.5 P=0.5

P=0.5

InfRatio is defined as the probability that a specific candidate successfully spreads the request to a certain target.

P=0.25 N31

N33

{A}

{A}

{B}

{B}

P=0.5

P=0.5

P=0.5

P=0.5

P=0.25

P=0.25

e.g., the InfRatio from N1 to N13 is 0.25

InfScore (intensity)

M.-S. Chen 32

N3 N0

N1

N2

N11

N12

N4

N13 N21

N41

N32

N34

{A}

{B}

{A, B} {A}

{B} {B}

{C}

{C}

{D}

P=0.5

P=0.5

P=0.5

P=0.5

The InfScore is the weighted sum of InfRatio.

P=0.25 N31

N33

{A}

{A}

{B}

{B}

P=0.5

P=0.5

P=0.25

P=0.25

N InfScore

N1 1.5

N2 0.5

N3 1.5

N4 1.5

InfScore(N1) = 0.5 + 0.5 + 0.25*2 = 1.5 (N11) (N12) (N13)

InfScore(N4) = 1.0 + 0.5 = 1.5 (N4) (N41)

DivScore (Diversity)

The DivScore is an entropy-like measure to evaluate the diversity of possibly accessible target nodes.

For each node, the target vector XT is defined as follows. Each item in the vector is a normalized InfScore value, describing the probability distribution on different targets.

With the target vector, the DivScore is defined as follows.

DivScore

N3 N0

N1

N2

N11

N12

N4

N13 N21

N41

N32

N31

N34

N33

{A}

{B}

{A, B} {A} {A}

{A}

{B}

{B}

{B} {B}

{C}

{C}

{D}

We design the DivScore as the probability distribution to each possibly accessible target.

P=0.5

P=0.5

P=0.25

P=0.5 P=0.25

P=0.25

P=0.5

P=0.5

P=0.5

N DivScore

N1 1.585

N2 0.000

N3 1.918

N4 0.918

Example: DivScore(N3) = [-(1/3)*log2 (1/3)]*2 + [-(1/6)*log2(1/6)]*2

Distribution of N3: [0.5/1.5, 0.5/1.5, 0.25/1.5, 0.25/1.5] =[1/3, 1/3, 1/6, 1/6]

Experimental Setup

• DBLP dataset [DBLP]

– Co-authorship network

– Edge probability • Based on the WC (weighted cascade) model

• p(Ni -> Nj) = 1 / d(Nj)

• d(Nj) is the in-degree of Nj

– Node attribute • Conference names of an author’s publications

M.-S. Chen 35

[DBLP] http://www.informatik.uni-trier.de/~ley/db/

[Chen’10] W. Chen, et al., “Scalable Influence Maximization for Prevalent Viral Marketing in Large-Scale Social Networks,” KDD 2010.

Experimental Results

M.-S. Chen 36

Query input: • q = ‘Ming-Syan Chen’ • k = 10 • W = [KDD, SDM, CIKM,

ICDM, PKDD]

Suppose Ming-Syan Chen wants to discuss with people who have published papers on KDD, SDM, CIKM, ICDM, PKDD, which co-authors should he first connect to? (i.e., Either coauthors who have these conf. papers or coauthors who coauthored with people who have these conf. papers.)

Remark

• The key notion is to guide the query to right candidates in the social network. – For each candidate, a combination of the expertise

and the social relationship with the person initiating the query is considered

• Just like the group formation (KDD-12) and this expert finding problem (WAIM-13), more applications/tools can be enhanced with SR considered

M.-S. Chen 37

Outline

• Graph reduction





M.-S. Chen 38

Diffusion Analysis in Social Networks

• Diffusion of Information can be used to model the interaction among nodes in a network, e.g.,

– Viruses spread over the internet.

– Disease spread in the community.

– Rumors/news spread among humans.

M.-S. Chen 39

Example Diffusion

• Information diffusion can happen in social networks, such as facebook and twitter.

M.-S. Chen 40

0

3

2

1 𝑛1

𝑛2

𝑛3

𝑛4

𝑛5

𝑛6

𝑛7

𝑛8

𝑛9

Path of Infection

Underlying network

The Network is Hidden • In some situations, the underlying network is not

known (due to cost or privacy issue).

• Network inference problem (NIP) is studied to discover the underlying network

M.-S. Chen 41

0

3

2

1

𝑛1

𝑛2

𝑛3

𝑛4

𝑛5

𝑛6

𝑛7

𝑛8

𝑛9

To infer the network from what happened.

Network Inference Problem

• Assume there is an underlying information network.

• NIP is to infer the information network given a set of cascades.

• A cascade 𝐭s = [t1s , … , tN

s ] is the time records of information s spreading over the network. (N is #nodes), i.e., node 𝑛𝑖 gets s (infected) in time ti

s

• If a node i is never infected with s, set 𝑡𝑖𝑠 = ∞ .

• Ex : 𝐭𝐬 = [∞,∞, 2,∞, 0,1]

M.-S. Chen 42 1

0

2 𝑛1

𝑛2

𝑛3

𝑛4

𝑛5

𝑛6

Clustering Cascades

• Traditionally, NIP assumes there is one underlying network, which may not always be true in reality – e.g., Sports news, political news, and

entertainment news are likely to spread in different ways

• Hence, we would like to cluster cascades so that the cascades in each cluster spread in the same pattern –An SN graph is hence decomposed into

application-specific ones

M.-S. Chen 43

Example Cascades

M.-S. Chen 44

0

1

Cascade b (49ers news)

0

1

2

Cascade e (Jets news)

0 2

1

Cascade c (Redskins news)

0

1

Cascade f (Celtics news)

3

1

2

0

Cascade d (Heats news)

1

0

Cascade a (Lakers news)

𝑛1

𝑛2

𝑛3

𝑛4

𝑛5

𝑛6

𝑛1

𝑛2

𝑛3

𝑛4

𝑛5

𝑛6

𝑛1

𝑛2

𝑛3

𝑛4

𝑛5

𝑛6

𝑛1

𝑛2

𝑛3

𝑛4

𝑛5

𝑛6

𝑛1

𝑛2

𝑛3

𝑛4

𝑛5

𝑛6

𝑛1

𝑛2

𝑛3

𝑛4

𝑛5

𝑛6

To Model Inference Network

• Modeling method:

– If two nodes are always infected in short time, the weight would be large.

– 𝑤𝑖𝑗 =1

|𝑠:𝑡𝑖𝑠<𝑡𝑗

𝑠|

1

𝑡𝑗𝑠−𝑡𝑖

𝑠𝑠:𝑡𝑖𝑠<𝑡𝑗

𝑠

– Consider 𝑤12 as an example. {𝑠: 𝑡1

𝑠 < 𝑡2𝑠} = {𝑏, 𝑐, 𝑒}

𝑤12 =1

3(

1

∞ − 0+

1

1 − 0+

1

2 − 0) =

1

2

45

Example Inference Network

M.-S. Chen 46

0.17

𝑛1

𝑛6

𝑛5

𝑛4

𝑛3

𝑛2

0.67

0.5

0.25

0.5

0.25

0.67

0.5

0.17

0.5

0.25

To Cluster Cascades by K-Means

• Transform cascade 𝒕 to N-dim indicator based on whether nodes are infected or not.

• Ex:

– 𝒕𝒂 = ∞,∞,∞,∞, 0,1 → [0,0,0,0,1,1]

– 𝒕𝒃 = 0,∞,∞, 1,∞,∞ → [1,0,0,1,0,0]

– 𝒕𝒄 = 0,1,2,∞,∞,∞ → [1,1,1,0,0,0]

• Run K-means to get the clustering result.

– (𝑎 , 𝑑 , f) and (b, c , e)

47

Graph Decomposition

• By considering cascades {a, d, f} and cascades {b, c, e} independently (based on which nodes are infected), the original SN graph is decomposed in accordance with the information carried.

M.-S. Chen 48

Cascades {a, d, f} (NBA) Cascades {b, c, e} (NFL)

𝑛1

𝑛6

𝑛5

𝑛4

𝑛3

𝑛2

0.67

0.5 0.5

0.5

0.17

0.17

𝑛1

𝑛6

𝑛5

𝑛4

𝑛3

𝑛2 0.25

0.67

0.5

0.5

0.33

Remark

• Traditionally NIP results in a dense and complex network, which is difficult to capture knowledge.

• By properly clustering cascades, we can have a few resulting concise networks which carry clearer information

– These resulting networks better match the corresponding cascades than a single dense network.

M.-S. Chen 49

Conclusion

• Information extraction is an application/goal-oriented process to capture the key ingredients (parameters, information, structure, etc) in the huge SN

• The procedure of information extraction can be integrated into related process for better efficiency in practice

M.-S. Chen 50

M.-S. Chen 51

Thank you!

Graph Summarization

Condense the original graph to a

more compact form

Lossless and lossy methods

Required to examine the entire network

M.-S. Chen 52

1

3 2

4

5

6

7 8

G Gs b a

c

d

Sa={2,3} Sb={1,9}

Sc={7,8,10} Sd={4,5,6}

9

10

A revised example form S. Navlakha et al.

Graph Summarization with Bounded Error.

SIGMOD’08

─ {5, 10}

─ {6, 10}

Graph Sampling

• Graph Sampling

– Selecting a subset of the original data

– Characteristics of the original graph are preserved

– Only a proportion of nodes in the network are visited

M.-S. Chen 53

Sampling

Plotted by NodeXL, an EXCEL template created by the NodeXL team at Microsoft Research

A Running Example of CENDY

Originally, we have the closeness centralities of all nodes and the

average path length of the graph.

M.-S. Chen 54

47

13

27262524132231

114)(

xCc

An unweighted and

undirected graph G with

14 nodes and 18 edges

A=

182

586

)114(14

47373540

GL

a b c d h l o r s t u v w x

40

13

35

13

37

13

33

13

46

13

47

13

40

13

33

13

40

13

56

13

57

13

44

13

31

13

47

13

Example (Cont’d)

For the insertion of the

edge e(a,b).

• We perform BFS at

node a in G and G’ to

obtain Ga and G’a, and

then have

V’a={b,c,h,v,t}.

M.-S. Chen 55

Ga G’a

Example (Cont’d)

• Also, we perform BFS

at node b in G and G’

to obtain Gb and G’b,

and then have

V’b={a,x,l,u}.

M.-S. Chen 56

Gb G’b

Example (Cont’d)

• Then, in light of the main theorem, we

re-calculate the paths between V’a and

V’b

• For example, for node x ∈ V’b, we calculate

– (1): ||p(x,t)| - |p’(x,t)|| = 7 – (1+1+3) = 2

– (2): ||p(x,h)| - |p’(x,h)|| = 6 – 4 = 2

– (3): ||p(x,v)| - |p’(x,v)|| = 6 – 4 = 2

– (4): ||p(x,c)| - |p’(x,c)|| = 5 – 3 = 2

– (5): ||p(x,b)| - |p’(x,b)|| = 4 – 2 = 2

• and then update its new closeness centrality:

M.-S. Chen 57

Gx G’x

37

13

2222247

13

)5()4()3()2()1(47

13)(

xCc

Example (Cont’d)

• Finally, we update the closeness centralities of the referenced

nodes and recalculate the APL.

M.-S. Chen 58

A=


40

13

35

13

37

13

33

13

46

13

47

13

40

13

33

13

40

13

56

13

57

13

44

13

31

13

47

13

182

516

)114(14

37302830

GL


30

13

28

13

30

13

33

13

39

13

42

13

40

13

33

13

40

13

49

13

47

13

37

13

31

13

37

13

Example Scenario

M.-S. Chen 59

N3 N0

N1

N2

N11

N12

N4

N13 N21

N41

N32

N31

N34

N33

{A}

{B}

{A, B} {A} {A}

{A}

{B}

{B}

{B} {B}

{C}

{C}

{D}

N0 is initiating a query to find a job in company A or company B. Which friend should N0 ask for information?

New Contributions

• Given M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring Networks of Diffusion and Influence. In KDD ’10,

Our work is unique in that:

1. We assume there could be many underlying networks (rather than only one).

2. We model and learn a weighted graph (rather than an unweighted one).

M.-S. Chen 60

on application-aware data extraction for big data in ... · graph reduction graph summarization...

Documents