1 exploring blog networks patterns and a model for information propagation mary mcglohon in...

Post on 21-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Exploring Blog Networks

Patterns and a Model for Information Propagation

Mary McGlohon

In collaboration with Jure Leskovec, Christos Faloutsos

Natalie Glance, Matthew Hurst

Sandia National Labs- July 6, 2007

(As seen at SIAM-

Data Mining 2007)

2

Long-term Goals

● How does information on the Web propagate?● With what pattern do ideas catch on, diffuse,

and decrease in popularity?● Can we build a model for this propagation?

3

Why blogs?

● Blogs are a widely used medium of information for many topics and have become an important mode of communication.

● Blogs cite one another, creating a record of how information and ideas spread through a social network.

● This record is publicly available.

4

Why do we care?

● Understanding how the blog network works is important for:– Social issues: Political mapping, social trends and

change, reactions to mass media.– Economic issues: Marketing, predicting

commercial success, discovering links between companies.

Example: blogs in the 2004 election.[Adamic, Glance 2005]

5

Immediate Goals

● Temporal questions: Does popularity have half-life? Is there periodicity?

● Topological questions: What topological patterns do posts and blogs follow? What shapes do cascades take on? Stars? Chains? Something else?

● Generative model: Can we build a generative model that mimics properties of cascades?

6

OutlineMotivation

PreliminariesConcepts and terminologyData

Temporal ObservationsTopological ObservationsCascade Generation ModelDiscussion & Conclusions

7

What is a blog?

● A blog is a frequently-updated webpage.● A blog’s author updates the blog using posts.● Each post has a permanent hyperlink, and may

contain links to other blog posts.

slashdot boingboing

8

What is a blog?

● A blog is a frequently-updated webpage.● A blog’s author updates the blog using posts.● Each post has a permanent hyperlink, and may

contain links to other blog posts.

slashdot boingboing

The iPhone is here, hooray!

9

What is a blog?

● A blog is a frequently-updated webpage.● A blog’s author updates the blog using posts.● Each post has a permanent hyperlink, and may

contain links to other blog posts.

slashdot boingboing

The iPhone is here, hooray!

At this link, Slashdot says the iPhone has arrived. But I’m not buying one, because …

10

What is a blog?

● A blog is a frequently-updated webpage.● A blog’s author updates the blog using posts.● Each post has a permanent hyperlink, and may

contain links to other blog posts.

slashdot boingboing

The iPhone is here, hooray!

At this link, Slashdot says the iPhone has arrived. But I’m not buying one, because …

Here Boingboing says they’re not

buying an iPhone. They’re just

jealous.

11

Blogosphere network

B1 B2

B4B3

B1 B2

B4B3

11

2

1 3

1

a

b c

de

From blogs to networks

1

Non-trivial vs. trivial cascades

Stars vs. chains

Nodes a,b,c,d are cascade initiators

e is a connector

Blog network Post network

slashdotboingboing

DlistedMichelleMalkin

slashdotboingboing

DlistedMichelleMalkin

12

Blogosphere network

Non-trivial vs. trivial cascades

Cascades

From networks to cascades

slashdot boingboing

DlistedMichelleMalkin

13

From networks to cascades

Non-trivial vs. trivial cascades

Cascade initiators are first sources of information

We also have stars and chains

Blogosphere network

Cascades

slashdot boingboing

DlistedMichelleMalkin

14

Dataset (Nielsen Buzzmetrics)● Gathered from August-September 2005*

● Used set of 44,362 blogs, traced cascades

● 2.4 million posts, ~5 million out-links, 245,404 blog-to-blog links

Time [1 day]

Nu

mb

er

of p

ost

s

15

OutlineMotivationPreliminaries

Concepts and terminologyData

Temporal ObservationsDoes blog traffic behave periodically?How does popularity change over time?

Topological ObservationsCascade Generation ModelDiscussion & ConclusionsFuture Work

16

Temporal Observations

Does blog traffic behave periodically?• Posts have “weekend effect”, less traffic on

Saturday/Sunday.

17

Temporal Observations

Does blog traffic behave periodically?• Monday appears to compensate for this behavior, but

it is not actually the case.

• We normalize data: countnorm = count / pd

where pd is percentage of links on that day.

Same data, normalizedMonday post dropoff- days after post

Num

ber

in-li

nks

(log)

Num

ber

in-li

nks

(log)

18

Temporal Observations

How does post popularity change over time?

Post popularity dropoff follows a power law identical to that found in communication response times in [Vazquez2006].

Observation 1: The probability that a post written at time tp acquires a link at time tp + is:

p(tp+) 1.5

Days after post

Nu

mb

er

of in

-lin

ks

19

OutlineMotivationPreliminariesTemporal Observations

Does blog traffic behave periodically?How does post popularity change over time?

Topological ObservationsWhat are graph properties for blog networks?What shapes do cascades take on? Stars, chains,

or something else?Cascade Generation ModelDiscussion & ConclusionsFuture Work

20

Topological Observations

What graph properties does the blog network exhibit?

B1 B2

B4B3

11

2

1 3

1

21

Topological Observations

What graph properties does the blog network exhibit? How connected?

● 44,356 nodes, 122,153 edges● Half of blogs belong to largest connected

component.

B1 B2

B4B3

11

2

1 3

1

22

Topological Observations

What power laws does the blog network exhibit?

Both in- and out-degree follows a power law distribution, in-link PL exponent -1.7, out-degree PL exponent near -3.

This suggests strong rich-get-richer phenomena.

Number of blog in-links (log scale) Number of blog out-links (log scale)

Co

unt

(lo

g s

cale

)

Co

unt

(lo

g s

cale

)

23

Topological Observations

How are blog in- and out-degree related?

In-links and out-links are not correlated. (correlation coefficient 0.16)

Number of blog in-links (log scale)

Nu

mb

er

of b

log

ou

t-lin

ks

(lo

g s

cale

)

24

Topological Observations

What graph properties does the post network exhibit?

a

b c

de

25

Topological Observations

a

b c

de

What graph properties does the post network exhibit?

Very sparsely connected: 98% of posts are isolated.

26

Topological Observations

• Both in-and out-degree follow power laws:• In-degree has PL exponent -2.15, out-degree

has PL exponent -2.95.

What power laws does the post network exhibit?

Post in-degree

Co

unt

Post out-degree

Co

unt

27

Topological Observations

How do we measure how information flows through the network?

We gather cascades using the following procedure:– Find all initiators (out-degree 0).

a

b c

de

28

Topological Observations

How do we measure how information flows through the network?

We gather cascades using the following procedure:– Find all initiators (out-degree 0).– Follow in-links.

a

b c

de

a

b c

de

29

Topological Observations

How do we measure how information flows through the network?

We gather cascades using the following procedure:– Find all initiators (out-degree 0).– Follow in-links.– Produces directed acyclic graph.

a

b c

de

a

b c

de

d

e

b c

e

a

30

Topological Observations

How do we measure how information flows through the network?

Common cascade shapes are extracted using algorithms in [Leskovec2006].

31

Topological Observations

How do we measure how information flows through the network?

Number of edges increases linearally with cascade size, while effective diameter increases logarithmically, suggesting tree-like structures.

Cascade size (# nodes)

Num

ber

of e

dges

Cascade size

Eff

ectiv

e di

amet

er

32

Topological Observations

How do we measure how information flows through the network?

We work with a bag of cascades– each cascade is a disconnected subgraph.

We now explore some graph properties of cascades.

33

Topological Observations

As before, in- and out-degree in bag of cascades follow power laws.

What graph properties do cascades exhibit?

Cascade node in-degree Cascade node out-degree

Cou

nt

Cou

nt

34

Topological Observations

Cascade size distributions also follow power law.

What graph properties do cascades exhibit?

35

Topological Observations

Cascade size distributions also follow power law.

What graph properties do cascades exhibit?

Observation 2: The probability of observing a cascade on n nodes follows a Zipf distribution:

p(n) n-2

Cascade size (# of nodes)

Cou

nt

36

Topological Observations

What graph properties do cascades exhibit?

Stars and chains also follow a power law, with different exponents (star -3.1, chain -8.5).

37

Topological Observations

What graph properties do cascades exhibit?

Stars and chains also follow a power law, with different exponents (star -3.1, chain -8.5).

Size of chain (# nodes)

Cou

nt

Size of star (# nodes)

Cou

nt

38

OutlineMotivationPreliminariesTemporal ObservationsTopological Observations

What are graph properties for blog networks?What shapes and patterns do cascades take on?

Cascade Generation ModelEpidemiological BackgroundProposed ModelExperimental Validation

Discussion & ConclusionsFuture Work

39

Epidemiological models

● We consider modeling cascade generation as an epidemic, with ideas as viruses.

● We use the SIS model:– At any time, an entity is in one of two states:

susceptible or infected.– One parameter determines how easily spreading

conversations are.– [Hethcote2000]

40

Epidemiological models

● We consider modeling cascade generation as an epidemic, with ideas as viruses.

● We use the SIS model:– At any time, an entity is in one of two states:

susceptible or infected.– One parameter determines how easily spreading

conversations are.– [Hethcote2000]

41

Epidemiological models

● We consider modeling cascade generation as an epidemic, with ideas as viruses.

● We use the SIS model:– At any time, an entity is in one of two states:

susceptible or infected.– One parameter determines how easily spreading

conversations are.– [Hethcote2000]

42

Epidemiological models

● We consider modeling cascade generation as an epidemic, with ideas as viruses.

● We use the SIS model:– At any time, an entity is in one of two states:

susceptible or infected.– One parameter determines how easily spreading

conversations are.– [Hethcote2000]

43

Epidemiological models

● We consider modeling cascade generation as an epidemic, with ideas as viruses.

● We use the SIS model:– At any time, an entity is in one of two states:

susceptible or infected.– One parameter determines how easily spreading

conversations are.– [Hethcote2000]

44

Epidemiological models

● We consider modeling cascade generation as an epidemic, with ideas as viruses.

● We use the SIS model:– At any time, an entity is in one of two states:

susceptible or infected.– One parameter determines how easily spreading

conversations are.– [Hethcote2000]

45

Epidemiological models

● We consider modeling cascade generation as an epidemic, with ideas as viruses.

● We use the SIS model:– At any time, an entity is in one of two states:

susceptible or infected.– One parameter determines how easily spreading

conversations are.– [Hethcote2000]

46

Epidemiological models

● We consider modeling cascade generation as an epidemic, with ideas as viruses.

● We use the SIS model:– At any time, an entity is in one of two states:

susceptible or infected.– One parameter determines how easily spreading

conversations are.– [Hethcote2000]

47

Cascade Generation Model

B1

3

B2

B3 B4

1

2

1

1

1

0. Begin with Blog Net.

48

Cascade Generation Model

B1 B2

B3 B4

0. Begin with Blog Net, but ignore edge weights.

Example–

B1 links to B2, B2 links to B1, B4 links to B2 and B1, as well as itself

B3 is isolated, linking to itself.

49

Cascade Generation Model

B1 B2

B3 B4

1. Randomly pick a blog to infect, add node to cascade

B1

50

Cascade Generation Model

B1 B2

B3 B4

2. Infect each in-linked neighbor with probability .

B1

51

Cascade Generation Model

B1 B2

B3 B4

2. Infect each in-linked neighbor with probability .

B1

INFECT

DO NOT INFECT

52

Cascade Generation Model

B1 B2

B3 B4

3. Add infected neighbors to cascade.

B1

B4

53

Cascade Generation Model

B1 B2

B3 B4

4. Set “old” infected nodes to uninfected.

B1

B4

54

Cascade Generation Model

B1 B2

B3 B4

4. Set “old” infected nodes to uninfected. Repeat steps 2-4 until no nodes are infected.

B1

B4

55

Cascade Generation Model

B1 B2

B3 B4

4. Set “old” infected nodes to uninfected. Repeat steps 2-4 until no nodes are infected.

B1

B4DO NOT INFECT

56

Cascade Generation Model

B1 B2

B3 B4

4. Set “old” infected nodes to uninfected. Repeat steps 2-4 until no nodes are infected.

B1

B4

Completed cascade!

57

CGM matches observations

● After trying several values, we decide on =.025.● 10 simulations, 2 million cascades each● Most frequent cascades: 7 of 10 matched exactly.

model

data

58

CGM matches observations

Cascade size in this model also follows a power law-- the model distribution is shown with the real data points.

Cascade size (number of nodes)

Cou

nt

59

CGM matches observations

● Stars and chains both follow power laws, close to those observed in real data.

Cou

nt

Star size

Cou

nt

Chain size

60

Results in brief

● Analyzed one of largest available collections of blog information.

● Two networks: “Post network” and “blog network”.

● Discovered several properties of the networks.● Also analyzed properties of “cascades”.● Presented generative model for cascades.

61

Immediate questions: answered

● Temporal questions: Does popularity have half-life? Is there periodicity?– Popularity dropoff follows a power-law distribution

exactly as found in response times in other work. We do find that posts follow weekly periodicity.

Days after post

Nu

mb

er

of in

-lin

ks

62

Immediate questions: answered

● Topology: What topological patterns do posts and blogs follow? What shapes to cascades take on? Stars? Chains? Something else?– We find power law distributions in almost every

topological property. In cascade shapes, stars are more common than chains, and size of cascades follow a power law. Cascades are tree-like.

Size of chain (# nodes)

Cou

nt

Size of star (# nodes)

Cou

nt

63

Immediate questions: answered

● Can a simple model replicate this behavior?– Yes. We developed a model based on the SIS

model in epidemiology. It is a simple model with only one parameter, and it produces behavior remarkably similar to that found in the dataset.

Cou

nt

Star size

Cou

nt

Chain size

64

Future work and applications

● This work suggested that ideas may behave like viruses under an SIS model.

● This may be useful for mapping social/political trends.

● Further investigation into these properties may also allow us early detection of changes in social or economic structure.

65

Related work

● For explanation of SIS model:– [Hethcote2000] H.W. Hethcote. The mathematics of

infectious diseases. SIAM Rev., 42(4):599–653, 2000.● For algorithms for extracting cascade shapes:

– [Leskovec2006] J. Leskovec, A. Singh, and J. Kleinberg. Patterns of influence in a recommendation network. PAKDD 2006.

● For some modeling of power laws:– [Vazquez2006] A. Vazquez, J. G. Oliveira, Z. Dezso, K. I.

Goh, I. Kondor, and A. L. Barabasi. Modeling bursts and heavy tails in human dynamics. Physical Review E, 73:036127, 2006.

66

Additional Info

Mary McGlohon

www.cs.cmu.edu/~mmcgloho

mcglohon@cmu.edu

6767

Acknowledgments

● Mary McGlohon was partially supported by an NSF Graduate Fellowship.

● Jure Leskovec was partially supported by a Microsoft Fellowship.

68

Questions?

69

● EXTRA SLIDES BEGIN HERE!

7070

Preliminaries- PCA

● We will work with very high-dimensional data (~9,000 dimensions).

● Principal Component Analysis is a method of dimensionality reduction.

Depth upwards

Conversation mass upwards

Hypothetically, for each blog...

7171

Preliminaries- PCA

● We will work with very high-dimensional data (~9,000 dimensions).

● Principal Component Analysis is a method of dimensionality reduction.

Depth upwards

Conversation mass upwards

Hypothetically, for each blog...

7272

Preliminaries- PCA

● We will work with very high-dimensional data (~9,000 dimensions).

● Principal Component Analysis is a method of dimensionality reduction.

Depth upwards

Hypothetically, for each blog...

Conversation mass upwards

73

Preliminaries- PCA

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

v1

We can represent any real N x M matrix X as X= U x x Vt

Det

ails

X U

Vt

74

Preliminaries- PCA

● Reduce dimensionality by setting all other components of to zero.

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

Det

ails

75

Preliminaries- PCA

Reference: Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press.

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

~9.64 0

0 0x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

Det

ails

76

Preliminaries- Regularizing data

● Not everything in life is normally distributed.

To

tal I

n-li

nks

Total Conversation Mass Downwards

Blog properties, linear-linear scale

77

Preliminaries- Regularizing data

● Not everything in life is normally distributed.

To

tal I

n-li

nks

Total Conversation Mass Downwards

Blog properties, linear-linear scale

99.4% of points!

78

Preliminaries: Regularizing data

● Not everything in life is normally distributed.

To

tal I

n-li

nks

Total Conversation Mass Downwards

Blog properties, linear-linear scale

Try to fit a line...

79

Preliminaries: Regularizing data

● Not everything in life is normally distributed.

To

tal I

n-li

nks

Total Conversation Mass Downwards

Blog properties, linear-linear scale

Try to fit a line...

Outliers dramatically affect fit.

80

Preliminaries: Regularizing data

● Not everything in life is normally distributed. ● Therefore, we propose to take log(count+1).

To

tal I

n-li

nks

Total Conversation Mass Downwards

Blog properties, log-log scale

81

Preliminaries: Regularizing data

● Not everything in life is normally distributed. ● Therefore, we propose to take log(count+1).

To

tal I

n-li

nks

Total Conversation Mass Downwards

Blog properties, log-log scale

Outliers’ effects are minimized.

82

● Suppose we want to cluster blogs based on content. What features do we use per blog?

83

CascadeType

• Perform PCA on sparse matrix.

• Use log(count+1)• Project onto 2 PC…

.01…

.07.67…

1.12.1…

5.1…

4.2…

.073.41.13.2boingboing

.092.14.6slashdot

…………

~9,000 cascade types

~44

,000

blo

gs

8484

CascadeType: Results

● Observation: Content of blogs and cascade behavior are often related.

• Distinct clusters for “conservative” and “humorous” blogs (hand-labeling).

8585

CascadeType: Results

● Observation: Content of blogs and cascade behavior are often related.

• Distinct clusters for “conservative” and “humorous” blogs (hand-labeling).

86

● Suppose we want to cluster blog posts. What features do we use?

8787

Preliminaries- Blogs

● There are several terms we use to describe cascades:

● In-link, out-link

– Green node has one out-link

– Yellow node has one in-link.● Depth downwards/upwards

– Pink node has an upward depth of 1,

– downward depth of 2.

● Conversation mass upwards/downwards

– Pink node has upward CM 1,

– downward CM 3

8888

PostFeatures

.6.1…

1.1.6boingboing-p002

6.24.2boingboing-p001

2.41.2…

4.5.2…

2.2.3slashdot-p002

4.5slashdot-p001

# in

-link

s #

out-

links

C

M u

p

C

M d

own

de

pth

up

dep

th d

own

~2,

400,

000

post

s Run PCA…

89

PostFeatures: Results

• Observation: Posts within a blog tend to retain similar network characteristics.

90

PostFeatures: Results

• Observation: Posts within a blog tend to retain similar network characteristics.

MichelleMalkin

Dlisted

– PC1 ~ CM upward– PC2 ~ CM

downward– We show this

scatter plot instead.

9191

Ranking blogs by PostFeatures

● Conversation mass up/down gives a better understanding of the blog posts than in-links and out-links.

● Therefore, we may choose to rank blogs based on these attributes.

9292

Blogs ranked by CM vs in-links

1 michellemalkin.com

2 boingboing.net

3 imao.us (75)

4 captainsquartersblog.com/mt

5 instapundit.com

6 radioequalizer.blogspot.com (53)

7 powerlineblog.com

8 waxy.org/links

9 washingtonmonthly.com

10 kottke.org/reminder

1 boingboing.net

2 michellemalkin.com

3 instapundit.com

4 waxy.org/links

5 kottke.com/reminder

6 patriotdaily.com (11)

7 captainsquartersblog.com/mt

8 powerlineblog.com

9 washingtonmonthly.com

10 petashon.com (30)

Top blogs by conversation mass Top blogs by in-links

9393

Blogs ranked by CM vs in-links

1 michellemalkin.com

2 boingboing.net

3 imao.us (75)

4 captainsquartersblog.com/mt

1 boingboing.net

2 michellemalkin.com

3 instapundit.com

4 waxy.org/links

Top blogs by conversation mass Top blogs by in-links

in-links: 2

CM: 6in-links: 5

CM: 5

– Perhaps IMAO has longer cascades, just fewer inlinks.– While petashun has “stars”.

.....10 petashon.com (30)

94

BlogTimeFractal: some time series

● Problem: time series data is nonuniform and difficult to analyze.

● Any patterns?● Any measures?

in-links over time

9595

BlogTimeFractal: Definitions

● Any patterns?● Self similarity!● The 80-20 law describes self-similarity.● For any sequence, we divide it into two equal-

length subsequences. 80% of traffic is in one, 20% in the other.– Repeat recursively.

96

Self-similarity

● The bias factor for the 80-20 law is b=0.8.20 80

Det

ails

97

Self-similarity

● The bias factor for the 80-20 law is b=0.8.20 80

Q: How do we estimate b?

Det

ails

98

Self-similarity

● The bias factor for the 80-20 law is b=0.8.20 80

Q: How do we estimate b?

A: Entropy plots!

Det

ails

9999

BlogTimeFractal

● An entropy plot plots entropy vs. resolution.● From time series data, begin with resolution R=

T/2. ● Record entropy H

R

100100

BlogTimeFractal

● An entropy plot plots entropy vs. resolution.● From time series data, begin with resolution R=

T/2. ● Record entropy H

R

● Recursively take finer resolutions.

101101

BlogTimeFractal

● An entropy plot plots entropy vs. resolution.● From time series data, begin with resolution r=

T/2. ● Record entropy H

r

● Recursively take finer resolutions.

102

BlogTimeFractal: Definitions● Entropy measures the non-uniformity of histogram at

a given resolution.● We define entropy of our sequence at given R :

where p(t) is percentage of posts from a blog on interval t, R is resolution and 2R is number of intervals.

Det

ails

103103

BlogTimeFractal

● For a b-model (and self similar cases), entropy plot is linear. The slope s will tell us the bias factor.

● Lemma: For traffic generated by a b-model, the bias factor b obeys the equation:

s= - b log2 b – (1-b) log2 (1-b)

104

Entropy Plots

● Linear plot Self-similarity

Resolution

En

tro

py

105

Entropy Plots

● Linear plot Self-similarity● Uniform: slope s=1. bias=.5● Point mass: s=0. bias=1

Resolution

En

tro

py

106

Entropy Plots

● Linear plot Self-similarity● Uniform: slope s=1. bias=.5● Point mass: s=0. bias=1

Resolution

En

tro

py

Michelle Malkin in-links, s= 0.85

By Lemma 1, b= 0.72

107107

BlogTimeFractal: Results● Observation: Most time series of interest are

self-similar.● Observation: Bias factor is approximately 0.7--

that is, more bursty than uniform (70/30 law).

in-links, b=.72 conversation mass, b=.76 number of posts, b=.70

Entropy plots: MichelleMalkin

108

● Other related work

109

[Ali-Hasen, Adamic 2007]Expressing Social Relationships on the Blog through Links and Comments

Analyzed three blog communities:

Dallas-Fort Worth

-Most links are external to community (91%)

-Low centralization

-Low reciprocity

UAE

-Fewer links external to community

-More centralization

-Obvious “hub” structure

Kuwait

-Fewest links external to community (53%)

-Highly centralized

-Much reciprocity

110

[Duarte et. al. 2007]

● Classified blogs into parlor, register, and broadcast.

Total sessions

Fra

ctio

ns o

f se

ssio

ns

with

com

men

ts

parlor

register

broadcast

111

[Adar et. al. 2004]

● Implicit Structure and the Dynamics of Blogspace

Suggested that ideas behaved like epidemics.

Presented iRank based on how “infectious” a blog was.

(giant microbes, a site infectious in more ways than one)

top related