a bivariate point process model with application to social ... · application to social media user...

33
A Bivariate Point Process Model with Application to Social Media User Content Generation Emma Jingfei Zhang [email protected] Yongtao Guan [email protected] Department of Management Science The Miami Business School, University of Miami 1 / 33

Upload: others

Post on 10-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

A Bivariate Point Process Model withApplication to Social Media User Content

Generation

Emma Jingfei [email protected]

Yongtao [email protected]

Department of Management ScienceThe Miami Business School, University of Miami

1 / 33

Page 2: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Data Description: Sina Weibo Data

Source: Sina Weibo, the largest twitter-type online socialmedia in China.The dataset contains posts from 5,913 followers of theofficial Beijing University Guanghua MBA Weibo account.For each user, all of his/her posts during the period of Jan1st to Jan 30th, 2014, including the time stamp of eachpost, have been collected.Each post can be a post with original contents or a repost.

2 / 33

Page 3: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Data Description: Trump’s Twitter Data

Source: Twitter data collected from Donald Trump(@realDonaldTrump) from Jan 2013 to Apr 2018.Twitter archive of Donald Trump can be downloaded fromhttp://www.trumptwitterarchive.com/.Twitter shows the device used for each tweet; devices maybe Android, Web Client, iPhone, and others.We consider the tweets posted by using a Android devicebefore and an iPhone after the election.This results in a total of 17,518 tweets; the averagenumber of monthly tweets is 278.Each tweet is either an original tweet or a retweet.

3 / 33

Page 4: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Data Description: Sina Weibo Data

01/01 01/05 01/10 01/15 01/20 01/25 01/30

Use

r 3U

ser 2

Use

r 1

date

Figure : The posting times of three users.

4 / 33

Page 5: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Data Description: Sina Weibo Data

0 10 20 30 40 50 60 70

1.0e-05

1.5e-05

2.0e-05

2.5e-05

3.0e-05

3.5e-05

hour

Figure : Average empirical pair correlation function.

5 / 33

Page 6: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Observations from Data

A user’s posting activity may alternate between active andinactive states.During an active state, the user may publish one or moreposts (often with short inter-post time distances).During an inactive state, no post is being produced untilthe start of the next active state.There may be daily patterns in posting times.It’s a bivariate point process (i.e., posts and reposts).

6 / 33

Page 7: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Graphical Illustration: Univariate Process

Episodes: clusters of posting time locations.Adjacent episodes are nonoverlapping and separated bythe inactive period in between.

7 / 33

Page 8: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Graphical Illustration: Bivariate Process

episode

postsegment

postsegment

repostsegment

episodeInactive

Each episode contains subepisodes of posts and reposts.Posts (reposts) tend to be followed by posts (reposts).Reposts may be more clustered than posts.Number of reposts may be related to number of followees.

8 / 33

Page 9: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Clustered Point Process

Goal: Model the clustered posting times for social mediaposting time data (do not distinguish between posts andreposts for now).

Existing Methods:Hawkes processThe Neyman-Scott processBarlett-Lewis processInterrupted poisson process

We propose a new class of clustered temporal point processesthat is easy to interpret and also can be easily generalized tothe bivariate case.

9 / 33

Page 10: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Model Formulation

For each episode, the parent event generates a Poissonnumber of offspring events with mean µ.

Each offspring location, relative to the location of theprevious event in the same cluster, follows an exponentialdistribution with parameter ρ.

Once all the events in an episode have been observed, theparent event in the following episode is generated followinga hazard function λ(t ;β).

10 / 33

Page 11: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Model Formulation

By observing the daily cyclic pattern in the average paircorrelation function, we may assume that

λ(t ;β) = exp

β0 +

p∑j=1

[βj1 cos(ωj t) + βj2 sin(ωj t)]

where ωj = 2jπ and β = {β0, βj1, βj2 : j = 1, · · · ,p}.

Other nonparametric models can also be used.

11 / 33

Page 12: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Model Formulation

Define event time locations {Tl : l = 1, . . . ,N} andindicator variables {Yl : l = 1, . . . ,N}, where Yl = 1 denoteparent events and Yl = 0 offspring events.Let T0 = 0. Define the gap time

Dl = Tl − Tl−1, l = 1, · · · ,N.

Let fl0(x) and fl1(x) be the probability density functions ofDl given that Yl = 0 and Yl = 1. Assume

fl0(x) = ρexp(−ρx),

and

fl1(x) = λ(tl−1 + x ;β)exp

[−∫ tl−1+x

tl−1

λ(t ;β)dt

].

12 / 33

Page 13: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Model Formulation

Assume the first event is a parent event and all events in thelast episode are contained in [0,T ].

The complete-data likelihood can then be written as

L(θ; t,y) =n∏

l=1

1∏m=0

[flm(dl ;θ)

I(yl=m)] [ k∏

i=1

P(Ni = ni)

]P(Dn+1 > T−tn),

where Dn+1 is the gap time between tn and the next parentevent,

P(Ni = ni) =exp(−µ)µni

ni !,

and

P(Dn+1 > T − tn) = exp

[−∫ T

tnλ(t ;β)dt

].

13 / 33

Page 14: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Composite Likelihood Estimation

The observed-data likelihood is∑

y L(θ; t,y), where thesummation is over all 2n possibilities of y!!!Divide W = [0,T ] into J non-overlapping unit windows oflength s, i.e., W =

⋃Jj=1 Wj where Wj = [(j − 1)s, js).

As before, we assumeThe first event in Wj is a parent event,All events in the last episode of Wj are contained in Wj .

Define tj = {ti : ti ∈Wj} and yj = {yi : ti ∈Wj}. Then theobserved-data likelihood on Wj is

∑yj

L(θ; tj ,yj).

We estimate θ by maximizing the composite likelihood

L̃(θ; t) =J∏

j=1

∑yj

L(θ; tj ,yj)

.14 / 33

Page 15: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Composite Likelihood Estimation

Each summation in the CLE is over 2nj terms where nj isthe number of events in Wj .

Note that∑J

j=1 2nj << 2n so significant computationalgains can be achieved.There is a potential bias problem since

The first event in Wj may not be a parent event,Not all events in the last episode of Wj are contained in Wj .

The bias problem can be mitigated if we choose the blocks“wisely”.Convergence can be a problem since multiple parametersneed to be estimated simultaneously and the likelihoodsurface is often quite flat.

15 / 33

Page 16: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

A Composite Likelihood EM Algorithm

Let Tj and Yj be the random version of tj and yj .In the E-Step, we take expectation of the log likelihood`(θ; tj ,Yj) with respect to the conditional distribution ofYj |Tj = tj , θ̂prev , i.e.,

Qj(θ|θ̂prev ) = EYj |Tj=tj ,θ̂prev`(θ; tj ,Yj).

Define

Q(θ|θ̂prev ) =J∑

j=1

Qj(θ|θ̂prev ).

In the M-step, Q(θ|θ̂prev ) is maximized with respect to θ.

16 / 33

Page 17: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

A Composite Likelihood EM Algorithm

For the expectation, we need to calculate for tl ∈Wj ,Pθ(Yl = m|Tj = tj) which is

Pθ(Yl = m|Tj = tj) =

∑yj |yl=m L(θ; tj ,yj)∑

yjL(θ; tj ,yj)

.

If there are a large number of events in Wj , we employ astandard Metropolis- Hasting algorithm to sample from theconditional distribution Yj |Tj = tj ,θ for the E-step.

Closed form expressions can be obtained for θ̂ (except forβ̂) in the M-step.Convergence is no issue.

17 / 33

Page 18: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

A Composite Likelihood EM Algorithm

Theorem

The log-composite likelihood ˜̀(θ; t) = log L̃(θ; t) satisfies˜̀(θp; t) ≥ ˜̀(θp−1; t), p = 1,2, . . ., where θp is the pth updatefrom the E-M algorithm.

The theorem guarantees that log-composite likelihood isnondecreasing at each EM iteration.The convergence of θ̂p to a stationary point as p →∞ isguaranteed by Theorem 2 in Wu (1983).Standard techniques such as running the EM algorithmfrom multiple starting point can help locate the globalmaximum.Consistency and asymptotic normality can be establishedfor the global maximum (assuming the model is right).

18 / 33

Page 19: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Extension to Bivariate Case

For each episode, there are a Poisson number ofsubepisodes with mean γ.

Post and repost episodes alternate.The first subepisode is post with probability α.There are a Poisson number of offspring in each post(repost) subepisode with mean µ1 (µ0).For each offspring in a post (repost) subepisode, its locationrelative to that of the previous event in the same episodefollows an exponential distribution with parameter ρ1 (ρ0).

Once all the events in an episode have been observed, theparent event in the following episode is generated followinga hazard function λ(t ;β).

The composite likelihood E-M algorithm can be modified tofit the model.

19 / 33

Page 20: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Application to Trump’s Twitter Data

2013 2014 2015 2016 2017 2018

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

α

2013 2014 2015 2016 2017 2018

0.0

0.5

1.0

1.5

γ

2013 2014 2015 2016 2017 2018

0.2 0.4 0.6 0.8 1.0 1.2 1.4

µ1

2013 2014 2015 2016 2017 2018

0.00.51.01.52.02.5

µ0

2013 2014 2015 2016 2017 2018

100

200

300

400

ρ1

2013 2014 2015 2016 2017 2018

0500

1000

1500

ρ0

2013 2014 2015 2016 2017 20183

45

6

number of tweets per episode

2013 2014 2015 2016 2017 2018

hour

0.2 0.3 0.4 0.5 0.6 0.7 0.8

episode length

Figure : Parameters estimated from Donald Trump’s monthly Twitterdata. The two red dashed lines mark June 2015 (candidacyannouncement) and Jan 2017 (assumes office), respectively.

20 / 33

Page 21: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Figure : Estimated parent event hazard functions from DonaldTrump’s monthly Twitter data. The two red dashed lines mark June2015 (candidacy announcement) and Jan 2017 (assumes office),respectively. 21 / 33

Page 22: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.005 0.010 0.015 0.020 0.025 0.030

0.0

0.2

0.4

0.6

0.8

1.0

0.01 0.02 0.03 0.04 0.05

0.0

0.2

0.4

0.6

0.8

1.0

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Figure : Goodness of fit plots of the model fitted for Jan 2017. Fromleft to right are the envelop plot (first plot) with the upper and lowerenvelopes marked in red dashed lines, goodness of fit plots for theoriginal offspring post (second plot), offspring repost (third plot) andparent (last plot) inter-event distances. Red solid lines are calculatedfrom cdf of exponential distributions. The grey bands are the 95%confidence intervals.

22 / 33

Page 23: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Application to Sina Weibo Data

01/01 01/05 01/10 01/15 01/20 01/25 01/30

Use

r 3U

ser 2

Use

r 1

date

Figure : The posting times of three users.

23 / 33

Page 24: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

α γ µ1 µ0 ρ1 ρ0User 1 0.343 0.024 0.099 0.241 14.444 43.442

(0.008) (0.004) (0.010) (0.014) (7.166) (6.124)User 2 0.387 0.086 0.101 0.614 163.026 618.721

(0.009) (0.006) (0.010) (0.010) (13.013) (21.749)User 3 0.644 0.227 0.445 0.309 90.983 152.253

(0.006) (0.008) (0.013) (0.012) (5.882) (7.477)

Table : Estimated α, γ, µ1, µ0, ρ1, ρ0 of Users 1, 2 and 3.

24 / 33

Page 25: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Application to Sina Weibo Data

12 am 12 pm 12 am

05

1015

20

time

intensity

User 1User 2User 3

Figure : Parent hazard functions of Users 1, 2 and 3.

25 / 33

Page 26: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Application to Sina Weibo Data

12am 12pm 12am

12

34

mea

ncu

rve

-3-1

12

-3-1

12

eige

nfun

ctio

n2

-3-1

12

mean function first eigenfunction

second eigenfunction third eigenfunction

12am 12pm 12am

12am 12pm 12am

12am 12pm 12am

Figure : Plots of the mean and first three eigenfunctions of theestimated daily parent hazard functions.

26 / 33

Page 27: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Characterize Sina Weibo User Behavior

05

1015

20

3.2% 15.6% 81.2%

05

10

4.2% 20.4% 75.4%

01

23

4

7.3% 26.05% 66.6%

Figure : Groups in the average daily parent hazard (left plot), averagenumber of posts per episode (middle plot) and average length (inhours) of an episode (right plots). The percentages at the bottom ofthe boxplots show the percentage of users in each group.

27 / 33

Page 28: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Social Effect on Users of Sina Weibo

For each Sina Weibo user, we were also able to collect thenumber of accounts the user was following (n→) and thenumber of accounts that were following this user (n←).

We find that there is a stronger correlation between n→and µ0 (r = 0.205).These observations indicate that users who follow moreaccounts are more likely to have more reposts.One explanation could be that the more accounts a userfollows, the more content they can repost from. Anotherplausible explanation is that the “followers” in the socialmedia tend to repost more.

28 / 33

Page 29: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Social Effect on Users of Sina Weibo

We find that the “popular” users, i.e., those whoseaccounts have many followers, tend to post more originalcontent. They are also more likely to initiate their Weiboengagement by posting original content.

We find that users who have strong social ties, i.e., havemany followers or follow many others, are more likely touse Weibo more often.

We find that users with many followers are more likely tospend more time on Weibo once they start an episode ofengagement.

29 / 33

Page 30: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Simulation Study

We set the observation window length T = 100, α = 0.6.With each parameter configuration, we simulate 100 eventtrajectories.We set the parent event hazard function as

λ(t ;β) = exp [β01 + β11 cos(2πt) + β12 sin(2πt)] .

For estimation, we use unit window length s = 1 or 5.To model λ(t ,β), we consider both the true model and thenonparametric cyclic B-spline model. For the latter, we usethe knot vector (0,0.2,0.4,0.6,0.8,1).

30 / 33

Page 31: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Simulation Study

31 / 33

Page 32: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Simulation Study

(γ, µ1, µ0, ρ1, ρ0)

(β01, β11, β12; s) α γ µ1 µ0 ρ1 ρ0

(0.5,0.5,0.5,10,15) 0.595 0.498 0.489 0.494 10.172 15.604(-2,-2,2; 5) (0.010) (0.013) (0.014) (0.014) (0.261) (0.365)

(0.5,0.5,0.5,10,15) 0.594 0.496 0.510 0.518 9.867 15.422(-3,-3,3; 5) (0.007) (0.011) (0.012) (0.014) (0.188) (0.284)

(1.0,0.5,0.5,10,15) 0.603 0.993 0.489 0.499 10.012 15.026(-2,-2,2; 5) (0.009) (0.017) (0.011) (0.012) (0.176) (0.257)

(0.5,1.0,1.0,10,15) 0.598 0.511 0.990 1.025 10.149 15.084(-2,-2,2; 5) (0.008) (0.010) (0.016) (0.017) (0.171) (0.309)

(0.5,0.5,0.5,20,30) 0.600 0.508 0.499 0.488 19.855 30.354(-2,-2,2; 5) (0.008) (0.012) (0.012) (0.013) (0.460) (0.717)

(0.5,0.5,0.5,10,15) 0.601 0.468 0.495 0.460 10.795 16.335(-2,-2,2; 1) (0.008) (0.010) (0.014) (0.014) (0.271) (0.309)

32 / 33

Page 33: A Bivariate Point Process Model with Application to Social ... · Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu

Summary

We propose a new clustered temporal point process modelto model user generated posts on social media.The proposed model captures both inhomogeneity in theinitial posting time and the clustering pattern in thesubsequent posts following the initial post.The proposed goodness of fit procedure shows that theproposed model fits the data reasonably well.The fitted models provide valuable insights on a user’scontent generating behavior.

33 / 33