modeling the evolution of discussion topics and...

9
Modeling the Evolution of Discussion Topics and Communication to Improve Relational Classification Ryan Rossi Department of Computer Science Purdue University [email protected] Jennifer Neville Department of Computer Science Purdue University [email protected] ABSTRACT Textual analysis is one means by which to assess commu- nication type and moderate the influence of network struc- ture in predictive models of individual behavior. However, there are few methods available to incorporate textual con- tent into time-evolving network models. In particular, mod- eling both the evolution of network topology and textual content change in time-varying communication data poses a difficult challenge. In this work, we propose a Temporally- Evolving Network Classifier (TENC) to incorporate the in- fluence of time-varying edges and temporally-evolving at- tributes in relational classification models. To facilitate this, we use an evolutionary latent topic approach to automati- cally discover and label communications between individuals in a network with their corresponding latent topic. The top- ics of the messages are incorporated into the TENC along with time-varying relationships and temporally-evolving at- tributes, using weighted, exponential kernel summarization. We evaluate the utility of the TENC on a real-world classifi- cation task, where the aim is to predict the effectiveness of a developer in the python open-source developer network. We take advantage of the textual content in developer emails and bug communications, which both evolve over time. The TENC paired with the latent topics significantly improves performance over the baseline classifiers that only take into account the static properties of the topics and communi- cations. The results show that the TENC can be used to accurately model the complete-set of temporal dynamics in time-evolving communication networks. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— Data Mining General Terms Algorithms, Experimentation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 1st Workshop on Social Media Analytics (SOMA ’10), July 25, 2010, Washington, DC, USA. Copyright 2010 ACM 978-1-4503-0217-3 ...$10.00. Keywords Statistical relational learning, temporal-relational models, topic modeling 1. INTRODUCTION Dependencies in relational and network domains have been successfully exploited in classification models (see e.g., [6]). This work focused primarily on modeling static network data. For the numerous relational domains that have tem- poral dynamics, researchers have generally analyzed static snapshots of the data, which consist of all the objects, links, and attributes that have occurred up to and including time t (e.g., [5, 12]). This approach ignores the temporal infor- mation present in the data and limits the applicability of the models. In many datasets there are dependencies be- tween the temporal and relational information that can be exploited to improve model performance. Recent work in statistical relational learning has started to investigate these temporal dimensions. Some initial work has focused on transforming temporal-varying links and ob- jects into aggregated features [12] and other work has in- vestigated the modeling of the temporal dynamics of net- work structure [14]. This work has focused on classification, but there has also been some efforts to exploit temporally- varying links to improve automatic discovery of relational communities or groups [3, 7]. In addition to network topology that changes over time, many relational datasets contain textual content that changes over time. As an example, consider email datasets, which naturally evolve over time and often contain noisy link in- formation (as a single email is not necessarily an indication of a strong relationship). There are also other facets of com- munication that evolve over time such as the topic of mes- sages between two users over time. Although frequency of communication may be indicative of relationship strength, frequency count is limited in its ability to capture the com- plexity of interpersonal iteractions between people. It is quite possible that the topic, sentiment, and/or intimacy of the email content is a stronger indicator of relationship type and strength, compared to simple frequency. While textual content offers a wealth of information for predictive network modeling, there has not been a lot of research investigating how to model the network topology and message content, particuarly in the case where both are evolving over time. There has been some work that mod- els a static network of documents, using bag-of-word meth- ods [4, 15]. In addition, there are also a number of clustering techniques for discovering topics, roles, and groups in social

Upload: lemien

Post on 12-Jul-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling the Evolution of Discussion Topics and ...snap.stanford.edu/soma2010/papers/soma2010_13.pdf · pare between statistical text analysis versus using only the network structure

Modeling the Evolution of Discussion Topics andCommunication to Improve Relational Classification

Ryan RossiDepartment of Computer Science

Purdue [email protected]

Jennifer NevilleDepartment of Computer Science

Purdue [email protected]

ABSTRACTTextual analysis is one means by which to assess commu-nication type and moderate the influence of network struc-ture in predictive models of individual behavior. However,there are few methods available to incorporate textual con-tent into time-evolving network models. In particular, mod-eling both the evolution of network topology and textualcontent change in time-varying communication data poses adifficult challenge. In this work, we propose a Temporally-Evolving Network Classifier (TENC) to incorporate the in-fluence of time-varying edges and temporally-evolving at-tributes in relational classification models. To facilitate this,we use an evolutionary latent topic approach to automati-cally discover and label communications between individualsin a network with their corresponding latent topic. The top-ics of the messages are incorporated into the TENC alongwith time-varying relationships and temporally-evolving at-tributes, using weighted, exponential kernel summarization.We evaluate the utility of the TENC on a real-world classifi-cation task, where the aim is to predict the effectiveness of adeveloper in the python open-source developer network. Wetake advantage of the textual content in developer emailsand bug communications, which both evolve over time. TheTENC paired with the latent topics significantly improvesperformance over the baseline classifiers that only take intoaccount the static properties of the topics and communi-cations. The results show that the TENC can be used toaccurately model the complete-set of temporal dynamics intime-evolving communication networks.

Categories and Subject DescriptorsH.2.8 [Database Management]: Database Applications—Data Mining

General TermsAlgorithms, Experimentation

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.1st Workshop on Social Media Analytics (SOMA ’10), July 25, 2010,Washington, DC, USA.Copyright 2010 ACM 978-1-4503-0217-3 ...$10.00.

KeywordsStatistical relational learning, temporal-relational models,topic modeling

1. INTRODUCTIONDependencies in relational and network domains have been

successfully exploited in classification models (see e.g., [6]).This work focused primarily on modeling static networkdata. For the numerous relational domains that have tem-poral dynamics, researchers have generally analyzed staticsnapshots of the data, which consist of all the objects, links,and attributes that have occurred up to and including timet (e.g., [5, 12]). This approach ignores the temporal infor-mation present in the data and limits the applicability ofthe models. In many datasets there are dependencies be-tween the temporal and relational information that can beexploited to improve model performance.

Recent work in statistical relational learning has startedto investigate these temporal dimensions. Some initial workhas focused on transforming temporal-varying links and ob-jects into aggregated features [12] and other work has in-vestigated the modeling of the temporal dynamics of net-work structure [14]. This work has focused on classification,but there has also been some efforts to exploit temporally-varying links to improve automatic discovery of relationalcommunities or groups [3, 7].

In addition to network topology that changes over time,many relational datasets contain textual content that changesover time. As an example, consider email datasets, whichnaturally evolve over time and often contain noisy link in-formation (as a single email is not necessarily an indicationof a strong relationship). There are also other facets of com-munication that evolve over time such as the topic of mes-sages between two users over time. Although frequency ofcommunication may be indicative of relationship strength,frequency count is limited in its ability to capture the com-plexity of interpersonal iteractions between people. It isquite possible that the topic, sentiment, and/or intimacy ofthe email content is a stronger indicator of relationship typeand strength, compared to simple frequency.

While textual content offers a wealth of information forpredictive network modeling, there has not been a lot ofresearch investigating how to model the network topologyand message content, particuarly in the case where both areevolving over time. There has been some work that mod-els a static network of documents, using bag-of-word meth-ods [4, 15]. In addition, there are also a number of clusteringtechniques for discovering topics, roles, and groups in social

Page 2: Modeling the Evolution of Discussion Topics and ...snap.stanford.edu/soma2010/papers/soma2010_13.pdf · pare between statistical text analysis versus using only the network structure

networks [9, 8]. These approaches learn topic distributionsbased on email messages sent between users. The email mes-sages are seen to be informative of the individual’s role andconsequently the groups to which they belong.

There are also other related text mining approaches forsocial networks. Agrawal et. al [1] use newsgroups to com-pare between statistical text analysis versus using only thenetwork structure as a basis for a given classification task.Mei and Zhai [10] develop a temporal text mining techni-queto find interesting evolutionary theme patterns. The ap-proach focuses on the evolution and transition of themesacross time. In another approach by Mika [11], the tradi-tional bipartite model of ontologies is extended to incorpo-rate a social dimension, resulting in a tripartite model ofvarious actors, concepts and instances.

The goal of our work is to improve attribute prediction indynamic domains by incorporating time-varying links andtemporally-evolving attributes into statistical relational mod-els. In particular we are interested in modeling the influenceof evolving latent topics corresponding to the textual se-mantics of an individuals communication in our sequencesof graphs across time. We focus our analysis on the taskof predicting effectiveness within a python open-source de-veloper communication network. We posit that the natureor topic of the communication between two developers iscrucial in predicting the effectiveness of a developer. Tradi-tional approaches treat the communications uniformly. Ourmethod assigns latent topics to the communications and in-corporates the topic information into predictive models toimprove performance.

More specifically, we extract and analyze the publicly avail-able data about open-source software development from thePython project (www.python.org). The communications be-tween developers are each labeled with its corresponding la-tent topic. The effects of the communication patterns withrespect to the latent topics are investigated for their impacton developer effectiveness. We derive a temporally-evolvingnetwork classifier to incorporate the temporal evolution ofboth the relationships and attributes. This classifier is usedto learn predictive models of developer effectiveness and ex-plore the utility of using the latent topics to predict devel-oper effectiveness. Our analysis indicates the significanceof using the textual information and its evolution over time.Furthermore we find that modeling both temporally evolvingtopic attributes and time-varying relationships improves theaccuracy of the models even further, compared to baselinemodels that only consider static snapshots of the networks.

The main contribution of our work is two-fold. We de-velop a temporally-evolving network classifier (TENC) capa-ble of modeling the influence of the complete-set of temporaldynamics in relational domains. Secondly, textual analysistechniques are used to derive a network where the edgesand nodes contain latent information which is used as thebasis to predict individual effectiveness within a communi-cation network. These proposed techniques could be usedto explore other networks such as blogs, forums, or othersocial media. The TENC is general enough to model thecomplete-set of temporal dynamics for arbitrary relationalnetworks.

2. COMMUNICATION NETWORKIn this work, one of our goals is to derive a temporally-

evolving network classifier capable of modeling the evolu-

tion of the latent topics for the communications. In thisregards, we analyze observational data from a domain thatreflects some characteristics of communication and social re-lationships. Open-source software development is one suchdomain where the communication among developers is crit-ical to the success of the project. Email is a common formof communication among the developers, and often mailinglists are used to ensure timely delivery of messages to allinterested parties. The textual content of both emails andbug messages between developers are extracted and modeledusing our temporal classifier.

2.1 DataWe analyze email and bug communication networks ex-

tracted from the open-source Python development environ-ment (www.python.org). In Python development, the pri-mary location for communication is the python-dev mailinglist, which is publicly available for subscription or download.

We collected a subset of the python-dev mailing list archivefor the period 01/01/07−09/30/08. The sample contains13181 email messages, among 1914 users. Each message con-tains the following information: message id, sender emailaddress, sender name, timestamp, in-reply-to message id,subject, body. From the set of messages, we constructedan email communication network with nodes correspondingto each unique email address, and edges from the sender ofeach message to the author of the in-reply-to message.

In addition to mailing list discussions, the Python projectalso has a public bug-tracking database (bugs.python.org),which records information about projects errors (i.e., bugs)and their associated fixes (i.e., solutions). Each bug is as-sociated with the following information: report date, er-ror type, component, status, assigned-developer, resolutiondate. In addition, there is information describing the detailsof the bug and the history of activity related to the bug.There is also a set of messages that record discussion amongthe developers during their attempt to resolve the bug. Theformat of these messages is similar to the format of the emailmessages.

Bug reports were collected for the same period listed above(01/01/07−09/30/08) and we constructed a second bug dis-cussion network, where nodes consisted of people with uniqueemail addresses and edges consisted of bug comments sentin reply to a previous posting. The sample contained 69435bug comments among 5108 users.

In addition to the bug and communication networks wealso extracted text from the bug and email messages. Usingthe text from these messages we perform topic modelingand label the links of the network with their appropriatetopic. We also assign developers to their most frequently

Python Communication Network Attributes

Latent TopicsAll TopicsEmail TopicsBug Topics

Link EdgeCount EdgeTopicAttributes EmailEdgeCount EmailEdgeTopic

BugEdgeCount BugEdgeTopicTemporal [Aug - Oct ’07] [Nov ’07 - Jan ’08]Snapshots [Feb - April ’08] [May - July ’08]

[Aug - Sept ’08]

Table 1: Details of the Python communication net-work dataset.

Page 3: Modeling the Evolution of Discussion Topics and ...snap.stanford.edu/soma2010/papers/soma2010_13.pdf · pare between statistical text analysis versus using only the network structure

Figure 1: Graph and Attribute Summarization Phase: Temporally evolving edges and attributes.

communicated topic. See Section 4.1 for more detail. Forour analysis, we used the set of 185 developers that appearedin both networks (i.e., those that had at least one mailinglist message and at least one bug comment).

The defect information (i.e., bugs) associated with thePython dataset enables us to derive measures of individualeffectiveness that are consistent with the performance as-pect of effectiveness measures. In broad terms, assessingperformance on a specific problem-solving task (i.e., prob-lem: errors/bugs and solution: associated fixes) as a mea-sure of effectiveness fits well within the context of a broadertheoretical approach to effectiveness in group settings. Asa measure of developer effectiveness we used the number ofbugs resolved by a developer in a given time window. A bi-nary class label is created that recorded whether a developerhas fixed a bug during a three-month period.

2.2 Network FormulationThe Python Communication Network consists of a collec-

tion of temporal snapshots. Let D = {D1, D2, ..., Dn} bea sequence of temporal snapshots from the relational com-munication network. Every temporal snapshot correspondsto the events that occured during the time period t, wheret = 1, 2, ..., n. The size of the temporal snapshots are threemonth periods where: D1 = (Feb07,Mar07, Apr07), D2 =(May07, Jun07, Jul07), ..., D7 = (Aug08, Sep08). Note thatthe last temporal snapshot has only two months.

The latent topics of the communications between develop-ers are used to predict individual effectiness ([Has Closed]).Table 1 lists the topic features used in our models. We couldhave also generated temporal centrality or other types of at-tributes based on communication patterns. In this work,we are only concerned with using the evolutionary latenttopics of the communications between developers in orderto predict effectiveness. This allows for the significance ofthe actual semantics of the communications to be quantifiedwith respect to the measure of effectiveness.

In many relational classification methods, the textual con-tent on links is discarded and the fact that a communicationoccured between di and dj is the only aspect taken into ac-count. In this work, we are interested in using the textualcontent of the communications to label both the edges andvertices with their corresponding latent topics. Our model

uses the relational dependencies between the developers andthe latent topics extracted from the textual content of theemails and bugs. More specifically, we focus on predictingindividual effectiveness given the latent topics of the com-munications among individuals.

3. APPROACHWe represent the data as an attributed graph D = (G,X).

The graph G = (V,E) represents a set of N individuals asnodes, such that vi ∈ V corresponds to individual i and eachedge eij ∈ E corresponds to a communication event (e.g.,email) between individuals i and j. The attribute set:

X =

„XV = [X1, X2, ..., Xmv ],XE = [Xmv+1, Xmv+2, ..., Xmv+me ]

«may contain observed attributes on both the nodes (XV)and the edges (XE). Below we use Xm to refer to the genericmth attribute on either nodes or edges.

There are three aspects of relational data that may varyover time. First, the values of attribute Xm may vary overtime:

Xm =

2666666664

xm11 xm12 · · · xm1t

xm21 xm22 · · · xm2t

xm31 xm32 · · · xm3t

......

. . ....

xmn1 xmn2 · · · xmnt

3777777775Second, relationships may vary over time. This results in adifferent data graph Gt = (V,Et) for each time step t, wherethe nodes remain constant but the edge set may vary (i.e.,Eti 6= Etj for some i, j). Third, objects existence may varyover time (i.e., objects may be added or deleted). This isalso represented as a set of data graphs G′t = (Vt, Et), butin this case both the objects and edge sets may vary.

In this work, we aim to model a communication networkthat evolves over time, with both edges and attributes (e.g.,topics) changing. For simplicity, we assume that the set ofnodes remains constant. More specifically, let Dt = (Gt,Xt)refer to the dataset set at time t, where Gt = (V,Et,W

Et )

and Xt = (XVt ,X

Et ,W

Xt ). Here Wt refers to a function that

Page 4: Modeling the Evolution of Discussion Topics and ...snap.stanford.edu/soma2010/papers/soma2010_13.pdf · pare between statistical text analysis versus using only the network structure

assigns weights on the edges and attributes that are usedin TENC below. We define WE

t (i, j) = 1 if eij ∈ Et and 0otherwise. Similarly, we define WX

t (xmi ) = 1 if Xmi = xmi ∈

Xmt and 0 otherwise.

3.1 Temporally-Evolving Network ClassifierThe Time Varying Relational Classifier (TVRC) [14] is

a recent approach that has been developed to incorporatetemporal dependencies into statistical relational models. Wedefine the Temporally-Evolving Network Classifier (TENC)as an extension of the TVRC to incorporate the influenceof temporally evolving attributes (e.g., topics) on both theedges and the nodes. The TENC uses a two-step processthat first transforms the dynamic relational dataset into astatically weighted summary graph and set of summary at-tributes using kernel smoothing. The second phase thenincorporates the static edge and attribute weights into amodified relational classifier to moderate the conditional at-tribute dependencies throughout the relational data graph.

The graph summarization phase follows the TVRC pro-cess by summarizing a temporal sequence of relational graphsinto a weighted summary graph. Let Gt = (V,Et,W

Et )

be the relational graph at time step t, where WEt are unit

weights on the edges as defined above. We define the sum-mary graph GSt = (V,ESt ,W

ESt

) at time t as a weightedsum of the temporal graphs up to time t as follows:

ESt = E1 ∪ E2 ∪ · · · ∪ Et

WESt

= α1WE1 + α2W

E2 + · · ·+ αtW

Et =

tXi=1

KE(Gi; t, θ)

where Et is the edge set of the temporal snapshot Gt, andWEt are the unit weights associated with the edges of Gt.

The α weights determine the contribution of each temporalsnapshot in the summary graph. For weighting, we use anexponential kernel function KE with parameter θ to deter-mine the influence of each edge of snapshot graph Gi in thesummary graph:

KE(Gi; t, θ) = (1− θ)t−iθWEi

With the exponential kernel function, we can compute thesummary graph weights recursively, as a combination of thesummary graph weights at time t−1 and the snapshot graphweights at time t:

WESt

=

((1− θ)WE

St−1+ θWE

t if t > t1

θWEt if t = t1

where t1 refers to the first timestep of the data.In addition to summarizing the graph changes over time

as in the TVRC, the TENC also summarizes the attributechanges in a similar fashion. Let Xt = (XV

t ,XEt ,W

Xt ) be

the attribute data at time step t, where WXt are unit weights

on the attribute values as defined above. We define thesummary attribute set XSt = (XV

St,XE

St,WX

St) at time t as

a weighted sum of the temporal attribute sets up to time tas follows:

XVSt

= XV1 ∪XV

2 ∪ · · · ∪XVt

XESt

= XE1 ∪XE

2 ∪ · · · ∪XEt

WXSt

= β1WX1 + β2W

X2 + · · ·+ βtW

Xt =

tXi=1

KX(Xi; t, λ)

Figure 2: The TENC models the complete-setof temporal dynamics, including node attributeschanging over time (top row), edges changes overtime (middle row), and link attributes changing overtime (bottom row).

where Xt is the attribute set of the temporal dataset Dt, andWXt are the unit weights associated with the attribute val-

ues of Xt. The β weights determine the contribution of eachtemporal snapshot in the summary network. For weighting,we again use an exponential kernel function KX with pa-rameter λ to determine the influence of each attribute valueof Xi in the summary network:

KX(Xi; t, λ) = (1− λ)t−iλWXi

We can also compute the summary attribute weights recur-sively, as a combination of the summary attribute weightsat time t− 1 and the snapshot attribute weights at time t:

WXSt

=

((1− λ)WX

St−1+ λWX

t if t > t1

θWt if t = t1

The exponential kernel weights attribute values (and edges)that have occurred in the recent past highly and then de-cays those weights exponentially as time passes. The pa-rameters θ and λ determines the rate of decay for edges andattributes, respectively. The kernel smoothing operation onthe input temporal sequence {D1, D2, · · · , Dt} can also beexpressed as a recursive computation on the weights {WE

1 ,WE

2 , · · · , WEt } and {WX

1 , WX2 , · · · , WX

t } through time asshown above.

The second phase of the TENC algorithm is similar tothe weighted relational classification of the TVRC. Once wehave summarized the temporal graph and attribute infor-mation into edge and attribute weights, respectively, in thesummary network (GSt ,XSt). We then learn a predictivemodel on the summarized data using the edge and attributeweights to moderate the influence of the weighted relationaltopic attributes. This approach exploits the temporal infor-mation in both the communications and the attribute val-ues to improve the prediction results. More specifically, the

Page 5: Modeling the Evolution of Discussion Topics and ...snap.stanford.edu/soma2010/papers/soma2010_13.pdf · pare between statistical text analysis versus using only the network structure

(a) TVRC Model (b) TENC Model

Figure 3: (a) The TVRC feature calculation that includes only the summary edge weights. (b) The TENCfeature calculation that incorporates both the summary attribute weights and the summary edge weights.

set of attributes that are used for learning include the setof attribute values in XSt and each attribute value xmi isweighted by its summary weight WX

St(xmi ). When relational

attributes are considered by the model, the attributes areweighted by the product of their temporal attribute weightand the link weight of the appropriate link in the summarygraph. For example, if node i and j are linked in GSt andthe model uses attribute Xk on node j to predict the classlabel on node i, then if WE

St(i, j) = ω and WX

St(xkj ) = ψ,

the attribute value xkj will be given a weight of ω · ψ duringlearning.

Any arbitrary relational model that can be extended touse weighted instances is suitable for this phase. In thiswork, we use the relational probability tree (RPT) [13] tocompare roughly equivalent models, both with and withoutreasoning over the temporal dynamics.

The weights can be viewed as probabilities that a partic-ular relationship or attribute value in the network is stillactive at the current time step t, given that it was observedat time (t−k). Further if we consider the exponential kernel,then we have a geometric distribution with either parameter,θ for edges or λ for attributes. The distribution specifies theprobability that an edge or attribute value occured k timesteps in the past.

The edges in Figure 3 represent the communications be-tween individuals and the topics on the nodes are denotedτ1 (green), τ2 (blue), and τ3 (red). The thickness of the edgerepresents the temporal strength of the communication andthe colors represent attribute values of either τ1, τ2, and τ3.The TENC Model has probability distributions of attributevalues for every node. The weighted relational neighborhoodis used in the prediction process.

Figure 3 shows the feature calculations of TVRC andTENC. In particular, consider the weighted mode of thelinked topics in TVRC. It is obvious from Figure 3(a) thatthe combined weight associated with the latent topic τ3 (red)is greater than the other observed latent topics. Never-theless, consider the weighted mode of the linked topics in

TENC where both the influence of the edges and attributesare incorporated. Clearly, the combined weight associatedwith τ2 (blue) is greater than the combined weight of τ3(red) as shown in Figure 3(b). Therefore by modeling thecomplete-set of temporal dynamics we are able to more ac-curately capture and exploit the evolutionary patterns toimprove predictions of relational classifiers.

In order to incorporate the weights from the summarygraph and summary attributes, we define a weighted RPTby modifying the counts of the attribute value when com-puting the RPT aggregate functions. The weighted rela-tional neighborhood is incorporated in the feature calcula-tion. More specifically, the product of the edge weight andthe corresponding attribute weight represents the combinedinfluence of an attribute Xij and edge Eij across time.

The RPT classifier takes a sequence of temporal snapshots{D1, D2, ..., Dt} from the communication network. In everytimestep we predict effectiveness using the Has Closed at-tribute. The other objects and links form the relational net-work used as a basis for our predictions. The RPT algorithmconstructs a probability estimation tree for the temporalsample Dt to predict the effectiveness of a future Dt+1 tem-poral sample. The temporal latent topic attributes changefrom Dt to Dt+1.

The temporal snapshots in TENC are summarized andthen weighted using the exponential kernel function. Thegraph summarization phase summarizes a temporal sequenceof relational graphs which are then used to predict the ef-fectiveness or the Has Closed attribute. In TENC therelational topic attributes are moderated by the summarygraph weights and the summary attribute weights.

For every temporal snapshot Dt we learn a model andapply it to Dt+1 a future temporal snapshot to predict theeffectiveness. The TENC uses the current summarized tem-poral snapshot DSt to predict the future DSt+1 summarizedtemporal snapshot.

Page 6: Modeling the Evolution of Discussion Topics and ...snap.stanford.edu/soma2010/papers/soma2010_13.pdf · pare between statistical text analysis versus using only the network structure

3.2 Content AnalysisAn edge between two developers in the python communi-

cation network represents an email or bug communication.We use techniques based on Latent Dirichlet Allocation[2]to extract topics of the communications. The latent topicsare used to label the communication links between users.Let T = {τ1, τ2, ..., τκ} be a set of latent topics extractedfrom the bugs and email communications. In the task ofpredicting effectiveness between individuals we might findthat τi = {web programming} and τj = {sports} there-fore it is clear that individuals communicating about sportsmay be less effective, while communications about web pro-gramming might be a significant indicator of effectiveness.The latent topics provide context for the links instead of theuniform notion of a simple communication. This methoddisambiguates and therefore assigns useful meaning for theedges and consequently the nodes in a relational network.

The bug and email communications are from developerswho work on distributed teams. Let C = {c1, c2, ..., cm} be aset of bug and email communications andW = {w1, w2, ..., wn}be a set of words from the communications between devel-opers. We use the vector-space representation. We definean n×m matrix denoted M where the coefficents representthe frequency of wi ∈ W appearing in cj ∈ C. The matrixis of size 191607 words × 82616 communications. A stan-dard list of stopwords were removed. Every communicationis represented by an edge between two people in the rela-tional network. Using the text from the communication weassign the edge an attribute that corresponds to the topicof communication. We introduce a latent class variable forthe topics T = {τ1, τ2, ..., τκ} where κ is the number oftopics to be discovered. This can be thought of as a type

Figure 4: Communication links with uniform seman-tics.

Figure 5: Adding textual information by labelinglinks in the network with latent topics.

of dimension reduction where the communications are pro-jected into κ dimensions (or topics). Every communicationedge is labeled with its most likely topic and conversely ev-ery individual is labeled with their most frequent topic ofcommunication. The topic discovery is extended temporallyby considering every timestep separately. Therefore at timet the topic of a communication between di and dj might beτ1, but at t + 1 the topic of conversation might evolve intotopic τ2 and a similar temporal transition could occur at theindividual or attribute level as well. The next obvious step isto use the full distribution of the topic likelihood instead ofselecting the most likely topic. Further details are discussedin section 4.1 on topic discovery.

4. RESULTSIn our experiments, we evaluate the various temporal di-

mensions of communication networks, with respect to thetask of predicting node behavior. The PyComm Network isa complete-temporal real-world dataset where nodes, edges,and attributes are rapidly evolving over time. Additionally,the class label based on developer productivity is also chang-ing across time. In our empirical evaluation, we specificallyconsider the effects of modeling the latent topics of the com-munication networks and their evolution over time.

4.1 Topic DiscoveryThe number of topics to estimate depends on many fac-

tors including the type of corpus (web content, open sourcedevelopment, ..) and also on the size of the collection. Wechose κ = 20 as the majority of communications are mostlikely to focus on some aspect of the 18 Python projectsunder development.

We removed a standard list of stopwords from the com-munications and also other technical words that appearedwith high frequency (e.g., python). A topic can be viewedas a cluster of words that are frequently used together. Weused a version of Latent Dirichlet Allocation [2] to model theκ topics. To estimate the parameters we used Expectation-Maximizatiom (EM) and Gibbs sampling for inference.

We modeled the topics in three different sets of communi-cations: (1) the email communication, (2) the bug commu-nications, and (3) the joint set of email and bug messages.After extracting the latent topics from the email and bugcommunications we label the edges of our relational com-munication network with their appropriate latent topic. Theedges are labeled by performing inference on a communica-tion and assigning it to the topic with the greatest liklihoodand consequently a nodal attribute is defined as the modeof the neighbors latent topics.

We generate three object attributes {AllTopic, Email-Topic, BugTopic} where a developer is assigned to themost prevalent topic in a particular set of communications.As an example if a developer most often discusses ’testing’in her bug messages for a particular timestep then she wouldbe assigned to the BugTopic corresponding to testing.

The text of the messages between individuals in the Py-Comm network are used to predict individual effectiveness.Using this text we automatically extract hidden semanticsof the messages and assign an arbitrary message to a la-tent topic by considering the latent topic with the greatestliklihood.

In Table 2 we list the most likely words for five of the 20topics when we combine bug and email communication to-

Page 7: Modeling the Evolution of Discussion Topics and ...snap.stanford.edu/soma2010/papers/soma2010_13.pdf · pare between statistical text analysis versus using only the network structure

(a) Initial Network (b) Uniform semantics (c) Edges labeled (d) Nodal attributes assigned

Figure 6: Defining a network using only the evolving textual information. The edges are labeled with theircorresponding latent topic. (a) The initial network has only the textual information associated with themessages between developers. As an example, Ci1 = {w1, w2, ..., wz}. (b) All edges and topic attributes areinitially semantically uniform. (c) Edges are labeled with their corresponding latent topic using maximumliklihood. (d) Nodal attributes are assigned a latent topic based on the mode of the neighboring latent topics.

Table 2: The first five topics from the email and bugcommunications. The number of topics modeled istwenty (κ = 20).

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5dev logged gt code test

wrote patch file object libguido issue lt class view

import bugs line case svncode bug os method trunkpep problem import type revmail fix print list modules

release fixed call set buildtests days read objects ampwork created socket change error

people time path imple usrmake docu data functions includepm module error argument homeve docs open dict file

support added windows add runmodule check problem def mainthings doc traceback methods localgood doesnt mailto exception srcvan report recent ms directory

gether. The table contains words with both positive andnegative connotation such as ‘good’ or ‘doesnt’ and alsowords referring to the network domain such as bugs or ex-ception. An intersting direction to pursue would be to usesentiment analysis to automatically identify the communi-cations with positive and negative tone and use those pre-dictions in the analysis, to moderate relationships amongdevelopers.

It is difficult to subjectively assess the high level mean-ing of a given latent topic from the communications. Thisis likely due to the fact that we are analyzing communica-tion in a highly specific technical domain. Nevertheless weidentify some patterns from the topics. As an example, thefirst topic seems to be about open source development in amuch broader sense. The word ‘guido’ appears significant,since this is likely a reference to the author of the Pythonprogramming language, Guido van Rossum.

4.2 Predicting EffectivenessThe experiments below are intended to investigate the fol-

lowing hypotheses in this domain:

• Relational latent topic information can improve modelperformance over using simple intrinsic aggregate topicfeatures.

• Information about the temporal dynamics of the latenttopics can improve model performance.

• Modeling the topics influence of both the time-varyingedges and temporally-evolving attributes improves per-formance over simply modeling the temporal dynamicsof edges.

The main facets of interest are whether incorporating theinfluence of both time-varying edges and temporally-evolvingattributes (in the TENC) improves performance over simplymodeling the temporal dynamics of the time-varying edges(in the TVRC). The second facet is if by using TENC, wecan successfully predict the effectiveness through the latenttopics of the communications between developers. Alongthese lines, we evaluate the significance of TENC where theinfluence of both edges and attributes are incorporated intothe model. These experiments provide insight into model-ing the complete-set of temporal dynamics as well as usingtextual analysis techniques to incorporate the temporal dy-namics of the latent topics into networks.

We evaluate four different models for the classificationtask of predicting effectiveness using the latent topics. TheTVRC, Window, and Union Models are used as baselineclassifiers to evaluate the significance of TENC where theinfluence of the topics from both the temporally evolvingedges and attributes are incorporated to predict effective-ness:

− TENC: We present the results of the TENC algorithmusing an exponential smoothing kernel for summariza-tion of both the relationships and the attributes. TheTENC model first summarizes the network into DSt

using all snapshots up to and including t, selectingthe weighing parameters θ and λ using k-fold crossvalidation and then weights the relationships and at-tributes appropriately during learning and prediction.

Page 8: Modeling the Evolution of Discussion Topics and ...snap.stanford.edu/soma2010/papers/soma2010_13.pdf · pare between statistical text analysis versus using only the network structure

This model summarizes both the relationships and theattributes from the latent topics.

− TVRC: This model summarizes the graph into GSt

using all timesteps up to and including t, selecting theweighting parameter θ using k-fold cross validation.The attributes are moderated using the temporal edgestrengths only.

− Union Model: The union model is a baseline modelthat uses a graph D≤t, which consists of all objectsand links up to and including the year t of the sample,for learning. The union model does not weight theattributes or the links.

− Window Model: The window model is again a base-line model that uses the network Dt−1 for predictionon network Dt. In other words, it only uses the im-mediate past information for prediction and ignores allthe other historical data. This model corresponds tothe RPT.

The models are learned using the latent topics of the com-munications and the performance is evaluated using areaunder the ROC curve (AUC). For every timestep t, we learna model on Dt and apply the model to Dt+1. The utility ofincorporating the various granularities of temporal informa-tion is measured by comparing TENC to the TVRC. Theutility of the temporal information is further evaluated bycomparing against two baseline models that ignore the tem-poral aspects.

We first model the latent network semantics of communi-cations (i.e, textual information) between developers throughtopic discovery. Using these temporal latent topics, we applythe four relational classifiers. In Figure 7 the results indicatethe necessity of appropriately modeling the complete-set oftemporal dimensions. Our TENC model that incorporatesboth the temporal influence of the topic attributes and edgesdrastically improves model performance over the baselinemodels. Nevertheless, the TVRC model that incorporatesonly the influence of time-varying edges also improves theaccuracy of our models over the Window and Union Modelswhich ignore the temporal aspects of the dataset.

The temporal dynamics must be modeled and incorpo-rated to successfully predict effectiveness as the static mod-els are seen to be poor predictors. This is an interest-ing point since it implies that the static latent topics arenot very meaningful. The performance is substantially im-proved only when the temporal-dynamics are incorporatedinto the predictive models. Modeling the temporal influenceof attributes and edges improves performance significantlyover all the static models (Window/Union Models) in ev-ery timestep as seen above. Therefore we have shown thenecessity to model the complete-set of temporal dynamicsof datasets to maximize the accuracy of the performance ofclassifiers.

Evolutionary patterns between the topics of communica-tions and the individuals themselves are clearly present withinthe communication network. These results indicate thatproductive developers usually communicate about similartopics or aspects of development. An effective communica-tion has a specific structure or latent meaning that conse-quently enables others to become more effective. This latentstructure is captured more accurately through evolutionary

T=1 T=2 T=3 T=4 AVG

AU

C

0.65

0.70

0.75

0.80

0.85

0.90

0.95 TENC

TVRCWindow ModelUnion Model

Figure 7: AUC results for each timestep whereTENC is compared against the baseline models. Themodels use only the latent topics of the communica-tions to predict effectiveness. LDA is used to auto-matically discover the latent topics as well as anno-tating the communication links and individuals withtheir appropriate topic in the temporal networks.

topic patterns. Moreover, there are indications of evolu-tionary patterns within the topics. We posit that as thesemantics of a topic evolve over time, an effective individu-als messages also transition to corresponding effective topics.An effective topic might evolve over time into an uneffectivetopic. As an example, suppose an effective topic at timet is mainly about a specific set of functions. However, attime t+1 the functions become depricated and therefore theeffective topic has evolved into an ineffective topic of com-munication. It is likely that the individuals who are still ac-tively pursuing development and discussing the depricatedoperations are also becoming ineffective.

Our results indicate the necessity of appropriately mod-eling the full set of temporal dimensions. The TENC per-formance is significantly better than the static Window andUnion models at p < 0.01 and better than TVRC at p < 0.1level. We have shown that by modeling and incorporatingthe temporal influence of attributes and relationships leadsto a vast improvement in accuracy and an overall robustmodel.

5. CONCLUSIONSWe presented a novel approach for modeling time-varying

relational data that incorporates textual content by use oflatent topics where both the communication links and top-ics are evolving over time. The TENC framework is notrestricted to communication networks, but is applicable toa wide array of relational domains where the relationshipsbetween entities change over time as well as the entities at-tributes. Furthermore, the latent semantic labeling tech-nique is not restricted to natural languages and can be ap-

Page 9: Modeling the Evolution of Discussion Topics and ...snap.stanford.edu/soma2010/papers/soma2010_13.pdf · pare between statistical text analysis versus using only the network structure

plied to label the edges and nodes of other types of networks.We have shown the significant increase in performance by in-cluding the temporal influence of the latent topics on boththe attributes and edges.

The temporal dynamics of the latent topics is a signifi-cant aspect in the prediction of individual effectiveness. Thisis shown using the Temporally-Evolving Network Classifierwhere the performance is substantially increased when mod-eling the influence of the temporal dynamics. The utility ofmodeling a developers latent topic of communications acrosstime is important in determining future communications.This means that a developers past topics of communicationsare somewhat predictive of their future conversations withtheir codevelopers and the influence should be modeled ac-cordingly (i.e, decayed over time). These results indicate thenecessity of modeling both the time-varying communicationedges and the temporally-evolving latent topic attributes.These findings warrant a more systematic investigation ofextending relational learning algorithms to take into accountthe complete-set of temporal dimensions as well as usingsimilar textual analysis approaches for labeling edges andcreating attributes through latent temporal information.

Our future work will consider using the entire distributionof topic liklihoods instead of selecting the most likely topic.We are also interested in the temporal evolution of topicswith respect to the corresponding individuals communica-tions. In particular the temporal patterns and transitions oftopics as they relate to the productivity of the individuals.

6. ACKNOWLEDGMENTSThis research is supported by DARPA and NSF under con-tract number(s) NBCH1080005 and SES-0823313. The U.S.Government is authorized to reproduce and distribute reprintsfor governmental purposes notwithstanding any copyrightnotation hereon. The views and conclusions contained hereinare those of the authors and should not be interpreted asnecessarily representing the official policies or endorsementseither expressed or implied, of DARPA, NSF, or the U.S.Government. This research was also made with Governmentsupport under and awarded by DoD, Air Force Office of Sci-entific Research, National Defense Science and EngineeringGraduate (NDSEG) Fellowship, 32 CFR 168a.

7. REFERENCES[1] R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu.

Mining newsgroups using networks arising from socialbehavior. In Proceedings of the World Wide WebConference, pages 529–535, 2003.

[2] D. Blei, A. Ng, and M. Jordan. Latent Dirichletallocation. Journal of Machine Learning Research,3:993–1022, 2003.

[3] D. Cai, Z. Shao, X. He, X. Yan, and J. Han.Community mining from multi-relational networks. InProceedings of the 9th European Conference onPrinciples and Practice of Knowledge Discovery inDatabases, 2005.

[4] S. Chakrabarti, B. Dom, and P. Indyk. Enhancedhypertext categorization using hyperlinks. InProceedings of the ACM SIGMOD InternationalConference on Management of Data, pages 307–318,1998.

[5] P. Domingos and M. Richardson. Mining the networkvalue of customers. In Proceedings of the 7th ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 57–66, 2001.

[6] L. Getoor and B. Taskar, editors. Introduction toStatistical Relational Learning. MIT Press, 2007.

[7] S. Hill, D. Agarwal, R. Bell, and C. Volinsky. Buildingan effective representation of dynamic networks.Journal of Computational and Graphical Statistics,Sept, 2006.

[8] A. McCallum, X. Wang, and A. Corrada-Emmanuel.Topic and role discovery in social networks withexperiments on enron and academic email. In Journalof Artificial Intelligence Research (JAIR), pages249–272, 2007.

[9] A. McCallum, X. Wang, and N. Mohanty. Joint groupand topic discovery from relations and text. InStatistical Network Analysis: Models, Issues and NewDirections, Lecture Notes in Computer Science 4503,pages 28–44, 2007.

[10] Q. Mei and C. Zhai. Discovering evolutionary themepatterns from text - an exploration of temporal textmining. In Proceedings of the 11th ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining, pages 198–207, 2005.

[11] P. Mika. Ontologies are us: A unified model of socialnetworks and semantics. In Journal of Web Semantics,pages 5–15, 2007.

[12] J. Neville, O. Simsek, D. Jensen, J. Komoroske,K. Palmer, and H. Goldberg. Using relationalknowledge discovery to prevent securities fraud. InProceedings of the 11th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,pages 449–458, 2005.

[13] J. Neville, D. Jensen, L. Friedland, and M. Hay.Learning relational probability trees. In Proceedings ofthe 9th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages625–630, 2003.

[14] U. Sharan and J. Neville. Temporal-relationalclassifiers for prediction in evolving domains. InProceedings of the 8th IEEE International Conferenceon Data Mining, 2008.

[15] B. Taskar, P. Abbeel, and D. Koller. Discriminativeprobabilistic models for relational data. In EighteenthConference on Uncertainty in Artificial Intelligence,2002.