sprite: generalizing topic models with structured priors

SPRITE: Generalizing Topic Models with Structured Priors

Michael J. Paul and Mark DredzeDepartment of Computer Science

Human Language Technology Center of ExcellenceJohns Hopkins University, Baltimore, MD 21218

[email protected], [email protected]

Abstract

We introduce SPRITE, a family of topicmodels that incorporates structure intomodel priors as a function of underlyingcomponents. The structured priors canbe constrained to model topic hierarchies,factorizations, correlations, and supervi-sion, allowing SPRITE to be tailored toparticular settings. We demonstrate thisflexibility by constructing a SPRITE-basedmodel to jointly infer topic hierarchies andauthor perspective, which we apply to cor-pora of political debates and online re-views. We show that the model learns in-tuitive topics, outperforming several othertopic models at predictive tasks.

1 Introduction

Topic models can be a powerful aid for analyzinglarge collections of text by uncovering latent in-terpretable structures without manual supervision.Yet people often have expectations about topics ina given corpus and how they should be structuredfor a particular task. It is crucial for the user expe-rience that topics meet these expectations (Mimnoet al., 2011; Talley et al., 2011) yet black box topicmodels provide no control over the desired output.

This paper presents SPRITE, a family of topicmodels that provide a flexible framework for en-coding preferences as priors for how topics shouldbe structured. SPRITE can incorporate many typesof structure that have been considered in priorwork, including hierarchies (Blei et al., 2003a;Mimno et al., 2007), factorizations (Paul andDredze, 2012; Eisenstein et al., 2011), sparsity(Wang and Blei, 2009; Balasubramanyan and Co-hen, 2013), correlations between topics (Blei andLafferty, 2007; Li and McCallum, 2006), pref-erences over word choices (Andrzejewski et al.,2009; Paul and Dredze, 2013), and associations

between topics and document attributes (Ramageet al., 2009; Mimno and McCallum, 2008).

SPRITE builds on a standard topic model,adding structure to the priors over the model pa-rameters. The priors are given by log-linear func-tions of underlying components (§2), which pro-vide additional latent structure that we will showcan enrich the model in many ways. By apply-ing particular constraints and priors to the compo-nent hyperparameters, a variety of structures canbe induced such as hierarchies and factorizations(§3), and we will show that this framework cap-tures many existing topic models (§4).

After describing the general form of the model,we show how SPRITE can be tailored to partic-ular settings by describing a specific model forthe applied task of jointly inferring topic hierar-chies and perspective (§6). We experiment withthis topic+perspective model on sets of politicaldebates and online reviews (§7), and demonstratethat SPRITE learns desired structures while outper-forming many baselines at predictive tasks.

2 Topic Modeling with Structured Priors

Our model family generalizes latent Dirichlet al-location (LDA) (Blei et al., 2003b). Under LDA,there are K topics, where a topic is a categor-ical distribution over V words parameterized byφk. Each document has a categorical distributionover topics, parameterized by θm for the mth doc-ument. Each observed word in a document is gen-erated by drawing a topic z from θm, then drawingthe word from φz . θ and φ have priors given byDirichlet distributions.

Our generalization adds structure to the gener-ation of the Dirichlet parameters. The priors forthese parameters are modeled as log-linear com-binations of underlying components. Componentsare real-valued vectors of length equal to the vo-cabulary size V (for priors over word distribu-tions) or length equal to the number of topics K

(for priors over topic distributions).For example, we might assume that topics about

sports like baseball and football share a commonprior – given by a component – with general wordsabout sports. A fine-grained topic about steroiduse in sports might be created by combining com-ponents about broader topics like sports, medicine,and crime. By modeling the priors as combina-tions of components that are shared across all top-ics, we can learn interesting connections betweentopics, where components provide an additionallatent layer for corpus understanding.

As we’ll show in the next section, by imposingcertain requirements on which components feedinto which topics (or documents), we can inducea variety of model structures. For example, if wewant to model a topic hierarchy, we require thateach topic depend on exactly one parent compo-nent. If we want to jointly model topic and ide-ology in a corpus of political documents (§6), wemake topic priors a combination of one componentfrom each of two groups: a topical component andan ideological component, resulting in ideology-specific topics like “conservative economics”.

Components construct priors as follows. For thetopic-specific word distributions φ, there are C(φ)

topic components. The kth topic’s prior over φkis a weighted combination (with coefficient vectorβk) of theC(φ) components (where component c isdenoted ωc). For the document-specific topic dis-tributions θ, there are C(θ) document components.The mth document’s prior over θm is a weightedcombination (coefficients αm) of the C(θ) compo-nents (where component c is denoted δc).

Once conditioned on these priors, the modelis identical to LDA. The generative story is de-scribed in Figure 1. We call this family of modelsSPRITE: Structured PRIor Topic modEls.

To illustrate the role that components can play,consider an example in which we are modeling re-search topics in a corpus of NLP abstracts (as wedo in §7.3). Consider three speech-related topics:signal processing, automatic speech recognition,and dialog systems. Conceptualized as a hierar-chy, these topics might belong to a higher levelcategory of spoken language processing. SPRITE

allows the relationship between these three topicsto be defined in two ways. One, we can model thatthese topics will all have words in common. Thisis handled by the topic components – these threetopics could all draw from a common “spoken lan-

• Generate hyperparameters: α, β, δ, ω (§3)

• For each document m, generate parameters:

1. θ̃mk = exp(∑C(θ)

c=1 αmc δck), 1≤k≤K2. θm ∼ Dirichlet(θ̃m)

• For each topic k, generate parameters:

1. φ̃kv = exp(∑C(φ)

c=1 βkc ωcv), 1≤v≤V2. φk ∼ Dirichlet(φ̃k)

• For each token (m,n), generate data:

1. Topic (unobserved): zm,n ∼ θm2. Word (observed): wm,n ∼ φzm,n

Figure 1: The generative story of SPRITE. The differencefrom latent Dirichlet allocation (Blei et al., 2003b) is the gen-eration of the Dirichlet parameters.

guage” topic component, with high-weight wordssuch as speech and spoken, which informs theprior of all three topics. Second, we can model thatthese topics are likely to occur together in docu-ments. For example, articles about dialog systemsare likely to discuss automatic speech recognitionas a subroutine. This is handled by the documentcomponents – there could be a “spoken language”document component that gives high weight to allthree topics, so that if a document draw its priorfrom this component, then it is more likely to giveprobability to these topics together.

The next section will describe how particularpriors over the coefficients can induce variousstructures such as hierarchies and factorizations,and components and coefficients can also be pro-vided as input to incorporate supervision and priorknowledge. The general prior structure used inSPRITE can be used to represent a wide array ofexisting topic models, outlined in Section 4.

3 Topic Structures

By changing the particular configuration of the hy-perparameters – the component coefficients (α andβ) and the component weights (δ and ω) – we ob-tain a diverse range of model structures and behav-iors. We now describe possible structures and thecorresponding priors.

3.1 Component Structures

This subsection discusses various graph structuresthat can describe the relation between topic com-ponents and topics, and between document com-ponents and documents, illustrated in Figure 2.

(a) Dense DAG (b) Sparse DAG (c) Tree (d) Factored Forest

Figure 2: Example graph structures describing possible relations between components (middle row) and topics or documents(bottom row). Edges correspond to non-zero values for α or β (the component coefficients defining priors over the documentand topic distributions). The root node is a shared prior over the component weights (with other possibilities discussed in §3.3).

3.1.1 Directed Acyclic GraphThe general SPRITE model can be thought of as adense directed acyclic graph (DAG), where everydocument or topic is connected to every compo-nent with some weight α or β. When many ofthe α or β coefficients are zero, the DAG becomessparse. A sparse DAG has an intuitive interpre-tation: each document or topic depends on somesubset of components.

The default prior over coefficients that we usein this study is a 0-mean Gaussian distribution,which encourages the weights to be small. Wenote that to induce a sparse graph, one could usea 0-mean Laplace distribution as the prior over αand β, which prefers parameters such that somecomponents are zero.

3.1.2 TreeWhen each document or topic has exactly one par-ent (one nonzero coefficient) we obtain a two-leveltree structure. This structure naturally arises intopic hierarchies, for example, where fine-grainedtopics are children of coarse-grained topics.

To create an (unweighted) tree, we requireαmc ∈ {0, 1} and

∑c αmc = 1 for each docu-

ment m. Similarly, βkc ∈ {0, 1} and∑

c βkc = 1for each topic k. In this setting, αm and βk areindicator vectors which select a single component.

In this study, rather than strictly requiring αmand βk to be binary-valued indicator vectors, wecreate a relaxation that allows for easier parameterestimation. We let αm and βk to real-valued vari-ables in a simplex, but place a prior over their val-ues to encourage sparse values, favoring vectorswith a single component near 1 and others near 0.This is achieved using a Dirichlet(ρ < 1) distribu-tion as the prior over α and β, which has higherdensity near the boundaries of the simplex.1

1This generalizes the technique used in Paul and Dredze(2012), who approximated binary variables with real-valuedvariables in (0, 1), by using a “U-shaped” Beta(ρ < 1) distri-

For a weighted tree, α and β could be a productof two variables: an “integer-like” indicator vec-tor with sparse Dirichlet prior as suggested above,combined with a real-valued weight (e.g., with aGaussian prior). We take this approach in ourmodel of topic and perspective (§6).

3.1.3 Factored ForestBy using structured sparsity over the DAG, we canobtain a structure where components are groupedinto G factors, and each document or topic hasone parent from each group. Figure 2(d) illus-trates this: the left three components belong to onegroup, the right two belong to another, and eachbottom node has exactly one parent from each.This is a DAG that we call a “factored forest” be-cause the subgraphs associated with each group inisolation are trees. This structure arises in “multi-dimensional” models like SAGE (Eisenstein et al.,2011) and Factorial LDA (Paul and Dredze, 2012),which allow tokens to be associated with multiplevariables (e.g. a topic along with a variable denot-ing positive or negative sentiment). This allowsword distributions to depend on both factors.

The “exactly one parent” indicator constraint isthe same as in the tree structure but enforces atree only within each group. This can therefore be(softly) modeled using a sparse Dirichlet prior asdescribed in the previous subsection. In this case,the subsets of components belonging to each fac-tor have separate sparse Dirichlet priors. Usingthe example from Figure 2(d), the first three com-ponent indicators would come from one Dirichlet,while the latter two component indicators wouldcome from a second.

3.2 Tying Topic and Document Components

A desirable property for many situations is for thetopic and document components to correspond to

bution as the prior to encourage sparsity. The Dirichlet distri-bution is the multivariate extension of the Beta distribution.

each other. For example, if we think of the com-ponents as coarse-grained topics in a hierarchy,then the coefficients β enforce that topic word dis-tributions share a prior defined by their parent ωcomponent, while the coefficients α represent adocument’s proportions of coarse-grained topics,which effects the document’s prior over child top-ics (through the δ vectors). Consider the examplewith spoken language topics in §2: these three top-ics (signal processing, speech recognition, and di-alog systems) are a priori likely both to share thesame words and to occur together in documents.By tying these together, we ensure that the pat-terns are consistent across the two types of com-ponents, and the patterns from both types can re-inforce each other during inference.

In this case, the number of topic componentsis the same as the number of document compo-nents (C(φ) = C(θ)), and the coefficients (βcz)of the topic components should correlate with theweights of the document components (δzc). Theapproach we take (§6) is to define δ and β as aproduct of two variables (suggested in §3.1.2): abinary mask variable (with sparse Dirichlet prior),which we let be identical for both δ and β, and areal-valued positive weight.

3.3 Deep Components

As for priors over the component weights δ andω, we assume they are generated from a 0-meanGaussian. While not experimented with in thisstudy, it is also possible to allow the componentsthemselves to have rich priors which are functionsof higher level components. For example, ratherthan assuming a mean of zero, the mean could be aweighted combination of higher level weight vec-tors. This approach was used by Paul and Dredze(2013) in Factorial LDA, in which each ω compo-nent had its own Gaussian prior provided as inputto guide the parameters.

4 Special Cases and Extensions

We now describe several existing Dirichlet priortopic models and show how they are special casesof SPRITE. Table 1 summarizes these models andtheir relation to SPRITE. In almost every case, wealso describe how the SPRITE representation ofthe model offers improvements over the originalmodel or can lead to novel extensions.

Model Sec. Document priors Topic priorsLDA 4.1 Single component Single component

SCTM 4.2 Single component Sparse binary βSAGE 4.3 Single component Sparse ωFLDA 4.3 Binary δ is transpose of β Factored binary βPAM 4.4 α are supertopic weights Single componentDMR 4.5 α are feature values Single component

Table 1: Topic models with Dirichlet priors that are gen-eralized by SPRITE. The description of each model can befound in the noted section number. PAM is not equivalent,but captures very similar behavior. The described componentformulations of SCTM and SAGE are equivalent, but thesediffer from SPRITE in that the components directly define theparameters, rather than priors over the parameters.

4.1 Latent Dirichlet AllocationIn LDA (Blei et al., 2003b), all θ vectors aredrawn from the same prior, as are all φ vectors.This is a basic instance of our model with onlyone component at the topic and document levels,C(θ) = C(φ) = 1, with coefficients α = β = 1.

4.2 Shared Components Topic ModelsShared components topic models (SCTM) (Gorm-ley et al., 2010) define topics as products of “com-ponents”, where components are word distribu-tions. To use the notation of our paper, the kthtopic’s word distribution in SCTM is parameter-ized by φkv ∝

∏c ω

βkccv , where the ω vectors are

word distributions (rather than vectors in RV ), andthe βkc ∈ {0, 1} variables are indicators denotingwhether component c is in topic k.

This is closely related to SPRITE, where top-ics also depend on products of underlying com-ponents. A major difference is that in SCTM,the topic-specific word distributions are exactlydefined as a product of components, whereas inSPRITE, it is only the prior that is a product ofcomponents.2 Another difference is that SCTMhas an unweighted product of components (β is bi-nary), whereas SPRITE allows for weighted prod-ucts. The log-linear parameterization leads to sim-pler optimization procedures than the product pa-rameterization. Finally, the components in SCTMonly apply to the word distributions, and not thetopic distributions in documents.

4.3 Factored Topic ModelsFactored topic models combine multiple aspectsof the text to generate the document (instead ofjust topics). One such topic model is FactorialLDA (FLDA) (Paul and Dredze, 2012). In FLDA,

2The posterior becomes concentrated around the priorwhen the Dirichlet variance is low, in which case SPRITE be-haves like SCTM. SPRITE is therefore more general.

“topics” are actually tuples of potentially multiplevariables, such as aspect and sentiment in onlinereviews (Paul et al., 2013). Each document distri-bution θm is a distribution over pairs (or higher-dimensional tuples if there are more than two fac-tors), and each pair (j, k) has a word distribu-tion φ(j,k). FLDA uses a similar log-linear pa-rameterization of the Dirichlet priors as SPRITE.Using our notation, the Dirichlet(φ̃(j,k)) prior forφ(j,k) is defined as φ̃(j,k),v=exp(ωjv+ωkv), whereωj is a weight vector over the vocabulary for thejth component of the first factor, and ωk encodesthe weights for the kth component of the secondfactor. (Some bias terms are omitted for sim-plicity.) The prior over θm has a similar form:θ̃m,(j,k)=exp(αmj + αmk), where αmj is docu-ment m’s preference for component j of the firstfactor (and likewise for k of the second).

This corresponds to an instantiation of SPRITE

using an unweighted factored forest (§3.1.3),where βzc = δcz (§3.2, recall that δ are documentcomponents while β are the topic coefficients).Each subtopic z (which is a pair of variables inthe two-factor model) has one parent componentfrom each factor, indicated by βz which is binary-valued. At the document level in the two-factorexample, δj is an indicator vector with values of 1for all pairs with j as the first component, and thusthe coefficient αmj controls the prior for all suchpairs of the form (j, ·), and likewise δk indicatespairs with k as the second component, controllingthe prior over (·, k).

The SPRITE representation offers a benefit overthe original FLDA model. FLDA assumes that theentire Cartesian product of the different factors isrepresented in the model (e.g. φ parameters for ev-ery possible tuple), which leads to issues with effi-ciency and overparameterization with higher num-bers of factors. With SPRITE, we can simply fixthe number of “topics” to a number smaller thanthe size of the Cartesian product, and the modelwill learn which subset of tuples are included,through the values of β and δ.

Finally, another existing model family that al-lows for topic factorization is the sparse additivegenerative model (SAGE) (Eisenstein et al., 2011).SAGE uses a log-linear parameterization to defineword distributions. SAGE is a general family ofmodels that need not be factored, but is presentedas an efficient solution for including multiple fac-tors, such as topic and geography or topic and au-

thor ideology. Like SCTM, φ is exactly defined asa product of ω weights, rather than our approachof using the product to define a prior over φ.

4.4 Topic Hierarchies and Correlations

While the two previous subsections primarily fo-cused on word distributions (with FLDA being anexception that focused on both), SPRITE’s priorsover topic distributions also have useful charac-teristics. The component-specific δ vectors canbe interpreted as common topic distribution pat-terns, where each component is likely to give highweight to groups of topics that tend to occur to-gether. Each document’s α weights encode whichof the topic groups are present in that document.

Similar properties are captured by the Pachinkoallocation model (PAM) (Li and McCallum,2006). Under PAM, each document has a distri-bution over supertopics. Each supertopic is as-sociated with a Dirichlet prior over subtopic dis-tributions, where subtopics are the low level top-ics that are associated with word parameters φ.Documents also have supertopic-specific distribu-tions over subtopics (drawn from each supertopic-specific Dirichlet prior). Each topic in a documentis drawn by first drawing a supertopic from thedocument’s distribution, then drawing a subtopicfrom that supertopic’s document distribution.

While not equivalent, this is quite similar toSPRITE where document components correspondto supertopics. Each document’s α weights canbe interpreted to be similar to a distribution oversupertopics, and each δ vector is that supertopic’scontribution to the prior over subtopics. The priorover the document’s topic distribution is thus af-fected by the document’s supertopic weights α.

The SPRITE formulation naturally allows forpowerful extensions to PAM. One possibility isto include topic components for the word distri-butions, in addition to document components, andto tie together δcz and βzc (§3.2). This models theintuitive characteristic that subtopics belonging tosimilar supertopics (encoded by δ) should comefrom similar priors over their word distributions(since they will have similar β values). That is,children of a supertopic are topically related – theyare likely to share words. This is a richer alterna-tive to the hierarchical variant of PAM proposedby Mimno et al. (2007), which modeled separateword distributions for supertopics and subtopics,but the subtopics were not dependent on the super-

topic word distributions. Another extension is toform a strict tree structure, making each subtopicbelong to exactly one supertopic: a true hierarchy.

4.5 Conditioning on Document AttributesSPRITE also naturally provides the ability to con-dition document topic distributions on features ofthe document, such as a user rating in a review.To do this, let the number of document compo-nents be the number of features, and the value ofαmc is the mth document’s value of the cth fea-ture. The δ vectors then influence the document’stopic prior based on the feature values. For exam-ple, increasing αmc will increase the prior for topicz if δcz is positive and decrease the prior if δcz isnegative. This is similar to the structure used forPAM (§4.4), but here the α weights are fixed andprovided as input, rather than learned and inter-preted as supertopic weights. This is identical tothe Dirichlet-multinomial regression (DMR) topicmodel (Mimno and McCallum, 2008). The DMRtopic model define’s each document’s Dirichletprior over topics as a log-linear function of thedocument’s feature values and regression coeffi-cients for each topic. The cth feature’s regressioncoefficients correspond to the δc vector in SPRITE.

5 Inference and Parameter Estimation

We now discuss how to infer the posterior of thelatent variables z and parameters θ and φ, and findmaximum a posteriori (MAP) estimates of the hy-perparameters α, β, δ, and ω, given their hyperpri-ors. We take a Monte Carlo EM approach, using acollapsed Gibbs sampler to sample from the pos-terior of the topic assignments z conditioned onthe hyperparameters, then optimizing the hyperpa-rameters using gradient-based optimization condi-tioned on the samples.

Given the hyperparameters, the sampling equa-tions are identical to the standard LDA sampler(Griffiths and Steyvers, 2004). The partial deriva-tive of the collapsed log likelihood L of the corpuswith respect to each hyperparameter βkc is:

∂L∂βkc

=∂P (β)

∂βkc+∑v

ωcvφ̃kv × (1)(Ψ(nkv+φ̃kv)−Ψ(φ̃kv) +Ψ(

∑k′ φ̃k′v)−Ψ(

∑k′n

k′v +φ̃k′v)

)

where φ̃kv=exp(∑

c′ βkc′ωc′v), nkv is the numberof times word v is assigned to topic k (in thesamples from the E-step), and Ψ is the digamma

function, the derivative of the log of the gammafunction. The digamma terms arise from theDirichlet-multinomial distribution, when integrat-ing out the parameters φ. P (β) is the hyperprior.For a 0-mean Gaussian hyperprior with varianceσ2, ∂P (β)

∂βkc= −βkc

σ2 . Under a Dirchlet(ρ) hyper-prior, when we want β to represent an indicatorvector (§3.1.2), ∂P (β)

∂βkc= ρ−1

βkc.

The partial derivatives for the other hyperpa-rameters are similar. Rather than involving a sumover the vocabulary, ∂L

∂δcksums over documents,

while ∂L∂ωcv

and ∂L∂αmc

sum over topics.Our inference algorithm alternates between one

Gibbs iteration and one iteration of gradient as-cent, so that the parameters change gradually. Forunconstrained parameters, we use the update rule:xt+1=xt + ηt∇L(xt), for some variable x anda step size ηt at iteration t. For parameters con-strained to the simplex (such as when β is a softindicator vector), we use exponentiated gradientascent (Kivinen and Warmuth, 1997) with the up-date rule: xt+1

i ∝ xti exp(ηt∇iL(xt)).

5.1 Tightening the Constraints

For variables that we prefer to be binary buthave softened to continuous variables using sparseBeta or Dirichlet priors, we can straightforwardlystrengthen the preference to be binary by modify-ing the objective function to favor the prior moreheavily. Specifically, under a Dirichlet(ρ<1) priorwe will introduce a scaling parameter τt ≥ 1to the prior log likelihood: τt logP (β) with par-tial derivative τt ρ−1βkc

, which adds extra weight tothe sparse Dirichlet prior in the objective. Thealgorithm used in our experiments begins withτ1 = 1 and optionally increases τ over time. Thisis a deterministic annealing approach, where τcorresponds to an inverse temperature (Ueda andNakano, 1998; Smith and Eisner, 2006).

As τ approaches infinity, the prior-annealedMAP objective maxβ P (φ|β)P (β)τ approachesmaxβ P (φ|β) maxβ P (β). Annealing only theprior P (β) results in maximization of this termonly, while the outer max chooses a good β underP (φ|β) as a tie-breaker among all β values thatmaximize the inner max (binary-valued β).3

We show experimentally (§7.2.2) that annealingthe prior yields values that satisfy the constraints.

3Other modifications could be made to the objective func-tion to induce sparsity, such as entropy regularization (Bala-subramanyan and Cohen, 2013).

6 A Factored Hierarchical Model ofTopic and Perspective

We will now describe a SPRITE model that en-compasses nearly all of the structures and exten-sions described in §3–4, followed by experimen-tal results using this model to jointly capture topicand “perspective” in a corpus of political debates(where perspective corresponds to ideology) anda corpus of online doctor reviews (where perspec-tive corresponds to the review sentiment).

First, we will create a topic hierarchy (§4.4).The hierarchy will model both topics and docu-ments, where αm is documentm’s supertopic pro-portions, δc is the cth supertopic’s subtopic prior,ωc is the cth supertopic’s word prior, and βk isthe weight vector that selects the kth topic’s par-ent supertopic, which incorporates (soft) indicatorvectors to encode a tree structure (§3.1.2).

We want a weighted tree; while each βk hasonly one nonzero element, the nonzero elementcan be a value other than 1. We do this by replac-ing the single coefficient βkc with a product of twovariables: bkcβ̂kc. Here, β̂k is a real-valued weightvector, while bkc is a binary indicator vector whichzeroes out all but one element of βk. We do thesame with the δ vectors, replacing δck with bkcδ̂ck.The b variables are shared across both topic anddocument components, which is how we tie thesetogether (§3.2). We relax the binary requirementand instead allow a positive real-valued vectorwhose elements sum to 1, with a Dirichlet(ρ<1)prior to encourage sparsity (§3.1.2).

To be properly interpreted as a hierarchy, weconstrain the coefficients α and β (and by ex-tension, δ) to be positive. To optimize these pa-rameters in a mathematically convenient way, wewrite βkc as exp(log βkc), and instead optimizelog βkc ∈ R rather than βkc ∈ R+.

Second, we factorize (§4.3) our hierarchy suchthat each topic depends not only on its supertopic,but also on a value indicating perspective. For ex-ample, a conservative topic about energy will ap-pear differently from a liberal topic about energy.The prior for a topic will be a log-linear combina-tion of both a supertopic (e.g. energy) and a per-spective (e.g. liberal) weight vector. The variablesassociated with the perspective component are de-noted with superscript (P ) rather than subscript c.

To learn meaningful perspective parameters, weinclude supervision in the form of document at-tributes (§4.5). Each document includes a pos-

• bk ∼ Dirichlet(ρ < 1) (soft indicator)

• α(P ) is given as input (perspective value)

• δ(P )k = β

(P )k

• φ̃kv = exp(ω(B)v +β

(P )k ω

(P )v +

∑c bkcβ̂kcωcv)

• θ̃mk = exp(δ(B)k +α

(P )m δ

(P )k +

∑c bkcαmcδ̂ck)

Figure 3: Summary of the hyperparameters in our SPRITE-based topic and perspective model (§6).

itive or negative score denoting the perspective,which is the variable α(P )

m for document m. Sinceα(P ) are the coefficients for δ(P ), positive valuesof δ(P )

k indicate that topic k is more likely if the au-thor is conservative (which has a positive α scorein our data), and less likely if the author is liberal(which has a negative score). There is only a singleperspective component, but it represents two endsof a spectrum with positive and negative weights;β(P ) and δ(P ) are not constrained to be positive,unlike the supertopics. We also set β(P )

k = δ(P )k .

This means that topics with positive δ(P )k will also

have a positive β coefficient that is multiplied withthe perspective word vector ω(P ).

Finally, we include “bias” component vectorsdenoted ω(B) and δ(B), which act as overallweights over the vocabulary and topics, so that thecomponent-specific ω and δ weights can be inter-preted as deviations from the global bias weights.

Figure 3 summarizes the model. This includesmost of the features described above (trees, fac-tored structures, tying topic and document compo-nents, and document attributes), so we can ablatemodel features to measure their effect.

7 Experiments

7.1 Datasets and Experimental Setup

We applied our models to two corpora:

• Debates: A set of floor debates from the 109th–112th U.S. Congress, collected by Nguyen etal. (2013), who also applied a hierarchical topicmodel to this data. Each document is a tran-script of one speaker’s turn in a debate, and eachdocument includes the first dimension of theDW-NOMINATE score (Lewis and Poole, 2004),a real-valued score indicating how conservative(positive) or liberal (negative) the speaker is.This value is α(P ). We took a sample of 5,000documents from the House debates (850,374 to-kens; 7,426 types), balanced across party affilia-

tion. We sampled from the most partisan speak-ers, removing scores below the median value.

• Reviews: Doctor reviews from RateMDs.com,previously analyzed using FLDA (Paul et al.,2013; Wallace et al., 2014). The reviews con-tain ratings on a 1–5 scale for multiple aspects.We centered the ratings around the middle value3, then took reviews that had the same sign forall aspects, and averaged the scores to producea value for α(P ). Our corpus contains 20,000documents (476,991 tokens; 10,158 types), bal-anced across positive/negative scores.

Unless otherwise specified, K=50 topics andC=10 components (excluding the perspectivecomponent) for Debates, and K=20 and C=5 forReviews. These values were chosen as a qualita-tive preference, not optimized for predictive per-formance, but we experiment with different valuesin §7.2.2. We set the step size ηt according to Ada-Grad (Duchi et al., 2011), where the step size isthe inverse of the sum of squared historical gradi-ents.4 We place a sparse Dirichlet(ρ=0.01) prioron the b variables, and apply weak regulariza-tion to all other hyperparameters via a N (0, 102)prior. These hyperparameters were chosen afteronly minimal tuning, and were selected becausethey showed stable and reasonable output qualita-tively during preliminary development.

We ran our inference algorithm for 5000 itera-tions, estimating the parameters θ and φ by aver-aging the final 100 iterations. Our results are aver-aged across 10 randomly initialized samplers.5

7.2 Evaluating the Topic Perspective Model

7.2.1 Analysis of OutputFigure 4 shows examples of topics learned fromthe Reviews corpus. The figure includes the high-est probability words in various topics as well asthe highest weight words in the supertopic com-ponents and perspective component, which feedinto the priors over the topic parameters. We seethat one supertopic includes many words related tosurgery, such as procedure and performed, and hasmultiple children, including a topic about dentalwork. Another supertopic includes words describ-ing family members such as kids and husband.

4AdaGrad decayed too quickly for the b variables. Forthese, we used a variant suggested by Zeiler (2012) whichuses an average of historical gradients rather than a sum.

5Our code and the data will be available at:http://cs.jhu.edu/˜mpaul.

One topic has both supertopics as parents, whichappears to describe surgeries that saved a familymember’s life, with top words including {saved,life, husband, cancer}. The figure also illustrateswhich topics are associated more with positive ornegative reviews, as indicated by the value of δ(P ).

Interpretable parameters were also learned fromthe Debates corpus. Consider two topics aboutenergy that have polar values of δ(P ). Theconservative-leaning topic is about oil and gas,with top words including {oil, gas, companies,prices, drilling}. The liberal-leaning topic isabout renewable energy, with top words includ-ing {energy, new, technology, future, renewable}.Both of these topics share a common parent of anindustry-related supertopic whose top words are{industry, companies, market, price}. A nonparti-san topic under this same supertopic has top words{credit, financial, loan, mortgage, loans}.

7.2.2 Quantitative Evaluation

We evaluated the model on two predictive tasks aswell as topic quality. The first metric is perplex-ity of held-out text. The held-out set is based ontokens rather than documents: we trained on evennumbered tokens and tested on odd tokens. This isa type of “document completion” evaluation (Wal-lach et al., 2009b) which measures how well themodel can predict held-out tokens of a documentafter observing only some.

We also evaluated how well the model can pre-dict the attribute value (DW-NOMINATE score oruser rating) of the document. We trained a linearregression model using the document topic distri-butions θ as features. We held out half of the docu-ments for testing and measured the mean absoluteerror. When estimating document-specific SPRITE

parameters for held-out documents, we fix the fea-ture value α(P )

m = 0 for that document.These predictive experiments do not directly

measure performance at many of the particulartasks that topic models are well suited for, likedata exploration, summarization, and visualiza-tion. We therefore also include a metric that moredirectly measures the quality and interpretabilityof topics. We use the topic coherence metric intro-duced by Mimno et al. (2011), which is based onco-occurrence statistics among each topic’s mostprobable words and has been shown to correlatewith human judgments of topic quality. This met-ric measures the quality of each topic, and we

best!love!years!caring!children!really!

wonderful!hes!great!family!

comfortable!listens!thank!

amazing!…!…!went!pay!later!staff!asked!money!company!refused!pain!office!didn’t!said!told!doctor!

surgery!pain!went!dr!

surgeon!told!

procedure!months!

performed!removed!

left!fix!said!later!years!

dr!life!thank!saved!god!

husband!heart!cancer!years!helped!doctors!hospital!father!man!able!

told!hospital!

dr!blood!went!later!days!mother!said!er!

cancer!weight!home!father!months!

dentist!teeth!dental!work!tooth!root!mouth!pain!

dentists!went!filling!canal!dr!

crown!cleaning!

dr!best!

children!years!kids!cares!hes!care!old!

daughter!child!

husband!family!

pediatrician!trust!

baby!son!

pregnancy!dr!child!

pregnant!ob!

daughter!first!

delivered!gyn!birth!

delivery!section!hospital!

dr!best!years!doctor!love!cares!ive!

children!patients!hes!family!kids!seen!doctors!son!

pain!surgery!dr!went!knee!foot!neck!mri!injury!

shoulder!bone!months!told!

surgeon!therapy!

Perspective!

“Surgery”! “Family”!

Figure 4: Examples of topics (gray boxes) and components (colored boxes) learned on the Reviews corpus with 20 topics and5 components. Words with the highest and lowest values of ω(P ), the perspective component, are shown on the left, reflectingpositive and negative sentiment words. The words with largest ω values in two supertopic components are also shown, withmanually given labels. Arrows from components to topics indicate that the topic’s word distribution draws from that componentin its prior (with non-zero β value). There are also implicit arrows from the perspective component to all topics (omitted forclarity). The vertical positions of topics reflect the topic’s perspective value δ(P ). Topics centered above the middle line aremore likely to occur in reviews with positive scores, while topics below the middle line are more likely in negative reviews.Note that this is a “soft” hierarchy because the tree structure is not strictly enforced, so some topics have multiple parentcomponents. Table 3 shows how strict trees can be learned by tuning the annealing parameter.

measure the average coherence across all topics:

1

K

K∑k=1

M∑m=2

m−1∑l=1

logDF (vkm, vkl) + 1

DF (vkl)(2)

where DF (v, w) is the document frequency ofwords v andw (the number of documents in whichthey both occur), DF (v) is the document fre-quency of word v, and vki is the ith most probableword in topic k. We use the top M = 20 words.This metric is limited to measuring only the qual-ity of word clusters, ignoring the potentially im-proved interpretability of organizing the data intocertain structures. However, it is still useful as analternative measure of performance and utility, in-dependent of the models’ predictive abilities.

Using these three metrics, we compared to sev-eral variants (denoted in bold) of the full modelto understand how the different parts of the modelaffect performance:

• Variants that contain the hierarchy componentsbut not the perspective component (Hierarchyonly), and vice versa (Perspective only).• The “hierarchy only” model using only docu-

ment components δ and no topic components.This is a PAM-style model because it exhibits

similar behavior to PAM (§4.4). We also com-pared to the original PAM model.• The “hierarchy only” model using only topic

components ω and no document components.This is a SCTM-style model because it exhibitssimilar behavior to SCTM (§4.2).• The full model where α(P ) is learned rather than

given as input. This is a FLDA-style model thathas similar behavior to FLDA (§4.3). We alsocompared to the original FLDA model.• The “perspective only” model but without theω(P ) topic component, so the attribute value af-fects only the topic distributions and not theword distributions. This is identical to the DMRmodel of Mimno and McCallum (2008) (§4.5).• A model with no components except for the

bias vectors ω(B) and δ(B). This is equiva-lent to LDA with optimized hyperparameters(learned). We also experimented with usingfixed symmetric hyperparameters, using val-ues suggested by Griffiths and Steyvers (2004):50/K and 0.01 for topic and word distributions.

To put the results in context, we also compare totwo types of baselines: (1) “bag of words” base-lines, where we measure the perplexity of add-onesmoothed unigram language models, we measure

Debates ReviewsModel Perplexity Prediction error Coherence Perplexity Prediction error Coherence

Full model †1555.5 ± 2.3 †0.615 ± 0.001 -342.8 ± 0.9 †1421.3 ± 8.4 †0.787 ± 0.006 -512.7 ± 1.6Hierarchy only †1561.8 ± 1.4 0.620 ± 0.002 -342.6 ± 1.1 †1457.2 ± 6.9 †0.804 ± 0.007 -509.1 ± 1.9

Perspective only †1567.3 ± 2.3 †0.613 ± 0.002 -342.1 ± 1.2 †1413.7 ± 2.2 †0.800 ± 0.002 -512.0 ± 1.7SCTM-style 1572.5 ± 1.6 0.620 ± 0.002 †-335.8 ± 1.1 1504.0 ± 1.9 †0.837 ± 0.002 †-490.8 ± 0.9PAM-style †1567.4 ± 1.9 0.620 ± 0.002 -347.6 ± 1.4 †1440.4 ± 2.7 †0.835 ± 0.004 -542.9 ± 6.7FLDA-style †1559.5 ± 2.0 0.617 ± 0.002 -340.8 ± 1.4 †1451.1 ± 5.4 †0.809 ± 0.006 -505.3 ± 2.3

DMR 1578.0 ± 1.1 0.618 ± 0.002 -343.1 ± 1.0 †1416.4 ± 3.0 †0.799 ± 0.003 -511.6 ± 2.0PAM 1578.9 ± 0.3 0.622 ± 0.003 †-336.0 ± 1.1 1514.8 ± 0.9 †0.835 ± 0.003 †-493.3 ± 1.2FLDA 1574.1 ± 2.2 0.618 ± 0.002 -344.4 ± 1.3 1541.9 ± 2.3 0.856 ± 0.003 -502.2 ± 3.1

LDA (learned) 1579.6 ± 1.5 0.620 ± 0.001 -342.6 ± 0.6 1507.9 ± 2.4 0.846 ± 0.002 -501.4 ± 1.2LDA (fixed) 1659.3 ± 0.9 0.622 ± 0.002 -349.5 ± 0.8 1517.2 ± 0.4 0.920 ± 0.003 -585.2 ± 0.9Bag of words 2521.6 ± 0.0 0.617 ± 0.000 †-196.2 ± 0.0 1633.5 ± 0.0 0.813 ± 0.000 †-408.1 ± 0.0Naive baseline 7426.0 ± 0.0 0.677 ± 0.000 -852.9 ± 7.4 10158.0 ± 0.0 1.595 ± 0.000 -795.2 ± 13.0

Table 2: Perplexity of held-out tokens and mean absolute error for attribute prediction using various models (± std. error).† indicates significant improvement (p < 0.05) over optimized LDA under a two-sided t-test.

the prediction error using bag of words features,and we measure coherence of the unigram distri-bution; (2) naive baselines, where we measure theperplexity of the uniform distribution over eachdataset’s vocabulary, the prediction error whensimply predicting each attribute as the mean valuein the training set, and the coherence of 20 ran-domly selected words (repeated for 10 trials).

Table 2 shows that the full SPRITE model sub-stantially outperforms the LDA baseline at bothpredictive tasks. Generally, model variants withmore structure perform better predictively.

The difference between SCTM-style andPAM-style is that the former uses only topic com-ponents (for word distributions) and the latter usesonly document components (for the topic distri-butions). Results show that the structured priorsare more important for topic than word distribu-tions, since PAM-style has lower perplexity onboth datasets. However, models with both topicand document components generally outperformeither alone, including comparing the Perspec-tive only and DMR models. The former includesboth topic and document perspective components,while DMR has only a document level component.

PAM does not significantly outperform opti-mized LDA in most measures, likely because it up-dates the hyperparameters using a moment-basedapproximation, which is less accurate than ourgradient-based optimization. FLDA perplexityis 2.3% higher than optimized LDA on Reviews,comparable to the 4% reported by Paul and Dredze(2012) on a different corpus. The FLDA-styleSPRITE variant, which is more flexible, signifi-cantly outperforms FLDA in most measures.

The results are quite different under the co-herence metric. It seems that topic components

(which influence the word distributions) improvecoherence over LDA, while document compo-nents worsen coherence. SCTM-style (which usesonly topic components) does the best in bothdatasets, while PAM-style (which uses only doc-uments) does the worst. PAM also significantlyimproves over LDA, despite worse perplexity.

The LDA (learned) baseline substantially out-performs LDA (fixed) in all cases, highlighting theimportance of optimizing hyperparameters, con-sistent with prior research (Wallach et al., 2009a).

Surprisingly, many SPRITE variants also outper-form the bag of words regression baseline, eventhough the latter was tuned to optimize perfor-mance using heavy `2 regularization, which weapplied only weakly (without tuning) to the topicmodel features. We also point out that the “bagof words” version of the coherence metric (the co-herence of the top 20 words) is higher than the av-erage topic coherence, which is an artifact of howthe metric is defined: the most probable words inthe corpus also tend to co-occur together in mostdocuments, so these words are considered to behighly coherent when grouped together.

Parameter Sensitivity We evaluated the fullmodel at the two predictive tasks with varyingnumbers of topics ({12,25,50,100} for Debatesand {5,10,20,40} for Reviews) and components({2,5,10,20}). Figure 5 shows that performance ismore sensitive to the number of topics than com-ponents, with generally less variance among thelatter. More topics improve performance mono-tonically on Debates, while performance declinesat 40 topics on Reviews. The middle range of com-ponents (5–10) tends to perform better than toofew (2) or too many (20) components.

Regardless of quantitative differences, the

2 5 10 201500

1600

1700

1800

1900Pe

rple

xity

DebatesK=12K=25

K=50K=100

2 5 10 201400

1450

1500

1550

1600 ReviewsK=5K=10

K=20K=40

2 5 10 20Number of components

.605

.610

.615

.620

.625

.630

Pre

dict

ion

erro

r

2 5 10 20Number of components

.76

.80

.84

.88

.92

Figure 5: Predictive performance of full model with differ-ent numbers of topics K across different numbers of compo-nents, represented on the x-axis (log scale).

τt Debates Reviews0.000 (Sparse DAG) 58.1% 42.4%1.000 (Soft Tree) 93.2% 74.6%1.001t (Hard Tree) 99.8% 99.4%1.003t (Hard Tree) 100% 100%

Table 3: The percentage of indicator values that are sparse(near 0 or 1) when using different annealing schedules.

choice of parameters may depend on the end ap-plication and the particular structures that the userhas in mind, if interpretability is important. Forexample, if the topic model is used as a visual-ization tool, then 2 components would not likelyresult in an interesting hierarchy to the user, evenif this setting produces low perplexity.

Structured Sparsity We use a relaxation of thebinary b that induces a “soft” tree structure. Ta-ble 3 shows the percentage of b values which arewithin ε = .001 of 0 or 1 under various anneal-ing schedules, increasing the inverse temperatureτ by 0.1% after each iteration (i.e. τt = 1.001t)as well as 0.3% and no annealing at all (τ = 1).At τ = 0, we model a DAG rather than a tree, be-cause the model has no preference that b is sparse.Many of the values are binary in the DAG case, butthe sparse prior substantially increases the numberof binary values, obtaining fully binary structureswith sufficient annealing. We compare the DAGand tree structures more in the next subsection.

7.3 Structure ComparisonThe previous subsection experimented with mod-els that included a variety of structures, but didnot provide a comparison of each structure in iso-lation, since most model variants were part of acomplex joint model. In this section, we exper-

iment with the basic SPRITE model for the threestructures described in §3: a DAG, a tree, and afactored forest. For each structure, we also exper-iment with each type of component: document,topic, and both types (combined).

For this set of experiments, we included a thirddataset that does not contain a perspective value:

• Abstracts: A set of 957 abstracts from the ACLanthology (97,168 tokens; 8,246 types). Theseabstracts have previously been analyzed withFLDA (Paul and Dredze, 2012), so we includeit here to see if the factored structure that weexplore in this section learns similar patterns.

Based on our sparsity experiments in the pre-vious subsection, we set τt = 1.003t to inducehard structures (tree and factored) and τ = 0 to in-duce a DAG. We keep the same parameters as theprevious subsection: K=50 and C=10 for Debatesand K=20 and C=5 for Reviews. For the factoredstructures, we use two factors, with one factor hav-ing more components than the other: 3 and 7 com-ponents for Debates, and 2 and 3 components forReviews (the total number of components acrossthe two factors is therefore the same as for theDAG and tree experiments). The Abstracts exper-iments use the same parameters as with Debates.

Since the Abstracts dataset does not have a per-spective value to predict, we do not include predic-tion error as a metric, instead focusing on held-outperplexity and topic coherence (Eq. 2). Table 4shows the results of these two metrics.

Some trends are clear and consistent. Topiccomponents always hurt perplexity, while thesecomponents typically improve coherence, as wasobserved in the previous subsection. It has pre-viously been observed that perplexity and topicquality are not correlated (Chang et al., 2009).These results show that the choice of componentsdepends on the task at hand. Combining the twocomponents tends to produce results somewherein between, suggesting that using both componenttypes is a reasonable “default” setting.

Document components usually improve per-plexity, likely due to the nature of the documentcompletion setup, in which half of each documentis held out. The document components capturecorrelations between topics, so by inferring thecomponents that generated the first half of the doc-ument, the prior is adjusted to give more probabil-ity to topics that are likely to occur in the unseensecond half. Another interesting trend is that the

Perplexity CoherenceDAG Tree Factored DAG Tree Factored

DebatesDocument 1572.0 ± 0.9 1568.7 ± 2.0 1566.8 ± 2.0 -342.9 ± 1.2 -346.0 ± 0.9 -343.2 ± 1.0

Topic 1575.0 ± 1.5 1573.4 ± 1.8 1559.3 ± 1.5 -342.4 ± 0.6 -339.2 ± 1.7 -333.9 ± 0.9Combined 1566.7 ± 1.7 1559.9 ± 1.9 1552.5 ± 1.9 -342.9 ± 1.3 -342.6 ± 1.2 -340.3 ± 1.0

ReviewsDocument 1456.9 ± 3.8 1446.4 ± 4.0 1450.4 ± 5.5 -512.2 ± 4.6 -527.9 ± 6.5 -535.4 ± 7.4

Topic 1508.5 ± 1.7 1517.9 ± 2.0 1502.0 ± 1.9 -500.1 ± 1.2 -499.0 ± 0.9 -486.1 ± 1.5Combined 1464.1 ± 3.3 1455.1 ± 5.6 1448.5 ± 8.5 -504.9 ± 1.4 -527.8 ± 6.1 -535.5 ± 8.2

AbstractsDocument 3107.7 ± 7.7 3089.5 ± 9.1 3098.7 ± 10.2 -393.2 ± 0.8 -390.8 ± 0.9 -392.8 ± 1.5

Topic 3241.7 ± 2.1 3455.9 ± 10.2 3507.4 ± 9.7 -389.0 ± 0.8 -388.8 ± 0.7 -332.2 ± 1.1Combined 3200.8 ± 3.5 3307.2 ± 7.8 3364.9 ± 19.1 -373.1 ± 0.8 -360.6 ± 0.9 -342.3 ± 0.9

Table 4: Quantitative results for different structures (columns) and different components (rows) for two metrics (± std. error)across three datasets. The best (structure, component) pair for each dataset and metric is in bold.

factored structure tends to perform well under bothmetrics, with the lowest perplexity and highest co-herence in a majority of the nine comparisons (i.e.each row). Perhaps the models are capturing a nat-ural factorization present in the data.

To understand the factored structure qualita-tively, Figure 6 shows examples of componentsfrom each factor along with example topics thatdraw from all pairs of these components, learnedon Abstracts. We find that the factor with thesmaller number of components (left of the figure)seems to decompose into components represent-ing the major themes or disciplines found in ACLabstracts, with one component expressing compu-tational approaches (top) and the other expressinglinguistic theory (bottom). The third component(not shown) has words associated with speech, in-cluding {spoken, speech, recognition}.

The factor shown on the right seems to decom-pose into different research topics: one compo-nent represents semantics (top), another syntax(bottom), with others including morphology (topwords including {segmentation, chinese, morphol-ogy}) and information retrieval (top words includ-ing {documents, retrieval, ir}).

Many of the topics intuitively follow from thecomponents of these two factors. For example,the two topics expressing vector space models anddistributional semantics (top left and right) bothdraw from the “computational” and “semantics”components, while the topics expressing ontolo-gies and question answering (middle left and right)draw from “linguistics” and “semantics”.

The factorization is similar to what had beenpreviously been induced by FLDA. Figure 3 ofPaul and Dredze (2012) shows components thatlook similar to the computational methods andlinguistic theory components here, and the factor

with the largest number of components also de-composes by research topic. These results showthat SPRITE is capable of recovering similar struc-tures as FLDA, a more specialized model. SPRITE

is also much more flexible than FLDA. WhileFLDA strictly models a one-to-one mapping oftopics to each pair of components, SPRITE allowsmultiple topics to belong to the same pair (as inthe semantics examples above), and converselySPRITE does not require that all pairs have an as-sociated topic. This property allows SPRITE toscale to larger numbers of factors than FLDA, be-cause the number of topics is not required to growwith the number of all possible tuples.

8 Related Work

Our topic and perspective model is related to su-pervised hierarchical LDA (SHLDA) (Nguyen etal., 2013), which learns a topic hierarchy whilealso learning regression parameters to associatetopics with feature values such as political per-spective. This model does not explicitly incorpo-rate perspective-specific word priors into the top-ics (as in our factorized approach). The regressionstructure is also different. SHLDA is a “down-stream” model, where the perspective value is a re-sponse variable conditioned on the topics. In con-trast, SPRITE is an “upstream” model, where thetopics are conditioned on the perspective value.We argue that the latter is more accurate as a gen-erative story (the emitted words depend on theauthor’s perspective, not the other way around).Moreover, in our model the perspective influencesboth the word and topic distributions (through thetopic and document components, respectively).

Inverse regression topic models (Rabinovichand Blei, 2014) use document feature values (suchas political ideology) to alter the parameters of the

method!words!word!corpus!learning!

performance!approaches!training!proposed!based!

“Linguistics”!

grammar!parsing!

representation!structure!grammars!parse!syntax!

representations!semantics!formalism!

semantic!knowledge!domain!ontology!systems!words!

information!wordnet!question!dialogue!

parse!treebank!parser!penn!parsers!trees!

dependencies!acoustic!corpus!parsing!

training!learning!corpus!large!

unsupervised!corpora!method!data!

semantic!knowledge!semantics!ontology!relations!lexical!concepts!concept!

similarity!words!word!vector!

semantic!similar!based!method!

words!corpus!word!

multiword!paper!based!

frequency!expressions!

question!questions!answer!

answering!answers!

qa!systems!type!

parsing!parser!parse!

treebank!grammar!tree!trees!

structure!

german!languages!french!english!

multilingual!italian!

structure!spanish!

“Computational”! “Semantics”!

“Syntax”!

Figure 6: Examples of topics (gray boxes) and components (colored boxes) learned on the Abstracts corpus with 50 topicsusing a factored structure. The components have been grouped into two factors, one factor with 3 components (left) and onewith 7 (right), with two examples shown from each. Each topic prior draws from exactly one component from each factor.

topic-specific word distributions. This is an alter-native to the more common approach to regressionbased topic modeling, where the variables affectthe topic distributions rather than the word distri-butions. Our SPRITE-based model does both: thedocument features adjust the prior over topic dis-tributions (through δ), but by tying together thedocument and topic components (with β), the doc-ument features also affect the prior over word dis-tributions. To the best of our knowledge, this is thefirst topic model to condition both topic and worddistributions on the same features.

The topic aspect model (Paul and Girju, 2010a)is also a two-dimensional factored model that hasbeen used to jointly model topic and perspective(Paul and Girju, 2010b). However, this modeldoes not use structured priors over the parameters,unlike most of the models discussed in §4.

An alternative approach to incorporating userpreferences and expertise are interactive topicmodels (Hu et al., 2013), a complimentary ap-proach to SPRITE.

9 Discussion and Conclusion

We have presented SPRITE, a family of topic mod-els that utilize structured priors to induce pre-ferred topic structures. Specific instantiations ofSPRITE are similar or equivalent to several exist-ing topic models. We demonstrated the utility ofSPRITE by constructing a single model with manydifferent characteristics, including a topic hierar-chy, a factorization of topic and perspective, and

supervision in the form of document attributes.These structures were incorporated into the pri-ors of both the word and topic distributions, unlikemost prior work that considered one or the other.Our experiments explored how each of these var-ious model features affect performance, and ourresults showed that models with structured priorsperform better than baseline LDA models.

Our framework has made clear advancementswith respect to existing structured topic models.For example, SPRITE is more general and of-fers simpler inference than the shared compo-nents topic model (Gormley et al., 2010), andSPRITE allows for more flexible and scalable fac-tored structures than FLDA, as described in earliersections. Both of these models were motivated bytheir ability to learn interesting structures, ratherthan their performance at any predictive task. Sim-ilarly, our goal in this study was not to providestate of the art results for a particular task, butto demonstrate a framework for learning struc-tures that are richer than previous structured mod-els. Therefore, our experiments focused on un-derstanding how SPRITE compares to commonlyused models with similar structures, and how thedifferent variants compare under different metrics.

Ultimately, the model design choice depends onthe application and the user needs. By unifyingsuch a wide variety of topic models, SPRITE canserve as a common framework for enabling modelexploration and bringing application-specific pref-erences and structure into topic models.

Acknowledgments

We thank Jason Eisner and Hanna Wallach forhelpful discussions, and Viet-An Nguyen for pro-viding the Congressional debates data. MichaelPaul is supported by a Microsoft Research PhDfellowship.

ReferencesD. Andrzejewski, X. Zhu, and M. Craven. 2009. In-

corporating domain knowledge into topic modelingvia Dirichlet forest priors. In ICML.

R. Balasubramanyan and W. Cohen. 2013. Regular-ization of latent variable models to obtain sparsity.In SIAM Conference on Data Mining.

D. Blei and J. Lafferty. 2007. A correlated topic modelof Science. Annals of Applied Statistics, 1(1):17–35.

D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum.2003a. Hierarchical topic models and the nestedChinese restaurant process. In NIPS.

D. Blei, A. Ng, and M. Jordan. 2003b. Latent Dirichletallocation. JMLR.

J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, andD. Blei. 2009. Reading tea leaves: How humansinterpret topic models. In NIPS.

J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive sub-gradient methods for online learning and stochasticoptimization. JMLR, 12:2121–2159.

J. Eisenstein, A. Ahmed, and E. P. Xing. 2011. Sparseadditive generative models of text. In ICML.

M.R. Gormley, M. Dredze, B. Van Durme, and J. Eis-ner. 2010. Shared components topic models. InNAACL.

T. Griffiths and M. Steyvers. 2004. Finding scientifictopics. In Proceedings of the National Academy ofSciences of the United States of America.

Y. Hu, J. Boyd-Graber, B. Satinoff, and A. Smith.2013. Interactive topic modeling. Machine Learn-ing, 95:423–469.

J. Kivinen and M.K. Warmuth. 1997. Exponentiatedgradient versus gradient descent for linear predic-tors. Information and Computation, 132:1–63.

J.B. Lewis and K.T. Poole. 2004. Measuring bias anduncertainty in ideal point estimates via the paramet-ric bootstrap. Political Analysis, 12(2):105–127.

W. Li and A. McCallum. 2006. Pachinko alloca-tion: DAG-structured mixture models of topic cor-relations. In International Conference on MachineLearning.

D. Mimno and A. McCallum. 2008. Topic mod-els conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI.

D. Mimno, W. Li, and A. McCallum. 2007. Mixturesof hierarchical topics with Pachinko allocation. InInternational Conference on Machine Learning.

D. Mimno, H.M. Wallach, E. Talley, M. Leenders, andA. McCallum. 2011. Optimizing semantic coher-ence in topic models. In EMNLP.

V. Nguyen, J. Boyd-Graber, and P. Resnik. 2013. Lex-ical and hierarchical topic regression. In Neural In-formation Processing Systems.

M.J. Paul and M. Dredze. 2012. Factorial LDA: Sparsemulti-dimensional text models. In Neural Informa-tion Processing Systems (NIPS).

M.J. Paul and M. Dredze. 2013. Drug extraction fromthe web: Summarizing drug experiences with multi-dimensional topic models. In NAACL.

M. Paul and R. Girju. 2010a. A two-dimensionaltopic-aspect model for discovering multi-facetedtopics. In AAAI.

M.J. Paul and R. Girju. 2010b. Summarizing con-trastive viewpoints in opinionated text. In EmpiricalMethods in Natural Language Processing.

M.J. Paul, B.C. Wallace, and M. Dredze. 2013. Whataffects patient (dis)satisfaction? Analyzing onlinedoctor ratings with a joint topic-sentiment model.In AAAI Workshop on Expanding the Boundaries ofHealth Informatics Using AI.

M. Rabinovich and D. Blei. 2014. The inverse regres-sion topic model. In International Conference onMachine Learning.

D. Ramage, D. Hall, R. Nallapati, and C.D. Man-ning. 2009. Labeled LDA: a supervised topic modelfor credit attribution in multi-labeled corpora. InEMNLP.

N.A. Smith and J. Eisner. 2006. Annealing structuralbias in multilingual weighted grammar induction. InCOLING-ACL.

E.M. Talley, D. Newman, D. Mimno, B.W. Herr II,H.M. Wallach, G.A.P.C. Burns, M. Leenders, andA. McCallum. 2011. Database of NIH grants us-ing machine-learned categories and graphical clus-tering. Nature Methods, 8(6):443–444.

N. Ueda and R. Nakano. 1998. Deterministic anneal-ing EM algorithm. Neural Networks, 11(2):271–282.

B.C. Wallace, M.J. Paul, U. Sarkar, T.A. Trikalinos,and M. Dredze. 2014. A large-scale quantitativeanalysis of latent factors and sentiment in onlinedoctor reviews. Journal of the American MedicalInformatics Association, 21(6):1098–1103.

H.M. Wallach, D. Mimno, and A. McCallum. 2009a.Rethinking LDA: Why priors matter. In NIPS.

H.M. Wallach, I. Murray, R. Salakhutdinov, andD. Mimno. 2009b. Evaluation methods for topicmodels. In ICML.

C. Wang and D. Blei. 2009. Decoupling sparsityand smoothness in the discrete hierarchical Dirich-let process. In NIPS.

M.D. Zeiler. 2012. ADADELTA: An adaptive learningrate method. CoRR, abs/1212.5701.