analysis of high-dimensional longitudinal genomic data for...

1

Analysis of High-Dimensional Longitudinal Genomic Data for Monitoring Viral Infection1Lawrence Carin, 2Alfred Hero, III, 3Joseph Lucas, 4David Dunson, 1Minhua Chen,

3Ricardo Henao, 2Arnau Tibau-Puig, 3Aimee Zaas, 3Christopher W. Woods, and 3Geoffrey S. Ginsburg

1Electrical and Computer Engineering Department, Duke University2Electrical Engineering and Computer Science Department, University of Michigan

3Institute for Genome Sciences and Policy & Department of Medicine, Duke University4Department of Statistical Science, Duke University

POC: [email protected]

I. INTRODUCTION

A. Motivation

There is often interest in predicting an individual’s latenthealth status based on high-dimensional genomic biomarkersthat vary over time, for example gene-expression and pro-teomic data. Motivated by novel longitudinal gene-expressionand proteomic data we have collected in several viral challengestudies, performed with healthy human volunteers, we presentsignal processing methods for analysis of time-evolving ge-nomic biomarkers. We consider this problem from multipleperspectives related to factor analysis and dictionary learning,and in each the high-dimensional data trajectories are relatedto a relatively low-dimensional vector of latent factors ordictionary elements. The multiple analyses are employed forcross validation, to assure that the inferred biological processesare meaningful and uncovered via distinct models. The modelsinfer genes and proteins in the viral response pathway, aswell as variability among individuals in infection times. Theinferred low-dimensional space in which the high-dimensionaldata resides is used to provide biological interpretation of theinferred viral response pathways.

There has been much recent interest in the analysis ofdynamic biological processes, particularly with data fromDNA gene-expression microarray chips [1], [2]. Appropriatelyanalyzing the trajectories as multivariate functional data ischallenging due to the massive dimensionality, few obser-vations in time, low signal-to-noise ratio, and missingness.Ideally, methods would allow building a full joint model thatallows each biomarker (e.g., gene or protein) to have its owntrajectory, while accommodating dependence in these trajecto-ries across biomarkers within shared pathways and variabilityacross individuals. In such time-dependent modeling, one mustoften distinguish the observed (“wall clock”) time at whicha measurement was performed from the (latent) biological-clock time, and the difference between these two must beinferred (since the offset between the two is typically subjectdependent) [3], [4].

In this paper we consider analysis of time-evolving gene-expression and proteomic data. The proposed models explicitlyaddress issues associated with inferring the time shift betweenbiological times and “wall-clock” time, inferring the subject-dependent character of the former. We employ factor-analysisand related dictionary-learning based approaches. The use of

such methods obviates the need for explicit clustering [1] ofgenes.

The analysis techniques reviewed here are motivated by anddemonstrated with a novel data set we have measured in re-cent challenge studies. Specifically, after receiving appropriateInstitutional Review Board approval, we performed separatechallenge studies in which human volunteers were inoculatedwith two strains of influenza (H3N2 and H1N1), human rhinovirus (HRV) and respiratory syncytial virus (RSV). For eachsuch challenge study, roughly 20 healthy individuals wereinoculated with a particular influenza virus, and blood sampleswere collected at regular time intervals until the individualswere discharged. These data provide a unique opportunityto examine the time-evolving host response to such viruses.The mRNA expression levels in blood were assayed withAffymetrix GeneChip Human Genome U133A 2.0 Arrays toconstitute gene-expression values for 12,023 genes. For moredetails on the mRNA data, see [5].

B. Existing methods for time-course analysisThere have been numerous previous studies on the analysis

of time-course gene-expression data [1], [2] and almost all ofthese employ a clustering of the genes. To model the con-tinuous time dependence of the gene expression, researchershave employed the Gaussian process [3], as well as splinebasis functions [1]. Most of these methods employ mixed-effects models, where the fixed-effects component correspondsto clusters, with the genes clustered among one of C differentclasses or clusters. The random effect term typically has a con-tinuous time dependence that is a function of the specific geneand subject (inferred, for example, using a spline expansion).One may also employ hierarchical clustering of the genes [6].

Additional examples of such a mixed-effect clusteringmodel applied to time-course gene-expression data include[7], [8]. While this approach has been applied successfullyin many settings, it has limitations that restrict its utility. Forexample, we are typically interested in over 10,000 genes whenperforming microarray analysis, and therefore the numberof spline-based expansions that must be fit is significant.Additionally, for the application of interest here, we have onthe order of 20 different subjects, each manifesting a distincttime-course profile.

The proposed approaches avoid the need to explicitlyperform clustering (it is done implicitly within the factor

2

modeling and dictionary learning), and the proposed modelsinfer subject-dependent shifts in the latent biological turn-ontime. Typically only a small fraction of the genes contributeto the biology under study, and in the context of the factoranalysis these genes are inferred by imposing a sparsenessconstraint. One need only model the time dependence of thefactor scores, rather than separately model the time evolutionof each individual gene or protein.

II. TIME-DEPENDENT FACTOR SCORES

A. Basic factor model

Let Xi ∈ RP×T represent observed biomarkers (e.g., gene-expression data) for individual i, considering P markers,collected at T time points (the number of time points couldalso be subject-dependent); the jth column of Xi correspondsto the P biomarkers measured at time tij , for j ∈ {1, . . . , ni}.We assume a total of S individuals/subjects, constitutingcumulative data {Xi}i=1,S . We consider a factor model withk factors

Xi = LSi+Ei =

k∑m=1

LmS>mi+Ei; i = 1, 2, · · · , S (1)

where L ∈ RP×k is the factor loading matrix, and Lm isthe mth column of L; the factor scores for individual i areSi ∈ Rk×T , and S>mi is a row vector (mth row of Si) of time-varying scores for the ith individual and mth latent factor. Thefactor loadings are assumed fixed in time, while we allow thelatent factors, S>mi, to vary dynamically. The matrix Ei ∈RP×T is the additive noise or residual.

B. Shifted spline representation

Recall that individual i has data sampled at ni time points;let ti = (ti1, ti2, . . . , tT ) denote the time points at which datawere collected for individual i (in units of minutes/hours, etc.),with respect to a time reference shared by all S individuals.Note that these are observed times, on a universal clock, to bedistinguished from the latent biological clock of the systemunder investigation (in our specific example this correspondsto the host response to a virus), which is generally individualdependent. The rows of Si(t) are a continuous function oftime, and the matrix Si represents each such row sampled atthe T time points represented by ti.

Recall that Smi ∈ RT represents the factor score associatedwith factor m ∈ {1, . . . , k} for subject i ∈ {1, . . . , I},evaluated at the T discrete time points in ti (Smi is acolumn vector, the transpose of S>mi above). To model Smi,let b(t) ∈ Rq represent a column vector, corresponding toevaluating each of q spline functions at any time t over thesupport of the splines [1], defined here by the time windowover which data are collected. The number of splines q andtheir composition depend upon the specific application. Thefunction b(t − τ) ∈ Rq corresponds to realigning the splinefunctions to have the time origin shifted forward by τ ∈ R. Weallow a time shift τmi specific to latent factor m and individuali by characterizing the factor score trajectories as

Smi = B(ti; τmi)wm + εmi, (2)

Fig. 1. Generative process for the factors Smi = B(ti; τmi)wm +εmi. At left are shown the basis functions, corresponding to splinefunctions and a step function at earliest times (the latter represents thefactor before the virus under study causes changes to the host). Thebasis functions are weighted by wm and superposed, to constitute acontinuous-time factor, termed here a “prototypical trajectory”. Forindividual i, the trajectory is shifted by time τmi, and then sampledat the times defined by ti, manifesting the discrete samples in thesecond-to-last column. Finally, i.i.d. noise is added to each discreteobservation, manifesting the final discrete individual-dependent fac-tors for factor m (right-most column). The figures in the right twocolumns correspond to actual samples from the H3N2 challenge study(microarray data), with the “prototypical trajectory” representing theinferred typical host response, apart from the individual-dependentshift τmi. The basis functions (left column) are used for all factorsm ∈ {1, . . . , k}, and separate weights wm are used to yield theshifted factors within the box.

where [B(ti; τmi)]> = [b(ti1 − τmi), . . . , b(tini

− τmi)],[B(ti; τmi)]

> is the transpose of B(ti; τmi), and wm ∈ Rq

corresponds to the spline coefficients for the mth latent factor.An illustration of the above generative process is presented inFigure 1.

C. Temporal shift and distinguishing host-response factors

In our motivating application, all individuals are inoculatedwith a virus at the same time. Blood is drawn from all subjectsat a specified time prior to inoculation (t = −5 hours) toconstitute a baseline signature, and another (distinct) bloodsample is drawn just before inoculation (the latter occurringat what is defined as time t = 0 hours). The vector ti is definedsuch that increasing element index corresponds to increasingtime; this vector records the times at which blood sampleswere collected. Therefore, each individual shares the same firsttwo time points in ti, and since the time of inoculation isby definition at t = 0, the first element in ti corresponds tonegative time.

Our objective is to study the host (body) response to thevirus, and therefore the spline-based construction for the time-dependent factors is constituted as in Figure 2. Note that thefunction B(ti; τmi = 0)wm has a constant form for t ≤ −5

3

hours (with value of the constant inferred via the analysis), thisrepresenting the background/baseline (pre-inoculation) factorscore for a (presumably) healthy individual. Consequently,with application to our challenge studies, the shift τmi maybe viewed as the delay between inoculation of subject i andthe time at which factor m changes from its background(“normal”) value; i.e., this is the host response time forpathway m, which is expected to vary between subjects.

Fig. 2. Basis functions used for modeling the time dependence of the factorscores.

Considering Figure 2, note that large shifts τmi imply thatindividual i has a near constant host response for factor m,as function of time (“near”, but not exactly constant becauseof the addition of the εmi). This model is consistent with ourinfluenza challenge study data, as approximately half of theindividuals did not become symptomatic, and for these all ofthe associated factor scores manifested very weak temporalchanges. Therefore, the presence of large τmi for all factorsm ∈ {1, . . . , k} implies that individual i is asymptomatic.Further, if a particular factor m ∈ {1, . . . , k} is not related tothe host response to the virus for individual i, the associatedτmi will be large, implying that Smi is nearly time invariant.

Concerning the subject and factor-dependent shift τmi, weconsider a finite set of discretized τmi, finely sampled in time,and place a Dirichlet prior on the probability that each of theseshifts are selected. The model infers an approximate posteriordensity function on which time shift is most appropriate foreach subject and factor.

D. Sparse Factor Loadings

In many biological applications it is desirable to impose thatthe factor-loading matrix is sparse [9]. In the case of gene-expression data, the mth factor can be viewed as measuringoverall expression of the mth pathway, with the non-zeroelements in the mth column of the loadings matrix Lm

corresponding to the genes in that pathway. Biologically, wewould expect a small minority of the genes to play a rolein any one pathway, implying sparsity. Hence, we model theloading matrix as

L = V ◦ Z (3)

where ◦ represents a pointwise (Hadamard) matrix productbetween A ∈ RP×k and Z ∈ {0, 1}P×k. Binary matrix Z

is designed to be sparse, and therefore the factor loadings,defined by the columns of A ◦ Z, are also sparse.

The binary matrix Z is constituted via a so-called Indianbuffet process (IBP) [10], implemented in practice by settingk large, and allowing the model to infer the number of neededfactors. In the context of the IBP, each of the P genesare “customers” in a buffet restaurant, and the mth factorrepresents the mth dish. If gene g selects dish (factor) m,then Zgm = 1, and otherwise Zgm = 0.

E. Example results

Of the k factors, one of them manifested a time trajec-tory B(ti; τmi)wm that was closely aligned with the clinicalscores, and it is this factor that is examined in further detail,as it is deemed to be associated with the (time-evolving)host response to the virus. Results are shown for the gene gcorresponding to RSAD2, for the H3N2 virus. This gene hadthe strongest contribution to the loading of this factor (largestZgm|Agm|). All computations with this method are performedusing a Gibbs sampler.

We compare the individual- and time-dependent factorscore of this factor with clinical symptom score provided bymedical doctors. The clinical symptom score was recordedtwice daily using standardized symptom scoring [11]. Themodified Jackson Score requires subjects to rank symptomsof upper respiratory infection (stuffy nose, scratchy throat,headache, cough, etc) on a scale of 0-3 of “no symptoms”,“just noticeable”, “bothersome but can still do activities” and“bothersome and cannot do daily activities”. For all cohorts,modified Jackson scores were tabulated to determine if sub-jects became symptomatic from the respiratory viral challenge.A modified Jackson score of ≥ 6 over the quarantine periodwas the primary indicator of successful viral infection [12] andsubjects with such a score were denoted as “symptomatic”; thelatter individuals are represented with blue points in Figure 3.

In Figure 3 we plot the inferred time-dependent factor scorefor each of the subjects as well as the clinical symptom scores.Note that the clinical symptom score generally tracks theinferred factor score well, for this time-evolving factor. Ad-ditionally, for the asymptomatic x(m)

g (tij) = Agm[εmi(tij) +∑ql=1 wmlBl(tij ; τmi)] is almost a constant with time, but it

is not zero.We now examine the inferred mean trajectory

Agm

∑ql=1 wmlBl(tij ; τmi) of the (typical) individuals

who became symptomatic (Zgm = 1). In Figure 4 we depictthe inferred host response for this factor. Note that thistrajectory has a constant value at early time; it is used as aprototype trajectory for both symptomatic and asymptomaticsubjects, and the two are distinguished by the manner inwhich the trajectory evolves with time and the inferredtemporal shifts.

III. ORDER PRESERVING FACTOR ANALYSIS

A. Dictionary learning and factor analysis

In addition to Bayesian factor analysis, we have also exam-ined the data using non-Bayesian dictionary learning. The two

4

Fig. 3. Subject-dependent plots of average x(m)g (tij) =

ZgmAgm[εmi(tij)+∑q

l=1 wmlBl(tij ; τmi)], for gene RSAD2 fromthe factor linked to H3N2 (blue), as well as the clinically ob-served symptom score (green). We consider RSAD2 gene, for whichZgm = 1. The horizontal axes correspond to time from inoculation,in hours, and the vertical axes correspond to factor (left) or clinical(right) score. The subjects with a +1 label (top of each subfigure)corresponds to individuals who became symptomatic, and those with-1 labels were asymptomatic. Time t = 0 hour corresponds to whenthe virus inoculation occurred. To reduce clutter in the figures, theaxes are not labeled; the horizontal axes correspond to time in hours,and the vertical axes represent the factor score (left) or the clinicalscore (right).

Fig. 4. Top: Inferred average trajectory for the (presumed) fac-tor associated with the time-dependent host response to H3N2,ZgmAgm

∑ql=1 wmlBl(tij ; τmi = 0) (with standard-deviation error

bars), corresponding to the gene RSAD2 (Zgm = 1). Bottom:Inferred shifts for all individuals. Note that the shifts cluster naturallyinto two groups (red: asymptomatic, blue:symptomatic), consistentwith the clinical label information.

distinct classes of models have inferred very similar under-lying biological processes. The use of independent analysesis deemed important for accurately uncovering new biologybased on limited high-dimensional data.

Dictionary learning refers to a class of methods that seekto represent data by sparse combinations of an overcompletebasis set, called a dictionary [13]. Note that here we take adifferent approach from the factor modeling in Section II, withX>i from Section II corresponding to Yi. Dictionary learningis also called sparse coding and is widely used in neuroscience,speech, audio, and image processing. On the surface dictionarylearning resembles factor analysis in that they both seek afactored representation for the data matrix. For example, whenYi is the matrix of subject i with rows corresponding to T timepoints and columns corresponding to P gene indices dictionarylearning seeks a factorization of the form

Yi = MAi + εi, i = 1, . . . , S (4)

where S is the number of subjects, the f columns of matrixM form the universal dictionary of basis elements, and Ai isa sparse coefficient vector (the code) associated with subjecti’s particular linear combination of dictionary elements com-posing Yi. In dictionary learning, as in factor analysis, boththe dictionary and the coefficients are learned from the data{Yi}. However, while in standard factor analysis the objectiveis to find a low rank Ai, in dictionary learning the objectiveis to find a sparse matrix Ai.

For consistency we will use the standard factor analysisterminology for the dictionary learning model (5): columns ofM and Ai will be called factor loadings and factor scores,respectively. While this model can be used for a wide rangeof applications it is not applicable when there are unknowndelays among factor loadings shared by different subjects. Wedescribe a variant of the dictionary learning model, calledorder preserving factor analysis (OPFA), that accounts fortemporal misalignment, incorporates smoothness constraintson the factor loadings, and preserves their relative orderingover the subject population.

B. Order preserving factor analysis

The principle of evolutionary conservation suggests thatmajor gene regulation mechanisms, such as cell growth anddeath, operate similarly over the human population. Accordingto this principle, all healthy individuals share the same basicmechanisms of immune response. Systems biology modelsformalize this principle by modeling gene regulation accordingto a causal cascade of modules or pathways. Under sucha model, signals associated with viral sensing and antigenpresentation precede signals for the inflammation response tothe virus. The order in which these signals occur is importantto effective immune response while the precise timing of thesesignals may be less important. The order may in some casesbe known, or hypothesized based on known biology, or it maybe unspecified and learned from data [14].

The systems biology viewpoint motivates an order preserv-ing modification of the dictionary learning model (5) thatrestricts immune-related signaling to occur in a (unknown)

5

temporal precedence order. The modified model accounts fortemporal misalignments between signals up to these orderrestrictions

Yi = M(F,di)Ai + εi, (5)

delays that specifies the delay of each factor, a column ofF. When the factors satisfy a precedence order constraint thevector of delays will lie in a cone shaped region for all subjectsi, for example, di ∈

{d = [d1, . . . , df ] : ∩fj=2{dj > dj−1}

}is the region where each factor precedes the next in thenatural index order of the factors. The factors, delays andcoefficients are estimated from the data by solving a non-convex optimization problem of the form

minF,di,Ai

S∑i=1

‖Yi −M(F,di)Ai‖2F

+λP1(A1, . . . ,AS) + βP2(F) (6)

where minimization is performed over vectors of delays di

that lie in the order-preserving set for all subjects and overnon-negative matrices F, Ai. The non-negativity constrainton the factors is natural since gene expression is measuredin units of abundance of mRNA. The functions P1 and P2

are penalties that induce temporally smooth columns of F andsparse columns of Ai. For more details on the implementationof the optimization algorithm for solving the order preservingdictionary learning problem (6) the reader is referred to [15].

C. Example results

Our formulation (6) of order preserving factor analysis canbe interpreted as an extension of Sparse Factor Analysis (SFA)[16], parallel matrix factorization (PARAFAC) [17], and non-negative matrix factorization (NMF) [18] that accommodatesfactor misalignment and unknown factor ordering commonto all measurements. PARAFAC and NMF generalize PCAto higher dimensions (tensors) and to non-negative matrices,respectively. These matrix factorization methods are highlysensitive to misalignments of the factors. The order preservingrestriction of OPFA overcomes this misalignment sensitivityas illustrated in Fig. 5 for a toy example.

Figure 6 shows the result of applying OPFA to real data,in particular the set of 9 clinically sick subjects in theH3N2 challenge study described above. OPFA factors werediscovered that correspond to three characteristic temporalprofiles: suppressed response (factor 1 in red), suppressedresponse followed by recovery (factor 2 in blue), and enhancedresponse (factor 3 in green). The genes in these three groupsare associated with the JNK pathway (factor 1), ribosomalprotein (RP) expression (factor 2), and interferon inducible(IFN) genes (factor 3). Furthermore, the factor order recon-structed by OPFA indicates that the gene onset times infactor 2 occur earlier than those in factor 3, a finding thatis consistent with previous studies of temporal immune hostresponse [19]. Remarkably, OPFA discovered this ordering denovo despite subject misalignments in the gene expressiontrajectories. Furthermore, even though clinically determinedsymptom onset times reported in [20] were not used by

Fig. 5. Illustration of effect of temporal misalignment on orderpreserving factor analysis for a synthetic example with ten subjects(only four shown), 50 genes, 50 time points and f = 2 factors ina low-noise environment (SNR=10dB). The top plot shows the genetrajectories of four of the subjects, the misalignment of the signalfeatures is evident. The left-bottom plot shows the OPFA estimatedand original factors, after realignment to a common reference time-point. The right-bottom plot shows the same for Sparse Factor Anal-ysis (SFA), a model that does not account for the order-preservingmisalignments. The OPFA estimates correlate signficantly better withthe original factors than the SFA ones.

Fig. 6. Upper panel: gene expression trajectories of 13 highly-variantgenes for 4 of the 9 symptomatic subjects in the H3N2 challengestudy. Lower panel: OPFA discovers 3 factors (red blue and green),and their corresponding alignment parameters, that explain these geneexpressions trajectories by solving the optimization problem in (6).The clinical onset times, determined by physicians, are shown inblack. It is clear that the peak of the green factor predicts the onsettime and that precedence order among the three factors is consistentacross subjects.

6

Fig. 7. Scatter plots: Representation of each gene trajectory onthe first two OPFA coordinates (given by the first two columns ofAi) and the coordinates obtained through a PCA analysis of themisaligned joint covariance

∑9i=1 XiX

Ti . The left-most and right-

most figures show the correspondence between the color codingand the clusters obtained by doing hierarchical clustering on theOPFA re-aligned data (left) and the raw misaligned data (right). Thedown-regulated genes (reds) are clearly clustered away from the up-regulated genes (dark blues) in the OPFA representation, and bothgroups are separated by the genes that show little or no variation(light blue). The PCA analysis suffers from misalignments and thefirst two principal components fail to separate the up-regulated genes(reds) from the down-regulated genes (dark blues). Furthermore, thetemporal clusters obtained from OPFA re-aligned data (far left) aremore concentrated than those obtained from raw misaligned data (farright), as can be seen from the tighter confidence envelopes on theOPFA cluster means.

OPFA, the subject-dependent delays of OPFA factor 3 trackthese clinical symptom onset times. This provides independentconfirmation of the power of the OPFA method.

Figure 7 illustrates the utility of OPFA for improving clusterseparation performance for unsupervised clustering. The twoscatter plots show the coefficients of each gene over the firsttwo OPFA factors and the two first principal components ofthe covariance matrix of the misaligned data,

∑9i=1 XiX

Ti .

Each gene is color coded according to the clusters foundby performing hierarchical clustering on the OPFA-alignedand on the original misaligned data, respectively. The scatterplots show how the OPFA coefficients show better separa-tion between the up-regulation genes (reds) from the down-regulation genes (dark blues). In the OPFA scatter plot thered and dark blue groups are separated by genes with weaktemporal response (light blue). In contrast the PCA-basedrepresentation of the raw misaligned data does not separatethese groups well, reflecting the higher variance in the red anddark blue genes groups due to misalignment. This tighteningof the clusters is further illustrated by comparing the far lefttime profiles (cluster means after OPFA alignment) to the farright time profiles (cluster means without first applying OPFAalignment).

IV. METAPROTEIN EXPRESSION MODELING

A. Mass spectrometry proteomics

Unbiased mass spectrometry based proteomics has madetremendous progress since initial studies using MALDI-ToF(matrix-assisted laser desorption/ionization - time of flight)machines in the late 1980’s. Current machines are now ca-pable of splitting samples according to a number of differentfeatures such as pK, hydrophobicity, and ion mobility and theresolution of measurements of mass-to-charge ratios are nowhigh enough to detect the difference between two polypeptidesthat are identical except for the inclusion of a single extraneutron. In this section, we present a different extension ofthe factor model used in [21] that is specific for the analysisof unbiased, label free mass spectrometry proteomics data. Weincorporate multiple sources of information about correlationin the hierarchical structure of the model, and this leads tosignificant improvement in posterior estimation.

Mass spectrometry data may be summarized at a number ofdifferent levels, and the analysis of that data may be tailoredto any of these summarizations. The smallest unit that ismeasured by LC-MS-MS is a single peak, which is termeda feature, and there are typically on the order of 105 suchfeatures. This is a single peak in the 2-dimensional surfaceover a plane defined by the retention time (amount of time apolypeptide takes to pass through the liquid chromatographycolumn) and mass-to-charge ratio. The intensity of this featureis defined to be the volume under this peak. Because acertain percentage of carbon in nature has an extra neutron,each polypeptide leads to multiple features. The collection offeatures from a single polypeptide that differ in mass-to-chargeratio only by an integer number of neutrons is called an isotopegroup, and the intensity of the isotope group is the sum of theintensities of its associated features – this is the level at whichw e summarize our data in this paper. In addition to differencesin mass, polypeptides may accept a variable, integer numberof protons during electrospray ionization. Thus there may bemultiple isotope groups per peptide. Finally, for a collectionof isotope groups that are known to originate from the sameprotein one might summarize the data at the protein level.We note that, in contrast to gene expression microarray datain which each spot on the array is fully characterized, thechemical species that make up a mass spectrometry peaks areoften unknown.

B. Existing methods for analysis

There are a number of different regression models designedfor summarization of proteomics data at the protein level. Thesimplest such procedures involve direct summarization of allfeatures/isotope groups/peptides that are identified for eachprotein. This may involve averaging or robust summarizationbased on quantiles [22]. In addition to these algorithms, thereare a number of different ANOVA approaches which includefixed effects for protein, peptide and experimental group [23],include an additional random effect for cases in which subjectsare measured in replicate [24], or add additional interactioneffects between treatment and feature [25]. These may assumeconstant or varying noise levels across isotope groups, and

7

have been shown to exhibit better performance than naivesummarization approaches that do not adjust for confoundingfactors [25]. While all generally acknowledge the existenceof i ncorrect identifications, none of these approaches directlyaddress this problem. In addition, other than our previous workwhich examines an earlier factor model in greater detal in adifferent biological context [26], we are unaware of any suchtechniques that utilize correlation between features/isotopegroups/peptides in any way, nor do any of them utilizeunidentified features in protein level quantitation. We reviewin this paper a statistical model first described in [27] thatallows the direct modeling of correlation structure and itsdeconvolution into separate protein and pathway effects. Wedo not examine any improvements that might be made by theinclusion of fixed and random effects associated with treatmentgroup or replicate measurements of sample. However, themodel we describe is a regression model, and would, therefore,be amenable to the inclusion of such effects.

Fig. 8. All isotope groups originating from the protein A2GL. Thecolumns have been ordered to make the first principal componentmonotone (independently in each heatmap) and the rows have beenordered from top to bottom in order of decreasing correlation withthe first principal component in the left heatmap, with that orderingpreserved in the right heatmap. We have broken the figure by batchto demonstrate that the correlation structure is preserved even whenthe experiment is repeated on different samples months apart.

C. Factor model and hierarchical structure

There are two key sources of information we might usein order to collect isotope groups into coherent subsets –identifications and coexpression. The identifications tell uswhich isotope groups originate from the same protein, and ifwe assume that proteomics is actually measuring differentialexpression of proteins, then all of the isotope groups from thesame protein should coexpress. However, identifications areincomplete, and while those that are obtained are reasonablyhigh accuracy, there are still some mistakes. Additionally,there are a number of biological processes that add chemical

Fig. 9. A heatmap of the metaprotein showing the strongestassociation with disease. Each row is an isotope group and eachcolumn is a sample. Note that the majority of the peptides are fromthe protein A2GL (Leucine-rich alpha-2-glycoprotein), but that thereare peptides that were identified as belonging to other proteins suchas Apolipoprotein B-100 (APO B), Complement factor B (CFAB),Kallistatin (KAIN) and other unidentified isotope groups. The redcolor represents a relatively high concentration of the isotope group inthe sample while blue represents low (each row has been standardizedto have mean zero and variance 1). Samples from subjects whobecame symptomatic are labeled with + and those who remainedassymptomatic are labeled with -. The label colors, black, blue, pinkand red represent times 0, .2, .8 and 1 respectively. The samplesare ordered so that the associated factor is increasing, and becausealmost all samples from symptomatic individuals at times .8 and 1(red and pink +’s) are at the far right we can see that this factorclearly distinguishes sick from healthy individuals.

modifications to specific regions of proteins. These modi-fications change the relative abundance of the unmodifiedregions of the proteins, which exerts strong effects on theresulting measurements (thus proteomics is actually measuringsomething more subtle than just differential expression ofproteins).

Aside from biological processes that may lead to differentialexpression of individual proteins, there is technical variationthat will lead to differential measurements of expression acrosslarge portions of the data set. We will utilize a factor modelto represent the correlation structure present in the data, butwe break that model into two parts, each of which will haveits own hierarchical structure. We suppose that X ∈ RP×N

is a matrix of intensities, with columns Xi, where P is thenumber of measured isotope groups and N is the number ofsamples. We assume

X = MA+ LS+E.

Note that the basic form of this model is related to the factormodel in Section II-A, but now X corresponds to proteomicdata. Both MA and LS describe latent factors, but we havesplit them because they describe different types of correlationwith different hierarchical structures.

We represent technical noise with the MA factor structurewhere M ∈ RP×d is a factor loadings matrix and A ∈ Rd×N

is a factor scores matrix. We assume that this noise is

8

ubiquitous throughout all isotope groups and therefore do notimpose sparsity, Mi,j ∼ N(0, τ0). We may optionally includedesign variables in A if we want to control for specific knownfeatures of the data. This may include either known batcheffects (although we find that these are captured well by latentfactors) or known phenotypes of the samples. Otherwise, weassume latent factors such that Ak,j ∼ N(0, 1).

The correlation structure that is present in the data becausethere are multiple isotope groups are derived from the sameprotein is modeled by a separate factor structure LS. As beforeL ∈ RP×k is a factor loadings matrix and S ∈ Rk×N isa matrix of factor scores. This structure of L is expectedto be sparse – correlation described by this structure shouldbe largely restricted to sets of isotope groups from the sameproteins. We assume that every isotope group originates froma single protein, and therefore that every row of the loadingsmatrix L contains only one non-zero element. Thus we intro-duce a latent variable zi which identifies, for isotope group i,the metaprotein factor to which it belongs (i.e., which elementof the ith row of L is non-zero). Thus element Li,j = 0 whenj 6= zi and Li,zi ∼ N(0, τ0). Our hierarchical prior for zi is

zi ∼ Multinomial(1, qi) , qi ∼ Dir(ai). (7)

We utilize an informative choice of Dirichlet distributionparameters ai in cases where we have prior information tellingus in which protein isotope group i originated. Specifically,if isotope group i is from protein k then we assume thatai = (a0, · · · , a0, ak, a0, · · · , a0)′ where ak >> a0.

In addition to correlation between isotope groups due tooriginating from the same protein, expression of the proteinsthemselves is also correlated due to that expression beingregulated within the same biological pathways. Because wehave information about the relationships between some isotopegroups and proteins, we are able to deconvolute these twosources of structure in the data. In order to capture this“pathway level” correlation between proteins, we impose abinary tree model on the metaproteins. We suppose that eachrow, Sk of S (each metaprotein) identifies an expressionpattern that is associated with a leaf in a binary tree. Wedefine ta→b to be the “distance” between node a and its childnode, b, and wa to be an N -dimensional vector describingan expression pattern associated with node (or leaf) a. Then,assuming b is a child of a, we assume

wb ∼ N(wa, ta→bIN )

Given any pair of leaves, we a priori assume that the distancebetween one of those leaves and the first node which isan ancestor of both is exponential with rate parameter 1.This is the Kingman’s coalescent [28]. It describes a uniformdistribution on the space of binary trees, and provides a properprior distribution on child-to-parent distances, t.

We introduce a factor for each protein that has more thanone identified isotope group in the data set. This model isconjugate, and we utilize Gibbs sampling in a Markov chainMonte Carlo algorithm to obtain posterior distributions for allmodel parameters. Sampling of coalescense times for the treemodel is accomplished via belief propogation [29], [27]. Trees

Fig. 10. The tree with the highest posterior likelihood from amongthose that were visited during the MCMC chain. A2GL is shown atthe top in a sub-tree with c-reactive protein and lipopolysaccharidebinding protein, both of which are known to react to the presence ofinfection.

are constructed through the coalescence procedure describedin [28] and in [27].

D. Plasma proteomics during viral infection

Working with the same set of subjects discussed in theprevious sections (innoculated with an H3N2 strain of viralinfluenza), we obtained LC-MS/MS proteomics data fromblood plasma at baseline (time=0), at the time of maximumsymptoms (time=1) and at times .2 and .8. The resultingMS traces were then used to estimate the concentrations ofapproximately 40,000 isotope groups in each sample (with≈ 5% overall missingness). Approximately 10% of thesewere identified – the amino acid sequence of the peptidewas characterized and that sequence was mapped to a knownprotein.

In our viral infection data, this leads to a very sparsemodel with 109 latent factors, each nominally representingthe expression of a particular protein. Due to uncertainties inidentifications as well as biological perturbations of particularsections of the proteins, there are many peptides (approxi-mately half) that do not follow a pattern of expression acrossthe samples that is consistent with the majority of peptidesfrom that protein (see, for example, Figure 8).

We were able to identify a number of meta-proteins thatare associated with the disease state, the strongest of whichis shown in Figure 9. We note that the majority of isotopegroups in this factor are identified as belonging to the proteinA2GL and that it is grouped together by the tree model with c-reactive protein and lipopolysaccharide binding protein (Figure10), both of which are known to react to the presence of

9

infection. It is informative to examine the full collection ofisotope groups from A2GL (Figure 8). We note that, whilearound two thirds of the isotope groups show clear visiblecoexpression, the remainder show patterns that are not highlycorrelated. An understanding of this structure in the dataprovides a number of benefits. First, in cases where it isthe bulk of the protein that shows differential expression thatcorrelates with biological phenotypes, as is the case withA2GL and the “symptomatic versus assymptomatic” pheno-type, the aggregate expression from the metaprotein modelwill provide a much stronger predictor than a summarizationbased on isotope group identifications. Second, if our goalis the development of biosignatures then we must be carefulabout which peptides, not just which proteins, we will use forthat biosignature. Finally, in cases where we are looking forassociation between protein expression and phenotype data, wewill be able to perform many fewer hypothesis tests, and havecommensurate higher power, if we can perform those tests onjust metaproteins rather than on peptides. This is also true ofprotein summarization approaches based on identifications aswell, however, we find that there are approximately half asmany metaproteins versus proteins, that the metaproteins aretypically less correlated with each other than are proteins, andthat we can include potentially informative but unidentifiedisotope groups in targeted studies when we select them basedon the metaprotein model.

V. INFERRED BIOLOGY

As summarized in Sections II and III, two very distincttechniques were employed to analyze the time-course gene-expression data (in Section II a fully Bayesian approach wasemployed, while in Section III a non-Bayesian optimizationapproach was employed). It is encouraging that these ap-proaches agreed on the the following 50 genes as beingimportant to the host response to the virus (these contributesignificantly to the factor linked to the host response to thevirus): RSAD2, OAS1, IFI44L, RTP4, IFIT3, IFITM1, IFI44,PLSCR1, LY6E, ISG15, P2RX5, IFI27, GBP1, KIAA0125,APOBEC3A, EPB41L3, IFIT1, XAF1, PSMB9, TRIM22,SERPING1, HERC5, OASL, SCO2, IFI6, DDX60, BLK,MS4A4A, TNFRSF9, BLVRA, LOC26010, MX1, C1QA,OAS3, IRF7, VAMP5, IFIT5, SMPDL3A, FER1L3, UBE2L6,SIGLEC1, C13orf18, PSME2, IFI35, C1QB, BST2, OAS2,PNOC, RRAS and SRBD1. These same genes were foundto be important to all viruses we have studied (H3N2, H1N1,HRV and RSV), and therefore we refer to these as constitutinga “pan-viral” factor. Further, we emphasize that we only list50 genes for brevity, but hundreds of other genes are alsoinferred to play a role in the host response. In the context of thefactor analysis in Section II, for example, these genes are thosethat contribute appreciable amplitude to the factor loadinghighlighted in Figures 3 and 4. In a third distinct analysis (notcovered in this paper), based upon elastic-net and Bayesianelastic-net analyses [30], these genes were again found to playprincipal roles in the host response; we therefore emphasizethat these genes have been analyzed and re-analyzed frommultiple statistical perspectives, and their robustness suggestsbiological importance.

Fig. 11. Genes identified from a key anti-viral immune pathway.Pathway analysis (www.genego.com) illustrates the ISG15 pathway,with over-representation of genes identified by the multi-task elasticnet. ISG15 is a ubiquitin-like modifier that is induced by interferonto restrict viral replication [32]. Downstream elements of ISG15activation include activation of STAT-1.

These findings are also supported by our proteomics data.Three of the proteins (CRP, A2GL and CO9), and 35 of the50 top genes, have in their promoter regions binding sites forinterferon regulatory factor 1. This suggests that activation ofthis interferon pathway is critical for an active response toinfection, although this has yet to be tested thoroughly.

In Figure 11 we relate the aforementioned genes to aninferred pathway. This pathway is deemed to be of highaccuracy as the strength of association of the multi-task genelist with this pathway is quite robust (z-score [an indication ofhow many genes in the gene list are represented in a particularnetwork] 76.83). The top represented pathway, the ISG15pathway in Figure 11, is highly involved in viral immunity asit is activated by initial viral sensing and subsequent interferonproduction. ISG15 is known to target the influenza A proteinNS1 and result in limitation of viral replication [31].

VI. SUMMARY

In this paper we have reviewed recent progress in usinghigh-dimensional longitudinal genomic data collected fromvirus challenge studies, performed with healthy human vol-unteers. The focus of the paper has been on the statisticalsignal processing, but we also show how the results may beused to yield biological insights.

An underlying theme of the statistical analysis is constitutedby use of factor analysis to yield a small number of factors

10

responsible for the high-dimensional data. This frameworksignificantly aids analysis, as we typically have far fewersamples than genes and proteins, and therefore dimensionalityreduction is essential. The factor analysis has been imple-mented from various perspectives. Specifically, in one analysisof the gene-expression data, and when analyzing the proteomicdata, the factor loadings were related to the genes/proteins, andthe factor loadings were assumed sparse, as to infer the low-dimensional set of genes/proteins responsible for biologicalpathways. In a distinct factor analysis, related to dictionarylearning, the factor loadings were employed to model thetime dependence of the gene expressions, and in this case theloadings are not sparse. In addition to these different usages ofthe underlying model, we also employed Bayesian and non-Bayesian i nference methods. It is highly encouraging thatthese very different methods yielded very similar biologicalinterpretations, concerning the genes that play a pivotal rolein the host response to virus.

We have focused here on the H3N2 influenza virus, tosimplify the discussion. However, we have performed relatedanalyses on all virus investigated in our challenge stud-ies, and we found consistent host responses and underlyinggenes/proteins across all of them. This has led us to constitutewhat we term a “pan-viral” factor, with an associated pathwaywe have briefly discussed.

ACKNOWLEDGEMENT

The research reported here was supported by the DefenseAdvanced Research Projects Agency (DARPA), under thePredicting Health and Disease (PHD) program. The findingsof this paper are those of the authors only.

REFERENCES

[1] Z. Bar-Joseph, G. Gerber, D. Gifford, T. Jaakkola, and I. Simon, “Con-tinuous representations of time-series gene expression data,” Journal ofComputational Biology, vol. 10, pp. 3–4, 2003.

[2] Z. Bar-Joseph, “Analyzing time series gene expression data,” Bioinfor-matics, vol. 20, pp. 2493–2503, 2004.

[3] Q. Liu, K. Lin, B. Andersen, P. Smyth, and A. Ihler, “Estimatingreplicate time shifts using gaussian process regression,” Bioinformatices,vol. 26, pp. 770–776, 2010.

[4] G. James and T. Hastie, “Functional linear discriminant analysis forirregularly sampled curves,” Journal of the Royal Statistical SocietySeries B, vol. 63, pp. 533–550, 2001.

[5] Y. H. et al., “Temporal dynamics of host molecular responses dif-ferentiate symptomatic and asymptomatic influenza a infection,” PLoSGenetics, p. 7(8): e1002234. doi:10.1371/journal.pgen.1002234, 2011.

[6] N. Heard, C. Holmes, and D. Stephens, “A quantitative study of generegulation involved in the immune response of anopheline mosquitoes:An application of bayesian hierarchical clustering of curves,” Journal ofthe American Statistical Association, vol. 101, pp. 18–29, 2006.

[7] T. Scharl, B. Grun, and F. Leisch, “Mixtures of regression models fortime-course gene expression data: evaluation of initialization and randomeffects,” Bioinformatics, vol. 26, pp. 370–377, 2010.

[8] L. Wang, G. Chen, and H. Li, “Group SCAD regression analysis formicroarray time course gene expression data,” Bioinformatics, vol. 23,pp. 1486–1494, 2007.

[9] C. Carvalho, J. Chang, J. Lucas, J. Nevins, Q. Wang, and M. West,“High-dimensional sparse factor modelling: Applications in gene ex-pression genomics,” Journal of the American Statistical Association, vol.103, pp. 1438–1456, 2008.

[10] T. Griffiths and Z. Ghahramani, “Infinite latent feature models and theindian buffet process,” in Advances in Neural Information ProcessingSystems, 2005, pp. 475–482.

[11] J. Brieland, D. Essig, C. Jackson, D. Frank, D. Loebenberg, F. Menzel,B. Arnold, B. DiDomenico, and R. Hare, “Comparison of pathogenesisand host immune responses to Candida glabrata and Candida albicans insystemically infected immunocompetent mice,” Infect. Immun., vol. 69,pp. 5046–5055, 2001.

[12] R. Turner, “Ineffectiveness of intranasal zinc gluconate for preventionof experimental rhinovirus colds,” Clin. Infect. Dis., vol. 33, pp. 1865–1870, 2001.

[13] M. Lewicki and T. Sejnowski, “Learning overcomplete representations,”Neural computation, vol. 12, no. 2, pp. 337–365, 2000.

[14] M. Lee and Y. Kim, “Signaling pathways downstream of pattern-recognition receptors and their cross talk,” Biochemistry, vol. 76, no. 1,p. 447, 2007.

[15] A. Tibau-Puig, A. Wiesel, A. Zaas, G. S. Ginsburg, G. Fleury, andA. O. Hero, “Order preserving factor analysis,” in IEEE Workshop onSensor, Array and Multichannel Signal Processing (SAM), Jerusalem,Israel, Sept 2010.

[16] R. Jenatton, G. Obozinski, and F. Bach, “Structured sparse principalcomponent analysis,” in Proc. AISTATS, 2009.

[17] T. Kolda and B. Bader, “Tensor decompositions and applications,” SIAMReview, vol. 51, no. 3, pp. 455–500, 2009.

[18] D. Lee and H. Seung, “Learning the parts of objects by non-negativematrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.

[19] M. Taylor, T. Tsukahara, J. McClintick, H. Edenberg, and P. Kwo,“Cyclic changes in gene expression induced by Peg-interferon alfa-2b plus ribavirin in peripheral blood monocytes(PBMC) of hepatitis Cpatients during the first 10 weeks of treatment,” Journal of TranslationalMedicine, vol. 6, no. 1, p. 66, 2008.

[20] A. Zaas, M. Chen, J. Lucas, T. Veldman, A. Hero, J. Varkey, R. Turner,C. Oien, S. Kingsmore, L. Carin, C. Woods, and G. Ginsburg, “Pe-ripheral blood gene expression signatures characterize symptomaticrespiratory viral infection,” Cell Host & Microbe, vol. 6, pp. 207–217,2009.

[21] M. Chen, A. Zaas, C. Woods, G. Ginsburg, J. Lucas, D. Dunson, andL. Carin, “Predicting viral infection from high-dimensional biomarkertrajectories,” submitted to J. Am. Statistical Association, 2010.

[22] A. Polpitiya, W.-J. Qian, N. Jaitly, V. Petyuk, J. Adkins, D. Camp,A. Gordon, and R. Smith, “Dante: a statistical tool for quantitativeanalysis of -omics data,” Bioinformatics, vol. 24, no. 13, pp. 1556–1558,2008.

[23] Y. Karpievitch, J. Stanley, T. Taverner, J. Huang, J. N. Adkins, C. An-song, F. Heffron, T. O. Metz, W.-J. Qian, H. Yoon, R. D. Smith,and A. R. Dabney, “A statistical framework for protein quantitation inbottom-up ms-based proteomics,” Bioinformatics, vol. 25, no. 16, pp.2028–2034, 2009.

[24] D. S. Daly, K. K. Anderson, E. A. Panisko, S. O. Purvine, R. Fang,M. E. Monroe, and S. E. Baker, “Mixed-effects statistical model forcomparative lc-ms proteomics studies,” Journal of Proteome Research,vol. 7, pp. 1209–1217, 2008.

[25] T. Clough, M. Key, I. Ott, S. Ragg, G. Schadow, and O. Vitek, “Proteinquantitation in label-free lc-ms experiments,” Journal of ProteomeResearch, vol. 8, pp. 5275–5284, 2009.

[26] J. Lucas, J. Thompson, L. Dubois, J. McCarthy, K. Patel, H. Tillman,A. Thompson, J. McHutchison, and M. Moseley, “Metaprotein expres-sion modeling for label-free quantitative proteomics,” working paper.

[27] R. Henao, J. W. Thompson, M. Moseley, G. Ginsburg,L. Carin, and J. Lucas, “Latent protein trees,” WorkingPaper, 2011. [Online]. Available: http://people.genome.duke.edu/∼jel2/workingPapers/latentProteinTrees.pdf

[28] J. F. C. Kingman, “The coalescent,” Stochastic Processesand their Applications, vol. 13, no. 3, pp. 235 – 248,1982. [Online]. Available: http://www.sciencedirect.com/science/article/B6V1B-45FT7RY-V/2/9aa9d856ebcd608a17099d47bb63319b

[29] J. Pearl, Probabilistic reasoning in intelligent systems: networks ofplausible inference. San Francisco, CA, USA: Morgan KaufmannPublishers Inc., 1988.

[30] M. Chen, D. Carlson, A. Zaas, C. Woods, G. Ginsburg, A. Hero,J. Lucas, and L. Carin., “Detection of viruses via statistical geneexpression analysis,” Biomedical Engineering, IEEE Transactions on,vol. 58, no. 3, pp. 468 –479, 2011.

[31] C. Zhao, G. Sun, S. Li, M.-F. Lang, S. Yang, W. Li, , and Y. Shi, “Mi-croRNA let-7b regulates neural stem cell proliferation and differentiationby targeting nuclear receptor TLX signaling,” Proc. Nat. Ac. Sciences,vol. 107, pp. 1876–1881, 2010.

[32] G. Versteeg, B. Hale, S. van Boheemen, T. Wolff, D. Lenschow, andA. Garca-Sastre, “Species-specific antagonism of host ISGylation by

11

the influenza B virus NS1 protein,” J. Virology, vol. 10, pp. 5423–5430,2010.

analysis of high-dimensional longitudinal genomic data for...

Documents