investigating the correspondence between transcriptomic and proteomic expression profiles using...

Post on 15-Jan-2016

221 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models

Simon Rogers, Mark Girolami, Walter Kolch, Katrina M. Waters, Tao Liu, Brian Thrall and H. Steven Wiley

BIOINFORMATICS Vol. 24 no. 24 2008, pages 2894-2900

Outline

● Introduction● The coupled mixture model● Result and discussion● Conclusions

Introduction

● Proteomic:– A blend of “protein” and “genome”– Large-scale study of proteins– More complicated than genomics– mRNA is not always translated into protein, the amount of

protein produced for a given amount of mRNA depends on the gene it is transcribed from and on the current physiological state of the cell.

● Modern transcriptomics and proteomics enable us to survey the expression of RNAs and proteins at large scales.

Introduction

● There is an increasing interest in comparing and Co-analyzing transcriptome and proteome expression data.– A major open question is whether transcriptome and

proteome expression is lined and how it is coordinated.

– Make inferences and predictions about how the network of regulatory control varies at the mRNA and protein levels.

Introduction

● Two strategies:– Concatenating:

● Given mRNA and proteomic data for some set of N genes at T time points.

● Combining the both data into one vector of length 2T.● Groups together genes that share similar mRNA and

protein profiles.● Inflexible! How about the genes that share similar mRNA

profiles but have very different protein profiles (amd vice versa)

Introduction

● A great more clusters in the concatenated space than in either individual representation. (double the size of the feature space without increasing the number of data instances)

– Clustering completely independent● Lose the explicit relationship between the two datasets

Introduction

● A probabilistic clustering model that couples together transcriptomic and proteomic profiles–

– based on two coupled statistical mixture models.

Clustering independentlyconcatenating

For a particular dataset, at which point on this scale our model naturally sits?

The coupled mixture model

● Assuming we have two separate mixture models, one for the mRNA data (with K components) and one for the proteomic data (with J components)

● Prior distribution over both sets of components : p(k,j)

● Factorize the joint prior as p(k,j)=p(k)p(j|k)– the components of p(j|k) provides us withh details of

the relationship between expression at the mRNA and protein levels.

The coupled mixture model

● Defining p(k) as πk and p(j|k) as θjk and the complete sets of these parameters as π and θ, respectively. The likelihood of a dataset (X) of G genes is

where Δk and Δj correspond to any parameters unique to the k-th mRNA and j-th protein cluster, respectively.

The coupled mixture model

● Initialisation and re-starting– The expectation-maximization (EM) algorithm can be

used to find a local maxima of a lower bound on the likelihood function.

– Sensitive to initial conditions● Run the algorithm from 100 random initializations and

keep the one that gave the highest value of the lower bound.

The coupled mixture model

● Reproducibility– The symmetry of the likelihood with respect to

permutations of the component labels (j and k) makes it very difficult to compare results produced from multiple restarts.

– Comparing the enriched GO terms across multiple restarts.

– If the results are reproducible, we would expect a significant proportion of GO terms to be enriched over many random initializations.

Results and Discussion

● Apply to a large dataset of quantitative transcriptomic and proteomic expression data obtained from a human breast epithelial cell line stimulated by epidermal growth factor (EGF) over a series of timepoints corresponding to one cell cycle.

● The number of components K and L were determined individually using the Bayesian Information Criterion (BIC) (K=15, J=20)

Results and Discussion

● Preliminary analysis– We first clustered the two data sets separately and

analyzed the similarity between the obtained clustering, finding that there is a very low level of agreement.

– We looked at the number of enriched Gene Ontology (GO) terms found when the two representations are clustered individually and when they are concatenated. Significantly fewer were found when concatenated than when the data sets are analyzed individually.

Results and Discussion

• High-level observations– The model defines a prior distribution over the

component to which a protein profile should be assigned conditioned on which component the associated mRNA profile belongs to –

– KxJ matrix provide us with some insight regarding the level of connectivity between the two representations.

)|( kjp

Results and Discussion

– is very diffuse rather than being dominated by a small number of protein clusters.

– Each mRNA cluster is connected to a large number of protein clusters, and vice versa, suggesting that the relationship between transcriptional and translation control is a very complex one.

– Quantify the level of complexity by analyzing the entropy of .

• If there is one-to-one relationship between mRNA and protein clusters, the entropy would be close to 0.

)|( kjp

)|( kjp

Results and Discussion

● The fact that the decrease is so small can be partly explained by the observation the genes appear to be organized into many small group with homogeneous mRNA and protein profiles.

The left curve gives the true entropy, the right gives the entropy obtained when the proteins are permuted.

p(j|k) > 0.1

Results and Discussion

● Cluster-cluster relationships– The ribosomes:

● In the highly complicated network, one very strong connection stands out:the connection between protein cluster j=4 and mRNA cluster k=3 and k=11

● P(k=3|j=4)=0.3656 and P(k=11|j=4)=0.2316both in the highest 10 values out of the total KxJ =300 values.

● A total of 18 out of the 19 proteins in j=4 are ribosomal and they exhibit an exceptionally high expression homogeneity.

– These proteins must act together to form the large and small ribosomal subunits.

Copyright restrictions may apply.

Rogers, S. et al. Bioinformatics 2008 24:2894-2900; doi:10.1093/bioinformatics/btn553

Rather similar profile

Quite diverse expression profiles

Copyright restrictions may apply.

Rogers, S. et al. Bioinformatics 2008 24:2894-2900; doi:10.1093/bioinformatics/btn553

Isolating mRNA cluster k=3

Enormous diversity of protein profiles

It does not seem unreasonable from these observation to suggest that all of these processes are heavily regulated at the protein level.

Results and Discussion

– Cell adhesion

Copyright restrictions may apply.

Rogers, S. et al. Bioinformatics 2008 24:2894-2900; doi:10.1093/bioinformatics/btn553

Results and Discussion

– The chaperonin TCP-1 complex

Results and Discussion

● Summary– The correlation between transcription and translation

seems to be generally low and diverge with evolution.– This correlation becomes very limited in mammals.– This results indicate that transcriptional (mRNA) and

translational (protein) networks may have evolved independently unless the rare occasions where a strong selection factor in favor of correlation between gene transcription and protein translation was present.

Conclusion

● The model consists of two Gaussian mixtures coupled through a joint prior on the mixture components and allows us to find clusters of genes similar at the mRNA and protein levels and unravel the links between them.

GO的介紹● The Gene Ontology project, or GO, provides a

controlled vocabulary to describe gene and gene product attributes in any organism.

● http://www.geneontology.org/ GO的入口網站

● GO裡有三個重要的資料庫  1. FlyBase (Drosophila)

  2. Saccharomyces Genome Database

  3.Mouse Genome Database

ontologies

● GO裡所要用到的檔案內容有下列三種模式  1.molecular function

  2. biological process

  3. cellular component

● 一個 gene product 裡有一個或許多個  molecular functions ,並且被利用於一個或多 個 biological processes ,也可能是經由一個 或多個 cellular components 所組合成的

Molecular function

● 在go裡Molecular function 代表的是activities ,而不是一個實體 (molecules or complexes) 。

  Examples of broad functional terms are catalytic activity, transporter activity, or binding;

  Examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding.

Biological process

● 一個 biological process 是經由一個或多個整齊的 molecular functions 集合所完成的事件序列。

  Examples of broad biological process terms are cellular physiological process or signal transduction.

  Examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport.

Cellular component

● A cellular component is just that, a component of a cell but with the proviso that it is part of some larger object, which may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer).

Gene Ontology term

● Each GO term consists of a unique alphanumerical identifier.

● When a term has multiple meanings depending on species, the GO uses a "sensu" tag to differentiate among them.

● Terms are classified into only one of the three ontologies, which are each structured as a directed acyclic graph.

top related