gene clustering by latent semantic indexing of medline abstracts

21
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee presented by J. Jiang

Upload: lucine

Post on 14-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts. Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee presented by J. Jiang. Outline. Brief Overview of Biomedical Literature Mining The Gene Clustering Problem Latent Semantic Indexing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry

University of Tennessee

presented by J. Jiang

Page 2: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Outline

Brief Overview of Biomedical Literature Mining

The Gene Clustering Problem Latent Semantic Indexing Experiments Conclusions and Discussions

Page 3: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Biomedical Literature MiningBrief Overview

Goal: to find useful information from the large amount of biomedical literature

Tasks include: Identifying relevant literature for a given gene/protein Connecting genes with diseases Grouping genes/proteins by functions Reconstructing and predicting gene networks

(ISMB 05’ Tutorial Proposal, H. Shatkay)

Page 4: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Biomedical Literature MiningBrief Overview (cont.)

Approaches: IE & NLP: entities, relations, facts, etc. Many methods

rely on co-occurrences of genes/proteins. IR: text categorization and summarization, etc. Hybrid: combining multiple techniques

Challenges include: No fixed nomenclature or sentence structure Indirect links Etc.

Page 5: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

The Gene Clustering Problem

To group genes based on their functions Previous work:

Co-occurrence of gene symbols to extract gene relationships

Implicit textual relationships Gene clustering using functional information in

annotated indices or MEDLINE abstracts

Page 6: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Vector Space Modelfor Gene Clustering

Glenisson et al., 2003 Bag-of-words, vector space model Cosine similarity K-medoids algorithm

This paper tries to improve the vector representation of documents using LSA.

Page 7: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Background: LSA

First studied by Deerwester et al., Indexing by Latent Semantic Analysis, J Info Sci, 1990

Motivation: inaccuracy of term matching due to polysemy and synonomy

Assumption: existence of latent semantic structure (“artificial concepts”)

Dimension reduction. Keep the most important dimensions. Similar to PCA.

Page 8: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Singular Value Decomposition

d documents, t terms (in general, t >> d) d t matrix X = [xij], where xij denotes the frequency

of term j in document i X can be decomposed as:

X = T0S0D0,where columns of T0 are the eigenvectors of XX, and columns of D0 are the eigenvectors of X X. S0 is diagonal. S0

2 is the matrix of eigenvalues of XX (or X X).

Page 9: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

SVD (cont.)

The diagonal elements of S0 are constructed to be positive and ordered in decreasing magnitude.

Page 10: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

SVD (cont.)

The eigenvector with the largest eigenvalue represents the dimension along which the variance of the data is maximized.

Keep the k largest elements in S0, remove other elements, and remove corresponding columns (eigenvectors) in T0 and D0, X can be approximated by:

X Xhat = TSD.

Page 11: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

SVD (cont.)

Xhat is the best least-square-fit to X with rank k.

Page 12: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Illustration

The first eigenvector

The second eigenvector

(taken from “A Tutorial on PCA” by Lindsay Smith)

Page 13: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

LSA with SVD

Terms are represented by rows of Xhat and documents are represented by columns of Xhat in the reduced space.

Doc-to-doc similarity:

Xhat Xhat = DS2D = DS(DS) . Query is represented as pseudo-document:

Dq = Xq TS-1,

where Xq is the query vector in the original space. Dq is like a row of D.

Query-to-doc similarity:

DqS (DS) .

Page 14: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Experiments

50 genes in (1) development, (2) Alzheimer Disease, and (3) Cancer Biology are selected

Gene-document: concatenation of abstracts known to be related the gene

Gene-document represented as vectors:

Page 15: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Experiments (cont.)

Keyword query and accession number query Reelin signaling pathway GO classification terms and human disease Direct genes and indirect genes Hierarchical Clustering

Page 16: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Results

Page 17: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Results (cont.)

Page 18: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Results (cont.)

Tried 5, 25, and 50 dimensions. 50 is shown to perform the best.

Tried reducing the numbers of abstracts of Reelin genes. Claimed that AP was not significantly reduced when 50% abstracts were removed.

Claimed that hierarchical clustering agrees with biological relationships.

Page 19: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Discussions

Pros Gene clustering by textual information. Applied LSA to biomedical literature. Indirect linkage

can be found through latent concepts. Cons

Requires human annotation to construct gene-documents. Not applicable to new domain.

Genes in the experiments are carefully chosen in 3 categories. How does the method perform in general?

Other gene clustering methods?

Page 20: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

References

S. Deerwester et al. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41-6, 391-407.

M.A. Gerolami (2004). Latent Semantic Analysis A General Tutorial Introduction. http://ir.dcs.gla.ac.uk/oldseminars/Girolami.ppt

H. Shatkay (2005). ISMB 05’ Tutorial Proposal. http://www.iscb.org/ismb2005/tutorials/pm10.pdf

H. Shatkay & R. Feldman (2004). Mining the Biomedical Literature in the Genomic Era: An Overview. Journal of Computational Biology, 10-6, 821-855.

Page 21: Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

The End

Questions? Thank you!