clustering of yeast genes using literature mining
TRANSCRIPT
![Page 1: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/1.jpg)
Clustering of yeast genes using
Literature MiningCS 6910 Project
Advisor: Dr.Venu DasigiStudent: Maitreyee Bhise
![Page 2: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/2.jpg)
Contents
• Text mining• Overview of the project • Weight Metrics• Ohio Supercomputer (MEDLINE database)• Implementation• Results and Analysis• Limitations• Future Scope
![Page 3: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/3.jpg)
What is text mining?
• Provides mechanism to handle large amount of data
• Integrates all sources and add meaning to it
![Page 4: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/4.jpg)
Background Set
• Reference Set used to compare query set• Dictionary of documents and respective words
of importance• Restricted Background set (used in this project)• Unrestricted Background set
• Restricted Background set: Only those documents that satisfy a condition
• Unrestricted Background set: All the documents
![Page 5: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/5.jpg)
Query Set and Background Set used
• Query Set: 44 Tables for each gene• Each table with abstracts that contain
the gene name in the title• Restricted Background set: Union of 44
gene tables• Background set is a table with all
keywords from 44 tables
![Page 6: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/6.jpg)
Overview
• Restricted background set created using MEDLINE (collection of biological documents)
• Used to compare frequency of a word in query set against background set
• 44 gene keywords extracted, analyzed and clustered using weight metrics
• Frequency of each keyword in a query set is compared with background set
• Gene characteristics are discovered on the basis of computed weights
![Page 7: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/7.jpg)
Overview (cont'd)
• Entire MEDLINE collection had been preprocessed earlier which was added in this work
• Project uses a different background set• Implementation on Ohio Supercomputer
![Page 8: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/8.jpg)
MEDLINE
• MEDLINE is a collection of biological documents provided by National Library of Medicine (NLM)
• Consists of 23 million abstracts• Data provided in XML format• XML data parsed and loaded in Ohio
Supercomputer• Oakley is the newly built server with 8,300+
core HP Intel Xeon machine
![Page 9: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/9.jpg)
Portable Batch Script
• #PBS –l walltime = hh:mm:ss (running time required for the job)
• #PBS –l nodes = 2 :ppn = 12 (no. of nodes and processors per node required)
• #PBS –m abe (emails the user when job aborts/begins/ends)
• To submit the batch : qsub Job_Name
![Page 10: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/10.jpg)
MEDLINE Table Structure
• 3 Tables:• N_Word • N_Document• N_WordDocument
• This project uses only N_Document and N_WordDocument tables
![Page 11: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/11.jpg)
MEDLINE Table Structure (cont'd)
![Page 12: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/12.jpg)
Weight Metrics
• Two statistical parameters:• Z-Score• TF-IDF (Term Frequency-Inverse
Document Frequency)• Calculates frequency of a keyword in query
set in comparison with some reference set• Helps to discriminate high information
content words of a document
![Page 13: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/13.jpg)
Stop Words Removal
• Words which are commonly used but doesn’t add meaning
• Used Stop word list provided by PubMed
• Separate table created to store stop word list
• Stop words are removed using simple join with this table
![Page 14: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/14.jpg)
Z-Score
• Z-Score of a word ‘a’ in a gene ‘n’,
Zₐⁿ = Where=√( ) = Document frequency of a word a for gene n = mean frequency of a word a = sum of for all genes = Standard Deviation of word aN = Number of Genes (or set of group of documents)
![Page 15: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/15.jpg)
Z-Score (cont'd)(Document Frequency calculation)
• Captures the strength of a keywords in a collection set
• Emphasis on distribution of the keyword across genes
• It is the number of documents that contain the word
• Calculated with respect to a gene (or related group of documents)
![Page 16: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/16.jpg)
Z-Score for gene Cwp1(cont'd)(Document Frequency calculation)
![Page 17: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/17.jpg)
Z-Score for Cwp1(cont'd)(Mean Frequency calculation)
• Sum of document frequency corresponding to each keyword from different genes divided by total number of genes.
![Page 18: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/18.jpg)
Z-Score for gene Cwp1 (cont’d)( Numerator calculation)
![Page 19: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/19.jpg)
Sum of square of Numeratorfor gene Cwp1
![Page 20: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/20.jpg)
Z-Score (cont’d) (Standard Deviation Calculation)
• Tells deviation from Normal• Lesser the value of standard deviation,
more is the possibility of high-information content word
• For lesser value of standard deviation, z-score will be high
![Page 21: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/21.jpg)
Z-Score for gene Cwp1 (cont’d)(Standard Deviation Calculation)
![Page 22: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/22.jpg)
Z-Score for gene Cwp1(cont’d)
![Page 23: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/23.jpg)
TF-IGF Calculations
• TFIGF of a word ‘a’ in gene ‘n’ , TFIGFₐⁿ = TFₐⁿ * IGFₐWhere TFₐⁿ = ∑ tfₐ ͩ
And, IGFₐ = log GFₐ is number of genes that contain the word ‘a’• Emphasis on importance of a keyword
within a gene
![Page 24: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/24.jpg)
TF-IGF for gene “Cwp1” (cont’d)(Term Frequency Calculations)
![Page 25: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/25.jpg)
TF-IGF (cont’d)(Group Frequency Calculations for
gene “Cwp1”)
![Page 26: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/26.jpg)
TF-IGF (cont’d)(Inverse Group Frequency
Calculations for gene “Cwp1”)
![Page 27: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/27.jpg)
TF-IGF for gene “Cwp1” (cont’d)
![Page 28: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/28.jpg)
Results
• Determine high-information content words for a gene
• Higher the z-score value of a keyword in a gene, more unique it describes functionality the gene
• Higher the TF-IGF value of a keyword in a gene, more unique it is in that gene as compared to other genes
• TF-IGF yields better quality keywords as filters unwanted keywords
![Page 29: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/29.jpg)
Top keywords for gene “Cwp1”using z-score
• Irrespective of the document frequency, top 75 out of 1612 keywords have same high z-score value 6.245
• Top 75 keywords are unique to Cwp1
![Page 30: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/30.jpg)
Limitations
• Some types of parsing errors which results in false positives
![Page 31: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/31.jpg)
Top keywords for gene “Cwp1”using TF-IGF
• Better quality keywords obtained from TF-IGF
![Page 32: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/32.jpg)
Cluster 3.0 Clustering Software
• Used Cluster 3.0 open source clustering software specially designed for gene expression data analysis
• Developed at Stanford University• Can run on Windows/Mac/Linux
![Page 33: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/33.jpg)
Yeast Genes Grouped using Z-ScoreGenes Cluster
Gic2,Rad27,Dun1,Tel2,Cdc20,Far1
1
Cln1,Cln2,Cdc6 2Gic1,Ace2,Mcm3 3Exg1,Htb2,Cts1 4
Mcm2 5Mnn1,Och1,Hho1,Mcm6 6
Msb2,Rsr1,Bud9,Kre6,Cwp1,Clb5,Clb6,Rnr1,Cdc21,Cdc45,Htb1,
Hta1,Hta2,Hht1,Tem1
7
Rad51 8Ste2 9
![Page 34: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/34.jpg)
Yeast Genes Grouped using TF-IGFGenes Cluster
Cdc20 1Cln1,Cln2 2Swi5,Ace2 3
Cdc6,Mcm3,Mcm6,Cdc46 4Mcm2 5Cdc45 6
Msb2,Rsr1,Bud9,Mnn1,Och1,Exg1,Kre6,Cwp1,Clb5,Clb6,Rnr1,Rad27,Cdc21,Dun1,Htb1,Htb2,Hta1,Hta2,Hho1,Hht1,Tel2,Tem1,Clb2,Cts1,Gi
c1,Gic2
7
Rad51 8Ste2,Far1 9
![Page 35: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/35.jpg)
Analysis
• Z-Score is independent of Document frequency only when word is unique to the gene
• Cln1 GeneCln2 Gene
![Page 36: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/36.jpg)
Analysis (cont'd.)
• Words found with low frequency but with high z-score
![Page 37: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/37.jpg)
Analysis (cont'd.)
• Words with high frequency have low z-scores
![Page 38: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/38.jpg)
Future Scope
• New Background set can be used. Example-• Abstracts that contain Gene name and
its related words in the title• Unrestricted Background set
• Applying Stemming Algorithm on the MEDLINE database
• Concepts of Latent Semantic Metrics can be applied by preserving the order of words
![Page 39: Clustering of yeast genes using Literature Mining](https://reader034.vdocument.in/reader034/viewer/2022042907/587b44a51a28ab9c0e8b66ef/html5/thumbnails/39.jpg)
Special Thanks To…
Dr. Venu DasigiDr. Vipa Phuntumart
Dr. Ray KresmanPukar Hamal