promoter discovery: a correlation mining approach yi lu department of computer science wayne state...
Post on 22-Dec-2015
219 views
TRANSCRIPT
Promoter Discovery: A Correlation Mining
Approach
Yi Lu
Department of Computer ScienceWayne State University
Yi Lu Wayne State University 2
Outline
Introduction Related Work Problem Definition Correlation Mining Conclusion and Future work
Yi Lu Wayne State University 3
Introduction
Transcription
RNA
DNA
Translation
Protein
Central Dogma Gene Expression
Yi Lu Wayne State University 4
The promoter region (a set of transcription binding sites) of the gene acts as light switch. It signals when to turn the gene on and off.
We are interested in the relationship between the promoter region and gene expression. i.e. what kind of binding sites determine whether a gene is expressed or not?
Introduction
Yi Lu Wayne State University 5
Introduction - MicroarrayGene ValueD26528_at 193D26561_at 70D26579_at 318D26598_at 1764D26599_at 1537D26600_at 1204D28114_at 707H29189_at 899G29183_at 9210
Gene Day 1 Day 2 Day 3 … D26528_at 193 4157 556 D26561_at 70 11557 476 D26579_at 318 12125 498
D26598_at 1764 8484 1211 D26599_at H21219 1537 3537 131 D26600_at 1207 4578 94 D28114_at 707 2431 209
…….
Microarray chips Images scanned by laser
DatasetsD26528_at
D26561_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
…..
..
D1 D2 D3 D4……..
Yi Lu Wayne State University 6
Introduction Transcription factor binding sites (motif) in
promoter region should “explain” changes in transcription.
AGCTAGCTGATTGTGCACACTGATCGAGCCCCACCATAGCTTCGTTGTGCGCTATATATTGTGCAGCTAGTAGAGCTCTGCTAGAGCTCTATTTGTGCCGATTGCGGGGCGTCTGAGCTCTTTGCTCTTTTGTGCCGCTTTTGATATTATCTCTCTGCTCGTTTGTGCTTTATTGTGGGGGTTGTGCTGATTATGCTGCTCATAGGAGATTGTGCGAGAGTCGTCGTAGTTGTGCGTCGTCGTGATGATGCTGCTGATCGATCGTTGTGCCTAGCTAGTAGATCGATGTTTGTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCTTGTGCTCGAGAGGAAGTATATATTTGTGCGCGCGCCGCGCGCACGTTGTGCAGCTGATGCATGCATGCTAGTATTGTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCTTGTGCAGTCGATCGATGCTAGTTATTGTTGTGCGTAGTAGTGCTTGTGCTCGTAGCTGTAG
AGCTAGCTGATTGTGCACACTGATCGAGCCCCACCATAGCTTCGTTGTGCGCTATATATTGTGCAGCTAGTAGAGCTCTGCTAGAGCTCTATTTGTGCCGATTGCGGGGCGTCTGAGCTCTTTGCTCTTTTGTGCCGCTTTTGATATTATCTCTCTGCTCGTTTGTGCTTTATTGTGGGGGTTGTGCTGATTATGCTGCTCATAGGAGATTGTGCGAGAGTCGTCGTAGTTGTGCGTCGTCGTGATGATGCTGCTGATCGATCGTTGTGCCTAGCTAGTAGATCGATGTTTGTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCTTGTGCTCGAGAGGAAGTATATATTTGTGCGCGCGCCGCGCGCACGTTGTGCAGCTGATGCATGCATGCTAGTATTGTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCTTGTGCAGTCGATCGATGCTAGTTATTGTTGTGCGTAGTAGTGCTTGTGCTCGTAGCTGTAG
R(t
1)
t1 Motif
R(t
2)
t2 Motif
AGCTAGCTGATTGTGCACACTGATCGAGCCCCACCATAGCTTCGTTGTGCGCTATATATTGTGCAGCTAGTAGAGCTCTGCTAGAGCTCTATTTGTGCCGATTGCGGGGCGTCTGAGCTCTTTGCTCTTTTGTGCCGCTTTTGATATTATCTCTCTGCTCGTTTGTGCTTTATTGTGGGGGTTGTGCTGATTATGCTGCTCATAGGAGATTGTGCGAGAGTCGTCGTAGTTGTGCGTCGTCGTGATGATGCTGCTGATCGATCGTTGTGCCTAGCTAGTAGATCGATGTTTGTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCTTGTGCTCGAGAGGAAGTATATATTTGTGCGCGCGCCGCGCGCACGTTGTGCAGCTGATGCATGCATGCTAGTATTGTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCTTGTGCAGTCGATCGATGCTAGTTATTGTTGTGCGTAGTAGTGCTTGTGCTCGTAGCTGTAG
Time Course genes
Yi Lu Wayne State University 7
Related work Cluster gene expression profiles Search for motifs in promoter regions of clustered
genes
clustering
AGCTAGCTGATTGTGCACAC
TTCGTTGTGCGCTATATAGA
TTGTGCAGCTAGTAGAGCTC
CTAGAGCTCTATTTGTGCCG
ATTGCGGGGCGTCTGAGCTC
TTTGCTCTTTTGTGCCGCTT
Promoter regions Gene 1
Gene 2Gene 3Gene 4Gene 5Gene 6
TTGTGC
AGCTAGCTGATTGTGCACAC
TTCGTTGTGCGCTATATAGA
TTGTGCAGCTAGTAGAGCTC
CTAGAGCTCTATTTGTGCCG
ATTGCGGGGCGTCTGAGCTC
TTTGCTCTTTTGTGCCGCTT
Motif
Yi Lu Wayne State University 8
Related work Clustering
partition the N genes to a set of disjoint groups so that the expression profile of genes in same group have high similarity to each other and the expression profile of genes in different groups are dissimilar to each other.
Most widely used algorithms: K-means clustering, hierarchy clustering algorithms.
Genetic K-means algorithms (Lu et al. 2003, 2004).
Yi Lu Wayne State University 9
Related work Motif discovery after clustering
given a set of upstream sequence of genes which are co-expressed, find subsequences that are overrepresented and are significant to be separated from other subsequences
MEME, Gibbs Sampling, Winnower algorithms. PDC algorithm (Lu et al. 2006) Usually have high false positive rate
ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCAGCTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCTTTGTGGTACTGATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACCACCCCCCGCTCGATCGATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCGCTGCTCGTAGCATGCTAGCTAGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTTTTTTTTTTTTGCTAGCACCCAACTGACTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCTCGTCGTCGATGCATCGTACGTAGCTACGTAGCATGCTAGCTGCTCGCAAAAAAAAAACGTCGTCGATCGTAGCTGCTCGCCCCCCCCCCCCGACTGATCGTAGCTAGCTGATCGATCGATCGATCGTAGCTGAATTATATATATATATATACGGCG
TCGACTGC
TCGACTGC
TCGACTGC
TCGACTGC GATAC
GATAC
GATACGATAC
CCAATCCAAT
CCAATCCAAT
TCGACTGC
CCAATCCAAT
CCAAT
GCAGTT GCAGT
T
GCAGTT
Gen
es
Yi Lu Wayne State University 10
Motivation
Researches have indicated that multiple transcription factor binding sites are involved into each transcription process. This lead us to study the Modules (a pair of motifs) instead of Motifs.
Yi Lu Wayne State University 11
Motivation Not all genes contain the same motif
cause the same gene expression change. Not all genes with same gene expression
change contains same motif.
TTCGTTGTGCGCTATATAGA
TTGTGCAGCTAGTAGAGCTC
CTAGAGCTCTATTTGTGCCG
ATTGCGGGGCGTCTGAGCTC
TTTGCTCTTTTGTGCCGCTT
ATCTTGTGCACATGTACTAC
AGCTAGTTGTGCACACACTT
AATTTCGTTGTGCGCTATAT
GAGCTCTTGTGCAGCTAGTA
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Gene 9
Yi Lu Wayne State University 12
Problem Definition
Given a list of genes, and corresponding module present information, gene expression information, find the relationship between module and gene expression, i.e. which modules or module combinations may relate to the gene expression change.
M1 M2 => increase gene expression change from Day 1 to Day 4
Gene ETSFETSF NFKBSTAT STATETSF … Day0 Day3 Day6 …
Mm.100117 1 0 0 16.75 65.3 119.15
Mm.100118 0 0 0 150.85 137.55 130.55
Mm.100125 0 1 0 84.55 96.9 119.15
Mm.10154 0 0 1 84.55 96.9 119.15
Mm.10174 0 1 0 223.05 181.55 200.9
Mm.10178 0 0 0 16.75 65.3 119.15
Mm.10182 1 0 1 79.6 80.3 94.75
Yi Lu Wayne State University 13
Method - Quantify Gene Expression
Days 1 4 8 11 14 18 21 26 29 60
Mm.116803 189.9 398.3 224.1 123.4 602.7 2218 8624 9901 11748 18519
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Day1-4 Day4-8 Day8-11
Day11-14
Day14-18
Day18-21
Day21-26
Day26-29
Day29-60
Days 1-4 4-8 8-11 11-14 14-18 18-21 21-26 26-29 29-60
log10(Di+1/Di) 0.322 -0.25 -0.26 0.689 0.566 0.59 0.06 0.074 0.198
Mean 0.014 0.006 0.006 0.017 0.04 0.063 0.052 0.019 0.044
Lower Bound -0.110 -0.15 -0.12 -0.23 -0.22 -0.165 -0.225 -0.22 -0.32
Upper Bound 0.138 0.165 0.132 0.269 0.297 0.291 0.328 0.258 0.410
Yi Lu Wayne State University 14
Method - Quantify Gene Expression
Days 1 4 8 11 14 18 21 26 29 60
Mm.116803 189.9 398.3 224.1 123.4 602.7 2218 8624 9901 11748 18519
E1 E2 E3 E4 E5 E6 E7 E8 E9
Mm.116803 + - - + + + 0 0 0
E1 E2 E3 E4 E5 E6 E7 E8 E9
Days 1-4 4-8 8-11 11-14 14-18 18-21 21-26 26-29 29-60
Ei=log10(Di+1/Di) 0.322 -0.25 -0.26 0.689 0.566 0.59 0.06 0.074 0.198
Lower Bound -0.110 -0.15 -0.12 -0.23 -0.22 -0.165 -0.225 -0.22 -0.32
Upper Bound 0.138 0.165 0.132 0.269 0.297 0.291 0.328 0.258 0.410
Yi Lu Wayne State University 15
Method – Generate Frequent Module Set
M1 M2 M3 M4
Gene 1 1 1 0 0
Gene 2 1 0 1 0
Gene 3 1 1 1 0
Gene 4 1 1 0 1
Frequent module sets (occurrence >=2)M1(4), M2 (3), M3 (2) , M4(1)
, M2M3 (1)M1M2 (3), M1M3 (2)
M1M2M3(1)
Yi Lu Wayne State University 16
Method – Generate Frequent Gene Expression Set
Frequent gene expression sets (occurrence >=2):E1+ (2), EE1- 1- (0), E(0), E2+ 2+ (1),(1), E2-(3), EE3+ 3+ (0),(0), E3-,(2),
EE1+1+EE2-2-(1), E(1), E1+1+EE3-3-(1),(1), E2-E3- (2)
E1 E2 E3
Gene 1 + + 0
Gene 2 0 - -
Gene 3 + - -
Gene 4 0 - 0
E1+ E2+ E3+ E1- E2- E3-
Gene 1 1 1 0 0 0 0
Gene 2 0 0 0 0 1 1
Gene 3 1 0 0 0 1 1
Gene 4 0 0 0 0 1 0
Yi Lu Wayne State University 17
Correlation Measure – Contingency Table The relation between u and v in the pair
(u,v)
Yi Lu Wayne State University 18
Liddell Measure
Liddell = ( 2*1-1*0)/(2*2) = 0.5
E1+ ^E1+
M2 O11=2 O12=1 R1 = 3
^M2 O21=0 O22=1 R2 = 1
C1= 2 C2 = 2 N = 4
Yi Lu Wayne State University 19
Method – Correlate Module Set with Gene Expression Set
Minimize module set
Maximize gene expression set
Minimum Liddell value is set to 0.5/-0.5, then the result sets:
M2 ->E1+
M2 -> ^(E2- E3-)
M3 ->E2- E3-
M1 M2 M3 M4 E1 E2 E3
Gene 1 1 1 0 0 + + 0
Gene 2 1 0 1 0 0 - -
Gene 3 1 1 1 0 + - -
Gene 4 1 1 0 1 0 - 0
E1+ E2- E3- E2-E3-
M1 0 0 0 0
M2 0.5 -0.3333 -0.5 -0.5
M3 0 0.66667 1 1
M1M2 0.5 -0.3333 -0.5 -0.5
M1M3 0 0.66667 1 1
Yi Lu Wayne State University 20
Result on Spermatogenesis Spermatogenesis is the biological process related to
formation of sperm. Two gene expression data sets are downloaded from GEO (Gene Expression Omnibus).
The time course of one dataset ranges from day 0, 3, 6, 8, 10, 14, 18, 20, 30, 35, and 56. And the other ranges from day 1, 4, 8, 11, 14, 18, 21, 26, 29, and 60.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.5 0.6 0.7 0.8
Liddell
Co
nco
nd
ance
Yi Lu Wayne State University 21
System Workflow GEO: Gene
Expression Omnibus
DBTSS: DataBase of Transcriptional Start Sites
TRANSFAC: the Transcription Factor database
JASPAR: The high-quality transcription factor binding profile database
Gene Expression Clustering
Motif Discovery
GEO cDNA
DBTSS
TRANSFACJASPAR
Gene IDs
Correlation Mining of Modules
Upstream Sequences
Expression Data
K-SPMM
Clustered Genes
Motif MatricesMotifs
Modules
Yi Lu Wayne State University 22
Conclusion Not only same module combination result, but
also the same genes that contain the module combinations have been pulled out between the two datasets.
The promoter detected using our approach statistically shows significance than random generated datasets.
Some promoters found by our approach are confirmed by literatures.
Yi Lu Wayne State University 23
Future work The concordance between the two gene
expression datasets downloaded from GEO are low, new method to reconcile the difference between two data sets is needed.
Motifs found by different algorithms are overwhelming, we may incorporate the weight matrix and gene ontology to identify the significant ones.
Yi Lu Wayne State University 24
References Gene Expression Clustering:
Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng and Susan Brown, "FGKA: A Fast Genetic K-means Clustering Algorithm", in Proceedings of the 19th ACM Symposium on Applied Computing, Nicosia, Cyprus, March, 2004.
Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng, and Susan Brown, “Incremental Genetic K-means Algorithm and its Application in Gene Expression Data Analysis”, International Journal of BMC Bioinformatics, 5(172), October, 2004.
Motif Discovery: Yi Lu, Shiyong Lu, Farshad Fotouhi, Yan Sun and Zijiang Yang, “PDC: Pattern Discovery with
Confidence in DNA Sequences”, In the proceedings of the IASTED International Conference on Advances in Computer Science and Technology (ACST 2006), Puerto Vallarta, Mexico, January, 2006
Motif Extraction, Module Integration: Adrian E. Platts, Yi Lu, Stephen A. Krawetz, “K-SPMM, an Online System for Data Mining
Regulatory Elements from Murine Spermatogenic Promoter Sequences”, presented in 2006 Great Lakes Mammalian Development Meeting, Toronto, March 3-5 2006.
Yi Lu, Adrian E. Platts, Charles G. Ostermeier, Stephen A. Krawetz, “A Database of Murine Spermatogenic Promoters Modules & Motifs”, Submitted to Journal of BMC Bioinformatics for publication.
Correlation Mining: Yi Lu, Adrian Platts, Shiyong Lu, Jeffrey L. Ram and Stephen Krawetz, "Correlation Mining to
Reveal the Regulation of Transcription Factor Binding Site Modules", 4th Great Lake Bioinformatics Retreat, Frankenmuth, Michigan, August, 2005.
Yi Lu, Adrian Platts, Shiyong Lu, Jeffrey L. Ram and Stephen Krawetz, “Mining of Correlation Between Transcription Binding Sites and Gene Expression Profiles”, In preparation.
Yi Lu Wayne State University 26
Acknowledgements Dr. Shiyong Lu Dr. Stephen Krawetz Mr. Adrian Platts Dr. Jeffrey Ram Dr. Youping Deng