promoter discovery: a correlation mining approach yi lu department of computer science wayne state...

Promoter Discovery: A Correlation Mining

Approach

Yi Lu

Department of Computer ScienceWayne State University

Yi Lu Wayne State University 2

Outline

Introduction Related Work Problem Definition Correlation Mining Conclusion and Future work


Introduction

Transcription

RNA

DNA

Translation

Protein

Central Dogma Gene Expression


The promoter region (a set of transcription binding sites) of the gene acts as light switch. It signals when to turn the gene on and off.

We are interested in the relationship between the promoter region and gene expression. i.e. what kind of binding sites determine whether a gene is expressed or not?

Introduction


Introduction - MicroarrayGene ValueD26528_at 193D26561_at 70D26579_at 318D26598_at 1764D26599_at 1537D26600_at 1204D28114_at 707H29189_at 899G29183_at 9210

Gene Day 1 Day 2 Day 3 … D26528_at 193 4157 556 D26561_at 70 11557 476 D26579_at 318 12125 498

D26598_at 1764 8484 1211 D26599_at H21219 1537 3537 131 D26600_at 1207 4578 94 D28114_at 707 2431 209

…….

Microarray chips Images scanned by laser

DatasetsD26528_at

D26561_at

D26579_at

D26598_at

D26599_at

D26600_at

D28114_at

…..

..

D1 D2 D3 D4……..


Introduction Transcription factor binding sites (motif) in

promoter region should “explain” changes in transcription.

AGCTAGCTGATTGTGCACACTGATCGAGCCCCACCATAGCTTCGTTGTGCGCTATATATTGTGCAGCTAGTAGAGCTCTGCTAGAGCTCTATTTGTGCCGATTGCGGGGCGTCTGAGCTCTTTGCTCTTTTGTGCCGCTTTTGATATTATCTCTCTGCTCGTTTGTGCTTTATTGTGGGGGTTGTGCTGATTATGCTGCTCATAGGAGATTGTGCGAGAGTCGTCGTAGTTGTGCGTCGTCGTGATGATGCTGCTGATCGATCGTTGTGCCTAGCTAGTAGATCGATGTTTGTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCTTGTGCTCGAGAGGAAGTATATATTTGTGCGCGCGCCGCGCGCACGTTGTGCAGCTGATGCATGCATGCTAGTATTGTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCTTGTGCAGTCGATCGATGCTAGTTATTGTTGTGCGTAGTAGTGCTTGTGCTCGTAGCTGTAG


R(t

1)

t1 Motif

R(t

2)

t2 Motif


Time Course genes


Related work Cluster gene expression profiles Search for motifs in promoter regions of clustered

genes

clustering

AGCTAGCTGATTGTGCACAC

TTCGTTGTGCGCTATATAGA

TTGTGCAGCTAGTAGAGCTC

CTAGAGCTCTATTTGTGCCG

ATTGCGGGGCGTCTGAGCTC

TTTGCTCTTTTGTGCCGCTT

Promoter regions Gene 1

Gene 2Gene 3Gene 4Gene 5Gene 6

TTGTGC

AGCTAGCTGATTGTGCACAC






Motif


Related work Clustering

partition the N genes to a set of disjoint groups so that the expression profile of genes in same group have high similarity to each other and the expression profile of genes in different groups are dissimilar to each other.

Most widely used algorithms: K-means clustering, hierarchy clustering algorithms.

Genetic K-means algorithms (Lu et al. 2003, 2004).


Related work Motif discovery after clustering

given a set of upstream sequence of genes which are co-expressed, find subsequences that are overrepresented and are significant to be separated from other subsequences

MEME, Gibbs Sampling, Winnower algorithms. PDC algorithm (Lu et al. 2006) Usually have high false positive rate

ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCAGCTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCTTTGTGGTACTGATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACCACCCCCCGCTCGATCGATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCGCTGCTCGTAGCATGCTAGCTAGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTTTTTTTTTTTTGCTAGCACCCAACTGACTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCTCGTCGTCGATGCATCGTACGTAGCTACGTAGCATGCTAGCTGCTCGCAAAAAAAAAACGTCGTCGATCGTAGCTGCTCGCCCCCCCCCCCCGACTGATCGTAGCTAGCTGATCGATCGATCGATCGTAGCTGAATTATATATATATATATACGGCG

TCGACTGC

TCGACTGC

TCGACTGC

TCGACTGC GATAC

GATAC

GATACGATAC

CCAATCCAAT

CCAATCCAAT

TCGACTGC

CCAATCCAAT

CCAAT

GCAGTT GCAGT

T

GCAGTT

Gen

es


Motivation

Researches have indicated that multiple transcription factor binding sites are involved into each transcription process. This lead us to study the Modules (a pair of motifs) instead of Motifs.


Motivation Not all genes contain the same motif

cause the same gene expression change. Not all genes with same gene expression

change contains same motif.






ATCTTGTGCACATGTACTAC

AGCTAGTTGTGCACACACTT

AATTTCGTTGTGCGCTATAT

GAGCTCTTGTGCAGCTAGTA

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Gene 9


Problem Definition

Given a list of genes, and corresponding module present information, gene expression information, find the relationship between module and gene expression, i.e. which modules or module combinations may relate to the gene expression change.

M1 M2 => increase gene expression change from Day 1 to Day 4

Gene ETSFETSF NFKBSTAT STATETSF … Day0 Day3 Day6 …

Mm.100117 1 0 0 16.75 65.3 119.15

Mm.100118 0 0 0 150.85 137.55 130.55

Mm.100125 0 1 0 84.55 96.9 119.15

Mm.10154 0 0 1 84.55 96.9 119.15

Mm.10174 0 1 0 223.05 181.55 200.9

Mm.10178 0 0 0 16.75 65.3 119.15

Mm.10182 1 0 1 79.6 80.3 94.75


Method - Quantify Gene Expression

Days 1 4 8 11 14 18 21 26 29 60

Mm.116803 189.9 398.3 224.1 123.4 602.7 2218 8624 9901 11748 18519

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Day1-4 Day4-8 Day8-11

Day11-14

Day14-18

Day18-21

Day21-26

Day26-29

Day29-60

Days 1-4 4-8 8-11 11-14 14-18 18-21 21-26 26-29 29-60

log10(Di+1/Di) 0.322 -0.25 -0.26 0.689 0.566 0.59 0.06 0.074 0.198

Mean 0.014 0.006 0.006 0.017 0.04 0.063 0.052 0.019 0.044

Lower Bound -0.110 -0.15 -0.12 -0.23 -0.22 -0.165 -0.225 -0.22 -0.32

Upper Bound 0.138 0.165 0.132 0.269 0.297 0.291 0.328 0.258 0.410


Method - Quantify Gene Expression

Days 1 4 8 11 14 18 21 26 29 60

Mm.116803 189.9 398.3 224.1 123.4 602.7 2218 8624 9901 11748 18519

E1 E2 E3 E4 E5 E6 E7 E8 E9

Mm.116803 + - - + + + 0 0 0

E1 E2 E3 E4 E5 E6 E7 E8 E9

Days 1-4 4-8 8-11 11-14 14-18 18-21 21-26 26-29 29-60

Ei=log10(Di+1/Di) 0.322 -0.25 -0.26 0.689 0.566 0.59 0.06 0.074 0.198

Lower Bound -0.110 -0.15 -0.12 -0.23 -0.22 -0.165 -0.225 -0.22 -0.32

Upper Bound 0.138 0.165 0.132 0.269 0.297 0.291 0.328 0.258 0.410


Method – Generate Frequent Module Set

M1 M2 M3 M4

Gene 1 1 1 0 0

Gene 2 1 0 1 0

Gene 3 1 1 1 0

Gene 4 1 1 0 1

Frequent module sets (occurrence >=2)M1(4), M2 (3), M3 (2) , M4(1)

, M2M3 (1)M1M2 (3), M1M3 (2)

M1M2M3(1)


Method – Generate Frequent Gene Expression Set

Frequent gene expression sets (occurrence >=2):E1+ (2), EE1- 1- (0), E(0), E2+ 2+ (1),(1), E2-(3), EE3+ 3+ (0),(0), E3-,(2),

EE1+1+EE2-2-(1), E(1), E1+1+EE3-3-(1),(1), E2-E3- (2)

E1 E2 E3

Gene 1 + + 0

Gene 2 0 - -

Gene 3 + - -

Gene 4 0 - 0

E1+ E2+ E3+ E1- E2- E3-

Gene 1 1 1 0 0 0 0

Gene 2 0 0 0 0 1 1

Gene 3 1 0 0 0 1 1

Gene 4 0 0 0 0 1 0


Correlation Measure – Contingency Table The relation between u and v in the pair

(u,v)


Liddell Measure

Liddell = ( 2*1-1*0)/(2*2) = 0.5

E1+ ^E1+

M2 O11=2 O12=1 R1 = 3

^M2 O21=0 O22=1 R2 = 1

C1= 2 C2 = 2 N = 4


Method – Correlate Module Set with Gene Expression Set

Minimize module set

Maximize gene expression set

Minimum Liddell value is set to 0.5/-0.5, then the result sets:

M2 ->E1+

M2 -> ^(E2- E3-)

M3 ->E2- E3-

M1 M2 M3 M4 E1 E2 E3

Gene 1 1 1 0 0 + + 0

Gene 2 1 0 1 0 0 - -

Gene 3 1 1 1 0 + - -

Gene 4 1 1 0 1 0 - 0

E1+ E2- E3- E2-E3-

M1 0 0 0 0

M2 0.5 -0.3333 -0.5 -0.5

M3 0 0.66667 1 1

M1M2 0.5 -0.3333 -0.5 -0.5

M1M3 0 0.66667 1 1


Result on Spermatogenesis Spermatogenesis is the biological process related to

formation of sperm. Two gene expression data sets are downloaded from GEO (Gene Expression Omnibus).

The time course of one dataset ranges from day 0, 3, 6, 8, 10, 14, 18, 20, 30, 35, and 56. And the other ranges from day 1, 4, 8, 11, 14, 18, 21, 26, 29, and 60.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.5 0.6 0.7 0.8

Liddell

Co

nco

nd

ance


System Workflow GEO: Gene

Expression Omnibus

DBTSS: DataBase of Transcriptional Start Sites

TRANSFAC: the Transcription Factor database

JASPAR: The high-quality transcription factor binding profile database

Gene Expression Clustering

Motif Discovery

GEO cDNA

DBTSS

TRANSFACJASPAR

Gene IDs

Correlation Mining of Modules

Upstream Sequences

Expression Data

K-SPMM

Clustered Genes

Motif MatricesMotifs

Modules


Conclusion Not only same module combination result, but

also the same genes that contain the module combinations have been pulled out between the two datasets.

The promoter detected using our approach statistically shows significance than random generated datasets.

Some promoters found by our approach are confirmed by literatures.


Future work The concordance between the two gene

expression datasets downloaded from GEO are low, new method to reconcile the difference between two data sets is needed.

Motifs found by different algorithms are overwhelming, we may incorporate the weight matrix and gene ontology to identify the significant ones.


References Gene Expression Clustering:

Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng and Susan Brown, "FGKA: A Fast Genetic K-means Clustering Algorithm", in Proceedings of the 19th ACM Symposium on Applied Computing, Nicosia, Cyprus, March, 2004.

Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng, and Susan Brown, “Incremental Genetic K-means Algorithm and its Application in Gene Expression Data Analysis”, International Journal of BMC Bioinformatics, 5(172), October, 2004.

Motif Discovery: Yi Lu, Shiyong Lu, Farshad Fotouhi, Yan Sun and Zijiang Yang, “PDC: Pattern Discovery with

Confidence in DNA Sequences”, In the proceedings of the IASTED International Conference on Advances in Computer Science and Technology (ACST 2006), Puerto Vallarta, Mexico, January, 2006

Motif Extraction, Module Integration: Adrian E. Platts, Yi Lu, Stephen A. Krawetz, “K-SPMM, an Online System for Data Mining

Regulatory Elements from Murine Spermatogenic Promoter Sequences”, presented in 2006 Great Lakes Mammalian Development Meeting, Toronto, March 3-5 2006.

Yi Lu, Adrian E. Platts, Charles G. Ostermeier, Stephen A. Krawetz, “A Database of Murine Spermatogenic Promoters Modules & Motifs”, Submitted to Journal of BMC Bioinformatics for publication.

Correlation Mining: Yi Lu, Adrian Platts, Shiyong Lu, Jeffrey L. Ram and Stephen Krawetz, "Correlation Mining to

Reveal the Regulation of Transcription Factor Binding Site Modules", 4th Great Lake Bioinformatics Retreat, Frankenmuth, Michigan, August, 2005.

Yi Lu, Adrian Platts, Shiyong Lu, Jeffrey L. Ram and Stephen Krawetz, “Mining of Correlation Between Transcription Binding Sites and Gene Expression Profiles”, In preparation.


Acknowledgements Dr. Shiyong Lu Dr. Stephen Krawetz Mr. Adrian Platts Dr. Jeffrey Ram Dr. Youping Deng


Questions?

promoter discovery: a correlation mining approach yi lu department of computer science wayne state...

Documents

gene acts

yi lu wayne state university4

introduction slide

algorithms lu

future work slide

expression profile of

sites motif

promoter discovery