cs262 lecture 16, win07, batzoglou gene recognition credits for slides: serafim batzoglou marina...
Post on 15-Jan-2016
220 views
TRANSCRIPT
![Page 1: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/1.jpg)
CS262 Lecture 16, Win07, Batzoglou
Gene Recognition
Credits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSerge Saxonov
![Page 2: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/2.jpg)
CS262 Lecture 16, Win07, Batzoglou
Gene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = protein-codingintron = non-coding
Codon:A triplet of nucleotides that is converted to one amino acid
![Page 3: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/3.jpg)
CS262 Lecture 16, Win07, Batzoglou
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
exon exon exonintronintronintergene intergene
Hidden Markov Models for Gene Finding
Intergene State
First Exon State
IntronState
![Page 4: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/4.jpg)
CS262 Lecture 16, Win07, Batzoglou
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
exon exon exonintronintronintergene intergene
Hidden Markov Models for Gene Finding
Intergene State
First Exon State
IntronState
![Page 5: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/5.jpg)
CS262 Lecture 16, Win07, Batzoglou
TAA A A A A A A A A A A AA AAT T T T TT TT T T TT T T TG GGG G G G GGGG G G G GCC C C C C C
Exon1 Exon2 Exon3
Duration d
Duration HMM for Gene Finding
iPINTRON(xi | xi-1…xi-w)
PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)
j+2
P5’SS(xi-3…xi+4)
PSTOP(xi-4…xi+3)
![Page 6: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/6.jpg)
CS262 Lecture 16, Win07, Batzoglou
HMM-based Gene Finders
• GENMARK (Borodovsky & McIninch 1993)
• GENIE (Kulp 1996)
• GENSCAN (Burge 1997) Big jump in accuracy of de novo gene finding Currently, one of the best HMM with duration modeling for Exon states
• FGENESH (Solovyev 1997) Currently one of the best
• HMMgene (Krogh 1997)
• VEIL (Henderson, Salzberg, & Fasman 1997)
![Page 7: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/7.jpg)
CS262 Lecture 16, Win07, Batzoglou
Better way to do it: negative binomial
• EasyGene:
Prokaryotic
gene-finder
Larsen TS, Krogh A
• Negative binomial with n = 3
![Page 8: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/8.jpg)
CS262 Lecture 16, Win07, Batzoglou
GENSCAN’s hidden weapon
• C+G content is correlated with: Gene content (+) Mean exon length (+) Mean intron length (–)
• These quantities affect parameters of model
• Solution Train parameters of model in four
different C+G content ranges!
![Page 9: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/9.jpg)
CS262 Lecture 16, Win07, Batzoglou
Evaluation of Accuracy
(Slide by NF Samatova)
Sensitivity (SN) Fraction of exons (coding nucleotides) whose boundaries are predicted exactly (that are predicted as coding)
•Specificity (Sp) Fraction of the predicted exons (coding nucleotides) that are exactly correct (that are coding)
•Correlation Coefficient (CC)
Combined measure of Sensitivity & Specificity Range: -1 (always wrong) +1 (always right)
TP FP TN FN TP FN TN
Actual
Predicted
Coding / No Coding
TNFN
FPTP
Pre
dic
ted
Actual
No
Co
din
g /
Co
din
g
![Page 10: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/10.jpg)
CS262 Lecture 16, Win07, Batzoglou
Results of GENSCAN
• On the initial test dataset (Burset & Guigo) 80% exact exon detection
• 10% partial exons• 10% wrong exons
• In general
HMMs have been best in de novo prediction In practice they overpredict human genes by ~2x
![Page 11: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/11.jpg)
CS262 Lecture 16, Win07, Batzoglou
Comparison-based Methods
![Page 12: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/12.jpg)
CS262 Lecture 16, Win07, Batzoglou
Cross-species gene finding
5’ 3’
Exon1 Exon2 Exon3Intron1 Intron2
[human]
[mouse]
GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | |C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-
![Page 13: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/13.jpg)
CS262 Lecture 16, Win07, Batzoglou
Comparison of 1196 orthologous genes(Makalowski et al., 1996)
• Sequence identity between genes in human/mouse– exons: 84.6%– protein: 85.4%– introns: 35%– 5’ UTRs: 67%– 3’ UTRs: 69%
• 27 proteins were 100% identical
![Page 14: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/14.jpg)
CS262 Lecture 16, Win07, Batzoglou
![Page 15: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/15.jpg)
CS262 Lecture 16, Win07, Batzoglou
Not always: HoxA human-mouse
![Page 16: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/16.jpg)
CS262 Lecture 16, Win07, Batzoglou
Patterns of Conservation
30% 1.3%
0.14%
58%14%
10.2%
Genes Intergenic
Mutations Gaps Frameshifts
Separation
2-fold10-fold75-fold
![Page 17: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/17.jpg)
CS262 Lecture 16, Win07, Batzoglou
Twinscan
• Twinscan is an augmented version of the Gencscan HMM.
E I
transitions
duration
emissionsACUAUACAGACAUAUAUCAU
![Page 18: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/18.jpg)
CS262 Lecture 16, Win07, Batzoglou
Twinscan Algorithm
1. Align the two sequences (eg. from human and mouse)
2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )
New “alphabet”: 4 x 3 = 12 letters = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }
3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }
Emission distributions ek(b) estimated from real genes from human/mouse
eI(x|) < eE(x|): matches favored in exonseI(x-) > eE(x-): gaps (and mismatches) favored in introns
![Page 19: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/19.jpg)
CS262 Lecture 16, Win07, Batzoglou
Example
Human: ACGGCGACGUGCACGU
Mouse: ACUGUGACGUGCACUU
Alignment: ||:|:|||||||||:|
Input to Twinscan HMM:A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U|
Recall, eE(A|) > eI(A|)
eE(A-) < eI(A-)
Likely exon
![Page 20: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/20.jpg)
CS262 Lecture 16, Win07, Batzoglou
HMMs for simultaneous alignment and gene finding:
Generalized Pair HMMs
![Page 21: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/21.jpg)
CS262 Lecture 16, Win07, Batzoglou
The SLAM hidden Markov model
![Page 22: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/22.jpg)
CS262 Lecture 16, Win07, Batzoglou
Exon GPHMM
d
e
1.Choose exon lengths (d,e).2.Generate alignment of length d+e.
![Page 23: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/23.jpg)
CS262 Lecture 16, Win07, Batzoglou
Approximate alignment
![Page 24: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/24.jpg)
CS262 Lecture 16, Win07, Batzoglou
Measuring Performance
![Page 25: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/25.jpg)
CS262 Lecture 16, Win07, Batzoglou
Example: HoxA2 and HoxA3
SLAM
SGP-2
TwinscanGenscan
TBLASTXSLAM CNS
VISTARefSeq
![Page 26: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/26.jpg)
CS262 Lecture 16, Win07, Batzoglou
Gene Regulation and Gene Regulation and MicroarraysMicroarrays
![Page 27: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/27.jpg)
CS262 Lecture 16, Win07, Batzoglou
Overview
• A. Gene Expression and Regulation
• B. Measuring Gene Expression: Microarrays
• C. Finding Regulatory Motifs
![Page 28: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/28.jpg)
CS262 Lecture 16, Win07, Batzoglou
Cells respond to environment
Cell responds toenvironment—various external messages
![Page 29: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/29.jpg)
CS262 Lecture 16, Win07, Batzoglou
Genome is fixed – Cells are dynamic
• A genome is static
Every cell in our body has a copy of same genome
• A cell is dynamic
Responds to external conditions Most cells follow a cell cycle of division
• Cells differentiate during development
• Gene expression varies according to:
Cell type Cell cycle External conditions Location
slide credits: M. Kellis
![Page 30: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/30.jpg)
CS262 Lecture 16, Win07, Batzoglou
Where gene regulation takes place
• Opening of chromatin
• Transcription
• Translation
• Protein stability
• Protein modifications
![Page 31: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/31.jpg)
CS262 Lecture 16, Win07, Batzoglou
Transcriptional Regulation
• Efficient place to regulate:
No energy wasted making intermediate products
• However, slowest response time
After a receptor notices a change:
1. Cascade message to nucleus
2. Open chromatin & bind transcription factors
3. Recruit RNA polymerase and transcribe
4. Splice mRNA and send to cytoplasm
5. Translate into protein
![Page 32: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/32.jpg)
CS262 Lecture 16, Win07, Batzoglou
Transcription Factors Binding to DNA
Transcription regulation:
Certain transcription factors bind DNA
Binding recognizes DNA substrings:
Regulatory motifs
![Page 33: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/33.jpg)
CS262 Lecture 16, Win07, Batzoglou
Promoter and Enhancers
• Promoter necessary to start transcription
• Enhancers can affect transcription from afar
![Page 34: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/34.jpg)
CS262 Lecture 16, Win07, Batzoglou
Regulation of Genes
GeneRegulatory Element
RNA polymerase(Protein)
Transcription Factor(Protein)
DNA
![Page 35: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/35.jpg)
CS262 Lecture 16, Win07, Batzoglou
Regulation of Genes
Gene
RNA polymerase
Transcription Factor(Protein)
Regulatory Element
DNA
![Page 36: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/36.jpg)
CS262 Lecture 16, Win07, Batzoglou
Regulation of Genes
Gene
RNA polymerase
Transcription Factor
Regulatory Element
DNA
New protein
![Page 37: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/37.jpg)
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT
![Page 38: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/38.jpg)
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT
Promoter motifs
3’ UTR motifs
Exons
Introns
![Page 39: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/39.jpg)
CS262 Lecture 16, Win07, Batzoglou
Example: A Human heat shock protein
• TATA box: positioning transcription start
• TATA, CCAAT: constitutive transcription
• GRE: glucocorticoid response
• MRE: metal response
• HSE: heat shock element
TATASP1CCAAT AP2HSEAP2CCAATSP1
promoter of heat shock hsp70
0--158
GENE
![Page 40: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/40.jpg)
CS262 Lecture 16, Win07, Batzoglou
The Cell as a Regulatory Network
• Genes = wires• Motifs = gates
A B Make DC
If C then D
If B then NOT D
If A and B then D D
Make BD
If D then B
C
gene D
gene B
![Page 41: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/41.jpg)
CS262 Lecture 16, Win07, Batzoglou
The Cell as a Regulatory Network (2)
![Page 42: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/42.jpg)
CS262 Lecture 16, Win07, Batzoglou
DNA Microarrays
Measuring gene transcription in a high-throughput fashion
![Page 43: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/43.jpg)
CS262 Lecture 16, Win07, Batzoglou
What is a microarray
![Page 44: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/44.jpg)
CS262 Lecture 16, Win07, Batzoglou
What is a microarray
• Measure the level of mRNA messages in a cell
DN
A 1
DN
A 3
DN
A 5
DN
A 6
DN
A 4
DN
A 2
cDNA 4
cDNA 6
Hybridize Gen
e 1
Gen
e 3
Gen
e 5
Gen
e 6
Gen
e 4
Gen
e 2
MeasureRNA 4
RNA 6
RT
slide credits: M. Kellis
![Page 45: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/45.jpg)
CS262 Lecture 16, Win07, Batzoglou
What is a microarray
• A 2D array of DNA sequences from thousands of genes
• Each spot has many copies of same gene
• Measure number of hybridizations per spot
Result:• Thousands of “experiments” – one per gene –
in one go
• Perform many microarrays for different conditions: Time during cell cycle Temperature Nutrient level
![Page 46: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/46.jpg)
CS262 Lecture 16, Win07, Batzoglou
Goal of Microarray Experiments
• Measure level of gene expression across many different conditions:
Expression Matrix M: {genes}{conditions}:
Mij = |genei| in conditionj
• Group genes into coregulated sets
Observe cells under different conditions
Find genes with similar expression profiles
• Potentially regulated by same TF
slide credits: M. Kellis
![Page 47: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/47.jpg)
CS262 Lecture 16, Win07, Batzoglou
Clustering vs. Classification
• Clustering Idea: Groups of genes that share similar function have similar expression
patterns• Hierarchical clustering• k-means • Bayesian approaches• Projection techniques
• Principal Component Analysis• Independent Component Analysis
• Classification Idea: A cell can be in one of several states
• (Diseased vs. Healthy, Cancer X vs. Cancer Y vs. Normal) Can we train an algorithm to use the gene expression patterns to
determine which state a cell is in?• Support Vector Machines• Decision Trees• Neural Networks• K-Nearest Neighbors
![Page 48: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/48.jpg)
CS262 Lecture 16, Win07, Batzoglou
Clustering Algorithms
b
ed
f
a
c
h
ga b d e f g hc
• K-meansb
ed
f
a
c
h
gc1
c2
c3a b g hcd e f
• Hierarchical
slide credits: M. Kellis
![Page 49: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/49.jpg)
CS262 Lecture 16, Win07, Batzoglou
Hierarchical clustering
• Bottom-up algorithm: Initialization: each point in a separate cluster
• At each step: Choose the pair of closest clusters Merge
• The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y
• Avoids the problem of specifying the number of clusters
b
ed
f
a
c
h
g
slide credits: M. Kellis
![Page 50: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/50.jpg)
CS262 Lecture 16, Win07, Batzoglou
Distance between clusters
• CD(X,Y)=minx X, y Y D(x,y)
Single-link method
• CD(X,Y)=maxx X, y Y D(x,y)
Complete-link method
• CD(X,Y)=avgx X, y Y D(x,y)
Average-link method
• CD(X,Y)=D( avg(X) , avg(Y) )
Centroid method
ed
f
h
g
ed
f
h
g
ed
f
h
g
ed
f
h
g
slide credits: M. Kellis
![Page 51: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/51.jpg)
CS262 Lecture 16, Win07, Batzoglou
Results of Clustering Gene Expression
• CLUSTER is simple and easy to use
• De facto standard for microarray analysis
Time: O(N2M)
N: #genesM: #conditions
![Page 52: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/52.jpg)
CS262 Lecture 16, Win07, Batzoglou
K-Means Clustering Algorithm
• Each cluster Xi has a center ci
• Define the clustering cost criterion
• COST(X1,…Xk) = ∑Xi ∑x Xi |x – ci|2
• Algorithm tries to find clusters X1…Xk and centers c1…ck that minimize COST
• K-means algorithm: Initialize centers Repeat:
• Compute best clusters for given centers
• → Attach each point to the closest center
• Compute best centers for given clusters
• → Choose the centroid of points in cluster
Until the changes in COST are “small”
b
ed
f
a
c
h
g
c1
c2
c3
slide credits: M. Kellis
![Page 53: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/53.jpg)
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Randomly Initialize Clusters
![Page 54: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/54.jpg)
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Assign data points to nearest clusters
![Page 55: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/55.jpg)
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Recalculate Clusters
![Page 56: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/56.jpg)
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Recalculate Clusters
![Page 57: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/57.jpg)
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Repeat
![Page 58: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/58.jpg)
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Repeat
![Page 59: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/59.jpg)
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Repeat … until convergence
Time: O(KNM) per iteration
N: #genesM: #conditions
![Page 60: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/60.jpg)
CS262 Lecture 16, Win07, Batzoglou
Mixture of Gaussians – Probabilistic K-means
• Data is modeled as mixture of K Gaussians N(1, 2I), …, N(K, 2I)
Prior probabilities 1, …, K
• Different i for every Gaussian i, or even different covariance matrices are possible, but learning becomes harder
P(x) = ∑i P(x | N(1, 2I)) i
Use EM to learn parameters
![Page 61: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/61.jpg)
CS262 Lecture 16, Win07, Batzoglou
Analysis of Clustering Data
• Statistical Significance of Clusters
Gene Ontology http://www.geneontology.org/
KEGG http://www.genome.jp/kegg/
• Regulatory motifs responsible for common expression
• Regulatory Networks
• Experimental Verification
![Page 62: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader036.vdocument.in/reader036/viewer/2022062314/56649d545503460f94a3056e/html5/thumbnails/62.jpg)
CS262 Lecture 16, Win07, Batzoglou
Evaluating clusters – Hypergeometric Distribution
rm
k
N
mk
pN
m
p
rposP )(
• N experiments, p labeled ++, (N-p) ––• Cluster: k elements, m labeled ++• P-value of single cluster containing k
elements of which at least r are ++
Prob that a randomly chosen set of k experiments would result in m positive and k-m negative
P-value of uniformity
in computed cluster
slide credits: M. Kellis