finding transcription modules from large gene-expression data sets ned wingreen – molecular...
Post on 21-Dec-2015
218 views
TRANSCRIPT
![Page 1: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/1.jpg)
Finding Transcription Modules from large gene-expression data sets
Ned Wingreen – Molecular BiologyMorten Kloster, Chao Tang – NEC Laboratories America
![Page 2: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/2.jpg)
Outline
• Introduction – transcription, regulation, gene chips, and transcription modules.
• Iterative Signature Algorithm (ISA).
• Advantages of Progressive Iterative Signature Algorithm (PISA).
• PISA applied to yeast data.
![Page 3: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/3.jpg)
Transcription regulation
http://doegenomestolife.org
![Page 4: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/4.jpg)
Gene chips
DNA microarray
![Page 5: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/5.jpg)
Gene-expression profile
Egc g=1,2,...,Ng
c=1,2,...,Nc
But data very noisy…
![Page 6: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/6.jpg)
Transcription module
C1 C2 C3 Conditions
G1 G7G2 G3 G4 G5 G6 Genes
TF1 TF2 TF3 TF4Transcription factors
A Transcription Module: a set of conditions and a set of genes connected by a transcription factor.
![Page 7: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/7.jpg)
A gene can be in multiple transcription modules.
Conditions
Gen
esc1 c2 c3 … … cm … … cn ... ... cNc
g 1
g 2 g 3 . .g i . .g j . . g N g
Signature of a transcription module
![Page 8: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/8.jpg)
Iterative Signature Algorithm (ISA)Barkai group (2002,2003)
( ) { : }
( ) { : }
m
m
gcm m G Cg G
gcm m C Gc C
C G c C E t
G C g G E t
1 1
2 2
G C
G C
G CG C
G CN N
m m
m m
m m
m m
Transcription Module (TM)
Gene vector and condition vector:
T
( 1) ( ( ))
( 1) ( ( 1))
G
C
G Ct C
C Gt G
n f n
n f n
m E m
m E m
Conditions
Gen
es
c1 c2 c3 … … cm … … cn ... ... cNC
g 1
g 2 g 3 . .g i . .g j . . g N
G
Thresholding on both genes and conditions reduces noise.
Thresholding:
![Page 9: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/9.jpg)
Limitations of ISA• Lots of spurious modules (millions…).
• Weak modules may be absorbed by strong ones.
• ISA does not make use of identified modules to find new ones.
c1 c2 c3 … … cm … … cn ... ... cNc
g 1
g 2 g 3 . .g i . .g j . . g Ng
![Page 10: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/10.jpg)
Progressive Iterative Signature Algorithm (PISA)
c1 c2 c3 … … cm … … cn ... ... cNc
g 1
g 2 g 3 . .g i . .g j . . g N g
![Page 11: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/11.jpg)
Advantages of PISA over ISA
• Removing found modules reveals “hidden” modules, and reduces noise for unrelated modules.
• No positive feedback.
• Improved thresholding for genes.
• Combines coregulated and counter-regulated genes.
![Page 12: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/12.jpg)
Example of PISA vs. ISA
TF1 TF2
G1 G2
A B
![Page 13: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/13.jpg)
The gene-score threshold
•Goal: less than one gene included in the module by mistake.
•Require: threshold that is insensitive to (unknown) module size.
Gene scores along the condition vector for some module
![Page 14: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/14.jpg)
Eliminating false modulesFor scrambled data, preliminary modules either have few genes or few contributing conditions.
Truepositives
![Page 15: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/15.jpg)
PISA applied to yeast data
• Applied PISA to a dataset containing almost all available microarray data for S. cerevisiae: >6000 genes, ~1000 conditions.
• Found ~140 different modules, including all “good” modules found by ISA.
• Found some unknown modules.
• Found many “good” small modules that ISA could not find / separate from the spurious modules.
• ~2600 genes in at least one module, ~900 genes in more than module.
![Page 16: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/16.jpg)
Some modules found by PISA
![Page 17: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/17.jpg)
Example: Zinc module
ZRT1
YNL254C
INO1ZAP1
YOL154W
ADH4
ZRT3ZRT2
YOR387C
ZRT1
ZAP1
ZRT2
YNL254C
YOL154W
ZRT3
ADH4
RAD27
ZRC1
… Lyons
et a
l., P
NA
S 97
, 795
7-7
962
(2000)
ZAP1-regulated genesduring zinc starvation.
Zinc module found by PISA
![Page 18: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/18.jpg)
Comparison with other databases“Gold standard”: Gene Ontology (Genome Res. 11, 1425-1433
(2001)) Database A: Immunoprecipitation (Lee et al., Science 298, 799-804 (2002))
Database B: Comparative genomics (Kellis et al., Nature 423, 241-254
(2003))
![Page 19: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/19.jpg)
anticorrelated correlated
Oxidative stress response(69)De novo purine biosyn (32)Lysine biosyn (11)Biotin syn & transport (6)Arg biosyn (6)aa biosyn (96)
Oxidative stress response (69) aryl alcohol dehydrogenase (6) proteolysis (27) trehalose & hexose metabolism/conversion (21) COS genes (11) heat shock (52) repair of disulfide bonds (26)
Mating genes for type a (15)Mating type a signaling genes (6)Mating (110)Mating factors/receptors: a/ difference (26)
rRNA processing (117) Ribosomal proteins (126) Histone (19) Fatty acid syn ++ (22) Cell cycle G2/M (31) Cell cycle M/G1 (35) Cell cycle G1/S (66)
Correlations
![Page 20: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/20.jpg)
Summary
• Data from gene chips can be used to identify transcription modules (TMs).
• Iterative approach (ISA) is promising.
• PISA improves on ISA by taking out found TMs.
– PISA also improves gene thresholding, avoids positive feedback, and improves signal to noise by grouping coregulated and counter-regulated genes.
– PISA very effective for finding “secondary modules”.
http://cn.arxiv.org/abs/q-bio/0311017
![Page 21: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/21.jpg)
Future Directions
• Input to experiment: – new modules and new genes in old modules.– what kinds of experiments give the most informative
data?
• Improve PISA:– better pre/post-processing of data.
• Apply PISA to other organisms.
• Combine PISA with other data (experimental, bioinformatic) to systematically identify TMs, and reconstruct the transcription network.
![Page 22: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/22.jpg)
De novo purine biosynthesisNumber of genes: 32Average number of contributing conditions: 14.6Consistency: 0.59Best ISA overlap: 0.59 at tG=5.0; frequency 16
![Page 23: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/23.jpg)
Galactose induced genesNumber of genes: 23Average number of contributing conditions: 18.1Consistency: 0.55Best ISA overlap: 0.74 at tG=3.2; frequency 686
![Page 24: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/24.jpg)
Hexose transporters
Number of genes: 10Average number of contributing conditions: 33.7Consistency: 0.59Best ISA overlap: 0.6 at tG=3.8; frequency 41
![Page 25: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/25.jpg)
Peroxide shockNumber of genes: 69Average number of contributing conditions: 23.9Consistency: 0.50Best ISA overlap: 0.34 at tG=3.4; frequency (1)
![Page 26: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/26.jpg)
Implementation of PISA
• Normalization of gene-expression data
• Iterative algorithm to find preliminary modules (modified ISA)– avoiding positive feedback– gene-score threshold
• Orthogonalization
• Finding consistent modules
![Page 27: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/27.jpg)
Normalization of expression data
Gene-score matrix EG:
Condition-score matrix EC:
removes reference-condition bias
normalizes total RNA levels
makes gene scores comparable
makes condition scores comparable
![Page 28: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/28.jpg)
Iterative algorithm: modified ISA (mISA)
Start with a random set of genes GI.
Produce condition-score vector sC.
Produce gene-score vector sG, using “leave-one-out” scoring to avoid positive feedback.
From sG, calculate gene vector mG for next iteration.
![Page 29: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/29.jpg)
OrthogonalizationAfter finding each converged preliminary module (sG, sC), remove component along sC from all genes:
s1C
s’
s2C
![Page 30: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/30.jpg)
Why does scrambled data yield large modules?
Long tails of expression data lead to single-condition modules.
![Page 31: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/31.jpg)
Finding consistent modules
• Repeat PISA runs many times (~30).• Tabulate preliminary modules.• A preliminary module contributes to a module if:
– the preliminary module contains > 50% of the genes in the module,
– these genes constitute > 20% of the preliminary module.
• A gene is included in a module if it appears in >50% of the contributing modules, always with the same gene-score sign.
![Page 32: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/32.jpg)
Comparison with other databasesGene Ontology (Genome Res. 11, 1425-1433 (2001))
Database A: Immunoprecipitation (Lee et al., Science 298, 799-804 (2002)) Database B: Comparative genomics (Kellis et al., Nature 423, 241-254 (2003))
1
0
1
G
n
i G
c N c
i m ip
N
m
Ng — number of genes in organismm — number of genes in module c — number of genes in GO categoryn — number of genes in both module and GO category
p value:
![Page 33: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d5d5503460f94a3bc91/html5/thumbnails/33.jpg)
Correlation of modules
1 2 1 2
'
Corr( , ) ' '
CC
C
mm
m
m m m m
Conditions
Gen
es
c1 c2 c3 … … cm … … cn ... ... cNc
g 1
g 2 g 3 . .g i . .g j . . g Ng