transcription factor binding sites and gene regulatory network victor jin department of biomedical...
TRANSCRIPT
Transcription factor binding sites and Transcription factor binding sites and gene regulatory networkgene regulatory network
Victor JinVictor JinDepartment of Biomedical InformaticsDepartment of Biomedical Informatics
The Ohio State UniversityThe Ohio State University
Transcription in higher eukaryotesTranscription in higher eukaryotes
Gene Expression
1. Chromatin structure
2. Initiation of transcription
3. Processing of the transcript
4. Transport to the cytoplasm
5. mRNA translation
6. mRNA stability
7. Protein activity stability
Transcriptional Regulation
Nuclear membrane
Transcriptional Regulation
Nuclear membrane
Binding site/motifCCG__CCG Genome-wide mRNA
transcript data (e.g. microarrays)
Transcriptional Regulation
Nuclear membrane
Binding site/motifCCG__CCG
• Understand which regulators control which target genes
• Discover motifs representing regulatory elements
Learning problems:
Some common approaches
• Cluster-first motif discovery – Cluster genes by expression profile, annotation, …
to find potentially coregulated genes– Find overrepresented motifs in promoter
sequences of similar genes (algorithms: MEME, Consensus, Gibbs sampler, AlignACE, …)
(Spellman et al. 1998)
Training data – Features
label
promoter sequence
regulator expression
feature vector
What is PWM?
Transcription factor binding sites (TFBSs) are usually slightly variable in their sequences.
A positional weight matrix (PWM) specifies the probability that you will see a given base at each index position of the motif.
NCCAGTNNNACTGGNCon165231426973424447T61034441915111089343113G1839431001415214339338C611391077729145818A151413121110987654321Pos
PWM for ERE
1. acggcagggTGACCc
2. aGGGCAtcgTGACCc
3. cGGTCGccaGGACCt
4. tGGTCAggcTGGTCt
5. aGGTGGcccTGACCc
6. cTGTCCctcTGACCc
7. aGGCTAcgaTGACGt ...
41. cagggagtgTGACCc
42. gagcatgggTGACCa
43. aGGTCAtaacgattt44. gGAACAgttTGACC
c45. cGGTGAcctTGAC
Cc46. gGGGCAaagTGAC
Tg
1. acggcagggTGACCc
2. aGGGCAtcgTGACCc
3. cGGTCGccaGGACCt
4. tGGTCAggcTGGTCt
5. aGGTGGcccTGACCc
6. cTGTCCctcTGACCc
7. aGGCTAcgaTGACGt ...
41. cagggagtgTGACCc
42. gagcatgggTGACCa
43. aGGTCAtaacgattt44. gGAACAgttTGACC
c45. cGGTGAcctTGAC
Cc46. gGGGCAaagTGAC
Tg
Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position). A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position.
Position frequency matrix (PFM)
(also known as raw count matrix)
PFM should be converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM.
Position weight matrix (PWM)(also known as position-specific scoring matrix)
Position Weight Matrix for ERE
Converting a PFM into a PWM
)(
4 log
,log),(
,
22 bpNN
Nf
bp
ibpibw
ib
– raw count (PFM matrix element) of nucleotide b in column i
N – number of sequences used to create PFM (= column sum)
- pseudocounts (correction for small sample size)
p(b) - background frequency of nucleotide b
NN
4
and
For each matrix element do:
A 18 8 5 4 1 29 7 7 7 0 1 39 1 1 6C 8 3 3 9 33 4 21 15 14 0 0 1 43 39 18G 13 31 34 9 8 10 11 15 19 4 44 3 0 1 6T 7 4 4 24 4 3 7 9 6 42 1 3 2 5 16
ibf ,
A 0.58-
0.44-
0.98-
1.21-
2.29 1.22-
0.60-
0.60-
0.60-
2.96-
2.29 1.62-
2.29-
2.29 -0.72
C-
0.44-
1.49-
1.49-
0.30 1.39-
1.21 0.78 0.34 0.25-
2.96-
2.96-
2.29 1.76 1.62 0.46
G 0.16 1.31 1.44-
0.30-
0.44-
0.17-
0.06 0.34 0.65-
1.21 1.79-
1.49-
2.96-
2.29 -0.64
T-
0.60-
1.21-
1.21 0.96-
1.21-
1.49-
0.60-
0.30-
0.78 1.73-
2.29-
1.49-
1.84-
0.98 0.23
G G G T C A G C A T G G C C A
Absolute score of the site
Max 0.58 1.31 1.44 0.96 1.39 1.22 0.78 0.34 0.65 1.73 1.79 1.62 1.76 1.62 17.20Min -0.60 -1.49 -1.49 -1.21 -2.29 -1.49 -0.60 -0.60 -0.78 -2.96 -2.96 -2.29 -2.96 -2.29 -24.02
scoreMinimumscoreMaximum
scoreMinimumscoreAbsolutescorerelative
__
___
86.0
02.2420.17
02.2457.11
m
i
ibwS1
),( =11.57
Scoring putative EREs by scanning the promoter with PWM
Row Sum
A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72
C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46
G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64
T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23
Yeast ESR: Biological Validation
STRE element
Universal stress repressor motif
Previous work: “Structure learning”
• Graphical models (and other methods)– Learn structure of “regulatory network”, “regulatory
modules”, etc. – Fit interpretable model to training data– Model small number of genes or clusters of genes– Many computational and statistical challenges; often
used for qualitative hypotheses rather than prediction
(Segal et al, 2003, 2004)
(Pe’er et al. 2001)
Signaling networks in a cell
• Regulator-motif associations in nodes can have different meanings:
• Need other data to confirm binding relationship between regulator and target (e.g. ChIP-chip)
• Still, can determine statistically significant regulator-target relationships from regulation program
TFMTF
PPMp
PMMp
Direct binding Indirect effect Co-occurrence
Network inference
Example: oxygen sensing and regulatory network
• ChIP-chip: genome-wide protein-DNA binding data, i.e. what promoters are bound by TF?
• Investigate regulatory network model: use ChIP-chip data in place of motifs (no motif discovery)– Features: (regulator, TF-
occupancy) pairs
TFP2P1
Binding data for regulatory networks
Inferring regulatory networks from the combination of expression data and binding data
CCNL1
BRF1
ER
FOSMYC
CEBPXBP1
RXRA
HSF2
PNN
NRIP1
TXNDC
IVNS1ABP
BATF
HES1
CHAF1B
CSDE1
CUTL1
PURB
ADAR
C140RF43
SP3
DDX20
ELF3
TXNIPPAWR
BRIP1
FOXP4
ZNF394
BAZ1B
STRAP
ASCC3
MKL2
GTF2I
RUVBL1
RFC1ZNF500 TTF2
RAB18 ZKSCAN1
MSX2
LASS2
HDAC1ZBTB41
TBX2
THRAP1
VPS72TLE3
BHLHB2
ZNF38
ZNF239
DNMT1
HIF1AHEY2
An extended ER regulatory network in MCF7 cells
Signaling molecules -- Networks
• Find all SMs that associate as regulators with a particular TF’s ChIP occupancy in ADT features
• e.g.
• Hypothesis: Glc7 phosphatase complex interacts with Hsf1 in regulation of Hsf1 targets (Interaction supported in literature)
Hsf1Gac1Gip1Sds22
Glc7 phosphatase
complex
TFSM mRNA
Input Data
Ab initio Motif Discovery Programs
Statistical Methods
STAMP Matching
Results
•SeqLog
•PWM
•P-value
•Known or novel motifs
•Bootstrap re-sampling
•Fisher test
•Weeder
•MaMf
•MEME
•FASTA file
•Contact Info
•Control data (optional)
http://motif.bmi.ohio-state.edu/ChIPMotifs/
http://motif.bmi-ohio-state.edu/HRTBLDb
Software Demo
• W-ChIPMotifs• HRTargetDB