detection and analysis of transcriptional control sequences wyeth wasserman october vanbug seminar...
Post on 27-Dec-2015
236 Views
Preview:
TRANSCRIPT
Detection and analysis of transcriptional control sequences
Wyeth Wasserman
October VanBUG Seminar
Centre for Molecular Medicine and TherapeuticsChildren’s and Women’s Hospital
University of British Columbia
CMMT
Overview of Transcription in Gene Regulation
• At the most basic level, transcriptional regulation is defined by binding of TFs to DNA
• Complexity is increased by TF interactions, chromatin structure and protein modifications
• How can we advance our understanding of regulation by computational analysis?
CMMT
Representing Binding Sites for a TF (HNF1)
• A set of sites represented as a consensus» VDRTWRWWSHDWVWH
A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4
• A matrix describing a a set of sites
• A single HNF1 site» AAGTTAATGATTAAC
CMMT
TGCTG = 0.9
PFMs to PWMs
One would like to add the following features to the model:1. Correcting for the base frequencies in DNA2. Weighting for the confidence (depth) in the pattern3. Convert to log-scale probability for easy arithmetic
A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2
f matrix w matrix
Log ( )f(b,i) +p(b)
4/N
CMMT
Performance of Profiles
• 95% of predicted sites bound in vitro (Tronche 1997)
• MyoD binding sites predicted about once every 600 bp (Fickett 1995)
• The Futility Theorem– Nearly 100% of predicted transcription factor
binding sites have no function in vivo
CMMT
Phylogenetic Footprinting to Identify Functional Segments
% Id
en
tity
Actin gene compared between human and mouse with DPB.
200 bp Window Start Position (human sequence)
CMMT
Regulatory sites are usually conserved between orthologous genes
HUMAN ACGATACGCATCACAGACT.ACAGACTACGGCTAGCA -|-|||||||||-|---|--|||-------|-|---|MOUSE GCAATACGCATCGCGATCAGACATCAGCACG.TGTGA
HUMAN ACATCAGCATACACGCAACTACACAGACTACGACTA ---|||||-||||---|-|----||-||-||||---MOUSE CGTTCAGCTTACAGCTAGCATAGCATACGACGATAC
CMMT
Choosing the ”right” species...(BONUS: What’s the ultimate sin in bioinformatics?)
COW
MOUSE
CHICKEN
HUMAN
HUMAN
HUMAN
CMMT
Performance: Human vs. Mouse
• Testing set: 40 experimentally defined sites in 15 well studied genes
• 85-95% of defined sites detected with conservation filter, while only 11-16%of total predictions retained
CMMT
Unraveling Transcriptional Control Mechanisms
Given a set of ”co-regulated” genes, define motifs over-represented in the regulatory regions
CMMT
Pattern Detection Methods
• Exhaustive – e.g. “Moby Dick” (Bussemaker, Li & Siggia)
– Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections
• Monte Carlo/Gibbs Sampling – e.g. AnnSpec (Workman & Stormo)
– Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics
CMMT
Yeast tests of YRSA System
PDR3-regulated genes from array study
Classic cell-cycle array data re-clustered by Getz et al
DNA-damage responsepartially mediating by MCB
CMMT
10
12
14
16
18
0 100 200 300 400 500 600
THE PROBLEM:
Pattern Detection in Long Sequences
SEQUENCE LENGTH
RANDOM SET
MEF2 SET
ME
F2 S
IMIL
AR
ITY S
CO
RE
CMMT
Four Approaches to Extend Sensitivity
• Phylogenetic Footprinting– Human-Mouse eliminates ~75% of sequence
• Better background models– e.g. AnnSpec
• Better definition of co-regulation– Microarrays occasionally produce noise
• Use biochemical knowledge about TFs– TFBS patterns are NOT random
CMMT
Some characteristics have been explored…• Segmentation: informative positions separated
by variable positions (proteins bind as dimers)• Positional Variance: subset of positions
contain most of the info• Palindromes are common in the patterns
CMMT
Our Hypothesis
• Point 1: Structurally-related DNA binding domains interact with similar target sequences
• Exceptions exist (e.g. Zn-fingers)
• Point 2: There are a finite number of binding domains used in human TFs
• Approximately 20-25
• Idea: We could use the shared binding properties for each family to focus pattern detection methods
• Constrain the range of patterns sought
CMMT
Comparison of profiles requires alignment and a scoring function
• Scoring function based on sum of squared differences
• Align frequency matrices with modified Needleman-Wunsch algorithm
• Calculate empirical p-values based on simulated set of matrices
Score
Fre
que
ncy
CMMT
Prediction of TF Class
TF Database(JASPAR)
COMPARE
Match to bHLH
Jackknife Test 87% correct
Independent Test Set 93% correct
CMMT
APPLICATION:
Cancer Protection Response
• Detoxification-related enzymes are induced by compounds present in Broccoli
• Arrays, SSH and hard work have defined a set of responsive genes
• A known element mediates the response (Antioxidant Responsive Element)
• Controversy over the type of mediating leucine zipper TF
• NF-E2/Maf or Jun/Fos
CMMT
Gibbs Sampling
Application (2)
Problem: Given a set of co-regulated genes, determine the common TFBS. Classify the mediating TF. We expect a leucine zipper-type TF.
Gibbs with FBP PriorClassify New TF Motif
Maf (p<0.02)
Jun (p<0.98)
Layers of Complexity in Metazoan Transcription
Chromatin picture used with permission of Zymogenetics.
CMMT
Liver Differentiation (data mostly from studies of hepatocytes)
CEBP HNF3 HNF1HNF4
Stem Early Fetal Mature
CMMT
Models for Liver TFs…(Data that takes 2 months to produce and 10 seconds to present) (Or, what to do with an astrophysicist new to bioinformatics)
HNF1
C/EBP
HNF3
HNF4
CMMT
Training predictive models for modules
• Limited by small size of positive training set
• We elected to use logistic regression analysis for the first models
• Your favorite statistical approach would probably do equally well– data limited
CMMT
Logistic Regression Analysis
“logit”
Optimize vector to maximize the distance between output values for positive and negative training data.
Output value is:
elogit
p(x)= 1 + elogit
CMMT
UDPGT1 (Gilbert’s Syndrome)
WildtypeMutant
Live
r M
odul
e M
odel
Sco
re
“Window” Position in Sequence
CMMT
PERFORMANCE
• Liver (Genome Research, 2001)
– At 1 hit per 35 kbp, identifies 60% of modules– Limited to genes expressed late in liver
development
• Skeletal Muscle (JMB, 1998)
– Set to 1 prediction per 35 000 bp– Identifies 66% of test set correctly
LRA Models do not account for multiple sites for the same TF*
* Side-track: Newer Methods
CMMT
Genome Scan
• Screened the available mouse genomic sequences (~300 MB) for modules and discarded hits for which sequence was not conserved with human (BLAST)
• Removed regions for which corresponding human sequence did not score as module
• Of ~100 predicted modules• 20 annotated genes: 5 from training, 3 additional
modules, 5 liver specific, 3 unknown and 4 not liver
CMMT
Focus on regulatory modules for pattern detection
Cluster Genes by Expression
Identify and ModelContributing TFs
6 0 0 0 7 0 02 8 4 7 1 0 20 0 4 0 0 8 00 0 0 1 0 0 6
Predictive Models
CMMT
Finding binding sites in sets of co-regulated human genes
• Sequence “space” is too large– Narrow with Phylogenetic Footprinting
• Identify patterns in conserved blocks via Gibbs sampling
• Assess quality of patterns based on biological knowledge
CMMT
Skeletal Muscle Genes
• One of the most extensively studied tissues for transcriptional regulation– 45 genes partially analyzed
– 26 genes with orthologous genomic sequence from human and rodent
• Five primary classes of transcription factors– Principal: Myf (myoD), Mef2, SRF
– Secondary: Sp1 (G/C rich patches), Tef (subset of skeletal muscle types)
CMMT
de novo Discovery of Skeletal Muscle Transcription Factor Binding Sites
Mef2-Like SRF-Like Myf-Like
CMMT
COMING SOON:
The Integrated Module Sampler
Gene1Gene2Gene3Gene4Gene5
Calls to ensEMBL
Calls to GeneLynx
Calls to BlastZ(Switch to Lagan?)
Module Sampler
CMMT
Conclusions
• Evolution drives understanding in biology– Phylogenetic Footprinting
• Biochemistry inspires Bioinformatics– Regulatory Modules– Familial Binding Profiles
• Analysis of regulatory sequences is improving– Given sets of orthologous genes, one can predict regulatory regions– Given sets of co-regulated genes, it is possible to infer the binding
profiles for critical transcription factors
• Much more work is needed…
THANKS!Wasserman Group – CMMT
Danielle KemmerSeveral Newcomers
Wasserman Group - SwedenAlbin Sandelin
Raf Podowski (CA)Wynand Alkema
Collaborating StudentsMalin Andersson (Odeberg)
Öjvind Johansson (Lagergren)Hui Gao (Dahlman-Wright)
Emily Hodges (Höög)
Support: Merck-Frosst, C&W, Pharmacia, EU–Marie Curie, CGDN, KI-Funder
CollaboratorsChip Lawrence (Wadsworth)
Boris Lenhard (K.I.)Jens Lagergren (SBC)
Christer Höög (K.I.)Brenda Gallie (OCI)
Jacob Odeberg (KTH)Niclas Jareborg (AZ)William Hayes (AZ)
Group AlumniPer Engström Elena Herzog
Annette HöglundWilliam KrivanBoris LenhardLuis Mendoza
top related