csce555 bioinformatics lecture 11 promoter predication meeting: mw 4:00pm-5:15pm swgn2a21...

34
CSCE555 Bioinformatics CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu . HAPPY CHINESE NEW YEAR

Upload: noreen-mathews

Post on 26-Dec-2015

221 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

CSCE555 BioinformaticsCSCE555 Bioinformatics

Lecture 11 Promoter Predication

Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.

HAPPY CHINESE NEW YEAR

Page 2: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

OutlineOutline

Introduction to DNA MotifMotif Representations (Recap)Motif database searchAlgorithms for motif discovery

04/19/23 2

Page 3: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Search SpaceSearch Space

N

Length = L

Motif width = W

Size of search space = (L – W + 1)N

L=100, W=15, N=10 size 1019

Page 4: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Worked ExampleWorked Example

W

k tgcai

ci

tgcaiki

kipcN1 ,,,,,,

!!3

6lnscore

1 2 3 4

a 0 2 0 3

c 4 0 2 1

g 0 1 2 0

t 0 1 0 0

2561

41 N

i

cikipcki =

N = 4pi = ¼

10532

!36

i

cikip

N

Score = 1.99 - 0.50 + 0.20 + 0.60 = 2.29

Page 5: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Gibbs Sampling SearchGibbs Sampling Search

1

2

Suppose the search space is a 2D rectangle. (Typically, more than 2 dimensions!)

X

Start at a random point X.

Randomly pick a dimension.

Look at all points along this dimension.

Repeat.

Move to one of them randomly, proportional to its score π.

Page 6: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Gibbs Sampling for Motif Gibbs Sampling for Motif SearchSearch

Choose a random starting state.

Randomly pick a sequence.

Look at all motif positions in this sequence.

Pick one randomly proportional to exp(score).

Repeat.

Page 7: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Does it Work in Practice?Does it Work in Practice?Only successful cases get published!Seems more successful in microbes (bacteria &

yeast) than in animals.The search algorithm seems to work quite well,

the problem is the scoring scheme: real motifs often don’t have higher scores than you would find in random sequences by chance. I.e. the needle looks like hay.

Attempts to deal with this:◦ Assume the motif is an inverted palindrome (they often

are).◦ Only analyze sequence regions that are conserved in

another species (e.g. human vs. mouse).As usual, repetitive sequences cause problems.More powerful algorithm: MEME

Page 8: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

1. Go to our MEME server:

http://molgen.biol.rug.nl/meme/website/meme.html

1. Fill in your emailadres, description of the sequences

2. Open the fasta formatted file you just saved with Genome2d (click “Browse”)

3. Select the number of motifs, number of sites and the optimum width of the motif

4. Click “Search given strand only”

5. Click “Start search”

Page 9: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Something like this will appear in your email. The results are quite self explanatory.

Page 10: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Promoter PredictionPromoter PredictionWhat are promoters?Three strategies for promoter

prediction◦Signal based◦Comparative genomics/phylogenetic

footprinting◦Expression profile base de-novo

motif discovery algorthms

Page 11: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

What is a Promoter?What is a Promoter?

Region of gene that binds RNA polymerase and transcription factors to initiate transcription

Page 12: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

12

Promoters:Promoters:What signals are there?What signals are there?

Simple ones in prokaryotesSimple ones in prokaryotes

Page 13: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Prokaryotic promoters Prokaryotic promoters RNA polymerase complex

recognizes promoter sequences located very close to & on 5’ side (“upstream”) of initiation site

RNA polymerase complex binds directly to these. with no requirement for “transcription factors”

Prokaryotic promoter sequences are highly conserved

-10 region -35 region

13

Page 14: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

14

What signals are there? What signals are there? Complex ones in Complex ones in

eukaryoteseukaryotes

Page 15: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

15

Eukaryotic genes are transcribed by Eukaryotic genes are transcribed by 3 different RNA polymerases3 different RNA polymerases

Recognize different types of promoters & enhancers:

Page 16: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Eukaryotic promoters & Eukaryotic promoters & enhancers enhancers Promoters located “relatively” close to

initiation site (but can be located within gene, rather than

upstream!)Enhancers also required for regulated

transcription(these control expression in specific cell types, developmental stages, in response to environment)

RNA polymerase complexes do not specifically recognize promoter sequences directly

Transcription factors bind first and serve as “landmarks” for recognition by RNA polymerase complexes

16

Page 17: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Eukaryotic transcription Eukaryotic transcription factors factors Transcription factors (TFs) are DNA

binding proteins that also interact with RNA polymerase complex to activate or repress transcription

TFs contain characteristic “DNA binding motifs”

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039

TFs recognize specific short DNA sequence motifs “transcription factor binding sites”◦ Several databases for these, e.g. TRANSFAC http://www.generegulation

.com/cgibin/pub/databases/transfac17

Page 18: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

18

Zinc finger-containing Zinc finger-containing transcription factors transcription factors • Common in eukaryotic proteins

• Estimated 1% of mammalian genes encode zinc-finger proteins

• In C. elegans, there are 500!

• Can be used as highly specific DNA binding modules

• Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy

Page 19: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Predicting PromotersPredicting Promoters

• Overview of strategies◦ What sequence signals can be

used?• What other types of information can

be used? • Algorithms • Promoter prediction software

• 3 major types• many, many programs

19

Page 20: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Promoter prediction: Promoter prediction: Eukaryotes vs prokaryotesEukaryotes vs prokaryotes

20

Promoter prediction is easier in microbial genomes

Why? Highly conservedSimpler gene structuresMore sequenced genomes!

(for comparative approaches)

Methods? Previously, again mostly HMM-based

Now: • similarity-based. • comparative methods (because so

many genomes available)• De novo motif discovery

Page 21: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Predicting promoters: Steps & Predicting promoters: Steps & StrategiesStrategies Closely related to gene prediction• Obtain genomic sequence• Use sequence-similarity based comparison

(BLAST, MSA) to find related genes But: "regulatory" regions are much less well-

conserved than coding regions

• Locate ORFs • Identify TSS (if possible!)Identify TSS (if possible!)• Use promoter prediction programs • Analyze motifs, etc. in sequence

(TRANSFAC)

21

FirstEF

Page 22: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Automated promoter Automated promoter prediction strategiesprediction strategies

22

1) Pattern-driven algorithms

2) Sequence-similarity based algorithms

3) Combined "evidence-based"

BEST RESULTS? Combined, sequential

Page 23: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

1: Promoter Prediction: Pattern-driven 1: Promoter Prediction: Pattern-driven algorithmsalgorithms

23

• Success depends on availability of collections of annotated binding sites (TRANSFAC & PROMO)

• Tend to produce huge numbers of FPs

• Why? • Binding sites (BS) for specific TFs often

variable• Binding sites are short (typically 5-15 bp)• Interactions between TFs (& other

proteins) influence affinity & specificity of TF binding

• One binding site often recognized by multiple BFs

• Biology is complex: promoters often specific to organism/cell/stage/environmental condition

Page 24: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Solutions to problem of too many Solutions to problem of too many FP predictions?FP predictions?

24

Take sequence context/biology into account• Eukaryotes: clusters of TFBSs are

common• Prokaryotes: knowledge of factors

helps• Probability of "real" binding site

increases if annotated transcription start site (TSS) nearby • But: What about enhancers? (no TSS

nearby!) & Only a small fraction of TSSs

have been experimentally mapped

• CpG islands before promoter around TSS

• TATA Box, CCAAT box• Content Information: hexamer

frequency

Page 25: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Why we cannot rely on consensus Why we cannot rely on consensus sequence?sequence?Inr (Initiator) consensus sequence will

appear once every 512bp in random sequences

For TATA box, one for every 120bpShort-sequence patterns can appear

by chance with high likelihood (false postives)

Page 26: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

2: Promoter Prediction: Phylogenetic 2: Promoter Prediction: Phylogenetic FootprintingFootprinting

26

• Assumption: common functionality can be deduced from sequence conservation• Comparative promoter prediction:

"Phylogenetic footprintingrVista, ConSite, PromH, FootPrinter

• For comparative (phylogenetic) methods• Must choose appropriate species• Different genomes evolve at different rates• Classical alignment methods have trouble with translocations, inversions in order of functional

elements• If background conservation of entire region is

highly conserved, comparison is useless• Not enough data (Prokaryotes >>> Eukaryotes)

• Biology is complex: many (most?) regulatory elements are not conserved across species!

Page 27: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

3: Promoter Prediction: Co-3: Promoter Prediction: Co-expression based algorithmsexpression based algorithms

Problems:• Need sets of co-regulated genes• Genes experimentally determined to be co-

regulated (using microarrays??) Careful: How determine co-regulation?

• Alignments of co-regulated genes should highlight elements involved in regulation

Algorithms:MEME

AlignACE, PhyloCon

27

Page 28: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Examples of promoter Examples of promoter prediction/characterization prediction/characterization softwaresoftware

28

MATCH, MatInspectorTRANSFACMEME & MASTBLAST, etc.

Others?FIRST EFDragon Promoter Finder (these are links in PPTs)

also see Dragon Genome Explorer (has specialized promoter software for GC-rich DNA, finding CpG islands, etc)JASPAR

Page 29: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

29

TRANSFAC matrix entry: for TRANSFAC matrix entry: for TATA boxTATA box

Fields:• Accession & ID •Brief description•TFs associated with this entry•Weight matrix •Number of sites used to build (How many here?)•Other info

Page 30: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

30

Global alignment of human & mouse obese Global alignment of human & mouse obese gene promoters (200 bp upstream from gene promoters (200 bp upstream from TSS)TSS)

Page 31: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Check out optional review & Check out optional review & try associated tutorial: try associated tutorial:

Wasserman WW & Sandelin A (2004) Applied bioinformatics for identification of regulatory elements. Nat Rev Genet 5:276-287http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html

D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 31

Check this out: http://www.phylofoot.org/NRG_testcases/

Page 32: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

32

Annotated lists of promoter databases & Annotated lists of promoter databases & promoter prediction softwarepromoter prediction software

• URLs from Mount Chp 9, available onlineTable 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html

• Table in Wasserman & Sandelin Nat Rev Genet article http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm

• URLs for Baxevanis & Ouellette, Chp 5:http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links

More lists:• http://www.softberry.com/berry.phtml?

topic=index&group=programs&subgroup=promoter• http://bioinformatics.ubc.ca/resources/links_directory/?

subcategory_id=104• http://www3.oup.co.uk/nar/database/subcat/1/4/

Page 33: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

SummarySummaryPromoter & gene regulation3 types of methods for promoter predictionMany programs have sensitivity and

specificity less than 0.5 Integrative algorithms are more promising

Page 34: CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

AcknowledgementAcknowledgementZhiping Weng (Boston Uni.)