special topics in genomics motif analysis. sequence motif – a pattern of nucleotide or amino acid...

Post on 18-Jan-2018

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Motif representation

TRANSCRIPT

Special Topics in Genomics

Motif Analysis

Sequence motif – a pattern of nucleotide or amino acid sequences

GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA

TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA

TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG

AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC

ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG

TF

TF

TF

TF

TF

TF

123456789

TGGGTGGTC

TGGGTGGTA

TGGGAGGTC

TGGGTGGTG

TGAGTGGTC

TGGGTGGTC

Transcription Factor Binding Sites (TFBS)

DNA motif:

Protein motif:

Motif representation

Consensus sequence

Example: CACSTG

Sequence LogoSchneider & Stephens, Nucleic Acids Res. 18:6097-6100 (1990)

Entropy (Shannon) – a measurement of uncertainty

The amount of uncertainty reduced by observing sequences is the amount of information (or information content) we obtained:

This is the height of each position in the logo plot.

Height of each nucleotide is proportional to its frequency

Two questions in motif analysis

• Known motif mapping

Finding occurrences of a motif in nucleotide or amino acid sequences

• De novo motif discovery

Finding motifs that are previously unknown

Known motif mapping

• Consensus mapping

STEP 1: provide a motif (e.g. CACSTG = CAC[C,G]TG)STEP 2: specify number of mismatches allowed (e.g. <=1)STEP 3: scan the sequence

CGCCGGGACCAGATCAACGCCGAGATCCGGCACATGAAGGAGCT m=3, no m=1, yes

A useful tool: CisGenome (http://www.biostat.jhsph.edu/~hji/cisgenome)

Known motif mapping

• Motif matrix mapping (CisGenome)STEP 1: provide a motif and background modelSTEP 2: specify a likelihood ratio cutoff (e.g. LR>=500)STEP 3: scan the sequence

0

GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA

LR>500, yes LR<500, no

Motif:Background:

A C G TA .3 .2 .2 .3C .2 .3 .3 .2G .2 .3 .3 .2T .3 .2 .2 .3

1 2 3 4 5 6 7 8 9A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00

• Another tool for matrix mappingMAST (http://meme.sdsc.edu/meme/mast-intro.html)

De novo motif discovery

• Two major class of methods:

1. Word enumeration

2. Matrix updating

Word enumeration

Example: Sinha & Tompa, Nucleic Acids Res. 30: 5549-5560 (2002)

STEP 1: enumerate possible words;STEP 2: count word occurrences;STEP 3: compare observed word count with random expectation.

Matrix updating

• CONSENSUS (Stormo & Hartzell, PNAS, 86: 1183-1187, 1990)

STEP 1: use all k-mers in the first sequence as seeds;

STEP 2: find matches (often use best matches) of each seed in the second sequence;

STEP 3: update seed matrices, exclude matrices with low information content;

STEP 4: repeat step 2 and 3 for all sequences.

Matrix updating• Mixture model

0 , W

EM:

Lawrence and Reilly (1990)

Bailey and Elkan (1994), etc.

Gibbs Sampler:

Lawrence et al. (1993)

Liu (1994), Liu et al. (1995), etc.

S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA

A: 000000000000001000000000000000000000000001000000000000000000000000000000

Motif:Background:

q = [q0,q1]q0 q1

),,(),,,|,(),|,,,( qWΘθqWΘASθSqWΘA 00 ff

A C G TA .3 .2 .2 .3C .2 .3 .3 .2G .2 .3 .3 .2T .3 .2 .2 .3

1 2 3 4 5 6 7 8 9A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00

,W,q A

Inference by iterative estimation/sampling

Other issues

• Dependencies within motif

• Functions of novel motifs

top related