-translation biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf ·...

12
1 Biological sequence analysis Tore Samuelsson 26 sep 2008 ----- ----- ------ ----- ----- A V S L M G C A G T A A G C T T G A T G DNA Protein Concepts in sequence analysis - Translation HindIII G C A G T A A G C T T G A T G Concepts in sequence analysis - Pattern matching (example: identification of restriction sites) G C A G T A A G C T T G A T G P(A) = 0.23 P(T) = 0.24 P(C) = 0.26 P(G) = 0.27 Concepts in sequence analysis Probabilistic models

Upload: others

Post on 31-May-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

1

Biological sequence analysis

Tore Samuelsson 26 sep 2008 ----- ----- ------ ----- -----A V S L M

G C A G T A A G C T T G A T G DNA

Protein

Concepts in sequence analysis- Translation

HindIIIG C A G T A A G C T T G A T G

Concepts in sequence analysis- Pattern matching

(example: identification of restriction sites)G C A G T A A G C T T G A T G

P(A) = 0.23P(T) = 0.24P(C) = 0.26P(G) = 0.27

Concepts in sequence analysisProbabilistic models

Page 2: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

2

0.11 0.74 1.00 0.00 0.29 0.64 0.09 0.00 0.00 0.61 0.13 0.12 0.00 1.00 0.07 0.11 0.06 0.00 0.00 0.02

A G - G U A

GAUC

3' end of exon 5' end of intron

Concepts in sequence analysisProbabilistic models

Modeling of splicing signals A A G G G U U C G A U U C C C U U

tRNA

Concepts in sequence analysis- long range dependencies

G C A G T A A G C T T G A T GG C A G T A A - C T T T A T G* * * * * * * * * * * * *

Concepts in sequence analysis- Alignments Why are sequence alignments important?

• Sequence assembly• Prediction of function • Protein family analysis• Comparative genomics• Phylogeny / Evolutionary history

Page 3: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

3

Sequence assembly

We have a ‘new’ sequence. It is similar to a previously known sequence?

Alignment to all previously known sequences. (Many of these have annotation such as a description of function )

similarity

?

no similarity

Prediction of function

Prediction of function

Predicting the molecular basis of disease

BRCA1 gene - genetic factor in breast cancer

Cloning and sequencing of the gene revealed a protein remotely related to a yeast protein (Rad9) involved in cell cycle control

RAD9_YEAST : GNVFDKCIFVLTS-LFENReELRQTIESQGGTVIeSGfstlfnfthplakslvnkgntdnBRC1_HUMAN : ERVNKRMSMVVSGLTPEEFmLVYKFARKHHITLTnLI-----------------------

RAD9_YEAST : irelalklawkphslfaDCRFACLITKRHLrSLKYLET------LALGWPTLHWKFISAC BRC1_HUMAN : -----------------TEETTHVVMKTDA-EFVCERTLKyflGIAGGKWVVSYFWVTQS

RAD9_YEAST : IEKKRIVPHLIYQYBRC1_HUMAN : IKERKMLNEHDFEV

Prediction of functionAnalysis of the human genome reveals a large

number of olfactory receptor proteins

Protein family analysis

Page 4: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

4

Comparative genomics - reveals biologically significant regions of the genome

Human evolutionOrigin of manClosest primate relatives of man?Did modern humans originate in Africa?Relationship between human populationsHow are genes of medical interest

distributed among human populations?

Evolution of viruses / microorganisms that cause human disease

HIV (AIDS)H5N1 (Bird Flu)

Phylogeny / Evolutionary history

Margaret Dayhoff

Early days of sequence analysisSubstitution matrixScoring of alignments

A G L C E| | | | |A A L C D4+ 0+4 +9+2 =19

Page 5: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

5

BLAST - searches in databases for sequence similarity, p. 93-103ClustalW - multiple alignment of sequences, p. 89-93

Frequently used methods in sequence analysis that are based on the principle of

sequence alignment using dynamic programming

FASTA, 1988William Pearson

BLAST

David LipmanStephen Altschul

BLAST, 1990

Searching databases for sequence similarity- local alignment using

Smith-Waterman method is too slow

M A K L Q G A L G K R Y

M *A * *K * *I

Q *G * *A * *L * *A * * K * *R *Y

BLAST

Improvement of speed as compared to local alignment algorithm:

• Initial search is for word hits. • Word hits are then extended in either direction.

Searching databases for sequence similarity- heuristics of BLAST

First step in BLAST - obtaining a list of words based on the query sequence

Query sequence: MSGTWAMA ....

Words derived from query sequence:MSG, SGT, GTW, TWA ....etc

Each word extracted from the query sequenceis matched against words derived from the database sequence.

BLAST

Page 6: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

6

First step in BLAST - obtaining a list of words based on the query sequence -improving sensitivity by considering 'word neighbors'

Consider the word GTW:

compile a list of words scoring at least T with query word:

GTW (6+5+11=22)GSW (6+1+11=18)GNW (6+0+11=17)GAW (6+0+11=17)ATW (0+5+11=16)DTW (-1+5+11=15)GTF (6+5+1=12)

GTM (6+5-1=10)DAW (-1+0+11=10)

threshold T

BLAST

exact matches to these words will be searched

against database sequence

BLASTP 2.0.11 [Jan-20-2000]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.

Query= ramp4.seq(75 letters)

Database: nr457,798 sequences; 140,871,481 total letters

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6

BLASTBLAST output

Parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is.

BLAST

Expect value (E) >gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;cDNA EST EMBL:D71338 comes from this gene; cDNA ESTEMBL:D74010 comes from this gene; cDNA EST EMBL:D74852comes from this gene; cDNA EST EMBL:C07354 comes fromthis gene; cDNA EST EMBL:C0...Length = 65

Score = 74.1 bits (179), Expect = 1e-13Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)

Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++M

Sbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63

Query: 74 G 74G

Sbjct: 64 G 64

BLAST

High Scoring Pair (HSP)

High Scoring Pair (HSP)

Page 7: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

7

Query Database

blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA

The different variants of BLASTBLAST

The variants of BLAST

blastall -i input_sequence -d database -p blast_version

bl2seq uses the BLAST algorithm but matches two sequencesinstead of matching one sequence against a database

bl2seq -i 1st_sequence -j 2nd_sequence -p blast_version

BLASTUsing BLAST in a unix environment

Large scale alignments and further improvement of computational efficiency - BLAT

(http://genome.ucsc.edu/cgi-bin/hgBlat?command=start)

Page 8: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

8

BLAST will reveal evolutionary relationships. DNA or proteinsequences are homologous if they are related by divergence from a common ancestor.

Two kinds of homology:

Orthology Sequences that diverged after a speciation event.Orthologous genes often have the samefunction in different species.

Paralogy Sequences that diverged after a duplication event.Paralogous genes perform different but related functions within one organism.

Evolutionary relationships revealed by database searchesX

X

X1

X

X2

Speciation

Times goes on ...

Orthologs

Ancestral organism

Organism A

Organism A

Organism B

Organism B

Orthologs

X

X

Xa

X

Xb

Gene duplication

Times goes on ...

Paralogs

Paralogs

Mouse trypsin -- orthologs -- Human trypsin| | |

paralogs paralogs| |

Mouse chymotrypsin -- orthologs -- Human chymotrypsin

Example of orthology / paralogy relationships

Page 9: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

9

ClustalW

• Construction of tree based on pairwise alignments• Progressive alignment guided by tree.

AB

CD

E

Viruses - dependent on living cells for propagation

HIV

Introduction to the "Exercises with biological sequences -examining HIV genes and proteins"

Introduction to the "Exercises with biological sequences -examining HIV genes and proteins"

Page 10: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

10

EMBOSS programs in this practical

sixpackplotorf

water - Smith Waterman alignmentneedle - Needleman - Wunsch alignmentdottup - dotplot analysis

M A K R K L K K N L K T F V A F S A I T F1W Q R E S * K R T * K L L L H L V L L L F2G K E K V K K E L K N F C C I * C Y Y C F3

1 ATGGCAAAGAGAAAGTTAAAAAAGAACTTAAAAACTTTTGTTGCATTTAGTGCTATTACT 60----:----|----:----|----:----|----:----|----:----|----:----|

1 TACCGTTTCTCTTTCAATTTTTTCTTGAATTTTTGAAAACAACGTAAATCACGATAATGA 60X A F L F N F F F K F V K T A N L A I V F6X P L S F T L F S S L F K Q Q M * H * * F5

H C L S L * F L V * F S K N C K T S N S F4

A L L L T N G I P I S A L T Q S S N T T F1L Y C * L M V F Q L V L * L S L P I Q L F2F I V N * W Y S N * C F N S V F Q Y N * F3

61 GCTTTATTGTTAACTAATGGTATTCCAATTAGTGCTTTAACTCAGTCTTCCAATACAACT 120----:----|----:----|----:----|----:----|----:----|----:----|

61 CGAAATAACAATTGATTACCATAAGGTTAATCACGAAATTGAGTCAGAAGGTTATGTTGA 120A K N N V L P I G I L A K V * D E L V V F6Q K I T L * H Y E L * H K L E T K W Y L F5

S * Q * S I T N W N T S * S L R G I C S F4

E I T S Q A T T G L R N V M Y Y G D W S F1R L L H K L L Q G Y V M * C I M V T G L F2D Y F T S Y Y R V T * C N V L W * L V Y F3

121 GAGATTACTTCACAAGCTACTACAGGGTTACGTAATGTAATGTATTATGGTGACTGGTCT 180----:----|----:----|----:----|----:----|----:----|----:----|

121 CTCTAATGAAGTGTTCGATGATGTCCCAATGCATTACATTACATAATACCACTGACCAGA 180S I V E C A V V P N R L T I Y * P S Q D F6Q S * K V L * * L T V Y H L T N H H S T F5

L N S * L S S C P * T I Y H I I T V P R F4

Translation of a nucleotide sequence using ‘sixpack’

Deviations from the standard genetic code

# Yeast mitochondria

UGA = Trp:WCUU = Thr:TCUC = Thr:TCUA = Thr:TCUG = Thr:TAUA = Met:M

# Mammalian mitochondria

UGA = Trp:WAUU = Ile:IAUC = Ile:IAUA = Met:MAGA = * :*AGG = * :*

# Drosophila mitochondria

UGA = Trp:WAUU = Ile:IAUA = Met:MAGA = Ser:SAGG = Ser:S

# Mycoplasma

UGA = Trp

# Cilian protozoa

UAA = Gln:QUAG = Gln:Q

Plotorf to show open reading frames(in this case ORF is defined as starting with AUG codon)

Ribosomal protein S16 1771-2019

Ribosomal protein L19 3426-3773

Unnamed protein 416-1522 tRNA methyltransferase 2617-3384

Page 11: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

11

Introduction to the "Exercises with biological sequences -examining HIV genes and proteins"

Gag

Gag-Pol fusion(5%)

Global alignment of mRNA sequence to genomic DNA sequence

Effect of gap parameters

mature, spliced mRNA

genomic DNA

Global alignment of mRNA sequence to genomic DNA sequence

Effect of gap parameters

Dot plot analysis (dottup) reveals repeats

Page 12: -Translation Biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf · Concepts in sequence analysis Probabilistic models. 2 0.11 0.74 1.00 0.00 0.29 ... Nucleic

12

Introduction to the "Exercises with biological sequences -examining HIV genes and proteins"

- important biological questions addressed.

BLAST

* Identifying orthologues and paralogues. * Non-viral homologues to any HIV proteins?* Are we able to identify a relationship between human HIV

and the monkey SIV?

ClustalW

* How does HIV drug resistance develop?* What is the origin of HIV - relationship to monkey SIV?* Using a multiple alignment to compute a phylogenetic tree