general nucleic acid sequence databases embl:(european molecular biology laboratory) genbank: ncbi...

142
General nucleic acid Sequence databases • EMBL:(European Molecular Biology Labo ratory) http://www.ebi.ac.uk/Informa tion/ • GenBank: NCBI (National Center for Bi otechnology Information) http://www.ncbi.nlm.nih.gov/ • DDBJ: DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/ Entry name; accession number; version number

Post on 15-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

General nucleic acid Sequence databases

• EMBL:(European Molecular Biology Laboratory)

http://www.ebi.ac.uk/Information/• GenBank: NCBI (National Center for Biote

chnology Information) http://www.ncbi.nlm.nih.gov/• DDBJ: DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/

Entry name; accession number; version number

Page 2: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

General protein Sequence databases

• SWISS-PROT• PIR• PRF/SEQDB• PDB: It is the largest data bank of three-dimensional (3-D) biol

ogical macromolecular structure data.

coding sequences (CDS): from translation• TrEMBL• GenPret:

Page 3: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• SWISS-PROT is a highly curated database that contains excellent documentation. SWISS-PROT systematically merges variants and fragments into a single entry, but is greatly lagging behind the growth of the DNA data banks.

• PIR contains more sequences, including numerous “really sequenced” oligopeptides, but is not that tightly curated.

• The “automatic” data banks such as TrEMBL and GenPept are even larger, but contain little documentation and sometimes conceptual translations that are not actually found in nature.

Page 4: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

BLAST Basic Local Alignment Search Tool• The BLAST algorithm breaks the query sequence

into short fragments, or “words,” and looks for an identical or close match between those words and words from the database sequences. When such a match or “hit” is encountered, the hit is extended in both directions to generate a local alignment segment. The quality of each alignment is quantified in a score, and the high-scoring segment pairs (HSPs) are reported in a table.

Page 5: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• BLASTN, which compares a nucleotide query sequence with a nucleotide sequence database; BLASTP, which compares a protein query sequence with a protein sequence database; BLASTX, which compares a nucleotide query sequence translated in all six open reading frames with a protein sequence database; TBLASTN, which compares a protein query sequence with a nucleotide sequence database dynamically translated in all six open reading frames; and TBLASTX, which compares a six-frame translation of a nucleotide query sequence with the six-frame translations of a nucleotide sequence database.

http://www.ddbj.nig.ac.jp/

Page 6: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 7: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Sequence alignment

Chapter 5

Measuring GeneticChange

Page 8: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 9: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 10: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

D=s+wg

Page 11: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 12: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

W=1P1 0+1x2=2P2 2+1x1=3

D=s+wg

Page 13: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

W=1P1 0+1x2=2P2 2+1x1=3

W=2P1 0+2x2=4P2 2+2x1=4

W=3P1 0+3x2=6P2 2+3x1=5

W 小 gap 衝擊小Gap 多

W 大 gap 衝擊大Gap 少

D=s+wg

Gap 多 or 序列變異大 , W 可選小

Gap 少 or 序列保守 , W 可選大

Page 14: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

D=s+wg

Page 15: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

A B C D E F G H I K L M N P Q R S T V W X Y Z

A 8 -40 -

4-

2-

40 -

4-

2-

2-

2-

2-

4-

2-

2-

22 0 0 -

60 -

4-

2

B -48 -

68 2 -

6-

20 -

60 -

8-

66 -

40 -

20 -

2-

6-

8-

2-

62

C 0 -61

8-

6-

8-

4-

6-

6-

2-

6-

2-

2-

6-

6-

6-

6-

2-

2-

2-

4-

4-

4-

6

D -48 -

61

24 -

6-

2-

2-

6-

2-

8-

62 -

20 -

40 -

2-

6-

8-

2-

62

E -22 -

84 1

0-

6-

40 -

62 -

6-

40 -

24 0 0 -

2-

4-

6-

2-

48

F -4-

6-

4-

6-

61

2-

6-

20 -

60 0 -

6-

8-

6-

6-

4-

4-

22 -

26 -

6

G 0 -2-

6-

2-

4-

61

2-

4-

8-

4-

8-

60 -

4-

4-

40 -

4-

6-

4-

2-

6-

4

H -40 -

6-

20 -

2-

41

6-

6-

2-

6-

42 -

40 0 -

2-

4-

6-

4-

24 0

I -2-

6-

2-

6-

60 -

8-

68 -

64 2 -

6-

6-

6-

6-

4-

26 -

6-

2-

2-

6

K -20 -

6-

22 -

6-

4-

2-

61

0-

4-

20 -

22 4 0 -

2-

4-

6-

2-

42

L -2-

8-

2-

8-

60 -

8-

64 -

48 4 -

6-

6-

4-

4-

4-

22 -

4-

2-

2-

6

M -2-

6-

2-

6-

40 -

6-

42 -

24 1

0-

4-

40 -

2-

2-

22 -

2-

2-

2-

2

N -46 -

62 0 -

60 2 -

60 -

6-

41

2-

40 0 2 0 -

6-

8-

2-

40

P -2-

4-

6-

2-

2-

8-

4-

4-

6-

2-

6-

4-

41

4-

2-

4-

2-

2-

4-

8-

4-

6-

2

Q -20 -

60 4 -

6-

40 -

62 -

40 0 -

21

02 0 -

2-

4-

4-

2-

26

R -2-

2-

6-

40 -

6-

40 -

64 -

4-

20 -

42 1

0-

2-

2-

6-

6-

2-

40

S 2 0 -20 0 -

40 -

2-

40 -

4-

22 -

20 -

28 2 -

4-

60 -

40

T 0 -2-

2-

2-

2-

4-

4-

4-

2-

2-

2-

20 -

2-

2-

22 1

00 -

40 -

4-

2

V 0 -6-

2-

6-

4-

2-

6-

66 -

42 2 -

6-

4-

4-

6-

40 8 -

6-

2-

2-

4

W -6-

8-

4-

8-

62 -

4-

4-

6-

6-

4-

2-

8-

8-

4-

6-

6-

4-

62

2-

44 -

6

X 0 -2-

4-

2-

2-

2-

2-

2-

2-

2-

2-

2-

2-

4-

2-

20 0 -

2-

4-

2-

2-

2

Y -4-

6-

4-

6-

46 -

64 -

2-

4-

2-

2-

4-

6-

2-

4-

4-

4-

24 -

21

4-

4

Z -22 -

62 8 -

6-

40 -

62 -

6-

20 -

26 0 0 -

2-

4-

6-

2-

48

Blosum62mt

Page 16: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

PAM500A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 1 -1

0 1 -2

0 1 1 0 0 -1

0 -1

-3

1 1 1 -6

-3

0 1 0 0 -9

R -1

5 1 0 -4

2 0 -1

2 -2

-2

4 0 -4

0 0 0 4 -4

-2

0 1 0 -9

N 0 1 1 2 -3

1 1 1 1 -1

-2

1 -1

-4

0 1 0 -5

-3

-1

1 1 0 -9

D 1 0 2 3 -5

2 3 1 1 -2

-3

1 -2

-5

0 1 0 -7

-5

-1

2 2 0 -9

C -2

-4

-3

-5

22

-5

-5

-3

-4

-2

-6

-5

-5

-3

-2

0 -2

-9

2 -2

-4

-5

-2

-9

Q 0 2 1 2 -5

2 2 0 2 -1

-2

1 -1

-4

1 0 0 -5

-4

-1

2 2 0 -9

E 1 0 1 3 -5

2 3 1 1 -2

-3

1 -1

-5

0 1 0 -7

-5

-1

2 2 0 -9

G 1 -1

1 1 -3

0 1 4 -1

-2

-3

0 -2

-5

1 1 1 -8

-5

-1

1 1 0 -9

H 0 2 1 1 -4

2 1 -1

4 -2

-2

1 -1

-2

0 0 0 -2

0 -2

1 2 0 -9

I 0 -2

-1

-2

-2

-1

-2

-2

-2

3 4 -2

3 2 -1

-1

0 -5

0 3 -2

-2

0 -9

L -1

-2

-2

-3

-6

-2

-3

-3

-2

4 7 -2

4 4 -2

-2

-1

-1

1 3 -3

-2

-1

-9

K 0 4 1 1 -5

1 1 0 1 -2

-2

4 0 -5

0 0 0 -3

-5

-2

1 1 0 -9

M -1

0 -1

-2

-5

-1

-1

-2

-1

3 4 0 4 1 -1

-1

0 -4

-1

2 -1

-1

0 -9

F -3

-4

-4

-5

-3

-4

-5

-5

-2

2 4 -5

1 13

-4

-3

-3

3 13

0 -4

-5

-2

-9

P 1 0 0 0 -2

1 0 1 0 -1

-2

0 -1

-4

4 1 1 -6

-5

-1

0 1 0 -9

S 1 0 1 1 0 0 1 1 0 -1

-2

0 -1

-3

1 1 1 -3

-3

-1

1 0 0 -9

T 1 0 0 0 -2

0 0 1 0 0 -1

0 0 -3

1 1 1 -6

-3

0 0 0 0 -9

W -6

4 -5

-7

-9

-5

-7

-8

-2

-5

-1

-3

-4

3 -6

-3

-6

34

2 -6

-6

-6

-4

-9

Y -3

-4

-3

-5

2 -4

-5

-5

0 0 1 -5

-1

13

-5

-3

-3

2 15

-1

-4

-4

-2

-9

V 0 -2

-1

-1

-2

-1

-1

-1

-2

3 3 -2

2 0 -1

-1

0 -6

-1

3 -1

-1

0 -9

B 1 0 1 2 -4

2 2 1 1 -2

-3

1 -1

-4

0 1 0 -6

-4

-1

2 2 0 -9

Z 0 1 1 2 -5

2 2 1 2 -2

-2

1 -1

-5

1 0 0 -6

-4

-1

2 2 0 -9

X 0 0 0 0 -2

0 0 0 0 0 -1

0 0 -2

0 0 0 -4

-2

0 0 0 0 -9

* -9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

1

Page 17: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• The cost for every pair of possible amino acid replacements defines a cost matrix that can be used to score the alignment. Protein sequence alignment programmes typically use matrices derived from empirical comparisons of protein sequences

Page 18: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• #z97619 AATCAA-TAG TTTTTTAATT GAAAACTGGA ATGAATGGTT TGACGAG-AA

• #z97620 AATCAA-TAG TTTTTTAATT GGAAACTGGG ATGAATGGTT TGACGAA-AA

• #u18065 TAATCATTAG TTTCTTAATT AGGGGCTTGA ATGAAGGGAT TGACGAGAAA

• #u18066 TAATCATTAG TTTCTTAATT AGGGGCTTGA ATGAATGGAT TGACGAGAAA

• #u18069 AATCA-TTAG TCTCTTAATT AGAGGCTTGA ATGAATGGTT TAACGAG-AA

• #u18070 AATCA-TTAG TCTCTTAATT GGGGGCTTGA ATGAATGGTT TAACGAG-AA

• #u18071 AATCA-TTAG TTTCTTAATT AGAGGCTTGA ATGAATGGTT T-ACGAG-AA

• #u18068 AATCAGTTAG TTTCTTAATT AGAGGCTTGA ATGAATGGTT TAACGAG-AA

• #u18073 AATCA-TTAG TTTCTTAATT AGGGGCTTGT ATGAATGGTT TGACGAG-AA

• #u18074 AATCA-TTAG TTTCTTAATT AGAGGCTTGA ATGAATGGTT TCACGAG-AA

• #u18072 AATCA-TTAG TTTCTTAATT AGAGGCTTGT ATGAATGGTT TGACGAG-AA

• #u18064 AATCA-TTAG TTTCTTAATT AGAGGCTGGA ATGAATGGTT TGACGAG-AA

• #u18067 AATCA-TTAG TTTCTTAATT AGAGGCTGGA ATGAATGGTT TGACGAG-AA

• #af514505 AATCA-TTAG TTTCTTAATT GGGGACTGGA ATGAATGGTT TGACGAG-AA

• #z97617 AATCA-TTAG TCTCTTAATT AGAGACTGGA ATGAAGGGTT TAACAAG-AA

• #z97621 AATCA-TTAG TCTTTTAATT GAAGGCTGGT ATGAATGGTT TGACGAG-GA

• #z97623 AATCA-TTAG TCTTTTAATT GAAGACTGGA ATGAATGGTT TGACGAG-GA

As alignment, How to select W D=s+wg

Page 19: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• #z97619 TTATATAAAA TTTTATGTTT ACTTTATTTT TATAT---TT TATATATATT

• #z97620 ATAT---AAT TTTGTTTTTA CTTTTATTTT TATAT---TA AAAAAATATT

• #u18065 GATTTTATAT TATTTTAGTT TAGATTTTTA AATATAATTT TTATAATGTT

• #u18066 GATTTTATAT TATTTTAGTT TATATTTTTA AATATAATTT TTATAATGTT

• #u18069 ATTTTTATAT TATTTTGGTT T--ATTTTAA AATAAAATTT TTATAATGTT

• #u18070 ATTTTTATAT TATTTTGGTT T--ATTTTAA AATAAAATTT TTATAGTGTT

• #u18071 ATTTTTATAT TATTTTGGTT T--ATTTTTA AGTATAATTT TTATAATGTT

• #u18068 AATTTTATAT TATTTTGGTT T--ATTTTTA AATATAATTT TTACTATGTT

• #u18073 AAATTTATAT TATTTTAGTT T--ATTTTTA AGTATAAATT TTTAAATGTT

• #u18074 AGTTTTGTAT TATTTTAGCT T--ATCTTTT AATATAAGTT TTTTAATGTT

• #u18072 AATTTTTTAT TATTTTAGTT T--ATCTTTT AATATAGATT TTT-AATGTT

• #u18064 ATTTAATATT TCTTTTA--- -TTATCTTTT TATATTAAAT GT-TGATGTT

• #u18067 ATTTAATATT TTTTTTA--- -TTATCTTTT TATATTAATT GT-TGATGTT

• #af514505 AATTAATTTT TATTATATAG TTTATTTTTT AATGTTAATT TT-TATTGTT

• #z97617 -ATTTAATTT TGTTTTTTTG TAAATTTTGT TACTATTAAT TCAAAATATT

• #z97621 TGTAATGTAT TTTTGGATTG ----TTTTTT TACATGCATT A-GTTATATT

• #z97623 TTTATATTTG TATATGATAG ----TTTTGA AATATATTTT ATATTATATT

Page 20: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 21: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

If indels were weighted 4, transversions 2, and transitions 1, the morphological character data were weighted 4. Leading and trailing gaps were weighted one-half internal gaps.These parameters, insertion:deletion cost (indel) and transversion:transition ratio (Tv:Ti) were variedIn all cases where morphological data were included, character transformations for morphology were weighted as equal to the indel cost.

Page 22: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 23: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 24: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

ATCGATATGCTT

CGA

GCT

CGA C

GCT

C TGA C

GCT

3 changes 3 differences

4 changes 3 differences

5 changes 3 differences

. .. . .。。

Page 25: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

靜者恆靜;動者恆動

Page 26: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 27: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

0

0.03

0.06

0.09

0.12

0.15

0 0.05 0.1 0.15 0.2

TotalTvTotalTs

Tv

Ts

Tvs

Tv or Ts

Within sibling species

Among speciesIn the same genus

Among genera

Tv: 顛換取代

Ts: 轉換取代

Page 28: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 29: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 30: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 31: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 32: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

JC69

Page 33: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 34: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Still A

Change to A

Page 35: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 36: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

pA(0)=1

Page 37: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 38: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 39: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 40: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 41: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 42: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 43: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Transition

Transversion

Remain identical

Page 44: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Saturated effect in DNA mutation

1. GTTCTCAGAATC2. GATCACAGAAAC

T A T C G A

Page 45: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 46: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

y = 0.1969x

y = 0.8027x

0

0.05

0.1

0.15

0.2

0 0.05 0.1 0.15 0.2 0.25

CPTS

CPTV

線性(CPTV)線性(CPTS) Total

鞘蛋白基因之取代趨勢 :

轉換取代速率為顛換取代的 4 倍 (0.8/0.2)

Ts: 轉換取代

Tv: 顛換取代

Page 47: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Felsenstein 81

Page 48: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Taxon G% A% T% C%Palaeopteran

Ephemerida 22.0 33.8 32.5 11.7Orthoptroid

Isoptera 20.1a 24.9 42.0 12.0 Grylloblateria 21.3a

Orthoptera-Loxo. 21.9a 30.4 35.9 11.8 Blattaria 17.5b 33.1 39.0 10.5 Phasimid 17.9b

Hemipteroid

Homoptera 16.4 32.3 40.7 10.6Holometabolous

Diptera 15.0 35.4 40.5 9.1 Coleoptera 16.4 35.6 39.4 8.6 Lepidoptera 13.6 39.7 39.1 7.6 Hymenoptera 9.6 44.5 39.3 6.7

Page 49: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

結論 RNA 的二級結構限制了 5’UTR 的變異

性• 無論是短期 ( 群內 ) 或長期 ( 群間 ) 演化的結果

均顯示 RNA 的二級結構是必須的• RNA 的二級結構有其穩定性 , 分離株必須有特

定的二級結構 , 保留下來的可能性才高 ( 群間 ).• 5’UTR 正負股的二級結構各有其穩定性 .• 5’UTR 正負股的二級結構有不同的功能 , 兩種

不同的演化力量導致 promoter 成為 hypervariable region.

Page 50: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

植物園麻竹相關之 satBaMV5’ UTR 之二級結構進化趨勢

BSL6

IV

I-8

6Vb

IIIII

I

I-6

6Va

DL-I DL-IIDL-IIIDL-IVDL-6V 1997BSL6 1995

1998I

A1

A1

A5A2

A3

Page 51: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

鹼基組成在粒線體 16S rDNA 基因的演變趨勢

Evolutionary trend in base composition of mitochondrial 16S ribosomal

DNA

Page 52: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

運用 16S rRNA 二級結構的訊息將各類昆蟲的 DNA 序列共同排列

吉普賽蛾 E. coli

Page 53: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

A

0

10

20

30

40

50

60T

hys

Em

ph

Ord

o

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Thr

i

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

Page 54: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

T

05

101520253035404550

Thy

s

Em

ph

Ord

o

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Thr

i

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

Page 55: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

G

0

5

10

15

20

25

30T

hys

Em

ph

Od

on

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Th

ri

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

Page 56: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

C

0

2

4

6

8

10

12

14

16T

hys

Em

ph

Ord

o

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Thr

i

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

Page 57: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

0

5

10

15

20

25T

hys

Em

ph

Od

on

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Th

ri

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

多出的 G 是否都分佈在 loop 的位置

Page 58: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

STEM-G

0

5

10

15

20

25

30

35T

hys

Em

ph

Od

on

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Th

ri

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

Page 59: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

A

0

10

20

30

40

50

60T

hys

Em

ph

Ord

o

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Thr

i

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

Page 60: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

LOOP-A

0

10

20

30

40

50

60T

hys

Em

ph

Od

on

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Th

ri

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

Page 61: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

STEM-A

05

101520253035404550

Th

ys

Em

ph

Od

on

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Th

ri

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

Page 62: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

0

5

10

15

20

25

30

35

05

101520253035404550

G

A

Page 63: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

鹼基 G 在各類昆蟲中的遞減 以鹼基 T 為主角看

在較原始的昆蟲鹼基 T 與 G 配

對在較進化的昆蟲鹼基 T 與 A 配

因此不會影響 RNA stem 的二級結構

Page 64: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

HKY85

Page 65: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

C G

Page 66: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

General reversible model

Page 67: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 68: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 69: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 70: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 71: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 72: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 73: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

+ + + + + + + + + + + + + + + +I-1 (A) GAAAACTCACCGCAACGAAACGAAAACAATCGTTCAGAAATACTTGACCACGAGGGGTCCCCTATAGTCCGCTTTGGCGGTGCGGCAGCCCCCGTGCGATAGGCTAACTGCGGTATTCCCCGCACTCCGTCGAGCGGTTAATACGACGCTTACCAAGACGII-1 (A) .....................................................................T..........................................................................................II-9 (A) .........................................................C......CTA..T..G...................................C.............................A.....................III-1(A) ..........................T..........................................T..........................................................................................IV-1 (A) ........................................................-.T.............G..........T...A-.................................................A.....................BB21 (A) .......................................G................-.T..T..........G..........T...G-.................................................A.....................6V-1 (A) .........................................................C......CTA..T..G..........................C........C.............................A.....................6V-6 (A) .........................CA..............................C...........T..........................................................................................BSL3 (A) .........................CA..............................-T.............G.....-.................................................................................BSL6 (A) .........................CA..............................C...........T........................................A.................................................DL11 (A) T........................CA.............................C............T..................G.......................................................................DL12 (A) .........................CA..........................................T..................-.......................................................................DL15 (A) .........................CA.............................CC............-AG..-TT........-.G.......................................................................DL16 (A) .........................CA..........................................T..........................................................................................DL23 (A) .................G.......CA..........................................T..........................................................................................BB18 (B) .........................CA.............C...A...........CA......C.A-..-.---AAG..........G.....................T...G.............................................BB23 (B) .........................CA.G...........C...A...........CC............-TG...AA..........G.....................T...G.............................................BB25 (B) ..............-..........CA.............C...A...........CC............-.GC..AG..........GT....................T...G.............................................BB28 (B) .....................-----A.............C...................T.......A.-.G.-CAG.T..............................T...G.....................................TT......BO20 (B) .........................CA.............C...A.....T.....T.............-.GC..AG.......T-.G.....................T...G.............................................BO23 (B) .........................CAG............C...A.....T.....C.............-.GC..AG..........G.....................T...G..-..........................................BV17 (B) .........................CA.............C...A...........CC......C..-..-.GC..AG..........G.....................T...G.............................................DL17 (B) .........................CA..............................C...........T......AG....T..T..G.....................T...G.............................................DL19 (B) .........................CA.............C...A.........A.TC............-.G..-AG..........G.T...................T...G.............................................DL20 (B) -------------------------------------------.A...........CC............-.GC..AG..........GT........................G.............................................DL21 (B) .........................CA.............C...A.....T.....CC..........A.-.G...AG.T........G.....................T...G.............................................BSF4 (B) .........................CA.............C...A...........CC............-.GC..AG....T.....G.....................T...G.............................................BSL4 (B) .........................CA.............C...A.....T.....C..T..........-.GC..AG..........G.....................T...G.............................................USA1 (B) .........................CA.............C...A.....T.....CC............-.GC..A-.....T....G.....................T...G.............................................BSL2 (B) .........................CA.................A...........CC......C..C..-.GC..AG..........G.....................T...G.............................................BSL1 (B) .........................CA.................A...........CC..........A.-.G...AG.T........G.....................T.................................................

Secondary structure simulation by Mfold program

Page 74: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

U UU G

C-GG-CU-GC-G

U U G

A G U

A C U

C-GC-G

C CC AU-GG-CG-CG-CG-C

A

U UU G

C-GG-CC-GC-G

U U G

A G U

A C U

C-GC-G

C CC AU-GG-CG-CG-CG-C

A C

A GC U

C U ▃

U UG U

A GU-GA-UU-GC-GC-G

C C ▃

C-GC-GG-CG-CG-C

A C

DL6V6, BSL6 DL12

U UU G

C-GG-CU-GC-G

U U G

A G U

A C U

C-GC-G

C CC AU-GC-GG-CG-CG-C

A C

DL11

U UU G

C-GG-CU-GC-G

U U G

A G U

A C U

C-GC-G

C CC AU-GG-CG-CG-CG-C

A C

DL16, DL23, III-1

DLI-1, DLII-1

BB21

U UG G

C-GG-CC-GC-GU-GG-CA-UU-G

A U

U C C C

U-A▃ ▃

U-GG-CG-CG-C

A C

U UG G

C-GG-CC-GC-GU-GG-CA-UU-G

A U

C C C C

U-A ▃ ▃

U-AG-CG-CG-C

A C

DLIV-1

DL15BSL3

G GU C

U GG-UC-GG-CC-GC-GG-C

A U A U A C C C ▃

U-GG-CG-CG-CG-C

A C

U UU

C C

DL6V1

C GG U

U UC-GU-GU-GC-GA-UU-GC-GC-G

C C A

C-GG-CG-CG-CG-C

A C

A C

C C

Group A

U UU G

C-GG-CU-GC-G

U U G

A G U

A C U

C-GC-G

C CC AC-GG-CG-CG-CG-C

A C

CC A

C

A GA G

C-GA-UU-GC-GC-G

C C A

A GC-GG-CG-CG-C

A C

DL20, BB25

UC GG A

C-GC-GC-G

U U G A G U

A C U

C-GC-G

C CC AC-GC-GG-UG-CG-C

A C

BO20

UC GG A

C-GC-GC-G

U U G A U G

A U C

C C-GC-G

C UU-GU-GG-CG-CG-C

A C

BB23

UU GG A

U-AC-GC-G

U U G A G

U A C

U C-GC-G

C CC AC-GC-GG-CG-CG-C

A C

UC GG A

C-GC-GC-G

U U G A G

U A C

U C-GC-G

C CC AU-GC-GG-CG-CG-C

A C

BO23

UC GG A

C-GC-GC-G

U U G A G

U A U

U C-GC-G

C CC AC-GC-GG-CG-CG-C

A C

BSF4 BV17

UC GG A

C-GC-GC-GG-U

A C GA C

U C-GC-G

C CC AC-GC-GG-CG-CG-C

A C

UU GG A

C-GC-GA-U

U U G

A G U

A C U

C-GC-G

C CC AC-GC-GG-CG-CG-C

A C

BSL1, DL21

UC GG A

C-GC-GC-G

U U G A G

U A C

U C-GC-G

U CC AU-GC-GG-CG-CG-C

A C

BSL4

U UG A

C-GC-GC-G

U U G A G

U A C

U C-GC-G

C CC AC-GU-GG-CA-UG-C

A C

DL19S

▃ ▃

▃▃ ▃

▃▃

USA1

UC GG A

C-GC-GU-GG-CA-UU-G

A U C

C C

C A C

C-GC-GG-CG-CG-C

A C

BB18

DL17

U UC UG GU-AC-GU-GA-UU-GA-UU-GC-G

C UC A

C-GC-GG-CG-CG-CG-C

A .

▃ ▃▃ ▃

▃ ▃

A

C C

G G

Group B

C U

BB28

CG A

C-GC-GA-U

A-UU-GU-GC-G

U C C

C AU-GG-CG-CG-CG-C

A C

BSL2

UC GG A

C-GC-GC-GC-GG-C

A C A U

C-GC-G

C CU A

C-GC-GG-CG-CG-C

A C

G▃

U

A C

U

U

RNA secondary simulation of satBaMV isolates

Positivestrand

Page 75: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Group A

U UU G

C-GG-CU-GC-G

U UG .

A GU .

A CU .

C-GC-G

C CC AC-GG-CG-CG-CG-C

A C

UC GG A

C-GC-GC-G

U UG .

A GU .

A UU .C-GC-G

C CC AC-GC-GG-CG-CG-C

A C

Group B IV-*

U UG G

C-GG-CC-GC-G

. UU-GG-CA-UU-G

A .U .

C CC .

C .U-AU-AG-CG-CG-C

A C

Positivestrand

Page 76: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

RNA 的二級結構限制了 5’UTR 的變異性

5’UTR, 3’UTR 及其負股為複製酵素之辨識區,通常具特殊的二級結構

5’

5’

p20

Tsai et al.

3’ UTR ofBaMV

3’

3’

Page 77: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

A1 A2 A3 A4 A5BSL66V4,6,7,96V1,8,10,12I-6,8II-5,9

A AC A

C-GG-C

C A G (A2)* 70 80 C G

AA U A A

C UG A

C-GC-G

G G 60U-G

C-G A (A2)G-C

90 G-CG-CG-CG-U

*I-1,2,3,4,5,7,9,10*III-5

A AC A

C-GG-C

C-G 70 80 C-G

A A G

C U A

G U A G(A3)

(A3) U C-G C-G G

U-G 60 C-G A (A4)

AG-C

90 G-CG-CG-CG-U

III-1,4,6,7,8,9II-1DL23DL16

II-3,6,7,8III-2,3

A AC A

C-GG-C

C-G 70 80 C-G

A A C U

C AG-UU-GC-G

G G U-G 60

C-G A

G-C 90 G-C

G-CG-CG-U

BB21

A AC C

C-GG-C

C-G 70 80 C-G

A AC C

UG AA U

C AG AU-G

(A5) U C-G 60 A A

90 G-CG-CG-CG-U

IV1~12

A AC C

C-GG-C

C-G 70 80 C-G

A AC C

U G A UA A

GC G G G 60

U-AU-A

90 G-CG-CG-C G-U(-

)(-)(-

)

(-)

(-)

5‘

5‘5‘5‘

5‘

3‘

3‘ 3

3‘3‘

Negative strand

Page 78: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

AC GU C

C-G

C-G 70 80 C-G

A A C

C U A

A U A

C-GC-G

G G 60 (B4) delete U-G C-G A (B2)

C-G A (B3) 90 G-C

G-CG-CG-U

BSF4BSL1DL21BB23BB25

B1 B4B2 B3DL15

A AA C

A-U

C-G 70 80 C-G

A A C

C U A

G U A

C-GC-G

G G 60 G

C-GC-G

90 G-CG-CG-CG-U

AC GU CC-G

C-G 70 80 C-G

A A C

C U A

G U A

C-GC-G

G U-G 60

C-G A

C-G 90 G-C

G-CG-CG-U

BO23DL11BSL4BB18

BO20DL19

AC GU CC-G

C-G 70 80 C-G

A A C

C U A

G U AC-GC-G

G U-G 60

C-GC-G

A 90 G-C

G-CG-CG-U(-

)

(-)

(-)

(-)

5‘

5‘

5‘

5‘

3‘

3‘

3‘

3‘

Negative strand

Page 79: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Group ABSL6

A AC C

C-GG-C

C A 70 80 C G

AA T A A

C TG A

C-GC-G

G G 60T-GC-GG-C

90 G-CG-CG-CG-T

DLIV

A AC C

C-GG-C

C-G 70 80 C-G

A AC C

T AG TA A

GC G G G 60

T-AT-A

90 G-CG-CG-CG-T(-)(-)

5‘5‘3‘ 3‘

A AC GT C

C-G

C-G 70 80 C-G

A A C

C T A

A T A

C-GC-G

G G 60 T- G

C-G C-G

90 G-C G-C G-C G-T(-)

5‘3‘

Group BBSF4

B

Negative strand

Page 80: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

U UU G

C-GG-CU-GC-G

U UG .

A GU .

A CU .

C-GC-G

C CC AC-GG-CG-CG-CG-C

A C

UC GG A

C-GC-GC-G

U UG .

A GU .

A UU .C-GC-G

C CC AC-GC-GG-CG-CG-C

A C

Negative Positive

Group ABSL6

A AC C

C-GG-C

C A 70 80 C G

AA T A A

C TG A

C-GC-G

G G 60T-GC-GG-C

90 G-CG-CG-CG-T(-)

5‘3‘

AC GT C

C-G

C-G 70 80 C-G

A A C

C T A

A T A

C-GC-G

G G 60 T- G

C-G C-G

90 G-C G-C G-C G-T(-)

5‘3‘

Group BBSF4

Group ABSL6

Group BBSF4

Page 81: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 82: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 83: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Small values of result in an L-shaped distribution with extreme variation of rates; most sites are invariable but a few have very high rates of substitution

Parameter : the range of rate variation among sites

各 site 間的 rate 均等

Conservedregion

各 site 間的 rate 差大

Variable region

Page 84: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• This is primarily because the majority of substitutions happen at the same sites; that is , the variable positions.

• Obviously the more distantly related the sequences, the more pronounced this phenomenon becomes.

Page 85: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 86: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• Jin and Nei (1990) followed a similar approach, but assumed that substitution rates were Г- evolutionary model, which involves a parameter αthat describes the extent of the rate variation, they derived several equations to compute the evolutionary distance from the observed sequence dissimilarities.

Page 87: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• Relative nucleotide substitution rates in the SRC method are estimated by observing the frequencies with which sequence pairs differ at homologous positions.

• For an alignment of n sequences, TREECON computes n(n-1)/2 pairwise evolutionary distances d according to the Jukes and Cantor equation.

• When all pairwise distances have been computed, they are classified in several distance intervals (e.g., four).

• For each distance interval, the fraction of sequence pairs possessing a different nucleotide is plotted and a curve obeying he following equation:

Page 88: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 89: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 90: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 91: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• This is accomplished for all alignment positions

• The probability pi that an alignment position i contains a different nucleotide in two sequences, as a function of the evolutionary distance d separating these sequences.

• The slope of the curve through the origin yields the specific nucleotide substitution rate vi for the position under consideration.

• After estimation of all vi values, alignment positions are grouped into sets of similar variability and form a spectrum of relative nucleotide substitution rates.

Page 92: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 93: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 94: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 95: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 96: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 97: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 98: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Inferring Molecular PhylogenyDistance methods first convert aligned sequences into a pairwise distance matrix, then input that matrix into a tree building method,

whereas discrete methods consider each nucleotide site directly.

That the parsimony tree gives us the additional information of which site contributes to the length of each branch. Once we convert sequences into distance we lose this information.

Page 99: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 100: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Clustering methods

Page 101: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• Tree-building methods in the second class use optimality criteria to choose among the set of all possible trees. This criterion is used to assign to each tree a ‘score’ or rank which is a function of the relationship between tree and data.

Page 102: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Tree-building methods in the second class use optimality criteria to choose among the set of all possible tree (Fig. 6.3). This criterion is used to assign to each tree ‘score’ or bank which is a function of the relationship between tree and data (examples include maximum parsimony and maximum likelihood).

Page 103: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• What is the value of the optimality criterion for that tree?

• Which tree requires the fewest evolutionary events?

• While for small numbers of sequences (e.g. no more than 20) it is often possible to find the optimal tree (or trees), in many cases this is not feasible, in which case we have to rely on heuristic methods.

• A typical heuristic strategy is to start with a tree and rearrange it, keeping any rearrangement that produces a better tree. Such algorithms are often called ‘hill-climbing’.

Page 104: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 105: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Efficiency; Power; Consistency; Robustness; Falsifiability

Page 106: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Unweighted pair group method with arithmetic means (UPGMA)

• In an ultrametric tree all the tips are equidistant form the root of the tree, which is equivalent to assuming a molecular clock.

Page 107: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 108: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 109: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 110: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 111: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 112: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 113: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

0.1715/2

0.2192/2

0.2795/2

Page 114: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• Distances are rarely, exactly tree metrics, and hence one class of ‘goodness of fit’ methods seeks the metric tree that best accounts for the ‘observed’ distances.

• The goodness of fit F between observed distance d

ij and tree distances pij for each pair of sequences i and j is given by.

• In the example just given we were fitting an additive tree with (2n-3) branches to

( ) = n (n-1)/2 pairwise distances.n2

Distance methods

Page 115: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 116: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Minimum evolution• Given an unrooted metric tree for n sequences there a

re (2n-3) branches, each with length ei. The sum of these branch lengths is the length L of the tree:

The minimum evolution tree (ME) is the tree which minimizes L.

• More commonly, the branch lengths of the minimum evolution tree are estimated using least-squares methods. The branch lengths are estimated in the same way as for goodness of fit measures; however, rather than compare the fit of the observed distances the least squares branch lengths are added together to give the length of the tree.

Page 117: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 118: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 119: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 120: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 121: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Neighbour joining clustering

• Neighbour joining (NJ) is a widely used method for tree building which combines computational speed with uniqueness of result - most implementations give a single tree.

• One strategy for finding the ME tree is to first compute the NJ tree, then see if any local rearrangement of the NJ tree produces a shorter tree.

Page 122: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 123: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 124: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Terminal node i toall other taxa

New node u to the the terminal taxa i and j

Page 125: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

0.2795/2 + (0.3959-0.4525)/2 =0.1114 0.2795-0.1114=0.1682

Page 126: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

● node 1

Page 127: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

(0.2147+0.3091-0.2795)/2=0.1222

Page 128: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 129: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Objection of distance method

• Summarizing a set of sequences by a pairwise distance matrix loses information;

• Branch lengths estimated by some distance methods may not be evolutionarily interpretable.

Page 130: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Discrete methods operate directly on the sequences, rather than on pairwise distances.

• The two major discrete methods are maximum parsimony (MP) and maximum likelihood (ML).

• Maximum parsimony choose the tree (or trees) that require the fewest evolutionary changes.

• Maximum likelihood chooses the tree (or trees) that of all tress is the one that is most likely to have produced the observed data.

Page 131: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

1 ATATT2 ATCGT3 GCAGT4 GCCGT

The total number of evolutionary changes on a tree is simply the sum of the number of changes at each site.

Page 132: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

1 ATATT2 ATCGT3 GCAGT4 GCCGT

Phylogenetically uninformative; sites that are invariant or sites where only one sequence has a different nucleotide are examples of such sites.

Page 133: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 134: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 135: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

This is equivalent to saying the transversions are rarer than transitions, and therefore may be more reliable indicators of phylogeny.

Page 136: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Maximum likelihood requires three elements,

a model of sequence evolution a tree

the observed data.• for a given tree topology, what set of

branch lengths makes the observed.• Which tree of all the possible trees has the

greatest likelihood.

The log likelihood of obtaining the observed sequences is the sum of the log likelihoods of each individual site:

Page 137: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

The 16 possible combinations of ancestral sites for a tree for four sequences.

Page 138: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for
Page 139: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

• Obtaining the maximum likelihood estimate of branch lengths for a given tree is computationally time consuming, and in practice this has limited the application of the method to fairly small data sets.

• This model may include parameters for the transition/transversion ratio (TS/TV), base composition, and variation in rate among sites.

Page 140: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Objections to likelihood

• Which model to use, and what values of the parameters, such as transition/transversion ratio, should be employed.

• This is computationally time consuming, more than one maximal likelihood value may exist for a given tree.

Page 141: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for

Putting confidence limits on phylogenies bootstrap analysis

• Because we are sampling with replacement some sites may occur more than once in the pseudoreplicate, while others may not be represented at all.

• From this pseudoreplicate we would then build a tree using.

• We then repeat this two-step process a large number of time (anywhere from 100-to 1000-fold), resulting in a set of bootstrap trees.

Page 142: General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory)  GenBank: NCBI (National Center for