bioinformatics before, during and after the era of genomics, one person's view - terry speed

73
Bioinformatics before, during and after the era of genomics One personʼs view BioInfoSummer, 2013 1

Upload: australian-bioinformatics-network

Post on 11-May-2015

1.218 views

Category:

Technology


36 download

TRANSCRIPT

Page 1: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

! !Bioinformatics before, during and after the era of genomics!

One personʼs view!

BioInfoSummer, 2013!

1

!

Page 2: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Caveats!

2

All statements are implicitly prefaced by !“In my humble (but perhaps not humble enough) opinion” !

This is a personal potted history, along with commentary, !opinions and summary remarks. For the real history, perhaps!start with Wikipedia, then go to secondary or primary sources.!

Finally, I will use a lot of terms that I do not properly define. My!apologies, but there is insufficient time for such care. This is a !superficial overview: care wth details comes from others here.!

Page 3: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Bioinformatics is driven by two things:!

New methods for generating molecular data!and!

The desires of scientists to use newly-generated data to address questions of interest to them!

!

3

Page 4: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

!As a result, there are general problems in the field that need revisiting, and general issues that need attention, with each new kind of data. There are!general techniques, which can be adapted to new contexts, and there are a host of specific problems arising with each new kind of data and each new application, many of which demand new techniques. !

4

Page 5: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Challenges with Bioinformatics !

•  It is highly interdisciplinary!Pro: there is always lots of interesting stuff to learn!Con: no clear boundaries. It merges into biology, biochemistry, biophysics, biomathematics, biostatistics, biocomputing, ..!•  It sits within a rapidly changing landscape!Pro: there is always something new to dive into!Con: by the time you eventually solve your problem, no-one may care (so work quickly!)!•  It is currently not so well funded in Australia!Pro: this should focus your mind on doing really good work!Con: you might not realize your dreams too readily (dguydj)!

5

Page 6: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

6

Before!

During!

After!

•  Molecular evolution!•  Protein structure and classification!•  Sequence alignment !•  Database searching!•  Genetic mapping !•  Physical mapping !•  DNA sequencing !•  Genome assembly !•  Gene-finding!

•  Sequencing, sequencing, sequencing, sequencing!

•  Functional genomics, … !•  Proteomics, Metabolomics, …!•  Imaging, …!

Page 7: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Before: 1950s!

7

Page 8: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Proteins are single chains of amino acids! ! This point of view arose out of the work of Frederick

Sanger, who determined the amino acid sequence of (bovine) insulin at the beginning of the decade. !

!

8

Page 9: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Protein structure by X-ray crystallography !

The view of proteins as amino acid (polypeptide) chains was strengthened by later work of Max Perutz and John Kendrew, who determined the structure of (horse) haemoglobin and (sperm-whale) myoglobin respectively, publishing in 1960. At that time, the amino acid sequence of these globins was not known.!

!! !

9

Page 10: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Another beginning: DNA!

! Arising from James Watson and Francis Crick’s

famous determination of the structure of DNA in 1953. ! But it took another two decades to have a method for

determining DNA sequences (Sanger again). !

10

Page 11: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

The 20 common amino acids!

!

Page 12: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Before: 1960s, Molecular evolution !

The idea that macromolecules have an evolutionary history fully as traceable as that of bone structure or any other large-scale feature developed on the heels of the first determinations of amino acid sequences of proteins. Hemoglobin: Structure, Function, Evolution and Pathology by RE Dickerson and I Geis 1983, p.66.!

12

Page 13: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

From Perutz et al, Nature 1959!

The polypeptide chain…first discovered in sperm whale myoglobin has since been found in seal myoglobin. Its appearance in horse haemoglobin suggests that all haemoglobins and myoglobins of vertebrates follow a similar pattern. How is this possible? ….This suggests the occurrence of similar sequences throughout this group of proteins….!

…. their structural similarity suggests that they have developed from a common genetic precursor.!

13

Page 14: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Comparison between the Amino-Acid Sequences of Sperm Whale Myoglobin and Human Haemoglobin

Watson & Kendrew Nature 1961!

14

Page 15: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Was this the first a.a. sequence alignment?!

In the paper we find the sentence:!! ..the human haemoglobin chains have been placed

alongside that of myoglobin ...in what seems to us the most plausible manner….we have tentatively assumed that interpolations or deletions occur only in or near inter-helical regions since even a single change in the middle of a helix would.…” !

! A modern alignment will be given later.! 15

Page 16: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Evolution using molecules !

•  DNA is passed from parents to offspring more or less unchanged.!

•  Molecular evolution is dominated by mutations that are neutral from the standpoint of natural selection (neutral hypothesis: Kimura, 1968)!

•  Mutations accumulate at fairly steady rates in surviving lineages (molecular clock hypothesis: Zuckerkandl & Pauling, 1962)!

•  We can study the evolution of (macro) molecules and reconstruct the evolutionary history of organisms using their molecules.!

!16

Page 17: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Margaret O Dayhoff (1925-1983)!

Has been called the mother and father of bioinformatics. !“She anticipated the potential of computers to the current theories of Zuckerkandl & Pauling and the method which Sanger had engineered. With Richard Eck, she published the first reconstruction of a phylogeny (evolutionary tree) by computers from molecular sequences, using a maximum parsimony method. She also formulated the first probability model of protein evolution, the PAM model, in 1966.!She initiated the collection of protein sequences in the Atlas of Protein Sequence and Structure, a book collecting all known protein sequences that she published in 1965.” !All from Wikipedia!! 17

Page 18: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

18

Protein evolution: some terms!Two proteins that share a common ancestry are said to be homologues. (Important to distinguish from similarity.) !We have two different types of homologous proteins.!Orthologues: the same gene in different organisms; common ancestry goes back to speciation.!A common mode of protein evolution is by duplication.!Paralogues: different genes in the same organism; common ancestry goes back to a gene duplication.!Lateral gene transfer gives another form of homology, one not wholly describable by a tree (need networks). ! !

Page 19: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Some human globins (paralogues)!

- V L S P A D K T N V K A A W G K V G A H A G E Y G A E A L E R M F L S F P T T !V H . T . E E . S A . T . L . . . . - - N V D . V . G . . . G . L L V V Y . W . !V H . T . E E . . A . N . L . . . . - - N V D A V . G . . . G . L L V V Y . W . !V H F T A E E . A A . T S L . S . M - - N V E . A . G . . . G . L L V V Y . W . !G H F T E E . . A T I T S L . . . . - - N V E D A . G . T . G . L L V V Y . W . !- G . . D G E W Q L . L N V . . . . E . D I P G H . Q . V . I . L . K G H . E . !

alpha!beta!delta!epsilon!gamma!myoglobin!

10 3020 40

The sequences continue up to ~146 a.a. ! ‘.’ means ditto, ‘-’ means deletion. !

Producing multiple sequence alignments like this !(and much longer, and with many more sequences), remains a major continuing task for bioinformatics. !

19

Page 20: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

10 20 30 40 V H L T P E E K S A V T A L W G K V N V D E V G G E A L G R L L V V Y P W T Q R human . . . . . . . . N . . . T . . . . . . . . . . . . . . . . . . . . . . . . . . . macaque - - . . A . . . A . . . . F . . . . K . . . . . . . . . . . . . . . . . . . . . cow . . . S G G . . . . . . N . . . . . . I N . L . . . . . . . . . . . . . . . . . platypus - - . . A . . . A . . . . F . . . . K . . . . . . . . . . . . . . . . . . . . . chicken . . W S E V . L H E I . T T . K S I D K H S L . A K . . A . M F I . . . . . T . shark

50 60 70 80 F F E S F G D L S T P D A V M G N P K V K A H G K K V L G A F S D G L A H L D N human . . . . . . . . . S . . . . . . . . . . . . . . . . . . . . . . . . . N . . . . macaque . . . . . . . . . . A . . . . N . . . . . . . . . . . . D S . . N . M K . . . D cow . . . A . . . . . S A G . . . . . . . . . . . . A . . . T S . G . A . K N . . D platypus . . . . . . . . . . A . . . . N . . . . . . . . . . . . D S . . N . M K . . . D chicken Y . G N L K E F T A C S Y G - - - - - . . E . A . . . T . . L G V A V T . . G D shark

90 100 110 120 L K G T F A T L S E L H C D K L H V D P E N F R L L G N V L V C V L A H H F G K human . . . . . . Q . . . . . . . . . . . . . . . . K . . . . . . . . . . . . . . . . macaque . . . . . . A . . . . . . . . . . . . . . . . K . . . . . . . V . . . R N . . . cow . . . . . . K . . . . . . . . . . . . . . . . N R . . . . . I V . . . R . . S . platypus . . . . . . A . . . . . . . . . . . . . . . . K . . . . . . . V . . . R N . . . chicken V . S Q . T D . . K K . A E E . . . . V . S . K . . A K C F . V E . G I L L K D shark

130 140 E F T P P V Q A A Y Q K V V A G V A N A L A H K Y H human . . . . Q . . . . . . . . . . . . . . . . . . . . . macaque . . . . V L . . D F . . . . . . . . . . . . . R . . cow D . S . E . . . . W . . L . S . . . H . . G . . . . platypus . . . . V L . . D F . . . . . . . . . . . . . R . . chicken K . A . Q T . . I W E . Y F G V . V D . I S K E . . BG-shark

Some vertebrate beta-globins (orthologues)!

20

Page 21: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

21

Beta-globins: Uncorrected pairwise distances

Distances: between protein sequences. Calculated over: 1 to 147! Below diagonal: observed number of differences! Above diagonal: number of differences per 100 amino acids!!! hum mac cow pla chi sha hum ---- 5 16 23 31 65 mac 7 ---- 17 23 30 62 cow 23 24 ---- 27 37 65 pla 34 34 39 ---- 29 64 chi 45 44 52 42 ---- 61 sha 91 88 91 90 87 ----

Page 22: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

22

Beta-globins: Corrected pairwise distances! Distances: between protein sequences. Calculated over: residues 1 to 147! Below diagonal: observed number of differences! Above diagonal: estimated number of substitutions per 100 amino acids ! Correction method: Jukes-Cantor (1969)!!

hum mac cow pla chi sha hum ---- 5 17 27 37 108 mac 7 ---- 18 27 36 102 cow 23 24 ---- 32 46 110 pla 34 34 39 ---- 34 106 chi 45 44 52 42 ---- 98 sha 91 88 91 90 87 ----

Page 23: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Evolution of the globins

Fibr

inop

eptid

es

1.1

MY

Cytochrome c 20.0 MY

Separation of ancestors of plants and animals

1

2 3 4

6 7 8 9 10

5 M

amm

als

Bird

s/R

eptil

es

Rep

tiles

/Fis

h C

arp/

Lam

prey

Mam

mal

s/

Rep

tiles

V erte

brat

es/

Inse

cts

a 220 200 180 160 140 120 100 80 60 40 20 0

200 100 300 400 500 600 700 800 Millions of years since divergence

After Dickerson (1971)

Cor

rect

ed a

min

o ac

id c

hang

es p

er 1

00 re

sidu

es

900 1000 1 100 1200 1300 1400

b c d e f h i g j

Hur

onia

n

Algo

nkia

n

Cam

bria

n O

rdov

icia

n Si

luria

n D

evon

ian

Perm

ian

T riassi

c Ju

rass

ic

Cre

tace

ous

Pale

ocen

e O

ligoc

ene

Mio

cene

Plio

cene

Eoce

ne

Car

boni

fero

us

The Rates of Macromolecular Evolution!

23

Page 24: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

24

Unrooted UPGMA tree for beta-globins!

cow

human macaque

platypus

chicken

shark A gene tree across species

Page 25: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Before: 1970s, Sequence alignment!

Gibbs & McIntyre (1970):The Diagram (now the DotPlot) !Needleman & Wunsch (1970): Global alignment using a

dynamic programming algorithm!

!! 25

Page 26: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Cytochrome c dot plots from Gibbs & McIntyre!

26 !

Page 27: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Needleman and Wunsch (1970)!

27

Figure 2!

Figure 1!

Page 28: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

! 1 VHLTPEEKSAVTALWG.KVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL 78! :|..:.. |.. .: | |:.: : .. :| | .. :|. : | ..:| :.||:.||:..|:| ..|: |:|! 1 .GFTDKQEALVNSSSEFKQNLPGYSILFYTIVLEKAPAAKGLFSFLKD...TAGVEDSPKLQAHAEQVFGLVRDSAAQL 74!! 79 ...DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANAL..AHKYH 146! ::: . |||:.:|.:| |. .:| :: :.|: .: . |.::...:..|:: . .|:|.|: | | ! 75 RTKGEVVLGNATLGAIHVQK.GVTNPHFVVVKEALLQTIKKASGNNWSEELNTAWEVAYDGLATAIKKAMKTA 146!

Human beta haemoglobin vs.

pea leghaemoglobin!

28

Page 29: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Globin fold α protein myoglobin PDB: 1MBN

Page 30: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

β sandwich β protein immunoglobulin PDB: 7FAB

Page 31: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Before: 1980s, Molecular databases!

31

Page 32: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Sequence databases!

In the mid-1970s, Sanger and Maxam-Gilbert DNA sequencing were invented.!

! From 1965 M.O. Dayhoff published her first Atlas of

Protein Sequence and Structure. It became the PIR.!! By the end of the decade, a rapidly increasing

amount of DNA sequence data joined the steadily growing amount of sequence in protein databases. !

32

Page 33: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Bioinformatics comes of age (but the name doesnʼt yet exist)!

•  1982 Genbank opens!

Many developments in database searching, pairwise and multiple alignment and protein structure prediction!

!•  1987 !First of many high-resolution

human genetic maps!•  1988 NCBI starts!!

33

Page 34: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Substitution/Scoring Matrices!PAM matrices (Dayhoff et al. 1978) --- phylogeny-based.!

PAM250 matrix, log-odds representation

PAM1: expected number of mutation = 1%!

Page 35: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Databases, cont. !

Not only are databases valuable repositories for knowledge transfer, they can help generate new knowledge. !

We can see the Darwinian notion of descent with modification very clearly, that the dominant mode of molecular evolution was duplicate and modify, and find unexpected match-ups occurring, such as the early one on the next page.!

35

Page 36: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Simian Sarcoma Virus onc Gene,v-sis, Is Derived from the Gene (or Genes)

Encoding a Platelet-Derived Growth Factor!

However, previous efforts to demonstrate any functional or evolutionary relatedness between transforming gene products and any growth factor have been unsuccessful.!

36 Doolittle et al, Science 1983!

Page 37: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Simian Sarcoma Virus onc Gene,v-sis, Is Derived from the Gene (or Genes) Encoding

a Platelet-Derived Growth Factor!

37 Doolittle et al, Science 1983!

Page 38: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

1990s: The explosion!

The decade of FASTA, BLAST, CLUSTAL W and HMMs!Database searching on a large scale!

Whole genome mapping and sequencing !Proteomics, gene expression arrays, and much more.!

38

Page 39: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

BLO(ck)SU(bstitution)M(atrix) (Henikoff & Henikoff 1992)!

Derived from a set (2000) of aligned and ungapped regions from protein families; emphasizing more on chemical similarities (versus how easy it is to mutate from one residue to another). BLOSUMx is derived from the set of segments of x% identity.!

BLOSUM62 Matrix, log-odds representation!

Page 40: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

40

DNA markers: RFLPs !

Page 41: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

41 From Donis-Keller et al, Cell 1987 The point: there were lots of these markers!

Page 42: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

42

From Donis-Keller et al, Cell 1987!

Chromosomes are mosaics of grandparental chromosomes,!and the breakpoints can be!Inferred.

Page 43: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Bioinformatics during the genome era!

43

Life went on as before, but a lot of bioinformatic effort was!devoted to the development of tools to deal with the growing !amounts of complete genome sequence data. Why? !

! !Why do people rob banks? !

• Genetic mapping !• Physical mapping !• DNA sequencing !• Genome assembly !• Gene-finding!

Page 44: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

DNA sequencing: the march of technology!

44

Page 45: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

45

http://en.wikipedia.org/wiki/Sequencing

Sanger DNA sequencing ~1975: radio-labelled chain terminators; slab gel electrophoresis!

Page 46: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Automated DNA sequencing machines initially slab gel, later capillary electorphoresis!

46

Page 47: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

47

Automated Sanger DNA sequencing ~2000: fluorescently labelled chain terminators,

capillary electrophoresis!

Page 48: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Sequencing technology growth!

/machine!/year!

Page 49: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

49

Helicos

Illumina

AB SOLiD

Roche 454

Pacific Biociences

Oxford Nanopore

Page 50: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

50

HiSeq2000!

MiSeq!

Page 51: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Post-genome bioinformatics!

A very large fraction is about sequencing!!

•  Take some “cells”!•  Do something to them!•  Sequence the resulting DNA!•  Do something with the resulting reads!

51

Page 52: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

The “cells” you take can be!

•  Very heterogeneous pools, e.g. from soil, gut, seawater, …!

•  Mixed cell populations, e.g. PBMC from blood!•  Cell lines!•  Sorted cells, e.g. CD4+ T-cells from blood!•  Single cells, eg. neurons, B-cells, …!•  Not cells at all, e.g. cell-free DNA from plasma!

52

Page 53: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

The something you do to the cells may be!

•  Not much, e.g. metagenomics!•  Chromatin Immunoprecipitation, e.g for

transcription factor binding, histone modification!•  RNA extraction, to do RNA-seq, to find non-

coding RNAs !•  RNA interference, e.g. siRNA to silence genes!•  Bisulphite conversion to measure methylation!•  ….(much more)!

53

Page 54: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

What you do with the resulting reads!

•  Map them back to a reference genome!•  Assemble them de novo into a genome!•  Try to sort out what is there (a mix of spp)!•  Map them to a custom transcriptome!•  Map them to some other custom database!•  …..!

54

Page 55: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

55

Metagenomics @ MelbourneWed 27 November 2013, 12.30pm 6.00pm

Bio21 Molecular Science and Biotechnology Institute30 Flemington Road, ParkvilleEnquiries to kholt@unimelb edu au www bio21 org

Ian Paulsen Macquarie University

Guest Speakers TopicsHuman microbiome

Enquiries to [email protected] www.bio21.org

Ian Paulsen, Macquarie UniversityGene Tyson, UQ Centre for EcogenomicsAaron Darling, i3 Institute, UTSBrendan Burns, UNSWRob Moore, CSIRO

Human microbiomeAnimal & plant microbiomeEcogenomicsMarine metagenomicsBioinformatics

Plus short talks from 10 local researchers

Register by November 20research.mdhs.unimelb.edu.au/event/advancing systems biology

Registration is FREE and Includes lunch, afternoon tea and evening refreshments.

Register by November 20research.mdhs.unimelb.edu.au/event/advancing systems biology

Registration is FREE and includes lunch, afternoon tea and evening refreshments

All welcome. All welcome

A Bio21 Institute Structural and Cellular Biology Theme Event. Part of Advancing Systems Biology @ Melb.

Page 56: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

56

Forshew et al. Sci Transl Med (2012)!

Page 57: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

And then there is lots of other stuff, too numerous to mention!

•  The other omics (prote-, metabol-, lipid-, phen-…)!•  many imaging modalities (is this bioinformatics?)!•  …. !•  and, perhaps hardest of all, combinations of the

above!

57

Page 58: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

58

High!content!imaging!screens

• Robotic!instruments!treat!cells!(e.g.!RNAi,!drugs,!peptides)

• Phenotype!measurement!is!image"based

image!from!Drug!Discovery!World

Page 59: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Combining image and molecular data!

59

Yuan et al. Sci Transl Med 4, (2012)! !

Page 60: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

a general problem in the field, that needs revisiting with each new kind of data!

Sequence alignment!

60

Page 61: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

61

first sequence =  GAATTCAGTTA ! second sequence = GGATCGA !An optimal alignment:! G A A T T C A G T T A!

G G A T _ C _ G _  _ A !(score: match = 2, mismatch = -1, indel = -2)! !A different optimal alignment:! _ G A A T T C A G T T A!

G G A  _T _ C _ G _ _ A!(score: match = 1, mismatch = 0, indel = 0)!!

Page 62: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

62

Sequence alignment methods!

Page 63: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

general issues that need attention, with each new kind of data!

•  Data quality assessment!•  Data cleaning/normalization/adjusting!•  Data visualization!

63

Page 64: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Some general techniques which can be adapted to new contexts!

•  Dynamic programming!•  Hidden Markov models!•  Dealing with multiplicity (p-value adjustment)!•  Resampling/bootstrapping!•  Network analyses!•  Bayesian inference!

64

Page 65: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Some applications of hidden Markov models in bioinformatics!

•  mapping chromosomes!•  aligning biological sequences!•  predicting sequence structure!•  inferring evolutionary relationships!•  finding genes in DNA sequence!

65

Page 66: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

66

A very short profile HMM!

M = Match state, I = Insert state, D = Delete state. !To operate, go from left to right. I and M states output !

amino acids; B, D and E states are “silent”.!

Page 67: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

67

How profile HMMs work, in brief!

•  Instances of the motif are identified by calculating! log{pr(sequence | M)/pr(sequence | B)}, ! where M and B are the motif and background HMMs.!•  Alignments of instances of the motif to the HMM are found

by calculating ! arg maxstates pr(states | instance, M).!•  Estimation of HMM parameters is by calculating ! arg max parameters pr(sequences| M, parameters).!

In all cases, we use the efficient HMM algorithms.!

Page 68: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Dealing with multiplicity (p-value adjustment)!

BLAST (Basic Local Alignment Search Tool) aligns short sequence to databases. It uses extreme value theory in calculating its E-value.!!Mascot uses mass spectometry data to identify proteins from peptide sequence databases. It has calibrated its probability score to deal with the huge multiplicity. The details are not fully known, but it works.!!With microarray data the Benjamini-Hochberg false discovery rate (FDR) has become widely used. In other areas Bonferroni still reigns. !!!!

68

Page 69: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

A specific problem arising with a new kind of data and a new application, which demanded a new technique. !

Gene set tests!

69

Page 70: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Suppose that we do thousands of tests, comparing gene expression levels between treated and control cells, and find no (“significantly”) differentially expressed genes. What might this mean? !!Or, suppose we find 100s of (“significantly”) differentially !expressed genes. What might this mean?!!In each case there are gene-set tests that can be used which might illuminate the situation.!

70

Page 71: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Finally!

71

Page 72: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

Two important general themes!

•  Evolution and the comparative method!!"Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky 1973!!

•  The use of positive and negative controls (“truth”) in order to see whether methods work in practice. !

!“In theory, there is no difference between theory and practice. But in practice, there is.” Yogi Berra (undated)! 72

Page 73: Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

73 “The Deluge” (1840) by Francis Danby (1793-1861)!

THANKS FOR LISTENING!!