bioinformatics before, during and after the era of genomics, one person's view - terry speed

! !Bioinformatics before, during and after the era of genomics!

One personʼs view!

BioInfoSummer, 2013!

1

!

Caveats!

2

All statements are implicitly prefaced by !“In my humble (but perhaps not humble enough) opinion” !

This is a personal potted history, along with commentary, !opinions and summary remarks. For the real history, perhaps!start with Wikipedia, then go to secondary or primary sources.!

Finally, I will use a lot of terms that I do not properly define. My!apologies, but there is insufficient time for such care. This is a !superficial overview: care wth details comes from others here.!

Bioinformatics is driven by two things:!

New methods for generating molecular data!and!

The desires of scientists to use newly-generated data to address questions of interest to them!

!

3

!As a result, there are general problems in the field that need revisiting, and general issues that need attention, with each new kind of data. There are!general techniques, which can be adapted to new contexts, and there are a host of specific problems arising with each new kind of data and each new application, many of which demand new techniques. !

4

Challenges with Bioinformatics !

•  It is highly interdisciplinary!Pro: there is always lots of interesting stuff to learn!Con: no clear boundaries. It merges into biology, biochemistry, biophysics, biomathematics, biostatistics, biocomputing, ..!•  It sits within a rapidly changing landscape!Pro: there is always something new to dive into!Con: by the time you eventually solve your problem, no-one may care (so work quickly!)!•  It is currently not so well funded in Australia!Pro: this should focus your mind on doing really good work!Con: you might not realize your dreams too readily (dguydj)!

5

6

Before!

During!

After!

•  Molecular evolution!•  Protein structure and classification!•  Sequence alignment !•  Database searching!•  Genetic mapping !•  Physical mapping !•  DNA sequencing !•  Genome assembly !•  Gene-finding!

•  Sequencing, sequencing, sequencing, sequencing!

•  Functional genomics, … !•  Proteomics, Metabolomics, …!•  Imaging, …!

Before: 1950s!

7

Proteins are single chains of amino acids! ! This point of view arose out of the work of Frederick

Sanger, who determined the amino acid sequence of (bovine) insulin at the beginning of the decade. !

!

8

Protein structure by X-ray crystallography !

The view of proteins as amino acid (polypeptide) chains was strengthened by later work of Max Perutz and John Kendrew, who determined the structure of (horse) haemoglobin and (sperm-whale) myoglobin respectively, publishing in 1960. At that time, the amino acid sequence of these globins was not known.!

!! !

9

Another beginning: DNA!

! Arising from James Watson and Francis Crick’s

famous determination of the structure of DNA in 1953. ! But it took another two decades to have a method for

determining DNA sequences (Sanger again). !

10

The 20 common amino acids!

!

Before: 1960s, Molecular evolution !

The idea that macromolecules have an evolutionary history fully as traceable as that of bone structure or any other large-scale feature developed on the heels of the first determinations of amino acid sequences of proteins. Hemoglobin: Structure, Function, Evolution and Pathology by RE Dickerson and I Geis 1983, p.66.!

12

From Perutz et al, Nature 1959!

The polypeptide chain…first discovered in sperm whale myoglobin has since been found in seal myoglobin. Its appearance in horse haemoglobin suggests that all haemoglobins and myoglobins of vertebrates follow a similar pattern. How is this possible? ….This suggests the occurrence of similar sequences throughout this group of proteins….!

…. their structural similarity suggests that they have developed from a common genetic precursor.!

13

Comparison between the Amino-Acid Sequences of Sperm Whale Myoglobin and Human Haemoglobin

Watson & Kendrew Nature 1961!

14

Was this the first a.a. sequence alignment?!

In the paper we find the sentence:!! ..the human haemoglobin chains have been placed

alongside that of myoglobin ...in what seems to us the most plausible manner….we have tentatively assumed that interpolations or deletions occur only in or near inter-helical regions since even a single change in the middle of a helix would.…” !

! A modern alignment will be given later.! 15

Evolution using molecules !

•  DNA is passed from parents to offspring more or less unchanged.!

•  Molecular evolution is dominated by mutations that are neutral from the standpoint of natural selection (neutral hypothesis: Kimura, 1968)!

•  Mutations accumulate at fairly steady rates in surviving lineages (molecular clock hypothesis: Zuckerkandl & Pauling, 1962)!

•  We can study the evolution of (macro) molecules and reconstruct the evolutionary history of organisms using their molecules.!

!16

Margaret O Dayhoff (1925-1983)!

Has been called the mother and father of bioinformatics. !“She anticipated the potential of computers to the current theories of Zuckerkandl & Pauling and the method which Sanger had engineered. With Richard Eck, she published the first reconstruction of a phylogeny (evolutionary tree) by computers from molecular sequences, using a maximum parsimony method. She also formulated the first probability model of protein evolution, the PAM model, in 1966.!She initiated the collection of protein sequences in the Atlas of Protein Sequence and Structure, a book collecting all known protein sequences that she published in 1965.” !All from Wikipedia!! 17

18

Protein evolution: some terms!Two proteins that share a common ancestry are said to be homologues. (Important to distinguish from similarity.) !We have two different types of homologous proteins.!Orthologues: the same gene in different organisms; common ancestry goes back to speciation.!A common mode of protein evolution is by duplication.!Paralogues: different genes in the same organism; common ancestry goes back to a gene duplication.!Lateral gene transfer gives another form of homology, one not wholly describable by a tree (need networks). ! !

Some human globins (paralogues)!

- V L S P A D K T N V K A A W G K V G A H A G E Y G A E A L E R M F L S F P T T !V H . T . E E . S A . T . L . . . . - - N V D . V . G . . . G . L L V V Y . W . !V H . T . E E . . A . N . L . . . . - - N V D A V . G . . . G . L L V V Y . W . !V H F T A E E . A A . T S L . S . M - - N V E . A . G . . . G . L L V V Y . W . !G H F T E E . . A T I T S L . . . . - - N V E D A . G . T . G . L L V V Y . W . !- G . . D G E W Q L . L N V . . . . E . D I P G H . Q . V . I . L . K G H . E . !

alpha!beta!delta!epsilon!gamma!myoglobin!

10 3020 40

The sequences continue up to ~146 a.a. ! ‘.’ means ditto, ‘-’ means deletion. !

Producing multiple sequence alignments like this !(and much longer, and with many more sequences), remains a major continuing task for bioinformatics. !

19

10 20 30 40 V H L T P E E K S A V T A L W G K V N V D E V G G E A L G R L L V V Y P W T Q R human . . . . . . . . N . . . T . . . . . . . . . . . . . . . . . . . . . . . . . . . macaque - - . . A . . . A . . . . F . . . . K . . . . . . . . . . . . . . . . . . . . . cow . . . S G G . . . . . . N . . . . . . I N . L . . . . . . . . . . . . . . . . . platypus - - . . A . . . A . . . . F . . . . K . . . . . . . . . . . . . . . . . . . . . chicken . . W S E V . L H E I . T T . K S I D K H S L . A K . . A . M F I . . . . . T . shark

50 60 70 80 F F E S F G D L S T P D A V M G N P K V K A H G K K V L G A F S D G L A H L D N human . . . . . . . . . S . . . . . . . . . . . . . . . . . . . . . . . . . N . . . . macaque . . . . . . . . . . A . . . . N . . . . . . . . . . . . D S . . N . M K . . . D cow . . . A . . . . . S A G . . . . . . . . . . . . A . . . T S . G . A . K N . . D platypus . . . . . . . . . . A . . . . N . . . . . . . . . . . . D S . . N . M K . . . D chicken Y . G N L K E F T A C S Y G - - - - - . . E . A . . . T . . L G V A V T . . G D shark

90 100 110 120 L K G T F A T L S E L H C D K L H V D P E N F R L L G N V L V C V L A H H F G K human . . . . . . Q . . . . . . . . . . . . . . . . K . . . . . . . . . . . . . . . . macaque . . . . . . A . . . . . . . . . . . . . . . . K . . . . . . . V . . . R N . . . cow . . . . . . K . . . . . . . . . . . . . . . . N R . . . . . I V . . . R . . S . platypus . . . . . . A . . . . . . . . . . . . . . . . K . . . . . . . V . . . R N . . . chicken V . S Q . T D . . K K . A E E . . . . V . S . K . . A K C F . V E . G I L L K D shark

130 140 E F T P P V Q A A Y Q K V V A G V A N A L A H K Y H human . . . . Q . . . . . . . . . . . . . . . . . . . . . macaque . . . . V L . . D F . . . . . . . . . . . . . R . . cow D . S . E . . . . W . . L . S . . . H . . G . . . . platypus . . . . V L . . D F . . . . . . . . . . . . . R . . chicken K . A . Q T . . I W E . Y F G V . V D . I S K E . . BG-shark

Some vertebrate beta-globins (orthologues)!

20

21

Beta-globins: Uncorrected pairwise distances

Distances: between protein sequences. Calculated over: 1 to 147! Below diagonal: observed number of differences! Above diagonal: number of differences per 100 amino acids!!! hum mac cow pla chi sha hum ---- 5 16 23 31 65 mac 7 ---- 17 23 30 62 cow 23 24 ---- 27 37 65 pla 34 34 39 ---- 29 64 chi 45 44 52 42 ---- 61 sha 91 88 91 90 87 ----

22

Beta-globins: Corrected pairwise distances! Distances: between protein sequences. Calculated over: residues 1 to 147! Below diagonal: observed number of differences! Above diagonal: estimated number of substitutions per 100 amino acids ! Correction method: Jukes-Cantor (1969)!!

hum mac cow pla chi sha hum ---- 5 17 27 37 108 mac 7 ---- 18 27 36 102 cow 23 24 ---- 32 46 110 pla 34 34 39 ---- 34 106 chi 45 44 52 42 ---- 98 sha 91 88 91 90 87 ----

Evolution of the globins

Fibr

inop

eptid

es

1.1

MY

Cytochrome c 20.0 MY

Separation of ancestors of plants and animals

1

2 3 4

6 7 8 9 10

5 M

amm

als

Bird

s/R

eptil

es

Rep

tiles

/Fis

h C

arp/

Lam

prey

Mam

mal

s/

Rep

tiles

V erte

brat

es/

Inse

cts

a 220 200 180 160 140 120 100 80 60 40 20 0

200 100 300 400 500 600 700 800 Millions of years since divergence

After Dickerson (1971)

Cor

rect

ed a

min

o ac

id c

hang

es p

er 1

00 re

sidu

es

900 1000 1 100 1200 1300 1400

b c d e f h i g j

Hur

onia

n

Algo

nkia

n

Cam

bria

n O

rdov

icia

n Si

luria

n D

evon

ian

Perm

ian

T riassi

c Ju

rass

ic

Cre

tace

ous

Pale

ocen

e O

ligoc

ene

Mio

cene

Plio

cene

Eoce

ne

Car

boni

fero

us

The Rates of Macromolecular Evolution!

23

24

Unrooted UPGMA tree for beta-globins!

cow

human macaque

platypus

chicken

shark A gene tree across species

Before: 1970s, Sequence alignment!

Gibbs & McIntyre (1970):The Diagram (now the DotPlot) !Needleman & Wunsch (1970): Global alignment using a

dynamic programming algorithm!

!! 25

Cytochrome c dot plots from Gibbs & McIntyre!

26 !

Needleman and Wunsch (1970)!

27

Figure 2!

Figure 1!

! 1 VHLTPEEKSAVTALWG.KVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL 78! :|..:.. |.. .: | |:.: : .. :| | .. :|. : | ..:| :.||:.||:..|:| ..|: |:|! 1 .GFTDKQEALVNSSSEFKQNLPGYSILFYTIVLEKAPAAKGLFSFLKD...TAGVEDSPKLQAHAEQVFGLVRDSAAQL 74!! 79 ...DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANAL..AHKYH 146! ::: . |||:.:|.:| |. .:| :: :.|: .: . |.::...:..|:: . .|:|.|: | | ! 75 RTKGEVVLGNATLGAIHVQK.GVTNPHFVVVKEALLQTIKKASGNNWSEELNTAWEVAYDGLATAIKKAMKTA 146!

Human beta haemoglobin vs.

pea leghaemoglobin!

28

Globin fold α protein myoglobin PDB: 1MBN

β sandwich β protein immunoglobulin PDB: 7FAB

Before: 1980s, Molecular databases!

31

Sequence databases!

In the mid-1970s, Sanger and Maxam-Gilbert DNA sequencing were invented.!

! From 1965 M.O. Dayhoff published her first Atlas of

Protein Sequence and Structure. It became the PIR.!! By the end of the decade, a rapidly increasing

amount of DNA sequence data joined the steadily growing amount of sequence in protein databases. !

32

Bioinformatics comes of age (but the name doesnʼt yet exist)!

•  1982 Genbank opens!

Many developments in database searching, pairwise and multiple alignment and protein structure prediction!

!•  1987 !First of many high-resolution

human genetic maps!•  1988 NCBI starts!!

33

Substitution/Scoring Matrices!PAM matrices (Dayhoff et al. 1978) --- phylogeny-based.!

PAM250 matrix, log-odds representation

PAM1: expected number of mutation = 1%!

Databases, cont. !

Not only are databases valuable repositories for knowledge transfer, they can help generate new knowledge. !

We can see the Darwinian notion of descent with modification very clearly, that the dominant mode of molecular evolution was duplicate and modify, and find unexpected match-ups occurring, such as the early one on the next page.!

35

Simian Sarcoma Virus onc Gene,v-sis, Is Derived from the Gene (or Genes)

Encoding a Platelet-Derived Growth Factor!

However, previous efforts to demonstrate any functional or evolutionary relatedness between transforming gene products and any growth factor have been unsuccessful.!

36 Doolittle et al, Science 1983!

Simian Sarcoma Virus onc Gene,v-sis, Is Derived from the Gene (or Genes) Encoding

a Platelet-Derived Growth Factor!

37 Doolittle et al, Science 1983!

1990s: The explosion!

The decade of FASTA, BLAST, CLUSTAL W and HMMs!Database searching on a large scale!

Whole genome mapping and sequencing !Proteomics, gene expression arrays, and much more.!

38

BLO(ck)SU(bstitution)M(atrix) (Henikoff & Henikoff 1992)!

Derived from a set (2000) of aligned and ungapped regions from protein families; emphasizing more on chemical similarities (versus how easy it is to mutate from one residue to another). BLOSUMx is derived from the set of segments of x% identity.!

BLOSUM62 Matrix, log-odds representation!

40

DNA markers: RFLPs !

41 From Donis-Keller et al, Cell 1987 The point: there were lots of these markers!

42

From Donis-Keller et al, Cell 1987!

Chromosomes are mosaics of grandparental chromosomes,!and the breakpoints can be!Inferred.

Bioinformatics during the genome era!

43

Life went on as before, but a lot of bioinformatic effort was!devoted to the development of tools to deal with the growing !amounts of complete genome sequence data. Why? !

! !Why do people rob banks? !

• Genetic mapping !• Physical mapping !• DNA sequencing !• Genome assembly !• Gene-finding!

DNA sequencing: the march of technology!

44

45

http://en.wikipedia.org/wiki/Sequencing

Sanger DNA sequencing ~1975: radio-labelled chain terminators; slab gel electrophoresis!

Automated DNA sequencing machines initially slab gel, later capillary electorphoresis!

46

47

Automated Sanger DNA sequencing ~2000: fluorescently labelled chain terminators,

capillary electrophoresis!

Sequencing technology growth!

/machine!/year!

49

Helicos

Illumina

AB SOLiD

Roche 454

Pacific Biociences

Oxford Nanopore

50

HiSeq2000!

MiSeq!

Post-genome bioinformatics!

A very large fraction is about sequencing!!

•  Take some “cells”!•  Do something to them!•  Sequence the resulting DNA!•  Do something with the resulting reads!

51

The “cells” you take can be!

•  Very heterogeneous pools, e.g. from soil, gut, seawater, …!

•  Mixed cell populations, e.g. PBMC from blood!•  Cell lines!•  Sorted cells, e.g. CD4+ T-cells from blood!•  Single cells, eg. neurons, B-cells, …!•  Not cells at all, e.g. cell-free DNA from plasma!

52

The something you do to the cells may be!

•  Not much, e.g. metagenomics!•  Chromatin Immunoprecipitation, e.g for

transcription factor binding, histone modification!•  RNA extraction, to do RNA-seq, to find non-

coding RNAs !•  RNA interference, e.g. siRNA to silence genes!•  Bisulphite conversion to measure methylation!•  ….(much more)!

53

What you do with the resulting reads!

•  Map them back to a reference genome!•  Assemble them de novo into a genome!•  Try to sort out what is there (a mix of spp)!•  Map them to a custom transcriptome!•  Map them to some other custom database!•  …..!

54

55

Metagenomics @ MelbourneWed 27 November 2013, 12.30pm 6.00pm

Bio21 Molecular Science and Biotechnology Institute30 Flemington Road, ParkvilleEnquiries to kholt@unimelb edu au www bio21 org

Ian Paulsen Macquarie University

Guest Speakers TopicsHuman microbiome

Enquiries to [email protected] www.bio21.org

Ian Paulsen, Macquarie UniversityGene Tyson, UQ Centre for EcogenomicsAaron Darling, i3 Institute, UTSBrendan Burns, UNSWRob Moore, CSIRO

Human microbiomeAnimal & plant microbiomeEcogenomicsMarine metagenomicsBioinformatics

Plus short talks from 10 local researchers

Register by November 20research.mdhs.unimelb.edu.au/event/advancing systems biology

Registration is FREE and Includes lunch, afternoon tea and evening refreshments.

Register by November 20research.mdhs.unimelb.edu.au/event/advancing systems biology

Registration is FREE and includes lunch, afternoon tea and evening refreshments

All welcome. All welcome

A Bio21 Institute Structural and Cellular Biology Theme Event. Part of Advancing Systems Biology @ Melb.

56

Forshew et al. Sci Transl Med (2012)!

And then there is lots of other stuff, too numerous to mention!

•  The other omics (prote-, metabol-, lipid-, phen-…)!•  many imaging modalities (is this bioinformatics?)!•  …. !•  and, perhaps hardest of all, combinations of the

above!

57

58

High!content!imaging!screens

• Robotic!instruments!treat!cells!(e.g.!RNAi,!drugs,!peptides)

• Phenotype!measurement!is!image"based

image!from!Drug!Discovery!World

Combining image and molecular data!

59

Yuan et al. Sci Transl Med 4, (2012)! !

a general problem in the field, that needs revisiting with each new kind of data!

Sequence alignment!

60

61

first sequence = GAATTCAGTTA ! second sequence = GGATCGA !An optimal alignment:! G A A T T C A G T T A!

G G A T _ C _ G _ _ A !(score: match = 2, mismatch = -1, indel = -2)! !A different optimal alignment:! _ G A A T T C A G T T A!

G G A _T _ C _ G _ _ A!(score: match = 1, mismatch = 0, indel = 0)!!

62

Sequence alignment methods!

general issues that need attention, with each new kind of data!

•  Data quality assessment!•  Data cleaning/normalization/adjusting!•  Data visualization!

63

Some general techniques which can be adapted to new contexts!

•  Dynamic programming!•  Hidden Markov models!•  Dealing with multiplicity (p-value adjustment)!•  Resampling/bootstrapping!•  Network analyses!•  Bayesian inference!

64

Some applications of hidden Markov models in bioinformatics!

•  mapping chromosomes!•  aligning biological sequences!•  predicting sequence structure!•  inferring evolutionary relationships!•  finding genes in DNA sequence!

65

66

A very short profile HMM!

M = Match state, I = Insert state, D = Delete state. !To operate, go from left to right. I and M states output !

amino acids; B, D and E states are “silent”.!

67

How profile HMMs work, in brief!

•  Instances of the motif are identified by calculating! log{pr(sequence | M)/pr(sequence | B)}, ! where M and B are the motif and background HMMs.!•  Alignments of instances of the motif to the HMM are found

by calculating ! arg maxstates pr(states | instance, M).!•  Estimation of HMM parameters is by calculating ! arg max parameters pr(sequences| M, parameters).!

In all cases, we use the efficient HMM algorithms.!

Dealing with multiplicity (p-value adjustment)!

BLAST (Basic Local Alignment Search Tool) aligns short sequence to databases. It uses extreme value theory in calculating its E-value.!!Mascot uses mass spectometry data to identify proteins from peptide sequence databases. It has calibrated its probability score to deal with the huge multiplicity. The details are not fully known, but it works.!!With microarray data the Benjamini-Hochberg false discovery rate (FDR) has become widely used. In other areas Bonferroni still reigns. !!!!

68

A specific problem arising with a new kind of data and a new application, which demanded a new technique. !

Gene set tests!

69

Suppose that we do thousands of tests, comparing gene expression levels between treated and control cells, and find no (“significantly”) differentially expressed genes. What might this mean? !!Or, suppose we find 100s of (“significantly”) differentially !expressed genes. What might this mean?!!In each case there are gene-set tests that can be used which might illuminate the situation.!

70

Finally!

71

Two important general themes!

•  Evolution and the comparative method!!"Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky 1973!!

•  The use of positive and negative controls (“truth”) in order to see whether methods work in practice. !

!“In theory, there is no difference between theory and practice. But in practice, there is.” Yogi Berra (undated)! 72

73 “The Deluge” (1840) by Francis Danby (1793-1861)!

THANKS FOR LISTENING!!

bioinformatics before, during and after the era of genomics, one person's view - terry speed

Technology

molecular evolution

structure of dna

molecular data

dna sequencing

new techniques

new kind of data

amino acid polypeptide

structure of horse haemoglobin