what are math and computer science doing in biology ?

73
What are Math and Computer Science doing in Biology? Dan Gusfield UC Davis March 29, 2012 Denison University

Upload: munin

Post on 12-Jan-2016

34 views

Category:

Documents


2 download

DESCRIPTION

What are Math and Computer Science doing in Biology ?. Dan Gusfield UC Davis March 29, 2012 Denison University. One limited perspective. Short Answer:. Bioinformatics Computational Biology Statistical Biology Mathematical Biology …. Short Answer:. Bioinformatics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: What are  Math  and  Computer Science  doing in  Biology ?

What are Math and Computer Science doing in Biology?

Dan Gusfield

UC Davis

March 29, 2012

Denison University

Page 2: What are  Math  and  Computer Science  doing in  Biology ?

One limitedperspective

Page 3: What are  Math  and  Computer Science  doing in  Biology ?
Page 4: What are  Math  and  Computer Science  doing in  Biology ?

Short Answer:

• Bioinformatics

• Computational Biology

• Statistical Biology

• Mathematical Biology

• …..

Page 5: What are  Math  and  Computer Science  doing in  Biology ?

Short Answer:

• Bioinformatics

• Computational Biology

• Statistical Biology

• Mathematical Biology

• …..

My focus

Page 6: What are  Math  and  Computer Science  doing in  Biology ?

UC Davis6

computational biology–“An interdisciplinary field that applies the

techniques of computer science, applied mathematics and statistics to address biological problems” (Wikipedia)

BiologyComputer Science

Math &Statistics

Computational biology, Bioinformatics

Page 7: What are  Math  and  Computer Science  doing in  Biology ?

How can non-biologists,non-chemists understandor contribute to biology?

Where does our licensecome from?

Page 8: What are  Math  and  Computer Science  doing in  Biology ?

My Fear 30 years ago was that I would first need to

master material like:

Page 9: What are  Math  and  Computer Science  doing in  Biology ?

Citric Acid Cycle

Page 10: What are  Math  and  Computer Science  doing in  Biology ?

Amylase + starch substrate

Page 11: What are  Math  and  Computer Science  doing in  Biology ?

Bond representation of triplex DNA. This view is down the long axis. The “third” strand is colored.

Page 12: What are  Math  and  Computer Science  doing in  Biology ?

MYOGLOBIN - An oxygen carrier in muscle

Here is another way of visualising tertiary the structure

Tertiary Stucture

Spot the Tertiary folding.

Quaternary Structure

Spot the Haem group

Page 13: What are  Math  and  Computer Science  doing in  Biology ?

LYSOZYME

Including the Side chains.

Can you see any active site now?

Page 14: What are  Math  and  Computer Science  doing in  Biology ?

It looked very daunting!

But,

Page 15: What are  Math  and  Computer Science  doing in  Biology ?

By some wonderfulfact or fluke of nature,a huge simplification ispossible and veryproductive.

Page 16: What are  Math  and  Computer Science  doing in  Biology ?

Molecular information is (partially) Digital.

And, nature takes notes (leaves historical footnotes).

Page 17: What are  Math  and  Computer Science  doing in  Biology ?

PRIMARY STRUCTURE

This diagram shows the primary structure of PIG INSULIN, a protein hormone as discovered by Frederick Sanger.

He was given a Nobel prize in 1958.

Primary structure is described by the sequence of Amino Acids in the chain

Page 18: What are  Math  and  Computer Science  doing in  Biology ?

Hemoglobin – Primary Structure

NH2-Val-His-Leu-Thr-Pro-Glu-Glu-Lys-Ser-Ala-Val-Thr-Ala-Leu-Trp-Gly-Lys-Val-Asn-Val-Asp-Glu-Val-Gly-Gly-Glu-…..

beta subunit amino acid sequence

Page 19: What are  Math  and  Computer Science  doing in  Biology ?

It has been amazingly productive to treat protein and DNA

molecules just as text:collecting, comparing,

creating molecular sequences.

Page 20: What are  Math  and  Computer Science  doing in  Biology ?

No hard-core chemistry orbiology - just text comparisonand analysis.

Fluke of nature?An imposition of the humanmind?Lucky break for us?

Page 21: What are  Math  and  Computer Science  doing in  Biology ?

The first major success story:

Page 22: What are  Math  and  Computer Science  doing in  Biology ?

Simian Sarcoma Virus onc Gene, v-sis isderived from the Gene (or Genes) of aPlatelet-Derived Growth Factor.R.F. Doolittle et al, Science 1983

Page 23: What are  Math  and  Computer Science  doing in  Biology ?

“The transforming protein of aprimate sarcoma virus and aplatelet-derived growth factor arederived from the same or closelyrelated cellular genes. This conclusion is based on the demonstration of extensivesequence similarity.”

From the abstract

Page 24: What are  Math  and  Computer Science  doing in  Biology ?

Sequence similarity suggestedthat genes involved in cancerwere functionally related to genesinvolved in blood platelet growth,two biological phenomena thathad previously seemed unrelated.

This was a very surprising result,and a novel kind of reasoning.But,

Page 25: What are  Math  and  Computer Science  doing in  Biology ?

Biology via Sequence Analysisis now completely accepted, main-stream.

Some biologists have evenreplaced their wet-labs withcomputer labs, doing biologyonly by sequence analysis.

Page 26: What are  Math  and  Computer Science  doing in  Biology ?

“The ultimate rational behind allpurposeful structures and behaviorof living things is embodied in thesequence of residues of nascentpolypeptide chains …” J. Monod

“The rosetta stone of modern biologyappears to be sequence comparitiveanalysis.” T. Smith

Page 27: What are  Math  and  Computer Science  doing in  Biology ?

Success stories from sequence analysis are now routine. Why?

Mostly shared history and duplicationwith modification, but also shared physical, chemical constraints.

Page 28: What are  Math  and  Computer Science  doing in  Biology ?

“We didn't know it at the time, but we found out everythingin life is so similar, that the same genes that work in flies are the ones that work in humans.”

Eric Wieschaus, co-winner of the 1995 Nobel prize in medicine

Page 29: What are  Math  and  Computer Science  doing in  Biology ?

Take-home message

04/21/23UC Davis29

High sequence similarity implies significant functional and/or structural similarity

Ancestor

Species A

Species B

paralogs

orthologs

Page 30: What are  Math  and  Computer Science  doing in  Biology ?

Can we reverse the statement?

04/21/23UC Davis31

Two sequences with high functional similarity should have similar sequences.

Page 31: What are  Math  and  Computer Science  doing in  Biology ?

The success of sequence comparison and analysis, and thedevelopment of efficient DNAsequencing, has leadto huge projects to capture, accumulate, store, curate, and annotate bio-molecular sequences.

Genbank, Blast, Human GenomeProject, specialized databases.

Page 32: What are  Math  and  Computer Science  doing in  Biology ?

Today it has around 300 trillion bases!

Page 33: What are  Math  and  Computer Science  doing in  Biology ?

Examples of large-scale sequencing projects

1,000 Genomes Project. http://www.1000genomes.org/.

BGI, 10,000 whole human genomes.

BGI, 1,000 individuals with IQ>145 versus 1,000 random individuals.

BGI, Autism Genetic Resource Exchange, 10,000 individuals.

BGI, CHOP, many childhood diseases.

Genome Institute, Washington U. St. Louis, 600 childhood cancer patients;

$65 million over three years. 150 tumor & normal cancer genome pairs.

Epitwin: TwinsUK & BGI $30 million for epigenetic differences in 5,000 twins.

Netherlands Genome Project: BGI 750 genomes (250 trios) in Dutch biobanks.

Epi4K: Duke et al. $25M to sequence 4,000 genomes for epilepsy research.

U. Michigan Cancer Center: Clinical next-gen sequencing of cancer patients.

R. Michelmore

Page 34: What are  Math  and  Computer Science  doing in  Biology ?

$1,000 ($100?) human genome coming => $1,000 genome for many animals and plants $100 genome for fungi $10 genome for bacteria en masse

Metagenomics: sequencing of communitiesbiomes (humans = 100x more bacteria)novel & unculturable organismscharacterization of diversity & unique genes

Not just genomic DNA sequence: DNA modificationsepigenomics & copy number variation (CNV)expression analysis (RNAseq not arrays)

Enormous amounts of sequence dataNeed for major data handling capabilitiesVital role for bioinformatics just to manage the data

In near future: DNA sequence = an inexpensive commodity generated on a variety of platforms

R. Michelmore

Page 35: What are  Math  and  Computer Science  doing in  Biology ?

More recently: Metagenomics,metabolomics, proteomics,microbiomics, epigenomics,transcriptomics, methylomics….

High-throughput biology generatingmassive amounts of data; sometimes too large even to store.

Page 36: What are  Math  and  Computer Science  doing in  Biology ?

NYT November 30, 2011:

“The Bejing Genome Center has enough sequencing capacity tosequence 2,000 human genomesper day.”

“World capacity is now 13 quadrillionDNA bases a year, an amount thatwould fill a stack of DVDs two mileshigh.”

Page 37: What are  Math  and  Computer Science  doing in  Biology ?

OK, so sequences and sequence analysis are

important, but where’s the promised computer science

and math?

Page 38: What are  Math  and  Computer Science  doing in  Biology ?

Simple sequence comparison,comparing new sequences againstsequences in databases, has beenextremely productive.

But how do we extract the mostbiological value from sequences?

The Larger Challenge and Opportunity: How to utilize the deluge of sequence data?

Page 39: What are  Math  and  Computer Science  doing in  Biology ?

What significant patterns do you see in:

Page 40: What are  Math  and  Computer Science  doing in  Biology ?
Page 41: What are  Math  and  Computer Science  doing in  Biology ?

Making sense of the code

04/21/23UC Davis43

Page 42: What are  Math  and  Computer Science  doing in  Biology ?
Page 43: What are  Math  and  Computer Science  doing in  Biology ?

Damien Peltier

Page 44: What are  Math  and  Computer Science  doing in  Biology ?

How do we know that patterns wesee are meaningful? How do weknow that similarities we see are based in biology and not justrandom happenstance?

Humans are good at seeingpatterns, even in random eventsand data.

How do we analyze so much data?

Page 45: What are  Math  and  Computer Science  doing in  Biology ?

FromMars

Page 46: What are  Math  and  Computer Science  doing in  Biology ?

From the bible code

Page 47: What are  Math  and  Computer Science  doing in  Biology ?

What we need:

• Clear, biologically meaningful definitions of similarity, patterns. Biological models of mutation and evolution - how sequences evolve.

• Metrics - how similar, how good the fit.• Efficient methods to compute similarities, and

find patterns, and compute the metrics.• Efficient methods to assess the “significance”

of the finds.

Page 48: What are  Math  and  Computer Science  doing in  Biology ?

For those tasks, we need

• Biology - to define and model meaningful types of similarities and patterns to look for.

• Mathematics - to propose and understand the models and metrics.

• Computer Science - for efficient sequence analysis and search algorithms.

• Statistics - to measure the ``significance” (deviation from random happenstance) of the finds.

Page 49: What are  Math  and  Computer Science  doing in  Biology ?

UC Davis51

computational biology–“An interdisciplinary field that applies the

techniques of computer science, applied mathematics and statistics to address biological problems” (Wikipedia)

BiologyComputer Science

Math &Statistics

Computational biology, Bioinformatics

Page 50: What are  Math  and  Computer Science  doing in  Biology ?

“It costs more to analyze a genomethan to sequence a genome.”D. Haussler

Page 51: What are  Math  and  Computer Science  doing in  Biology ?

A small part of the story in greater detail

Page 52: What are  Math  and  Computer Science  doing in  Biology ?

Basic problem: define and compute the similarity of two

sequences

04/21/23UC Davis54

• Biological-Mathematical model: Two sequences are similar when…

• Algorithmic problem: How do you compute the sequence similarity of two sequences S1 and S2.

Page 53: What are  Math  and  Computer Science  doing in  Biology ?

“All models are wrong, but someare useful.”

George Box

Page 54: What are  Math  and  Computer Science  doing in  Biology ?

S1: AATCCAGTTTTACAGATCCTC length m=21

S2: AATAGTTTTACAGACTCAT length n=19

S1: - AATCCAGTTTTATAGA-TCCTC length m=23

S2: AATA—GTTTTACAGACTCAT-- length n=23

Match, Mismatch, Space, Gap

One measure of the goodness of the alignment is the (# of matches) -- (# of mismatches) --(# of spaces)

Alignment: Insert spaces into, or before or afterthe two sequences to make them the same length.

Modeling sequence evolution

Page 55: What are  Math  and  Computer Science  doing in  Biology ?

Given a metric to measure the goodness of any specific alignment,we define the Similarity of twosequences S1 and S2 as:

The Maximum(# matches) -- (# mis) -- (# spaces)over all possible alignments ofS1 and S2.But how do we compute similarity?

Page 56: What are  Math  and  Computer Science  doing in  Biology ?

Mathematics finds a formula:

So there are a huge number ofalignments.

Page 57: What are  Math  and  Computer Science  doing in  Biology ?

Mathematics counts the number of alignments

04/21/23UC Davis59

Length of thesequences

Number of alignments

10 184,756

20 ~1.4e11

100 ~9.0e58

Page 58: What are  Math  and  Computer Science  doing in  Biology ?

There are too many alignmentsto try each one out, but clever,efficient algorithms, using thetechnique of Dynamic Programming,allow the efficient computation ofsimilarity. (Computer Sciencecontribution).

Page 59: What are  Math  and  Computer Science  doing in  Biology ?

For any length n, the number ofoperations needed to computethe similarity of two n-lengthsequences, via Dynamic Programming, is proportional ton squared (i.e, n^2).

Page 60: What are  Math  and  Computer Science  doing in  Biology ?

Number of operations needed to compute Similarity

Length of thesequences

Number of operations using explicit enumeration

Number ofoperations using Dynamic Programming

10 184,756 100

20 ~1.4e11 400

100 ~9.0e58 10e4

So similarity can be found quickly, but

Page 61: What are  Math  and  Computer Science  doing in  Biology ?

Elegant statistical methods can be used to determine the probability that two random sequences wouldhave that level of similarity or more.

We don’t reject the possibility that two sequences are similar due only to chance, unless the computed probability is very low.

Is the similarity significant?

Page 62: What are  Math  and  Computer Science  doing in  Biology ?

Extensions: Finding patterns in multiple sequences

ACTAACCGGGAGATTTCAGA human

AAGTTCCGGGAGATTTCCA chimp

TAGTTATCCGGGAGATTAGA mouse

AAAACCGGTAGATTTCAGG rat

Page 63: What are  Math  and  Computer Science  doing in  Biology ?

Multiple Sequence Alignment

AC--TAACCGGGAGATTTCAGA human

AAGTT--CCGGGAGATTTCC-A chimp

TAGTTATCCGGGAGATT--AGA mouse

AA---AACCGGTAGATTTCAGG rat

Page 64: What are  Math  and  Computer Science  doing in  Biology ?

CLUSTALW multiple sequence alignment (rbcS gene)

Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATTPea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACATobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACCIce-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACCTurnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGCWheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAADuckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAALarch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC

Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----APea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------ATobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGAIce-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAATurnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------AWheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC--------Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATTLarch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA

Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTAPea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTATobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATGIce-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGGTurnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATAWheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTGDuckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATCLarch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA

Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTACPea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAACTobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAAIce-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTACLarch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCATurnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAGWheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCCDuckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

Page 65: What are  Math  and  Computer Science  doing in  Biology ?

Again we need a model of whatmultiple sequence alignmentsare biologically meaningful;a metric to score the goodness of a multiple alignment; an algorithm to compute multiple alignments, based on the metric;and statistical methods to evaluatethe signifinance of an alignment.

Page 66: What are  Math  and  Computer Science  doing in  Biology ?

Summarizing• Biology by sequence analysis opens the door

widely to non-biologists.• Models of sequence evolution and metrics used in

sequence analysis are articulated by biology and Mathematics.

• Computer Science contributes efficient algorithms to do the analysis and compute the metrics.

• Statistics is needed to evaluate the significance of the computed results.

• Sequence analysis is just one of many ways that computer science and mathematics have entered biology.

Page 67: What are  Math  and  Computer Science  doing in  Biology ?

In general: The computational- biology work flow

04/21/23UC Davis69

Biological Knowledge

E.g. assumption about mutation distribution or preferential attachment

E.g. Given the mathematical model, find spots where mutation rates are high or low in a statistically significant way

Biological model

Mathematical model and assumptions

Mathematical problem

Algorithmic problem E.g. What algorithm should I develop to efficiently find hotspots

Programming problem

E.g. Data storage, Memory, OOP and languages, optimizations, GUI

E.g. DNA mutates

Eg. DNA replication infidelity model, mutagens, radiation models etc

Page 68: What are  Math  and  Computer Science  doing in  Biology ?

Another illustrastion, involvingphylogenetic trees rather than sequences.

Page 69: What are  Math  and  Computer Science  doing in  Biology ?

Comparing Trees: Tanglegrams

• A Tanglegram is a pair of phylogenetic trees drawn in the plane with no crossing edges, with the same labeled leaf set. The leaves of one tree are displayed on a line, and the leaves of the other tree are displayed on a parallel line.

• One tree represents the evolution of a set of species, and the other tree represents the evolution of a set of parasites that inhabit the species.

• A straight line connect each leaf in one tree to the leaf with the same label in the other tree.

• The number of crossing lines is a measure of the similarity of the trees.

• A small measure suggests that the species and parasites co-evolved.

Page 70: What are  Math  and  Computer Science  doing in  Biology ?

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Images courtesy of NTBG

Page 71: What are  Math  and  Computer Science  doing in  Biology ?

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Images courtesy of NTBG

Page 72: What are  Math  and  Computer Science  doing in  Biology ?

So we have the algorithmicproblem of finding planar layouts of the two trees, to minimize thenumber of crossings of the linesbetween the leaves. That minimum number is the metric of similarity. How do we compute it, and how can we evaluate significance?

But the trees can be redrawn toreduce the number of crossings.

Page 73: What are  Math  and  Computer Science  doing in  Biology ?

Thank you