basics of molecular biology - tandy warnowtandy.cs.illinois.edu/bio-basics.pdf · a multiple...

21
Basics of Molecular Biology Tandy Warnow December 29, 2016

Upload: others

Post on 02-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Basics of Molecular Biology

Tandy Warnow

December 29, 2016

Page 2: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Introduction to CS 466 Tandy Warnow

Page 3: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Introduction to CS 466

This course is about:

I Understanding the tools used in biological sequence analysis

I Understanding the mathematical models underlying thesetools

I Designing better methods for biological sequence analysis

Fundamentally this requires computer science and statistics, butnot too much biology!

Page 4: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Biological topics we’ll cover

I Genome assembly and search

I Multiple sequence alignment and applications

I Phylogenetics

I Protein sequence analysis

I Metagenomics

I Systems biology

Page 5: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

A multiple sequence alignment

220 Multiple sequence alignment

P1 a1 - a2 a3 a4 a5P2 b1 b2 b3 b4 b5 b6

Table 9.9 An optimal alignment of profiles P1 and P2 from Table 9.8, where ai and bi

represent the unadjusted frequency vectors for the ith positions in P1 and P2, respectively.

We define the cost of putting ai and b j in the same column to be the expected cost ofaligning a randomly generated nucleotide in the ith position of P1 and the jth position ofP2. Hence,

cost(ai,b j) = ∑x 6=y

P(x|ai)P(y|b j).

For example, cost(a1,b1) = 1/3, cost(a2,b1) = 1, cost(a2,b3) = 0, and cost(a5,b6) =11/15. Note also that the cost of aligning a non-empty column against an entirely gappedcolumn is 1.

After we compute cost(ai,b j) for all i, j, we use those as the cost of each “substitution”of ai by b j, and we run the Needleman-Wunsch algorithm to find a minimum cost align-ment of the two profiles. If we apply this technique to the two profiles in Table 9.8 weobtain the pairwise alignment given in Table 9.9, which has total cost 1/3+1+0+1/4+1/3+(2/15+3/5) = 2.65.

Note that the pairwise alignment of P1 and P2 defines a merger of alignments A1 andA2, and hence a multiple sequence alignment of {s1,s2, . . . ,s8} (shown in Table 9.10) thatis consistent with A1 and A2. Hence, during this profile-profile alignment procedure, wenever changed the alignments A1 and A2. This is a property of profile-profile alignmentand also HMM-HMM alignment strategies: the underlying alignments that are representedby profiles or profile HMMs are not modified during the process.

s1 - - - T A Cs2 - - A T A Cs3 C - A - - Gs4 C - A A T Gs5 C - - T - Gs6 C T - - A Cs7 C - A T A Cs8 G - A - A T

Table 9.10 The final multiple sequence alignment A of {s1,s2, . . . ,s8} obtained byaligning A1 and A2 from Table 9.7, computing their profiles P1 and P2 (see Table 9.8),and then aligning the profiles (see Table 9.9). Recall that A1 is an alignment of{s1,s2,s3,s4,s5}, and that A2 is an alignment of {s6,s7,s8}. Note that A agrees with A1,but includes an all-gap column. Similarly, A agrees with A2.

Page 6: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

A phylogeny

From https://en.wikipedia.org/wiki/File:Phylogenetic tree.svg

Page 7: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Systems Biology

From http://www.assignmentpoint.com/science/biology/systems-biology.html

Page 8: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Regulatory Network

From http://en.wikipedia.org/wiki/Gene regulatory network

Page 9: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Protein-Protein Interaction Network

From https://ocw.mit.edu/courses/biology/7-343-network-medicine-using-systems-biology-and-signaling-

networks-to-create-novel-cancer-therapeutics-fall-2012/

Page 10: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Steps in a phylogenomic analysis

1. Decide which species and genes are needed

2. Collect specimens and obtain sequence data

3. Compute multiple sequence alignments and phylogenetic treesfor each gene

4. Estimate the species tree1

5. Estimate branch support and dates

6. Answer biological questions

1or network

Page 11: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Proteins

I Proteins can be seen as strings of amino acids, but they arealso fundamental to function.

I Proteins form structures that enable them to interact withother proteins and with DNA molecules. The inference ofprotein structure is closely related to the inference of proteinfunction.

Page 12: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Protein structure

From http://www.biologyexams4u.com/2011/10/protein-structure.html#.WGUOyjKZPrI

Page 13: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Inferring protein structure and/or function

A simple and common technique to infer the structure or functionof a newly discovered protein sequence:

I Examine protein sequences in a public database (perhaps withknown structure and/or function) and find the ones that areclosest (in terms of its AA sequence), using methods likeBLAST. These are the “homologous” sequences, and the setof homologous sequences forms a protein “family”.

I Then compute a pairwise sequence alignment for the newsequence and the closest homolog, and transfer theinformation.

Page 14: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Inferring protein structure and/or function

Or you could try one of these:

I Better: Compute a multiple sequence alignment and tree forthe new sequence and its homologs, and see where the newsequence is located in the tree. Infer function and structure atthe internal nodes of the tree, and transfer the information tothe new sequence.

I Even better (if available): Use a profile Hidden MarkovModel or other statistical model for the family that isannotated with structural features, and align the newsequence to the model. Transfer structural information.

The same could be done with RNAs, which also form structuresand have functions.

Page 15: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Repeating themes

I Genome sequencing data

I Multiple sequence alignment

I Phylogenetic tree

I Protein and RNA structure and function

I BLAST (or other database search tools)

I profile Hidden Markov Models (HMMs)

Page 16: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

This course

This course is about:

I designing better algorithms – ones that are more accurate andare scalable to large datasets.

I proving theorems about methods under statistical models(especially for phylogenies)

I dataset analysis

You don’t need to know any biology for this; you’ll learn as you go.Most of the work is really a combination of computer science andstatistics.

Page 17: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Basic vocabulary, page 1

I Nucleotide sequences (RNA and DNA): think of them asstrings over a four-letter alphabet (ACTG for DNA, ACUG forRNA)

I Protein sequences: composed of amino acids, of which thereare 20

I Coding sequences: only some nucleotide sequences are used tocreate proteins. Those sequences are called exons.

I Codons: three nucleotides in a row, that are used to createamino acids. Every codon makes a single amino acid. Hence,this is a many-to-one mapping, called the Genetic code.

I Gene: used to mean something specific, and it’s no longerclear what it means. However, it’s safe to think of this as acollection of regions of a genome that may have somefunction.

Page 18: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Basic vocabulary, page 2

I Genome: the collection of chromosomes that carry geneticinformation (in the form of DNA), which is inherited.

I Alleles: in diploid organisms, we have two copies of eachchromosome, and so two copies of the gene at a given locus(position) in the genome; each of these copies is called anallele. The collection of alleles is the genotype.

I Transcription: the action of changing a DNA sequence intomessenger RNA.

I Translation: the action of changing a string of messengerRNA into a string of amino acids. (Think of this as DNAmakes RNA, and RNA makes proteins...)

I Exons: the portion of the gene that remains after the intronsare removed during transcription.

Page 19: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Basic vocabular, page 3

I Mutations: substitutions of nucleotides by other nucleotides inthe genome.

I Synonymous mutations (or silent mutations): Mutations thatoccur in coding regions (i.e., exons) and that do not changethe resultant amino acids, due to the many-to-one mapping ofthe genetic code.

I Non-synonymous mutations: mutations that do change theresultant amino acid.

I Natural selection: preferential survival and reproduction orpreferential elimination of individuals with certain genotypes

I Neutral evolution: evolutionary changes that are not impactedby selection

Page 20: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Basic vocabulary, page 4

I Phylogeny: an evolutionary tree or possibly evolutionarynetwork

I Gene tree: a phylogeny based on a single gene

I Species tree: a phylogeny that represents how species evolved

I Hybridization: when two different species mate and haveoffspring, called hybrids

I Horizontal gene transfer: when genetic material is transferredfrom one organism into another

I Homology: related by descent from a common ancestor

I Species: too complicated to answer

I Genus, Family, Order, Class, Phylum, Kingdom, Domain - thehierarchies in a taxonomy

Page 21: Basics of Molecular Biology - Tandy Warnowtandy.cs.illinois.edu/bio-basics.pdf · A multiple sequence alignment 220 Multiple sequence alignment P1 a1 - a2 a3 a4 a5 P2 b1 b2 b3 b4

Basic vocabulary, page 5

I Exome: the part of the genome that is comprised of exons

I Transcriptome: all the messenger RNA that is expressed bythe genes in an organism

I Genome assembly: putting together a genome from lots ofshort DNA strings, generated by a sequencing project

I Reads: the short DNA strings produced by the sequencingtechnology

I Contigs: somewhat longer DNA strings produced bycombining reads