stat 877(992) statistical methods in molecular biology
TRANSCRIPT
Course plans
• Team taught: Newton, Larget, Ane, Keles, Kendziorski, Broman, Yandell
• Per instructor homework set (six at 12pts each)
• Final project, poster presentation (28 pts)
National Research Council Report, 2004Mathematics and 21st Century Biology
“Progress in the biosciences will increasingly depend on deep and broadintegration of mathematical analysis into studies at all levels of biological organization…: molecules, cells, organisms, populations, and Ecosystems.”
“The committee regards the interface between mathematics and biology as biology-driven.”
cell structural/functional unit of all living organisms
protein organic compound produced and used by cell
amino acid protein building block
nucleic acid chainlike molecule involved in preservation, replication, and expression of hereditary information in every living cell
nucleotide nucleic acid building block
Some definitions [first approximations!]
Example function: oxygen transport
2-3 x 10^13 red blood cells/body
2 x 10^6 new cells/second
95% of dry weight is protein hemoglobin
sequence of amino acids in hemoglobin
• alpha chain (141 amino acids) [2 subunits]
• VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGKKVADGLTLAVGHLDDLPGALSDLSNLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR
• beta chain (146 amino acids) [2 subunits]
• VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLALVVARHFGKDFTPELQASYQKVVAGVANALAHKYH
A few amino acids (among 20 standard)
V = Val = Valine
L = Leu = Leucine
M = Meth = Methionine
more about amino acids
Amino acids are concatenated into protein by the translation of information stored in messenger RNA
Ribonucleic acid (RNA)
Nucleotide bases
A = adenine C = cytosine U = uracil G = guanine
single stranded
Amino acids are concatenated into protein by the translation of information stored in messenger RNA (mRNA)
Ribonucleic acid (RNA)
Nucleotide bases
A = adenine C = cytosine U = uracil G = guanine
Met
Thr
Glu
Leu
Arg
Ser
stop
mRNA structure
orientation 5’ to 3’
UTR = untranslated region: mRNA stability mRNA localization translational efficiency
Primary transcripts are produced by the transcription of DNA
Deoxyribonucleic acid (DNA)
double stranded
4 nucleotide bases ATGC
base pairing: A-T, C-G
Chromosomes are organized structures of DNA and proteinsthat are found in cells. Each chromosome contains a singlecontinuous piece of DNA.
In diploid species, chromosomes are paired.
Human total number chromosome base pairs1 247,200,0002 242,750,0003 199,450,0004 191,260,0005 180,840,0006 170,900,0007 158,820,0008 146,270,0009 140,440,00010 135,370,00011 134,450,00012 132,290,00013 114,130,00014 106,360,00015 100,340,00016 88,820,00017 78,650,00018 76,120,00019 63,810,00020 62,440,00021 46,940,00022 49,530,000X (sex chromosome) 154,910,000Y (sex chromosome) 57,740,000
100 yrs at 1bp/second
Estimates from Sanger’s Vertebrate Genome Annotation (VEGA) database, 7/07
3 Gbp, or
A genome equals the sequenceof one full copy
1 % of bases are in exons24 % of bases are in introns
2001: drafts of the human genome sequence published
2007: pilot phase of ENCODE project completed
Encyclopedia Of DNA Elements
majority of bases are transcribedextensive transcript overlapfunctions poorly understood
Evolving definition of gene
1860s-1900s: a discrete unit of heredity (Mendel)
1910s: a distinct locus (Morgan)
1940s: the blueprint for a protein (Beadle & Tatum)
1960s: a transcribed code (Watson & Crick)
Genome era: a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions
The gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products
Gerstein et al 2007
Post ENCODE
Statistics supports the development of genomic resources
• In accomodating sequencing errors for genome assembly
• In rating the significance of sequence matches by alignment algorithms
Statistics supports analyses to determine the function of genes/transcripts/proteins
• Gene regulation
• Gene expression• Network considerations (many processes/functions)
Example: oxygen transportAccording to the Gene Ontology (GO) project,46 different genes are involved in this biological process
Statistics is critical in analyzing patterns of genomic variation within populations, and in
relating this variation to disease states or other phenotypes
• Genomes differ from the reference copy (single nucleotide polymorphisms, structural variants)
• Gene mapping by linkage and association methods
Statistics is critical in analyzing patterns of genomic variation between populations/species
• Phylogenetic analysis
“Nothing in biology makes sense except in the light of evolution”
-T. Dobzhansky
Tree of life project
“It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp earth, and to reflect that these elaborately constructed forms, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us. These laws, taken in the largest sense, being Growth with reproduction; Inheritance which is almost implied by reproduction; Variability from the indirect and direct action of the conditions of life, and from use and disuse; a Ratio of Increase so high as to lead to a Struggle for Life, and as a consequence to Natural Selection, entailing Divergence of Character and the Extinction of less improved forms. Thus, from the war of nature, from famine and death, the most exalted object which we are capable of conceiving, namely, the production of the higher animals, directly follows.”
- Charles Darwin