![Page 1: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/1.jpg)
CS 598SSProbabilistic Methods in
Biological Sequence Analysis
Saurabh Sinha
![Page 2: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/2.jpg)
What is the course about?
• Bioinformatics / Computational Biology
• Tools for analyzing genomes
• Probabilistic methods
![Page 3: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/3.jpg)
What is the course format?
• Research course• Lectures by instructor• Student presentations of research papers
– 1 or 2 paper(s) per student
• Research project & presentation– Typically, 2 students per project– 30 mins presentation at end of course.
![Page 4: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/4.jpg)
Grading
• Project: 40%
• Paper presentation: 25%
• Assignments and/or tests: 25%
• Participation: 10%
• Grade distribution
![Page 5: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/5.jpg)
Expectations
• Programming skills (for the project)
• Basic exposure to probability theory
• Basic exposure to algorithms
![Page 6: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/6.jpg)
What you can do at the end of the course
• Start working on research projects in bioinformatics: biological sequence analysis
• Use principled approaches, supported by probability theory, instead of ad hoc methods
• Join me as a graduate advisee ?
![Page 7: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/7.jpg)
Administrative Details
• Instructor: – Saurabh Sinha– Room 2122, Siebel Center– Email: [email protected]
• Class hrs: Tue & Thurs, 2:00pm - 3:15pm, 1131SC
• CRN: 43781• Credits: 4 graduate hrs• Welcome to sit in, if not taking for credit
![Page 8: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/8.jpg)
Books
• Not required1. Biological Sequence Analysis : Probabilistic Models
of Proteins and Nucleic Acids -- Durbin, Eddy, Krogh, Mitchison2. Bioinformatics: The Machine Learning Approach
-- Baldi, Brunak3. Statistical Methods in Bioinformatics
-- Ewens and Grant4. Bioinformatics -- Polanski and Kimmel
![Page 9: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/9.jpg)
Why study bioinformatics?
• Molecular biology is the new frontier of 21st century science
• Computer science is the crown prince of 20th century engineering
• Bioinformatics is the application and development of computer science with the goal of supporting molecular biology
![Page 10: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/10.jpg)
Why study bioinformatics?
• Flood of data: several Giga (Tera?) bytes of sequence, and gene expression data.
• Noise in the data– Biological– Experimental
• Algorithms needed to make discoveries– Probabilistic methods– Need for efficiency
![Page 11: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/11.jpg)
Why study bioinformatics?
• The big picture:– Human health and quality of life– Fundamental science
• Billions of dollars being spent– Health research gets the major chunk of the US
Govt’s funds– Fundamental health research is at the molecular
level– Molecular biology research increasingly a
quantitative science
![Page 12: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/12.jpg)
Why study bioinformatics?• Recent issue of Science: top 25 questions>What Is the Universe Made Of?>What is the Biological Basis of
Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong?
![Page 13: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/13.jpg)
Basic Molecular Biology
![Page 14: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/14.jpg)
Life, Cells, Proteins
• The study of life the study of cells• Cells are born, do their job, duplicate,
die– What is “their job”?– Break down nutrients, produce energy,
produce required molecules
• All these processes controlled by proteins
![Page 15: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/15.jpg)
Protein functions
• “Enzymes” (catalysts)– Control chemical reactions in cell
• Transfer of signals/molecules between and inside cells– E.g., sensing of environment
• Regulate production of other proteins
![Page 16: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/16.jpg)
Protein molecule
• Protein is a sequence of amino-acids
• 20 possible amino acids
• The amino-acid sequence “folds” into a 3-D structure called protein
![Page 17: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/17.jpg)
Protein Structure
Protein
DNA
The DNA repair protein MutY (blue) bound to DNA (purple).
PN
AS
cover, courtesy Am
ie B
oal
![Page 18: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/18.jpg)
DNA
• Deoxyribonucleic acid: a molecule that is involved in production of proteins
• Double helical structure (discovered by Watson, Crick, Wilkins & Franklin)
• Chromosomes are densely coiled and packed DNA
![Page 19: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/19.jpg)
SOURCE: http://www.microbe.org/espanol/news/human_genome.asp
Chromosome
DNA
![Page 20: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/20.jpg)
The DNA Molecule
G -- C A -- T T -- A G -- C C -- G G -- C T -- A G -- C T -- A T -- A A -- T A -- T C -- G T -- A
Base = Nucleotide
5’
3’
![Page 21: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/21.jpg)
SRC:http://www.biologycorner.com/resources/DNA-RNA.gif
Cell
From DNA to Amino-acid sequence
![Page 22: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/22.jpg)
From DNA to Protein: In words
1. DNA = nucleotide sequence • Alphabet size = 4 (A,C,G,T)
2. DNA mRNA (single stranded)• Alphabet size = 4 (A,C,G,U)
3. mRNA amino acid sequence• Alphabet size = 20
4. Amino acid sequence “folds” into 3-dimensional molecule called protein
![Page 23: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/23.jpg)
Central Dogma
• “Information” flows from DNA to RNA to Protein
• Why “information” ?
• The DNA in a cell has complete information of which proteins will be present in the cell
![Page 24: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/24.jpg)
DNA and genes
• DNA is a very “long” molecule
• DNA in human has 3 billion base-pairs– String of 3 billion characters !
• DNA harbors “genes” – A gene is a substring of the DNA string
![Page 25: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/25.jpg)
Genes code for proteins
• DNA mRNA protein can actually be written as Gene mRNA protein
• A gene is typically few hundred base-pairs (bp) long
![Page 26: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/26.jpg)
Transcription
• Process of making a single stranded mRNA using double stranded DNA as template
![Page 27: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/27.jpg)
Step 1: From DNA to mRNA
Transcription
![Page 28: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/28.jpg)
Step 1: From DNA to mRNA
Transcription
![Page 29: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/29.jpg)
Translation
• Process of making an amino acid sequence from (single stranded) mRNA
• Each triplet of bases translates into one amino acid: each such triplet is called “codon”
• The translation is basically a table lookup
![Page 30: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/30.jpg)
![Page 31: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/31.jpg)
The
Gen
etic
Cod
e
SO
UR
CE
: ht
tp:/
/ww
w.b
iosc
ienc
e.or
g/at
lase
s/ge
neco
de/g
enec
ode.
htm
![Page 32: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/32.jpg)
Step 2: mRNA to Amino acid sequence
Translation
![Page 33: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/33.jpg)
Review so far
• Proteins: important molecules, amino acid sequences
• DNA: structure, base-pairing.
• Genes: substrings of DNA
• Gene --> mRNA (transcription)
• mRNA --> amino acid sequence (translation), genetic code.
![Page 34: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/34.jpg)
Gene expression
• Process of making a protein from a gene as template
• Transcription, then translation
• Can be regulated
![Page 35: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/35.jpg)
GENE
ACAGTGA
TRANSCRIPTIONFACTOR
PROTEIN
Transcriptional regulation
![Page 36: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/36.jpg)
GENE
ACAGTGA
TRANSCRIPTIONFACTOR
PROTEIN
Transcriptional regulation
![Page 37: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/37.jpg)
The importance of gene regulation
![Page 38: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/38.jpg)
Genetic regulatory network controlling the development of the body plan of the sea urchin embryoDavidson et al., Science, 295(5560):1669-1678.
![Page 39: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/39.jpg)
• That was the “circuit” responsible for development of the sea urchin embryo
• Nodes = genes
• Switches = gene regulation
• Change the switches and the circuit changes
• Gene regulation significance:– Development of an organism– Functioning of the organism– Evolution of organisms
![Page 40: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/40.jpg)
Genome
• The entire sequence of DNA in a cell• All cells have the same genome
– All cells came from repeated duplications starting from initial cell (zygote)
• Human genome is 99.9% identical among individuals
• Human genome is 3 billion base-pairs (bp) long
![Page 41: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha](https://reader036.vdocument.in/reader036/viewer/2022081519/56649d245503460f949fa8fc/html5/thumbnails/41.jpg)
Genome features
• Genes• Regulatory sequences• The above two make up 5% of human
genome• What’s the rest doing?
– We don’t know for sure
• “Annotating” the genome– Task of bioinformatics