bnfo 615 data analysis in bioinformatics instructor zhi wei
TRANSCRIPT
BNFO 615 Data Analysis in Bioinformatics
InstructorZhi Wei
Outline Cell Genome Gene mRNA Proteins Systems biology
Outline Cell Genome Gene mRNA Proteins Systems biology
Cells Fundamental working units of every living system. Every organism is composed of one of two radically
different types of cells: prokaryotic cells or eukaryotic cells. Prokaryotes and Eukaryotes are descended from
the same primitive cell. All extant prokaryotic and eukaryotic cells are the
result of a total of 3.5 billion years of evolution.
Prokaryotes v.s. Eukaryotes
Different StructuresDifferent ComponentsDifferent biological processes
Prokaryotes vs Eukaryotes
Prokaryotes Eukaryotes
Single cell Single or multi cell
No nucleus Nucleus
No organelles Organelles
One piece of circular DNA Chromosomes
No mRNA post transcriptional modification
Exons/Introns splicing
Prokaryotes v.s. EukaryotesProkaryotes bacteria, archaea Ecoli cell
5X106 base pairs > 90% of DNA encode protein 5400 genes Lacks a membrane-bound nucleus. Circular DNA Histones are unknown
Eukaryotes plants, animals, protista, and fungi Yeast cell
12.4x106 base pairs A small fraction of the total DNA encodes
protein. Many repeats of non-coding sequences
5800 genes All chromosomes are contained in a
membrane bound nucleus DNA is divided between 16 chromosomes A set of five histones: DNA packaging and
gene expression regulation
Cells chemical composition
Chemical composition-by weight 70% water 7% small molecules
salts Lipids amino acids nucleotides
23% macromolecules Proteins Polysaccharides lipids
We have different cells Cells differ in size, shape
and weights Q: what is the biggest cell in
the human body?
Cell Cycle Born, eat, replicate, and die
Lodish et al. Molecular Biology of the Cell (5th ed.). W.H. Freeman & Co., 2003.
Sexual Reproduction v.s. Cell Division Cell Division: Cells reproduce by duplicating
their contents and dividing in two. Sexual Reproduction
Formation of new individual by a combination of two haploid sex cells (gametes).
Gametes for fertilization usually come from separate parents
Both gametes are haploid, with a single set of chromosomes. The new individual is called a zygote, with two sets of chromosomes (diploid).
Meiosis is a process to convert a diploid cell to a haploid gamete, and cause a change in the genetic information to increase diversity in the offspring.
Meiosisv.s.Mitotic cell division
Outline Cell Genome Gene mRNA Proteins Systems biology
Genome A genome is an organism’s
complete set of DNA (including its genes).
However, in humans less than 3% of the genome actually encodes for genes.
A part of the rest of the genome serves as a control regions (though that’s also a small part)
The function of the rest of the genome is unknown (junk DNA? An open question).
Comparison of Different Organisms
Genome size (bp) Num. of genes
E. Coli .05*108 5,400
Yeast .12*108 5,800
Worm .15*108 18,400
Fly 1.8*108 13,600
Human 30*108 25,000
Plant 1.3*108 25,000
Outline Cell Genome Gene mRNA Proteins Systems biology
What is a gene?
Genomic DNA
Protein coding sequencePromoter Terminator
DNA: Deoxyribo Nucleic Acid
Example of a Gene: Gal4 DNAATGAAGCTACTGTCTTCTATCGAACAAGCATGCGATATTTGCCGACTTAAAAAGCTCAAGTGCTCCAAAGAAAAACCGAAGTGCGCCAAGTGTCTGAAGAACAACTGGGAGTGTCGCTACTCTCCCAAAACCAAAAGGTCTCCGCTGACTAGGGCACATCTGACAGAAGTGGAATCAAGGCTAGAAAGACTGGAACAGCTATTTCTACTGATTTTTCCTCGAGAAGACCTTGACATGATTTTGAAAATGGATTCTTTACAGGATATAAAAGCATTGTTAACAGGATTATTTGTACAAGATAATGTGAATAAAGATGCCGTCACAGATAGATTGGCTTCAGTGGAGACTGATATGCCTCTAACATTGAGACAGCATAGAATAAGTGCGACATCATCATCGGAAGAGAGTAGTAACAAAGGTCAAAGACAGTTGACTGTATCGATTGACTCGGCAGCTCATCATGATAACTCCACAATTCCGTTGGATTTTATGCCCAGGGATGCTCTTCATGGATTTGATTGGTCTGAAGAGGATGACATGTCGGATGGCTTGCCCTTCCTGAAAACGGACCCCAACAATAATGGGTTCTTTGGCGACGGTTCTCTCTTATGTATTCTTCGATCTATTGGCTTTAAACCGGAAAATTACACGAACTCTAACGTTAACAGGCTCCCGACCATGATTACGGATAGATACACGTTGGCTTCTAGATCCACAACATCCCGTTTACTTCAAAGTTATCTCAATAATTTTCACCCCTACTGCCCTATCGTGCACTCACCGACGCTAATGATGTTGTATAATAACCAGATTGAAATCGCGTCGAAGGATCAATGGCAAATCCTTTTTAACTGCATATTAGCCATTGGAGCCTGGTGTATAGAGGGGGAATCTACTGATATAGATGTTTTTTACTATCAAAATGCTAAATCTCATTTGACGAGCAAGGTCTTCGAGTCA
A sequence of A,C,G,T
Example of a Gene: Gal4 AAMKLLSSIEQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESRLERLEQLFLLIFPREDLDMILKMDSLQDIKALLTGLFVQDNVNKDAVTDRLASVETDMPLTLRQHRISATSSSEESSNKGQRQLTVSIDSAAHHDNSTIPLDFMPRDALHGFDWSEEDDMSDGLPFLKTDPNNNGFFGDGSLLCILRSIGFKPENYTNSNVNRLPTMITDRYTLASRSTTSRLLQSYLNNFHPYCPIVHSPTLMMLYNNQIEIASKDQWQILFNCILAIGAWCIEGESTDIDVFYYQNAKSHLTSKVFESGSIILVTALHLLSRYTQWRQKTNTSYNFHSFSIRMAISLGLNRDLPSSFSDSSILEQRRRIWWSVYSWEIQLSLLYGRSIQLSQNTISFPSSVDDVQRTTTGPTIYHGIIETARLLQVFTKIYELDKTVTAEKSPICAKKCLMICNEIEEVSRQAPKFLQMDISTTALTNLLKEHPWLSFTRFELKWKQLSLIIYVLRDFFTNFTQKKSQLEQDQNDHQSYEVKRCSIMLSDAAQRTVMSVSSYMDNHNVTPYFAWNCSYYLFNAVLVPIKTLLSNSKSNAENNETAQLLQQINTVLMLLKKLATFKIQTCEKYIQVLEEVCAPFLLSQCAIPLPHISYNNSNGSAIKNIVGSATIAQYPTLPEENVNNISVKYVSPGSVGPSPVPLKSGASFSDLVKLLSNRPPSRNSPVTIPRSTPSHRSVTPFLGQQQQLQSLVPLTPSALFGGANFNQSGNIADSS
A sequence of 20 amino acids {A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V}
The Central Dogma
DNA RNA: Gene Transcriptionpromoter
3’
5’
5’
3’G A T T A C A . . .
C T A A T G T . . .
Gene Transcriptiontranscription factor, binding site, RNA polymerase
3’
5’
5’
3’
Transcription factors recognize transcription factor binding sites and bind to them, forming a complex.
RNA polymerase binds the complex.
G A T T A C A . . .
C T A A T G T . . .
Gene Transcription
3’
5’
5’
3’
The two strands are separated
G A T T A C
A . . .
C T A A T G T . . .
Gene Transcription
3’
5’
5’
3’
An RNA copy of the 5’→3’ sequence is created from the 3’→5’ template
G A T T A C
A . . .
C T A A T G T . . .
G A U U A C A
Gene Transcription
3’
5’
5’
3’
G A U U A C A . . .
G A T T A C A . . .
C T A A T G T . . .
pre-mRNA 5’ 3’
RNA Processing (Eukaryotes)5’ cap, polyadenylation, exon, intron, splicing, UTR
5’ cap poly(A) tail
intron
exon
mRNA
5’ UTR 3’ UTR
Mammalian Gene Structure
5’ 3’
promoter
5’ UTR exons 3’ UTR
introns
coding
non-coding
Only 1.5% DNA for coding in Human!
Regulatory regions: up to 50 kb upstream of +1 site Exons: protein coding and untranslated regions (UTR)
1 to 178 exons per gene (mean 8.8)8 bp to 17 kb per exon (mean 145 bp)
Introns: splice acceptor (GU) and donor (AG) sites, junk DNAaverage 1 kb – 50 kb per intron
Gene size: Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.
Identifying Genes in Sequence Data Predicting the start and end of genes as well as
the introns and exons in each gene is one of the basic problems in computational biology.
Gene prediction methods look for ORFs (Open Reading Frame).
These are (relatively long) DNA segments that start with the start codon, end with one of the end codons, and do not contain any other end codon in between.
Splice site prediction has received a lot of attention in the literature.
Comparative genomics
Outline Cell Genome Gene mRNA Proteins Systems biology
RNA RNA is similar to DNA chemically. It is usually
only a single strand. T(hyamine) is replaced by U(racil)
Some forms of RNA can form secondary structures by “pairing up” with itself. This can have change its properties dramatically.
DNA and RNAcan pair with each other.
http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.giftRNA linear and 3D view:
RNA, continued Several types exist, classified by function mRNA – this is what is usually being
referred to when a Bioinformatician says “RNA”. This is used to carry a gene’s message out of the nucleus.
tRNA – transfers genetic information from mRNA to an amino acid sequence
rRNA – ribosomal RNA. Part of the ribosome which is involved in translation.
Messenger RNA Basically, an intermediate product Transcribed from the genome and
translated into protein Number of copies correlates well with
number of proteins for the gene. Unlike DNA, the amount of messenger RNA
(as well as the number of proteins) differs between different cell types and under different conditions.
Complementary base-pairing mRNA is transcribed from the DNA mRNA (like DNA, but unlike proteins) binds to its
complement
Quantify mRNA levels
Outline Cell Genome Gene mRNA Proteins Systems biology
Proteins: Workhorses of the Cell
Proteins are polypeptide chains of amino acids. 20 different amino acids
different chemical properties cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell.
Proteins do all essential work for the cell build cellular structures digest nutrients execute metabolic functions Mediate information flow within a cell and
among cellular communities.
Genes Make Proteins
genome-> genes ->protein(forms cellular structural & life functional)->pathways & physiology
Genes Encode for ProteinsU C A G
U
UUU Phenylalanine (Phe) UCU Serine (Ser) UAU Tyrosine (Tyr) UGU Cysteine (Cys) U
UUC Phe UCC Ser UAC Tyr UGC Cys C
UUA Leucine (Leu) UCA Ser UAA STOP UGA STOP A
UUG Leu UCG Ser UAG STOP UGG Tryptophan (Trp) G
C
CUU Leucine (Leu) CCU Proline (Pro) CAU Histidine (His) CGU Arginine (Arg) U
CUC Leu CCC Pro CAC His CGC Arg C
CUA Leu CCA Pro CAA Glutamine (Gln) CGA Arg A
CUG Leu CCG Pro CAG Gln CGG Arg G
A
AUU Isoleucine (Ile) ACU Threonine (Thr) AAU Asparagine (Asn) AGU Serine (Ser) U
AUC Ile ACC Thr AAC Asn AGC Ser C
AUA Ile ACA Thr AAA Lysine (Lys) AGA Arginine (Arg) A
AUG Methionine (Met) or START ACG Thr AAG Lys AGG Arg G
G
GUU Valine (Val) GCU Alanine (Ala) GAU Aspartic acid (Asp) GGU Glycine (Gly) U
GUC Val GCC Ala GAC Asp GGC Gly C
GUA Val GCA Ala GAA Glutamic acid (Glu) GGA Gly A
GUG Val GCG Ala GAG Glu GGG Gly G
Second letter
Fir
st
lett
er
Th
ird
lette
r
Triplet one Amino Acid4^3 combinations mapped to 20 Amino Acids
Open Reading Frames
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Synonymous Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
G
G C U U G U U U G C G A A U U A G
Ala Cys Leu Arg Ile
Missense Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
G
G C U U G G U U A C G A A U U A G
Ala Trp Leu Arg Ile
Nonsense Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A GA
G C U U G A U U A C G A A U U A G
Ala STOP
Frameshift
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
G C U U G U U A C G A A U U A G
Ala Cys Tyr Glu Leu
Protein Structure Proteins work together with
other proteins or nucleic acids as "molecular machines" structures fit together and
function in highly specific, lock-and-key ways.
Four levels of structure: Primary Structure: The
sequence of the protein Secondary structure: Local
structure in regions of the chain. (alpha helix, beta sheet)
Tertiary Structure: Three dimensional structure
Quaternary Structure: multiple subunits
Assigning Function to Proteins While 25000 genes have been identified in
the human genome, relatively few have known functional annotation.
Determining the function of the protein can be done in several ways. Sequence similarity to other (known) proteins Using domain information Using three dimensional structure Based on high throughput experiments (when
does it functions and who it interacts with)
Summary: DNA(Gene) RNA Protein
TranslationTranscription
Replication
Outline Cell Genome Gene mRNA Proteins Systems biology
Biological pathway/gene networks Instead of having brains, cells make decision
through complex networks of interactions, called pathways Synthesize new materials Break other materials down for spare parts Signal to eat or die
In order to fulfill their function, proteins interact with other proteins in a number of ways including: Regulation Signaling Pathways, for example A -> B -> C Post translational modifications Forming protein complexes
An Example
Systems Biology We now have many sources of data, each
providing a different view on the activity in the cell Sequence (genes) DNA motifs Gene expression Protein interactions Protein-DNA interaction Etc.
Putting it all together: Systems Biology
Next week Introduction to R programming
You need to do in-class exercises
Acknowledgments Ziv Bar-Joseph: for some of the slides
adapted or modified from his lecture slides at Carnegie Mellon University
Neil Jones: for some of the slides adapted or modified from his slides for the book An Introduction to Bioinformatics Algorithms