bnfo 615 data analysis in bioinformatics instructor zhi wei

52
BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Upload: bernardo-royle

Post on 14-Dec-2015

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

BNFO 615 Data Analysis in Bioinformatics

InstructorZhi Wei

Page 2: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Outline Cell Genome Gene mRNA Proteins Systems biology

Page 3: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Outline Cell Genome Gene mRNA Proteins Systems biology

Page 4: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Cells Fundamental working units of every living system. Every organism is composed of one of two radically

different types of cells: prokaryotic cells or eukaryotic cells. Prokaryotes and Eukaryotes are descended from

the same primitive cell. All extant prokaryotic and eukaryotic cells are the

result of a total of 3.5 billion years of evolution.

Page 5: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Prokaryotes v.s. Eukaryotes

Different StructuresDifferent ComponentsDifferent biological processes

Page 6: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Prokaryotes vs Eukaryotes

Prokaryotes Eukaryotes

Single cell Single or multi cell

No nucleus Nucleus

No organelles Organelles

One piece of circular DNA Chromosomes

No mRNA post transcriptional modification

Exons/Introns splicing

Page 7: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Prokaryotes v.s. EukaryotesProkaryotes bacteria, archaea Ecoli cell

5X106 base pairs > 90% of DNA encode protein 5400 genes Lacks a membrane-bound nucleus. Circular DNA Histones are unknown

Eukaryotes plants, animals, protista, and fungi Yeast cell

12.4x106 base pairs A small fraction of the total DNA encodes

protein. Many repeats of non-coding sequences

5800 genes All chromosomes are contained in a

membrane bound nucleus DNA is divided between 16 chromosomes A set of five histones: DNA packaging and

gene expression regulation

Page 8: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Cells chemical composition

Chemical composition-by weight 70% water 7% small molecules

salts Lipids amino acids nucleotides

23% macromolecules Proteins Polysaccharides lipids

Page 9: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

We have different cells Cells differ in size, shape

and weights Q: what is the biggest cell in

the human body?

Page 10: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Cell Cycle Born, eat, replicate, and die

Lodish et al. Molecular Biology of the Cell (5th ed.). W.H. Freeman & Co., 2003.

Page 11: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Sexual Reproduction v.s. Cell Division Cell Division: Cells reproduce by duplicating

their contents and dividing in two. Sexual Reproduction

Formation of new individual by a combination of two haploid sex cells (gametes).

Gametes for fertilization usually come from separate parents

Both gametes are haploid, with a single set of chromosomes. The new individual is called a zygote, with two sets of chromosomes (diploid).

Meiosis is a process to convert a diploid cell to a haploid gamete, and cause a change in the genetic information to increase diversity in the offspring.

Page 12: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Meiosisv.s.Mitotic cell division

Page 13: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Outline Cell Genome Gene mRNA Proteins Systems biology

Page 14: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Genome A genome is an organism’s

complete set of DNA (including its genes).

However, in humans less than 3% of the genome actually encodes for genes.

A part of the rest of the genome serves as a control regions (though that’s also a small part)

The function of the rest of the genome is unknown (junk DNA? An open question).

Page 15: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Comparison of Different Organisms

Genome size (bp) Num. of genes

E. Coli .05*108 5,400

Yeast .12*108 5,800

Worm .15*108 18,400

Fly 1.8*108 13,600

Human 30*108 25,000

Plant 1.3*108 25,000

Page 16: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Outline Cell Genome Gene mRNA Proteins Systems biology

Page 17: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

What is a gene?

Genomic DNA

Protein coding sequencePromoter Terminator

DNA: Deoxyribo Nucleic Acid

Page 18: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Example of a Gene: Gal4 DNAATGAAGCTACTGTCTTCTATCGAACAAGCATGCGATATTTGCCGACTTAAAAAGCTCAAGTGCTCCAAAGAAAAACCGAAGTGCGCCAAGTGTCTGAAGAACAACTGGGAGTGTCGCTACTCTCCCAAAACCAAAAGGTCTCCGCTGACTAGGGCACATCTGACAGAAGTGGAATCAAGGCTAGAAAGACTGGAACAGCTATTTCTACTGATTTTTCCTCGAGAAGACCTTGACATGATTTTGAAAATGGATTCTTTACAGGATATAAAAGCATTGTTAACAGGATTATTTGTACAAGATAATGTGAATAAAGATGCCGTCACAGATAGATTGGCTTCAGTGGAGACTGATATGCCTCTAACATTGAGACAGCATAGAATAAGTGCGACATCATCATCGGAAGAGAGTAGTAACAAAGGTCAAAGACAGTTGACTGTATCGATTGACTCGGCAGCTCATCATGATAACTCCACAATTCCGTTGGATTTTATGCCCAGGGATGCTCTTCATGGATTTGATTGGTCTGAAGAGGATGACATGTCGGATGGCTTGCCCTTCCTGAAAACGGACCCCAACAATAATGGGTTCTTTGGCGACGGTTCTCTCTTATGTATTCTTCGATCTATTGGCTTTAAACCGGAAAATTACACGAACTCTAACGTTAACAGGCTCCCGACCATGATTACGGATAGATACACGTTGGCTTCTAGATCCACAACATCCCGTTTACTTCAAAGTTATCTCAATAATTTTCACCCCTACTGCCCTATCGTGCACTCACCGACGCTAATGATGTTGTATAATAACCAGATTGAAATCGCGTCGAAGGATCAATGGCAAATCCTTTTTAACTGCATATTAGCCATTGGAGCCTGGTGTATAGAGGGGGAATCTACTGATATAGATGTTTTTTACTATCAAAATGCTAAATCTCATTTGACGAGCAAGGTCTTCGAGTCA

A sequence of A,C,G,T

Page 19: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Example of a Gene: Gal4 AAMKLLSSIEQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESRLERLEQLFLLIFPREDLDMILKMDSLQDIKALLTGLFVQDNVNKDAVTDRLASVETDMPLTLRQHRISATSSSEESSNKGQRQLTVSIDSAAHHDNSTIPLDFMPRDALHGFDWSEEDDMSDGLPFLKTDPNNNGFFGDGSLLCILRSIGFKPENYTNSNVNRLPTMITDRYTLASRSTTSRLLQSYLNNFHPYCPIVHSPTLMMLYNNQIEIASKDQWQILFNCILAIGAWCIEGESTDIDVFYYQNAKSHLTSKVFESGSIILVTALHLLSRYTQWRQKTNTSYNFHSFSIRMAISLGLNRDLPSSFSDSSILEQRRRIWWSVYSWEIQLSLLYGRSIQLSQNTISFPSSVDDVQRTTTGPTIYHGIIETARLLQVFTKIYELDKTVTAEKSPICAKKCLMICNEIEEVSRQAPKFLQMDISTTALTNLLKEHPWLSFTRFELKWKQLSLIIYVLRDFFTNFTQKKSQLEQDQNDHQSYEVKRCSIMLSDAAQRTVMSVSSYMDNHNVTPYFAWNCSYYLFNAVLVPIKTLLSNSKSNAENNETAQLLQQINTVLMLLKKLATFKIQTCEKYIQVLEEVCAPFLLSQCAIPLPHISYNNSNGSAIKNIVGSATIAQYPTLPEENVNNISVKYVSPGSVGPSPVPLKSGASFSDLVKLLSNRPPSRNSPVTIPRSTPSHRSVTPFLGQQQQLQSLVPLTPSALFGGANFNQSGNIADSS

A sequence of 20 amino acids {A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V}

Page 20: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

The Central Dogma

Page 21: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

DNA RNA: Gene Transcriptionpromoter

3’

5’

5’

3’G A T T A C A . . .

C T A A T G T . . .

Page 22: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Gene Transcriptiontranscription factor, binding site, RNA polymerase

3’

5’

5’

3’

Transcription factors recognize transcription factor binding sites and bind to them, forming a complex.

RNA polymerase binds the complex.

G A T T A C A . . .

C T A A T G T . . .

Page 23: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Gene Transcription

3’

5’

5’

3’

The two strands are separated

G A T T A C

A . . .

C T A A T G T . . .

Page 24: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Gene Transcription

3’

5’

5’

3’

An RNA copy of the 5’→3’ sequence is created from the 3’→5’ template

G A T T A C

A . . .

C T A A T G T . . .

G A U U A C A

Page 25: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Gene Transcription

3’

5’

5’

3’

G A U U A C A . . .

G A T T A C A . . .

C T A A T G T . . .

pre-mRNA 5’ 3’

Page 26: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

RNA Processing (Eukaryotes)5’ cap, polyadenylation, exon, intron, splicing, UTR

5’ cap poly(A) tail

intron

exon

mRNA

5’ UTR 3’ UTR

Page 27: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Mammalian Gene Structure

5’ 3’

promoter

5’ UTR exons 3’ UTR

introns

coding

non-coding

Only 1.5% DNA for coding in Human!

Regulatory regions: up to 50 kb upstream of +1 site Exons: protein coding and untranslated regions (UTR)

1 to 178 exons per gene (mean 8.8)8 bp to 17 kb per exon (mean 145 bp)

Introns: splice acceptor (GU) and donor (AG) sites, junk DNAaverage 1 kb – 50 kb per intron

Gene size: Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.

Page 28: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Identifying Genes in Sequence Data Predicting the start and end of genes as well as

the introns and exons in each gene is one of the basic problems in computational biology.

Gene prediction methods look for ORFs (Open Reading Frame).

These are (relatively long) DNA segments that start with the start codon, end with one of the end codons, and do not contain any other end codon in between.

Splice site prediction has received a lot of attention in the literature.

Comparative genomics

Page 29: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Outline Cell Genome Gene mRNA Proteins Systems biology

Page 30: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

RNA RNA is similar to DNA chemically. It is usually

only a single strand. T(hyamine) is replaced by U(racil)

Some forms of RNA can form secondary structures by “pairing up” with itself. This can have change its properties dramatically.

DNA and RNAcan pair with each other.

http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.giftRNA linear and 3D view:

Page 31: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

RNA, continued Several types exist, classified by function mRNA – this is what is usually being

referred to when a Bioinformatician says “RNA”. This is used to carry a gene’s message out of the nucleus.

tRNA – transfers genetic information from mRNA to an amino acid sequence

rRNA – ribosomal RNA. Part of the ribosome which is involved in translation.

Page 32: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Messenger RNA Basically, an intermediate product Transcribed from the genome and

translated into protein Number of copies correlates well with

number of proteins for the gene. Unlike DNA, the amount of messenger RNA

(as well as the number of proteins) differs between different cell types and under different conditions.

Page 33: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Complementary base-pairing mRNA is transcribed from the DNA mRNA (like DNA, but unlike proteins) binds to its

complement

Page 34: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Quantify mRNA levels

Page 35: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Outline Cell Genome Gene mRNA Proteins Systems biology

Page 36: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Proteins: Workhorses of the Cell

Proteins are polypeptide chains of amino acids. 20 different amino acids

different chemical properties cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell.

Proteins do all essential work for the cell build cellular structures digest nutrients execute metabolic functions Mediate information flow within a cell and

among cellular communities.

Page 37: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Genes Make Proteins

genome-> genes ->protein(forms cellular structural & life functional)->pathways & physiology

Page 38: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Genes Encode for ProteinsU C A G

U

UUU Phenylalanine (Phe) UCU Serine (Ser) UAU Tyrosine (Tyr) UGU Cysteine (Cys) U

UUC Phe UCC Ser UAC Tyr UGC Cys C

UUA Leucine (Leu) UCA Ser UAA STOP UGA STOP A

UUG Leu UCG Ser UAG STOP UGG Tryptophan (Trp) G

C

CUU Leucine (Leu) CCU Proline (Pro) CAU Histidine (His) CGU Arginine (Arg) U

CUC Leu CCC Pro CAC His CGC Arg C

CUA Leu CCA Pro CAA Glutamine (Gln) CGA Arg A

CUG Leu CCG Pro CAG Gln CGG Arg G

A

AUU Isoleucine (Ile) ACU Threonine (Thr) AAU Asparagine (Asn) AGU Serine (Ser) U

AUC Ile ACC Thr AAC Asn AGC Ser C

AUA Ile ACA Thr AAA Lysine (Lys) AGA Arginine (Arg) A

AUG Methionine (Met) or START ACG Thr AAG Lys AGG Arg G

G

GUU Valine (Val) GCU Alanine (Ala) GAU Aspartic acid (Asp) GGU Glycine (Gly) U

GUC Val GCC Ala GAC Asp GGC Gly C

GUA Val GCA Ala GAA Glutamic acid (Glu) GGA Gly A

GUG Val GCG Ala GAG Glu GGG Gly G

Second letter

Fir

st

lett

er

Th

ird

lette

r

Triplet one Amino Acid4^3 combinations mapped to 20 Amino Acids

Page 39: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Open Reading Frames

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

Page 40: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Synonymous Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

G

G C U U G U U U G C G A A U U A G

Ala Cys Leu Arg Ile

Page 41: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Missense Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

G

G C U U G G U U A C G A A U U A G

Ala Trp Leu Arg Ile

Page 42: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Nonsense Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A GA

G C U U G A U U A C G A A U U A G

Ala STOP

Page 43: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Frameshift

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

G C U U G U U A C G A A U U A G

Ala Cys Tyr Glu Leu

Page 44: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Protein Structure Proteins work together with

other proteins or nucleic acids as "molecular machines" structures fit together and

function in highly specific, lock-and-key ways.

Four levels of structure: Primary Structure: The

sequence of the protein Secondary structure: Local

structure in regions of the chain. (alpha helix, beta sheet)

Tertiary Structure: Three dimensional structure

Quaternary Structure: multiple subunits

Page 45: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Assigning Function to Proteins While 25000 genes have been identified in

the human genome, relatively few have known functional annotation.

Determining the function of the protein can be done in several ways. Sequence similarity to other (known) proteins Using domain information Using three dimensional structure Based on high throughput experiments (when

does it functions and who it interacts with)

Page 46: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Summary: DNA(Gene) RNA Protein

TranslationTranscription

Replication

Page 47: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Outline Cell Genome Gene mRNA Proteins Systems biology

Page 48: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Biological pathway/gene networks Instead of having brains, cells make decision

through complex networks of interactions, called pathways Synthesize new materials Break other materials down for spare parts Signal to eat or die

In order to fulfill their function, proteins interact with other proteins in a number of ways including: Regulation Signaling Pathways, for example A -> B -> C Post translational modifications Forming protein complexes

Page 49: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

An Example

Page 50: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Systems Biology We now have many sources of data, each

providing a different view on the activity in the cell Sequence (genes) DNA motifs Gene expression Protein interactions Protein-DNA interaction Etc.

Putting it all together: Systems Biology

Page 51: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Next week Introduction to R programming

You need to do in-class exercises

Page 52: BNFO 615 Data Analysis in Bioinformatics Instructor Zhi Wei

Acknowledgments Ziv Bar-Joseph: for some of the slides

adapted or modified from his lecture slides at Carnegie Mellon University

Neil Jones: for some of the slides adapted or modified from his slides for the book An Introduction to Bioinformatics Algorithms