gene expression analysis, dna chips and genetic networks · 3’ gtt cat atg gcg 5’ mrna: 5’...
TRANSCRIPT
• The information science of biology: organize, store, analyze and visualize biological data
• Responds to the explosion of biological data, and builds on the IT revolution
Bioinformatics
Paradigm shift in biological research
Classical biology: focus on a single gene or sub-system. Hypothesis driven
Systems biology: model the behavior of an entire biological system. Hypothesis generating
Large-scale data;
Bioinformatics
Administration
• ~5 home assignments as part of a home exam, to be done independently (50%)
• Final exam (50%)
• Classes: Tue 12:15-13:30; Thu 14:45-16
• TA: Yaron Orenstein (Thu 16-17).
Administration (cont.)
• Web page of course: http://www.cs.tau.ac.il/~rshamir/cg/12/
• Includes slides and scribes of previous years on each of the classes.
The Cell
• Basic unit of life.
• Carries complete characteristics of the species.
• All cells store hereditary information in DNA.
• All cells transform DNA to proteins, which determine cell’s structure and function.
• Two classes: eukaryotes (with nucleus) and prokaryotes (without).
http://regentsprep.org/Regents/biology/units/organization/cell.gif
Nucleotide Chain Double helix
sugar
phosphate
Nucleotides/ Bases:
Adenine (A),
Guanine (G),
Cytosine (C),
Thymine (T).
Weak hydrogen
bonds between base
pairs
Strong covalent bonds
between sugars
Gregor Mendel
laws of inheritance, “gene”
1866
Watson and Crick
DNA structure 1953
kidzsoft/src/rnadnatutor.html -5-10/98s/f2http://www.cs.utexas.edu/users/almstrum/s
DNA (Deoxy-Ribonucleic acid) • Bases:
– Adenine (A) – Guanine (G) – Cytosine (C) – Thymine (T)
• Bonds: – G - C – A - T
• Oriented from 5’ to 3’. • Located in the cell nucleus
Purines
pyrimidines
DNA and Chromosomes • DNA is packaged
(10000-fold)
• Chromatin: complex of
DNA and proteins that
pack it (histones)
• Chromosome: contiguous
stretch of DNA
• Diploid: two homologous
chromosomes, one from
each parent
• Genome: totality of DNA
material
Genes
• Gene: a segment of DNA that specifies a protein.
• The transformation of a gene into a protein is called expression.
• Genes are 3% of human DNA
• The rest - non-coding “junk DNA”:
– RNA elements (62%)
– Regulatory regions
– Retrotransposons
– Pseudogenes
– and more…
Proteins
• Build the cell and drive
most of its functions.
• Polymers of amino-acids
(20 total), linked by
peptide bonds.
• Oriented (from amino to
carboxyl group).
• Fold into 3D structure of
lowest energy.
DNA RNA protein
transcription translation
The hard
disk
One
program
Its output
http://www.ornl.gov/hgmis/publicat/tko/index.htm
RNA (Ribonucleic acid) • Bases:
– Adenine (A) – Guanine (G) – Cytosine (C) – Uracil (U); replaces T
• Oriented from 5’ (phosphate) to 3’ (sugar). • Single-stranded => flexible backbone =>
secondary structure => catalytic role.
Transcription
http://www.iacr.bbsrc.ac.uk/notebook/courses/guide/words/transcriptiongif.htm
Template
The Genetic Code
• Codon - a triplet of bases, codes a specific amino acid (except the stop codons)
• Stop codons - signal termination of the protein synthesis process
• Different codons may code the same amino acid
http://ntri.tamuk.edu/cell/ribosomes.html
Translation
http://biology.kenyon.edu/courses/biol114/Chap05/Chapter05.html#Protein
DNA Protein
transcription translation
RNA
Expression and Regulation
Gene
Transcription factors (TFs) control
transcription by binding to specific DNA
sequence motifs.
The Human Genome: numbers
• 23 pairs of chromosomes • ~3,200,000,000 bases • ~21,000 genes • Gene length: 1000-3000 bases,
spanning 30-40K bases
Model Organisms
• Eukaryotes; increasing complexity • Easy to store, manipulate.
Budding yeast
• 1 cell
• 6K genes
Nematode worm
• 959 cells
• 19K genes
Fruit fly
• vertebrate-like
• 14K genes
mouse
• mammal
• 30K genes
27
Biotechnology - Restriction Enzymes
• Restriction Enzyme: – Natural role: break foreign DNA
entering the cell.
– Ability: • Breaks the phosphodiester bonds of
a DNA upon appearance of a certain cleavage sequence.
• Different cleavage for each enzyme
• Hundreds of different enzymes known.
– Digestion = application of restriction enzymes to a sequence.
Cloning vector (plasmids)
Foreign DNA
Recombinant DNA
Introduction into host cell
Use of antibiotics to
grow recombinant cells
Cloning
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’
5’
3’
3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 5’ 3’ 3’
5’
5’ 3’
5’ 3’
5’ 3’
3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
Denaturation
Annealing
Extension
Cycle 1
Cycle 2
Cycle 3
PCR
30
Biotechnology - Gel Electrophoresis
• Use: “race” digested DNA fragments through electrically charged gel
• Goals: – Separate a mixture of DNA fragments
– Measure length of DNA fragments
• How does it work: – smaller molecule travel faster than larger ones
– same size and shape the same movement speed
http://dlab.reed.edu/projects/vgm/vgm/V
GMProjectFolder/VGM/RED/RED.ISG/
mapping.html
32
The Double Digest Problem Given 3 sets of distances {Xi} {Yi}
{Zi}, reconstruct cut sites A1<…<An
and B1<…<Bm s.t. – {Ai-A i-1}={X}, {Bi-B i-1}={Y}
– for C=A U B (ordered), {Ci-C I-1}={Z}
Complexity: NP hard, many heuristics.
33
The Partial Digest Problem •Problem: Given a (multi-) set of distances {|Xi-Xj|} 1 i j n, reconstruct the original series X1,…,Xn
•Complexity: unknown (yet)
34
Sequencing • Sequencing: determining the nucleotide
sequence of a given molecule.
• Classical approach: gel electrophoresis
• Creating DNA strands of different lengths : catalyzing replication in environment with “terminator” A*.
• Repeat separately with C*, G*, T*
• Abilities: reconstructs sequences of 500-1000 nucleotides.
36
The Sequence Assembly Problem
• Problem: Given a set of sub- strings, find the shortest (super)string containing all the members of the set.
http://www.ornl.gov/hgmis/graphics/slides/images1.html
38
The Gene Finding Problem
Given a DNA sequence, predict the location of genes (open reading frames) exons and introns.
39
Wild-Type
DNA: 5’ CAA GTA TAC CGC 3'
3’ GTT CAT ATG GCG 5'
mRNA: 5’ CAA GUA UAC CGC 3'
Polypeptide : Gln Val Tyr Arg
Substitution
DNA: 5’ CAC GTA TAC CGC 3'
3’ GTG CAT ATG GCG 5'
RNA: 5’ CAC GUA UAC CGC 3'
Polypeptide: His Val Tyr Arg
In this example of a base substitution, a "C-G" base pair has been substituted for the "A-T" in the original DNA strand. The resulting mRNA transcript has a "C" in place of the original "A". Thus the first three-letter codon is now "CAC" instead of "CAA". The polypeptide encoded by this mRNA is different from the original, because the codon "CAC" codes for the amino acid histidine, whereas the original codon "CAA" codes for glutamine.
Mutations -
Substitution
/MutationFrameset.html1/page2http://morgan.rutgers.edu/morganwebframes/level
Mutations are a
major source of
genetic variation
Mutation does not
always change the
polypeptide
Mutation types:
– substitution
– insertion
– deletion
– rearrangement
40
Wild-Type
DNA: 5’ CAA GTA TAC CGC 3'
3’ GTT CAT ATG GCG 5’
mRNA: 5’ CAA GUA UAC CGC 3'
Polypeptide : Gln Val Tyr Arg
Duplication
DNA: 5’ CAA GTA TAC TAC CGCbv 3’
3’ GTT CAT ATG ATG GCG 5'
RNA: 5’ CAA GUA UAC UAC CGC 3'
Polypeptide: Gln Val Tyr Tyr Arg
In this example, there has been an addition to the DNA molecule. A sequence of three base pairs has been duplicated. The mRNA transcript now contains another "UAC" codon, and the resulting polypeptide is longer by one amino acid -- a second tyrosine has been included.
Mutations -
Duplication
/MutationFrameset.html1/page2http://morgan.rutgers.edu/morganwebframes/level
Mutations are a
major source of
genetic variation
Mutation does not
always change the
polypeptide
Mutation types:
– substitution
– insertion
– deletion
– rearrangement
41
Wild-Type
DNA: 5’ CAA GTA TAC CGC 3'
3’ GTT CAT ATG GCG 5’
mRNA: 5’ CAA GUA UAC CGC 3'
Polypeptide :Gln Val Tyr Arg
Deletion
T-A
DNA: 5’ CAA GAT ACC GC 3'
3’ GTT CTA TGG CG 5’
mRNA: 5’ CAA GAU ACC GC 3'
polypeptide: Gln Asp Thr
In this example, one of the base pairs in the DNA molecule has been deleted. An "T-A" has been lost, as shown above. The mRNA is now missing the "U" that was in position 7. This has caused a shift in the reading frame of the mRNA; thus, a mutation of this type is often referred to as a "frameshift mutation". The second and third amino acids are different from those in the original polypeptide.
Mutations -
Deletion
/MutationFrameset.html1/page2http://morgan.rutgers.edu/morganwebframes/level
Mutations are a
major source of
genetic variation
Mutation does not
always change the
polypeptide
Mutation types:
– substitution
– insertion
– deletion
– rearrangement
42
Sequence Alignment problems
Given two sequences, find their best alignment: Match with insertion/deletion of min cost.
Many variants…
“Workhorse” of Bioinformatics! Key challenge: huge volume of data (more on this later)
43
Before
After
Rearrangement is a change in the order of complete segments along a chromosome.
Mutations -
Rearrangement
Mutations are a
major source of
genetic variation
Mutation does not
always change the
polypeptide
Mutation types:
– substitution
– insertion
– deletion
– rearrangement
44
Genome Rearrangements
M. Ozery-Flato
Challenges: •Reconstruct the evolutionary path of rearrangements •Shortest sequence of rearrangements between two permutations
45
More problems in sequencing data
Solve all the problems above (alignment, gene finding, rearrangements,…) on
really huge datasets
Need to handle practical problems of efficiency – time and space
Need to overcome large noise (false hits) due to data size
Hybridization
• DNA double strands form by “gluing” of complementary single strands
• Complementarity rule: A-T, G-C
ACTCCG
TGAGGC | | | | | |
Gene Expression Arrays
• Assumption: transcription level indicates
gene’s importance in a specific condition.
• Variants enable measuring absolute or
relative levels of a transcript.
Given the expression profiles of normal vs. disease: -Build a classifier to recognize the disease -Cluster disease profiles into sub-classes; or cluster genes into functional groups
Classifier construction P
ati
ents
70-gene signature
No met.
Met.
• Classify to minimize incorrect assignments in the no met. class.
• 17/19 correct predictions on test set.
52
Human Variation • DNA of two human
beings is ~99.9% identical
• Phenotype and disease variation is due these 1/1000 mutations
Challenges: •Associate mutations to specific disease •Deal with huge datasets •Overcome the multiple hypothesis testing problem
Protein interaction detection: yeast two-bybrid
http://www.bioteach.ubc.ca/MolecularBiology/AYeastTwoHybridAssay/
Challenges in network analysis: function prediction
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9
Network distance
GO
sem
an
tic s
imil
ari
ty
56
Complexity summary
• ~21,000 genes in the genome
• Hard to identify
• Harder to figure their function
• Even harder to figure how they work together
Promise and problems in comparative genomics (2009)
57
arabidposis c. elegans h. sapiens s. cerevisiae
genes
arabidopsis 26207 0.19 0.24 0.42
c. elegans 19992 0.26 0.38 0.38
h. sapiens 21673 0.30 0.28 0.43
s. cerevisiae 5884 0.21 0.13 0.19
a(x,y)= fraction of genes in genome y that have strict orthologs in
genome x
Source: 10/2009