genomics lecture 3
DESCRIPTION
Background to genomics - based on the C. elegans genome project.TRANSCRIPT
>CEK06A5acaagagagggcgcctcggccgtatgttgaatgggagatcgatggaaccgagacaacgagaaaaggaatagagacggagaaagagagagagagcgcgcgttgttggaaggatgaaaaagaaaaaagacatgagctgcttcacaagagcttggcgaaagcaaagggcaaagtgttgacagcttagtggtggtagttggatcttctctcctcgttctctgctcacaactcgtctatcactcatatcacatttatttcccaatatcattttaacaacatcttccgatgcatgttcgtcaatattgcgcaaccactttgcaatattgtcaaaacttttcgcatttgtgatatcgtaaaccagcataattcccattgctccgcggtaatatgatgttgtgattgtgtggaatcgttcttgtccagctgtgtcccagatttgtaatttaatcttttttccttttaattcgatagttttaattttgaagtcgattcctgaatgaaaaaagaaaattattttgaaatcactagattctgaataaaaactaaccaatagttgagatgaatgtggtgttaaaggcatcatccgaaaatctgtacagaatgcaagtttttccaactcctgagtcgcctattagcagcaatttgaagagcatgtcatacggtcggcgagccatttttcttctgaaatgagaaaaagttgagaactaaagttgcacaaaagtaagagaaaagcacttgagtcatggcaaatagaacgaacactttgagatttcgaagaagttatcaagagttgacaattggaagatatttggaagaactttctaatttttttctagttttccaaaattaggtttttgtcataaaatgttgtcaaagaaaaaacaggacaaaatagttaattgttgtttccattataacaaaaaaaaatttgaacggagctattaacgcgtgcatgcgcaaatcacatcgattagctgtttctgggaaattctcgggaaaaggtgaacagcagctgctggcttcctctgcgggtcacgaaaacacaaagagatcattataattgttatttggaaaggaagcgaatctaaaacgggtacaggtggacgtttattgatcgaaagtgctttttatttgaaattgaatggtgaactttgcaattttgtaatgcaaagtacgttatcagatggcatgagatgtgtgaagtgataaggaataaaatgtgaacgacatgttcaagaaactgtgatttttcaataatttgtgatgaaatattttaggaacagaaatgaacatattaattgatataaaaacaataggaacactaactcataattatgataggtgaatatcaaaatgtgctagattttttgaagttaaaaaatacatttctaatattttttcaaataataagtttcagctgaaatttcagggtgatttcagaaagctatgttttgataaattgttttgaaaattaaaagaagctacagcaaaaaaaaattaaagagaacatcgctccctcgtagtgtataatttttgattatcgaaaaaaatgagtcaatgatgaaaaggaagtcgcaatctcaaaacttcaaaaatcaaaagaagccgttgcctctgtcatcaaaaattcagaagacaaggttgttgacaagggtcaattctcagtggtggagggcattgggcgtggtgaaatttttgaaggctagtgtggttggacctctactagatagacaaaacccccgaaatagacgtttaatttgatgagatggtggagaaagaaaaggactcattctctagatgatagagagaccagagatacagacaagagagggcgcctcggccgtatgttgaatgggagatcgatggaaccgagacaacgagaaaaggaatagagacggagaaagagagagagagcgcgcgttgttggaaggatgaaaaagaaaaaagacatgagctgcttcacaagagcttggcgaaagcaaagggcaaagtgttgacagcttagtggtggtagttggatcatgtgtttttatgtttccggtgggagaaggttcaacaaaaaatgaaaagaaaaagttcaagcggcatgaatcattctgagtttaaaacaaaattattgcgaaaattaatattaaaaccttttcacaaaacttcaagctaatctgttcatgaaaatttgaataatagttttttcccacctatttagaattaacttcatattaacgaaattaattaacgaatcgaaaattatgacttttcagaatcatctgaagttttttcacattccatgctgcatggaataatttgatcctggaatcgatatgtttttatggtatactttttaaccttcaatttagctggaaaagtatggaataaataattcccgaagctatgtacatatatgtagaattattgaatgattgtgagaacaacttgactttagcttgagtaggaatcggaatggctatcgaccgatcaacacttaggattgtaagaatggcagtaagaatatattgaagaaagaatgtttgttcataggaagagaaagagtattgcgaaatcatcatcgcccactttagaatggacgggcggtgagcggacatagagaattgtgaatgactaatgcttttgcagaatctagggcaaaatcgtaggaacaaacaattgtaatacggagaaaacaatcatatcgatcgatgatcatggagaaaaatgtgatttaagtgagtagacttggaaaaattaataaaagcatgaattgtcgatatttttcatttattttcattataaagctctttaaaaacaaattaaatattgagaatggcttcgaagaatattgtttcaaatatgttcaatggtgacaccttgcggataaaattaatgtaaaaatcatggaacacagattcactgatatctcattatctcaagcagtgtaattagagattttttggaacaattattttataaaactataaataaaccgtttatactactcaaagccaaatattcaagctattaccattttttttctaactaattcttgagcaattaaagtattccccagtttttattttgcaacgactccaggcaaacacgctccgttgcacttgccgccaaggcgttgcattcaaatcagagagacatctcattccgatttctgtttttcttccaataaacggtattttatgcctaatgggtgatacggaaattgttcctcttcgagtacaaaatgtacttgatagcgaaatcattcgtctcaacttgtggtccatgaaggtaactgtctagtttttttaagttttcatgatttcaatatttttacagtttaacgcgaccagtttcaaactcgaaggttttgtgagaaatgaagaaggcactatgatgcagaaagtttgttccgaatttatttgtgtaagtcgagaaacatattcgtcaacaattttcattaaatattcagagacgcttcacttctacgttgcttttcgatgtttccggacgtttcttcgacttggtcggacagattgatcgggaatatcaacaaaaaatgggaatgcctagtagaattattgatgaattttcaaatggaattcctgaaaattgggccgaccttatctattcctgcatgtcagccaaccaaagaagcgcacttcgccctatccaacaggctccaaaagaaccaattagaactagaacagaaccaattgttacgttggcagatgaaaccgagctaactggaggatgccagaaaaattccgaaaacgagaaagaaaggaacagacgtgagcgtgaagaacagcaaacaaaggaacgtgagagaagattagaagaagaaaaacaacgacgagatgctgaagctgaggctgaaagaaggcgaaaagaagaggaagagctggaagaagctaattacacccttcgtgctccgaaatctcagaacggcgagccaatcactccgataaga
C. elegans cosmid K06A5, 24323 bp.Flat sequence file –3955 bp shown.
Genome sequence of C.elegans.
Sequence of entire genome.
Sequence of cDNA clones.
Approximately 19,500 PREDICTED protein coding gene sequences.
Large number of various kinds of functional RNAs – not discuss further.
For this lecture – focus predicted proteins.
Gene prediction? How?Science, December 1998.
Computer based predictions
GENEFINDER (C.elegans), BLAST (all genomes) and other computer programs.
Biases in coding sequence - in C. elegans non-coding is AT rich. Splice site signals, initiator methionines, termination codons.Likely exons and probable/possible splice patterns.
BLAST – compare the Translation of all 6 reading frames.
• Evidence that a prediction is correct?• Homology with genes in other organisms – homologues.• Known protein families.
• Experimental evidence.
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences.
The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.
http://www.ncbi.nlm.nih.gov/ The National Center for Biotechnology Information (NCBI), the U.S. National Library of Medicine.
mqnpmillifclfcavicsrgtdsdiphef
How does BLAST work?
BLAST compares small sequential blocks – or WINDOWS- of sequence against massive databases. It looks for regions of similarity and scores them.
Protein SequenceSingle Letter code
Search windows
Large Protein
More BLAST
High similarity BLAST score
Low similarity BLAST score
Conserved regions
Small windows of comparison - detect LOCAL regions of similarity.
Output - % identity and % similarity (permits conservative substitutions of aa.)
Gives overall score and probability of relatedness.
If the entire protein sequence was compared in one go, you may get a relatively low overall similarity.
How did genes and gene families evolve and what is meant by protein domains?We need to come back to this – remember the question!
Non-conserved regions
Go to NCBI http://www.ncbi.nlm.nih.gov/ Go to Blast then look down the left for “Choose a BLAST program to run”From within that section, select “protein blast”.Copy the above protein sequence and paste it into the box on the top left of web page.Scroll down the page and click the big blue BLAST button.
Have a look at the outcome – any questions – post to the Forum on moodle.
mqnpmillif clfcavicsr gtdsdiphef hkmlkhaksl nsllrdlhvi yspemtnrhvektdkhgaal slksgsmsaq rivsiqnisd demdgytlfh lqsmkdikqg ndtcnlqsvcvpipqlsddp qvlmypkcye vkqcvgsccn svetchpgti nlvkkhvael lyigngrfmfnmtkeitmee htscscfdcg sntpqcapgf vvgrsctcec ankeernncv gnatwnaetckcecdlkcee gkilhkdrcd cvrrrqhhgg prghhghrhh hrsrpidtee vqkigqlkvgrigg
Below is the sequence of a protein:
BLAST is one of the powerful computational tools for Comparative Genomics
HOMEWORK
Expressed sequence tags (ESTs) – cDNA clones.
To make cDNA mRNA is copied to DNA with reverse transcriptase.
RNA → DNA
“The Central Dogma” of Molecular Biology
DNA → mRNA → Protein
Retroviruses (e.g. HIV).
RNA genome → DNA → integration → mRNA → protein
Computational biology is mostly predictive – not EXPERIMENTAL
Lets look at simple experimental evidence for existence of genes.
Typical eukaryotic gene - double stranded DNA
intronexon
Primary transcript – single sense strand RNA – introns present5’ 3’OH
Capping, splicing, poly-adenylation
First strand cDNA synthesis -reverse transcriptase
Messenger RNA (mRNA)5’ CAP
OH-TTTTTTTT-5’ DNA primer
AAAAAAAAAAA 3’OH
TTTTTTTTAAAAAAAAAAA RNA/cDNA duplex
TTTTTTTTAAAAAAAA
Second strand cDNA – DNA polymerase
Double stranded cDNA
RNA Polymerase1.
2.
3.
4.
RNA exon
Making cDNA
EST sequencing was carried out in parallel to genome sequencing.
Simplest experimental evidence that a bit of genomic DNA contains a gene.
OH-TTTTTTTT-5’ DNA primer
AAAAAAAAAAA 3’OHMessenger RNA (mRNA)
cDNA synthesis oligo dT priming
Making cDNA
cDNA synthesis by random priming
OH-NNNNNNNNN-5’
DNA primer
AAAAAAAAAAA 3’OH
Random 6-mers or 9-mers
The advantage of Random Priming is cDNA clones not biased towards 3’ end of gene.
Typical eukaryotic gene - double stranded DNA
EST sequences
Sequence data from Random Primed cDNA – ESTs (or EST Tags)
EST 1
EST 3EST 2
EST 4
The sequencing of ESTs uncovered frequent examples of differential splicing.
Common examples of which are exon skipping (above)
Alternative 5’ exons, alternative splice altering stop codons, genes within genes etc.
Above true for C. elegans, humans, flies, and many other species.
• C. elegans EST data from approximately 50,000 cDNA clones.• Identified 9,356 different genes.
1. Grind up thousands of worms.2. Prepare mRNA – convert to cDNA with reverse transcriptase – clone in plasmid.3. Some mRNSs exist at extremely low levels of abundance.4. Low abundance cDNAs may be impossible to clone randomly.
Reverse transcriptase PCR – very sensitive.
cDNA from mRNA using reverse transcriptase.
Amplify cDNA by PCR – primers designed from predicted genes.
Clone and analyse products.
Experimentally confirmed genes raised to > 18,000.
Full length cDNA– valuable for confirming intron/exon structure.
Gene
mRNAAAAAAAAA
Primer A.
Primer B
Summary of predicted and known gene sequences in C. elegans
1. Predicted 19,500 genes.
2. At least 18,000 expressed as RNA.
3. Average of 1 gene per 5 kb.
4. ~ 42% have detectable homologies to genes/proteins outside Nematoda.
Genome Size
Organism Genome Genes
E.coli (bacteria) 4.64 Mb 4,377S. cerevisiae (fungal) 12.1 Mb 6,163C.elegans (metazoan) 100 Mb 19,300Arabadopsis (plant) 118 Mb ~20,000D. melanogaster (fruit fly) 135.6 Mb 13,472Mus musculus (mouse) 3059 Mb ~25,000Homo sapiens (obvious) 3286 Mb ~25,000
Number Description
650 7 TM chemoreceptor410 Eukaryotic protein kinase domain240 Zinc finger, C4 (transcription factor)170 Collagen140 7 TM receptor130 Zinc finger, C2H2 (transcription factor)120 Lectin C-type domain short and long forms100 RNA recognition motif (RRM, RBD, or RNP domain)90 Zinc finger, C3HC4 type (transcription factor)90 Protein-tyrosine phosphatase90 Ankyrin repeat90 WD domain, G-beta repeats80 Homeobox domain (transcription factor)80 Neurotransmitter-gated ion channel80 Cytochrome P45080 Helicases conserved C-terminal domain80 Alcohol/other dehydrogenases, short-chain type70 UDP-glucoronosyl and UDP-glucosyl transferases70 EGF-like domain70 Immunoglobulin superfamily
The C. elegans Top 20 protein Homologies
Does the “Top 20” list tell us anything?
Previous slide looked rather boring?
Test your memory – what was on the list?
Many of the large gene families are implicated in developmental control.
Core set of proteins needed for general cell biology/metabolism to make a cell – e.g. S. cerevisiae ~6,163 genes.
Evolution of developmental complexity – amplification of families of regulatory molecules.
The above in part explains the increase in number of genes in multicellular organisms – it does not explain fully the increase in DNA content.
How much does DNA sequence teach us?
Remember that what we can learn from protein similarities is limited by what we know about the similar proteins.
We still need to connect genes/proteins with functions.
C. elegans mutants
dpy-7: Short fat worm – exoskeletal defect.
ced-4: Programmed cell death defective.
unc-51: Paralysed - abnormal axons.
dec-2: long defecation cycle – genetically constipated.
Wild Type
How has genomics influenced genetics?
bli-3
egl-30
mab-20
fog-1unc-73unc-57dpy-5
dpy-14fer-1
unc-29lin-11
unc-75
unc-101
glp-4
unc-54
Chromosome I
-15
-10
-5
0
5
10
15
20
25
Central cluster
Left arm
Right arm
m.u.
Genetic mapping.
m.u. = map unit.
Genetic mapping – recombination.
1 m.u. is 1% recombination per meiosis.
fog-1
glp-4
+
+ glp-4
+fog-1
+
Parent Recombinant
We wanted to investigate the molecular detail of gene defined by mutation.We knew where mutant genes mapped and we knew their phenotype.
bli-3
egl-3
0
mab
-20
fog-
1
unc-
73dp
y-5
fer-
1lin
-11
unc-
75
unc-
101
glp-
4
unc-
54
-15
-10 -5 0 5 10 15 20 25
Genetic map
How can the physical and genetic maps be aligned?Identify the sequence of genes defined by mutation.
AGCCTTTATGGCGAGATGGATAGCT………………………..………………………………………….TATAASequence of genomes – individual chromosomes
Physical Map of clones
bli-3
egl-3
0
mab
-20
fog-
1
unc-
73dp
y-5
fer-
1lin
-11
unc-
75
unc-
101
glp-
4
unc-
54
-15
-10 -5 0 5 10 15 20 25
Genetic map
Physical map
• An association or alignment between the physical and genetic maps.
bli-3
egl-3
0
mab
-20
fog-
1un
c-73
dpy-
5
fer-
1lin
-11
unc-
75
unc-
101
glp-
4
unc-
54
-15
-10 -5 0 5 10 15 20 25
Genetic map
Physical map
Positional cloning of genes defined by mutation.
Imagine lin-11 and unc-101 had both been cloned.
Where on the physical map might unc-75 be?
Transgenic C.elegans – rescue of mutant phenotype.
DNA injected into the gonads of the adult hermaphrodites.
Form large heritable DNA molecules termed "free arrays".
1. Inject cosmid into the mutant.2. Observe transgenic progeny for phenotypic rescue.3. Subclone individual genes from cosmid.4. Observe transgenic progeny for phenotypic rescue.
Cosmid sequence
Genes
Phenotypic Rescue
Inject unc-75 mutant worms.
bli-3
egl-3
0
mab
-20
fog-
1un
c-73
dpy-
5
fer-
1lin
-11
unc-
75
unc-
101
glp-
4
unc-
54
-15
-10 -5 0 5 10 15 20 25
Genetic map
Physical map
Positional cloning of genes defined by mutation.
Attempt phenotypic rescue with cosmids.
• The standard route to clone C. elegans genes defined by mutation.
• The more genes are cloned the easier it becomes to clone others.
Can’t make transgenic humans – but the same positional information is used to identify Human disease genes.
RNA Interference (RNAi)
RNAi - sequence-specific inactivation of gene function by, either by double stranded RNA or siRNA.
Since its discovery in C.elegans, it has been found to work in many organisms – e.g. cultured vertebrate cells, plants, trypanosomes, Drosophila.
Mediators of RNAi - short interfering RNAs (siRNAs)
21-23 nt dsRNA duplexes.
DICER – Highly conserved family of RNaseIII enzymes.Targets double stranded RNA.
Argonaute
Single Stranded interfering RNA
RNAi in C.elegans.
ds RNA
Observer phenotype of F1 offspringNoticed that site of injection did not matter – intestine works??How could that affect embryos?Systemic RNAi
Bacterial Feeding Method in C. elegansExpress dsRNA of a cloned C.elegans gene in a strain of E.coli. Worms eat the bacteria as food.
RNAi of the gene can be obtained both in the worms that feed on the dsRNA expressing bacteria, and in the F1 progeny of these worms.
Transport of dsRNA into Cells bythe Transmembrane Protein SID-1Science 301, 1545 (2003)
sid-1 mutants are defective in systemic RNAi
SID-1 protein
Loss of function phenotype can be estimated by RNAi.
RNAi by feeding method – whole genome RNAi projects.
Clones of 16,757 predicted genes tested in genome wide screen.
10.3% gave obvious phenotype.
RNAi as a tool for genetic analysis
Redundancy between genes.
RNAi is capable of functioning for more than one gene at a time.
Permits analysis of functionally redundant genes.
Summary, C. elegans Genomics
Permits comparisons with human genes.
Most human disease genes have C. elegans homologues.
Powerful genetic tools – experiments on genes.
Detailed anatomy – relate gene to function.
Examples of processes investigated.
Programmed cell death.Signalling.Cell adhesion.Axonal guidance.Oncogene function.Insulin PathwayAgeing
How did genes evolve and what are gene/protein families
Early genomes– Early genomes made of RNA
• RNA world - no cells (in modern sense), just RNA, starting with 1 gene
• RNotide polymerase activity - catalyse own synth.• Later on - translation - encoded info for production of proteins
– Involves nucleic acids ‘coding for’ proteins– Later emergence of DNA as the info store - genome stability - less labile– Modern functions of nucleic acids
• coding - proteins via mRNA• catalytic – ribozymes• structural – rRNA, tRNA• regulatory - miRNAsnucleotides
RNA
DNA
mRNA
tRNA, rRNA
protein
Inorganic surface
*
‘Tree of Life’- Tree of all Animals
Common ancestor=> common genome
• Each species’ genomedescended with modificationfrom genome of ancestor
Reconstruction of picture of ‘ancestral genome’?
Comparative genomics - tells us about stateof ancestor and changes along each branch
Where did our genome come from?….
*
Initial ligation to form early chromosomes
inversion
duplication / deletion
accumn. of point mutations
Invasion - horizontal gene transfer & transposable elements
Genes and Genome evolution
• What processes lead to genome evolution…?*
TSS ATG stop
Domain 1 Domain 2
Poly A tail
promotergene
mRNA
protein
5’-UTR 3’-UTR
Exon 2 Exon 3Intron 1
Exon 4Exon 1
Structure of a typical eukaryotic gene
What features of all genes are missing from this diagram….?
*