csci6904 genomics and biological computing lecture 2 – genomics encoding molecules sequencing dna...
TRANSCRIPT
CSCI6904
Genomics and Biological Computing
Lecture 2 – Genomics
Encoding Molecules
Sequencing DNA
Genome Projects
Finding Genes
How to encode information into molecules
Adenosine A
Guanine G
Thymine T
Cytosine C
Nucleotides also are building blocks for energy and signaling pathways
Adenosine A
ATP
AMP
The tale of a structureA structure for Deoxyribose Nucleic Acid
J. D. WATSON F. H. C. CRICK
2 April 1953MOLECULAR STRUCTURE OF NUCLEIC ACIDS
http://www.chemheritage.org/EducationalServices/chemach/ppb/cwwf.html
A double helix as encoding medium
Protection against environmentThe informative unit is stowed inside, away from watersoluble toxins. Cancer-causing agents typically canpenetrate this defense.
RedundancyProof reading by comparing to complementary strand
Mechanical ProtectionTorque, stretching…
Control of information flow“Archive” the information when not in use
Fancy control structuresHairpins, turns, twists and other frills.
Transcription
Retrieve copies of accessible genesAll genes on the chromosome that are exposed, for Any reasons, are transcribed into mobile and unstablemolecules called RNA messengers [A, U, C, G].
Exporting, editing, processingThese are exported out of the nucleus, some are editedaccording to some control scheme.
Taken up by the translation machineryEnters a complex molecular machine to translate thegene into a 20-characters alphabet protein.
Destroyed quicklyAs to avoid to require a control mechanism at this step
Translation
Convert in words of three characters to protein chainsThese three-character words are called “codon”.
Translation
The code is universal and degeneratedMost organisms are using the universal code.
Translation
The code is universal and degeneratedDifferent organism have different codon frequencies.
TranslationFrom there, the new chain spontaneously adopt a 3Dstructure and start to do something. Other protein requireFurther edition and are exported to their destination.
Protein Alphabet20 Universal characters.
Genetically encoded extra AA are very rare.
Each character has a set of properties:
•Electostatic charge (+/-)•“hydrophobicity” (don’t mix with water)•Chemical reactivity
Sequencing DNAWhyAll genetic information is encoded in the DNA molecule. Sequencing DNA is necessary to create a representation of the information in which computation can be performed.
PrincipleReading cannot be done directly as the individual nucleotides cannot be visually resolved.
Using a natural protein cloned from a bacteria, a collection of molecules of every possible length is generated. These artificial replicates are separated on the basis of their size in a gel matrix using a powerful electric field (electrophoresis). Individual replicates are then resolved because they are tagged with either fluorescent of radioactive markers.
Replication in vivoPrincipleA protein, called a DNA polymerase, step through a single strand of the DNA, finding the complementary character to a nucleotide and attaching it to the growing new chain.
Polymerase enzyme (proteins)
Replication in vitro and sequencing
Sequencing DNANowadaysCreate a mixture of chains of all possible length by using un-reactive capped ends.Separate together the four mixtures on the basis of their size.Read the sequence as a string (anywhere between 300 – 900 characters*). *includes 0!
Sequencing DNAErrorsError in reading are more frequent at the extremity of the readable sequence.
Very compact reads (early)Less defined reads (late) due to “smearing”
Polymerase has an intrinsic rate of replication error. In nature, these would be called mutations. In a lab, these are just called annoying.
Proof reading DNASince there are two single strand, both are usually sequenced and cross-checked for inconsistencies.
RepresentationFASTA formatMost basic format to store sequence information only. This is what is usually downloaded from a database.
>Example1 envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKKYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK >Example2 synthetic peptide HITREPLKHIPKERYRGTNDTLSPQIESIWAAELDRYKLVKTNCSNVS
RepresentationASN.1 formatThis is the flat file format for Genebank. I’m not sure who directly use this.
XML formatsThere are a collection of markup language derivatives for sequence information which may be more convenient to deal with.
GenePep formatHuman readable formatting of the content, this is the default representation is one uses the online portal to the Genebank database.
Seq-entry ::= set { class pop-set , descr { pub { pub { gen { cit "Unpublished" , authors { names std { { name name { last "Burda" , first "Sherri" , initials "S.T." } } , { name name { last "Konings" , first "Frank" , initials "F.A.J." } } , …
Size of genomes
Human 3.31 Gbp
Mouse 3.3 Gbp
Corn 5 Gbp
Fruit fly 180 Mbp
Frog 3.1 Gbp
E. coli 4.7 Mbp
HIV-2 9.6 Kbp
Size of genomes
Variance in sizeIt cost energy to replicate a genome. Organisms with a short generation time (~minutes) will have a strong pressure to dump the garbage and duplicate DNA as to maximize their efficiency.
This is not a problem with higher life forms for which the availability of energy isn’t a problem.
Some plants have the largest genomes. Although only a small fraction of these is actually encoding genes.
Vocabulary
PlasmidArtificial construct used tomanipulate sequences.
CloningMake a copy of asegment of DNA
Reading a whole genomeBAC and YACAre artificial chromosomes or plasmids from respectively bacteria or Yeast.
LibraryExtract whole cell DNA, clean it up, break it into random fragments:
150 kbp (BAC)0.15-1.5 Mbp (YAC)
Paste the frags into BAC or YAC.Introduce BAC or YAC into host cell.
Chromosome walk Sequencing
PrincipleIsolate a DNA fragment / chromosome.Create a specific replication primer.Sequence as far as possible.Use a region near the end of the current read to design a new primer.
The initial primers are know because they are located on the BAC/YAC.
Slow and expensive.
Shotgun SequencingPrincipleStart with a BAC/YAC construct from a library, again.Create random replication primers.Sequence an arbitrary large number of samples.Assemble based on sequence identity.
Shotgun SequencingAssembly into CONTIGSAn case of the Shortest-common substring problem.
Aided with the knowledge of Sequence-Tagged-Sites (STS). STS are pretty much just unique substring to a genome which have been mapped to a chromosome.
Shotgun SequencingAssembly into CONTIGSThe main caveat with the method is that is would tend to delete repetitive regions. Or get into local minima in situations like the following:
...ADGHKJGKJXXXXXXXXXSDGDKJHDGFXXXXXXXSADGUYDSSDGK…
Public vs Private Genomes
Shotgun vs. Systematic walk
The Human genome project broke into two components somewhere along the road:
Private company: Shotgun sequencing onlyPublic project: Chromosome mapping.
Public vs Private Genomes
Tigr:Non-profit.First full genome: Haemophilus Influenzae
Human Genome Draft 1 (Public data, 2001)
Public vs Private Genomes
CELERA:Drosophila Genome (Public collaboration)Bacterial genomes (Proof of concept)
Human Genome Draft 1 (Proprietary data, 2001)
Both still have gaps and typos.
Public vs Private Genomes
Which one is the best?CELERA’S draft has been shown to collapse
regions of high sequence identity.
CELERA has access to the public database to correct this problem.
CELERA charges a high price for access to their data!
Who’s DNA was sequenced
Nine persons (Anonymous)8 Males1 Female
Males have a Y chromosome, females don’t.
3/9 were from germ line cells (sperm)
Some genes are known to be pre-processed in non-germ line cells directly on the DNA.
Craig Venter, CELERA’s CEO, admitted that ~3/5 of CELERA’s DNA is his! Sigh.
What is in the HGP
OK, so what is a gene.
STOP codonTAA, TGA, TAG
START codonATG (also code for protein character M)
Open Reading Frame (ORF).DefinitionAny segment of DNA which starts with a Start codon and end with a stop codon in phase.
Open Reading Frame (ORF).DefinitionAny segment of DNA which starts with a Start codon and end with a stop codon in phase.
The purple protein in this figure is responsible for finding stop codons.
Open Reading Frame (ORF).
There are six possible translational frames to worry aboutSequence as in the DB
TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG
Three first frames
TCC AAC TCG GGG TCC GCA TCG CTC CGC CGG CGA CCG ACG AAG CCG A T CCA ACT CGG GGT CCG CAT CGC TCC GCC GGC GAC CGA CGA AGC CGATC CAA CTC GGG GTC CGC ATC GCT CCG CCG GCG ACC GAC GAA GCC GA
But DNA is a double strand…
5’-TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG-3’ 3’-AGGTTGAGCCCCAGGCGTAGCGTGGCGGCCGCTGGCTGCTTCGGC-5’
What is 5’ and 3’?
This is derived from the chemical notation for the sugar molecule ribose.
Directionality of the Chain
Open Reading Frame (ORF).
But DNA is a double strand
5’-TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG-3’ 3’-AGGTTGAGCCCCAGGCGTAGCGTGGCGGCCGCTGGCTGCTTCGGC-5’
Principle (in bacteria)
1. Find the Longest possible sequence beginning with an ATG, and terminating by a TAA, TAG, TGA.
2. There may be multiple ATG inside the gene, but only a single stop codon.
3. Real genes will have a regulatory regions upstream of ORF. Use pattern detection to do this.
4. Real genes are typically 100-500 codon long.
Open Reading Frame (ORF).
The regulatory regions cannot be searched using string, or even regular expressions.
The following slides will give you an idea of how this can be done.
Promoter regions
Calculating a Sequence Logo
21
logM
i ii
H P P
random observedR H H
Information theory (Sequence Logos)
Promoter regions
21
logM
i ii
H P P
random observedR H H
Information theory (Sequence Logos)
Finding Real Start codon
Something like a HMM can be trained to classify whether an ATG codon really is a start codon.
Yin and Wang, GeneScout paper, see course website
Things get complicated with eukaryotes
Eukaryote genes contain sub-strings of self-splicing junk called introns.
Things get complicated with eukaryotes
However, the splicing sites are made of statistically correlated sub-strings.
Things get complicated with eukaryotes
A similar HMM strategy can be used to find all splicing sites.
Things get complicated with eukaryotes
The weights in the model are calculated based on a so-called coding potential:
-No stop codon- Codon preferences in an organism (right frame will give a much
better score)
Open Reading Frame (ORF).
Things that can go wrong with ORFs
The N-terminus part of the gene is truncated because of the presence of downstream ATG.
Random occurrence of the ORF causing patterns. These would not have a “promoter” pattern upstream from the ATG.
Eukaryotic genes are internally spliced which complicates the story, a lot.
Real genes vs. ORF
Real genes are likely to be already documented in the databases (I know, a circular argument.)
As we will see in the next series of slides.
If not, a ORF has to be sequenced from a cDNA library instead of a genomic DNA library to be proven to be a gene.
BLASTING sequences
BLASTING sequences
Using Blast searches
Tells whether a mystery piece of sequence belongs to a gene existing in other organisms.Who has a copy of the gene.
Is likely to have a hit to something already characterized.What it does.
Is likely to be similar to other genes doing a different task following functional divergence during evolution.How did it came to be.
We will look at BLAST in more detail in the following week.
What else can be said?
Hydropathy plots on protein sequences
Can tell you if the protein is soluble or attached to the cell (through the oily membrane).
What else can be said?
Hydropathy plots on protein sequences
Leptin receptor is an example of a protein with a transmembrane helix (plot window = 19).
What else can be said?
Signal sequences
Small sub-sequence that are used to dispatch a protein to a specific location in (or out of) the cell.
From sequence information alone, its possible to tell where a gene is acting in the cell.
Summary
DNA is sequenced in relatively small steps (>1000 characters)
Genomes have to be broken down into smaller DNA clones and amplified in bacteria/yeast cells.
Strategies have to be devised to efficiently assemble small reads into large DNA sequences.
Genes can be found from sequence information as Open Reading Frame.
Sequences can be characterized on the basis of their similarity to other existing sequences.
Sequences can also be characterized on the sole basis of the chemical properties of the amino acids it encodes.