csci6904 genomics and biological computing lecture 2 – genomics encoding molecules sequencing dna...

CSCI6904

Genomics and Biological Computing

Lecture 2 – Genomics

Encoding Molecules

Sequencing DNA

Genome Projects

Finding Genes

How to encode information into molecules

Adenosine A

Guanine G

Thymine T

Cytosine C

Nucleotides also are building blocks for energy and signaling pathways

Adenosine A

ATP

AMP

The tale of a structureA structure for Deoxyribose Nucleic Acid

J. D. WATSON F. H. C. CRICK

2 April 1953MOLECULAR STRUCTURE OF NUCLEIC ACIDS

http://www.chemheritage.org/EducationalServices/chemach/ppb/cwwf.html

http://www.chemheritage.org/EducationalServices/chemach/ppb/cwwf.html

A double helix as encoding medium

Protection against environmentThe informative unit is stowed inside, away from watersoluble toxins. Cancer-causing agents typically canpenetrate this defense.

RedundancyProof reading by comparing to complementary strand

Mechanical ProtectionTorque, stretching…

Control of information flow“Archive” the information when not in use

Fancy control structuresHairpins, turns, twists and other frills.

Transcription

Retrieve copies of accessible genesAll genes on the chromosome that are exposed, for Any reasons, are transcribed into mobile and unstablemolecules called RNA messengers [A, U, C, G].

Exporting, editing, processingThese are exported out of the nucleus, some are editedaccording to some control scheme.

Taken up by the translation machineryEnters a complex molecular machine to translate thegene into a 20-characters alphabet protein.

Destroyed quicklyAs to avoid to require a control mechanism at this step

Translation

Convert in words of three characters to protein chainsThese three-character words are called “codon”.

Translation

The code is universal and degeneratedMost organisms are using the universal code.

Translation

The code is universal and degeneratedDifferent organism have different codon frequencies.

TranslationFrom there, the new chain spontaneously adopt a 3Dstructure and start to do something. Other protein requireFurther edition and are exported to their destination.

Protein Alphabet20 Universal characters.

Genetically encoded extra AA are very rare.

Each character has a set of properties:

•Electostatic charge (+/-)•“hydrophobicity” (don’t mix with water)•Chemical reactivity

Sequencing DNAWhyAll genetic information is encoded in the DNA molecule. Sequencing DNA is necessary to create a representation of the information in which computation can be performed.

PrincipleReading cannot be done directly as the individual nucleotides cannot be visually resolved.

Using a natural protein cloned from a bacteria, a collection of molecules of every possible length is generated. These artificial replicates are separated on the basis of their size in a gel matrix using a powerful electric field (electrophoresis). Individual replicates are then resolved because they are tagged with either fluorescent of radioactive markers.

Replication in vivoPrincipleA protein, called a DNA polymerase, step through a single strand of the DNA, finding the complementary character to a nucleotide and attaching it to the growing new chain.

Polymerase enzyme (proteins)

Replication in vitro and sequencing

Sequencing DNANowadaysCreate a mixture of chains of all possible length by using un-reactive capped ends.Separate together the four mixtures on the basis of their size.Read the sequence as a string (anywhere between 300 – 900 characters*). *includes 0!

Sequencing DNAErrorsError in reading are more frequent at the extremity of the readable sequence.

Very compact reads (early)Less defined reads (late) due to “smearing”

Polymerase has an intrinsic rate of replication error. In nature, these would be called mutations. In a lab, these are just called annoying.

Proof reading DNASince there are two single strand, both are usually sequenced and cross-checked for inconsistencies.

RepresentationFASTA formatMost basic format to store sequence information only. This is what is usually downloaded from a database.

>Example1 envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKKYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK >Example2 synthetic peptide HITREPLKHIPKERYRGTNDTLSPQIESIWAAELDRYKLVKTNCSNVS

RepresentationASN.1 formatThis is the flat file format for Genebank. I’m not sure who directly use this.

XML formatsThere are a collection of markup language derivatives for sequence information which may be more convenient to deal with.

GenePep formatHuman readable formatting of the content, this is the default representation is one uses the online portal to the Genebank database.

Seq-entry ::= set { class pop-set , descr { pub { pub { gen { cit "Unpublished" , authors { names std { { name name { last "Burda" , first "Sherri" , initials "S.T." } } , { name name { last "Konings" , first "Frank" , initials "F.A.J." } } , …

Size of genomes

Human 3.31 Gbp

Mouse 3.3 Gbp

Corn 5 Gbp

Fruit fly 180 Mbp

Frog 3.1 Gbp

E. coli 4.7 Mbp

HIV-2 9.6 Kbp

Size of genomes

Variance in sizeIt cost energy to replicate a genome. Organisms with a short generation time (~minutes) will have a strong pressure to dump the garbage and duplicate DNA as to maximize their efficiency.

This is not a problem with higher life forms for which the availability of energy isn’t a problem.

Some plants have the largest genomes. Although only a small fraction of these is actually encoding genes.

Vocabulary

PlasmidArtificial construct used tomanipulate sequences.

CloningMake a copy of asegment of DNA

Reading a whole genomeBAC and YACAre artificial chromosomes or plasmids from respectively bacteria or Yeast.

LibraryExtract whole cell DNA, clean it up, break it into random fragments:

150 kbp (BAC)0.15-1.5 Mbp (YAC)

Paste the frags into BAC or YAC.Introduce BAC or YAC into host cell.

Chromosome walk Sequencing

PrincipleIsolate a DNA fragment / chromosome.Create a specific replication primer.Sequence as far as possible.Use a region near the end of the current read to design a new primer.

The initial primers are know because they are located on the BAC/YAC.

Slow and expensive.

Shotgun SequencingPrincipleStart with a BAC/YAC construct from a library, again.Create random replication primers.Sequence an arbitrary large number of samples.Assemble based on sequence identity.

Shotgun SequencingAssembly into CONTIGSAn case of the Shortest-common substring problem.

Aided with the knowledge of Sequence-Tagged-Sites (STS). STS are pretty much just unique substring to a genome which have been mapped to a chromosome.

Shotgun SequencingAssembly into CONTIGSThe main caveat with the method is that is would tend to delete repetitive regions. Or get into local minima in situations like the following:

...ADGHKJGKJXXXXXXXXXSDGDKJHDGFXXXXXXXSADGUYDSSDGK…

Public vs Private Genomes

Shotgun vs. Systematic walk

The Human genome project broke into two components somewhere along the road:

Private company: Shotgun sequencing onlyPublic project: Chromosome mapping.


Tigr:Non-profit.First full genome: Haemophilus Influenzae

Human Genome Draft 1 (Public data, 2001)


CELERA:Drosophila Genome (Public collaboration)Bacterial genomes (Proof of concept)

Human Genome Draft 1 (Proprietary data, 2001)

Both still have gaps and typos.


Which one is the best?CELERA’S draft has been shown to collapse

regions of high sequence identity.

CELERA has access to the public database to correct this problem.

CELERA charges a high price for access to their data!

Who’s DNA was sequenced

Nine persons (Anonymous)8 Males1 Female

Males have a Y chromosome, females don’t.

3/9 were from germ line cells (sperm)

Some genes are known to be pre-processed in non-germ line cells directly on the DNA.

Craig Venter, CELERA’s CEO, admitted that ~3/5 of CELERA’s DNA is his! Sigh.

What is in the HGP

OK, so what is a gene.

STOP codonTAA, TGA, TAG

START codonATG (also code for protein character M)

Open Reading Frame (ORF).DefinitionAny segment of DNA which starts with a Start codon and end with a stop codon in phase.

Open Reading Frame (ORF).DefinitionAny segment of DNA which starts with a Start codon and end with a stop codon in phase.

The purple protein in this figure is responsible for finding stop codons.

Open Reading Frame (ORF).

There are six possible translational frames to worry aboutSequence as in the DB

TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG

Three first frames

TCC AAC TCG GGG TCC GCA TCG CTC CGC CGG CGA CCG ACG AAG CCG A T CCA ACT CGG GGT CCG CAT CGC TCC GCC GGC GAC CGA CGA AGC CGATC CAA CTC GGG GTC CGC ATC GCT CCG CCG GCG ACC GAC GAA GCC GA

But DNA is a double strand…

5’-TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG-3’ 3’-AGGTTGAGCCCCAGGCGTAGCGTGGCGGCCGCTGGCTGCTTCGGC-5’

What is 5’ and 3’?

This is derived from the chemical notation for the sugar molecule ribose.

Directionality of the Chain


But DNA is a double strand

5’-TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG-3’ 3’-AGGTTGAGCCCCAGGCGTAGCGTGGCGGCCGCTGGCTGCTTCGGC-5’

Principle (in bacteria)

1. Find the Longest possible sequence beginning with an ATG, and terminating by a TAA, TAG, TGA.

2. There may be multiple ATG inside the gene, but only a single stop codon.

3. Real genes will have a regulatory regions upstream of ORF. Use pattern detection to do this.

4. Real genes are typically 100-500 codon long.


The regulatory regions cannot be searched using string, or even regular expressions.

The following slides will give you an idea of how this can be done.

Promoter regions

Calculating a Sequence Logo

21

logM

i ii

H P P

random observedR H H

Information theory (Sequence Logos)

Promoter regions

21

logM

i ii

H P P

random observedR H H

Information theory (Sequence Logos)

Finding Real Start codon

Something like a HMM can be trained to classify whether an ATG codon really is a start codon.

Yin and Wang, GeneScout paper, see course website

Things get complicated with eukaryotes

Eukaryote genes contain sub-strings of self-splicing junk called introns.


However, the splicing sites are made of statistically correlated sub-strings.


A similar HMM strategy can be used to find all splicing sites.


The weights in the model are calculated based on a so-called coding potential:

-No stop codon- Codon preferences in an organism (right frame will give a much

better score)


Things that can go wrong with ORFs

The N-terminus part of the gene is truncated because of the presence of downstream ATG.

Random occurrence of the ORF causing patterns. These would not have a “promoter” pattern upstream from the ATG.

Eukaryotic genes are internally spliced which complicates the story, a lot.

Real genes vs. ORF

Real genes are likely to be already documented in the databases (I know, a circular argument.)

As we will see in the next series of slides.

If not, a ORF has to be sequenced from a cDNA library instead of a genomic DNA library to be proven to be a gene.

BLASTING sequences

Using Blast searches

Tells whether a mystery piece of sequence belongs to a gene existing in other organisms.Who has a copy of the gene.

Is likely to have a hit to something already characterized.What it does.

Is likely to be similar to other genes doing a different task following functional divergence during evolution.How did it came to be.

We will look at BLAST in more detail in the following week.

What else can be said?

Hydropathy plots on protein sequences

Can tell you if the protein is soluble or attached to the cell (through the oily membrane).


Hydropathy plots on protein sequences

Leptin receptor is an example of a protein with a transmembrane helix (plot window = 19).


Signal sequences

Small sub-sequence that are used to dispatch a protein to a specific location in (or out of) the cell.

From sequence information alone, its possible to tell where a gene is acting in the cell.

Summary

DNA is sequenced in relatively small steps (>1000 characters)

Genomes have to be broken down into smaller DNA clones and amplified in bacteria/yeast cells.

Strategies have to be devised to efficiently assemble small reads into large DNA sequences.

Genes can be found from sequence information as Open Reading Frame.

Sequences can be characterized on the basis of their similarity to other existing sequences.

Sequences can also be characterized on the sole basis of the chemical properties of the amino acids it encodes.

csci6904 genomics and biological computing lecture 2 – genomics encoding molecules sequencing dna...

Documents

universal code

protein chainsthese

vivoprinciplea protein

natural protein

characters alphabet

dna polymerase

protein requirefurther

translationthe code