csci6904 genomics and biological computing lecture 2 – genomics encoding molecules sequencing dna...

56
CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Upload: howard-boyd

Post on 28-Dec-2015

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

CSCI6904

Genomics and Biological Computing

Lecture 2 – Genomics

Encoding Molecules

Sequencing DNA

Genome Projects

Finding Genes

Page 2: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

How to encode information into molecules

Adenosine A

Guanine G

Thymine T

Cytosine C

Page 3: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Nucleotides also are building blocks for energy and signaling pathways

Adenosine A

ATP

AMP

Page 4: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

The tale of a structureA structure for Deoxyribose Nucleic Acid

J. D. WATSON F. H. C. CRICK

2 April 1953MOLECULAR STRUCTURE OF NUCLEIC ACIDS

http://www.chemheritage.org/EducationalServices/chemach/ppb/cwwf.html

Page 5: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

A double helix as encoding medium

Protection against environmentThe informative unit is stowed inside, away from watersoluble toxins. Cancer-causing agents typically canpenetrate this defense.

RedundancyProof reading by comparing to complementary strand

Mechanical ProtectionTorque, stretching…

Control of information flow“Archive” the information when not in use

Fancy control structuresHairpins, turns, twists and other frills.

Page 6: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Transcription

Retrieve copies of accessible genesAll genes on the chromosome that are exposed, for Any reasons, are transcribed into mobile and unstablemolecules called RNA messengers [A, U, C, G].

Exporting, editing, processingThese are exported out of the nucleus, some are editedaccording to some control scheme.

Taken up by the translation machineryEnters a complex molecular machine to translate thegene into a 20-characters alphabet protein.

Destroyed quicklyAs to avoid to require a control mechanism at this step

Page 7: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Translation

Convert in words of three characters to protein chainsThese three-character words are called “codon”.

Page 8: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Translation

The code is universal and degeneratedMost organisms are using the universal code.

Page 9: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Translation

The code is universal and degeneratedDifferent organism have different codon frequencies.

Page 10: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

TranslationFrom there, the new chain spontaneously adopt a 3Dstructure and start to do something. Other protein requireFurther edition and are exported to their destination.

Page 11: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Protein Alphabet20 Universal characters.

Genetically encoded extra AA are very rare.

Each character has a set of properties:

•Electostatic charge (+/-)•“hydrophobicity” (don’t mix with water)•Chemical reactivity

Page 12: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Sequencing DNAWhyAll genetic information is encoded in the DNA molecule. Sequencing DNA is necessary to create a representation of the information in which computation can be performed.

PrincipleReading cannot be done directly as the individual nucleotides cannot be visually resolved.

Using a natural protein cloned from a bacteria, a collection of molecules of every possible length is generated. These artificial replicates are separated on the basis of their size in a gel matrix using a powerful electric field (electrophoresis). Individual replicates are then resolved because they are tagged with either fluorescent of radioactive markers.

Page 13: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Replication in vivoPrincipleA protein, called a DNA polymerase, step through a single strand of the DNA, finding the complementary character to a nucleotide and attaching it to the growing new chain.

Polymerase enzyme (proteins)

Page 14: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Replication in vitro and sequencing

Page 15: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Sequencing DNANowadaysCreate a mixture of chains of all possible length by using un-reactive capped ends.Separate together the four mixtures on the basis of their size.Read the sequence as a string (anywhere between 300 – 900 characters*). *includes 0!

Page 16: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Sequencing DNAErrorsError in reading are more frequent at the extremity of the readable sequence.

Very compact reads (early)Less defined reads (late) due to “smearing”

Polymerase has an intrinsic rate of replication error. In nature, these would be called mutations. In a lab, these are just called annoying.

Proof reading DNASince there are two single strand, both are usually sequenced and cross-checked for inconsistencies.

Page 17: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

RepresentationFASTA formatMost basic format to store sequence information only. This is what is usually downloaded from a database.

>Example1 envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKKYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK >Example2 synthetic peptide HITREPLKHIPKERYRGTNDTLSPQIESIWAAELDRYKLVKTNCSNVS

Page 18: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

RepresentationASN.1 formatThis is the flat file format for Genebank. I’m not sure who directly use this.

XML formatsThere are a collection of markup language derivatives for sequence information which may be more convenient to deal with.

GenePep formatHuman readable formatting of the content, this is the default representation is one uses the online portal to the Genebank database.

Seq-entry ::= set { class pop-set , descr { pub { pub { gen { cit "Unpublished" , authors { names std { { name name { last "Burda" , first "Sherri" , initials "S.T." } } , { name name { last "Konings" , first "Frank" , initials "F.A.J." } } , …

Page 19: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Size of genomes

Human 3.31 Gbp

Mouse 3.3 Gbp

Corn 5 Gbp

Fruit fly 180 Mbp

Frog 3.1 Gbp

E. coli 4.7 Mbp

HIV-2 9.6 Kbp

Page 20: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Size of genomes

Variance in sizeIt cost energy to replicate a genome. Organisms with a short generation time (~minutes) will have a strong pressure to dump the garbage and duplicate DNA as to maximize their efficiency.

This is not a problem with higher life forms for which the availability of energy isn’t a problem.

Some plants have the largest genomes. Although only a small fraction of these is actually encoding genes.

Page 21: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Vocabulary

PlasmidArtificial construct used tomanipulate sequences.

CloningMake a copy of asegment of DNA

Page 22: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Reading a whole genomeBAC and YACAre artificial chromosomes or plasmids from respectively bacteria or Yeast.

LibraryExtract whole cell DNA, clean it up, break it into random fragments:

150 kbp (BAC)0.15-1.5 Mbp (YAC)

Paste the frags into BAC or YAC.Introduce BAC or YAC into host cell.

Page 23: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Chromosome walk Sequencing

PrincipleIsolate a DNA fragment / chromosome.Create a specific replication primer.Sequence as far as possible.Use a region near the end of the current read to design a new primer.

The initial primers are know because they are located on the BAC/YAC.

Slow and expensive.

Page 24: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Shotgun SequencingPrincipleStart with a BAC/YAC construct from a library, again.Create random replication primers.Sequence an arbitrary large number of samples.Assemble based on sequence identity.

Page 25: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Shotgun SequencingAssembly into CONTIGSAn case of the Shortest-common substring problem.

Aided with the knowledge of Sequence-Tagged-Sites (STS). STS are pretty much just unique substring to a genome which have been mapped to a chromosome.

Page 26: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Shotgun SequencingAssembly into CONTIGSThe main caveat with the method is that is would tend to delete repetitive regions. Or get into local minima in situations like the following:

...ADGHKJGKJXXXXXXXXXSDGDKJHDGFXXXXXXXSADGUYDSSDGK…

Page 27: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Public vs Private Genomes

Shotgun vs. Systematic walk

The Human genome project broke into two components somewhere along the road:

Private company: Shotgun sequencing onlyPublic project: Chromosome mapping.

Page 28: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Public vs Private Genomes

Tigr:Non-profit.First full genome: Haemophilus Influenzae

Human Genome Draft 1 (Public data, 2001)

Page 29: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Public vs Private Genomes

CELERA:Drosophila Genome (Public collaboration)Bacterial genomes (Proof of concept)

Human Genome Draft 1 (Proprietary data, 2001)

Both still have gaps and typos.

Page 30: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Public vs Private Genomes

Which one is the best?CELERA’S draft has been shown to collapse

regions of high sequence identity.

CELERA has access to the public database to correct this problem.

CELERA charges a high price for access to their data!

Page 31: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Who’s DNA was sequenced

Nine persons (Anonymous)8 Males1 Female

Males have a Y chromosome, females don’t.

3/9 were from germ line cells (sperm)

Some genes are known to be pre-processed in non-germ line cells directly on the DNA.

Craig Venter, CELERA’s CEO, admitted that ~3/5 of CELERA’s DNA is his! Sigh.

Page 32: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

What is in the HGP

Page 33: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

OK, so what is a gene.

STOP codonTAA, TGA, TAG

START codonATG (also code for protein character M)

Page 34: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Open Reading Frame (ORF).DefinitionAny segment of DNA which starts with a Start codon and end with a stop codon in phase.

Page 35: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Open Reading Frame (ORF).DefinitionAny segment of DNA which starts with a Start codon and end with a stop codon in phase.

The purple protein in this figure is responsible for finding stop codons.

Page 36: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Open Reading Frame (ORF).

There are six possible translational frames to worry aboutSequence as in the DB

TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG

Three first frames

TCC AAC TCG GGG TCC GCA TCG CTC CGC CGG CGA CCG ACG AAG CCG A T CCA ACT CGG GGT CCG CAT CGC TCC GCC GGC GAC CGA CGA AGC CGATC CAA CTC GGG GTC CGC ATC GCT CCG CCG GCG ACC GAC GAA GCC GA

But DNA is a double strand…

5’-TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG-3’ 3’-AGGTTGAGCCCCAGGCGTAGCGTGGCGGCCGCTGGCTGCTTCGGC-5’

Page 37: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

What is 5’ and 3’?

This is derived from the chemical notation for the sugar molecule ribose.

Directionality of the Chain

Page 38: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Open Reading Frame (ORF).

But DNA is a double strand

5’-TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG-3’ 3’-AGGTTGAGCCCCAGGCGTAGCGTGGCGGCCGCTGGCTGCTTCGGC-5’

Principle (in bacteria)

1. Find the Longest possible sequence beginning with an ATG, and terminating by a TAA, TAG, TGA.

2. There may be multiple ATG inside the gene, but only a single stop codon.

3. Real genes will have a regulatory regions upstream of ORF. Use pattern detection to do this.

4. Real genes are typically 100-500 codon long.

Page 39: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Open Reading Frame (ORF).

The regulatory regions cannot be searched using string, or even regular expressions.

The following slides will give you an idea of how this can be done.

Page 40: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Promoter regions

Page 41: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Calculating a Sequence Logo

21

logM

i ii

H P P

random observedR H H

Information theory (Sequence Logos)

Page 42: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Promoter regions

21

logM

i ii

H P P

random observedR H H

Information theory (Sequence Logos)

Page 43: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Finding Real Start codon

Something like a HMM can be trained to classify whether an ATG codon really is a start codon.

Yin and Wang, GeneScout paper, see course website

Page 44: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Things get complicated with eukaryotes

Eukaryote genes contain sub-strings of self-splicing junk called introns.

Page 45: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Things get complicated with eukaryotes

However, the splicing sites are made of statistically correlated sub-strings.

Page 46: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Things get complicated with eukaryotes

A similar HMM strategy can be used to find all splicing sites.

Page 47: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Things get complicated with eukaryotes

The weights in the model are calculated based on a so-called coding potential:

-No stop codon- Codon preferences in an organism (right frame will give a much

better score)

Page 48: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Open Reading Frame (ORF).

Things that can go wrong with ORFs

The N-terminus part of the gene is truncated because of the presence of downstream ATG.

Random occurrence of the ORF causing patterns. These would not have a “promoter” pattern upstream from the ATG.

Eukaryotic genes are internally spliced which complicates the story, a lot.

Page 49: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Real genes vs. ORF

Real genes are likely to be already documented in the databases (I know, a circular argument.)

As we will see in the next series of slides.

If not, a ORF has to be sequenced from a cDNA library instead of a genomic DNA library to be proven to be a gene.

Page 50: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

BLASTING sequences

Page 51: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

BLASTING sequences

Page 52: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Using Blast searches

Tells whether a mystery piece of sequence belongs to a gene existing in other organisms.Who has a copy of the gene.

Is likely to have a hit to something already characterized.What it does.

Is likely to be similar to other genes doing a different task following functional divergence during evolution.How did it came to be.

We will look at BLAST in more detail in the following week.

Page 53: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

What else can be said?

Hydropathy plots on protein sequences

Can tell you if the protein is soluble or attached to the cell (through the oily membrane).

Page 54: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

What else can be said?

Hydropathy plots on protein sequences

Leptin receptor is an example of a protein with a transmembrane helix (plot window = 19).

Page 55: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

What else can be said?

Signal sequences

Small sub-sequence that are used to dispatch a protein to a specific location in (or out of) the cell.

From sequence information alone, its possible to tell where a gene is acting in the cell.

Page 56: CSCI6904 Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

Summary

DNA is sequenced in relatively small steps (>1000 characters)

Genomes have to be broken down into smaller DNA clones and amplified in bacteria/yeast cells.

Strategies have to be devised to efficiently assemble small reads into large DNA sequences.

Genes can be found from sequence information as Open Reading Frame.

Sequences can be characterized on the basis of their similarity to other existing sequences.

Sequences can also be characterized on the sole basis of the chemical properties of the amino acids it encodes.