dna is data - bioinfosummer 2012 (dave adelson)
TRANSCRIPT
![Page 2: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/2.jpg)
2
What is Bioinformatics?
• The mathematical, statistical and computing methods that aim to solve biological problems using DNA and protein sequences and related information.
• My main interest is genome analysis of mammals.
![Page 3: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/3.jpg)
3
G-gnome vs Genome
Thanks to Ernie Bailey
![Page 4: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/4.jpg)
4
What is a Genome?• The genome is the total
genetic content of the individual/cell.
• All mammal genomes are about the same size.
• Made up of chromosomes, each of which is a single molecule of DNA.
• Total genome length 3,000,000,000 base pairs.
Image courtesy NHGRI
![Page 5: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/5.jpg)
5
Central paradigm of Molecular Biology
DNA RNA Protein Phenotype
Guanine- GAdenine- AThymine- TCytosine- C
Guanine- GAdenine- AUracil- UCytosine- C
G Glycine Gly
P Proline Pro
A Alanine Ala
V Valine Val
20 amino acids
![Page 6: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/6.jpg)
6
Central paradigm of Molecular Biology
![Page 7: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/7.jpg)
7
Gene vs Genome
• Each chromosome is a single, long DNA molecule.
• Genes are the basic unit of heredity.
• Genes are specific DNA sequences located on chromosomes.
• Genome contains approximately 20,000 protein coding genes.
• The 20,000 genes fill up about 2% of the genome.
![Page 8: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/8.jpg)
8
DNA Sequences- threebases and stop codons
http://www.genome.gov/EdKit/bio2b.html
![Page 9: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/9.jpg)
9
Genetic Code
http://plato.stanford.edu/entries/information-biological/GeneticCode.png
![Page 10: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/10.jpg)
10
Sense Strand / Antisense Strand
http://www.genome.gov/EdKit/bio2c.html
![Page 11: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/11.jpg)
11
Open reading frames
p://www.genome.gov/EdKit/bio2d.html
![Page 12: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/12.jpg)
12
Reading frames
http://www.genome.gov/EdKit/bio2e.html
![Page 13: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/13.jpg)
13
Exons and Introns
http://www.genome.gov/EdKit/bio2i.html
![Page 14: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/14.jpg)
14
Genes from different animals are similar
Query = human actinSubject= fruit fly actin
![Page 15: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/15.jpg)
15
Bioinformatics: what we do
![Page 16: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/16.jpg)
16
Bioinformatics: What we really do
• Once the sequencing has been done, every other part of the process is bioinformatics.– Genome Assembly– Gene Prediction– Sequence Analysis
![Page 17: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/17.jpg)
17
Bioinformatics: Why do we do it?
• It’s the only way to make sense of billions of base pairs of DNA sequence.
• To understand the mechanistic basis of biological trait determination.
![Page 18: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/18.jpg)
18
Cost of DNA sequencing
![Page 19: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/19.jpg)
Genbank: 1982-2008
• The number of entries in databases of gene sequences has increased exponentially
19
![Page 20: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/20.jpg)
Genbank: latest release
• In October 15 2012, Release 192.0– 145,430,961,262 bases– 157,889,737 reported sequences
20
![Page 21: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/21.jpg)
21
Growth of GenBank
Nucleic Acids Res. 2011 Jan;39(Database issue):D32-7.
![Page 22: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/22.jpg)
Current status of genome projects
http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Statistics
![Page 23: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/23.jpg)
23
Mammalian Genes
Unique genes
Known genes
Cow Dog Man Mouse Rat Opossum Platypus
![Page 24: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/24.jpg)
24
Mammal Family Tree Based on Genes
![Page 25: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/25.jpg)
25
Key things about DNA sequencing
• Only short sequences can be generated (up to 1000bp long, depending on technology)
• Typical mammalian genome is 3x109 bp.• Sequencing a genome means stitching together
millions of short reads.• To assemble reads, one must be able to identify
overlap by aligning sequences.• Sequence alignment tools are fundamental to
bioinformatics.
![Page 26: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/26.jpg)
26
Shotgun sequencing1. Create libraries of
the whole genome.2. Sequence millions
of fragments.3. Look for overlap
between reads.4. Assemble reads
based on overlaps into contigs.
ED Green(2001) Nature Reviews Genetics vol. 2 (8) pp. 573-583
![Page 27: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/27.jpg)
27
Shotgun assembly steps
• Remove bad sequences, trim adapters from reads.
• Identify repeats.• Identify overlaps by sequence alignment
(excluding repeats).• Build contigs from overlapping sequences.• Used paired-end reads to assemble contigs
into scaffolds.• Use additional marker information to order
and orient scaffolds into super-scaffolds (chromosomes).
![Page 28: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/28.jpg)
28
Shotgun sequencing problems
• Leaves gaps.• Contigs have
to be ordered and oriented.
• Occasional misassembled contigs.
ED Green(2001) Nature Reviews Genetics vol. 2 (8) pp. 573-583
![Page 29: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/29.jpg)
29
Old style paired end libraries
• To take advantage of information from paired ends multiple libraries are made:
• Small insert ~2kb (plasmid)• Medium insert ~10kb (plasmid)• Large insert ~40kb (fosmid)• Tight control of insert size is paramount. Use
random shearing, not restriction digest to generate inserts. Small inserts may well be sequenced through with overlap from ends.
![Page 30: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/30.jpg)
30
Scaffold (long range) assembly
E Myers et all(2000)Science,v287,p2196
![Page 31: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/31.jpg)
31
Current shotgun sequencing
![Page 32: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/32.jpg)
32
Repeat sequences cause problems
![Page 33: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/33.jpg)
“Junk DNA”, an unfortunate choice of words
http://www.junkdna.com/ohno.html
Used to describe the mostly repetitive DNA between genes
![Page 34: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/34.jpg)
Adelson GENE3111/3110 34
LINEs and SINEs
These are typical of Eukaryotes, in particular mammals.
Intact autonomous elements are about 6kb long.
Non-autonomous truncated (SINE) elements that share the same tail make use of the autonomous elements insertion machinery.
![Page 35: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/35.jpg)
CytoplasmCell
Nucleus
Retrotransposition•Retrotransposons are ancient, retroviral like pieces of DNA that copy themselves around the genome.
•They cannot “infect” other individuals or cells because they lack key components that viruses have.
![Page 36: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/36.jpg)
36
Repeats and genome assembly
• Repeats can align to many places in the genome.
• Many repeats are longer than the sequence reads produced by current sequencers.
• To avoid many to many mapping, leading to incorrect contig assembly, repeats must be identified and masked prior to alignment.
![Page 37: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/37.jpg)
37
Repeats affect anything requiring alignment.
• Any sequence data needing to be aligned must have repeats masked.– Transcriptome data– Structural variation(SV)/mutation mapping
![Page 38: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/38.jpg)
38
Resequencing is the norm
• Sequencing of patient samples to determine mutations underlying disease.
• Must be able to detect a range of mutation events (of various sizes).
• Applies to germ line mutations/variations or somatic mutations/variations (ie cancer).
![Page 39: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/39.jpg)
39
Different classes of mutation operating in the human genome.
Freeman J L et al. Genome Res. 2006;16:949-961
Copyright © 2006, Cold Spring Harbor Laboratory Press
![Page 40: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/40.jpg)
Genome resequencing for SV
40http://www.sciencemag.org/cgi/content/full/318/5849/420
![Page 41: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/41.jpg)
Summary and Challenges Ahead
• DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore’s law (the rate at which computing gets faster and cheaper).
• the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data.
• Data handling is now the bottleneck• It costs more to analyze a genome than to sequence a genome.• The cost of sequencing a human genome — all three billion bases of
DNA in a set of human chromosomes — plunged to under $10,000 this year from $8.9 million in July 2007
![Page 42: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/42.jpg)
Summary and Challenges Ahead
• Storage and access to data causes issues– Not all data in Genbank or in a format that can be easily accessed
• Demand from health care system for tools to visualize, understand and interpret patient genomic data.
![Page 43: DNA is Data - BioInfoSummer 2012 (Dave Adelson)](https://reader036.vdocument.in/reader036/viewer/2022062303/554f481ab4c905b9508b4693/html5/thumbnails/43.jpg)
Biggest driver for bioinformatics
43