informatics tools for next-generation sequence analysis
DESCRIPTION
Informatics tools for next-generation sequence analysis. Gabor Marth Boston College Biology Next-Generation Sequencing MiniSymposium CHOP Philadelphia, PA April 6, 2009. New sequencing technologies…. … offer vast throughput. 100 Gb. Illumina/Solexa , AB/ SOLiD sequencers. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/1.jpg)
Informatics tools for next-generation sequence analysis
Gabor MarthBoston College Biology
Next-Generation Sequencing MiniSymposiumCHOP Philadelphia, PAApril 6, 2009
![Page 2: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/2.jpg)
New sequencing technologies…
![Page 3: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/3.jpg)
… offer vast throughput
read length
base
s per
mac
hine
run
10 bp 1,000 bp100 bp
1 Gb
100 Mb
10 Mb
10 Gb
Illumina/Solexa, AB/SOLiD sequencers
ABI capillary sequencer
Roche/454 pyrosequencer(100-400 Mb in 200-450 bp reads)
(10-30Gb in 25-100 bp reads)
1 Mb
100 Gb
![Page 4: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/4.jpg)
Roche / 454
• pyrosequencing technology• variable read-length• the only new technology with >100bp reads
![Page 5: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/5.jpg)
Illumina / Solexa• fixed-length short-read sequencer• very high throughput• read properties are very close to traditional capillary sequences
![Page 6: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/6.jpg)
AB / SOLiD
A C G TA
C
G
T
2nd Base
1st B
ase
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
• fixed-length short-reads• very high throughput• 2-base encoding system• color-space informatics
![Page 7: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/7.jpg)
Helicos / Heliscope• short-read sequencer• single molecule sequencing• no amplification• variable read-length
![Page 8: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/8.jpg)
Many applications• organismal resequencing & de novo sequencing
Ruby et al. Cell, 2006
Jones-Rhoades et al. PLoS Genetics, 2007
• transcriptome sequencing for transcript discovery and expression profiling
Meissner et al. Nature 2008
• epigenetic analysis (e.g. DNA methylation)
![Page 9: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/9.jpg)
Data characteristics
![Page 10: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/10.jpg)
Read length
read length [bp]0 100 200 300
~200-450 (variable)
25-100(fixed)
25-50 (fixed)
25-60 (variable)
400
![Page 11: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/11.jpg)
Error characteristics (Illumina)
Insertions1.43%
Deletions3.23%
Substitutions95.34%
![Page 12: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/12.jpg)
Error characteristics (454)
![Page 13: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/13.jpg)
Coverage bias
~2X read genome read coverage
~20X read genome read coverage
![Page 14: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/14.jpg)
Genome re-sequencing
![Page 15: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/15.jpg)
Complete human genomes
![Page 16: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/16.jpg)
The re-sequencing informatics pipelineREF
(ii) read mappingIND
(i) base calling
IND(iii) SNP and short INDEL calling
(v) data viewing, hypothesis generation
(iv) SV calling GigaBayesGigaBayes
![Page 17: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/17.jpg)
Read mapping
![Page 18: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/18.jpg)
… is like a jigsaw puzzle
… and they give you the picture on the box
2. Read mapping…you get the pieces…
Big and Unique pieces are easier to place than others…
![Page 19: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/19.jpg)
Challenge: non-uniqueness
• Reads from repeats cannot be uniquely mapped back to their true region of origin
• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length
![Page 20: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/20.jpg)
Non-unique mapping
![Page 21: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/21.jpg)
SE short-read alignments are error-prone
0.35%
![Page 22: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/22.jpg)
Paired-end (PE) reads
fragment length: 100 – 600bp
Korbel et al. Science 2007
fragment length: 1 – 10kb
![Page 23: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/23.jpg)
PE alignment statistics (simulated data)
0.00%7.6%
0.09%
0.35%
0.03%
![Page 24: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/24.jpg)
The MOSAIK read mapper/aligner
Michael Strömberg
![Page 25: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/25.jpg)
Gapped alignments
![Page 26: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/26.jpg)
Aligning multiple read types together
ABI/capillary454 FLX
454 GS20
Illumina
![Page 27: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/27.jpg)
SNP / short-INDEL discovery
![Page 28: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/28.jpg)
Polymorphism detection
sequencing error polymorphism
![Page 29: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/29.jpg)
Allele calling in multi-individual data
P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(SNP)
“genotype probabilities”
P(B1=aacc|G1=aa)P(B1=aacc|G1=cc)P(B1=aacc|G1=ac)
P(Bi=aaaac|Gi=aa)P(Bi=aaaac|Gi=cc)P(Bi=aaaac|Gi=ac)
P(Bn=cccc|Gn=aa)P(Bn=cccc|Gn=cc)P(Bn=cccc|Gn=ac)
“genotype likelihoods”
Prio
r(G1,.
.,Gi,..
, Gn)
-----a----------a----------c----------c-----
-----a----------a----------a----------a----------c-----
-----c----------c----------c----------c-----
![Page 30: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/30.jpg)
SNP calling in deep sample sets
Population Samples Reads Allele detection
![Page 31: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/31.jpg)
Capturing the allele in the samples
0.000
1
0.000
2
0.000
50.0
010.0
020.0
05 0.01
0.02
0.05 0.1 0.2 0.5
00.10.20.30.40.50.60.70.80.9
1
n=100n=200n=400n=800n=1600
Population AF
Pro
b(al
lele
cap
ture
d in
sam
ple)
![Page 32: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/32.jpg)
The ability to call rare alleles
reads Q30 Q40 Q50 Q60
1 0.01 0.01 0.1 0.5
2 0.82 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
GigaBayesGigaBayes
![Page 33: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/33.jpg)
Allele calling in 400 samples
![Page 34: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/34.jpg)
Detecting de novo mutations
2
2
2 22 2
2
2
2
2 2
2
11 12 22
1 111: 1 12 2 11: 111: 1
1 111 12 : 2 1 12 : 2 1 1 12 : 12 2
22 : 22 : 11 122 : 12 2
1 1 111: 1 1 11:2 2 4
Pr | , 1 112 12 : 2 1 12 2
1 122 : 12 2
M M M
F
C M FF
G G G
G
G G GG
2 2 2
2 22 2
2 22
2
2 22 2
1 1 1 11 1 11: 12 4 2 2
1 1 1 1 112 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2
1 1 1 1 122 : 1 1 22 : 1 14 2 4 2 2
1 111: 12 211: 1
1 122 12 : 1 12 : 12
22 : 1FG
2
22
11:2 1 12 : 2 1
222 : 11 122 : 1 1
2 2
• the child inherits one chromosome from each parent• there is a small probability for a de novo (germ-line or somatic) mutation in the child
![Page 35: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/35.jpg)
Capture sequencing
![Page 36: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/36.jpg)
Targeted mammalian re-sequencing
• Deep sequencing of complete human genomes is still too expensive
• There is a need to sequence target regions, typically genes, to follow up on GWAS studies
• Targeted re-sequencing with DNA fragment capture offers apotentially cost-effective alternative
• Solid phase or liquid phase capture• 454 or Illumina sequencing
• Informatics pipeline must accountfor the peculiarities of capture data
![Page 37: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/37.jpg)
On/off target captureref allele*:
45%non-ref allele*: 54%
Target region
SNP(outside target region)
![Page 38: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/38.jpg)
Reference allele bias
(*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346
ref allele*:54%
non-ref allele*: 45%
![Page 39: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/39.jpg)
SNP example
Amit Indap
![Page 40: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/40.jpg)
Structural Variation discovery
![Page 41: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/41.jpg)
Structural variations
![Page 42: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/42.jpg)
SV/CNV detection – SNP chips
• Tiling arrays and SNP-chips made whole-genome CNV scans possible
• Probe density and placement limits resolution
• Balanced events cannot be detected
![Page 43: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/43.jpg)
SV/CNV detection – resolution
Expected CNVsKaryotype
Micro-arraySequencing
Rela
tive
num
bers
of e
vent
s
CNV event length [bp]
![Page 44: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/44.jpg)
44
Read depth
![Page 45: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/45.jpg)
Chromosome 2 Position [Mb]
CNV events found using RD
![Page 46: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/46.jpg)
PE read mapping positions
Deletion
DNA reference
LM ~ LF+Ldel & depth: low
patternLMLF
Ldel
Tandemduplication
LM ~ LF-Ldup & depth: highLdup
Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv
Translocation
LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2
LT2 LT1
LM LM
LM
InsertionLins
un-paired read clusters & depth normal
Chromosomaltranslocation
LT
LM ~LF+LT & depth: normal& cross-paired read clusters
![Page 47: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/47.jpg)
47
The SV/CNV “event display”
Chip Stewart
![Page 48: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/48.jpg)
Spanner – specificity
![Page 49: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/49.jpg)
Data standards
![Page 50: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/50.jpg)
Data types with standard formats
SRF/FASTQ
SAM/BAM
GLF
![Page 51: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/51.jpg)
Transcriptome sequencing
![Page 52: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/52.jpg)
Data highly reproducible
Michele Busby
![Page 53: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/53.jpg)
Comparative data
Michele Busby
![Page 54: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/54.jpg)
Biological questions
Michele Busby
![Page 55: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/55.jpg)
Our software tools for next-gen data
http://bioinformatics.bc.edu/marthlab/Software_Release
![Page 56: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/56.jpg)
CreditsElaine Mardis
Andy Clark
Aravinda Chakravarti
Doug Smith
Michael Egholm
Scott Kahn
Francisco de la Vega
Patrice MilosJohn Thompson
![Page 57: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/57.jpg)
Lab
Several postdoc positions are available!
![Page 58: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/58.jpg)
Mutational profiling
![Page 59: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/59.jpg)
Chemical mutagenesis
![Page 60: Informatics tools for next-generation sequence analysis](https://reader034.vdocument.in/reader034/viewer/2022051219/568161ca550346895dd1b05c/html5/thumbnails/60.jpg)
Mutational profiling: deep 454/Illumina/SOLiD data
• Pichia stipitis converts xylose to ethanol (bio-fuel production)• one mutagenized strain had high conversion efficiency• determine which mutations caused this phenotype• 15MB genome: 454, Illumina, and SOLiD reads• 14 true point mutations in the entire genome
Pichia stipitis reference sequence
Image from JGI web site
10-15X genome coverage required