next generation sequencing exome sequencingbio.lundberg.gu.se/courses/vt12/nextgen_marcela_i.pdf ·...
TRANSCRIPT
Next Generation Sequencing Exome Sequencing
Marcela Davila
Genomics Core Facility
NGS methods
First generation (great cost, intense human effort) 1954 – Sequencing by degradation (Whitfeld PR) 1975 – Chain termination method (Sanger & Coulson) 1977 – Chemical modification (Maxam and Gilbert) Second generation (sincronyzed washing/scanning) SBS – Illumina Pyrosequencing – Roche SBL – AB SOLiD Third generation (increase sequencing speed, high throughput, no optics) Semiconductor: Ion Torrent SBS-single molecule: Helicos SBS-single molecule-real time: Pacific Biosciences SBH/SBL- Complete Genomics FRET: VisiGen Protein nanopores: Oxford Nanopore TEM: Halcyon Molecular and ZS Genetics Transistor mediated: IBM STM: Reveo
Sanger method
Dye-labeled terminator
DNA template
Laser beam
Chromatogram
Capillar electrophoresis
Next generation sequencing
For cyclic array sequencing
1. DNA library preparation (ligation of adapters) 2. Amplification (ePCR, bridge PCR) 3. Sequencing reaction 4. Imaging 5. Decoding
1. DNA library preparation (ligation of adapters) 2. Amplification (ePCR, bridge PCR) 3. Sequencing reaction 4. Imaging 5. Decoding
Next generation sequencing
For cyclic array sequencing
1. DNA library preparation (ligation of adapters) 2. Amplification (ePCR, bridge PCR) 3. Sequencing reaction 4. Imaging 5. Decoding
Next generation sequencing
For cyclic array sequencing
SeqBySynthesis - Illumina
Pyrosequencing - Roche
Pyrogram
SeqByLigation – AB/SOLiD
First round
Second round
SeqByLigation – AB/SOLiD
SeqByLigation – AB/SOLiD
IonSensitiveFieldEffectTransistors – Ion Torrent
1 2 3
A A
C C C
G G G G
T T
A A
C C
G G
T T
SeqBySynthesis - single molecule - Helicos
Single Molecule Real Time – Pacific Biosciences
combinatorialProbeAnchorLigation – Complete Genomics
FluorescenceResonanceEnergyTransfer – VisiGen
Protein nanopores – Oxford Nanopore Tech
Exonuclease
Transmission Electron Microscopy – Halcyon Molecular/ZS Genetics
Electronic fingerprint
STM - Transistor mediated – IBM
metal
Dielectric layers
ScanningTunnelingMicroscope– Reveo
Niedringhaus TP, et al (2011) Metzker ML (2010) Schadt EE, et al (2010) Tanaka H and Kawai T (2009) Drmanac R, et al (2009) Mardis ER. (2008) http://www.illumina.com/Media/flash_player.ilmn?dirname=systems&swfname=GA_workflow_vid&width=780&height=485&iframe http://my454.com/products/technology.asp http://appliedbiosystems.cnpg.com/Video/flatFiles/699/index.aspx http://www.helicosbio.com/Technology/TrueSingleMoleculeSequencing/tSMStradeHowItWorks/tabid/162/Default.aspx http://www.pacificbiosciences.com/aboutus/video-gallery?videoImage=pac_bio_lg.jpg http://www.nanoporetech.com/news/movies#movie-24-nanopore-dna-sequencing http://www.invitrogen.com/site/us/en/home/Products-and-Services/Applications/Sequencing/Semiconductor-Sequencing/Semiconductor-Sequencing-Technology/Ion-Torrent-Technology-How-Does-It-Work.html http://www.abrf.org/Other/ABRFMeetings/ABRF2005/Hardin.pdf http://researcher.ibm.com/view_project.php?id=1120
References
Quality Check Quality Filter
Mapping to reference genome Realignment and recalibration
SNV detection Peak detection Transcript abundance estimation
Resequencing RNA-seq ChIP-Seq
Different applications, different pipelines
AAGCCTA
AAGCTTA
Human genome 3 billion bps
3 million differences
(0.1%)
AAGCTA
AAG-TA
AAG-TA
AAGCTA
UAG GGU ACU
* G T
Splice sites/branch site UTRs Coding regions
SNPs
Biotin probes
Streptavidin beads
DNA library
Hybridization
Capture
Targeted resequencing
Single end (SE)
Paired-end (PE)
Mate-pair (MP)
200-500 bp
2-5 Kb
R1
R1
R1
R2
R2
Different recipies
@HWI-H200:53:D08U2ACXX:5:1101:1231:2012 1:N:0:
GCATTTTAGTAGAACCAGNCATTTCCCCCNACNTCNNTNCGNNANNNNTAA
+
@CCFFFFFHFFHHJJJJJ#3<FGIJJJJJ#1?###################
@HWI-H200:53:D08U2ACXX:5:1101:1184:2013 1:N:0:
TATATTTAATGTACTTTCNTATTTTATATNCANTATNTNATANANNNNTTG
+
CC@FFFFFHHHFFFFHIG#3AFGIIIHIJ#2A#1:C###############
@HWI-H200:53:D08U2ACXX:5:1101:1151:2035 1:N:0:
TTTTGCCTTGTTGCCCAGGTTGGTCTCGAACTCCTGGGCTCAAGGGATATG
+
@CCFFFFFHHHHHJJJGIGJJIIBHHHHIIGBGHGCHIIIHHGIGIJGHIF
@HWI-H200:53:D08U2ACXX:5:1101:1248:2055 1:N:0:
CAGGAACAGAATGAATGAGCGAAACAAATTCCCCTTGAGCTTCACTTGTTG
+
CCCFFFFFHHHHHIJJJJJJJJJJIJJIJJIJJJJJIJJIJJJJJJJIIIH
@HWI-H200:53:D08U2ACXX:5:1101:1235:2080 1:N:0:
ATGGTCTATTAAGTATGCAATAGTATTTTGTCTAAAACAATAATGTACATA
+
@@@FADDFHHHGHFHHGEIHIJGAIFHIIIIJIHIIJHIJIJJJHFHDHII
@HWI-H200:53:D08U2ACXX:5:1101:1165:2081 1:N:0:
ATAACAATGACAATAGAATTTGGGGACTCAGGAGGAAAGGGAGGGAAGCGG
+
CCCFFFFFGHHHHJGHIIIJJJJJJJJIIIJJIGGIJJJJGIIGIIIIIJJ
@HWI-H200:53:D08U2ACXX:5:1101:1231:2012 2:N:0:
TACTNNTANNTNCAGANCAGTTTAAATAAATAAAACATNCACCAGTATGTA
+
@BCF##22##2#2<CG#2AEFGIHJIIJJJFIJJJJJJ#0?GGGBFHIJGH
@HWI-H200:53:D08U2ACXX:5:1101:1184:2013 2:N:0:
ACATNNAANNTNAAAGNTCACAAACTATATATTATATANTGTACATAAAAT
+
B@@F##22##2#3<CG#3AFHIJJJGJJJJJJJJIJJJ#0?FGHJJJJGJG
@HWI-H200:53:D08U2ACXX:5:1101:1151:2035 2:N:0:
CAAACTAACCANGCGGACTTCATTGCTTTTAGAGGACACAATTAATTCTCT
+
CCCFFFFFHHH#2<CGIJBHJJIJJGIGJIIFGGIJJJIIJHIJIGIJIJI
@HWI-H200:53:D08U2ACXX:5:1101:1248:2055 2:N:0:
TATACAATCAANGCACAATCTATTAGAATGGGAAGAGACCCTGGAGATAAT
+
CCCFFFFFHHH#2AFHIJIHHHJJJJJJIJJJJJJJJJJJJIJJHEGHGG<
@HWI-H200:53:D08U2ACXX:5:1101:1235:2080 2:N:0:
AATCCCAACACTTTGGGAGGCTGAGGTGGGTGGATCACTTGGGGTCAGGAG
+
B@?DFBFFHHHHHIJJIJIJJJJIGI:DGI?F@GBFGIIGAGIIBF>HGIH
@HWI-H200:53:D08U2ACXX:5:1101:1165:2081 2:N:0:
GCTGTGTTAGCTTCTTTGTCCTATTGAAATGCAAAGATAGGCTGACTAACT
+
CC@FFFFFHHHHHJJJJI?CHFHGJJJJJIIJJJJIIJJGFHIJJJJJJJE
R1 R2
@HWI-H200:53:D08U2ACXX:5:1101:1231:2012 1:N:0:
GCATTTTAGTAGAACCAGNCATTTCCCCCNACNTCNNTNCGNNANNNNTAA
+
@CCFFFFFHFFHHJJJJJ#3<FGIJJJJJ#1?###################
31 37 39 18 16 2
Fastq format
LIMS
Phred = 50
Probability that the base has been erroneously called
Phred score
P(called wrong)
Accuracy base call
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99,9%
40 1 in 10000 99,99%
50 1 in 100000 99,999%
Phred = 10
Phred quality score
Peak detection Transcript abundance estimation
RNA-seq ChIP-Seq
Quality Check Quality Filter
Mapping to reference genome Realignment and recalibration
SNV detection
Resequencing
Variant calling Annotation Custom filtering of variants
Exome pipeline
Quality check - FastQC
Quality check - FastQC
@HWI-H200:53:D08U2ACXX:5:1101:1231:2012 1:N:0:
GCATTTTAGTAGAACCAGNCATTTCCCCCNACNTCNNTNCGNNANNNNTAA
+
@CCFFFFFHFFHHJJJJJ#3<FGIJJJJJ#1?###################
X nts
Low quality
Ambiguous bases
Quality filter- FastX
CTACGATCGATCGA AGACGCAGCTACTACACG
CTACGATCGATCTACGCAGCTACTACACGTGCTGGGACGC REF
ACCACACGTGCAGG TCGATCGACG
CTACG ATCGACGCAGCTACCA AGGGACGT
READS
WHERE to place the reads? a) Unique reads b) Everywhere possible c) Choose one randomly d) Use pair-end data
HOW to place the reads? a) Ungapped b) Gapped
ACTACACGTGCAGGGACGT
Mapping
Local realignment around indels
HWI-H200:53:D08U2ACXX:6:1108:18555:16623 99 chr1 10001 0 45M6S = 10174 224
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
AACCCTAAAGATCG @?@DDDBDAH??FHDGFFFHIIIGDGEHHI<ABHICHIEHCDD3BDEDGEC MD:Z:45 RG:Z:1 XG:i:0 AM:i:0 NM:i:0
SM:i:0 XM:i:0 XO:i:0 XT:A:M
HWI-H200:53:D08U2ACXX:6:1101:9568:123823 99 chr1 10003 11 1S46M1S = 10204 252
GACCCTGACCCTGACCCTAACCCTAACCCTAACCCTA
ACCCCAAACCC @@CFBDFFDFHHFGIIEHGGGD@GGHDGGFHGGEHEGCGHGGHGEHGC MD:Z:5A5A28T2C2 RG:Z:1 XG:i:0
AM:i:11 NM:i:4 SM:i:11 XM:i:4 XO:i:0 XT:A:
M
HWI-H200:53:D08U2ACXX:6:1302:17187:33007 97 chr1 10003 0 51M chrM 430 0
ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA
CCCTAACCCTAACC CCCFFFFFHHHHHJJJJJJIIIIJJJJJIIJJJJJJJJJJJJJJJJJIIGI X0:i:513 MD:Z:51 RG:Z:1 XG:i:0
AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:
0 XT:A:R
HWI-H200:53:D08U2ACXX:6:1104:2930:78353 177 chr1 10004 0 51M chr22 38431286 0
CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
CCTAACCCTAACCC IIGAF?JJIGADJIGGD?GHGEEEIHGCCGIIHIHHIHFDHDHDDDDB@@B X0:i:515 MD:Z:51 RG:Z:1 XG:i:0
AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:
0 XT:A:R
HWI-H200:53:D08U2ACXX:6:1205:3665:10423 99 chr1 10054 0 51M = 10366 363
CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
ACCCTA CCCFFFFFDBHHHIGEGEHEHIJJIHIFGIIGEHIGH9FGHHIIJJGGI=C X0:i:502 MD:Z:51 RG:Z:1 XG:i:0 AM:i:0
NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:
R
HWI-H200:53:D08U2ACXX:6:1101:4778:107011 163 chr1 10056 0 51M = 10355 350
AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
ACCCTAACCCTAAC CCCFFFFFHHHHGJJJJJJJJJJIJJIJJIIHGIJECEHIJ;FGEIIEHCA X0:i:508 MD:Z:51 RG:Z:1 XG:i:0
AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:
0 XT:A:R
SAM (Sequence Alignment/Map) BAM
Query name HWI-H200:53:D08U2ACXX:6:1101:1233:2037
Flat 83
Reference name chr15
Leftmost position 47933389
Mapping quality 29
CIGAR string 51M
Mate reference =
Mate position 47933089
Insert size 351
Query sequence AATGAATGNCCATGGNCAGCAGCAGGACAGCAGGAACCACGTCT
Quality 00#9DG?:1#FB@>E@BGHHCGCFABIIHEIGFHFDC7;ADB@?@
Optional fields XT:A:U NM:i:2 SM:i:29 AM:i:29 X0:i:1 X1:i:0
BAM format
ACTACACGTGCAGGGACGT CTACGATCGATCGA AGACGCAGCTACTACACG
CTACGATCGATCTACGCAGCTACTACACGTGCTGGGACGC REF
ACCACACGTGCAGG TCGATCGACG
CTACG ATCGACGCAGCTACCA AGGGACGT
READS
Is it a variant allele?
P(CC|D) = 0.06 P(CT|D) = 0.94 P(TT|D) = 3 × 10−11
What is the most likely
genotype?
Variant calling - GATK
VCF format
In which gene is it located? Name, Description,
OMIM, Pathway, GO,
Expression profiles . . .
Where in the gene is it located? Intron, exon, UTR,
intergenic region, splice site Is there any AA change? GAA -> GAG = E->E
GTT -> CTT = V->L
TGG -> TGA = W->X
TGA -> CGA = X->R
What impact does the AA
change have? Damaging, benign
Is it a known SNP?
ACTACACGTGCAGGGACGT CTACGATCGATCGA AGACGCAGCTACTACACG
CTACGATCGATCTACGCAGCTACTACACGTGCTGGGACGC REF
ACCACACGTGCAGG TCGATCGACG
CTACG ATCGACGCAGCTACCA AGGGACGT
READS
Annotation – Annovar, SIFT, PERL
300,000 SNPs - 10,000 Indels
Variants list
Exome sequencing
cases
Coding variants
Controls Genetic variation DBs
Disease model Disease knowledge
Candidate genes
Family filters
The real work begins…
http://www.sciencedirect.com/science/article/pii/S0002929711003946
Variants filtering
Data visualization