module 4 mcpherson 2012 - files.bioinformatics.ca · 120529 2 module’4’...
TRANSCRIPT
12-‐05-‐29
1
Canadian Bioinforma2cs Workshops
www.bioinforma2cs.ca
2 Module #: Title of Module
12-‐05-‐29
2
Module 4 Mapping and Genome Rearrangement
John McPherson, Ph.D.
ATCAA CTAAG
DNA fragment
Paired-end Reads
Module 1 bioinformatics.ca
Platform complexity
Increasing Run Time
Increasing Data
Per Run
Moving away from a “one-size-fits-all” platform
$
$
Cross-platform data integration needed.
700Mb/23h
150Mb/3h
100Mb/1h
2Gb/27h
100Gb/15d
90Gb/10d
600Gb/10d
14TB/run
120Gb/1d
Proton? GridION?
12-‐05-‐29
3
Module bioinformatics.ca
Single Molecule
Amplified Template
Long Read
Short Read
Module bioinformatics.ca
Basecalling
• How do we translate the machine data to base calls? • How do we es2mate and represent sequencing errors?
12-‐05-‐29
4
Module 1 bioinformatics.ca
Spectral overlap
http://www.olympusfluoview.com/theory/bleedthrough.html
Module 1 bioinformatics.ca
Spectral overlap
http://www.olympusfluoview.com/theory/bleedthrough.html
12-‐05-‐29
5
Module 1 bioinformatics.ca
Spectral overlap
http://www.olympusfluoview.com/theory/bleedthrough.html
Module 1 bioinformatics.ca
Spectral overlap
http://www.olympusfluoview.com/theory/bleedthrough.html
Sources of error
12-‐05-‐29
6
Module bioinformatics.ca
Sources of error Illumina: Pre-‐phasing & Phasing
Module bioinformatics.ca
What is a base quality?
Base Quality Perror(obs. base)
3 50 % 5 32 %
10 10 % 20 1 % 30 0.1 % 40 0.01 %
PHRED values: - Sequence a known template and align reads - determine error rate.
12-‐05-‐29
7
Module bioinformatics.ca
Calibra@ng Base Quali2es
“Original” Recalibrated
Mark DePristo Broad Institute June 2009
Module bioinformatics.ca
Error Profiles
Roche 454 • error rate is low (< 0.5%) • most errors are INDELs
72%
24%
4%
insertions
deletions
substitutions
Illumina • error rate is also low • most errors substitions
Slides by M. Stromberg
Correct 99.5%
Error 0.5%
Error rate over all bases
12-‐05-‐29
8
Module bioinformatics.ca
Mismatch by cycle
Module bioinformatics.ca
Fasta files ASF-1.fa ASF-2.fa
• Reads are oWen stored in fasta files • Separate file for forward and reverse pairs • header line -‐-‐ read name/pairing info • sequence line -‐-‐ nucleo2des
12-‐05-‐29
9
Module bioinformatics.ca
Fastq files
ASF-1.fastq ASF-2.fastq
• header line: @SEQUENCE_ID • sequence line • line beginning with + • encoded quality value line
• Most reads are stored in fastq • 4 lines per read
Module bioinformatics.ca
Alignment
• Reference-‐based alignment: • Goal: find posi2on in genome from which read was sampled
– Comparison is to the human reference genome (eg HG19)
• Can't we use BLAT or BLAST? – op2mized for long reads – slow
• Things to consider: – support for your technology – speed / sensi2vity – parallelism – tolerance for gapped alignment – handling of mul2ple good mappings
12-‐05-‐29
10
Illumina AB SOLiD Roche 454 Helicos gapped all alignments multithreaded
Bowtie X X X X BWA X X X X BFAST X X X X X X X Corona Lite X X ELAND X GenomeMapper X X X X gnumap X X X X karma X X X * MAQ X X MOSAIK X X X X X X X MrFAST X X X MrsFAST X X Novoalign X X X * RMAP X X SeqMap X X X SHRiMP X X X X X X Slider X X SOAP2 X X X SSAHA2 X X X X SOCS X X SXOligoSearch X X X X Zoom X X * X Slides by M. Stromberg
Module bioinformatics.ca
0 5000 10000 15000 20000
Karma
Bowtie
SOAP2
BWA
ELAND2
MOSAIK
srprism
BFAST
Novoalign
speed (reads/s)
alig
ners
Performance
Illumina 37 bp (human genome) program aligned reads/s Karma 15,635 Bowtie 13,889 SOAP 13,580 BWA 10,314 ELAND2 8,859 MOSAIK 6,792 Srprism 2,768 BFAST 1,125 Novoalign 1,095
Slides by M. Stromberg
12-‐05-‐29
11
Module 1 bioinformatics.ca
Genome Mapping
Module 1 bioinformatics.ca
De novo assembly
12-‐05-‐29
12
Module 1 bioinformatics.ca
Reference alignment
Module bioinformatics.ca
Reference alignments
?
Reference genome
Sequence read
12-‐05-‐29
13
Module bioinformatics.ca
Reference alignments Reference genome
Sequence read
x x x
Module bioinformatics.ca
Reference alignments Reference genome
Sequence read
x x x x
?
12-‐05-‐29
14
Module bioinformatics.ca
Alignment Quality
0
10
20
30
40
50
60
70
0 20 40 60
Act
ual a
lignm
ent q
ualit
y Assigned alignment quality
optimal MOSAIK using PE AQs Slides by M. Stromberg
Module bioinformatics.ca
Alignment Quality
alignment errors due to heuristic algorithm
probability that the best hit is wrong
Slides by M. Stromberg
12-‐05-‐29
15
Module bioinformatics.ca
INDEL Cleaning
Module bioinformatics.ca
“You like tomato and I like tomahto”
George Gershwin • NGS010 BRCA1 dele2on variant
…CGCTTTAATTTATTTGTG…!!…CGCTTTATTTGTG…!!
…CGC-----TTTATTTGTG…!!
…CGCTTTA-----TTTGTG…!!
!c.1500_1504delTTTAA! c.1504_1508delATTTA!
Reference
Variant
CAP/CLIA Sanger sequence NGS pipeline
12-‐05-‐29
16
Module bioinformatics.ca
De novo assembly
Module bioinformatics.ca
De novo assembly
12-‐05-‐29
17
Module bioinformatics.ca
De novo assembly
Module bioinformatics.ca
De novo assembly
Read from a repeat
12-‐05-‐29
18
Module bioinformatics.ca
De novo assembly
Long Reads
Module bioinformatics.ca
What are Paired Reads?
ATCAA CTAAG
Insert size (IS)
DNA fragment
Paired-end Reads
Slides by M. Brudno
12-‐05-‐29
19
Module bioinformatics.ca
De novo assembly
Module bioinformatics.ca
12-‐05-‐29
20
Module bioinformatics.ca
SAM/BAM
• SAM = text, BAM = binary
SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77
Read name Flag Reference Position CIGAR Mate Position
Bases
Base Qualities
Module bioinformatics.ca
CIGAR
MD:3^A03C5
12-‐05-‐29
21
Module bioinformatics.ca
samtools
• used for low level processing of BAM/SAM files • convert between BAM and SAM • sort alignments by posi2on • create index for sorted BAM file
Module bioinformatics.ca
What kinds of varia@on is there?
• Single Nucleo2de Polymorphisms (SNPs) • Short indels (< read length) • Structural varia2ons
– large scale inser2ons and dele2ons – Inversions – Transloca2ons – Copy number varia2on
12-‐05-‐29
22
Module bioinformatics.ca
Structural variants Mate-pair and paired-end reads can be used to detect structural
variants
Fragmentation & circularization to an internal adaptor
Shear
Isolate internal adaptors and fragment ends
Mate-Pairs Paired-Ends
Fragmentation
Add amplification and sequencing adaptors
Sequence
Add amplification and sequencing adaptors
Genomic DNA
1 - 20kb 200 – 500bp
Module bioinformatics.ca
Clusters of aberrantly aligned read pairs
Mapping of read pairs to reference • Spanning unexpected distance • Unexpected orientation
• Detection of: • Deletions • Insertions • Translocations • Inversions
Fragment size
Fragment number
< <
Insertion
> <
Deletion
> <>Build35
<
inv Map
Seq
del Map
Seq
> <
Concordant Inversion translocation
ChrA ChrB
12-‐05-‐29
23
Module bioinformatics.ca
Inser@on: signature
Mapped distance
Insert size Mapped distance < IS - 2.s.d
Size of insertion = Insert size - Mapped distance
don
ref
Slides by M. Brudno
Module bioinformatics.ca
Inser@on: consistency 1. Overlap 2. Size of inser2on explained by X = Size of inser2on explained by Y
X
Y
X
Y
don
ref
Slides by M. Brudno
12-‐05-‐29
24
Module bioinformatics.ca
Inser@on: narrowing down the loca@on
don
ref
Possible location of insertion
• Insertion lies within spanning region of matepair • For clusters, it lies within the intersection
Slides by M. Brudno
Module bioinformatics.ca
SV summary
Type Mapped Distance Orienta@on
Inser2on too big correct
Dele2on too small correct
Inversion *
Tandem duplica2on *
Interchromosomal different chromosomes
N/A
Slides by M. Brudno
12-‐05-‐29
25
Module bioinformatics.ca
Where can we go wrong: missed inser@on
don
ref
IS Insertions larger than IS cannot be detected with basic method
Module bioinformatics.ca
Soma@c vs. Germline
• tumor vs. normal sequencing • approach 1:
– find SVs separately in two samples – filter out soma2c SVs that overlap germline SVs
• approach 2 – find soma2c SVs – for each soma2c SV, find any type of evidence in germline
• a single discordant but non-‐consistent matepair?
– filter out anything with evidence
Slides by M. Brudno
12-‐05-‐29
26
Module 1 bioinformatics.ca
Variant detection - distinguishing novel variants
from errors Reference: ACGT …
Germline variants 50% Ref : 50% Var 40% for : 60% rev
Module 1 bioinformatics.ca
Variant detection - distinguishing novel variants
from errors Reference: ACGT …
Strand bias
Misaligned reads - Not consistent so sometimes seen as somatic
PCR artifact Germline variants 50% Ref : 50% Var 40% for : 60% rev
12-‐05-‐29
27
Module 1 bioinformatics.ca
Variant detection - distinguishing novel variants
from errors Reference: ACGT …
80% Ref : 20% Var
Strand bias
Misaligned reads PCR artifact Somaic mutations
Module bioinformatics.ca
Structural Variants and Split Reads
Paired Short Reads
Align
Most of these pairs can be aligned to the reference genome
For some paired-end reads one of the pair may not be mapped because it goes across the breakpoint of a structural variant. We call such reads split reads.
Slides by M. Brudno
12-‐05-‐29
28
Module bioinformatics.ca
Split read signatures
don
ref
Deletion
don
ref
Insertion
don
ref
don
ref Slides by M. Brudno
Module bioinformatics.ca
Pair informed split mapping
ref
Deletion
reference region 1
reference region 2
• searching the whole genome for split mappings – gives a lot of false mappings – too slow
• can exactly es2mate breakpoint and indel sizes • can detect very small dele2ons
Slides by M. Brudno
12-‐05-‐29
29
Module bioinformatics.ca
SV SoRware and Exercise • There are many: • GASV (will use this today)
– hsp://compbio.cs.brown.edu/soWware.html – S. Sindi, E. Helman, A. Bashir, B.J. Raphael. (2009) A Geometric Approach for Classifica2on and Comparison of Structural Variants.Bioinforma*cs. 25: i222-‐i230
• Breakdancer – hsp://breakdancer.sourceforge.net/ – hsp://www.nature.com/nmeth/journal/v6/n9/abs/nmeth.1363.html
Module bioinformatics.ca
Gene fusions
• if a linking signature connects two genes, this might indicate a gene fusion
ChrA
ChrB
Gene X
Gene Y
Gene XY Protein
12-‐05-‐29
30
Module bioinformatics.ca
Things we have set up:
• Loaded data files to an S3 bucket • We brought up an Ubuntu (Linux) instance, and loaded a whole bunch of soWware for NGS analysis.
• We then cloned this, and made separate instances for everybody in the class.
• We’ve simplified the security: you basically all have the same login and and file access, and opened ports. In your own world you would be more secure.
Module bioinformatics.ca
All on Wiki! �http://bioinformatics.ca/workshop_wiki/�"Login: FirstnameLastname�"Password: guest �
12-‐05-‐29
31
Module bioinformatics.ca
Module bioinformatics.ca
12-‐05-‐29
32
Module bioinformatics.ca
On Mac: Control+
Module bioinformatics.ca
12-‐05-‐29
33
Module bioinformatics.ca
http://bioinformatics.ca/workshop_wiki/��"Login: FirstnameLastname�"Password: guest �
Module bioinformatics.ca
12-‐05-‐29
34
Module bioinformatics.ca
# is your assigned student number�
Module bioinformatics.ca
Ask your ques@on, and then gather the data, the tools and hardware you need
• Data and Databases: you will take workshops, you will read papers, and you will go on-‐line: SeqAnswers & maybe the bioinforma2cs.ca Links Directory
• Tools: you will take workshops, you will read papers, and you will go on-‐line: SeqAnswers & maybe the bioinforma2cs.ca Links Directory
• Hardware: you need to decide?
12-‐05-‐29
35
Module bioinformatics.ca
We are now going to start an exercise in mapping and structural variant
detec2on.
Module bioinformatics.ca
We are on a Coffee Break & Networking Session
12-‐05-‐29
36
Module bioinformatics.ca
Cryp2c Fusion Oncogene
• Welch et al. JAMA 305; 1577-‐1584 – April 20, 2011
• Acute Promyelocy2c Leukemia (APL) – >90% associated with gene fusion PML-‐RARA – Rapid diagnosis is essen2al as adding all-‐trans re2noic acid to chemotherapy leads to substan2ally improved outcome (5yr event-‐free-‐survival of 69% compared with 29% with chemotherapy alone)
Module bioinformatics.ca
Cryp2c Fusion Oncogene
• 39 year old pa2ent diagnosed with acute myeloid leukemia (AML) in first remission referred for allogenic stem cell transplanta2on.
• Cytogene2cs indicated a poor prognosis and absence of a PML-‐RARA fusion.
• Leukemic cytomorphology consistent with APL • Course of treatment uncertain – APL or AML with poor prognosis?
Welch et al. JAMA 305; 1577-1584
12-‐05-‐29
37
Module bioinformatics.ca
Cryp2c Fusion Oncogene
• Whole genome sequencing revealed a 77kb inser2on from chromosme 15 into the second intron of the RARA gene on chromosome 17 resul2ng in a classic PML-‐RARA fusion.
• 7 week turnaround; ~$40,000 • Pa2ent received ATRA and is in remission at 15 months.
Welch et al. JAMA 305; 1577-1584
Module bioinformatics.ca
Cryp2c Fusion Oncogene
Welch et al. JAMA 305; 1577-1584