pipecleaner: sanitation for your ngs pipeline
TRANSCRIPT
PipeCleaner: Sanitation for
your NGS Pipeline
Ken Doig, Jason Ellul
Peter MacCallum Bioinformatics Core
BigData – April 2013
What we do
• Molecular pathology services
• Blood and tumour tissue samples
• Targeted genetic sequencing using amplicon panels
• Between 4-48 cancer specific genes
• Looking for needles in haystacks
• Very sensitive assays
19 Apr 2013 BigData 2
...
... AAAAGCAGGT TATATAGGCT AAATAGAACT AATCATTGTT TTAGACATAC TTATTGACTC TAAGAGGAAA GATGAAGTAC TATGTTTTAA AGAATATTAT ATTACAGAAT TATAGAAATT AGATCTCTTA CCTAAACTCT TCATAATGCT TGCTCTGATA GGAAAATGAG ATCTACTGTT TTCCTTTACT TACTACACCT CAGATATATT TCTTCATGAA GACCTCACAG TAAAAATAGG TGATGTTGGT AGCTAGGAGT GAAATCTCGA TGGAGTGGGT CCCATCAGTT TGAACAGTTG TCTGGATCCA TTTTGTGGAT GGTAAGAATT GAGGCTATTT TTCCACTGAT TAAATTTTTG GCCCTGAGAT GCTGCTGAGT TACTAGAAAG TCATTGAAGG TCTCAACTAT AGTATTTTCA TAGTTCCCAG TATTCACAAA AATCAGTGTT CTTATTTTTT ATGTAAATAG ATTTTTTAAC TTTTTTCTTT ... ...
Why: Acquisition of mutations
19 Apr 2013 BigData 3
B Vogelstein et al. Science 2013;339:1546-1558
Driver mutations
Somatic
mutations
Allele distribution Cancer 2015 study data
19 Apr 2013 BigData
Known
Polymorphisms
(dbSNP)
Known
Cancerous
(Cosmic)
VOUS
4
The problem
• Ageing population – more cancer
• NGS means more data / more variants
• Need faster turn around
• Need audited processes
• Replace manual paper trail
• Get rid of uncontrolled data
19 Apr 2013 5 BigData
Software Qualities
Software should be: • Correct
• Efficient
• Robust
• Flexible
• Repeatable
• Maintainable
• Reusable
• Using debugged methods/libraries
• Logging all activity
• Version controllled
19 Apr 2013 BigData 6
Groovy
http://www.youtube.com/watch?v=7jZsEUMeU94
Sample Data Flow
19 Apr 2013 BigData 7
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant
Calling
Annotation
DB Load
Filtering
Known
Variant ?
Publish
Report
Curation
Peter Mac
Mutation
DB
External and
Locus Specific
DBs
Yes
No
Patient
Clinical
Report
Document
Assembly
Variant
Normalisation
Report
Editing and
Signoff
Manual
Step
Automatic
Step
Wet Lab Bioinformatics Clinical Informatics
Patient
Sample
Report Assembly
19 Apr 2013 BigData 8
Images
Generated
graphics
PathOS Database • Analysis data
• Patient details
• Lab QC
• Pharmacogenomics
• Clinical reports
• Mutation descripts
“TransMute”
Document
Assembly
System
PipeCleaner
19 April, 2013 Bioinformatics Core
simulated
read data Pipeline Under Test
Variant
Generator
reference
genome
Pipeline results:
variant calls
assemblies
alignments
Result
Comparator Test
results
Read
Generator
(wgsim)
SNPs, indels (VCF file(s))
and known variants
(dbSNP, Cosmic)
mutated
genomes
reference genomes
(if mapping to reference)
Test Controller
Simulating tumour genetics
19 Apr 2013 BigData
Amplicon region
Germline variant
- homozygous
Low frequency
allele – 10% not
detected
Variant allele 40%
detected
Variant allele –
20% detected
12
Pipeline Validation (PipeCleaner)
19 Apr 2013 BigData 13
Germline mutation: dbSNP = rs1050171 @ 100% Tumour mutation: cosmic = cosm476 @ 10% Readlen: 200 Read depth: 200 Readerr: 1.0% Indel fraction: 10.0%
Input Parameters (expected)
Pipeline Output (actual)
SNPs – simulated reads (PipeCleaner)
19 Apr 2013 BigData
0
20
40
60
80
100
120
0.2
0.4
0.6
0.8 1
1.2
1.4
1.6
1.8 2
2.2
2.4
2.8 3
3.2
3.4
3.6
3.8 4
4.2
4.4
4.6
4.8 5
5.2
5.4
5.6
5.8 6
6.2
6.4
6.6
6.8 7
7.2
7.4
7.6
7.8 8
8.2
8.4
8.6
8.8 9
9.2
9.4
9.6
9.8 10
SN
P C
ou
nt
Allele Frequency %
SNP Accuracy (err=0.003, depth=2000)
True SNPs
VarScan 2
VS false -ve
GATK 2.1
GK false -ve
FreeBayes
FB false -ve
14
Deletions – simulated reads (PipeCleaner)
19 Apr 2013 BigData
0
5
10
15
20
25
1 4 7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
10
0
10
3
10
6
10
9
11
2
11
5
11
8
12
1
12
4
Nu
mb
er
of
Vari
an
ts
Deletion Size (bp)
Deletion Validation 6-Dec-2012
True variants
VarScan 2
GATK 2.1
FreeBayes
15
Inserts – simulated reads (PipeCleaner)
19 Apr 2013 BigData
0
5
10
15
20
25
30
1 5 9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
10
1
10
5
10
9
11
3
11
7
12
1
12
5
12
9
Nu
mb
er
of
Vari
an
ts
Insert Size (bp)
Insert validation 6-Dec-2012
True variants
VarScan 2
GATK 2.1
16
PipeCleaner Outcomes
• Regression testing
– Unit testing
– Control samples
• End to end test from PCR to Database
• Two control samples per run
• 4 variants per control (3 x 2.5% af, 1 x 100% af)
• Test for ref/alt, gene, HGVS, allele freq., conseq.
– Set of dbSNP and Cosmic mutations
• Test for annotation and allele freq.
• Performance testing
– Increased insert/deletion size from 15bps to 100bps
– Calculating false +ve and false –ve rates of pipeline
19 Apr 2013 BigData 17